Imbalanced Classes FAQ
January 5th, 2017
We previously published a post on imbalanced classes by Tom Fawcett. The response was impressive, and we’ve found a good deal of value in the discussion that took place in the comments. Below are some additional questions and resources offered by readers, with Tom’s responses where appropriate.
Questions and clarifications
Which technique would be best when working with the Predictive Maintenance (PdM) model?
This is kind of a vague question so I’ll have to make some assumptions of what you’re asking. Usually the dominant problem with predictive maintenance is the FP rate, since faults happen so rarely. You have so many negatives that you need a very low (eg <0.05) FP rate or you’ll spend most of the effort dealing with false alarms.
My advice is:
- Try some of these techniques (especially the downsampled-bagged approach that I show) to learn the best classifier you can.
- Use a very conservative (high threshold) operating point to keep FPs down.
- If neither of those get you close enough, you could see if it’s possible to break the problem in two such that the “easy” false alarms can be disposed of cheaply, and you only need more expensive (human) intervention for the remaining ones.
Can you suggest a modeling framework that targets high recall at low FP rates.
I’m not quite sure what “modeling framework” refers to here. High recall at low FP rates amounts to (near) perfect performance, so it sounds like you’re asking for a framework that will produce an ideal classifier on any given dataset. I’m afraid there’s no single framework (or regimen) that can do that.
If you’re asking something else, please clarify.
Why did over- and undersampling affect variance as they did in the post? Shouldn’t the (biased) sample variance to stay the same when duplicating the data set, while there’d be no asymptotic difference when using undersampling?
A fellow reader stepped in to help with this question:
You’re very close — it’s n-1 in the denominator. When duplicating points in a dataset the mean stays the same, and the numerator for the variance is proportional depending on which observations are duplicated, but the variance itself decreases.
Mathematically, variance is defined as E( [X – E(X)]^2 ), where E() is the mean function (typically just sum everything up and divide by n), but when finding the variance of a sample, instead of taking the straight-up mean of the squared differences as the last step you need to sum everything up and divide by n-1. (It can be shown that dividing by n underestimates the variance, on average.)
Suppose some dataset consists of 5 points. Say the numerator is sum([X – E(X)]^2) = Y, so the variance is Y/4. Now duplicate the data, creating dataset Z, and you have 10 points and the numerator of the formula is sum([Z – E(Z)]^2) = 2Y. But now the variance is 2Y/9, which is smaller than Y/4.”
With well-behaved data, this does not affect much in practical terms.
Additional tools and references
- The Python toolbox imbalanced-learn, as well as an associated Jupyter notebook.
- Angle-based subspace/outlier methods are additional options to neighbor-based approaches. Examples can be found here and here.
- “Using Random Forest to Learn Imbalanced Data” by Chen, et al. is another reference on the topic.