Class imbalance

Often in Machine learning, data has a strong class imbalance – e.g., there are way more negative labels with few positive ones. This is tricky to handle both in terms of model training and evaluation of the result. For instance, if only 1% of the training data is positive, a model that just predicts everything is negative can achieve 99% accuracy. This means that it is tricky to measure the performance of a model.

Not only we need to use metrics that take this into account (see ROC vs. precision-recall), but we also need to think about the preparation of test dataset. Should it be measured in the balanced sample? Or in the imbalance data?

Furthermore, strong imbalance may make it difficult for the model to learn from the data. There is also a question about whether the method trained on a balanced dataset can perform well in the wild where there is a strong class imbalance.

Over/Undersampling Methods

Libraries

Articles