Class imbalance

Often in Machine learning, data has a strong class imbalance – e.g., there are way more negative labels with few positive ones. This is tricky to handle both in terms of model training and evaluation of the result. For instance, if only 1% of the training data is positive, a model that just predicts everything is negative can achieve 99% accuracy. This means that it is tricky to measure the performance of a model.

Not only we need to use metrics that take this into account (see ROC vs. precision-recall), but we also need to think about the preparation of test dataset. Should it be measured in the balanced sample? Or in the imbalance data?

Furthermore, strong imbalance may make it difficult for the model to learn from the data. There is also a question about whether the method trained on a balanced dataset can perform well in the wild where there is a strong class imbalance.

Over/Undersampling Methods

https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis
SMOTE
Class imbalance, redux: argues for undersampling + bagging strategy
Estimating the class prior and posterior from noisy positives and unlabeled data

Libraries

Articles

예측 모형에서의 클래스 불균형(class imbalance) 문제
- Handling class imbalance in customer churn prediction: AUC good. Undersampling good. no need to undersample to match the balance. no need for advanced methods.
Kaggle: Credit Card Fraud: Handling highly imbalance classes and why Receiver Operating Characteristics Curve (ROC Curve) should not be used, and Precision/Recall curve should be preferred in highly imbalanced situations
8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset
7 Techniques to Handle Imbalanced Data
Learning from Imbalanced Classes
Jain2017estimating