Different ways to Handle Imbalanced data set in ML
A common problem that is encountered while training machine learning models is imbalanced data. An imbalanced datasets can lead to inaccurate results even when best models are used to process that data. If the data is biased, the results will also be biased. If a data set is biased towards one class, the algorithm will also be biased towards the same class.
Imbalance data set typically refers to a classification problem where the number of observation per class is not equally distributed .you’ll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes)
For example, suppose You have 100k data points for a two-class classification problem. Out of these, 10k data points are associated with the A class and 90k are associated with the B class.
In classification techniques, we would have tons of data points and label pairs. Labels are the class associated with each data point. If the distribution of the labels is not moderately uniform, then the datasets is called imbalanced.
Challenges faced with Imbalanced datasets
- ML algorithms misleading Accuracy because of the unequal distribution in dependent variable.
- This causes the performance of existing classifiers to get biased towards majority class.
Sampling based approaches to handling Imbalanced Data
- Oversampling, by adding more of the minority class so it has more effect on the machine learning algorithm
- Under-sampling, by removing some of the majority class so it has less effect on the machine learning algorithm
- Hybrid, a mix of oversampling and under-sampling
Naive random over-sampling
this techniques for oversampling the minority classes to increase the number of minority observations until we’ve reached a balanced datasets
Random over-sampling can be used to repeat some samples and balance the number of samples between the data set
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X, y)
from collections import Counter
print(sorted(Counter(y_resampled).items()))
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy=0.5)
X_resampled, y_resampled = oversample.fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))
This means that if the majority class had 1,000 examples and the minority class had 100, the transformed dataset would have 500 examples of the minority class.
Random over-sampling using SMOTE , SMOTETomek and ADASYN
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))
from imblearn.over_sampling import ADASYN
X_resampled, y_resampled = ADASYN().fit_resample(X, y)
from imblearn.combine import SMOTETomek
smt = SMOTETomek(random_state=42)
X_res, y_res = smt.fit_resample(X, y)GitHub
Under-sampling
The simplest undersampling technique involves randomly selecting examples from the majority class and deleting them from the training dataset. This is referred to as random undersampling.
in this technique balances the imbalance dataset by reducing the size of the class which is in abundance
“that consists of reducing the data by eliminating samples belonging to the majority class with the objective of equalizing the number of samples of each class”
NearMiss
the method calculates the distances between all instances of the majority class and the instances of the minority class. Then k instances of the majority class that have the smallest distances to those in the minority class are selected. If there are n instances in the minority class, the “nearest” method will result in k*n instances of the majority class.
NearMiss implements 3 different types of heuristic which can be selected with the parameter version
NearMiss-1 (version =1) selects the positive samples for which the average distance to the losest samples of the negative class is the smallest.
NearMiss-2 (version =2)selects the positive samples for which the average distance to the N farthest samples of the negative class is the smallest.
NearMiss-3 is a 2-steps algorithm. First, for each negative sample, their M nearest-neighbors will be kept. Then, the positive samples selected are the one for which the average distance to the N nearest-neighbors is the largest
from imblearn.under_sampling import NearMiss
nm1 = NearMiss(version=1)
X_resampled_nm1, y_resampled = nm1.fit_resample(X, y)
RandomUnderSampler
Is a fast and easy way to balance the data by randomly selecting a subset of data for the targeted majority classes
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)GitHub
References
https://imbalanced-learn.readthedocs.io/en/stable/under_sampling.html