Login

Register

Login

Register

✆+91-9916812177 | contact@beingdatum.com

A Comparative study of various methods of handling Imbalanced datasets in Binary and Multiclass classification

Introduction:-

Balance in life of everything is of utmost importance in a person’s life.The reason behind so much importance is that it is due to this balance in life a person stays fit both physically and mentally which results in increase in consistency in life and efficiency in work.

Similarly in any ML/DL framework a balanced data set is of tremendous importance as most of the algorithms are being designed to handle a properly balanced data set. Now you must be thinking what is an Imbalanced Dataset?

Imbalanced data typically refers to a problem with classification problems where the classes are not represented equally.

For eg:-  You may have a binary classification problem of 100 data points where frequency of one class is 80 and for another it is just 20,hence lack of unbiasedness in distribution of points.

A balanced dataset will led to the formation of a consistent and accurate model when put in production.So in this blog how we can perform a classification problem (both binary and multi-class) based on imbalanced datasets.

Now let me give you a general introduction to the imbalanced classification:-

As we all know classification in predictive modelling means predicting a particular  class or category to which the data points belong.Imbalanced classification is a classification problem where the distribution of data points across the known classes is highly biased or skewed.In other words for a single category A there may be 100’s or 1000’s of category B.

Now in ML/DL architecture most of the models are designed assuming that the classes to be predicted are present in a frequent manner i.e the are approximately equal in number(present in 1:2 or 2:3 ratio).So when these models are fitted on this type of data it results in poor predictive performance .Specially the minority class is predicted incorrectly and it is seen that these classes are of more importance in classification problems.

Examples of Imbalanced Classification problem:-

  1. Fraud Detection.
  2. Claim Prediction.
  3. Default Prediction.
  4. Churn Prediction.
  5. Spam Detection.
  6. Anomaly Detection.
  7. Outlier Detection.

By now you might have understood about what an imbalanced classification is and its importance and probably thinking about what are the causes of imbalanced dataset.

So here they are:-

Causes of class imbalance:-

Apparently there are many cases behind the occurance of an imbalanced dataset. Among them the two most dominant ones are:-

  1. Data Sampling
  2. Properties of Domain

Now let me brief about these two cases:-

Data Sampling:-

It is quite natural to get imbalanced data by incorrectly choosing the procedure by which are collected from the probable domain.This  might include biased sampling and measurement errors. For eg:- If data are collected from a narrow geographical region and varying slices of time it might led to the varying distribution of classes.Same situation might arise if we change the method of collection of data according to region and time.

Problem Domain:-

It may happen that natural occurence or presence of one class may dominate other classes.This may be because the process that generates observations in one class is more expensive in time,cost,computation and other resources.As such it is not correct to collect samples of only one class, rather the model should be trained to learn the difference between classes.

Now there is enough of theoretical facts let’s dive into programming now.

Here first I will fit a tuned SVM classifier to get the baseline accuracy that we can derive from the imbalanced fraudulent transaction detection dataset.

Dataset Link:- https://www.kaggle.com/mlg-ulb/creditcardfraud

I have taken fraction of data from the original dataset for fast analysis.

Extracting data from working directory and visualizing the imbalanced classes.

#importing necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#reading data from directory
data=pd.read_csv('creditcard.csv')
data.head()

#creating a imbalanced dataset by taking a fraction of data from original dataset
df=data.loc[data['Class']==1]
df1=data.loc[data['Class']==0]

#taking less fraction of data from majority class
df1=df1.sample(frac=0.00282)

#taking more fraction of data from minority class
df=df.sample(frac=0.204)

#concating both and making a new imbalanced dataset
data=pd.concat([df1,df],axis=0)

#to see the number of classes
data['Class'].value_counts()

#to see counts of classes as barplots
plot1=sns.countplot(x='Class',data=data)
plt.savefig('plot1.png')

 

Here 0:- Normal Transaction and 1:- Fraudulent Transaction
Now we will fit the SVM model on our datset to get a baseline accuracy

#splitting into dependent and independent variables
X=data.iloc[:,:-1]
y=data.iloc[:,-1]

#splitting into train and test
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=101)

#tuning and fitting the model
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
params={'kernel':['linear','rbf','sigmoid'],'C':[1,10,100,1000],'gamma':[1,0.1,0.01,0.001]}
model=SVC(random_state=101)
random_search=RandomizedSearchCV(model,params,cv=4,scoring='accuracy',n_iter=4,n_jobs=-1,verbose=1)
random_search.fit(X_train,y_train)

#prediction and evaluation
predictions=random_search.best_estimator_.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(predictions,y_test))
print(classification_report(predictions,y_test))

So we get a baseline accuracy of  0.92.After seeing this you might think that wow such a great model.But here comes the main interpretation of imbalanced dataset. Here we must also see the precision and recall of the minority class to infer how much accurate the model is  while predicting minority classes.

For predicting the minority class( class = 1). I got a precision of 0.48  and a recall of 0.89.This clearly shows that the precision with which the fraudulent cases are predicted is low as compared to class- 0 which is 0.99.

Now the confusion matrix:-[[191    17]
                                                    [ 2     16]]

Now our target is whether we can build a model which will be able to predict the minority classes more accurately.For this now I am going to discuss several techniques by which it can be done.

Resampling Techniques:-

So before understanding the logic behind various oversampling and undersampling techniques let’s understand the goal behind the resampling procedures.

Resampling methods are designed to change the composition of training dataset for an imbalanced classification task.Most of the technique aims at oversampling the minority class.But there are also some algorithms which focuses on under sampling of majority class. Lastly several other algorithms which combines both over and under sampling.

Undersampling for Imbalanced classification:-

In this technique the points that removes data points  from training set of majority class so as to make the count of majority class equal to that of the count of training set of minority class.This reduces the skew of the distribution from 1:1000 to 1:50 or from 1:10 to 1:2. This is completely different from oversampling  where data points are added to the training set of majority class.
Under sampling techniques can be directly applied on the training dataset and then that resampled dataset can be used for model fitting. Another technique is to combine both over and under sampling to get more accurate predictions.I will discuss about it later.
Library to be used for all resampling techniques is  “imbalanced-learn”.
How to install:- pip install imbalanced-learn

Various types of Under Sampling:-

  1. NearMiss Under Sampling
  2. Condensed Nearest Neighbors
  3. Tomek Links for Under Sampling
  4. Edited Nearest Neighbors

But here I will discuss only about NearMiss under sampling and its application on my data set.

For reference you can refer to this link :- https://arxiv.org/pdf/1608.06048.pdf

NearMiss Under Sampling:-

The driving concept of the Near Miss under sampling technique is nearest neighbor distance. In this technique we select data points based on the distance of majority class data points to minority class data points.

Now there are three versions of NearMiss sampling:-

  • Near Miss 1:- It deals with minimum average distance of majority class examples to the  3 (by default) nearest neighbors of minority class examples.
  • Near Miss 2:- It deals with minimum average distance of majority class data points to  3 (by default) farthest neighbors of minority class data points.
  • Near Miss 3:- It deals with minimum distance of majority class examples to each minority class examples.

(examples= data points)

Whenever we apply Near Miss algorithm by default Near Miss 1 is implemented.However there is no fixed procedure to solve any particular problem you have to perform trial and error to find the one that suits your data best.

Advantages:-

  • Since the sample size of majority class is decreased by under sampling the training time and complexity of the model is decreased to a large extent.

Disadvantages:-

  1. Since the data points are reduced, it might happen that during the removal of data points many important points get deleted which decreases the prediction capability of the fitted model, due to poor training.

Important attributes for coding:-

  • n_neighbors:- If int, size of the neighbourhood to consider to compute the average distance to the minority point samples.By default it is set to 3.
  • sampling_strategy:- it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as \alpha_{us} = N_{m} / N_{rM} where N_{m} is the number of samples in the minority class and N_{rM} is the number of samples in the majority class after resampling.

Now let’s get our hands dirty by coding it in python.
Since I have already imported the data earlier I will start my code form splitting of data into train and test set.

#perform train_test_split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=101)

#performing undersampling
from imblearn.under_sampling import NearMiss
nm=NearMiss(n_jobs=-1,sampling_strategy=0.125)
X_train_ns,y_train_ns=nm.fit_sample(X_train,y_train)

#fitting model on undersampled dataset
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(random_state=101)
classifier.fit(X_train_ns,y_train_ns)

#predicting and evaluating metrics
predictions=classifier.predict(X_test)
print(confusion_matrix(predictions,y_test))
print(classification_report(predictions,y_test))

 

Comment:-
By performing under sampling  the I got a precision and recall of 0.85 and 1.00 for the minority class. This is quite a good improvement than the previous one when precision is considered.Hence it implies that the model is very much trustable than the previous one when it predicts that a particular data point belongs to the minority class.

The confusion matrix is:-  [[  193      5] 
                                                    [ 0        28]]

Comment:-
There is also significant increase in the true negative points and decrement in false negative points hence the accuracy increased to 0.98 from 0.92.
So we conclude that Under Sampling performed better than no sampling.
You can fine tune the model based on resampled data points to see if you get a better precision or recall than me.
Now that we have covered Under Sampling in detail. Let’s dive in to our next resampling technique  RandomOverSampler.

Random Over Sampler

The driving concept behind Random Over Sampling is that it randomly duplicates examples from the minority class and adds them to the training data set. In this technique examples from training dataset are selected randomly with replacement.This means that examples from the minority class can be chosen and added to the new more balanced training set multiple times. So the whole algorithm can be summarized as examples selected from old training data set and added to the new training dataset and then returned or replaced in original to get picked again.

This technique can work well with those algorithms which gets affected by skewed or duplicate distributions like artificial Neural Network(ANN), Stochastic Gradient Descent(SGD).Decision Tree and SVM’s are also a good suit for this algorithm as they are based on spitting of data.
A great technique while applying this algorithm is to monitor the performance of both train and test datasets after oversampling and compare the result with original dataset.

Advantages:-

  • Since the number of data points in minority class increases there is a high possibility for the model to learn from the data points in minority class of training set.

Disadvantages:-

  • Uncontrolled oversampling can lead to more training time and model complexity thus diminishing the predictive capability of the model.

Important Coding Parameters:-


Enough of theory by now, let’s code it in python.

#splitting dataset
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=101)

#performing oversampling
from imblearn.over_sampling import RandomOverSampler
os=RandomOverSampler(sampling_strategy=0.125,random_state=101)
X_train_os,y_train_os=os.fit_sample(X_train,y_train)

#fitting model
classifier=RandomForestClassifier(random_state=101)
classifier.fit(X_train_os,y_train_os)

#predicting and evaluating metrics
from sklearn.metrics import classification_report,confusion_matrix
predictions=classifier.predict(X_test)
print(classification_report(predictions,y_test))
print(confusion_matrix(predictions,y_test))

Comment:-
By performing Random Over Sampling  I got a precision and recall of 0.94 and 1.00 for the minority class. This is quite a good improvement than the previous Under Sampling  when precision is considered.Hence it implies that the Random Over Sampler model is more  trustable than the previous Under Sampling one when it predicts that a particular data point belongs to the minority class.

The confusion matrix is given by:-   [[   193      2  ]
                                                                [   0          31]]

Comment:-
There is also significant decrease in the false positive points and increment in both true positive and true negative  points.
Hence we can conclude that Random Over Sampler performed slightly better than the previous Under Sampling.
Now that we have understood and coded Random Over Sampler let’s dive in to the next and last technique called the Easy Ensemble Classifier.

Easy Ensemble Classifier

In this type of classifier a random subset from majority class is sampled along with all the examples from minority class.After this a model or a weak learner is fit on the dataset. Then this process is repeated multiple number of times and the average prediction across all the models can be used to give final predictions.
One of the main advantage of this type of classifier is that since multiple sub samples are being generated from the majority class,there is less chance of losing valuable information by down sampling.

Now let the explain the concept of Easy Ensemble in detail.
The easy ensemble technique involves creating balanced samples of the training dataset by selecting all sample from the minority class and a subset of majority class.In this algorithm we use boosted decision trees, specially Adaboost, in each subset rather than pruned decision trees.

Now you must be  thinking what is Adaboost?
Adaboost is a boosting algorithm that functions by fitting decision trees on the data set.In this technique the miss classified observation are given more weight than the correctly classified ones.Then a decision tree is being fitted on it to correct the miss classification. This process is repeated multiple time s to correct all the errors.
Advantages of Easy Ensemble:

  • There is less chance of loosing data as multiple sub samples are being generated form the majority class to perform classification.

Disadvantages of Easy Ensemble:-

  • Since multiple decision trees are being fitted there can be a significant increase in the variance.

Important coding attributes:-

  • n_estimators:- Number of AdaBoost learners in the ensemble
  • sampling_strategy:- When float, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as \alpha_{us} = N_{m} / N_{rM} where N_{m} is the number of samples in the minority class and N_{rM} is the number of samples in the majority class after resampling.
    Refer to:– https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.ensemble.EasyEnsembleClassifier.html

Enough of theory by now, now it’s coding time:-

#splitting into train and test data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=101)

#fitting the model
from imblearn.ensemble import EasyEnsembleClassifier
model=EasyEnsembleClassifier(n_estimators=500,sampling_strategy=0.125,random_state=101,n_jobs=-1)
model.fit(X_train,y_train)

#predicting and evaluating metrics
predictions=model.predict(X_test)
print(classification_report(predictions,y_test))
print(confusion_matrix(predictions,y_test))

Comment:-
By performing Easy Ensemble Classifier  I got a precision and recall of 0.79 and 0.93 for the minority class. This is not a improvement than the previous Under and Over Sampling  when precision and recall is considered.Hence it implies that the Random Over Sampler model is more  trustable than the previous Under Sampling and present Easy Ensemble Classifier one when it predicts that a particular data point belongs to the minority class.

The confusion matrix is given by:- [  191       7]
                                                               [   2       26]]

Comment:-
There is also significant increase in the false positive and false negative points and decrease in both true positive and true negative  points.This leads to the decrease in accuracy of the model to 0.96.
Hence we can conclude that Easy Ensemble Classifier performed better than the baseline model but not better than the previous resampling techniques.

So by this I conclude my discussion on handling imbalanced binary classification problems.I hope I have been able to give a clear and concise idea about how it is done by doing a comparative study
Now I will explain how to handle imbalanced data in case of multi class classification.

Multi Class Classification using imbalanced data

The only difference between binary and multi class classification is in the present one we need to categorize our data points into more than two classes.

There are various techniques to perform imbalanced multi class classification like:-

  • SMOTE svm
  • SMOTE Borderline Classifier
  • ADASYN(Adaptive Synthetic Sampling)
  • SMOTE Tomek
  • SMOTE ENN

Among all the above methods I will discuss about SMOTE ENN as I have applied it in my data set.
So first let’s understand the algorithm in detail.

SMOTE ENN (Synthetic Minority Over Sampling technique by Edited Nearest Neighbor)

SMOTE ENN is a hybrid of two imbalanced dataset handling techniques SMOTE and ENN.
So to understand SMOTE ENN, I will first try to give clear idea about two techniques(SMOTE and ENN) seperately and club them together at last to conclude my discussion.

So let’s first understand the SMOTE Algorithm in detail.

SMOTE(Synthetic Minority Oversampling Technique)

As I have discussed earlier one of the effective way of handling imbalanced dataset is by increasing the number of data points in the minority class.The driving concept behind this algorithm is creation of synthetic data points in the minority class to match with the majority class.These data points do not add any new information to the model.This type of data augmentation for the minority class is referred to as SMOTE.

Algorithm:-
A single data point is selected from the minority class and then k- nearest points from that points are identified.After this a random neighbor is chosen from them and synthetic data points between the previous selected point and the k- nearest neighbor are generated.
This procedure can generate as many synthetic data points for minority class as required.It is an efficient approach as all the synthetic data points that are being generated are close in feature space to the existing data points of minority class.
So by now, I think I have been able to give crisp and concise idea about SMOTE. So now let’s understand ENN algorithm.

Edited Nearest Neighbor(ENN)

In this procedure we find ambiguous(miss classified) or noisy data points from the data set.
Again, the driving concept of this algorithm is based on k-nearest neighbors.Here for k=3  3 nearest neighboring points are selected and checked whether they are miss classified or not.If they are miss classified they are deleted before we move to k=1 step.
When this technique is applied as an under sampling procedure it is applied on each and every data points of majority class, allowing all miss classified points in majority class to be removed on;y the correct classified points are kept.
It can also be applied on the minority class to delete miss classified points from minority class whose nearest neighbor lies in majority class.
There are various techniques which are hybrid of both Over and Under sampling like:-

  1. Condensed Nearest Neighbor + Tomek Links
  2. SMOTE + Tomek Links
  3. SMOTE +ENN

Now that I have discussed both SMOTE and ENN it will be easier to understand their hybrid SMOTE ENN.

SMOTE ENN

In SMOTE ENN we under sample the majority class and over sample the minority classes in order to balance the data set.After that the model is fitted on the dataset to get our required result. It is found that ENN is more aggressive than the TOMEK links and is responsible for more in depth cleaning of data set.

Important coding parameters:-

  • sampling_strategy:- When float, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as \alpha_{os} = N_{rm} / N_{M} where N_{rm} is the number of samples in the minority class after resampling and N_{M} is the number of samples in the majority class.
  • smote:- The SMOTE configuration can be set as a SMOTE object via the “smote” argument.The imblearn.over_sampling.SMOTE object to use. If not given, a imblearn.over_sampling.SMOTE object with default parameters will be given.SMOTE defaults to balancing the distribution, followed by ENN that by default removes misclassified examples from all classes.
  • enn:- ENN configuration can be set via the EditedNearestNeighbours object via the “enn” argument.We could change the ENN to only remove examples from the majority class by setting the “enn” argument to an EditedNearestNeighbours instance with sampling_strategy argument set to ‘majority

So by now I have given a lot of theoretical concepts. Now it’s time to code it in python.
Here I have applied three algorithms like Random Forest, Logistic Regression and SMOTE ENN to do a comparative study.
Dataset Link:- https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
Here I have recategorized the 6 types of wine into 3 categories:-
1 -> Bad Quality
2 -> Average Quality
3 -> Good Quality
Code for Random Forest application:-

#reading file from directory
dat=pd.read_csv('winequality_red.csv')
dat.head()

#to see the number of categories
dat['quality'].value_counts()

# recategorizing the categories
review=[]
for i in data['quality']:
    if i >=1 and i <= 3: review.append(1) elif i >= 4 and i <= 6: review.append(2) elif i >= 7 and i <=10 :
        review.append(3)
data['review']=review    

#plotting categories frequency
import seaborn as sns
import matplotlib.pyplot as plt
plot=sns.countplot(x='review',data=dat)
plt.savefig('plot.png')

#to look for any missing values
dat.isnull().sum()

#splitting to dependent and independent set
X=dat.iloc[:,:-1]
y=dat.iloc[:,-1]

#performing train_test_split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=101)

#fitting a bmodel to get baseline accuracy
from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier(random_state=101)
model.fit(X_train,y_train)
predictions=model.predict(X_test)

#evaluation metrics
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(predictions,y_test))
print(confusion_matrix(predictions,y_test))

The imbalanced category image is given below
Here the minority class is given by classes 1 and 3 and class 2 is the majority class.
Comment:-
After applying random forest I got the following result:-

               precision     recall         f1-score
1              0.00           0.00                0.00
2              1.00           0.98                0.99
3              0.25           0.50                0.33

So from the above table it is clear that class 2 is predicted with maximum accuracy followed by class 3 which has a much lower accuracy than class 2. But class 1 is not predicted at all.
Hence the model is not capable of identifying wines of bad quality(class 1) as it’s recall and precision are zero due to insufficient training of model.
The confusion matrix is given by:-        [[ 0        0          0]
                                                                   [ 5      390        3]
                                                                   [ 0        1           1]]
From the confusion matrix it is clear that the accuracy of 0.98 is achieved only due to the average quality wine.But it is not good enough since the model shows below average performance when wines of good(class 3) and bad qualities(class 1)  are to be predicted.

Now let’s apply Logistic Regression to see if there is any improvement in the prediction of minority classes.

#splitting into test and train
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=101)

#scaling the data points
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(X_train)
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)

#fitting the model
from sklearn.linear_model import LogisticRegression
model1=LogisticRegression(class_weight='balanced',solver='sag',penalty='l2',multi_class='multinomial',
                          random_state=101,C=2.0,max_iter=1500)
model1.fit(X_train,y_train)

#predicting by this model
predictions=model1.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(predictions,y_test))
print(confusion_matrix(predictions,y_test))

 

After applying Logistic Regression I got the following result:-
precision     recall        f1-score
1              0.67         0.33              0.44
2             0.77          0.93               0.85
3              0.64        0.32               0.43
So from the above table it is clear that class 2 is predicted with maximum accuracy followed by class 3 and class 1.
Hence the model is not completely capable of identifying wines of bad(class 1) and good(class 3) quality as it’s recall and precision are much lower due to insufficient training of model.
But the notable change in the Logistic Regression model is that the previous model was not able to detect class 1 wines whereas logistic regression is able to detect it but at the cost of model accuracy.
The confusion matrix is given by:-
[[ 2       4          0]
 [ 1      213      15]

   [ 0      58       27]]

From the confusion matrix it is clear that the accuracy of 0.76 is achieved.But it is not good enough if we compare w.r.t  accuracy with the previous model. But the interesting fact here is that this model predicts class 1 and class 3 better as compared to the Random forest Model.
Now I will try SMOTE ENN sampling technique, the last in my list.
#performing train test split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=101)

#applying SMOTEENN
from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import EditedNearestNeighbours
enn = SMOTEENN(random_state=101,smote=SMOTE(),enn=EditedNearestNeighbours(sampling_strategy='majority'))
X_res, y_res =enn.fit_resample(X_train,y_train)

#to see the oversampled values
from collections import Counter
print(sorted(Counter(y_res).items()))

#fitting model
from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier(random_state=101)
model.fit(X_res,y_res)

#predicting by model
predictions=model.predict(X_test)
print(classification_report(predictions,y_test))
print(confusion_matrix(predictions,y_test))

#tuning the model
n_estimators=[int(i) for i in np.linspace(100,1000,10)]
max_features=['auto','sqrt','log2']
max_depth=[int(i) for i in np.linspace(10,1000,10)]
min_samples_split=[2,5,10,14]
min_samples_leaf=[1,2,4,6,8]
params={'n_estimators':n_estimators,'max_features':max_features,'max_depth':max_depth,'min_samples_split':min_samples_split
       ,'min_samples_leaf':min_samples_leaf}
from sklearn.model_selection import RandomizedSearchCV
random_search=RandomizedSearchCV(model,params,n_iter=5,scoring='accuracy',n_jobs=-1,cv=5,verbose=1)
random_search.fit(X_res,y_res)

#predicting by fitted model
predictions=random_search.predict(X_test)
print(classification_report(predictions,y_test))
print(confusion_matrix(predictions,y_test))
After applying SMOTE ENN I got the following result:-
             precision     recall        f1-score
1              0.00         0.00              0.00
2             0.89         0.93               0.91
3              0.57         0.44               0.50
So from the above table it is clear that class 2 is predicted with maximum accuracy followed by class 3 which has a much lower accuracy than 2. But class 1 is not predicted at all.
Hence the model is not capable of identifying wines of bad quality(class 1) as it’s recall and precision are zero due to insufficient training of model.

The confusion matrix is given by:-

[[ 0       3       0]
 [ 3     242    18]
   [ 0      30     24]]

From the confusion matrix it is clear that the accuracy of 0.83 is achieved.But it is better than Logistic Regression  if we compare w.r.t  accuracy but worse than the baseline model. But the interesting fact here is that although it is better than Logistic regression but it could not predict class 1 and class 3 together the way Logistic Regression does.

So by this I have come to the end of my blog. Here I have tried to cover various techniques that are used in handling imbalanced data sets in both binary and multi class classification.

  • For Binary Classification I can clearly conclude that Random Over Sampler performed better than the other two methods.

  • But in case of Multi Class Classification there is no conclusive evidence for me to decide which technique best suits the data, as in imbalanced datasets accuracy is not only our main criteria.We need to see how well the model is able to predict the classes which in turn depends on precision and recall of other classes.
    So taking this into consideration it depends on the domain expert and the client to discuss and decide what is of more importance to them(i.e which class’s prediction is most important among the others) and choose model accordingly.

Conclusion:-

The picture will give a brief idea about how to proceed in case of  imbalanced data.

So by now it is clear to us how important it is to balance the data set before interpreting our model in a data science project.
Similarly in our life we also need to have work-life balance to perform efficiently in work and look after our family to lead a good life all together.

September 7, 2020

0 responses on "A Comparative study of various methods of handling Imbalanced datasets in Binary and Multiclass classification"

    Leave a Message

    Your email address will not be published. Required fields are marked *

    © BeingDatum. All rights reserved.
    X