Login

Register

Login

Register

✆+91-9916812177 | contact@beingdatum.com

Comparative Study of Automatic Outlier Detection Methods

 

 Introduction:-

Nowadays Machine Learning and Deep learning are used in almost every domain to get valuable insights from the data. This, in turn, helps in making valuable decisions for the company. Now before taking these valuable decisions we apply various types of algorithms on the data set and then choose the one that suits us the best.

Now imagine a situation where the algorithm implemented by you gives inconsistent or misleading results when put in production. But you have pre-processed the data,fine-tuned your model, and did all the necessary things required in the life cycle of a Data Science project. Then where did it go wrong?

It is then when the outliers or anomalous observations come into play. Treatment of outlying observations play s a very crucial role in every data science project. As by treating them properly, we can improve the efficiency of a model by reducing training time, increasing accuracy, precision and recall. So by now, this question has surely come to your  mind that

What is an outlier?

Outliers are observations that are exceptionally far from the mainstream of data.

There is no specific or definite way to identify outliers in general because it is datasets specific and mainly decided by a domain expert.

So now the question arises how can we identify the outliers?

The process of identifying outlier is called anomaly detection, outlier modelling or novelty detection. There are various ways to detect outliers mainly.

  1. Standard Deviation method
  2. Interquartile  Range method
  3. Automatic Outlier Detection

But here in this blog, I will discuss Automatic Outlier Detection methods only.

Before diving deep into the various techniques of automatic outlier detection let me introduce to you what the outlier detection models are based on.

  • Extreme Value Analysis:-
    For example, statistical methods like the z-scores on univariate data.
  • Probabilistic and statistical models:-
    For example, Gaussian mixture models optimized using expectation-maximization.
  • Linear Models:-
    For example, principle component analysis and data with large residual errors may be outliers.
  • Proximity based models:-
    For example, Local Outlier Factor where data points are isolated from the mass of the data as determined density.
  • High-Dimensional Outlier detection models:-
    For example, Minimum Covariance Determinant that search subspaces for outliers and gives the breakdown of distance based measures in higher dimensions
  • Tree-Based models:-
    For example, Isolation Forest Algorithm detects outliers based on formation of decision trees. It distinguishes anomalies based on anomaly score for each data instances.

Now that we have the basic knowledge on what is an outlier, how it is detected, its effect on results of projects. So we dive into the most interesting part of the blog.

Implementation in Python:-

I will do a comparative study of various Automatic Outlier Detection Techniques on a medical used case.

I will cover the following topics in my blog:-

  • Isolation Forest Algorithm
  • Local Outlier Factor
  • One Class SVM

Dataset link:-http://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records

Objective:-

Here we need to predict whether a patient will die of heart attack or not,based on various attributes like age,diabetes,sex,platelets etc.

Steps followed:-

First, I will apply Random Forest Classifier on the dataset,perform tuning of  hyperparameters and get the cross-validated accuracy.Then I will compare the obtained accuracy, with accuracy obtained after applying each outlier detection technique.

Let’s import the necessary libraries and dataset to be used for this project

#Import necessary libraries
import pandas as pd
import numpy as np
#read the data from working directory
data=pd.read_csv('heart_failure_clinical_records_dataset.csv')
data.head()

Now we do the train test split along with the model building part

#Getting dependent and independent variables
X=data.iloc[:,:-1]
y=data['DEATH_EVENT']
#splitting into training set and testing set
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=101)


#fitting a model to get baseline accuracy
from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier(random_state=101)
model.fit(X_train,y_train)
#predicting on test set
predictions=model.predict(X_test)

#create a classification report
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions));

Fine tuning the model

#creating dictionary of parameters for tuning
n_estimators=[int(i) for i in np.linspace(200,2000,10)]
max_features=['auto','sqrt','log2']
max_depth=[int(i) for i in np.linspace(10,1000,10)]
min_samples_split=[2,5,10,14]
min_samples_leaf=[1,2,4,6,8]
params={'n_estimators':n_estimators,'max_features':max_features,'max_depth':max_depth,'min_samples_split':min_samples_split
       ,'min_samples_leaf':min_samples_leaf}
       
#performing the random_search       
from sklearn.model_selection import RandomizedSearchCV
random_search=RandomizedSearchCV(model,params,n_iter=5,scoring='accuracy',n_jobs=-1,cv=5,verbose=1)
random_search.fit(X,y)

#to get the fine tuned parameters
random_search.best_params_
#to get the fine tuned model
random_search.best_estimator_

#to get the cross validated score by performing k-fold CV using the fine tuned model to get baseline accuracy
from sklearn.model_selection import cross_val_score
score=cross_val_score(random_search.best_estimator_,X,y,cv=10,scoring='accuracy')
print(f'The accuracy of the model is {np.mean(score)}');

So after implementing this model I got the baseline accuracy as 0.7757471264367817.

Now we will compare the above model with models fitted after removing the anomalous points.

First,let’s understand how the above 3 automatic outlier detection algorithms work.

1. Isolation Forest Algorithm

As the name suggests, this algorithm is based on the concept of isolating anomalous data points from the genuine points to classify it as anomalous points.

The driving concept behind this algorithm is that, the anomalous points are few in number  and different.So they are more prone to get isolated than the genuine points.This method is algorithmically different and efficient from all other existing methods.It is efficient because it can be used for datasets with many dimensions without fearing about the curse of dimensionality .Also it is said to be algorithmically different from other algorithms as it applies isolation technique to detect anomalies rather than the commonly used  basic distance and density measure.
 
One of the main advantage of this algorithm is that it has low linear time complexity and small memory requirement.

Core principle:-

In this algorithm decisions trees are being created by selecting random attributes to find out anomalous points.When several trees are formed and combined it forms isolation forest(just like random forest),hence the name.

The random partitioning of dataset creates shorter paths for anomalous data points since:-

  1. The anomalies are distinctively less in number it leads to smaller partitions
  2. Distinguishable  data points are prone to get detected at the starting of the classification process.

This process is repeated for a fixed number of times and the anomaly score is  noted at each isolation level for each data point.After all the iterations are completed we generate an anomaly score for each data points and decide on the basis of that whether it is anomalous or not.

Now you must be thinking what is this anomaly score?

So the anomaly score is the function of the average level at which the point is isolated.The top ‘k’ gathered points on the basis of score are labelled as anomalies.

So for your better understanding let me provide you with a pseudo code for the algorithm.

Steps:-

  1. Select a particular data point that you want to isolate.
  2. For every feature of the data set set a range between maximum and minimum of that feature to isolate
  3. Now select a feature at random
  4. Here comes the iterative step
    a)Select a value within the range decided in step 2.If the data point selected at first step is larger than the chosen value,set  the minimum of the range to that value
    b)If the data point is less than the selected value set the maximum of the range to that value.
    c)Repeat steps 3 and 4 until the point selected at step 1 is isolated. In other words steps 3 and 4 are repeated until that point is the one and only point which lies inside the selected range at step 2 for all the features of the data set.
  5. If we count the number of times we have to repeat steps 3 and 4 we get a number which we call isolation number.

Now if a point is called an outlier the isolation number will be less(as it will get isolated easily).In depth this algorithm uses random no. to select data points so the procedure is repeated several number of times and the final isolation number on the basis of which decision is taken is a combination of all isolation numbers.

I think I have been able to give an outline of what isolation forest is and how the algorithm works.

For further studies refer to:-https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest

Before diving deep into the coding let’s know about some important parameter of the isolation forest algorithm

Parameters:-

  • Contamination= The proportion of outliers in the data set(In my code snippet I have ran my code for different values).
  • n_estimators = No. of base estimators in the ensemble
  • max_features= The number of features sampled from X for training  of the estimator
  • mac_samples = The number of instances sampled from X to train the estimator

I will use the same fine tuned Random Forest classifier that I have used previously.

Let’s do some  coding now!!

Python code for Isolation Forest Algorithm

After applying the Isolation Forest model, the anomalous points will be categorized as -1 and inliers are categorized as 1.

from sklearn.ensemble import IsolationForest
# list of contamination ratios to check
list1=[0.01,0.02,0.03,0.04,0.05,0.1,0.2]
scores=[]
for i in list1:  
    X=data.iloc[:,:-1]
    y=data['DEATH_EVENT']
# identify outliers in the training dataset
    iso = IsolationForest(contamination=i,random_state=101)
    yhat = iso.fit_predict(X)
    mask=pd.DataFrame(yhat)

#Concating with original data
    X=pd.concat([X,mask],axis=1)
    y=pd.concat([y,mask],axis=1)

#Naming the new column as anomaly
    X.columns=['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time','anomaly']
    y.columns=['DEATH_EVENT','anomaly']
      
#Selecting only the non-anomalous rows(rows with anomaly value equal to 1)
    X=X.loc[X['anomaly']==1,X.columns[:-1]]
    y=y.loc[y['anomaly']==1,'DEATH_EVENT']
#model fitting
    score=cross_val_score(random_search.best_estimator_,X,y,scoring='accuracy',cv=10,n_jobs=-1)
    scores.append(np.mean(score))
print(scores)    
print(f'The maximumum accuracy attained by the model is {max(scores)}');

After running the above  code I got the following output:-

[0.7929885057471265, 0.7819540229885057, 0.7758620689655172, 0.783743842364532, 0.7852216748768474, 0.7843304843304844, 0.7405797101449274]

The maximum accuracy attained by the model is 0.7929885057471265.

So we conclude that for outlier fraction 0.01 the model attains its maximum accuracy.
Clearly the accuracy of the model has increased by 1.5% after removing the anomalous points from it by Isolation Forest.

So by this we have come to the end of Isolation Forest Algorithm.

Now let’s understand the second technique called the  Local Outlier Factor.

Local Outlier Factor

Introduction:-

Local Outlier Factor(LOF) is an unsupervised outlier detection technique.It is based on the concept of k-nearest neighbors.Let me explain you briefly the intuition behind LOF.

LOF is a density based outlier detection technique derived from the concept of DBSCAN. The intuition behind the approach is that the density around the outlying points will be significantly different from the density around neighbors. LOF is a float value which tells us how likely it is for a datapoint to be considered as an outlier.

Suppose if we consider the point C as the center of the data point then according to distance based measure point O3 and O4 are termed as outlier.While we see that O4 is not an oulier but O3 is.

Now we consider points O1 and O2, they are termed as outlier based on local neighborhood.But when global neighborhood is taken into consideration they are not termed as outliers.

There is a deciding criterion for LOF:-

  • LOF >>1:- lower density then neighbors hence an outlier
  • LOF<1:- higher density than neighbors implies hence not an outlier
  • LOF~1:- similar density as its neighbors

Now let me introduce to you how the algorithm works and how the LOF for each data point is calculated by defining some important terms:-

Important Terms:-

  • k:- It is the number of neighboring points the LOF algorithm has taken into consideration.The LOF algorithm takes into account the density of a particular point and compares it with the density of other points.Getting a right value of  k  for implementation is not an easy task, because for a very small value of  k  more importance is given to the local points and hence there is a chance of missing out global outliers.The problem intensifies when there are large number of noise in the dataset.

The red point’s k-distance is illustrated by the red circle if k=3

The red point’s kth-distance is illustrated by the red circle if k=3

  • kth- distance:- kth distance is the distance between the point and its kth neighbor. By kth neighbor I mean the point which comes in third place when we rank all the points in ascending order according to their distance from specific point.

  • Reachability Distance:- Here is when kth-distance comes into play.It is used to calculate reachability distance.It is the maximum of distance between two points (O,O’).

          reach dist(O,O’) = max{kth-dist(O’),dist(O,O’)}

           If O is far away from O’ then the reachability distance is replaced by the distance between two points.

If O is sufficiently close to O’ then the actual distance is replaced by the kth-distance of O.This is done to                        significantly reducethe statistical fluctuations of dist(O,O’) for all O’ close to O. The strength of the smoothing              effect is controlled by the paarmeter  ‘k’ . The higher the value of  ‘k’  more similar is the reachability distance                for objects within the same neighborhood.

The lrd of the upper right point (5,5) is the average reachability distance to its nearest neighbors which are points (2, 2), (1.5, 1.5) and (2, 1). These neighbors, however, have other lrd’s as their nearest neighbors don’t include the upper right point

  • Local Reachability Density(LRD):-  It is measured with the help of reachability distance.It is a measure which gives us an idea about how dense the set of points are wrt to a point say ‘O’. The lower the value less dense they are and higher values indicate greater density. LRD is nothing but inverse of the average of the reachability distance of all its k-nearest neighbors.LRD(O) =( ∑(reach dist(O,O’)/k))^(-1)So to sum up the whole concept, we can say that LRD tells us how far we have to travel to reach the nect point or cluster of points.
  • Local Outlier Factor(LOF):-  Now finally we have reached our goal, i.e. the attribute that we were trying to find out.Suppose we have four points A,B,C,D and we need to find out the LOF(A) where we consider  B,C,D as the 3 nearest points(i.e k=3).

                                      LOF(A) = (1/3)*(LRD(B)+LRD(C)+LRD(D))*(LRD(A))^(-1)

LOF is basically the average of the ratio of the local reachability density of  ‘A’ and ‘A’s’ nearest neighbors.It can           be easily said that lower the value of LRD(A) higher is its LOF(A).Hence there is more chance of treating  ‘A’ as             an outlier and vice versa.

Disadvantages:-

1)Scalability is a big issue as the computations are memory-intensive and hence, a good computing infrastructure is required for bigger datasets.

2)The user has to himself provide the value of  ‘k’  for the algorithm to perform efficiently.

Advantages:-

1) The drawbacks of LOF can be overcome using the some of the latest advancements in algorithms and spark- and Hadoop-based infrastructure like DILOF: Effective and Memory Efficient Local Outlier Detection in Data streams

2)Scalable Top-n local outlier detection: Uses pruning to eliminate most points from the set of potential outlier candidates without computing LOF or even calculating k nearest neighbors.

So with this we have come to end of another method of detecting outliers.

For further studies refer:-https://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf

Important model parameters for coding:-

  1. n_neighbors:- This is nothing but the value of k which I mentioned previously.It is used to find LRD,RD,k-th distance.It should be chosen carefully for  proper training of the model.
  2. contamination:- It is the proportion of outliers present in the dataset.

You can hypertune the model for getting fine tuned model.Here in the code below I have shown it for various contamination values .

Python code for Local Outlier Factor(LOF)

After applying the LOF model, the anomalous points will be categorized as -1 and inliers are categorized as 1.

from sklearn.neighbors import LocalOutlierFactor
list1=[0.01,0.02,0.03,0.04,0.05,0.1,0.2]
scores=[]
for i in list1:
    X=data.iloc[:,:-1]
    y=data['DEATH_EVENT']
    lof = LocalOutlierFactor(contamination=i)
    yhat = lof.fit_predict(X)
    mask=pd.DataFrame(yhat)

#Concating with original data
    X=pd.concat([X,mask],axis=1)
    y=pd.concat([y,mask],axis=1)

#Naming the new column as anomaly
    X.columns=['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time','anomaly']
    y.columns=['DEATH_EVENT','anomaly']
      
#Selecting only the non-anomalous rows(rows with anomaly value equal to 1)
    X=X.loc[X['anomaly']==1,X.columns[:-1]]
    y=y.loc[y['anomaly']==1,'DEATH_EVENT']
#model fitting
    score=cross_val_score(random_search.best_estimator_,X,y,scoring='accuracy',cv=10)
    scores.append(np.mean(score))
print(scores)
print(f'The maximum accuracy attained by the model is {max(scores)}');

The scores obtained are:-

[0.7862068965517242, 0.7820689655172414, 0.7793103448275863, 0.7798029556650248, 0.7923645320197045, 0.7951566951566952, 0.7820652173913043]

The maximum accuracy attained by the model is  0.7951566951566952 for outlier fraction of 0.1.

Thus we see that by applying LocalOutlierFactor also we have been able to get an accuracy which is better than the baseline one.

Though by applying LOF I got a better accuracy than Isolation Forest but the improvement is not significant.

So now we have come to the last outlier detection technique of this blog called the One Class SVM.

 

One- Class SVM(OC-SVM)

Introduction:-

As the name suggests it is a subclass of Support Vector Machines(SVM).But unlike normal SVM , OC-SVM is an unsupervised machine learning technique.So before diving deep into the algorithm let me give you an idea about what an SVM is and how it works.

SVM is one of the most popular and efficient ML algorithm.It is a supervised ML algorithm which is used for both classification and regression.

SVM is a technique that carries out the classification procedure by creating an hyperplane in bi dimensional space.It is used for both binary and multi-class classification.It can handle both continuous and categorical values.The most interesting feature about SVM is called the Kernel Trick. In most of the cases the datapoints are not linearly separable.So it is not possible to separate them in a bi dimensional plane.

At that point the kernel trick comes into play.Here a non-linear decision boundary is created by projecting the data points through a non linear function φ(kernel fn.) to a higher dimensional space.This makes the separation of data points into their respective classes possible.

Now let me give you give you an overview of what is one class classification and how the algorithm works.

Analogy behind OC-SVM:-

In one class classifictaion data is being sampled from only one class i.e. the target class,while there is little or no amount of data in the other class also called outlier class,hence the name.It is an un supervised ML algorithm.

The basic difference between One Class SVM and conventional SVM is that in the former one it is assumed that we have information on only  one class available to us, whereas in the later one we have well defined binary or multi class available to us.Thus in OC-SVM the boundary or the hyperplane between the two classes needs to be constructed on the basis of single class only(target class).

+1 class= inliers. -1 class= outliers

The idea is to define a boundary around the target class ,such that it accepts as much target data points as possible and leave out the anomalous or outlying points.

There are 4 simple methods that are being widely used for the task of one class classification,namely:-

  1. k-means clustering
  2. Support Vector Domain Description
  3. k-center method

In support vector data description a spherical shaped decision boundary is constructed around data points by the support vectors.By applying kernel trick the model becomes more flexible and the possibility of converting data to new feature space becomes possible,without much computational cost.Here instead of target density the distance to the center of the  hyper sphere is used.

In k-means or k-center the data points are clustered and the distance to the nearest points are calculated and compared to categorize the normal and anomalous points.

OC- SVM can be used to detect outliers for both classification and regression problem.

Difference between SVM and One Class SVM

  • OC-SVM contains data only only about target class whereas SVM contains data about two or more classes.
  • OC-SVM’s goal is to create a description for one type of object as target class and distinguish it from outliers.Whereas SVM’s aim is to create a hyperplane with maximum margin from data points.
  • In OC SVM decision boundary is formed in all directions in the feature space around the target class.Whereas in the SVM a hyperplane is created in between data points to separate data points.
  • OC-SVM is used only for oulier detection but SVM is used for both classification and regression
  • OC-SVM requires less parameter,less training data  whereas SVM requires more training data and  more parameters.

For further studies refer to:-http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5565&rep=rep1&type=pdf#:~:text=The%20primary%20difference%20between%20a,training%20of%20a%20classification%20functions.&text=The%20one%2Dclass%20SVM%20approach,to%20learn%20a%20decision%20function.

Important attributes for coding:-

  • Kernel:- It is the kernel (rbf, linear, poly, sigmoid) to be used for detecting outliers.
  • nu:- An upper bound on the fraction of training errors(outlier fraction)  and a lower bound of the fraction of support vectors.

One-Class SVM in Python:-

After applying the OC-SVM model the anomalous points will be categorized as -1 and inliers are categorized as 1.

Here I will run the model for various values of nu  and find out the nu for which the classifier attains maximum accuracy.

from sklearn.svm import OneClassSVM
list1=[0.01,0.02,0.03,0.04,0.05,0.1,0.2]
scores=[]
for i in list1:
    X=data.iloc[:,:-1]
    y=data['DEATH_EVENT']
    ocs = OneClassSVM(nu=i)
    yhat = ocs.fit_predict(X)
    mask=pd.DataFrame(yhat)

#Concating with original data
    X=pd.concat([X,mask],axis=1)
    y=pd.concat([y,mask],axis=1)

#Naming the new column as anomaly
    X.columns=['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time','anomaly']
    y.columns=['DEATH_EVENT','anomaly']
      
#Selecting only the non-anomalous rows(rows with anomaly value equal to 1)
    X=X.loc[X['anomaly']==1,X.columns[:-1]]
    y=y.loc[y['anomaly']==1,'DEATH_EVENT']
#model fitting
    score=cross_val_score(random_search.best_estimator_,X,y,scoring='accuracy',cv=10)
    scores.append(np.mean(score))
print(scores)
print(f'The maximum accuracy attained by the model is {max(scores)}');

The accuracies obtained are

[0.816931216931217, 0.7883004926108375, 0.7959359605911331, 0.8100985221674877, 0.799384236453202, 0.7951566951566952, 0.7655797101449275]

The maximum accuracy attained by the model is 0.816931216931217 for an outlier fraction of 0.01.

So there is a 4% increase in the accuracy of the model as compared with the baseline model.

Conclusion

So with this I have completed all the facts and implementation that I wanted to share on Automatic Outlier Detection methods.I found out that among all the 3 outlier detection algorithms One-class SVM attains the maximum accuracy of 81.69% followed by Local Outlier Factor(LOF) with 79.51% accuracy and Isolation Forest with 79.29% accuracy.

So One Class SVM outclasses the other two algorithms by detecting more outliers than them which lead to the increase in the accuracy of the model.But this doesn’t mean that this algorithm will always outperform the other two.It is completely dataset specific.So you need to try all the algorithms to see which suits the data best.One thing I will recommend is to fine tune the outlier detection model and see if you can get more accuracy than me.

Lastly, I would like to say that outlier detection plays an integral role in any data science project so it needs to be handled properly. Just like washing our hands properly from time to time has become a part of our lifestyle amidst this pandemic to stay healthy. Similarly, checking, understanding and removing outliers from time to time should also be incorporated in every data science project to get reliable and consistent results.

0 responses on "Comparative Study of Automatic Outlier Detection Methods"

    Leave a Message

    Your email address will not be published. Required fields are marked *

    © BeingDatum. All rights reserved.
    X