# Introduction:-

Nowadays Machine Learning and Deep learning are used in almost every domain to get valuable insights from the data. This, in turn, helps in making valuable decisions for the company. Now before taking these valuable decisions we apply various types of algorithms on the data set and then choose the one that suits us the best.

Now imagine a situation where the algorithm implemented by you gives inconsistent or misleading results when put in production. But you have pre-processed the data,fine-tuned your model, and did all the necessary things required in the life cycle of a Data Science project. Then where did it go wrong?

It is then when the **outliers **or **anomalous**** observations **come into play. Treatment of outlying observations play s a very crucial role in every data science project. As by treating them properly, we can improve the efficiency of a model by reducing training time, increasing accuracy, precision and recall. So by now, this question has surely come to your mind that

### What is an outlier?

Outliers are observations that are exceptionally far from the mainstream of data.

There is no specific or definite way to identify outliers in general because it is datasets specific and mainly decided by a domain expert.

### So now the question arises how can we identify the outliers?

The process of identifying outlier is called anomaly detection, outlier modelling or novelty detection. There are various ways to detect outliers mainly.

**Standard Deviation method****Interquartile Range method****Automatic Outlier Detection**

But here in this blog, I will discuss **Automatic Outlier Detection **methods only.

Before diving deep into the various techniques of automatic outlier detection let me introduce to you what the outlier detection models are based on.

**Extreme Value Analysis:-**

For example, statistical methods like the z-scores on univariate data.**Probabilistic and statistical models:-**

For example, Gaussian mixture models optimized using expectation-maximization.**Linear Models:-**For example, principle component analysis and data with large residual errors may be outliers.

**Proximity based models:-**For example, Local Outlier Factor where data points are isolated from the mass of the data as determined density.

**High-Dimensional Outlier detection models:-**

For example, Minimum Covariance Determinant that search subspaces for outliers and gives the breakdown of distance based measures in higher dimensions**Tree-Based models:-**

For example, Isolation Forest Algorithm detects outliers based on formation of decision trees. It distinguishes anomalies based on anomaly score for each data instances.

Now that we have the basic knowledge on what is an outlier, how it is detected, its effect on results of projects. So we dive into the most interesting part of the blog.

# Implementation in Python:-

I will do a comparative study of various Automatic Outlier Detection Techniques on a medical used case.

I will cover the following topics in my blog:-

**Isolation Forest Algorithm****Local Outlier Factor****One Class SVM**

Dataset link:-http://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records

### Objective:-

**Here we need to predict whether a patient will die of heart attack or not,based on various attributes like age,diabetes,sex,platelets etc.**

### Steps followed:-

First, I will apply **Random Forest Classifier **on the dataset,perform **tuning of hyperparameters** and get the **cross-validated accuracy.**Then I will compare the obtained accuracy, with accuracy obtained after applying each outlier detection technique.

Let’s import the necessary libraries and dataset to be used for this project

```
#Import necessary libraries
import pandas as pd
import numpy as np
#read the data from working directory
data=pd.read_csv('heart_failure_clinical_records_dataset.csv')
data.head()
```

Now we do the **train test split along with the model building** part

```
#Getting dependent and independent variables
X=data.iloc[:,:-1]
y=data['DEATH_EVENT']
#splitting into training set and testing set
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=101)
#fitting a model to get baseline accuracy
from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier(random_state=101)
model.fit(X_train,y_train)
#predicting on test set
predictions=model.predict(X_test)
#create a classification report
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions));
```

**Fine tuning** the model

```
#creating dictionary of parameters for tuning
n_estimators=[int(i) for i in np.linspace(200,2000,10)]
max_features=['auto','sqrt','log2']
max_depth=[int(i) for i in np.linspace(10,1000,10)]
min_samples_split=[2,5,10,14]
min_samples_leaf=[1,2,4,6,8]
params={'n_estimators':n_estimators,'max_features':max_features,'max_depth':max_depth,'min_samples_split':min_samples_split
,'min_samples_leaf':min_samples_leaf}
#performing the random_search
from sklearn.model_selection import RandomizedSearchCV
random_search=RandomizedSearchCV(model,params,n_iter=5,scoring='accuracy',n_jobs=-1,cv=5,verbose=1)
random_search.fit(X,y)
#to get the fine tuned parameters
random_search.best_params_
#to get the fine tuned model
random_search.best_estimator_
#to get the cross validated score by performing k-fold CV using the fine tuned model to get baseline accuracy
from sklearn.model_selection import cross_val_score
score=cross_val_score(random_search.best_estimator_,X,y,cv=10,scoring='accuracy')
print(f'The accuracy of the model is {np.mean(score)}');
```

So after implementing this model I got the baseline accuracy as** 0.7757471264367817**.

Now we will compare the above model with models fitted after removing the anomalous points.

First,let’s understand how the above 3 automatic outlier detection algorithms work.

# 1. Isolation Forest Algorithm

As the name suggests, this algorithm is based on the concept of **isolating** **anomalous** data points from the** genuine** points to classify it as anomalous points.

The driving concept behind this algorithm is that, the anomalous points are few in number and different.So they are more prone to get isolated than the genuine points.This method is algorithmically different and efficient from all other existing methods.It is efficient because it can be used for datasets with many dimensions without fearing about the **curse of dimensionality** .Also it is said to be algorithmically different from other algorithms as it applies isolation technique to detect anomalies rather than the commonly used **basic distance** and **density measure.**

One of the main advantage of this algorithm is that it has low **linear time complexity** and **small memory requirement.**

## Core principle:-

In this algorithm decisions trees are being created by selecting random attributes to find out anomalous points.When several trees are formed and combined it forms isolation forest(just like random forest),hence the name.

The random partitioning of dataset creates shorter paths for anomalous data points since:-

- The anomalies are distinctively less in number it leads to smaller partitions
- Distinguishable data points are prone to get detected at the starting of the classification process.

This process is repeated for a fixed number of times and the anomaly score is noted at each isolation level for each data point.After all the iterations are completed we generate an anomaly score for each data points and decide on the basis of that whether it is anomalous or not.

**Now you must be thinking what is this anomaly score?**

So the anomaly score is the function of the average level at which the point is isolated.The top ‘k’ gathered points on the basis of score are labelled as anomalies.

So for your better understanding let me provide you with a pseudo code for the algorithm.

### Steps:-

- Select a particular data point that you want to isolate.
- For every feature of the data set set a range between maximum and minimum of that feature to isolate
- Now select a feature at random
- Here comes the iterative step

a)Select a value within the range decided in step 2.If the data point selected at first step is larger than the chosen value,set the minimum of the range to that value

b)If the data point is less than the selected value set the maximum of the range to that value.

c)Repeat steps 3 and 4 until the point selected at step 1 is isolated. In other words steps 3 and 4 are repeated until that point is the one and only point which lies inside the selected range at step 2 for all the features of the data set. - If we count the number of times we have to repeat steps 3 and 4 we get a number which we call isolation number.

Now if a point is called an outlier the isolation number will be less(as it will get isolated easily).In depth this algorithm uses random no. to select data points so the procedure is repeated several number of times and the final isolation number on the basis of which decision is taken is a combination of all isolation numbers.

I think I have been able to give an outline of what isolation forest is and how the algorithm works.

For further studies refer to:-https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest

Before diving deep into the coding let’s know about some important parameter of the isolation forest algorithm

### Parameters:-

- Contamination= The proportion of outliers in the data set(In my code snippet I have ran my code for different values).
- n_estimators = No. of base estimators in the ensemble
- max_features= The number of features sampled from X for training of the estimator
- mac_samples = The number of instances sampled from X to train the estimator

I will use the same fine tuned Random Forest classifier that I have used previously.

Let’s do some coding now!!

## Python code for Isolation Forest Algorithm

After applying the Isolation Forest model, the **anomalous points** will be categorized as **-1** and** inliers** are categorized as **1**.

```
from sklearn.ensemble import IsolationForest
# list of contamination ratios to check
list1=[0.01,0.02,0.03,0.04,0.05,0.1,0.2]
scores=[]
for i in list1:
X=data.iloc[:,:-1]
y=data['DEATH_EVENT']
# identify outliers in the training dataset
iso = IsolationForest(contamination=i,random_state=101)
yhat = iso.fit_predict(X)
mask=pd.DataFrame(yhat)
#Concating with original data
X=pd.concat([X,mask],axis=1)
y=pd.concat([y,mask],axis=1)
#Naming the new column as anomaly
X.columns=['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
'ejection_fraction', 'high_blood_pressure', 'platelets',
'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time','anomaly']
y.columns=['DEATH_EVENT','anomaly']
#Selecting only the non-anomalous rows(rows with anomaly value equal to 1)
X=X.loc[X['anomaly']==1,X.columns[:-1]]
y=y.loc[y['anomaly']==1,'DEATH_EVENT']
#model fitting
score=cross_val_score(random_search.best_estimator_,X,y,scoring='accuracy',cv=10,n_jobs=-1)
scores.append(np.mean(score))
print(scores)
print(f'The maximumum accuracy attained by the model is {max(scores)}');
```

After running the above code I got the following output:-

## 0 responses on "Comparative Study of Automatic Outlier Detection Methods"