Complete Understanding on Outlier Analysis
Outlier Analysis – What is it in layman’s term?
Consider some one has collected the following data,
- The height (cm) of 100 randomly selected students in Xth class from a school.
- The weight (kg) of 100 randomly selected peoples working in a Government office.
- The total sale amount (INR) of Golden jewelry on 100 randomly selected days in a year at a Jewelry showroom on working hours.
- The number of passengers traveled by road, on a route by Government transport over a year.
- The table wise quantity (gm) of food waste in a restaurant on a randomly selected during the working hours.
Let us read these examples very carefully. It can be noted that while collecting the data a good care is taken on different aspects. For example
- While measuring height, unit of measurement ‘cm’ is fixed, the student class Xth is fixed, selection of students among many is random, school is fixed and the data is size-able.
- While measuring the weight unit of measurement ‘kg’ is fixed, the people class ‘from Government office’ is fixed, the selection of people is on a random basis, the data is size-able.
- While recording total sale amount, the currency unit is ‘INR’, the sale product is Golden Jewelry & days selection in a year is on random basis, the data is size-able and period is working hours of selected days.
- While recording number of passengers, travel mode is ‘by road’, a route is fixed and transport mode is Government Transport’ and data size is on 365 days
- While taking measurement on food, waste unit is ‘gm’, measurements are taken from all serviced tables, restaurant is fixed, day is selected randomly and period of data collection is over the working hours.
Upon putting restriction on variety of aspects which means that the variation is reduced, data is size-able which symbolizes sufficient chance is given to every possible value and randomness ensures that personal bias is avoided. In these situations with following exception other observations are fairly within the certain range;
- Students in Xth from the school may have height 175 cm or 110 cm.
- Weight of individual working in the selected office may be 45 kg or 102 kg.
- Total sale amount of Golden Jewelry may be 80,000 INR or may be 50,00,000 INR.
- Number passengers travelled may be 105 or may be 3000.
- Food wastage may range from 5 gm to 400 gm.
The data points that are observable but out of a certain range are usually referred as Outlier. These are uncommon observations and are not suitable in the underlying data set. These are highly extreme observations with definite meaning attached to them. For example,
- Student with height 170 cm is the tallest student and the one with 110 cm may be dwarf. If teacher ask all these students according to height then both will fetch the attention of audience there as they seem to be uncommon.
- Among the individuals, the one with 45 kg weight will be very thin whereas the one with weight 102 kg will be fat enough. Both are going to have some reasons for being uncommon. Doctors’ advice for one will be to increase the weight whereas for other to reduce the weight.
- The total Gold Jewelry sale is just 80,000 INR which means the day may be something uncommon. It may be not be good to purchase the Gold, say “Amavsya” or “Pitraweek”. The day may have some problems of law and order. The day may be highly rainy or have high temperature, may have insufficient stock, may be certain Government policy or etc. If the same is 50,00,000 INR, it means that the day has powerful Nakshtra, may be marriage season, may be big festival, may be extra income to the nearby community, may have certain schemes and offers, rich collection of design.
- Number of passengers in a day on the route by Government road transport is just 105, it means off season or some local reasons. If it is 3000 then may be vacations, festival, marriage season, rituals in short good season.
- If the wastage on a table on the selected day is just 05 gm means almost no wastage, it means respect to food by the customer, the food might be very tasty and delicious, may be proper order. If the wastage is 400 gm then the food might be tasteless, or may be unnecessary extra order, careless and ill respect to food by the customer.
In short, Outliers are the highly extreme observations with notable meaning to them.
Outlier Analysis – Identifying the outliers:
It can be best identified by the Box-Plot. In this blog the data used to catch outliers is Iris.xls. The boxplots of Sepal Length, Sepal Width, Petal Length and Petal Width of Iris-Setosa flowers is shown in following figure.
In the figure it is seen that one flower shows comparatively longer Sepal Length and two flower shows longer Petal Width. These three flowers seems to be slightly different from the rest basket of the flowers. Hence these are treated as Outliers.
To read more on Anomaly detection techniques, follow this blog