Dimensionality Reduction Techniques in Machine Learning

Introduction:

Have you worked on a large dataset that contains more than a thousand variables? It’s a very challenging task, especially when you do not know where to start! Having a large number of variables is both a boon and a crush. So, we need a better way to overcome high dimensional data so we can quickly extract patterns and insights from them. So, how do we approach such a dataset?

using the Dimensionality reduction technique, we can reduce the number of features without losing too much information and improve the model’s performance.

TABLE OF CONTENT

1. What is Dimensionality Reduction?
2. Why is Dimensionality Reduction is Important in Machine Learning?
3. Methods of Dimensionality Reduction?
4. Conclusion

What is Dimensionality Reduction?

Dimensionality Reduction is an unsupervised learning technique. In the Machine Learning classification problem, there are often too many factors on the basis of which the final classification is done. These factors are basically variables that are called a feature. The higher the number of features, the harder to handle or analyze. Most of the time features are redundant or correlated. This is where the dimensionality reduction comes into the picture. To obtaining a set of principal variables we reducing the number of random variables.

It refers to techniques for reducing the number of input variables in training data. It is a data preprocessing technique performed on the data before the modeling. It can be performed after the data cleaning, data scaling, and before training a predictive model. It can be divide into feature selection and feature extraction.

Why is Dimensionality Reduction is Important in Machine Learning?

Here are some benefits of applying Dimensionality Reduction on a dataset.

•  As the number of dimensions comes down the space required to store the data is automatically reduced.
• fewer dimensions take less training time.
• Some algorithms do not perform well when we have large dimensions in the dataset. So we reduce these dimensions that need to happen for the algorithm to be useful.
• it also helps in multicollinearity by removing redundant features.
• it also helps in visualizing the data.

The below figure illustrates this concept, where 3-D feature space is split into two 1-D feature spaces, and later if found to be correlated, the number of features can be reduced even after.

There are two components of Dimensionality Reduction.

1.Feature Selection: In this, we try to find out the subset of the original set of variable, to get a smaller subset which can be used to model the problem. It usually involves three ways.

• Filter
• Wrapper
• Embedded.
• Feature Extraction: This will reduce the data in a high dimensional space to lower space.

Methods of Dimensionality Reduction?

There are various methods used for Dimensionality Reduction:

• Principal component analysis(PCA)
• Linear Discriminant Analysis(LDA)
• Generalized  Discriminant Analysis(GDA)
• Conclusion

Dimensionality Reduction may be linear or non-linear depend upon the method used.

Principal Component Analysis:

This method first introduces by Karl Pearson. Also, it works on a condition. That says while the data in a higher-dimensional space need to map to data in a lower-dimensional space. Although, the variance of the data should be maximum in the lower dimensional space.

It involves the following points:

• let’s Construct the covariance matrix of the data.
• After that Compute the eigenvector of the matrix.
• The eigenvector corresponding to the largest eigenvalues is used to reconstruct a large fraction of the variance of the original data.

Hence, we are left with a lesser number of eigenvectors. And there are some data chances of data loss in the process. But, the most important variances should be retained by the remaining eigenvectors.

PCA Implementation Example:

We will take the mushroom classification Dataset to illustrate the PCA.

First, we need to load all the libraries

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings(“ignore”)

now we can check the null values

here,we need to convert the int features into string

encoder = LabelEncoder()

# transform all the data:
for col in m_data.columns:
m_data[col] = encoder.fit_transform(m_data[col])

X_features = m_data.iloc[:,1:23]
y_label = m_data.iloc[:, 0]m_data = pd.read_csv(‘mushrooms.csv’)

encoder = LabelEncoder()

#now transform all the columns
m_data[col] = encoder.fit_transform(m_data[col])

X_features = m_data.iloc[:,1:23]
y_label = m_data.iloc[:, 0]

# Scale the feature

scaler = StandardScaler() X_features = scaler.fit_transform(X_features)

We’ll Implement the PCA to get the list of features and plot which features have the most variance. These are the principal components. It looks like around 17 or 18 of the features explain the majority, almost 95% of our data:

# Visualize
pca = PCA()
pca.fit_transform(X_features)
pca_variance = pca.explained_variance_

plt.figure(figsize=(8, 6))
plt.bar(range(22), pca_variance, alpha=0.5, align=’center’, label=’individual variance’)
plt.legend()
plt.ylabel(‘Variance ratio’)
plt.xlabel(‘Principal components’)
plt.show()

Let’s convert the features into the 17 top features. We will then plot a scatter plot of the data point:

pca2 = PCA(n_components=17)

pca2.fit(X_features) x_3d = pca2.transform(X_features)

plt.figure(figsize=(8,6))

plt.scatter(x_3d[:,0], x_3d[:,5], c=m_data[‘class’])

plt.show()

Let’s also do this for the top 2 features and see how the classification changes:

pca3 = PCA(n_components=2)

pca3.fit(X_features)

x_3d = pca3.transform(X_features)

plt.figure(figsize=(8,6))

plt.scatter(x_3d[:,0], x_3d[:,1], c=m_data[‘class’])

plt.show()

Linear Discriminant Analysis(LDA):

It is used to designing the data from a multidimensional graph into a linear graph.The easiest way to create this is with a graph filled up with data points of two different classes. Suppose that there is no line that will neately separate the data into two classes ,we can covert 2-dimensional graph  into a 1-D graph .This 1-D graph then we can used to hopefully achieve the best possible separation of the data points.

When LDA is used there are two primary goals: Minimizing the variance of the two classes and maximizing the distance between the means of the two data classes.

LDA Implementation Example:

Finally ,let’s see how can be use LDA in case of dimensionality reduction.

here we will use the Titanic dataset for the following example:

#let’s import all the library

import pandas as pd

import numpy as np

from sklearn.metrics import accuracy_score, f1_score

from sklearn.preprocessing import LabelEncoder, StandardScaler

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

Let’s drop the Name, cabin, and Ticket columns as they don’t carry a lot of useful info.

#Let’s drop the cabin and ticket cabin

training_data.drop(labels=[‘Cabin’, ‘Ticket’], axis=1, inplace=True)

training_data[“Age”].fillna(training_data[“Age”].median(), inplace=True)

training_data[“Embarked”].fillna(“S”, inplace=True)

encoder_1 = LabelEncoder()

# let’s fit the data into encoder

encoder_1.fit(training_data[“Sex”])

# now transform and replace the data

training_sex_encoded = encoder_1.transform(training_data[“Sex”])

training_data[“Sex”] = training_sex_encoded

encoder_2 = LabelEncoder()

encoder_2.fit(training_data[“Embarked”])

training_embarked_encoded = encoder_2.transform(training_data[“Embarked”])

training_data[“Embarked”] = training_embarked_encoded

# here we can drop the Name column because it is useless

training_data.drop(“Name”, axis=1, inplace=True)

we need to scale the values ,but the Scaler tool takes arrays,so the values we want to reshape need to be turned into arrays first.After that,we can scale the data:

# Remember that the scaler takes arrays

ages_train = np.array(training_data[“Age”]).reshape(-1, 1)

fares_train = np.array(training_data[“Fare”]).reshape(-1, 1)

scaler = StandardScaler()

training_data[“Age”] = scaler.fit_transform(ages_train)

training_data[“Fare”] = scaler.fit_transform(fares_train)

# Now to select our training and testing data

features = training_data.drop(labels=[‘PassengerId’, ‘Survived’], axis=1)

labels = training_data[‘Survived’]

Use the  train_test_split to make our training and validation data. It’s easy to do classification with LDA

X_train, X_val, y_train, y_val = train_test_split(features, labels, test_size=0.2, random_state=27)

model = LDA() model.fit(X_train, y_train)

preds = model.predict(X_val)

acc = accuracy_score(y_val, preds)

f1 = f1_score(y_val, preds) print(“Accuracy: {}”.format(acc)) print(“F1 Score: {}”.format(f1))

Accuracy: 0.8100558659217877

F1 Score: 0.734375

after performing we get accuracy 0.81 and F1 score 0.7

after that, we will transform the data features by specifying the number of desired components for LDA and fitting the model on features and labels.we just transform and save into a new variable. Let’s print out the original and reduce the number of features.

LDA_transform = LDA(n_components=1)

LDA_transform.fit(features, labels)

features_new = LDA_transform.transform(features)

# Print the number of features print(‘Original feature #:’, features.shape[1])

print(‘Reduced feature #:’, features_new.shape[1])

# Print the ratio of explained variance

print(LDA_transform.explained_variance_ratio_)

#output of the above code

original feature:7

Reduced feature:1[1.]

Generalized  Discriminant Analysis(GDA)

This Generalized Discriminant Analysis (GDA)  approach is used for extracting non-linear features.it is one of the most dimensionality reduction techniques, which design a data matrix from a high-dimensional space into a low-dimensional space by maximizing the ratio of between-class scatters to within-class scatter. This not only reduces the number of input features but also increases the classification accuracy and reduces the training and testing time of the classifiers by selecting the most discriminating features.

Conclusion:

In the above blog, we had discussed dimensionality reduction techniques: Principal component Analysis, Linear Discriminant Analysis, Generalized Discriminant Analysis. These are statistical techniques you can use to help your machine Learning models to make the performance better, combat overfitting, and assist in data analysis.

(Visited 30 times, 1 visits today)
February 13, 2021