+91-9916812177 | contact@beingdatum.com

Text Classification

In this lesson, we will focus on text classification.

Example: Sentiment Analysis

Input: text of reviews

Output: Class of sentiments i.e. positive or negative

Positive example: The hotel is really beautiful, it was a nice stay at the hotel.
Negative example: WiFi wasn’t working, lights went off during the night hours, it was an awful experience staying at this hotel.


import pandas as pd

def ingest_train():
data = pd.read_csv('data/dataset.csv', encoding="ISO-8859-1")
data = data[data.Sentiment.isnull() == False]
data['Sentiment'] = data['Sentiment'].map(int)
data = data[data['SentimentText'].isnull() == False]
data.drop('index', axis=1, inplace=True)
return data

train = ingest_train()

Data Preparation

Let’s do some data cleaning.

Let’s first define the data cleaning function, then apply it to the whole dataset. This function removes URL, remove HTML tags, handle negation words which are split into two parts, convert the words to lower cases, remove all non-letter characters. These elements are very common and they do not provide enough semantic information for the task.

import re

pat_1 = r"(?:\@|https?\://)\S+"
pat_2 = r'#\w+ ?'
combined_pat = r'|'.join((pat_1, pat_2))
www_pat = r'www.[^ ]+'
html_tag = r'<[^>]+>'
negations_ = {"isn't":"is not", "can't":"can not","couldn't":"could not", "hasn't":"has not",
"hadn't":"had not","won't":"will not",
"wouldn't":"would not","aren't":"are not",
"haven't":"have not", "doesn't":"does not","didn't":"did not",
"don't":"do not","shouldn't":"should not","wasn't":"was not", "weren't":"were not",
"mightn't":"might not",
"mustn't":"must not"}
negation_pattern = re.compile(r'\b(' + '|'.join(negations_.keys()) + r')\b')

from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

def data_cleaner(text):
stripped = re.sub(combined_pat, '', text)
stripped = re.sub(www_pat, '', stripped)
cleantags = re.sub(html_tag, '', stripped)
lower_case = cleantags.lower()
neg_handled = negation_pattern.sub(lambda x: negations_[x.group()], lower_case)
letters_only = re.sub("[^a-zA-Z]", " ", neg_handled)
tokens = tokenizer.tokenize(letters_only)
return (" ".join(tokens)).strip()
return 'NC'

The results of this should give us a cleaned dataset and remove lines with ‘NC’.

Next, let’s define a handy function to monitor DataFrame creations, then look at our cleaned data.

from tqdm import tqdm

def post_process(data, n=1000000):
data = data.head(n)
data['SentimentText'] = data['SentimentText'].progress_map(data_cleaner) 
data.drop('index', inplace=True, axis=1)
return data

train = post_process(train)

progress-bar: 100%|████████████████████████████████████████████████████████████| 25000/25000 [00:08<00:00, 2940.12it/s]

Let’s save the cleaned data:

clean_data = pd.DataFrame(train,columns=['SentimentText'])
clean_data['Sentiment'] = train.Sentiment


csv = 'clean_data.csv'
data = pd.read_csv(csv,index_col=0)

Data visualization

Before proceeding to the classification step, let’s do some visualization of our textual data. the words cloud is the best choice for this matter, it is a visual representation of text data. It displays a list of words, the importance of each being shown with font size or color. This format is useful for quickly perceiving the most prominent terms.

For this data viz, we use the python library wordcloud.

Let’s begin with the word cloud of negative terms.

import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from wordcloud import WordCloud, STOPWORDS

neg_tweets = train[train.Sentiment == 0]
neg_string = []
for t in neg_tweets.SentimentText:
neg_string = pd.Series(neg_string).str.cat(sep=' ')
from wordcloud import WordCloud

wordcloud = WordCloud(width=1600, height=800,max_font_size=200).generate(neg_string)
plt.imshow(wordcloud, interpolation="bilinear")

The world cloud for the positive terms.

pos_tweets = train[train.Sentiment == 1]
pos_string = []
for t in pos_tweets.SentimentText:
pos_string = pd.Series(pos_string).str.cat(sep=' ')
wordcloud = WordCloud(width=1600, height=800,max_font_size=200,colormap='magma').generate(pos_string) 
plt.imshow(wordcloud, interpolation="bilinear") 

Building the models

Before proceeding to the training phases, let’s split our data into training and validation set.

#Spliting The Data
from sklearn.cross_validation import train_test_split
SEED = 2000

x_train, x_validation, y_train, y_validation = train_test_split(train.SentimentText, train.Sentiment, test_size=.2, random_state=SEED)

Features Extraction

In this part, we will use a feature extraction technique called Tfidf vectorizer of 100,000 features including up to trigram. This technique is a way to convert textual data to the numeric form.

The below model_comparator function, we will use a custom function acc_summary, which reports validation accuracy, and the time it took to train and evaluate.

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score 
import numpy as np
from time import time

def acc_summary(pipeline, x_train, y_train, x_test, y_test):
t0 = time()
sentiment_fit = pipeline.fit(x_train, y_train)
y_pred = sentiment_fit.predict(x_test)
train_test_time = time() - t0
accuracy = accuracy_score(y_test, y_pred)
print("accuracy score: {0:.2f}%".format(accuracy*100))
print("train and test time: {0:.2f}s".format(train_test_time))
return accuracy, train_test_time

from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer()

from sklearn.svm import LinearSVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import Perceptron
from sklearn.neighbors import NearestCentroid
from sklearn.feature_selection import SelectFromModel

names = ["Logistic Regression", "Linear SVC", "LinearSVC with L1-based feature selection","Multinomial NB", 
"Bernoulli NB", "Ridge Classifier", "AdaBoost", "Perceptron","Passive-Aggresive", "Nearest Centroid"]
classifiers = [
('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False))),
('classification', LinearSVC(penalty="l2"))]),
zipped_clf = zip(names,classifiers)

tvec = TfidfVectorizer()
def classifier_comparator(vectorizer=tvec, n_features=10000, stop_words=None, ngram_range=(1, 1), classifier=zipped_clf):
result = []
vectorizer.set_params(stop_words=stop_words, max_features=n_features, ngram_range=ngram_range)
for n,c in classifier:
checker_pipeline = Pipeline([
('vectorizer', vectorizer),
('classifier', c)
print("Validation result for {}".format(n))
clf_acc,tt_time = acc_summary(checker_pipeline, x_train, y_train, x_validation, y_validation)
return result



Validation result for Logistic Regression
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class=’ovr’, n_jobs=1,
          penalty=’l2′, random_state=None, solver=’liblinear’, tol=0.0001,
          verbose=0, warm_start=False)
accuracy score: 89.36%
train and test time: 72.07s
Validation result for Linear SVC
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss=’squared_hinge’, max_iter=1000,
     multi_class=’ovr’, penalty=’l2′, random_state=None, tol=0.0001,
accuracy score: 90.48%
train and test time: 73.73s
Validation result for LinearSVC with L1-based feature selection
     steps=[(‘feature_selection’, SelectFromModel(estimator=LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss=’squared_hinge’, max_iter=1000,
     multi_class=’ovr’, penalty=’l1′, random_state=None, tol=0.0001,
        norm_order=1, prefit…ax_iter=1000,
     multi_class=’ovr’, penalty=’l2′, random_state=None, tol=0.0001,
accuracy score: 89.62%
train and test time: 68.06s
Validation result for Multinomial NB
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
accuracy score: 87.86%
train and test time: 63.91s
Validation result for Bernoulli NB
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
accuracy score: 87.82%
train and test time: 61.73s
Validation result for Ridge Classifier
RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
        max_iter=None, normalize=False, random_state=None, solver=’auto’,
accuracy score: 90.56%
train and test time: 68.91s
Validation result for AdaBoost
AdaBoostClassifier(algorithm=’SAMME.R’, base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
accuracy score: 80.72%
train and test time: 96.33s
Validation result for Perceptron
Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,
      max_iter=None, n_iter=None, n_jobs=1, penalty=None, random_state=0,
      shuffle=True, tol=None, verbose=0, warm_start=False)
C:\Users\pattn\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\stochastic_gradient.py:128: FutureWarning: max_iter and tol parameters have been added in <class ‘sklearn.linear_model.perceptron.Perceptron’> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
  “and default tol will be 1e-3.” % type(self), FutureWarning)
accuracy score: 89.02%
train and test time: 63.27s
Validation result for Passive-Aggresive
PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
              fit_intercept=True, loss=’hinge’, max_iter=None, n_iter=None,
              n_jobs=1, random_state=None, shuffle=True, tol=None,
              verbose=0, warm_start=False)
C:\Users\pattn\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\stochastic_gradient.py:128: FutureWarning: max_iter and tol parameters have been added in <class ‘sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier’> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
  “and default tol will be 1e-3.” % type(self), FutureWarning)
accuracy score: 90.08%
train and test time: 66.22s
Validation result for Nearest Centroid
NearestCentroid(metric=’euclidean’, shrink_threshold=None)
accuracy score: 81.50%
train and test time: 66.27s

A summary of the results for comparison is given below.

Thus, It looks like Ridge Classifier and Linear SVC are the best performing classifier in our case.

SEE ALL Add a note
Add your Comment
© BeingDatum. All rights reserved.