+91-9916812177 | contact@beingdatum.com

String Matching on Large Dataset

It’s very difficult to deal with the large datasets for match the string with the traditional libraries like
Fuzzy Wuzzy, ftfy,Levenshtein,etc. For example, We are matching from two file each having 10k
entries then if we want to perform string matching between two columns of each files, taking only
100 entries from first file and second file is consumed fully and store the percent similarity and
matched entry on the first file. Then for Fuzzy-Wuzzy,if it took 15 seconds in my system
(Configuration of my system is i7 8th generation processor having 8 GB Ram and 4 GB Nvidia
Ge-force GTX 1050ti.), when I am utilizing the GPU. If we take whole file then the consumed is
approx is 1500 seconds and if our data is lakhs then the time consumed is much high.
Then in place of Fuzzy-Wuzzy we can match the strings using the NLP Algorithm which is based on n-grams and tf-idf which can match both the file in one go in approx 8-10 sec. Which is really-
really fast than any other string matching package.

Importing some Important packages

import pandas as pd
import re
import time
from ftfy import fix_text
from tqdm import tqdm

#Loading the first file
data_1 = pd.read_csv(‘data_1.csv’,encoding=’latin’)
data_1

Output:
Unnamed:0 LINE_NO ENGL_CUST_NAME
0 1000 25475066 Lee Siu Lam
1 1001 27634906 Chiu Siu Ying Betty
2 1002 27155866 Lee Chi Pang
3 1003 27740669 Wong Yuk Sim
4 1004 24408814 Chow Kwok Leung
… … … …
9995 10995 23321078 Hang Feng (HK) Co Ltd
9996 10996 26546388 Eleven Lounge & Rest
9997 10997 21647388 Eleven Lounge & Rest
9998 10998 23983136 Hk Great Wealth Internatl Trdg Co Ltd
9999 10999 26318121 Xh Internatl Greative Group Co Ltd
10000 rows × 3 columns

ngrams
ngrmas is basically continous sequencing of n items for a given sample of text or speech.

For example-
Taking, ngrams = 3

For the word Data, the sequence generated is ‘ Da’, ‘Dat’, ‘ata’, ‘ta ‘. which are used for string
matching.

t1 = time.time()
def ngram(string, n=3):
s = fix_text(string) # fix text
s = s.encode(“ascii”, errors=”ignore”).decode() #remove non ascii chars
s = s.lower()
chars_to_remove = [“)”,”(“,”.”,”|”,”[“,”]”,”{“,”}”,”‘”]
rx = ‘[‘ + re.escape(”.join(chars_to_remove)) + ‘]’
s = re.sub(rx, ”, string)
s= s.replace(‘&’, ‘and’)
s = s.replace(‘,’, ‘ ‘)
s = s.replace(‘-‘, ‘ ‘)
s = s.title() # normalise case – capital at start of each word
s = re.sub(‘ +’,’ ‘,s).strip() # get rid of multiple spaces and replace with
s = ‘ ‘+ s +’ ‘ # pad names for ngrams…
s = re.sub(r'[,-./]|\sBD’,r”, s)
ng = zip(*[s[i:] for i in range(n)])
return [”.join(ng1) for ng1 in ng]

tf-idf (Term frequency and Inverse document frequency)
The aim of tf-idf is to define the importance of a keyword or phrase within a document or a web
page and we can add ngram function as a analyzer in the tfidf vectorizer which generated the
sequence of each entries then a sparse matrix is generated, which is used to find the percent
similarity of the strings.This algorithms usually deal with numbers, and natural language is, well,
text and it transform that text into numbers. Vectorizing a document is taking the text and creating
one of these vectors, and the numbers of the vectors somehow represent the content of the text.
TF-IDF enables us to gives us a way to associate each word in a document with a number that
represents how relevant each word is in that document.

  • Term Frequency is simply the raw count of a term in a document or the number of times a
    word occurs in the document. There are ways to adjust the frequency, by length of a
    document, or by the raw frequency of the most frequent word in a document.
  • Inverse document frequency is the measure of how much information the word provides,
    i.e., if it’s common or rare across all documents. The closer it is to 0, the more common a word
    is. This metric can be calculated by taking the total number of documents, dividing it by the
    number of documents that contain a word, and calculating the logarithm.
  • So, if the word is very common and appears in many documents, this number will approach 0.
    Otherwise, it will approach 1.

Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the
score, the more relevant that word is in that particular document.

Applications of TF-IDF

  • Information retrieval
  • Keyword Extraction

from sklearn.feature_extraction.text import TfidfVectorizer as TV
#Loding the second dataset
data_2 = pd.read_csv(‘data_2.csv’)
data_2

Output:
Unnamed: 0 CUST_NAME
0 1000 LEE TAI TRANSPORT
1 1001 WAI YIP MAINTENANCE ENGINEERING LTD
2 1002 SANFIELD-GAMMON CONSTRUCTION JV COMPANY LIMITED
3 1003 ASMAC LIMITED
4 1004 268 HAIR SALOON
… … …
9995 10995 COPPERBURG DEVELOPMENT LTD
9996 10996 LUK HUP PLASTIC MATERIACS CO.
9997 10997 KI MEE KITCHEWARE LTD
9998 10998 LEE TAK GROCERY COMPANY
9999 10999 NICEWORK CONSTRUCTION ENGINEERING LIMITED
10000 rows × 2 columns
new_data_2 = data_2[‘CUST_NAME’].values.astype(‘U’)

Vecorizing the data
While vectorizing the data, we are going to use the lowercase as false and using ngram as the
analyser and then fit transform the data of the second file in a sparse matrix which is used for
matching the
vectorizer = TV(min_df=1, analyzer=ngram, lowercase=False)
tfidf = vectorizer.fit_transform(new_data_2)

Using KNN for finding or matching the nearest neighours of each string of file one with the
vectorized matrix of second file.

from sklearn.neighbors import NearestNeighbors
NN = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf)
data_1_new = (data_1[‘ENGL_CUST_NAME’].values.astype(‘U’))

def getNearestNeigh(query):
query_1= vectorizer.transform(query)
distances, indices = NN.kneighbors(query_1)
return distances, indices

Calculating the distance and indices of each strings of the file one and then enumerating
over the indices for finding the closest match of each entry of file one from the file 2.

distances, indices = getNearestNeigh(data_1_new)
#unique_org = list(data_1_new) #need to convert back to a list
#print(‘finding matches…’)
final = []
for i,j in enumerate(indices):
dist=round(distances[i][0],2)
temp = [dist, data_2.values[j][0][1]]
final.append(temp)
t = time.time()-t1
print(“Time Consumed for matching using ngrams and tfidf is “, t)

 Time Consumed for matching using ngrams and tfidf is 8.168980836868286

final = pd.DataFrame(final, columns=[‘Match confidence’,’Matched name’])
data_1[‘Ratio’]=final[‘Match confidence’]
data_1[‘Matched Name’]=final[‘Matched name’]

This is the required result
data_1

Unnamed:0  LINE_NO ENGL_CUST_NAME Ratio Matched Name
0 1000 25475066 Lee Siu Lam 0.83 KWAN SIU LAM
1 1001 27634906 Chiu Siu Ying Betty 1.10 SIU YUN TING
2 1002 27155866 Lee Chi Pang 0.73 KONG CHI PANG
3 1003 27740669 Wong Yuk Sim 0.98 NG YUK SANG
4 1004 24408814 Chow Kwok Leung 1.04 CHOW KAN WAN
… … … … … …
9995 10995 23321078 Hang Feng (HK) Co Ltd 0.91 YU FENG (HK) TRADING CO LTD
9996 10996 26546388 Eleven Lounge & Rest 1.09 CORNER LOUNGE LTD
9997 10997 21647388 Eleven Lounge & Rest 1.09 CORNER LOUNGE LTD
9998 10998 23983136 Hk Great Wealth Internatl Trdg Co Ltd 1.06 WEALTH INDUSTRY (HK) CLTD

9999 10999 26318121 Xh Internatl Greative Group CoLtd 1.03 CHEUNG WO INT’L (GROUP)CO LTD

10000 rows × 5 columns

Now using the fuzzy wuzzy algo to perform the above

As I mentioned, Fuzzy Wuzzy consume time for matching the string. So, I am using only 100
entries from first file and second file is consumed fully.

data1_copy=data_1.copy().head(100)
data_1_copy=list(data1_copy[‘ENGL_CUST_NAME’])
data_2_copy=list(data_2[‘CUST_NAME’].unique())

Importing the Fuzzy Wuzzy Package

from fuzzywuzzy import fuzz

t1 = time.time()
save_ratio=[]
save_cust=[]
final_ratio=[]
final_cust=[]
count=0
save_both=[]
for i in range(len(data_1_copy)):
save_ratio=[]
save_cust=[]
save_both=[]
for j in data_2_copy:
Ratio = fuzz.token_sort_ratio(data_1_copy[i],j)
inter_save=str(Ratio)+”|”+j
new_save=list(inter_save.split(“|”))
save_both.append(new_save)
if save_both==[]:
final_ratio.append(None)
final_cust.append(None)
elif len(save_both)>0:
val_1=[int(row[0]) for row in save_both]
save=val_1.index(max(val_1))
ans1=save_both[save]
final_ratio.append(ans1[0])
final_cust.append(ans1[1])
t = time.time()-t1
print(“Time consumed by Fuzzy Wuzzy for 100 entries:”, t)

Time consumed by Fuzzy Wuzzy for 100 entries: 11.957054615020752

data1_copy[‘Names’]=final_cust
data1_copy[‘Ratios’]=final_ratio

data1_copy

Unnamed:0 LINE_NO ENGL_CUST_NAME Ratio Matched Name Names Ratios
0 1000 25475066 Lee Siu Lam 0.83 KWAN SIU LAM LEE SHUK HA 73
1 1001 27634906 Chiu Siu Ying Betty 1.10 SIU YUN TING CHOI SUI YING 69
2 1002 27155866 Lee Chi Pang 0.73 KONG CHI PANG LEE HUNG PANG 80
3 1003 27740669 Wong Yuk Sim 0.98 NG YUK SANG WONG YEUK SZE 80
4 1004 24408814 Chow Kwok Leung 1.04 CHOW KAN WAN LEUNG KIN HOP 71
… … … … … … … …
95 1095 27746782 Hilton Fur & Leather (HK) Ltd 1.12 MAN FOOK LEATHER COMPANY LIMITED GOLDEN LEAT DEV (HK) LTD 64

96 1096 21235961 Hilton Fur & Leather (HK) Ltd 1.12 MAN FOOK LEATHER COMPANY LIMITED GOLDEN LEAT DEV (HK) LTD 64
97 1097 23471785 Cheng Ho See 0.82 CHENG SEE WING YEE HING HO 78
98 1098 31516380 Ng Wing Chiu 0.99 NG CHING LAM CHING YING 82
99 1099 23318803 Tse Pui Kei 1.02 PUI KEE TRADING CO MUI ON KEI 67

100 rows × 7 columns

Conclusion of above comparision is that using of NLP algorithm for string matching is
approximately equal to using of Fuzzy Wuzzy for 100 entries only, which is really very slow.
Then it is better to use NLP algorithm for Large dataset, because it is much less time
consuming than Fuzzy Wuzzy package.

May 12, 2020

0 responses on "String Matching on Large Dataset"

    Leave a Message

    Your email address will not be published. Required fields are marked *

    © BeingDatum. All rights reserved.
    X