## Register

✆+91-9916812177 | contact@beingdatum.com

# Fuzzy Wuzzy at a glance

Fuzzy Wuzzy is a sting matching algorithm which uses Levenshtein distance, for calculating the
difference between the sequences of the two string.

Requirements

The basic requirements for using fuzzy-wuzzy packages are listed below

• Python 2.7 or higher version
• difflib provides many classes and functions for comparing sequences, directories and files
including HTML and context and unified diffs
• python-Levenshtein is optional, but can provide a 4-10x speedup in String Matching.

Testing is performed on the basis of

• pycodestyle is a tool to check your Python code against some of the style conventions
• hypothesis test is statistical method, used to make statistical decisions usning some raw
data
• pytest is a testing framework which allow us to write test codes

Installation

• using pip
pip install fuzzywuzzy
pip install python-Levenshtein
• or directly by
pip install fuzzywuzzy[speedup]

How to use Fuzzy Wuzzy Package

from fuzzywuzzy import fuzz as f

Simple Ratio

It gives the percent similarity between the two strings as the Levenshtein distance give.

print(f.ratio(“Being Dautm The Data Society!”,”being Dautm the data society”))
print(f.ratio(“being dautm The data dociety!”,”being dautm the data society”))
print(f.ratio(“being dautm!”,”being dautm the data society”))

Output:

84
91
55

But it gives very less percent similarity, when it comes to match substring of x length with the string
of y length (x<y). Then the other fuction of fuzzy wuzzy can be used.

Partial Ratio

It is very useful for matching the subtring with another string.

print(f.partial_ratio(“Being Dautm The Data Society!”,”being Dautm the data socie
print(f.partial_ratio(“Being Dautm The Data Society!”.lower(),”being Dautm the da
print(f.partial_ratio(“Being Dautm The Data Society!”.lower(),”the data society,

Output:

86

100

55

But the method of Partial ratio is also failed, when the arrangemet of data is changed. Then to
solve this problem we will go for another function of fuzzy wuzzy, which is termed as Token Sort
Ratio.

Token Sort Ratio
This function is used where parial ratio fails, but it only works when there are equal number of
words in both the strings. And it gives less percent similarity when we use this function for
matching substring of length x with string of length y where x<y. To overcome this failure,, there is
another function named token sort ratio.

print(f.token_sort_ratio(“Being Dautm The Data Society”,”the data society, being
print(f.token_sort_ratio(“Being Dautm!”,”the data society is being Dautm”))

Output:

100

52

Token Set Ratio
This function gives us more flexibility than token sort function because it performs set operation,
intersection for finding out the common words then applying the fuzzy ratio to find out the
comparison.

print(f.token_set_ratio(“Being Dautm!”,”the data society is being Dautm”))

100

Process
Process is one of the powerful function of fuzzy wuzzy. It is used to perform string matching on the
vector of the strings or can get string of highest percent similarity among the vector of the string.

from fuzzywuzzy import process as p
key=[‘being dautm!! The Data Society’,’Being Dautm! the data society’,’The data s
print(p.extract(“being dautm the data society”,key))
print(p.extractOne(“being dautm the data society”,key))

Output:

[(‘Being Dautm The Data Society’, 100), (‘Being Dautm! the data society’, 98),
(‘being dautm!! The Data Society’, 97), (‘The data society Being dautm!’, 95),
(‘Being Dautm is The Data Society’, 95)]
(‘Being Dautm The Data Society’, 100)

And Also we can limit the number of strings in the decresing order of percent similar, can be
extracted from the vector of string.

print(p.extract(“being dautm the data society”,key,limit=2))

Output:

[(‘Being Dautm The Data Society’, 100), (‘Being Dautm! the data society’, 98)]

June 4, 2020