Login

Register

Login

Register

✆+91-9916812177 | contact@beingdatum.com

Building our Text Matching Model using Cosine Similarity in Flask

Building Text Matching Model using Cosine Similarity in Flask

Text Matching Model using Cosine Similarity in Flask.

What is Cosine Similarity?

Cosine similarity as its name suggests identifies the similarity between two (or more) vectors. It is a similarity measure (which can be converted to a distance measure, and then be used in any distance based classifier, such as nearest neighbor classification.)

After completing this tutorial, you will know:

  • What is Vector Theory?
  • What is Cosine Similarity
  • How to build our Text Matching model to detect plagiarism using Cosine Similarity

Meanwhile, you can go through another blog to understand the various operations in mathematical structures here.

Vector Theory

A vector has length and direction.  The vector’s absolute position is not relevant.  Two vectors may be same even though their starting positions position are different.  For example, the vectors in the first slide have different starting positions.  They can all be moved to have the same starting position at the (0,0) origin by simply focusing on their movements in X and Y space.

For example, Vector b moves 1 unit in the x direction and 1 unit in the y direction, so this vector can be translated to (1,1) with respect to the (0,0) origin.

Vector Length

The length of a vector in in two dimensional space is the length of its hypotenuse. The length of a vector is called its “norm” and is defined below:

Unit Vectors

To scale different vectors to a common length, each vector is divided by its length.  This ensures that the resultant vector has a length of one.  This is called the unit vector as shown below:

Dot Product

The dot product of two vectors is simply multiplying each component of the vectors and then adding the results. This is shown below:

Assuming that z is (2,0), then the dot product of b (1,1) and z (2,0) would be (1 x 2) + (1 x 0) = 2.

Cosine Similarity

Now that the dot product and norm has been defined, then the cosine similarity of two vectors is simply the dot product of two unit vectors.  This is shown below:

Given that vector b moves up and to the right by equal amounts, it would be expected that this vector is 45 degrees to the x axis. If the x axis is represented by z (2,0). The result of the cosine similarity between b and z is equal to:   0.7071.  The inverse cosine of this value is .7855 radians or 45 degrees.

 

 

Enough of the theory part, and let’s move on to build our first text matching model based on the concept of Cosine Similarity 🙂

In this example, we will be using our existing text in a file named: existingQuery.txt, and we will try to match the text being sent from the front end with our existingQuery.txt and end up calculating the match percentage. We are going to build our model using Flask API, you can get started with Flask here

Importing Libraries Loading page

Importing flask module in the project is mandatory. An object of Flask class is our WSGI application. Flask constructor takes the name of current module (__name__) as argument. The route() function of the Flask class is a decorator, which tells the application which URL should call the associated function.

from flask import Flask, request, render_template
import re
import math

app = Flask("__name__")

q = ""

@app.route("/")
def loadPage():
	return render_template('home.html', query="")

Defining the POST method to be called

@app.route("/", methods=['POST'])
def cosineSim():

We will now go through the contents inside our method “cosineSim” line by line for a better understanding.

Here, we are defining the variables

    #List of unique words
	uniqueWords = []
    #m implies Percentage of matching between Input text & Database text
	m = 0

Reading the input query from UI, converting it to lower case & then replacing the punctuation by space and split. And then, creating a list of unique words from the input query.

    #inputQuery: It is the input query which is entered in the frontend.   
	inputQuery = request.form['query']
    #Converting the input query to lower case
	lowercaseQuery = inputQuery.lower()
    #Replace punctuation by space and split
	queryWordList = re.sub("[^\w]", " ",lowercaseQuery).split()			

    #Creating a list of uniqueWords from the input query
	for word in queryWordList:
		if word not in uniqueWords:
			uniqueWords.append(word)

Here, we are reading the existing text present in our existingQuery.txt, converting it to lower case and replacing the punctuation by space and split, and then appending more words to the uniqueWords list to create an universal list which contains all unique words both from existingQuery.txt and from the text which was passed as an Input from the UI.

    #Reading the existing text present in existingQuery.txt
	fd = open("existingQuery.txt", "r")
    #Converting the text to lower case
	existingQuery = fd.read().lower()
    #Replace punctuation by space and split
	existingQueryList = re.sub("[^\w]", " ",existingQuery).split()	

    #Appending more words to the uniqueWords list to create an universal list
	for word in existingQueryList:
		if word not in uniqueWords:
			uniqueWords.append(word)

Here, we are initiating few counters, and based on some logic we are calculating the Vector Magnitude’s for both input query & existing query



	queryTF = []
	existingQueryTF = []

	for word in uniqueWords:
		queryTfCounter = 0
		existingQueryTFCounter = 0

		for word2 in queryWordList:
			if word == word2:
				queryTfCounter += 1
		queryTF.append(queryTfCounter)

		for word2 in existingQueryList:
			if word == word2:
				existingQueryTFCounter += 1
		existingQueryTF.append(existingQueryTFCounter)

	dotProduct = 0
	for i in range (len(queryTF)):
		dotProduct += queryTF[i]*existingQueryTF[i]

	queryVectorMagnitude = 0
	for i in range (len(queryTF)):
		queryVectorMagnitude += queryTF[i]**2
	queryVectorMagnitude = math.sqrt(queryVectorMagnitude)

	existingQueryMagnitude = 0
	for i in range (len(existingQueryTF)):
		existingQueryMagnitude += existingQueryTF[i]**2
	existingQueryMagnitude = math.sqrt(existingQueryMagnitude)

Calculating the matching percentage

m = (float)(dotProduct / (queryVectorMagnitude * existingQueryMagnitude))*100

Final output, keep the entire content till now inside the cosineSim method.

        output = "Input text matches %0.02f%% with existing data."%m

	return render_template('home.html', query=inputQuery, output=output)

Finally the run() method of Flask class runs the application on the local development server.

app.run()

The above given Python script is executed from Python shell.

python textmatch.py

A message in Python shell informs you that

* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

Open the above URL (localhost:5000) in the browser, and you can test your application.

You can download the entire project here

References

0 responses on "Building our Text Matching Model using Cosine Similarity in Flask"

    Leave a Message

    Your email address will not be published. Required fields are marked *

    © BeingDatum. All rights reserved.
    X