In this lesson, we will try to understand how to load and clean text data so that it is ready for modelling.
Usually, we can’t go directly from raw text data to fit into a machine learning or deep learning model. Cleaning data is the first task, which means splitting it into words and normalizing issues such as :
- Upper & lower case characters
- Spelling mistakes & regional variations
- Unicode characters
- Numbers such as amounts and dates.
The process of segmenting running text into words and sentences.
Electronic text is a linear sequence of symbols (characters or words or phrases). Naturally, before any real text processing is to be done, text needs to be segmented into linguistic units such as words, punctuation, numbers, alpha-numerics, etc. This process is called tokenization.
In English, words are often separated from each other by blanks (white space), but not all white space is equal. Both “Los Angeles” and “rock ‘n’ roll” are individual thoughts despite the fact that they contain multiple words and spaces. We may also need to separate single words like “I’m” into separate words “I” and “am”.
Tokenization is a kind of pre-processing in a sense; an identification of basic units to be processed. It is conventional to concentrate on pure analysis or generation while taking basic units for granted. Yet without these basic units clearly segregated it is impossible to carry out any analysis or generation.
The identification of units that do not need to be further decomposed for subsequent processing is an extremely important one. Errors made at this stage are very likely to induce more errors at later stages of text processing and are therefore very dangerous.
Let’s manually develop a Python code to clean text, and often this is a good approach given that each text dataset must be tokenized in a unique way. Ex, this code will load a text file, split tokens by whitespace and convert each token to lowercase.
Let’s create a file named filename.txt, where we write “Hello World”
filename = 'filename.txt' file = open(filename, 'rt') text = file.read() file.close() # split into words by white space words = text.split() # convert to lowercase words = [word.lower() for word in words] print(words)
Output: [‘hello’, ‘world’]
Natural Language Toolkit a.k.a NLTK is a Python library for tokenization. Let’s install this and use.
Installation: pip install nltk
python -m nltk.downloader all → From Anaconda Prompt
file = open(filename, 'rt') text = file.read() file.close() # split into words from nltk.tokenize import word_tokenize tokens = word_tokenize(text) print(tokens)