Tokenizer in Python

As we all know, there is an incredibly huge amount of text data available on the internet. But, most of us may not be familiar with the methods in order to start working with this text data. Moreover, we also know that it is a tricky part to navigate our language's letters in Machine Learning as Machines can recognize the numbers, not the letters.

So, how the text data manipulation and cleaning are done to create a model? In order to answer this question, let us explore some wonderful concepts beneath Natural Language Processing (NLP).

Solving an NLP problem is a process divided into multiple stages. First of all, we have to clean the unstructured text data before moving to the modeling stage. There are some key steps included in the data cleaning. These steps are as follows:

Word Tokenization
Parts of Speech prediction for every token
Text Lemmatization
Stop Words Identification and Removal, and a lot more.

In the following tutorial, we will be learning a lot more about the very primary step known as Tokenization. We will be understanding what Tokenization is and why it is necessary for Natural Language Processing (NLP). Moreover, we will also be discovering some unique methods to execute Tokenization in Python.

Understanding Tokenization

Tokenization is said to be dividing a large quantity of text into smaller fragments known as Tokens. These fragments or Tokens are pretty useful to find the patterns and are deliberated as the foundation step for stemming and lemmatization. Tokenization also supports in substitution of sensitive data elements with non-sensitive ones.

Natural Language Processing (NLP) is utilized to create applications like Text Classification, Sentimental Analysis, Intelligent Chatbot, Language Translation, and many more. Thus, it becomes important to understand the text pattern to achieve the purpose stated above.

But for now, consider the stemming and lemmatization as the primary steps for cleaning the text data with the help of Natural Language Processing (NLP). Tasks like Text Classification or Spam Filtering use NLP along with deep learning libraries like Keras and Tensorflow.

Understanding the Significance of Tokenization in NLP

In order to understand the significance of Tokenization, let us consider the English Language as an example. Let us pick up any sentence and keep it in mind while understanding the following section.

Before processing a Natural Language, we have to identify the words constituting a string of characters. Thus, Tokenization appears out to be the most fundament step to proceed with Natural Language Processing (NLP)

This step is necessary as the text's actual meaning could be interpreted by analyzing each word present within the text.

Now, let us consider the following string as an example:

My name is Jamie Clark.

After performing the Tokenization on the above string, we would be getting output as shown below:

['My', 'name', 'is', 'Jamie', 'Clark']

There are various uses for performing the operation. We can utilize the tokenized form in order to:

Count the total number of words in the text.
Count the word's frequency, i.e., the total number of times a specific word is present and a lot more.

Now, let us understand several ways to perform Tokenization in Natural Language Processing (NLP) in Python.

Some Methods to perform Tokenization in Python

There are various unique methods of performing Tokenization on Textual Data. Some of these unique ways are described below:

Tokenization using the split() function in Python

The split() function is one of the basic methods available in order to split the strings. This function returns a list of strings after splitting the provided string by the particular separator. The split() function breaks a string at each space by default. However, we can specify the separator as per the need.

Let us consider the following examples:

Example 1.1: Word Tokenization using the split() function

my_text = """Let's play a game, Would You Rather! It's simple, you have to pick one or the other. Let's get started. Would you rather try Vanilla Ice Cream or Chocolate one? Would you rather be a bird or a bat? Would you rather explore space or the ocean? Would you rather live on Mars or on the Moon? Would you rather have many good friends or one very best friend? Isn't it easy though? When we have less choices, it's easier to decide. But what if the options would be complicated? I guess, you pretty much not understand my point, neither did I, at first place and that led me to a Bad Decision."""

print(my_text.split())

Output:

['Let's', 'play', 'a', 'game,', 'Would', 'You', 'Rather!', 'It's', 'simple,', 'you', 'have', 'to', 'pick', 'one', 'or', 'the', 'other.', 'Let's', 'get', 'started.', 'Would', 'you', 'rather', 'try', 'Vanilla', 'Ice', 'Cream', 'or', 'Chocolate', 'one?', 'Would', 'you', 'rather', 'be', 'a', 'bird', 'or', 'a', 'bat?', 'Would', 'you', 'rather', 'explore', 'space', 'or', 'the', 'ocean?', 'Would', 'you', 'rather', 'live', 'on', 'Mars', 'or', 'on', 'the', 'Moon?', 'Would', 'you', 'rather', 'have', 'many', 'good', 'friends', 'or', 'one', 'very', 'best', 'friend?', 'Isn't', 'it', 'easy', 'though?', 'When', 'we', 'have', 'less', 'choices,', 'it's', 'easier', 'to', 'decide.', 'But', 'what', 'if', 'the', 'options', 'would', 'be', 'complicated?', 'I', 'guess,', 'you', 'pretty', 'much', 'not', 'understand', 'my', 'point,', 'neither', 'did', 'I,', 'at', 'first', 'place', 'and', 'that', 'led', 'me', 'to', 'a', 'Bad', 'Decision.']

Explanation:

In the above example, we have used the split() method in order to break the paragraph into smaller fragments or say words. Similarly, we can also break the paragraph into sentences by specifying the separator as the parameter for the split() function. As we know, a sentence generally ends with a full stop "."; which means that we can utilize the "." as the separator to split the string.

Let us consider the same in the following example:

Example 1.2: Sentence Tokenization using the split() function

my_text = """Dreams. Desires. Reality. There is a fine line between dream to become a desire and a desire to become a reality but expectations are way far then the reality. Nevertheless, we live in a world of mirrors, where we always want to reflect the best of us. We all see a dream, a dream of no wonder what; a dream that we want to be accomplished no matter how much efforts it needed but we try."""

print(my_text.split('. '))

Output:

['Dreams', 'Desires', 'Reality', 'There is a fine line between dream to become a desire and a desire to become a reality but expectations are way far then the reality', 'Nevertheless, we live in a world of mirrors, where we always want to reflect the best of us', 'We all see a dream, a dream of no wonder what; a dream that we want to be accomplished no matter how much efforts it needed but we try.']

Explanation:

In the above example, we have used the split() function with the full stop (.) as its parameter in order to break the paragraph at the full stops. A major disadvantage of utilizing the split() function is that the function takes one parameter at a time. Hence, we can only use a separator in order to split the string. Moreover, the split() function does not consider the punctuations as the separate fragment.

Tokenization using RegEx (Regular Expressions) in Python

Before moving onto the next method, let us understand the regular expression in brief. A Regular Expression, also known as RegEx, is a special sequence of characters that allows the users to find or match other strings or string sets with that sequence's help as a pattern.

In order to start working with RegEx (Regular Expression), Python provides the library known as re. The re library is one of the pre-installed libraries in Python.

Let us consider the following examples based on word tokenization and sentence tokenization using the RegEx method in Python.

Example 2.1: Word Tokenization using the RegEx method in Python

import re

my_text = """Joseph Arthur was a young businessman. He was one of the shareholders at Ryan Cloud's Start-Up with James Foster and George Wilson. The Start-Up took its flight in the mid-90s and became one of the biggest firms in the United States of America. The business was expanded in all major sectors of livelihood, starting from Personal Care to Transportation by the end of 2000. Joseph was used to be a good friend of Ryan."""

my_tokens = re.findall

Output:

['Joseph', 'Arthur', 'was', 'a', 'young', 'businessman', 'He', 'was', 'one', 'of', 'the', 'shareholders', 'at', 'Ryan', 'Cloud', 's', 'Start', 'Up', 'with', 'James', 'Foster', 'and', 'George', 'Wilson', 'The', 'Start', 'Up', 'took', 'its', 'flight', 'in', 'the', 'mid', '90s', 'and', 'became', 'one', 'of', 'the', 'biggest', 'firms', 'in', 'the', 'United', 'States', 'of', 'America', 'The', 'business', 'was', 'expanded', 'in', 'all', 'major', 'sectors', 'of', 'livelihood', 'starting', 'from', 'Personal', 'Care', 'to', 'Transportation', 'by', 'the', 'end', 'of', '2000', 'Joseph', 'was', 'used', 'to', 'be', 'a', 'good', 'friend', 'of', 'Ryan']

Explanation:

In the above example, we have imported the re library in order to use its functions. We have then used the findall() function of the re library. This function helps the users to find all the words that match the pattern present in the parameter and stores them in the list.

Moreover, the "\w" is used to represent any word character, refers to alphanumeric (includes alphabets, numbers), and underscore (_). "+" indicates any frequency. Thus, we have followed the [\w']+ pattern so that the program should look and find all the alphanumeric characters until it encounters any other one.

Now, let's have a look at sentence tokenization with the RegEx method.

Example 2.2: Sentence Tokenization using the RegEx method in Python

import re

my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him."""

my_sentences = re.compile('[.!?] ').split(my_text)
print(my_sentences)

Output:

['The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America', 'The product became so successful among the people that the production was increased', 'Two new plant sites were finalized, and the construction was started', "Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care", 'Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories', 'Many popular magazines were started publishing Critiques about him.']

Explanation:

In the above example, we have used the compile() function of the re library with the parameter '[.?!]' and used the split() method to separator the string from the specified separator. As a result, the program splits the sentences as soon as it encounters any of these characters.

Tokenization using Natural Language ToolKit in Python

Natural Language ToolKit, also known as NLTK, is a library written in Python. NLTK library is generally used for symbolic and statistical Natural Language Processing and works well with textual data.

Natural Language ToolKit (NLTK) is a Third-party Library that can be installed using the following syntax in a command shell or terminal:

In order to verify the installation, one can import the nltk library in a program and execute it as shown below:

If the program does not raise an error, then the library has been installed successfully. Otherwise, it is recommended to follow the above installation procedure again and read the official documentation for more details.

Natural Language ToolKit (NLTK) has a module named tokenize(). This module is further categorized into two sub-categories: Word Tokenize and Sentence Tokenize

Word Tokenize: The word_tokenize() method is used to split a string into tokens or say words.
Sentence Tokenize: The sent_tokenize() method is used to split a string or paragraph into sentences.

Let us consider some example based on these two methods:

Example 3.1: Word Tokenization using the NLTK library in Python

from nltk.tokenize import word_tokenize

my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him."""

print(word_tokenize(my_text))

Output:

['The', 'Advertisement', 'was', 'telecasted', 'nationwide', ',', 'and', 'the', 'product', 'was', 'sold', 'in', 'around', '30', 'states', 'of', 'America', '.', 'The', 'product', 'became', 'so', 'successful', 'among', 'the', 'people', 'that', 'the', 'production', 'was', 'increased', '.', 'Two', 'new', 'plant', 'sites', 'were', 'finalized', ',', 'and', 'the', 'construction', 'was', 'started', '.', 'Now', ',', 'The', 'Cloud', 'Enterprise', 'became', 'one', 'of', 'America', "'s", 'biggest', 'firms', 'and', 'the', 'mass', 'producer', 'in', 'all', 'major', 'sectors', ',', 'from', 'transportation', 'to', 'personal', 'care', '.', 'Director', 'of', 'The', 'Cloud', 'Enterprise', ',', 'Ryan', 'Cloud', ',', 'was', 'now', 'started', 'getting', 'interviewed', 'over', 'his', 'success', 'stories', '.', 'Many', 'popular', 'magazines', 'were', 'started', 'publishing', 'Critiques', 'about', 'him', '.']

Explanation:

In the above program, we have imported the word_tokenize() method from the tokenize module of the NLTK library. Thus, as a result, the method has broken the string into different tokens and stored it in a list. And at last, we have printed the list. Moreover, this method includes the full stops and other punctuation marks as a separate token.

Example 3.1: Sentence Tokenization using the NLTK library in Python

from nltk.tokenize import sent_tokenize

my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him."""

print(sent_tokenize(my_text))

Output:

['The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America.', 'The product became so successful among the people that the production was increased.', 'Two new plant sites were finalized, and the construction was started.', "Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care.", 'Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories.', 'Many popular magazines were started publishing Critiques about him.']

Explanation:

In the above program, we have imported the sent_tokenize() method from the tokenize module of the NLTK library. Thus, as a result, the method has broken the paragraph into different sentences and stored it in a list. And at last, we have printed the list.

Conclusion

In the above tutorial, we have discovered the concepts of Tokenization and its role in the overall Natural Language Processing (NLP) pipeline. We have also discussed a few methods of Tokenization (including the word tokenization and sentence tokenization) from a specific text or string in Python.

Next TopicHow to add two lists in Python

← prev next →