Python FlashText Module
In this tutorial, we will learn about the FlashText module and how to replace words in the text sequences using FlashText. This provides the efficient ways of replacing a large set of words in a textual document.
Working of FlashText Algorithm
The FlashText algorithm is built upon the proprietary algorithm, the FlashText algorithm. In essence, it is based on a Python implementation of the Aho-Corasick Algorithm.
The advantage of this algorithm is to reduce the time spent searching a large number of keywords in the text. The FlashText algorithm stores all of the keywords, paired with corresponding replacement words in a dictionary. So it omits the scanning text for every keyword in the dictionary, it scans the text only once. If the words are matched in the scanned dictionary's keys, replaced with the key's value.
How to Install FlashText
We can install it using the pip command as below.
Working with FlashText
Let's have an introduction to the FlashText module's classes.
The KeywordProcessor Class
The KeywordProcessor class is a main class which responsible for the processing of keywords is. Let's import it directly from FlashText and initialize it.
The last line creates the KeywordProcessor object that will work in the case-insensitive mode.
However, we can create a KeywordProcessor instance in the case-sensitive mode.
Defining the Keyword Dictionary
In this module, we use keywords to define words that need to be replaced. The KeywordProcessor object holds a dictionary having all of the defined keywords.
There are two ways of adding keywords to the dictionary: in bulk or one-by-one.
Firstly, let's take a look at how to add keywords one-by-one -
If we have many keywords, adding them one-by-one can take much time. Alternatively, we can make the small list of keywords to add keyword in bulk.
Each key in the dictionary is a string keyword. Each value must be a list. Alternatively, you can provide keywords by a List.
This approach allows to just adding the keywords without replacement words. Or, we can specify key-value pairs in the text file as below.
We can import the file using the keywords_from_file() function -
It is a popular approach due to most flexibility and great readability. It is an also the most natural match for the algorithm, given the fact that it all ultimately ends up in a dictionary.
Let's take a real world example - Suppose we have a textual document and we want use minimize the usage of synonyms to standardize the vocabulary used. In this example, we want to replace all the occurrence of word "grim", "nasty", "unpleasent" with the word bad (replacement word), and all the occurrence of words such as excellent, quality, and fine with the word good.
We need to create a dictionary of keywords and replacement_words to the keyword_dictionary.
And, add this keyword_dictionary to the keyword_processor object.
Replace Keywords with Replacement Words
Once we assign the keywords their respective replacement words into the KeywordProcessor instance, we can use the replace_keywords() function, which scans the provided text, and executes the replacement.
The provided text parses, replacing all of the keywords within it with their matched values, and returns the new string. Rather than working with the string literals, we will work with documents. We will open the document in the read mode and read lines within it, pass to the string to the replace_keyword() function.
Now, we add some text in new.txt.