Python Tutorial

In this tutorial, we will learn about the FlashText module and how to replace words in the text sequences using FlashText. This provides the efficient ways of replacing a large set of words in a textual document.

Working of FlashText Algorithm

The FlashText algorithm is built upon the proprietary algorithm, the FlashText algorithm. In essence, it is based on a Python implementation of the Aho-Corasick Algorithm.

The advantage of this algorithm is to reduce the time spent searching a large number of keywords in the text. The FlashText algorithm stores all of the keywords, paired with corresponding replacement words in a dictionary. So it omits the scanning text for every keyword in the dictionary, it scans the text only once. If the words are matched in the scanned dictionary's keys, replaced with the key's value.

How to Install FlashText

We can install it using the pip command as below.

Working with FlashText

Let's have an introduction to the FlashText module's classes.

The KeywordProcessor Class

The KeywordProcessor class is a main class which responsible for the processing of keywords is. Let's import it directly from FlashText and initialize it.

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()

The last line creates the KeywordProcessor object that will work in the case-insensitive mode.

However, we can create a KeywordProcessor instance in the case-sensitive mode.

Defining the Keyword Dictionary

In this module, we use keywords to define words that need to be replaced. The KeywordProcessor object holds a dictionary having all of the defined keywords.

There are two ways of adding keywords to the dictionary: in bulk or one-by-one.

Firstly, let's take a look at how to add keywords one-by-one -

The replacementWord is an optional argument. If it is not specified, a keyword is added to the dictionary and there is no way of replacing it with another replacement word. So if you want to replace the word, we recommend you to pass the <replacementWord> argument.

If we have many keywords, adding them one-by-one can take much time. Alternatively, we can make the small list of keywords to add keyword in bulk.

Example -

keyword_dictionary = {
    'replacementWord1': ['list', 'of', 'keywords', 'for', 'replacementWord1'],
    'replacementWord2': ['list', 'of', 'keywords', 'for', 'replacementWord2'],
    ...
    'replacementWordN': ['list', 'of', 'keywords', 'for', 'replacementWordN']
}

keyword_processor.add_keywords_from_dict(keyword_dictionary )

Each key in the dictionary is a string keyword. Each value must be a list. Alternatively, you can provide keywords by a List.

This approach allows to just adding the keywords without replacement words. Or, we can specify key-value pairs in the text file as below.

keyword1=>value1
keyword2=>value2

We can import the file using the keywords_from_file() function -

It is a popular approach due to most flexibility and great readability. It is an also the most natural match for the algorithm, given the fact that it all ultimately ends up in a dictionary.

Let's take a real world example - Suppose we have a textual document and we want use minimize the usage of synonyms to standardize the vocabulary used. In this example, we want to replace all the occurrence of word "grim", "nasty", "unpleasent" with the word bad (replacement word), and all the occurrence of words such as excellent, quality, and fine with the word good.

We need to create a dictionary of keywords and replacement_words to the keyword_dictionary.

Example -

keyword_dictionary = {
    "bad": ["grim", "nasty", "unpleasent"],
    "good": ["fine", "excellent", "great"]
}

And, add this keyword_dictionary to the keyword_processor object.

Replace Keywords with Replacement Words

Once we assign the keywords their respective replacement words into the KeywordProcessor instance, we can use the replace_keywords() function, which scans the provided text, and executes the replacement.

The provided text parses, replacing all of the keywords within it with their matched values, and returns the new string. Rather than working with the string literals, we will work with documents. We will open the document in the read mode and read lines within it, pass to the string to the replace_keyword() function.

Example -

with open('data.txt', 'r+') as file:
    # Load the content from `data.txt` to a variable as a string
    content = file.read()
    # Replace all desired keywords from `data.txt` and store it in the new variable
    new_content = keyword_processor.replace_keywords(content)
    # Replace the old content
    file.seek(0)
    file.truncate()
    # Write the alternated content to the original file 
    file.write(new_content)

Now, we add some text in new.txt.

Next TopicPython Libraries for PDF Extraction

← prev next →