Aho-Corasick Algorithm for Pattern Searching Using Python

Aho-Corasick is a kind of a dictionary-matching algorithm. This algorithm is used to search for words present in the keywords set. This algorithm is fast and efficient for finding words and their location. The Aho-Corasick algorithm builds an existing system and employs the TRIE notion.

A tree data structure is used to execute the technique. When we create the tree, it converts or attempts to convert it into an automaton, allowing us to complete or perform the search in linear time.

Time Complexity of the Aho-Corasick Algorithm

The Aho-Corasik algorithm searches words in O(X+ Y+ Z) time, where X is the text length, Y is the keyword length, and Z is the number of times a keyword occurs in the text.

Problem Statement of the Ahio-Corasick Algorithm

Let's assume we have the input text and an array of m words, a[ ]. We need to search the count of the words present in the input text.

Let x be the text length, and y be the number of characters in all the words. It means y = len(a [0]) + len(a [1]) + len(a [2]) + …. + len(a [z - 1]). Here, z is the number of input words.

We will understand this algorithm using an example in which we will take an input string and the set of words to be searched in the text string.

Example:

Input:

Output:

The Word "he" is found at index 0 to 1.
The Word "he" is found at index 6 to 7
The Word "he" is found at index 11 to 12
The Word "hello" is found at index 0 to 4.
The Word "the" is found at index 5 to 7
The Word "their" is found at index 7 to 11.
The Word "she" is found at index 10 to 12.
The Word "here" is found at index 11 to 14

Preprocessing of the algorithm

The Aho-Corasick Algorithm is divided into three different stages:

  • Go To
  • Output
  • Failure

Go To: This stage makes the tree using the keywords fed into the algorithm in a pattern. It goes to the main function, then set points, and then goes to the main root. It traces the margins of all the words in the array a[ ]. A 2-D array, gt[ ][ ], denotes the go function, in which we can store characters and states for further next states.

Output: This stage searches the words ending at a particular state. It is the result when the condition and the availability match. This function stores indexes of the words ending at current states. A 1D array, op [ ], represents the output function, in which we can store the words as a bitmap for the current state.

Failure: It searches backward transformation to find suitable keywords from the set. If the keyword does not match, then it will not count. When the current character lacks an edge in Trie, this function records every edge followed. The 1D array fl [ ] represents the failure in which we record the next state for the current state.

Implementation of the Aho-Corasick algorithm for Pattern searching in Python

Here is the implementation of the Aho-Corasick Algorithm in Python:

Code:

Output:

The Word "he" is found at index 0 to 1.
The Word "he" is found at index 6 to 7
The Word "he" is found at index 11 to 12
The Word "hello" is found at index 0 to 4.
The Word "the" is found at index 5 to 7
The Word "their" is found at index 7 to 11.
The Word "she" is found at index 10 to 12.
The Word "their" is found at index 11 to 14

Explanation:

In this, we have used the default dictionary to store the output. We have set a maximum limit of characters and states. Using the three stages of the algorithm, we have preprocessed our data. Then, using loops and queues, we built the machine/ automata and then searched the set of words from the text string.