KMP Algorithm in C++

In this tutorial, we will study the KMP algorithm in C++ with code implementation. Other algorithms used for pattern matching include the Naive Algorithm and the Rabin Karp Algorithm. If we compare the algorithms, the Naive method and Rabin Karp have a time complexity of O((n-m)*m); however, Rabin Karp generally outperforms the Naive method. KMP algorithms, on the other hand, operate with O(n) time complexity and O(m) additional space. The pattern string and text string lengths, respectively, are m and n in this example.

The naïve approach for pattern matching, where n is the text length and m is the pattern length, runs in O(n.m) time, as shown. This is due to the algorithm's lack of memory for the previously matched characters. In essence, it repeatedly matches a character with a different pattern character.

The ingenious use of the prior comparison data is how the KMP method (or Knuth, Morris, and Pratt string searching method) operates. Because it never compares a text symbol that has already matched a pattern symbol again, it can find patterns in O(n) time. To examine the pattern structure, it uses a partial match table. It takes O(m) time to build a partial match table. As a result, the KMP algorithm's overall time complexity is O(m + n).

Write a function search(char pat[], char txt[]) that outputs all instances of pat[] in txt[] given a text txt[0... N-1] and a pattern pat[0... M-1]. N > M is a reasonable assumption.

Examples:

Enter "THIS IS A TEST TEXT" and "TEST" as the inputs.
Pattern discovered at index 10
Enter this value for txt[]: "AABAACAADAABAABA"
pat[] equals "AABA."

Output:

A pattern at index 0 was detected, followed by patterns at index 9 and index 12.

An important issue in computer science is pattern searching. Pattern-searching methods display the search results when we perform a string search in a database, browser, or notepad/word file.

In the last post, we spoke about the Naive pattern-searching algorithm. The Naive algorithm's worst-case complexity is O(m(n-m+1)). The worst-case time complexity of the KMP algorithm is O(n+m).

Pattern-Searching using KMP (Knuth Morris Pratt)

When several matching characters are followed by a mismatching character, the Naive pattern-searching algorithm struggles to find patterns.

For instance,

txt[] = "AAAAAAAAAAAAAB," and 2) pat[] = "AAAAB."
pat[] = "ABABAC", txt[] = "ABABABCABABABCABABABC" (Not the worst-case scenario, but a poor one for Naive)

The degenerating property of the pattern, which occurs when the same sub-patterns appear more than once in the pattern, is used by the KMP matching algorithm to reduce the worst-case complexity to O(n+m).

The fundamental tenet of KMP's method is that we already know some of the letters in the text of the following window if we see a discrepancy (after a few matches). Using this knowledge, we can avoid matching characters we know will already match.

Comparison Overview of the text "AAAAABAAABA."

Pat is "AAAA"

We contrast the first text window with the pat

pat = "AAAA" and txt = "AAAAABAAABA" [Starting place]

We locate a match. Similar to Naive String Matching, this.

Next, we compare the next window of the text file with the pat.

the text "AAAAABAAABA"

pat equals "AAAA" [One positional shift in the pattern]

KMP performs optimization over Naive in this situation. To determine if the current text window fits the pattern, we merely check the fourth A of the pattern with the fourth letter in the second window. We omitted matching the first three characters since we knew they would match.

Preprocessing Required?

As mentioned earlier, the explanation raises a crucial query: how can I determine how many characters should be skipped? We preprocess the pattern to determine this and create an integer array called lps[] that contains the number of characters that will be skipped.

Overview Of Preprocessing:

To skip characters during matching, the KMP algorithm preprocesses the pattern and creates an auxiliary lps of size m (the same as the pattern size).

The longest appropriate prefix, also a suffix, is denoted by the name lps. An appropriate prefix contains an entire string that is forbidden. Prefixes for "ABC" include "", "A", "AB," and "ABC," for instance. The appropriate prefixes are ", "A", and "AB". The string has the suffixes "", "C", "BC", and "ABC".

We look for subpatterns that include lps. We concentrate more explicitly on sub-strings of patterns that function as prefixes and suffixes.
The length of the greatest matched proper prefix, which is also a suffix of the sub-pattern pat[0..i], is stored in lps[i] for each sub-pattern pat[0..i] where i = 0 to m-1.

The longest valid prefix of pat[0..i] and a suffix of pat[0..i] is lps[i].

It should be noted that lps[i] is also the longest prefix that is a legitimate suffix. We must use it appropriately in one location to ensure that the entire substring is not considered.

lps[] construction examples:

The lps[] value for the pattern "AAAA" is [0, 1, 2, 3].

The lps[] value for the pattern "ABCDE" is [0, 0, 0, 0, 0].

Lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5] for the pattern "AABAACAABAA".

The value of lps[] for the pattern "AAACAAAAAC" is [0, 1, 2, 0, 1, 2, 3, 3, 4, 5].

The lps[] value for the pattern "AAABAAA" is [0, 1, 2, 0, 1, 2, 3].

Preprocessing Algorithm:

During this phase, we calculate values in lps[]. To achieve that, we maintain track of the longest prefix suffix value length for the preceding index using the len variable.

We set lps[0] and len to zero.
If pat[len] and pat[i] are equal, len is increased by 1, and the resulting value is sent to lps[i].
If len is not 0 and pat[i] and pat[len] are not equal, we update len to lps[len-1].
For further information, see computeLPSArray() in the code below.

Preprocessing (or lps[] Creation) Example:

len = 0, i = 0, pat[] = "AAACAAAA":

Since lps[0] is always 0 and i = 1 follows len = 0, i = 1:
Do len++, save it in lps[i], and do i++ since pat[len] and pat[i] match.
If len = 1 and lps[1] = 1, then len = 1 and i = 2 is obtained:
Do len++, save it in lps[i], and do i++ since pat[len] and pat[i] match.
If len = 2 and lps[2] = 2 and i = 3, then len = 2 and i = 3:
Since len > 0 and pat[len] and pat[i] are incompatible,
Make len equal to lps[len-1] = lps[1] = 1.
- len = 1, i = 3:
Since len > 0 and pat[len] and pat[i] do not match, len = lps[len-1] = lps[0] = 0 and therefore len = 0, i = 3:
Given that len = 0 and that pat[len] and pat[i] are not equal, set lps[3] = 0 and i = 4 to obtain len = 0, i = 4:
Do len++, save it in lps[i], and do i++ since pat[len] and pat[i] match.
If len = 1 and lps[4] = 1, then i = 5 and len = 1 as well:
Do len++, save it in lps[i], and do i++ since pat[len] and pat[i] match.
Set len to 2, lps[5] to 2, and i to 6.
If len = 2 and i = 6, perform the following steps: perform len++, save the result in lps[i], and perform i++.
Len = 3, lps[6] = 3, and i = 7 such that Len = 3, i = 7
Since len > 0 and pat[len] and pat[i] are not equal,
If len = lps[len-1] = lps[2] = 2, then i = 7 and len = 2:
Do len++, save it in lps[i], and do i++ since pat[len] and pat[i] match.
Len = 3, lps [7] = 3, and i = 8
We end here because the entire LP has been built.

Implementation Of KMP Algorithm:

Rather than sliding the pattern by one and comparing all characters at each shift, like in the Naive method, we utilize a value from lps[] to determine which characters must be matched next. The goal is to avoid matching a character that will, in all likelihood, match.

How can I use lps[] to choose the next locations (or determine how many characters need to be skipped)?

We begin comparing the pat[j] characters from the currently active text window, where j = 0.
While pat[j] and txt[i] continue to match, we continue to match the characters txt[i] and pat[j] and increment i and j.
Characters pat[0..j-1] match characters txt[i-j...i-1] when we notice a mismatch. (Remember that j increments its value only when there is a match; otherwise, it starts at 0).
From the definition above, we also know that lps[j-1] is the number of letters in pat[0...j-1] that are both a valid prefix and a proper suffix.
We may infer from the two considerations mentioned earlier that we do not need to match these lps[j-1] characters with txt[i-j...i-1] since we know they will already match. Let's use the example mentioned earlier to grasp this further.

The Method as Mentioned Earlier Is Shown In The Following Way:

Consider txt = "AAAAABAAABA", pat = "AAAA"

If we use the above LPS building process, then lps[] = 0, 1, 2, 3 -> i = 0, j = 0: txt[i] and pat[j] match, do i++, j++ -> i = 1, j = 1: txt[i] and pat[j] match, do i++, j++ -> i = 3, j = 3: txt[i] and pat j = lps[j-1] = lps[3] = 3 since j = M, print pattern was discovered, and j was reset.

We do not match the first three characters of this window here, in contrast to the Naive algorithm. In the previous phase, the value of lps[j-1] provided us with the index of the next character to match.

i = 4, j = 3: do i++, j++ when txt[i] and pat[j] match.
i = 5, j = 4: Since j == M, the print pattern is detected and reset to j, which is then equal to lps[j-1] + lps[3] = 3.
We do not match the first three characters of this window, once again in contrast to the Naive method. In the previous phase, the value of lps[j-1] provided us with the index of the next character to match.
- i = 5, j = 3: txt[i] and pat[j] are not equal, and j > 0; just j has to be changed. j = lps[j-1] = lps[2] = 2.
- j = lps[j-1] = lps[1] = 1 when i = 5, j = 2, and txt[i] and pat[j] do not match.
- j = lps[j-1] = lps[0] = 0 when i = 5 and j = 1 when txt[i] and pat[j] do not match.
- i = 5, j = 0: Since j is 0 and txt[i] and pat[j] do not match, we do i++.
- If i = 6 and j = 0, then i++ and j++ must be performed.
- If i = 7 and j = 1, then i++ and j++ must be performed.

This process is continued until there are enough characters in the text to compare to those in the pattern.

The application of the strategy, as mentioned earlier, is seen below:

Implementation of KMP Pattern Searching in C++

// C++ program for implementation of KMP pattern searching
// algorithm

#include <bits/stdc++.h>

void computeLPSArray(char* pat, int M, int* lps);

// Prints occurrences of txt[] in pat[]
void KMPSearch(char* pat, char* txt)
{
	int M = strlen(pat);
	int N = strlen(txt);

	// create lps[] that will hold the longest prefix suffix
	// values for pattern
	int lps[M];

	// Preprocess the pattern (calculate lps[] array)
	computeLPSArray(pat, M, lps);

	int i = 0; // index for txt[]
	int j = 0; // index for pat[]
	while ((N - i) >= (M - j)) {
		if (pat[j] == txt[i]) {
			j++;
			i++;
		}

		if (j == M) {
			printf("Found pattern at index %d ", i - j);
			j = lps[j - 1];
		}

		// mismatch after j matches
		else if (i < N && pat[j] != txt[i]) {
			// Do not match lps[0..lps[j-1]] characters,
			//They will match anyway
			if (j != 0)
				j = lps[j - 1];
			else
				i = i + 1;
		}
	}
}

// Fills lps[] for given pattern pat[0..M-1]
void computeLPSArray(char* pat, int M, int* lps)
{
	// length of the previous longest prefix suffix
	int len = 0;

	lps[0] = 0; // lps[0] is always 0

	// the loop calculates lps[i] for i = 1 to M-1
	int i = 1;
	while (i < M) {
		if (pat[i] == pat[len]) {
			len++;
			lps[i] = len;
			i++;
		}
		else // (pat[i] != pat[len])
		{
			// This is tricky. Consider the example.
			// AAACAAAA and i = 7. The idea is similar
			// to search step.
			if (len != 0) {
				len = lps[len - 1];

				// Also, note that we do not increment
				// i here
			}
			else // if (len == 0)
			{
				lps[i] = 0;
				i++;
			}
		}
	}
}

// Driver code
int main()
{
	char txt[] = "ABABDABACDABABCABAB";
	char pat[] = "ABABCABAB";
	KMPSearch(pat, txt);
	return 0;
}

Output

Found pattern at index 10

Explanation

In the above code, we used the KMP algorithm in the C++ programming language. KMP algorithm is used for pattern matching. Hence, we found the pattern at index 10 from the given input, as shown in the output.

Time Complexity:

O(N+M), where N is the text length, and M is the pattern length that has to be discovered.

Auxiliary Space:

O(M)

Next TopicAbstract data types in C++

← prev next →