KMP Algorithm in C++In this tutorial, we will study the KMP algorithm in C++ with code implementation. Other algorithms used for pattern matching include the Naive Algorithm and the Rabin Karp Algorithm. If we compare the algorithms, the Naive method and Rabin Karp have a time complexity of O((n-m)*m); however, Rabin Karp generally outperforms the Naive method. KMP algorithms, on the other hand, operate with O(n) time complexity and O(m) additional space. The pattern string and text string lengths, respectively, are m and n in this example. The naïve approach for pattern matching, where n is the text length and m is the pattern length, runs in O(n.m) time, as shown. This is due to the algorithm's lack of memory for the previously matched characters. In essence, it repeatedly matches a character with a different pattern character. The ingenious use of the prior comparison data is how the KMP method (or Knuth, Morris, and Pratt string searching method) operates. Because it never compares a text symbol that has already matched a pattern symbol again, it can find patterns in O(n) time. To examine the pattern structure, it uses a partial match table. It takes O(m) time to build a partial match table. As a result, the KMP algorithm's overall time complexity is O(m + n). Write a function search(char pat[], char txt[]) that outputs all instances of pat[] in txt[] given a text txt[0... N-1] and a pattern pat[0... M-1]. N > M is a reasonable assumption. Examples:Output: A pattern at index 0 was detected, followed by patterns at index 9 and index 12. An important issue in computer science is pattern searching. Pattern-searching methods display the search results when we perform a string search in a database, browser, or notepad/word file. In the last post, we spoke about the Naive pattern-searching algorithm. The Naive algorithm's worst-case complexity is O(m(n-m+1)). The worst-case time complexity of the KMP algorithm is O(n+m). Pattern-Searching using KMP (Knuth Morris Pratt)When several matching characters are followed by a mismatching character, the Naive pattern-searching algorithm struggles to find patterns. For instance,
The degenerating property of the pattern, which occurs when the same sub-patterns appear more than once in the pattern, is used by the KMP matching algorithm to reduce the worst-case complexity to O(n+m). The fundamental tenet of KMP's method is that we already know some of the letters in the text of the following window if we see a discrepancy (after a few matches). Using this knowledge, we can avoid matching characters we know will already match. Comparison Overview of the text "AAAAABAAABA."Pat is "AAAA" We contrast the first text window with the pat pat = "AAAA" and txt = "AAAAABAAABA" [Starting place] We locate a match. Similar to Naive String Matching, this. Next, we compare the next window of the text file with the pat. the text "AAAAABAAABA" pat equals "AAAA" [One positional shift in the pattern] KMP performs optimization over Naive in this situation. To determine if the current text window fits the pattern, we merely check the fourth A of the pattern with the fourth letter in the second window. We omitted matching the first three characters since we knew they would match. Preprocessing Required?As mentioned earlier, the explanation raises a crucial query: how can I determine how many characters should be skipped? We preprocess the pattern to determine this and create an integer array called lps[] that contains the number of characters that will be skipped. Overview Of Preprocessing:To skip characters during matching, the KMP algorithm preprocesses the pattern and creates an auxiliary lps of size m (the same as the pattern size). The longest appropriate prefix, also a suffix, is denoted by the name lps. An appropriate prefix contains an entire string that is forbidden. Prefixes for "ABC" include "", "A", "AB," and "ABC," for instance. The appropriate prefixes are ", "A", and "AB". The string has the suffixes "", "C", "BC", and "ABC".
The longest valid prefix of pat[0..i] and a suffix of pat[0..i] is lps[i]. It should be noted that lps[i] is also the longest prefix that is a legitimate suffix. We must use it appropriately in one location to ensure that the entire substring is not considered. lps[] construction examples: The lps[] value for the pattern "AAAA" is [0, 1, 2, 3]. The lps[] value for the pattern "ABCDE" is [0, 0, 0, 0, 0]. Lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5] for the pattern "AABAACAABAA". The value of lps[] for the pattern "AAACAAAAAC" is [0, 1, 2, 0, 1, 2, 3, 3, 4, 5]. The lps[] value for the pattern "AAABAAA" is [0, 1, 2, 0, 1, 2, 3]. Preprocessing Algorithm:During this phase, we calculate values in lps[]. To achieve that, we maintain track of the longest prefix suffix value length for the preceding index using the len variable.
Preprocessing (or lps[] Creation) Example:len = 0, i = 0, pat[] = "AAACAAAA":
Implementation Of KMP Algorithm:Rather than sliding the pattern by one and comparing all characters at each shift, like in the Naive method, we utilize a value from lps[] to determine which characters must be matched next. The goal is to avoid matching a character that will, in all likelihood, match. How can I use lps[] to choose the next locations (or determine how many characters need to be skipped)?
The Method as Mentioned Earlier Is Shown In The Following Way:Consider txt = "AAAAABAAABA", pat = "AAAA" If we use the above LPS building process, then lps[] = 0, 1, 2, 3 -> i = 0, j = 0: txt[i] and pat[j] match, do i++, j++ -> i = 1, j = 1: txt[i] and pat[j] match, do i++, j++ -> i = 3, j = 3: txt[i] and pat j = lps[j-1] = lps[3] = 3 since j = M, print pattern was discovered, and j was reset. We do not match the first three characters of this window here, in contrast to the Naive algorithm. In the previous phase, the value of lps[j-1] provided us with the index of the next character to match.
This process is continued until there are enough characters in the text to compare to those in the pattern. The application of the strategy, as mentioned earlier, is seen below: Implementation of KMP Pattern Searching in C++Output Found pattern at index 10 Explanation In the above code, we used the KMP algorithm in the C++ programming language. KMP algorithm is used for pattern matching. Hence, we found the pattern at index 10 from the given input, as shown in the output. Time Complexity: O(N+M), where N is the text length, and M is the pattern length that has to be discovered. Auxiliary Space: O(M) Next TopicAbstract data types in C++ |