Suffix Array nLogn Algorithm:All the suffixes of a particular string are arranged in a suffix array. The concept is comparable to the Suffix Tree, which is a compressed tree of all the text's suffixes. A fundamental data structure that is utilized by numerous algorithms that deal with strings is the suffix array. It displays an array of all suffixes from a given string that has been lexicographically ordered. The time complexity required to construct a suffix array using the most effective approach is typically O(n log n), where n is the length of the input text. Construction of Suffix Array using nlogn Algorithm:Using the Brute force approach the time Complexity is O(n^2logn). This is modified and built an Optimized approach which takes O(nlogn) Time complexity. The "Skew Algorithm" based on the DC3 (Difference Cover 3) technique is a well-known method for building the Suffix Array with O(n log n) time complexity. Here is a general description of the algorithm: 1. Preprocessing:
2. Construct Initial Suffix Array:
3. Induced Sorting:
4. Merge Step:
Although it can be difficult to implement the skew method from scratch, there are open-source tools and implementations that you can utilize. SuffixArray (C++), SuffixArray.jl (Julia), and pysuffixarray (Python) are a few well-known libraries. The SA-IS (Suffix Array Induced Sorting) approach, for example, is another algorithm to build a suffix array with O(n log n) time complexity, however, the skew algorithm is frequently favored due to its ease of use and practical effectiveness. Example:Let's look at an example to better grasp how a suffix array for a given string is created. Take the word "banana$" as an example. To indicate that the string has concluded, we append the special character "$" (which is smaller than every other character). The DC3 algorithm is used to create the suffix array for this string as follows: Step 1 Preprocessing: Using their ASCII values, the characters in the string "banana$" can be represented as integers: Step 2: Create the initial suffix array Using radix sort, order the string's suffixes. The suffixes are listed below, along with their initial positions: Suffixes:
The suffixes are rearranged after sorting as follows:
Step 3: Induced Sorting The suffixes are sorted recursively according to their types (S and L). We discriminate between S-type and L-type characters in this stage. Characters that are S-type are those that are smaller than the character after them, and characters that are L-type are those that are larger. For the sake of simplicity, the special character '$' is regarded as an S-type character. 36 (at index 6) and 97 (at index 1, 3, 5) are S-type (S) characters. 98 (at index 0) and 97 (at index 2, 4) are L-type (L) characters. We recursively sort the S-type and L-type suffixes beginning at the end. The following are the ordered suffixes as a result: Suffixes: 36 (Starting from index 6) 97 36 (Starting from index 5) 97 110 97 36 (Starting from index 3) 97 110 97 110 97 36 (Starting from index 1) 110 97 36 (Starting from index 4) 110 97 110 97 36 (Starting from index 2) 98 97 110 97 110 97 36 (Starting from index 0) Step 4: Merge Step The two sorted arrays resulting from the induced sorting are combined in the last phase while accounting for the suffixes' initial positions. Merged Suffix: 6 (Starting from index 6) 5 6 (Starting from index 5) 3 4 5 6 (Starting from index 3) 1 2 3 4 5 6 (Starting from index 1) 4 5 (Starting from index 4) 2 3 (Starting from index 2) 0 1 2 3 4 5 6 (Starting from index 0) The string "banana$"'s final suffix array is [6, 5, 3, 1, 4, 2, 0]. The starting places of the sorted suffixes in the original string are indicated by these numbers. Notably, the special character '$' that denotes the string's end corresponds to the shortest suffix in the sorted array. This is a simple illustration of how a suffix array for a given string is created. In actual use, the technique is effective even for very long strings. Implementation of nlogn Algorithm:The SA-IS or Skew algorithm must be completely implemented in C, which is outside the scope of a single response. I can provide you with a condensed version of the algorithm in C, though. Although this implementation might not be as effective as optimized libraries, it will help you understand the structure of the algorithm. C Code: Output: Suffix Array for the string "banana$": 6: $ 5: a$ 3: ana$ 1: anana$ 0: banana$ 4: na$ 2: nana$ Time Complexity: O(nlogn). Next TopicSuffix tree introduction: |