Suffix Array nLogn Algorithm:

All the suffixes of a particular string are arranged in a suffix array. The concept is comparable to the Suffix Tree, which is a compressed tree of all the text's suffixes.

A fundamental data structure that is utilized by numerous algorithms that deal with strings is the suffix array. It displays an array of all suffixes from a given string that has been lexicographically ordered. The time complexity required to construct a suffix array using the most effective approach is typically O(n log n), where n is the length of the input text.

Construction of Suffix Array using nlogn Algorithm:

Using the Brute force approach the time Complexity is O(n^2logn).

This is modified and built an Optimized approach which takes O(nlogn) Time complexity.

The "Skew Algorithm" based on the DC3 (Difference Cover 3) technique is a well-known method for building the Suffix Array with O(n log n) time complexity. Here is a general description of the algorithm:

1. Preprocessing:

  • Create an array of integers from the given string, each one representing a character. Typically, this is accomplished by giving each character a distinct integer value, such as one of its ASCII codes.
  • Add a special character to the string's end. The string's other characters should be larger than this character. You could use '0' (the null character), for instance, or a character with an extremely low ASCII value.

2. Construct Initial Suffix Array:

  • Sort all the suffixes of the string using radix sort, which takes O(n) time.

3. Induced Sorting:

  • Recursively sort the suffixes starting at the places of type L (bigger) for each suffix if it is of type S (smaller) and vice versa as you iterate through the suffix array.
  • O(n) time complexity can be used to complete this process.

4. Merge Step:

  • To create the final suffix array, combine the two sorted arrays that were obtained during the induced sorting stage.

Although it can be difficult to implement the skew method from scratch, there are open-source tools and implementations that you can utilize. SuffixArray (C++), SuffixArray.jl (Julia), and pysuffixarray (Python) are a few well-known libraries.

The SA-IS (Suffix Array Induced Sorting) approach, for example, is another algorithm to build a suffix array with O(n log n) time complexity, however, the skew algorithm is frequently favored due to its ease of use and practical effectiveness.

Example:

Let's look at an example to better grasp how a suffix array for a given string is created. Take the word "banana$" as an example. To indicate that the string has concluded, we append the special character "$" (which is smaller than every other character). The DC3 algorithm is used to create the suffix array for this string as follows:

Step 1 Preprocessing:

Using their ASCII values, the characters in the string "banana$" can be represented as integers:

Step 2: Create the initial suffix array

Using radix sort, order the string's suffixes. The suffixes are listed below, along with their initial positions:

Suffixes:

  • 98 97 110 97 110 97 36 (Starting from index 0)
  • 97 110 97 110 97 36 (Starting from index 1)
  • 110 97 110 97 36 (Starting from index 2)
  • 97 110 97 36 (Starting from index 3)
  • 110 97 36 (Starting from index 4)
  • 97 36 (Starting from index 5)
  • 36 (Starting from index 6)

The suffixes are rearranged after sorting as follows:

  • 36 (Starting from index 6)
  • 97 36 (Starting from index 5)
  • 97 110 97 36 (Starting from index 3)
  • 97 110 97 110 97 36 (Starting from index 1)
  • 110 97 36 (Starting from index 4)
  • 110 97 110 97 36 (Starting from index 2)
  • 98 97 110 97 110 97 36 (Starting from index 0)

Step 3: Induced Sorting

The suffixes are sorted recursively according to their types (S and L). We discriminate between S-type and L-type characters in this stage. Characters that are S-type are those that are smaller than the character after them, and characters that are L-type are those that are larger. For the sake of simplicity, the special character '$' is regarded as an S-type character.

36 (at index 6) and 97 (at index 1, 3, 5) are S-type (S) characters.

98 (at index 0) and 97 (at index 2, 4) are L-type (L) characters.

We recursively sort the S-type and L-type suffixes beginning at the end. The following are the ordered suffixes as a result:

Suffixes:

36 (Starting from index 6)

97 36 (Starting from index 5)

97 110 97 36 (Starting from index 3)

97 110 97 110 97 36 (Starting from index 1)

110 97 36 (Starting from index 4)

110 97 110 97 36 (Starting from index 2)

98 97 110 97 110 97 36 (Starting from index 0)

Step 4: Merge Step

The two sorted arrays resulting from the induced sorting are combined in the last phase while accounting for the suffixes' initial positions.

Merged Suffix:

6 (Starting from index 6)

5 6 (Starting from index 5)

3 4 5 6 (Starting from index 3)

1 2 3 4 5 6 (Starting from index 1)

4 5 (Starting from index 4)

2 3 (Starting from index 2)

0 1 2 3 4 5 6 (Starting from index 0)

The string "banana$"'s final suffix array is [6, 5, 3, 1, 4, 2, 0]. The starting places of the sorted suffixes in the original string are indicated by these numbers. Notably, the special character '$' that denotes the string's end corresponds to the shortest suffix in the sorted array.

This is a simple illustration of how a suffix array for a given string is created. In actual use, the technique is effective even for very long strings.

Implementation of nlogn Algorithm:

The SA-IS or Skew algorithm must be completely implemented in C, which is outside the scope of a single response. I can provide you with a condensed version of the algorithm in C, though. Although this implementation might not be as effective as optimized libraries, it will help you understand the structure of the algorithm.

C Code:

Output:

Suffix Array for the string "banana$":
6: $
5: a$
3: ana$
1: anana$
0: banana$
4: na$
2: nana$

Time Complexity: O(nlogn).





Latest Courses