string - Finding the longest double suffix in linear time

Given a string `s`, find the longest double suffix in time complexity `O(|s|)`.

Example: for string `banana`, the LDS is `na`. For `abaabaa` it's `baa`.

Obviously I thought about using a suffix tree, but I'm having trouble to find double suffix in it.

Reverse the string and build sparse array `P[i][j]`, where `i` is from `0` to `log(n)`, `j` is from `0` to `n-1`, `n` is the length of the string. `P[i][j]` refers to the rank of the suffix starting from position `j` and length `2^i`. So if `P[i][j]=P[i][k]`, the first `2^i` chars of the suffixes at indexes `j` and `k` are equal.

Now your problem reduces to finding a Longest Common Prefix for `0`(start of the reversed string) and another suffix at index `i`, such that `LCP >= i`. Where LCP can be computed by simply using `P` array in `log(n)` time, by comparing first `2^x` chars of these two suffixes and gradually reducing `x`.

Total complexity is `n*log(n)*log(n)`. Here is the working C++ source code: https://ideone.com/aJCAYG

I think that Gene's solution is the simpler to implement and since it does not rely on an arborescent structures but on arrays, it is likely more hardware friendly as well.

But since you mentioned suffix trees, let's look into a solution based on suffix trees! I will assume that you use an end token to mark the end of the string(s) you insert in the tree. To illustrate this, here is a representation of the suffix tree built for your `abaabaa` example:

``````\$ - ##
b a a - \$ - ## // Longest double suffix: P is the first dash, N the second
b a a \$ - ## // N' is the dash
a - \$ - ##
a - \$ - ##
b a a \$ - ##
b a a - \$ - ##
b a a \$ - ##
``````

When N is a node in a suffix tree, we will denote |N| the length of the substring represented by N.

How can you characterize a "double suffix" in a suffix tree? Well it is a terminal node N with a parent that has a specific property: let P be the parent node of a double suffix, then:

• P has a transition to the suffix node N that only contains the end token (`\$` above) of the string.
• Let suffix be the substring represented by the node P with an appended end token (`baa\$` in your example). If we walk down the tree from P, using suffix, we end up in another suffix node N' (walking down the tree won't be actually needed)
• The substring represented by the node P is the double suffix (`baa` in our case).
• We have the equalities |N'| = 2.|P| + 1 and |N| = |P| + 1

Given that, you only have to iterate over suffix nodes and test this condition. You can be greedy if you iterate suffixes in decreasing order of length: the first match is necessarily the longest double suffix.

Note that we can stop our search after having inspected the suffix of length |S|/2 and only iterate over suffixes of odd length (do not forget we add an end token to the string)

Complexity analysis

Building the suffix tree is `O(|S|)`.
Let N' be a suffix node and `N` be the suffix node for the suffix of length (|N'|-1)/2 + 1. Assuming proper construction of the tree:

• The suffixes can be stored in an array/vector in increasing order because the creation of the tree adds them in increasing order of length (at least with the Ukkonen's algorithm).
• Thus accessing the suffix of length k is `O(1)`
• Accessing the substring represented by a node of the tree is `O(1)`, in particular, this applies to P the parent node of N and N'
• Finding out if the transition from P to N only contains the end token (`\$`) is `O(1)`
• Checking if |N'| = 2.|P| + 1 is indeed `O(1)`

Since we are iterating over the suffix in decreasing order of length, we necessarily focus on the `N'` suffixes (the doubled suffix, ie `baabaa\$` in your example), so we just have to:

• Get N the suffix node such that `|N'| = 2.|N| - 1`: `O(1)`
• Get P the parent of the suffix node `N`: `O(1)`
• Check that the transition from P to N contains only the end token `\$`: `O(1)`

Proof: (We ignore the end token in the following proof)

The 3 steps above, if leading to a true evaluation, prove the existence of a suffix of length 2.|P| that starts with the substring represented by P, which is also a suffix. Since this substring is a suffix, the suffix of length 2.|P| necessarily ends with it and therefore is made of two occurrences of that substring QED.

Since we will do this step for at most `(|S|/2 + 1)/2` suffixes, the identification step is therefore `O(|S|)` in the worst case.

The overall complexity is thus `O(|S|)`.