Given a string
s, find the longest double suffix in time complexity
Example: for string
banana, the LDS is
Obviously I thought about using a suffix tree, but I'm having trouble to find double suffix in it.
Reverse the string and build sparse array
i is from
j is from
n is the length of the string.
P[i][j] refers to the rank of the suffix starting from position
j and length
2^i. So if
P[i][j]=P[i][k], the first
2^i chars of the suffixes at indexes
k are equal.
Now your problem reduces to finding a Longest Common Prefix for
0(start of the reversed string) and another suffix at index
i, such that
LCP >= i.
Where LCP can be computed by simply using
P array in
log(n) time, by comparing first
2^x chars of these two suffixes and gradually reducing
Total complexity is
Here is the working C++ source code: https://ideone.com/aJCAYG
I think that Gene's solution is the simpler to implement and since it does not rely on an arborescent structures but on arrays, it is likely more hardware friendly as well.
But since you mentioned suffix trees, let's look into a solution based on suffix trees! I will assume that you use an end token to mark the end of the string(s) you insert in the tree. To illustrate this, here is a representation of the suffix tree built for your
$ - ## b a a - $ - ## // Longest double suffix: P is the first dash, N the second b a a $ - ## // N' is the dash a - $ - ## a - $ - ## b a a $ - ## b a a - $ - ## b a a $ - ##
When N is a node in a suffix tree, we will denote |N| the length of the substring represented by N.
How can you characterize a "double suffix" in a suffix tree? Well it is a terminal node N with a parent that has a specific property: let P be the parent node of a double suffix, then:
$above) of the string.
baa$in your example). If we walk down the tree from P, using suffix, we end up in another suffix node N' (walking down the tree won't be actually needed)
baain our case).
Given that, you only have to iterate over suffix nodes and test this condition. You can be greedy if you iterate suffixes in decreasing order of length: the first match is necessarily the longest double suffix.
Note that we can stop our search after having inspected the suffix of length |S|/2 and only iterate over suffixes of odd length (do not forget we add an end token to the string)
Building the suffix tree is
Let N' be a suffix node and
N be the suffix node for the suffix of length (|N'|-1)/2 + 1. Assuming proper construction of the tree:
O(1), in particular, this applies to P the parent node of N and N'
Since we are iterating over the suffix in decreasing order of length, we necessarily focus on the
N' suffixes (the doubled suffix, ie
baabaa$ in your example), so we just have to:
|N'| = 2.|N| - 1:
Proof: (We ignore the end token in the following proof)
The 3 steps above, if leading to a true evaluation, prove the existence of a suffix of length 2.|P| that starts with the substring represented by P, which is also a suffix. Since this substring is a suffix, the suffix of length 2.|P| necessarily ends with it and therefore is made of two occurrences of that substring QED.
Since we will do this step for at most
(|S|/2 + 1)/2 suffixes, the identification step is therefore
O(|S|) in the worst case.
The overall complexity is thus