Thread: sorting rotations of a string - detecting an never-ending comparision

1. sorting rotations of a string - detecting an never-ending comparision

Hello,
I would like to sort the rotations of a string:
encode.ru
ncode.rue
code.ruen
ode.ruenc
de.ruenco
e.ruencod
.ruencode
ruencode.
uencode.r
.ruencode
code.ruen
de.ruenco
e.ruencod
encode.ru
ncode.rue
ode.ruenc
ruencode.
uencode.r
With some strings the sorting is not so easy because a comparision between 2 rotations never ends:
codecode
odecodec
decodeco
ecodecod
codecode
odecodec
decodeco
ecodecod
codecode
codecode
decodeco
decodeco
ecodecod
ecodecod
odecodec
odecodec
As you can see there are rotations which don't distinguish from each other. So if the sorting algorithm compares 2 such rotations to find out which rotation to place above and which below it will compare character for character untill one rotation reaches the end of the input string and then continues at the beginning and goes on like this forever.

The simplest way to fix the problem is to count the amount of compared characters or to check wether we have reached the character again from which we have started. Both ways need 1 comparision extra for each character we compare to distinguish the rotations. I am already using 4 characters each comparision to make better use of a 32 bit processor but there is still the annoying extra comparision.

An other way is to check first wether the input string is one of the problematic ones and if so, then execute special sorting code which has the extra comparison or reduces the input string, sort it and then expand the sorted result afterwards.

Any other suggestions?

2. An other way is to check first wether the input string is one of the problematic ones and if so, then execute special sorting code which has the extra comparison or reduces the input string, sort it and then expand the sorted result afterwards.
Finding whether the input is a cyclic repetition of some string would be the best approach IMHO. It can be done in linear time, look for KMP table-building algorithm or complete answer here: http://stackoverflow.com/questions/2...-in-bit-string. It should be superfast compared to the sorting itself.

If you find out that the input is cyclic, you can sort only one cycle and then expand the result in linear time.

Eg if there are N cycles each C elements long and if the result of sorting a single cycle is (XY is an starting index of rotation):
X1, X2,...,XC
then the final result should be:
X1,X1+C,X1+2*C,...,X1+(N-1)*C,X2,X2+C,....,XC+(N-1)*C

3. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

just a worm (30th May 2015)

4. The check for string length should be branch predicted, so no big impact. Did you benchmark? In any case, libdivsufsort should be faster.

5. This problem is usually solved (BWT, suffix sorting) by inserting a unique character at the end of the string to avoid your situation. The unique character may be '\$' for a string or, if all 256 symbols are present, all symbols are increased by 1 and 0 is used as End-Of-String marker.

6. Originally Posted by Matt Mahoney
The check for string length should be branch predicted, so no big impact.
Well depending on how it's implemented it still costs 1 or 2 machine cycles per comparision.

Originally Posted by Matt Mahoney
Did you benchmark?
Nope. I haven't implemented anything yet. But so far Piotr Tarsas suggestion seems to be the fastest. Maybe there is something faster than the KMP algorithm but it's the fastest concept I know so far.

Originally Posted by hexagone
This problem is usually solved (BWT, suffix sorting) by inserting a unique character at the end of the string to avoid your situation.
My function is supposed to sort the full rotations. Support to sort prefixes or suffixes will come later.

But since we are already talking about it, I would like to ask how this increment works. Let's say I use 1 byte per character and all 256 symbols can be present. If I increment all symbols by 1 then I would have a wrap-around when increasing the value 255. But how can I correctly sort the pre-/suffixes? Do I need 9 bits per character?

these strings
\5\255\2...
\5\255\10...
\5\255\3...
should sort to:
\5\255\2...
\5\255\3...
\5\255\10...
But if I increase the values with a wrap-around and append \0 then I have these strings:
\6\0\3...\0...
\6\0\11...\0...
\6\0\4...\0...
Now the function could not distinguish anymore wether \0 is a character or a end-of-string-mark (\$).

7. "My function is supposed to sort the full rotations."

Why not use the EOS token method first, then recursively process the ties ? It should be fast.

"Do I need 9 bits per character?"

Yes, you do need to increase the number of bits per symbol (say from char to short or int) when all characters are present.

8. I think it can be approximated, i.e. in case of vertical comparison, if only partial, gives same results of same character counts, which are next to each other, as in the first column (after it's sort), but I'd put there escape symbol.

Or another way could be, i.e. if you ranksort first two columns, you can check characters at boundaries of same characters from previous column, if there are the same, there should be no need for further sorting.

9. The BBB compressor sorts rotations without an end marker for BWT. The steps are:
1. compare the shorter suffix with same length of the longer suffix.
2. if equal, then compare the rest of the longer suffix with the beginning of the other string.
3. if still equal, then compare the rest.

The comparison is done with std::stable_sort(). Of course this has the problem of being very slow O(n^2 log n) worst case on highly redundant data. Early BWT algorithms preprocessed with LZP or inserting random bits to avoid the problem of comparing long matches. Now we have much faster algorithms like divsufsort which does not have this problem.

In any case, you can still use 8 bit characters with an implicit end marker. You just have to tell the comparison function how many bytes to compare. If all of the bytes are equal up to the end of the shorter suffix, then the shorter suffix is first.

10. The Following User Says Thank You to xezz For This Useful Post:

just a worm (5th June 2015)

11. xezz: it will take me some time to check and figure out how the code works. But thank you, I will look into it.

Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•