Results 1 to 10 of 10

Thread: sorting rotations of a string - detecting an never-ending comparision

  1. #1
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts

    sorting rotations of a string - detecting an never-ending comparision

    Hello,
    I would like to sort the rotations of a string:
    encode.ru
    ncode.rue
    code.ruen
    ode.ruenc
    de.ruenco
    e.ruencod
    .ruencode
    ruencode.
    uencode.r
    .ruencode
    code.ruen
    de.ruenco
    e.ruencod
    encode.ru
    ncode.rue
    ode.ruenc
    ruencode.
    uencode.r
    With some strings the sorting is not so easy because a comparision between 2 rotations never ends:
    codecode
    odecodec
    decodeco
    ecodecod
    codecode
    odecodec
    decodeco
    ecodecod
    codecode
    codecode
    decodeco
    decodeco
    ecodecod
    ecodecod
    odecodec
    odecodec
    As you can see there are rotations which don't distinguish from each other. So if the sorting algorithm compares 2 such rotations to find out which rotation to place above and which below it will compare character for character untill one rotation reaches the end of the input string and then continues at the beginning and goes on like this forever.

    The simplest way to fix the problem is to count the amount of compared characters or to check wether we have reached the character again from which we have started. Both ways need 1 comparision extra for each character we compare to distinguish the rotations. I am already using 4 characters each comparision to make better use of a 32 bit processor but there is still the annoying extra comparision.

    An other way is to check first wether the input string is one of the problematic ones and if so, then execute special sorting code which has the extra comparison or reduces the input string, sort it and then expand the sorted result afterwards.

    Any other suggestions?
    Last edited by just a worm; 30th May 2015 at 20:46.

  2. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    An other way is to check first wether the input string is one of the problematic ones and if so, then execute special sorting code which has the extra comparison or reduces the input string, sort it and then expand the sorted result afterwards.
    Finding whether the input is a cyclic repetition of some string would be the best approach IMHO. It can be done in linear time, look for KMP table-building algorithm or complete answer here: http://stackoverflow.com/questions/2...-in-bit-string. It should be superfast compared to the sorting itself.

    If you find out that the input is cyclic, you can sort only one cycle and then expand the result in linear time.

    Eg if there are N cycles each C elements long and if the result of sorting a single cycle is (XY is an starting index of rotation):
    X1, X2,...,XC
    then the final result should be:
    X1,X1+C,X1+2*C,...,X1+(N-1)*C,X2,X2+C,....,XC+(N-1)*C
    Last edited by Piotr Tarsa; 30th May 2015 at 21:26.

  3. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    just a worm (30th May 2015)

  4. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    The check for string length should be branch predicted, so no big impact. Did you benchmark? In any case, libdivsufsort should be faster.

  5. #4
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    122
    Thanks
    36
    Thanked 33 Times in 24 Posts
    This problem is usually solved (BWT, suffix sorting) by inserting a unique character at the end of the string to avoid your situation. The unique character may be '$' for a string or, if all 256 symbols are present, all symbols are increased by 1 and 0 is used as End-Of-String marker.

  6. #5
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    Quote Originally Posted by Matt Mahoney
    The check for string length should be branch predicted, so no big impact.
    Well depending on how it's implemented it still costs 1 or 2 machine cycles per comparision.

    Quote Originally Posted by Matt Mahoney
    Did you benchmark?
    Nope. I haven't implemented anything yet. But so far Piotr Tarsas suggestion seems to be the fastest. Maybe there is something faster than the KMP algorithm but it's the fastest concept I know so far.

    Quote Originally Posted by hexagone
    This problem is usually solved (BWT, suffix sorting) by inserting a unique character at the end of the string to avoid your situation.
    My function is supposed to sort the full rotations. Support to sort prefixes or suffixes will come later.

    But since we are already talking about it, I would like to ask how this increment works. Let's say I use 1 byte per character and all 256 symbols can be present. If I increment all symbols by 1 then I would have a wrap-around when increasing the value 255. But how can I correctly sort the pre-/suffixes? Do I need 9 bits per character?

    these strings
    \5\255\2...
    \5\255\10...
    \5\255\3...
    should sort to:
    \5\255\2...
    \5\255\3...
    \5\255\10...
    But if I increase the values with a wrap-around and append \0 then I have these strings:
    \6\0\3...\0...
    \6\0\11...\0...
    \6\0\4...\0...
    Now the function could not distinguish anymore wether \0 is a character or a end-of-string-mark ($).
    Last edited by just a worm; 31st May 2015 at 14:56.

  7. #6
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    122
    Thanks
    36
    Thanked 33 Times in 24 Posts
    "My function is supposed to sort the full rotations."

    Why not use the EOS token method first, then recursively process the ties ? It should be fast.

    "Do I need 9 bits per character?"

    Yes, you do need to increase the number of bits per symbol (say from char to short or int) when all characters are present.




  8. #7
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    I think it can be approximated, i.e. in case of vertical comparison, if only partial, gives same results of same character counts, which are next to each other, as in the first column (after it's sort), but I'd put there escape symbol.

    Or another way could be, i.e. if you ranksort first two columns, you can check characters at boundaries of same characters from previous column, if there are the same, there should be no need for further sorting.
    Last edited by quadro; 1st June 2015 at 09:17.

  9. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    The BBB compressor sorts rotations without an end marker for BWT. The steps are:
    1. compare the shorter suffix with same length of the longer suffix.
    2. if equal, then compare the rest of the longer suffix with the beginning of the other string.
    3. if still equal, then compare the rest.

    The comparison is done with std::stable_sort(). Of course this has the problem of being very slow O(n^2 log n) worst case on highly redundant data. Early BWT algorithms preprocessed with LZP or inserting random bits to avoid the problem of comparing long matches. Now we have much faster algorithms like divsufsort which does not have this problem.

    In any case, you can still use 8 bit characters with an implicit end marker. You just have to tell the comparison function how many bytes to compare. If all of the bytes are equal up to the end of the shorter suffix, then the shorter suffix is first.

  10. #9

  11. The Following User Says Thank You to xezz For This Useful Post:

    just a worm (5th June 2015)

  12. #10
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    xezz: it will take me some time to check and figure out how the code works. But thank you, I will look into it.

Similar Threads

  1. DataSmoke: detecting text, multimedia and incompressible data
    By Bulat Ziganshin in forum Data Compression
    Replies: 33
    Last Post: 8th March 2014, 00:33
  2. Archon sorting algorithm
    By just a worm in forum Data Compression
    Replies: 2
    Last Post: 21st August 2013, 13:50
  3. Directory hash as one string
    By FatBit in forum The Off-Topic Lounge
    Replies: 6
    Last Post: 17th January 2012, 00:29
  4. Detecting non-immediate data correlations
    By GerryB in forum Data Compression
    Replies: 14
    Last Post: 5th December 2010, 10:30
  5. Forum's Notifications ending up in spam/junk folders
    By Scientist in forum The Off-Topic Lounge
    Replies: 3
    Last Post: 11th December 2009, 02:15

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •