Results 1 to 19 of 19

Thread: An Elegant Algorithm for the Construction of Suffix Arrays

  1. #1
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts

    An Elegant Algorithm for the Construction of Suffix Arrays

    A friend pointed me at this:

    http://arxiv.org/abs/1307.1417

  2. The Following 3 Users Say Thank You to willvarfar For This Useful Post:

    Bulat Ziganshin (8th July 2013),Matt Mahoney (9th July 2013),nburns (8th July 2013)

  3. #2
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I've been playing with the sample. I've never seen it be faster than divsufsort, but on some files, it's close, especially larger ones, like enwik8. On some smaller files, it's multiple times slower. There are probably some optimizations in divsufsort that this does not have.

    It looks promising.

    I needed to make some changes to the source code to get it to compile and behave in reasonable ways for testing. I attached my version, in case it helps anyone. There's a Mac OSX binary included.
    Attached Files Attached Files
    Last edited by nburns; 8th July 2013 at 23:24.

  4. The Following 3 Users Say Thank You to nburns For This Useful Post:

    Bulat Ziganshin (8th July 2013),Matt Mahoney (9th July 2013),willvarfar (9th July 2013)

  5. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    afaiu, divsufsort cannot multithread. if this one algo can, it would be a great breakthrough

  6. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I tried compiling both the original and modified code in Windows using make and g++. In both cases I get a RadixSA.exe which crashes when run. Haven't tried to debug it further.

    Anyway, the paper says the algorithm can be run in parallel. But the "simple" version is IMHO worthless. It is to radix sort on the first l characters into buckets, then use another sorting algorithm on the buckets. That would work fine on random data (as they prove) but on real, highly compressible data the bucket distribution would be highly uneven. For example, on a file of all zero bytes, all of the suffixes would be in one bucket.

  7. #5
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I tried compiling both the original and modified code in Windows using make and g++. In both cases I get a RadixSA.exe which crashes when run. Haven't tried to debug it further.
    See if this binary works. I went into radix.h and #undef'd WORD64 and WORD64BUCK. I'm not sure what that will do to performance.
    Attached Files Attached Files

  8. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Thanks. The binary works. I tested it on the artificial files in http://encode.ru/threads/1711-Testin...ll=1#post32943
    All of the 1 MB files ran in less than 0.3 seconds (2.0 GHz T3200). On the file w8002 (2 copies of a 50 MB random string), process time was 45 seconds and task manager shows 1 GB memory used (10x input size).

    Edit: One file gives an error:
    Code:
    radixSA.exe d6016 d6016.sa
    Read 1000000 bytes from file d6016
    Computing Suffix Array...
          2 [main] radixSA 4404 exception::handle: Exception: STATUS_ACCESS_VIOLATION
        719 [main] radixSA 4404 open_stackdumpfile: Dumping stack trace to radixSA.exe.stackdump
    d6016 is a 1 MB file containing the repeating 16 byte sequence (00 10 20 30 40 50 60 70 80 90 a0 b0 c0 d0 e0 f0).
    Last edited by Matt Mahoney; 9th July 2013 at 03:28.

  9. #7
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I think the issue with building on Windows has to do with building a 32-bit binary. Cygwin seems to give you a 32-bit compiler by default.

    I've seen random crashes on a few files, so that's probably normal.

    I've noticed that it's faster than divsufsort on DNA and protein, which matches the results in the paper. I haven't seen it beat divsufsort on anything else.

  10. #8
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Someone should create a competition for suffix sorting, along the lines of http://sortbenchmark.org/. The state of the art doesn't seem to have moved much in the past few years. Suffix sorting is inherently useful, and it's an interesting problem.

  11. #9
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    afaiu, divsufsort cannot multithread. if this one algo can, it would be a great breakthrough
    I wonder how efficiently you could merge two suffix arrays into one. If that's efficient, then you could break up the data and use divsufsort or any sort in parallel and then merge.

  12. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    libdivsufsort will multi-thread if you compile with -fopenmp

  13. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Quote Originally Posted by Matt Mahoney View Post
    libdivsufsort will multi-thread if you compile with -fopenmp
    But is the whole algorithm parallelized or only a part of it? I suspect that there's a fairly large sequential part.

  14. #12
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Quote Originally Posted by Matt Mahoney View Post
    That would work fine on random data (as they prove) but on real, highly compressible data the bucket distribution would be highly uneven. For example, on a file of all zero bytes, all of the suffixes would be in one bucket.
    i think that for practical purposes, we should run LZP preprocessing before bwt/st, so all-zeroes performance isn't important. what about real files such as enwik? and with multithreading?

  15. #13
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    i think that for practical purposes, we should run LZP preprocessing before bwt/st, so all-zeroes performance isn't important. what about real files such as enwik? and with multithreading?
    On enwik8 it's slightly slower than divsufsort for me. There doesn't appear to be support for multithreading in the sample program. I guess the parallel version is meant to be an exercise for the reader.

    Code:
    ~/src$ radixSA enwik8 SA
    Read 100000000 bytes from file enwik8
    Computing Suffix Array...
    RadixSA took [11.62s]
    Writing output to file SA...
    ~/src$ mksary enwik8 SA
    enwik8: 100000000 bytes ... 11.0841 sec
    ~/src$
    Last edited by nburns; 9th July 2013 at 21:27.

  16. #14
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    But is the whole algorithm parallelized or only a part of it? I suspect that there's a fairly large sequential part.
    I believe only the main sort is done in parallel. The second stage is not. MSufSort is fully parallel for both first and second stage but the work isn't complete yet because there aren't enough hours in a day.

  17. The Following User Says Thank You to michael maniscalco For This Useful Post:

    Bulat Ziganshin (9th July 2013)

  18. #15
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by michael maniscalco View Post
    I believe only the main sort is done in parallel. The second stage is not. MSufSort is fully parallel for both first and second stage but the work isn't complete yet because there aren't enough hours in a day.
    I wanted to try MSufSort, but I had a hard time compiling it for Linux/OSX. I do all my development and experimenting in a posix environment. I figure it's sort of the least common denominator.

  19. #16
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    As to this:

    I've been playing with the sample. I've never seen it be faster than divsufsort, but on some files, it's close, especially larger ones, like enwik8. On some smaller files, it's multiple times slower. There are probably some optimizations in divsufsort that this does not have.
    The catch is here:

    Furthermore, a large set of SACAs are collected in the jSuffixArrays library
    [27] under a unified interface. This library contains Java implementations
    of: DivSufSort [28] , QsufSort [25], SAIS [11], skew [9] and DeepShallow
    [24]. We include them in the comparison with the note that these Java algorithms
    may incur a performance penalty compared to their C counterparts.



    MSufSort is fully parallel for both first and second stage but the work isn't complete yet because there aren't enough hours in a day.
    I would then say MSufSort is parallelizable, not parallel :]

  20. #17
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I would then say MSufSort is parallelizable, not parallel :]
    Yes ... an important distinction! (^:

  21. #18
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts

    Post

    The catch is here:


    Furthermore, a large set of SACAs are collected in the jSuffixArrays library
    [27] under a unified interface. This library contains Java implementations
    of: DivSufSort [28] , QsufSort [25], SAIS [11], skew [9] and DeepShallow
    [24]. We include them in the comparison with the note that these Java algorithms
    may incur a performance penalty compared to their C counterparts.


    So, basically, they're comparing C++ to Java. That's a good trick.

  22. #19
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I've skimmed through the paper and looks somewhat ridiculous. Most of the technical talk (ie besides comparisons to others and presenting experimental results) is about radix sort which everyone understands and I think lot of people know that for random inputs radix sort will do very well. What they propose is:
    - radix sort on first log2 n bits,
    - repetitions handling,
    - relaxed prefix doubling (ie not a prefix doubling anymore, but looks similar),
    They also didn't say what should be parallelized. Instead we can read in the paper:
    The for loop traverses
    the suffixes from the last to the first position in the text. This order ensures
    that after each step, suffix i will be found in a singleton bucket.
    So the main for loop is not parallelizable as a whole, but instead we can try to parallelize the small steps within. It seems somewhat doubtful for me that those small steps would parallelize well.

    The devil is in the implementation details, I think. They are talking about optimizations related to processor cache and probably that's where most of the performance comes from.

    So for me there aren't any new useful theory that wasn't present in papers describing the tested SACAs and the algorithm is only interesting if it performs relatively wel - if yes, then we can learn which optimizations bring a healthy gain in practice.

Similar Threads

  1. Fast CRC table construction and rolling CRC hash calculation
    By Bulat Ziganshin in forum Data Compression
    Replies: 10
    Last Post: 27th May 2018, 03:32
  2. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 08:37
  3. Suffix Trees, Suffix Arrays, or What?
    By incblitz in forum Data Compression
    Replies: 47
    Last Post: 12th July 2011, 21:54
  4. Little question about Suffix Sorting techniques
    By Piotr Tarsa in forum Data Compression
    Replies: 0
    Last Post: 23rd May 2011, 20:01
  5. PREDICTOR algorithm
    By encode in forum Forum Archive
    Replies: 30
    Last Post: 16th February 2008, 19:28

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •