Results 1 to 15 of 15

Thread: Suffix arrays

  1. #1
    Member
    Join Date
    Jun 2019
    Location
    Poland
    Posts
    24
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Suffix arrays

    I want use suffix arrays to find previous matches of text.

    The simplest way is making suffixes, like

    mississippi
    ississippi
    ssissippi
    sissippi
    issippi
    ssippi
    sippi
    ippi
    ppi
    pi
    i

    and next sort by alphabetical order.
    It has one disadvantage: we sort once, but if is updated each char, will slow (?)


    I try next solution:
    256 chars points to linked list (double linked when we want removed oldest elements), each coming char add element to top list of char.
    It is good for most cases but I we have cycles like cycle one : aaaaa...aaaaaaaa or abababab.....abab
    will be a lot elements for one letter and search will be slow.
    How can be it solved?

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    SA is not an efficient representation when it has to be updated on each symbol -
    half of the pointers would have to be shifted and/or incremented on each symbol, and that's just SA.
    To actually use SA for matchfinding, its helpful to build some additional tables,
    like common prefix length array.

    One option would be to use a blockwise approach - once you have eg. a 64kb of data, build SA from it.
    While within a block we can just use hash chains or other common matchfinding structures.
    Blockwise SAs can be linked together with a merge pass and some extra pointers.

    Alternatively, we can build SA from a 2*winsize block, with overlap (0..2*winsize, 1*winsize..3*winsize, ...)

    And, well, of course the simplest solution is to just build the SA for the whole file at once.

  3. #3
    Member
    Join Date
    Jun 2019
    Location
    Poland
    Posts
    24
    Thanks
    0
    Thanked 0 Times in 0 Posts
    There are two problems:
    1.Updating
    2.Cycles.

    1.Updating: I can use blocks with overlaps but not for first block, first block must be readed byte by byte.
    2.Still problems with cycles like aaaaa... or abababab......
    How do with additional tables like common prefix length array ? SA + this adiistional table?

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > 1.Updating: I can use blocks with overlaps but not for first block, first block must be readed byte by byte.

    a) if you want to apply it for LZ77 matchfinding, its not really a problem - you can sort a block ahead,
    and then filter out matches with pos>=cur.

    b) if its for a symmetric algorithm, you'd have to use some other method for incremental indexing
    of data within a block - SA can be still used for long-range matchfinding.

    > 2.Still problems with cycles like aaaaa... or abababab......

    Yes, its a common problem for BWT compressors.
    Good BWT implementations like divsufsort or msufsort handle it much better than naive string sort,
    but in the end it might still be better to have a dedup preprocessor to handle very long RLE patterns.
    https://github.com/michaelmaniscalco/gauntlet_corpus

    > How do with additional tables like common prefix length array ? SA + this adiistional table?

    Yes. We can use a byte array for common prefix length up to 255, it can be built with a single pass on SA.
    With it we can skip comparing this prefix when scanning for matches.

  5. #5
    Member
    Join Date
    Jun 2019
    Location
    Poland
    Posts
    24
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Preprocessor will be good thing - I think it sholud give code>=256 for cycle, but how detect cycle not aaaaaa// but abcdabcdabcd or even aabaabaabaab ?

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    DCE talks about this "problem of long common prefixes" here: http://mattmahoney.net/dc/dce.html#Section_551

    > I think it sholud give code>=256 for cycle

    Its more practical to use escape symbols or an extra stream, since there're no efficient BWT implementations for non-byte alphabet
    and it would be hard to make your own.

    > how detect cycle not aaaaaa// but abcdabcdabcd or even aabaabaabaab ?

    I suppose like LZ usually works - find matches and see if some of them are overlapped (len>dist).

    Seriously though, don't. BWT has the same problem with any long matches, not just overlapped ones.
    So it makes sense to run a dedup preprocessor, like srep.

    https://github.com/IlyaGrebnov/libbs...ter/libbsc/lzp (LZP preprocessor)
    https://github.com/facebook/zstd/blo...std_ldm.c#L206 (CDC matchfinder)

  7. #7
    Member
    Join Date
    Jun 2019
    Location
    Poland
    Posts
    24
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Here is SA-IS algorithm: https://zork.net/~st/jottings/sais.html

  8. #8
    Member
    Join Date
    Jun 2019
    Location
    Poland
    Posts
    24
    Thanks
    0
    Thanked 0 Times in 0 Posts
    How LCP can helps?
    I generated LCP and have:
    65535: i
    1: ippi
    1: issippi
    4: ississippi
    0: mississippi
    0: pi
    1: ppi
    0: sippi
    2: sissippi
    1: ssippi
    3: ssissippi

    Is ok, but my job is:
    start range = whole range
    new range = low_bound(letter, pos, range).
    low_bound(next letter, pos, range).
    new range is narrowed old range.
    If I want use LCP I should go through all range, not binary search, how LCP can helps?


  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    And what's the point of finding that range?
    LZ normally needs the whole set of matches for a given pos, not just the longest match (though SA kinda provides that automatically).
    Sometimes coding two short matches is better than one long match - because of entropy code.

    In any case, my idea was that current position in data can be sync'ed to SA the same way like inverse BWT works.
    And then we can scan up and down the SA to estimate coding cost of matches.
    SA would give us match offsets, but length would have to be estimated by comparing data at current pos and at pos pointed by SA element.
    And this can be optimized by having a precalculated array of LCP(SA[i],SA[i+1])

    bsdiff seems to work more like you suggest: https://github.com/adobe/chromium/bl...create.cc#L165

    How do you want to apply this? LZ and something like m03 would be quite different.

  10. #10
    Member
    Join Date
    Jun 2019
    Location
    Poland
    Posts
    24
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    In any case, my idea was that current position in data can be sync'ed to SA the same way like inverse BWT works.
    How is relationship between them?
    For example we have SA and must match "sis":
    mississippi$sisa
    Inverted BWT can help fast locate SA?

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    inverse BWT uses SA to walk through all data positions in original sequence.

    I meant that we can compute SA of a data block, then walk through it byte-by-byte (like inverse BWT)
    and locate matches for each position by scanning up and down from the current SA index.

  12. #12
    Member
    Join Date
    Jun 2019
    Location
    Poland
    Posts
    24
    Thanks
    0
    Thanked 0 Times in 0 Posts
    How merging two SA?: first big, second 2x-16x smaller.
    Is http://web.stanford.edu/class/archiv...04/Small04.pdf
    from page 79, but I don't know why
    "Comparing any two suffixes requires at most O(1) work because we can use the existing ranking of the suffixes in T1 and T2 to “truncate” long suffixes."
    Removing old items is simple - we remove smaller numbers,

  13. #13
    Member
    Join Date
    Jun 2019
    Location
    Poland
    Posts
    24
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Is any algorithm: we get index from SA that buffer[index] is smaller.
    If are equal we get that rank next is smaller.
    BUT is any difficult situation: when we are on boundary two SA - should be compared: letter, next letter and next rank.
    How distinguish this situation?
    In "Step 3: Merge" in article https://www.cs.helsinki.fi/u/tpkarkk...05-revised.pdf
    is "i belongs to B1 or
    i belongs to
    B2" but this is part of larger algorithm ,not only merge
    In previous paper: "Key idea: We know the relative ordering of the suffixes at position that are congruent to 1 or 2 mod 0". Why? If position divides without remain by 3 - it is special case?

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I don't think it makes sense to implement incremental BWT even by block merging.
    Even if merge step is O(N), you'd still end up with O(N^2) by repeating it.
    My idea was to link only SAs of adjacent single blocks (#0 and #1, #1 and #2, ...),
    so that it would turn into linked lists, like usual hash chains in LZ.

    Of course, overall matchfinding algorithm remains O(N^2) even with my idea,
    but its delayed from SA constructing step to matchfinding step,
    and also makes it possible to stop following the links after a set number of steps
    (like its done with hash chains).

  15. #15
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    If you want to have an on-line algorithm then suffix trees are much more useful than suffix arrays. Search "sliding window suffix tree". You can have a suffix tree implementation with constant amortized time for single symbol insertion at the front and single symbol removal at the back of window. Of course suffix trees have more memory overhead than plain suffix arrays, but if you're going to have multiple overlapping suffix arrays and other auxiliary data then the memory consumption advantage of suffix arrays will diminish or even completely vanish.

Similar Threads

  1. Even (somewhat) faster Suffix Sorting
    By Christoph Diegelmann in forum Data Compression
    Replies: 1
    Last Post: 19th September 2016, 20:52
  2. Suffix tree transformation?
    By Kennon Conrad in forum Data Compression
    Replies: 92
    Last Post: 11th May 2015, 10:48
  3. An Elegant Algorithm for the Construction of Suffix Arrays
    By willvarfar in forum Data Compression
    Replies: 18
    Last Post: 11th July 2013, 16:01
  4. Suffix Trees, Suffix Arrays, or What?
    By incblitz in forum Data Compression
    Replies: 47
    Last Post: 12th July 2011, 21:54
  5. Little question about Suffix Sorting techniques
    By Piotr Tarsa in forum Data Compression
    Replies: 0
    Last Post: 23rd May 2011, 20:01

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •