Results 1 to 16 of 16

Thread: Block sorting for LZ compression

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    what ypu think about sorting of data block (like BWT does) in order to find matches? this technique looks crazy, but it was successfully used in IMP. and it was very fast compressor

  2. #2
    Programmer
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    420
    Thanks
    28
    Thanked 150 Times in 18 Posts
    I tried this out once.
    It has some advantages over BWT-sorting. You don't need to maintain a full comparison - e.g. if the first 256 bytes of two strings are equal then the two strings are equal (for BWT this limit is buffersize). Still, you'd have a gurantee to find every maximum match upto 256 bytes. Additionally, you don't need to use a stable sort.
    If you're looking for a best match the problems arise. In such a sorted table maximum matches are neighbours. But in general you're not looking for a maximum match, but the most cost effective one. Additionally, the sorting order makes it very expensive to find a match of length n with the best index.

    aaaaa - index 10000
    ...
    aaaay - index 1
    aaaaz - index 10001
    baaaa - ...

    There can be hundreds of entries in between aaaaa and aaaaz. All these have a 4 byte match with aaaaz. But aaaaa has the best index. I know that this example isn't really good, but it should make the problem clear.
    There might be many ways to impove this. But IMO a nice search tree or hashing table do this job just fine.

  3. #3
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    They are not a lot interested to illusory and little useful technique BWT in the how much in how much ruin structure of the data for which with some files it gets worse the situation. They are a lot interested instead to the regarding speech the system of indices LZP to outside of the BWT. I wanted to know if there is a fast way in order to discard the indices gets worse and to use a smaller number of indices? (for Christian and Bulat) Hi!!

  4. #4
    Programmer
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    420
    Thanks
    28
    Thanked 150 Times in 18 Posts
    I'm sorry Nania. I do not understand what you're asking for.

  5. #5
    Member
    Join Date
    Dec 2006
    Posts
    611
    Thanks
    0
    Thanked 1 Time in 1 Post
    Seems to me as Nania wanted to know if there is fast way to discard unusable indexes and decrease indexes number or something similar

    EDIT:After some thinking and playing with google translator I changed "matches" into "indexes"

  6. #6
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    I always apologize me for my English!. I wanted to exactly say this..
    but this time use a practical example:
    abaaababcaaaaa
    1......xx1............
    xx=index that doesn't serve us!
    1 = index for the copy
    In this example of mine I want to say that the 2° ab if as position was memorized it would make the smaller copy of the equal bytes. Did I want note to know if there is a some algorithm that does this in automatic or however in precise way?

  7. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    christian:
    i have some ideas to make bwt based pattern matching fast and efficient.

    firtstly, when you have entries like:
    aaa 5
    aab 7
    aac 90
    aad 45
    ...
    aaz 34

    then you don't need all those entries but only one:
    aa? 5

    beacuse there's no two entries that starts with aa and have third letter the same. 5 is the index of earliest entry starting with aa.

    then if you find that match, you update index to next value. in above case updated entry would look like:
    aa? 7

    additionally you can add skip indexes.

    assume you have following entries:
    a?
    aa?
    abc?
    abcccdef?
    abd?
    abea?
    abeb?
    ac?
    ad?

    and you search for ad?

    you start at a? . okay. then you go to aa? . you notice that it didn't enlarged your match. so you go to abc? . again no gain. so you go to abcccdef? . hey!! why we are checking entries that start with ab? ??? we can skip directly to ac? !!! so we skip and continue search. this way we found ad? quickly.

    skip index points to first entry after current that has same (or less) letter in common than the previous entry.

    this way you would never chceck more than (size of alphabet) * (longest match) entries.

    additionally you could reorganize entries ie. move more popular prefixes before less popular ones (of course without breaking the structure).

    this way bwt based pattern matching can be very fast.

    additionally while walking you can remember longest matches for different distances (eg. for 8- bit distance, 9- bit distance, ..., 20- bit distance) and use that for more optimal parsing.

    those techniques i described above makes bwt based pattern matching as fast as tree based pattern matching (or even faster due to linear scanning) and it can have even shorter building time (if you use some fast bwt sorting algorithm like msufsort http://www.michael-maniscalco.com/msufsort.htm ).

    the disadvantages are high memory requirements and offline nature, ie. you must fetch entire block before compression and use fixed window (sliding window would be unpractiably inefficient with bwt based pattern matching).

  8. #8
    Programmer
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    420
    Thanks
    28
    Thanked 150 Times in 18 Posts
    Thank you for your interesting post, Donkey!

    I still have some problems with your idea, so I just ask. The goal is to find the longest match with the best index? e.g.

    00000*
    aaaaa* index i
    aaaab*
    ...
    aaazy*
    aaazz* <- best index

    Let's say we're at index i. Obviously, we're not allowed to match index i with itself. Thus, we're looking through the neighbourhood for a longest match with best index (regarding cost). The further we go into one direction (in the neighbourhood of aaaaa) the shorter the matches (at least they don't get longer, since they're sorted).
    So we start at aaaba (the direct neighbour). Here, we have a match of length 3. We know that this match is the longest we can get. But how do we reach aaazz efficiently?

  9. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    hmm, i have little time now. i will explain it later.

    but for now you can think about my method as a array implementation of prefix tree. skip index in my array is equivalent to right sibling in tree implementation. and compacting method i described in my first post is equivalent to removing nodes that have no leafs.

    since array is built once at start and the structure is not changing during compression (only indexes are updated) you can optimize it at start once (it should be relatively quick) and gain in compression speed while compressing.

    the optimization i'm talking here is equivalent to sorting childs of a particular leaf in frequency order, so more popular patterns will be found faster than less popular ones.

    i've dicovered also one more trick that would save time by not needing to compare buffer at inexes (ie. don't need to perform computelongestcommonprefix(buffer[index], buffer[current]) ), but comparing only one byte per entry (this byte would be stored in array). but this will require some special structure of array (ie. in a block of entries with common prefix, pairs of entries should be sorted by size of longest common prefix - okay, i'm messed here but it can work )

  10. #10
    Programmer
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    420
    Thanks
    28
    Thanked 150 Times in 18 Posts
    Quote Originally Posted by donkey7
    hmm, i have little time now. i will explain it later.
    Thank you Donkey! Itd be great if you use my last example.

  11. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Nania Francesco Antonio

    i still can't understood your question. as a variant, you can try to write it in Italiano and ask someone to translate it

    of course, sorting doen's allow us to use sliding window (although we can sort 2 times more data to get this effect)

    afaik, imp used suffix sort and at each step we have results analogue to STn sorting so we can easily use it to find match of length=n (just use previous string). On next sorting step we have results analogous to ST(n+1) sort so we can easily find macthes of length n+1 and so on. details was described in fido7.ru.compress 7-8 years ago by Dmitry Subbotin but i don't understood them those time

    so, we see 3 schemes
    1) described by donkey7 (which i don't understood)
    2) imp-like
    3) just make STn sort where n is minimal match len we are interested. then scab through this array searching matches among the last, say, 64 entries. this is very like to LZ77/ROLZ searching but cache misses will be much less likely! moreover, we can easily prefetch data that will be scanned a bit later. so, this scheme may completely omit cache miss expenses (except ones on sorting phase)! also, it should be easily multithreaded up to hundreds and thousands of threads (btw, are you know that &#036;100 32-core processors start shipping this month?)

    moreover, because searching at each time limited to vary small amount of strings, we can use highly complex schemes of searching speedup that require large amount of data to be saved (per each node searched), because total number of searched nodes would be anyway no more than, say, 1000-10k - old data simply must be automatically discarded (shifted out) from these search acceleration datastructures

    so, final result is - we run completely inside L2 cache (or even L1). we find matches for all strings (i.e. 5-10 times more than really required we can therefore use OP in its simple form - using longest matches (not cheapest ones)

    this method should be especially great for a huge amount of cores, slow memory access, binary files (because for these files percentage of strings used is bigger than for text files)

    4) after ST4 sorting we scan data in usual LZ manner, but use STn-sorted data as very fast way to find strings similar to current. i.e. if Ptr is current pointer in input buffer, then ST_Index[Ptr] gives us some ST_Ptr such as ST_Strings[ST_Ptr]==Ptr and ST_Strings[ST_Ptr-1], ST_Strings[ST_Ptr-2] and so on gives us pointers to previous strings with the same first 4 bytes. not bad? it looks like current QUAD scheme but allows to scan more than 64 (or any other fixed value) of possible matches and gives us only the strings that has the same 4 first bytes - no any hash collisions!

    i'm not sure but may be such scheme may be also used with BWT sorting. alternatively, one can try to organize Binary Tree (like lzma/cabarc does) inside each sub-array of nodes having the same first 4 bytes. this may allow to implement rather fast OP with weightning

  12. #12
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    i have started writing explanations to my methods. they are there: http://asembler.republika.pl/misc/bwt-lz.txt . it's not finished currently and it's not polished as it can be. i'm studying theoretical informatics at jagiellonian university, so i'm learning new algorithms, and i'm discovering more and more effective methods for pattern matching.

    when i will complete course of algorithms and data structures (this is on second year) and fully understand fast bwt sorting algorithms (like msufsort or others, look at http://www.michael-maniscalco.com/msufsort.htm ) i will make some ground- breaking matching algo

  13. #13
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    i've updated my document. i will be happy if anyone comment it (especially christian).

    christian:
    if you're searching for aaazz then you can't go into aaaaa*. the furthest you can go is aaaa* and you see that common prefix is shorter than length of aaaa* so you go to next sibling that is aaab* and so onm till aaaz*. for details look at my above document. (next sibling has the last letter different)

  14. #14
    Programmer
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    420
    Thanks
    28
    Thanked 150 Times in 18 Posts
    Quote Originally Posted by Christian
    00000*
    aaaaa* index i
    <u>aaaba</u>*
    aaabb*
    ...
    aaazy*
    aaazz* <- best index
    Quote Originally Posted by donkey7
    if youre searching for aaazz then you cant go into aaaaa*
    In my previous post I wasnt searching for aaazz.

    Quote Originally Posted by Christian
    Lets say were at index i. Obviously, were not allowed to match index i with itself.
    I was searching for a longest match with the best index for aaaaa. In the example aaazz qualifies for this because its a longest match and does have the best (=cheapest) index.

    Quote Originally Posted by donkey7
    ive updated my document. i will be happy if anyone comment it (especially christian).
    Im sorry, I do not have enough time to read all this. But Im sure some of the other guys can give you some feedback.

    EDIT: Upps, I just saw that there was a tiny mistake in the example. Ive underlined the difference in this post.

  15. #15
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    christian:
    if you read my document you will get the picture

    firstly youre example is impossible, ie you have:
    00000*
    aaaaa* index i
    aaaba*
    aaabb*
    ...
    aaazy*
    aaazz* <- best index
    but dont have:
    aaba*
    aabb*
    ..
    aazy*
    aazz*
    aba*
    abb*
    ..
    azy*
    azz*
    and so on.

    but if we convert your suffix sorted array to my preodrer binary tree resresentation pattern search array (yup, whats a long name ), it will look like:
    ordinal letter compressed.nodes right.sibling
    1 a 2 31
    2 b 0 3
    ......
    30 z 0 null
    31 b ....

    *(30 is only hand picked number here)
    so you will basically get three (all) cheapest matches for lengths 1, 2 and 3 by checking only one entry

    my pattern matching function finds cheapest for all match lengths and requires constant time (if maximum match length is constant, precisely it takes o(k) time where k is longest match at current position and alphabet size is fixed) for one execution. but you must execute it for every position in buffer if you want to get cheapest matches all the time because it updates indexes only with current position.



    my searching scheme *always* (ie. if searching & updating is performed at every position in buffer) guarantees finding cheapes indexes for every match of size up to longest match in constant time. it will give same results as executing naive pattern search for every position in the buffer (it will be quadratic time complexity).

    with binary trees you cant achieve such property, unless the binary tree is several times larger than buffer size (up to average_longest_match_size larger), because you cant dynamically compress nodes as i do statically when creating my repetition search array.

    memory requirements for my repetition search array is 15 N:
    - 4 N for last index (= cheapest index),
    - 4 N for indexes (= postion in buffer,
    - 4 N for right siblings,
    - 2 N for compressed nodes & letters,
    - 2 N for commons & suffix lengths,
    (two above can be represented as union, taking consequently only 2 N space, because they are indepedent)
    - 1 N for input buffer,
    plus some fraction of N for sorting algorithm (or that overhead can be hold in other array, eg. last.index because its not needed at sorting stage).

  16. #16
    Programmer
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    420
    Thanks
    28
    Thanked 150 Times in 18 Posts
    Quote Originally Posted by donkey7
    firstly youre example is impossible, ie you have...but dont have:...
    I thought it was obvious that I left out those uninteresting strings.

    Quote Originally Posted by donkey7
    if you read my document you will get the picture
    I do trust you that it works but I hope you understand that I dont have time to do this - its not really a short read.

Similar Threads

  1. bsc, new block sorting compressor
    By Gribok in forum Data Compression
    Replies: 363
    Last Post: 14th June 2017, 20:55
  2. Brute forcing Delta block size
    By SvenBent in forum Data Compression
    Replies: 2
    Last Post: 2nd May 2009, 12:44

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •