Results 1 to 10 of 10

Thread: Re-pair compression

  1. #1
    Member
    Join Date
    Jun 2010
    Location
    Portugal
    Posts
    22
    Thanks
    1
    Thanked 1 Time in 1 Post

    Re-pair compression

    Hi everybody,
    I'm no compression expert at all, but I desperately need an ultra fast decompression algorithm that also has a good compression ratio, but I'll leave this topic and the details for another post or maybe I'll include it in this post at a later stage.
    For now I would like to know if you have experimented with the "re-pair" idea from Larson and Mofat.
    I searched this forum and I haven't found anything on this topic so I decided to post this idea to the compression gurus of this forum.
    http://www.larsson.dogma.net/dcc99.pdf
    https://code.google.com/p/re-pair/
    http://www.cbrc.jp/~rwan/en/restore.html

    best regards,
    Alvaro

  2. #2
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    There has been some discussion of Re-pair, mostly by me -- I recommend using google or bing to search this site (https://www.google.com/search?q=site...pair+algorithm). The discussions haven't been all that substantive and analytic, but at least one other forum regular besides myself has read the paper. Discussion of fast compression usually revolves around LZ77-style algorithms, and there are several mature algorithms/compressors out there (some by forum regulars). I'm not sure how they compare with Re-pair for compression efficiency, but one possible advantage of Re-pair is the absence of a sliding window (it can also be a disadvantage). A few forum regulars have also written some experimental compressors that are similar in principle to Re-pair and have fast decompression speeds (e.g. Kenon Conrad's Tree). The problem with those is that, so far, compression is too slow to be useful.

  3. #3
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by nburns View Post
    The problem with those is that, so far, compression is too slow to be useful.
    I tried to overcome that problem in krc by storing the dictionary in local or remote cloud, so next time anybody in an organization (local cloud) or anybody in the world (remote cloud) compress the same file, compression speed is as fast as decompressing speed with option smaller output file size as remote output file format (without dictionary) or same output file size as stand alone output file format (with dictionary). This cloud support system can be build in Tree or Diqs or any other algorithm where building up dictionary is slower then decompressing. It's also a possible way to earn money by charging fee for remote cloud dictionary storage. I thought I had invented and build something useful for at least Tree and Diqs but had no response far. Or are there also for this papers I missed, that somebody did the same xx years ago?
    Last edited by Sportman; 14th August 2014 at 06:48.

  4. #4
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    I thought I had invented and build something useful for at least Tree but had no response far. Or are there also for this papers I missed, that somebody did the same xx years ago?
    It's an interesting idea, but I have been too buried in details of compression to think about big picture things. I think you are right in that storing the dictionary and reusing it would save a tremendous amount of time on subsequent compressions of the same or possibly similar files. However, I thnk compression for Tree would still be slower than decompression because the dictionary would need to be applied recursively in the same order it was created (at least that's what the simple algorithm would do).

    The problem I see with all of these grammar transformation algorithms in terms of speed is that they compress recursively, which means lots of cycles of data analysis and symbol substitutions. This limits the usefulness to cases where compression speed is not a concern, which might only be the case with data distribution.

    I have not experimented with re-pair, but did experiment with code I wrote that used a very similar concept. The issue I have with pairing is that it ignores higher order contexts when making pairing decisions. This can cause some undesirable pairs to get created (compared to higher order matching algorithms) and also can miss profitable higher order strings if combining pairs within the string is not profitable.

  5. #5
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    149
    Thanks
    30
    Thanked 59 Times in 35 Posts
    This version implemented CBT coding. It gives a bit better compression. enjoy!
    Attached Files Attached Files

  6. #6
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    What does CBT stand for? (I mean besides "Cognitive-Behavioral Therapy" and "Cock and Ball Torture", which is what a few minutes googling got me, and I had to stop... I think I might need the former after exposure to the latter...)

  7. The Following User Says Thank You to Paul W. For This Useful Post:

    encode (29th December 2014)

  8. #7
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    149
    Thanks
    30
    Thanked 59 Times in 35 Posts
    CBT means Complete Binary Tree.
    implemented in http://homepage3.nifty.com/wpage/junk/m99coder.c.bz2
    line 134 and line 147.

  9. The Following User Says Thank You to xezz For This Useful Post:

    encode (29th December 2014)

  10. #8
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    149
    Thanks
    30
    Thanked 59 Times in 35 Posts
    Compression is improved for small text file.
    Attached Files Attached Files

  11. #9
    Member
    Join Date
    Jan 2015
    Location
    Canada
    Posts
    1
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Alvaro View Post
    Hi everybody,
    I'm no compression expert at all, but I desperately need an ultra fast decompression algorithm that also has a good compression ratio, but I'll leave this topic and the details for another post or maybe I'll include it in this post at a later stage.
    For now I would like to know if you have experimented with the "re-pair" idea from Larson and Mofat.
    I searched this forum and I haven't found anything on this topic so I decided to post this idea to the compression gurus of this forum.
    http://www.larsson.dogma.net/dcc99.pdf
    https://code.google.com/p/re-pair/
    http://www.cbrc.jp/~rwan/en/restore.html

    best regards,
    Alvaro
    Hi Alvaro,

    You should head over to talkchess and search for Ronald de Man's posts about how he compressed his syzygy-bases.
    IIRC he used a Re-pair style of compression in them. Here's a couple posts of particular interest:
    http://www.talkchess.com/forum/viewt...=513651#513651
    http://kirill-kryukov.com/chess/disc...p=76955#p76955

    I know its a bit thin on the details, but the impression I got was that first Re-pair is run on the long sequence of bitbase values (he uses 5 different values, plus "don't care" which can be encoded as any of the other value, whatever is best for compression). After that, a Huffman-like coding is applied to the symbols, and finally they are grouped together into 64-byte blocks of compressed data. While doing this grouping, he can count how many compressed positions have been consumed at each block boundary; that allows him to build some sort of index structure for random access (which I guess is stored at the beginning of the file and mmap'd into memory like the rest of it is, but I don't remember if he described the operation of the index at all). Anyway, follow the links from the Syzygy Bases CPW page and you can probably find more info. Or just ask him directly, I suppose.

    -------

    And now an anecdote. I have used Re-pair myself, for compressing the value half (i.e. the strings) of a sorted associative array (int32 -> unicode string). So basically I had a large array of 16-bit chars, with null-char separators after each string. I used re-pair to remap the used literal chars to new symbols 1..K, and then replace the most common pairs with new symbols K+1..M. My application required random-access decompression of the strings, and with no additional storage, so I stored the dictionary as a simple array of M 4-byte entries, where each entry contained two 16-bit values. For non-terminal symbols (i.e. ones that were introduced to replace a pair of other symbols), these two values were the left and right children in the dictionary. For terminal symbols (representing a literal char), one value was zero and the other was the literal char value. Since my strings could only have a null terminator at the end, I never needed to actually encode those, so this worked fine and was easy/fast to decode.

    I then counted the frequency of each dictionary entry's use in the symbol text, and remapped the order of the indices from most- to least-frequently-used in the symbol text (obviously this requires building a new dictionary with each entry moved to its new index, and also for non-terminal entries the child indices need to be translated to the new ordering). In this new ordering, entries referenced only from the dictonary (not from the symbol text) had the highest indices. I assigned unique prefix codes to each entry referenced at least once from the symbol text, using a simple greedy algorithm that sort of groups clusters of symbols with the same code length. I did it this way because it could choose the code lengths extremely fast, and because I could supply a set of acceptable bit lengths for the codes. If I allowed all bit lengths, it would produce something equivalent to a canonical Huffman code. If I allowed just 8/16/24, it produced a byte-based code exactly equivalent to what I had used before in a similar application. But this time, I chose to use a set of even code lengths that was something like this: (6,8,10,12,14,16,18,24) I chose that because I had to encode the length of each string somewhere in order to detect where the next one would start, and making them all even lengths was helpful to me there. It hurt the compression ratio a bit, but I didn't mind.

    So for the re-pair part, I implemented the O(N) compression algorithm given in the paper, and was very happy with the compression speed. Choosing code lengths was very fast and my entropy coder also ran in O(N) time, so the whole thing was fast enough on inputs that were a few MB in size. (I never tried to run it on anything larger than 20 or 30 MB; the re-pair algorithm needs around 12 bytes per input symbol, for good performance, though I guess 10 bytes could be made to work). After some tuning of my re-pair code (the priority queue part and the hash table stuff), I was able to compress about 3 MB of input data into 500-600 KB in around half a second on one core of an i7. I remember compressing the same data with WinRar or WinZip and getting something in the 375-425 KB range, which is not surprising because (1) they were allowed to compress across string boundaries and (2) my data contained around 125-150 KB of symbol dictionary, in a not-very-compressed format. Anyway, it didn't have the best possible compression ratio but it met my main requirements, which was to load the compressed binary blob into memory once at startup and then randomly decompress specific strings at runtime using little or no additional storage. Decompression was simple and fast enough. My indexing structures giving the start and end of each string (and associating it with its 32-bit integer key) were the most complicated part, and decoding those cost a handful of microseconds. Then decoding the string itself would take anywhere from 10 to 100 microseconds, depending on how long it was. It involved a lot of L1 cache misses because the dictionary didn't fit in L1 and the rest of my application tended to clobber L1 anyway between requests for strings.

    [Edit: I forgot to mention, but in my implementation, I did not allow the null char to pair with any other character. So that ensured that compressed symbols always stayed within a single string. I could probably have got better compression by not doing that, but the indexing would have been more complicated. oh well. I also dropped the null chars when entropy-coding it, since I knew where the end of the compressed text of each string was, and the decompressor would just insert one at the end of the string anyway.]

    Now in my case I knew exactly which string to decode and I needed all the symbols of it; but if you were e.g. compressing data from an endgame database and you only needed one specific result from the block, you could speed up the decompressor by including in each dictionary entry a count of how many uncompressed symbols it produces in the final output. Then when "decompressing" a block, you can skip over all the compressed symbols before the one that you really need. (I vaguely recall that Ronald de Man's version does something like that, but I can't remember where I read it, sorry.)

    So yeah. TL,DR: if you want max possible compression, Re-pair may disappoint. But if you need random access, especially at specific boundary points, then it may be worth checking out.
    Last edited by wgarvin; 5th January 2015 at 13:47. Reason: change N to K to avoid confusion with O(N), fix 1 link, fix a typo

  12. #10
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    856
    Thanks
    45
    Thanked 104 Times in 82 Posts
    What does CBT stand for? (I mean besides "Cognitive-Behavioral Therapy" and "Cock and Ball Torture", which is what a few minutes googling got me, and I had to stop... I think I might need the former after exposure to the latter...)

    Lol ima quote you everytime i see people do the " let me google that for you" for anyone.
    btw do you know you can now get waffles in the colour of blue ? (hint: do not google it)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •