Results 1 to 4 of 4

Thread: Calgary compression challenge

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

    Calgary compression challenge

    I was looking at Alexander Rhatushnyak's winning entry in the Calgary Compression Challenge in 2010.
    http://mailcom.com/challenge/

    The entry has 2 parts, a compressed file of size 572,465 bytes and a decompression program as C++ source code in a ppmd archive with size 7,700 bytes. The code is stripped of comments and most white space and uses short variable names to make it more compressible. Nevertheless, it is somewhat readable. It runs in about 15 minutes on my 2.0 GHz T3200 under Win32 using about 580 MB memory.

    It appears to be based on paq8, unlike his earlier entries which were based on paqar, derived from paq6. The major difference is that context models are mixed by weighted averaging in the logistic domain ln(p/(1-p)), rather than by weighted summation of evidence (bit counts). It has most of the models that were in use in the paq8px series, such as order-13, word (order 5), sparse, record, distance, indirect, nest, and dmc. Useless models like jpeg, exe, and bmp are taken out, of course. Some new ones were added. For example, it calculates average prediction errors and uses that as context. But mostly it is not too different. It has hundreds of models, combined with several mixers and a chain of 6 APM (SSE) stages before the final bit prediction.

    The head of the compressed file looks like a normal paq8 archive header with some bits stripped out. It shows the order of decompression. File sizes are stored as usual as decimal strings ending with a tab:

    768771 book1
    610856 book2
    53161 paper1
    82199 paper2
    377109 news
    111261 bib
    93695 trans
    39611 progc
    71646 progl
    49379 progp
    21504 obj1
    246814 obj2
    102400 geo
    513216 pic

    All of the files except pic use the same context model. pic has its own model consisting of just 3 sliding window contexts surrounding the predicted pixel (1 bit). It differs from the other models in that context bits are shifted out every bit rather than on byte boundaries like the others.

    The major differences are in post-processing. For the text files, there is a 235 word dictionary, grouped by file, to encode common words as single bytes. There are 2 symbols to indicate that the next word should be capitalized or upper case. The last 5702 bytes of the archive is a separate stream used to model line endings. This was not obvious from the code, but when I removed this section (effectively replacing the compressed data with zeros) and ran the decompresser, many of the linefeeds in book1, book2, paper1, paper2, and news (but not the other text files like bib, progc, progl, progp, trans) were replaced with spaces. The encoding seems to be done heuristically. Not all linefeeds are encoded, just those where a space doesn't change the meaning. In book1, paragraphs, quotes, hyphens (+ rather than -) and markup like <P58> still end with linefeeds. In book2, paper1, and paper2, paragraphs and LaTeX markup lines (starting with .) still end with linefeeds. In news, paragraphs, headers, and quoted text ends with linefeeds.

    The arithmetic coder seems to have been modified to normalize 1 bit at a time instead of 1 byte at a time, probably to increase range precision. There are also other mystery transforms such as:

    if(tf&&(c-43&235)==0)c^=16;

    where c is the decoded byte and tf is true for the first 10 (text) files. I know about tricks to group white space characters together in the alphabet, but I'm guessing. It's hard to tell without the compressor. There is lot of other code whose purpose is not clear either, but I guess that's the nature of compression. It's there because it makes the file smaller.

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Here are the individual compressed sizes. I instrumented the code to show the number of bytes read from the two archive streams (main and newline models). This is a solid archive, so files after book1 are smaller than they would be if compressed separately of course.

    Edit: added paq8l, paq8px_v69, and zpaq for comparison. paq8px_v69 compresses the best of these 3 except for pic because the pic model from paq8l was removed. zpaq compressed the files alphabetically (bib, book1, ..., trans), not in the order shown.

    Code:
             cc580  LF     paq8l -8 paq8px_v69 -8 zpaq -m4
             -----------   -------- ------------- --------
    book1    181689 2885     192268     191580      198702
    book2    110762 2110     118027     116956      123693
    paper1     9935 81        10434      10323       11195
    paper2    15890 157       16733      16628       17128
    news      79665 469       83491      82559       90668
    bib       17935 0         18464      18255       23010
    trans      9808 0          9942       9791       11619
    progc      8055 0          8356       8203        9114
    progl      9316 0          9756       9434       11107
    progp      6520 0          6872       6678        7935
    obj1       7337 0          7256       7189        8811
    obj2      43805 0         44306      43755       56065
    geo       43739 0         44215      44259       46759
    pic       22129 0         23008      30887       28730
    
    Total    572465          593322     596607      644446
    
    c.dat header  182
    prog.pmd     7700
    file sizes      3
    file names      2
    Total      580170
    Also, here are the sizes of the binaries compressed separately. This improves compression over a solid archive except for obj1.

    43,865 geo.paq8px
    7,384 obj1.paq8px
    43,703 obj2.paq8px
    22,530 pic.paq8l
    Last edited by Matt Mahoney; 24th March 2012 at 00:49.

  3. #3
    Member
    Join Date
    Mar 2012
    Location
    Estonia
    Posts
    14
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Matt Mahoney View Post
    There are also other mystery transforms such as:

    if(tf&&(c-43&235)==0)c^=16;

    where c is the decoded byte and tf is true for the first 10 (text) files. I know about tricks to group white space characters together in the alphabet, but I'm guessing. It's hard to tell without the compressor. There is lot of other code whose purpose is not clear either, but I guess that's the nature of compression. It's there because it makes the file smaller.
    This appears to swap '+'<-->';' and '/'<-->'?'...

    Quote Originally Posted by Matt Mahoney View Post
    In book2, paper1, and paper2, paragraphs and LaTeX markup lines (starting with .) still end with linefeeds. In news, paragraphs, headers, and quoted text ends with linefeeds.
    Its troff markup.

  4. #4
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Quote Originally Posted by valdmann View Post
    This appears to swap '+'<-->';' and '/'<-->'?'...
    Well, most nowadays CMs use bitwise coding so grouping symbols with similiar characteristics under common bit prefix should improve compression. Of course symbol's characteristics depends on the file content.

Similar Threads

  1. Small just-for-fun challenge at codegolf
    By schnaader in forum Data Compression
    Replies: 13
    Last Post: 24th March 2012, 01:57
  2. a challenge for C experts
    By Alexander Rhatushnyak in forum The Off-Topic Lounge
    Replies: 10
    Last Post: 25th January 2012, 05:29
  3. Calgary challenge
    By Matt Mahoney in forum Data Compression
    Replies: 6
    Last Post: 12th September 2010, 01:45
  4. Calgary Corpus
    By LovePimple in forum Download Area
    Replies: 0
    Last Post: 31st July 2008, 21:55

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •