I was looking at Alexander Rhatushnyak's winning entry in the Calgary Compression Challenge in 2010.
The entry has 2 parts, a compressed file of size 572,465 bytes and a decompression program as C++ source code in a ppmd archive with size 7,700 bytes. The code is stripped of comments and most white space and uses short variable names to make it more compressible. Nevertheless, it is somewhat readable. It runs in about 15 minutes on my 2.0 GHz T3200 under Win32 using about 580 MB memory.
It appears to be based on paq8, unlike his earlier entries which were based on paqar, derived from paq6. The major difference is that context models are mixed by weighted averaging in the logistic domain ln(p/(1-p)), rather than by weighted summation of evidence (bit counts). It has most of the models that were in use in the paq8px series, such as order-13, word (order 5), sparse, record, distance, indirect, nest, and dmc. Useless models like jpeg, exe, and bmp are taken out, of course. Some new ones were added. For example, it calculates average prediction errors and uses that as context. But mostly it is not too different. It has hundreds of models, combined with several mixers and a chain of 6 APM (SSE) stages before the final bit prediction.
The head of the compressed file looks like a normal paq8 archive header with some bits stripped out. It shows the order of decompression. File sizes are stored as usual as decimal strings ending with a tab:
All of the files except pic use the same context model. pic has its own model consisting of just 3 sliding window contexts surrounding the predicted pixel (1 bit). It differs from the other models in that context bits are shifted out every bit rather than on byte boundaries like the others.
The major differences are in post-processing. For the text files, there is a 235 word dictionary, grouped by file, to encode common words as single bytes. There are 2 symbols to indicate that the next word should be capitalized or upper case. The last 5702 bytes of the archive is a separate stream used to model line endings. This was not obvious from the code, but when I removed this section (effectively replacing the compressed data with zeros) and ran the decompresser, many of the linefeeds in book1, book2, paper1, paper2, and news (but not the other text files like bib, progc, progl, progp, trans) were replaced with spaces. The encoding seems to be done heuristically. Not all linefeeds are encoded, just those where a space doesn't change the meaning. In book1, paragraphs, quotes, hyphens (+ rather than -) and markup like <P58> still end with linefeeds. In book2, paper1, and paper2, paragraphs and LaTeX markup lines (starting with .) still end with linefeeds. In news, paragraphs, headers, and quoted text ends with linefeeds.
The arithmetic coder seems to have been modified to normalize 1 bit at a time instead of 1 byte at a time, probably to increase range precision. There are also other mystery transforms such as:
where c is the decoded byte and tf is true for the first 10 (text) files. I know about tricks to group white space characters together in the alphabet, but I'm guessing. It's hard to tell without the compressor. There is lot of other code whose purpose is not clear either, but I guess that's the nature of compression. It's there because it makes the file smaller.