# Thread: Hutter Prize, 4.17% improvement is here

1. "Also, you can provide your own dictionary ..." -- that was a 3-minute answer, and here's a 30-minute:

Why would I need Someone Else's code if I were young and smart and looking for a problem to solve, a challenge to take?
Doesn't the fact that fast and small and simple SSE is still pretty useful (after all the enormous work performed by cmix and similar)
tell you that PAQ-like algorithms (including phda and cmix) don't recognize any patterns in the
model_number / probability_from_previous_models two-dimensional matrix?
They only have a one-dimensional view of it, first horizontal, then vertical.
But your algorithm does not have to look at this matrix at all !
Come on, guys, forget about phda and cmix, invent something radically new!
Like BWT and ANS were amazingly new solutions to old and boring problems.
And for super-brief introductions to a couple problems that look really important,
search for my last name on youtube.com (2nd half of the talk)

2. Here are the results for version 1.6:

enwik9 compressed size: 117039346 bytes
size of decompression program in .zip: 41911 bytes
total size (compressed file + decompression program): 117081257 bytes
compression time: 84713.758 seconds
decompression time: 88401.702 seconds
compression memory: 4996672 KiB
decompression memory: 4993904 KiB

enwik8 compressed size: 15040647 bytes
size of decompression program in .zip: 564616 bytes
total size (compressed file + decompression program): 15605263 bytes
compression time: 8946.996 seconds
decompression time: 8904.394 seconds
compression memory: 3799820 KiB
decompression memory: 3836272 KiB

Description of test machine:
processor: Intel Core i7-7700K
memory: 32GB DDR4
OS: Ubuntu 16.04

3. ## The Following User Says Thank You to byronknoll For This Useful Post:

Alexander Rhatushnyak (24th October 2018)

4. The set of enwik-specific transforms
from the 2017 Hutter Prize winning entry
goes open-source today.
Archive is attached, read_me is included.

5. ## The Following 9 Users Say Thank You to Alexander Rhatushnyak For This Useful Post:

byronknoll (3rd January 2019),comp1 (3rd January 2019),Cyan (3rd January 2019),Darek (3rd January 2019),encode (3rd January 2019),Jyrki Alakuijala (3rd January 2019),Mike (3rd January 2019),milky (6th January 2019),xinix (3rd January 2019)

6. What size difference do these transforms make to phda? I'm curious to know how well it does as a generic text compressor vs dedicated to this one corpus.

I guess the flip-side of this is to test what these transforms do to the performance of e.g. cmix. It's already matching phda9 for size, albeit considerably off for CPU and memory. It'd take a long while to know though!

Anyway, thanks for making them public.

7. Originally Posted by byronknoll
Here are the results for version 1.6:
...
enwik8 compressed size: 15040647 bytes
If you compress preprocessed_enwik8 instead of enwik8: 15015895 bytes.
The difference is much bigger in the compressor that may use only 1 GiB of RAM.

8. ## The Following User Says Thank You to Alexander Rhatushnyak For This Useful Post:

JamesB (4th January 2019)

9. @Alex, thanks for posting these transforms!

Originally Posted by JamesB
I guess the flip-side of this is to test what these transforms do to the performance of e.g. cmix. It's already matching phda9 for size, albeit considerably off for CPU and memory. It'd take a long while to know though!
cmix currently compresses enwik8 to 14965334.
preprocessed_enwik8 compresses to 14953755.

10. 36445248 enwik8.gz --- 103.44%
35233985 preprocessed_enwik8.gz

29008758 enwik8.bz2 --- 101.78%
28502591 preprocessed_enwik8.bz2

24861205 enwik8.7z --- 101.41%
24515695 preprocessed_enwik8.7z

gzip 1.6 -9
bzip2 1.0.6 -9
7z 9.20 -t7z -mx=9

Update:
Also, with phda9 the size of enwik8 before DRT decreases by ~5.1% (as in the read_me),
but size after DRT decreases by ~8.36% : 60520510 ==> 55852090 bytes.

11. Using the instructions in enwik8preproc.zip, I extracted the two files from the phda November 2017 release:
1) preprocessed enwik8
2) phda dictionary

I tried testing these with cmix:
cmix currently compresses enwik8 to 14947555 bytes.
phda preprocessed enwik8 compresses to 14866787 bytes.
cmix using the phda dictionary compresses enwik8 to 14878492 bytes.

The dictionary used by phda is amazing! @Alex, I am hoping to use this dictionary for cmix - let me know if this is not OK. Recently I have been focusing on the DRT, including trying to improve the dictionary. I am learning a lot through trying different experiments, but I think I am a long way from being able to generate a dictionary even close to this level of performance.

12. Originally Posted by byronknoll
I am hoping to use this dictionary for cmix - let me know if this is not OK.
OK, but if you tried building a dictionary by yourself, that would be even better

Originally Posted by byronknoll
phda preprocessed enwik8 compresses to 14866787 bytes.
cmix using the phda dictionary compresses enwik8 to 14878492 bytes.
So the gain from enwik-specific preprocessing is 11705 bytes,
very close to what was observed earlier:
cmix currently compresses enwik8 to 14965334.
preprocessed_enwik8 compresses to 14953755.

13. ## The Following User Says Thank You to Alexander Rhatushnyak For This Useful Post:

byronknoll (23rd January 2019)

14. Originally Posted by Alexander Rhatushnyak
OK, but if you tried building a dictionary by yourself, that would be even better
Thanks! Yeah, I will definitely continue trying to build a dictionary and improve DRT.

15. Originally Posted by Alexander Rhatushnyak
Version 1.6 is here: http://qlic.altervista.org/phda9.zip
As usual, four executables are barely tested,

hopefully they will work as expected.

Compressed size of enwik9 should be 117'039'xxx.

UPDATE: 117'039'346.
New version from Alexander (same link)!

16. Version 1.7 is here: http://qlic.altervista.org/phda9.zip
Improvement is below 0.1% on enwik9,
but bigger on smaller text files,
e.g. 0.4% on book1 from CC.

17. ## The Following 3 Users Say Thank You to Alexander Rhatushnyak For This Useful Post:

byronknoll (22nd February 2019),Jyrki Alakuijala (19th February 2019),Matt Mahoney (22nd February 2019)

18. Here are the results for version 1.7:

enwik9 compressed size: 116940874 bytes
size of decompression program in .zip: 43,274 bytes
total size (compressed file + decompression program): 116984148 bytes
compression time: 83712.733 seconds
decompression time: 87596.519 seconds
compression memory: 4999504 KiB
decompression memory: 4996880 KiB

enwik8 compressed size: 15023870 bytes
size of decompression program in .zip: 565,352 bytes
total size (compressed file + decompression program): 15589222 bytes
compression time: 8907.486 seconds
decompression time: 8868.092 seconds
compression memory: 3802892 KiB
decompression memory: 3838836 KiB

Description of test machine:
processor: Intel Core i7-7700K
memory: 32GB DDR4
OS: Ubuntu 18.04

19. ## The Following 2 Users Say Thank You to byronknoll For This Useful Post:

Alexander Rhatushnyak (24th February 2019),Matt Mahoney (22nd February 2019)

20. ## The Following 5 Users Say Thank You to Matt Mahoney For This Useful Post:

Alexander Rhatushnyak (24th February 2019),avitar (24th February 2019),encode (22nd February 2019),Mike (22nd February 2019),schnaader (22nd February 2019)

21. Originally Posted by Matt Mahoney
How do we get the large window brotli results on LTCB?

These results have been available since 2016. https://groups.google.com/forum/m/#!...li/aq9f-x_fSY4

LTCB reports 223'597'884 bytes

Actually delivered with large window: 199'118'013 bytes

22. Version 1.8 is here: http://qlic.altervista.org/phda9.zip
Memory usage is higher now, 6 GiB rather than 4.75.
No other news.
As usual, barely tested executables.

23. ## The Following 2 Users Say Thank You to Alexander Rhatushnyak For This Useful Post:

byronknoll (5th July 2019),Mike (5th July 2019)

24. Here are the results for version 1.8:

enwik9 compressed size: 116544849 bytes
size of decompression program in .zip: 42,944 bytes
total size (compressed file + decompression program): 116587793 bytes
compression time: 86182.993 seconds
decompression time: 86305.520 seconds
compression memory: 6319256 KiB
decompression memory: 6316396 KiB

enwik8 compressed size: 15010414 bytes
size of decompression program in .zip: 558,298 bytes
total size (compressed file + decompression program): 15568712 bytes
compression time: 9162.225 seconds
decompression time: 9258.572 seconds
compression memory: 4800208 KiB
decompression memory: 4836472 KiB

Description of test machine:
processor: Intel Core i7-7700K
memory: 32GB DDR4
OS: Ubuntu 18.04

25. ## The Following 3 Users Say Thank You to byronknoll For This Useful Post:

Alexander Rhatushnyak (11th July 2019),Darek (9th July 2019),Matt Mahoney (9th July 2019)

26. ## The Following 3 Users Say Thank You to Matt Mahoney For This Useful Post:

Alexander Rhatushnyak (11th July 2019),Mike (10th July 2019),rarkyan (11th July 2019)

Page 3 of 3 First 123

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•