Results 1 to 21 of 21

Thread: New benchmark for generic compression

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

    Cool New benchmark for generic compression

    I have created a new benchmark to test generic compression capability.
    http://cs.fit.edu/~mmahoney/compression/uiq/

    This is the first benchmark using synthetic data. It is designed to test general purpose compression techniques not tied to any popular format like text or images. Programs that solve the general problem do better on this set than programs that are tuned to benchmarks or that specialize in certain file types.

    The data is 1 million random binary strings (length 1 to 32 bytes) ranging from easy to compress (e.g. all 0 bits) to hard (all random bits) and everything in between. They are generated by small random programs. This kind of data has been proposed as a test for universal intelligence.

    A problem with synthetic data is that you can compress it very small if you know the program that created it. The test data generator uses a cryptographic random number generator (RC4) with a hardware generated key (using CPU instruction timing variation as an entropy source).

    This is an open benchmark where anyone can contribute results. Programs are generally tested on different data sets that you generate yourself. In my experiments, this introduces a small uncertainty, about 0.1%. This can be further reduced by taking the ratio to a standard compressor. Using ppmonstr as the standard, the average error is 0.02% to 0.05%, generally smaller for the better compressors. This can be reduced further by averaging over several tests.

    I could have used private data, but I prefer to allow others to contribute and verify results. I also could have used a fixed, public data set, but then I would have to add the size of the decompressor as with LTCB. I designed a protocol that has the advantages of both. Anyone can submit or verify results (with a small uncertainty), but it is not possible to hide the test set in the decompressor because the data will be generated randomly, different for every test.

    Test files are fairly small (16.5 MB). I don't believe that programs should need lots of memory to compress them, because the million strings are all independent of each other.

    I have posted a few results and will be adding more later. Currently, ppmonstr is on top.
    Last edited by Matt Mahoney; 24th December 2008 at 04:40.

  2. #2
    Member
    Join Date
    Aug 2008
    Location
    Saint Petersburg, Russia
    Posts
    215
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Great! I once had this idea, but never felt good enough to implement it in any way. Thank you

  3. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Thanks. BTW this is still experimental. I am still trying different Turing machines so hold off on submitting benchmarks for a few days until I have the final version of the generator program.

  4. #4
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Matt I could be wrong but your test would seem to give compressors
    that work with on byte fileds the best chance of doing good. Since each
    string is a random number of bytes 1 to 32. Any chance of using a
    similar test where you use random bit lengths you could still use ascii
    byte test files where only the ascii character '0' and "1' are used.
    The lengths could still be the same as your other files if you use
    1 to 32 BITS.

    I think I might write a full BWTS program to test it. But for test above
    I would have to use BYTE was hoping to get a good test set where
    the BIT BWTS could be used.

    In some ways this is also a test of RC4 I hope someone accidently
    stumbles on a compressor that does better than what theory says
    one should be able to do. This would indicate a serious break in RC4.
    I don't think some one will actually stumble on in it with your tests but it
    would make life more interesting.

    Thanks For Your Great Work

    David Scott

    P.S. I like the idea just hope you create other test sets. For binary or
    maybe even for DNA type of data.

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Merry Christmas. I have updated the test file generator (uiq2). Hopefully this will be the final update to the file format. The new version uses Turing machines with a more expressive format, a smaller file size (6.5 MB instead of 16.5 MB), and outputs self-delimiting strings that simple compressors ought to be able to model. On the new benchmark, rings and flzp each move down 2 spots. The rest did not change ranking. I will add other compressors later but it is late now.

    To answer your question, I experimented with other output formats like hexadecimal and 1 to 7 bits per byte (add to '0') to create text file formats with strings delimited by newlines. In theory these should compress the same, since the information is the same. In practice, most programs compressed 8 bit binary the best, although a few did better on base 64 or hex. Almost all programs did the worst on strings of ASCII '0' and '1'.

    Another reason for using binary files with all 256 possible values is to discourage tuning to the benchmark by assuming certain alphabets. I know the 0 byte has special meaning (no correlation between bytes on opposite sides), but most simple compressors should model this.

    I also updated the description of the file format and the justification so hopefully it is more clear.
    http://cs.fit.edu/~mmahoney/compression/uiq/

    If you already downloaded uiq.zip, throw it away and get uiq2.zip. It also has one new feature: you can give it - as the output file name and it will output in hex to the screen with each string on a separate line. In theory you should be able to compress this to the same size as the binary file since it has the same information.

    I know RC4 has some weaknesses but I don't think I used it in a way where they could be exploited. Later I might change the PRNG to something more secure like SHA-256 or AES. But I think the biggest weakness is using easy to guess seeds that a compressor could guess. The compressor could do a brute force key search, trying a copy of UIQ2 with different seeds to look for a match to the input.
    Last edited by Matt Mahoney; 25th December 2008 at 08:53.

  6. #6
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts

    Thumbs up

    Thanks Matt!

  7. #7
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Actually, I'm very skeptic about such test. The file is too small, even 16 MB one, model need some time to learn. It's not a real data like ENWIKs...
    Thanks anyway for an original idea...

  8. #8
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I have tested BIT 0.7 vs PPMonstr.

    Code:
    C:\>uiq2 x
    Creating random seed...
    3416/2048 deltas, 1666/1536 locations, 1578/1536 intervals in 7.375 s
    to re-create: uiq2 x REAIRUSVMEGTYUGTRVRUOJQHSGVG 1000000
    292515 programs halted, 633282 timed out, 74203 output premature NUL
    Created file x
    Code:
    ppmonstr J -> 3,001,790 bytes
    BIT 0.7    -> 3,064,510 bytes
    Ratio      -> 1.0208
    So, BIT 0.7 took rank #4
    BIT Archiver homepage: www.osmanturan.com

  9. #9
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    OMG! I've lost the rank with lastest updates! BTW, I want to note that I've used -p=5 option, if you want to include in your chart. Thanks for updating.
    BIT Archiver homepage: www.osmanturan.com

  10. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Thanks. I added your result and a few more of my own.
    http://cs.fit.edu/~mmahoney/compression/uiq/

    As for the small size of the data, the 1 million strings in the file are independent of each other, so it doesn't matter. Modeling correlations between bits in the string is a harder problem. These strings don't need to be big, but you need a lot of them to get reproduceable results.

    Anyway I think you can see the results correlate pretty well with real data. I like the fact that programs are quick to test. I could do about 2 per day on LTCB when I first started it. I did about 40 in just the last few hours, but I am also not doing speed or memory testing.

  11. #11
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Speed maybe important thing in this chart. Because, currently there is no option to use file type dependent filters gain (I'm especially talking about speed side). Anyway, here is my small ideas for increasing readability of the chart:
    - Ranks as numbers in a column (I prefer at most left)
    - Method column
    - Options column
    BTW, I'm not sure you should only list best result of a compressor or not (with all options vs only best options like in LTCB)

    ADDED:
    - AFAIK, RAR at Best mode (-m5) uses block-wise LZH/PPM switching. You can note as LZ77/PPM instead of PPM?.
    - CMM4 memory usage is not releated with listed option: 96. 96 means 2^9=512 MiB sliding window for the match model and 12x2^6=768 MiB context model size.
    - NanoZip in both optimum and optimum2 mode uses LZ77/BWT switching.
    Last edited by osmanturan; 25th December 2008 at 23:20. Reason: Spelling + Extra Things
    BIT Archiver homepage: www.osmanturan.com

  12. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Well I didn't think it was possible, but I benchmarked most of the 100 or so programs in LTCB in just one day. So it looks like I am going to stay with this data format.
    http://cs.fit.edu/~mmahoney/compression/uiq/

    I skipped a few commercial programs that would be a lot of work to install and I skipped some wrapper programs that have no new algorithms like freearc, peazip, etc. A few require Linux and I don't have it installed. The tests I did took about 5 minutes each to download, figure out the command line, run the program, record the results, etc. A few took longer, like when there was only source code and I had to make minor changes to compile. I saved a lot of time by using the LTCB notes so I wouldn't have to research it again.

    About the rank order, it is approximate (about 0.0005 error) but really, how important is an exact number? Just because one program compresses 1% better on some benchmark doesn't mean it will on your files. So going from #50 to #49 is really kind of pointless. Maybe this will cut down on the number of incompatible releases every time you do 0.1% better on some benchmark. That's why people still use zip.

    I was surprised how well ocamyd did (a DMC coder). Perhaps others might start writing programs tuned to the data. I tried to make that difficult, but it probably won't be long before we see new top programs. I could probably try it but I would like to give others a shot at it first. I still think that such programs will also do well on real data. So far that seems to be the case.

  13. #13
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I mean if I look your chart, it's not easy to see a compressor is how much distant from another one. A row number can be good.

    A DMC codec's success is also surprise for me too. Possibly the data structure fits into DMC handling way.

    BTW, as I see, you miswrote something again

    Code:
    rar 2.50          -m5  (LZ77/BWT max comp.)
    (...)
    cmm4 v0.1e        96   (CM 768 MB+256MB window)
    Seems, CCM/CCMX memory usage is wrong too (it should be around 1300 MiB)
    Last edited by osmanturan; 26th December 2008 at 09:31. Reason: Added CCM/CCMX hint...
    BIT Archiver homepage: www.osmanturan.com

  14. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Quote Originally Posted by Matt Mahoney View Post
    This is the first benchmark using synthetic data. It is designed to test general purpose compression techniques not tied to any popular format like text or images. Programs that solve the general problem do better on this set than programs that are tuned to benchmarks or that specialize in certain file types.
    i can't agree with you. every algorithm models some data source, be it markov chain or order-0 source. data you provide isn't somethat "generic" - this is meaningless term, they are just generated by model not existing in real world. so what you test actually? how close are various models used in real world and therefore implemented in real compressors, are close to your artifical, randomly choosen data source. it shopuld be obvious that by changing method of data generation to somewhat more lz-alike you will change your ranks

    now dmc and ppm are superior just because you have used markov model to generate data

  15. #15
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Actually CM compressors do the best, followed by DMC, PPM, and BWT which are restricted to Markov processes (can't model sparse data). All of these in turn do better than LZ77, LZW, LZP, and ROLZ because unlike other algorithms, they can code the same string more than one way, which wastes space. You can't make a data set more "LZ-like" because the algorithms are fundamentally inferior.

    Ocamyd may be bit-aligned DMC. We usually tune our compressors for byte aligned data because a lot of file types use it. But you can't assume this for every file. Some data formats like deflate (in pdf), jpeg, mp3, and mpeg use variable length codes. A good compressor should recognize patterns that are not byte aligned.

    As for justification, see the papers linked from my page. Compression is a prediction problem. Sources can be arbitrarily complex, so it is also an AI problem. All machine learning algorithms follow the principle of Occam's Razor: choose the simplest hypothesis that fits the data. But why this worked wasn't always well understood. Kolmogorov, Solomonoff, and Hutter showed why and proved it mathematically. The "simplicity" of a hypothesis description is the length of the shortest program that outputs it. Kolmogorov proved that the measure is independent of the programming language except for a small constant that doesn't depend on the string being measured. Next, Solomonoff's universal prior assigns a relative probability to a string x as 2^-|p| summed over all the programs p that output x and |p| is the length of p in bits. This sum is dominated by the shortest such program. Thus, what Solomonoff proved is that the best predictor of the next bit in a string from an unknown source is to find the shortest program that outputs what was seen so far. Finally, Hutter extended the idea to interactive, goal seeking agents (which would include humans), thus explaining Occam's Razor.

    Hutter and Legg proposed sampling a Solomonoff distribution of random programs (Turing machines) as a test for general intelligence. In practice, we can only approximate a Solomonoff distribution because programs need reasonable time and memory limits. Since a general solution to Solomonoff induction is exponential in the size of the program, we can make this number small, as well as the time and memory limits, without making the compression problem too easy.

    So all that is left to do is choose a simple language with a small Kolmogorov constant, make the programs complex enough (128 bits) so they are just beyond our ability to guess them, to use enough samples (1 million) to make the experimental error small, and test the benchmark to verify it agrees with tests on real data (which it does).

  16. #16
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Actually CM compressors do the best, followed by DMC, PPM, and BWT which are restricted to Markov processes (can't model sparse data). All of these in turn do better than LZ77, LZW, LZP, and ROLZ because unlike other algorithms, they can code the same string more than one way, which wastes space. You can't make a data set more "LZ-like" because the algorithms are fundamentally inferior.
    I think your idea is a good test. But the previous writter misses the point
    because the best GENERIC coder for all files is the IDENTITY transform.
    We have to assume some sort of model that allows for compression. I agree
    that there is more than one model and that it will change the rankings.

    Example take any of Matt's test files do a full UNBWTS the 256 symobl
    version and use those for input test files The ones that do good on
    Matt data will do badly. Since the writters will tune to eith of the sets
    of data. Its not likely one could do good on both set of data so the
    rankings would change alot.


    I disagree with 2 points and I think your tests will prove the point as to
    which method is best. I think that all the methods will more or less when
    tuned and most will fit the data quite well. But PPM for BWTS I suspect
    the only real difference after tuning will be the time to run and the space used.

    Second it is possible to make LZW bijective so that no space is wasted
    and even it will when tuned compress these kind of files about the same
    all it takes is someone willing to do creative thinking. But I bijecti LZW at
    least from my point of view is going to be a slow space hog.

  17. #17
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Matt, the problem here is that you don't compare AI of various algorithms, just measure how well one model which models some type of data fits to other type of data. it's easy to produce data well compressible with lz but not ppm - just repeat large sequence of data many times.

    and one more thing - afaik it's Hutter's idea to compare computers to human by compressing enwik9. but you should compare compression level of first 1gb of data with compression of first 1.1gb - since experiments was made on traineed *adults* which already have 1gb in their minds

  18. #18
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    I don't see why arbc2z failed I did test it did ok. However
    I also did a test based on what would happen if another test
    set was used. You could use Matts set and transfom it into
    either a set where all Matt files are BWTSed or UNBWTSed

    I checked with ARB255 which should compress all three about
    the same. And with PPMONST J.
    I aslo tested arbc2z which failed in Matts test set but not my random
    test.

    Then I see how in the new set which is the UNBWTS of Matts
    how PPMONST J compares to procedure where you do a BWTS
    and then apply arbc2z which happens to be the same as doing
    a arbc2z to Matt's test file by itself

    The test files
    to re-create: uiq2 t3.tmp DDTYGBWUMDIIPMZARRERNTUOOQND 1000000
    T3.TMP 6527098
    T3U.TMP 6527098 which is UNBWTS of T3.TMP
    T3B.TMP 6527098 Which is BWTS of T3.TMP

    The result from ARB255
    T3.255 5694751
    T3U.255 5694750
    T3B.255 5694751

    The result from PPMONST J
    T3.PMM 3005458
    T3U.PMM 5329593
    T3B.PMM 3168556

    The result of BWTS to T3U.TMP
    followed by arbc2z
    T3.C2Z 3204258

    ratio of (BWTS followed by ARBC2Z)/(PPMONST J) on T3U.TMP
    0.60122

    Hey it was fun

  19. #19
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts

    Thumbs up

    I like this benchmark, I generated a file and uploaded it as uiq2:
    http://www.metacompressor.com/uploads.aspx

  20. #20
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    To Bulat: the problem with compressing 1.1 GB - 1 GB of text is that compressors can cheat and compress both to the same size by adding garbage for the smaller file.

    Actually the idea for using random Turing machines to generate data is from Legg and Hutter in http://www.idsia.ch/idsiareport/IDSIA-04-05.pdf
    It was my idea to use text compression in place of the Turing test. Hutter liked my idea and sponsored the prize.

    Unfortunately, UIQ2 output is not completely generic. There are ways to tune for it, the more I think about it. Don't use any contexts with a 0 byte in the middle. Use bitwise contexts. Don't look for long matches. Train the model on data generated by a copy of UIQ2 before you start compressing. I haven't yet figured out how to solve these problems. But I am sure it won't be long before someone beats paq8k

    Also, you are right it is possible to tune data for LZ (use many repeating long sequences).

    To David: arbc2z failure was a bug in the g++ optimizer, not the program (after much debugging). Compile without -O and it works fine.

    Also, I am not surprised that BWTS and UNBWTS would have no effect on ARB255. It is an order 0 coder. BWTS and UNBWTS keep the same bytes but just reorder them. UNBWTS takes an order 0 data stream and converts it to high order, so I'm not surprised it makes compression worse on ppmonstr.

    To Sportman: thanks for the benchmark. I posted a link to it from my site.
    Last edited by Matt Mahoney; 29th December 2008 at 05:34.

  21. #21
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    To Bulat: the problem with compressing 1.1 GB - 1 GB of text is that compressors can cheat and compress both to the same size by adding garbage for the smaller file.
    yes, but this *practical* problem doesn;t make other ways of comparing computer and human corect

Similar Threads

  1. HFCB: Huge Files Compression Benchmark
    By Bulat Ziganshin in forum Data Compression
    Replies: 129
    Last Post: 6th January 2015, 15:31
  2. MONSTER OF COMPRESSION - New Benchmark
    By LovePimple in forum Data Compression
    Replies: 225
    Last Post: 23rd December 2009, 11:57
  3. MONSTER OF COMPRESSION - New Benchmark -
    By Nania Francesco in forum Forum Archive
    Replies: 222
    Last Post: 5th May 2008, 10:04
  4. Compression speed benchmark
    By Sportman in forum Forum Archive
    Replies: 104
    Last Post: 23rd April 2008, 16:38
  5. Synthetic compression benchmark
    By giorgiotani in forum Forum Archive
    Replies: 6
    Last Post: 3rd March 2008, 12:14

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •