I have created a new benchmark to test generic compression capability.
This is the first benchmark using synthetic data. It is designed to test general purpose compression techniques not tied to any popular format like text or images. Programs that solve the general problem do better on this set than programs that are tuned to benchmarks or that specialize in certain file types.
The data is 1 million random binary strings (length 1 to 32 bytes) ranging from easy to compress (e.g. all 0 bits) to hard (all random bits) and everything in between. They are generated by small random programs. This kind of data has been proposed as a test for universal intelligence.
A problem with synthetic data is that you can compress it very small if you know the program that created it. The test data generator uses a cryptographic random number generator (RC4) with a hardware generated key (using CPU instruction timing variation as an entropy source).
This is an open benchmark where anyone can contribute results. Programs are generally tested on different data sets that you generate yourself. In my experiments, this introduces a small uncertainty, about 0.1%. This can be further reduced by taking the ratio to a standard compressor. Using ppmonstr as the standard, the average error is 0.02% to 0.05%, generally smaller for the better compressors. This can be reduced further by averaging over several tests.
I could have used private data, but I prefer to allow others to contribute and verify results. I also could have used a fixed, public data set, but then I would have to add the size of the decompressor as with LTCB. I designed a protocol that has the advantages of both. Anyone can submit or verify results (with a small uncertainty), but it is not possible to hide the test set in the decompressor because the data will be generated randomly, different for every test.
Test files are fairly small (16.5 MB). I don't believe that programs should need lots of memory to compress them, because the million strings are all independent of each other.
I have posted a few results and will be adding more later. Currently, ppmonstr is on top.