Just very useful compression corpus:
Just very useful compression corpus:
Nice one. Waiting for some results, I guess it could be compressed down to about 20-30 MB or even better.
The PDF is uncompressed, by the way, so Precomp doesn't help here.
It helps with the mozilla and samba testsets, though:
mozilla: pcf+pcf+bzip2 ->16,685,982 bytes instead of just bzip2 -> 17,914,392
samba: pcf+bzip2 -> 4,310,438 bytes instead of just bzip2 -> 4,549,790
EDIT: First results with CCM 1.30c (setting 4):
Code:CCM 45.223.851 Precomp + CCM 43.377.536
Last edited by schnaader; 10th January 2009 at 12:36.
Quick test on a 2.0 GHz T3200, Win32. zip and nanozip -cc used 1 core. zpaq, 7zip, and nanozip in default mode used 2 cores. Times are real times to compress and decompress.
zip and zpaq compress files separately. nanozip and 7zip produce solid archives. It takes as long to extract just the last file as to extract all of them. 7zip stops when it reaches the file you want, but nanozip needlessly decompresses to the end and discards the rest.Code:Size Ctime Dtime 39,452,564 621 620 silesia-cc.nz 40,568,503 776 silesia-m4.zpaq 43,422,818 279 303 silesia-m3.zpaq 44,841,467 107 21 silesia.nz 47,183,792 79 93 silesia-m2-b64.zpaq 49,729,056 137 10 silesia.7z 49,982,874 49 54 silesia-m1-b64.zpaq 50,533,746 48 56 silesia-m1.zpaq 67,633,896 41 2.5 silesia-9.zip 68,229,871 19 2.5 silesia.zip
Completely forgotten about this one
Some up-to-date results (Precomp 0.4.3 developer version using PackJPG 2.5, zpaq 3.01, reflate v0a), M 520 @ 2.4 GHz:
Unfortunately, the actual version of reflate doesn't restore the whole testset correctly, but especially on the mozilla part, it works very well as it is able to decompress ZIP files that Precomp fails on.Code:38.831.249 silesia.pcf.zpaq_m4_b48 (54 s + 638 s comp) 13.366.403 mozilla.7z (55 s comp) 12.393.857 mozilla.zpaq_m4 (317 s comp) 11.913.640 mozilla.pcf.7z (73 s + 44 s comp) 11.740.249 mozilla.reflate (193 s comp, 92 s decomp) 11.121.041 mozilla.pcf.paq8px_v69_3 (39 s + 841 s comp) 10.664.503 mozilla.pcf.zpaq_m4 (73 s + 325 s comp) 8.906.579 mozilla.pcf.paq8px_v69_4 (39 s + 8340 s comp) 3.764.255 samba.7z (19 s comp) 3.435.432 samba.reflate (150 s comp, 10 s decomp) 3.067.085 samba.zpaq_m4 (181 s comp) 2.614.968 samba.pcf.zpaq_m4 (7 s + 224 s comp)
I also had a closer look at the x-ray file, it has a 240 byte header and after that, 1900*2230 grayscale pixels (12 bit, but stored as 16 bit values 0x0000-0x0FFF). I'm sure there's some way to convert it to one big or two seperate BMP files and process it with some image compressor or paq8px_v69, but the best results I got so far with conversion aren't much better than without:
Code:3.678.159 (paq8px_v69 -3, 139 s comp) 3.627.031 + ~240 (interpreting file as 16 bit BGR, 6:5:5, conversion to 1900*2230 24-bit BMP + paq8px_v69 -4, 274 s comp) 3.620.831 + ~240 (interpreting file as 8 bit grayscale, conversion to 3800*2230 8-bit BMP + paq8px_v69 -4, 371 s comp) 3.613.619 (paq8px_v69 -4, 1801 s comp)
Last edited by schnaader; 16th December 2011 at 01:08.
reflate v0b is able to recompress and restore the whole testset completely:
Code:211.938.968 silesia.7z_store 46.025.705 silesia.7z_store.reflate (1225 s comp, 213 s decomp)
That's cool, but is the -6 mode the best there? (you can change it in "set level=6" lines in c.bat/d.bat).
Also I suppose that's a plzma result? Can you test with some other compression too? (like shar+ccm)
I'm not sure if level 6 is the best. Precomp says used compression level are 1,5 and 6, so these would be other possibilities to test, but .hif files aren't that big anyway. Some test results and comments:
Tests with other compression methods:Code:46.025.705 silesia.7z_store.reflate (level 6, 1880 .hif files, total .hif size 76.128 bytes) 46.125.184 silesia.7z_store.reflate (level 9, 1880 .hif files, total .hif size 168.068 bytes)
Results are very good. Precomp is a bit better as it processes non-zlib streams, too (994 GIF, 4 JPG, 7 bZip2). When using switches to disable those streams and recursion, the comparison is more fair:Code:219.754.380 silesia.reflate.shar 43.337.208 silesia.reflate.shar.ccm (CCM 1.30a level 7) 38.923.300 silesia.reflate.shar.zpaq_m4_b48 38.831.249 silesia.pcf.zpaq_m4_b48 (from a post above, for comparison)
Code:218.655.017 silesia.pcf (-t-jbf -d0) 39.039.373 silesia.pcf.zpaq_m4_b48
Last edited by schnaader; 18th December 2011 at 03:50.
With 46M for reflate and 38M for precomp it didn't look very convincing before
These gifs and bzip2 streams are troublesome for me though, as handling them is not even in the plan yet
Also, is it necessary to buffer a few MBs ahead to recompress bzip2?..
Also note that GIF and bZip2 implementations are quite basic, so it wouldn't be that hard to create your own giving similar or better results.Code:225.744.709 silesia.shar.pcf 38.721.067 silesia.shar.pcf.zpaq_m4_b48
Combining the other way round (.pcf.shar) would be possible, too, but as Precomp will partially recompress some of the streams it can't recompress completely, I guess the result is likely to get worse or make reflate fail.
In theory, I think buffering one of the 100-900 KB blocks ahead should be enough, but I haven't looked at the bzlib internals close enough yet to be sure.
bZip2 format is suited much better for recompression as it has a 3 byte file header ("BZh"), followed by the compression "level" ("1"-"9" for block size 100-900 KB) and even the blocks start with a magic string (BCD(pi), 0x314159265359).
I decided to create a new benchmark. To encourage open source, those will be the only ones listed. http://mattmahoney.net/dc/silesia.html
Right now there are just a few programs listed. More will come.
7-zip has its own bzip2 codecMost of them will be recompressed 100% identical, although I already found some streams that differ.
The BTRFS people have been benchmarking using this corpus on LZ4 vs Snappy & Snappy-C here. Given they're putting them into the kernel modules, they should be open source.
Yeah, still need to test some Linux compressors.
Have you considered adding timings for the total corpus, either compression, decompression or both. Maybe memory too?
I know it is more work, however right now a novice reader would wonder why anyone would use, say, gzip over paq8, although we know there is one large and compelling reason.
Since largest file is only ~50 MB in size and files are compressed individually, then measuring the performance shouldn't be a big problem on laptop with 2 GiB RAM, except the fastest programs.
The large text benchmark measures speed and memory usage. For most compressors, speed depends mostly on the input size, so there is no need to repeat the speed tests. Yes, it would be nice to have it all in one table. But ultimately it is not practical because when I upgrade to a new computer, all of the tests would need to be repeated. Compression ratios are machine independent.
-having special modes for special data, i.e. skipping of incompressible things or using different models for different data.
-using algorithms with pessimistic complexity much different from optimistic ones. Trees, sorting, etc.
And, in general, no compression ratios are not machine-independent. For some, more cores=splitting into more chunks. For some others, more free RAM = larger windows. Sadly, either is unlikely to show in SCC split into individual files because each input size is too small. This shows one flaw of the benchmark; an area where it's results are going to be different from what happens in real uses. Or maybe it's not a flaw, it's just that the benchmark is not supposed to show strength of codecs in archiver usage. But I don't see what is it supposed to show, really.
I don't think any of the programs I tested have machine dependent compression ratios. At least if anyone finds one, they can report a discrepancy. If it depends on free memory or number of threads, there are usually options to set them.
Anyway, as I said in the other thread, my goal is to encourage research in extreme compression. Not because extreme compression is useful by itself, but because it leads to better practical compressors. It's why we have lots of CM compressors and not just PPM, even though PAQ is too slow to be useful.
Thus, open source only, and rank by size. paq8px_v69 is 2 years old. I predict that in a few months somebody will beat it.
This newsgroup is dedicated to image compression:
Can you suggest some options to improve on -m9?
what you mean? make it independed on memory/cores?
A lot of the freearc options are not well documented (at least in English), so I was hoping you could find some combination that would give better compression than just -m9. I'd prefer those that don't depend on memory or cores so results are repeatable. I would be testing in 32 bit Windows using 2 cores with 3 GB or 64 bit Linux with 4 cores and 4 GB.
Experimented with filtering/converting the file MR. It's a DICOM file with some header bytes followed by 19 grayscale pictures (512*512, 16-bit).
Renaming the file to mr.raw, opening it with IrfanView (8 bit grayscale, 1024*9737) and saving it as BMP gives a slightly better compression ratio and much better speed in paq8px:
But as this interpretes the high and low bits of the 16-bit grayscale values as "mixed" 8-bit grayscale values, I tried to split those into 2 "planes" and saved the result as a big 512*19474 image (containing the "even" bytes in the upper image half, the "odd" bytes in the lower one). This improves compression ratio again. Results for paq8px and paq8pxd_v4:Code:mr 9.970.564 mr_8_bit_1024_9737.bmp 9.971.766 mr.paq8px_v69_4 2.112.410 (2690 s) mr_8_bit_1024_9737.bmp.paq8px_v69_4 2.038.773 ( 604 s)
I attached the BMP and two C++ source code files (split.cpp, join.cpp) that do the conversion mr<->mr_split, although the conversion is quite trivial. BMP conversion isn't done, but would be easy to add - as the image width is a multiple of 4, no scanline padding is needed and only a ~1000 bytes header and some trailing bytes have to be added.Code:mr_split_8_bit_512_19474.bmp 9.971.766 mr_split_8_bit_512_19474.bmp.paq8px_v69_4 1.922.196 ( 610.24 s) mr_split_8_bit_512_19474.bmp.paq8pxd_v4_4 1.905.203 ( 649.02 s) mr_split_8_bit_512_19474.bmp.paq8pxd_v4_8 1.894.843 ( 657.51 s)
Splitting the DICOM header and the image data would perhaps improve results slightly, but I haven't tried it so far.
Had no luck with other approaches, e.g. (simple) delta compression didn't work - the images are similar, but it seems there's too much noise.
Not sure about how to add those results to Matt's benchmark or if they should be added at all. The benchmark is about compression the mr file itself, not a filtered version. Perhaps creating (yet another) PAQ branch specialized on Silesia would be a good solution.
Last edited by schnaader; 28th April 2012 at 05:08. Reason: 19 grayscale pictures, not 38 - thanks Ethatron :)
Yes, this is exactly the kind of research I was hoping to encourage.
I was looking at the file osdb. The silesia docs says it comes from the file hundred.med from the OSDB benchmark. Actually it is a binary version of the comma separated text file asap.hundred from data-40mb from http://sourceforge.net/projects/osdb/files/
The text version has 10 fields filled with random values of different types like this:
asap.report describes it thus:Code:99347,86340864,174,-368686869.00,616161616,616161616,7/17/1944,FhXTub:ZQN,5mbIXWAmfRRmuV5:ef6y,8WJ 27507,178321784,159,308080808.00,90909091,90909091,12/9/1990,oIPGrilOHc,t3FNs4t1EPl 2oifg1wf,ihZPaYgRcPtLx3 21036,832348324,113,-136363636.00,-555555556,-555555556,12/27/1900,EGE5CsbnqQ,DTpl1xgzarZEV:7oxO2D,yU 24466,281962820,101,-176767677.00,777777778,777777778,12/5/1908,Nnj.fJbN8Y,87:N7qBywYOxu pscbqM,822 aTeKPA08DTfyk1munPPNI6FZvfhb8TWoFxdJiG:ef28Fe6OBN675zVlVVXbDjvW3Q3Qr4WPFEYnf
The binary version codes numbers as 4 or 8 bytes. Dates are strings. Record lengths are variable. Each record has a 6 byte header apparently giving the string lengths. I didn't completely analyze how to parse.Code:Hundred Relation: Attribute Dist. Min Max # Uniques Comments ========= ======= =========== ========== ========== ======== key Uniform 0 100000 100000 '1' missing int Uniform 0 1000000000 100000 '1' missing signed Uniform 100 199 100 float Uniform -500000000 500000000 100 double Uniform -1000000000 1000000000 100 decim Uniform -1000000000 1000000000 100 same as "double" date Uniform 1/1/1900 12/31/1999 36524 code Uniform 1 1000000001 100000 min/max = rnd seed name Uniform 1 1000000001 100 min/max = rnd seed address Uniform 5 995 100 min/max = rnd seed no fill
The OSDB data distribution omits the code that generated the data, but I found an old version here: http://www.koders.com/c/fid99EBA91CB...B1E.aspx?s=sql
It differs in that it generates strings containing only digits and upper case letters. As you can see, the actual data contains lower case, ":", and spaces. Also, the date range is slightly different. The distribution in some cases is not random, but rather a permutation to avoid duplicates.
It apparently calls a random number function ran1() which might be http://ciks.cbt.nist.gov/bentz/flyash/node14.html
I calculated the entropy of osdb. It is 1910 KB if you assume a true random number generator. The best known compression is 2069 KB by paq8pxd_v4 -8.
As I mentioned earlier, osdb is a synthetic database file. It is the binary conversion of asap.hundred from the OSDB benchmark. I could not find any documentation on the binary format so I reverse engineered it. The file consists of 100,000 records with 10 fields each. Each record begins on a 4 byte boundary and is padded with trailing 0 bytes as needed. Each record is preceded by a header in one of the two formats:
1 0 length fmt1 fmt0 (if length = 1 (mod 4))
3 0 length pad fmt1 fmt0 (otherwise)
pad is 0, 1, or 2 as required to make the record length including header a multiple of 4. The length byte is the number of bytes that follow, including the rest of the header but not including padding 0 bytes. I believe the second byte (0) is the high byte of the 16 bit length, but it is always 0. fmt1, fmt0 is a 16 bit format as follows:
The 3 x bits indicate the format of the key, int, and address fields. The 10 fields are as follows:
key: 4 byte integer, range 0,2..99999 (LSB first).
int: 4 byte integer, range 0,2..1000000000.
signed: 4 byte integer, range 100..199.
float: 4 byte IEEE float, range -5e8..5e8.
double: 4 byte IEEE float, range -1e9..1e9.
decim: variable string, range "-1000000000.00"..."1000000000.00".
date: variable string, range "1/1/1900"..."12/31/1999".
code: 10 byte string.
name: 20 byte string.
address: 80 byte or variable length string.
Bits 9, 8, and 1 of the fmt1-fmt0 are usually 0, 0, 1, respectively. If bit 9 is 1 then the key is omitted and assumed 0. If bit 8 is 1 then int is omitted and assumed 0. These happen only once each out of 100000 records. If bit 1 is 0 then address has a length of 80. This happens 1000 times. Otherwise it is a variable length string. Variable length strings are preceded by a byte that gives the number of bytes to follow. Integers and floats are 32 bits, LSB first (Intel x86 format).
Each key is unique. There are 100000 possible keys. Thus, the entropy is log2(100000!) = 1,516,704 bits.
Each int is unique. Since there would be only a few expected collisions, these can be modeled as random, giving 100000 * log2(1000000000) = 2,989,735 bits. An exact calculation with the uniqueness constraint is log2(1e9! / (1e9-1e5)!) which is 7 bits less.
There are 100 possible values of signed, float, double, decim, name, and address. Each appears exactly 1000 times. The double and decim fields always have the same value. However the other 5 fields are independent. Thus, the entropy is 5 log2(100000! / 1000!^100) = 5 * 663764 = 3,318,820 bits.
Each date appears either 2 or 3 times. There are 36524 days from 1/1/1900 to 12/31/1999 (in month/day/year format) including leap years 1904..1996 every 4 years. Thus, 9572 dates appear twice and 26952 dates appear 3 times. The two sets are not random. Dates occurring twice are spaced every 36524/9572 = 3.8157 days. Thus, the entropy is log2(100000!/(2^9572 6^26952)) = 1,437,462 bits.
Codes are 10 bytes chosen randomly from a 64 character alphabet [A-Za-z.:]. Thus, each field is 60 bits, or 6,000,000 bits total.
99 of the 100 names are 20 bytes chosen randomly from a 64 character alphabet like codes except that it uses spaces instead of periods. One name is "THE+ASAP+BENCHMARKS+". Thus, the entropy is 20*100*6 = 12000 bits.
99 of the 100 address are chosen from the same alphabet as names. The other is "SILICON VALLEY". The length varies from 2 to 80. Only one address has length 80. The average length is 19.99 bytes for a total of 11994 bits.
The 100 values for float, double, and decim are uniformly distributed at intervals of 1/99 of their range and rounded to the nearest integer. Thus, no additional information is needed to represent them. Note that double is not really a double, but a 32 bit float. A 32 bit float does not have enough precision to represent the values to within the nearest integer as it would appear in asap.hundred. Furthermore, the decim values are string representations of the exact integer value followed by ".00". Thus, when the number 90909091 appears in the double and decim fields of asap.hundred, the representation in osdb is 90909088.0 (as a 32 bit float) and "90909091.00" (as a string of length 11).
Thus, the total entropy of osdb is 15,286,715 bits = 1,910,839 bytes.
Last edited by Matt Mahoney; 1st May 2012 at 23:57. Reason: "length = 3 (mod 4)" -> "1 (mod 4)"
Who can argue with that