I've been working on benchmarking lately, and it seems to me that our current data sets are pretty inadaquate. That thought has really been gnawing at me ever since I read cbloom's post at http://encode.ru/threads/1117-LZHAM?...ll=1#post42326, and now I'm thinking that I should stop thinking about it and actually do something about it, so I'm considering putting together a new corpus. I'd like to do this openly since I'm sure many of you will have thoughts on what should be included.
First off, everything would have to be redistributable. I think an open-source style license could be detrimental here, but I also don't think it would be necessary. As long as people can redistribute the files for benchmarking purposes I think people will be happy, and if we don't grant the right to incorporate the content into their own product I think we might be able to get some decent data which might otherwise not be released.
So the question is what should be included? I'm really looking for suggestions here, but some initial thoughts:
Plain Text and/or HTML
It's pretty easy to find freely available text we could incorporate and I think some should be included (maybe a book or two from Project Gutenberg, one English one not… Chinese maybe?), but current benchmarks tend to be very text-heavy, and I don't want to fall into that trap.
The only log file I can currently think of in a compression benchmark is fp.log from the maximum compression benchmark, but I'm very uncomfortable using data from that benchmark—I don't see a license, and it includes things like a DLL from MS Office…
I just spoke to a friend about using data from valadoc.org, and he is willing to share it. I think this would be good since there isn't really much of a privacy concern, though I was thinking I would still anonymize IP addresses just to be sure. I may also be able to get data from one of the GNOME sites—I believe developer.gnome.org is the most popular, and it should be similarly lacking in privacy concerns.
I was thinking something like 16 MiB of data here would be about right, what does everyone else think?
Binary log files
I was thinking about an uncompressed binary log file from systemd. This might be a bit of a pain to go through to extract any personally identifiable information, but I'd be willing to do it. Currently, my system log is 16 MiB (though I also have an old 32M backup) and my user log is 48 MiB, though they are compressed—I'm not sure which algorithm is in use.
The idea here is to download a disk image (like maybe one for the Raspberry Pi), mount it, then create a tarball of all the files in it. This would give us a tarball of lots of small files of various types (executables, configuration data, documentation), but again no privacy concerns.
A decently sized executable… Firefox or LibreOffice, maybe? Not the installer, the program itself. And/or a dll/so/dynlib or two?
I'm not really sure what this entails, I could definitely use help here. I'm guessing:
- Compressed textures
- Raw textures?
- Map data
If you're in the game development community, please speak up. Even if you can't release any data, if you can tell us about the data you are interested in compressing maybe we can find some from another source (like an open-source game).
Word Processing, Spreadsheets, etc.
A PDF—I believe most of these are compressed with deflate, not sure if we would want to decompress them first.
MS Office/LibreOffice/OpenOffice.org files, etc. AFAIK ODFs are zip files, so maybe unzip them and create a tarball of the contents.
There are two interesting subsets here (that I can think of). The obvious one is backups for things like MySQL, PostgreSQL, etc. The MusicBrainz database could be a good source here.
The other use case is a piece of data from a database which stores its contents compressed on disk. For example, LevelDB compresses chunks of data (IIRC they are each 4 KiB, but I'm not certain) before persisting it to disk.
RPC requests/responses / data packets
Small chunks of data which are often compressed. Some JSON, and maybe some Protocol Buffers?
Data from a mobile app
Maybe just the contents of an APK. We should talk to some mobile app developers to get a feel for what would be useful for them.