View Poll Results: Do we need a new corpus?

Voters
12. You may not vote on this poll
  • Yes

    9 75.00%
  • No

    3 25.00%
Page 1 of 2 12 LastLast
Results 1 to 30 of 31

Thread: Compiling a new corpus

  1. #1
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts

    Compiling a new corpus

    I've been working on benchmarking lately, and it seems to me that our current data sets are pretty inadaquate. That thought has really been gnawing at me ever since I read cbloom's post at http://encode.ru/threads/1117-LZHAM?...ll=1#post42326, and now I'm thinking that I should stop thinking about it and actually do something about it, so I'm considering putting together a new corpus. I'd like to do this openly since I'm sure many of you will have thoughts on what should be included.

    First off, everything would have to be redistributable. I think an open-source style license could be detrimental here, but I also don't think it would be necessary. As long as people can redistribute the files for benchmarking purposes I think people will be happy, and if we don't grant the right to incorporate the content into their own product I think we might be able to get some decent data which might otherwise not be released.

    So the question is what should be included? I'm really looking for suggestions here, but some initial thoughts:

    Plain Text and/or HTML

    It's pretty easy to find freely available text we could incorporate and I think some should be included (maybe a book or two from Project Gutenberg, one English one not… Chinese maybe?), but current benchmarks tend to be very text-heavy, and I don't want to fall into that trap.

    Log files

    The only log file I can currently think of in a compression benchmark is fp.log from the maximum compression benchmark, but I'm very uncomfortable using data from that benchmark—I don't see a license, and it includes things like a DLL from MS Office…

    I just spoke to a friend about using data from valadoc.org, and he is willing to share it. I think this would be good since there isn't really much of a privacy concern, though I was thinking I would still anonymize IP addresses just to be sure. I may also be able to get data from one of the GNOME sites—I believe developer.gnome.org is the most popular, and it should be similarly lacking in privacy concerns.

    I was thinking something like 16 MiB of data here would be about right, what does everyone else think?

    Binary log files

    I was thinking about an uncompressed binary log file from systemd. This might be a bit of a pain to go through to extract any personally identifiable information, but I'd be willing to do it. Currently, my system log is 16 MiB (though I also have an old 32M backup) and my user log is 48 MiB, though they are compressed—I'm not sure which algorithm is in use.

    Linux distribution

    The idea here is to download a disk image (like maybe one for the Raspberry Pi), mount it, then create a tarball of all the files in it. This would give us a tarball of lots of small files of various types (executables, configuration data, documentation), but again no privacy concerns.

    Executables

    A decently sized executable… Firefox or LibreOffice, maybe? Not the installer, the program itself. And/or a dll/so/dynlib or two?

    Game data

    I'm not really sure what this entails, I could definitely use help here. I'm guessing:

    • Compressed textures
    • Raw textures?
    • Map data

    If you're in the game development community, please speak up. Even if you can't release any data, if you can tell us about the data you are interested in compressing maybe we can find some from another source (like an open-source game).

    Word Processing, Spreadsheets, etc.

    A PDF—I believe most of these are compressed with deflate, not sure if we would want to decompress them first.

    MS Office/LibreOffice/OpenOffice.org files, etc. AFAIK ODFs are zip files, so maybe unzip them and create a tarball of the contents.

    Databases

    There are two interesting subsets here (that I can think of). The obvious one is backups for things like MySQL, PostgreSQL, etc. The MusicBrainz database could be a good source here.

    The other use case is a piece of data from a database which stores its contents compressed on disk. For example, LevelDB compresses chunks of data (IIRC they are each 4 KiB, but I'm not certain) before persisting it to disk.

    RPC requests/responses / data packets

    Small chunks of data which are often compressed. Some JSON, and maybe some Protocol Buffers?

    Data from a mobile app

    Maybe just the contents of an APK. We should talk to some mobile app developers to get a feel for what would be useful for them.

  2. The Following User Says Thank You to nemequ For This Useful Post:

    Cyan (27th March 2015)

  3. #2
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I'd start from scc and extend it as there's nothing outdated in it.

    I've been thinking about asking my employer for some test server logs. They have a bit unusual structure and I'd like to see how compressors work on them.
    You definitely need structured data, I suggest having both html/xml and json in large sizes. Extra small json for packets is a fine choice.
    I'd add some scientific data. DNA is in fasion. If you can get it, I suggest medical images, they are 3D, heavily compressible and users are often wary of lossy compression (actually I've met fear of lossless too) because of legalese.

  4. #3
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by m^2 View Post
    I've been thinking about asking my employer for some test server logs. They have a bit unusual structure and I'd like to see how compressors work on them.
    This brings up a really good point. I'm certainly open to discussion, but I think this is the wrong way to go. I don't want to include interesting data, I want uninteresting data. I want data that is representative of data that people are commonly trying to compress.

    Quote Originally Posted by m^2 View Post
    I'd start from scc and extend it as there's nothing outdated in it.
    Keeping the above in mind, I'm not sure how good Silesa is:

    • dickens — I think something more modern would be more appropriate. I'm not sure how different this would really be since Dickens isn't that different from more modern English, but if possible it would be nice to go with something published in the last few years. Maybe Free Culture by Lawrence Lessig (one of the few recent books which we could redistribute)?
    • mozilla — Tru64 UNIX? Also, AFAICT this is SeaMonkey not Firefox—why not go with something more modern, like 64-bit FF for Windows or Linux?
    • mr — I really don't know enough about medical data to be trusted here, but I think this is pretty dated… http://mridata.org looks interesting, but we should talk to some people who really know this stuff to see what is common these days.
    • nci — Pretty much the same comment as for mr
    • ooffice — Probably pretty accurate still, but why not update to a more modern version of either OO.o or LibreOffice?
    • osdb — I would have to look at this in more detail, but I'm wary of synthetic data. MusicBrainz is free, available, and definitely real-world data.
    • reymont — Love the idea of non-English text, but I think it would be wise to choose Chinese (Mandarin, I guess?) instead of Polish. A few more people speak Mandarin than Polish, there should be a lot more Mandarin data in need of compression.
    • samba — Probably a good idea to include something like this. As a C developer it pains me to say, but maybe a C++ project would be a better idea
    • sao — Seems like a bit of a niche, though maybe not a bad representation of structured data.
    • webster — A dictionary might not be bad, but something more structured would be nice.
    • xml — Probably ok
    • x-ray — same situation as mr

    Quote Originally Posted by m^2 View Post
    You definitely need structured data, I suggest having both html/xml and json in large sizes.
    Agreed. Maybe some geograpic data? OpenStreetMap.org?

    Quote Originally Posted by m^2 View Post
    Extra small json for packets is a fine choice.
    Agreed—that's what I meant with the "RPC requests/responses / data packets" item above. This shouldn't be hard to find for JSON—any free software with a RESTful API could probably be a good source. Protobuf might be a bit harder, but I think it would be a good idea.

    Quote Originally Posted by m^2 View Post
    I'd add some scientific data. DNA is in fasion. If you can get it, I suggest medical images, they are 3D, heavily compressible and users are often wary of lossy compression (actually I've met fear of lossless too) because of legalese.
    Sounds okay, we should find out what the common data formats are and try to get ahold of some good sample data. That said, I'm a bit worried about this being more of a niche that isn't all that interesting to most people. AFAIK there are algorithms designed specifically for compressing DNA, is there really any purpose in trying to compete with that? Are there also algorithms designed specifically for medical images? If so, same question.

  5. #4
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    856
    Thanks
    45
    Thanked 104 Times in 82 Posts
    I use a 800mb Final Fantasy XI online log file but it contains personal conversation so i cant share it without having to go through and snip it out
    but maybe look at http://www.ffxiah.com/ it has a shout box that reveals shout in the game which has nearly the same structure as the game log.
    you might find a way to get that out of the site

    i like to use the log file as its kinda a dirty text since it has typical typing//spelling errors. alot of monster names are anagrams of reals names from myth it will be mostly in English but do sometimes have foreign language inside of it as well and since it made up of thousand of different sources (humans + system logs) it has a bit more random structure. So its favors more adaptable compression schemes.

  6. #5
    Member
    Join Date
    Aug 2013
    Location
    France
    Posts
    77
    Thanks
    27
    Thanked 26 Times in 11 Posts
    Not much to say, except I completely agree with you guys and support the initiative.

  7. #6
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by gpnuma View Post
    Not much to say, except I completely agree with you guys and support the initiative.
    Thanks, that's really quite helpful to hear. I've added a poll to the thread—anyone who doesn't really have anything to add to the discussion but either supports or opposes the idea, please take part.

  8. #7
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by SvenBent View Post
    I use a 800mb Final Fantasy XI online log file but it contains personal conversation so i cant share it without having to go through and snip it out
    but maybe look at http://www.ffxiah.com/ it has a shout box that reveals shout in the game which has nearly the same structure as the game log.
    you might find a way to get that out of the site

    i like to use the log file as its kinda a dirty text since it has typical typing//spelling errors. alot of monster names are anagrams of reals names from myth it will be mostly in English but do sometimes have foreign language inside of it as well and since it made up of thousand of different sources (humans + system logs) it has a bit more random structure. So its favors more adaptable compression schemes.
    I don't think we can use that since I don't see anything about licensing. That said, it looks pretty similar to an IRC log, and should also be similar to other chat-like things (IM, facebook, twitter, SMS, etc.), so it could be an interesting piece of data to include. I'm thinking a day from a reasonably popular chat room—something like ##javascript on freenode, maybe?

  9. #8
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    While my main work is in designing medical imaging hardware, I know enough to say that DICOM (.dcm) is used by almost all hospitals and supported by all major suppliers. It tends to leverage existing compression standards such as JPEG and MPEG, but there are also other formats and the data can be stored uncompressed. Some companies use lossy compression, although that is more the exception than the rule.

  10. #9
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts

    Lightbulb The Encode Compression Corpus

    It will be very nice if testers can use a common corpus when posting to encode.ru.
    - Very offen testers are using private or non-public datasets. Ex. MOC or benchmarks from cbloom.
    - A corpus should not be dominated by one or two files like silesia (mozilla/webster) or 10GB benchmark (30% DNA).
    - A corpus should not give too much advantage to some compressors with special filters (Ex. wav files or lot of duplicate files).
    - Datafiles should be large enough specially for performance testing (Ex. 100MB-1GB)
    - Avoid using incompressible files or good compressible files (ex. MOV files). A web-log can be an exception.
    In general a ratio of 20-40% obtained with 7z is ok.
    - Try using data files from academica

    We can split a corpus in categories:
    - Text files incl. xml/html. 50-70% english (like enwik9) and rest other foreign languages)
    - Binary files like the files used by cbloom.
    - Popular applications (installed). Ex. acroread, libre-office, chrome
    - Popular file types: ms-office, Android, exe,...
    - web-logs
    - source code c/c++/java/javascript,python,...
    - VM image Ex. ubtuntu distribution
    - Games (installed) with public download.

    - special category: dna,databases,images,audio,...

    - Actually some pointers:
    - https://github.com/caesar0301/awesome-public-datasets
    - Pizza&Chili Text Corpus: http://pizzachili.dcc.uchile.cl/texts.html
    - Census, DNA, MingW, Wikipedia: http://acube.di.unipi.it/bc-zip/
    - Compressionratings + Squeeze chart : http://compressionratings.com/download.html
    - enwik8/enwik9: http://mattmahoney.net/dc/textdata.html
    - DNA Corpus: http://people.unipmn.it/manzini/dnacorpus/
    - "Classic reference file sets" + "misc executables and media files": http://peazip.sourceforge.net/peazip...sors_benchmark
    - Use google to search for public domain file types: Ex. "filetype:doc site:gov"
    https://www.google.de/search?q=filet...doc+site%3Agov
    Last edited by dnd; 27th March 2015 at 14:06.

  11. The Following User Says Thank You to dnd For This Useful Post:

    Fu Siyuan (27th March 2015)

  12. #10
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    As a matter of fact, I am absolutely agree with the initiative, with one *very* big side comment:
    Just as there´s no "universal compressor", it cannot be a "universal corpus" nor benchmark. Because most algorithms / libraries / archivers are oriented to different uses.
    Some are targeted to common people who never see medical nor scientific data like DNA. Others are coded just in the aim to get the very best possible ratio. Yet others, to win a competition.
    So a corpus with just a 5% of DNA would introduce some unwanted slant in a test for WinRar, while a corpus with a lot of exe code would be very interesting for 7z or FA but unfair for a library such as density.
    So I think it cannot exist a unique corpus "just fine" for everybody. Therefore, we must consider build a few ones.

  13. #11
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    When I have to decide which compressor to use in a project, I do test on real folders of my computer. For example, I like to see whether a packer will do well on several program installations. The more type of data the better. I like to put together log, jpgs, exe, some wav, text in different languages, some non-compressible files, bmps, and so on. Is also very important for me to see whether a compressor is "smart" enough to recognize data and to take a different approach for each. For example, a distro precomp-ed have a huge amount of different kinds of data inside, being just one file. A good compressor is capable of chose a good method for each block... Some *PAQ are a good example.


    Of course this is "just me" and may be not representative at all. Perhaps for the end user.
    Last edited by Gonzalo; 27th March 2015 at 18:26.

  14. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    It really depends on your target application. Are you using compression to make backups or software installers or something else? In the first case, compression speed is more important than decompression speed. In the second case, it is the opposite. Also for backups you are going to have a lot of duplicate files and a lot of already compressed files. In the second case, that's less likely. Most benchmarks like Calgary and Silesia and maximumcompression are not targeted to performance for backups. I designed 10GB to look more like a typical backup, although it still has a lot more text and DNA than most disks.

    In my experience at Ocarina/Dell, we had a lot of customers who were producing terabytes of mostly one kind of data. These all required specialized compressors. These included:

    - Oil companies producing seismic data in SEG-Y or Steim-1/2 format. The data consisted of 32 bit floats in either IEEE or the older IBM (base 16) format like calgary/geo. Steim-1/2 was lightly compressed by delta coding and packing into 8, 16, or 32 bit values as appropriate.

    - Movie/TV studios producing massive sets of high resolution uncompressed video either from a camera or animation. These were in a variety of formats: MXF, RLA, SPM, TIFF, MJPEG.

    - Genomics companies producing massive data sets from DNA sequencers. The first stage is taking huge 8 bit grayscale TIFF images of the arrays containing tens of millions of dots about 6 pixels across, each one representing one base of one DNA read. These are refined to FASTQ or SRF (base, quality, header) then FASTA (sequenced DNA), and BAM/SAM (alignment to reference genomes).

    - Photo sharing and social networking sites with trillions of JPEG images. JPEG can be compressed losslessly by about 30%. Some customers also wanted lossy compression. They also need to respond quickly to requests for images in different sizes.

    - Medical images in DICOM and lossless JPEG2000 format.

    - Speech in compressed GSM format.

    - I didn't work on database compression, but of course this has its own special set of requirements.

    For a typical job, I would get a portable hard disk containing hundreds of GB of proprietary data, sometimes in a proprietary or undocumented format, or sometimes in a standard format where you had to buy the documentation from some standards group. I would then have to analyze the data, writing experimental programs to run statistical analysis to determine where the redundancies were. For a standard benchmark, it is likely that others have already done this analysis. For example, you can read about my analyses of the Calgary corpus and enwik9. For the type of work I was doing, you have to do the analysis yourself. You end up writing something very specialized. A SEG-Y or FASTQ or TIFF compressor that works on data from one company won't work on data from another company, even if they adhere to the standards, because the compression depends on the redundancies introduced by their work flow.
    Last edited by Matt Mahoney; 27th March 2015 at 19:51.

  15. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    nemequ (27th March 2015)

  16. #13
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    corpus questions and suggestions

    One basic question is how much you want the corpus to reflect specific data formats that are in common use nowadays, vs. a broad and varied enough sample that you'll see all the usual kinds of regularities, even if the specifics and details are different.

    So, for example, most of the time it doesn't bother me much that the Mozilla distribution in Silesia is for a RISC architecture that isn't common nowadays---and in some ways I like it that way. I'm usually less interested in algorithms tuned very specifically to particular instruction sets (e.g., ARM V7 with NEON) or very specific data formats (e.g., DICOM or FASTA) than in algorithms that can find basic and common regularities for themselves and model them reasonably.

    One of the things that's nice about LZMA, for example, is that it's pretty good on most texty data AND and on a lot of "binary" structured data---it detects the usual Markov regularities and some very common non-Markov regularities (common strides, etc.), too, WITHOUT being tuned to specific data formats.

    The fact that you can beat LZMA on any particular kind of data if you have an algorithm tuned to that specific kind of data isn't surprising, but if you're doing that sort of thing, you probably shouldn't be using a "general" compression corpus for comparisons.

    I realize there's a counterargument to that---ideally, the corpus would be very representative of real data in real computers and have the right mix of all the most common data formats and instruction sets, plus the right number of curveballs (unusual data formats), etc. Then you could use it to compare "heroic" algorithms that have a bunch of ad hoc features for modeling e.g., specificially English text or 32- vs 64-bit x86 or ARM code, decompressing compressed images and recompressing them better, etc.

    In general there's a basic issue of what a corpus is for---is it to make bottom-line overall comparisons and declare a winner, or is it to provide a reasonable varied sample so that you can tell what algorithms are good for what kinds of data. The latter is easier, and in general the former is impossible to do really well. We don't have the data to tell what a truly representative corpus would look like, and even if we did, few users' actual needs would be representative.

    ---

    In terms of text, I think it would be good to have something like a third of the text samples in English and other Latin-1 European languages, a third in common non-Western alphabetic languages (e.g., Russian, Korean (hangul), Hindi, and Arabic), and a third in ideographic languages, e.g., Mandarin and Japanese.

    The 12 most-spoken languages (and their written scripts, language families, and millions of speakers) are

    1. Mandarin (Chinese Characters, Sino-Tibetan, 1151M)
    2. English (Latin, Indo-European, 1000M)
    3. Spanish (Latin, Indo-European, 500M)
    4. Hindi (devanagari, Indo-European, 490M)
    5. Russian (Cyrillic, Indo-European, 270M)
    6. Arabic (Arabic, Afro-Asiatic, 255M)
    7. Portuguese (Latin, Indo-European, 240M)
    8. Bengali (Bengali, Indo-European, 215M)
    9. French (Latin, Indo-European, 200M)
    10. Malay, Indonesian (Latin, Malayo-Polynesian, 175M)
    11. German (Latin, Indo-European, 166M)
    12. Japanese (Chinese Characters and 2 alphabets, Altaic, 132M)

    (A list of the top 30 is here: http://www.krysstal.com/spoken.html )

    I think that's a pretty nice mix of Indo-European and non-Indo-European languages, Germanic and Romance (European) languages, alphabetic vs. ideographic scripts, etc. It also covers the majority of people in the world, and all languages whose popularity is within a factor of 10 of Mandarin Chinese.

    It's not representative of what's in actual computers now, though, because languages like English, Japanese and Korean are way overrepresented among computer-users, and many people use English or Russian on the Internet even if their native language is Finnish or Polish or whatever. I'm not sure what to do about that, but one thing to keep in mind is that as computers get cheaper and more available, that will change somewhat. (People may still use the most popular languages when communicating with foreigners, but also have more people online to talk to in their native languages. And more people will get online who don't know any of the popular "International" languages.)

    ---

    When it comes to executables, I think it would be good to have some files that are only the actual code segments from exe or dll or apk files, and others that are the whole files---that will make it clearer whether an algorithm is compressing actual code well, or is mostly compressing all the other stuff in an executable. (Many executable files are only about half actual code, and most of the compression you get is actually from compressing numeric tables, string literals, link tables, jump tables, etc. For some kinds of things, like games and other graphical programs, there may only be a small amount of code relative to all the canned images, maps, audio samples, and so on.)

    ---

    I agree that we need more structured data, like cbloom keeps talking about but doesn't make available. (Presumably because he can't.)

    Several kinds of structured data are common among various junk I've looked at with fv and other tools:

    1. strictly strided arrays of numbers
    2. strictly strided arrays of structs
    3. variably-strided arrays of structs with a variable-length part (e.g. if you write an array of structs to a file, and each struct contains a variable-length string)
    4. multidimensional arrays, i.e., with nested strides, with or without variable-length parts that make the "stride" vary at some granularity

    --

    One way to get some of this is to take an "interesting" program using sophisticated data structures (and real data) and make it dump core, then do something reasonable with the core dump. (E.g., of an in-memory geographic database at or near its point of maximum memory usage.)

    ---

    A variety of log files would be good, and several kinds---e.g., in-memory structy logs of MP synchronization info, blocked and stripped but not heroically compressed memory reference traces, texty written-to-files logs.

    ---

    In general I think it would be good to separate stuff that's mostly already compressed (like jpg and mov files) from stuff that's relatively raw, if you include compressed stuff at all. Recompression is a very different task than normal compression.

    Including some of that is likely good for some purposes, e.g., when comparing fast algorithms they should be smart enough to skip random-looking data and not waste time trying to compress it.

    ---

    I think it's also probably good to include but distinguish some DNA and protein sequence data. You don't want them overrepresented, but you don't want to just leave them out because they're important in the real world and more importantly because they stress-test a compressor in ways that other random formats will too---anything where there's a small alphabet and a lot of apparent randomness within that small alphabet.

    --

    In general I think it's good to have reasonably-defined subsets, e.g., text vs. texty files vs. executables vs. structured data. And you can make a few more subtle distinctions withing each of those, e.g., relatively simply-formatted text vs. marked-up text, executable files vs. code segments, etc.

    That's the kind of thing people end up doing for meaningful comparisons anyway---e.g., comparing different algorithms on the subset of the Calgary corpus that's actual text, and leaving out the image and executable files.

    For a corpus to be reasonable for various purposes, you'd ideally do a good job of predefining those qualitatively different categories, in hopes that people would use the well-motivated "standard" subsets rather than having to improvise their own for a given paper.

    ---

    One way of addressing the issue of some files necessarily being huge and being overrepresented in the total input volume---like the DNA sequence being 30 percent of the 10GB corpus---is to define the benchmark to use a specific weighting. The disadvantage there is that you have to convince people it's worth the trouble to actually compute the properly weighted average rather than just running the whole input through a compressor and reporting a single overall number. (One way to deal with that is to provide a little script and/or a spreadsheet that does the right weighting for them.)
    Last edited by Paul W.; 27th March 2015 at 21:24.

  17. #14
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by dnd View Post
    - A corpus should not be dominated by one or two files like silesia (mozilla/webster) or 10GB benchmark (30% DNA).
    I have mixed feelings about this. Yes, in a perfect world things would be balanced, how do you balance it? People have very different use cases.

    I think it would be better to make the corpus a series of files which are representative of different use cases without too much regard for *overall* composition. Yes, it means you don't have one number for the entire corpus, but I think having one number is an oversimplification which would best be avoided.

    Another problem is that not all files are similarly sized. One of the things I would really want to see is a small JSON and/or Protobuf, like you might see for RPC. I don't want a 1 MiB JSON file to "balance out" a 1 MiB image, because that's not a realistic test case—it would grossly exaggerate the compression ratio, and give an unfair advantage to codecs with slow initialization times.

    Quote Originally Posted by dnd View Post
    - A corpus should not give too much advantage to some compressors with special filters (Ex. wav files or lot of duplicate files).
    This comes back to the same point about trying to balance the entire dataset. If someone has a special filter for wav files they should have great performance for the wav file, but not for a jpeg.

    Quote Originally Posted by dnd View Post
    - Datafiles should be large enough specially for performance testing (Ex. 100MB-1GB)
    Completely disagree. Files should be as large as they will be in real life. If you want to do performance testing on a small file do multiple iterations and present the average. If doing that would cause initialization/deinitialization time to dominate the result, that's fine—the software you're benchmarking probably isn't appropriate for that kind of data.

    Quote Originally Posted by dnd View Post
    - Avoid using incompressible files or good compressible files (ex. MOV files). A web-log can be an exception.
    In general a ratio of 20-40% obtained with 7z is ok.
    People have highly compressible data. I'm not saying we should include artificial highly compressible data, but I think things like logs, JSON, etc. are appropriate.

    Data which is difficult to compress is also useful… people try to compress stuff like that all the time—maybe they're just using an archive as a container so it's easier to e-mail that folder of JPEGs, maybe it's code that is content-agnostic (a cache, transparently compressed filesystem, etc.). Some codecs handle compressed data *very* quickly, some handle it *very* slowly, and it's pretty useful to know which ones do what.

    Quote Originally Posted by dnd View Post
    - Try using data files from academica
    Why?

  18. #15
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by Paul W. View Post
    In general there's a basic issue of what a corpus is for---is it to make bottom-line overall comparisons and declare a winner, or is it to provide a reasonable varied sample so that you can tell what algorithms are good for what kinds of data. The latter is easier, and in general the former is impossible to do really well. We don't have the data to tell what a truly representative corpus would look like, and even if we did, few users' actual needs would be representative.
    As you can probably guess from my reply to dnd's comment, I think the latter. I don't think the former is feasible.

    Quote Originally Posted by Paul W. View Post
    One way to get some of this is to take an "interesting" program using sophisticated data structures (and real data) and make it dump core, then do something reasonable with the core dump. (E.g., of an in-memory geographic database at or near its point of maximum memory usage.)
    I'm uneasy about the idea of artificial data like that. I'd rather try to get some representative samples.

    I think games could be a really good source of information here, but some people from the industry need to tell us what type of data they want to compress. Even if they can't release the actual data, maybe we can locate some samples which we can redistribute—for example, from open source games.

    Quote Originally Posted by Paul W. View Post
    In general I think it's good to have reasonably-defined subsets, e.g., text vs. texty files vs. executables vs. structured data. And you can make a few more subtle distinctions withing each of those, e.g., relatively simply-formatted text vs. marked-up text, executable files vs. code segments, etc.

    That's the kind of thing people end up doing for meaningful comparisons anyway---e.g., comparing different algorithms on the subset of the Calgary corpus that's actual text, and leaving out the image and executable files.

    For a corpus to be reasonable for various purposes, you'd ideally do a good job of predefining those qualitatively different categories, in hopes that people would use the well-motivated "standard" subsets rather than having to improvise their own for a given paper.
    I like this idea, but I'm worried about the size of the corpus exploding. I don't want to overwhelm people with too much information. I think no more than maybe 20 or so files in total (15 would be better) would be a good idea. I just want something which good for most people, it doesn't have to include every use case there is.
    Last edited by nemequ; 27th March 2015 at 22:09. Reason: double post, repurposed to reply to Paul W.'s post

  19. #16
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts

    IDEA:

    1) Build very specific sets. ONLY geo data, ONLY x86 code, ONLY text on just English, ANOTHER in Spanish, and so on.... Well defined, standardized. To have every need covered.

    2) When you run your benchmark, chose what percentage of each you will consider, then stat clearly that if any other wants to run the same bench, for example, in another architecture.

    3) Everybody´s happy

    What do you think? The programmer can decide which mix is the most desirable for his app... Then everybody runs in that direction.

    For example, Fu Siyuan probably would exclude DNA or pick a small amount of it for his CSA.
    Last edited by Gonzalo; 27th March 2015 at 21:59. Reason: typo correction

  20. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Ideally, it would be nice to have a small benchmark that could predict performance on larger data sets. Sadly, that goal is elusive. As one of my tests on the 10GB benchmark, I made 1GB and 100MB sets by randomly sampling files. Then I tested on several compressors. As one example of how badly this test predicts, 7zip compresses 100MB to 38 MB and zpaq to 45 MB both at default settings. But on 10GB, zpaq does better. Compression of large sets depends much more on using lots of memory or dedupe to find long range matches.

    Synthetic data is useful for detecting edge cases. For example, 10gb/zeropad is 52 MB of zero bytes. It will break naive suffix sorting algorithms like BBB. 10gb/benchmarks/simple/w8002 is a 100 MB file consisting of two copies of 50 MB of random data. Compressors will give vastly different results depending on their ability to detect this redundancy. The simple directory contains harder cases, like alternating zero bytes with random bytes, or alternating random bytes with a copy of the previous byte.

    One of my goals in designing 10GB was to approximate the distribution of files found in https://www.usenix.org/legacy/event/...pers/Meyer.pdf including the high percentage of duplicate files and already compressed files. However, I also included a lot more text and DNA than in the Microsoft study because I think these are interesting problems. Compressing text well requires a model of semantics and grammar and is ultimately a hard AI problem. Modeling DNA well requires a similar level of understanding of biochemistry. Both of these will be active areas of research for decades to come.

    The data is divided roughly equally into text, DNA, software, and already compressed files, but I also added a number of hard cases:

    - 10gb/benchmarks/act-jpeg contains 195 files of the same 3 images at different JPEG quality levels and chroma downsampling in baseline and progressive mode, as well as BMP, GIF, and PNG formats.

    - 10gb/benchmarks/gimp.tar and benchmarks/gimp-2.0.0 (4143 files) are two copies of the same 78 MB as a test of dedupe granularity.

    - 10gb/mingw has 7 versions of the MinGW compiler from 3.4.5 to 4.8.0.

    - 10gb/www.mattmahoney.net/dc and 10gb/2011/www.mattmahoney.net/dc are two backups of my website from 2 years apart. There are hundreds of small HTML files, but most of it is already compressed (zip, pmd, zpaq, pdf, jpg).

    - 10gb/benchmarks/primes is a 56 MB text file of prime numbers from 2 to 99999989.

    - 40 MB of uncompressed BMP and TIF images that were not converted from JPEG, and 51 MB of WAV files. There probably should be more, but lossy compression is the more interesting problem. Lossless is severely limited by low level noise.

    Of course if you want specific data types you can use subdirectories of 10GB.

  21. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Paul W. (28th March 2015)

  22. #18
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    nemequ,

    Re the core dump idea... I wasn't talking about artificial or synthetic data, but about in-memory data from a real programs operating on real data. For a 3D game or a geographical information system, you might have heap-allocated and/or memory-mapped R-trees, Binary Space Partitioning trees, various indexing tables and graph structures, buffers full of rendered tiles, etc... all of it real data.

    I agree that it's a worry that the corpus would explode and there'd be all sorts of finessing the subsets, but the situation now isn't any better, and I think it would be good to make a stab at a reasonably large corpus with reasonable subsets at the outset.

    For example, for the text subset, it'd be good to have something that isn't almost all in English, and that clearly distinguishes between relatively plain and fairly marked-up text.

    I, for one, am curious which algorithms that are good "for text" are good for specifically English text, or for alphabetic text, or for pretty much any kind of actual text, and which ones are pretty good at handling markup and which aren't.

    One reason I suggested text in a dozen different languages with varying language families and orthographies is that it would show whether an algorithm like PPMD or LZMA works pretty well for all of them, vs. an algorithm with an English-specific preprocessor that's going to fall on its face when it encounters Mandarin. (Most people don't mostly speak English.) An algorithm fine-tuned to handle all dozen languages well would be interesting---it might be a hack, but if it works, it can probably be extended to handle 30 or 80 languages and get excellent coverage of all of the non-rare natural language text. Likewise you'd want files with different kinds of markup (xml, html, wiki, LaTex, etc.) so that you could tell the difference between an algorithm that's specific to a particular kind of markup vs. one that detects fairly general regularities common to marked-up text but not to plain text.

    For most purposes, all of those things would be counted toward the "good for text files" score, which is all most people would care about, but if you wanted to ask subtler questions, the information would be right there. And right now it isn't---I wouldn't know where to look to find out what algorithm is good for (say) Japanese html pages or a Latex document in Arabic.

    What you don't want is the situation right now, where a lot of algorithms check for a very few important special cases---e.g., English text and x86 executables---and the benchmarks make them look good. A corpus should encourage people to write more robust algorithms that may do ad hoc things, but do them in a more general, principled, and extensible way.

    For example, a part-of-speech tagger for English may be a worthy thing, but a sensible and open framework for part-of-speech tagging in various languages would be much better---people could come up with part-of-speech tables for the common phrases in their languages, and plug them right in.

    Likewise, an algorithm or preprocessor that's good for x86 code is a worthy thing, but an algorithm that's good for variable-length instructions in general (e.g., Java or Python bytecodes as well as x86) would be better, maybe with pluggable tables to guide the parse. And similarly, a RISC code compressor that can be tuned to different RISC instruction sets is better than a hardcoded ARM V7 compressor.

    It seems to me that a "general purpose" compression algorithm should usually be able to play "name that tune" and quickly classify the data it's given and compress it appropriately.

    For example, given two hundred bytes of a text file, you should easily be able to tell that it's text, and if so, what language it's in, and pick a text compression algorithm and a preloaded dictionary (with frequencies and part of speech tags for all the common words and phrases in that language), and compress it very well from the get-go, even for small files, and generally better for large files than you could if you never realize what language you're parsing and what its parts of speech are. For example, just looking at very common word bigrams like "of the" (vs. "de la", "della", "von der", "van die", "af", etc.) will tell you very quickly what language you're looking at. (Same goes for common markup syntax/tags, if it's very marked up.)

    It's surprising to me that compression algorithms that are supposed to be good for "text compression" don't already do this, and I suspect that a big reason they don't is that benchmarks don't reward it.

  23. #19
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    856
    Thanks
    45
    Thanked 104 Times in 82 Posts
    Quote Originally Posted by nemequ View Post
    I don't think we can use that since I don't see anything about licensing. That said, it looks pretty similar to an IRC log, and should also be similar to other chat-like things (IM, facebook, twitter, SMS, etc.), so it could be an interesting piece of data to include. I'm thinking a day from a reasonably popular chat room—something like ##javascript on freenode, maybe?
    I would say it looks more like a hybrid of a system log file and a char as you have a lot of repeated sequence of system messages ( E.g. battle information attack/spells)

    here i'm just emptying my inbox
    [11:57:09]yYou take the 12 plates of crab sushi out of delivery slot 5.
    [11:57:12]yYou take the 12 plates of crab sushi out of delivery slot 6.
    [11:57:15]yYou take the 12 plates of crab sushi out of delivery slot 7.
    [11:57:18]yYou take the 12 plates of crab sushi out of delivery slot 8.
    [11:57:20]yYou take the 12 plates of crab sushi out of delivery slot 1.
    [11:57:23]yYou take the 12 plates of crab sushi out of delivery slot 2.
    [11:57:31]yYou take the 12 plates of crab sushi out of delivery slot 3.
    [11:57:34]yYou take the 12 plates of crab sushi out of delivery slot 4.

    or fishing on my wife's character
    [16:42:25]Lessfilling caught a black sole!
    [16:42:53]You didn't catch anything.
    [16:43:19]You didn't catch anything.
    [16:43:44]You didn't catch anything.
    [16:44:10]You didn't catch anything.
    [16:44:36]Something caught the hook!
    [16:44:36]You don't know if you have enough skill to reel this one in.
    [16:44:45]Lessfilling caught a black sole!
    [16:45:14]You feel something pulling at your line.
    [16:45:14]You have a good feeling about this one!
    [16:45:16]You give up and reel in your line.
    [16:45:44]Something clamps onto your line ferociously!
    [16:45:44]You have a good feeling about this one!
    [16:45:45]You give up and reel in your line.
    [16:46:09]You didn't catch anything.

    both very system log'ish

    a an example of Japanese shout:
    [00:10:32]Hideti[LowJeuno]: ‚u‚vƒEƒvƒ^ƒ‰‚P‚Q˜Aí‚¢‚«‚ Ü‚¹‚ñ‚©` 10/12-18
    [00:13:43]Mirge[LowJeuno]: VWƒ‚ƒ‹ƒ^12˜Aí‚ɍs‚«‚Ü‚¹‚ ñ‚©H ƒTƒ`ƒRƒŽQÆ‚Åtell‚*‚¾‚³‚¢ B 14/18
    [00:13:58]Mirge[LowJeuno]: VWƒ‚ƒ‹ƒ^12˜Aí‚ɍs‚«‚Ü‚¹‚ ñ‚©H ƒTƒ`ƒRƒŽQÆ‚Åtell‚*‚¾‚³‚¢ B 16/18
    [00:14:47]Hideti[LowJeuno]: ‚u‚vƒEƒvƒ^ƒ‰‚P‚Q˜Aí‚¢‚«‚ Ü‚¹‚ñ‚©` 10/12-18

    Lotsa "noise" still very repeatable but make dict preprocessing harder.

    anyway you might want to look into game MMO log file as they consist of a very mixed setup part system log and part chat'ish

  24. #20
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Quote Originally Posted by Paul W. View Post
    I agree that it's a worry that the corpus would explode and there'd be all sorts of finessing the subsets, but the situation now isn't any better, and I think it would be good to make a stab at a reasonably large corpus with reasonable subsets at the outset.

    For example, for the text subset, it'd be good to have something that isn't almost all in English, and that clearly distinguishes between relatively plain and fairly marked-up text.

    I, for one, am curious which algorithms that are good "for text" are good for specifically English text, or for alphabetic text, or for pretty much any kind of actual text, and which ones are pretty good at handling markup and which aren't.

    One reason I suggested text in a dozen different languages with varying language families and orthographies is that it would show whether an algorithm like PPMD or LZMA works pretty well for all of them, vs. an algorithm with an English-specific preprocessor that's going to fall on its face when it encounters Mandarin. (Most people don't mostly speak English.) An algorithm fine-tuned to handle all dozen languages well would be interesting---it might be a hack, but if it works, it can probably be extended to handle 30 or 80 languages and get excellent coverage of all of the non-rare natural language text. Likewise you'd want files with different kinds of markup (xml, html, wiki, LaTex, etc.) so that you could tell the difference between an algorithm that's specific to a particular kind of markup vs. one that detects fairly general regularities common to marked-up text but not to plain text.

    For most purposes, all of those things would be counted toward the "good for text files" score, which is all most people would care about, but if you wanted to ask subtler questions, the information would be right there. And right now it isn't---I wouldn't know where to look to find out what algorithm is good for (say) Japanese html pages or a Latex document in Arabic.

    What you don't want is the situation right now, where a lot of algorithms check for a very few important special cases---e.g., English text and x86 executables---and the benchmarks make them look good. A corpus should encourage people to write more robust algorithms that may do ad hoc things, but do them in a more general, principled, and extensible way.

    For example, a part-of-speech tagger for English may be a worthy thing, but a sensible and open framework for part-of-speech tagging in various languages would be much better---people could come up with part-of-speech tables for the common phrases in their languages, and plug them right in.

    Likewise, an algorithm or preprocessor that's good for x86 code is a worthy thing, but an algorithm that's good for variable-length instructions in general (e.g., Java or Python bytecodes as well as x86) would be better, maybe with pluggable tables to guide the parse. And similarly, a RISC code compressor that can be tuned to different RISC instruction sets is better than a hardcoded ARM V7 compressor.

    It seems to me that a "general purpose" compression algorithm should usually be able to play "name that tune" and quickly classify the data it's given and compress it appropriately.

    For example, given two hundred bytes of a text file, you should easily be able to tell that it's text, and if so, what language it's in, and pick a text compression algorithm and a preloaded dictionary (with frequencies and part of speech tags for all the common words and phrases in that language), and compress it very well from the get-go, even for small files, and generally better for large files than you could if you never realize what language you're parsing and what its parts of speech are. For example, just looking at very common word bigrams like "of the" (vs. "de la", "della", "von der", "van die", "af", etc.) will tell you very quickly what language you're looking at. (Same goes for common markup syntax/tags, if it's very marked up.)

    It's surprising to me that compression algorithms that are supposed to be good for "text compression" don't already do this, and I suspect that a big reason they don't is that benchmarks don't reward it.
    Answer:
    Quote Originally Posted by Gonzalo View Post

    IDEA:

    1) Build very specific sets. ONLY geo data, ONLY x86 code, ONLY text on just English, ANOTHER in Spanish, and so on.... Well defined, standardized. To have every need covered.

    2) When you run your benchmark, chose what percentage of each you will consider, then stat clearly that if any other wants to run the same bench, for example, in another architecture.

    3) Everybody´s happy

    What do you think? The programmer can decide which mix is the most desirable for his app... Then everybody runs in that direction.

    For example, Fu Siyuan probably would exclude DNA or pick a small amount of it for his CSA.

    That is the most flexible way to develop exactly what you want, IMHO. In your example, it would be just fine if you simply ignore everything but the sets in the languages you are working on, plus some natural + markup "text", like enwik and make a relatively big corpus to see your packer trying to deal with it.
    Some other programmer would like to include in his corpus the "incompresible" set to infer whether his program handles it well or not while remains competitive at text compression (just a random example).
    And the forum regulars can do the testing by just picking up the same sets than you or him.
    Last edited by Gonzalo; 28th March 2015 at 04:42.

  25. #21
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I'm not sure if we're agreeing or disagreeing. I agree that one important use case is exactly what you describe: have a set of files with various properties and combinations of properties, and pick exactly the files that address the particular questions you're trying to answer (e.g., whether an algorithm is good for text in Spanish marked up with html). That is a natural and inevitable use of a corpus, and a good thing when it's actually appropriate.

    I also think it's worthwhile to to discuss ahead of time what kinds of questions people are commonly going to want to answer, and have predefined "standard" subsets that address those common questions. It's better to have most comparisons e.g., of text-oriented compression algorithms use the standard subset for that kind of test than to have each paper pick a somewhat different subset and make one-to-one comparisons across papers harder.

    (E.g., if you claim your algorithm is good for "text" but you leave out the Mandarin and Japanese files instead of using the standard "text" set, you've got some explaining to do.)

  26. #22
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Quote Originally Posted by Paul W. View Post
    I'm not sure if we're agreeing or disagreeing.
    Both

    I also think it's worthwhile to to discuss ahead of time what kinds of questions people are commonly going to want to answer, and have predefined "standard" subsets that address those common questions. It's better to have most comparisons e.g., of text-oriented compression algorithms use the standard subset for that kind of test than to have each paper pick a somewhat different subset and make one-to-one comparisons across papers harder.
    This take my idea a step farther. And I am completely agree. Let's see if I understood...

    A not-so-clever example:

    Subset 01: Destined to still image packers
    Subset 01.1 Abstract art
    Subset 01.2 Photographs
    Subset 01.3 Artificially made images (Fractals, vectors converted to bitmap, plots)

    Subset 02: Destined to text packers
    Subset 02.1 Natural modern English text
    Subset 02.2 Natural modern Chinese text
    Subset 02.3 Natural modern Russian text
    ...
    Subset 02.4 Artificial: LOG
    Subset 02.5 XML
    Subset 02.6 C++ code

    and so forth...
    Last edited by Gonzalo; 28th March 2015 at 06:15.

  27. #23
    Member
    Join Date
    Jun 2013
    Location
    Canada
    Posts
    36
    Thanks
    24
    Thanked 47 Times in 14 Posts
    It should be possible to use very specialized test results to find the "best" compressor for arbitrary use cases by using dot product.

    E.g. if your data composition is: Exe 60% Text 20% Wav 20%

    Then you want the compressor c that minimizes (average_ratio(c, exe)*0.6 + average_ratio(c, text)*0.2 + average_ratio(c, wav)*0.2). This assumes that the ratio on test data will match the ratio on your data for that data type. To calculate this result you just need to have the 2D array of [compressor]x[data type] calculated.

  28. The Following User Says Thank You to Mat Chartier For This Useful Post:

    Gonzalo (28th March 2015)

  29. #24
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    I'll try to explain my suggestions:
    - A corpus should not be dominated by one or two files
    Ex. the Silesia corpus is dominated by ppm and bwt processors, but we know that ppm or bwt compressors are
    only good for text or special data. My conclusion is Silesia is not appropriate for benchmarking.
    - A corpus should not give too much advantage to some compressors:
    Ex. At compressionratings.com compressors with a good wav filter have in general a better ranking.
    - Data files should be large enough specially for performance testing (Ex. 100MB-1GB)
    I don't mean a single file but a category as tar file (text, binary, games, apps, ...)
    - Avoid using incompressible files...
    Well the goal of compression is primarily to save significant space.
    - Try using data files from academica
    In general it is better to use a data used in research papers or a recent corpus already used by others facilitating comparisons and
    not reinventing the wheel.

    Quote Originally Posted by Mat Chartier View Post
    It should be possible to use very specialized test results to find the "best" compressor for arbitrary use cases by using dot product.
    Actually all benchmarks are sorting first by the total compressed size. Rank aggregation is another metric independent of file size.

  30. #25
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Mat Chartier View Post
    It should be possible to use very specialized test results to find the "best" compressor for arbitrary use cases by using dot product.

    E.g. if your data composition is: Exe 60% Text 20% Wav 20%

    Then you want the compressor c that minimizes (average_ratio(c, exe)*0.6 + average_ratio(c, text)*0.2 + average_ratio(c, wav)*0.2). This assumes that the ratio on test data will match the ratio on your data for that data type. To calculate this result you just need to have the 2D array of [compressor]x[data type] calculated.
    Agree. In fact one could name such rankings officially, i.e. NewCorpus-60-20-20
    Quote Originally Posted by dnd View Post
    - Data files should be large enough specially for performance testing (Ex. 100MB-1GB)
    I'd say "Data files should be small enough specially for performance testing". When you tweak your algorithm, smaller data = lots of time saved. When you compare algorithms, you compress the same data many times, smaller data = lots of time saved.
    If you restrict yourself to open source codecs, there's no lower limit to corpus size other than that it has to show results comparable to *something*.
    If you don't, you indeed need more data. But 1 GB?
    * For me, many codecs are too slow to compare with such size as I'm not willing to wait for a result overnight
    * I wouldn't be able to run it on many CPUs as only some desktop / server platforms have enough RAM to fit such file. Oh well, not a problem with closed source codecs as you won't be able to run them on non-x86 anyway, but limits testset utility for open source.
    Quote Originally Posted by dnd View Post
    - Avoid using incompressible files...
    Well the goal of compression is primarily to save significant space.
    But system designers often don't know if a piece of data is compressible until they try to compress it. Therefore speed on incompressible data is important. And I tend to test on highly compressible data too. Quite a few codecs blow up with too many parsing choices.

  31. #26
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    @dnd:
    - A corpus should not be dominated by one or two files
    Ex. the Silesia corpus is dominated by ppm and bwt processors, but we know that ppm or bwt compressors are
    only good for text or special data. My conclusion is Silesia is not appropriate for benchmarking.
    - A corpus should not give too much advantage to some compressors:
    Ex. At compressionratings.com compressors with a good wav filter have in general a better ranking.
    IMHO, a corpus must be targeted to programs in a specific niche. Just because you can't compare apple to oranges. There will always be a "bad" corpus for your program, making it look bad too. You must compare what's comparable. So, I think something like Silesia is just fine, although obviously not for every program.
    If you have a general purpose compressor, like 7z or Zpaq or FreeArc, then you can choose a large set of very different type of data... and see what happen.
    This approach allows to play your game in your territory. Why should a PPM developer want to test his program in anything but text???

  32. #27
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    - Data files should be large enough specially for performance testing
    Here again, i mean a category. There is no sense to benchmark the performance with small files that fits into L1/L2 cache.
    Additionally there is no need for a such new corpus. You have already maximumcompression (sfc),calgary,...
    - Avoid using incompressible files...
    You can always consider incompressible files in a special category.

    Quote Originally Posted by Gonzalo View Post
    Why should a PPM developer want to test his program in anything but text
    Yes, you're right, in this case you can use the text files from the corpus, but as noted silesia is not appropriate as a corpus for benchmarking different algorithms (cm,lz,bwt,ppm,...).

    Well, the most important is to encourage people to use a common public corpus.
    I've posted some pointers, but I've the feeling that again this thread will simply continue in an endless discussion without results.
    Only looking at the poll results, this seems not an important subject.
    Last edited by dnd; 29th March 2015 at 21:34.

  33. #28
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    A good case can be done for small files benchmark too.
    Actually, in many "real world" scenarios, the amount of data to compress is not that large.
    Think about small messages, little blocks, columns, etc.

    Some algorithm can take a lot of time to "ramp up", and can only show their strength if the amount of data to compress is large enough.
    That's fair enough when the task in indeed to compress some kind of large file or large archive.
    But not every scenario does fit in.

  34. The Following User Says Thank You to Cyan For This Useful Post:

    nemequ (30th March 2015)

  35. #29
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    I've posted some pointers, but I've the feeling that again this thread will simply continue in an endless discussion without results.
    Maybe, maybe not. Why don't you post your own corpus and see what others say? Not just the idea, but the real set...
    I will see if I found some time to do the same.

  36. #30
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Lots of people here seem to want a *huge* corpus which includes several versions of every type of data imaginable, which makes sense considering most of you are looking at things from a codec development point of view. I'm not opposed to the idea, but what I'm more interested in is something geared more towards the 99% of users. Even with limited demand here, feedback from the Squash Benchmark has convinced me this is necessary. Several people have asked me to add exactly the type of data I was lamenting the lack of, especially small JSON snippets (like those used in RPC APIs).

    I am planning on putting together a corpus targeted at users, which will be added to the Squash Benchmark as soon as it is done. No more than 25 pieces of data (preferably 15-20), hopefully generally representative of the types of things the vast majority of people care about. I've created a a project on GitHub to help organize this, and I would very much appreciate any thoughts you have (with that goal in mind).

    That doesn't mean I intend to ignore the sentiment which has been expressed here. If people are willing to help, I think we can create a huge corpus with lots of interesting, niche, and/or pathological data. Multiple examples of lots of different types of data. If people want this, I think the best way to do this would be to create an organization on GitHub, then create a repository for each type general of data (text, medical, games, etc.) as well as one repository which has them all as submodules (which could also be used to coordinate overall activity). If people contact me and tell me they are willing to help curate one (or more) part of something like that, I would be happy to start putting everything together.
    Last edited by nemequ; 1st April 2015 at 21:51. Reason: typo

  37. The Following 3 Users Say Thank You to nemequ For This Useful Post:

    Cyan (1st April 2015),Gonzalo (1st April 2015),Matt Mahoney (1st April 2015)

Page 1 of 2 12 LastLast

Similar Threads

  1. DNA Corpus
    By Kennon Conrad in forum Download Area
    Replies: 20
    Last Post: 10th April 2019, 08:13
  2. Compiling ZPAQ on Windows with MinGW
    By fcorbelli in forum Data Compression
    Replies: 6
    Last Post: 20th January 2014, 03:29
  3. Compiling ZPAQ on Mac OSX 10.7.2 (Lion) gcc4.2.1 - problems!
    By z3cko in forum The Off-Topic Lounge
    Replies: 7
    Last Post: 21st December 2011, 03:28
  4. Compiling PPMd var J1 on Ubuntu
    By Piotr Tarsa in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 18th December 2011, 20:17
  5. can someone help me compiling paq by myself?
    By noshutdown in forum Forum Archive
    Replies: 4
    Last Post: 4th December 2007, 10:49

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •