Page 1 of 2 12 LastLast
Results 1 to 30 of 41

Thread: Benchmark results on GitHub

  1. #1
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts

    Benchmark results on GitHub

    Hi,

    I've quickly set up a basic benchmark results page on GitHub. It's on https://github.com/tarsa/lossless-benchmark (project) or https://tarsa.github.io/lossless-benchmark/ (running website). For now it's a poor rip-off of Matt's LTCB. Update: Now it looks quite different

    I think that issue tracking, pull requests, restricted rights and grouping people into organizations are good tools for maintaining projects collaboratively, including this project. What do you think about it? Many benchmarks are abandoned, but this approach will allow for taking over the stewardship when neccessity arises - just fork the repository and tell people that you will maintain it from now.
    Last edited by Piotr Tarsa; 27th May 2018 at 20:52.

  2. The Following 3 Users Say Thank You to Piotr Tarsa For This Useful Post:

    Darek (10th May 2018),Jyrki Alakuijala (10th May 2018),khavish (10th May 2018)

  3. #2
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Hi,

    I've quickly set up a basic benchmark results page on GitHub. It's here: https://github.com/tarsa/lossless-benchmark For now it's a poor rip-off of Matt's LTCB.

    I think that issue tracking, pull requests, restricted rights and grouping people into organizations are good tools for maintaining projects collaboratively, including this project. What do you think about it? Many benchmarks are abandoned, but this approach will allow for taking over the stewardship when neccessity arises - just fork the repository and tell people that you will maintain it from now.
    While some people actually may compress gigabytes at once, it is probably not the best practice in computing. For random access decoding, reduced decoding resource use, error recovery and error correction viewpoint it may be a more practical idea to compress smaller chunks -- for example two megabytes at a time. If we are talking about data transmission and not storage, the units are even smaller, often 50 kB to 1 MB uncompressed.

    A modern data compression benchmark should, instead of having two identical and relatively obscure huge files, have a collection of about 500 small files that are individually compressed.

  4. #3
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    For me, very proper idea. Especially when there no one, open, loseless compresion database.

    Most important question is how to keep orderliness.
    And, maybe there is a need to set some rules - like decompressor size definition (exe or source?, zip compression allowed or not? which version?, etc...)

    p.s. and many thanks for putting my score there

  5. #4
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    If you're thinking about multiple data sets, from small to large, from text to binary to structured, then the Squash benchmark already has a good starting point:

    https://github.com/quixdb/squash
    https://quixdb.github.io/squash-benchmark/
    https://quixdb.github.io/squash-benchmark/unstable/

    What it lacks is a way to add third party results (presumably deliberately in order to give stability to the numbers), but perhaps they could be marked up as the sizes shouldn't change even if the speeds become rather arbitrary. Maybe track those as an additional machine type "generic" where speed comparisons have an enormous caveat attached. It also lacks lots of compressors, as the benchmark has source rather than simply being a git project for the results. Maybe there are some ideas to pick up from it though.

  6. #5
    Member
    Join Date
    Aug 2017
    Location
    Mauritius
    Posts
    59
    Thanks
    67
    Thanked 22 Times in 16 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    While some people actually may compress gigabytes at once, it is probably not the best practice in computing. For random access decoding, reduced decoding resource use, error recovery and error correction viewpoint it may be a more practical idea to compress smaller chunks -- for example two megabytes at a time. If we are talking about data transmission and not storage, the units are even smaller, often 50 kB to 1 MB uncompressed.

    A modern data compression benchmark should, instead of having two identical and relatively obscure huge files, have a collection of about 500 small files that are individually compressed.
    +1

    How about we have the best of both worlds.Results on two datasets : one for small files and one for huge files ?

  7. #6
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Thanks for responses.

    We need to balance many things:
    1. collaborative development - it should be easy to contribute new results
    2. verifiability - for that compressed files need to be public, same goes for compression programs (ideally no software that expires after some time) and also we need to have standardized way to compress compression program if we're going to add its size to compressed file size (Matt does)
    3. overtuning resistance - Matt's BARF is a perfect example that with enough creativity you can easily win on benchmarks if decompressor size isn't factored into the result
    4. features - benchmark page should allow user to present the data he wants them to be presented. This has however much lower priority than prior items because adding features makes benchmark more complex for contribution.


    I think Matt's rules for his LTCB benchmark are pretty good. Focusing just on sizes allows to contribute results from very different machines. OTOH having just one homogenous file (enwik8 is a prefix of enwik9) means that compressors' abilites to adapt to data contents aren't exercised.

    Jyrki Alakuijala:
    Small files pose a problem. If you don't factor decompressor size into the result then people can cheat by excessive pretraining. If you add decompressor size to the result then the decompressor size will have too high impact on that result. There's also a third option - solid compression, but not every compressor supports that. Having big files means that decompressor size has lower impact on the result and therefore adding it to the result will be less controversial.

    Darek:
    > Most important question is how to keep orderliness.
    What exactly do you mean by orderliness? Generally GitHub repositories have a system of privileges so while you can give everyone a read-only access, the admin rights can be restricted to few people. If someone's contributions aren't accepted then he can continue with forked repository.

    Decompressor size should be (easily) verifiable just like compressed file size. ZIP format is very common and zip utility is available on practically any operating system. OTOH zip format is relatively weak (not counting the relatively recent extensions like optional LZMA compression). 7z format allows for much higher compression and is also available on many operating systems (thanks to p7zip). 7z limited to LZMA algorithm seems then like a very plausible idea.

    JamesB:
    Squash benchmark looks pretty advanced and complete. Lots of interesting data and many ways to visualize it. Controlled environment makes it sensible to compare performance, but OTOH restricts collaboration. Designing benchmark website with as many features as yours while also allowing easy collaboration would probably require a lot of effort, but I'm occupied with other tasks (like designing my own context mixing based compressor ). Therefore I would rather stay with the classic approach of having a table sorted by compressed file size.

    all:
    I'm open to suggestions as what file sets should be in the benchmark. But I don't want to bother with selecting and hosting files used in benchmark

    Currently the benchmark page consists of just one self-contained HTML file. That is pretty good for collaboration as HTML is an relatively easy format to learn and also having a self-contained file makes it easy to copy, share, view, etc. I will extend the idea further on the weekend by e.g. adding JavaScript based table sorting so that manual reordering isn't needed after updates.

    I don't know which benchmark results I can freely use (include in my benchmark page) so I'll search for Darek's results and hope he's not angry

  8. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    Gotty (11th May 2018)

  9. #7
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    Thanks for the answers.
    According to "7z limited to LZMA algorithm seems then like a very plausible idea." - super, for me it's clear and simply and I've 7zip to make decompression files by myself.
    One question more for Decompressor - could be stored as source or executable (which is smaller) as is in LTCB rules?

  10. #8
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Squash benchmark isn't mine btw, I just happen to like it.

    I agree that size is the primary purpose of compression and is clearly the best starting point! It's also easy to replicate compared to the plethora of problems with speeds, however Matt's LTCB does have speed and memory reported and this can be very useful, despite the myriad of caveats that come with it, so it's nice to be able to store this. We can perhaps just accept numbers reported by anyone contributing. Maybe two sets infact - their new tool plus a standard universal tool that isn't changing regularly to act as *crude* normalisation.

    Eg user 1 submits MyLZ, size 123,456 encode time 22.0s, decode time 1.5s. They also submit gzip size 133,444 encode time 20.0s, decode time 2.1s. So we can report their times as-is and also as factors of how much faster/slower than gzip it is. (I don't care much for gzip, but *something* will do.) This isn't perfect by any stretch, but it makes the data more usable IMO and it's still a low entry barrier for user submissions.

  11. #9
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Jyrki Alakuijala:
    Small files pose a problem. If you don't factor decompressor size into the result then people can cheat by excessive pretraining. If you add decompressor size to the result then the decompressor size will have too high impact on that result. There's also a third option - solid compression, but not every compressor supports that. Having big files means that decompressor size has lower impact on the result and therefore adding it to the result will be less controversial.
    I'm for including small files. Let's simply compress a large set of small files individually and present the sum(size_of_dompressed_files) + size_of_compressed_exe as the two-component sesult.
    The individual sizes are also important to see which file types are compressed better/worse. When I test a paq8px contribution, most of the times I also test it on my testset with 12125 small files. It IS useful, I can tell.
    File compressors compressing the filenames will have a disadvantage though. So a flag indicating if this is the case would be useful to fairly compare the compressors.

  12. The Following User Says Thank You to Gotty For This Useful Post:

    Jyrki Alakuijala (12th May 2018)

  13. #10
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Small files pose a problem. If you don't factor decompressor size into the result then people can cheat by excessive pretraining. If you add decompressor size to the result then the decompressor size will have too high impact on that result.
    This problem is easily solved: You add the decompressor size only once per corpus, not once per file. The whole corpus should be something like 100 MB to 1 GB, so the decompressor size gets the right proportion.

    Quote Originally Posted by Piotr Tarsa View Post
    There's also a third option - solid compression, but not every compressor supports that. Having big files means that decompressor size has lower impact on the result and therefore adding it to the result will be less controversial.
    What is solid compression? Why not just benchmark the usual use cases?

  14. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Darek:
    > One question more for Decompressor - could be stored as source or executable (which is smaller) as is in LTCB rules?

    Yes. Both are accepted - this way open-source programs will usually have an advantage (source code compressed with 7zip should be smaller than packed resulting executable).

    JamesB:
    Idea of comparing speed to gzip is interesting, but in reality there are many gzip versions and compiles (i.e. executables compiled using different compiler options). We would need to point users to some standardized compiles placed in trusted website (so that they won't be scared they are malicious).

    Gotty & Jyrki:
    Yep. Adding the decompressor size once per multiple small files solves the problem. Now we need publicly available corpora of many small files.

    As for the progress on the project, I'm reorganizing benchmark data (despite it's tiny size, but OTOH it's faster to refactor if there's less text to change) to make contributions easier and display more flexible. Right now single entry is almost as simple as a single line in Matt's LTCB.
    https://github.com/tarsa/lossless-be...e97/index.html
    Code:
                var enwik_results = [
                    enwik_result(
                            "Paq8pxd_v47_3", "-s15",
                            "16,038,418", "126,749,584",
                            "???", "???",
                            "83,435", "86,671", "27,600", "CM", "1")
                ];
    Next thing to do will be a support for series of compression programs, e.g. PAQ series has multiple program versions. Matt shows in the main table just a single entry for entire series and then shows more entries below notes (for that particular series) but I think that makes comparison between all entries more difficult. I'll try to come up with something smarter.

  15. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    Darek (13th May 2018)

  16. #12
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    520
    Thanks
    196
    Thanked 744 Times in 301 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    OTOH having just one homogenous file (enwik8 is a prefix of enwik9) means that compressors' abilites to adapt to data contents aren't exercised.
    Let's be honest here, it makes for a pointless and worthless exercise in overtuning/overfitting.

    Quote Originally Posted by Piotr Tarsa View Post
    I'm open to suggestions as what file sets should be in the benchmark. But I don't want to bother with selecting and hosting files used in benchmark
    There are a lot of publicly available testsets (Calgary, Canterbury, Silesia, MaximumCompression, etc).
    SqueezeChart from Stephan Busch has a lot of nice testsets for many specific areas, but most are several hundred MBs (or more) in size, so I don't think a lot of people would report results from CM compressors for those.

    Quote Originally Posted by Piotr Tarsa View Post
    Currently the benchmark page consists of just one self-contained HTML file. That is pretty good for collaboration as HTML is an relatively easy format to learn and also having a self-contained file makes it easy to copy, share, view, etc. I will extend the idea further on the weekend by e.g. adding JavaScript based table sorting so that manual reordering isn't needed after updates.
    Why not have the results for each section stored in JSON format, so anyone wanting access to the raw data could have it in an easily parseable format?

    Quote Originally Posted by JamesB View Post
    I agree that size is the primary purpose of compression and is clearly the best starting point! It's also easy to replicate compared to the plethora of problems with speeds, however Matt's LTCB does have speed and memory reported and this can be very useful, despite the myriad of caveats that come with it, so it's nice to be able to store this. We can perhaps just accept numbers reported by anyone contributing. Maybe two sets infact - their new tool plus a standard universal tool that isn't changing regularly to act as *crude* normalisation.

    Eg user 1 submits MyLZ, size 123,456 encode time 22.0s, decode time 1.5s. They also submit gzip size 133,444 encode time 20.0s, decode time 2.1s. So we can report their times as-is and also as factors of how much faster/slower than gzip it is. (I don't care much for gzip, but *something* will do.) This isn't perfect by any stretch, but it makes the data more usable IMO and it's still a low entry barrier for user submissions.
    Regarding memory usage, a lot of the results on the LTCB are wrong, as is the only result posted on GitHub by Piotr. Current physical memory usage is not the same as total allocated virtual memory.
    I agree that using a common, widely available tool as a performance baseline would be nice, but if we're relying on user-submitted results, we need peer validation. Is anyone willing to validate a cmix submission on a 2GB testset?

    Quote Originally Posted by Jyrki Alakuijala View Post
    What is solid compression? Why not just benchmark the usual use cases?
    Solid compression means that we keep the compressors internal state between blocks/files. Worst case scenario, if you compress N files in solid mode and you wish to decompress the last one, you have to decompress all the data from the other N-1 files first.



    Would chained encoding workflows be allowed? The only reason cmix is listed as #1 on the Silesia Open Source Compression benchmark is because it's used after precomp, otherwise paq8pxd would lead.

    In small testsets where sizes for each individual file are reported, would only a single combination of parameters be allowed? If the idea is to test how well a compressor can do, shouldn't the best possible results be shown?

  17. #13
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by mpais View Post
    Solid compression means that we keep the compressors internal state between blocks/files. Worst case scenario, if you compress N files in solid mode and you wish to decompress the last one, you have to decompress all the data from the other N-1 files first.
    Normally data is compressed in smaller fragments. Ideally benchmarks should replicate that.

  18. #14
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Let's be honest here, it makes for a pointless and worthless exercise in overtuning/overfitting.
    You always test your compressor on some data. If you care about winning on some benchmark then you mainly test on that benchmark. You can't be 100% unbiased (i.e. cater for all usage scenarios) as you don't even know what type of data user of your program will provide as an input. But there needs to be some compromise somewhere. Creators of commercial compressors probably don't care about compressing enwiks.

    There are a lot of publicly available testsets (Calgary, Canterbury, Silesia, MaximumCompression, etc).
    I've looked at Silesia Benchmark. Some files are a concatenation of many small files (e.g. HTML ones) but I haven't found the unconcatenated file set.

    Why not have the results for each section stored in JSON format, so anyone wanting access to the raw data could have it in an easily parseable format?
    Data is already defined in JavaScript. I want to have a single self-contained HTML file because that's easiest to handle by anyone. Splitting the file to multiple ones will AFAIK require you to run a server on localhost to bypass browsers restrictions. HTML pages can't load data directly from disk even if you hardcode paths into it.

    Regarding memory usage, a lot of the results on the LTCB are wrong, as is the only result posted on GitHub by Piotr. Current physical memory usage is not the same as total allocated virtual memory.
    I agree that using a common, widely available tool as a performance baseline would be nice, but if we're relying on user-submitted results, we need peer validation. Is anyone willing to validate a cmix submission on a 2GB testset?
    What's the error in megabytes/ percents? If it's small enough then it shouldn't be a problem when drawing conclusions about e.g. space efficiency of data structures used in compression programs.

    Would chained encoding workflows be allowed? The only reason cmix is listed as #1 on the Silesia Open Source Compression benchmark is because it's used after precomp, otherwise paq8pxd would lead.
    Yes, chaining is allowed, mostly because integrating preprocessors is a matter of adding some logic, not changing the existing one. It shouldn't matter whether you integrate precomp on C++ level or script level.

    In small testsets where sizes for each individual file are reported, would only a single combination of parameters be allowed? If the idea is to test how well a compressor can do, shouldn't the best possible results be shown?
    Different options for each file should be allowed, but decompressor must still be self-contained i.e. decompression must not require user to enter some options.

    Next thing to do will be a support for series of compression programs, e.g. PAQ series has multiple program versions. Matt shows in the main table just a single entry for entire series and then shows more entries below notes (for that particular series) but I think that makes comparison between all entries more difficult. I'll try to come up with something smarter.
    I've implemented the idea and it's visible on GH page. Next feature is sorting by (some) columns.

    Update:
    And to summarize: the goals of this project are ease of contribution, focus on compression strength and overall simplicity. Speed and feature comparisons while keeping benchmark open for contributions would require something much more sophisticated.

  19. #15
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    520
    Thanks
    196
    Thanked 744 Times in 301 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    You always test your compressor on some data. If you care about winning on some benchmark then you mainly test on that benchmark. You can't be 100% unbiased (i.e. cater for all usage scenarios) as you don't even know what type of data user of your program will provide as an input. But there needs to be some compromise somewhere. Creators of commercial compressors probably don't care about compressing enwiks.
    Is the purpose of this benchmark to use a single test set for all submissions, i.e., only general purpose compressors apply? Or do you plan on having sections dedicated to image, audio, text, recompression, etc, like in SqueezeChart?

    Quote Originally Posted by Piotr Tarsa View Post
    Data is already defined in JavaScript. I want to have a single self-contained HTML file because that's easiest to handle by anyone. Splitting the file to multiple ones will AFAIK require you to run a server on localhost to bypass browsers restrictions. HTML pages can't load data directly from disk even if you hardcode paths into it.
    You're assuming anyone would want to download the HTML file as is? If we wish to just see the results we can go to the benchmark page, but if we want to run some statistical analysis on them and produce some graphs, we download the raw data. You shouldn't mix content and presentation.

    Quote Originally Posted by Piotr Tarsa View Post
    What's the error in megabytes/ percents? If it's small enough then it shouldn't be a problem when drawing conclusions about e.g. space efficiency of data structures used in compression programs.
    For the submission you currently have? Over 7GB. Darek tested on a machine with 32GB of physical RAM and the value reported is most likely the amount of physical memory used (about 27.600 MB), but the real amount of virtual memory allocated by paq8pxd_v47_3 is over 35.400MB. It's even worse for cmix. Matt lists cmix v14 as using "just" ~28.300MB. I'm currently running 2 instances of cmix v15 on enwik8, one is the original version published by Byron and another one is modified with the latest changes from paq8px_v142. Currently each is using about 22.200MB of physical memory, but their actual allocated virtual memory is over 38.400MB each.

  20. #16
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Is the purpose of this benchmark to use a single test set for all submissions, i.e., only general purpose compressors apply? Or do you plan on having sections dedicated to image, audio, text, recompression, etc, like in SqueezeChart?
    I plan multiple test sets that are already available publicly, e.g. Silesia Corpus, MaximumCompression corpus, Calgary Corpus, etc By using MaximumCompression corpus I would be competing with it's author on his own data, but as it's not updated for years I think it's OK. In SqueezeChart case it's different - Stephan actively maintains it.

    I have to admit I'm mostly interested in compressing textual data. It also happens that most benchmarks are dominated by text files. Multimedia compression is limited by the amount of noise in input data which usually is quite high when sampling precision is high, so results are pretty flat.

    You're assuming anyone would want to download the HTML file as is? If we wish to just see the results we can go to the benchmark page, but if we want to run some statistical analysis on them and produce some graphs, we download the raw data. You shouldn't mix content and presentation.
    OK. I'll add the possibility to dump raw JSON data directly from the page.

    For the submission you currently have? Over 7GB. Darek tested on a machine with 32GB of physical RAM and the value reported is most likely the amount of physical memory used (about 27.600 MB), but the real amount of virtual memory allocated by paq8pxd_v47_3 is over 35.400MB. It's even worse for cmix. Matt lists cmix v14 as using "just" ~28.300MB. I'm currently running 2 instances of cmix v15 on enwik8, one is the original version published by Byron and another one is modified with the latest changes from paq8px_v142. Currently each is using about 22.200MB of physical memory, but their actual allocated virtual memory is over 38.400MB each.
    The question is what really matters? For me I would want to know the answer to: how much free RAM I need to have to run the program with advertised speed? I think the value Matt shows is a pretty accurate answer when it comes to slow programs that don't rely much on e.g. amount of free RAM for caching files.

    Update:
    Raw JSON data is displayed directly on the page. Is that OK?

  21. #17
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    520
    Thanks
    196
    Thanked 744 Times in 301 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I have to admit I'm mostly interested in compressing textual data. It also happens that most benchmarks are dominated by text files. Multimedia compression is limited by the amount of noise in input data which usually is quite high when sampling precision is high, so results are pretty flat.
    But most real-world usage cases are dominated by non-textual data. Sure, serving web content means having good textual compression even on small files, so that would be of interest for Jyrki, but consider what data a normal user is most likely to have: software (executables, other binary data), photos (mostly JPEGs or maybe some raw camera format), videos, music (not so much nowadays with the rise in popularity of streaming services), documents (pdf, docx, etc).

    Quote Originally Posted by Piotr Tarsa View Post
    The question is what really matters? For me I would want to know the answer to: how much free RAM I need to have to run the program with advertised speed? I think the value Matt shows is a pretty accurate answer when it comes to slow programs that don't rely much on e.g. amount of free RAM for caching files.
    Sure, cmix and paq8 are slow no matter what, but the thing is, Matt uses those values to determine Pareto efficiency ("An underlined value means that no better compressor uses less memory"). Maybe using the actual memory usage wouldn't change anything in regards to that, but I know of a situation where it does make a difference: if cmix only used about 27.5GB of memory, I could run 3 concurrent instances in this machine with 64GB of RAM (~89.6GB of virtual memory), but I can't. Which is probably why a lot of people can't run a single instance with 32GB of RAM either.

  22. #18
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    But most real-world usage cases are dominated by non-textual data. Sure, serving web content means having good textual compression even on small files, so that would be of interest for Jyrki, but consider what data a normal user is most likely to have: software (executables, other binary data), photos (mostly JPEGs or maybe some raw camera format), videos, music (not so much nowadays with the rise in popularity of streaming services), documents (pdf, docx, etc).
    If we want to produce perfect compressor for home PC user then context mixing if out of the question - it's too slow for ordinary user. Typical user also would prefer asymmetric compressor. Decompression speed is much more important, because you can use uncompressed data during compression, but uncompressed data isn't there yet before finishing decompression. And users use lots of space as long term storage, i.e. they add music, photos, documents, etc and then read them many times. Overall LZ schemes with filters will be the preferred choice for most home PC users. WinRAR removed PPMd algorithm from new RAR format versions because of slow decompression and lack of scalability to multiple cores. WinRAR also removed specialized multimedia filters claiming that uncompressed multimedia data is very unpopular.

    Applications are distributed in various forms. Not all languages are compiled to native binaries. We have JavaScript, Python, Java, shell scripts, etc Script languages are distributes in textual form, so textual compression algorithms apply to them. Java bytecode is like another ISA algonside x86, ARM, Itanium and others. Adding special preprocessing for Java bytecode could probably improve compression a lot and allow to compress Java programs like IDEs (and plugins) much stronger than today.

    DiskZIP boasts tremendous space savings, so probably they know where are the biggest opportunities in reducing file sizes, i.e. compressing which types of data will gain most benefit for home PC user.

    Sure, cmix and paq8 are slow no matter what, but the thing is, Matt uses those values to determine Pareto efficiency ("An underlined value means that no better compressor uses less memory"). Maybe using the actual memory usage wouldn't change anything in regards to that, but I know of a situation where it does make a difference: if cmix only used about 27.5GB of memory, I could run 3 concurrent instances in this machine with 64GB of RAM (~89.6GB of virtual memory), but I can't. Which is probably why a lot of people can't run a single instance with 32GB of RAM either.
    But Darek and Byron managed to finish testing cmix on 32 GiB machines.

    Does the amount of allocated virtual memory matter? On my 32 GiB Linux machine I can allocate terabytes of memory and nothing breaks until I actually try to write a lot of data to that allocated space.

    Pareto Frontier on LTCB doesn't make much sense to me because there is only one entry in the main table for each compressor and it's rarely for configuration that has the best chance to be on the Pareto Frontier.

    if cmix only used about 27.5GB of memory, I could run 3 concurrent instances in this machine with 64GB of RAM (~89.6GB of virtual memory)
    cmix used 27.5 GB of physical memory. 64 GiB is less than 3x that, so no wonder it wouldn't work with 3 copies running simultaneously.

  23. #19
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    520
    Thanks
    196
    Thanked 744 Times in 301 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    If we want to produce perfect compressor for home PC user then context mixing if out of the question - it's too slow for ordinary user. Typical user also would prefer asymmetric compressor. Decompression speed is much more important, because you can use uncompressed data during compression, but uncompressed data isn't there yet before finishing decompression. And users use lots of space as long term storage, i.e. they add music, photos, documents, etc and then read them many times. Overall LZ schemes with filters will be the preferred choice for most home PC users. WinRAR removed PPMd algorithm from new RAR format versions because of slow decompression and lack of scalability to multiple cores. WinRAR also removed specialized multimedia filters claiming that uncompressed multimedia data is very unpopular.
    Well, you just listed 2 of the slowest compressors known (paq8pxd and CMV). If you want to list only practical, real-world usage compressors, why list those?

    I agree that uncompressed multimedia data is very rare nowadays, but multimedia content itself is more popular than ever, just in already compressed formats, which is why recompression is unfortunatelly a necessity. So, for example, a test set of JPEGs images so a user can see what gains Lepton/PackJPG can provide and at what speed is always nice to have. Not to mention that anyone working on a new JPEG recompression engine could have a baseline with which to judge their progress.

    Having generic test sets with just maybe a dozen files is a recipe for overtuning of the codec parameters. I think it's better to have separate sections for specific data types with their own test sets, consisting of a large, diverse and accurate representation of that data, and then if you so wish, list compressors by aggregate totals. This way you have an idea of the best general purpose compressors, but for specific data you can also see what the state-of-the-art is and how does that general purpose compressor fare against it.

    Quote Originally Posted by Piotr Tarsa View Post
    Applications are distributed in various forms. Not all languages are compiled to native binaries. We have JavaScript, Python, Java, shell scripts, etc Script languages are distributes in textual form, so textual compression algorithms apply to them. Java bytecode is like another ISA algonside x86, ARM, Itanium and others. Adding special preprocessing for Java bytecode could probably improve compression a lot and allow to compress Java programs like IDEs (and plugins) much stronger than today.
    Distribution and actual application contents are very different. On most platforms, applications are distributed in compressed form, even on mobile, so recompression would be needed. But it's probably more interesting to research better methods to actually compress the instruction stream itself, in which case, as you correctly stated, ISA-dependent modelling brings many benefits, and is an area still in its infancy, where research is, imho, actually still interesting.

    Quote Originally Posted by Piotr Tarsa View Post
    But Darek and Byron managed to finish testing cmix on 32 GiB machines.

    Does the amount of allocated virtual memory matter? On my 32 GiB Linux machine I can allocate terabytes of memory and nothing breaks until I actually try to write a lot of data to that allocated space.

    Pareto Frontier on LTCB doesn't make much sense to me because there is only one entry in the main table for each compressor and it's rarely for configuration that has the best chance to be on the Pareto Frontier.

    cmix used 27.5 GB of physical memory. 64 GiB is less than 3x that, so no wonder it wouldn't work with 3 copies running simultaneously.
    Memory management isn't so simple Piotr (see here). If each instance of cmix is allocating ~38GB of memory, by your logic, I shouldn't be able to run even 2 instances.

    My virtual memory pool on this machine is about 89.6GB, consisting of 64GB of physical RAM and about 26GB for the page file (some memory is reserved, so the totals don't always add up neatly). If I increase my page file, I can probably run that 3rd instance, but there will probably be a lot of swapping. On a machine with 32GB of RAM, if your page file is too small or you already have other memory intensive applications running (*cough* Chrome *cough*), the system may not be able to allocate the 38GB of memory that cmix asks for.

    Click image for larger version. 

Name:	cmix v15 memory usage.png 
Views:	32 
Size:	4.5 KB 
ID:	5923

  24. #20
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Well, you just listed 2 of the slowest compressors known (paq8pxd and CMV). If you want to list only practical, real-world usage compressors, why list those?
    I didn't intend to list only practical compressors. Actually the main motivation for this project was to create a benchmark that won't easily die, i.e. create a benchmark that could be maintained and contributed to by anyone (but is a little bit more sophisticated than a static table). For it to be useful it must have some reproducible data. Compressed file sizes have the best chance to be reproducible (actually I think they have 100% chance now as I don't recall any compressor that has some variability in file size from run to run).

    As I've written before - if I wanted to have reliable measurements of speed or memory usage then I would need something more complex than just single static HTML page. Dedicated benchmark runner would be needed for that. Such runners are already available - we have TurboBench, lzbench, Squash Benchmark, etc I don't want to spend too much time on this project anyway as I want to work on demixer.

    There are many benchmarks that are practically dead e.g. Black Fox benchmark, MaximumCompression and we have ones that are rarely updated e.g. LTCB. They were very interesting to watch despite focusing almost only on file size.
    Last edited by Piotr Tarsa; 13th May 2018 at 22:15.

  25. #21
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    @Piotr, mpais
    > Current physical memory usage is not the same as total allocated virtual memory.
    It's obviously right - however we need to decide which should be reported and how users or testers may get the proper information. In some cases really used memory is different than allocated, in some file didn't need to use all allocated memory - then I've been able to test some cmix versions (6,7) on part of my files with 16GB machine. Which information are you try to post in benchmarks - really used or theoretically needed?
    Allocation of memory is a tricky thing. I've 2 laptops with 32GB of ram. On the first I can use only small amount more memory than physical - up to 40GB in total, but on this second laptop (newer, maybe faster SSD but still SATA?) I can easily run two instances of cmix w/o big visible system slowdown...


    >
    Next thing to do will be a support for series of compression programs, e.g. PAQ series has multiple program versions. Matt shows in the main table just a single entry for entire series and then shows more entries below notes (for that particular series) but I think that makes comparison between all entries more difficult. I'll try to come up with something smarter.
    Maybe we could use color scheme - best version could be reported normally (black font, white background) or bolded, other versions could have loghter color -
    dark, or pure grey and italic - looks like faded or older.
    Second idea is to separate compressor name from version number and put it into two columns - it coud be more clean to read, filtering or sorting.


    >
    There's also a third option - solid compression, but not every compressor supports that. Having big files means that decompressor size has lower impact on the result and therefore adding it to the result will be less controversial.
    I think we coud use also TAR container. As I understand files inside such package are sorted by file extensions then very often compression of TAR file is better than sum of particular files scores. From other hand there are lot's of other containers which standarize files inside and then works like solid compression "on". If the compressors didn't recognise such formats nor repacks such files then we not intentially use solid compression as default. I think then we should allow such option.


    >
    Would chained encoding workflows be allowed? The only reason cmix is listed as #1 on the Silesia Open Source Compression benchmark is because it's used after precomp, otherwise paq8pxd would lead.
    In my opinion workflows in all should be allowed on this condition that all used programs must be inserted in unpacking package. Every trick which gives us better compression in total should be permitted. Second idea is to add a column with information about preprocessing with two information: "pure" or "preprocessed".


    Maybe the idea to input information even more easily is to use f.e. Excel file with flat table - such file is very easy to use, convert, import or export etc.

    And the question - I've some scores for Calgary, Canterbury, MaximumCompression, Silesia or enwik8/9 scores, however often w/o timings, memory usage (I put memory usage and timings only for enwik files) - maybe it could be usable to fulfil your databases.

    LBNL - there are decompressors sizes for two first records packed by 7zip with LZMA:
    paq8pxd_47_3 -> 150'210 bytes -> total archive = 126'899'794
    CMVx64 v00.01.01 -> 78'922 bytes -> total archive = 149'436'687




  26. #22
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    520
    Thanks
    196
    Thanked 744 Times in 301 Posts
    Quote Originally Posted by Darek View Post
    @Piotr, mpais
    > Current physical memory usage is not the same as total allocated virtual memory.
    It's obviously right - however we need to decide which should be reported and how users or testers may get the proper information. In some cases really used memory is different than allocated, in some file didn't need to use all allocated memory - then I've been able to test some cmix versions (6,7) on part of my files with 16GB machine. Which information are you try to post in benchmarks - really used or theoretically needed?
    Allocation of memory is a tricky thing. I've 2 laptops with 32GB of ram. On the first I can use only small amount more memory than physical - up to 40GB in total, but on this second laptop (newer, maybe faster SSD but still SATA?) I can easily run two instances of cmix w/o big visible system slowdown...

    I've finished running cmix and made a few quick tests on paq8pxd, it seems it has a bug in memory reporting (to complicate things even more).
    Code:
    File: alice29.txt
    
    paq8pxd_v47: Time 15.16 sec, used 33245 MB (500257668 bytes) of memory
    Actual allocated virtual memory: ~26400MB
    
    paq8pxd_v47_3(bwt2): Time 16.92 sec, used 35481 MB (2845017218 bytes) of memory
    Actual allocated virtual memory: ~28600MB

    So there's a bug that makes it report more memory than what is actually allocated (aside from a 32bit overflow in the actual number of bytes). Looking at the code for v47_3 it's clear where the extra memory is being used, and it's the reason for the good improvement in enwik.

    Now, as for what should be reported, imho, it should always be the total memory allocated, because that is what the models actually request. Sure, if you use a 30GB page file on a machine with just 16GB of RAM, you can run paq8pxd -s15 and if you look at the physical memory used, you'll probably see just 14GB. Does that mean it only requires 14 GB of memory? No.

  27. #23
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    OK. So the question is now: how to reliably measure memory usage and have a metric that is consistent across operating systems?

    Maybe we could use color scheme - best version could be reported normally (black font, white background) or bolded, other versions could have loghter color - dark, or pure grey and italic - looks like faded or older.
    Second idea is to separate compressor name from version number and put it into two columns - it coud be more clean to read, filtering or sorting.
    Compressors are already divided into series and series filters are controlled using the multiselects visible above the table.

    I think we coud use also TAR container. As I understand files inside such package are sorted by file extensions then very often compression of TAR file is better than sum of particular files scores. From other hand there are lot's of other containers which standarize files inside and then works like solid compression "on". If the compressors didn't recognise such formats nor repacks such files then we not intentially use solid compression as default. I think then we should allow such option.
    There was an idea to have a decompressor size included once per test set and have each file compressed separately. This way we could have something like:
    namefile1file2file3totaldecompressor sizetotal + decompressor size
    paq810203060100160
    That's my current plan for displaying results on various corporas like Calgary corpus, Canterbury corpus, Silesia corpus, MaximumCompression SFC corpus, etc

    Maybe the idea to input information even more easily is to use f.e. Excel file with flat table - such file is very easy to use, convert, import or export etc.
    It should be feasible. I've done a quick googling and found CDN hosted library https://cdnjs.com/libraries/jquery-csv for CSV files. CSV is a format that Excel can import and export.

    And the question - I've some scores for Calgary, Canterbury, MaximumCompression, Silesia or enwik8/9 scores, however often w/o timings, memory usage (I put memory usage and timings only for enwik files) - maybe it could be usable to fulfil your databases.

    LBNL - there are decompressors sizes for two first records packed by 7zip with LZMA:
    paq8pxd_47_3 -> 150'210 bytes -> total archive = 126'899'794
    CMVx64 v00.01.01 -> 78'922 bytes -> total archive = 149'436'687
    Thanks for the results. For now I'm mostly interested in enwik scores as I already have a presentation for them. For me the compressed sizes and decompressor sizes are crucial. Memory usage and timings are welcomed, but not strictly required.

    As a side note:
    I've implemented sorting. Now you can click on columns "enwik8", "enwik9", "enwik9 + decompressor" and table will be sorted according to that column. Next step is to introduce a notion of outdated results, ie. if I have following results for single series of compressors:
    compressoroptionssizeoutdated
    zip 1.2best100false
    zip 1.2fast150false
    zip 1.1best110true
    zip 1.1fast160true
    Then I would be able to either:
    • show them all (already implemented)
    • show only the first result (already implemented)
    • show all results that are not outdated (to be implemented as next step)

  28. #24
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I have implemented the notion of outdated results (so we can show only the up-to-date ones, but have historical ones in the result set) and also changed the format of results table (from functions) to CSV. Now the results for enwiks are embedded in the HTML page as:
    Code:
    var enwik_results_csv = String.raw`
    "program_series","program_name","program_options","compressed_enwik8_size","compressed_enwik9_size","decompressor_size","total_compressed_size","compression_time","decompression_time","memory_usage_in_mb","algorithm_type","note_id","outdated"
    "CMV","CMV v00.01.01","-m2,3,0x03ed7dfb","18,122,372","149,357,765","78,922","149,436,687","426,163","394,855","3,335","CM","1","yes"
    "PAQ","Paq8pxd_v47_3","-s15","16,038,418","126,749,584","150,210","126,899,794","83,435","86,671","27,600","CM","1","no"
    "fake","fake v1","--fast","93,123,123","923,123,123","123,123","923,246,246","23,456","23,456","1,234","FAKE","0","no"
    "fake","fake v1","--test","3,123,123","23,123,123","123,123","23,246,246","823,456","823,456","81,234","FAKE","0","yes"
    "fake","fake v1","--best","20,000,000","100,000,000","123,123","100,123,123","123,456","123,456","12,345","FAKE","0","no"
    `.trim();
    That CSV content can be copied to a file and imported using Microsoft Excel, LibreOffice Calc, etc Be aware that you need to use locale in which a dot is used to separate fractions and comma is used to separate throusands. In other words, select e.g. UK locale when importing and exporting data.

    Next step is to implement support for more corporas, but you can send me some results for enwiks already.

  29. #25
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Support for more corporas is added. Now there are 5:
    - enwik8 & enwik9
    - Calgary corpus
    - Canterbury corpus
    - Maximum Compression SFC test files
    - Silesia compression corpus

    Data is still kept inside HTML as CSV content so you can copy it to a new file and then open in Microsoft Excel or LibreOffice Calc (or whatever that can open CSV files). Because I'm using some new features of JavaScript the page will only work on modern browsers like recent versions of Google Chrome or Mozilla Firefox.

    Tables can be sorted by clicking on columns named after files or named like "total size without decomp" or "total size". By default tables are sorted on column "total size" just like on Matt's LTCB.

  30. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    Darek (23rd May 2018)

  31. #26
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    measuring memory usage on windows: https://encode.ru/threads/1838-Comma...ll=1#post35998

    other utilities are also available

  32. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Piotr Tarsa (22nd May 2018)

  33. #27
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Thanks. I've added it to Q&A section. I will be writing usage and contribution guide soon and maybe merge it with questions and answers section.

  34. #28
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    Two more questions:
    1) is it allowed to use different compressor settings for particular files within one corpus?

    I mean that for some compressors with sophisticated options use only one setting for whole corpus gives much worse scores than possible maximum - f.e. emma, newest paq8px versions (e t a options). W/o such possibility, benchmark list will cointain only sub-optimal scores and we lose some opportunity to compress entire corpus better. That's my opinion.

    For such case there would be needed to put all settings into the list somewhere. Second idea is to use same rule as for precompressors - in this case it could be added compress.bat file into decompressor package. This file define particular settings simply by listing command lines to compress each file. This file could be stored instead list of settings.

    2) It's hard to put into the list information about mentioned above sophisticated, custom defined set of settings used to compress. Emma is a good example - settings could define 35 parameters. Maybe above idea for multisettings information could help here?

  35. #29
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    It's my test of input to the database - small exe file. Is it possibility to put it on the page by myself or only by DB administrator?
    Attached Files Attached Files

  36. #30
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Support for more corporas is added. Now there are 5:
    - enwik8 & enwik9
    - Calgary corpus
    - Canterbury corpus
    - Maximum Compression SFC test files
    - Silesia compression corpus
    You need a small file benchmark that has about 1000 files of 100 kB to 1 MB each. If your small file benchmark corpora are small (< 100 MB), they will emphasize the decoder size too much. It is nowadays unusual for a 10 MB file to ship its own decoder.

    For example, your browser contains decompressors for HTTP content encoding. Such decompressor will decode more than 10 MB of data during their lifetime (most likely gigabytes+++), and ranking them using just a few MB would be strange. Still, all they do is to decode small files.

    If you cannot get a 100++ MB small file benchmark otherwise, one possibility is to take a random sample from all of the other corpora, but keep the sample short, i.e., to grab slices from any files in the other corpora. For small files corpus, I'd recommend taking about 500-5000 files depending on the size of the benchmark.

    It would also be practical and respectful to users to have Chinese and Spanish UTF-8 in addition to the English and Polish text (is the Silesia Polish sample in UTF-8?)...

Page 1 of 2 12 LastLast

Similar Threads

  1. my stuff on github
    By Shelwien in forum Data Compression
    Replies: 7
    Last Post: 12th September 2016, 15:20
  2. Precomp source code on GitHub
    By schnaader in forum Data Compression
    Replies: 71
    Last Post: 9th May 2016, 13:32
  3. Need help to migrate libbsc.com from Office Live to github
    By Gribok in forum The Off-Topic Lounge
    Replies: 3
    Last Post: 23rd April 2012, 01:29
  4. Replies: 3
    Last Post: 30th July 2011, 14:48
  5. Strange gcc4.3 results with paq8o8
    By Hahobas in forum Forum Archive
    Replies: 8
    Last Post: 22nd March 2008, 19:44

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •