Cyan (3rd July 2018)
So there are (very roughly) about 20,000 packages in the repo, and for each package its metadata record is a few KB in size (starting with about 1KB, I believe). If they compress each record individually, that's gonna hurt the compression ratio tremendously. Because there isn't much redundancy within a record, but there's a lot of similarity between the records, especially if they are sorted by the package name.
Zchunk is some sort of delta scheme. I haven't dug into the details, but is it trivially easy to use brotli for delta compression cases? Is there any reason Zstd would be preferred for such cases?
This is thoroughly explained on the link provided by Cyan. "Delta" here refer to chunk-based diffs. Not the delta we know as implemented in some lz compressors. See the parts where the author talks about the relation of zchunk with rsync and zsync. The format itself is pretty straightforward and I believe the algorithm of choice have little to nothing to do with the functionality of the program. Probably the author went for zstd because it is more known, a little more stable and pretty much state of the art for its niche. Maybe brotli is better for some cases, maybe not, but the author probably doesn't even know it, or he knows that probably nobody on his teamwork would volunteer to help him insert it. Remember that zstd is rapidly gaining adepts in the linux world. It is even included on the kernel. See this post for some uses of it.
Comparing Brotli and Zstd, Brotli is very likely more widely available: it is in .NET, in iOS (including TV and phone versions of iOS), in Windows, in Android and in every browser that is still being developed. Brotli is even included in my LG tv.
Sure, it is not in the linux kernel. I think zstd got their foot in by benchmarking brotli 11 with 4-16 MB window against zstd 22 with 128 MB window. For quite a while the zstd's wikipedia page claimed faster compression and decompression for zstd with a better compression rate. Even today, the LTCB is not including the large window brotli results that would make it generally favorable over zstd. Large window is slowly fixing that, and a meta-analysis of recent benchmarking efforts shows Brotli compressing ~5 % more densely than zstd (when both are run at maximal settings in large corpus scenarios where the static dictionary doesn't help Brotli).
For compressed filesystem use zstd may have a better balance of features -- there decompression speed can be really critical, but I consider brotli more balanced for the general use case.
Last edited by Jyrki Alakuijala; 4th July 2018 at 15:19.
That I didn't know. Thanks for the info!
Jyrki, do you really believe that the current situation is the result of a benchmark publication on LTCB ?
Zstd for the filesystem or possibly for the datacenter network fabric seems to be a better fit for the properties of zstd.
Gonzalo (4th July 2018)
My bad. I stand corrected
A compression benchmark needs to the use the same hardware for all codecs, and ideally the latest versions of each codec.
What is streaming implementation anyway? All compressors have some lag during compression and decompression. If we pair compressor and decompressor (pipe compressor output directly to decompressor) then before the decompressor will output some original byte compressor will have to process many following input bytes.
A streaming data format is one where you don't need a lot of future data to decide how to decode the current bits.
An encoding streaming friendly data format is one where you don't need have a lot of data available before you can compress. For example ANS is a minute step away from encoding streaming friendliness, since data needs to be encoded backwards.