Results 1 to 17 of 17

Thread: Fedora 29 to use Zchunk to compress repository metadata

  1. #1
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    222
    Thanks
    89
    Thanked 46 Times in 30 Posts

    Fedora 29 to use Zchunk to compress repository metadata

    The news at Phoronix. I had never even heard of Zchunk. Apparently it's based on Zstd. It's replacing XZ, which I think is LZMA or LZMA2.

  2. The Following User Says Thank You to SolidComp For This Useful Post:

    Cyan (3rd July 2018)

  3. #2
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    Its author posted a description here :
    https://www.jdieter.net/posts/2018/0...hat-is-zchunk/

  4. #3
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by SolidComp View Post
    The news at Phoronix. I had never even heard of Zchunk. Apparently it's based on Zstd. It's replacing XZ, which I think is LZMA or LZMA2.
    With large window Brotli they would get a 5x decoding speed up from lzma and a 0.6% size increase. With zstd 10x speed up, but a 5 % size increase. My thinking is that they likely would have been better off with Brotli.

  5. #4
    Member
    Join Date
    Jan 2017
    Location
    Selo Bliny-S'edeny
    Posts
    14
    Thanks
    6
    Thanked 3 Times in 2 Posts
    So there are (very roughly) about 20,000 packages in the repo, and for each package its metadata record is a few KB in size (starting with about 1KB, I believe). If they compress each record individually, that's gonna hurt the compression ratio tremendously. Because there isn't much redundancy within a record, but there's a lot of similarity between the records, especially if they are sorted by the package name.

  6. #5
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    222
    Thanks
    89
    Thanked 46 Times in 30 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    With large window Brotli they would get a 5x decoding speed up from lzma and a 0.6% size increase. With zstd 10x speed up, but a 5 % size increase. My thinking is that they likely would have been better off with Brotli.
    If your estimate are accurate, I agree with you. Unless there's some catch regarding CPU or memory use, or how important encoding speed is. What Brotli and Zstd levels are you basing your estimate on?

    Zchunk is some sort of delta scheme. I haven't dug into the details, but is it trivially easy to use brotli for delta compression cases? Is there any reason Zstd would be preferred for such cases?

  7. #6
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    This is thoroughly explained on the link provided by Cyan. "Delta" here refer to chunk-based diffs. Not the delta we know as implemented in some lz compressors. See the parts where the author talks about the relation of zchunk with rsync and zsync. The format itself is pretty straightforward and I believe the algorithm of choice have little to nothing to do with the functionality of the program. Probably the author went for zstd because it is more known, a little more stable and pretty much state of the art for its niche. Maybe brotli is better for some cases, maybe not, but the author probably doesn't even know it, or he knows that probably nobody on his teamwork would volunteer to help him insert it. Remember that zstd is rapidly gaining adepts in the linux world. It is even included on the kernel. See this post for some uses of it.

  8. #7
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by Gonzalo View Post
    Remember that zstd is rapidly gaining adepts in the linux world. It is even included on the kernel. See this post for some uses of it.
    The recent penetration of modern compression technologies has been a fantastic step!

    Comparing Brotli and Zstd, Brotli is very likely more widely available: it is in .NET, in iOS (including TV and phone versions of iOS), in Windows, in Android and in every browser that is still being developed. Brotli is even included in my LG tv.

    Sure, it is not in the linux kernel. I think zstd got their foot in by benchmarking brotli 11 with 4-16 MB window against zstd 22 with 128 MB window. For quite a while the zstd's wikipedia page claimed faster compression and decompression for zstd with a better compression rate. Even today, the LTCB is not including the large window brotli results that would make it generally favorable over zstd. Large window is slowly fixing that, and a meta-analysis of recent benchmarking efforts shows Brotli compressing ~5 % more densely than zstd (when both are run at maximal settings in large corpus scenarios where the static dictionary doesn't help Brotli).

    For compressed filesystem use zstd may have a better balance of features -- there decompression speed can be really critical, but I consider brotli more balanced for the general use case.
    Last edited by Jyrki Alakuijala; 4th July 2018 at 15:19.

  9. #8
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    That I didn't know. Thanks for the info!

  10. #9
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    Jyrki, do you really believe that the current situation is the result of a benchmark publication on LTCB ?

  11. The Following User Says Thank You to Cyan For This Useful Post:

    Jyrki Alakuijala (6th July 2018)

  12. #10
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by Cyan View Post
    Jyrki, do you really believe that the current situation is the result of a benchmark publication on LTCB ?
    Yeah. I believe that those people who are basing over-the-wire or over-the-air internet scale solutions on zstd are doing it based on faulty benchmarking scenarios. It seems to me that zchunk is in this category.

    Zstd for the filesystem or possibly for the datacenter network fabric seems to be a better fit for the properties of zstd.

  13. #11
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by Gonzalo View Post
    ... a little more stable ...
    This is rather confusing. At the time when zstd was still modified at format level (before zstd 1.0), brotli was already used in its final form in trillion uses per day as part of WOFF 2.0 and 'br' content encoding.

  14. The Following User Says Thank You to Jyrki Alakuijala For This Useful Post:

    Gonzalo (4th July 2018)

  15. #12
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    My bad. I stand corrected

  16. The Following User Says Thank You to Gonzalo For This Useful Post:

    Jyrki Alakuijala (6th July 2018)

  17. #13
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    222
    Thanks
    89
    Thanked 46 Times in 30 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    The recent penetration of modern compression technologies has been a fantastic step!

    Comparing Brotli and Zstd, Brotli is very likely more widely available: it is in .NET, in iOS (including TV and phone versions of iOS), in Windows, in Android and in every browser that is still being developed. Brotli is even included in my LG tv.

    Sure, it is not in the linux kernel. I think zstd got their foot in by benchmarking brotli 11 with 4-16 MB window against zstd 22 with 128 MB window. For quite a while the zstd's wikipedia page claimed faster compression and decompression for zstd with a better compression rate. Even today, the LTCB is not including the large window brotli results that would make it generally favorable over zstd. Large window is slowly fixing that, and a meta-analysis of recent benchmarking efforts shows Brotli compressing ~5 % more densely than zstd (when both are run at maximal settings in large corpus scenarios where the static dictionary doesn't help Brotli).

    For compressed filesystem use zstd may have a better balance of features -- there decompression speed can be really critical, but I consider brotli more balanced for the general use case.
    I really wish Matt would fix the LTCB. It's not a valid benchmark. In addition to the window size issues, the benchmarks are on different machines, different compiler settings, and often use very old versions of a codec. The gzip he uses is more than 10 years old, and I have no idea how it compares to the latest release of zlib, GNU gzip, or libdeflate. And the brotli and Zstd versions are two years old. There's no report of decomp memory usage, and no report of CPU load for either compression or decomp.

    A compression benchmark needs to the use the same hardware for all codecs, and ideally the latest versions of each codec.

  18. The Following User Says Thank You to SolidComp For This Useful Post:

    Jyrki Alakuijala (6th July 2018)

  19. #14
    Member
    Join Date
    Dec 2014
    Location
    Berlin
    Posts
    29
    Thanks
    35
    Thanked 26 Times in 12 Posts
    Also offtopic, but I think the squah library is a good building block to perform such benchmarks and needs more support to cover more formats and test data: https://quixdb.github.io/squash-benchmark/

  20. #15
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by pothos2 View Post
    Also offtopic, but I think the squah library is a good building block to perform such benchmarks and needs more support to cover more formats and test data: https://quixdb.github.io/squash-benchmark/
    Squash is close to perfect. The only problem I know in it is its ignorance of non-streaming vs. streaming implementations. All codecs are assumed non-streaming in both ends. This favors non-streaming implementations in the transmission speed rankings.

  21. #16
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    What is streaming implementation anyway? All compressors have some lag during compression and decompression. If we pair compressor and decompressor (pipe compressor output directly to decompressor) then before the decompressor will output some original byte compressor will have to process many following input bytes.

  22. #17
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    What is streaming implementation anyway? All compressors have some lag during compression and decompression.
    A streaming implementation is one that can decode all the bytes that can be decoded given the current information. A streaming implementation of a decoder is in my experience 10-35 % slower than a non-streaming in cpu, but the computation system build on top of a streaming implementation works faster overall.

    A streaming data format is one where you don't need a lot of future data to decide how to decode the current bits.

    An encoding streaming friendly data format is one where you don't need have a lot of data available before you can compress. For example ANS is a minute step away from encoding streaming friendliness, since data needs to be encoded backwards.

    Quote Originally Posted by Piotr Tarsa View Post
    All compressors have some lag during compression and decompression.
    Yes, just different amounts, and these amounts can affect user experience. In the use case of a browser streaming means that the browser is able to issue fetches for the images or javascript earlier (if the urls are available in the already decoded part), or show some sort of preview from the half-loaded page earlier. Also, some more cpu-heavy processing like dom-building or javascript parsing/running can start while the content is being loaded.

Similar Threads

  1. lstm-compress
    By Mauro Vezzosi in forum Data Compression
    Replies: 57
    Last Post: 19th July 2019, 14:59
  2. which compress algorithm is best for me?
    By skogkatt in forum Data Compression
    Replies: 19
    Last Post: 29th December 2016, 15:29
  3. JPEG Metadata
    By lorents17 in forum Data Compression
    Replies: 13
    Last Post: 27th April 2014, 21:29
  4. Compress-LZF
    By spark in forum Data Compression
    Replies: 2
    Last Post: 16th October 2009, 00:08
  5. BMF 2.0 (Apr 29, 2009)
    By inikep in forum Data Compression
    Replies: 1
    Last Post: 4th May 2009, 22:05

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •