Results 1 to 11 of 11

Thread: Brotli literal and offset encoding

  1. #1
    Member
    Join Date
    Jun 2016
    Location
    Budapest
    Posts
    10
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Brotli literal and offset encoding

    Hi,

    This is my first post here, so please to meet you!

    I've questions about brotli (maybe the authors are here, as this is THE forum of compression topics).

    All the other LZ+huffman compressors I know, use codes where literal vs match differentiation is done by one-by-one. I mean, a possible decompressor is like:

    for (;;) {
    int mainCode = get();
    if (mainCode<256) {
    // got a literal
    } else {
    // got a match
    }
    }


    On the other hand, brotli uses literal-runs (like LZ4 does it, if I remember correctly):


    for (;;) {
    int command = get();
    int literalRunLen = getLiteralRunLen(command)

    for (int i=0; i<literalRunLen; i++) {
    // get literal
    }

    // process match
    }


    Why is this coding used in brotli? Does it have better compression? Better decompression speed?

    The other thing, that brotli doesn't use low offset bits entropy coding, like LZX. Why is that?

    Have you guys discussed previously whether it is possible to enhance LZX compression ratio (without significant decompression speed lost - I mean algorithm changes not just increasing window size)?

    All I know is LZMH, which is just a little better than LZX (as I analyzed it, it is basically an LZX, but with 4 low offset bits instead of 3, and it has a feature that it can completely omit some low offset bits if their statistics is highly skewed towards 0).

    Thanks!

  2. #2
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by geza View Post
    Why is this coding used in brotli? Does it have better compression? Better decompression speed?
    Yes. Yes.

    Quote Originally Posted by geza View Post
    The other thing, that brotli doesn't use low offset bits entropy coding, like LZX. Why is that?
    Several reasons, none of which were strong enough alone to help us make the decision:

    The way we do context modeling handles around 6 context bits well. We used the bits for the two previous bytes with some LUT magic, so no bits were left to use on addresses.

    Brotli has 'joint entropy codes', i.e., the same entropy code defines the length of the literals and the length of the copy. This partially helps to deal with data that is with a 4 or 8 byte repetitive pattern (or if the data has 3 or 11 byte repetition).

    Brotli was already "sufficiently complex" and adding more complexity didn't feel right.

    More regular data such as audio, pictures, video, triangle meshes, etc. should probably be encoded with a dedicated compressor or pre-processed.

    Quote Originally Posted by geza View Post
    Have you guys discussed previously whether it is possible to enhance LZX compression ratio (without significant decompression speed lost - I mean algorithm changes not just increasing window size)?
    I don't know LZX, but usually it is possible with LZ77 to do iterative optimal parsing like in Zopfli.

  3. #3
    Member
    Join Date
    Nov 2014
    Location
    Earth
    Posts
    38
    Thanks
    0
    Thanked 77 Times in 19 Posts
    Quote Originally Posted by geza View Post
    Have you guys discussed previously whether it is possible to enhance LZX compression ratio (without significant decompression speed lost - I mean algorithm changes not just increasing window size)?
    If you mean without changing the format, wimlib includes an open source LZX compressor which is close to optimal: https://wimlib.net/compression.html#LZX. Basically, for each block it uses a hash table of binary trees to find matches, then it does iterative optimization to choose the match/literal sequence. There are a few improvements needed which I have been working on, but they only result in slight improvements to the compression ratio.

    If you mean with changing the format, there are certainly things that could be done to improve it. None of them are trivial, though, and it wouldn't be LZX anymore if you changed the format.

  4. #4
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Quote Originally Posted by geza View Post
    The other thing, that brotli doesn't use low offset bits entropy coding, like LZX. Why is that?
    I think it does. See NPOSTFIX
    Last edited by cbloom; 3rd August 2016 at 19:07.

  5. #5
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Quote Originally Posted by geza View Post
    Have you guys discussed previously whether it is possible to enhance LZX compression ratio (without significant decompression speed lost - I mean algorithm changes not just increasing window size)?
    The only really big obvious improvement to LZX would be larger window.

    Other than that, everyone doing modern LZ-Huff stuff is working very hard for very minimal gains over LZX. (ZStd, LZHAM, LZMH, Brotli, Oodle, etc.)

    On generic data we haven't really found much. On certain data types there are some big things, delta-matches for some types of binary, pos-bits correlation on some types of binary, special contexts, dictionaries, delta-reps, stuff like that for text.
    Last edited by cbloom; 3rd August 2016 at 19:07.

  6. #6
    Member
    Join Date
    Jun 2016
    Location
    Budapest
    Posts
    10
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Thank you for the answers!

    Quote Originally Posted by Jyrki Alakuijala View Post
    Yes. Yes.
    Is this a well known fact? How much compression/speed gain have you achieved by using this encoding? It's weird that no other LZ+huf compressor uses this technique then.

    Quote Originally Posted by Jyrki Alakuijala View Post
    Several reasons, none of which were strong enough alone to help us make the decision:

    The way we do context modeling handles around 6 context bits well. We used the bits for the two previous bytes with some LUT magic, so no bits were left to use on addresses.

    Brotli has 'joint entropy codes', i.e., the same entropy code defines the length of the literals and the length of the copy. This partially helps to deal with data that is with a 4 or 8 byte repetitive pattern (or if the data has 3 or 11 byte repetition).

    Brotli was already "sufficiently complex" and adding more complexity didn't feel right.
    LZX uses separate huffman tree for these low bits (little bit slower decode speed). But I can understand your decision, brotli is already pretty complex.

  7. #7
    Member
    Join Date
    Jun 2016
    Location
    Budapest
    Posts
    10
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Zyzzyva View Post
    If you mean without changing the format, wimlib includes an open source LZX compressor which is close to optimal: https://wimlib.net/compression.html#LZX.
    Thanks for this, I didn't know about wimlib.

    Quote Originally Posted by Zyzzyva View Post
    If you mean with changing the format, there are certainly things that could be done to improve it. None of them are trivial, though, and it wouldn't be LZX anymore if you changed the format.
    I mean with changing the format. My goal is to create a compressor, which has LZX like decompressor speed, but with better compression ratio. I've already created one, which has a little different encoding than LZX (low offset bits merged with offset slot), and it has better parsing (the usual iterative optimal parsing), but these only give 1.5% better compression on my test data.

    I was thinking: "LZX is old, there must be some new simple technique" (like the difference between deflate and LZX), but apparently this is a harder problem.

  8. #8
    Member
    Join Date
    Jun 2016
    Location
    Budapest
    Posts
    10
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by cbloom View Post
    I think it does. See NPOSTFIX
    Oops, it seems you're right. Don't know how I missed that...

  9. #9
    Member
    Join Date
    Jun 2016
    Location
    Budapest
    Posts
    10
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by geza View Post
    Thanks for this, I didn't know about wimlib.


    I mean with changing the format. My goal is to create a compressor, which has LZX like decompressor speed, but with better compression ratio. I've already created one, which has a little different encoding than LZX (low offset bits merged with offset slot), and it has better parsing (the usual iterative optimal parsing), but these only give 1.5% better compression on my test data.

    I was thinking: "LZX is old, there must be some new simple technique" (like the difference between deflate and LZX), but apparently this is a harder problem.
    But thinking further, creating a compressor like this should be possible. There are a lot of things that current compressors don't take in account. With careful analyze, we should be able to "destructure" the file. Like:
    * what if the file is not structured around a power of 2 (so low offset bits doesn't help)? For example, it contains an array of a 25-byte sized struct.
    * what if the file contains interleaved streams? We should detect that, and deinterleave them.

    So it boils down to data transformation/filtering. But doing this well doesn't seem to be easy/simple.

  10. #10
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by cbloom View Post
    I think it does. See NPOSTFIX
    That is only for distances. There are no LSB address bits for literal context generation.

  11. #11
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by geza View Post
    How much compression/speed gain have you achieved by using this encoding?
    It is a hard question since it is near impossible to benchmark two such solutions that would otherwise be equal. If I would have to guess it is 3 % in decoding speed and 0.25 % in compression density.

    I think it gives a tiny bit more instruction level parallelism, the data if the next symbol is a literal or backward reference is available earlier -- particularly so for a cluster of literals. One nice consequence that this encoding allowed for brotli was the joint entropy code for literal insert length and copy length.

Similar Threads

  1. Brotli
    By willvarfar in forum Data Compression
    Replies: 212
    Last Post: 30th September 2018, 17:55
  2. Replies: 38
    Last Post: 27th April 2016, 18:01
  3. improving brotli
    By inikep in forum Data Compression
    Replies: 6
    Last Post: 18th November 2015, 22:45
  4. Codebok encoding
    By kredens in forum Data Compression
    Replies: 1
    Last Post: 29th October 2015, 08:57
  5. Variable-length Offset
    By Cyan in forum Data Compression
    Replies: 8
    Last Post: 12th December 2008, 14:06

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •