Results 1 to 7 of 7

Thread: init fix for masked literals in lzma

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

    init fix for masked literals in lzma

    http://nishi.dreamhosters.com/u/lzma922.rar
    http://nishi.dreamhosters.com/u/lzma922x.rar
    http://pastebin.com/iwLG025x

    First archive is an updated version of http://nishi.dreamhosters.com/u/lzma.rar (which is based on 4.65)
    ie just sources necessary to build lzmautil without weird subdirs.
    And 2nd is modified version, which breaks compatibility, but compresses a little better.
    The 3rd link is the patch for 9.22 vs 9.22x.

    This patch fixes the only real design error in lzma which I was able to find -
    its quite unlikely that masked literals would be equal to their mask
    (because otherwise we could simply increment the length of previous match
    and not encode the current literal at all)
    so the probability of last bit of masked literal being zero should be
    low if mask's bit0 is 0, and high if bit0 is 1 (to make it likely that
    literal's bit0 is different from mask's bit0).
    The probabilities of last bits correspond to indexes 128..255 in litprob table
    because its actually a binary tree.
    Also we can't set them to 0 because, although unlikely, sometimes literals
    in lzma do match their mask (for example, in case of matches longer than 273 bytes).

    But still, this fix should slightly improve the compression in all cases without
    any speed impact (because only stats init is affected).
    So, I suppose, despite incompatibility, it should still have some uses, like in freearc.
    Code:
    book1       768771  wcc386       536624
    book1.lzma  260967  wcc386.lzma  274206
    book1.lzmax 260917  wcc386.lzmax 273912

  2. The Following User Says Thank You to Shelwien For This Useful Post:

    encode (6th May 2017)

  3. #2
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    If i do remember, masked literal is applied only to the next literal immediately after a match ?

    Btw, i guess it is nonetheless possible for LZMA to have 2 immediately consecutives matches with no literal in between ?

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > If i do remember, masked literal is applied only to the next literal immediately after a match ?

    Yes, literal after a match is diff'ed with the symbol following the matched string.

    > Btw, i guess it is nonetheless possible for LZMA to have 2 immediately consecutives matches with no literal in between ?

    Sure, but in binary data masked literals are frequent enough to affect the results.

  5. #4
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    While investigating non compatible alternatives :

    Could there be any relation between the diff"ed symbol and :
    1) the match length ?
    2) the offset distance ?
    3) the "repeat" distance ?

  6. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    As to your question, any of these things could be useful as a context in an unrestricted CM model
    (ie not limited by speed/memory considerations).
    But in short, no, it doesn't make much sense.

    Here's an example of how masked literals improve compression:
    http://nishi.dreamhosters.com/u/lzma_book1__a0_a1.png
    In the picture, look at the lines
    "<P 58>\n"
    "<P 59>\n"
    As markup shows, they're encoded with 3 tokens - "<P 5" as a match, "9" in context of "8" as masked literal,
    then ">\n" as rep-match.
    So in a sense, both masked literals and rep-matches are parts of fuzzy match handling.

  7. #6
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    I would have thought that, in case of rep matches,
    the statistics for diff'ed bits could be different from generic matches.
    Last edited by Cyan; 17th May 2012 at 12:21.

  8. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Actually the statistics would be different in context of any specific matched string.
    But so what, it would be a PPM if we'd try to really use that.

    In any case, the mask byte is quite likely to have the same prefix bits as the masked literal,
    whether its followed by rep-match or not.

    And as to different statistics, I actually tried adding a skew to bit1 probabilities for masked literals as well.
    It didn't work, which means that matches up to bit1 are quite likely (its like that with '8' and '9' btw).

Similar Threads

  1. lzma recompressor
    By Shelwien in forum Data Compression
    Replies: 33
    Last Post: 25th February 2016, 23:40
  2. LZMA markup tool
    By Shelwien in forum Data Compression
    Replies: 13
    Last Post: 29th November 2015, 23:05
  3. lzma's binary tree matchfinder
    By Shelwien in forum Data Compression
    Replies: 7
    Last Post: 13th October 2011, 08:38
  4. LZMA algorithm question
    By incblitz in forum Data Compression
    Replies: 8
    Last Post: 25th July 2011, 18:51
  5. LZMA source
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 29th March 2010, 18:45

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •