Results 1 to 9 of 9

Thread: LZMA algorithm question

  1. #1
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts

    LZMA algorithm question

    Hi. An advanced thanks to any who will respond.

    Are the rep codes used in LZMA absolute offsets, or relative distances?

    I mean, is it trying to take advantage of repeated duplicate strings, or
    duplicate strings with breaks in between?

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Relative distances. See an example in http://encode.ru/threads/1288-LZMA-markup-tool
    (eg. how <P 58>\n is encoded there)

  3. #3
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I see.

    So:

    abcTUVWXYZabcRSVWXYZ

    ----------------^
    is a match
    ------------------------^
    --> is a rep0?

    I was thinking the other 2 days of combining

    lz77 with lzw-like behavior. (different from lzfg)

    Where:

    1. You code matches similar to lz77
    2. Every time you get a match, you add the offset and length to an array of pointers
    3. When you find a match with the strings pointed by the array, you output the index,
    instead of an offset length pair.
    4. You can tell the decoder to remove strings in that array if it will not occur again
    in the future. This can be done two ways:
    a. if a string is a first occurrence, yet will not occur again, tell the decoder outright.
    b. if a string is the last occurrence, tell the decoder that it is the last so it can remove
    it from the array.


    I might map it this way:

    (in bits)
    0 + byte
    1 + 0 + offset length

    1 + 1 + 0 + index

    where index is the minimum number required to code the index,
    so if the array contains 1, its just 1+1+0.
    if the array contains 2, its 1+1+0 +0/1
    if the array contains 4, its 1+1+0 + bb
    (then update the array[index] with the current offset)

    1 + 1 + 1 + 0 + offset length
    --> an lz77 pair, without adding it to the array

    1 + 1 + 1 + 1 + index
    --> offset = array[index].offset and length, removing it from the
    array, and moving array[last] to array[index]


    (I was also thinking of using the stored offsets and code
    a displacement from it, but got lost in thought and circled
    back to this ^^)

    But since LZMA is using a relative distance, I'm not sure this
    might be an improvement in the same direction.
    Last edited by incblitz; 22nd July 2011 at 22:28.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Every time you get a match, you add the offset and length to an array of pointers

    Now that's ROLZ. So like that you can get better compression than LZ77, but decoding is much more complex.

    But don't forget about parsing optimization. LZ schemes gain a lot from it (~10% better than lazy matching),
    so you might gain 5% with a complex model, then won't be able to make an optimizer for it (because its so complex)

    > But since LZMA is using a relative distance, I'm not sure this might be an improvement in the same direction.

    Sure, rep0 is mostly used for sparse matches. As to rep1-3, I think their use is more or less random - the only relevant
    example that I can think of is table/image compression, where a few multiples of row size can be frequent distance values.

    But even lzma's parser is not really aware of rep codes. It just checks whether it can use one (and whether it improves compression),
    but doesn't explicitly try to find longer sparse matches or anything like that.

  5. #5
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    > Now that's ROLZ


    Hmmm... I believe in ROLZ, you have to see the first 2 bytes first,
    then output the number of jumps in the array.

    Whereas here, the array works just like in Lzw, where it represents
    unique strings, but rather than store the string itself, just the pointer,

    i.e. the length and offset, since you are encoding a la lz77 and use
    a sliding window.

    So it looks like so:

    array[0..max]

    array_size = initialized to 0.

    When you code an lz77 length offset pair, you increment array_size
    to state you have added a string into the array.

    Encoding will look like this:

    abc123abc456abc789abc
    ---------^
    offset length
    (6, 3)

    ------------------^
    index[0]
    (which stored [absolute
    offset] 0, 3)
    (update index[0]
    w/ latest offset of
    repeat
    string to 12)

    --------------------------^
    index[0]


    In lzw, you add every single unique string into the allotted, i.e.

    0..255 first initial characters,
    256 .. 65535 (GIF implementation I believe) unique strings seen
    so far.

    Instead of doing that, I keep only a list of (in array) unique
    strings that will occur again.

    If the string won't occur in the future, within the boundary of the window
    size, it would tell the decoder so.

    If it's the last instance w/in the window, it would tell the decoder
    too.

    So that array that's not grow in size, and thus you can output less
    bits, since both encoder and decoder can keep an updated size
    of that array.

    >> but decoding is much more complex.

    I don't believe so (will try to code later, I'm still sleepy)

    since it behaves just like lzw, only optimizing it's size of list of strings.

    And decoder will look like an Lzma decoder, instead of the rep0..3,
    I have a whole array of whatever maximum size I'd like, and represents
    duplicate strings instead of sparse matches.

    >> But don't forget about parsing optimization

    Yes I won't. But that's where my original idea about different types of
    matches will come in. But I'd test just this first, and use optimal
    parsing.


  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    ROLZ = reduced offset LZ, ie where only match offsets from some kind of subset are allowed.
    It doesn't necessarily have to use previous symbols as context.
    Anyway, because you need less bits to encode the distance choice, it can compress better than LZ77,
    but then you also need to maintain the offset tables in decoder, so its slower.

    And I'm not saying that its bad... for example, xwrt+lzma would be afaik similar to what you describe,
    just that its slower and more complicated.

    Also, sparse matches are actually fairly important, like for structured texts (logs, tables etc) and most binaries.
    So I think that you won't beat lzma without some kind of sparse matches... the rep codes may not be the best
    idea to implement these though.

  7. #7
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Noted. ^^

    By the way, I don't just reduce distance, but the length as well.

    I don't see the point of saving the offset, but not the length.

    Of course the string can be longer, but that suffix can be saved
    in another entry.

    (Also, reduction isn't done on the first instance of a string to be
    encoded)

    Thus it's a little different than ROLZ.

    Maybe it's ROLLZ.

    LOL
    Last edited by incblitz; 23rd July 2011 at 08:05.

  8. #8
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I'm having some problems with implementation.

    I'm trying to use the optimal parsing described in

    "Optimal Parsing in Dictionary-Symbolwise
    Data Compression Schemes" G. Della Penna,
    A Langiu, F Mignosi, A Ulisse.

    But it is assumed that the cost function for a dictionary match
    is greater than coding a literal.

    But I have repeat matches where coding a match
    from saved index costs less than a literal.

    Haha, think mode again...

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    lzma encoder does optimal parsing with precise codelength estimation and rep codes, so I guess you can look at the source.
    There's nothing especially smart though - its structures include values of rep distances and context ("state")
    per offset in parsing buffer.
    Also it "cheats" a little by estimating combinations (like match-literal-repmatch) - I suppose the normal context/rep tracking
    isn't very precise, and such a multi-token estimation improves it
    (price estimation is based on current parsing, but real final parsing is unknown until the backward pass)

Similar Threads

  1. MergeSort based blocksorting algorithm
    By Piotr Tarsa in forum Data Compression
    Replies: 3
    Last Post: 20th July 2011, 22:12
  2. New layer 0 - compression algorithm
    By abocut in forum Data Compression
    Replies: 5
    Last Post: 28th May 2010, 01:32
  3. The best algorithm for high compression
    By Wladmir in forum Data Compression
    Replies: 8
    Last Post: 18th April 2010, 14:54
  4. RVLC Question
    By pessen in forum Data Compression
    Replies: 3
    Last Post: 11th July 2009, 03:29
  5. PREDICTOR algorithm
    By encode in forum Forum Archive
    Replies: 30
    Last Post: 16th February 2008, 19:28

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •