Page 2 of 2 FirstFirst 12
Results 31 to 38 of 38

Thread: lzpm 0.01 is here!

  1. #31
    Member
    Join Date
    Mar 2007
    Posts
    34
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by encode
    So, with 16 MB dictionary a single next[] needs 4X buffer size!
    With 16 MB dictionary 3x dictionary size is sufficient as 3 bytes are enough to store values < 16 MB. For practical reasons 4x is ok.
    Note sure, what you mean with "single" next[]. There is just one.

  2. #32
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quote Originally Posted by Uwe Herklotz
    Note sure, what you mean with "single" next[]. There is just one.
    I meant that we need also head[].

  3. #33
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    encode
    ... yes, you got the idea - head[] may be arbitrary-sized, but next[] size is fixed. BT use the same idea - it uses head[], left[] and right[] where last two arrays are dict-sized, so minimum memreqs are 8x

    for each buffer position p, left[p] points to the previous buffer position with the same hash value and lexicographically lesser string and right[] - to the larger one

    so searching for best match is straightforward:

    hash = hash(p) // where p is current pointer
    match = head[hash]
    while match do
    --if( memcmp(match,p) < 0 )
    ----then match = left[match]
    ----else match = right[match]

    but updating is really smart trick. i will try to recall it

    also take notice that on each cycle step you reduce range of strings that match may point. for example, if current string is "abcd" and you compared it to the "abce" then all subsequent matches that you will check will be also less thatn "abce". otoh, if on another step you've found that this "abcd" is greater than "abb" then you may be sure that all subsequent matches you will check will be larger than "abb" - it's usual property of binary search tree. so, after checking these two matches, all matches you will check will be between "abb" and "abce"

    and this means that you may reduce compare cost - if best smaller match was 2 bytes long and best larger match was 3 bytes long, you may be sure that all matches you check will have at least the same min(2,3)=2 first chars as your current string and you may skip checking these bytes. in this example, you may be sure that all subsequent matches are starting with "ab"

  4. #34
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Quote Originally Posted by encode
    The scheme I make use
    i use the same scheme in tor and thinks that it is much better than hash chains. as i said, hash chains was optimal in 32-64kb dictsize ages

    my tests shows that having 128k hash entries is optimal, a least for 4-bytes hashing and my test files. one can easily figure out that for NUMNODES=64 (i call it "hash row length") we need only 32-64mb of memory regardless of dictsize

    so, using linear hash is advatageous with large dicts and your lzpm proves it by outperforming lzma:fast with 16mb dict, which needs about 160mb for operation

    but afair, my searching implementation is much faster. i specially published sources because you may be interesting in using it together with your compression engine

  5. #35
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quote Originally Posted by Bulat Ziganshin
    but afair, my searching implementation is much faster. i specially published sources because you may be interesting in using it together with your compression engine
    I already had a brief look at it.

    By the way, reading papers about LZCB Ive found a nice idea for my LZPM!

    Its some kind of context coding of match positions.

    And this surprisingly gives some improvement not only on text files!

    Check out some digits on SFC for LZPM 0.02:

    Layout:
    <number of context bits> - <compressed size>
    0 - 13,355,017 bytes
    1 - 13,352,362 bytes
    2 - 13,330,503 bytes
    3 - 13,321,907 bytes
    4 - 13,311,877 bytes
    5 - 13,302,937 bytes
    6 - 13,292,131 bytes
    7 - 13,287,209 bytes
    8 - 13,287,724 bytes

    All in all, the decompression speed stays the same. For binary files the best number of context bits is 5. Currently, Im thinking about add or not such thing.

    Also, paralelly I dig into the LZMA sources, now completely understand at almost all switches meaning.

  6. #36
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    However, I think I should keep the arith coder as simple as it possible avoiding redundant context tricks.
    For example with QUAD I tried many things for compression improvement which works but they aren't incouded finally. For example, we can code the match index with exclusions. Say, we have a 11 available match indexes from 64, so we exclude all indexes above 11 for coding.

  7. #37
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    well, i dont know what is context bits and if you have a good idea please share its description with us (at least with me, i think that two lz compressors is enough for forum with ~10 active readers )

    btw, lzma uses lc, pb and lp parameters - isnt it one of them? anyway, i dont know their exact meaning

    Quote Originally Posted by encode
    I already had a brief look at it.
    i specially moved each pice into separate class, making easier their stealing btw, Igor does the same. actually, it seems that you cant reuse my sources as is because it still missing separate hashes for 2-3 byte strings. otoh, it seems that you also dont have multiple hashes in your program. its another idea, pioneered in late 90s - instead of hashing by 3 bytes and searching all the strings via one hash, we make small separate hashes specially for searching 2-byte, 3-byte, 4-byte strings and one large hash which holds only 5-byte matches. details may differ, afair ER said that 5-3-2 scheme turns to the best for him while IP use 4-3-2 scheme

  8. #38
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    For myself I call my LZPM as a reverse engineered ROLZ2.

    Firstly, about LZMA parameters:

    -lc:
    set number of literal context bits - [0, 8], default: 3

    This number of higher bits of previous byte will be used as a context for a literal.

    8 bits = pure order-1 coder

    with 4 bits we will use a higher nibble of previous byte as a context

    ROLZ2 is also has such parameter, which is hidden from the masses! Of course, LZPM has this parameter too.

    -lp:
    set number of literal pos bits - [0, 4], default: 0

    This one will extend a context for literals. This may improve compression of MM data and tables.

    Here we will include a lower bits of current position to a literal context.

    Example:

    i - current position
    x - current context for literal

    context = (x << 2) & (i & 3);

    if -lp = 2.

    Theoretically this will work. Anyway, by default LZMA will not use such thing. LZPM not uses this. AFAIK, ROLZ2 too.

    -pb:
    set number of pos bits - [0, 4], default: 2

    With this one LZMA will make use the same pos context for some structures like Flags (IsMatch) and RepGX coders.

    Again, AFAIK, ROLZ2 and LZPM not use that.

    Quote Originally Posted by Bulat Ziganshin
    instead of hashing by 3 bytes and searching all the strings via one hash, we make small separate hashes specially for searching 2-byte, 3-byte, 4-byte strings and one large hash which holds only 5-byte matches. details may differ, afair ER said that 5-3-2 scheme turns to the best for him while IP use 4-3-2 scheme
    I tried this, and this will work with QUAD-like schemes, but with the LZPM things are more complicated and we unable to use such thing.

    Quote Originally Posted by Bulat Ziganshin
    well, i dont know what is context bits and if you have a good idea please share its description with us
    Well this scheme will work only with ROLZ-like coders, however, with modified LZ77 coders it works as well.

    We encode the match position (or index) via order-0 arithmetic encoder. Additionally, we can include a few bits of a previous byte as a context. (Higher bits or whole byte)

    Thus we will encode match positions via order-1 arithmetic encoder. Read the first paper about LZP/LZCB for more info and ideas.


Page 2 of 2 FirstFirst 12

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •