Results 1 to 15 of 15

Thread: New LZW variant

  1. #1
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Just during playing with UNWHAP-like stuff, developed a new LZW variant.
    First of all, consider followed digits:
    doom3.exe (5,427,200 bytes)
    TC 3.4 (Generic LZW): 3,130,168 bytes
    COMPRESS 4.2.4 (LZW with flat codes): 2,986,623 bytes
    LZW1 (LZW with "sliding dictionary"): 2,944,286 bytes
    LZW2 (LZW1 + Improved dictionary loading): 2,879,424 bytes
    LZW3 (LZW2 + Kind of lazy matching): 2,873,584 bytes

    LZW1 thru LZW3 is my new LZW variants. All three use fixed 16-bit codes.

    As I already posted, playing with LZRW5/UNWHAP-like stuff I've found than no reason to reset or rebuilt a dictionary when it becomes full. With a baseline LZW we must do that, because during compression and decompression we use prefix trees - i.e. each new entry depends on previous. With UNWHAP each element is independent - since we keep a position and a length - and with buffer we able to restore any phrase at any time. Thus we may replace an oldest entry by the new one (circular table update). Furthermore, decoder stays untouched!! As a result, we can hold up to 2**N active phrases constantly.

    Second improvement is an improved dictionary loading. Since both encoder and decoder have no prefix trees, we may not do that what we can with other LZW modifications. Anyway, instead of adding just:
    Code:
     
    Add(Pos, Len + 1); 
    // we may additionally add 
    Add(Pos + 1, Len); 
    // Pos - current position 
    // Len - Length of encoded match 
    // ... 
    // in case if we encode the char (Len == 1) 
    // we just add 
    Add(Pos, 2); 
    // as with previous variant

    In other words, Len + 1 means than we add a current coded string plus next char. OK, kind of we advance to one char - i.e. we have one extra char. So, if we code AB, we may add:
    ABC
    BC

    Cool isn't it!

    Regarding parsing. I've found that LZW codes are not so redundant as LZ77 ones. However, the algorithm adapts to a new data much slower. In addition, we deal with fixed-sized strings - i.e. only lengths which are in dictionary are available. Furthermore, if we output a string which is shorter that we've found, we we duplicate entries - since the dictionary update also depends on what we will output - if we output a shorter string we will duplicate the longer string.
    Anyway, I tried many variants, including Flexible Parsing. Only one variant works so far:
    Code:
     
    cur = len + getmatch(i + len); 
    if (getmatch(i + 1) > cur) 
      code = buf[i]; // drop a char instead

    However, such trick gives really tiny compression gain, and at the cost of significant compression speed loss. (Currently, I'm experimenting with my brute-force encoder which is slower than PAQ8. )

    Additional results:
    world95.txt (2,988,578 bytes)
    COMPRESS: 1,133,483 bytes
    LZW1: 1,059,886 bytes
    LZW2: 1,036,410 bytes
    LZW3: 1,025,744 bytes

    rafale.bmp (4,149,414 bytes)
    COMPRESS: 1,444,893 bytes
    LZW1: 1,367,220 bytes
    LZW2: 1,336,280 bytes
    LZW3: 1,335,882 bytes

    calgary.tar (3,152,896 bytes)
    COMPRESS: 1,325,806 bytes
    LZW1: 1,295,680 bytes
    LZW2: 1,264,140 bytes
    LZW3: 1,261,778 bytes

    Anyway, concluding, the catch in tiny and super-fast decoder only.


  2. #2
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I had a similar idea some time ago, but didn't investigate further. Your results are promising, especially compared to compress, which did a fairly well job.
    To improve adaption speed add all possible prefixes to the dictionary, e.g. match: abcd, add ab, abc, abcd, bc, cd - but this will require decoder modification. How is your maximum match length limit? You could select a lzw dictionary from a small context, maybe order1.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  3. #3
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quote Originally Posted by toffer
    How is your maximum match length limit?
    No limit. I think that with such circular dictionary update the match length can grown infinitely.

    Quote Originally Posted by toffer
    You could select a lzw dictionary from a small context, maybe order1.
    I know that trick - it called LZW-PM. This will kill the decoder simplicity. In addition we will need large amount of memory.
    Currently the decoder needs:
    sizeof(TNode) * TABSIZE = 8 * 65536 = 524288 bytes (+ out buffer)

    256 * 524288 = 134217728 bytes = 128 MB!

    With smaller dictionaries we will loose the beauty of pure 16-bit codes....

  4. #4
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Why not use something like 12 bit dictionaries (two symbols form 3 bytes)? You don't need that large ones. I don't think it will make things more complex, just an additional array access.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  5. #5
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    This evening will try some. Additionaly will try idea with 8-bit dictionaries.
    0..254 - phrases
    255 - special case for a literal

    By the way, in the past I experimented with LZW+arithmetic encoding. i.e. instead of using flat codes use:
    rc.Encode(code, 1, dictsize);

    Anyway, with my current variant this wont help much - since we do not reset the dictionary.


  6. #6
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quick test with LZW-PM and different code sizes (9 thru 14 bits/code). This modification make use of an order-1 context. Note that an accelerated dictionary loading only hurts with LZW-PM.

    calgary.tar
    09: 1,329,506 bytes
    10: 1,242,114 bytes
    11: 1,220,145 bytes [!]
    12: 1,229,075 bytes
    13: 1,258,298 bytes
    14: 1,306,836 bytes

    world95.txt
    09: 1,224,857 bytes
    10: 1,044,599 bytes
    11: 962,991 bytes
    12: 926,131 bytes
    13: 920,737 bytes [!]
    14: 937,397 bytes

    rafale.bmp
    09: 1,360,424 bytes
    10: 1,219,651 bytes
    11: 1,177,076 bytes
    12: 1,177,033 bytes [!]
    13: 1,207,772 bytes
    14: 1,266,021 bytes

    acrord32.exe
    09: 2,119,816 bytes [!]
    10: 2,142,966 bytes
    11: 2,206,979 bytes
    12: 2,295,547 bytes
    13: 2,409,435 bytes
    14: 2,542,992 bytes

    fp.log
    09: 2,253,066 bytes
    10: 1,856,707 bytes
    11: 1,715,243 bytes
    12: 1,668,445 bytes [!]
    13: 1,678,905 bytes
    14: 1,725,310 bytes

    a10.jpg
    09: 932,295 bytes [!]
    10: 1,019,174 bytes
    11: 1,101,334 bytes
    12: 1,189,126 bytes
    13: 1,287,909 bytes
    14: 1,386,979 bytes


  7. #7
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Have you tried 8 bit dictionaries? (with an escape to code normal symbols)
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  8. #8
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quote Originally Posted by toffer
    Have you tried 8 bit dictionaries?
    Yep. Its a bad idea. It works only with LZW-PM styled encoder. Anyway, fixed-bit codes are better!

  9. #9
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    And what a about 8 bit dictionaries and using flags to distingush between a code and a literal?
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  10. #10
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    9-bit dictionaries is equal to flag+8-bit:

    0..255 - char
    256..511 - code

    is equal to

    0, 0..255 - char
    1, 0..255 - code


  11. #11
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Yes, but it saves memory.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  12. #12
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Added a new improvement:
    Now we add:
    pos, len+1
    pos+1, len
    pos-1, len+2

    in other words if we code BC, we will add:
    BCD
    CD
    ABCD


    Testing results:
    world95.txt
    COMPRESS: 1,133,483 bytes
    LZW: 1,012,516 bytes

    rafale.bmp
    COMPRESS: 1,444,893 bytes
    LZW: 1,316,020 bytes

    fp.log
    COMPRESS: 2,699,855 bytes
    LZW: 1,877,820 bytes

    acrord32.exe
    COMPRESS: 2,407,918 bytes
    LZW: 2,309,342 bytes

    mso97.dll
    COMPRESS: 2,922,165 bytes
    LZW: 2,809,882 bytes


  13. #13
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    In other words we do a logical extension to a generic LZW:

    instead of a concatenation of existing phrase plus a new symbol, we additionally do a concatenation of a new symbol with a i-1 and i+1 phrases.

    i.e.:

    E - New, probably unseen symbol
    CD - existing phrase

    Thus, if we code CD we will add:
    CDE
    DE
    BCDE

    Concatenation of a new symbol with available phrases...

  14. #14
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quote Originally Posted by toffer
    How is your maximum match length limit?
    By the way, if we will compress a file with 100,000,000 repeated bytes, the longest match will be 19998. That means at some point we will pack 19,998 bytes to 2 bytes. And the whole file will be packed to 20,002 bytes. Another LZW property.

  15. #15
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    ENWIKs:

    ENWIK8: 42,528,858 bytes
    ENWIK9: 380,341,672 bytes


Similar Threads

  1. LZW, LZMW and LZAP comparison
    By encode in forum Data Compression
    Replies: 14
    Last Post: 3rd August 2017, 15:34
  2. LZW with unbounded dictionary
    By encode in forum Data Compression
    Replies: 34
    Last Post: 28th September 2010, 03:30
  3. LZW v0.2 is here!
    By encode in forum Forum Archive
    Replies: 6
    Last Post: 8th February 2008, 23:53
  4. LZW v0.1 is here!
    By encode in forum Forum Archive
    Replies: 20
    Last Post: 2nd February 2008, 14:46
  5. New move-to-front variant?
    By Lasse Reinhold in forum Forum Archive
    Replies: 7
    Last Post: 17th September 2007, 16:05

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •