Results 1 to 7 of 7

Thread: Reduced Length LZ (RLLZ): One way to output LZ77 codes

  1. #1
    Member
    Join Date
    Sep 2018
    Location
    Philippines
    Posts
    29
    Thanks
    12
    Thanked 0 Times in 0 Posts

    Lightbulb Reduced Length LZ (RLLZ): One way to output LZ77 codes

    Hi all!

    This is my first post, though i have been visiting this site recently every now and then, watching professional and expert programmers exchange lively ideas and suggestions to come up with the most powerful compressors.

    Let me share here my one way to output LZ77 codes, which may improve LZ77/LZSS/ or LZW.

    LZ77:

    LZ77 coding transmits <offset, length> codes. Many times in the compression process, the same matching string is coded with the same <length> code. We can avoid transmitting the <length> code for many strings as well as not outputting a bit to identify a literal (as in LZSS) by the following:

    Instead, read the whole input or block (or history buffer) and gather matching strings and encode the <<the whole string>> e.g., <<offset>> plus <<length of the string (n>=2)>> only this one time, <<the number of the same matching strings (ie., number of following offset codes)>>, and its succeeding occurrences in the input block by transmitting only the said <<offset>> codes. This is actually a "Reduced Length LZ (RLLZ)".

    The literals are outputted last *without bit flags* since they fall into the block buffers not covered or "not activated" by encoded strings.

    During decoding, the strings are written first in the output buffer using the offsets, and the literals "fill in" the unwritten positions in the output buffer.

    Thanks,

    -- Gerald R. Tamayo

  2. #2
    Member
    Join Date
    Sep 2018
    Location
    Philippines
    Posts
    29
    Thanks
    12
    Thanked 0 Times in 0 Posts
    No need to output bit flags for literals or the number of following consecutive literals in some algorithms.

    Then the completely filled output buffer is written to file.

    That is, e.g. after gathering all distinct repeated strings on a block,

    1) no transmission of <length> code for the next occurrences of the same string (but, in the simplest way, you have to output the number of succeeding strings);

    2) transmitting the literals last (in the output buffer) means no need to transmit *bit flags* for literals (and matches). This is the novel idea here: deferred literals output;

    3) if you know LZT (2008 ) algorithm where it was demonstrated complete exclusion of the length code, this might as well be "LZT2". Should improve LZ77/LZSS/LZW based compressors. Decoding is also straight-forward.

    Sorry that only now after a decade of LZT (2008 ) i am releasing this. I stopped coding compression in 2010 or 2011.

    My question is: Is this new, or already in some ROLZ implementations?
    Last edited by compgt; 10th October 2018 at 14:14.

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    All popular LZ implementations these days seem to use "rep-codes" for match distance.
    I think I'd seen also attempts to use rep-codes for match length too, but it didn't show much improvement.
    Thus, reusing a whole distance;length pair probably won't be very helpful - well, maybe on logs or some such.

    Btw, the further generalization would be secondary LZ compression of LZ tokens.
    It was tested before at least in form of bsdiff+lzma, and does show some potential.

  4. #4
    Member
    Join Date
    Sep 2018
    Location
    Philippines
    Posts
    29
    Thanks
    12
    Thanked 0 Times in 0 Posts
    Yes, I have similar ideas as "rep-codes" in 2000s. My compression ideas are from the late 90s when i became interested again in data compression.

    The ones i call here "holes" they call "gaps" even in LZW. But the idea of deferring transmission of literals for an output buffer avoids bit flags for both literals and matches which is still used in some explanations of ROLZ. Other algorithms still need to output the number of following literals. Deferred, meaning this is not an "online" algorithm.

    Yes, i understand LZ recompression for LZ tokens needs well-crafted tokens; this might be a huge algorithm.

    Thanks, Shelwien!

  5. #5
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    I experimented with this back idea in March 2018 but the problem I encountered was enumerating all subsets of the string since the starting point would end up fixed after a match was found, but optimal parsing requires multiple rotations of the string to be able to find the shortest path.

    I never could get my implementation to beat LZ77, it was pretty much on par. I found that the odds of re-using a whole word is as probable as lz77 inserting a new word into the dictionary. By new word I mean a word which is either 1 byte shorter or longer than the next longest matching entry.

    I used a 3 symbol flag: literal (no match), lz77 match (insert word), and reuse word via offset code. I guess it'll be worth revisiting but it'll need a global match finder like a suffix array to find the best matches across the whole input unlike the binary tree I was using.

    There may be potential with this system for text but it is quite a challenge to get it working well.

  6. #6
    Member
    Join Date
    Sep 2018
    Location
    Philippines
    Posts
    29
    Thanks
    12
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Lucas View Post
    ...

    I never could get my implementation to beat LZ77, it was pretty much on par. I found that the odds of re-using a whole word is as probable as lz77 inserting a new word into the dictionary. By new word I mean a word which is either 1 byte shorter or longer than the next longest matching entry.

    I used a 3 symbol flag: literal (no match), lz77 match (insert word), and reuse word via offset code. I guess it'll be worth revisiting but it'll need a global match finder like a suffix array to find the best matches across the whole input unlike the binary tree I was using.

    There may be potential with this system for text but it is quite a challenge to get it working well.
    I guess, you're thinking about lz77's history buffer as "dictionary"?

    > a global match finder like a suffix array to find the best matches

    Indeed. Matt Mahoney did something like this in analyzing enwik9 strings...

  7. #7
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Yes I used the LZ history buffer as my dictionary, except I used the most recent 1MB of history to reference words, it was implemented as a 1MB array of uint32_t's to store the length which was referenced in the history, while the rest of the history buffer could grow to 512MB, this was done mainly to constrain decode memory requirements to 1N + 4MB

    You've got me wondering if we were both onto something with this encode technique because in theory it can flip between traditional LZ77 and an LZ78 style compressor almost freely, at least I haven't seen it ever compress worse than LZ77, it might just be a matter of determining proper switching between the 2 methods.

Similar Threads

  1. Best Compressor software to compress pure C source codes?
    By paqfan in forum Data Compression
    Replies: 3
    Last Post: 9th August 2016, 22:45
  2. Is there any cruncher with statistical output?
    By Crush in forum Data Compression
    Replies: 8
    Last Post: 18th September 2015, 18:44
  3. Replies: 7
    Last Post: 19th August 2015, 10:08
  4. Code generation in LZ decoder / Branchless LZ77 Decoder
    By Shelwien in forum Data Compression
    Replies: 1
    Last Post: 30th September 2010, 20:48
  5. Fast arithcoder for compression of LZ77 output
    By Bulat Ziganshin in forum Forum Archive
    Replies: 13
    Last Post: 15th April 2007, 17:40

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •