Just during playing with UNWHAP-like stuff, developed a new LZW variant.
First of all, consider followed digits:
doom3.exe (5,427,200 bytes)
TC 3.4 (Generic LZW): 3,130,168 bytes
COMPRESS 4.2.4 (LZW with flat codes): 2,986,623 bytes
LZW1 (LZW with "sliding dictionary"): 2,944,286 bytes
LZW2 (LZW1 + Improved dictionary loading): 2,879,424 bytes
LZW3 (LZW2 + Kind of lazy matching): 2,873,584 bytes
LZW1 thru LZW3 is my new LZW variants. All three use fixed 16-bit codes.
As I already posted, playing with LZRW5/UNWHAP-like stuff I've found than no reason to reset or rebuilt a dictionary when it becomes full. With a baseline LZW we must do that, because during compression and decompression we use prefix trees - i.e. each new entry depends on previous. With UNWHAP each element is independent - since we keep a position and a length - and with buffer we able to restore any phrase at any time. Thus we may replace an oldest entry by the new one (circular table update). Furthermore, decoder stays untouched!! As a result, we can hold up to 2**N active phrases constantly.
Second improvement is an improved dictionary loading. Since both encoder and decoder have no prefix trees, we may not do that what we can with other LZW modifications. Anyway, instead of adding just:
Add(Pos, Len + 1);
// we may additionally add
Add(Pos + 1, Len);
// Pos - current position
// Len - Length of encoded match
// in case if we encode the char (Len == 1)
// we just add
// as with previous variant
In other words, Len + 1 means than we add a current coded string plus next char. OK, kind of we advance to one char - i.e. we have one extra char. So, if we code AB, we may add:
Cool isn't it!
Regarding parsing. I've found that LZW codes are not so redundant as LZ77 ones. However, the algorithm adapts to a new data much slower. In addition, we deal with fixed-sized strings - i.e. only lengths which are in dictionary are available. Furthermore, if we output a string which is shorter that we've found, we we duplicate entries - since the dictionary update also depends on what we will output - if we output a shorter string we will duplicate the longer string.
Anyway, I tried many variants, including Flexible Parsing. Only one variant works so far:
cur = len + getmatch(i + len);
if (getmatch(i + 1) > cur)
code = buf[i]; // drop a char instead
However, such trick gives really tiny compression gain, and at the cost of significant compression speed loss. (Currently, I'm experimenting with my brute-force encoder which is slower than PAQ8. )
world95.txt (2,988,578 bytes)
COMPRESS: 1,133,483 bytes
LZW1: 1,059,886 bytes
LZW2: 1,036,410 bytes
LZW3: 1,025,744 bytes
rafale.bmp (4,149,414 bytes)
COMPRESS: 1,444,893 bytes
LZW1: 1,367,220 bytes
LZW2: 1,336,280 bytes
LZW3: 1,335,882 bytes
calgary.tar (3,152,896 bytes)
COMPRESS: 1,325,806 bytes
LZW1: 1,295,680 bytes
LZW2: 1,264,140 bytes
LZW3: 1,261,778 bytes
Anyway, concluding, the catch in tiny and super-fast decoder only.