I recently decided to start exercising my C skills again by implementing some of the data compression algorithms I learned over the years.
The first one I did, I don't remember the name for - basically, keep a rolling hash of data, look it up in a hash table; if it matches, output a 1, if not a 0 and the literal. Optimise to be byte aligned by gathering 8 flags and emitting together.
That actually performed a lot better than I expected when I expanded the hash table to 1MB. Even better than `lz4 -1` on enwik8.
Next, I implemented a simple LZP or ROLZ scheme.
Using a (8k then 64k) table of arrays of (at first 8 then 16) references, I hashed the last (2 bytes, then 3) and scanned the list of references for the longest match. I then output a match flag, the index, and the length.
If the length is short enough, I can store all that in a single byte, so even match lengths of 2 are acceptable.
Sequential literals are gathered into blocks with a (0, len) byte.
In either case if the length hits the maximum for its original size, an additional byte is read and added, allowing match lengths of (2 + 15 + 255) in the original, then (2 + 7 + 255) in the 16-deep. Literal runs can be up to (128 + 255) bytes long.
Literal run byte: 0lll llll (l = length - 1)
Original match flag byte: 1iii llll (i = index, l = length - 2)
Revised match flag byte: 1lll iiii
Again the results are not bad. I've done no optimisation for speed, but can manage to compress enwik9 in ~24 seconds, resulting in 485,910,956 bytes, and decompresses in ~12seconds.
For enwik8 it compresses to 55,343,408 bytes.
1. Would this count as LZP, or ROLZ, or is LZP typically a ROLZ?
2. I currently update the hash tables for every byte [up to -3 from the current] - is there a better scheme?
The longest match I've managed in enwik8 was 264 bytes. I guess my next step is to start gathering statistics to see how it's really performing.