Encode, since you're into LZ lately, I would be interested if you play with Morphing Match Chains (by Cyan & me) or try to convert BWT output or Suffix Trees to some simple trees that would help greatly exhaustive searches.
Encode, since you're into LZ lately, I would be interested if you play with Morphing Match Chains (by Cyan & me) or try to convert BWT output or Suffix Trees to some simple trees that would help greatly exhaustive searches.
Basically I do plan to release something that will strike. A program that will be very practical, unique, people will compare it against own software, refer to it, and so on. I need an explosion. So, I already put many work on CRUSH and tested countless number of ideas, including Morphing Chains (Guess the name is not that correct), Binary Tree structures, etc. A key feature of the program is a Fast compression mode - 50 MB/sec on modern PC with a compression on par with Deflate's best modes. And probably the best data structure for that mode is Hash Chains. The main benefit of such solution is extremely fast update of a structure - we just skip to next position briefly. Many more advanced algorithms need a special handling, a special update that is slower. Anyway, I continue experimenting...
hi encode
A quick clarification :The main benefit of such solution is extremely fast update of a structure - we just skip to next position briefly. Many more advanced algorithms need a special handling, a special update that is slower.
MMC starts being a classic Hash Chain.
The main idea driving its development was to remove from the chain elements which we know are no longer useful (ie cannot produce better length) in the current search context.
Consequently, its update method is exactly the same as classic Hash chain : elements are quickly added without further handling. This is, as you say, in stark contrast with other complex data structure, such as BST, which requires sorting data on insertion. MMC doesn't bring such penalty (neither classic Hash Chain).
In Greedy parsing strategy, it makes a huge difference, since a lot of data will never be sorted, or will only be partly sorted later on, just as needed.
This conclusion still holds true with lazy match strategy.
However, for Optimal parsing, BST is very likely to take the lead, since each and every position is searched and inserted. I haven't test it yet, but it sounds logical comparing both behaviors.
That being said, MMC is still about a full "no compromise" search strategy. For faster speed, a Hash Table with short rows is going to provide faster performance, at the cost of reduced match search capability.
Regards
Actually, It's hard to read other's code. Probably I'm not completely understood the MMC idea. Can you provide a few-line pseudo-code. For hash chains, the code will look like:
How to transform it to MMC?Code:// Search ptr=head[hash]; while (ptr) { // compare string at ptr ptr=prev[ptr&W_MASK]; } // Update // if literal, len=1 while (len--) { prev[pos&W_MASK]=head[hash]; head[hash]=pos; ++pos; hash=UpdateHash(pos); }
You need 2 pointers per position :
The "bad one" will simply go to next position, as usual.
The "good one" means i'm happy with the current position, which improves match length. The "good pointer" will link directly to another "good position", skipping a lot of "bad" positions in the process.
So, in pseudo code, it looks like this :
As you see, there is a lot in common with regular hash chain.Code:// Search ptr=head[hash]; while (ptr) { // compare string at ptr if (matchlength>currentlevel) { if (goodprev[ptr&W_MASK] == NULL) { ptr=badprev[ptr&W_MASK]; rememberGoodPrev = &goodprev[ptr&W_MASK]; } else { *rememberGoodPrev = ptr; ptr=goodprev[ptr&W_MASK]; currentlevel++; } } else ptr=badprev[ptr&W_MASK]; } // Update // if literal, len=1 while (len--) { goodprev[pos&W_MASK]=NULL; badprev[pos&W_MASK]=head[hash]; head[hash]=pos; ++pos; hash=UpdateHash(pos); }
The open-source implementation proposed at google code (http://code.google.com/p/mmc/) is actually more complex, because it integrates a few more tricks to improve speed (such as Secondary promotions, proposed by Piotr).
There is a more complete explanation of algorithm and tricks used at this page :
http://fastcompression.blogspot.com/...tch-chain.html
Regards
Last edited by Cyan; 23rd April 2011 at 01:26. Reason: Modified source code for more completeness
In your pseudo code above, you never update goodprev[]! Meaning it is always NULL.
Is "currentlevel" the best match length found so far?
Indeed. The goodprev update is done during the search.
I've omitted it into pseudo code, for better clarity. It's explained in the picture below :
We need these concepts :
==> Linking : The first time we find a "good position", by construction its pointer is NULL. We keep that in memory. On the next "good position" found, we populate the "good pointer".
==> Promotion : When a position is linked by a "good pointer", it is no longer necessary to keep it linked by "bad ones". So we remove it from the chain of bad pointers.
[Edit] :
"currentlevel" is a guaranteed minimum number N of common characters we'll get by continuing on the current chain, using bad pointers. Following a "good pointer" guarantees to find chains with at least "N+1" characters from now on. It makes the search more and more precise.Is "currentlevel" the best match length found so far?
Last edited by Cyan; 22nd April 2011 at 16:01.
BWT, why listed mmc.h only? (http://code.google.com/p/mmc/downloads/list?can=1&q=)
Also, can you please update the pseudo-code? Many users will unable to read source code and it's important that users will read a small pseudo code to understand the idea. The picture is not that clear to understand what's going on...
Mmmh, that's a part of google code that i don't understand completely.
"Downloads" category seems to be about files to download on top of source files. Typically binary. I will have a look into it again.
The correct place to read and download the source code is here :
http://code.google.com/p/mmc/source/browse/#svn%2Ftrunk
I will update the front page so that it's easier to find.
Sure, edited.Also, can you please update the pseudo-code?
Last edited by Cyan; 2nd May 2011 at 19:09.
Slightly stuck in a middle with a Match Finder experiments. Just because a bigger dictionary = better compression with no decompression speed loss and the Match Finder speed/performance is the only actual thing that limits the dictionary size.
In addition, we should define what we want get - a Deflate competitor or an LZMA competitor with LZOP decompression speeds. Anyway, CRUSH will have at least 1 MB dictionary which is definitely not a Deflate level.
Did many optimizations and experiments with my Optimizer. It's my first LZ that was optimized/tested thru my Parameter Optimizer!
Another question, should I add an integrity checking natively? As example ADLER32 or CRC32. Currently, CRUSH has some simple LZ-stream checking that most likely may detect corrupted files.
> CRUSH will have at least 1 MB dictionary which is definitely not a Deflate level
There's also rar/lzh and cab/lzx
> should I add an integrity checking natively?
Its very easy to add if there's such a requirement, so it doesn't seem as much of a benefit.
And its certainly bad for benchmark ratings.
I think archivers should support any hash/ checksum algorithm user want. Today's CPUs are getting hardware crypto engines, like SPARC T3 - look at: http://en.wikipedia.org/wiki/SPARC_T3 . It mentions that T3 has hardware to compute: DES, 3DES, AES, RC4, SHA1, SHA256/384/512, Kasumi, Galois Field, MD5, RSA to 2048 key, ECC, CRC32. Some hashes are also easily computable on GPGPU.
Not really, but VIA Nano has support for AES encryption, SHA-1 and SHA-256 hashing![]()
QuickLZ encodes a hash table position (level 1 and level 2) or an offset to previous data (level 3). What about ULZ v0.02?
ULZ 0.02 is a pure LZ77. It encodes Match Length/Match Offset pair. The compression levels just determine how deep the string searching will be... Decompressor is the same for all modes, as expected.
Still have no luck with MMC. Will continue trying...
Anyway, some testing results with current match finder:
ENWIK8
Normal, 1 MB - 100,000,000 -> 33,942,665 in 15.600 sec
Normal, 2 MB - 100,000,000 -> 33,297,420 in 30.127 sec
Normal, 4 MB - 100,000,000 -> 32,837,273 in 54.194 sec
Max, 1 MB - 100,000,000 -> 32,983,850 in 61.455 sec
Max, 2 MB - 100,000,000 -> 32,260,938 in 142.421 sec
Max, 4 MB - 100,000,000 -> 31,659,247 in 396.854 sec
bookstar
Normal, 1 MB - 35,594,240 -> 13,612,045 in 6.688 sec
Normal, 2 MB - 35,594,240 -> 13,390,039 in 13.071 sec
Normal, 4 MB - 35,594,240 -> 13,245,085 in 24.075 sec
Max, 1 MB - 35,594,240 -> 13,208,069 in 23.649 sec
Max, 2 MB - 35,594,240 -> 12,969,102 in 57.161 sec
Max, 4 MB - 35,594,240 -> 12,780,865 in 156.668 sec
Taking into account that these timings are for Sandybridge madness...
Well, since, like I said, at any compression mode/dictionary size, the decompression speed is on par with LZOP or even faster, the only question is how much time we may spend on compression vs compression gained. A better match finder may help here for sure.
If you're processing the data blockwise, why don't you use a BWT-based match finder? On the other hand, you may use a data structure called prefix-list, which allows you to build a sorted list of all strings appearing in the input (similar to BWT). Some modifications to that data structure even allow a dynamic update mechanism (you can delete and insert strings, i.e., implement a sliding window).
http://callisto.nsu.ru/documentation...yooko_2000.pdf
M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk
Well, probably the most advantageous is a Fast mode (not Normal or Max) that compresses enwik8 to 37,xxx,xxx bytes within 2 secs. Most likely I will unable go that fast with BWT... In addition, we must search strings on the fly to obtain specific optimizations - i.e. longest match might be not that best - a shorter match with a smaller offset can be preferable.
Did you have trouble using MMC source code as available at Google Code (http://code.google.com/p/mmc/)Still have no luck with MMC. Will continue trying...
or do you mean you're currently trying to code your own MMC implementation ?
encode, did you actually test decompression speed recently?
I wonder how does large dictionary affect cache efficiency.
Unfortunately, these days I have a small amount of spare time. So, I just briefly tested/implemented the MMC idea from the pseudo code above. The speed is not improved, most likely something wrong. I need time to get what's going on inside the source, that's why I asked for proper pseudo code. As a note, the hash chains' pseudo code will work!
Moving towards. At some point I should stop fighting for the biggest dictionary possible, since CRUSH is not about the max.compression possible - I even abandoned the Huffman for a faster decompression. Anyway, here are the new results (CRUSH with 1 MB window):
ENWIK8
100,000,000 -> 32,687,373
ENWIK9
1,000,000,000 -> 289,163,252
world95.txt
2,988,578 -> 698,153
3200.txt
16,013,962 -> 5,636,735
calgary.tar
3,152,896 -> 999,605
![]()
loving your work, following your posts closely
can you also include compression and decompression timings when noting the ratios?
For a long time the compression/decompression time is the same since I keep the same match finder and my integer codes are closely to be transparent at decompression. Many recent compression improvements came from better match parsing/optimization and more optimized integer codes.
BTW, thanks for your words, this gives me more energy and enthusiasm to work more and harder on CRUSH!
Improved variable-length integer codes. Now I use prefix codes similar to static Huffman ones.
New results:
calgary.tar
3,152,896 -> 988,685
world95.txt
2,988,578 -> 693,449
ENWIK8
100,000,000 -> 32,577,338
ENWIK9
1,000,000,000 -> 287,333,602
Nice improvement, taking into account that only the match length coding was changed. Also, the complexity of such new codes are similar to previous variable-length integer codes, thus decompression speed is the same. Also, such codes are faster than static Huffman and equally efficient in terms of compression, thanks to my brute-force optimizer. And it's interesting direction, since such codes may replace static Huffman in many situations...
Do I understand this correctly? The compression comes purely from LZ coding, right? The literals are not further compressed, and the literals are all byte-aligned?
Do you have stats on how many literal bytes are in the output of, to pick one, ENWIK8?
Faster VLCs than Huffman sound pretty good! When do you plan to release?
I am... Black_Fox... my discontinued benchmark