Thanks Chris!
Mirror: Download
Hi everyone!
I'm pleased to release another compression utility - a ROLZ based compression engine. RZM is very alpha and is more or less a proof of concept. So let the bug reports come in. Compression is slow, I'm sorry - but decompression is pretty fast. Therefore, it's highly asymmetric. RZM is the result of my experiments with near-optimal parsing and will most likely not be developed any further.
Its scheme reduces offsets with the previous byte - 64k per context. The symbol coder is only o1. The sliding window is 64MB - better safe than sorry. Generally speaking, ROLZ has one major drawback: long-distance-matches are nearly impossible because offsets get discared too fast (even with 64k offsets per context). Btw, RZM has got a simple exe-filter. So, it's more or less pure compression 'power'. Before this one gets lost on my hard disk, I thought I share it with you.
Download RZM 0.02
@Schnaader: I'm sorry that there hasn't been any mail. But I'm very, very busy right now. I'll contact you regarding precomp if some free time comes up again.
Thanks Chris!
Mirror: Download
Quick test...
A10.jpg > 836,229
AcroRd32.exe > 1,276,453
english.dic > 676,478
FlashMX.pdf > 3,694,219
FP.LOG > 537,828
MSO97.DLL > 1,714,316
ohs.doc > 794,281
rafale.bmp > 949,920
vcfiu.hlp > 600,403
world95.txt > 544,074
Total = 11,624,201 bytes
Nice!![]()
Another quick test...
Test machine: Intel PIII (Coppermine) @750 MHz, 512 MB RAM, Windows 2000 Pro SP4
Test File: ENWIK8 (100,000,000 bytes)
Timed with AcuTimer v1.2
ENWIK8 > 24,930,240 bytes
Elapsed Time: 00:14:50.858 (890.858 Seconds)
Hello Christian
RZM is really amazing!!
Damn, Chris, you again done something impossible - crazy work!
Can you describe how do you encode LZ output in detail.
Do you use flags or single alphabet for match lengths/literals?
How do you encode offsets?
![]()
And the main question is - how do you implemented an optimal parsing with ROLZ - as far as I understand you update offsets per each byte, not per phrase.![]()
@all:
Thanks! Heres an updated version. I improved the parsing a tiny, tiny bit. It should be compatible with the old version - not that it matters.
RZM 0.03
I use a flag for literal/match decision. Combining lengths&literals in one alphabet is a poor choice, IMO. When using a flag, you can use different contexts for the flag, length and literals.Originally Posted by encode
The flag, lengths and distances are using the previous byte as context. The symbol coder uses the previous byte and a small state as context.Originally Posted by encode
Im quite sure, that the contexts can be improved further to improve compression (use some length-info in the offset context, ...), but I somehow lost interest.
I dont understand this question. I simply code the distance with some bucket-coder.Originally Posted by encode
<div class=""jscript""><pre>distance(prev,x)=(root[prev]-x-1)&(KEY_MAX-1)</pre>[/QUOTE]
AFAIR, I already posted something about bucket/bin-coding. In a nutshell, it is like gamma coding with improvements for small distances (extra buckets).
[QUOTE]<div class=""quoting"">Quoting: encode</div>as far as I understand you update offsets per each byte, not per phrase.</div>
Yes, you do. But why is this a problem? For a given index i, the offset table offsets[win[i-1]][...] is always the same, no matter how you reach offset i.
Thank you!
Its interesting. Small state what is?Originally Posted by Christian
In other words how do you get an additional context?
As you know all my ROLZ implementations add an offset per phrase. At least in this case decoder is faster.Originally Posted by Christian
I just remembered why I didnt do this. It improves compression, Im quite sure, but then, the offset cost gets variable during cost optimization - which slows down the parsing a bit. I think it wasnt worth it.Originally Posted by Christian
I see. This complicates parsing a lot.Originally Posted by encode
This is some magical number, which alters based on the coding decision and the context i. Thus, it can be easily propagated during parsing.Originally Posted by encode
<div class=""jscript""><pre>state=get_state(st,i,decisi on)</pre></div>
Am I right?Originally Posted by Christian
i - whole previous byte (order-1 context)
decision - 0/1 - i.e. literal or match
st - ?? (previous state?)
And how do you mix that?
<div class=""jscript""><pre>
(st<<9)+(decision<<8)+i
</pre></div>
Thanks in advance!![]()
RZM vs 7ZIP
7zip 4.42 [LZMA]
ENWIK8 -> 24,996,113
ENWIK9 -> 213,490,979
RZM 0.02
ENWIK8-> 24,930,240
ENWIK9->216399070
Last one for today. Decompression speed is improved a bit.
RZM 0.04
Yes, you are - but i is the current file offset. Sorry, it shouldve looked like this.Originally Posted by encode
<div class=""jscript""><pre>new_state=get_state(state,i ,decision)</pre></div>
Regarding the state update. You can remember in that tiny state, what youve done before. Or you can use (i & 3) as part of a new state. Or you can analyse win(i-n,i-1) which is known to the decoder, too. You see, there are many, many possibilities. It all comes down to how your coders are designed in detail.
Finally, I almost forgot. Or you can just use stronger coders.
Thanks Chris!
Mirror: Download
Looking awesome!This is another good job story after CCM. ROLZ is interesting thing for Ilia, Nania and at last me. But, AFAIK none of us didn't merge a good parsing method with ROLZ (at least me, of course
). So, can you give us more detail?
- ROLZ match context? Order-1 or order-2?
- What about the detail of ROLZ parsing?
- Do you use LZMA like state decision?
At last, here is the test results with my test file: valley.cmb (19,776,230 bytes)
<div class="jscript"><pre>7-Zip (LZMA-Ultra) -> 7,507,360 bytes
PAQ9a (-9) -> 7,528,983 bytes
WinRK (ROLZ3-Best) -> 7,552,372 bytes
RZM 0.04 -> 8,157,401 bytes
WinRAR (Best) -> 8,511,238 bytes
LZPM 0.15 (-9) -> 9,367,718 bytes
BIT 0.1 -> 9,741,439 bytes
ZIP (Max) -> 9,887,176 bytes
Quad (-x) -> 10,017,449 bytes</pre></div>
BIT Archiver homepage: www.osmanturan.com
Thanks Chris! It looks like you find just some spare minutes a day and each few months there's new compressor using classic algorithm with ground-breaking results![]()
By the way, can you upload this file. Im curious about my new LZ77s performance on this file!Originally Posted by osmanturan
![]()
Ive posted the reason why Im using this file:
Ive just uploaded the file at my site. The link: http://www.osmanturan.com/valley.7zOriginally Posted by osmanturan
Im curious about your test results, too!![]()
BIT Archiver homepage: www.osmanturan.com
The page cannot be foundOriginally Posted by osmanturan
![]()
I can't figure out what's happening in the foolish server or internet explorerWhen I renamed it "test.zip" in FTP, there is no problem. You can download the file: http://www.osmanturan.com/test.zip
Do not forget to change it's extension! It's actually a 7-zip archive (extension: 7z)
BIT Archiver homepage: www.osmanturan.com
Got it!
OK, my new LZ77 with basic configuration compresses valley.cmb down to:
8,904,441 bytes
![]()
Better than LZPM 0.15 (-9)! Cool![]()
BIT Archiver homepage: www.osmanturan.com
Thanks again! I have not expected such positive feedback. I mean, RZM is slow and ratio is nothing new - Malcolms ROLZ seems to be even stronger. Additioally, RZM is missing serious data filters or a CM-backend.
Order 1.Originally Posted by osmanturan
No. State update is based on the content of the sliding window at the current position, too. But you shouldnt overestimate the impact of an additional state context - at least for RZM.Originally Posted by osmanturan
Parsing is not much different from LZ77 parsing. I do forward parsing - it makes state propagation and string searching easier. I dont know what you want to know exactly?Originally Posted by osmanturan
Can you share us a psuedo code for making it clear? I think, ilia wonders what is happenning in your parsing, tooOriginally Posted by Christian
My current implementation is:Originally Posted by Christian
- Order-1 ROLZ match context
- Match length based flexible parsing (same as Quads)
- 32768 ROLZ offset table (Ive used Deflate distance coding schema with Order-0 model which exactly fits)
- Segmented match length coding with a small state context. I have notice most of files match length avarage around 8, and some of them 16
- Literals encoded with an approximation of neural network context mixer which mixes Order 0-1-2 model by updating weight in linear space (Actually not a neural network)
- 16 MB non-sliding dictionary. Because I want to implement a multithreading task manager which manages of compressing each 16 MB block (currrently not implemented)
Why Im giving these informations?Because, my experience about CM back-end doesnt really gain as expected for ROLZ. You have to know, my current unreleased implementation is not powerful as yours
So, popular question becomes parsing. As we see, parsing is really a matter for ROLZ
![]()
BIT Archiver homepage: www.osmanturan.com
So, this can be true for my implementation, tooOriginally Posted by Christian
Because, my match length encoding based on the same thing as state: position & 3
This gives me about %1-2 gain on calgary corpus.
BIT Archiver homepage: www.osmanturan.com
I wonder how you do that. If you ripped off Matts NN leaving stretch & squash you are doing nonsense. Could you post a formula for weight updating?Originally Posted by osmanturan
BTW: Christian, which is your weighting approach? I have tried several different and found gradient descent / NN (gradient descent with a nonlinear transform) to be very robust (and efficient). The NN method requires no divisions, but needs more table lookups.
Shelwiens method didnt work for me, at least not as good as both mentioned above.
Newtons method for root search seems to be inappropriate, since the gradient of a weighted sum doesnt have a zero analytically (only at w->infinity), so it diverges.
At the moment im trying some modification of the simplex algorithm (nelder&mead) to minimize coding cost.
To increase robustness one might change the optimization metric, e.g. minimizing an averaged coding error.
Somehow Im out of luck here...
ps: gr??e von deutschland nach deutschland![]()