This thread is discussing about my CSC and also LZ77.It also recorded my progress.
I came into be interested in data compression several months ago, I tried a lot these days and finally work out a simple file compressor.
The compressor uses LZP and Order-1 (freq[context][c]) RangeCoding,
I thought LZP is fast to compress, but I have several questions to ask:
Does all compressor based on LZP works slow when decompress than compress? My compressor's decompression speed is about 15% slower than compression, how to avoid this?
Secondly, In my program I make the literal and match_len into one alphebat at first. Lately, I modified it, I seprated them and use a order-4 match flag state machine, But why it shows worse compression than before in almost all files?
Thirdly, Is there any samples about how to implement Order-2 or Order-3 arithmetic coder but not as complicated as PPM？
I'm waiting for your advices!
this is my first Thread(Topic?) in english and sorry for my bad english
This time I made it into bitwise. Literal/Matchlength/Match_flag use separate simple models. It shows worse compression (34.1-->34.8% on enwik on text but obviously better ratio on binary files. Speed of compression also improved slightly while that in decompression improved much. The two procedure seems completely symmetric.
some modifications 2009.05.06 CountrySide Compressor 2
CountrySide Compressor 3(why names countryside? it stands for copycatting in Chinese)
I've done all exams this terms so I have time these days. Last two days, I have totally rewrite the LZ part of my program----- It achieves better speed than my previous but still has long distance to my expectation.
Now, I have done such things:
LZ77 with sliding window and can be any window size (not have to be 2^n and have no look_ahead_buffer)
a fpaq-like range coder with good speed
match_pos & match_len are coded in buckets which likes that in deflate.
match finder uses 4 byte hash chain.
lazy match evaluation -- 2 matchs forward
Memory usage: 7MB+5*WND_SIZE (I/O buffer= 1MB*2)
This version is also very inaptitude. There has two serious problems need to be solved. Firstly, the match finder is too slow! even slower than my previous. And could someone have any ideas about how to deal with uncompressable data fast? The match finder always waste a lot of time on such kind of data.
Second: I think I need to evaluate how many bits per symbol or matches encoded into precisely.
At last, I will upload what I have done so far (with 8M window size)
Type csc3 file1 file2 [x=0..9] to run the program, the program will identify if it's a compressed file by csc3 or not.
2009 07 12 better parsing, especially for binary files. Fixed a serious bug of hash function. Now still be the HC MF&Flexible Parsing&Range Coding
2009 07 13 fixed some bugs that slow down the speed. Speed doubles on some files to previous edition. mode 0 .. 9 all became faster and better compression than the initial CSC3
2009 07 19 CSC3 seems no bug serious now -- However, I still don't know why it's doesn't work on wikipedia.tar by SqueezeChart. The performance is good i think, even in practical use. Maybe my next edition, ECS--- Experimental Compressor for Study , combined several things I studied, will work on the file.
2009 08 03 A better parsing. Significant improvement on most files.
2009 08 06 Slightly modified parsing. Add a simple .wav 2channel filter. Redesigned the rule of search depth. Now became more practical
2009 08 11 Some last modifications leads to csc3 final.
1. Search depth are dynamic. Then compression on bad data (WAV) won't be too slow.
2. Command-line format redesigned. Only m1..m3 mode now. And dictionary size can be controled by user.
3. Add 1-byte match rule which leads to big progress on files such as valley.cmb
4. Fixed a serious bug on lazy parsing which slows down the speed.
I will post the mature version (I'm satisfied more or less) after hunting for some bugs and the whole source code in a new thread.
BT+Optimal Parsing is a big project. It will be done on XXX(CSC?) 4.
2009 09 06 Some modifications based on CSC3 Final --> CSC 3.1
I've tried much about HashTable. However, it little disappointed me. HT only beneifts on long search cycles compared to HC. So finally I walked back with HashChain. Now it's Hash3+Hash5 instead of hash3+hash4. Then it achived very obvious progress on text, such as enwik8 or chinesetext. With a little change of parsing, performance on other type of data such as binary won't lose much. I'm still very interested in detector&filter now. But only very little spare time since new term begins.
New version also shows very precise memory usage.
2009 09 17 Some modifications again. And, more impressive than CSC3 Final now.
MF are Hash3+Hash6 now. I tried many combinations, this is the best so far.
Redesigned hash function and memory usage now is 6*WNDSIZE+4 for compression and <WNDSIZE+3 for decomp now.
A little improved parsing.
The most improves I made is on today morning, I switched the range coder with p=0.5(direct bits) into BIT-BYTE operation. Only a little faster speed when compress but much faster decompression on some files. Now on enwik9 kernel time of decompression is 29s! Very competitive to Tornado now.
Always conceiving about filters these days. Actually it isn't completely about filter. I'm thinking about the mulit-coder for different entropys. For example Order1 would be so slow on bad data, we should encode with raw datas.Also, as osman said on binary datas Order-1 context should be ctx>>4. What's more , I also plan to separate the range coder out of LZ77 part. I'm not sure but hope this would benefits memory access optimization of computer organization.
2009 09 22 It's the near final version. The exe auto detector works now. Some other filter also works. Bad data are stored now so it decompress very fast. Checking for bugs these days.
2009 09 03 Fixed memory bug. Fixed a bit-operation bug which only occurs at large dictionary size.