Hi guys, so I've been writing a compressor hybridizing lz77 and bwt. It essentially generalizes a bunch of cm derived models into bytewise equivalent filters and compression stages which are brute forced during compression but are known during decoding and has very asymmetric performance (typically 3-10mb/s and 30-50mb/s).
Source code: https://github.com/loxxous/Jampack
As expected it reaches BWT levels on text but with much better performance on structured and especially embedded data (eg: videogame dlc patches) in comparison to something like bsc. But I need to write one more model to generalize the fast adapting nature of lz or paq for structured inputs like Charles Bloom's lzt24 test file.
Currently it's dedupe + adaptive channel filtering + localized prefix modeling + lz77 to catch long structured repeats + burrows wheeler and entropy coding. I thought that should be enough stages to handle most data but turns out it's still missing the lz-esque compression on static binary data.
I've already got a local prefix model to capture nearby matches bijectively and induce a more optimal sort order but the literals are the tricky part. Any suggestions on how to synthesize fast adaptive modeling while still being compressed with a borrows wheeler core?
The Following 8 Users Say Thank You to Lucas For This Useful Post:
Bulat Ziganshin (23rd June 2017),comp1 (23rd June 2017),Gonzalo (23rd June 2017),nikkho (23rd June 2017),RamiroCruzo (23rd June 2017),Shelwien (23rd June 2017),Stephan Busch (23rd June 2017),unarc 125 (24th June 2017)
Nice work! Better ratio than LZMA on some files, even in this alpha state. However, Jampack keeps crashing on my system. I will try to find a pattern and figure out a cause. If not, I'll just upload one of the files involved in the crashes.
> Any suggestions on how to synthesize fast adaptive modeling while still being compressed with a borrows wheeler core?
Can you clarify this?
Technically most bwt postcoders already count as "fast adaptive modeling"?
Lzma and paq are inline since they compress contiguously and can adapt then and there, but for bwt it cannot take into account locality of contexts like paq and lzma can. The point of the prefix model is to track consecutive local symbol to order-3 matches before initializing a string-wise delta coder, this lowers the entropy on local contexts and forces bwt to sort the 'new' data better since local contexts become static differences and spread throughout the block. By "fast adaptive modeling" I'm referring to the inducing a better sort order on non-stationary data. Not the entropy coder. It is possible to induce better sort orders with transformations but I am not sure how to model short (order-0 or 1) local contexts.
The prefix model is like a generalized version of your 31.cpp if that helps explain any better.
I don't think its a good idea.
What BWT does, is writing down max-order prefix context histories one by one.
Sure, some kinds of data are not compatible with that.
But what's the point in trying to force BWT use in these cases?
Can't you just use BWT SA as a matchfinding structure for LZ or something?
Do you know about Maniscalco's m03 btw?
It's not forcing bwt to model the data, it's changing the data to make bwt more efficient, and it does help. But for very low orders I haven't got anything in place to handle it since it's never been done before and I'm just looking for some ideas.
1. You can try different contexts, like gather bit6-7 from a few bytes.
2. First, I don't understand when you're supposed to transform literals.
Because there can only exist up to 64k symbols with different order2 contexts,
and after that you're going to break 3-byte matches if you transform any.
Maybe just detect these areas in BWT output? That's why I suggested m03 actually.
You can go into Jampack.cpp, function comp, comment out every stage except the local prefix model and see what it does to the output. It uses a heuristic to find strength of word boundaries before even trying to delta encode matching suffixes of a leading prefix. So it automatically switches between being on and off if it benefits the situation. Like I'm saying it isn't a standard model, it's not trying to delta encode every third order context, it's just a new idea to model complex strings that would normally hurt bwt.
I think I understand your idea, preprocessing before BWT was implemented in all popular BWT compressors anyway, including structured data etc.
Just that at some level of complexity, preprocessing stops making sense.
For example, you can get exactly the same compression as BWT with PPM, while still processing the data sequentially.
Or you can use BWT SA for LZ matchfinding, possibly during decoding too, if its ROLZ or something.
But as to locality, one idea is to do small-block BWT, like on 64k blocks.
And then use a 2D postcoder, which would have access not only to one BWT output, but to multiple.
Kinda like a merge-sort.
It can even track contexts m03-style.
And another idea is obvious: BWT sorts symbols by context, so to preserve locality after sorting,
you have to make sure that info about it is a part of context.
Which means either making some ST-like BWT generalization (which I guess you don't like),
or transforming the data to make that context.
For example, we can expand bytes to nibbles, then use remaining bits to store some
24th June 2017, 00:24
Thank you Gonzalo!
Originally Posted by Gonzalo
Is it a crash or does it throw an error? The builds are compiled without any specific cpu listed so it likely would have compiled for i5-4690k (the cpu it was compiled on) with it's supported instructions and higher, if you have an old cpu that might be the cause of the errors. But then again you mentioned that you also got it to work on some test files; would you mind posting the file that caused the crash?
24th June 2017, 02:58
I considered using ST since Hamid brought up that the patent actually expired years ago, but decoding ST is quite a bit slower than my current BWT decoder so I'm just sticking with it for speed as well as the fact that BWT is usually stronger than ST. From the looks of it ST usually does a little worse than BWT on binary data; I guess local low-order contexts aren't a big deal then, maybe with a more aggressive lz77 stage I can get it near the ratios of lzma.
27th June 2017, 01:10
It was a bunch of crashes... Nevertheless, I couldn't reproduce them, so I guess it must be a S.O. failure. After that, I succesfully compressed literally thousands of files with Jampack, even those that have failed before. So false alarm. It happened before, so I better check my system.
Originally Posted by Lucas
Tags for this Thread