Recent In Memory Benchmark with different file types: Text, Binary, Games, Logs, Silesia Corpus
All compressors with latest version. Oodle dll from a game.
Jarek (27th December 2016)
Many like me agree that lzturbo is amazing, years ahead of its times. It is a waste that it is nearly not used - it could take place of zstd if open sourced a few years ago.
However, I understand you want to earn from it and I completely agree that you deserve it ... but as we are well aware of, data compression is no longer a sexy topic (e.g. I didn't get a dime from ANS, cannot even find a collaboration in data compression), nearly nobody is interested, even less would pay for improving parameters a bit ... especially now with "free" and marketed zstd greatly rising the bar - making the life of e.g. Oodle much harder ... and your chances for directly selling lzturbo.
Corporations are much more about branding than performance (e.g. Brotli) - personally, the most realistic way I see for you to earn something from lzturbo, simultaneously making it widely used, is the Yann's way: hiring to a big known corporation and open-source it as their product - you could continue to do what you currently do but for nice money and some marketing, spotlight, help - e.g. Microsoft, Amazon, IBM, Samsung, Dropbox ... or Google to repair the Brotli disappointment.
What are your plans regarding lzturbo?
inikep (27th December 2016)
thanks for your nice words and your support.
have the majority of their harddisks unused.
This is why, LzTurbo has been designed for speed from the beginning.
First of all, actually there are no plans to release LzTurbo as open source or as a free library.What are your plans regarding lzturbo?
I'm not working constantly on LzTurbo, only if time permits. Last time, mostly tweaking for speed was done.
I'm not interested in a new job, but I'll think to contact some companies and look if they are interested.
Like you've already said, most companies are more interested in branding or market share.
Indeed one reason data compression is no longer sexy is that the dominant philosophy is just buy more disks, transmission lines etc ... I have talked with someone from a big company and he said that they nearly use no compression as it means paying a person focused on it, it means risks - that it's cheaper and simpler to just buy more disks ... there is no discussion.
But the more important reason is that people don't longer value the foundations, what is now interesting is machine learning - even if what they usually need is just linear regression, you have to write deep NN if you want to e.g. get a grant.
For example there is this recent great GST ( https://github.com/GammaUNC/GST ) - twice better compression for textures in nearly the same time - seems important revolution e.g. for games, VR, but there is zero interest ... so what can be seen interesting in data compression?
If you would like to sell lzturbo, it is not just a one time transaction - they would need you for support and further development ...
I think the main reason (or large part of it) is somewhat different.
Most of popular file formats (especially videos) already have built-in compression, so it doesn't matter what you try to use to compress them again.
But if we'd manage to significantly increase the compression gain to like 50% (eg. with recompression and dedup), it could suddenly become interesting.
The main problem with that is the fact that most of the storage is used by videos, and its hard to write video codecs even without recompression.
Sure, video compression is extremely important, and so all the big players have gathered into AOM to get a better and "free"(?) alternative for HEVC, which is rarely used due to high license and processing costs.
However, in the Big Data world, general purpose and other specialized compressors are also very important.
But the general issue is that the fashion for bit-scale thinking has passed a few decades ago - for modern computer scientists these are just black boxes - they rarely think what is happening inside, nor have to think about some optimality as you can just buy more disks/processors ... currently the most known algorithm is probably the "middle-out compression" ... as even HBO "Silicon Valley" didn't make people interested in data compression, it seems just hopeless.
Or one could think that ANS will bring some spotlight there as it suggests that we might still be missing some basic concepts ... but academia pretends that it doesn't exist ... people who have built their careers on papers about Huffman and AC.
Firstly, well done; it's an interesting comparison. Agreed with others that it'll be hard to get a non-open source tool to gain any sort of market penetration. However it's more likely that Hamid will be using compression *expertise* rather than formats to monetise the tools, and Lzturbo is a great advertisement in that regard.
There are huge sectors other than video that are relatively naive at present with regards to compression technologies; eg Bioinformatics. (Although I hope I've been pushing back on that naivity a bit wrt Bioinformatics.) Indeed the video coding experts out there persuaded MPEG to get involved in DNA sequencing compression, which is a little bit bizarre. http://mpeg.chiariglione.org/standar...ompression-and
Putting numbers to things; where I work (Wellcome Trust Sanger Institute) I've been told we have more spinning disk than CERN, perhaps in part because they have some extremely agressive data filtering techniques. Even so, there has been a lot of reluctance to adopt modern compression techniques unless they're also not noticably slower than their old counterparts (based on Deflate; now based on very simplistic modelling + rANS amongst other things). There's quite a bit of room for lossy compression of some data types too.
I think this is mostly just confused benchmarkers allowing a different sized decompression contexts for different decompressors, and a failure to understand what matters at a system level.
My experience is that Brotli usually compresses 5-10 % more than the other relatively fast decompressing algorithms when we are within the memory budget that is acceptable for browser use (somewhere around 512 kB to 4 MB). People like to compare algorithms with an unlimited budget -- a game that brotli just doesn't play, and which is not really realistic in a practical system with multiples of streams and possible adversaries.
In brotli's use case about 1-5 seconds are used for transmission and about 1 ms for decompression. It is not advantageous to transfer 5-10 % compression (50-500 ms) to 0.5 ms in decompression speed. In addition to this brotli is fully streamed, so the decompression is fully shadowed in a properly implemented streamed transmission system like Chrome.
Brotli achieves the increase in compression density by using the last two decoded bytes as context for the next byte. While this gives an increase in density, it creates a dependency of completing those bytes before next bytes can be decoded, i.e., reduces parallel computation and slows down the computation. Still, this is an advantageous compromise up to speeds of reading around 1000 mbps -- about thousand times more than the current internet speeds.
Last edited by Jyrki Alakuijala; 29th December 2016 at 03:33. Reason: gbps -> mbps
My apologies, I am just a theoretician - only summarizing benchmarks and user opinions I have seen, like the cited the Guardian article.
Unfortunately, I still haven't seen any containing both which would favor Brotli, but I would gladly see some?
This is intriguing - so LZTurbo beats all other implementations at decompression speed, by a significant amount, and yet the author refuses to sell or open it ?
My gut feeling is the speed comes through a combination of lots of hand tuned SIMD work to maximise speed and some through interleaving and interesting algorithms. It was quite clear as I worked on more aggressive interleaving and SIMD implementations of rANS I was leap-frogging PowTurbo's entropy encoder, which then got updates to leapfrog mine again. I suspect his initial table generation or encoding is faster which is why he always ended up on top again, but there is likely to be a lot of similarity in the ANS side. I don't know if he uses tANS or rANS either, but tANS is even more suitable for SIMD implementations than rANS and we haven't really seen the best it can get yet IMO. If I had more spare time I'd go for it as I'm curious just how fast 32 interleaved tANS streams can get.
I don't know much about the LZ side of things, but there are quite a few very fast public LZ codes out there to start from.
Thanks, James. Can you tell me a bit more about your implementation? My feeling is that adding interleaved ANS to LZ , and running on GPU, is going to crush all other implementations.
So, Ooodle is used primarily for games: would it not make sense to do decompression on GPU (less data to send over PCI bus) rather than CPU ? Of course, GPU is needed for the actual graphics, but
on AMD hardware, at least, it is possible to asynchronously carve out some compute resources for this sort of thing.
I think RAD is at least experimenting with GPU implementations.
But GPUs have simpler memory arch (less caches etc), so running LZ there is not very efficient.
GPUs have caches primarily to reduce bandwidth requirements, not for latency (unlike CPUs); a cache hit is on a GPU is usually not appreciably lower latency than a cache miss, and we're talking hundreds of cycles per memory access. That's not a great environment to run LZs on.
I implemented a GPU version of Selkie on the PS4 a while back. We're not shipping it. I got it down to about 1.25 clocks/decompressed byte on a single GCN compute unit under ideal circumstances (and with tuned GCN assembly code), _if_ you have 40 independent 64k blocks to decompress. That's comparable to Selkie's throughput on a CPU, but the "needs 40 independent blocks" part (required to hide the latency of memory ops in the Selkie decoder) is the clincher.
Because to rephrase it in other words, I can decode 40 independent 64k blocks in 4.8ms on a single GPU compute unit. However, if I only want to decode a single 64k block, that's barely any faster at all... around 4.4ms. And if I want any block to reference the contents of a previous block, that's now a serial dependency chain, the blocks need to be decoded in order, and it takes those 4.4ms for every 64k, which would give you a breezy 14.5MB/s worth of progress on a single stream with non-independent chunks, _if_ you could run the shader continuously, which you can't. (You need to yield the GPU to the system software periodically.)
Long story short, if you fill up the GPU on a regular (non-pro) PS4 with these tasks, that gives you a throughput of about 11GB/s, which is kind of meh to begin with - for comparison, filling up 7 CPU cores with that type of workload yields about 7GB/s on this HW, and the Jaguar cores are narrow (2-wide decode/retire) and have low clock rate (1.6GHz). But you also need 720 _completely independent_ 64k chunks to do so. Which you just never have.
And the shader is already kind of inefficient in absolute terms to make that number come out at "only" 720 streams. It could be faster in throughput terms if it was written to assume even more independent streams but you just never have that many completely independent decompression jobs at once, not even close.
FWIW, the reason I looked at GPU Selkie decode has nothing to do with PCIe bandwidth (not ever a limiter for our decompression workloads, that I've seen anyway). The reason for investigating GPU decode was that data intended for the GPU on console is generally in write-combined memory, which is uncached for reads. The single most common issue our customers run into on consoles is trying to run one of our decoders with the destination in WC memory, which completely tanks performance for any LZ. Would have been nice to have a GPU-side decoder for that kind of use case (when the data goes to the GPU anyway), but the combination of high latency, high memory requirements for buffering, "how do we expose this sanely?" API woes and requirement for the data to be relatively finely chunked (independent 128k chunks with two dependent 64k halves is still sort of acceptable latency, but that's still a tiny chunk size...) made us not pursue it further. (So far, anyway.)
Same thing with GPU rANS etc. We have an implementation but no actual interesting use case for it (so far).
One idea I had for GPUs was to use compressed window (entropy-coded mainly).
We'd have to use a lot of arithmetics to extract a string from arithmetic-coded window, but the caches could be used more efficiently, etc.
Second idea is something like Bulat's future-LZ. Its where we encode where to put a given string in the future data, instead
of using backward references. Random-access writes may be more efficient than reads.
Third idea is about code-generation. Normally LZ decoder has to deal with at least 3 memory locations - compressed data, match data, uncompressed data.
So instead of doing this sequentially, we can first read compressed data and generate the LZ copy routine (code) from it. Then run the generated code.
Not sure how easy is it to do this on GPU though.
Regarding my implementation - it started as fgiesen's implementation and got tweaked from there. My initial tweaks were 4-way interleaving instead of 2-way as that's what I already had in my 4-way static-frequency arithmetic coder. It gives a small gain over 2-way. I also added order-1 frequency tables which work OK for the data I was dealing with which has a limited size alphabet (so the table isn't huge) and block sizes of the order of ~1Mb. It may not be appropriate, at least in a static frequency table way, for other data types. It also had optimised code, particularly the renormalisation, to be considerably faster.
Much later though I experimented with using SIMD instructions and found it just too poor with only 4 interleaved encoders. I experimented with an aggressive 16 and 32 interleaving, which has high overheads per block but offers huge speed benefits when combined with SIMD instructions, in particular AVX2 and AVX-512. This was all still based on the original rANS implementation by Fabian.
The same logic though of interleaving many independent streams (multiple ANS states) and using SIMD to updated many simulatenously will work fine for tANS as well as rANS. The only reason I chose rANS over tANS was because I'd just been working on a byte-wise arithmetic coder with a range-coder style interface and I could understand how rANS worked more than tANS. With hindsight, a table driven rANS (which is what both Fabian's original and my derivations do) has much the same memory costs as tANS but is likely not as fast, particularly for the encoder. I haven't had time to explore SIMD tANS though.
boxerab (27th June 2017)
I think James' decoder for pre-AVX512 is very close to the SSE4.1 ryg_rans decoder I posted a few years ago (https://github.com/rygorous/ryg_rans...s_word_sse41.h), except with more interleaving (the example in main_simd is 8-way; it keeps two of these decoders around).
Haven't looked at the encoder closely. Decoder-wise, SIMD rANS and SIMD tANS end up _very_ similar (as long as you're interleaving sufficiently to be throughput rather than latency bound) because tANS still needs a separate bit output stage, and you basically end up implementing a power-of-2 rANS decoder for that part anyway. Either way, both are dominated by the memory logistics, how you do the state update isn't that big a deal.
One potential advantage of SIMD rANS is that you can do alias rANS (that was originally meant for a GPU implementation, where the small alphabet-size-dependent table can then be kept in thread group shared memory without hitting occupancy limits, a problem with larger tANS/rANS tables; even if you don't plan to run tons of instances at the same time, you still don't want to hog more GPU resources from concurrent work than strictly necessary). That said, alias rANS essentially forces your encoder to use a big table. It's fairly asymmetric that way, and probably not worth it if you also care about encode perf.
JamesB (28th June 2017)
My SIMD work though was largely experimental and not the current decoder we use in production, which has extra hardening steps against malicious input bit streams where order-1 symbol combinations are used that have zero frequencies or the original decoded size is simply a lie. These are real things you have to do unless you control the entire pipeline from end to end, but they also add a little bit more burden. I haven't got around to updating everything with those extra checks. (So just a warning!)
My current rANS related work has been non-SIMD again (for now), but wrapping up the encoder with automatic byte-to-bit packing for small alphabets and RLE encoding so a single codec does everything for ease. In theory it could be rewritten to do some of this on-the-fly during the decoder itself without changing the data format, but that's an optimisation for later.
Yeah, we spend a lot more time on hardening our decoders against corruption (which has derailed several promising ideas when we couldn't figure out how to make them safe) than we spend writing them, or tweaking the compression format.
http://lcamtuf.coredump.cx/afl/) which seems pretty decent. A suggestion though of my colleague was even when it passes, to take the queue directory from AFL output and pass all of those files (which have been selected to exercise as many code paths as possible) and pass all of that data through valgrind and the gcc/clang address-sanitizer to look for uninitialised memory accesses. This will then spot the cases where AFL doesn't identify a crash or a hang, but could potentially be leaking internal memory. Eg a malloc (rather than calloc) which hasn't been entirely filled out before returning it.
I think it may also be possible to make the address sanitizer it's warnings as reason for abort() and then use in conjunction with AFL so it automatically detects such cases, but it's far slower to do this.
@JamesB and @fgiesen you guys are wizards
We do fuzz testing (internal tool, pre-AFL), which is good for things like headers etc., but it's mostly a combination of proofs (primarily manual right now - not a big deal since they're usually short - but I've experimented a bit with Z3) and directed testing with adversarial test cases. The adversarial testing is important because the hard edge cases are indirect enough (and not easily discovered even with full path coverage!) that a fuzzer has very low chances of discovering them, especially when the loops in question are mostly branchless so the fuzzer has nearly nothing to latch onto to guide it towards more promising test cases. E.g. with separate streams, some of the cases to watch out for in fast decoder loops are when some of the sub-streams have very low entropy and others have very high entropy. That's one of the cases that's easy to find in a manual proof ("what happens if...") but really hard to discover by fuzzing.
JamesB (29th June 2017)