Page 1 of 3 123 LastLast
Results 1 to 30 of 67

Thread: The future of lossless data compression

  1. #1
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 166 Times in 63 Posts

    The future of lossless data compression

    Recently I was wondering how lossless data compression has progressed in the last 10 years and what is the future of lossless data compression. My thoughts:

    1. General-purpose lossless compression.
    The old compressors: zip/gzip, bzip2, lzma are still widely used. We have new compressors: lz4, zstd and brotli but they aim to improve compression and decompression speed (lzma's compression ratio is still better). More powerful compressors based on PPM and CM (PAQ) are considered too slow and used mainly by enthusiasts. Is improvement of speed the only future of mainstream lossless general-purpose compression?

    2. Specialized lossless compression.
    Ten years ago we had lossless audio compressors (FLAC, WavPack), lossless image compressors (PNG, JPEG-LS, lossless JPEG-2000), genomic data compressors. There was a significant progress in image compression (lossless WebP, lossless BPG, FLIF) and genomic data compression. In practice, however, the amount of audio and images compressed losslessly is very small in comparison to lossy audio and image compression. The field of genomic data compression seems to be the most promising for the future. There is a lot of other custom file formats that can be compressed more densely with tailored compression algorithms. These specialized algorithms will be created probably only on request of specific people or companies.

    3. Lossless recompression.
    Ten years ago we had lossless recompressors for JPEG, MP3, ZLIB/GZIP/BZIP, OGG. Since then we have seen new recompressors for MPEG1/MPEG2, LZMA, GIF. But recompression does not hit the mainstream (maybe except WinZIP and StuffIt). The main reason is that the improvement in compression ratio is too small compared to recompression time. The future of recompression does not look promising.

  2. The Following 9 Users Say Thank You to inikep For This Useful Post:

    anormal (1st January 2018),anterus (12th February 2017),encode (6th February 2017),JamesB (6th February 2017),Jarek (6th February 2017),Matt Mahoney (6th July 2017),Randolf (6th February 2017),SolidComp (8th February 2017),yh.deng (17th October 2018)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > (lzma's compression ratio is still better).

    There're also plzma and oodle codecs (LZA/LZNA are better than lzma for some game resources).
    Also still some ROLZ coders in use by repackers (rzm etc).

    BWT coders (which you didn't mention at all) are quietly used in some places,
    like zpaq or nanozip. Also libbsc is still interesting enough, and supports GPUs.
    Then there's a recent new version of msufsort.

    Anyway, imho there're still many ways to improve lzma (compression-wise).

    1. lzma parsing optimization is relatively good, but still far from perfect -
    only certain patterns like match-literal-rep_match are explicitly checked by the optimizer,
    while other ones may be missed.
    I think lzma really needs a matchfinder based on fuzzy matching - and explicit
    price estimation for these, instead of treating rep-matches like something independent.

    2. lzma already has 7 token types. Why not add some more useful ones?
    - delta matches, like in bsdiff
    - dictionary matches, where dictionary is not a part of the common window
    - support for multiple data windows (we can have different types of preprocessing in these)
    - multi-dimensional matches (for 2D+ and structured data)
    - LZ78 matches (by index (potentially multiple sets), implicit length)
    - LZP matches (order choice, explicit length)
    - ROLZ matches (contextual distance, explicit length)
    - PPM matches (no distance, unary coding of length, context tracking)
    - long-range matches (like from a rep preprocessor)

    3. lzma2 added block support, but there's no proper optimization for these.
    For example, its possible to change lc/lp/pb options per block, but no compressors
    make use of this.
    More ideas could be taken from zstd (RLE blocks, separation of literals/matches).
    Some more can be added after that (rep/MTF/PPM/BWT/huffman blocks, a flag for not adding the block to the window).

    4. Secondary LZ. I mainly noticed it when using bsdiff+lzma in an app update system,
    but in some kinds of data there're visible patterns in literals/matches after first LZ pass.
    (Bsdiff used delta-matches, so there were secondary matches within literal blocks too).
    For example, page tags in book1 are encoded by lzma as <match:len=4><literal><rep_match:len=2>,
    and only literal changes there.


    > More powerful compressors based on PPM and CM (PAQ) are considered too slow and used mainly by enthusiasts.

    PPM/BWT actually can compress a few times faster than LZ, so they have their uses.
    Also LZMA basically uses bitwise CM for entropy coding.
    So I think its more about compression/speed tradeoff rather than CM/PPM being slow or something.
    For many kinds of data (eg. exes), lzma compression is actually better than PPMd/BWT/fast CM,
    so we just need PPM/BWT/CM codecs with better support for popular filetypes (ie not just enwik).

    In fact, we're having a hard time with actual use of PPMd in an archiver - not because of its speed,
    but because it needs specific order settings for specific files, ranging from o4 to o100+.
    When a single default order is used for everything, it gets easily beaten by lzma _in compression_.


    > Is improvement of speed the only future of mainstream lossless general-purpose compression?

    Hopefully not. Oodle codecs showed that its possible to provide 10x the speed while mostly
    keeping up the ratio, and I don't see why the same can't apply to other codec types.

    At the very least, its quite possible to apply the same vector/MT/branchless/out-of-order
    tricks to CM (compression only), because encoder has access to all of the data.
    In other words, it should be possible to make a version of paq that would compress 100s
    of bytes in parallel, with 10x+ faster actual processing speed.
    It could be still useful for eg. backups, 1-2Gbs of memory usage is not that scary comparing to
    lzma's 10.5*winsize.

    Also its not like estimating the probability for every bit of original data
    is the only solution. Fast coders can still actually use CM inside, just
    for coding of preprocessed/transformed data, eg. LZ tokens or BWT runlength (like in bsc).

    > But recompression does not hit the mainstream (maybe except WinZIP and StuffIt).

    Yes, winzip currently has a jpeg codec and packmp3, PA now has some too,
    dropbox develops lepton and that one h264 coder, etc.

    > The main reason is that the improvement in compression ratio is too small compared to recompression time.
    > The future of recompression does not look promising.

    I don't agree at all with this. Jpg/mp3 recompression speed can be pretty reasonable.
    5Mb/s per thread maybe, and with MT processing available.
    Sure, its far from ~1Gb/s processing speed advertised by zstd/oodle authors,
    but let's consider the real speed of widely available storage too.

    Also there's my idea about steganography-based recompression of lossy formats,
    which could hopefully improve the recompression gain by another 10-20%.

  4. The Following 5 Users Say Thank You to Shelwien For This Useful Post:

    encode (6th February 2017),kassane (7th December 2017),Mike (6th February 2017),RamiroCruzo (6th February 2017),xinix (6th February 2017)

  5. #3
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 166 Times in 63 Posts
    Sure we have hundreds of other compressors but they are used by a very small number of users. Let's look at compressors from a commercial point of view:
    A) open-source mainstream compressors (gzip, bzip2, 7z/xz/LZMA, ppmd, lz4, zstd, brotli) - free, well-tested and widely used
    B) closed-source compressors (ccm, nanozip, lzturbo) - not well-tested, sometimes better than open-source alternatives, but too risky to be widely used
    C) enthusiasts compressors (paq, zpaq, cmix, lzham, glza and many more) - sometimes very promising but doesn't hit the mainstream for some reasons
    D) commercial compressors (Oodle, Ocarina) - not free, frequently specialized for given kind of data, used by companies that pay for it

    Most of compressors on this forum go to groups C or B. They are hobbists/enthusiasts that spent their free time on creating or testing compression algorithms. But after 15 years spent in data compression I feel that I need my compressors to be widely used (group A) or allow me earn a living (group D). There was plenty of data compression experts that sadly left the field because they were in groups C or B.

    @Shelwien:
    LZMA can be improved but keeping compatibility with LZMA format is usually more important than 10% ratio improvement. In my experience, 25-30% ratio improvement is required by companies to justify a migration to a new format.

    I think the issue with PPMd is not selecting an optimal PPM order for specific files but slow decompression times because of symmetric characteristics of PPM algorithm.

    Moreover storage is cheap. Popularity of LZ4 shows that for many applications speed is more important than ratio.

  6. The Following 2 Users Say Thank You to inikep For This Useful Post:

    Jyrki Alakuijala (10th February 2017),khavish (20th August 2017)

  7. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > But after 15 years spent in data compression I feel that I need my
    > compressors to be widely used (group A) or allow me earn a living (group D).

    "Widely used" is normally unlikely, because compression applications are moving
    from user side to server side (mostly already moved).
    Also OS developers are not helping either - for example, win10 now has built-in
    LZX support in FS.
    But I still see at least one possible application for advanced compression -
    transferring files on the net. It can be just a proxy/VPN service with
    integrated compression, or a whole new filesharing protocol (like torrent) -
    internet speeds are not getting better with time.
    I think that recompression and anchor hashing (aka content-based segmentation)
    can improve filesharing efficiency by a lot, because atm there's no fileshare app
    that could merge sources from repacked file versions (for example, same video
    with different audio tracks or added subtitles).

    As to "earn a living" - I think recompression is most interesting in that sense,
    because there's no perfect solution even for already supported formats like jpeg
    (Stirner's coders are not streamable, can't detect formats and don't have built-in
    fallback for invalid files, also compression can be better; paq model works with streams,
    is relatively safe, and has detection, but is too slow).
    Anyway, there're many companies now that store petabytes of data, so it should
    be easily possible to turn this into a job. Of course, naive people who imagine
    selling patents and making millions can go dream further, but it would be certainly
    enough to "earn a living".

    Well, another relevant area is compression of game resources, but its too hardware-dependent imho.
    Ie you'd need access to all the most recent hardware, like game consoles, smartphones, tablets etc.
    But, I guess, it should be still possible to make something like a fast recompressor for DXT textures
    and sell it to RAD, or something.

    The main problem here is that for commercial use you'd need carefully written libraries -
    clean, portable, MT-friendly etc - ie not something like paq code.


    > LZMA can be improved but keeping compatibility with LZMA format is usually
    > more important than 10% ratio improvement.

    I don't think its true. Rar5 was implemented quite recently, for example.

    Also, I suspect that its possible to squeeze these 10-30% just from lzma itself,
    by improving parsing optimization, lzma2 block optimization, encoder memory usage.

    > In my experience, 25-30% ratio improvement is required by companies to
    > justify a migration to a new format.

    Its actually easily reachable by adding recompression, dedup preprocessing,
    better preprocessing for some formats (eg .exe), etc.

    But anyway, individual users and small companies are feeling quite well with just .zip .
    Its filehostings etc where compression (including relatively slow compression) is relevant.


    > I think the issue with PPMd is not selecting an optimal PPM order for
    > specific files but slow decompression times because of symmetric
    > characteristics of PPM algorithm.

    No, I was talking from my experience here. I added ppmd_sh to the codec set
    and we were looking for ways to improve compression with it, like by assigning it
    to some file extension or something. But we still don't have an automated
    method to do this.

    As to speed - there're still plenty users with 20-30MB/s storage speed
    (cheap notebooks). Sure, developers here like gigabyte speeds these days,
    but network speeds and real sustained storage speeds are nothing like that.

  8. #5
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    The difficulty to "earn for living" from data compression and the general "it is easier to just buy more disks" argument makes data compression field mainly enthusiasts only.
    There is a chance with compressors for specialized data ... but they usually already have hard to beat open-source:
    - genetic data - CRAM, part of practically default library SAMtools for working with DNA sequences ( https://en.wikipedia.org/wiki/SAMtools ),
    - textures - GST (~twice better compression than DXT at similar decompression speed): http://gamma.cs.unc.edu/GST
    - 3D mesh and point clouds - Google Draco: https://github.com/google/draco/

    Ps. LZNA is ~ 2.5x faster than LZMA - seems big enough difference to compensate the lack of compatibility - could get to A category if having an open-source alternative ...
    https://encode.ru/threads/2206-new-c...ll=1#post44122 :
    size ratio% C MB/s D MB/s (bold=pareto) MB=1.000.000
    9069473 36.7 1.21 77.92 lzna -z7 (Optimal3)
    9154343 37.1 1.68 78.99 lzna -z6 (Optimal2)
    9207584 37.3 2.29 77.58 lzna -z5 (Optimal1)
    9329982 37.8 2.17 32.19 lzma high

  9. The Following User Says Thank You to Jarek For This Useful Post:

    inikep (6th February 2017)

  10. #6
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 166 Times in 63 Posts
    The next factor we should take into account is application of data compression:
    1. Real-time data compression (network traffic, databases) requires very fast compression algorithms and it is limited mainly to LZ77-type algorithms like LZ4, snappy, zip, brotli, zstd. We have seen a huge progress in compression speed in the last years. I expect more progress here that is: better speed or better ratio at the same speed.

    2. Content distribution (static web content, linux packages, installers) allows slow compression speed but requires fast decompression as content is compressed only once and decompressed many times. We can observe that XZ/lzma takes over the place of gzip/bzip2 for linux packages. Brotli -11 is created for static web content. This is a place for improvements over LZMA like plzma, LZNA.

    3. Backup/archiving (7zip, WinRAR, zpaq) tries to achieve the best ratio at acceptable amount of time. Here we can use all we have: specialized lossless compression and lossless recompression. Computers are getting faster and there will be possibility to use more complex algorithms but I do not expect any breakthrough in ratio in comparison to general-purpose lossless compression. There will be a progress but in small steps.

  11. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Real-time data compression (network traffic, databases) requires very fast compression algorithms
    > and it is limited mainly to LZ77-type algorithms like LZ4, snappy, zip, brotli, zstd.

    1. It doesn't really apply to p2p and paid connections (like VPN).

    2. There's currently more free cpu than memory, because memory tech progress is relatively slow.
    So its quite possible that for heavily-loaded servers actually CM (with small window, but
    lots of mixing/extrapolation/interpolation) would be more efficient.
    The fact that fast LZ coders are currently used for this is not necessarily justified.

    > We have seen a huge progress in compression speed in the last years.

    I think its more about intel cpu progress.
    These new optimizations don't scale that well really - on AMD cpus LZNA speed is the same as LZMA
    (decoding too).

    > I expect more progress here that is: better speed or better ratio at the same speed.

    Vector optimizations are actually compatible only with very simple models.
    Sure, that vectorized nibble coder with linear counters turned out well,
    but even trying to use fsm counters instead of linear ones can break it.

    > I do not expect any breakthrough in ratio in comparison to general-purpose
    > lossless compression.

    Well, I do expect some. In recompression, we don't have even one properly
    supported format. For example, we have ~8 jpeg codecs, but apparently I'd
    have to write another one to process data after reflate.
    And reflate itself still doesn't have level/winsize autodetection.

    Also imagine what h264/aac recompression with 20%+ gain would do? Or LZX?

    Practical WRT filters and disasm-based exe preprocessing could also do a lot.
    And some new multi-dictionary preprocessing seems possible:
    https://encode.ru/threads/?p=50025&pp=1
    If its hard to understand, it allows to compress the fp.log file
    from http://www.maximumcompression.com/data/log.php
    to 332385 bytes instead of normal 703341 with lzma.
    Note that best listed 7z result is 515865, which is actually ppmd with optimal order.

    And as to "universal" algorithms, I already listed quite a few for lzma improvement.
    But there're other potential breakthoughs too, like async entropy coding with
    secondary model, probability translation, steganography workarounds.

    As to CM, we still don't have a perfect mixer implementation, automated model
    optimization could do a lot better, and there's no good parametric fsm generator
    for counters. Generated context fsms are only used in bsc, I guess - lzma
    has something similar as "state", but its probably made by hand.

    In short, we are very far from perfect solution on all fronts.
    Its just that strong CMs became boring because of paq, and constructing fast CMs
    for structured data (including recompression) requires some tools and know-how,
    which can't be simply found on the net.

    Well, its all good for me, I suppose :)

  12. #8
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Quote Originally Posted by Shelwien View Post
    Vector optimizations are actually compatible only with very simple models.
    Indeed the basic application of vectorization requires using many indepdnent streams, like crazy 32 streams to get 1500MB/s/core rANS decoding ( https://github.com/jkbonfield/rans_static ).
    However, in LZNA vectorization is also very beneficial:
    1) for adaptation of probability distribution of all 16 symbols at once (vector of CDF),
    2) to search for symbol to decode: such that CDF[s] <= x < CDF[s+1], vectorization makes all 16 comparisons at once.
    https://github.com/powzix/kraken/blob/master/lzna.cpp

    ps. A bit offtopic - I believe there is coming revolution in image compression with GAN ( https://en.wikipedia.org/wiki/Genera...arial_networks ) - these NN allow for really smart upsamplig of images learned on a library ... see for example: https://arxiv.org/pdf/1609.04802.pdf
    For high resolution photos we don't need that the details are perfectly as in this exact photo, only that they look realistically.

  13. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > However, in LZNA vectorization is also very beneficial

    That's what I was talking about. Now tell me how to vectorize logistic mixing of two vectors of fsm counters into a CDF for rANS decoding.

    > I believe there is coming revolution in image compression with GAN

    Interesting, but that's just a trick to avoid manually checking the generated images.
    While in lossless compression we do have a single objective metric - compressed size

    Well, I guess it could lead to a new image quality metric which would rate whether image details "look natural",
    rather than how similar to original image they are.

    Not sure that such a metric could become common though, even in lossy compression.

  14. #10
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    1. General purpose lossless compression: agreed mostly, but it is promising that lz4, zstd and brotli are in there. Some of this is basic *clout* (facebook & google) means people have faith that the format will still be supported in 10 years time. This is the key difference between say zstd / brotli and, say, lzham. LZ4 gained traction without a significant company backing it, most probably because the step forward over previous efforts (LZO etc) was so significant that it's hard to dismiss it.

    (Personally I've been impressed with the size vs speed tradeoff of bsc, so I feel it's a pity it hasn't got wider acceptance. It does very well for some genomic data too.)

    2. Specialized lossless compression: agreed too. That's really the field I'm in with the DNA sequencing data files. Such specialisations don't always lead to significant size reductions over the slower general purpose tools, but typically they get there with vastly lower CPU times as it doesn't need a complex NN or mixing algorithm if you already know the data characteristics. I'm sure the same is true with medical imaging, climate data, particle physics data sets, etc. I don't see these being in general purpose tools though; the only specialist formats in general purpose tools are likely to be the existing media related ones as most people have a lot of that data type.

    3. Lossless recompression.
    Right now deflate is ubiquitous which is why so many recompressors target it. As lz4 or even low compression level zstd & brotli gain traction I expect we'll see them being used commonly in file formats and a corresponding raft of compressors for them to use CM or PPM. However such tools very much still fit in the hobbyist end of the market.


    A goal oriented view makes a lot of sense. Is this just something to improve bandwidth / transmission (so symmetric is good, low CPU cost is good)? Write once read many (lzma, brotli -11, etc)? Write once probably read never (archiving)? Or somewhere in-between such as a file format - saving space is good, but with practical save and load times.

  15. #11
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Quote Originally Posted by Shelwien View Post
    That's what I was talking about. Now tell me how to vectorize logistic mixing of two vectors of fsm counters into a CDF for rANS decoding.
    To get CM benefiting from vectorization, maybe it is worth to consider going to a larger than binary alphabet, like size 16 in LZNA - you might be able to get compression ratio between lzma and zpaq, at decoding speed similar to lzma ...

    Well, I guess it could lead to a new image quality metric which would rate whether image details "look natural",
    rather than how similar to original image they are.
    Not sure that such a metric could become common though, even in lossy compression.
    Indeed - the main priority of image compression is to make it look realistic and nice, maintaining pixel-level details is not that crucial, especially in a 4 or 8K video.

  16. #12
    Member
    Join Date
    May 2017
    Location
    Germany
    Posts
    7
    Thanks
    6
    Thanked 2 Times in 1 Post
    This is an interesting and inspiring discussion, thanks to all of you!

    For the moment, I don't have much to add here. But, recently I gave an overview talk about general purpose lossless data compression at a Chaos Computer Club event. My talk also shortly addresses some of the above subtopics and is tagged with a couple of links. You can find my slides at

    https://www.researchgate.net/publica...d_Applications

  17. The Following 2 Users Say Thank You to Kai Sandfort For This Useful Post:

    Cyan (23rd June 2017),Shelwien (22nd June 2017)

  18. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Thanks for posting the link, its an interesting presentation, though not very precise.
    For example, I don't see anything about recompression (= compression of already compressed data, like .docx files or pdfs or jpegs).
    Also zstd wasn't developed by facebook.

  19. #14
    Member
    Join Date
    Dec 2014
    Location
    Berlin
    Posts
    29
    Thanks
    35
    Thanked 26 Times in 12 Posts
    Concerning deduplication (or matching as in the slides) there is a new sync/archive tool which makes interesting use of a rolling hash for content adressable parts. This allows to e.g. seed from your hard disk any kind of similar content when downloading an image:
    http://0pointer.net/blog/casync-a-to...em-images.html
    The "chunking" algorithm is based on a the buzhash rolling hash function. SHA256 is used as strong hash function to generate digests of the chunks.

  20. The Following User Says Thank You to pothos2 For This Useful Post:

    Kai Sandfort (23rd June 2017)

  21. #15
    Member
    Join Date
    May 2017
    Location
    Germany
    Posts
    7
    Thanks
    6
    Thanked 2 Times in 1 Post
    Quote Originally Posted by Shelwien View Post
    Thanks for posting the link, its an interesting presentation, though not very precise.
    For example, I don't see anything about recompression (= compression of already compressed data, like .docx files or pdfs or jpegs).
    Also zstd wasn't developed by facebook.
    I didn't address recompression, right, and omitted a few other interesting things as well. It's a subjective selection for an overview talk. Anyway, I'm thinking of preparing a successive talk later on, for keeping track of recent progress and going into further facets.

    Regarding Zstandard: I am and was aware that it was developed by Yann Collet - and I just see that "inikep" is a contributor, too. Yet, I decided to drop names of individuals (definitely, there are quite some whose ideas and works I appreciate on their very own) and mention a capturing affiliation, linking to http://facebook.github.io/zstd/. It's not very precise, granted. The exact origin is easy to trace back for the excited audience.

  22. #16
    Member
    Join Date
    Nov 2015
    Location
    ?l?nsk, PL
    Posts
    81
    Thanks
    9
    Thanked 13 Times in 11 Posts
    Quote Originally Posted by Kai Sandfort View Post
    Regarding Zstandard: I am and was aware that it was developed by Yann Collet - and I just see that "inikep" is a contributor, too. Yet, I decided to drop names of individuals (definitely, there are quite some whose ideas and works I appreciate on their very own) and mention a capturing affiliation, linking to http://facebook.github.io/zstd/. It's not very precise, granted. The exact origin is easy to trace back for the excited audience.
    I believe that it's better to omit some information rather than post incorrect one. zstd was not developed by facebook.

  23. #17
    Member nikkho's Avatar
    Join Date
    Jul 2011
    Location
    Spain
    Posts
    542
    Thanks
    214
    Thanked 163 Times in 104 Posts
    Quote Originally Posted by inikep View Post
    3. Lossless recompression.
    Ten years ago we had lossless recompressors for JPEG, MP3, ZLIB/GZIP/BZIP, OGG. Since then we have seen new recompressors for MPEG1/MPEG2, LZMA, GIF. But recompression does not hit the mainstream (maybe except WinZIP and StuffIt). The main reason is that the improvement in compression ratio is too small compared to recompression time. The future of recompression does not look promising.
    Do not forget about PowerArchiver: https://encode.ru/threads/130-PowerA...ll=1#post52953

  24. #18
    Member
    Join Date
    May 2017
    Location
    Germany
    Posts
    7
    Thanks
    6
    Thanked 2 Times in 1 Post
    zstd wasn't developed by facebook.
    Cyan, would

    "authored by Yann Collet (nowadays working at Facebook)"

    be OK as a correction in my slide on Zstd, adding the date on the first slide?

    By the way, thanks for your "Thank You".

  25. #19
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    BWT coders (which you didn't mention at all) are quietly used in some places,
    like zpaq or nanozip. Also libbsc is still interesting enough, and supports GPUs.
    Then there's a recent new version of msufsort
    For what it's worth, I plan on publishing a follow up to my earlier M03 work this summer once I return home from Bulgaria. That and finish up msufsort 4. I think that the new work on M03 is going to open up a whole new world for BWT style compression including the ability to construct actual PPM style adaptive statistical models to BWT using smaller block sizes. Hopefully I can keep BWT interesting for at least a few more years.

  26. The Following User Says Thank You to michael maniscalco For This Useful Post:

    Shelwien (27th June 2017)

  27. #20
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    Sure @Kai !

  28. #21
    Member
    Join Date
    May 2017
    Location
    Germany
    Posts
    7
    Thanks
    6
    Thanked 2 Times in 1 Post
    Quote Originally Posted by Kai Sandfort View Post
    This is an interesting and inspiring discussion, thanks to all of you!

    For the moment, I don't have much to add here. But, recently I gave an overview talk about general purpose lossless data compression at a Chaos Computer Club event. My talk also shortly addresses some of the above subtopics and is tagged with a couple of links. You can find my slides at

    https://www.researchgate.net/publica...d_Applications
    I've put a minor update of my slides online, primarily fixing the Zstd authorship issue. See

    https://www.researchgate.net/publica...d_Applications
    Last edited by Kai Sandfort; 30th June 2017 at 03:34.

  29. #22
    Member
    Join Date
    May 2012
    Location
    Germany
    Posts
    39
    Thanks
    25
    Thanked 103 Times in 24 Posts
    If you want, you could also use the compression benchmarks I did some weeks ago with 7-Zip.
    They include RAM usage, Timings and Ratio for: LZ4, LZ5, Lizard, Brotli, Zstandard + all compresson types, that 7-Zip has by default.

    The site with the diagrams is located here: https://github.com/mcmilk/7-Zip-Zstd#benchmarks

  30. #23
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    46
    Thanks
    14
    Thanked 11 Times in 7 Posts

    Question

    Quote Originally Posted by inikep View Post
    D) commercial compressors (Oodle, Ocarina) - not free, frequently specialized for given kind of data, used by companies that pay for it
    On radgametools.com/oodle.htm we read:
    "Oodle Mermaid offers mid-level compression (slightly better than zip) with insanely fast decodes - 7-12X faster than Zlib. Basically, near LZ4 speed but with zip-level compression."

    "Oodle Selkie offers lower compression ratios but the very crazy-fastest decodes, 1.5-2X faster than LZ4 but with slightly better compression (ratio falls between LZ4 and zip). Eliminate decode time - Oodle Selkie is the fastest decompressor in the world, close to memcpy speed!"

    How can this be?

  31. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Fast implementations are also possible for deflate, so its mostly marketing hype -
    they're comparing only memory-to-memory use and probably only vs standard zlib implementation.

    But there've been some recent progress with entropy coding (see FSE/tANS, zstd etc), vectorized EC implementations,
    plus new cpus have this feature called "out-of-order execution", which basically lets you run multiple threads of code in parallel
    within the same physical cpu thread, but you have to design the compressed format specifically to make use of it -
    its much harder to apply the same optimization for already existing formats, though some implementations for deflate seem to exist.

  32. The Following User Says Thank You to Shelwien For This Useful Post:

    lz77 (5th July 2017)

  33. #25
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    Quote Originally Posted by lz77 View Post
    How can this be?
    It is practically impossible to have a single threaded lz77, that can be close to or faster than memcpy.

    Selkie has the same decompressor as Mermaid. The optimization on the compressor side consist to select lz77 matches having
    small (L1/L2 cache) offsets or a large match length. This can lead to a worse compression ratio, depending on the data files.
    For the compression ratio, decompression can be considered as very fast in Selkie.

    In other words, fast decompression = L1/L2 cache optimization. There is no magic algorithm.

    Blosc is also claiming, that it is faster than memcpy.
    In their benchmarks they are simply comparing an 8 threads blosc against a single threaded memcpy!

    Im my experiments the only compressors than can be faster than memcpy are Run length encoding
    and SIMD integer compression.


    CPU i5-6300HQ 2.30 GHz. pd3d.tar game data files from RAD Game Tools
    Code:
          C Size  ratio%     C MB/s  D MB/s   Name
        10778010    33.7       8.00     568   libdeflate 12                    
        11061382    34.6       6.38     319   zlib 9                           
        11463934    35.9       3.52    2428   oodle 119 (Selkie)                      
        14237227    44.6       3.72    3350   lzturbo 19                       
        14279732    44.7      46.72    2616   lz4 9                            
        31952900   100.0    6354.24    6489   memcpy
    Note that all oodle compressors are optimized for game data and can perform poorly on other data types. see this benchmark
    Last edited by dnd; 5th July 2017 at 22:30.

  34. The Following User Says Thank You to dnd For This Useful Post:

    lz77 (6th July 2017)

  35. #26
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    46
    Thanks
    14
    Thanked 11 Times in 7 Posts
    May be in future Intel will introduce CPU instructions like lz77, huf, bwt, and professionals of data compression will be left without work?

  36. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Intel kinda already has all that as part of their IPP library.
    There're also hardware deflate encoders etc.
    But there's no one perfect compression solution for all contexts, so nothing much would change even if they could do that.

    Well, we'd probably have a problem if intel (or anybody else) made a paq8 chip with 100MB/s speed.
    But its the nature of compression that its not easy to parallelize - there's a tradeoff, if we want
    better compression, then we have to use more information about already processed data,
    but such a dependency breaks parallelism.

    And aside from extra parallelism, there's nothing special in hardware implementations,
    which we'd not have on x86, so there'd be always some work - more file formats to support etc.

    It would be really funny when we'd start getting 2nd-level recompression though.
    (Like recompression for winzip jpeg recompression).

  37. #28
    Member
    Join Date
    May 2017
    Location
    Germany
    Posts
    7
    Thanks
    6
    Thanked 2 Times in 1 Post
    Quote Originally Posted by milky View Post
    If you want, you could also use the compression benchmarks I did some weeks ago with 7-Zip.
    They include RAM usage, Timings and Ratio for: LZ4, LZ5, Lizard, Brotli, Zstandard + all compresson types, that 7-Zip has by default.

    The site with the diagrams is located here: https://github.com/mcmilk/7-Zip-Zstd#benchmarks
    Sorry for my response being late. Yeah, I'll consider this for any following talk then. Thank you!

  38. #29
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by milky View Post
    The site with the diagrams is located here: https://github.com/mcmilk/7-Zip-Zstd#benchmarks
    Something is off -- brotli decoding of any quality shouldn't use more than ~17 MB. The data reports 40 MB here.

  39. #30
    Member
    Join Date
    Feb 2016
    Location
    USA
    Posts
    80
    Thanks
    30
    Thanked 8 Times in 8 Posts
    A really interesting thread!

    New to compression world, working in data storage domain, but compression/decompression speed is important, and should be well traded for compression ratio two:

    1. in high end storage system: goal is to increase I/O throughput and reduce latency: so high compression ratio is a welcome for less data to I/O; on the other hand, high speed is needed for low latency; hard to quantify a balance between compression ratio and speed

    2. JPEG recompression is very interesting. Any better ones than Lepton ( which is derived from packJPG, and packJPG well deserves credit ) in either speed or compression ratio? Any Lepton implementation for Android? JPEG/MPEG recompression will have a bright future if speed is higher enough for Android, and integrated into mobile devices. Better yet of course, is a new format directly supported by cameras on mobile devices. As quality should be exact same with current JPEG, and many recompression operations can be saved, so that speed would be higher.

    Anyone more knowledgeable on lossless JPEG recompression are welcome to give an overview in the field, which is very interesting.

    Quote Originally Posted by inikep View Post
    Recently I was wondering how lossless data compression has progressed in the last 10 years and what is the future of lossless data compression. My thoughts:

    1. General-purpose lossless compression.
    The old compressors: zip/gzip, bzip2, lzma are still widely used. We have new compressors: lz4, zstd and brotli but they aim to improve compression and decompression speed (lzma's compression ratio is still better). More powerful compressors based on PPM and CM (PAQ) are considered too slow and used mainly by enthusiasts. Is improvement of speed the only future of mainstream lossless general-purpose compression?

    2. Specialized lossless compression.
    Ten years ago we had lossless audio compressors (FLAC, WavPack), lossless image compressors (PNG, JPEG-LS, lossless JPEG-2000), genomic data compressors. There was a significant progress in image compression (lossless WebP, lossless BPG, FLIF) and genomic data compression. In practice, however, the amount of audio and images compressed losslessly is very small in comparison to lossy audio and image compression. The field of genomic data compression seems to be the most promising for the future. There is a lot of other custom file formats that can be compressed more densely with tailored compression algorithms. These specialized algorithms will be created probably only on request of specific people or companies.

    3. Lossless recompression.
    Ten years ago we had lossless recompressors for JPEG, MP3, ZLIB/GZIP/BZIP, OGG. Since then we have seen new recompressors for MPEG1/MPEG2, LZMA, GIF. But recompression does not hit the mainstream (maybe except WinZIP and StuffIt). The main reason is that the improvement in compression ratio is too small compared to recompression time. The future of recompression does not look promising.

Page 1 of 3 123 LastLast

Similar Threads

  1. loseless data compression method for all digital data type
    By rarkyan in forum Data Compression
    Replies: 157
    Last Post: 9th July 2019, 17:28
  2. Draco 3D data compression (lossless)
    By pothos2 in forum Data Compression
    Replies: 2
    Last Post: 27th January 2017, 01:27
  3. Replies: 5
    Last Post: 17th September 2015, 20:43
  4. Future-LZ as last-step compression scheme
    By Piotr Tarsa in forum Data Compression
    Replies: 18
    Last Post: 3rd December 2011, 01:55
  5. lossless data compression
    By SLS in forum Data Compression
    Replies: 21
    Last Post: 15th March 2011, 11:35

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •