Results 1 to 11 of 11

Thread: LZ4 chunks

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

    LZ4 chunks

    Here's a problem: we have a large number of small files/chunks (potentially millions) and we want to compress them with lz4 -12.
    Working with files is slow, and in the end we want to produce a single file with concatenated compressed chunks.
    So, there was the idea to pad files to an appropriate block size (eg. 1M), concatenate, and compress as a single stream
    with corresponding (-B6?) option.
    And I made an utility for that - http://nishi.dreamhosters.com/u/pad_v0.rar .

    Turns out, lz4 -12 is very slow at compression of zeroes, slower than unpacking archive, compressing individual files and then concatenating.
    I also tried different paddings (all random, 8-byte repeating pattern), but they appeared even worse.

    Of course, one solution is to implement an explicit "archive" format and batch LZ4 compressor for it.
    Problem is, this is all about recompression, so using an existing compressor is much less work,
    as there're all kinds of versions with different outputs.

    Are there any options? Like an already existing batch compressor, or some settings like lzma's fb/mc for lz4,
    to let it faster compress zeroes?

  2. The Following User Says Thank You to Shelwien For This Useful Post:

    xinix (17th February 2019)

  3. #2
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    216
    Thanks
    97
    Thanked 128 Times in 92 Posts
    > I also tried different paddings
    If it is slow to compress well or badly compressable data, try to stay in the middle, some more ideas:
    - Repeat the original file up to 1M.
    - Pad with data taken from other files.

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Unfortunately that won't work either.
    These paddings are pretty large, so we need a way to process them faster than normal data.

    Atm we have:
    1) 7z with integrated lz4; time=16s; but lacks options and support for LZ4 versions
    2) batch-lz4 which parses the archive and processes files with external lz4.exe via stdio; time=42s
    3) pad <input | lz4 >output; takes minutes, even more with random padding

  5. The Following User Says Thank You to Shelwien For This Useful Post:

    xinix (17th February 2019)

  6. #4
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Could you have a "header" with the size of each file, followed by the concatenated data?

  7. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    That's exactly what I did in (2).
    Problem is, there're many versions of LZ4 in use, with various subtle differences in matchfinder, APIs and stream formats.
    I can port one version, though its not too easy because of unfriendly APIs, but dealing with 10 different versions is too much.

    Btw, another fun feature is that LZ4 output depends on malloc behavior.
    Its always decodable, but unstable.
    Have to override malloc/free to use it for recompression.

  8. The Following User Says Thank You to Shelwien For This Useful Post:

    xinix (17th February 2019)

  9. #6
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    Maybe some of these issues could be fixed directly in the library ?

    I'm not aware of a speed issue when compressing blocks padded with zeroes at maximum level 12.
    I tested `/dev/zero` with `lz4 -12` and got a speed of ~12 GB/s.
    I then tested a mixture of `/dev/random` followed by `/dev/zero`, and compression speed is down to ~60 MB/s, which feels still reasonable for max `-12` compression level.
    Maybe there are corner cases that this simple test failed to detect ?

    I'm also not aware of malloc dependent results.
    I believe `lz4` should already be protected vs "uninitialization" effects (`malloc()` doesn't guarantee that provided memory is filled with zeroes).
    A more subtle issue may happen when the memory address starts at a low position < 64K, but even that is expected to be fixed by now.
    Maybe this issue is present within an older library version ?

  10. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Maybe some of these issues could be fixed directly in the library ?

    Problem is, there're various games that use various versions of LZ4 lib.
    Just yesterday I was asked to build LZ4 r131 to recompress a specific game.
    But if we'd use a different version to recompress the data, the diff would be large.

    > I then tested a mixture of `/dev/random` followed by `/dev/zero`, and
    > compression speed is down to ~60 MB/s, which feels still reasonable for max
    > `-12` compression level.
    > Maybe there are corner cases that this simple test failed to detect ?

    Just did a test like this:

    1) take enwik8
    2) take LZ4 1.8.3 exe from github
    3) lz4.exe -12 -B7 -BD enwik8 1 ==> 5.531s, 18,079,913 b/s
    4) pad.exe enwik8 1 "text xml:" 65536 ==> 813,116,306 bytes
    5) lz4.exe -12 -B7 -BD 1 2 ==> 15.812s, 51,424,001 b/s

    So its 3x slower than plain encoding.
    I guess with 100 GBs of source data it adds up.

    > I'm also not aware of malloc dependent results.

    I tested 1.8.3 and it seems to be stable.
    But 1.7.3 for example: (exe from github)
    Code:
    >for /L %a in (1,1,100) do lz4 -12 FG_228584.tbx e.%a
    
    >md5sum e.*
    3741d328dbf7a1edefb8389c37afab57 *e.1
    2b279c0b1e0328585fe4611e9699a404 *e.2
    3741d328dbf7a1edefb8389c37afab57 *e.3
    3741d328dbf7a1edefb8389c37afab57 *e.4
    b16d18d2a09bfd5a25be8ee7cb0ff6aa *e.5
    b16d18d2a09bfd5a25be8ee7cb0ff6aa *e.6
    2b279c0b1e0328585fe4611e9699a404 *e.7
    2b279c0b1e0328585fe4611e9699a404 *e.8
    2b279c0b1e0328585fe4611e9699a404 *e.9
    2b279c0b1e0328585fe4611e9699a404 *e.10
    3741d328dbf7a1edefb8389c37afab57 *e.11
    b16d18d2a09bfd5a25be8ee7cb0ff6aa *e.12
    b16d18d2a09bfd5a25be8ee7cb0ff6aa *e.13
    b16d18d2a09bfd5a25be8ee7cb0ff6aa *e.14
    3741d328dbf7a1edefb8389c37afab57 *e.15
    b16d18d2a09bfd5a25be8ee7cb0ff6aa *e.16
    2b279c0b1e0328585fe4611e9699a404 *e.17
    2b279c0b1e0328585fe4611e9699a404 *e.18
    3741d328dbf7a1edefb8389c37afab57 *e.19
    3741d328dbf7a1edefb8389c37afab57 *e.20
    
    >lz4.exe -d e.1 1
    e.1                  : decoded 22282424 bytes
    
    >lz4.exe -d e.2 2
    e.2                  : decoded 22282424 bytes
    
    >md5sum 1 2
    5aef760f32e5803e2621999dd6965fdb *1
    5aef760f32e5803e2621999dd6965fdb *2
    > I believe `lz4` should already be protected vs "uninitialization" effects
    > (`malloc()` doesn't guarantee that provided memory is filled with zeroes).

    gcc/mingw gets hit by hxxps://stackoverflow.com/questions/1232081/heap-randomization-in-windows
    and that somehow affected the LZ4 encoding results.

    I work around it by providing my own implementations of malloc/free with a fixed memory pool.

    > Maybe this issue is present within an older library version ?

    Sure, but at least the API problem still remains - functions
    that lz4cli uses to compress files are only implemented for FILE*,
    so batch recompression of chunks is hard to do.

    I ended up hooking fopen/fclose/fread/fwrite and using lz4cli's main() as API.

  11. The Following User Says Thank You to Shelwien For This Useful Post:

    Cyan (20th February 2019)

  12. #8
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    Code:
    3) lz4.exe -12 -B7 -BD enwik8 1 ==> 5.531s, 18,079,913 b/s
    4) pad.exe enwik8 1 "text xml:" 65536 ==> 813,116,306 bytes
    5) lz4.exe -12 -B7 -BD 1 2 ==> 15.812s, 51,424,001 b/s
    Do I read correctly that `pad.exe` transforms 100M bytes enwik8 into a 813M bytes one ?
    In which case, a logical consequence is that there is 8x more data to compress,
    Does that sound unreasonable if the compressor needs 3x more time for 8x more data ?

    the API problem still remains - functions
    that lz4cli uses to compress files are only implemented for FILE*,
    so batch recompression of chunks is hard to do.
    What kind of API would you consider more suitable for your use case ?

  13. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Do I read correctly that `pad.exe` transforms 100M bytes enwik8 into a 813M bytes one ?

    Yes. I was wrong to use "-BD" there, but the idea was to pad the chunks
    and use already existing lz4's block mode to compress many individual chunks at once

    > In which case, a logical consequence is that there is 8x more data to compress,
    > Does that sound unreasonable if the compressor needs 3x more time for 8x more data ?

    I don't have the data myself, but they said difference was
    15s using 7z-with-integrated-lz4 (wrong lz4 version though) with chunks in files,
    vs "minutes" using my pad utility, so at least 10x?

    I also agree that its reasonable, but we needed an universal solution
    for lz4 recompression, which didn't involve patching source of every version
    to support the new API.

    > What kind of API would you consider more suitable for your use case ?

    Something like this, I guess?
    https://github.com/Shelwien/ppmd_sh/...er/pmd.cpp#L48

    I know that LZ4 already has streaming API,
    but I think it requires too much work on user side,
    at least based on LZ4IO_compressFilename_extRess() source.

    Although in any case its not a solution for the batch chunk compression problem
    (something like non-solid archive compression),
    since I doubt you'd backport the new API for all existing LZ4 releases :).

  14. #10
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    I don't get what the aim is here.

    If your concatenation of chunks padded out with nuls is larger than the original input data, then why are you using lz4 anyway? It seems the padding is extreme!

    In the past when I've had thousands of small files I just used Unix tar along with a custom "hash_tar" command that indexed it and created a new tar entry holding a hash table of filename to offset, permitting some hackery to be able to open foo/bar.tar/name and pretend the tar is simply a subdirectory. That's old legacy code, but the similar plan could work. Block them up in a simple format, index it, and use LD_PRELOAD, Fuse or equivalent hacks to override the filesystem interface so that programs think they're still separate files.

  15. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > I don't get what the aim is here.

    Games use their custom "archive" formats for resources.
    Its usually non-solid for obvious reasons,
    so after unpacking game containers we get millions of files
    (or just independently compressed chunks, if a stream detector
    is used, rather than unpacker for a specific container format).

    After that its possible to compress the game much better,
    using solid mode and stronger codecs.
    But first we need to find a way to recompress the unpacked chunks
    to mostly the same compressed data.
    ("mostly" is ok, because we can use diffs).

    > If your concatenation of chunks padded out with nuls is larger than the
    > original input data, then why are you using lz4 anyway?
    > It seems the padding is extreme!

    Yes, but zero padding doesn't add _too_ much, and in the end
    we just need to provide the file containing specific compressed data
    to the diff utility - it doesn't really matter if its a bit larger.

    And in any case, keeping chunks in files, compressing each file
    individually with lz4.exe, then concatenating, is apparently
    even worse than pad | lz4 method.

    > In the past when I've had thousands of small files I just used Unix tar
    > along with a custom "hash_tar" command that indexed it and created a new
    > tar entry holding a hash table of filename to offset,

    That's interesting, but the question here is
    how to compress a non-solid archive with lz4, with support for all lz4 library versions
    (they have some api differences).

    We found a good pre-existing option in the form of modified 7z with lz4 support,
    but it uses lz4 dll and its API, so versions before 1.8 are not supported.

    When posting this thread, I just hoped that there's maybe some archiver with lz4 support,
    which is old enough to support most library versions.

Similar Threads

  1. Compiling lz4
    By smjohn1 in forum Data Compression
    Replies: 11
    Last Post: 3rd January 2018, 23:52
  2. May be this will accelerate LZ4 decompression?
    By lz77 in forum Data Compression
    Replies: 4
    Last Post: 14th November 2017, 11:26
  3. New LZ4 vulnerability - to be checked
    By Cyan in forum Data Compression
    Replies: 2
    Last Post: 3rd July 2014, 09:18
  4. Replies: 2
    Last Post: 23rd February 2013, 07:01
  5. Traffic compression (1500 byte chunks)
    By blackhaz in forum Data Compression
    Replies: 4
    Last Post: 21st January 2012, 22:44

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •