Results 1 to 10 of 10

Thread: Some idea of multithreaded lzma without using blocks and fragmenting dictionary

  1. #1
    Member
    Join Date
    Mar 2018
    Location
    sun
    Posts
    33
    Thanks
    18
    Thanked 15 Times in 7 Posts

    Some idea of multithreaded lzma without using blocks and fragmenting dictionary

    So as you all know, today's typical way to do multithreaded compression is to divide input data to blocks and feed them to parallel compression engines, where each have its own "core" with dictionary and all stuff. They are generally independent. One typical problem(among others) with this is that as data are fed to each engine in blocks, within each core next block it receivers will be non continuous:

    Click image for larger version. 

Name:	blcklzma.png 
Views:	47 
Size:	17.9 KB 
ID:	6467
    ^ So maybe while block 1 and 2 still had similar data that could benefit from same dictionary, they are fed to different independent engines and next block core1 will see is block 5 which may already contain very different data, thus block 1 + 5 in continuous order will not benefit as block 1 + 2 would.

    That do not even count for data's fragmentation problem where for example 10gb of data in case of block size = 64mb would mean 160 parts, each parsed to lzma core in totally wrong order. I think you got an idea. How much this matter is question of data itself and of course things like srep really eased on this issue, still it is not perfect and for that reason many will use 1T compression regardless of time needed for such task and some compressors dont even support MT.

    So I was thinking if better idea would be to offload fragmentation to file seeks and have it always constant value regardless of input size:
    Click image for larger version. 

Name:	filemt.png 
Views:	43 
Size:	15.9 KB 
ID:	6466

    Ignore global dictionary for a moment. First I am aware that this method can't work with <stdio>. Secondly, output would likely had to be first written to separate files then joined upon completion to a single file. Aka disk trashing. Also, input file would be read simultaneously at different parts however here I am confident it would not be an issue. That is because even with 4+ cores = freads, this would always be quicker than lzma itself. Also, each part would be read by small chunks as needed as it pass to lzma - no need to load all part of file at once, which is better than loading whole block to memory(that could be better used for dictionary).

    Idea is that input size is known and is divided per N cores used. That mean in 4T case core1 is assigned to process only first 1/4 of the file(up to beginning of 2nd part), core2 from 2/4 to 3/4 and so on. Benefits are that no matter how big input is, division is always constant - aka 100gb file can be fully multithreaded while still having no more than 1/4 fragmentation(in case of 4T), which I am sure would be extremely negligible loss compared to 1T.
    Another benefit is that most of the data would arrive continuously and quality of compression should(at least in theory) be better, also benefited by proper sorting mechanism. Negatives are more input seeks(no big deal IMO), no <stdio> and output file trashing upon final parts joining.

    I was wondering whether this were discussed before and what you guys think about such approach. Would it be worth it or not.

    Finally I want to touch upon global dictionary but here I know much less so its just idea. If this could work though it would obviously be a huge deal, but I am aware of at least some issues. First, idea is of course that because dict would only be single and shared, it could be significantly bigger and all cores could benefits from each other's "match finds". This could theoretically be fantastic. However, obviously write access would have to be "mutexed" and I suspect costs of MT are too great although my theory is that most of the time dictionary is read(and whole on top of that), which on big ones mean a lot of reads with occasional writes. And reads are not a problem for MT, so only writes would be.
    One idea I had was that there could be a small buffer(say few kb) exclusive for each core and at certain periods some sort of "central lock" would trigger and results would synchronize with big dictionary for each core to update. Maybe sort of like CPU L2->L3 cache works .

    I also saw somewhere over forum that Bulat designed whats called "HT4" matchfinder, which seems to scale very well(~1.6x) with memory compare to BT4, thus imagine single huge dictionary(say 8-10g for 16g systems) that could potentially render srep obsolete for many cases . And that mean disk trashing would also be less of a problem since srep anyway need to do full passes.

    Anyway just thinking loud, you guys know this stuff better I wonder what you got to say.
    regards

  2. #2
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Well parallel LZ has existed for a while, LZHAM has a parallel match finder and I also wrote one last year, thread: https://encode.ru/threads/2921-clz-C...ll=1#post56308

    LZHAM builds multiple trees in parallel while CLZ builds 1 tree in parallel.

    That CLZ codec lead me to make a generalized threading system which can make almost anything run in parallel like hash-tables and search engines with any number of threads, though I haven't had any time to document the algorithm.

    Locks and blocking and merging are quite a naive approach to parallelism, however in most cases they are significantly faster because they are more cache friendly. Radyx is a good example of fast parallel LZMA compression though it must simulate a sliding window with overlapped chunks which hurts compression a little bit.

    There's always going to be pros and cons to each method, sacrificing compression ratio for speed or vice versa.

  3. The Following User Says Thank You to Lucas For This Useful Post:

    elit (15th February 2019)

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > could potentially render srep obsolete for many cases

    1) lzma entropy model only supports distances up to 4G.
    2) dedup matchfinder is completely different from normal LZ mf.
    There're more efficient algorithms for finding longer strings,
    and there's no point in allowing long-distance matches with short length,
    since match code would be pretty long too.
    Of course, its possible to implement dedup mf with output to lzma format
    (like they did in zstd), but there's little point, since there're
    modern ANS-based LZ codecs with 10x faster decoding than lzma.

    I can suggest a possible solid-stream MT scheme with unmodified lzma2 though.
    We can simply start compressing the whole stream with multiple lzma2 instances,
    but only set one to strong compression, while running others with fastest
    settings and discarding their output (its easy since lzma2 has block headers).
    Then, once some instance reaches the start of 2nd block, we can start
    storing its output too (and change its settings to stronger compression), and so on.
    Of course, there'd be a significant overlap, but in theory this scheme
    should compress the whole stream faster than single instance and without
    any compression loss.

    Of course, a better approach would be to run matchfinder separately
    from parsing-optimizer and entropy coder, but its hard to do
    with original lzma source.

    Anyway, I think, its better to make a specific task spec first.
    What exactly do you want to create?
    Why do you care about compression speed at all?
    If its repacks or official game containers, lzma encoding speed
    won't matter there at all.
    Well, encoding speed does matter in repacks actually -
    in recompression schemes, when restoring original game's compression,
    but that'd be encoding of lots of small chunks and mostly not lzma.

    Depending on task, the solution can be completely different though.
    Like whether you can use temp files (and how large), do we have
    random access to file data, or can only process streams, etc.
    Or maybe the actual question is how to implement solid MT decoding,
    rather than encoding.

  5. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    elit (15th February 2019),Mike (15th February 2019)

  6. #4
    Member
    Join Date
    Jan 2017
    Location
    Selo Bliny-S'edeny
    Posts
    14
    Thanks
    6
    Thanked 3 Times in 2 Posts
    Running the match finder in a separate thread is already implemented in the original lzma code.
    LzmaEnc.c: p->numThreads

  7. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Well, its even better, if you can implement "dictionary mode" for lzma, where it would index some data,
    but only start actually coding after given offset.

  8. #6
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Brotli averages at +0.6 % more bytes but decodes 5x faster on huge files. On normal files of less than a megabyte brotli compresses more as it has less need to warm up statistics. Why bother with lzma at all?

  9. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Because brotli encoding is 10x slower? https://sites.google.com/site/powtur...sion-benchmark
    And I don't think it implements solid MT compression anyway (which is the topic here).

    As to lzma decoding speed, even without switching to ANS, I think its speed can be improved a lot.
    Not sure if its possible while keeping compatibility, but its just a matter of rewriting
    the decoder in branchless/interleaved style.
    Decoding in modern LZ codecs didn't become this fast because of some great new invention or anything
    (ANS is cool, but its just as well possible to vectorize a rangecoder).
    That's the result of recent (i7+) cpu development and optimizations for these new cpus -
    on cpus without speculative execution lzma/brotli decoding speed difference is much smaller.

  10. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    encode (15th February 2019),joerg (15th February 2019)

  11. #8
    Member
    Join Date
    Mar 2018
    Location
    sun
    Posts
    33
    Thanks
    18
    Thanked 15 Times in 7 Posts
    Ok, upon reading some ideas(which I admit are all interesting and certainly have its uses), I think for avoiding input fragmentation and random(or wrong) block order, method of simple input file split-per-core still seems least complicated and easiest to implement. Even with high IO tradeoff.

    As for shared dict., Lucas's CLZ looks closest to it, but if efficiency is that bad(16T for ~2.4x gain is definitely not good IMO) then thats not an option. Shelwien your idea of MT is very interesting, I like it but it does require better understanding and more thorough implementation, which I lack ATM.
    Thanks for explaining lzma limits btw. Looks like it may actually be fine to have independent lzma cores with their own dict., as long as file is split to as few parts as possible. Therefore for say 10g input and 4T - each processing 2.5g, that really should come close to 1T continuous stream. At least when you compare it to number of 64m-size typical blocks(whose will be additionally processed almost certainly in wrong order, rendering sorting mechanism almost useless), this is only 4 times data can't rely on nearby ones.

    Question is, is it worth it(to lose <stdio>) and would more orderly and continuous compression benefit enough over random blocks pass? If yes I may consider implement it to xnlz one day but need more data/tests. And yes, big games are my main concern.

  12. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > is it worth it(to lose <stdio>)

    There're different cases:
    1) you're making a portable library which can be used in random
    context on a random platform.
    Its better not to rely on temp files and other resources that
    may not be available.
    2) its just an API requirement, eg. you want to compress
    the output of your algorithm with "7z a -si -txz".
    Then you can use memory caches or temp files for out-of-order data
    and do whatever is necessary for interface compatibility.
    Whether its useful depends on actual performance of your implementation.

    > And yes, big games are my main concern.

    Then you're doing it wrong.
    1) Most game resources are already compressed,
    so output size of srep and imaginary lzma -d64g won't be much different.
    2) Recompression is very important, but its better to treat its output
    as a set of files rather than single large file/stream.
    Resource compression in games usually works with small chunks,
    and there's usually a significant amount of duplicate chunks.
    So dedup at chunk level works more efficiently than global srep.
    3) LZ is not the best fit for most common file formats in game resources,
    eg. DDS and 3D models. Its better to design something more specialized,
    at least preprocessors.
    4) Diff/patching is usually required when recompression is used... ie always.
    Eg. game uses lzma for resource compression. We can unpack it and we found
    the right version of encoder and settings to reproduce the compression... mostly.
    EOF coding is different, so there's 3-4 bytes per chunk which don't match.
    Do we modify lzma encoder to match the game?

  13. The Following User Says Thank You to Shelwien For This Useful Post:

    elit (16th February 2019)

  14. #10
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by Shelwien View Post
    Because brotli encoding is 10x slower?
    When your encoding is speed critical but you still want it rather dense, use brotli 10d29. It is perhaps another 0.5 % worse. Also, there are many other possible improvements in encoding speed for large files. No format related reasons why it should be slower than lzma I think.

Similar Threads

  1. Adler32 on Large Blocks
    By encode in forum Data Compression
    Replies: 1
    Last Post: 2nd January 2015, 22:38
  2. Multithreaded merge BWT/ST
    By Bulat Ziganshin in forum Data Compression
    Replies: 7
    Last Post: 5th September 2012, 18:54
  3. 2G+ memory blocks
    By Shelwien in forum Data Compression
    Replies: 0
    Last Post: 6th March 2009, 03:13
  4. 4x4 - multithreaded compressor
    By Bulat Ziganshin in forum Forum Archive
    Replies: 12
    Last Post: 19th April 2008, 17:25
  5. CCM(x) multithreaded ?
    By SvenBent in forum Forum Archive
    Replies: 2
    Last Post: 15th September 2007, 15:29

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •