So as you all know, today's typical way to do multithreaded compression is to divide input data to blocks and feed them to parallel compression engines, where each have its own "core" with dictionary and all stuff. They are generally independent. One typical problem(among others) with this is that as data are fed to each engine in blocks, within each core next block it receivers will be non continuous:
^ So maybe while block 1 and 2 still had similar data that could benefit from same dictionary, they are fed to different independent engines and next block core1 will see is block 5 which may already contain very different data, thus block 1 + 5 in continuous order will not benefit as block 1 + 2 would.
That do not even count for data's fragmentation problem where for example 10gb of data in case of block size = 64mb would mean 160 parts, each parsed to lzma core in totally wrong order. I think you got an idea. How much this matter is question of data itself and of course things like srep really eased on this issue, still it is not perfect and for that reason many will use 1T compression regardless of time needed for such task and some compressors dont even support MT.
So I was thinking if better idea would be to offload fragmentation to file seeks and have it always constant value regardless of input size:
Ignore global dictionary for a moment. First I am aware that this method can't work with <stdio>. Secondly, output would likely had to be first written to separate files then joined upon completion to a single file. Aka disk trashing. Also, input file would be read simultaneously at different parts however here I am confident it would not be an issue. That is because even with 4+ cores = freads, this would always be quicker than lzma itself. Also, each part would be read by small chunks as needed as it pass to lzma - no need to load all part of file at once, which is better than loading whole block to memory(that could be better used for dictionary).
Idea is that input size is known and is divided per N cores used. That mean in 4T case core1 is assigned to process only first 1/4 of the file(up to beginning of 2nd part), core2 from 2/4 to 3/4 and so on. Benefits are that no matter how big input is, division is always constant - aka 100gb file can be fully multithreaded while still having no more than 1/4 fragmentation(in case of 4T), which I am sure would be extremely negligible loss compared to 1T.
Another benefit is that most of the data would arrive continuously and quality of compression should(at least in theory) be better, also benefited by proper sorting mechanism. Negatives are more input seeks(no big deal IMO), no <stdio> and output file trashing upon final parts joining.
I was wondering whether this were discussed before and what you guys think about such approach. Would it be worth it or not.
Finally I want to touch upon global dictionary but here I know much less so its just idea. If this could work though it would obviously be a huge deal, but I am aware of at least some issues. First, idea is of course that because dict would only be single and shared, it could be significantly bigger and all cores could benefits from each other's "match finds". This could theoretically be fantastic. However, obviously write access would have to be "mutexed" and I suspect costs of MT are too great although my theory is that most of the time dictionary is read(and whole on top of that), which on big ones mean a lot of reads with occasional writes. And reads are not a problem for MT, so only writes would be.
One idea I had was that there could be a small buffer(say few kb) exclusive for each core and at certain periods some sort of "central lock" would trigger and results would synchronize with big dictionary for each core to update. Maybe sort of like CPU L2->L3 cache works .
I also saw somewhere over forum that Bulat designed whats called "HT4" matchfinder, which seems to scale very well(~1.6x) with memory compare to BT4, thus imagine single huge dictionary(say 8-10g for 16g systems) that could potentially render srep obsolete for many cases . And that mean disk trashing would also be less of a problem since srep anyway need to do full passes.
Anyway just thinking loud, you guys know this stuff better I wonder what you got to say.