Results 1 to 13 of 13

Thread: [LZ] M/T & GPU compression/decompression

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts

    [LZ] M/T & GPU compression/decompression

    I want to discuss the schemes that zstd, lz5 and other lz algos may use to implement m/t or gpu operation

    1. M/T parsing

    The simple approach to m/t compression, used by 4x4/lzma2/pzstd/mtzstd, is to break data into several [overlapping] blocks and compress them in parallel. Unfortunately, it raises memory reqs to O(threads*dictsize). Here i describe scheme that limits memory usage to O(dictsize).

    This require to split work into two independent tasks, match finding and parsing, like lzma already does for a years. Since match finding precedes parsing, we can't find matches only for positions where parser stops and should find matches for every position in the file. This makes use of fast parsing strategies (lazy/greedy) less interesting since important part of their speedup is exactly the ability to skip match searching for the most of positions.

    So, the first group of threads should find matches for all positions and the second part should use [optimal] parsing to build an LZ output. The list of matches found should be split into multiple parts, each part parsed and compressed independently, and then their results are catenated. This requires some support from compressed data format, though - it should allow to catenate such streams, i.e. either reset entropy coding model (for lzma-like codecs), or encode entropy model as the part of compressed data themself (like deflate/zstd/lz5 does). Note that for lzma-like codecs it's preferable to use larger parsing blocks, since we lose compression ratio on each reset, while for deflate-like codecs it's ok to catenate individual compressed blocks, so we may split input data, f.e., in 128 KB blocks plus points where extra-large matches are found (larger than FAST_BYTE chars in lzma terminology).


    2. M/T matchfinder

    The first group of threads in this scheme should fill the match list. They can do that cooperatively by assigning work based on the hash index in each position. F.e. if we use matchfinder with minlen=4, we need to compute hash(pos) based on pos[0]..pos[3]. Then, depending on 3 bits from this hash, we assing this position to one of 8 threads.

    There are two ways to calculate the assignment. First way is to scan every pos in main thread, calculate the hash and push {pos,hash} pair to queue allocated for corresponding thread. This means a lot of memory traffic, so ideally queue blocks should be kept small enough to restrict this traffic to L3 cpu cache.

    The second approach is that every sub-thread scans every position, filtering out all positions that belong to other threads. This means that we make N times more work (where N is the number of sub-threads), but since this work is extremely simple (just a few alu operations), it may be unimportant for desktop cpus (i.e. while N is limited to 8..16). There is also the combined approach where main thread sends to queue only pos and hash is recomputed by sub-thread.

    To improve the speed, sub-threads can prefetch data from hastable using hash values ahead. It's easier to implement with the first approach, although also possible with remaining ones.


    2.1 Searching the data

    By assigning work depending on hash value we effectibely split has tables into several segments, each segment belonging to one thread and serached/updated only in this thread. The details depend on hash organization:

    HashTable matchfinder (i.e. one from Tornado) is the simplest one. All pointers are saved into HashTable[hash(pos)][i] slots where i is a small number, f.e. 0..7, so we can divide work between threads based on higher bits of the hash. All required data from HashTable can be prefetched once we know the hash value.


    HashChain
    matchfinder is a bit more tricky. It maintains a list of values with head stored in HeadTable[hash(pos)] and next pointers stored in ChainTable[pos]. So, to avoid enormous traffic between cores, ChainTable should be updated in main thread:
    Code:
    for (every pos)
      hash = hashfunc(pos)
      next = HeadTable[hash]
      HeadTable[hash] = pos
      ChainTable[pos] = next
      Queue[thread(pos)].push(pos,next)
    Since the following search in ChainTable doesn't modify the table and doesn't depend on the hash value, thread(pos) function doesn't need to depend on hash value and may be as simple as pos%THREADS.


    For BinaryTree, first step of searching is pretty close to the HashTable case, i.e. each thread can deal with its own part of the HeadTable.

    Igor Pavlov already tried naive implementation where multiple threads share the single lzma-style BinaryTree and found that this is too slow due to massive traafic between cores. So, the only way to use single BinaryTree is by HyperThreaded threads on the single core which may be forced by OS calls.

    Alternatiely, we can use multiple BinaryTrees - one per thread. This raises memory usage by 50% compared to lzma-style BT, since it's no more possible to use "natural indexing" where index=pos. Instead, every node should keep {pos,left_index,right_index} tuple.


    Finally, i considered B-tree organization - it's natural choice for reducing memory traffic. While B-Trees has larger memory requirements than single BinaryTree and thus avoided by current s/t algorithms, it may be the best choice for m/t ones with exhaustive matchfinders. Like with BT, on the first level we should have large, directly indexed HashTable, segmented between threads. On lower levels, each node should occupy 64-256 bytes, since each memory transaction usually reads 128 bytes. We may keep only buffer positions on the lowest level, while on higher levels we also need to store indexes of B-Tree nodes. Organization of B-Tree should be similar to organization of lzma BinaryTree, i.e. each entry can point only to entries with smaller buffer position. It still needs more research, can anyone look into that?


    PS: m/t decompression and gpu compression will be described later

  2. The Following 3 Users Say Thank You to Bulat Ziganshin For This Useful Post:

    Cyan (12th February 2017),jibz (16th February 2017),Simorq (1st April 2017)

  3. #2
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Do you need to maintain matches in trees? Maybe it would be better to sort.

  4. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    the main problem with sorted matchlist is how to use it. i see two ways:
    1) scan matches in sorted order, build a global match list, then use it to do a parsing. needs a lot of memory, but seems the only way to run entire match finding on GPU
    2) maintain its own HeadTable in each match finding thread and update it as we go through the data. In this case, each MF thread can process positions in natural order and even employ non-optimal strategies if sorting is fast enough. But memory usage is O(threads*dictsize), even if with smaller coefficient

    what's your idea?

    Do you need to maintain matches in trees?
    either it's a rhetorical question or i don't got it

  5. #4
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    the main problem with sorted matchlist is how to use it. i see two ways:
    1) scan matches in sorted order, build a global match list, then use it to do a parsing. needs a lot of memory, but seems the only way to run entire match finding on GPU
    2) maintain its own HeadTable in each match finding thread and update it as we go through the data. In this case, each MF thread can process positions in natural order and even employ non-optimal strategies if sorting is fast enough. But memory usage is O(threads*dictsize), even if with smaller coefficient

    what's your idea?


    either it's a rhetorical question or i don't got it
    You mentioned B-trees, and B-trees are a compromise between search trees and sorted arrays. So, I wondered if you had considered going treeless. You probably need trees if you're doing online modifications; if so, disregard my suggestion.

  6. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    i considerd and wrote in last answer what are the problems with sorted list. in short - how to find matching range in the sorted list? if we process positions sequentially, it's hard. if we process in sorted order, we need to store a lot of matches before we can start parsing

    i will write more detailed later, unfortunately for GPU it's the only way i know

  7. #6
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    if we process in sorted order, we need to store a lot of matches before we can start parsing
    Is that really a problem? We could store matches to disk if we manage to make disk access linear enough. If the parsing is strong and computationally expensive, like a typical LZMA parsing then the disk shouldn't be a bottleneck.

    Before outputtting matches they can be filtered. Eg if we have a match that looks like that:
    Match(from=123, to=234, length=45)
    and Input[match.from - 1] == Input[match.to - 1] then we can be sure that we will have also:
    Match(from=122, to=233, length=46)
    thus we can skip saving the former match as it can be trivially computed from the latter.

    Also we can use stable radix sort to output all tuples (matchLength, matchDistance, targetPosition) where a pair (matchLength, targetPosition) is unique and matchDistance is the lowest possible one for the tuple. Here's how to do it. Assume that we're sorting a segment of suffixes that all start with "th".

    The segment is like that (sourcePosition, suffixBeginning):
    12 "their"
    26 "thy"
    33 "those"
    57 "thorn"
    68 "there"
    79 "thing"
    99 "thirst"

    When we sort by third letter we get following subsegments:
    12 "their"
    68 "there"

    79 "thing"
    99 "thirst"

    33 "those"
    57 "thorn"

    26 "thy"

    When splitting to subsegments we should output following matches:

    Match(from=12, to=26, length=2)
    Match(from=26, to=33, length=2)
    Match(from=57, to=68, length=2)
    Match(from=68, to=79, length=2)

    because those pairs of suffixes are next to each other (remember that we're talking about stable sort here) and have 3rd letter differing. All other pairs have third letter identical, so for them we will wait for the next nested radix sort pass as we can surely produce a match that is longer. Those matches are:

    Match(from=33, to=57, length>2)
    Match(from=79, to=99, length>2)

    Don't forget about that filtering of matches I've written in the beginning of my post. So if we are about to output:
    Match(from=57, to=68, length=2)
    and we determine that Input[56] == Input[67] then we can skip outputtting this match as it can be trivially recomputed from a corresponding match for previous position, ie following match (or an even longer one) will be outputted somewhere else:
    Match(from=56, to=67, length=3)

  8. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    BTW:
    Has anyone already made radix sort based match finder? I'm working on my one based on the idea I've described in my previous post. It's still highly unfinished and has bugs when matches lengths exceed (user selectable) maximum match limit - the bug is that too many matches are filtered out and then can't be easily recreated.

    By radix-sort-based I mean that the match finder should emit matches during radix sort, not after as after radix sort we effectively have a sorted suffix array and there are more efficient algorithms for building sorted SAs than plain radix sort.

  9. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    for radix sort i propose you to use my optimized code - it's 5x faster than the naive code

    lz compressors with sorting MF so far:
    * 20-year old IMP
    * radyx-7z from the same author (encode.ru thread)
    * BWT-based MF in zpaq, looking only for the longest match

    it will be great if you will place this project to github so we can see the progress and your approach
    Last edited by Bulat Ziganshin; 26th February 2017 at 22:33.

  10. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Thanks. I've published my project on GitHub as I've written in another thread.

    Sadly, radyx-7z only finds longest match during radix-sort phase. The documentation says:
    The radix match finder only finds the longest match (up to the fast length
    value) which slightly decreases compression performance when using optimized
    encoding (level 5 and up). To compensate, Radyx has a hybrid mode which adds a
    small sliding window dictionary and hash chain. This is enabled at level 7 and
    above.
    Your radix sort is the LSB one, whereas mine is MSB one (it would be rather unwise to do LSB radix sorting with a sorting depth limit of eg 100 bytes) so very different optimization techniques will apply (I think). And I'm still coding in Scala - coding in closer-to-metal language will come later, after I finish with algorithmical improvements.

  11. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    it seems that my memory needs a service more than a year ago i started a similar thread, and wrote uite a lot about sorting MFs: https://encode.ru/threads/2407-LZ-M-...de)compression

  12. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Have you implemented the idea of multithreaded binary tree match finder?

    For BinaryTree, first step of searching is pretty close to the HashTable case, i.e. each thread can deal with its own part of the HeadTable.

    Igor Pavlov already tried naive implementation where multiple threads share the single lzma-style BinaryTree and found that this is too slow due to massive traafic between cores. So, the only way to use single BinaryTree is by HyperThreaded threads on the single core which may be forced by OS calls.

    Alternatiely, we can use multiple BinaryTrees - one per thread. This raises memory usage by 50% compared to lzma-style BT, since it's no more possible to use "natural indexing" where index=pos. Instead, every node should keep {pos,left_index,right_index} tuple.
    7zip with LZMA2 just splits the file to chunks and run separate compression processes in parallel. That raises memory usage proportionally to number of threads, so that's much more than 50%, especially on four (and higher) core CPUs. Only 50% overhead is a big win - I wonder why Igor hasn't tried that approach yet.

    OTOH I'm busy with reducing memory requirements of TarsaMatchFinder.

  13. #12
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Only 50% overhead is a big win - I wonder why Igor hasn't tried that approach yet.
    for real effect you need to make parser also m/t. and that's much more work rather than just run multiple threads of old code

    i do the same in 4x4, Cyan in zstd. except for lzham and radyx, there is no yet real m/t lz coders

  14. #13
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    just found our first thread on this topic

Similar Threads

  1. [LZ] M/T & GPU (de)compression
    By Bulat Ziganshin in forum Data Compression
    Replies: 0
    Last Post: 31st December 2015, 13:09
  2. Replies: 10
    Last Post: 10th May 2014, 09:21
  3. Very slow compression with very fast decompression?
    By Mangix in forum Data Compression
    Replies: 16
    Last Post: 12th September 2013, 02:25
  4. GPU compression again
    By Shelwien in forum Data Compression
    Replies: 13
    Last Post: 13th January 2013, 21:09
  5. decompression is ~1000 times slower than compression?
    By Alexander Rhatushnyak in forum Data Compression
    Replies: 17
    Last Post: 28th April 2010, 13:27

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •