Results 1 to 23 of 23

Thread: Index-Compress-Update: parallel LZ match finder algo

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts

    Index-Compress-Update: parallel LZ match finder algo

    The main problem with parallel impelementation of LZ match searching is that we need different match indices for different blocks processed simultaneously. I propose to solve this problem by saving small "incremental" search indices. For example, consider compression of 64 kb block with 8 mb dictionary in the follwoing steps:

    1. we split 64 kb into four 16 kb subblocks and index them all
    2. we compress all subblocks simultaneously, searching for matches both in index of previous 8 mb and in indices of subblocks preceding current one
    3. we update main 8mb index with small indices of subblocks

    of course, real algo should use sliding dictionary. every subblock should be processed in 3 steps:
    1. build its own small index
    2. compress using large index built N blocks before and small indices of previous N blocks
    3. update large index with our small index data once all previous subblocks are finished compressing

    This algo requires to use N small indices to search matches. Alternatively, we can use one small index that holds last 64 kb. It should be built in every subblock starting with copy of index of previous subblock. Then build process should remove obsolete data for the oldest 16 kb and add data for the new 16 kb


    The third step, Update, should be done when all threads performing Compress step are locked (unless we have index that can't be broken during update and we are able to skip duplicate matches found by small and large index). Update procedure may be multithreaded itself


    PS: I've seen the process of multiplication of hash indices 10 years ago when we switched from using single index on 3 bytes to 3+4 and likewise schemes. It would be fanny to see its again, now due to reqs of multicore cpus. World is still waiting for real multi-core optimized software, including LZ compression algos
    Last edited by Bulat Ziganshin; 13th August 2011 at 12:14.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    1. Re "indicies"
    http://en.wiktionary.org/wiki/index#Noun
    Personally, I prefer to avoid weird plural forms (like indices, unices),
    because they're not english (derived from original language) and thus confusing.
    "Indexes" may be considered "american english" (ie not UK english),
    but its still considered valid too, and easier to read and write.

    2. I think that parallel LZ encoding has more threading potential than
    what's normally used.
    For example, these are possible threads in the _sequential_ encoder:
    - long-distance LZ (ie rep); can be also applied after filters
    - preprocessing (disasm,wrt,delta,mm) - likely multiple threads
    - matchfinding (building of a data structure that links the matches)
    (can be BWT, including MT ones)
    - parsing optimization
    - context optimization / expansion control
    - arithmetic coding (can be 2+ threads)

    3. In fast LZs we might not be able to afford maintaining multiple dictionaries,
    while in stronger LZs there's kinda no need to consider block threading atm,
    because of all the other threading options.
    Also note that in lzma parsing is purely sequential (it takes into account
    match context and a distance stack) so its only possible to apply block
    threading to lzma2.

    4. As to parallel matchfinding, I think its quite possible to keep using
    a single data structure with some branches at the processing area.
    I mean, the popular matchfinding structures these days seem to be built
    over hash chains, so we can "skip" parts allocated to other threads with
    a fast hc pass - it shouldn't be a problem to link multiple hc "tail" blocks to
    the same base structure.
    Hashtable cloning would be a problem though... I wonder if its acceptable to
    do it on hardware level, by mapping the same memory to multiple locations
    and cloning pages on write.

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    I want to metntion that these days, LZ matchfinder efficiency in most cases can be measured just by number of memory access operations - everything else goes much faster. And in this respect, small indices will be "free"

    For LZMA, we probably need to have one global 4+ index, plus the small 2-3-4 index that includes only last 64 kb (for the 4*16kb subblocks processing). The entire process for one 16 kb subblock will be:

    1. get a copy of small index (already updated for previous subblock) and duplicate it
    2. update the small index with this block data
    3. send updated small index to the next subblock
    4. compress using both large index and saved copy of small index, updating small index
    5. update large index with small index content

    there are LOTS of additional operations, but they don't access RAM outside of Last Level Cache. Number of RAM accesses remains the same as with sequential match finder

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > I want to metntion that these days, LZ matchfinder efficiency in
    > most cases can be measured just by number of memory access
    > operations - everything else goes much faster. And in this respect,
    > small indices will be "free"

    I'd agree, but only if its about fast LZs.
    Otherwise I don't quite see how explicit caching (via multiple indexes)
    would be better than hardware caching (single index + L1-L3).
    This reminds me about a possible optimization though - add a flag to
    hashtable entries and avoid hc lookup for unique strings; then maybe
    store up to 8 offsets in hashtable's cache lines.

    Also ppmd encoding is much faster than lzma, while ppmd certainly
    uses more memory and looks up more data.

    > The entire process for one 16 kb subblock will be:

    A thread processes a 16k block, then syncs, skips 48k and continues, right?
    Like that, I think 16k is too little, especially for faster coders.
    Thread sync is slow.

    If the dictionary is not 1G, but more reasonable, like 4-8M,
    I think it would be better to process much larger blocks in threads,
    like 8-16M maybe. Like that we'd be able to init the tables, then
    index the "skipped" data in each thread, then start processing.
    There'd be duplicate work like that, but I still think that it would
    be more efficient; now there're some core-specific caches too, not
    only global ones.

    > 4. compress using both large index and saved copy of small index, updating small index

    There's a design where it would certainly make sense - when BWT's SA is used for matchfinding.
    Computing one global BWT is not a solution in general, and any kind of "sliding" only can be
    done at blockwise level, so there's no way around some local index.

    > Number of RAM accesses remains the same as with sequential match finder

    I just made a test (see attach) and it says that uncached 32-bit RAM read here
    (Q9450 @ 3.52) takes ~600 clocks, which is certainly huge.
    But instead of local indexes, I'd consider prefetching first - we can
    hash and lookup a few n-grams ahead... might be an idea too, as I didn't
    see anything like that in lzma at least.
    Attached Files Attached Files
    • File Type: zip 1.zip (1.4 KB, 172 views)

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    1. thread sync should require about the same time as one RAM access
    2. 8M subbocks will mean 32-64 mb "small" index. 4 such indeces will occupy 128 mb of RAM and increase number of RAM accesses

    i found your program funny. nevertheless, its results here (2600k @ 1.6->4.6 Ghz, RAM 1600-8-9-8-24-t1):
    F:\>1.exe
    buf=00404000
    r=0000000C 12
    q=00000024 36
    q-r=00000018 24

    F:\>1.exe
    buf=00404000
    r=0000000E 14
    q=0000000E 14
    q-r=00000000 0

    F:\>1.exe
    buf=00404000
    r=0000000E 14
    q=0000000E 14
    q-r=00000000 0

    F:\>1.exe
    buf=00404000
    r=0000000E 14
    q=0000000E 14
    q-r=00000000 0

    F:\>1.exe
    buf=00404000
    r=0000000C 12
    q=0000000E 14
    q-r=00000002 2

    F:\>1.exe
    buf=00404000
    r=0000000C 12
    q=0000000E 14
    q-r=00000002 2

    F:\>1.exe
    buf=00404000
    r=0000000E 14
    q=0000000E 14
    q-r=00000000 0

  6. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    with my own program i measured about 90 cpu ticks per random memory access:
    Code:
    int main( void ) {
      const unsigned size = 1<<30;
      char sum=0, *buf=new char[size];
      if (buf)
        for(unsigned i=0, k=123456789; i<1000*1000*1000; i++, k=k*123456791+31415926)
          sum += buf[k%size];
      return sum;
    }
    benchmarked with external timer

  7. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Probably your processor figured out that the values read were not used anywhere Save the read values in some variables and return them from main().

    I wonder what will happen if you replace "mov ecx, mem_loc" with "xchg ecx, mem_log".
    Last edited by Piotr Tarsa; 16th August 2011 at 19:03.

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Another version - http://codepad.org/ENIvUr2H

  9. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    >Probably your processor figured out that the values read were not used anywhere

    actually you are right - it may reorder operations

    >Save the read values in some variables and return them from main().

    anyway i prefer to find average time of many memory accesses rather than rdtsc one access

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Ok, I did more tests.
    First, I suspected that there're actually cache hits in the "random" read loop:
    Code:
    #include <stdio.h>
    
    const int size = 1<<30;
    const int iter = 1000*1000*1000;
    
    int test( int cachesize ) {
    
      int* p = new int[size/32];
      if( p==0 ) return 2;
    
      int i,n=0; unsigned k;
    
      for( i=0; i<size/32; i++ ) p[i]=-iter;
    
      for( i=0, k=123456789; i<iter; i++, k=k*123456791+31415926 ) {
        int& j = p[(k%size)/32];
        if( i-j < cachesize ) n++;
        j = i;
      }
    
      printf( "cs=%iM n=%i/%i=%.3lf%%\n", cachesize>>20, n, iter, double(n)/iter*100 );
    
      delete p;
    
      return 0;
    }
    
    int main( void ) {
    
      test( 4<<20 );
      test( 8<<20 );
    
    }
    And it looks like there're some really:
    Code:
    cs=4M n=62368231/1000000000=6.237%
    cs=8M n=124476006/1000000000=12.448%
    but not enough to cover the gap.

    So then I tried to block overlapping of reads:
    Code:
    int main( void ) {
      const unsigned size = 1<<30;
      char sum=0, *buf=new char[size];
      if( buf ) {
        volatile int x = 0;
        int y = x;
        int i = 0;
        for(unsigned k=123456789; i<1000*1000*1000; i++, k=k*123456791+31415926) {
          sum += buf[y+k%size];
          y &= sum;
        }
      }
      return sum;
    }
    And got ~120s instead of original ~40s there, and 120s is 422 clocks per iteration.

    So rdtsc results are important too.

  11. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    smart code! i agree with second one

    about first one: seems that usually lz has a lot of cache matches that allow it to run more or less fast. afaik, most modern cpus has 64 byte cache line. and you have one thing to fix - cachesize should be measured in number of cachelines, too. with these changes it prints:

    cs=4M n=2926590/1000000000=0.293%
    cs=8M n=5847931/1000000000=0.585%

    while the theory tells us that 4 MB cache should have 0.4% hit ratio for random access to 1 GB of memory
    Code:
    #include <stdio.h>
    
    const int size = 1<<30;
    const int cache_line = 64;
    const int iter = 1000*1000*1000;
    
    int test( int cachesize ) {
    
      int* p = new int[size/cache_line];
      if( p==0 ) return 2;
    
      int i,n=0; unsigned k;
    
      for( i=0; i<size/cache_line; i++ ) p[i]=-iter;
    
      for( i=0, k=123456789; i<iter; i++, k=k*123456791+31415926 ) {
        int& j = p[(k%size)/cache_line];
        if( i-j < cachesize/cache_line ) n++;
        j = i;
      }
    
      printf( "cs=%iM n=%i/%i=%.3lf%%\n", cachesize>>20, n, iter, double(n)/iter*100 );
    
      delete p;
    
      return 0;
    }
    
    int main( void ) {
      test( 4<<20 );
      test( 8<<20 );
    }

  12. #12
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Igor said that he already have the cpu benchmark site: http://7-cpu.com/ including memory latency measurement program at http://7-cpu.com/utils.html

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Thanks, I didn't see that before.
    I guess my first test also causes a TLB miss then.
    Actually I didn't really think about that since DOS times, so actual results are quite "impressive" -
    its even slower than GPU results (I tested it using rdtsc equivalent and got around 100 clocks).
    So I guess a linked list (HC etc) is a really bad data structure these days, because a pointer
    lookup can't even overlap with other code.

  14. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    i made another test, comparing speed of single memory-access thread with 8 threads running the same code. results are:

    Code:
    C:\!\Haskell>timer access_time.exe
    (performs 1e9 unordered memory accesses in single thread)
    User Time    =    17.082 = 00:00:17.082 =  98%
    Process Time =    17.269 = 00:00:17.269 =  99%
    Global Time  =    17.284 = 00:00:17.284 = 100%
    
    C:\!\Haskell>start8x.cmd access_time.exe
    (performs 8e9 unordered memory accesses in multiple threads)
    User Time    =    77.329 = 00:01:17.329 =  95%
    Process Time =    77.626 = 00:01:17.626 =  96%
    Global Time  =    80.605 = 00:01:20.605 = 100%
    
    
    
    C:\!\Haskell>timer access_time_shelwien.exe
    (performs 1e9 sequential memory accesses in single thread)
    User Time    =    71.682 = 00:01:11.682 =  99%
    Process Time =    71.869 = 00:01:11.869 =  99%
    Global Time  =    71.917 = 00:01:11.917 = 100%
    
    C:\!\Haskell>start8x.cmd access_time_shelwien.exe
    (performs 8e9 sequential memory accesses in multiple threads)
    User Time    =   121.633 = 00:02:01.633 =  98%
    Process Time =   121.805 = 00:02:01.805 =  98%
    Global Time  =   123.241 = 00:02:03.241 = 100%

    My memory is 1866 MHz 9-11-9-27-1N

    The last result is most exciting. It means that while one memory access costs 72 ns on my system, running them in multiple threads lowers cost down to 15 ns. With prefetching it may be even lower, i don't yet tried
    Attached Files Attached Files

  15. #15
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    i've added to the archive assembler listings generated with the commands

    g++ -O2 access_time.cpp -oaccess_time.exe -Wa,-adhln >access_time.s
    g++ -O2 access_time_shelwien.cpp -oaccess_time_shelwien.exe -Wa,-adhln >access_time_shelwien.s

    i cannot understand why Shelwien's version is so much slower - there is no any sign of prefetching in assember code generated for my program
    Attached Files Attached Files

  16. #16
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Now measuring write times:
    Code:
    for(unsigned i=0, k=123456789; i<1000*1000*1000; i++, k=k*123456791+31415926)
      buf[k%size] = 1;

    Code:
    C:\!\Haskell>timer write_time.exe
    User Time    =    17.113 = 00:00:17.113 =  98%
    Process Time =    17.316 = 00:00:17.316 =  99%
    Global Time  =    17.332 = 00:00:17.332 = 100%
    
    
    C:\!\Haskell>start8x.cmd write_time.exe
    User Time    =    84.724 = 00:01:24.724 =  97%
    Process Time =    85.067 = 00:01:25.067 =  98%
    Global Time  =    86.519 = 00:01:26.519 = 100%
    Attached Files Attached Files

  17. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Code:
    .00401330: 89C8                         mov         eax,ecx
    .00401332: 25FFFFFF3F                   and         eax,03FFFFFFF
    .00401337: 021C06                       add         bl,[esi][eax]
    .0040133A: 69C117CD5B07                 imul        eax,ecx,0075BCD17
    .00401340: 4A                           dec         edx
    .00401341: 8D88765EDF01                 lea         ecx,[eax][01DF5E76]
    .00401347: 79E7                         jns        .000401330  --- (6)
    
    .00401340: 89C8                         mov         eax,ecx
    .00401342: 43                           inc         ebx
    .00401343: 25FFFFFF3F                   and         eax,03FFFFFFF
    .00401348: 8D3407                       lea         esi,[edi][eax]
    .0040134B: 0FB60416                     movzx       eax,b,[esi][edx]
    .0040134F: 0045EF                       add         [ebp][-11],al
    .00401352: 69C117CD5B07                 imul        eax,ecx,0075BCD17
    .00401358: 0FBE75EF                     movsx       esi,b,[ebp][-11]
    .0040135C: 8D88765EDF01                 lea         ecx,[eax][01DF5E76]
    .00401362: 21F2                         and         edx,esi
    .00401364: 81FBFFC99A3B                 cmp         ebx,03B9AC9FF
    .0040136A: 7ED4                         jle        .000401340  --- (6)
    2nd version is much more complex overall, there's this eax/al/eax dependency chain - multiplication has
    to wait for the read etc, also there're other memory accesses.

    But I still think that 2nd version is a better test, because that simple "add bl,[addr]" sequence may
    be optimized by the cpu (there's no dependency on BL, so the loop can continue after issuing a read operation),
    while in 2nd version each loop iteration certainly waits until the read finishes.

  18. #18
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    yes, at the second thought it seems that cpu accumulates those "add bl,[addr]" instructions since nothing depends on them. so 17 ns/access is possible when you have some stream of memory operations whose results you don't need to wait. write_access.cpp has exactly the same speed, probably for the same reason. and i foresee that prefetch-assisted memory access operations will have the same speed that is great

    btw,
    __builtin_prefetch is available both in gcc and icl

  19. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    1. ppmd implements prefetching like this:
    Code:
    inline void PrefetchData( void* Addr ) {
      BYTE PrefetchByte = *(volatile BYTE*)Addr;
    }
    Its likely more universal that than intrinsic.

    2. Pure write speed tests don't really make much sense, because writes are always cached.
    But there're more interesting cases - for example, read from the same cache line after write,
    or same write and read on different cores.

  20. #20
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    1. Prefetching using intrinsic is an explicit signal to processor. Above trick sometimes could be optimized out. Also, there are few prefetching levels, like prefetch1, prefetch2, etc selectable in that intrinsic.

    2. There are few write buffers in modern desktop processors, at least that's what Agner Fog told me by email. My experiments, available on that forum, showed that some processors (eg AMD Brazos based ones) have seemingly no write buffers.

  21. #21
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Shelwien,
    1. it is question how good it's compared to builtin CPU prefetch command. anyway, we can make a conditional definition
    2. speed of streamlined reads/writes is also interesting. and with 17 ns it's not very fast. streamlined read (probably) shows the speed of prefetched memory read, while streamlined write shows the speed of updating hash table or so
    Last edited by Bulat Ziganshin; 9th January 2012 at 21:38.

  22. #22
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    1. Prefetching using intrinsic is an explicit signal to processor. Above trick sometimes could be optimized out
    due to volatile reference, it's probably compiled to "mov reg,[addr]" whose result is not used, that should work like the "add" instruction we discussed above

  23. #23
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    1. ppmd implements prefetching like this:
    Code:
    inline void PrefetchData( void* Addr ) {
      BYTE PrefetchByte = *(volatile BYTE*)Addr;
    }
    Its likely more universal that than intrinsic.
    This is kind of optimization is more preferable according to one of the AMD's documents. They called that "software prefetching" while calling "prefetch" instruction as "hardware prefetching". According to same document, they say CPUs can ignore hardware prefetch, but they cannot ignore software prefetch thus ensures prefetching. That "document" was old enough that I cannot give you any link. So, we need to benchmark it make any conclusion. But, I believe that AMD and Intel's ways could be different.
    BIT Archiver homepage: www.osmanturan.com

Similar Threads

  1. which tools support parallel compress
    By l1t in forum Data Compression
    Replies: 10
    Last Post: 9th January 2012, 22:35
  2. Replies: 7
    Last Post: 19th March 2011, 10:50
  3. zpaq 1.02 update
    By Matt Mahoney in forum Data Compression
    Replies: 11
    Last Post: 10th July 2009, 00:55
  4. Ocamyd Update!
    By LovePimple in forum Forum Archive
    Replies: 2
    Last Post: 29th March 2008, 22:28
  5. CM Match model
    By toffer in forum Forum Archive
    Replies: 32
    Last Post: 1st February 2008, 20:26

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •