Results 1 to 25 of 25

Thread: Nakamichi

  1. #1
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts

    Nakamichi

    Nakamichi is a LZSS codec with super-fast decompression, targeting mainly highly redundant text data.
    Written by Kaze.
    http://www.codeproject.com/Articles/...34#xx4807834xx
    http://www.sanmayce.com/Nakamichi/

    A very quick'n'rough bench on Phenom 2:
    Code:
    pcbsd-8973% ./fsbench -v all ~/bench/calgary.tar 
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    7z-deflate                              9.20         5
         986324 (x 3.197)     5608 KB/s  121 MB/s      3853e3 4061e3
    7z-deflate64                            9.20         5
         954413 (x 3.303)     4826 KB/s  129 MB/s      3365e3 4176e3
    crush                                   0.0.1        0
        1115260 (x 2.827)     14.6 MB/s  160 MB/s      9675e3   11e6
    LZ4                                     r114         
        1620566 (x 1.946)      252 MB/s 1082 MB/s       122e6  122e6
    LZ4hc                                   r114         
        1195650 (x 2.637)     17.0 MB/s 1287 MB/s        10e6   11e6
    LZF                                     3.6          
        1524761 (x 2.068)      108 MB/s  441 MB/s        55e6   55e6
    lzmat                                   1.1          
        1181402 (x 2.669)     16.6 MB/s  236 MB/s        10e6   11e6
    LZO                                     2.06         1x1
        1595374 (x 1.976)      269 MB/s  425 MB/s       132e6  133e6
    miniz                                   1.11         6
        1032885 (x 3.053)     14.0 MB/s  165 MB/s      9659e3   11e6
    Nakamichi                               round 5      
        2024249 (x 1.558)     36.0 KB/s 1825 MB/s        12e3 2199e3
    Nakamichi-nomemcpy                      round 5      
        2024249 (x 1.558)     25.8 KB/s 1750 MB/s      9440e0 2204e3
    QuickLZ                                 1.5.1b6      1
        1497111 (x 2.106)      303 MB/s  335 MB/s       159e6  160e6
    RLE64                                   R3.00        64
        2877120 (x 1.096)     1732 MB/s 4228 MB/s       151e6  151e6
    Snappy                                  1.1.0        
        1619277 (x 1.947)      221 MB/s  646 MB/s       107e6  107e6
    tornado                                 0.6a         6
         989437 (x 3.187)     17.1 MB/s 75.9 MB/s        11e6   12e6
    Yappy                                   v2           10
        1565492 (x 2.014)     55.9 MB/s 1119 MB/s        28e6   30e6
    zlib                                    1.2.8        6
        1034381 (x 3.048)     14.4 MB/s  234 MB/s      9883e3   12e6
    zling                                   20140324     0
        1036139 (x 3.043)     38.5 MB/s 96.2 MB/s        25e6   28e6
    zopfli/zlib                             1.0.0/1.2.8  15/6
         976086 (x 3.230)      141 KB/s  240 MB/s        97e3 4234e3
    CityHash32                              1.1.0        
        3152900 (x 1.000)     2165 MB/s 2311 MB/s       32e15  32e15
    CityHash64                              1.1.0        
        3152904 (x 1.000)     4071 MB/s 4582 MB/s       32e15  32e15
    CityHash128                             1.1.0        
        3152912 (x 1.000)     4294 MB/s 4715 MB/s       32e15  32e15
    uhash                                   2007-04-17   
        3152904 (x 1.000)     1350 MB/s 1431 MB/s       32e15  32e15
    vhash                                   2007-04-17   
        3152912 (x 1.000)     5238 MB/s 5905 MB/s       32e15  32e15
    umac                                    2007-04-17   
        3152904 (x 1.000)     1389 MB/s 1413 MB/s       32e15  32e15
    vmac                                    2007-04-17   
        3152912 (x 1.000)     5226 MB/s 5635 MB/s       32e15  32e15
    SipHash24                               reference    
        3152904 (x 1.000)     1014 MB/s 1056 MB/s       32e15  32e15
    murmur3_x86_32                          2012-02-29   
        3152900 (x 1.000)     2197 MB/s 2315 MB/s       32e15  32e15
    murmur3_x86_128                         2012-02-29   
        3152912 (x 1.000)     2225 MB/s 2293 MB/s       32e15  32e15
    murmur3_x64_128                         2012-02-29   
        3152912 (x 1.000)     3481 MB/s 3771 MB/s       32e15  32e15
    SpookyHash                              V2 2012-08-05 
        3152912 (x 1.000)     6098 MB/s 6651 MB/s       32e15  32e15
    FNV1a-Jesteress                         2013-06-16   
        3152900 (x 1.000)     5076 MB/s 5605 MB/s       32e15  32e15
    FNV1a-Mantis                            2013-06-16   
        3152900 (x 1.000)     5100 MB/s 5563 MB/s       32e15  32e15
    FNV1a-Meiyan                            2013-06-16   
        3152900 (x 1.000)     5082 MB/s 5557 MB/s       32e15  32e15
    FNV1a-Tesla                             2013-06-16   
        3152904 (x 1.000)     8100 MB/s 9351 MB/s       32e15  32e15
    FNV1a-Tesla3                            2013-06-16   
        3152904 (x 1.000)     8233 MB/s 9977 MB/s       32e15  32e15
    FNV1a-Yorikke                           2013-06-16   
        3152900 (x 1.000)     5923 MB/s 6856 MB/s       32e15  32e15
    FNV1a-YoshimitsuTRIAD                   2013-06-16   
        3152900 (x 1.000)     6224 MB/s 7030 MB/s       32e15  32e15
    FNV1a-Yoshimura                         2013-06-16   
        3152904 (x 1.000)     5761 MB/s 6525 MB/s       32e15  32e15
    Blake224                                SHA3 Final 64bit opt 
        3152924 (x 1.000)      138 MB/s  137 MB/s       32e15  32e15
    Blake256                                SHA3 Final 64bit opt 
        3152928 (x 1.000)      136 MB/s  137 MB/s       32e15  32e15
    Blake384                                SHA3 Final 64bit opt 
        3152944 (x 1.000)      227 MB/s  228 MB/s       32e15  32e15
    Blake512                                SHA3 Final 64bit opt 
        3152960 (x 1.000)      228 MB/s  230 MB/s       32e15  32e15
    Keccak224                               3.2          
        3152924 (x 1.000)      217 MB/s  225 MB/s       31e15  32e15
    Keccak256                               3.2          
        3152928 (x 1.000)      206 MB/s  209 MB/s       32e15  32e15
    Keccak384                               3.2          
        3152944 (x 1.000)      164 MB/s  165 MB/s       31e15  32e15
    Keccak512                               3.2          
        3152960 (x 1.000)      114 MB/s  114 MB/s       32e15  32e15
    JH224                                   SHA3 Final 64bit opt 
        3152924 (x 1.000)     27.4 MB/s 27.7 MB/s       29e15  30e15
    JH256                                   SHA3 Final 64bit opt 
        3152928 (x 1.000)     27.5 MB/s 27.4 MB/s       29e15  29e15
    JH384                                   SHA3 Final 64bit opt 
        3152944 (x 1.000)     29.0 MB/s 29.5 MB/s       31e15  32e15
    Skein224                                SHA3 Final 64bit opt 
        3152924 (x 1.000)      477 MB/s  476 MB/s       32e15  32e15
    Skein256                                SHA3 Final 64bit opt 
        3152928 (x 1.000)      473 MB/s  479 MB/s       32e15  32e15
    Skein384                                SHA3 Final 64bit opt 
        3152944 (x 1.000)      465 MB/s  475 MB/s       32e15  32e15
    Skein512                                SHA3 Final 64bit opt 
        3152960 (x 1.000)      478 MB/s  483 MB/s       32e15  32e15
    Skein1024                               SHA3 Final 64bit opt 
        3153024 (x 1.000)      272 MB/s  272 MB/s       32e15  32e15
    xxhash                                  r29          
        3152900 (x 1.000)     4366 MB/s 4727 MB/s       32e15  32e15
    AES128Bernstein                         little-4     128
        3152896 (x 1.000)      149 MB/s  152 MB/s         0e0    0e0
    AES256Hongjun                           v1           256
        3152896 (x 1.000)      152 MB/s  156 MB/s         0e0    0e0
    ChaCha                                  20080118     128
        3152896 (x 1.000)      554 MB/s  562 MB/s         0e0    0e0
    HC128                                   2007-01b     128
        3152896 (x 1.000)      939 MB/s  974 MB/s         0e0    0e0
    HC256                                   2007-01      256
        3152896 (x 1.000)      659 MB/s  668 MB/s         0e0    0e0
    Lex                                     v2           128
        3152896 (x 1.000)      401 MB/s  415 MB/s         0e0    0e0
    Rabbit                                  opt2         128
        3152896 (x 1.000)      338 MB/s  356 MB/s         0e0    0e0
    RC4                                     2005-08-21   128
        3152896 (x 1.000)      181 MB/s  181 MB/s         0e0    0e0
    Salsa20_8                               merged       256
        3152896 (x 1.000)      660 MB/s  664 MB/s         0e0    0e0
    Salsa20_12                              merged       256
        3152896 (x 1.000)      431 MB/s  437 MB/s         0e0    0e0
    Salsa20                                 merged       256
        3152896 (x 1.000)      291 MB/s  295 MB/s         0e0    0e0
    Snow2                                   fast         256
        3152896 (x 1.000)      631 MB/s  636 MB/s         0e0    0e0
    Sosemanuk                               2005-04-26   256
        3152896 (x 1.000)      403 MB/s  420 MB/s         0e0    0e0
    Trivium                                 2006-02-23   80
        3152896 (x 1.000)      547 MB/s  563 MB/s         0e0    0e0
    bswap16                                 0            
        3152896 (x 1.000)     1485 MB/s         -         0e0    0e0
    CRC mismatch!
    bswap32                                 0            
        3152896 (x 1.000)     1957 MB/s         -         0e0    0e0
    CRC mismatch!
    bswap64                                 0            
        3152896 (x 1.000)     2417 MB/s         -         0e0    0e0
    CRC mismatch!
    memcpy                                  0            
        3152896 (x 1.000)     2165 MB/s         -         0e0    0e0
    memmove                                 0            
        3152896 (x 1.000)     2195 MB/s         -         0e0    0e0
    nop                                     0            
              1 (x3152896.000)     7457 MB/s 2757 GB/s      7456e6 7456e6
    CRC mismatch!
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    done... (4*X*1) iteration(s)).
    It appears to be by far the fastest LZ decoder out there.
    Note: Unlike f.e. lz4 it's not safe against lz stream corruption.

    Note: There are also SIMD versions, I have not benchmarked them.
    Last edited by m^2; 4th May 2014 at 17:11.

  2. #2
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    Interesting.
    If I do understand properly, speed comes primarily from only taking care of matches of length >= 24.

  3. #3
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    342
    Thanks
    12
    Thanked 34 Times in 28 Posts
    download link?

  4. #4
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I won't put it in the 1st post because the author might add new versions at any time and I don't want to maintain the post, it's not my business. But the latest versions are:
    For NOmemcpy:
    http://www.sanmayce.com/Downloads/Na...Omemcpy_FIX.7z
    For others:
    http://www.sanmayce.com/Downloads/Na...D_NOmemcpy.zip

  5. #5
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    There's a new version, on my super-rough test it's 10% stronger and decompresses 10% faster. It's compression is 3x slower though. Also, the author has created a website.
    http://www.sanmayce.com/Nakamichi/

    Same PC, same data, same codecs, different compiler (Clang 3.1 instead of gcc 4.2) and different settings (no -v which slows things down):
    Code:
    pcbsd-8973% please ./fsbench -s3 nakamichi fast /usr/home/m/bench/calgary.tar
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    Nakamichi                               round 5     
        2024249 (x 1.558)      262 KB/s 2005 MB/s        93e3  358e6
    LZ4                                     r114         
        1620566 (x 1.946)      301 MB/s 1203 MB/s       146e6  292e6
    LZF                                     3.6          very
        1534640 (x 2.054)      177 MB/s  501 MB/s        90e6  257e6
    LZO                                     2.06         1x1
        1595374 (x 1.976)      334 MB/s  430 MB/s       165e6  212e6
    QuickLZ                                 1.5.1b6      1
        1497111 (x 2.106)      376 MB/s  334 MB/s       197e6  175e6
    RLE64                                   R3.00        64
        2877120 (x 1.096)     2255 MB/s 4009 MB/s       197e6  263e6
    Snappy                                  1.1.0        
        1619277 (x 1.947)      273 MB/s  752 MB/s       132e6  365e6
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    Updated Nakamichi:
    Code:
    pcbsd-8973% please ./fsbench -s3 nakamichi /usr/home/m/bench/calgary.tar
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    Nakamichi                               Kaidanji     
        1862139 (x 1.693)     82.6 KB/s 2255 MB/s        33e3  307e6
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    done... (4*X*1) iteration(s)).
    Sorry, I'm not into careful benching recently.

  6. #6
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Could somebody briefly (or not so briefly, if that's more worth your time) explain what this is about?
    Nakamichi is a LZSS codec with super-fast decompression, targeting mainly highly redundant text data.
    Written by Kaze.
    http://www.codeproject.com/Articles/...34#xx4807834xx

    http://www.sanmayce.com/Nakamichi/


    The first link is confusing. The 2nd gets you to a page only a graphic designer could love, and probably only the particular graphic designer who designed it. (Seriously, gray and grayish text on black-and-gray patterned background? Seriously?)

    My wild-ass-guess is that this is only especially good if it matches long strings, or you have long repeated strings in your input, and requires a big ass dictionary with the right long strings in it to be useful unless what you're compressing has lots of long repeats internally... Is that correct? Is this more or less a fine-grained BARF, or what? (Not that that wouldn't be useful for some purposes, but WTF is it?)
    Last edited by Paul W.; 4th May 2014 at 20:51. Reason: orphan tag

  7. #7
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    No, this is a regular LZSS that successfully compresses regular data. As you can see I test on calgary.
    It's rather weak for its class, yet much stronger than RLE with decompression speeds that only RLE beats.

    Encoder is very weak now, so the strength could be improved together with speed; Match finder is heavily optimised brute-force, that is: it uses a fast strstr. Parser was greedy in the previous version, haven't looked into the current one.

  8. The Following User Says Thank You to m^2 For This Useful Post:

    Paul W. (4th May 2014)

  9. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I tested it but didn't add to LTCB because there wasn't a public website for it. The compression is very poor, and the download unnecessarily includes hundreds of MB of files that he used for testing. Well, I may have to test it now.

    Edit: never mind. As far as I can tell, this is unfinished test code. I'm not going to bother.
    Last edited by Matt Mahoney; 5th May 2014 at 18:48.

  10. #9
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I tested it but didn't add to LTCB because there wasn't a public website for it. The compression is very poor, and the download unnecessarily includes hundreds of MB of files that he used for testing. Well, I may have to test it now.

    Edit: never mind. As far as I can tell, this is unfinished test code. I'm not going to bother.
    You mean that there's a lot of development leftovers all over the code?
    That's the author's style, he'll never clean things up. It may very well be the last version.

  11. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Apparently the compressor works, then runs a 256 GB memcpy() test.

  12. #11
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I don't know what do you mean.

  13. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    It compresses the file, then instead of exiting it runs some internal benchmark tests

  14. #13
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    You mean the main function?
    I never executed it. ^^.

  15. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    There are several versions of the program, I guess with different code to copy arrays using x86, SSE2, AVX, etc. that he was testing.

  16. #15
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Several news:
    1. Please disregard my previous benchmark results. I haven't set architecture in the source and decompression just silently skipped some of memcpy parts.
    2. The author released a stronger variant, Sanshi.
    3. I added bounds checking to Kaidanji decoder. The result is basically just as fast as the original (actually here it's even slightly faster due to some random changes that the compiler optimised better, on another system it was 1.2% slower). To do such thing, I used a trick that I can't believe is novel, yet I haven't seen it anywhere, though it universally useful:
    I split the main loop into 3.
    * The tail loop is fully safe, but runs on very short strings.
    * The head loop is safe against reading matches that are claimed to be before the buffer, but may get past the input / output. It's run until we're close to the end of either input or output buffer (if we are, we switch to the tail loop) or we've passed a single window size (in which case we switch to the main loop)
    * the main loop, like the head one, but doesn't even have to check matches.

    Benchmark results:
    Code:
    ./fsbench  nakamichi/nakamichi-safe nakamichi nakamichi-sanshi ~/bench/calgary.tar 
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    Nakamichi/Nakamichi-safe                Kaidanji FIX/Kaidanji FIX /
        1862145 (x 1.693)     70.5 KB/s 2129 MB/s        28e3 2520e3
    Nakamichi                               Kaidanji FIX 
        1862145 (x 1.693)     70.5 KB/s 2117 MB/s        28e3 2520e3
    Nakamichi-Sanshi                        Sanshi       
        1810650 (x 1.741)     8619  B/s 1158 MB/s      3669e0 1310e3
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    done... (1*X*1) iteration(s)).
    pcbsd-8973% ./fsbench  nakamichi/nakamichi-safe nakamichi nakamichi-sanshi lz4hc ~/bench/scc2.tar
    Password:
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    Nakamichi/Nakamichi-safe                Kaidanji FIX/Kaidanji FIX /
         519733 (x 1.537)     68.1 KB/s 2517 MB/s        23e3  544e3
    Nakamichi                               Kaidanji FIX 
         519733 (x 1.537)     68.1 KB/s 2501 MB/s        23e3  544e3
    Nakamichi-Sanshi                        Sanshi       
         546107 (x 1.463)     18.0 KB/s 1795 MB/s      5844e0  493e3
    LZ4hc                                   r114         
         350006 (x 2.282)     23.1 MB/s 1440 MB/s        12e6   13e6
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    done... (4*X*1) iteration(s)).

  17. #16
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    News 1: There's a new version from Kaze, Daikuni. Decompression speed close to Sanshi, strength lower, encoding speed terrible.
    News 2: I've played with Nakamichi Kaidanji a bit and can tell you more about it.

    Bit stream is a sequence of tokens. Each token represents either literal or a match.
    Literal tokens are single bytes with the lower 3 bits all zeroed and the remaining 5 encoding literal length (0-31) followed by actual literals.
    Match tokens are 2-byte, little-endian numbers. Number represents match offset (0-65535). Match length is always 8.
    This has a quirk that match length can't have all lower 3 bits set to 0, every 8-th match has to be discarded.

    Encoder is still a brute-force match finder with a greedy parser. It employs 1 trick to improve decoding cache efficiency - first searches for matches in the tail 16 KB of the window.

    The speed comes primarily from having very few branches, 1 / 8 bytes on a match and 1 / (1-31) on a literal.

    I tested 2 modifications to the stream:
    * encode literal length as 1 - 32. That's a single addition in the decoder added. It's slightly stronger and on most files it actually makes things slightly faster too. Not on very redundant stuff (which is Kaze's focus).
    * put the 3-bit tag in the most significant bits. This was supposed to let me get rid of a shift in the decoder. Effectively it moves the 1/8th of excluded matches from being separate to clustered.
    The effect was surprising, strength improved noticeably on binary files (text had a minimal loss), while speed dropped a bit (and improved on texts). I attribute the strength increase to binary files often consisting of structures of sizes being multiple of 8. I suppose that with a smarter parser this variant may loose some appeal because the first variant never misses a 9+ byte match, while this one can sink even a 32-byte one. And speed? It surely looks to be directly caused by having more (or less) matches; the literal code path is just faster.

    You can see my version of Nakamichi here:
    https://chiselapp.com/user/Justin_be...d3ea5c55aaae0b

  18. The Following User Says Thank You to m^2 For This Useful Post:

    Paul W. (10th May 2014)

  19. #17
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I wrote an optimal encoder for Nakamichi-M. It's extremely dirty, unoptimised and likely buggy, but in general it seems to work.
    Overall I'm disappointed with strength gains.
    I also re-evaluated the placement of the 3-bit tag. The current position stays.
    And tried different match lengths, 5-8. 6 is the strongest, but the speed drop overweights the size improvement, so I stayed with 8.

    The encoder features a cache optimisation - it always picks the closest match. This doesn't help much, but still. I could improve it further by trying to take matches that, assuming that the output buffer is aligned to cache line size (that's what most malloc implementations return anyway) and assuming a particular cache line size, favours matches that fit within a single line, but I don't think it's worth bothering.
    Also, there's a tiny, but clear improvement possible, the first 32 bytes can't be compressed at all because match offsests are unencodable with the stream format. They could be blindly copied for 1-byte gain. Not worth the complication.
    Code:
    pcbsd-8973% please ./fsbench nakamichi nakamichi-m lz4hc ~/bench/scc2.tar
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    Nakamichi                               Kaidanji FIX 
         519733 (x 1.537)     67.9 KB/s 2482 MB/s        23e3  544e3
    Nakamichi-M                             M            
         506934 (x 1.576)      348 KB/s 2607 MB/s       127e3  569e3
    LZ4hc                                   r114         
         350006 (x 2.282)     22.9 MB/s 1431 MB/s        12e6   12e6
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    done... (4*X*1) iteration(s)).
    pcbsd-8973% please ./fsbench nakamichi nakamichi-m lz4hc ~/bench/book1 
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    Nakamichi                               Kaidanji FIX 
         528748 (x 1.454)     65.5 KB/s 2085 MB/s        20e3  468e3
    Nakamichi-M                             M            
         523022 (x 1.470)     1353 KB/s 2120 MB/s       432e3  479e3
    LZ4hc                                   r114         
         363217 (x 2.117)     13.9 MB/s 1015 MB/s      7486e3 7920e3
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    done... (4*X*1) iteration(s)).
    pcbsd-8973% please ./fsbench nakamichi nakamichi-m lz4hc ~/bench/calgary.tar
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    Nakamichi                               Kaidanji FIX 
        1862145 (x 1.693)     70.2 KB/s 2065 MB/s        28e3 2515e3
    Nakamichi-M                             M            
        1832926 (x 1.720)     44.3 KB/s 2213 MB/s        18e3 2578e3
    LZ4hc                                   r114         
        1195650 (x 2.637)     17.5 MB/s 1275 MB/s        10e6   11e6
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    done... (4*X*1) iteration(s)).
    I committed it to fsbench:
    https://chiselapp.com/user/Justin_be...sitory/fsbench

  20. The Following User Says Thank You to m^2 For This Useful Post:

    avitar (11th May 2014)

  21. #18
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I've made some progress, but too little to show the code.

    The 1st thing is a trivial change that trades a lot of strength for a lot of speed:
    Increasing match length from constant 8 to 16 bytes.
    50% faster while still being much stronger than RLE. I call it m1-16. I expect this to be the fastest decoder for a while, but I abandon it for something better.

    I introduced variable-length matches in a branchless style that's pretty much the same as that of literals in Kaidanji.
    The code consists of a series of 1-byte tags with appendixes. A tag has a match flag, the rest of bits encodes a match / literal length. The length is limited (initially to 32), so branchless operation is possible. After a literal tag there are literals. After a match flag there's a 2-byte offset. I call this variant m2-64.

    This is clearly inefficient with 2 bits wasted in each tag. In match, these can be used to increase window size to 256 KB. This is m2-256. It has an interesting property: Every m2-64 stream is a m2-256 one too and decodes to the same output. But m2-256 decoder is slower.
    m2-256 could still be easily made denser by storing not match length (no more than 31 now) but (match_length - min_match_length). That's to be investigated.

    m2-256, while clearly redundant, fares OK. In my initial tests it's very close to LZ4hc. LZ4hc is not optimal when it comes to strength, but CBloom's experiments suggest that it's nearly optimal. My encoder is optimal.
    m2-256, when used to its full potential, is slow though, 20% slower than LZ4. But I never intended to use its full potential anyway. Now the trick is to create not a compressed stream that's the densest, but one that is the most efficient in terms of decoding speed vs. size.
    I did the most basic thing here, limited minimum match length to be no less than some constant. With 10-12 min_match the code is fairly efficient, a clear improvement over m1-8 and, though not as fast, also a clear improvement over m1-16.
    There's a minor problem - parser got much slower. Right now, with some optimisations already applied, I compress calgary.tar at 2-3 KB/s. That makes tweaking parameters (min_match, max_match, max_literal_length) painful.
    As to optimising the output stream, I did 3 things so far:
    * Of 2 match choices that produce paths of the same length, I pick the one with lower offset to improve cache efficiency (like in m1). It would be better to analyse not just the current match, but the whole path for cache efficiency though. TODO.
    * tried a basic cost model, where I pick a match not when the pick is smaller than one with a literal, but when it's smaller by at least X bytes. This utterly failed, efficiency got only lower.
    * made an inoptimal, but much faster encoding mode where I take the 1st match that's max_match-long. I was shocked to see that this produced a stream that decoded much faster with only minimal strength loss. Much more efficient overall. I have no idea what's going on, need to dig into it. Any suggestions? (ADDED: I guess it's about paths with many short matches vs. fewer longer ones. Explicitly modelling the time needed to process each token would be better)

    ADDED:
    m1-16 on my pc is 62% faster than m1-8.
    Code:
    pcbsd-8973% ./fsbench -i10 -s1000 rle64 nakamichi-m ~/bench/scc2.tar
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    RLE64                                   R3.00        64
         752456 (x 1.061)     2357 MB/s 4801 MB/s       136e6  278e6
    Nakamichi-M                             M1           
         592513 (x 1.348)      474 KB/s 4234 MB/s       122e3 1093e6
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    done... (10*X*1) iteration(s)).
    Last edited by m^2; 15th May 2014 at 06:02.

  22. #19
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    M2.0 is here. It's as described above, with 256k dictionary.
    The encoder leaves a lot to be desired, so there's fair amount of space for improvement w/out changing the stream format.
    I also added the new version from Kaze to fsbench.
    Code:
    pcbsd-8973% please ./fsbench nakamichi nakamichi-m lz4hc ~/bench/scc1.tar   
    Password:
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    Nakamichi                               Kaidanji FIX 
        7984013 (x 1.578)     73.3 KB/s 2038 MB/s        26e3  746e6
    Nakamichi-M                             M1           
        7846626 (x 1.605)      129 KB/s 2186 MB/s        48e3  824e6
    Nakamichi-M                             M2.0         
        7613754 (x 1.654)     40.2 KB/s 2212 MB/s        15e3  874e6
    LZ4hc                                   r114         
        5614216 (x 2.243)     16.6 MB/s 1324 MB/s      9442e3  733e6
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    done... (4*X*1) iteration(s)).
    please ./fsbench nakamichi-sanbashi nakamichi nakamichi-m lz4hc ~/bench/calgary.tar
    Nakamichi-Sanbashi                      Sanbashi     
        1560637 (x 2.020)     5854  B/s 1044 MB/s      2956e0  527e6
    Nakamichi                               Kaidanji FIX 
        1862145 (x 1.693)     70.5 KB/s 2095 MB/s        28e3  857e6
    Nakamichi-M                             M1           
        1832926 (x 1.720)     41.4 KB/s 2227 MB/s        17e3  932e6
    Nakamichi-M                             M2.0         
        1892330 (x 1.666)     10.8 KB/s 2436 MB/s      4419e0  973e6
    LZ4hc                                   r114         
        1195650 (x 2.637)     17.8 MB/s 1278 MB/s        11e6  793e6
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    done... (4*X*1) iteration(s)).
    I skipped the tests on scc2 - it is uniquely suited for codecs with 64k dictionary like Nakamichi Kaidanji, M1, lz4hc. But not M2.0. :P SCC1 favours 1 MB, Calgary - I don't know.
    Last edited by m^2; 19th May 2014 at 23:50.

  23. #20
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Kaze has created a very interesting variant, Aratama, where the main decoder loop body is fully branchless.
    Simplified code:
    Code:
        while(srcIndex < srcSize){
            WORDpair = *(unsigned short int*)&src[srcIndex];
                Flag=!(WORDpair & 0xF8);
    
                *(uint64_t*)(ret+retIndex) = *(uint64_t*)( (uintptr_t)(src+srcIndex+1)*Flag + (uintptr_t)(ret+retIndex-WORDpair)*!Flag );
                srcIndex += ((WORDpair & 0xFF)+1)*Flag + 2*!Flag;
                retIndex += (WORDpair & 0xFF)*Flag + Match_Length*!Flag;
        }
    Author's results:
    Code:
    // For enwik8 on Q9550s:
    // | Nakamichi 'Kaidanji'     (64KB) |  063,430,147 / 0283161 / 1014 MB/s |
    // | Nakamichi 'Aratama' ML=8 (64KB) |  066,251,713 / 4170885 /  866 MB/s |
    // | Nakamichi 'Aratama' ML=7 (64KB) |  061,499,908 / 2800002 /  762 MB/s |
    // | Nakamichi 'Aratama' ML=5 (64KB) |  056,528,218 / 0717352 /  607 MB/s |
    // | Nakamichi 'Hanazakari'   (64KB) |  054,693,537 / 0031544 /  756 MB/s |

  24. #21
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    I was a bit surprised initially by the results,
    since 800 MB/s decoding speed does not look particularly fast for a Q9550 @ 2.83GHz.
    I was expecting something faster considering the branchless property of the decoder.

    It could be that extra maths required to calculate both source pointers and finally select the right one by masking with flag
    ends up costing more than a predictable branch ?

    It that is the case, maybe a more modern Core i7 CPU could benefit more from this methodology, thanks to its more efficient ALUs.

  25. #22
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    It looks to be disk to disk performance. Another table from the author's site:
    Code:
    -------------------------------------------------------------------------------------------------------------------------------------------------
    | compressor \ filedataset      | alice29.txt    | CalgaryCorpus.tar | shaks12.txt      | dickens            | enwik8                           |
    -------------------------------------------------------------------------------------------------------------------------------------------------
    | UNCOMPRESSED                  | 152,089        | 3,153,408         | 5,582,655        | 10,192,446         | 100,000,000                      |
    | Nakamichi 'Kaidanji'   (64KB) | 092,285 / 0328 | 1,862,449 / 11838 | 3,391,657 / 6799 | 06,387,079 / 14977 | 063,430,147 / 283161 / 1014 MB/s |
    | Nakamichi 'Hanazakari' (64KB) | 080,270 / 0218 | 1,620,653 / 05396 | 2,737,003 / 0452 | 05,068,626 / 00630 | 054,693,537 / 031544 /  756 MB/s |
    | Nakamichi 'Sanbashi'    (2MB) | 095,682 / 0267 | 1,560,737 / 06841 | 2,559,110 / 0519 | 04,614,250 / 00558 | 046,881,842 / 026545 /  607 MB/s |
    | Nakamichi 'Kaiko'      (16MB) | 100,756 / 1371 | 1,834,530 / 22877 | 2,704,445 / 4412 | 04,694,952 / 04356 | 047,016,954 / 097458 /  289 MB/s |
    | 7z's gz, Ultra Deflate32      | 051,707        | 0,980,026         | 1,934,787        | 03,681,828         | 035,102,891                      |
    | 7z's zip, Ultra Deflate64     | 050,051        | 0,945,849         | 1,834,240        | 03,508,645         | 033,757,921                      |
    | TANGELO 2.3                   | 039,160        | 0,710,066         | 1,236,021        | 02,279,659         | 020,921,619                      |
    | LZ4 v1.4, -9                  | 063,705        | 1,195,853         | 2,315,036        | 04,442,992         | 042,283,904        / 2186.9 MB/s |
    | Yappy, 8192 10000             | 087,965        | 1,654,203         | 3,337,964        | 06,374,780         | 057,701,807        /  698.7 MB/s |
    | Yappy, 65536 10000            | 081,217        | 1,544,271         | 3,120,688        | 05,912,295         | 054,162,908        /  679.4 MB/s |
    | Yappy, 1048576 10000          | 080,353        | 1,530,823         | 3,091,493        | 05,850,648         | 053,687,370        /  679.4 MB/s |
    -------------------------------------------------------------------------------------------------------------------------------------------------
    LZ4 beats Kaidanji by a lot and I suspect the reason is different I/O subsystem.

  26. #23
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    This is too fast for I/O. These are in-memory benchmark results.

    My understanding, for this specific benchmark, is that LZ4 results are multi-threaded, which would explain the difference.
    Last edited by Cyan; 17th June 2014 at 12:42.

  27. The Following User Says Thank You to Cyan For This Useful Post:

    Kennon Conrad (17th June 2014)

  28. #24
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Kaze said you have guessed correctly. It's mutlithreaded LZ4 and single-threaded Nakamichi, all in-memory.

  29. #25
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Kaze said you have guessed correctly. It's mutlithreaded LZ4 and single-threaded Nakamichi, all in-memory.
    ADDED:
    Admin, please delete the duplicate.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •