Page 1 of 2 12 LastLast
Results 1 to 30 of 31

Thread: In-memory open source benchmark of entropy coders

  1. #1
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 165 Times in 63 Posts

    In-memory open source benchmark of entropy coders

    Attached is an in-memory benchmark (with source code and Win64 executable included) of entropy coders. All supported entropy coders are joined into a single exe. At the beginning an input file is read to memory. Then all entropy coders are used to compress and decompress the file and decompressed memory is verified. The "i" option selects number of iterations (default 1) and displays the fastest time of N iterations. This approach has a big advantage of using the same compiler with the same optimizations for all entropy coders. The disadvantage is that it requires source code of each entropy coder. For 64-bit compilation use "make BUILD_SYSTEM=64-bit".

    The results using 1 core of Intel Core i5-520M 2.4 GHz, Windows 8 64-bit (MinGW-w64 compilation under gcc 4.8.2) and 5 iterations. The input file (100 MB) is a concatenation of 10 different files, about 10 MB each: bmp, dct_coeffs, english_dic, ENWIK, exe, fp_log, hlp, XML, pdf, ncb.

    Code:
    memcpy [chunk_size] [99 MB]     32 ms (3199 MB/s), 104854004, 32 ms (3199 MB/s)
    FSE 2014-04-07 [64 KB]          582 ms (175 MB/s), 63254759, 261 ms (392 MB/s)
    FSE 2014-04-07 [99 MB]          561 ms (182 MB/s), 84899621, 250 ms (409 MB/s)
    ryg_rANS 2014-02-18 [64 KB]     817 ms (125 MB/s), 66078971, 888 ms (115 MB/s)
    ryg_rANS 2014-02-18 [99 MB]     793 ms (129 MB/s), 84830126, 902 ms (113 MB/s)
    ryg_rANS interleaved [64 KB]    723 ms (141 MB/s), 66084585, 563 ms (181 MB/s)
    ryg_rANS interleaved [99 MB]    698 ms (146 MB/s), 84830137, 547 ms (187 MB/s)
    ryg_rANS64 2014-02-18 [64 KB]   860 ms (119 MB/s), 66082876, 683 ms (149 MB/s)
    ryg_rANS64 2014-02-18 [99 MB]   852 ms (120 MB/s), 84830136, 673 ms (152 MB/s)
    ryg_rANS64 interleaved [64 KB]  646 ms (158 MB/s), 66092596, 424 ms (241 MB/s)
    ryg_rANS64 interleaved [99 MB]  647 ms (158 MB/s), 84830140, 416 ms (246 MB/s)
    tornado ArithCoder [64 KB]      1476 ms (69 MB/s), 67962411, 1997 ms (51 MB/s)
    tornado ArithCoder [99 MB]      1446 ms (70 MB/s), 63192926, 2025 ms (50 MB/s)
    tornado HuffCoder [64 KB]       773 ms (132 MB/s), 65826175, 1216 ms (84 MB/s)
    tornado HuffCoder [99 MB]       771 ms (132 MB/s), 64701654, 1207 ms (84 MB/s)
    tornado HuffCoder o1 [99 MB]    1114 ms (91 MB/s), 55655031, 2420 ms (42 MB/s)
    zlibh [64 KB]                   626 ms (163 MB/s), 64008871, 562 ms (182 MB/s)
    Last edited by inikep; 14th April 2014 at 18:22.

  2. The Following 12 Users Say Thank You to inikep For This Useful Post:

    Black_Fox (14th April 2014),Bulat Ziganshin (14th April 2014),cade (15th April 2014),Cyan (15th April 2014),JamesB (16th April 2014),Matt Mahoney (14th April 2014),Mike (15th April 2014),m^2 (14th April 2014),ne0n (15th April 2014),Paul W. (14th April 2014),RichSelian (14th April 2014)

  3. #2
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 165 Times in 63 Posts
    entropy_bench v0.2 adds ryg_rANS_SIMD:

    Code:
    ryg_rANS_SIMD [64 KB]           920 ms (111 MB/s), 66287954, 250 ms (409 MB/s)
    ryg_rANS_SIMD [99 MB]           968 ms (105 MB/s), 84846424, 243 ms (421 MB/s)
    Attached Files Attached Files

  4. The Following User Says Thank You to inikep For This Useful Post:

    Nania Francesco (12th November 2014)

  5. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,483
    Thanks
    719
    Thanked 653 Times in 349 Posts
    haswell i7-4770@3.9GHz:
    Code:
    C:\Testing\entropy_bench02>entropy_bench.exe -i5 z:\100mentropy_benchmark 0.2 (64-bit Windows) (c) Dell Inc.  Written by P.Skibinski
    memcpy [chunk_size] [95 MB]     10 ms (9765 MB/s), 100000000, 10 ms (9765 MB/s)
    FSE 2014-04-07 [64 KB]          288 ms (339 MB/s), 74167964, 184 ms (530 MB/s)
    FSE 2014-04-07 [95 MB]          277 ms (352 MB/s), 82858740, 176 ms (554 MB/s)
    ryg_rANS 2014-02-18 [64 KB]     488 ms (200 MB/s), 76766075, 588 ms (166 MB/s)
    ryg_rANS 2014-02-18 [95 MB]     455 ms (214 MB/s), 82788588, 560 ms (174 MB/s)
    ryg_rANS interleaved [64 KB]    391 ms (249 MB/s), 76771385, 368 ms (265 MB/s)
    ryg_rANS interleaved [95 MB]    354 ms (275 MB/s), 82788591, 338 ms (288 MB/s)
    ryg_rANS64 2014-02-18 [64 KB]   397 ms (245 MB/s), 76770000, 494 ms (197 MB/s)
    ryg_rANS64 2014-02-18 [95 MB]   386 ms (252 MB/s), 82788588, 485 ms (201 MB/s)
    ryg_rANS64 interleaved [64 KB]  318 ms (307 MB/s), 76779116, 284 ms (343 MB/s)
    ryg_rANS64 interleaved [95 MB]  306 ms (319 MB/s), 82788596, 272 ms (359 MB/s)
    ryg_rANS_SIMD [64 KB]           700 ms (139 MB/s), 78046438, 160 ms (610 MB/s)
    ryg_rANS_SIMD [95 MB]           697 ms (140 MB/s), 82803834, 154 ms (634 MB/s)
    tornado ArithCoder [64 KB]      1117 ms (87 MB/s), 76437023, 1359 ms (71 MB/s)
    tornado ArithCoder [95 MB]      1135 ms (86 MB/s), 74184537, 1391 ms (70 MB/s)
    tornado HuffCoder [64 KB]       659 ms (148 MB/s), 76469288, 689 ms (141 MB/s)
    tornado HuffCoder [95 MB]       663 ms (147 MB/s), 76018771, 684 ms (142 MB/s)
    tornado HuffCoder o1 [95 MB]    805 ms (121 MB/s), 68587351, 1594 ms (61 MB/s)
    unfortunately, it fails on 1 gb file and zlibh fails too. can you please add ryg tANS coder? and share your test data so we will have the common test file (altough anyway it's far from LZ stream)? also i think that good huffman implementationm shpuld outperform FSE in the encoding speed and be slightly faster on decoding too. if possible please look into 7-zip code, Igor said me that his huffman implementation should be faster than my tornado code

  6. #4
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    Igor said me that his huffman implementation should be faster than my tornado code
    What do you think about the decoder speed? 7-zip's Huffman decoder uses multiple tables, the one in tornado is two tables and one (mostly) predictable branch. I thought if Huffman code is limited to 16 bits and fast length is 9 bits, it would be filling a 64kb and a 512b table once per block, and that would be fast and simple for decoding.

  7. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,483
    Thanks
    719
    Thanked 653 Times in 349 Posts
    my huf decoder calls the following function:

    Code:
        uint needbits (int n)    {
            if (bitcount<=32)
                bitbuf |= uint64(get32()) << bitcount, bitcount+=32;
            return mask(bitbuf,n);
        }
    i think that these unpredictable jump is what makes tornado slower than FSE. OTOH, zlib is also faster than tornado, so it may be not only problem. of course, tornado doesn't transmit huffman tables, making its decoding much slower than zlib one

  8. #6
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Oh, right... IO filling. Although, I guess that all bit-IO have the same problem of underflow.

  9. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,483
    Thanks
    719
    Thanked 653 Times in 349 Posts
    FSE employs jumpless bitcoder. look at this topic for example. jumpless putbits() by itself needs <10 arithmetic operations, so the full huffman order-0 coder should run at ~1GB/s speed:

    Code:
    void huf_encode(BYTE c)
    {
      code = huf_codes[c];
      bits = huf_bits[c];
      putbits(code,bits);
    }
    Last edited by Bulat Ziganshin; 15th April 2014 at 15:36.

  10. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    cade (15th April 2014)

  11. #8
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    853
    Thanks
    441
    Thanked 244 Times in 101 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    can you please add ryg tANS coder?
    As far as I know, there is no ryg tANS coder. Only rANS.

    However, there are cbloom's tANS & rANS coders.
    They are open source, BUT rely heavily on cblib, which is really huge and Windows only.

  12. The Following User Says Thank You to Cyan For This Useful Post:

    Bulat Ziganshin (15th April 2014)

  13. #9
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    FSE employs jumpless bitcoder. look at this topic for example
    Those are some very nice tips and tricks in that there, thanks!

  14. #10
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    853
    Thanks
    441
    Thanked 244 Times in 101 Posts
    Since Tornado ArithCoder & HuffCoder get better results at 95MB than 64KB,
    I have to assume that they are adaptive, or semi-adaptive, and internally break the input into smaller blocks for better entropy adaptation ?

  15. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,483
    Thanks
    719
    Thanked 653 Times in 349 Posts
    from tornado sources:

    // Code for various streams and entropy codecs:
    // - semi-adaptive huffman codec
    // - fast semi-adaptive arithmetic codec
    // Here semi-adaptive means that statistics collected on previous symbols are
    // used to encode new ones, but encoding tables are updated with some intervals
    // (after each 10k-20k symbols encoded)


    also, inikep uses "%" in the tornado driver that should ruin its speed:
    Code:
        for (int i = 0; i < insize; i++)
        {
            if ((i%chunk_size) == 0)
                huff.flush();
            huff.encode(inbuf[i]);
        }
    and here Charles Bloom found that HUF has 1.5x faster encoding compared to FSE. somethere in previous posts he said that when tested alone (without LZ part or smth like that) codecs may suffer from lack of ILP and it may need to run 2 codecs in parallel in order to make things faster
    Last edited by Bulat Ziganshin; 15th April 2014 at 16:01.

  16. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Paul W. (15th April 2014)

  17. #12
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 165 Times in 63 Posts
    @Bulat
    1. zlibh is buggy (doesn't even work with chunks over 64 KB)
    2. if you're thinking about replacing (i%chunk_size) with (i&(chunk_size-1)) it will not work, because a default chunk size can be changed with "-b" option (you can always change sources and check the speed difference by yourself)

  18. #13
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    422
    Thanks
    133
    Thanked 147 Times in 95 Posts
    I added my wrapped up ryg-RANS(32) drivers to see how they compared. This is basically just a 4-way interleaving instead of 2-way and with an order-1 entropy stats as well as order-0 based. (I know that's technically mixing model with entropy encoding, but it's an appropriate simplification for the DNA Quality strings I've been looking at.)

    Code:
    jkb@seq3a[compress.../entropy_...] ./entropy_bench -b64 -i3 100m
    entropy_benchmark 0.2 (64-bit Linux) (c) Dell Inc.  Written by P.Skibinski
    memcpy [chunk_size] [95 MB]     27 ms (3616 MB/s), 100000000, 28 ms (3487 MB/s)
    FSE 2014-04-07 [64 KB]          427 ms (228 MB/s), 43042331, 297 ms (328 MB/s)
    FSE 2014-04-07 [95 MB]          419 ms (233 MB/s), 43220627, 283 ms (345 MB/s)
    ryg_rANS 2014-02-18 [64 KB]     655 ms (149 MB/s), 46070442, 907 ms (107 MB/s)
    ryg_rANS 2014-02-18 [95 MB]     645 ms (151 MB/s), 43203696, 889 ms (109 MB/s)
    ryg_rANS interleaved [64 KB]    571 ms (171 MB/s), 46075762, 557 ms (175 MB/s)
    ryg_rANS interleaved [95 MB]    566 ms (172 MB/s), 43203696, 537 ms (181 MB/s)
    ryg_rANS64 2014-02-18 [64 KB]   567 ms (172 MB/s), 46074280, 744 ms (131 MB/s)
    ryg_rANS64 2014-02-18 [95 MB]   561 ms (174 MB/s), 43203688, 725 ms (134 MB/s)
    ryg_rANS64 interleaved [64 KB]  484 ms (201 MB/s), 46083380, 436 ms (223 MB/s)
    ryg_rANS64 interleaved [95 MB]  477 ms (204 MB/s), 43203696, 417 ms (234 MB/s)
    ryg_rANS_SIMD [64 KB]           815 ms (119 MB/s), 46108652, 208 ms (469 MB/s)
    ryg_rANS_SIMD [95 MB]           814 ms (119 MB/s), 43211504, 201 ms (485 MB/s)
    tornado ArithCoder [64 KB]      1392 ms (70 MB/s), 50179429, 1795 ms (54 MB/s)
    tornado ArithCoder [95 MB]      1160 ms (84 MB/s), 43263298, 1807 ms (54 MB/s)
    tornado HuffCoder [64 KB]       793 ms (123 MB/s), 45212263, 845 ms (115 MB/s)
    tornado HuffCoder [95 MB]       789 ms (123 MB/s), 43654432, 849 ms (115 MB/s)
    tornado HuffCoder o1 [95 MB]    887 ms (110 MB/s), 34727880, 1478 ms (66 MB/s)
    ryg_rANS+jkb-O0 [64 KB]         503 ms (194 MB/s), 43051785, 411 ms (237 MB/s)
    ryg_rANS+jkb-O0 [95 MB]         513 ms (190 MB/s), 43216457, 408 ms (239 MB/s)
    ryg_rANS+jkb-O1 [64 KB]         867 ms (112 MB/s), 35804852, 669 ms (145 MB/s)
    ryg_rANS+jkb-O1 [95 MB]         777 ms (125 MB/s), 33092156, 588 ms (166 MB/s)
    sh_arith+jkb-O0 [64 KB]         614 ms (159 MB/s), 43101134, 1001 ms (97 MB/s)
    sh_arith+jkb-O0 [95 MB]         642 ms (152 MB/s), 43216872, 1004 ms (97 MB/s)
    sh_arith+jkb-O1 [64 KB]         944 ms (103 MB/s), 35987822, 1263 ms (77 MB/s)
    sh_arith+jkb-O1 [95 MB]         879 ms (111 MB/s), 33092667, 1198 ms (81 MB/s)
    zlibh [64 KB]                   449 ms (217 MB/s), 43445698, 482 ms (202 MB/s)
    done... (3 iterations)
    It appears on this system (Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz) that more interleaving of the 32-bit ryg_rANS code has a considerable improvement still.

    Edit: Now with the static interleaved arithmetic coder (which is what I started on just as rANS implementations started showing up).

    I also had to do a few tweaks to get it to build on Linux

    Code:
    *** entropy_bench.cpp~  Mon Apr 14 16:57:22 2014
    --- entropy_bench.cpp   Wed Apr 16 17:19:59 2014
    ***************
    *** 30,39 ****
    --- 30,43 ----
      #include <algorithm> // std::sort
      #include <vector>
      #include <numeric>
    + #include <unistd.h>
    + #include <sys/time.h>
    + #include <sys/resource.h>
      #include "zlibh/zlibh.h"
      #include "fse/fse.h"
      #include "tornado/tor_test.h"
      #include "ryg_rans/ryg_rans.h"
    + #include "rANS_demo-3/rANS_static4.h"
    
    
      #if defined(WIN32) || defined(_WIN32) || defined(__WIN32__) || defined(WIN64) || defined(_WIN64)
    *** tornado/EntropyCoder.cpp~   Thu Apr 10 17:30:32 2014
    --- tornado/EntropyCoder.cpp    Wed Apr 16 16:26:15 2014
    ***************
    *** 1,3 ****
    --- 1,11 ----
    + #if defined(WIN32) || defined(_WIN32) || defined(__WIN32__) || defined(WIN64) || defined(_WIN64)
    + #     define WINDOWS
    + #endif
    +
    + #ifndef WINDOWS
    + #     define __cdecl
    + #endif
    +
      // Code for various streams and entropy codecs:
      //   - in/out byte streams
      //   - in/out bit streams
    Last edited by JamesB; 16th April 2014 at 19:43.

  19. The Following User Says Thank You to JamesB For This Useful Post:

    inikep (17th April 2014)

  20. #14
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    my huf decoder calls the following function:

    Code:
        uint needbits (int n)    {
            if (bitcount<=32)
                bitbuf |= uint64(get32()) << bitcount, bitcount+=32;
            return mask(bitbuf,n);
        }
    i think that these unpredictable jump is what makes tornado slower than FSE. OTOH, zlib is also faster than tornado, so it may be not only problem. of course, tornado doesn't transmit huffman tables, making its decoding much slower than zlib one
    I had a branchless idea if encoder and decoder have the same endianess, and we have 64k (or some decent amount) of a coded buffer in memory:
    Code:
    int bits_passed = 0;
    byte buffer[0x10000];
    uint32 read_bits(int num)
    {
    uint32 res = *(uint32*)buffer[bits_passed >> 3] >> (bits_passed & 7);
    bits_passed += num;
    return mask(res, num);
    }
    bits_passed >> 3 gives the byte index, bits_passed & 7 is the offset within the first byte. If entropy for 64k syms is larger than 64k bytes, there can be some extra padding for the buffer. Not sure how good this is though, especially if handling endianess will be slower.

  21. #15
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    853
    Thanks
    441
    Thanked 244 Times in 101 Posts
    For anyone interested :

    There is a new fast Huffman entropy coder available in open source format, that can be grabbed at :
    https://github.com/Cyan4973/FiniteStateEntropy

    It's reaching 500 MB/s per core.

    A few technical explanations are provided here :
    http://fastcompression.blogspot.com/...ed-part-1.html
    http://fastcompression.blogspot.com/...2-decoder.html
    http://fastcompression.blogspot.com/...h-limited.html

  22. The Following User Says Thank You to Cyan For This Useful Post:

    inikep (8th October 2015)

  23. #16
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    Out of curiosity. Already asked myself the first time I read this benchmark.
    If entropy coders mainly get use after lz coders or others, doesn't this benchmark potentially hide the strength of many of those coders in the real use?
    I think the speed could change/shift dramatically and also the compression ratio.
    And y I know this is available to download that anyone can test things he likes...

  24. #17
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    854
    Thanks
    45
    Thanked 104 Times in 82 Posts
    Crashed on me. i probalby did something wrong this is the console output. and a got a windows message that the process had crashed

    Code:
    R:\entropy_bench02>entropy_bench.exe "S0E01 - Past Transgressions.mp4"
    entropy_benchmark 0.2 (64-bit Windows) (c) Dell Inc.  Written by P.Skibinski
    memcpy [chunk_size] [170 MB]    58 ms (3008 MB/s), 178654994, 58 ms (3008 MB/s)
    FSE 2014-04-07 [64 KB]          543 ms (321 MB/s), 178061434, 24 ms (7269 MB/s)
    count=178654990  24 102 116 121!=0 0 0 0
    ERROR: common=3
    FSE 2014-04-07 [170 MB]         519 ms (336 MB/s), 178631116, 1 ms (174467 MB/s
    
    
    R:\entropy_bench02>

    Code:
    R:\entropy_bench02>entropy_bench.exe master-of-orion.zip
    entropy_benchmark 0.2 (64-bit Windows) (c) Dell Inc.  Written by P.Skibinski
    memcpy [chunk_size] [5069 KB]   1 ms (5069 MB/s), 5190984, 1 ms (5069 MB/s)
    FSE 2014-04-07 [64 KB]          15 ms (337 MB/s), 5190859, 1 ms (5069 MB/s)
    
    R:\entropy_bench02>
    Last edited by SvenBent; 31st July 2015 at 02:53.

  25. #18
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    853
    Thanks
    441
    Thanked 244 Times in 101 Posts
    Maybe the FSE code should be updated if it dates back from 2014-04-07 as mentioned

  26. #19
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 165 Times in 63 Posts
    Sorry, I can't check it as I'm on vacations. This benchmark was not intented for incompressible files (your MP4) so maybe this is a problem. Please try something uncompressed.

  27. #20
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    854
    Thanks
    45
    Thanked 104 Times in 82 Posts
    Quote Originally Posted by inikep View Post
    Sorry, I can't check it as I'm on vacations. This benchmark was not intented for incompressible files (your MP4) so maybe this is a problem. Please try something uncompressed.
    I dont really have anything thats not in some kind or way compressed thats big enough

    Code:
    R:\entropy_bench02>entropy_bench.exe "Cryptography For Dummies.pdf"
    entropy_benchmark 0.2 (64-bit Windows) (c) Dell Inc.  Written by P.Skibinski
    memcpy [chunk_size] [8356 KB]   4 ms (2089 MB/s), 8557108, 4 ms (2089 MB/s)
    FSE 2014-04-07 [64 KB]          36 ms (232 MB/s), 8522425, 3 ms (2785 MB/s)
    Crashed again


    I've tried with a smaller pdf files where it successed maybe its a size error ?
    Code:
    R:\entropy_bench02>entropy_bench.exe "PN for STD.pdf"
    entropy_benchmark 0.2 (64-bit Windows) (c) Dell Inc.  Written by P.Skibinski
    memcpy [chunk_size] [1919 KB]   1 ms (1919 MB/s), 1965800, 1 ms (1919 MB/s)
    FSE 2014-04-07 [64 KB]          5 ms (383 MB/s), 1910438, 2 ms (959 MB/s)
    FSE 2014-04-07 [1919 KB]        5 ms (383 MB/s), 1931063, 3 ms (639 MB/s)
    ryg_rANS 2014-02-18 [64 KB]     6 ms (319 MB/s), 1963463, 10 ms (191 MB/s)
    ryg_rANS 2014-02-18 [1919 KB]   6 ms (319 MB/s), 1929153, 10 ms (191 MB/s)
    ryg_rANS interleaved [64 KB]    5 ms (383 MB/s), 1963569, 5 ms (383 MB/s)
    ryg_rANS interleaved [1919 KB]  4 ms (479 MB/s), 1929157, 5 ms (383 MB/s)
    ryg_rANS64 2014-02-18 [64 KB]   6 ms (319 MB/s), 1963540, 8 ms (239 MB/s)
    ryg_rANS64 2014-02-18 [1919 KB]  6 ms (319 MB/s), 1929156, 8 ms (239 MB/s)
    ryg_rANS64 interleaved [64 KB]  5 ms (383 MB/s), 1963736, 4 ms (479 MB/s)
    ryg_rANS64 interleaved [1919 KB]  4 ms (479 MB/s), 1929160, 4 ms (479 MB/s)
    ryg_rANS_SIMD [64 KB]           10 ms (191 MB/s), 1964226, 2 ms (959 MB/s)
    ryg_rANS_SIMD [1919 KB]         10 ms (191 MB/s), 1929294, 2 ms (959 MB/s)
    tornado ArithCoder [64 KB]      19 ms (101 MB/s), 1907899, 24 ms (79 MB/s)
    tornado ArithCoder [1919 KB]    18 ms (106 MB/s), 1903060, 23 ms (83 MB/s)
    tornado HuffCoder [64 KB]       11 ms (174 MB/s), 1917026, 12 ms (159 MB/s)
    tornado HuffCoder [1919 KB]     10 ms (191 MB/s), 1913247, 12 ms (159 MB/s)
    tornado HuffCoder o1 [1919 KB]  20 ms (95 MB/s), 1842807, 44 ms (43 MB/s)
    zlibh [64 KB]                   5 ms (383 MB/s), 1908615, 6 ms (319 MB/s)
    done... (1 iterations)
    but so small files does not give a detailed enough result
    Last edited by SvenBent; 2nd August 2015 at 18:22.

  28. #21
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    600
    Thanks
    186
    Thanked 186 Times in 112 Posts

    tANS - Huffman hybrid

    Indeed while tANS encoding step is e.g.
    nbBits = k[s] − (x < bound[s]);
    useBits(x, nbBits);
    x = encodingTable[start[s] + (x >> nbBits)];

    for Huffman we don't need the first branch - nbBits is fully determined by symbol:
    nbBits = k[s];
    useBits(x, nbBits);
    x = encodingTable[start[s] + (x >> nbBits)];


    For a small number of symbols (definitely not 256), tANS encoding speed could be improved by using 2D k[s,x] table instead.
    Additionally, if the number of symbol appearances is a power of 2, tANS nbBits is also directly determined by symbol (nbBits = k[s]).

    It could be used for a hybrid between tANS and Huffman, especially to handle the situation when Huffman is terrible: one dominating symbol, let say Pr(s=0) >> 0.5
    In this case, we can approximate probabilities of the remaining symbols by a power of 1/2.
    Thanks of it
    - the header problem disappears: we write only Huffman lengths for these low probable symbol, Pr(s=0) = 1 - sum_{s>0} 2^{-r_s} is calculated,
    - in contrast to Huffman, no sort is needed - we just approximate low probabilities with powers of 1/2 and Pr(0) is any complement (not a power of 1/2),
    - for these low probable symbols, symbol directly determines nbBits = k[s].

    Finally, such encoding step can be
     if(s==0) nbBits = k0[x] else nbBits = k[s];    // k0[x] = 1 - (x < bound[0] )
    useBits(x, nbBits);
    x = encodingTable[start[s] + (x >> nbBits)];

    which might be a bit faster than standard tANS encoding step (?).

    ps. it could be also written
     t ={k[s], k0[x]}; nbBits = t[s==0];

    or
     k[0]=k0[x]; nbBits = k[s]; 
    Last edited by Jarek; 3rd August 2015 at 08:49.

  29. The Following User Says Thank You to Jarek For This Useful Post:

    Cyan (3rd August 2015)

  30. #22
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    853
    Thanks
    441
    Thanked 244 Times in 101 Posts
    Note that Huffman has another advantage when it comes to compression : there is no state value to maintain.
    You can directly dump symbol's code and its length into the bitStream.

    My feeling is that this absence of state maintenance is also key in explaining the compression speed difference.

    One reason is, I could improve the way to calculate nbBits within FSE/tANS.
    It is now simplified to :

    nbBits = (state + constantDelta[s]) >> constantShift;


    This new formulation is faster than previous nbBits = k[s] − (x < bound[s]);
    Not yet as straightforward as nbBits = k[s]; but close enough.
    Overall, this simple change alone provided a modest but solid +5% compression speed on FSE.

    But that's all there is.
    Currently, FSE compresses at 335 MB/s. It's still short of the 505 MB/s of Huff0.

    So where could the remaining difference come from ?
    My suspicion is that it comes from state maintenance, which doesn't exist (or at least is not required) with Huffman.

    Determining current state value is necessary to encode the next symbol, so it could be a critical dependency path.
    Also, considering the speed of these algorithms, it could be that just maintaining the state is a sufficient workload to slow down noticeably the compression process.


    > tANS - Huffman hybrid

    Your suggestion to combine tANS with Huffman is interesting.
    The main issue to make it work efficiently is instruction flow : you want to avoid disrupting branches in the code.
    Your latest proposition does it successfully : k[0]=k0[x]; nbBits = k[s];

    Unfortunately, once branches are no longer something to worry about, calculation dependency become the next bottleneck.
    And in above example, it's required to complete k[0]=k0[x]; before nbBits = k[s];.
    Therefore, some dependency seems still present.

    [Edit] : note also that, in your comparison, Huffman can be written in a simpler way.
    Basically, an encoding step is just :

    addBitsFast (&bitStream, t[s].value, t[s].nbBits);

    There is simply no 'x' state to maintain.
    Last edited by Cyan; 4th August 2015 at 03:58.

  31. The Following User Says Thank You to Cyan For This Useful Post:

    Jarek (4th August 2015)

  32. #23
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    600
    Thanks
    186
    Thanked 186 Times in 112 Posts
    You are right, the tANS encoding branch can be removed:
    nbBits = k[s] − (x < bound[s])

    can be transformed to
    nbBits = K[s] + (x + dif[s]) >> R

    where 2L = 2^R
    K[s] = k[s] - 1
    dif[s] = 2L - bound[s]
    Then further to
    nbBits = (x + nb[s]) >> R

    where nb[s] = (K[s] << R) + dif[s]
    If it is still essentially slower (you write 430MS/s in FSE github?), it is indeed a matter of handling the state.
    ps. Added updated algorithms to http://encode.ru/threads/2078-List-o...mplementations
    Last edited by Jarek; 4th August 2015 at 12:03.

  33. #24
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    853
    Thanks
    441
    Thanked 244 Times in 101 Posts
    > If it is still essentially slower (you write 430MS/s in FSE github?), it is indeed a matter of handling the state.

    This is the speed of the FSE encoding loop only.
    It was provided because other ANS versions were communicating on their encoding loop timings only, making comparison difficult (and biaised against the full version).

    The complete timing must also count statistics, build tables and generate headers.
    Just counting the bytes is a fairly significant portion of total time !
    With all these additional tasks, the speed becomes ~335 MB/s.

    Note that the compared Huff0 version also has the same duty to count stats, build tables and generate headers.

  34. #25
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    600
    Thanks
    186
    Thanked 186 Times in 112 Posts
    Such "total speed" is strongly dependent on size of data frame - for better understanding what are we fighting for, designing optimizations, you should separate these speeds: processing speed of the main loop and the preparation time: to count appearances and build tables.
    The cost of counting appearances is probably the same for both and it depends on frame size - the most beneficial would be knowing time for building coding tables only.
    I would expect that Huffman preparation is more expensive due to symbol sort, while you say suggest that it is opposite - I will take a look at it in FSE.

    ps. analogously, to understand the header problem, it would be beneficial to separate header size from encoded data size.

  35. #26
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    853
    Thanks
    441
    Thanked 244 Times in 101 Posts
    For timing of dedicated sub-sections of code, I've got 'fullbench', another tool provided within FSE package.
    It's useful for development and evaluation.

    Such tool allows me to compare, for example, FSE_buildCTable() with HUF_buildCTable(),
    and finds that the HUF version is faster :
    FSE_buildCTable : 250K blocks / second
    HUF_buildCTable : 450K blocks / second

    I suspect the advantage of Huffman CTable is that it's much smaller : one cell per symbol.
    As opposed to FSE CTable, which requires a full state conversion table, hence as large as 1<<MAXBITS.

    That being said, both are minor participants to the overall compression time (~3%).
    So they don't matter that much.
    Last edited by Cyan; 5th August 2015 at 16:08.

  36. #27
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    600
    Thanks
    186
    Thanked 186 Times in 112 Posts
    So you are saying it is 3.9us to build FSE encoding table, and 2.2us to for Huffman (including sorting symbols) ?
    Assuming 10kB blocks, every processed gigabyte is 0.39s cost for FSE, 0.22s for Huffman here.
    At 330 MB/s, we are talking about 10% loss ... still it doesn't fit these 430? Especially it should be increased while removing branch?

    ps. I'm looking for generation of encoding tables in FSE, but it seems to be spread over the code.
    Please take a look here: https://dl.dropboxusercontent.com/u/...67/tANSalg.png
    floor of logarithm up to 2048 can be put into easily initialized table (011222233333333...)
    start is first initialized as start[k] = start[k-1] + Ls[k-1], then final loop is performed: start[k] -= Ls[k].
    I see you write "symbolTT[s].deltaNbBits = tableLog * (1 << 16) ", I am not certain if the multiplication is optimized to tableLog<<16.
    It should be faster than for Huffman with its sorting of symbols ...
    Last edited by Jarek; 4th August 2015 at 17:56.

  37. #28
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    853
    Thanks
    441
    Thanked 244 Times in 101 Posts
    I'm using 32 KB block size for benchmark. So these relative costs are even smaller.

    The second most important contributor to complete compression time is counting byte.
    This step is identical for both FSE and Huff0.
    It works at approximately 2 GB/s.
    (See http://fastcompression.blogspot.fr/2...rick-from.html, with a lot of interesting comments)
    It may sound fast, but that's approximately 25% of timing for Huff0, and 16% for FSE. So it's not negligible.

    For more detailed benchmark numbers, I suggest you download latest version of fullbench from the dev branch of fse :
    https://github.com/Cyan4973/FiniteStateEntropy/tree/dev

    You'll see for example direct comparison of FSE_compress_usingCTable() and HUF_compress_usingCTable(), i.e. the internal encoding loop only, excluding all other elements.
    On my system, HUF_compress_usingCTable() runs at 860 MB/s. Twice the speed of FSE_compress_usingCTable().
    Last edited by Cyan; 5th August 2015 at 16:09.

  38. The Following 2 Users Say Thank You to Cyan For This Useful Post:

    Bulat Ziganshin (4th August 2015),Jarek (4th August 2015)

  39. #29
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 165 Times in 63 Posts
    Attached is a new version of benchmark which adds:
    a) Huffman coder from 7z-Deflate
    b) shcodec
    c) rANS_static4c and rANS_static64c
    d) FSE HUF - good work Yann!

    The following results are obtained using 1 core of Intel Core i5-4300U, Windows 10 64-bit (MinGW-w64 compilation under gcc 4.8.3) and 3 iterations. The input file (100 MB) is a concatenation of 10 different files, about 10 MB each: bmp, dct_coeffs, english_dic, ENWIK, exe, fp_log, hlp, XML, pdf, ncb.

    Code:
    entropy_benchmark 0.3 (64-bit Windows) (c) Dell Inc.  Written by P.Skibinski
    memcpy [chunk_size] [99 MB]     12 ms (8533 MB/s), 104854004, 12 ms (8533 MB/s)
    7z-Huff 9.20 [64 KB]            1849 ms (55 MB/s), 63956692, 1014 ms (100 MB/s)
    7z-Huff 9.20 [99 MB]            1586 ms (64 MB/s), 64011419, 1016 ms (100 MB/s)
    shcodec 1.0.1 [64 KB]           368 ms (278 MB/s), 64066884, 762 ms (134 MB/s)
    shcodec 1.0.1 [99 MB]           356 ms (287 MB/s), 85373397, 920 ms (111 MB/s)
    FSE 2015-08-26 [64 KB]          371 ms (276 MB/s), 63245440, 229 ms (447 MB/s)
    FSE 2015-08-26 [99 MB]          347 ms (295 MB/s), 84899617, 219 ms (467 MB/s)
    FSE HUF 2015-08-26 [64 KB]      208 ms (492 MB/s), 64045695, 197 ms (519 MB/s)
    rANS_static4c 2015-08 [64 KB]   505 ms (202 MB/s), 63254536, 437 ms (234 MB/s)
    rANS_static4c 2015-08 [99 MB]   463 ms (221 MB/s), 84899617, 394 ms (259 MB/s)
    rANS_static4c order1 [64 KB]    1209 ms (84 MB/s), 60426048, 912 ms (112 MB/s)
    rANS_static64c 2015-08 [64 KB]  721 ms (142 MB/s), 63270453, 299 ms (342 MB/s)
    rANS_static64c 2015-08 [99 MB]  743 ms (137 MB/s), 84899617, 312 ms (328 MB/s)
    rANS_static64c order1 [64 KB]   1446 ms (70 MB/s), 60441990, 766 ms (133 MB/s)
    ryg_rANS64 2014-02-18 [64 KB]   543 ms (188 MB/s), 66082876, 677 ms (151 MB/s)
    ryg_rANS64 2014-02-18 [99 MB]   534 ms (191 MB/s), 84830136, 669 ms (153 MB/s)
    ryg_rANS64 interleaved [64 KB]  436 ms (234 MB/s), 66092596, 382 ms (268 MB/s)
    ryg_rANS64 interleaved [99 MB]  423 ms (242 MB/s), 84830140, 375 ms (273 MB/s)
    ryg_rANS_SIMD [64 KB]           916 ms (111 MB/s), 66287954, 227 ms (451 MB/s)
    ryg_rANS_SIMD [99 MB]           966 ms (106 MB/s), 84846424, 219 ms (467 MB/s)
    tornado ArithCoder [64 KB]      1484 ms (69 MB/s), 67962411, 1910 ms (53 MB/s)
    tornado ArithCoder [99 MB]      1462 ms (70 MB/s), 63192926, 1928 ms (53 MB/s)
    tornado HuffCoder [64 KB]       854 ms (119 MB/s), 65826175, 956 ms (107 MB/s)
    tornado HuffCoder [99 MB]       854 ms (119 MB/s), 64701654, 951 ms (107 MB/s)
    tornado HuffCoder o1 [99 MB]    929 ms (110 MB/s), 55655031, 1991 ms (51 MB/s)
    zlibh [64 KB]                   455 ms (225 MB/s), 64008871, 486 ms (210 MB/s)
    done... (3 iterations)
    Attached Files Attached Files

  40. The Following 3 Users Say Thank You to inikep For This Useful Post:

    Bulat Ziganshin (13th October 2015),Cyan (14th October 2015),JamesB (13th October 2015)

  41. #30
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,483
    Thanks
    719
    Thanked 653 Times in 349 Posts
    Quote Originally Posted by Cyan View Post
    Unfortunately, once branches are no longer something to worry about, calculation dependency become the next bottleneck.
    may be encoding 2-3 blocks simultaneously can fix that? if 7 x86 registers is enough for FSE, then 15 x64 registers should be enough for 2 FSE encoders working simultaneously

    are you compared FSE and HUF in terms of overall amount of ALU operations? Memory operations usually can be made free on modern intel architectures, and dependency delays may be relaxed by this multi-buffer technique

Page 1 of 2 12 LastLast

Similar Threads

  1. Benchmarking Entropy Coders
    By dnd in forum Data Compression
    Replies: 183
    Last Post: 27th June 2018, 13:48
  2. Silesia Open Source Compression Benchmark
    By Alexander Rhatushnyak in forum Data Compression
    Replies: 15
    Last Post: 22nd May 2016, 15:34
  3. Why not open source?
    By nemequ in forum Data Compression
    Replies: 65
    Last Post: 25th November 2013, 23:05
  4. packMP3 v1.0d release: Open source under the GPL v3
    By packDEV in forum Data Compression
    Replies: 3
    Last Post: 2nd October 2013, 02:11
  5. MCM open source
    By Mat Chartier in forum Data Compression
    Replies: 12
    Last Post: 29th August 2013, 20:22

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •