Page 6 of 7 FirstFirst ... 4567 LastLast
Results 151 to 180 of 184

Thread: Benchmarking Entropy Coders

  1. #151
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    421
    Thanks
    42
    Thanked 147 Times in 104 Posts

    Building TurboBench

    The complete source code for TurboBench can be downloaded from github.
    The LzTurbo library is not included, but it is not necessary to
    make a complete benchmark including html plots.


    Several entropy coders like FSC, FSE/FSE-Huffman, rans_static, zlibh,... are already there.
    TurboBench was tested on linux and windows (mingw64).
    See readme file for downloading and building TurboBench.
    No additional package installation or configuration is necessary.

    Code:
      git clone --recursive git://github.com/powturbo/TurboBench.git
      cd TurboBench
      make
    To include a codec just follow the sample implementation "mycodec" (see file plugins.cc).
    The calls to the codec functions can be directly inserted in "plugins.cc".
    There is no need to write any wrapper functions.


    The lzturbo codecs can be optionally compared by downloading the TurboBench binaries as stated in the github readme.
    Last edited by dnd; 26th December 2015 at 17:45.

  2. #152
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    600
    Thanks
    186
    Thanked 186 Times in 112 Posts
    Quote Originally Posted by dnd View Post
    Experimental ANS/ABS in vp10 at google.
    there is implementation of uABS, rABS and rANS - indeed rANS+uABS seems a perfect combination for video compression: switching between large alphabets (DCT coefficients, prediction modes etc.) and binary choices (e.g. if exception happens).
    It seems a good time to discuss how to use it effectively for this application ...
    https://groups.google.com/a/webmproj...Y/ZfklwwhWDgAJ

  3. #153
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    600
    Thanks
    186
    Thanked 186 Times in 112 Posts
    Quote Originally Posted by dnd View Post
    - New blog describing "Asymmetric Numeral Systems" implementation at RAD: rANS in practice.
    Unfortunately, there are no benchmark numbers about compression ratio or speed.
    Charles has placed some benchmarks on his blog: http://www.cbloom.com/rants.html

    oohcLZNA : 2.88:1 , 5.3 enc mbps , 135.0 dec mbps
    lzma : 2.82:1 , 2.9 enc mbps , 43.0 dec mbps
    oohcBitKnit : 2.76:1 , 6.4 enc mbps , 273.3 dec mbps
    lzham : 2.59:1 , 1.8 enc mbps , 162.9 dec mbps
    oohcLZHLW : 2.38:1 , 4.2 enc mbps , 456.3 dec mbps
    zstdhc9 : 2.11:1 , 29.5 enc mbps , 558.0 dec mbps
    oohcLZNIB : 2.04:1 , 11.5 enc mbps , 1316.4 dec mbps


    Looks very interesting, but sadly no independent confirmation ...

  4. #154
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    421
    Thanks
    42
    Thanked 147 Times in 104 Posts
    Quote Originally Posted by Jarek View Post
    Charles has placed some benchmarks on his blog...
    Thank you, I've seen that but I'm talking about the entropy coders.

    Quote Originally Posted by Jarek View Post
    Looks very interesting...
    You can see in the speedup benchmark sheet and plot, that for a low bandwidth, the load time is dominated by the latency and transfer time.
    There is only a negligible difference in speedup between lzma and other high end compressors.
    You can also see in the thread "Transfer Encoding & Data compression" that even a (compressed value) difference of 20% in compression ratio is imperceptible for small files.
    Accurate results can be obtained only in a real szenario.

    Quote Originally Posted by Jarek View Post
    ..., but sadly no independent confirmation ...
    Personally, I have no problems to trust the numbers from charles and will
    independent benchmarks make better tests?
    The goal is to have a general image about the new compression methods.
    I think, it will be better to see additionally brotli (option,11 w/ 16MB), zstd,20
    and may be, other file sets.
    The best option is to include the compressors into TurboBench (very simple) and publish the
    html results and plots.

    I have some general suggestions about publishing benchmark results:
    - Describe the OS, CPU type, CPU clock, (memory speed?) with each published benchmark
    - Describe, if your files are compressed separately
    - Do not use private files and provide links to download the test files or corpus
    - Do not use files, you can't release because of copyrights
    - Use a minimum of 2 file sets, text and binary files
    - Do not use exclusively small files
    - Do not use random or difficult to compress files (20-50% w/ zlib,9 is ok)
    Last edited by dnd; 28th December 2015 at 03:12.

  5. #155
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    There wasn't any point in giving benchmark results in the "rANS in practice" article, since it didn't introduce any new techniques that would help with speed!

    Buffering up events and then reverse-encoding has nothing to do with speed (and is not new). Interleaving does help with speed but is likewise an old hat by now; there's not much to add that I didn't already say in early 2014!

    I consider the alternating-stream encoder/decoder to be a much more convenient interface for a 2x interleaved coder, but it is essentially equivalent in terms of speed, as far as it is possible to say such things. For example, if I replace the 2x manual-interleaved coder in the ryg_rans64 example with a 2x unrolled invocation of the alternating-stream coder, I get the same performance (to within measurement accuracy of that testbed), and essentially the same code (up to register assignment and some minor scheduling differences). If I don't do the 2x unroll, it's a bit slower (something like 8% IIRC - visiting relatives right now, don't have access to that code), but still substantially faster than not interleaving, and the ryg_rans example code is really pure entropy coding (very little code), so even loop overhead is not negligible. It's not trivial to untangle what part of the cost is loop overhead and what is the extra MOV you get in the non-unrolled alternate-stream coder variant, and I didn't even try.

    The bypass coding for raw bits is faster than using a frequency of 1<<k and going through a full rANS encode/decode step, but I likewise never bothered to quantify exactly how much faster. Either way, it's purely a "nice to have" feature; there's little reason not to use it.

    Finally, the "knot tying" procedure runs once per block flush, so the impact on speed is negligible. Again I mention it mostly because it's convenient - flushing a rANS encoder into another one is nice because it doesn't require separate bit IO machinery, and also because it enables the reduction tree-style flush for wide SIMD/GPU applications.

  6. The Following 3 Users Say Thank You to fgiesen For This Useful Post:

    dnd (28th December 2015),Jarek (30th December 2015),Turtle (27th December 2015)

  7. #156
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    Quote Originally Posted by cbloom.com
    total : oohcLZNA : 167,495,105 ->58,242,995 = 2.782 bpb = 2.876 to 1
    total : encode : 31.867 seconds, 1.54 b/kc, rate= 5.26 M/s
    total : decode : 1.240 seconds, 39.68 b/kc, rate= 135.04 M/s
    total : encode+decode : 33.107 seconds, 1.49 b/kc, rate= 5.06 M/s

    total : lzma : 167,495,105 ->59,300,023 = 2.832 bpb = 2.825 to 1
    total : encode : 57.712 seconds, 0.85 b/kc, rate= 2.90 M/s
    total : decode : 3.898 seconds, 12.63 b/kc, rate= 42.97 M/s
    total : encode+decode : 61.610 seconds, 0.80 b/kc, rate= 2.72 M/s

    total : lzham : 167,495,105 ->64,682,721 = 3.089 bpb = 2.589 to 1
    total : encode : 93.182 seconds, 0.53 b/kc, rate= 1.80 M/s
    total : decode : 1.028 seconds, 47.86 b/kc, rate= 162.86 M/s
    total : encode+decode : 94.211 seconds, 0.52 b/kc, rate= 1.78 M/s
    This benchmark falls into the same pit as previous one. But I'm not going to be sarcastic and harsh as I did before. Just a couple of remarks.
    Here is the screenshot taken from Skype conversation with one man about the Oodle benchmark we are talking about. Conversation is in Russian language so I translated it to English.
    scr_20151230_010400.png
    I'm completely agree with him.
    For example how we should treat the following mutually exclusive phrases ?

    "LZMA and LZHAM are run at max compression with context bits at their best setting."
    "all compressors set at high (not max) compression levels."

    More important question here is how exactly LZMA has been configured. Match finder, parser mode, dictionary size, fast byte value, number of match finder cycles and all other parameters - what exactly values are used ?
    There is also no information about memory usage during compression\decompression. And I'm not talking about the mistery of the data itself.
    Its described as "game test set" - well, ok. What is the data exactly ? Textures, meshes, models, pcm audio streams, level geometry, scripts?

    The page contains the phrase: "As usual, I'm trying to be as fair as possible to the competition."
    Should I say that your attempt has failed ? If you're going to be fair then provide that data set so we can test it. Nobody forces you to provide the compressor itself. But you can compile the decode-only binaries and publish them, as same as Oodle compressed streams. Without actual proof your benchmark looks like a market-driven masquerade. Sorry.

  8. #157
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    90
    Thanked 69 Times in 49 Posts
    At the moment we are blessed with multiple open-source benchmarking efforts (lzbench, Squash, TurboBench), and it would be nice if some of the closed-source libraries could be added as plugins for comparison. Personally I like the Squash benchmark because it is run on multiple types of hardware, has a set of test data that is available, and is third-party in the sense that there is no commercial interest in making any specific library look good.

    But I think this is unlikely, since it could be dangerous to submit to testing in an environment you do not control. What if the benchmark does not show what you'd like?

  9. The Following User Says Thank You to jibz For This Useful Post:

    Jarek (30th December 2015)

  10. #158
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    421
    Thanks
    42
    Thanked 147 Times in 104 Posts
    Quote Originally Posted by jibz View Post
    ...Personally I like the Squash benchmark because it is run on multiple types of hardware, has a set of test data that is available...
    Wondering, if you or some others have tried to build the complete squash benchmark.

    I've also tried to build the squash benchmark for some weeks ago, but abandoned it.
    I don't know the current status.

    I have some remarks:
    1 - from the first line "Building squash" you can read: "Squash currently does not support Windows"
    For most users, this remark alone is enough, to not continue reading the rest.
    2 - Squash requires a lot of additional dependencies cmake,ragel,... (you must first build ragel!)
    3 - Squash failed to compile on ubuntu (I don't know the current status).
    4 - The squash library must be installed on your system, even if you want only to run a simple benchmark
    5 - After building squash, you must build the squash benchmark
    6 - To include your own library, you must spend a lot of time writing plugins,...
    7 - The squash benchmark ouputs only a "csv" file (see coding).
    I think you need to install the "squash benchmark web" (and a web server?) to see a simple formated output.

    Well, if someone still think it is simple to use build, use and integrate his own functions into squash, then good luck!

    TurboBench and lzbench are simple to build.
    TurboBench does not require any additional installation (no even cmake)
    to build or use the plots functions.
    It is also very simple to integrate your own functions.
    Last edited by dnd; 30th December 2015 at 14:09.

  11. #159
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    90
    Thanked 69 Times in 49 Posts
    I am sure nemequ can answer your concerns better, but let me add a few notes.

    I didn't quite get the distinction between the squash library and the benchmark at first, but as I understand it, the library is a general compression library which offers a common interface to a number of libraries that are included as plugins. The benchmark is a seperate project, which uses the library to run the plugins on various platforms, and the collected data is available on the benchmark site.

    As such, there is no need to compile the entire squash benchmark, or even just the squash library, locally, in order to write a plugin. If you create a pull request, the CI will check that your plugin works, and if it is included, the results will appear on the official benchmark site the next time the squash benchmark is updated.

    I wrote the BriefLZ plugin on Windows before squash was able to compile there. It didn't take long either, I just copied one of the other plugins for a library with a similar interface. I did check that it compiled in a linux VM though.

    The next version of the squash library will hopefully have Windows and OSX support, but it is clearly not as easy to build as the other options.

    I don't think CMake should be an obstacle. It is a pain to use, sure, but I think it has emerged as the lesser evil for building cross-platform when including non-POSIX targets. There are so many projects using it, people should be familiar with it by now.

  12. #160
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    421
    Thanks
    42
    Thanked 147 Times in 104 Posts

    Squash Benchmark

    First of all, "Happy new year 2016" for all.

    I think, the fact that windows is not supported is enough to not consider the squash benchmark.
    On other operating systems, you must first install and build RAGEL! otherwise cmake fails.
    Actually, you can't install squash or "squash benchmark" without building it.
    So you are only limited to write plugins, that will be some day integrated
    into the proprietary "squash benchmark".
    You can't first test, if your plugins are working.

    You must also spend a lot of time with the squash package and writing plugins.

    This is also not an option for the compressors of Charles, LzTurbo or other closed source projects,
    because you must give your libraries for free.

    Most important, users can't make any benchmark with theire own data and on theire own hardware!


    Now to the numbers of the squash benchmark web:
    - Most of the files in the "squash benchmark web" are small. Small files
    are bad for benchmarking speed (cache szenario).
    Additionally, compressors that shine on small files (like brotli) are favored.

    - Plugins are tested including I/O!, so it is not a pure "In memory" benchmark
    and consequently not reliable.

    - You have a lot of overhead (especially for small files), because the libraries are not called directly.

    - The plugins are tested with N iterations (I), but in only one run (R = r * I) and only one benchmark.
    The numbers can't be accurate enough, because of CPU throttling and other effects like background processes,...
    All these issues are considered and processed automatically in TurboBench.


    In conclusion, you see nice plots and very long sheets in the "Squash Benchmark", but it is more than questionable, if the numbers
    are accurate enough.

    You can read more here about the inaccuracy of the squash benchmark.

    Quote Originally Posted by jibz View Post
    But I think this is unlikely, since it could be dangerous to submit to testing in an environment you do not control. What if the benchmark does not show what you'd like?
    This is one more reason to use only a benchmark program, you can verify the results.
    Last edited by dnd; 1st January 2016 at 15:22.

  13. #161
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    421
    Thanks
    42
    Thanked 147 Times in 104 Posts

    New optimized TurboANX

    New optimized TurboANX - Asymmetric Numeral Systems

    - Optimized TurboANX with better compression, now rivaling adaptive entropy coders
    - can compress even better than order 1 rans_static
    - can compress more than 20% than other ANSs
    - Fastest entropy coder. Decoding more than 3 times faster than FSE!
    - New Skylake benchmarks

    see: Entropy Coder Benchmark

  14. #162
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    422
    Thanks
    133
    Thanked 147 Times in 95 Posts
    I was wondering when you'd rise to the challenge! I'm not sure you can claim it's the fastest still though.

    Is this using much the same techniques as my 32-way unrolled ANS? Is it rANS or tANS? I'm pretty sure tANS has the legs to be much faster than FSE current implementation (at the expense of needing a lot more state and less efficient on small block sizes).

  15. #163
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    421
    Thanks
    42
    Thanked 147 Times in 104 Posts
    Most of the speed improvement in TurboANX is coming from the new dynamic interleaving.
    Some of the SIMD technics are already known standards. see for ex. my tweet.

    I'm claiming nothing but simply interpreting the last benchmark numbers.
    One must always consider the compression ratio when comparing speed and here TurboANX is playing in another league.

  16. #164
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    214
    Thanks
    89
    Thanked 44 Times in 29 Posts
    How is TurboANX in a different league in compression ratio? Your benchmarks show it having the same ratio as many other compressors, at 63-64%. What are you comparing it to?

  17. #165
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    422
    Thanks
    133
    Thanked 147 Times in 95 Posts
    The level 9 edition is doing better in ratio, at the cost of being vastly slower for encoding. This also only has an effect on the BWT enwiki. Am I correct in guessing that this level is optimising the block boundaries to maximise compression ratio, which is why it works great on a BWT and negligible on the original? This is still a useful change of course, but so is adjusting fixed block sizes and so on.

    There are other possible solutions too, but we don't know which if any are being used as this is a closed source implementation. Eg shrinking the block sizes so you get finer control over local entropy variability (important on BWT), but not having a random access block model. If you encoding the frequency delta of this block to the previous block then the overheads of storing the frequences of each block are vastly reduced, but you've lost the ability to do any form of random access. What block sizes are used? Can you individually adress any block without decoding the previous blocks?

  18. #166
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    421
    Thanks
    42
    Thanked 147 Times in 104 Posts
    Quote Originally Posted by SolidComp View Post
    How is TurboANX in a different league in compression ratio? Your benchmarks show it having the same ratio as many other compressors, at 63-64%. What are you comparing it to?
    There is a significant difference of 23% for enwik9 bwt between TurboANX and other ANSs.

    Quote Originally Posted by JamesB View Post
    The level 9 edition is doing better in ratio, at the cost of being vastly slower for encoding...
    I'm using an entropy based segmentation with a shortest path algorithm on histograms. Each super-block (can be entire file) is segmented into variable size 1-64k small blocks.
    Random access is possible at the super-block level.

  19. The Following 2 Users Say Thank You to dnd For This Useful Post:

    JamesB (19th October 2016),Jarek (19th October 2016)

  20. #167
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    422
    Thanks
    133
    Thanked 147 Times in 95 Posts
    Thanks Dnd - it's as I was guessing then. They're both valid and evidently highly useful optimisations of course, plus it's good to see results with and without these too (ANX vs ANX,9) as it shows whether the extra time is worth it will depend on the input file characteristics.

    I'm curious how speed wise it compares to my r32x16b.c code (see http://encode.ru/threads/2607-AVX-51...0616#post50616 for source & linux binary) on your platform.

    Ratio won't match of course, at least not with the ",9" switch, as it doesn't have the variable block sizes code.

  21. #168
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    421
    Thanks
    42
    Thanked 147 Times in 104 Posts
    Hi,
    I've integrated your r32x16b into TurboBench and updated the entropy coder benchmark for skylake.
    The default block size for TurboANX is 12k and 32k the other ANSs.
    It is possible to set the block size with the parameter -Y (ex. -Y64k or -Y1m ) for the other ANSs.
    If you want, I can also publish the results with your preferred block size.

    The optimized TurboANX can be considered as experimental and not usefull for all distributions.
    I've also a version where the freq. are coded using gamma coding w. a bitwise range coder, but it is slower.

  22. The Following User Says Thank You to dnd For This Useful Post:

    Jarek (19th October 2016)

  23. #169
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    422
    Thanks
    133
    Thanked 147 Times in 95 Posts
    Thanks. Nothing wrong with experimental code. It's where it all starts and my own is highly experimental too!

    The frequency encoding is probably the weak part of my codec. It's a very basic byte shovelling thing at the moment with no delta to previous blocks either. It is however (I think) reasonably quick, if not optimal. Order-1 frequency encoding is probably poorly done! Eg I expect storing the order-0 frequencies and then deltas of order-1 freqs against that would produce smaller tables, but it's not something I've experimented with.

    Re: block sizes - the 32-way version inevitably is optimised for large block sizes as flushing the state is considerable. My defaults are usually 1Mb because that's the default block of qualities used in the CRAM format (although it's tunable). 128K would be reasonable too, but going much lower is likely to harm speed and/or size, although I haven't checked by how much - perhaps it's fine on 32k and likely the bwt one it'd gain more via small block size due to the nature of the data than it loses through expensive flushing. The order-1 codec in particular doesn't like small block sizes for obvious reasons (the freq table becomes dominant), but again 128K is probably sufficient.

    I've also found broadwell to be much slower than haswell for my order-1 code, however it's pretty challenging in memory behaviour so it's possibly just other differences in the machine itself causing this. I can't see anything obvious in the chip that would make it slow down.

  24. #170
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    422
    Thanks
    133
    Thanked 147 Times in 95 Posts
    Curious discovery, aka how to cheat on benchmarks

    My order-1 encoder on low entropy data:

    @ deskpro107386[compress.../rans_sta...]; gcc6 -O3 -march=native -DTEST_MAIN -DBLK_SIZE=1024*1024 r32x16b_avx2.c
    @ deskpro107386[compress.../rans_sta...]; ./a.out -t -o1 /var/tmp/q8
    197.3 MB/s enc, 909.7 MB/s dec 73124567 bytes -> 15862493 bytes
    197.1 MB/s enc, 915.1 MB/s dec 73124567 bytes -> 15862493 bytes
    197.4 MB/s enc, 914.0 MB/s dec 73124567 bytes -> 15862493 bytes
    197.4 MB/s enc, 913.0 MB/s dec 73124567 bytes -> 15862493 bytes
    197.3 MB/s enc, 915.9 MB/s dec 73124567 bytes -> 15862493 bytes
    197.2 MB/s enc, 916.6 MB/s dec 73124567 bytes -> 15862493 bytes
    197.1 MB/s enc, 911.7 MB/s dec 73124567 bytes -> 15862493 bytes
    197.2 MB/s enc, 916.9 MB/s dec 73124567 bytes -> 15862493 bytes
    197.3 MB/s enc, 917.4 MB/s dec 73124567 bytes -> 15862493 bytes
    197.2 MB/s enc, 917.1 MB/s dec 73124567 bytes -> 15862493 bytes

    And with a marginally different block size:

    @ deskpro107386[compress.../rans_sta...]; gcc6 -O3 -march=native -DTEST_MAIN -DBLK_SIZE=1013*1047 r32x16b_avx2.c
    @ deskpro107386[compress.../rans_sta...]; ./a.out -t -o1 /var/tmp/q8
    243.4 MB/s enc, 1069.5 MB/s dec 73124567 bytes -> 15862079 bytes
    243.2 MB/s enc, 1065.6 MB/s dec 73124567 bytes -> 15862079 bytes
    244.0 MB/s enc, 1069.3 MB/s dec 73124567 bytes -> 15862079 bytes
    243.5 MB/s enc, 1070.3 MB/s dec 73124567 bytes -> 15862079 bytes
    242.8 MB/s enc, 1070.1 MB/s dec 73124567 bytes -> 15862079 bytes
    243.6 MB/s enc, 1070.3 MB/s dec 73124567 bytes -> 15862079 bytes
    243.8 MB/s enc, 1066.8 MB/s dec 73124567 bytes -> 15862079 bytes
    243.9 MB/s enc, 1070.3 MB/s dec 73124567 bytes -> 15862079 bytes
    243.6 MB/s enc, 1069.9 MB/s dec 73124567 bytes -> 15862079 bytes
    243.7 MB/s enc, 1068.3 MB/s dec 73124567 bytes -> 15862079 bytes

    The reason, I think, is 1024*1024 is trivially divisible by 32, making the 32 order-1 pointers have mostly the same bits as each other. 1013*1047 doesn't have this effect. The result is (I'm guessing here) far fewer cache collisions.

    The same can be seen on other data sets, albeit a bit less noticable. However I'm wondering if this applies to other codecs if they're doing heavy unrolling. Worth experimenting.
    Last edited by JamesB; 23rd October 2016 at 23:59.

  25. The Following 4 Users Say Thank You to JamesB For This Useful Post:

    Cyan (24th October 2016),Jarek (24th October 2016),Mike (24th October 2016),Shelwien (24th October 2016)

  26. #171
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    422
    Thanks
    133
    Thanked 147 Times in 95 Posts
    Indeed evidence shows the layout matters greatly for the encoder, but not so much the decoder. Infact encoding/decoding from files rather than my in-memory test activated with "-t" shows different performance characteristics. Possibly the in/out block memory alignment is changing too. The test mode decoder runs slower than when decoding from file(!) when using exactly 1MB block size.

    Anyway, perf is showing nice numbers to back up this theory in part at least.

    Input is my 8-binned quality file (so low entropy):

    1013*1047 block size

    Code:
    @ deskpro107386[compress.../rans_sta...]; perf stat -e L1-dcache-load,L1-dcache-loads-misses,L1-dcache-stores,L1-dcache-store-misses ./a.out -o1 /tmp/in /tmp/out;perf stat -e L1-dcache-load,L1-dcache-loads-misses,L1-dcache-stores,L1-dcache-store-misses ./a.out -d /tmp/out /dev/null
    Took 3191690 microseconds, 229.1 MB/s
    
     Performance counter stats for './a.out -o1 /tmp/in /tmp/out':
    
           10930222964 L1-dcache-load                                              
              79789097 L1-dcache-loads-misses    #    0.73% of all L1-dcache hits  
            4169473999 L1-dcache-stores                                            
              23278924 L1-dcache-store-misses                                      
    
           3.659998681 seconds time elapsed
    
    Took 709925 microseconds, 1030.0 MB/s
    
     Performance counter stats for './a.out -d /tmp/out /dev/null':
    
            2619188081 L1-dcache-load                                              
              68030637 L1-dcache-loads-misses    #    2.60% of all L1-dcache hits  
             422798953 L1-dcache-stores                                            
              22065522 L1-dcache-store-misses                                      
    
           0.710987393 seconds time elapsed
    1024*1024 block size

    Code:
    @ deskpro107386[compress.../rans_sta...]; perf stat -e L1-dcache-load,L1-dcache-loads-misses,L1-dcache-stores,L1-dcache-store-misses ./b.out -o1 /tmp/in /tmp/out;perf stat -e L1-dcache-load,L1-dcache-loads-misses,L1-dcache-stores,L1-dcache-store-misses ./b.out -d /tmp/out /dev/null
    Took 3399114 microseconds, 215.1 MB/s
    
     Performance counter stats for './b.out -o1 /tmp/in /tmp/out':
    
           10916794653 L1-dcache-load                                              
             804240191 L1-dcache-loads-misses    #    7.37% of all L1-dcache hits  
            4161272365 L1-dcache-stores                                            
              29233954 L1-dcache-store-misses                                      
    
           3.861525937 seconds time elapsed
    
    Took 725868 microseconds, 1007.4 MB/s
    
     Performance counter stats for './b.out -d /tmp/out /dev/null':
    
            2619148917 L1-dcache-load                                              
              58870968 L1-dcache-loads-misses    #    2.25% of all L1-dcache hits  
             422310647 L1-dcache-stores                                            
              26928287 L1-dcache-store-misses                                      
    
           0.727073387 seconds time elapsed

    On something like enwik8 the entropy is much higher so the number of order-1 permutations is much higher leading to higher overall used memory access. However it's still 4% vs 10% L1-dcache-load miss rates between the two versions, so significant. The decodes show slightly better cache performance on the 1024*1024 size, so opposite to the expected behaviour. However as mentioned above, the file to file encode/decode is much more even in speed between the two versions anyway.

    PS. For the same reason, my 4 or 8 histogram buffers for computing the frequencies are sized as (eg) F0[256+MAGIC] where MAGIC is a #define to 8 or similar. This had a huge impact on our older AMD servers with 2-way associative hashes. I see FSE_count_parallel has 256 for these. I should probably benchmark that with a slightly higher number on those servers to see.

    Edit: yes, it helps. Alas the kernel doesn't support L1-dcache-store-misses on our "AMD Opteron(tm) Processor 6378", but I can get other figures out:

    Array size 256, ewiki8 input *100 trials.
    Code:
           5905.845184 task-clock                #    1.000 CPUs utilized          
           17291179661 cycles                    #    2.928 GHz                     [75.00%]
             238385676 stalled-cycles-frontend   #    1.38% frontend cycles idle    [75.01%]
            5345616530 stalled-cycles-backend    #   30.92% backend  cycles idle    [50.00%]
    256+8
    Code:
           5505.574011 task-clock                #    1.000 CPUs utilized          
           16392730803 cycles                    #    2.977 GHz                     [75.01%]
             251208399 stalled-cycles-frontend   #    1.53% frontend cycles idle    [75.01%]
            4349577595 stalled-cycles-backend    #   26.53% backend  cycles idle    [50.02%]
    Last edited by JamesB; 24th October 2016 at 12:18. Reason: Added FSE_count figures

  27. #172
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    421
    Thanks
    42
    Thanked 147 Times in 104 Posts
    Entropy Coder Benchmark updated:

    - Entropy coder used in AOMedia AV1 benchmarked on Intel Skylake i6700 / 3.4 GHz and RASPBERRY PI3 1.2GHz ARM 64 bits.
    TurboANXN rANS based entropy coder compress better and is 7 times! faster than AV1 EC.
    The bitwise (boolean) entropy coder TurboRC is even 3 times faster.


    - AOMedia AV1 and Daala entropy coders included in TurboBench Compression Benchmark

    - TurboHF, 0 : Optimal Huffman Coding beats asymmetric numeral systems+FSE



  28. The Following User Says Thank You to dnd For This Useful Post:

    Jarek (10th June 2018)

  29. #173
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    65
    Thanks
    30
    Thanked 22 Times in 15 Posts
    Regarding FPC for raspberry pi,there is an unexpected large difference between fpc,0 and fpc,32 for decompression speed.
    In my tests,using clang on a cortex a53 phone,the speed is similar.

  30. #174
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    421
    Thanks
    42
    Thanked 147 Times in 104 Posts
    I've rerun the benchmark (option -J63) and I'm getting the same results.
    FPC is getting slower with smaller block sizes.
    Unlike Skylake, the Raspberry PI benchmark numbers are not stable between the runs.
    The maximum speed is reported.

  31. The Following User Says Thank You to dnd For This Useful Post:

    algorithm (10th June 2018)

  32. #175
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    65
    Thanks
    30
    Thanked 22 Times in 15 Posts
    Quote Originally Posted by dnd View Post
    I've rerun the benchmark (option -J63) and I'm getting the same results.
    FPC is getting slower with smaller block sizes.
    Unlike Skylake, the Raspberry PI benchmark numbers are not stable between the runs.
    The maximum speed is reported.
    This is possibly due to throttling.Maybe because compression is slower,it throttles and so decompression becomes slower.
    Can you test fewer iterations for compression?

  33. #176
    Member
    Join Date
    Jan 2017
    Location
    Germany
    Posts
    40
    Thanks
    19
    Thanked 9 Times in 6 Posts
    Quote Originally Posted by algorithm View Post
    This is possibly due to throttling.Maybe because compression is slower,it throttles and so decompression becomes slower.
    Yes, very likely. For heavy work loads I recommend to use a heatsink on the chip (of the Raspberry Pi) and ensure proper air ventilation.

  34. #177
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    421
    Thanks
    42
    Thanked 147 Times in 104 Posts
    Quote Originally Posted by algorithm View Post
    This is possibly due to throttling
    I've also used the options "-I1 -J63".
    Turbobench pauses 20 secs after every 8 runs to avoid throtling.
    One runs is a minimum processing of 2 seconds.
    This options is also used for other block sizes "12,16,32" and as already stated FPC is getting slower
    with decreasing block size.
    With these options used, I'm getting constant results for all other block sizes, so this has nothing to do with throttling.
    The raspberry PI3 used has already a heatsink.

  35. #178
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    421
    Thanks
    42
    Thanked 147 Times in 104 Posts
    Some quotes from the AOMedia AV1 developers in https://news.ycombinator.com/item?id=17278907 :

    - Not useless, just not as good as Jarek liked to claim. It's a neat technique but hardly a breakthrough. In the end, for AV1, Daala_EC was better.
    - ... They're not patenting jarek's work, they're patenting improvements they made to it that were necessary to use it in AV1, which they dedicated engineers to do because Jarek declined to do it himself.
    - ... I can fully believe they verbally promised Jarek far more than he got.
    - Because Jarek is frankly out of line, and I see no reason to sugarcoat it.
    Google pissed him off (and I'm sure he has reason to be pissed) but he's not exactly being truthful about what happened, or what is happening.


    Even, after the evidences in Entropy coding Benchmark, they are still insisting, that the Daala EC in AV1 is better than ANS.

  36. #179
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    600
    Thanks
    186
    Thanked 186 Times in 112 Posts
    The patent scandal was probably a crucial reason behind this decision - even if 16 multiplications per symbol in DaalaEC is faster/more energy efficient/cheaper in hardware than 1 per symbol in rANS, for many years it will be software decoded - where there are absolutely no doubts.
    I had some direct contact regarding collaboration in a completely different topic March 2016, but it has suddenly stopped like the person has found out that collaboration was impossible due to conflict of interest of they trying to patent my suggestions behind my back without disclosing them.
    Timothy Lee from Arstechnica was promised contact with someone technical to explain novelty of this patent ... but later got instead only some general comment about great Google patent behavior (... https://www.wired.com/story/eric-swi...awsuit-patent/ ...).

  37. #180
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    421
    Thanks
    42
    Thanked 147 Times in 104 Posts
    The Daala EC (used in AOMedia AV1) is an usual multisymbol (nibble) range coder.
    In decoding, a division is normally required for the symbol search.
    For nibble decoding it is possible to avoid the division, by using up to 16 multiplications (like in Daala EC).
    In my benchmarks using a division is faster.
    My benchmarks also show that bitwise (boolean) decoding is faster, so there is no reason to use a multisymbol range coder.


    Only rANS with SIMD nibble decoding can be faster than a bitwise range coder.
    Symbol search can be done without any multiplication. Only one multiplication is required to update the ANS state.

  38. The Following 2 Users Say Thank You to dnd For This Useful Post:

    JamesB (22nd June 2018),Jarek (22nd June 2018)

Page 6 of 7 FirstFirst ... 4567 LastLast

Similar Threads

  1. Archiver benchmarking framework
    By Bulat Ziganshin in forum Data Compression
    Replies: 11
    Last Post: 8th February 2013, 20:46
  2. Compression benchmarking: 64 bit images and 24 bit codecs
    By m^2 in forum The Off-Topic Lounge
    Replies: 6
    Last Post: 30th November 2011, 17:01
  3. Lossless image coders
    By Madgeniy in forum Data Compression
    Replies: 26
    Last Post: 11th July 2011, 09:06
  4. paq8f w/ .DXEs (DJGPPv2, DOS, benchmarking)
    By Rugxulo in forum Data Compression
    Replies: 4
    Last Post: 2nd February 2010, 15:32
  5. Comparison of the recent CM demo coders
    By Shelwien in forum Data Compression
    Replies: 38
    Last Post: 13th June 2008, 13:21

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •