Results 1 to 24 of 24

Thread: TurboRLE: Turbo Run Length Encoding

  1. #1
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts

    Lightbulb TurboRLE: Turbo Run Length Encoding

    Most efficient and fastest Run Length Encoding library


    • 100% C (C++ compatible headers), without inline assembly
    • Efficient compression
    • No other RLE compress or decompress faster with better compression
    • Zero byte overhead
    • No modification of the raw data, preserving compressibility
    • Order preserving
    Last edited by dnd; 6th July 2016 at 07:34.

  2. The Following User Says Thank You to dnd For This Useful Post:

    nikkho (22nd February 2015)

  3. #2
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Last edited by m^2; 21st February 2015 at 23:30.

  4. #3
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    From: http://encode.ru/threads/2121-No-mor...ll=1#post42555
    Silesia Corpus
    Code:
    Filename     Orignal    Mespotine     PackBits    Tsukiyama   TurboRLE
    --------------------------------------------------------------------------------------
    dickens   10.192.446   10.192.478   10.422.651   10.410.427   10.188.450+
    mozilla   51.220.480   47.839.740   48.785.288   48.771.984   46.192.009+
    mr         9.970.564    9.967.870    7.304.657    7.244.984    7.226.824+
    nci       33.553.445   32.191.250   38.218.365   38.504.498   30.436.658+
    ooffice    6.152.192    6.033.040    6.060.136    6.077.203    5.916.834+
    osdb      10.085.684   10.085.716   10.335.930   10.337.870   10.031.464+
    reymont    6.627.202    6.627.234    6.730.810    6.708.956    6.616.505+
    samba     21.606.400   20.107.208   20.138.170   20.150.529   19.610.858+
    sao        7.251.944+   7.251.976    7.332.554    7.300.187    7.251.945
    webster   41.458.703   41.458.735   42.094.958   41.962.355   41.451.444+
    x-ray      8.474.240+   8.474.272    8.543.060    8.488.706    8.474.241
    xml        5.345.280    5.324.800    5.414.858    5.401.844    5.296.442+
    Last edited by dnd; 2nd January 2016 at 20:48.

  5. #4
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    You have links mangled on the github site, f.e. a link to girl.bmp.
    And something's wrong about the lrrle results.
    First, compression speed looks too well. 5-9 times that of rle64? Can't be.
    Also, lrrle64 strength should be very close to that of rle64 and usually stronger.
    Maybe you don't use the fastest rle64's variant, with 64-bit words, but some other one?

  6. #5
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    Quote Originally Posted by m^2 View Post
    You have links mangled on the github site, f.e. a link to girl.bmp.
    And something's wrong about the lrrle results.
    First, compression speed looks too well. 5-9 times that of rle64? Can't be.
    Also, lrrle64 strength should be very close to that of rle64 and usually stronger.
    Maybe you don't use the fastest rle64's variant, with 64-bit words, but some other one?
    The link is pointing to the site where you can find "girl.bmp" inside of "RLE64Samples.7z"
    I'm not linking directly to the file, because links may become inaccessible in the future.

    From the result on github you can see that i'm using rle64,8 (as a standard RLE example). The rle64,32, rle64,64
    and also all the lrrle functions are inefficient in term of compression ratio.

  7. #6
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by dnd View Post
    The link is pointing to the site where you can find "girl.bmp" inside of "RLE64Samples.7z"
    I'm not linking directly to the file, because links may become inaccessible in the future.

    From the result on github you can see that i'm using rle64,8 (as a standard RLE example). The rle64,32, rle64,64
    and also all the lrrle functions are inefficient in term of compression ratio.
    rle64,64 is weak, can't argue with that.
    But it's fast. And you call your codec fast. Then compare with fast ones.
    At first glance you seem to do. For a reader not careful enough to spot a digit preceeding codec version, your benchmark includes a fast codec. It has mislead at least me.

    Your RLE ain't fast. It may very well be fast for a strong RLE. For some data it may be generally efficient, even very efficient. Or for some uses, ones that favour decompression speed. But these remarks are important, without adding them a word 'fast' is simply incorrect. And I would add LZ4 for reference too.

    Also, there's a codec that targets a similar niche, very compressible data, that may be seen as a direct competitor for TRLE despite that it's not a RLE. Blosc. I suggest that you benchmark it too.
    Last edited by m^2; 24th February 2015 at 00:36.

  8. #7
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    Well i'm comparing what comparable. There is no sense to compare speed without mentioning the compression ratio.
    This is why I'm always sorting my benchmarks by compressed size.
    TurboRLE is playing in another league when compared with rle64 and your RLEs. The difference in compression ratio is not a few percent but
    several times for RLE compressible files.
    There is also a standard RLE function (_srle) used internally in TurboRLE and not benchmarked separately. I've now extended
    this with my c templates (16,32,64 bits objects), obtaining not only the fastest but also more efficient functions than the corresponding RLE64
    functions (RLE64,32 and RLE64,64). I've already updated the github repository.
    I'll update my benchmarks the next weekend.

    And at 2-3 Gb/s TurboRLE (trle) is already very fast for RLE decompression.

    blosc is simply lz77 extended with a shuffle transform and is not competitive for RLE compressible files.
    You can extend every compressor with a shuffle transform.
    For the compression targeted by blosc (integer binary files) you can (probably) get better compression and 10 times faster execution by using integer
    compression like TurboPFor or FastPFor.
    I think RLE is best used in conjunction with another compression method or for specific data and not as general purpose compression.
    Then it is better to compare only RLE functions.

  9. #8
    Member
    Join Date
    Mar 2015
    Location
    Philadelphia, PA
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Exclamation

    You say a maximum of 1 byte overhead, but right there on lines 19-23 you add 32 bytes of overhead.

    for (i = 0; i < 32; ++i) { int j;
    c = 0;
    for (j=0; j<8; ++j) c += (t[i*8+j]>0)<<j;
    _putc(c, op);
    }

    (forgive the lack of indentation, but the forum's default behavior was completely wacky, at least in the editor)

  10. #9
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    well, when you look at line 148-152 (trlec.c) you see that the output is limited to inlen+1 by simply storing the input without compression.

  11. #10
    Member
    Join Date
    Mar 2015
    Location
    Philadelphia, PA
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Okay, so this is not part of your external interface, this is your guts. DERP.

    Yeah, there are other files in here. They do things, including deciding whether to call this at all.
    Last edited by Drachefly; 28th March 2015 at 16:53.

  12. #11
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts

    New: TurboRLE - SIMD

    • 2x faster with SIMD decompression
    • ZERO byte overhead



    see TurboRLE

    Compress better and up to 8 times faster and decompress up to 4 times faster than RLE64
    Last edited by dnd; 29th December 2015 at 16:37.

  13. #12
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    New update TurboRLE
    - more faster
    - better compression
    - benchmark update

    TurboRC (bitwise order-0 entropy coder) after "RLE" preprocessing
    Code:
          C Size  ratio%     Name                    File
       180537740    18.1     TurboRLE (trle)         enwik9bwt
       254312056    25.4     Mespotine RLE (mrle)    enwik9bwt

  14. #13
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    New update TurboRLE:
    https://github.com/powturbo/TurboRLE
    - visual c++ support incl. build "makefile.vs" added

  15. #14
    Member
    Join Date
    May 2017
    Location
    UK
    Posts
    10
    Thanks
    1
    Thanked 3 Times in 3 Posts
    Thanks!

    Some benchmarks on Dual X5690:

    1. structured doubles - highly redundant
    Code:
    File='struct_mixed1' Length=541837056
    function  E MB/s   size      ratio   D MB/s
    trle      220.44  434738191  80.2%  1517.15
    srle 0    224.21  434738190  80.2%  1589.08
    srle 8    256.15  435057406  80.3%  1556.02
    srle 16   274.54  541837056 100.0%  2781.62
    srle 32   397.23  541837056 100.0%  3400.10
    srle 64  1805.58  238846493  44.1%  4262.84
    mrle      154.63  401799959  74.2%   801.61
    memcpy   2857.29  541837056 100.0%
    Comparison: 'lzturbo -30' 541837056 --> 52274804 (9.6%) Comp (took 0.566s, 957 MB/s) Decomp (took 0.392s, 1382.5 MB/s)
    (turbobench crashes on my computer, so I had to use lzturbo in a RAM drive)


    2. girl.bmp
    Code:
    File='girl.bmp' Length=403920054
    function  E MB/s   size      ratio   D MB/s
    trle     1202.19    3289669   0.8%  6305.04
    srle 0   1235.78    4732082   1.2%  6447.04
    srle 8   6126.03    4732081   1.2%  5170.24
    srle 16  2386.53    8431855   2.1%  6459.10
    srle 32  4074.86   13727066   3.4%  6383.87
    srle 64  6220.47   19844801   4.9%  6220.09
    mrle      183.75    4482384   1.1%  4994.68
    memcpy   2373.19  403920054 100.0%
    lzturbo -30 403920054 --> 2920823 (0.7%) Comp (took 0.342s, 1181 MB/s) Decomp (took 0.242s, 1669.0 MB/s)


    I'll post some more benchmarks embedded in my application. I've found for my uses that my requirement for a malloc for the output data seems to put an upper bound at around 3000 MB/s.
    Last edited by djubik; 16th June 2017 at 13:53.

  16. #15
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    Quote Originally Posted by djubik View Post
    I've found for my uses that my requirement for a malloc for the output data seems to put an upper bound at around 3000 MB/s.
    Thanks for testing TurboRLE. Not sure, but it seems that additional copy is involved in the interface between matlab and c++. see here

  17. #16
    Member
    Join Date
    May 2017
    Location
    UK
    Posts
    10
    Thanks
    1
    Thanked 3 Times in 3 Posts
    Thanks for the link, but actually it's more a function of my use-case. In Matlab you can pass a pointer to C++ and do operations without creating memory.

    I like to consider the malloc time as part of the total cost of an algorithm because I use compression to store large data structures in memory until they are required for scientific computation.
    So, the base case comparison is keeping the raw data uncompressed, which doesn't use a malloc (as the data is already present and available for immediate use in computation).

    For compression I need to:
    (1) malloc a buffer
    (2) compress to the buffer
    (3) malloc an output vector with the correct compressed length
    (4) memcpy the compressed buffer to the output vector

    For decompression:
    (1) malloc output vector of correct size (stored in a header)
    (2) decompress directly to output vector

    So 2 mallocs + 1 memcpy for compression, and 1 malloc for decompression.

    I wish there was a way to avoid the 2nd malloc, and somehow "shrink" the buffer to the correct size (freeing up the excess memory) but as far as I'm aware that's not possible.
    Last edited by djubik; 21st June 2017 at 22:07.

  18. #17
    Member
    Join Date
    May 2017
    Location
    UK
    Posts
    10
    Thanks
    1
    Thanked 3 Times in 3 Posts
    Here's a benchmark from within my framework. This is with 12 threads compressing, and 4 threads decompressing. (chosen to provide best timings)

    struct_mixed1 (541837056 bytes)
    Code:
    Selkie ( 1) -T 4 --> size    541,837,056 - compress time 0.5159 s (1002 MB/s)- decompress time 0.2052 s (2518 MB/s) (compress sz    116,957,152) (ratio 21.6%) total cost = 0.7211
    LZ4    ( 1) -T 4 --> size    541,837,056 - compress time 0.2496 s (2070 MB/s)- decompress time 0.2078 s (2487 MB/s) (compress sz    114,785,038) (ratio 21.2%) total cost = 0.4574
    Lizard (21) -T 4 --> size    541,837,056 - compress time 0.3461 s (1493 MB/s)- decompress time 0.2178 s (2373 MB/s) (compress sz     86,624,627) (ratio 16.0%) total cost = 0.5639
    srle64 ( 1) -T 4 --> size    541,837,056 - compress time 0.3085 s (1675 MB/s)- decompress time 0.1841 s (2806 MB/s) (compress sz    238,846,956) (ratio 44.1%) total cost = 0.4926
    memcpy ( 1) -T 4 --> size    541,837,056 - compress time 0.6199 s ( 834 MB/s)- decompress time 0.1781 s (2901 MB/s) (compress sz    541,837,382) (ratio 100.0%)  total cost = 0.798
    TurboRLE is very interesting. For decomp it's basically at memcpy speed, but I get slower compression than LZ4 presumably this is because the final malloc + memcpy is twice the size of the LZ4 final step do to worse compression ratio.

    I was hoping to find something that beat LZ4 in the round trip time, something like total cost of 0.2 seconds, then it might be feasible to rewrite my matrix class to ALWAYS compress data behind the hood (similar idea to blosc in numpy), but at the moment it slows things down too much.

  19. #18
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    Quote Originally Posted by djubik View Post
    So 2 mallocs + 1 memcpy for compression, and 1 malloc for decompression.

    I wish there was a way to avoid the 2nd malloc, and somehow "shrink" the buffer to the correct size (freeing up the excess memory) but as far as I'm aware that's not possible.
    Better use realloc/MxRealloc to shrink the buffer.
    Code:
    savep= p; if(!(p = realloc(p, newsize)) p=savep;
    You might also try the FCM/DFCM floating point compression (see file "fp.h") in TurboPFor Integer Compression.
    You can test it using the icbench app.

    ...class to ALWAYS compress data behind the hood (similar idea to blosc in numpy)...
    Something similar is also implemented in zfp

  20. #19
    Member
    Join Date
    May 2017
    Location
    UK
    Posts
    10
    Thanks
    1
    Thanked 3 Times in 3 Posts
    Thanks, I will try that.

    Btw, I had to add #include "intrin.h" on line 110 of conf.h so the compiler could see __popcnt64.

    After that TurboPFor compiled successfully on VS2010.

  21. #20
    Member
    Join Date
    Feb 2018
    Location
    Czechia
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    pls would you be so kind and give me a link to download windows binary ? I am not capable to build it.

  22. #21
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    You can now download the icapp a benchmark program for integer compression including Turbo Run Length Encoding

  23. The Following User Says Thank You to dnd For This Useful Post:

    RamiroCruzo (7th March 2018)

  24. #22
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    @dnd: Could you please publish a file/stdin to file/stdout version of Turbo RLE? I would greatly appreciate it! Thanks in advance.

  25. #23
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    "TurboRLE: SSE/AVX2 Run Length Encoding" the fastest RLE now 150% faster.

    "TurboRLE" demolishes "Mespotine RLE" also in the RLE entropy coder benchmark

  26. #24
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Blockchain block and Bible corpus:

    Code:
    function  E MB/s   size      ratio   D MB/s
    trle      511.29   32657091  64.9% 10196.85      
    srle 0    511.04   32657090  64.9% 10190.66      
    srle 8    587.23   32673166  64.9% 10123.02      
    srle 16  2736.31   32883313  65.3% 11001.45      
    srle 32  5139.55   32978512  65.5% 11054.61      
    srle 64  9294.86   33101321  65.8% 11006.26      
    mrle      533.58   32737286  65.0%  1786.77      
    memcpy  13804.62   50331648 100.0%
    
    function  E MB/s   size      ratio   D MB/s
    trle      356.19  505522810  95.6%  3294.92      
    srle 0    359.62  512033643  96.8%  7386.94      
    srle 8    396.81  512146943  96.9%  8349.66      
    srle 16  1889.11  526716994  99.6%  8614.11      
    srle 32  3956.73  527373849  99.7%  8626.19      
    srle 64  6956.59  528677473 100.0%  8569.15      
    mrle      531.45  508137683  96.1%  1531.63      
    memcpy  12076.43  528742400 100.0%

  27. The Following 2 Users Say Thank You to Sportman For This Useful Post:

    dnd (16th June 2019),Mike (16th June 2019)

Similar Threads

  1. Replies: 38
    Last Post: 27th April 2016, 18:01
  2. GearEnc - test application gear encoding
    By Sportman in forum Data Compression
    Replies: 7
    Last Post: 19th May 2013, 04:33
  3. Opera Web Browser's Turbo Feature
    By osmanturan in forum Data Compression
    Replies: 0
    Last Post: 19th January 2011, 16:46
  4. Advanced Huffman Encoding
    By Simon Berger in forum Data Compression
    Replies: 28
    Last Post: 15th April 2009, 14:24
  5. Variable-length Offset
    By Cyan in forum Data Compression
    Replies: 8
    Last Post: 12th December 2008, 14:06

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •