Page 1 of 2 12 LastLast
Results 1 to 30 of 33

Thread: libdeflate - a new optimized DEFLATE implementation

  1. #1
    Member
    Join Date
    Nov 2014
    Location
    Earth
    Posts
    38
    Thanks
    0
    Thanked 77 Times in 19 Posts

    libdeflate - a new optimized DEFLATE implementation

    For anyone here who may be interested:

    I have written a new implementation of DEFLATE, freely available at https://github.com/ebiggers/libdeflate.

    The main goals of the project (other than learning) have been:

    • Implement many significant optimizations to improve performance over zlib
    • Provide optional high compression modes which compress better than zlib level 9
    • Support the zlib and gzip wrapper formats in addition to raw DEFLATE
    • Release all source code into the public domain
    • Maintain portability across operating systems and architectures


    The optimizations are probably the most interesting part of libdeflate. Based on my testing, on x86_64 they result in a compressor and decompressor approximately twice as fast as zlib at the default compression level (6). I can certainly provide more details if requested, and all the code is open. My intent was both to improve DEFLATE performance for applications that need to use it, and to provide a more optimized baseline to fairly compare new compression formats against (zlib is sometimes considered useless for this, as it is no longer considered well optimized).

    However, do note that the following have explicitly not been goals of the project:

    • Maintain compatibility with zlib's API
    • Support a streaming API
    • Support gzipping very large files which are too large to fit in memory
    • Compress as well as zopfli (which is near optimal for DEFLATE)
    • Break compatibility with standard DEFLATE
    • Support compressing with sliding windows smaller than the DEFLATE maximum of 32768 bytes


    libdeflate itself is a library, with a documented public API. I have written two simple programs that use the library:

    • benchmark, a compression benchmarking and testing program. It divides the given file(s), or standard input, into chunks, compresses and decompresses each chunk, verifies the uncompressed data is correct, and prints the average compression and decompression speeds. The compression level, chunk size, and compression engine --- libdeflate or zlib --- is configurable on the command line. This allows easy comparison between libdeflate and zlib.
    • gzip (or gunzip), a program to compress (or decompress) a gzip file, using a configurable compression level. Note: since this uses libdeflate's whole-buffer API rather than a streaming API, it is only really usable on small to mid-sized files and is not a drop-in replacement for UNIX gzip.


    libdeflate is most easily built on any UNIX-like system with gcc, but can also be built for Windows. I have posted Windows binaries compiled with MinGW at https://github.com/ebiggers/libdeflate/releases. Prefer the 64-bit binaries, since they are faster.

    More information about libdeflate can be found in its README file. See https://github.com/ebiggers/libdefla...ster/README.md.
    Last edited by Zyzzyva; 29th May 2016 at 00:29. Reason: update link to releases page

  2. The Following 13 Users Say Thank You to Zyzzyva For This Useful Post:

    Bulat Ziganshin (11th April 2016),cbloom (13th April 2016),Cyan (15th May 2016),encode (22nd May 2016),fhanau (11th April 2016),Gonzalo (11th April 2016),Jaff (12th April 2016),JamesB (11th April 2016),lorents17 (11th April 2016),Matt Mahoney (13th April 2016),Mike (15th May 2016),tobijdc (28th May 2016),zyzzle (12th April 2016)

  3. #2
    Member
    Join Date
    Oct 2009
    Location
    usa
    Posts
    56
    Thanks
    1
    Thanked 9 Times in 6 Posts
    A new implementation of zlib is always welcome. Thanks very much.

    I do not see any binaries posted, just the downloadable source...

    Will you kindly compile this for x32 and x64? I don't see where your compiled binaries are located.

    Incidently, I love your username.

  4. #3
    Member
    Join Date
    Nov 2014
    Location
    Earth
    Posts
    38
    Thanks
    0
    Thanked 77 Times in 19 Posts
    I do not see any binaries posted, just the downloadable source...
    Sorry, I screwed up the Github release with the Windows binaries. It should be fixed now.

  5. The Following User Says Thank You to Zyzzyva For This Useful Post:

    Jaff (12th April 2016)

  6. #4
    Member
    Join Date
    Apr 2009
    Location
    here
    Posts
    202
    Thanks
    165
    Thanked 109 Times in 65 Posts
    some windows compiles, gcc 5.2, x86 and x64

    not tested.

    /edit: ok, i'm too late.
    Attached Files Attached Files
    Last edited by load; 12th April 2016 at 08:37.

  7. The Following User Says Thank You to load For This Useful Post:

    lorents17 (15th May 2016)

  8. #5
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Excellent idea for a project. I fully agree with the goal of providing a better baseline for what LZ-Huff-32k-window can do. The standard zlib implementation is a very unfair straw man to compare against that doesn't properly reflect how fast a modern LZ-H-32k can be.
    Last edited by cbloom; 3rd August 2016 at 19:12.

  9. #6
    Member
    Join Date
    Apr 2011
    Location
    Russia
    Posts
    168
    Thanks
    163
    Thanked 9 Times in 8 Posts
    good afternoon!
    Prompt, please, how to compile gzip from the libdeflate project.

    For compilation I use MSYS2.
    Last edited by lorents17; 15th May 2016 at 16:10.

  10. #7
    Member
    Join Date
    Nov 2014
    Location
    Earth
    Posts
    38
    Thanks
    0
    Thanked 77 Times in 19 Posts
    I have updated libdeflate to v0.2. Notable changes:


    • Implemented a new block splitting algorithm which typically improves the compression ratio slightly at all compression levels.
    • The compressor now outputs each block using the cheapest type (dynamic Huffman, static Huffman, or uncompressed).
    • The gzip program has received an overhaul and now behaves more like the standard version.
    • Build system updates, including: some build options were changed and some build options were removed, and the default 'make' target now includes the gzip program as well as the library


    The latest source code is available by cloning https://github.com/ebiggers/libdeflate, and the latest Windows binaries can be downloaded from https://github.com/ebiggers/libdeflate/releases.

  11. The Following 2 Users Say Thank You to Zyzzyva For This Useful Post:

    Jaff (22nd May 2016),lorents17 (22nd May 2016)

  12. #8
    Member
    Join Date
    Apr 2011
    Location
    Russia
    Posts
    168
    Thanks
    163
    Thanked 9 Times in 8 Posts
    Please, compile the libdeflate version with parameters:

    Code:
     override CFLAGS += -O3 -fomit-frame-pointer -std=gnu89 -I. -Icommon -Wall -Wundef -Wdeclaration-after-statement -Wmissing-prototypes -Wstrict-prototypes

  13. #9
    Member
    Join Date
    Nov 2014
    Location
    Earth
    Posts
    38
    Thanks
    0
    Thanked 77 Times in 19 Posts
    I usually find gcc's -O3 to be either the same or slightly worse than -O2. So I do not use it.

  14. The Following User Says Thank You to Zyzzyva For This Useful Post:

    lorents17 (22nd May 2016)

  15. #10
    Member
    Join Date
    Apr 2011
    Location
    Russia
    Posts
    168
    Thanks
    163
    Thanked 9 Times in 8 Posts
    At me in attempt to compile gives MSYS2 error message

    Code:
    $ make
      CC       lib/aligned_malloc.pic.o
    lib/aligned_malloc.c:1:0: warning: -fPIC ignored for target (all code is position independent)
     /*
     ^
      CC       lib/deflate_decompress.pic.o
    lib/deflate_decompress.c:1:0: warning: -fPIC ignored for target (all code is position independent)
     /*
     ^
      CC       lib/x86_cpu_features.pic.o
    lib/x86_cpu_features.c:1:0: warning: -fPIC ignored for target (all code is position independent)
     /*
     ^
      CC       lib/deflate_compress.pic.o
    lib/deflate_compress.c:1:0: warning: -fPIC ignored for target (all code is position independent)
     /*
     ^
      CC       lib/adler32.pic.o
    lib/adler32.c:1:0: warning: -fPIC ignored for target (all code is position independent)
     /*
     ^
      CC       lib/zlib_decompress.pic.o
    lib/zlib_decompress.c:1:0: warning: -fPIC ignored for target (all code is position independent)
     /*
     ^
      CC       lib/zlib_compress.pic.o
    lib/zlib_compress.c:1:0: warning: -fPIC ignored for target (all code is position independent)
     /*
     ^
      CC       lib/crc32.pic.o
    lib/crc32.c:1:0: warning: -fPIC ignored for target (all code is position independent)
     /*
     ^
      CC       lib/gzip_decompress.pic.o
    lib/gzip_decompress.c:1:0: warning: -fPIC ignored for target (all code is position independent)
     /*
     ^
      CC       lib/gzip_compress.pic.o
    lib/gzip_compress.c:1:0: warning: -fPIC ignored for target (all code is position independent)
     /*
     ^
      CCLD     libdeflate.so
      CC       lib/aligned_malloc.o
      CC       lib/deflate_decompress.o
      CC       lib/x86_cpu_features.o
      CC       lib/deflate_compress.o
      CC       lib/adler32.o
      CC       lib/zlib_decompress.o
      CC       lib/zlib_compress.o
      CC       lib/crc32.o
      CC       lib/gzip_decompress.o
      CC       lib/gzip_compress.o
      AR       libdeflate.a
      GEN      programs/config.h
      CC       programs/gzip.o
      CC       programs/prog_util.o
      CC       programs/tgetopt.o
      CCLD     gzip
    C:/msys32/mingw32/bin/../lib/gcc/i686-w64-mingw32/5.3.0/../../../../i686-w64-mingw32/lib/../lib/libmingw32.a(lib32_libmingw32_a-crt0_c.o): In function `main':
    C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crt0_c.c:18: undefined reference to `WinMain@16'
    collect2.exe: error: ld returned 1 exit status
    Makefile:169: ошибка выполнения рецепта для цели «gzip»
    make: *** [gzip] Ошибка 1

  16. #11
    Member
    Join Date
    Apr 2009
    Location
    here
    Posts
    202
    Thanks
    165
    Thanked 109 Times in 65 Posts
    works for me, using toolchains

    it's from newest git code, compiled using GCC 6.1

    as said, sometimes -O3 is faster, sometimes not. zlib as an example was faster for me using just -O2
    Attached Files Attached Files

  17. The Following User Says Thank You to load For This Useful Post:

    lorents17 (22nd May 2016)

  18. #12
    Member
    Join Date
    Nov 2014
    Location
    Earth
    Posts
    38
    Thanks
    0
    Thanked 77 Times in 19 Posts
    @lorents17

    The code is meant to be compiled into Windows native binaries, not linked to the Cygwin or MSYS2 compatibility layer. That means that in Cygwin or MSYS2, you need to compile using "x86_64-w64-mingw32-gcc", not "gcc". The README file contains an example of how to do this by overriding CC.

  19. The Following User Says Thank You to Zyzzyva For This Useful Post:

    lorents17 (22nd May 2016)

  20. #13
    Member
    Join Date
    Apr 2011
    Location
    Russia
    Posts
    168
    Thanks
    163
    Thanked 9 Times in 8 Posts
    At compression of the attached file - gzip falls.
    Code:
    gzip -12 tulip_PNG9011
    And if to make so, then everything works.
    Code:
    gzip -11 tulip_PNG9011
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	Снимок.PNG 
Views:	335 
Size:	10.4 KB 
ID:	4404  
    Attached Files Attached Files

  21. #14
    Member
    Join Date
    Nov 2014
    Location
    Earth
    Posts
    38
    Thanks
    0
    Thanked 77 Times in 19 Posts
    Thanks lorents17. I have fixed the bug in git master.

  22. #15
    Member
    Join Date
    Nov 2014
    Location
    Earth
    Posts
    38
    Thanks
    0
    Thanked 77 Times in 19 Posts
    I have tagged v0.3. There a few bug fixes and minor changes.

    Source code: https://github.com/ebiggers/libdeflate
    Windows binaries: https://github.com/ebiggers/libdeflate/releases

  23. The Following 2 Users Say Thank You to Zyzzyva For This Useful Post:

    lorents17 (29th May 2016),SolidComp (30th May 2016)

  24. #16
    Member
    Join Date
    Apr 2011
    Location
    Russia
    Posts
    168
    Thanks
    163
    Thanked 9 Times in 8 Posts
    good evening!
    Wanted to learn, who thought to make the program on the basis of libdeflate for compression of PNG? On similarity of advdef

    for testing:
    Code:
    for %%i in (PNG\*) do (
        krzydefc --gzipname "%%i"
        7za x "%%i.gz"
        gzip -12 "%%~dpni"
        krzydefc --png"%%~i" "%%~dpni.gz"
        del "%%i.gz"
        del "%%~dpni.gz"
        move /y "%%~dpni.gz.png" "%%i"
    )
    PNG RGBA - 8200 files:
    Code:
    ect -3 - 592 532 552 bytes
    gzip -12 - 592 784 260 bytes
    PNG RGBA - 3800 files:
    Code:
    ect -3 - 1 182 123 002 bytes
    gzip -12 - 1 181 681 973 bytes
    Last edited by lorents17; 30th May 2016 at 01:58.

  25. #17
    Member
    Join Date
    Apr 2011
    Location
    Russia
    Posts
    168
    Thanks
    163
    Thanked 9 Times in 8 Posts
    Good afternoon! Wanted to return to my request. I wanted to ask to write the application for compression (without optimization) to PNG on the basis of libdeflate. For example as advdef. Libdeflate has the high speed of work and extent of compression.

  26. #18
    Member RamiroCruzo's Avatar
    Join Date
    Jul 2015
    Location
    India
    Posts
    15
    Thanks
    137
    Thanked 10 Times in 7 Posts
    Evening Masters, Can ya please provide Optimized .obj files compatible with Zlib. I would be using them with Delphi.

  27. #19
    Member
    Join Date
    Nov 2014
    Location
    Earth
    Posts
    38
    Thanks
    0
    Thanked 77 Times in 19 Posts
    I've tagged libdeflate v0.4 and posted new Windows binaries.

    There are a number of miscellaneous changes, but the most technically interesting thing is that I have optimized Adler-32 (the checksum used in the zlib format) using vector instructions: SSE2 and AVX2 for x86_64, and NEON for ARM.

    I tested the AVX2 version on a Skylake CPU and it runs at about 3.9 bytes/cycle, or 10.5 GB/sec when the CPU runs at 2.7 GHz. This suggests that the inner loop, which processes 32 bytes per iteration using 256-bit vectors, runs in 8 cycles.

    The SSE2 version is a little slower at about 2.7 bytes/cycle.

    The original was just 0.76 bytes/cycle (2.05 GB/s), more or less the same as zlib's. So the AVX2 version is over 5 times faster.

    Note: the AVX2 version is chosen automatically at runtime on supported CPUs.

    On ARM, the NEON version is somewhere around 3 times faster than the original.

    Next I am planning to optimize CRC-32 using carryless multiplication instructions.

    None of this is really new by itself, of course, but I think what is new is having all the fastest algorithms for DEFLATE/zlib/gzip in one place. That place *should* simply be zlib, of course, but for now I am focusing on the algorithms...

  28. The Following 6 Users Say Thank You to Zyzzyva For This Useful Post:

    Bulat Ziganshin (8th September 2016),Cyan (8th September 2016),jibz (8th September 2016),lorents17 (8th September 2016),RamiroCruzo (8th September 2016),stbrumme (8th September 2016)

  29. #20
    Member
    Join Date
    Nov 2014
    Location
    Earth
    Posts
    38
    Thanks
    0
    Thanked 77 Times in 19 Posts
    Another update: I've tagged libdeflate v0.5 and posted new Windows binaries.

    libdeflate now includes a PCLMULQDQ (carryless multiplication) optimized CRC-32 implementation for x86_64. This makes gzip compression and decompression faster --- decompression proportionally more so, of course.

    Also, Adler-32 and CRC-32 are now part of libdeflate's public API.

    Here are some performance results I got comparing the different checksum implementations on Skylake, only counting the time actually spent calculating the checksum (no I/O, no decompression, etc.):

    Code:
        Method                              Speed (MB/s)
        ------                              ------------
        CRC-32 (libdeflate, PCLMUL)         10244
        CRC-32 (libdeflate, generic)        1450
        CRC-32 (zlib)                       1084
    
    
        Adler-32 (libdeflate, AVX2)         24271
        Adler-32 (libdeflate, SSE2)         7086
        Adler-32 (libdeflate, generic)      2401
        Adler-32 (zlib)                     2398
    So the fastest CRC-32 and Adler32 implementations are both about 10x faster than the ones in zlib. And PCLMUL CRC-32 is about 7x faster than "slicing by 8" which is libdeflate's "generic" implementation.

    I was going to include zlib-ng in the CRC-32 comparisons too, but the optimizations there only apply to internal code and not to the public crc32() API, so it wasn't straightforward.

    The AVX-2 Adler-32 speed of 24 GB/s is indeed correct, though it should be considered as being measured under "somewhat ideal conditions". The 10.5 GB/s I previously gave was actually limited by my RAM speed of 12.5 GB/s, but the code actually runs faster than that when reading from cache. On Skylake the inner loop processes 32 bytes in as few as 3 cycles, which is 10.66 bytes/cycle. But that's also assuming the AVX2 hardware is "warmed-up". (There is an observable warm-up period where AVX-2 instructions are about twice as slow. Supposedly it's because Intel CPUs normally keep only 128 of the 256 bits powered on, and it takes time to power on the full 256 bits when a program tries to use it.)

  30. The Following 8 Users Say Thank You to Zyzzyva For This Useful Post:

    Bulat Ziganshin (16th October 2016),Cyan (16th October 2016),JamesB (16th October 2016),jibz (16th October 2016),lorents17 (16th October 2016),Marsu42 (17th October 2016),Shelwien (16th October 2016),SolidComp (17th October 2016)

  31. #21
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    this is fantastic result! currently, the best result i'm aware of was 1 cycle/byte using slicing-by-16. In particular, google crc-util library provided the same speed employing pclmul command. The crc-util library is pretty old, though, so it may be just because CLMUL throughput was improved from 1/8 in Nehalem..IvyBridge to 1/2 in Haswell and 1 in Broadwell? But improtant point is that crc-util library is huge while your code is easy to include in other programs

    in fact, even slicing-by-8 is enough for iuntel cpus to reach 1 cpb speed, but this requires correct command sheduling. Most compilers fails to do that (may be except for ICC). slicing-by-16 just improves their chances since it provides more independent command streams

    look at http://encode.ru/threads/1698-Fast-C...sh-calculation - it may improve table building time. there is also merge_crc that may be used for m/t crc calculation

  32. #22
    Member
    Join Date
    Nov 2014
    Location
    Earth
    Posts
    38
    Thanks
    0
    Thanked 77 Times in 19 Posts
    Using carryless multiplication to calculcate CRCs is not new, per se --- my code more or less follows this 2009 paper from Intel: http://www.intel.com/content/dam/www...lqdq-paper.pdf. And there are already some CRC implementations that use this method. It may not be well known just how much faster it is than the table-driven methods, though, and it may have gotten faster as new Intel/AMD CPUs have been released in the past few years.

    The inner loop processes 64 bytes (512 bits) per iteration, essentially like this:
    Code:
            const __v2di multipliers_4 = (__v2di){ 0x8F352D95, 0x1D9513D7 };
            __m128i x0 = p[0], x1 = p[1], x2 = p[2], x3 = p[3];
            p += 4;
            for (; p != end512; p += 4) {
                    __m128i y0 = p[0], y1 = p[1], y2 = p[2], y3 = p[3];
    
    
                    y0 ^= _mm_clmulepi64_si128(x0, multipliers_4, 0x00);
                    y1 ^= _mm_clmulepi64_si128(x1, multipliers_4, 0x00);
                    y2 ^= _mm_clmulepi64_si128(x2, multipliers_4, 0x00);
                    y3 ^= _mm_clmulepi64_si128(x3, multipliers_4, 0x00);
                    y0 ^= _mm_clmulepi64_si128(x0, multipliers_4, 0x11);
                    y1 ^= _mm_clmulepi64_si128(x1, multipliers_4, 0x11);
                    y2 ^= _mm_clmulepi64_si128(x2, multipliers_4, 0x11);
                    y3 ^= _mm_clmulepi64_si128(x3, multipliers_4, 0x11);
    
    
                    x0 = y0, x1 = y1, x2 = y2, x3 = y3;
            }
    Each carryless multiplication and XOR "folds" 64 bits into 95 bits that are 543 (= 512 + 95 - 64) or 479 (= 512 + 95 - 128 ) bits later in the message. The precomputed constants in 'multipliers_4' are x^543 mod G(x) and x^479 mod G(x) where G(x) is the degree-32 CRC-32 generator polynomial. Note that this is *not* using the SSE4.2 "crc32l" instruction which assumes a different generator polynomial, the "Castagnoli polynomial".

    Essentially the same method can be used on any other CPU that supports carryless multiplication --- e.g. ARMv8, I think, but I don't have an ARMv8 CPU to test right now.

  33. #23
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    just a few months ago i got confirmation that clmul-based code runs at 1 cpb: http://encode.ru/threads/2274-ECT-an...pression/page4

    but he was tested on Sandy Bridge, so it will be interestig to check speeds of your and his code on different cpu generations. can you make windows executable that checks just crc speed?

  34. #24
    Member
    Join Date
    Nov 2014
    Location
    Earth
    Posts
    38
    Thanks
    0
    Thanked 77 Times in 19 Posts
    but he was tested on Sandy Bridge, so it will be interestig to check speeds of your and his code on different cpu generations. can you make windows executable that checks just crc speed?
    There is already such a program included with the Windows binaries. Just download https://github.com/ebiggers/libdefla...x86_64-bin.zip and use the 'checksum.exe' program. Usage is:

    Code:
     > .\checksum.exe -h
    Usage: checksum.exe [-A] [-h] [-s SIZE] [-t] [-Z] [FILE]...
    Calculate Adler-32 or CRC-32 checksums of the specified FILEs.
    
    
    Options:
      -A        use Adler-32 (default is CRC-32)
      -h        print this help
      -s SIZE   chunk size
      -t        show checksum speed, excluding I/O
      -Z        use zlib implementation instead of libdeflate
    For CRC-32 it uses the PCLMUL version if the CPU supports it, otherwise it uses slice-by-8. Or you can specify -Z to use the zlib version, which is slice-by-4.

    Note that the checksum itself is very fast, faster than reading from the input files. So you need to use the -t option to time just the checksum computation. Input files also need to be fairly large to get a somewhat consistent measurement.

    You can also use the -s option to vary the buffer size that is used (default is 131072 bytes). This may affect the performance differently on different CPUs.

  35. The Following User Says Thank You to Zyzzyva For This Useful Post:

    Marsu42 (17th October 2016)

  36. #25
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    the results are inconsistent, but definitely better than 1 cpb. unfortunately, it probably needs more specialized benchamrk code to get consistent results. my cpu is i7-4770 (3.9 GHz)


    C:\Testing\!Compression\libdeflate-0.5-windows-x86_64-bin>checksum.exe -t z:\e9
    f39e3afb "z:\e9" 113 ms 8849 MB/s

    C:\Testing\!Compression\libdeflate-0.5-windows-x86_64-bin>checksum.exe -t z:\e9
    f39e3afb "z:\e9" 70 ms 14285 MB/s

    C:\Testing\!Compression\libdeflate-0.5-windows-x86_64-bin>checksum.exe -t z:\e9
    f39e3afb "z:\e9" 60 ms 16666 MB/s

    C:\Testing\!Compression\libdeflate-0.5-windows-x86_64-bin>checksum.exe -t z:\e9
    f39e3afb "z:\e9" 70 ms 14285 MB/s

    C:\Testing\!Compression\libdeflate-0.5-windows-x86_64-bin>checksum.exe -t z:\e9
    f39e3afb "z:\e9" 81 ms 12345 MB/s

    C:\Testing\!Compression\libdeflate-0.5-windows-x86_64-bin>checksum.exe -t z:\4g
    d974fa68 "z:\4g" 449 ms 10091 MB/s

    C:\Testing\!Compression\libdeflate-0.5-windows-x86_64-bin>checksum.exe -t z:\4g
    d974fa68 "z:\4g" 615 ms 7367 MB/s

    C:\Testing\!Compression\libdeflate-0.5-windows-x86_64-bin>checksum.exe -t z:\4g
    d974fa68 "z:\4g" 628 ms 7215 MB/s

    C:\Testing\!Compression\libdeflate-0.5-windows-x86_64-bin>checksum.exe -t z:\4g
    d974fa68 "z:\4g" 705 ms 6427 MB/s

    C:\Testing\!Compression\libdeflate-0.5-windows-x86_64-bin>checksum.exe -t -Z z:\e9
    f39e3afb "z:\e9" 737 ms 1356 MB/s

    C:\Testing\!Compression\libdeflate-0.5-windows-x86_64-bin>checksum.exe -t -Z z:\e9
    f39e3afb "z:\e9" 711 ms 1406 MB/s

  37. #26
    Member
    Join Date
    Nov 2014
    Location
    Earth
    Posts
    38
    Thanks
    0
    Thanked 77 Times in 19 Posts
    I attached a version of the checksum program that should be more accurate. I believe the timings were inaccurate on Windows because I wasn't using QueryPerformanceCounter(). (I didn't notice because I normally test on Linux.) This is fixed now.
    Attached Files Attached Files

  38. #27
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    thanks, now results seems much more consistent once file is cached. other chunk sizes doesn't improved results. so, on Haswell, it's about 3.5 bpc, which is close to theoretical maximum of 4 bpc (8 CLMUL operations each taking 2 cycles per 64 input bytes). but your results on Skylake may be limited by memory speed, though (or by CLMUL latencies - i don't find other explanations)

    D:\Downloads>checksum.exe -t z:\e9
    f39e3afb "z:\e9" 171 ms 5841 MB/s

    D:\Downloads>checksum.exe -t z:\e9
    f39e3afb "z:\e9" 75 ms 13161 MB/s

    D:\Downloads>checksum.exe -t z:\e9
    f39e3afb "z:\e9" 73 ms 13615 MB/s

    D:\Downloads>checksum.exe -t z:\e9
    f39e3afb "z:\e9" 76 ms 13057 MB/s

    D:\Downloads>checksum.exe -t z:\e9
    f39e3afb "z:\e9" 73 ms 13644 MB/s

    D:\Downloads>checksum.exe -t z:\e9
    f39e3afb "z:\e9" 73 ms 13657 MB/s

    D:\Downloads>checksum.exe -t z:\e9
    f39e3afb "z:\e9" 75 ms 13202 MB/s

    D:\Downloads>checksum.exe -t -Z z:\e9
    f39e3afb "z:\e9" 787 ms 1269 MB/s

    D:\Downloads>checksum.exe -t -Z z:\e9
    f39e3afb "z:\e9" 804 ms 1243 MB/s

    D:\Downloads>checksum.exe -t z:\4g
    d974fa68 "z:\4g" 570 ms 7941 MB/s

    D:\Downloads>checksum.exe -t z:\4g
    d974fa68 "z:\4g" 627 ms 7226 MB/s

    D:\Downloads>checksum.exe -t z:\4g
    d974fa68 "z:\4g" 588 ms 7693 MB/s

  39. #28
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    222
    Thanks
    89
    Thanked 46 Times in 30 Posts
    Quote Originally Posted by Zyzzyva View Post
    Another update: I've tagged libdeflate v0.5 and posted new Windows binaries.

    On Skylake the inner loop processes 32 bytes in as few as 3 cycles, which is 10.66 bytes/cycle. But that's also assuming the AVX2 hardware is "warmed-up". (There is an observable warm-up period where AVX-2 instructions are about twice as slow. Supposedly it's because Intel CPUs normally keep only 128 of the 256 bits powered on, and it takes time to power on the full 256 bits when a program tries to use it.)
    The key to keeping Skylake AVX2 units warmed up is to feed them 256-bit / 32-byte instruction payloads without any lulls exceeding 675 µs, according to Agner Fog.

  40. #29
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    222
    Thanks
    89
    Thanked 46 Times in 30 Posts
    FYI, libdeflate 1.0 was released last month: https://github.com/ebiggers/libdeflate/releases

    There have been lots of interesting updates over the past year or so. It would be interesting to see how it compares to zopfli now (and ECT?):

    Version 1.0:
    Added support for multi-member gzip files.
    Moved architecture-specific code into subdirectories. If you aren't using the provided Makefile to build libdeflate, you now need to compile lib/*.c and lib/*/*.c instead of just lib/*.c.
    Added an ARM PMULL implementation of CRC-32, which speeds up gzip compression and decompression on 32-bit and 64-bit ARM processors that have the Cryptography Extensions.
    Improved detection of CPU features, resulting in accelerated functions being used in more cases. This includes: - Detect CPU features on 32-bit x86, not just 64-bit as was done previously. - Detect CPU features on ARM, both 32 and 64-bit. (Limited to Linux only currently.)

    Version 0.8:
    Build fixes for certain platforms and compilers.
    libdeflate now produces the same output on all CPU architectures.
    Improved documentation for building libdeflate on Windows.

    Version 0.7:
    Fixed a very rare bug that caused data to be compressed incorrectly. The bug affected compression levels 7 and below since libdeflate v0.2. Although there have been no user reports of the bug, and I believe it would have been highly unlikely to encounter on realistic data, it could occur on data specially crafted to reproduce it.
    Fixed a compilation error when building with clang 3.7.

    Version 0.6:
    Various improvements to the gzip program's behavior.
    Faster CRC-32 on AVX-capable processors.
    Other minor changes.

    Version 0.5:
    The CRC-32 checksum algorithm has been optimized with carryless multiplication instructions for x86_64 (PCLMUL). This speeds up gzip compression and decompression.
    Build fixes for certain platforms and compilers.
    Added more test programs and scripts.
    libdeflate is now entirely MIT-licensed.

  41. The Following User Says Thank You to SolidComp For This Useful Post:

    Bulat Ziganshin (23rd May 2018)

  42. #30
    Member
    Join Date
    Jun 2013
    Location
    Sweden
    Posts
    150
    Thanks
    9
    Thanked 25 Times in 23 Posts
    Quote Originally Posted by SolidComp View Post
    FYI, libdeflate 1.0 was released last month: https://github.com/ebiggers/libdeflate/releases
    Test of libdeflate-1.0-windows-x86_64-bin
    Code:
    1.787.785.811 mkv
    1.786.957.948 mkv.7z.22.zstd - 137 sec
    1.785.465.763 mkv.winrar.best.zip - 27 sec
    1.785.419.731 mkv.7z.deflate.mx.zip - 11.5 min
    1.785.400.203 mkv.1.gz - 37 sec
    1.785.221.871 mkv.7z.deflate64.mx.zip - 13 min
    1.784.581.959 mkv.12.gz - 80 sec
    1.784.439.182 mkv.m8.zcm093 - 9 min
    1.784.355.455 mkv.ect.10007.zip - 105 min
    1.784.044.948 mkv.b2047.bcm130 - 10 min
    No support for 4.5GB files, maybe only 4GB.

    Another mkv:
    Code:
    2.772.852.928  mkv
    2.771.456.073  mkv.7z.zip deflate (18 min)
    2.771.293.029  mkv.7z.zip deflate64 (25 min)
    2.771.171.201  mkv.gz -12 (2 min)
    gz extracted with md5 ok.
    Last edited by a902cd23; 26th May 2018 at 17:32.

Page 1 of 2 12 LastLast

Similar Threads

  1. LZF - Optimized LZF compressor
    By encode in forum Data Compression
    Replies: 39
    Last Post: 28th March 2019, 19:49
  2. Intel's fast OSS deflate implementation
    By Bulat Ziganshin in forum Data Compression
    Replies: 16
    Last Post: 23rd May 2016, 17:13
  3. Optimized LZSS compressor
    By encode in forum Data Compression
    Replies: 11
    Last Post: 13th February 2014, 23:51
  4. I'm looking for the best free implementation of deflate
    By caveman in forum Data Compression
    Replies: 2
    Last Post: 22nd November 2010, 09:27
  5. 7zip >> Sfx optimized - 23,7 kb
    By Yuri Grille. in forum Data Compression
    Replies: 22
    Last Post: 12th April 2009, 21:33

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •