Results 1 to 10 of 10

Thread: Intel's Zlib

  1. #1
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts

    Intel's Zlib

    Intel have been working again on Zlib and have a patched version. In my hands it's looking pretty good, although I haven't tested it on AMD systems or on 32-bit platforms yet to see how portable it is.

    https://www-ssl.intel.com/content/da...paper-copy.pdf
    https://github.com/jtkukunas/zlib

    On my test set it seems to be around 70% faster, but 1-2% larger. A good tradeoff anyway.

    Also -1 compression level is much faster and much weaker on compression. This is a good thing IMO as you only use -1 when you want lightweight compression. LZ4 beats it still for that tradeoff, but within the zlib framework it's a nice option to have.

  2. #2
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Not much better on Phenom 2:

    Code:
    pcbsd-8973% please ./fsbench -i3 zlib,1 zlib,2 zlib,3 zlib,4 zlib,5 zlib,6 zlib,7 zlib,8 zlib,9 ~/bench/scc1.tar 
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    zlib                                    2014-06-16 Intel 1
        5583776 (x 2.256)     40.6 MB/s  213 MB/s        22e6  118e6
    zlib                                    2014-06-16 Intel 2
        5452748 (x 2.310)     37.7 MB/s  221 MB/s        21e6  125e6
    zlib                                    2014-06-16 Intel 3
        5320826 (x 2.367)     32.2 MB/s  229 MB/s        18e6  132e6
    zlib                                    2014-06-16 Intel 4
        5248936 (x 2.400)     29.0 MB/s  223 MB/s        16e6  129e6
    zlib                                    2014-06-16 Intel 5
        5142273 (x 2.449)     22.5 MB/s  228 MB/s        13e6  134e6
    zlib                                    2014-06-16 Intel 6
        5069744 (x 2.484)     16.4 MB/s  231 MB/s         9e6  137e6
    zlib                                    2014-06-16 Intel 7
        5056512 (x 2.491)     14.0 MB/s  234 MB/s      8560e3  140e6
    zlib                                    2014-06-16 Intel 8
        5041030 (x 2.499)     9594 KB/s  236 MB/s      5754e3  141e6
    zlib                                    2014-06-16 Intel 9
        5039048 (x 2.500)     8680 KB/s  235 MB/s      5207e3  140e6
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    done... (3*X*1) iteration(s)).
    pcbsd-8973% please ./fsbench -i3 zlib,1 zlib,2 zlib,3 zlib,4 zlib,5 zlib,6 zlib,7 zlib,8 zlib,9 ~/bench/scc1.tar
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    zlib                                    1.2.8        1
        5583512 (x 2.256)     40.8 MB/s  217 MB/s        22e6  120e6
    zlib                                    1.2.8        2
        5451648 (x 2.310)     37.6 MB/s  226 MB/s        21e6  128e6
    zlib                                    1.2.8        3
        5315631 (x 2.369)     31.3 MB/s  235 MB/s        18e6  135e6
    zlib                                    1.2.8        4
        5244572 (x 2.402)     28.8 MB/s  229 MB/s        16e6  133e6
    zlib                                    1.2.8        5
        5132183 (x 2.454)     21.6 MB/s  233 MB/s        12e6  137e6
    zlib                                    1.2.8        6
        5069744 (x 2.484)     15.9 MB/s  239 MB/s      9733e3  142e6
    zlib                                    1.2.8        7
        5056513 (x 2.491)     13.4 MB/s  238 MB/s      8225e3  142e6
    zlib                                    1.2.8        8
        5041027 (x 2.499)     9104 KB/s  240 MB/s      5460e3  143e6
    zlib                                    1.2.8        9
        5039044 (x 2.500)     8249 KB/s  240 MB/s      4949e3  143e6
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    done... (3*X*1) iteration(s)).
    Compression is slightly better, especially in strong modes, decompression is worse. Maybe I do something wrong?
    ADDED: benchmark code is here:
    https://chiselapp.com/user/Justin_be.../dir?type=tree

    If you want to check yourself, you need to build 2 versions of the program, one should be stock, in the other you should edit CMakeLists.txt and change
    Code:
    set(USE_ZLIB              1)
    set(USE_ZLIB_INTEL        0)
    To
    Code:
    set(USE_ZLIB              0)
    set(USE_ZLIB_INTEL        1)
    ADDED:

    If I enable SSE2, the code gets faster:
    Code:
         pcbsd-8973% please ./fsbench -i3 zlib,1 zlib,2 zlib,3 zlib,4 zlib,5 zlib,6 zlib,7 zlib,8 zlib,9 ~/bench/scc1.tar
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    zlib                                    2014-06-16 Intel 1
        5583776 (x 2.256)     52.4 MB/s  215 MB/s        29e6  119e6
    zlib                                    2014-06-16 Intel 2
        5452748 (x 2.310)     47.7 MB/s  222 MB/s        27e6  126e6
    zlib                                    2014-06-16 Intel 3
        5320826 (x 2.367)     39.1 MB/s  233 MB/s        22e6  134e6
    zlib                                    2014-06-16 Intel 4
        5248936 (x 2.400)     34.4 MB/s  227 MB/s        20e6  132e6
    zlib                                    2014-06-16 Intel 5
        5142273 (x 2.449)     25.6 MB/s  231 MB/s        15e6  136e6
    zlib                                    2014-06-16 Intel 6
        5069744 (x 2.484)     18.1 MB/s  236 MB/s        10e6  140e6
    zlib                                    2014-06-16 Intel 7
        5056512 (x 2.491)     15.1 MB/s  237 MB/s      9225e3  142e6
    zlib                                    2014-06-16 Intel 8
        5041030 (x 2.499)     9.87 MB/s  240 MB/s      6061e3  144e6
    zlib                                    2014-06-16 Intel 9
        5039048 (x 2.500)     9125 KB/s  240 MB/s      5474e3  143e6
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    done... (3*X*1) iteration(s)).
    It has options to use newer x86 extensions, but I don't have a CPU to test. And fsbench support for such things is quite bad and needs a redesign.
    Last edited by m^2; 16th June 2014 at 21:13.

  3. #3
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Quote Originally Posted by m^2 View Post
    Not much better on Phenom 2:

    Code:
    pcbsd-8973% please ./fsbench -i3 zlib,1 zlib,2 zlib,3 zlib,4 zlib,5 zlib,6 zlib,7 zlib,8 zlib,9 ~/bench/scc1.tar 
    Codec                                   version      args
    C.Size      (C.Ratio)        E.Speed   D.Speed      E.Eff. D.Eff.
    zlib                                    2014-06-16 Intel 1
        5583776 (x 2.256)     40.6 MB/s  213 MB/s        22e6  118e6
    ...
    zlib                                    1.2.8        1
        5583512 (x 2.256)     40.8 MB/s  217 MB/s        22e6  120e6
    It has options to use newer x86 extensions, but I don't have a CPU to test. And fsbench support for such things is quite bad and needs a redesign.
    I'm not convinced this is working correctly. For me there was a huge difference in the zlib -1 compression level. I tested it simply using the minigzip demo program that comes with Zlib, taking care to set LD_LIBRARY_PATH before hand to insure it picked up the correct run-time library.

  4. #4
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I found another set of compile time options, USE_QUICK and USE_MEDIUM.
    The first doesn't compile because it uses inline asm that's not supported by my compiler (clang 3.1).
    The second is used in modes 5,6. They turn from:
    Code:
    zlib                                    2014-06-16 Intel 5
        5142273 (x 2.449)     25.6 MB/s  231 MB/s        15e6  136e6
    zlib                                    2014-06-16 Intel 6
        5069744 (x 2.484)     18.1 MB/s  236 MB/s        10e6  140e6
    to:
    Code:
    zlib                                    2014-06-16 Intel 5
        5123058 (x 2.459)     31.2 MB/s  236 MB/s        18e6  139e6
    zlib                                    2014-06-16 Intel 6
        5079791 (x 2.479)     24.3 MB/s  239 MB/s        14e6  142e6

  5. #5
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Quote Originally Posted by m^2 View Post
    I found another set of compile time options, USE_QUICK and USE_MEDIUM.
    The first doesn't compile because it uses inline asm that's not supported by my compiler (clang 3.1).
    It's clearly ongoing work. I think someone locally (Martin Pollard) has got it to build under clang so there may be a pull request for that out there already. Edit: Ah yes I see it has been incorporated into a git branch "issue6", presumably to be merged in soon.

    However we can see therefore how much of their speed gains are due to algorithmic differences and how much are due to rewriting zlib to use assembly with SSE instructions. (Of course they could also be doing SSE intrinsics from within C.)

    Regardless, I think it is a promising development given the wealth of programs that are built around zlib.
    Last edited by JamesB; 17th June 2014 at 18:42.

  6. #6
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I cloned the repository and looked through the logs. It looks like the last commit from mainline zlib prior to the fork was 50893291621658f355bc5b4d450a8d06a563053d. So this should tell you what was done:
    https://github.com/jtkukunas/zlib/co...3053d...master

    It's almost 100% compiler intrinsics, almost no inline assembly. They appear to have changed some algorithms, too.
    Last edited by nburns; 17th June 2014 at 22:09.

  7. #7
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Not just Intel it seems, but IBM also optimising Deflate:

    http://ieeexplore.ieee.org/xpl/artic...mber%3D6824430

    Alas it's not an OpenAccess publication so they want people to pay for it. Pfft!

  8. #8
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    Microsoft is studying Canonical Huffman Code:
    http://research.microsoft.com/pubs/2...al_huffman.pdf

  9. #9
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    I have also noticed that in trees.c (moved to deflate.h in Intel's version) "send_bits" is apparently suboptimal, non DEBUG defines it this way:
    #define send_bits(s, value, length) \
    { int len = length;\
    if (s->bi_valid > (int)Buf_size - len) {\
    int val = value;\
    s->bi_buf |= (ush)val << s->bi_valid;\
    put_short(s, s->bi_buf);\
    s->bi_buf = (ush)val >> (Buf_size - s->bi_valid);\
    s->bi_valid += len - Buf_size;\
    } else {\
    s->bi_buf |= (ush)(value) << s->bi_valid;\
    s->bi_valid += len;\
    }\
    }

    but found that something like this:
      bi_buf |= value << bi_valid;
    bi_valid += length;
    if (bi_valid > 16U) {
    bi_valid = bi_valid - 16;
    *op++=(unsigned char)(bi_buf & 0xff);
    bi_buf = bi_buf >> 8;
    *op++=(unsigned char)(bi_buf & 0xff);
    bi_buf = bi_buf >> 8;

    if (bi_valid > 7U) {
    *op++=(unsigned char)(bi_buf & 0xff);
    bi_buf = bi_buf >> 8;
    bi_valid = bi_valid - 8;
    }
    }

    Is about 10% faster, ok first one is a macro and second one straight code with put_short and put_byte inlined (never found time to rewrite a macro to retrofit and benchmark it in original zlib code), but apparently computing the new bi_valid value right on top is a bit faster.
    This code writes 0, 2 or 3 bytes to the ouput buffer, more precisely it tries to write 2 bytes and eventually another one, the original macro did not try to write the single byte (the code also works without the second if statement and would be closer to the original macro in this case but is apparently a little bit slower).
    Last edited by caveman; 21st June 2014 at 03:36.

  10. #10
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    double post, sorry
    Last edited by caveman; 21st June 2014 at 03:42.

Similar Threads

  1. Fast Zlib compression
    By gildor in forum Data Compression
    Replies: 14
    Last Post: 20th February 2017, 18:32
  2. zlib 1.2.4 LC2 patch
    By roytam1 in forum Data Compression
    Replies: 8
    Last Post: 5th December 2016, 09:35
  3. Precomp0.4.2 zlib error
    By jason73 in forum Data Compression
    Replies: 3
    Last Post: 19th December 2011, 11:47
  4. inflate for zlib v1.2.3
    By bartek in forum Data Compression
    Replies: 1
    Last Post: 15th December 2009, 13:18
  5. zlib-compatible alternatives
    By Cyan in forum Data Compression
    Replies: 0
    Last Post: 12th May 2009, 01:28

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •