Results 1 to 12 of 12

Thread: Compiling lz4

  1. #1
    Member
    Join Date
    Feb 2016
    Location
    USA
    Posts
    80
    Thanks
    30
    Thanked 8 Times in 8 Posts

    Post Compiling lz4

    Just trying to see how to speed up lz4 by using different compilers.

    Test machine: Intel(R) i5-3320M CPU @ 2.60GHz, cache size : 3072 KB

    One compiler is gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5) , -O3, another gcc -O3 -mavx
    the 3rd is Intel C++ 64 Compiler, icpc version 17.0.4 (gcc version 5.0.0 compatibility) -O3 -ipo

    Here are brief test results for lz4 default level ( -1 ) for a file about 900KB

    Comp. Decomp.

    gcc 204.86 MB/s 248.27 MB/s
    gcc -mavx 202.24MB/s 258.63 MB/s
    icpc 201.09 MB/s 244.36 MB/s


    These are just rough tests, far from scientific. But it seems that SSE or Intel compiler does not help improve lz4 speed (significantly). So any other ways to further improve lz4 speeds?

  2. The Following User Says Thank You to smjohn1 For This Useful Post:

    Cyan (29th December 2017)

  3. #2
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    how is speed measured ?

  4. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    900 KB file is too small for measurement of such fast algo

    can you try PGO-directed compilation? i my experiments with SREP it was worth only a few percent improvement, but probably there are programs where improvement will be much higher

  5. #4
    Member
    Join Date
    Feb 2016
    Location
    USA
    Posts
    80
    Thanks
    30
    Thanked 8 Times in 8 Posts
    Speed is measured within LZ4IO_compressFilename_extRess():

    clock_t const clockStart = clock();

    srcFile = LZ4IO_openSrcFile(srcFileName);
    if (g_removeSrcFile) { if (remove(srcFileName)) EXM_THROW(40, "Remove error : %s: %s", srcFileName, strerror(errno)); } /* remove source file : --rm */

    clockEnd = clock();
    if (clockEnd==clockStart) clockEnd+=1; /* avoid division by zero (speed) */
    filesize += !filesize; /* avoid division by zero (ratio) */

    And similarly within LZ4IO_decompressSrcFile()

    I know they include file I/O time, but data is small and I just wanted to get some rough idea ...

    ....




    Quote Originally Posted by Cyan View Post
    how is speed measured ?

  6. #5
    Member
    Join Date
    Feb 2016
    Location
    USA
    Posts
    80
    Thanks
    30
    Thanked 8 Times in 8 Posts
    I tested with large data ( around 1G ), and did see speed improved to around 480MB/s for all compilers, and still didn't see obvious differences.

    I tend not to use PGO-directed compilation, as in file system application, incoming data is very generic, and there is no way to know profile of data.
    ( An actual processing block would be quite small, at around 4KB to 8KB, all data is possible. ) So profiling might hurt most data.

    Quote Originally Posted by Bulat Ziganshin View Post
    900 KB file is too small for measurement of such fast algo

    can you try PGO-directed compilation? i my experiments with SREP it was worth only a few percent improvement, but probably there are programs where improvement will be much higher

  7. #6
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    LZ4 belongs to the "faster than IO" category.
    By measuring timing at IO FILE* level, speed is likely bottlenecked by other parts of the system, such as FS caching, SSD speed, libc runtime implementation, etc.. It means you are unlikely to spot any difference between compilers, which seems to be your objective, since performance is throttled.

    LZ4 is commonly used to compress data between 2 memory regions.
    Fortunately, LZ4 cli includes a benchmark module, which tests direct mem-to-mem compression and decompression speed, avoiding IO limitations. It's possible to trigger it with command `-b` :
    Code:
    lz4 -b file

  8. The Following User Says Thank You to Cyan For This Useful Post:

    smjohn1 (2nd January 2018)

  9. #7
    Member
    Join Date
    Feb 2016
    Location
    USA
    Posts
    80
    Thanks
    30
    Thanked 8 Times in 8 Posts
    Brief tests again with the same 900K pdf file,
    and got the following results

    Comp. Decomp. ( level 1 )

    gcc 1800.1 MB/s ,8931.2 MB/s
    gcc -mavx 2050.4 MB/s ,10096.1 MB/s
    icpc 1782.8 MB/s ,8762.7 MB/s

    Using level 9, (-b9)
    Comp. Decomp. ( level 9 )

    gcc 36.3 MB/s ,8444.0 MB/s
    gcc -mavx 36.5 MB/s ,9575.7 MB/s
    icpc 35.9 MB/s ,8293.3 MB/s


    So it seems using SSE instructions does help, while Intel's icpc compiler actually hurts in this case. Interesting!




    Quote Originally Posted by Cyan View Post
    LZ4 belongs to the "faster than IO" category.


    Code:
    lz4 -b file

  10. The Following User Says Thank You to smjohn1 For This Useful Post:

    Cyan (2nd January 2018)

  11. #8
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    This conclusion might depends on the file being benchmarked.
    The pdf is likely poorly compressible, resulting in extremely fast speed, where a limited usage of vectorial instructions can be beneficial.
    A more compressible source will trigger a different pattern.

  12. #9
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    I see avx2 slowing it down here.

    CC=gcc7 CFLAGS="-O3 -march=core-avx2":
    1#_.bam : 29027673 -> 17671225 (1.643), 529.5 MB/s ,2302.9 MB/s

    CC=gcc7 CFLAGS="-O3 -march=corei7"
    1#_.bam : 29027673 -> 17671225 (1.643), 551.0 MB/s ,2574.1 MB/s

    CC=icc15 CFLAGS="-O3 -march=core-avx2":
    1#_.bam : 29027673 -> 17671225 (1.643), 324.2 MB/s ,2311.5 MB/s

    CC=icc15 CFLAGS="-O3 -march=corei7":
    1#_.bam : 29027673 -> 17671225 (1.643), 329.0 MB/s ,2354.8 MB/s

    CC=clang50 CFLAGS="-O3 -march=core-avx2":
    1#_.bam : 29027673 -> 17671225 (1.643), 521.9 MB/s ,2494.9 MB/s

    CC=clang50 CFLAGS="-O3 -march=corei7":
    1#_.bam : 29027673 -> 17671225 (1.643), 529.2 MB/s ,2252.0 MB/s


    Big win for gcc over icc also. Input is an uncompressed "BAM" file - DNA sequence alignments stored in a binary form. Some compressability as you see, but nothing huge. Order-1 encoding gets it to 13536727 bytes and order-0 to 20782156 bytes, so clearly plenty more is available. Gzip defaults give 11790834, bzip2 9846734 and xz 9726168.
    Last edited by JamesB; 2nd January 2018 at 16:52. Reason: Added clang5.0 results

  13. #10
    Member
    Join Date
    Feb 2016
    Location
    USA
    Posts
    80
    Thanks
    30
    Thanked 8 Times in 8 Posts
    That seems to be somewhat true, and results are more interesting.

    This time I just used lz4io.c as a test file, and results are

    Level 1:

    lz4.gcc -b lz4io.c : 42820 -> 18053 (2.372), 486.6 MB/s ,1946.4 MB/s
    lz4.avx -b lz4io.c: 42820 -> 18053 (2.372), 503.8 MB/s ,1784.2 MB/s
    lz4.icpc -b lz4io.c: 42820 -> 18053 (2.372), 470.5 MB/s ,1784.2 MB/s

    So it seems avx still holds some advantage in compression, but not in decompression

    Level 9:

    lz4.gcc -b9 lz4io.c: 42820 -> 13487 (3.175), 35.0 MB/s ,2854.7 MB/s
    lz4.avx -b9 lz4io.c: 42820 -> 13487 (3.175), 35.0 MB/s ,2253.7 MB/s
    z4.icpc -b9 lz4io.c: 42820 -> 13487 (3.175), 33.0 MB/s ,2676.2 MB/s

    So at this level, avx has no advantage, even worse than icpc.

    I am puzzled why icpc (and avx) hurts, without knowing the internals of lz4 implementation. Perhaps Cyan could explain a bit.


    Quote Originally Posted by Cyan View Post
    This conclusion might depends on the file being benchmarked.
    The pdf is likely poorly compressible, resulting in extremely fast speed, where a limited usage of vectorial instructions can be beneficial.
    A more compressible source will trigger a different pattern.

  14. #11
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    It's less about the internals of LZ4 than about the internals of compilers with automatic generation of vector instructions.

    There is no vectorial instruction within LZ4 source code.
    But compiler is free to transform some normal code into a set of vectorial instructions if it believes it's beneficial.
    Problem is, this is likely a bet, which will pay off when the amount of work is sizable, and be detrimental when this amount is low.

    The most likely usage of automatic vectorial instructions is in memory copy operation, as it's the most common and easiest one.
    Nonetheless, ensuring all safety conditions are met adds a non-negligible set of test instructions, before triggering the main loop.

    When data is barely compressible, there are probably some large segments of memory to copy, so a fast internal loop using SSE or AVX register is probably beneficial, and will offset the cost of test stage.
    When the amount of data to copy is smaller, the test stage becomes dominant, and actually slows down operations.

    A good example can be generated using godbolt :
    https://godbolt.org/g/MXnmtz

    This example just compares a small "copy by 8 bytes" function, with and without auto-vectorization.
    The vectorized version is much larger.
    This will likely cause performance issues with I-cache, and obviously represents more instructions to run.
    I'm sure the internal loop is probably much faster.
    Just, it's likely a bad idea if only 8 bytes end up being copied after all.

  15. The Following 3 Users Say Thank You to Cyan For This Useful Post:

    JamesB (4th January 2018),schnaader (3rd January 2018),smjohn1 (3rd January 2018)

  16. #12
    Member
    Join Date
    Feb 2016
    Location
    USA
    Posts
    80
    Thanks
    30
    Thanked 8 Times in 8 Posts
    Ha, I just finished reading your "old" blog explaining LZ4 ( http://fastcompression.blogspot.in/2...explained.html ) ( a very nice one ! ), and at the end of comments, T. M. discussed an SSE implementation of decoder. Is this decoder implementation available in public? Or is it already in your current implementation? ( I should check your source code myself, but it is much easier just to ask ) If not, why? I can do tests to see compiler impact on an SSE aware implementation of decoder. If it does improve performance a lot, it would be really beneficial to further increase LZ4's impact.

    Quote Originally Posted by Cyan View Post
    It's less about the internals of LZ4 than about the internals of compilers with automatic generation of vector instructions.

    There is no vectorial instruction within LZ4 source code.
    But compiler is free to transform some normal code into a set of vectorial instructions if it believes it's beneficial.
    Problem is, this is likely a bet, which will pay off when the amount of work is sizable, and be detrimental when this amount is low.

    The most likely usage of automatic vectorial instructions is in memory copy operation, as it's the most common and easiest one.
    Nonetheless, ensuring all safety conditions are met adds a non-negligible set of test instructions, before triggering the main loop.

    When data is barely compressible, there are probably some large segments of memory to copy, so a fast internal loop using SSE or AVX register is probably beneficial, and will offset the cost of test stage.
    When the amount of data to copy is smaller, the test stage becomes dominant, and actually slows down operations.

    A good example can be generated using godbolt :
    https://godbolt.org/g/MXnmtz

    This example just compares a small "copy by 8 bytes" function, with and without auto-vectorization.
    The vectorized version is much larger.
    This will likely cause performance issues with I-cache, and obviously represents more instructions to run.
    I'm sure the internal loop is probably much faster.
    Just, it's likely a bad idea if only 8 bytes end up being copied after all.

Similar Threads

  1. Compiling Fastest Binaries
    By comp1 in forum The Off-Topic Lounge
    Replies: 8
    Last Post: 3rd April 2016, 18:54
  2. Compiling Source Code
    By comp1 in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 10th June 2015, 22:32
  3. Compiling a new corpus
    By nemequ in forum Data Compression
    Replies: 30
    Last Post: 1st April 2015, 13:45
  4. Compiling ZPAQ on Windows with MinGW
    By fcorbelli in forum Data Compression
    Replies: 6
    Last Post: 20th January 2014, 03:29
  5. can someone help me compiling paq by myself?
    By noshutdown in forum Forum Archive
    Replies: 4
    Last Post: 4th December 2007, 10:49

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •