Page 1 of 2 12 LastLast
Results 1 to 30 of 48

Thread: Consistent resolution of ties in LZ4

  1. #1
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    46
    Thanks
    36
    Thanked 18 Times in 12 Posts

    Consistent resolution of ties in LZ4

    While working with emarty on another project, which I will write about separately, we noticed that we could fairly substantially improve the decompression speed by strategically resolving ties. More specifically, emarty wrote a byte-aligned compressor similar in style to LZ4. Its nearly optimal parser deals with a fairly substantial number of ties, i.e. variations of encoding in the compressed data that do not affect the size of the resulting compressed data. Our analysis of the data format showed that decompression of literals tends to require less processing and this suggested that a preference to encode literals in the compressor can improve decompression speeds. Working with a particular decompressor we managed to get an overall improvement of decompression speed on the order of 5-7%.

    Of course, this involved working with a particular decompressor. If we were to analyse every code path of our decompressor and identified specific costs associated with every one of them, we could have probably achieved even more win on decompression speed. However, we did not do that. Instead, emarty came up with a useful heuristic: the average length of groups of literals and matches inversely correlates with the number of used compression tokens, so he designed an algorithm to resolve ties in a way that simply reduces the number of tokens. To see how useful this can be emarty wrote his version of nearly optimal compressor for LZ4 that uses this method of tie resolution. His work is available here: https://github.com/emmanuel-marty/lz4ultra

    To do a fair testing, I tried compressing a bunch of small files (all files < 64K, see more about my corpus here: https://encode.ru/threads/3001-State...ll=1#post58588 ) using several available optimal compressors for LZ4 and measured the precise decompression speed for each of them using my speed-optimized decompressor. This is what I observed:
    Code:
    					compression	decompression
    					ratio, %	time per byte
    
    
    lz4 -19 (ver.1.8.3)			58.47%		32.81
    lz4 -19 --favor-decSpeed (ver.1.8.3)	59.26%		32.59
    lz4x -9 (ver.1.60)			58.47%		32.58
    smallz4 -9 (ver.1.3)			58.47%		32.58
    lz4ultra (ver.1.0.2)			58.50%		32.33
    (the units of time are Z80 t-states, the less, the better).

    So, these are my observations about the results of these tests:
    1. LZ4 does not seem to do tie resolution in a consistent fashion.
    2. LZ4X and Smallz4 generate data that somehow have some partial tie resolution, so that they are decompressed 0.7% faster with my decompressor.
    3. ematry's compressor LZ4Ultra generates very slightly larger files than optimal (very close to LZ4 -19). However, the data compressed using LZ4Ultra are about 1.5% faster to decompress in our tests.
    4. Interestingly, if we attempt using the LZ4 option --favor-decSpeed, we get similar number of tokens to LZ4X or Smallz4, but significantly lower compression ratio.

    Is this win substantial? I am guessing the answer depends on your applications, but, quite importantly, it comes "free", in the sense that it does not require you to compromize on compression ratio. It would be interesting to know what you think about this style of optimizing compressed data, would you consider doing something similar in your project (and if not - why not?). It would also be very interesting to know if you could do some precise measurements of decompression times for data packed using these optimal compressors, as well as emarty's compressor, to see how much of a difference this makes for decompressors different from mine.
    Last edited by introspec; 12th April 2019 at 21:02. Reason: Included results from a bigger set of measurement than I originally had. Updated observations

  2. The Following 4 Users Say Thank You to introspec For This Useful Post:

    Bulat Ziganshin (10th April 2019),Cyan (10th April 2019),encode (10th April 2019),jibz (10th April 2019)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I think we can generalize this by adding 2nd metric to parsing optimization - the decoding time estimation.
    For a simple format like LZ4 it can be made fairly precise, even accounting for things like cache lines, pages, caching logic etc.
    For example, on my cpu L2 is 16-way, so a repeating pattern of 17+ matches targeting same lines in different pages would force it to always miss.

    Then we can use a hybrid metric as optimization target, maybe csize/nsp+tdec
    (csize=compressed size, nsp=network speed, tdec=decoding time).
    The idea is that when we need to send the file over network, compressing it may reduce the transfer time,
    but smaller compressed size may not compensate for slower decoding.

    I think oodle codecs already do something like this.

  4. #3
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    LZ4 v1.8.2 introduced `--favor-decSpeed` (also available at API level), which generates output more favorable to decompression speed, at a fairly low compression ratio cost. There are similarities, as it also tends to favor longer sequences. But there are probably differences too. It could be an interesting reference point to compare to.

  5. The Following User Says Thank You to Cyan For This Useful Post:

    JamesB (11th April 2019)

  6. #4
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    46
    Thanks
    36
    Thanked 18 Times in 12 Posts
    @Shelwien, yes, absolutely, in situations where the decompression speed must be optimized, all these tricks can make quite some difference.
    I do remember Charles Bloom talking about using some of these techniques in Oodle; although I do not seem to remember if he said anything very specific about this.

    However, our focus was somewhat different. We were trying to see how much speed we can squeeze out of a specific compression format without compromizing on the compression ratio. Once further decompressor optimization became impossible, we turned to thinking about ties as a resource for further increases in speed. What surprised us was how much of a difference it made in our application.

    @Cyan, very interesting, I did not know that. In my small bunch of tests I am seeing that LZ4 v.1.8.2 --favor-decSpeed has worse compression ratio than LZ4Ultra (up to 0.5% loss in compression ratio), while producing data that about 0.4% faster to decompress. In my mind, tie resolution on its own increases the speed about two thirds of the way from original optimal data to the results by --favor-decSpeed. Not too bad given it is free from compromizing on the compression ratio.

  7. #5
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    46
    Thanks
    36
    Thanked 18 Times in 12 Posts
    Just wanted to add a couple of citations from the prior related work by Charles Bloom:
    With byte-wise coders you have something funny in the optimal parser than you don't run into much with huffman or arithmetic coders : *ties*. That is, there are frequently many ways to code that have exactly the same code length. (in fact it's not uncommon for *all* the coding choices at a given position to produce the same total length). You might think ties don't matter but in fact they do. One way you can break a tie is to favor speed; eg. break the tie by picking the encoding that decodes the fastest. But beyond that if your format has some feedback, the tie is important.
    (see http://cbloomrants.blogspot.com/2012...-on-lznib.html ). I am pretty sure I read this old post.

    However, I just noticed there is a new post with much more details on the topic:
    Making good choices in the encoder can have an absolutely huge affect on the decode speed, even at roughly equal compressed size. Some of those issues are non-local. LZNIB was the first time we encountered an LZ like this, and it turned out to be important again for Kraken & Mermaid/Selkie. One of the issues with LZNIB is there are a lot of exact ties in compressed size, since it steps in units of nibbles. Those ties need to be broken in favor of faster decode speed. To good approximation, decode speed is just about minimizing the number of control nibbles. You want as few transitions between actions as possible, you prefer to stay in long literal runs and long matches. You definitely don't want to be doing more mode switches if there's an encoding of the same compressed size with fewer mode switches.
    (see http://cbloomrants.blogspot.com/2019...algorithm.html ).

  8. The Following 2 Users Say Thank You to introspec For This Useful Post:

    Bulat Ziganshin (11th April 2019),Shelwien (10th April 2019)

  9. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I'm just saying that using simple metric like the number of tokens is too simple for this.
    Since you already have parsing optimizer anyway, why not take into account all the things that affect speed?
    In fact, there's a simple way to test this too - just time actual decoding from start to current point with TSC (min-of-N or something)
    and combine it with compressed size for token price.
    If you only want to resolve ties, time has to be taken as a "fractional part" of compressed size.

  10. #7
    Member
    Join Date
    Mar 2013
    Location
    Berlin
    Posts
    45
    Thanks
    14
    Thanked 71 Times in 30 Posts
    Can you go more into detail how you measured the decompression speed ?

    I compressed enwik8 and one of my debugging tools showed the following statistics (max. compression level):

    Code:
    lz4ultra 1.0.0         41928203 bytes, 10826799 tokens
    lz4 1.8.3              41913377 bytes, 11025738 tokens
    lz4 --favor-decSpeed   42298381 bytes, 10547798 tokens
    smallz4 1.3            41913368 bytes, 10753324 tokens
    When decompressing all these files with the same program (Yann's lz4 1.8.3 compiled with GCC 8, x64), neither wall clock time nor perf stat showed any significant speed advantage of lz4ultra.

    PS: I'm pretty sure we had a long thread in this forum about measuring performance, too.

  11. The Following User Says Thank You to stbrumme For This Useful Post:

    introspec (11th April 2019)

  12. #8
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    46
    Thanks
    36
    Thanked 18 Times in 12 Posts
    @Shelwien, basically, yes, a more complex tuning can be done. However, the more complex the tuning will get, the more decompressor-specific it is likely to get. I was hoping to see if simpler, more generic heuristics can be used to more-or-less uniformly affect the resulting decompression times. For example, we could see that changing minimum match length had a very significant effect on decompression speed, so a possible heuristic recommendation for faster compressor could be to avoid matches that are too short.

    @stbrumme, I am targeting 8-bit platforms and my decompressor is written in Z80 assembly. The numbers I gave are actually the numbers of Z80 clocks measured in a specialized emulator.

    For all intents and purposes, they are exact times; there are no hardware interferences to overcomplicate things. Effectively, every combination of token parameters corresponds to an exact time, which for a specific file are all added up and measured by the emulator. How precisely this corresponds to the performance on real hardware depends on a hardware. On many ZX Spectrum clones the actual run times would be exactly as in the emulator. On some original models the performance will depend on which memory is getting used, so it may not be 100% precise there. Of course, when you are on a PC you get affected by cache misses, branch mispredictions and other similar complications which make similar precise timing not possible. This is why I really wanted to know if simply reducing the number of tokens would be percievable during decompression on a PC.

    I originally struggled to reproduce stbrumme's results; however, after ensuring that all my executables have correct versions and command line arguments, I am seeing the following:
    Code:
    					compressed	no. of tokens
    lz4ultra (ver.1.0.0)			41,928,203	10,826,799	(9.23634 bytes/token)
    lz4ultra (ver.1.0.1)			41,927,804	10,825,609	(9,23736 bytes/token)
    lz4ultra -B7 (ver.1.0.2)		41,913,406	10,824,901	(9.23796 bytes/token)
    
    
    lz4 -19 -BD --no-frame-crc (ver.1.8.2)	41,913,377	11,025,738	(9.06969 bytes/token)
    lz4 -19 -BD --no-frame-crc (ver.1.8.3)	41,913,377	11,025,738	(9.06969 bytes/token)	- basically, identical to 1.8.2
    lz4 -19 -BD --no-frame-crc --favor-decSpeed (ver.1.8.3)
    					42,298,381	10,547,798	(9.48065 bytes/token)
    
    
    smallz4 -9 (ver.1.3)			41,913,367	10,753,325	(9.29945 bytes/token)
    I am unsure why lz4ultra produces quite as many tokens on enwik8 (in my tests on small files <64K, lz4ultra tends to produce much fewer tokens than smallz4). We will be looking into this together with emarty.

    How does smallz4 ends up with a preference for fewer tokens? Do you already have a mechanism for consistent resolution of ties in place?
    Last edited by introspec; 12th April 2019 at 20:37. Reason: updated measurement results to correct values; added a brief discussion of them

  13. #9
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    46
    Thanks
    36
    Thanked 18 Times in 12 Posts
    BTW, I forgot to mention that the somewhat reduced compression ratio of lz4ultra -9 is caused by the forced use of -B4, i.e. block sizes are always kept at 64K. If this restriction was to be relaxed, the compression ratios will get significantly nearer to the optimal parsers. It did not matter to us for the small files we were targetting.

  14. #10
    Member
    Join Date
    Mar 2013
    Location
    Berlin
    Posts
    45
    Thanks
    14
    Thanked 71 Times in 30 Posts
    The full lz4 parameters:
    lz4 -19 -BD --no-frame-crc (and optionally --favor-decSpeed )

    -BD matches across block boundaries
    --no-frame-crc skips the xxHash checksum (doesn't affect the number of tokens)

    Today I noticed that my smallz4 program wasn't v1.3 but an internal testing version which produces a file that's one byte larger than v1.3 (and one token more than v1.3).

    lzultra was run without -B4

  15. #11
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    46
    Thanks
    36
    Thanked 18 Times in 12 Posts
    @stbrumme, my apologies, I realized that I messed up some of my results (I have several executables, including some preliminary versions with bugs).
    I will re-do all my tests just to be sure and will update the posts accordingly; will write here once this is done.

  16. #12
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    46
    Thanks
    36
    Thanked 18 Times in 12 Posts
    @stbrumme, OK, I updated both the main post and the post with my tests on enwik8. You are correct in that they do not fully tie in. We are currently trying to understand why this is so.

    emarty updated his compressor to deal with bigger compressed block sizes; this helped us to verify that his compression ratio is only marginally lower than the ratio produced by lz4 -19 or smallz -9.
    At the same time, Smallz4 produces substantially fewer tokens on enwik8, so we would it expect it to produce data that is faster to decompress. Of course, the difference is under 1% and I am not sure how hard it would be to see the impact of this difference on a PC.

    Do you have some mechanism for consistent tie resolution in place?

    At the same time, for the small files (<64K) that I was targeting originally I am consistently seeing lz4ultra producing fewer tokens compared to smallz4 or lz4x. This is why I re-run the tests on a larger bunch of small files just to confirm it. The reasons for this are not clear to us at present.

  17. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Btw, I remembered about this trick: https://encode.ru/threads/1135-Code-...s-LZ77-Decoder
    On x86 it certainly helps (in a certain range of compresion ratios), hard to say about Z80, but it may be worth testing?

    And here's another idea: what if we'd compress to executable format right away?

  18. #14
    Member
    Join Date
    Mar 2013
    Location
    Berlin
    Posts
    45
    Thanks
    14
    Thanked 71 Times in 30 Posts
    smallz4 doesn't perform any kind of token reduction. Its only goal is to find the smallest LZ4 output.
    However, there are often several "equivalent" encodings (= have the same optimal size) to choose from - in these cases smallz4 always prefers the one which starts with the longest match.
    Basically at every step I analyze all matches and check if currentCost <= minCost. The equal-sign is important because currentCost < minCost (without the equal-sign) sometimes produces a non-optimal encoding. I have a lengthy comment about that issue at line 400 of smallz4.cpp, see https://create.stephan-brumme.com/smallz4/ .

    The reduced number of tokens is just a side-effect.

  19. The Following User Says Thank You to stbrumme For This Useful Post:

    introspec (13th April 2019)

  20. #15
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    91
    Thanked 69 Times in 49 Posts
    Quote Originally Posted by stbrumme View Post
    Basically at every step I analyze all matches and check if currentCost <= minCost. The equal-sign is important because currentCost < minCost (without the equal-sign) sometimes produces a non-optimal encoding.
    Strictly speaking, to get optimal encoding you only have to choose the match if the choice is between a literal and a match with the same cost (see for instance the comparison in blz4). Using <= in the comparison includes this condition.

  21. #16
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    I've integrated lz4ultra and smallz4 into turbobench compression benchmark for testing.
    "In Memory" benchmarking is more precise as when File I/O is involved.


    Enwik8 with TurboBench: - Sat Apr 13 14:18:33 2019.


    CPU: Skylake i7-6700 at 3.4 GHz, gcc 8.2


    Code:
         C Size   ratio%     C MB/s     D MB/s   Name    (MB=1.000.0000)
        41913144    41.9      13.79    2862.79   lz4 12          
        41913144    41.9      13.78    2150.49   lz4ultra       (compression with lz4,12 and decompression with lz4ultra)
        41913371    41.9       5.18     563.67   smallz4        (decompression with unlz4 from smallz4cat.c)
        42298150    42.3      14.03    3264.35   lz4 12s       (favorDecompressionSpeed)

    LZ4ULTRA: Compression results in a compare error at offset 65536 after deccompression.
    Only lz4ultra decompression is tested. On the PC, lz4ultra is 25% to 50% slower than lz4.


    SMALLZ4: We get a compare error by using lz4 function "LZ4_decompress_safe" for smallz4 compressed stream.
    Incompatibility for In memory functions?
    In my tests, lz4,12 is compressing better than smallz4.


    For lz77 optimal parsing, optimizing for speed is generally done by optimizing the memory access paterns.

    For processors with L1 caches of 64k (intel,AMD,...), it is unlikely to have a significant speed boost for lz4
    by this optimization, because lz offsets are already limited to 64k.
    Last edited by dnd; 17th April 2019 at 13:20.

  22. #17
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    46
    Thanks
    36
    Thanked 18 Times in 12 Posts
    @dnd, very interesting, but you answered the reverse question. It is clear from your test that LZ4's decompressor is the fastest available. However, the interesting test would be: could you please compress data using each of the three: lz4 -19, vs lz4ultra, vs smallz4 -9 and measure decompression speed when decompressing each of the three using LZ4 decompressor. That would verify the "number of tokens" hypothesis.

  23. #18
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    Actually this test is not possible, because the lz4ultra compression is resulting in a compare error and smallz4 compressed stream cannot be decompressed using lz4 decompression without errors (see my previous post).
    I've added a new lz4,12s entry with the favorDecompressionSpeed option.

    You might have a look at coding coresponding to the lz4ultra compression in the file "pugins.cc".
    For your small files case, it might be better to consider adding dictionary support to lz4ultra.

  24. #19
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    I've manged to benchmark your testset_full corpus.
    Maximum possible compression level used. Decompression is done with lz4 only.
    No errors encountered for lz4ultra with this small files corpus.
    We are still getting a compare error on the last bytes for smallz4.

    Note that 12 is the maximum level possible for lz4.

    As you can see, there is no speed difference in decompression for this corpus.

    Code:
          C Size  ratio%    C MB/s    D MB/s   Name                        (bold = pareto) MB=1.000.0000
          721509    58.5      1.89   3980.63   smallz4 9z       (smallz4 compression, lz4 decompression)
          721511    58.5     19.81   3942.48   lz4 12          
          721867    58.5      5.33   3980.63   lz4ultra 12uz    (lz4ultra compression, lz4 decompression)
          731327    59.3     18.97   4183.03   lz4 12f          (favorDecompressionSpeed)
    Last edited by dnd; 16th April 2019 at 17:33.

  25. The Following User Says Thank You to dnd For This Useful Post:

    introspec (14th April 2019)

  26. #20
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    46
    Thanks
    36
    Thanked 18 Times in 12 Posts
    Well, actually, there is about 1% difference in decompression times, which is about the right ball park figure I expected.
    Of course, with this small set of data the accuracy of your measurement may suffer.

  27. #21
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    46
    Thanks
    36
    Thanked 18 Times in 12 Posts
    @stbrumme, I recompiled smallz4 with a strict inequality instead of non-strict in line 400 and I saw a reduction in the number of tokens by 1.5% (using enwik8 as my test input).
    Thank you for pointing out this heuristic, this clearly reduces the number of tokens while not harming the compression ratio too much.
    However, this does not explain why smallz4 has fewer tokens as is. There must be another heuristic there, which is likely to be distinct from what LZ4 or BLZ4 does.

    @Shelwien, I read the other posts you referred to. My feeling about this is as follows. The idea of using code itself as a machine that generates decompressed data is not new. On demoscene this is usually called "crunching", with the process of decompression called "decrunching". It is quite powerful, as it allows efficient generation of large amounts of fairly complex data sometimes. The snag is, these programs, "decrunchers", are always written by hand. Then it is clear that, say, assembly is Turing-complete, so a vast number of data can be generated. Of course, if you were to attempt automating this process, your task is a bit like analysing some data and then coming up with a generating algorithm for the data. If you find a way to do it even approximately well, you will effectively find an approach to measure Kolmogorov's complexity. I.e., an interesting problem, but very likely to be intractable.

    The other trick, with two-stage generation of intermediate code, that will then generate actual decompressed data... Not very likely to be universally helpful on 8-bitter, unless in some very specialized applications. We are already decompressing data at 1.5 times the speed of copying. So, the process of generating code and then more copying is very likely to be too slow to impress anyone. However, I can imagine that if we have a particular piece of data that needs to be stored compactly, but then decompressed extremely quickly, maybe one can generate intermediate representation as a way of being able to decompress extremely quickly.

  28. The Following User Says Thank You to introspec For This Useful Post:

    jibz (14th April 2019)

  29. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > The idea of using code itself as a machine that generates decompressed data is not new.

    I was talking about LZ-specific code generation. x86 has a 2-byte "REP MOVSB" instruction that can copy a match.
    So an example for generation of "abracadabra" string could look like this:
    ; init
    8B F1 mov esi,ecx
    8B FA mov edi,edx
    33 C9 xor ecx,ecx

    ; copy 7 literals
    B1 07 mov cl,7
    F3 A4 rep movsb

    ; copy len=4;dist=7 match
    96 xchg esi,eax
    B1 04 mov cl,4
    8D 77 F9 lea esi,[edi-7]
    F3 A4 rep movsb
    96 xchg esi,eax

    ; copy 1 literal
    A4 movsb

    C3 ret


    Of course, like this it would only make sense with len>11 matches.
    Though most of match-copy code could be replaced with a 2-byte "CALL reg" instruction.

    Support for additional features like RLE or delta-matches or specific handling for matches of specific length
    could be also implemented easily enough.

    > The other trick, with two-stage generation of intermediate code, that will then generate actual decompressed data...
    > Not very likely to be universally helpful on 8-bitter, unless in some very specialized applications.

    The main idea is to make use of cpu's instruction parser.
    Normally LZ decoder has to juggle 3 memory locations (compressed data, match source, output),
    so single loop with 3 is split to two loops with 2 each (+cpu code which has dedicated hardware).
    On x86 it certainly helps due to L1code cache.
    But I'm also not sure that it won't work on Z80, since it might also benefit from less memory locations -
    like reduced memory bank switching or something.
    Attached Files Attached Files

  30. #23
    Member
    Join Date
    Mar 2013
    Location
    Berlin
    Posts
    45
    Thanks
    14
    Thanked 71 Times in 30 Posts
    @dnd:
    TurboBench doesn't produce proper LZ4 files (frame format) with its in-memory compression - it's only generating simple LZ4 blocks and thus has less overhead.
    Using the command-line lz4 -12 -BD --no-frame-crc the aforementioned enwik8 file shrinks to 41913377 bytes (lz4 1.8.3, a "true" lz4 file) not 41913144 (TurboBench output, just lz4 blocks).
    The difference of 233 bytes is simply caused by LZ4 file format overhead.
    Compared to lz4, smallz4 saves one byte in that special case.

    And when TurboBench tries to decompress smallz4's "file" output it assumes that data is stored as simple LZ4 blocks which obviously fails because any LZ4 frame overhead is causing an error.

    But we're moving away from the original topic ...

    Wall clock time is almost identical for the files produced by various LZ4 engines. Any deviation seems to be within the margin of error.
    However, instruction count is significantly different (Core i7 2600) when running lz4 -d with smallz4's, lz4's and lzultra's enwik8 version:
    perf stat -d lz4 -d enwik8.lz4 > /dev/null

    smallz4
    835.285.849 instructions
    lz4
    1.002.452.799 instructions
    lzultra
    1.078.765.611 instructions

    At first I though that something is confusing perf but valgrind/callgrind shows a similar pattern:
    valgrind --tool=callgrind lz4 -d enwik8.lz4 > /dev/null

    smallz4
    841.688.003
    lz4
    1.004.007.358
    lzultra
    1.082.068.246

    (callgrind's numbers are always (!) identical betweens several runs while perf stat change a little bit every time)

    I started digging a bit deeper into perf stat's enwik9 numbers (enwik8 decompression is just too fast for a reasonable analysis).
    Comparing Yann's LZ4 against my smallz4:
    - smallz4 causes about 3% less branches but has 5% more branch misses
    - smallz4 has 2% less L1 and 10% (!) less LLC loads
    - smallz4 produces >5% more stalled cycles

    Quite frankly I was expecting all numbers to be almost identical for LZ4, smallz4 and/or lzultra but there seems to be a statistical pattern.
    Last edited by stbrumme; 16th April 2019 at 11:03. Reason: a few more numbers

  31. The Following User Says Thank You to stbrumme For This Useful Post:

    introspec (16th April 2019)

  32. #24
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    I've manually removed the frame headers in "smallz4.h" and now it is working without errors for this small files benchmark (decompression with lz4 only).
    I've also included lz4 with the option favorDecompressionSpeed ( See the benchmark above)

  33. #25
    Member
    Join Date
    Mar 2013
    Location
    Berlin
    Posts
    45
    Thanks
    14
    Thanked 71 Times in 30 Posts
    Today I played with another idea to improve LZ4 decompression speed:
    a side-effect of optimal LZ4 compression is that sometimes the length of the longest match has to be reduced.

    Let's say your data looks as follows:
    abcde...abcd...abcde... (where dots are random bytes)

    The third group "abcde" matches the first group "abcde", it's a match of 5 bytes.
    Let's assume optimal parsing finds that - depending on other bytes, not shown here - it's more efficient to use only the first 4 bytes of that 5-byte match.
    In that case the third group "abcde" could match the second group "abcd" as well since the first four bytes are obviously identical.

    And now my idea:
    If the length of your longest match is reduced in the optimal parsing phase, then look for a closer, more recent match.
    Thus we have a chance that the decompressor still has that closer match in a hot CPU cache (whereas the old, far away match is already gone) which avoids expensive memory fetches.

    A crude smallz4 prototype detects 3748 opportunities for such optimization in enwik8, the first four are:
    2858,19 => 286,18 @ 23004
    1830,21 => 1636,18 @ 25242
    3090,20 => 874,18 @ 28543
    481,19 => 150,18 @ 32734
    (old distance, non-optimal length => new distance, optimal length @ file position)

    Unfortunately, my initial results are not very encouraging: wall time of lz -d is completely unaffected by my optimization and even instruction count barely changes.
    Depending on the input file, it actually becomes worse ... enwik8 is such an input file (valgrind/callgrind):
    smallz4 1.3: 843192400 ticks
    prototype: 843192703 ticks => 303 ticks slower


    perf stat's cache counters (especially L1 misses) don't show any conclusive result as well.

  34. #26
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    If the goal is to help cache locality, this trick would be expected to produce benefits when transforming a "long" offset (> 32 K) into a "short" one.
    That is, assuming L1 cache is 32 KB, which is the case for Intel, and many modern ARM cpus. But AMD ones for example offer 64 KB L1 cache, so they would not benefit from shorter distances.

    However, as you found out, while such cases exist, they are rare.
    As a consequence, they do not meaningfully impact general performance.

  35. #27
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    68
    Thanks
    31
    Thanked 22 Times in 15 Posts
    Also i think when there is a tie Literal_length_1 + match_length_1 = Literal_length_2 + match_length_2 with match _length_1 > match_length_2, it must favor the second match because if match_length_1 > 8 then there is a branch misprediction in the decoder.

  36. #28
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    91
    Thanked 69 Times in 49 Posts
    For partially overlapping matches there should often be a range of possible choices of how to distribute the overlap in LZ4. If it weren't for the "wildcopy" which copies past the end anyway, it might have been beneficial to try to select either one or both of the matches to have a length that is a multiple of 4/8 (for instance choosing len 16 + len 16 would be faster than len 17 + len 15).

    Regarding locality, I don't know if it might be worth checking for the special case of a match that comes "closely after" the last match, that is what would be a repeat offset match in other encodings. It would only be potentially interesting if it happened to be on the same cache line as the last match.

  37. #29
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    Latest release of LZ4 features a new decoder loop
    offering substantial decompression speed improvements (up to +20% on x64).

    The new decoder targets branches to reach this performance, organizing them differently for predictability.

    For the purpose of this discussion, I suspect there are more gains to be expected by targeting the characteristics of the decoder,
    such as ensuring that its hot branches are as predictable as possible,
    thus only crossing the limits when it's actually worth it.

  38. The Following 2 Users Say Thank You to Cyan For This Useful Post:

    encode (26th April 2019),introspec (18th April 2019)

  39. #30
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    46
    Thanks
    36
    Thanked 18 Times in 12 Posts
    During the last few months emarty did a fair amount of work looking into the ways to reduce the number of tokens without substantially affecting the compression ratio. The latest version of the LZ4Ultra ver.1.2.1 is much more successful at reducing the number of tokens, compared to the earlier versions discussed in this thread. Compared to vanilla LZ4 compressor, new compressor sacrifices about ~0.002% ratio to save ~10% tokens (this is according to my tests on enwik9).

    I did some measurements of the number of CPU instructions, as well as CPU times, spent by LZ4 1.9.1 to decompress various variants of compressed enwik9.lz4 into nul. I did this using uninstrumented (as shipped) 32-bit executable and Intel VTune, hence I do not think that CPU times are particularly reliable. At the same time, both instruction counts and CPU times were measured reasonably consistently. Overall, I did 5 separate measurements for each file and computed the averages:

    Click image for larger version. 

Name:	encode_ru.png 
Views:	25 
Size:	23.5 KB 
ID:	6663


    I believe that these results show quite convincingly that there definitely is a correlation between the number of tokens in a LZ4 file and its decompression time. Maybe this correlation is not as obvious as it was in the case of Z80 decompression, where we could easily account for each CPU cycle. However, the effect is substantial and for files that are intended to be compressed once and decompressed often a potential saving of up to 10% CPU instructions (and 7-8% saving in CPU decompression time) is definitely not completely trivial.

    The news are not all good. I tried to do similar kind of tests for LZ4 1.8.3 and failed miserably. Nothing was adding up: the decompression times were faster by a factor of about 2.5, even though LZ4 1.9.1 is supposed to be quite a bit faster. The instruction counts were showing the opposite picture: the files with fewer tokens took consistently longer to decompress. Frankly, I do not know what to make of it. Most likely explanations could be that LZ4 1.8.3 is somehow VTune unfriendly, for whatever reason.

    Therefore, I would very much appreciate if you could do more decompression speed tests. With the 10% difference in the number of tokens, the effect should be fairly pronounced and easier to observe than it was before.

    The latest version of LZ4Ultra is available as usual at https://github.com/emmanuel-marty/lz4ultra

  40. The Following User Says Thank You to introspec For This Useful Post:

    Cyan (21st June 2019)

Page 1 of 2 12 LastLast

Similar Threads

  1. Replies: 4
    Last Post: 18th April 2019, 15:00
  2. Image compression and end-user display resolution
    By SolidComp in forum Data Compression
    Replies: 3
    Last Post: 2nd February 2019, 11:28
  3. Compiling lz4
    By smjohn1 in forum Data Compression
    Replies: 11
    Last Post: 3rd January 2018, 23:52
  4. Replies: 81
    Last Post: 11th November 2016, 15:46
  5. Srez - image super-resolution through deep learning
    By Sportman in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 28th August 2016, 23:17

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •