Results 1 to 7 of 7

Thread: Average / Arithmetic Mean vs. Harmonic Mean of Compression Ratios

  1. #1
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    Average / Arithmetic Mean vs. Harmonic Mean of Compression Ratios

    Compression results for a corpus are often reported as arithmetic means of compression ratios where compression ratio is compressed size / uncompressed size (expressed in percentage or bits-per-byte).

    Is this the right thing to do? My impression is that in performance analysis, it's generally considered The Wrong Thing to average ratios that way, and that you should use the harmonic mean instead. (For example, when talking about transactions per second.)

    Are compression ratios somehow an exception to this rule?

    I seem to recall reading someplace that they're not, and harmonic means of compression ratios are to be preferred, but I can't remember where. (I noticed that on the Canterbury Corpus website, they report averages and "weighted averages," but the weighted averages are not the harmonic means. )

    Or is that only true if you define compression ratio the other way (uncompressed size / compressed size), so that compressing a file to half its size would give you a "compression ratio" of 2. (Which has always seemed more intuitive to me; the usual way seems like an incompression ratio.)

    I suspect that's right: we're essentially taking the reciprocal before averaging, and could then take the reciprocal of the result to get the compression ratio in the higher-is-better sense, so in effect we are computing the reciprocal of the harmonic mean. But I don't trust my understanding well enough to be sure.


    Any guidance would be appreciated.
    Last edited by Paul W.; 17th January 2016 at 22:55.

  2. #2
    Member
    Join Date
    Oct 2015
    Location
    Belgium
    Posts
    29
    Thanks
    9
    Thanked 10 Times in 8 Posts
    Defining compression ratio in the usual way (compressed / uncompressed) is needed to make averages somewhat meaningful.

    Consider a corpus with one file which is 100MB of zero bytes, and another file which is 100MB of noise. Typically the corpus can be compressed to about 100MB.
    The first file can be compressed to nearly 0%, while the other one cannot be compressed at all (compression ratio 100%). The average is 50% which in this case corresponds to the compression ratio of the entire corpus (because both files have the same uncompressed size).

    If you would define compression ratio the other way around, the first file would get a "compression ratio" of nearly infinite, while the noise gets a ratio of 1, so the average would be near infinite and not 2.

  3. The Following User Says Thank You to Jon Sneyers For This Useful Post:

    Paul W. (17th January 2016)

  4. #3
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by Paul W. View Post
    Compression results for a corpus are often reported as arithmetic means of compression ratios where compression ratio is compressed size / uncompressed size (expressed in percentage or bits-per-byte).

    Is this the right thing to do? My impression is that in performance analysis, it's generally considered The Wrong Thing to average ratios that way, and that you should use the harmonic mean instead. (For example, when talking about transactions per second.)
    I like the geometric ratios better. However, my absolute favorite way of showing the difference between two algorithms is the one used in Figure 3 in https://developers.google.com/speed/...ss_alpha_study

    There, you can just look at the graph and see that for about 3 % of the data you can expect worse results, even when the average/median/geometric average etc. are a lot better.

  5. The Following 3 Users Say Thank You to Jyrki Alakuijala For This Useful Post:

    Cyan (18th January 2016),Paul W. (18th January 2016),Turtle (18th January 2016)

  6. #4
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    437
    Thanks
    1
    Thanked 96 Times in 57 Posts
    Quote Originally Posted by Paul W. View Post
    Compression results for a corpus are often reported as arithmetic means of compression ratios where compression ratio is compressed size / uncompressed size (expressed in percentage or bits-per-byte).

    Is this the right thing to do? My impression is that in performance analysis, it's generally considered The Wrong Thing to average ratios that way, and that you should use the harmonic mean instead.
    I'd say it rather depends on what you want to deduce from the result. I'm not sure either of the averages make too much sense. Consider we take an average over a very disjoint data set of sources A, B and C and want to form an average. I would believe a sensible way of doing that is rather to consider a joint source D as the concatenation of all three sources. The purpose of an average would then be to give a theoretical estimate or bound on the performance on the concatenated source. This is not done by either average (harmonic or arithmetic).

    If there is no correlation between the sources, a suitable estimate would rather be to divide the sum of the compressed sizes |c(A)|+|c(B)|+|c(C)| by the size of the concatenated file (|A|+|B|+|C|) where c is the encoder. If all source files are of equal size, this is the average, but otherwise, it is a weighted average. To be precise, the weight of source file #i is given by the size of this file divided by the average file size in the test corpus, or to put it in a different way, the contribution of a file to the "compression performance" of the total corpus is proportional to its size - which makes quite some sense.

  7. The Following User Says Thank You to thorfdbg For This Useful Post:

    Paul W. (18th January 2016)

  8. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    The proper way to report compression ratio is the way others have done on the same benchmark. The most common way is to add up the compressed sizes of all of the files. If you are creating a new benchmark, then you should select files such that this scoring method makes sense.

  9. #6
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    I like the geometric ratios better. However, my absolute favorite way of showing the difference between two algorithms is the one used in Figure 3 in https://developers.google.com/speed/...ss_alpha_study

    There, you can just look at the graph and see that for about 3 % of the data you can expect worse results, even when the average/median/geometric average etc. are a lot better.
    I agree that's a very nice way to show compression ratio differences over a large set of files.

    However, something that I haven't seen a good way to show is how to compare not just size, but space-speed over multiple files.

    I think my "speedup" graphs are a pretty nice way to show space-speed on *one* file, but on multiple files it becomes a mess, so it doesn't show the way performance can vary on different files in a set.
    Last edited by cbloom; 3rd August 2016 at 20:01.

  10. #7
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    If we need another dimension, maybe time can help. Graph animation ?

Similar Threads

  1. SMAC : SMall Arithmetic Coding
    By Jean-Marie Barone in forum Data Compression
    Replies: 79
    Last Post: 14th July 2018, 19:56
  2. FARI - Fast Arithmetic Compressor
    By david_werecat in forum Data Compression
    Replies: 29
    Last Post: 9th August 2015, 07:37
  3. Unrolling arithmetic coding.
    By JamesB in forum Data Compression
    Replies: 23
    Last Post: 12th February 2014, 20:48
  4. jpegrescan: now with arithmetic coding!
    By Mangix in forum Data Compression
    Replies: 8
    Last Post: 5th July 2013, 22:01
  5. Replies: 18
    Last Post: 5th November 2007, 12:12

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •