Compression results for a corpus are often reported as arithmetic means of compression ratios where compression ratio is compressed size / uncompressed size (expressed in percentage or bits-per-byte).

Is this the right thing to do? My impression is that in performance analysis, it's generally considered The Wrong Thing to average ratios that way, and that you should use the harmonic mean instead. (For example, when talking about transactions per second.)

Are compression ratios somehow an exception to this rule?

I seem to recall reading someplace that they're not, and harmonic means of compression ratios are to be preferred, but I can't remember where. (I noticed that on the Canterbury Corpus website, they report averages and "weighted averages," but the weighted averages are not the harmonic means. )

Or is that only true if you define compression ratio the other way (uncompressed size / compressed size), so that compressing a file to half its size would give you a "compression ratio" of 2. (Which has always seemed more intuitive to me; the usual way seems like an incompression ratio.)

I suspect that's right: we're essentially taking the reciprocal before averaging, and could then take the reciprocal of the result to get the compression ratio in the higher-is-better sense, so in effect we are computing the reciprocal of the harmonic mean. But I don't trust my understanding well enough to be sure.

Any guidance would be appreciated.