# Thread: Division-free arithmetic coder

1. ## Division-free arithmetic coder

There are well-known dynamic binary arithmetic coders, that avoids divisions by keeping total_freq == 1<<K. But the same can be done with any larger alphabet. We just need to redistribute range between all the symbols. The simple way to do it:

freq[c] += N // increase counter for current char
freq[1..N] -= 1 // decrease counters for all remaining chars

or more advanced approach:
Code:
```sum = 0
for (i=1..N)
sum += freq[i]/32
freq[i] -= freq[i]/32
freq[c] += sum```
Of course, as N grows, it quickly became too slow. But for handling 3-10 values it may be ok. I don't looked into decoding yet PS: I added a little analysis here 2. ## The Following 5 Users Say Thank You to Bulat Ziganshin For This Useful Post:

Christoph Diegelmann (12th May 2016),encode (12th May 2016),Kennon Conrad (12th May 2016),lorents17 (15th May 2016),willvarfar (13th May 2016)

3. Daala team has been using large alphabet division-free approximation of range coding by Stuiver and Moffat:
https://people.xiph.org/~tterribe/tm...c%20Coding.pdf
Ratio penalty of the approximation is 1-3% for binary alphabet (it doubles probability of first symbols):  4. ## The Following 2 Users Say Thank You to Jarek For This Useful Post:

Bulat Ziganshin (12th May 2016),willvarfar (13th May 2016)

5. Another paper with division-less coder: 0.057 bits/symbol lost on Calgary corpus (your paper mentioned 0.084), but the code looks more complex. OTOH it doesn't have cycle in the first step. The table mentioned in the post is actually the tabulated logb implementation. 6. Thank you very much - the "more advanced approach" is really quite obvious yet had eluded me.

Is there a problem with the "simple" approach: Originally Posted by Bulat Ziganshin freq[c] += N // increase counter for current char
freq[1..N] -= 1 // decrease counters for all remaining chars
if freq[n] = 1 for n <= N, in that the frequency would get reduced to zero and become uncodable?

I tried the "more advanced" approach with GLZA using K = 12 and using ">>6" instead of "/32" in one of the models (one that has no more than 4 values). Decoding was somewhere between 0.5% - 1% faster, which is better than it sounds because most of the decoding time is outside of this model. The compression ratio was the same or slightly better (<1%) vs. what I had on all files in the Calgary, Canterbury and Silesia corpora, except PROGP which was 2 bytes larger. Seems like a win/win! 7. with simple approach, freqs may become even negative what's the decoder code you have used? is it division-free?? 8. I think you have found the secret to compressing random data!

The decode code I have used is not division free:

```
count = (code - low) / (range /= RangeScaleSymType[Context]);              \
```

but now there is one less divide in this model (and one less varaible array and associated operations):

```
count = (code - low) / (range >>= 12);              \
```

I would love to get rid of the other divide, but I think that means all models need to be done this way. But most have a lot more than 4 values. 9. Originally Posted by Kennon Conrad but now there is one less divide in this model (and one less varaible array and associated operations):

```
count = (code - low) / (range >>= 12);              \
```

I would love to get rid of the other divide, but I think that means all models need to be done this way. But most have a lot more than 4 values.
That's how my arith_static code worked - ensuring that the frequencies per block of data are constant and normalised frequencies to a fixed power of two.

In an adaptive model you can either use a lookup table if your frequency space is small enough or perhaps delay the recomputation of the freq array to every Nth symbol once you've processed enough data. I don't know how to remove the other division though. I did try via lookup tables, but couldn't get anything fast.

Edit: Looking at this another way, it seems the traditional multi-symbol arithmetic coder approach is to update freq and totFreq and renormalise the freq array once totFreq hits some magic number so that we don't get overflows. Bulat, I think your approach is basically to do that renormalise step on every symbol such that totFreq never changes. An alternative is to do it periodically such that totFreq oscillates between two known values with relatively few possible values meaning that we have divides by only a set number of denominators which can simply be turned into multiplications by reciprocal from a lookup table. It's still not as good as a fixed shift though, but needs fewer renormalisation steps.

I did explore some of these in fqzcomp. Eg see the base model: https://sourceforge.net/p/fqzcomp/co...k/base_model.h (and corresponding cdr256.h). The coders here are very heavily based on Shelwein's code. Speed gains always seemed rather marginal though. 10. James, yes - it is approach for rapid-changing stats, such as first-level model in bzip 0.21. This model has 11 symbols, freq_step=12 and all freqs are halved when total_freq>1000:

freq[c] += 12
total += 12
if (total>1000) freq[1..11] /= 2

So, the total varies in the 500..1000 range. Therefore, each update increases freq[c] by ~1/40..1/80 of the codespace (depending on the current total) and every remaining symbol lose the same part of its strength. Renormalization increases strength of weaker symbols, up to 2x . Min_freq=1 == total/1000. Initial freq[i]=freq_step==12, so it adapts much faster at the start.

My scheme (we may call it "birthday scheme" since every victim symbol brings a gift to the winner one ) means that every symbol lose 1/32-1/64 of its strength, depending on this strength. Active symbols will have rather large strength and lose almost exactly 1/32 of it, while weaker symbols may lose a bit less. Minimal symbol freq is 32, so it needs total==32K in order to make min_freq==total/1024. Well, it should be closer to bzip behavior if we will use freq-=freq/64 and total==64K.

So, bzip adapts faster at the start while mine has more fair adaptation later. My scheme requires much larger codespace, 64K vs 1000 for bzip-tuned model. 11. Another approach is to keep symbol history and remap probablity from the symbol shifted out to the one shifted in:

freq[*ptr]++
freq[ptr[-40]]--

This way, freq[] will hold exactly the probabilities of the last 40 symbols (plus some basis prob. to give more chances to rare symbols). And this is faster than updating all N freqs. But, compared to the bzip model, it's less flexible. Let's see. bzip gives 12 points to the freq of each symbol since the last renormalisation, plus 6 points for preceding 40 symbols and so on. In order to partially emulate this, we can do the following:

freq[*ptr] += 12
freq[ptr[-40]] -= 6
freq[ptr[-80]] -= 3
freq[ptr[-120]] -= 2
freq[ptr[-160]] -= 1

Well, it should be even better than bzip code since we assign weight of 12 to exactly last 40 values, weigth=6 to preceding 40 values and so on, while bzip occasionally halves all the preceding weigths. This goes us to conclusion that we can improve bzip scheme by making the freq_increment grow with the total:

freq_increment = total/64
freq[c] += freq_increment
total += freq_increment
if (total > 64000) renormalize_freqs()

This scheme should have the same compression ratio as the my "advanced" one - on every step it borrows 1/64 of weight from every symbol and assigns it to the current one. 12. Originally Posted by Bulat Ziganshin we can do the following:

freq[*ptr] += 12
freq[ptr[-40]] -= 6
freq[ptr[-80]] -= 3
freq[ptr[-120]] -= 2
freq[ptr[-160]] -= 1
This is what did occur to me. I tried it, but found it was a little worse on compression ratio than PPMd style scheme. Maybe I should have tried halving the pointers a little earlier because on average the first "halving" would be after only 20 symbols, then the second at 60, third at 100, etc. for PPMd style. Originally Posted by Bulat Ziganshin This goes us to conclusion that we can improve bzip scheme by making the freq_increment grow with the total:

freq_increment = total/64
freq[c] += freq_increment
total += freq_increment
if (total > 64000) renormalize_freqs()

This scheme should have the same compression ratio as the my "advanced" one - on every step it borrows 1/64 of weight from every symbol and assigns it to the current one.
This one works well. It's less flexible on the increment value than PPMd but more fair and improves GLZA on average for at least one model. Another simple gem I didn't think of . 13. if you found it useful, a two more minor notes: first, unlike fixed-step scheme it doesn't have fast adapatation on the start. it can be fixed with

freq_increment = max(total,500)/64 // where 500 is something like the minimum total_freq

second, you can use higher range of totals, for example max_total=64000 and then reduce all freqs 16x. this means 4x less renormalizations compared to halving 14. It turned out that Fabian “ryg” Giesen described idea from the first post a year ago, with faster implementation (especially on SIMD):

https://fgiesen.wordpress.com/2015/0...hmetic-coding/ (mixin_CDFs[] init values are provided in the comments)

https://fgiesen.wordpress.com/2015/02/20/mixing-discrete-probability-distributions/ 15. Fabian's idea (and power-of-2 total adaptive RANS in general) is the fundamental basis of LZNA.

The old-school approach (pre-constant-sum-shift) was to do something like Deferred Summation, which can be geometric-increment scaled for recency bias, combined with a sliding window of last N symbols to get updates after the last defsum accumulation.

Sliding window is just like

counts[ cur ] ++
counts[ leaving window ] --

and defsum does a periodic update to gather longer-filetime stats. Then sum the defsum and sliding counts so that total is a power of 2, and the partial sums give you a weighting for the fast-adapting and slow-adapting parts of the model. 16. Originally Posted by Bulat Ziganshin It turned out that Fabian “ryg” Giesen described idea from the first post a year ago, with faster implementation (especially on SIMD):
I'm not the first one to describe it either. As was pointed out to me after I published the "models for adaptive arithmetic coding" post, a draft Daala spec described essentially the same scheme in 2012. (https://tools.ietf.org/html/draft-te...s-02#section-2 section 2.2, variant 3).

This particular model is fairly nice, and its SIMD-friendliness is certainly a bonus. But I think the key insight is just that forming convex combinations of integer-valued CDFs (in the sense of forming an exact convex combination of the original values and then truncating/rounding to the nearest integer) has nice algebraic closure properties, in particular:
1. convex combinations between CDFs with the same total T again yield a CDF with that total,
2. convex combinations between CDFs for PDFs where every symbol has a probability >=p_min yield a CDF with the same property,
all under fairly weak assumptions about how exactly the rounding is done (meaning, you can do it in slightly different ways). The proof is elementary but the result was surprising to me.

Having not just one particular update scheme with those properties, but instead an entire space of them (and one that's computationally very nice to deal with at that!) is great. LZNA uses that type of model (for small/medium-sized alphabets) directly. BitKnit uses larger alphabets and a deferred summation-style approach, but still uses interpolation between CDFs to handle its periodic renormalization. Maintaining a truly constant total at all times, without any ad-hoc fix-up steps required, makes everything nicer. 17. Originally Posted by fgiesen ...a draft Daala spec described essentially the same scheme in 2012. (https://tools.ietf.org/html/draft-te...s-02#section-2 section 2.2, variant 3).
I'm now convinced that guaranteeing evey symbol has a probability >= p_min in the inputs to your convex combination (as described in that draft) is the wrong way to go. The problem is that either
a) you have to use a relatively large p_min if you want to make small adjustments to the distribution, otherwise after scaling, differences of p_min get truncated to 0, or
b) you have to do your convex combinations with extended precision (e.g., double the bit width of the probabilities), slowing down your SIMD throughput by a factor of two (not even counting pack/unpack instructions).
A large p_min limits the maximum probability you can have to T - (N-1)*p_min (where T is a power of 2, and N is the size of the alphabet), which can have a non-trivial impact on compression. E.g., with T = 32768, N = 16, and p_min = 256, the minimum cost of coding a symbol is almost 0.2 bits (compared with 0.001 bits with p_min = 1).

Jean-Marc came up with a better approach. Just subtract p_min from every symbol in the current distribution, giving it a total of T - N*p_min. Then do whatever adjustments you like (that can send probabilities to 0 or even 1.0), but keep the total fixed. Then add back in p_min. This allows p_min to be as small as 1 with no risk of truncation. If you only ever use a distribution once, you can even avoid the initial subtraction. Just add in p_min right before you need it and output to a copy, keeping the original distribution with no minimum probability guarantees for input to the next adaptation step.

Code is here: https://review.xiph.org/1345/
Simplifications here: https://review.xiph.org/1355/ 18. ## The Following User Says Thank You to derf For This Useful Post:

Bulat Ziganshin (18th May 2016)

19. Originally Posted by derf I'm now convinced that guaranteeing evey symbol has a probability >= p_min in the inputs to your convex combination (as described in that draft) is the wrong way to go. The problem is that either
a) you have to use a relatively large p_min if you want to make small adjustments to the distribution, otherwise after scaling, differences of p_min get truncated to 0
b) you have to do your convex combinations with extended precision (e.g., double the bit width of the probabilities), slowing down your SIMD throughput by a factor of two (not even counting pack/unpack instructions).
Not true. You can perform an (effectively) extended-precision adjustment without widening at all (well, you need 1 extra bit because you're forming a difference), for linear interpolation anyway, and the value of p_min is immaterial. Suppose you're using a .15-bit fixed point factor f to blend between two CDFs:

cdf_out = ((32768 - f) * cdf0 + f * cdf1 + bias) >> 15
== (32768 * cdf0 + f * (cdf1 - cdf0) + bias) >> 15
== cdf0 + ((f * (cdf1 - cdf0) + bias) >> 15);

This is an exact identity, not approximate (it works because ((a<<k) + b) >> k == (a + (b>>k)) barring overflow), and (in that form) easily computed using something like PMULHRSW (or shifts if you want a bit more throughput and are OK with less freedom of choice for adaptation factors).

With PMULHRSW, you want the bias to be a constant. With the bit-shit form, you get:

cdf[i] += (cdf_mix_i - cdf[i]) >> k;

and it's (experimentally) somewhat better to use an extended range where cdf_mix_i goes from 0 to larger than kProbMax (you can let it grow to up to (kProbMax + (1<<k) - 1) in this form).
You need to carefully analyze the ranges of the values involved if you do this (overflows would be really bad), but it works just fine, for any p_min. We normally use 1 or 2. 20. ## The Following User Says Thank You to fgiesen For This Useful Post:

Bulat Ziganshin (18th May 2016)

21. Some further explanation on that last bit: the straightforward version is just

Code:
`  cdf[i] += (cdf_mix_i - cdf[i] + bias) >> k;`
and you'd normally pick something like bias=(1<<k)>>1. But note that this is equivalent to just adding "bias" to "cdf_mix_i" directly. The (simple) proof I give in the blog post shows that doing the computation above with any value of bias in [0,(1<<k)-1] works.

bias=0 floors and then ends up allocating the remaining range to the last symbol. bias=(1<<k)-1 is a ceiling function, so the extra range gets allocated to the first symbol in your alphabet. But really, there's no need to hold the bias constant; a (very simple) variation of the proof shows that using per-i bias[i] works as long as the sequence of bias values is non-decreasing with i. So you get to start with bias=0 on the first symbol, end with bias=(1<<k)-1 on the last symbol, and it still works out. At this point the per-i bias values are just an empty formalism; all it really means is that your mixin CDF (if there's a single one) has a de-facto slightly larger range that you get to assign to whatever you want inside your mixin distribution; and the natural place to put it is on our new MPS estimate, that is, the symbol we're trying to boost. With p_min=1, you get

Code:
```  kCDFMax = kMaxProb + (1 << kRate) - 1;
cdf_mix_i = i + (i >= kCurrentSym) ? (kCDFMax - kNumSyms) : 0;
cdf[i] += (cdf_mix_i - cdf[i]) >> kRate;```

You could in principle do the same with a multiplicative update, but the problem is that you want to add the bias in a different place and don't really get to with typical SIMD instructions (there's rounding high multiplies like PMULHRSW, but they have a fixed bias).

If you don't do anything like this, you get "clumping" somewhere inside the range, because that's where the cumulative small boosts to symbols before are finally large enough to actually care over and change the integer-rounded result. I think that might be what you meant by problems with p_min=1, but it's really quite independent of that; it's not an artifact of doing the computation approximately (as noted above, it's not actually approximate!), that is behavior that exists in the extended-range version as well and is more fundamental. The tweaked update rule as described still has "round-off error", but it makes sure that the excess probability range gets allocated to symbols that are currently in use, and avoids the "clumping" problem nicely. (In our tests anyway.) 22. ## The Following User Says Thank You to fgiesen For This Useful Post:

Bulat Ziganshin (18th May 2016)

23. Originally Posted by fgiesen Code:
```  kCDFMax = kMaxProb + (1 << kRate) - 1;
cdf_mix_i = i + (i >= kCurrentSym) ? (kCDFMax - kNumSyms) : 0;
cdf[i] += (cdf_mix_i - cdf[i]) >> kRate;```
So, let's look at the "Simplified Version" from my https://review.xiph.org/1355/ link above.

Mapping between notations, we have kMaxProb = 32768, kRate = rate, kNumSyms = n. Then

Code:
```    tmp = 2 - (1 << rate) + i + (32767 + (1 << rate) - n)*(i >= val);

=>  tmp = i >= val ? (32769 - n + i) : (i + 2 - (1 << rate));```
and

Code:
```    cdf[i] -= (cdf[i] - tmp) >> rate;

=>  cdf[i] += -((cdf[i] - tmp) >> rate);

=>  cdf[i] += (-(cdf[i] - tmp) + (1 << rate) - 1) >> rate;

=>  cdf[i] += (tmp - cdf[i] + (1 << rate) - 1) >> rate;

=>  cdf[i] += ((i >= val ? (32769 - n + i) : (i + 2 - (1 << rate))) + (1 << rate) - 1) >> rate;

=>  cdf[i] += (i >= val ? (32768 - n + (i + 1) + (1 << rate) - 1) : (i + 1)) >> rate;```
That looks exactly like what you have, except with (i + 1) instead of i. Working the other way:

Code:
```    cdf[i] -= (cdf[i] - tmp) >> rate;

=>  cdf[i] -= (cdf[i] - (i >= val ? (32769 - n + i) : (i + 2 - (1 << rate)))) >> rate;

=>  cdf[i] -= (cdf[i] - (i + 1) - (i >= val ? (32768 - n) : -((1 << rate) - 1))) >> rate;

=>  cdf[i] -= (i + 1);
cdf[i] -= (cdf[i] - (i >= val ? (32768 - n) : -((1 << rate) - 1))) >> rate;
cdf[i] += (i + 1);```
And that's the "subtract a probability floor, make any changes you want that keep the probabilities between 0 and (kMaxProb - kNumSyms), then add the floor back" approach.

So except for an off-by-one, what I'm describing and what you're describing are exactly equivalent (and maybe the off-by-one is just explained by the fact that we exclude the 0-term from our cdf, so that cdf is already the probability of value 0). In the last formulation there, it should be clear that the -((1 << rate) - 1) offset is just to ensure that the change is non-zero so long as cdf[i] > 0 (after subtracting off the probability floor). That guarantees that cdf[i] can be reduced all the way down to that floor. In the other direction, no offset is required, since a negative change won't be reduced to 0 by the right shift. #### Tags for this Thread

arithmetic coder, entropy coder #### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•