Results 1 to 18 of 18

Thread: Division-free arithmetic coder

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts

    Division-free arithmetic coder

    There are well-known dynamic binary arithmetic coders, that avoids divisions by keeping total_freq == 1<<K. But the same can be done with any larger alphabet. We just need to redistribute range between all the symbols. The simple way to do it:

    freq[c] += N // increase counter for current char
    freq[1..N] -= 1 // decrease counters for all remaining chars

    or more advanced approach:
    Code:
    sum = 0
    for (i=1..N)
      sum += freq[i]/32
      freq[i] -= freq[i]/32
    freq[c] += sum
    Of course, as N grows, it quickly became too slow. But for handling 3-10 values it may be ok. I don't looked into decoding yet

    PS: I added a little analysis here
    Last edited by Bulat Ziganshin; 13th May 2016 at 11:42.

  2. The Following 5 Users Say Thank You to Bulat Ziganshin For This Useful Post:

    Christoph Diegelmann (12th May 2016),encode (12th May 2016),Kennon Conrad (12th May 2016),lorents17 (15th May 2016),willvarfar (13th May 2016)

  3. #2
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Daala team has been using large alphabet division-free approximation of range coding by Stuiver and Moffat:
    https://people.xiph.org/~tterribe/tm...c%20Coding.pdf
    Ratio penalty of the approximation is 1-3% for binary alphabet (it doubles probability of first symbols):
    Click image for larger version. 

Name:	Moffat.png 
Views:	186 
Size:	4.5 KB 
ID:	4381

  4. The Following 2 Users Say Thank You to Jarek For This Useful Post:

    Bulat Ziganshin (12th May 2016),willvarfar (13th May 2016)

  5. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Another paper with division-less coder: 0.057 bits/symbol lost on Calgary corpus (your paper mentioned 0.084), but the code looks more complex. OTOH it doesn't have cycle in the first step. The table mentioned in the post is actually the tabulated logb implementation.
    Last edited by Bulat Ziganshin; 13th May 2016 at 13:30.

  6. #4
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Thank you very much - the "more advanced approach" is really quite obvious yet had eluded me.

    Is there a problem with the "simple" approach:
    Quote Originally Posted by Bulat Ziganshin View Post
    freq[c] += N // increase counter for current char
    freq[1..N] -= 1 // decrease counters for all remaining chars
    if freq[n] = 1 for n <= N, in that the frequency would get reduced to zero and become uncodable?

    I tried the "more advanced" approach with GLZA using K = 12 and using ">>6" instead of "/32" in one of the models (one that has no more than 4 values). Decoding was somewhere between 0.5% - 1% faster, which is better than it sounds because most of the decoding time is outside of this model. The compression ratio was the same or slightly better (<1%) vs. what I had on all files in the Calgary, Canterbury and Silesia corpora, except PROGP which was 2 bytes larger. Seems like a win/win!

  7. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    with simple approach, freqs may become even negative

    what's the decoder code you have used? is it division-free??

  8. #6
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    I think you have found the secret to compressing random data!

    The decode code I have used is not division free:


    count = (code - low) / (range /= RangeScaleSymType[Context]); \


    but now there is one less divide in this model (and one less varaible array and associated operations):


    count = (code - low) / (range >>= 12); \


    I would love to get rid of the other divide, but I think that means all models need to be done this way. But most have a lot more than 4 values.

  9. #7
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Quote Originally Posted by Kennon Conrad View Post
    but now there is one less divide in this model (and one less varaible array and associated operations):


    count = (code - low) / (range >>= 12); \


    I would love to get rid of the other divide, but I think that means all models need to be done this way. But most have a lot more than 4 values.
    That's how my arith_static code worked - ensuring that the frequencies per block of data are constant and normalised frequencies to a fixed power of two.

    In an adaptive model you can either use a lookup table if your frequency space is small enough or perhaps delay the recomputation of the freq array to every Nth symbol once you've processed enough data. I don't know how to remove the other division though. I did try via lookup tables, but couldn't get anything fast.

    Edit: Looking at this another way, it seems the traditional multi-symbol arithmetic coder approach is to update freq and totFreq and renormalise the freq array once totFreq hits some magic number so that we don't get overflows. Bulat, I think your approach is basically to do that renormalise step on every symbol such that totFreq never changes. An alternative is to do it periodically such that totFreq oscillates between two known values with relatively few possible values meaning that we have divides by only a set number of denominators which can simply be turned into multiplications by reciprocal from a lookup table. It's still not as good as a fixed shift though, but needs fewer renormalisation steps.

    I did explore some of these in fqzcomp. Eg see the base model: https://sourceforge.net/p/fqzcomp/co...k/base_model.h (and corresponding cdr256.h). The coders here are very heavily based on Shelwein's code. Speed gains always seemed rather marginal though.

  10. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    James, yes - it is approach for rapid-changing stats, such as first-level model in bzip 0.21. This model has 11 symbols, freq_step=12 and all freqs are halved when total_freq>1000:

    freq[c] += 12
    total += 12
    if (total>1000) freq[1..11] /= 2

    So, the total varies in the 500..1000 range. Therefore, each update increases freq[c] by ~1/40..1/80 of the codespace (depending on the current total) and every remaining symbol lose the same part of its strength. Renormalization increases strength of weaker symbols, up to 2x. Min_freq=1 == total/1000. Initial freq[i]=freq_step==12, so it adapts much faster at the start.

    My scheme (we may call it "birthday scheme" since every victim symbol brings a gift to the winner one) means that every symbol lose 1/32-1/64 of its strength, depending on this strength. Active symbols will have rather large strength and lose almost exactly 1/32 of it, while weaker symbols may lose a bit less. Minimal symbol freq is 32, so it needs total==32K in order to make min_freq==total/1024. Well, it should be closer to bzip behavior if we will use freq-=freq/64 and total==64K.

    So, bzip adapts faster at the start while mine has more fair adaptation later. My scheme requires much larger codespace, 64K vs 1000 for bzip-tuned model.

  11. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Another approach is to keep symbol history and remap probablity from the symbol shifted out to the one shifted in:

    freq[*ptr]++
    freq[ptr[-40]]--

    This way, freq[] will hold exactly the probabilities of the last 40 symbols (plus some basis prob. to give more chances to rare symbols). And this is faster than updating all N freqs. But, compared to the bzip model, it's less flexible. Let's see. bzip gives 12 points to the freq of each symbol since the last renormalisation, plus 6 points for preceding 40 symbols and so on. In order to partially emulate this, we can do the following:

    freq[*ptr] += 12
    freq[ptr[-40]] -= 6
    freq[ptr[-80]] -= 3
    freq[ptr[-120]] -= 2
    freq[ptr[-160]] -= 1

    Well, it should be even better than bzip code since we assign weight of 12 to exactly last 40 values, weigth=6 to preceding 40 values and so on, while bzip occasionally halves all the preceding weigths. This goes us to conclusion that we can improve bzip scheme by making the freq_increment grow with the total:

    freq_increment = total/64
    freq[c] += freq_increment
    total += freq_increment
    if (total > 64000) renormalize_freqs()

    This scheme should have the same compression ratio as the my "advanced" one - on every step it borrows 1/64 of weight from every symbol and assigns it to the current one.

  12. #10
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    we can do the following:

    freq[*ptr] += 12
    freq[ptr[-40]] -= 6
    freq[ptr[-80]] -= 3
    freq[ptr[-120]] -= 2
    freq[ptr[-160]] -= 1
    This is what did occur to me. I tried it, but found it was a little worse on compression ratio than PPMd style scheme. Maybe I should have tried halving the pointers a little earlier because on average the first "halving" would be after only 20 symbols, then the second at 60, third at 100, etc. for PPMd style.

    Quote Originally Posted by Bulat Ziganshin View Post
    This goes us to conclusion that we can improve bzip scheme by making the freq_increment grow with the total:

    freq_increment = total/64
    freq[c] += freq_increment
    total += freq_increment
    if (total > 64000) renormalize_freqs()

    This scheme should have the same compression ratio as the my "advanced" one - on every step it borrows 1/64 of weight from every symbol and assigns it to the current one.
    This one works well. It's less flexible on the increment value than PPMd but more fair and improves GLZA on average for at least one model. Another simple gem I didn't think of .

  13. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    if you found it useful, a two more minor notes: first, unlike fixed-step scheme it doesn't have fast adapatation on the start. it can be fixed with

    freq_increment = max(total,500)/64 // where 500 is something like the minimum total_freq

    second, you can use higher range of totals, for example max_total=64000 and then reduce all freqs 16x. this means 4x less renormalizations compared to halving

  14. #12
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    It turned out that Fabian “ryg” Giesen described idea from the first post a year ago, with faster implementation (especially on SIMD):

    https://fgiesen.wordpress.com/2015/0...hmetic-coding/ (mixin_CDFs[] init values are provided in the comments)

    https://fgiesen.wordpress.com/2015/02/20/mixing-discrete-probability-distributions/

  15. #13
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Fabian's idea (and power-of-2 total adaptive RANS in general) is the fundamental basis of LZNA.

    The old-school approach (pre-constant-sum-shift) was to do something like Deferred Summation, which can be geometric-increment scaled for recency bias, combined with a sliding window of last N symbols to get updates after the last defsum accumulation.

    Sliding window is just like

    counts[ cur ] ++
    counts[ leaving window ] --

    and defsum does a periodic update to gather longer-filetime stats. Then sum the defsum and sliding counts so that total is a power of 2, and the partial sums give you a weighting for the fast-adapting and slow-adapting parts of the model.
    Last edited by cbloom; 3rd August 2016 at 19:11.

  16. #14
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    It turned out that Fabian “ryg” Giesen described idea from the first post a year ago, with faster implementation (especially on SIMD):
    I'm not the first one to describe it either. As was pointed out to me after I published the "models for adaptive arithmetic coding" post, a draft Daala spec described essentially the same scheme in 2012. (https://tools.ietf.org/html/draft-te...s-02#section-2 section 2.2, variant 3).

    This particular model is fairly nice, and its SIMD-friendliness is certainly a bonus. But I think the key insight is just that forming convex combinations of integer-valued CDFs (in the sense of forming an exact convex combination of the original values and then truncating/rounding to the nearest integer) has nice algebraic closure properties, in particular:
    1. convex combinations between CDFs with the same total T again yield a CDF with that total,
    2. convex combinations between CDFs for PDFs where every symbol has a probability >=p_min yield a CDF with the same property,
    all under fairly weak assumptions about how exactly the rounding is done (meaning, you can do it in slightly different ways). The proof is elementary but the result was surprising to me.

    Having not just one particular update scheme with those properties, but instead an entire space of them (and one that's computationally very nice to deal with at that!) is great. LZNA uses that type of model (for small/medium-sized alphabets) directly. BitKnit uses larger alphabets and a deferred summation-style approach, but still uses interpolation between CDFs to handle its periodic renormalization. Maintaining a truly constant total at all times, without any ad-hoc fix-up steps required, makes everything nicer.

  17. #15
    Member
    Join Date
    May 2016
    Location
    a
    Posts
    2
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by fgiesen View Post
    ...a draft Daala spec described essentially the same scheme in 2012. (https://tools.ietf.org/html/draft-te...s-02#section-2 section 2.2, variant 3).
    I'm now convinced that guaranteeing evey symbol has a probability >= p_min in the inputs to your convex combination (as described in that draft) is the wrong way to go. The problem is that either
    a) you have to use a relatively large p_min if you want to make small adjustments to the distribution, otherwise after scaling, differences of p_min get truncated to 0, or
    b) you have to do your convex combinations with extended precision (e.g., double the bit width of the probabilities), slowing down your SIMD throughput by a factor of two (not even counting pack/unpack instructions).
    A large p_min limits the maximum probability you can have to T - (N-1)*p_min (where T is a power of 2, and N is the size of the alphabet), which can have a non-trivial impact on compression. E.g., with T = 32768, N = 16, and p_min = 256, the minimum cost of coding a symbol is almost 0.2 bits (compared with 0.001 bits with p_min = 1).

    Jean-Marc came up with a better approach. Just subtract p_min from every symbol in the current distribution, giving it a total of T - N*p_min. Then do whatever adjustments you like (that can send probabilities to 0 or even 1.0), but keep the total fixed. Then add back in p_min. This allows p_min to be as small as 1 with no risk of truncation. If you only ever use a distribution once, you can even avoid the initial subtraction. Just add in p_min right before you need it and output to a copy, keeping the original distribution with no minimum probability guarantees for input to the next adaptation step.

    Code is here: https://review.xiph.org/1345/
    Simplifications here: https://review.xiph.org/1355/

  18. The Following User Says Thank You to derf For This Useful Post:

    Bulat Ziganshin (18th May 2016)

  19. #16
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    Quote Originally Posted by derf View Post
    I'm now convinced that guaranteeing evey symbol has a probability >= p_min in the inputs to your convex combination (as described in that draft) is the wrong way to go. The problem is that either
    a) you have to use a relatively large p_min if you want to make small adjustments to the distribution, otherwise after scaling, differences of p_min get truncated to 0
    b) you have to do your convex combinations with extended precision (e.g., double the bit width of the probabilities), slowing down your SIMD throughput by a factor of two (not even counting pack/unpack instructions).
    Not true. You can perform an (effectively) extended-precision adjustment without widening at all (well, you need 1 extra bit because you're forming a difference), for linear interpolation anyway, and the value of p_min is immaterial. Suppose you're using a .15-bit fixed point factor f to blend between two CDFs:

    cdf_out = ((32768 - f) * cdf0 + f * cdf1 + bias) >> 15
    == (32768 * cdf0 + f * (cdf1 - cdf0) + bias) >> 15
    == cdf0 + ((f * (cdf1 - cdf0) + bias) >> 15);

    This is an exact identity, not approximate (it works because ((a<<k) + b) >> k == (a + (b>>k)) barring overflow), and (in that form) easily computed using something like PMULHRSW (or shifts if you want a bit more throughput and are OK with less freedom of choice for adaptation factors).

    With PMULHRSW, you want the bias to be a constant. With the bit-shit form, you get:

    cdf[i] += (cdf_mix_i - cdf[i]) >> k;

    and it's (experimentally) somewhat better to use an extended range where cdf_mix_i goes from 0 to larger than kProbMax (you can let it grow to up to (kProbMax + (1<<k) - 1) in this form).
    You need to carefully analyze the ranges of the values involved if you do this (overflows would be really bad), but it works just fine, for any p_min. We normally use 1 or 2.

  20. The Following User Says Thank You to fgiesen For This Useful Post:

    Bulat Ziganshin (18th May 2016)

  21. #17
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    Some further explanation on that last bit: the straightforward version is just

    Code:
      cdf[i] += (cdf_mix_i - cdf[i] + bias) >> k;
    and you'd normally pick something like bias=(1<<k)>>1. But note that this is equivalent to just adding "bias" to "cdf_mix_i" directly. The (simple) proof I give in the blog post shows that doing the computation above with any value of bias in [0,(1<<k)-1] works.


    bias=0 floors and then ends up allocating the remaining range to the last symbol. bias=(1<<k)-1 is a ceiling function, so the extra range gets allocated to the first symbol in your alphabet. But really, there's no need to hold the bias constant; a (very simple) variation of the proof shows that using per-i bias[i] works as long as the sequence of bias values is non-decreasing with i. So you get to start with bias=0 on the first symbol, end with bias=(1<<k)-1 on the last symbol, and it still works out. At this point the per-i bias values are just an empty formalism; all it really means is that your mixin CDF (if there's a single one) has a de-facto slightly larger range that you get to assign to whatever you want inside your mixin distribution; and the natural place to put it is on our new MPS estimate, that is, the symbol we're trying to boost. With p_min=1, you get

    Code:
      kCDFMax = kMaxProb + (1 << kRate) - 1;
      cdf_mix_i = i + (i >= kCurrentSym) ? (kCDFMax - kNumSyms) : 0;
      cdf[i] += (cdf_mix_i - cdf[i]) >> kRate;

    You could in principle do the same with a multiplicative update, but the problem is that you want to add the bias in a different place and don't really get to with typical SIMD instructions (there's rounding high multiplies like PMULHRSW, but they have a fixed bias).


    If you don't do anything like this, you get "clumping" somewhere inside the range, because that's where the cumulative small boosts to symbols before are finally large enough to actually care over and change the integer-rounded result. I think that might be what you meant by problems with p_min=1, but it's really quite independent of that; it's not an artifact of doing the computation approximately (as noted above, it's not actually approximate!), that is behavior that exists in the extended-range version as well and is more fundamental. The tweaked update rule as described still has "round-off error", but it makes sure that the excess probability range gets allocated to symbols that are currently in use, and avoids the "clumping" problem nicely. (In our tests anyway.)

  22. The Following User Says Thank You to fgiesen For This Useful Post:

    Bulat Ziganshin (18th May 2016)

  23. #18
    Member
    Join Date
    May 2016
    Location
    a
    Posts
    2
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by fgiesen View Post
    Code:
      kCDFMax = kMaxProb + (1 << kRate) - 1;
      cdf_mix_i = i + (i >= kCurrentSym) ? (kCDFMax - kNumSyms) : 0;
      cdf[i] += (cdf_mix_i - cdf[i]) >> kRate;
    So, let's look at the "Simplified Version" from my https://review.xiph.org/1355/ link above.

    Mapping between notations, we have kMaxProb = 32768, kRate = rate, kNumSyms = n. Then

    Code:
        tmp = 2 - (1 << rate) + i + (32767 + (1 << rate) - n)*(i >= val);
    
    =>  tmp = i >= val ? (32769 - n + i) : (i + 2 - (1 << rate));
    and

    Code:
        cdf[i] -= (cdf[i] - tmp) >> rate;
    
    =>  cdf[i] += -((cdf[i] - tmp) >> rate);
    
    =>  cdf[i] += (-(cdf[i] - tmp) + (1 << rate) - 1) >> rate;
    
    =>  cdf[i] += (tmp - cdf[i] + (1 << rate) - 1) >> rate;
    
    =>  cdf[i] += ((i >= val ? (32769 - n + i) : (i + 2 - (1 << rate))) + (1 << rate) - 1) >> rate;
    
    =>  cdf[i] += (i >= val ? (32768 - n + (i + 1) + (1 << rate) - 1) : (i + 1)) >> rate;
    That looks exactly like what you have, except with (i + 1) instead of i. Working the other way:

    Code:
        cdf[i] -= (cdf[i] - tmp) >> rate;
    
    =>  cdf[i] -= (cdf[i] - (i >= val ? (32769 - n + i) : (i + 2 - (1 << rate)))) >> rate;
    
    =>  cdf[i] -= (cdf[i] - (i + 1) - (i >= val ? (32768 - n) : -((1 << rate) - 1))) >> rate;
    
    =>  cdf[i] -= (i + 1);
        cdf[i] -= (cdf[i] - (i >= val ? (32768 - n) : -((1 << rate) - 1))) >> rate;
        cdf[i] += (i + 1);
    And that's the "subtract a probability floor, make any changes you want that keep the probabilities between 0 and (kMaxProb - kNumSyms), then add the floor back" approach.

    So except for an off-by-one, what I'm describing and what you're describing are exactly equivalent (and maybe the off-by-one is just explained by the fact that we exclude the 0-term from our cdf, so that cdf[0] is already the probability of value 0). In the last formulation there, it should be clear that the -((1 << rate) - 1) offset is just to ensure that the change is non-zero so long as cdf[i] > 0 (after subtracting off the probability floor). That guarantees that cdf[i] can be reduced all the way down to that floor. In the other direction, no offset is required, since a negative change won't be reduced to 0 by the right shift.

Similar Threads

  1. Witten-Cleary arithmetic coder
    By Marco_B in forum Data Compression
    Replies: 1
    Last Post: 31st March 2017, 15:38
  2. HAMSTER Free Archiver
    By encode in forum The Off-Topic Lounge
    Replies: 3
    Last Post: 27th December 2012, 02:12
  3. Minimal Ashford arithmetic-coder termination
    By Ethatron in forum Data Compression
    Replies: 18
    Last Post: 15th January 2011, 15:38
  4. I'm looking for the best free implementation of deflate
    By caveman in forum Data Compression
    Replies: 2
    Last Post: 22nd November 2010, 09:27
  5. flzp_ac2 (flzp + an order-2 arithmetic coder)
    By inikep in forum Data Compression
    Replies: 4
    Last Post: 25th June 2008, 21:37

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •