Results 1 to 12 of 12

Thread: Arithmetic Coding : Underflow problem

  1. #1
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts

    Arithmetic Coding : Underflow problem

    Hello to all,

    I am trying myself to Arithmetic Coding after having read Mark Nelson's tutorial, and I have worked out from scratch 2 functions, SMAC_compress & SMAC_decompress, in Assembly langage (cf SMAC.asm in enclosure if you're interested).

    For a start, I'm beginning with a Static order-0 model, and I'm working at byte-level. Later, I plan to update the Model so that every time a byte is encoded, I will decrement its frequency in the dedicated chart and recalculate the Upper & Lower limits in the model. But one step at a time!

    I've tested my functions with a small file of 9 Kb (write.exe), and it's working good although the compression ratio & entropy are cheesy, but I know I can't expect much from a static model.

    With a larger file, I'm facing the dreadful Underflow situation, and I'm not sure on how to handle it. Concretely, I end up with the following values (hexa):
    High=E70001DA Low=E6FFFB93 Range=648

    If I apply the "rule", I should get rid of each 2nd bytes, and shift left the following bytes, and would obtain :
    High=E701DAFF Low=E6FB9300 Range=64800 UnderflowCount=1

    And things could go on like before, except that after the next Most Significant Byte is output from High & Low, I would also output ?? ,errhhh wait, that's my question! Mark Nelson writes that I should output 00 or FF "depending on whether High and Low converged to the higher or lower value". I must admit I don't get it...

    Also, isn't there a simpler method to get rid of underlow ? Something like a "Rescale" maybe, I dunno ?
    Thanks in advance for your help!

    Attached Files Attached Files

  2. #2
    Member
    Join Date
    Jun 2008
    Location
    G
    Posts
    370
    Thanks
    26
    Thanked 22 Times in 15 Posts
    http://en.wikipedia.org/wiki/Range_encoding there is an example of adjust the range

  3. #3
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    Thanks thometal, it's gonna be helpful!

  4. #4
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    432
    Thanks
    1
    Thanked 94 Times in 55 Posts
    Quote Originally Posted by Jean-Marie Barone View Post
    [FONT=trebuchet ms] With a larger file, I'm facing the dreadful Underflow situation, and I'm not sure on how to handle it. Mark Nelson writes that I should output 00 or FF "depending on whether High and Low converged to the higher or lower value". I must admit I don't get it...
    The trick is basically to remove intermediate bits in the underflow case (i.e. in case the coding interval still crosses the 1/2 point and it remains undecided whether it will end up in the upper or lower interval) and to re-insert them once the decision for one of the halves is made. Let's stick to binary notation at this point and let's indicate lower and upper bounds of the coding interval. Let's say we have: lower = 0.0111110101 and upper = 0.1000001011 At this point, you cannot "renormalize" by writing out the topmost bit of the fraction of upper and lower since they are still different, i.e. the coding interval crosses 1/2. What you *can* do, however, is to remove all bits in between, i.e. the chain of 1's in lower and the chain of 0's in upper, and count how many you removed. That is, you could replace the above coding interval by: lower = 0.011110101 and upper = 0.100001011 by striking the second bit after the binary point, and increment a counter that one bit is missing. This is mathematically nothing but stretching the coding interval by a factor of two around 1/2 (think about this!) and incrementing a counter. Repeat until you cannot continue to remove common bits: lower = 0.00101 and upper = 0.11011 (here, counter = 5, five bits removed). Now continue to code bits. At some point, the coding interval will now decode to go below 1/2 or above 1/2. At that point, we have potentially a carry over from lower bits into the first bit behind the decimal fraction, so all "remembered" bits you stroke before must be zero - a carry rippled from the lowest to the highest bit position. Or it may happen that upper falls below 1/2, in which case no carry appeared and all the stored bits must be still at one. Thus, the "renormalization" algorithm looks like this: If upper < 1/2 and lower < 1/2 OR upper >= 1/2 and lower >= 1/2: if counter > 0: if upper < 1/2: Write counter one bits else: Write counter zero bits set counter to zero. if upper < 1/2: write zero stretch interval around zero else wite one stretch interval around one It's only a minor modification, but the trick is that you can also remove intermediate bits in the "underflow" case, you just need to remember how many you removed.

  5. #5
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    Thanks thorfdbg : Things are much clearer in binary notation indeed.

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 775 Times in 484 Posts
    Two other ways to handle this problem.
    In PPMd, throw away part of the range so that it does not cross a boundary.
    In PAQ and ZPAQ, code one bit at a time and you never have the problem in the first place.

  7. #7
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    Quote Originally Posted by Matt Mahoney View Post
    In PPMd, throw away part of the range so that it does not cross a boundary.
    That sounds rather radical. And I love it!
    I'm gonna test both methods : that quick one, and the "stricter" one keeping an Underflow count and getting rid of 2nd byte.
    Then i'll make up my mind.
    In PAQ and ZPAQ, code one bit at a time and you never have the problem in the first place.
    I'm experimenting "Byte-level" for a start, but that' good to know. Thanks Matt.

  8. #8
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    432
    Thanks
    1
    Thanked 94 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Two other ways to handle this problem. In PPMd, throw away part of the range so that it does not cross a boundary. In PAQ and ZPAQ, code one bit at a time and you never have the problem in the first place.
    I understand the first solution, but what do you mean by the second? Most AC coders I know code one binary decision a time.

  9. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 775 Times in 484 Posts
    A bitwise arithmetic coder stores a range as (low, high) with implied 0 bits after low and implied 1 bits after high. The initial range is (0, 0xffffffff). At all times you have low < high. To split the range, you find mid = low + p*(high-low) where p is the next bit probability. Due to rounding, you have low <= mid < high. Then split the range into either (low, mid) or (mid+1, high), then shift out any common leading bytes. Shift in 0 bits into low and 1 bits into high. In the worst case the new range is low == high, in which case you shift out 4 bytes and your new range is (0, 0xffffffff).

    The actual computation of mid in ZPAQ is like mid = low + ((unsigned int64_t)p*(high-low) >> 16), where p is the bit probability in range 0..65535 that the next bit is a 1. It encodes 1 in the lower half and 0 in the upper half. PAQ uses 12 bits for p (0..4095), which is sufficient except for very highly compressible data. Some older PAQ versions also split the multiplication into two parts to avoid 64 bit arithmetic like this:

    mid = low + ((high-low) >> 12)*p + (((high-low) & 4095)*p >> 12);

    You could probably omit the second part for slightly better speed with slightly worse compression, but that's against the PAQ philosophy

    There is also a complication in ZPAQ that low is never allowed to be 0 because it uses 4 zero bytes to mark the end of compressed data so it can be found quickly without decompressing. During normalization, if the low 24 bits of low are 0, then it shifts in 00000001 instead of 00000000.

  10. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Bulat Ziganshin (12th May 2016)

  11. #10
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Jean-Marie Barone View Post
    Hello to all,

    I am trying myself to Arithmetic Coding after having read Mark Nelson's tutorial, and I have worked out from scratch 2 functions, SMAC_compress & SMAC_decompress, in Assembly langage (cf SMAC.asm in enclosure if you're interested).

    For a start, I'm beginning with a Static order-0 model, and I'm working at byte-level. Later, I plan to update the Model so that every time a byte is encoded, I will decrement its frequency in the dedicated chart and recalculate the Upper & Lower limits in the model. But one step at a time!

    I've tested my functions with a small file of 9 Kb (write.exe), and it's working good although the compression ratio & entropy are cheesy, but I know I can't expect much from a static model.

    With a larger file, I'm facing the dreadful Underflow situation, and I'm not sure on how to handle it. Concretely, I end up with the following values (hexa):
    High=E70001DA Low=E6FFFB93 Range=648

    If I apply the "rule", I should get rid of each 2nd bytes, and shift left the following bytes, and would obtain :
    High=E701DAFF Low=E6FB9300 Range=64800 UnderflowCount=1

    And things could go on like before, except that after the next Most Significant Byte is output from High & Low, I would also output ?? ,errhhh wait, that's my question! Mark Nelson writes that I should output 00 or FF "depending on whether High and Low converged to the higher or lower value". I must admit I don't get it...

    Also, isn't there a simpler method to get rid of underlow ? Something like a "Rescale" maybe, I dunno ?
    Thanks in advance for your help!

    I am not sure that I should reply but Mark Nelson has always been a hero and inspiration to me. I modeled my bijective arithmetic compress after studying his code. It also does arithmetic compression based on the 256 ascii code. However the bytes are broken down into a tree of 256 leaves with 255 binary cells so that you do 8 binary cells for each ascii character another major difference is how the ending of file handled so that the compress is completely bijective that is any file can be thought of as either a compressed file or uncompressed file since totally bijective. The code may be a little harder to follow than marks but you might want to look at it. Its ARB255.ZIP the source code is there the executables are to old to run you may have to make minor changes to get it to compile and run. Just mostly get rid of writes where the format are not compatible with today's compilers.

  12. #11
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    432
    Thanks
    1
    Thanked 94 Times in 55 Posts
    [QUOTE=Matt Mahoney;30921]A bitwise arithmetic coder stores a range as (low, high) with implied 0 bits after low and implied 1 bits after high. The initial range is (0, 0xffffffff). At all times you have low < high. To split the range, you find mid = low + p*(high-low) where p is the next bit probability. Due to rounding, you have low

  13. #12
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts

    SMAC

    Thank you all, guys.
    I managed to overcome my Underflow issue (and even an Overflow one).
    My little proggy is sluggish, compresses poorly, but at last it seems to work! Still you should not test it with files larger than 100 MB, unless you're very patient or you have a NASA computer.

    I am a bit disappointed though; I thought that updating the Table of Frequencies and the Model ranges after each encoding would have improved the compression significantly. I only gain a few kilobytes on average-sized files, and of course it slows down the program considerably.

    I guess it's time now to tackle more serious methods, with High-order modeling...
    Attached Files Attached Files

Similar Threads

  1. Compression Problem
    By Nquisitive in forum Data Compression
    Replies: 19
    Last Post: 13th October 2012, 23:42
  2. Arithmetic coding broken in IJG-code
    By thorfdbg in forum Data Compression
    Replies: 1
    Last Post: 10th September 2012, 21:22
  3. Raising performance bar in arithmetic coding
    By stbit in forum Data Compression
    Replies: 43
    Last Post: 28th April 2011, 09:07
  4. Minimal Ashford arithmetic-coder termination
    By Ethatron in forum Data Compression
    Replies: 18
    Last Post: 15th January 2011, 15:38
  5. flzp_ac2 (flzp + an order-2 arithmetic coder)
    By inikep in forum Data Compression
    Replies: 4
    Last Post: 25th June 2008, 21:37

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •