Page 1 of 3 123 LastLast
Results 1 to 30 of 80

Thread: SMAC : SMall Arithmetic Coding

  1. #1
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts

    SMAC : SMall Arithmetic Coding

    This is a simple PPM compressor of Order-1 Adaptive Model.
    While it is no match with other bitwise arithmetic coders, it could be useful for educational purpose.
    Source code is for TASM 5, but I guess it can be easily adapted for MASM or others.
    I'll try to improve the compression ratio by using an Order-2 model.
    Thanks go to Mark Nelson for his tutorial on arithmetic coding, and all the members of this forum who helped me to overcome my Underflow problem.

    Last edited by Jean-Marie Barone; 26th February 2013 at 12:18.

  2. #2
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    This version now uses an Order-2 model which slightly enhances the compression ratio, except with very small files where Order-1 model remains better.
    Now let's try mixing several models...
    Attached Files Attached Files

  3. #3
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    Quote Originally Posted by Jean-Marie Barone View Post
    This version now uses an Order-2 model which slightly enhances the compression ratio, except with very small files where Order-1 model remains better.
    Now let's try mixing several models...
    As soon as I can put your program in my benchmark WCC but the thing that most interests me and you're a great connoisseur of ASM. I wonder why not create something simpler and quicker, maybe an LZ77.
    Hi!

  4. #4
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    Quote Originally Posted by Nania Francesco View Post
    As soon as I can put your program in my benchmark WCC but the thing that most interests me and you're a great connoisseur of ASM. I wonder why not create something simpler and quicker, maybe an LZ77.
    Well, there are already plenty of LZ77 programs, so I would feel like reinventing the wheel. I want to explore some Terra Incognita !
    Last edited by Jean-Marie Barone; 15th November 2012 at 02:55. Reason: Bug with IE9 ? message did not appear

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Just for historical fun, here's my compressor from december 2000:
    http://nishi.dreamhosters.com/u/dosppm_v0.rar
    Its for DOS/DPMI, so won't work on x64.

    Also I had to cripple it a little here, to make it easier to understand,
    so there's an extra ~1kb of uncompressed data at start of compressed files,
    and decompression won't work for files containing 0xFF symbols.
    The reason is that originally it was made to work with 32-bit symbols,
    and that table contains positions of their first occurrences - it was stored
    to a separate file and then compressed with another specialized coder.

  6. #6
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    Quote Originally Posted by Shelwien View Post
    Just for historical fun, here's my compressor from december 2000:
    http://nishi.dreamhosters.com/u/dosppm_v0.rar
    Its for DOS/DPMI, so won't work on x64.

    Also I had to cripple it a little here, to make it easier to understand,
    so there's an extra ~1kb of uncompressed data at start of compressed files,
    and decompression won't work for files containing 0xFF symbols.
    The reason is that originally it was made to work with 32-bit symbols,
    and that table contains positions of their first occurrences - it was stored
    to a separate file and then compressed with another specialized coder.
    Nice! It looks like you were mixing (and weighting) your Tables of Frequencies. Was it to avoid the use of Escape codes ?

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

  8. #8
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    Thanks Shelwien. I am a bit reluctant to use Escape codes to merge several models, so I might try this method.

  9. #9
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts

    Model Mixing

    This is an experimental version using Model Mixing, or more properly Model Sharing, since I have tested an unusual method for combining several models during encoding/decoding, which differs from the classical Escape Codes or Blending methods.
    The original idea comes from Mark Nelson. At the very end of his tutorial on Arithmetic Coding, he stated :
    "When compressing using order-3 statistics, early portions of the text input generate a lot of escapes while the statistics tables fill up. It ought to be possible to start encoding using order-0 statistics, while keeping order-3 statistics. As the table fills up, the order used for encoding could be incremented until it reaches the maximum."

    The method I have worked out uses 3 models, from Order 0 to 2. It starts encoding the file with Model 0. When the file index reaches some point, it will jump to Model 1, and eventually to Model 2.



    During the R0 cycle, the statistics of all 3 models will be updated. During R1 cycle, only Models 1 & 2 will be updated, since we won't use Model 0 anymore. Likewise, during R2 cycle, only Model 2 frequencies will be updated.
    The difficulty in this method is to assess the 2 thresholds (noted x1 and x2 in my diagram), where we must escalate the Order-range. I have initially conducted experiments with 2 models (order-1 & order-2) on a sample of files of various size. For each file, I have tested which R1 interval gave the best compression, and translated it in a percentage.



    Obviously, a percentage is pretty useless in binary arithmetic. My first idea was to use a "binary percentage" with a denominator of 128, but I figured out that using 2^32 could help me to elaborate a generic formula.
    We can notice that the bigger is the file, the smaller is the optimal R1 proportion. It roughly follows an hyperbola as shown below :



    I have noticed that adding the binary logarithm of the filesize to the binary logarithm of optimal R1 gave approximately 50. Hence, I have deducted that :
    Log2 R1=50 - Log2 FileSize
    From there, we need to calculate 2 Log2 R1 to obtain the value of R1. Finally, multiplying the file size by R1, and dividing by 2^32 gives us the threshold where the Order must be increased (x2).
    To calculate x1, I have been quite lazy and simply assumed that R0=R1/64.
    This formula is a first draft, and a better one must certainly exist. Also, I have rounded the binary logarithm "to nearest" in my code : using all decimals could improve the compression ratio.
    All in all, this method gives poor results compared to classical ones, but is certainly faster to carry out.
    I share it with you in case it could give someone some idea on a new method for mixing models...
    Last edited by Jean-Marie Barone; 26th February 2013 at 12:20. Reason: Bug with executable flag

  10. #10
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts

    SMAC 1.6 : Order-4, "Nibblewise"

    This version is a simple Order-4 arithmetic coder/decoder. It is no longer Bytewise as the use of hash tables made them too bulky in this mode, and generated too much collisions.
    Actually, it is "Nibblewise" (a nibble is 4 bits). I don't know if this has been experimented before.
    Order-4 is suited for large files; with small files, a lower order is likely to give better results.
    A tiny E8/E9 filter is also included to reduce the size of Exe & Dll files.
    (Open)source code is for TASM 5.
    Last edited by Jean-Marie Barone; 26th February 2013 at 12:22.

  11. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    "smac /c enwik9" crashes (maybe out of memory?) enwik8 compresses to 30,200,308 bytes in 54 sec, decompresses in 55 sec, verifies OK, uses 1178 MB memory. Tested on a 2.0 GHz T3200, 3 GB, 2 cores, Windows Vista.

  12. #12
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    Out of Memory, indeed... SMAC is very memory-consuming, and won't handle such a big file. Until I find a solution, this update will display a nice message of error, instead of crashing.
    Last edited by Jean-Marie Barone; 26th February 2013 at 12:23.

  13. #13
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts

    SMAC 1.8 : Bitwise, Order-4

    I've switched from Nibblewise to Bitwise with this version. Overall, it seems to give better results, but is much much slower than previous Nibblewise release.
    I'll try to speed it up in the next version, which will feature Context Mixing (hopefully).
    SMAC can now handle very large files by the way.
    Last edited by Jean-Marie Barone; 26th February 2013 at 12:24. Reason: Bug with executable files

  14. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Last edited by Matt Mahoney; 23rd January 2013 at 18:21.

  15. #15
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts

    SMAC 1.9 : Bitwise, Context-mixing, Order-6 & Order-4

    Properly speaking, it is rather Context-choosing than Context-mixing, since my experiments with Linear Mixing gave poor results. Anyway, the method I'm using in this version will pick up the best prediction between Order-4 and Order-6 models during each encoding/decoding. To do so, I calculate a Confidence value for each Order, as described by Matt in an older post :"Suppose you have two models that predict that the next bit will be a 1 with probabilities 0.5 and 0.98. The first model doesn't know, but the second is highly confident that the next bit will be 1. Therefore the second model should get greater weight".

    The formula I'm using is Confidence=(|p0-p1|*232)/(p0+p1). The reason I multiply by 232 is to get an integer result in a 32 bits register. It will contain a low value if prediction is uncertain, and a high value in the opposite case. I simply choose the Order that gives the highest value.

    The result with enwik8 is 26888498 bytes.
    Last edited by Jean-Marie Barone; 26th February 2013 at 12:25.

  16. #16
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Nice improvement.
    http://mattmahoney.net/dc/text.html#2420
    http://mattmahoney.net/dc/silesia.html

    I assume that since p0 + p1 = 1, that you just choose the model whose prediction is further away from 1/2?

  17. #17
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    Yes, but I've been quite inaccurate in the description of my formula. Actually, I should have written n0 & n1 instead of p0 & p1, since those values represent the number of frequencies for bits 0 & 1, not the probabilities.

    Here is an example which speaks for itself:
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	confidence.png 
Views:	267 
Size:	4.9 KB 
ID:	2180  

  18. #18
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts

    SMAC 1.10 : Bitwise, Context-choosing, Order-6 & Order-4, Nonstationary models

    • Compression ratio is slightly improved by turning the models Nonstationary when updating the counters of Frequencies, as described by Matt in PAQ1 documentation :
      "If the training bit is y (0 or 1) then increment cy (c0 or c1).
      If c 1-y>2, then set c 1-y=c 1-y /2 +1 (rounding down if odd)."
    • A bug has been corrected when calling the Hashing function. An Order-6 context such as 00 00 xx xx xx xx was giving the same hash than Order-4 xx xx xx xx, creating collisions.
    • I got rid of a few erratic conditional jumps, to improve the speed.

      Edit: After consulting Handbook of Data Compression, it turns out this formula produces Semi-stationary models, and not Nonstationary ones. A nonstationary model would be obtained by clearing c 1-y
    Last edited by Jean-Marie Barone; 26th February 2013 at 12:27.

  19. #19
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Compression is improved on LTCB and Silesia, although not all of the Silesia files.
    http://mattmahoney.net/dc/text.html#2307
    http://mattmahoney.net/dc/silesia.html

    Using non-stationary (or semi-stationary) models on stationary sources such as dickens, mr, osdb, reymont, sao, and x-ray makes compression worse. One way to solve this problem is to use indirect context models like in PAQ8 and ZPAQ (context -> bit history -> prediction).

  20. #20
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    Thanks for the tip Matt. I'll give the matter thought.

  21. #21
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts

    SMAC 1.11: Bitwise, Context-choosing, Orders-6-4-3 to 3-2-1, Nonstationary models

    This versions chooses among 3 models the one whose prediction is further away from 1/2, using the Confidence formula detailed previously. When 2 models happen to return the same Confidence value, the highest-order model is favored. My tests show it gives a better compression with large files, like enwik8 (OTOH, favoring lowest-order seems to give better results with small files).

    SMAC now uses :
    - Orders-6, 4 and 3, if the Filesize is greater than 5 MB
    - Orders-4, 3 and 2, if File size is greater than 2 MB and smaller than 5 MB
    - Orders-3, 2 and 1, if File size is below 2 MB

    In previous versions, I had used Hashtables of 8 bytes long, which was way too much (4 bytes for bit 0 count, 4 bytes for bit 1 count). They have been reduced to 4 bytes long, which has a good impact on compression ratio since more tables can be stored in the memory buffer, hence less collisions.

    Result on enwik8 is 25633348, and 223294431 on enwik9.
    Matt, I've read your documentation regarding Indirect Context Models, and I find it hard to assimilate the concept of Bit History. From what I understand, you store in one byte the count of bits 0 & 1 of a context, plus the last seen bit. Yet I can't see how the prediction is calculated, and how it could be more efficient than a Direct Model.
    Last edited by Jean-Marie Barone; 26th February 2013 at 12:27. Reason: Bug with executable files

  22. #22
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Matt (and many other people including me sometimes) use state table, where each state describes an approximate proportion of recent 0 and 1 bits or their history. For simplification I suggest playing first with bit histories.

    So instead of having a pair of counts for each hashed context, you would store bit history. The bit history can eg store 7 bits of history (max) + a leading bit 1, so:
    - empty bit history would be 1,
    - after bit 0 it would be 10,
    - after bit 1 it would be 11,
    - after bits 00101010 it would be 10101010,
    - update function looks like:
    Code:
    uint32_t returnUpdatedBitHistory(uint32_t const bitHistory,
            uint32_t const nextBit) {
        return bitHistory << 1 & 255 | nextBit | bitHistory & 128;
    }
    ,

    With that you have 256 possible states so you make an additional 256-records table for each coding order (ie you have 3 models so you make 3 tables). Each recors is a pair of counters.


    So to recap, instead of having:
    Code:
    int table[RecordsNumber * 2]; // initialized to 0
    int getAndUpdatePrediction(int hashedContext, int modelNumber, bool nextBit) {
      int * record = table + hashedContext * 2; // there are pairs
      int prediction = record[0] / (record[0] + record[1]);
      record[bit]++; // you need rescaling here to avoid overflows of course
      return prediction;
    }
    You have:
    Code:
    byte contexts[RecordsNumber]; // initialized to 1 which means empty bit history
    int stateMap[ModelsNumber][256 * 2];// separate stateMap for each order, each having pairs of counts for each of 255 possible truncated bit histories,
    int getAndUpdatePrediction(int hashedContext, int modelNumber, bool nextBit) {
      byte bitHistory = contexts[hashedContext];
      contexts[hashedContext] = returnUpdatedBitHistory(bitHistory, nextBit);
      int * record = stateMap[modelNumber] + bitHistory * 2; // there are pairs
      int prediction = record[0] / (record[0] + record[1]);
      record[bit]++; // you need rescaling here to avoid overflows of course
      return prediction;
    }
    You need to separate retrieving and updating predictions of course.


    FSMs (like Matt's one) are drop-in replacements for direct bit histories, so after you get direct bit histories to work, you can move to FSMs.


    Next, you can replace pairs of counts with pair of (prediction, count) like in class StateMap from lpaq1.zip. But only if you manage to get indirect contexts working.

  23. #23
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    That was an exhaustive answer : thanks a bunch Piotr, I got it !

  24. #24
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Nice improvement. http://mattmahoney.net/dc/text.html#2232

    Edit: smac failed on ooffice in the Silesia corpus. Decompressed file is not identical. All other files were OK.
    Last edited by Matt Mahoney; 25th February 2013 at 04:40.

  25. #25
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    That's weird. I have tested again, and I see no differences (checked with FC and Winhex). CRC are identical too.
    Can you carry out a second test please ?
    I had modified the way an executable file is detected with this version, but it seems to run fine.


  26. #26
    Programmer Jan Ondrus's Avatar
    Join Date
    Sep 2008
    Location
    Rychnov nad Kněžnou, Czech Republic
    Posts
    278
    Thanks
    33
    Thanked 137 Times in 49 Posts
    Decompressed file ooffice is not identical also for me.
    Click image for larger version. 

Name:	Clipboard.png 
Views:	396 
Size:	61.1 KB 
ID:	2207

  27. #27
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    Ok, Thanks to Jan's screenshot, I can see that this bug is clearly related to the E8/E9 Filter. I suspect the ReadFile function to be quirky with the nNumberOfBytesToRead parameter. From what I have read, the function can fail with some configurations if the number of bytes to read is set with a value greater than the filesize. This update should hopefully solve this bug. If not, I'll have no other choice than to discard the E8/E9 filter : since I can't reproduce the bug on my system (Windows 7 SP1 32 bits), it will be hard for me to get over it...
    Last edited by Jean-Marie Barone; 26th February 2013 at 12:28. Reason: Bug with executable files

  28. #28
    Programmer Jan Ondrus's Avatar
    Join Date
    Sep 2008
    Location
    Rychnov nad Kněžnou, Czech Republic
    Posts
    278
    Thanks
    33
    Thanked 137 Times in 49 Posts
    Still not working here

    EDIT: I just noticed what is wrong with your program. It modifies source file during compression.... Decompressed file is identical to original. Comparison fails because original was corrupted!
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	Clipboard.png 
Views:	288 
Size:	69.4 KB 
ID:	2209  
    Last edited by Jan Ondrus; 25th February 2013 at 16:47.

  29. #29
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Quote Originally Posted by Jan Ondrus View Post
    EDIT: I just noticed what is wrong with your program. It modifies source file during compression.... Decompressed file is identical to original. Comparison fails because original was corrupted!
    Indeed, looks like some unintentional copy-paste error:

    Code:
    	call	E8E9Defilter
    	call	SetFilePointer,TargetHandle,0,0,0
    	pop	ecx
    	call	WriteFile,TargetHandle,DestBuffer,ecx,offset Dummy,NULL
    This one is okay, applying the reverse E8E9, modifying DestBuffer and writing.

    Code:
    	call	E8E9Filter
    	call	SetFilePointer,SourceHandle,0,0,0
    	pop	ecx
    	call	WriteFile,SourceHandle,SourceBuffer,ecx,offset Dummy,NULL
    This is not okay, WriteFile is modifying the source file. Usually, I would just recommend to remove the WriteFile call, but it seems that SMAC_Compress wouldn't work without it at least for files >256 MB because the E8E9Filter modifications would get lost.

    The best solution seems to be moving the E8E9 filtering inside SMAC_Compress, so it operates on the 256 MB chunks read there.
    Last edited by schnaader; 25th February 2013 at 17:25.
    http://schnaader.info
    Damn kids. They're all alike.

  30. #30
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    Yep, that is the only solution I can see for the moment.
    And I shouldn't have modified the original file with my E8/E9 filter, that was a big mistake...
    I'm gonna think if I can find a better solution. Thanks for finding my bug, guys (the advantage of Open source).

Page 1 of 3 123 LastLast

Similar Threads

  1. Arithmetic Coding : Underflow problem
    By Jean-Marie Barone in forum Data Compression
    Replies: 11
    Last Post: 24th October 2012, 02:53
  2. Arithmetic coding broken in IJG-code
    By thorfdbg in forum Data Compression
    Replies: 1
    Last Post: 10th September 2012, 21:22
  3. Raising performance bar in arithmetic coding
    By stbit in forum Data Compression
    Replies: 43
    Last Post: 28th April 2011, 09:07
  4. Minimal Ashford arithmetic-coder termination
    By Ethatron in forum Data Compression
    Replies: 18
    Last Post: 15th January 2011, 15:38
  5. Experiments with small dictionary coding
    By Matt Mahoney in forum Data Compression
    Replies: 13
    Last Post: 28th June 2010, 16:34

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •