# Thread: Simple Variable Order Rangecoder

1. In ZPAQ I changed the range and precision of probabilities, stretched probabilities, and mixer weights because I found cases where it hurt compression in PAQ. The changes I make were as follows:

Probabilities. In PAQ they were 12 bits. In ZPAQ, 15 bits (odd multiples of 2^-16). This improved compression of highly redundant data.

Stretched probabilities. In PAQ they were 12 bits with 8 bits after the decimal point (-8 to 8 in steps of 1/256). In ZPAQ they were 12 bits with 6 bits after the decimal point (-32 to 32 in steps of 1/64). This improved compression when one model had a high degree of certainty. The extra 2 bits of precision did not seem to make any difference. Quantization errors in probability estimates tend to expand the data proportionally to the square of the error. A 1% error only expands the data by 0.01%.

Mixer weights. In PAQ they were 16 bits in the range 0 to 1. In ZPAQ they are 20 bits in the range -8 to 8 in steps of 2^-16. I wanted to keep them at 16 bits so I could implement mixers using SSE2 assembler like in PAQ, but cutting either the range or precision hurt compression. When I implemented JIT in ZPAQ, I was able to improve the speed of predict() using SSE2 but not update(). I wrote update() both ways and the scalar form was faster.

2. Thanks for the explanations. I've upgraded the mixer to use this adaptive weighted model, although the weights are bounded from 0.0 to 1.0 and it can accept more than 2 inputs. All values are stored as integers so the weights and probabilities are 32 bits in length. I'm working on testing the new mixer and replacing the old one, as well as getting logistic functions to work without the CRT. I've also added an SSE model, where the high order bits of the probability are used as the context for a secondary model to output a new count pair. During my experimentation with SSE, it seems that chaining SSE models greatly increases learning time and therefore hurts compression. The model that I'm working on with SSE uses a single order 16 SSE model.

3. I just tested the new mixer, and it's giving strange results on enwik9:

Code:
```Old Mixer With SSE      50m 15s      260,009,661 bytes
New Mixer With SSE      67m 20s      259,954,897 bytes```
It seems that it gives very little gain in compression ratio and has a noticeable impact on speed. It seems that the cost outweighs the gain with the new mixer. The parameters that I used were a learning rate of 0x1000 and a weight between 0x0000000 and 0xFFFFFFF. If a weight goes out of range, the range is rescaled. I'm thinking of releasing the old mixer for now while I work on the new one.

4. Okay, I have another release of RangeCoderC. Version 1.6 adds in the CM model which mixes the Hashed and Double model together using linear mixing. It also has optional SSE which uses an order 16 SSE model to refine predictions. I have included in the source code a beta version of a new mixer and a faster indirect model.

Also, since I'm now working with CM components, I probably won't be posting updates as often. I'm still learning about CM theory and it will take time for me to get better. If only I could adjust my learning rate...

5. the division hurt speed very hard, try to avoid them by using a table or shifts

6. Thanks for the suggestion. I fixed a few of the divisions and multiplications that could be handled with shifts. The problem remaining is doing fast multiplication of an integer by a fraction. From what I can figure out, if R=A*(B/C) where R is the resulting value, this can be rewritten as R=(A*K*B)/(C*K). If C*K were a power of 2, it could be converted to a bit shift. So I get R=(A*K*B)>>N, where K=(1<<N)/C. I'm fairly sure that there is a trick to dividing powers of 2, this could probably be done via a lookup table. This has a couple catches though; A*K*B is often too large to fit into 64 bits and the precision suffers for low values of K. If I make any progress on this I'll integrate it into the code.

The main cause of slowdowns seems to be memory access. Accessing memory across a large range significantly slows the models down. As soon as the memory usage for a model goes above 64kb, slowdowns begin occurring during memory access. A model that encodes at 3.2mb/s using 64kb of memory can slow down to 480kb/s when updating values across a 1gb memory range. Unlike the division issue, it's probably not possible to speed up memory access programmatically.

7. > not possible to speed up memory access programmatically

Its possible by taking into account the cache specifics.
CPU always reads whole cache lines from memory, so its only slow when program reads from different cache lines.
Thus, speed frequently can be improved by rearranging data structures for compact storage of sequentially accessed fields.

There's another important thing though - the worst case is when there're many accesses to same lines (32/64 bytes)
in different pages (4096 bytes). And its fairly common in CMs too; for example, when there're static counter tables
for order0-2.
For some reason, caches are designed in such a way that only 2-4 lines with same address bits 6/7..11 can be
stored in cache at once

8. Look at the ZPAQ design for CM, ICM, and ISSE components. They are designed to minimize cache misses. It is optimized for 64 byte cache lines, but it also helps for 32 byte lines.

CM maps a context to a bit prediction using a table of 22 bit predictions and 10 bit counts packed into 32 bits. The table is aligned on a 64 byte address. A context or context hash is computed on byte boundaries. Then for each bit x following the boundary, the context is XORed with the following:

000000001 <-- cache miss
00000001x
0000001xx
000001xxx
1xxxx0001 <-- cache miss
1xxxx001x
1xxxx01xx
1xxxx1xxx

So there are 2 cache misses per byte instead of 8 if you computed a new hash for each bit.

ICM and ISSE map a context to a bit history, represented as an 8 bit state. Bit histories are stored in a hash table with 16 byte rows containing a hash confirmation checksum and 15 bit histories for each of the 15 possible values of the next 0 to 3 bits following a 4 bit boundary. A new hash index h is computed every 4 bits. Then it looks for a matching checksum at addresses h, h XOR 1, and h XOR 2. The hash table is also aligned on a 64 byte address, so the 3 searches stay within a cache line. So there are always only 2 cache misses per byte.

Other components have table entries too big for address alignment to help, but array index order is still optimized for sequential access. For example, MIX and MIX2 weights have the form wt[context][m], where the weights are accessed and updated by iterating over m, but there is no attempt to avoid a cache miss with a new context each bit. A MATCH has a hash table to look for matches in a history buffer, but it is only accessed once per byte. SSE uses a table p[context][32] like a CM where it interpolates between 2 predictions in adjacent memory locations.

9. RangeCoderC32.exe v1.5 c1 fails on enwik8. It produces a very large compressed file which decompresses incorrectly. I tried orders 26 and 27. All of the other modes work correctly as shown. Tested on a 2.0 GHz T3200, 32 bit Vista.

Code:
```   33,761,529 enwik8-c0-26
33,225,245 enwik8-c0-27
2,575,500,309 enwik8-c1-26
2,584,948,965 enwik8-c1-27
30,934,989 enwik8-c2-26
30,832,497 enwik8-c3-26
30,371,685 enwik8-c3-27
29,269,185 enwik8-c4-26
28,461,477 enwik8-c5-26```

10. Thanks for the suggestions. I'm posting pre-release builds for version 1.7 as an update for the errors in 1.5 and 1.6. The complete version of 1.7 will be finished after working on cache optimizations.

11. c1 works in v1.7a. I didn't test the other modes. I presume they are the same as v1.6. http://mattmahoney.net/dc/text.html#2600

12. They are the same, just with minor speed improvements.

13. So far the tests for using bytewise hashes aren't going well. While they are faster by reducing the number of cache misses, the compression ratios seem to suffer. The causes appear to be index collisions and order reduction. By XORing the most recent bits into the bytewise hash, collisions are created. These collisions can map conflicting data into the same index, resulting in prediction errors. Also, if the lower bits of the index are reserved for the most recent bits or the bit index within the byte, it reduces the number of bits of raw context in the index. This effectively reduces the order of the model when compared to models with similar memory usage. It may be possible that I am using a bad hash for this type of indexing. To update the context hash, I use: hash = (((hash << 6) + (hash << 7)) ^ inbyte) & mask. This is the same hash the the bitwise hashed model uses.

I'm also stuck on the division problem. From the looks of it, I probably won't be able to optimize division due to the way that RangeCoderC is coded. Optimizing division involving unbounded 64 bit integers while maintaining adequate precision seems to be an increasingly difficult task. I'll need to keep fast division in mind for the next compressor.

Due to the problems that I am having with the optimizations, version 1.7 will probably only have small speed gains. Not much else will change from version 1.6. I'll release it after experimenting with the optimizations for a little while longer.

14. Use FNV-1 hash (is better for caching than FNV-1a). It is fast and generally worked well for me. If you care about collisions and if your hash table record has some unused bits then you can use some of the masked out bits to produce additional checksum and place that checksum in that unused bits. I have used something like that, along with simple collision resolution - if collision was detected using the checksum then I tried some other hashtable record placed nearby (I don't know the exact algorithm forb computing the offset), preferably from the same cache line. Such collision resolution scheme implemented for TarsaLZP gave me substantial gains when the hashtable was small (tens of megabytes) but the gain was rather small when the hashtable was big (hundreds of megabytes), at least on enwik9.
When computing FNV-1 hash for n-bytes context you can get FNV-1 hashes for (1, ..., n-1)-bytes contexts for free if you start hashing from most recent byte (ie. in reversed order).
I don't know if my advice is relevant though

15. For ZPAQ I typically use a hash = (hash + byte + 512)*773, then for a CM I XOR with the 9 bit expanded extra bits as described in my previous post. For indirect models (ICM/ISSE), hash collisions hurt compression too much so I use a checksum to detect most collisions. Collision detection is not needed for a CM.

A CM maps a context to a 22 bit prediction and 10 bit count packed into 32 bits. I update the prediction by error/count and increment count. To avoid divisions, I store 1/count (actually 1/(count+1.5)) in a table and multiply. ICM/ISSE use a fixed learning rate of error/1024 = error>>10.

16. I just tested the bytewise hash model using the hash and cache structure from ZPAQ. It expands most files, so I'm doing something wrong. Here is the code that I'm using for the test model:

Code:
```typedef struct ByteBitModel {
int order;
PValPair model;
UINT context;
UINT realctx;
UINT pos;
UINT val;
} BBitModel, *PBBitModel;
PBBitModel InitBBM(int modelOrder) {
LONG init;
PBBitModel bm = (PBBitModel) MemAlloc(sizeof(BBitModel));
bm->order = modelOrder;
bm->mask = (1 << modelOrder) - 1;
bm->context = 0;
bm->realctx = 0;
bm->pos = 0;
bm->val = 0;
bm->model = (PValPair) MemAlloc(sizeof(ValPair) * (bm->mask + 1));
if(IsInvalidPtr(bm->model)) {
MemFree(bm);
return NULL;
};
for(init = 0; init <= bm->mask; init++) {
bm->model[init].v0 = 1;
bm->model[init].vt = 2;
};
return bm;
};
__forceinline ValPair BBMP1(PBBitModel bm) {
return bm->model[bm->realctx];
};
__forceinline void BBMUpdate(PBBitModel bm, BOOL bit) {
bm->model[bm->realctx].vt++;
if(bit == FALSE) { bm->model[bm->realctx].v0++;	};
if(bm->pos == 7) {
bm->context = (bm->context + bm->val + 512) * 733;
bm->realctx = bm->context ^ 1;
bm->pos = 0;
bm->val = bit & 1;
} else {
bm->pos++;
bm->val |= (bit & 1) << bm->pos;
if(bm->pos < 4) {
bm->realctx = bm->context ^ ((1 << bm->pos) | bm->val);
} else {
bm->realctx = bm->context ^ (0x100 | (1 << (bm->pos - 4)) | ((bm->val & 0xF) << 4) | (bm->val >> 4));
};
};
};
void DestroyBBM(PBBitModel bm) {
MemFree(bm->model);
MemFree(bm);
};```

17. Could be something wrong. I have a section on context mixing in my book that might explain it better. http://mattmahoney.net/dc/dce.html#Section_43

18. Thanks for the link. I've read most of your book before, and it has been very helpful with my understanding of data compression. It turns out that the problem with the hash from ZPAQ is that is accumulates bits in the low order section of the hash. The number 733 doesn't seem to work well as a multiplier, since the one bit is set. If the one bit is set for the multiplier, then part of the hash is not shifted up which causes the low order bits of the hash to produce patterns dependent on the data encountered so far. Therefore if two similar patters occur in the file, the will have different hashes based on their position in the file. By changing the multiplication into a shift left by 9 bits, the compression ration quickly improved (tested on model order 24):

Code:
```*733              cc1.exe -> 6.5mb
<<9               cc1.exe -> 3.4mb```
I'm going to run tests to see which multiplier will work the best for this kind of hash. If all goes well, I'll release version 1.7 today with the new cache optimized model.

19. I used 773 as a hash multiplier, not 733, although it probably doesn't matter much. Almost any odd number will work. I picked it for speed because 773 = (3<<+5 doesn't require any multiply instructions. But in my tests, x86 imul was just as fast.

But odd numbers are not for rolling hashes. It has to be recomputed for each byte boundary. But if you need to compute n hashes from orders 1 through n, then it takes the same time as computing n rolling hashes.

For a rolling hash, I suggest you have as many trailing zeros as there are bits of entropy per byte in your source according to your model. So for text, you could try multipliers like 12 or 24 if you expect 2-3 bpc. Bigger odd numbers times 4 or 8 should work too.

20. I think I finally understand; I was trying to implement the ZPAQ hash as a rolling hash, and that obviously didn't work... I've now corrected that. Thanks once again for all the help, I hope that it hasn't been too much trouble. Based on the tests, the updated model has slightly lower ratios than the simple bitwise model, although is significantly faster. I hope that it's okay to include the ZPAQ hash and cache structure into RangeCoderC. Alternatively, I have a backup hash function which works almost as well at similar speeds.

21. libzpaq is public domain code. But if you can improve on it, do it.

22. Originally Posted by Matt Mahoney
libzpaq is public domain code. But if you can improve on it, do it.
Maybe someday, I still have a lot to learn. ZPAQ is excellent and will be hard to improve.

I am releasing version 1.7 of RangeCoderC. It adds two new models, the Bytewise Hashed Model and the Combined Model. The Bytewise Hashed model uses the hash and cache structure from ZPAQ to achieve high speeds, even at higher orders. The Combined Model uses the same structure as the Double Model but has a hashed context and outputs its predictions into a SSE model for better compression.

23. It looks like the two new modes make the others obsolete. The Bytewise Hashed model is the fastest and the combined model compresses smallest. http://mattmahoney.net/dc/text.html#2545

24. I'm thinking of removing some of the inefficient models at the cost of breaking compatibility with other versions. The Indirect Bitwise Model has poor compression and is slow, and the CM models have been obsoleted by the Combined Bitwise Model. With some more work I'll probably be able to reintroduce better CM models.

25. I'd remove them. It's still experimental code at this point, and if they need it they can get the old versions.

26. I've completed version 1.8. It removes all obsolete models and adds the Bitwise Adaptive Model. The Bitwise Adaptive Model uses probabilities instead of counts, which are adjusted nonlinearly for better compression on changing data. The learning speed of the model is derived from the model order.

27. Tested the bitwise adaptive model. I didn't retest the other models. http://mattmahoney.net/dc/text.html#2545

28. Thanks once again for tesing RangeCoderC. For the next version, I hope to improve both speed and compression ratio. I've modified the rangecoder component so that it eliminates division in the Bitwise Adaptive Model. Also, I've been working on replacing the rangecoder with a proper arithmetic coder. I have a mostly working version finished, but there's still a rare bug that causes the decoder to indefinately output ones. The new coder has noticeable improvments to compression ratio without hurting speed.

29. I've finished version 1.9 of RangeCoderC. I've replaced the old rangecoder with an arithmetic coder. The new coder has better compression ratios and eliminates the division for some models. With the addition of the new coder, I'm not sure that RangeCoderC qualifies as a rangecoder any more...

30. Here's a changelog for all versions of RangeCoderC:

Code:
```Version 1.0
-Initial Version
-Uses 8 bit rangecoder with a simple bitwise model
-Programmed in VB.net

Version 1.1
-Redone in C
-Speedups in models

Version 1.2
-Uses a 32 bit rangecoder instead of an 8 bit one
-Fixes some crashes

Version 1.3
-Added 'Indirect' model in seperate executable
-Added 'Double' model in seperate executable
-Halved memory usage for the 'Standard' model
-Increased speed in 32 bit platforms
-Trimmed executable size

Version 1.4
-Removed the need to store the file size in the header
-Added single byte RLE stream encoding
-Added 'Hashed' model in seperate executable

Version 1.5
-Merged all models into single executable
-Added new version of the 'Indirect' model
-Renamed 'Double' model to 'Indexed Model Array'
-Fixed some crashes
-Changed code structure to allow for multiple models

Version 1.6
-Added a new CM based model with a linear mixer
-Added an SSE component to the CM model
-Reverted the 'Indirect' model to the old version
-Cleaned up code

Version 1.7
-Added a bytewise hashed model with high speeds
-Added a combined model for maximal compression of text
-Speed is improved for most models
-Fixed some crashes

Version 1.8
-Removed all obsolete models
-Updated model interface

Version 1.9
-Replaced the rangecoder with an arithmetic coder
-Removed division from the arithmetic coder for better speed```

Page 2 of 3 First 123 Last