# Thread: Nibble entropy encoding

1. ## Nibble entropy encoding

I'm showing my lack of basic entropy encoding knowledge here! I'd just like to double check some newb things - sorry.

Normally when doing byte-wise range coding in a given context (eg last 8 bits for an order-1 markov model), we may do something like:

Code:
```int ctx = 0;
for (i = 0; i < len; i++) {
encode_sym(&model256[ctx], &rc, in[i]);
ctx = in[i];
}```
I've been experimenting with 4-bit nibble based entropy encoding, following ideas from Fabian Giesen, DAALA, DivANS, etc. All great ideas and they do indeed help a lot. What I don't entirely understand is how to encode bytes in 4-bit nibbles. I guess it's because I've never played around much with binary arithmetic coding. Naively I did this:

Code:
```int ctx = 0;
for (i = 0; i < len; i++) {
encode_sym(&model16[ctx],         &rc, in[i] & 15);
encode_sym(&model16[ctx][in[i] & 15], &rc, in[i] >> 4);
ctx = in[i];
}```
Ie my model has 17 sub-contexts one of which I use for the bottom 4 bits and the other 16 I use for the top 4 bits with the bottom 4 as context (in addition to the current context). This basically means instead of encoding 8 bits using an N bit context, we encode lower 4 bits using N bit context followed by higher 4 bits using N+4 bit context and update those two models separately. I cannot see any benefit over top 4 bits vs bottom 4 bits first, but it may depend on the data.

It does seem to work fine with mostly comparable results, but I'm unsure if it's the "right thing" to do. I took at look at how the bit-wise ACs work, but they're usually a mess of cunning optimisations which rather obfuscates the code. I thought about encoding the lower 4 bits also with N+4 bit context, adding in the previous 4 high bits, but given my basic context is just a O(N) markov chain, those previous 4 high bits are already part of the N bit context.

Anyway, my ultimate aim was to explore nibble-adaptive-rans using CDF, but so far all I've done is gain a significant speed up to arithmetic coding and little impact on adaptive rANS! Oops.  2. ## The Following User Says Thank You to JamesB For This Useful Post:

encode (18th October 2018)

3. I have implemented nibble based rANS before so maybe I can lend a hand. You appear to be using the same nibble model for the high bits and the low bits, the trick is to use one nibble model for the high bits and 16 low nibble models selected by high nibble context.
Code:
```for (i = 0; i < len; i++) {
high = in[i] >> 4;
low = in[i] & 15;
encode_sym(&high_nib_model, &rc, high);
encode_sym(&low_nib_model[high], &rc, low);
}```
Another trick you can use is a cyclic 8-bit context where you store 16 high nibble models and 16 low nibble models and swap the high and low nibble contexts on each encode operation, basically using the high nibble as context for the low nibble model, and use the low nibble for context of the high nibble model, this leads to compression ratios of 56MB on enwik8. It does use more memory but it is technically a near optimal order-0 model since it's only using 8-bits per context at any given point in the coder.

Code:
```int cxt = 0;
for (i = 0; i < len; i++) {
high = in[i] >> 4;
low = in[i] & 15;
encode_sym(&high_nib_model[cxt], &rc, high);
encode_sym(&low_nib_model[high], &rc, low);
cxt = low;
}```
Hope this helps 4. For your first example, it's what I'm doing already. I just cheated with a 17 wide model, 16 of which are for the second nibble (&low_nib_model[high]) and the 17th of which is for the first nibble (&high_nib_model). Thanks though as it's confirmation that I'm not doing something completely stupid. The second code snippet is more interesting and you have two 16-wide models; it looks like a win for order-0 entropy encoding. However in my case I already have a context on the previous byte (ctx = in[i]), so setting ctx=low is duplicate effort. I'd be using 8 bits of the previous byte plus another 4 low bits of the same byte. Possibly I could slide by context along by 4 bits so I always have N+4 sized models for both nibbles. 5. Yeah your construction looks right. That's the "default" thing to do.

For some background, binary coding is indeed the place to start.

For a binary coder coding an 8-bit symbol, then before you see the first symbol, the set of symbols it could be is { 0, 1, 2, ..., 255 }.

Then you decode one bit. Conceptually, you partition that 256-element set into 2 disjoint subsets, and you specify which set the symbol you care about is in.

Clearly you can repeat that process, leading to a binary tree of decisions. The leaves of that tree will correspond to the 1-element subsets { 0 }, { 1 }, { 2 }, ..., { 255 }, i.e. whenever you reach a leaf, there's only one possible symbol it could be. All such binary trees have exactly 255 internal nodes (corresponding to a probability you store).

Any binary tree with 256 leaves works. In principle they're all equivalent, in the sense that any probability distribution over the original 256 symbols can be represented as corresponding binary event probabilities for each internal node of the tree.

The "default" choice is to just use a complete binary tree, usually going from MSB down to LSB. So the root node encodes the MSB (bit 7); it has two child nodes, both of which encode bit 6, one in the case where the MSB was 0 and one where the MSB was 1; these two nodes have a total of 4 children, corresponding to bit patterns "00??????", "01??????", "10??????", "11??????", and so forth.

It's very convenient to pack them all into an array in the same way you do with a heap (the data structure). 1-based indexing is most natural here. You can get rid of the unused slot at  but it's nicer for exposition to keep it.

Encoding ends up being, using an interface similar to yours:

Code:
```int binary_model; // initialized and kept elsewhere
int ctx = 1; // per-sym
for (int bit_index = 7; bit_index >= 0; bit_index--) {
int bit = (sym >> bit_index) & 1;
encode_bit(&binary_model[ctx], &rc, bit);
ctx = ctx*2 + bit;
}```
The root node is numbered 1; its two children are 2 and 3; their children are 4,5 and 6,7 respectively; and so forth. The leaf nodes are numbered 256, ..., 511. "Encode_bit" might update the model state adaptively, or it might choose not to.

The corresponding decoding algorithm is:

Code:
```int binary_model; // initialized and kept elsewhere
int ctx = 1;
for (int i = 0; i < 8; i++)
ctx = ctx*2 + decode_bit(&binary_model[ctx], &rc);
sym = ctx - 256;```
You shift in a new bit at the bottom in every step. After 8 steps the original 1 has been shifted 8 times (that's the bias of 1<<8 = 256 subtracted at the end) and we have the 8 payload bits below.

That's the encoding and binarization scheme you see almost everywhere. If you don't know anything about your data in advance, the complete binary tree is a good a choice as any, and it's convenient.

If you do know something about your data, you might want to binarize (e.g. turn the larger-alphabet symbol into a sequence of binary events) in a different way that takes that knowledge into account. For example, if you have a static distribution (or at least a known prior) and are using a binary coder, you likely want to minimize the number of binary symbols you decode, because binary decode steps aren't fast! Say you have known probabilities for all the leaves, and want to minimize the total number of binary decode steps you do. What you're minimizing here is the weighted (by the probabilities of the leaf nodes) external path length of the binary tree... and the solution to that is a Huffman tree. In other words, if you want to minimize the number of binary symbol decodes you do for a known probability distribution (or at least a good estimate), don't use a complete binary tree shape, use the Huffman tree! (I think some of the On2/Google VP* codecs used that? Not sure if they still do.)

The complete binary tree shape corresponds to a prior that is a uniform distribution (i.e. no information).

With that all out of the way, you can generalize this:
- You can of course have extra context. For example by having several models indexed by said context bits.
- For nibble models, you don't have a binary tree, you have a 16-ary (hexadeciary? Is that a word?) tree. The same "set partitioning" mathematical model works as well here, only the trees are much flatter (only 2 levels here).

For binary trees, we could use a single probability per inner node because the nodes only have two children and their probabilities need to sum to 1. For the 16-ary case, we have a 16-element probability distribution over the children (which we could store only 15 values of since all 16 need to sum to 1, so if you know the probabilities of 15 of the child nodes, you can infer the 16th). In the binary case we can think of probabilities being stored in the nodes but for more general n-ary it's a lot more natural to think of the probabilities being attached to the edges.

The complete tree layout for 8-bit bytes from nibble models just has a full set of { 0, ..., 255 } at the root, and 16 child nodes corresponding to subsets { 0, ..., 15 }, { 16, ..., 31 }, ..., { 240, ... ,255 }, with 256 leaves below those.

But the basic characteristics are still the same, in the sense that you could use different tree shapes, and any probability distribution over { 0, ..., 255 } can be mapped to probabilities for each edge, for any tree shape. (Nor is the number 256 magic here; this works for any alphabet, any alphabet size, no matter whether power-of-2 or not.)

Final caveat: while the tree shape doesn't matter in principle, in practice of course you don't store probabilities with infinite precision, and often you have some local adaptation rule.

To maximize overall precision, a Huffman-ish layout is ideal, since it keeps all probabilities as close to even as possible, and your round-off errors are usually worst for the low probabilities.

If you're using adaptation, say with the usual "exponential moving average"-style probability models, the shape of the tree absolutely does make a difference. Say for a 4-symbol alphabet, the two tree shapes ((0 1) (2 3)) (balanced tree) and (0 (1 ( 2 3 ) ) (right-leaning tree) are definitely not equivalent if you update probabilities on the fly. 6. ## The Following 2 Users Say Thank You to fgiesen For This Useful Post:

JamesB (18th October 2018),jibz (21st October 2018)

7. In experimenting with CDFs and nibble encoding, I've concluded that it helps on more even distributions, but my existing modelling for extreme distributions (which I'm dealing with more frequently) still win out. I thought I'd explain how that works here (example taken from fqzcomp):

Model:

Code:
```struct SIMPLE_MODEL {
enum { STEP=8 };

SIMPLE_MODEL();
inline void encodeSymbol(RangeCoder *rc, uint16_t sym);
inline int encodeNearSymbol(RangeCoder *rc, uint16_t sym, int dist);
inline uint16_t decodeSymbol(RangeCoder *rc);

protected:
void   normalize();

uint32_t TotFreq;  // Total frequency

// Array of Symbols approximately sorted by Freq.
struct SymFreqs {
uint8_t Symbol;
uint16_t Freq;
} sentinel, F[NSYM+1];
};```
The key thing here is the symbol array F[] is preceded by sentinel. This is initialised with:

Code:
```SIMPLE_MODEL<NSYM>::SIMPLE_MODEL() {
for ( int i=0; i<NSYM; i++ ) {
F[i].Symbol = i;
F[i].Freq   = 1;
}

TotFreq         = NSYM;
sentinel.Symbol = 0;
sentinel.Freq   = MAX_FREQ; // Always first; simplifies sorting.

F[NSYM].Freq = 0; // terminates normalize() loop. See below.
}```
I have no escape symbols here, so everything has a minimum frequency of 1. It's faster handling it this way, but not as optimal compression, especially on higher orders. Instead I typically ensure NSYM is a suitable value to not leave lots of unused symbols.

The symbol encoding doesn't use a CDF, but instead linearly scans. However it does so on a sorted list with the most likely symbols first, so on extreme distributions this is fast to find:

Code:
```template <int NSYM>
inline void SIMPLE_MODEL<NSYM>::encodeSymbol(RangeCoder *rc, uint16_t sym) {
SymFreqs *s = F;
uint32_t AccFreq  = 0;

while (s->Symbol != sym)
AccFreq += s++->Freq;

rc->Encode(AccFreq, s->Freq, TotFreq);
s->Freq += STEP;
TotFreq += STEP;

if (TotFreq > MAX_FREQ)
normalize();

/* Keep approx sorted */
if (s.Freq > s[-1].Freq) {
SymFreqs t = s;
s = s[-1];
s[-1] = t;
}
}```
The last part of that just swaps the current symbol with the previous symbol if it's larger. This is one single step from a bubble sort, distributed across each symbol encode. The Sentinel value we defined earlier forces this to not underrun the array (although I it's rather dodgy it being a separate variable rather than an *actual* element -1 in the array). We can also only do this swap periodically (I used to have a "bub_count" variable for this and do every 16 symbols), but it's an extra comparison and not worth it generally.

While I've been experimenting with doing SIMD encoding and decoding of 16-wide CDFs, I just can't get it to run significantly faster than the above on most of the data sets I'm working on. Maybe your mileage will vary, but it's worth considering with adaptive modelling. #### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•