# Thread: Finite State Entropy

1. ## Finite State Entropy

Just completed an entropy coder, based on Jarek Duda's ANS theory.
It's released as an open source code with a BSD license, at https://github.com/Cyan4973/FiniteStateEntropy .

It's designed to be a drop in replacement to block-based entropy coders, including typically Huffman and range encoder.
Its speed is pretty decent, and it can give a serious challenge to the fastest Huffman implementations.
But FSE features the compression performance of an arithmetic coder,
and therefore precisely estimates probabilities, critically those beyond 50%.

Arithmetic compression performance at Huffman speed, that seems like a good deal.

Blog entry :
http://fastcompression.blogspot.fr/2...-breed-of.html

FSE only use additions, masks and shifts.
I initially designed it to work on retro platforms, but it's now good for modern CPU too. 2. ## The Following 11 Users Say Thank You to Cyan For This Useful Post:

Bulat Ziganshin (17th December 2013),encode (2nd April 2016),inikep (17th December 2013),Jarek (17th December 2013),Matt Mahoney (18th December 2013),moinakg (18th December 2013),nburns (24th February 2014),Paul W. (17th December 2013),Piotr Tarsa (17th December 2013),RichSelian (3rd February 2014),stevenxu (21st December 2013)

3. How would I use this as a drop in replacement for an arithmetic coder? From looking at the source code it looks to me like it implements a fixed order 0 model.

Here are some test results on proba70.bin
Code:
```1,048,576 proba70.bin
171,702 proba70.fpaqa
171,044 proba70.fpaqc
171,146 proba70.fpaqc-arith
165,760 proba70.fpaq0 (ratio 6.326)```
fpaqa uses an adaptive order 0 model and table driven ANS coder which I implemented in 2007. The compressor stores the predictions on a stack so they are coded in blocks in reverse order.

fpaqc uses the same model except the ANS coder is implemented without tables except for a table of reciprocals to optimize division (2007).

fpaqc-arith is the same model with an arithmetic coder. To compile: g++ fpaqc.cpp -DARITH

fpaq0 uses a stationary order 0 model with an arithmetic coder. I tested with a few other compressors and this seems to be the best model for this file. For example, it compresses better than zpaq -method 6 but about the same as zpaq -method s4.0c256 which is also a near-stationary order 0 model.

Also compare lpaq1 and lpaq1a at http://mattmahoney.net/dc/text.html#1440
These use identical models except that lpaq1 uses arithmetic coding and lpaq1a uses the ANS coder from fpaqc.

Here are timing results for all the fpaq variants. My implementation of ANS (fpaqa, fpaqb, fpaqc) is somewhat slower than arithmetic coding. fpaq0p is probably the most similar model (with a different adaptation rate). The really fast versions use tightly integrated arithmetic coders that would require some effort to convert to ANS. http://mattmahoney.net/dc/text.html#5586 4. From looking at the source code it looks to me like it implements a fixed order 0 model.
Yes, that's what I meant by "block-based entropy coder" : probabilities are fixed for the block of input data.
Adaptation is done by cutting input data into smaller blocks, but that's not per-symbol probability adaptation.
This methodology is fine for zlib, zhuff, and so on (I guess FreeArc maybe ?), but it's not suitable for fpaq/zpaq/paq, which favor adaptive binary models.

paqa uses an adaptive order 0 model and table driven ANS coder which I implemented in 2007.
Of course ! Sorry, binary versions were completely out of my mind when I wrote the blog article.
They obey a completely different set of properties, and are meant for different use case (aka. fpaq/zpaq/paq and equivalent).
I will nonetheless mention it in the article. 5. Thanks, Yann - indeed while ABS for "bit-wise adaptive" is practically equivalent to arithmetic coding, ANS for fixed probabilities ("block-based") allows to encode symbol from a large alphabet in a single table check - it should be faster than Huffman, having precision like arithmetic.

The number of states should be a few times larger than alphabet size to work in deltaH~0.001 bits/symbol regime.
You could make it faster by putting also the bitwise operations into the table, like the whole "(symbol,state)->(bit sequence, new state)" rules ... but at cost of using lower number of states/alphabet to remain in L1 cache.
You could also increase base b e.g. to 4 to make twice less bit transfers at cost of precision ... generally there are lots of options to optimize among ...

Another advantage of ANS is that we can slightly perturb the initialization procedure using a pseudo-random number generator initialized with cryptographic key (e.g. and also the number of block): for example choosing between the lowest weight symbol and the second best one.
This way we simultaneously get a decent encryption practically for free.

The advantage of Huffman is that we can adaptively modify the tree on the run, but we could also do it in ANS - switch some appearances between symbols and renumerate the following ones ... but it would be costly for a large table.

About materials, the recent paper should be more readable and here is a poster gathering basic information. 6. The advantage of Huffman is that we can adaptively modify the tree on the run, but we could also do it in ANS - switch some appearances between symbols and renumerate the following ones ... but it would be costly for a large table.
Indeed, even for Huffman, this is a costly operation.
I guess the most common algorithm used is the Vitter one, and it is way slower than any standard implementation. In the end, it's not even better than just cutting data into smaller blocks, with an optimal header tree into each one of them.

My guess : block-based algorithms are not meant to be per-symbol adaptive. It just puts them too far away from their comfort zone.
Per-symbol statistic adaptation is best achieved using Matt's BAC (Binary Arithmetic Coding) and ABS (Asymetric Binary System).

Btw, another potential advantage of Huffman is that the header can be smaller :
since it's enough to tell the number of bits, it's typically possible to encode this value using 4 bits or even less. But for arithmetic coding, and for FSE, it's necessary to provide the exact frequency, which costs more bits.
It may not sound like much, but these added bits translate into heavier header, and put a lower limit to the size of input blocks for the algorithm to remain efficient. As a consequence, anything that can improve this header size is welcomed. 7. Btw, another potential advantage of Huffman is that the header can be smaller :
since it's enough to tell the number of bits, it's typically possible to encode this value using 4 bits or even less. But for arithmetic coding, and for FSE, it's necessary to provide the exact frequency, which costs more bits.
It may not sound like much, but these added bits translate into heavier header, and put a lower limit to the size of input blocks for the algorithm to remain efficient. As a consequence, anything that can improve this header size is welcomed.
You can store logarithms instead of exact frequencies. What matters is code lengths and logarithms are just that. You can fit a logarithm of single frequency in one byte - I think that would be enough. 8. what's the difference between logarithm of frequency and bit length?  9. Basically, there's no difference other than that Huffman implies integral bit lengths, while we don't need to comply with that in case of arithmetic coding.

In fact, we can do some sort of hybrid mode, ie store fixed-point logairthms for few most frequent values and then for the rest store integral logarithms using the usual tricks for compacting bit lengths. 10. Piotr: since our goal is optimal usage of codespace, it seems that optimal solution is to store all freqs as fixed-precision logarithms. f.e. with 8 binary digits: 0.0101 0001, 0.0001 0111, 0.0000 0101, 0.0000 0001..

Cyan, your bitflushing code supposes that nbBitsOut+nbBitsOut2<=25 11. Piotr: since our goal is optimal usage of codespace, it seems that optimal solution is to store all freqs as fixed-precision logarithms. f.e. with 8 binary digits: 0.0101 0001, 0.0001 0111, 0.0000 0101, 0.0000 0001..
That's what I was talking about.

The precision of frequency of infrequent symbols doesn't contribute much to compression ratio improvement (over Huffman) so we can provide bit lengths rounded to integers for most symbols, as I said, saving on header space and probably saving space in general. 12. The optimal solution was ultimately found by one of Shannon's own pupils, David A. Huffman, almost by chance. His version became immensely popular, not least because he could prove, a few years later, that his construction method was optimal : there was no way to build a better distribution.
Huffman is optimal, given its assumptions. Optimal is a slippery word, because nothing can be optimal without assumptions, and using the right assumptions, anything can be optimal.

Huffman makes a set of codes that matches a distribution. The assumption is that each code represents a single thing and is always the same. That assumption implies that each code must be a whole number of bits. 13. About probabilities for the first block, this block can be shorter and so weakly compressed - mainly used to estimate further probabilities.
So we can store just some very rough approximations for these initial probabilities.
For example 1 bit (b_i=0,1) per symbol i (32 bytes for 256 used), like: probability of symbol i is (b_i+1)/sum_i (b_i+1) this way.
The question how to do it optimally is an interesting general problem - for given information limit, how to represent the best approximation of probability distribution from some family ...

About Huffman, by grouping symbols it allows to approach the Shannon limit as close as we want - in this sense it is optimal.
However, to get to dH distance, the number of symbols we need to group is approximately proportional to 1/dH, so we would need huge 2^(1/dH) size alphabets - for dH~0.002 bits/symbol it becomes completely non-practical (picture in poster). 14. Piotr, i mean that we don't need to store bit lengths at all: all symbols with smaller frequency assigned the fixed 0.0000 0001 amount. if we encode higher frequencies only with 8-bit precision, we shouldn't store bitlengths higher than 8 bits - instead assign them all the fixed 1/256 frequency. i believe that it will optimize overall codelength for tables+data, although have no proof 15. Cyan, your bitflushing code supposes that nbBitsOut+nbBitsOut2<=25
Correct. I'll add a #define protection for that. Thanks ! 16. Here are some test results on proba70.bin
Code:
1,048,576 proba70.bin
171,702 proba70.fpaqa
171,044 proba70.fpaqc
171,146 proba70.fpaqc-arith
165,760 proba70.fpaq0 (ratio 6.326)
Thanks for the comparison metric.
Indeed, proba70 (and all probaxx files) is a very "static" file, which means probability distribution does not vary much across the file.

By default, FSE gets 166,091 bytes with it.
This is by using the default "32 KB" block mode, which means that the header is repeated for each block.

If I'm "cheating a bit", by forcing "1 MB" block mode, and therefore a single header for the whole file,
then FSE gets 165,635 bytes, which is in line with fpaq0.

This brings us back to mentioned issue in this thread : header contributes to the overall "cost", and can probably be optimized to cost less than it does today. 17. Single header means just a static compression - for large uniform data it is very reasonable and ANS is perfect for this purpose.
We can also use a higher order or context dependent methods here - switching between corresponding coding tables (many, probably for smaller alphabet). 18. 1. all lzh/ari codecs i know use alphabets with >256 symbols since they combine literal chars and matches into the one dictionary. so the first improvement i propose is to support 16-bit symbols. one possible implementation:

#ifndef SYMBOL
#define SYMBOL unsigned char
#endif

although ideally one sourcefile should provide both 8-bit and 16-bit routines

2. next thing is header-less mode. does i undestand correctly that it's enough to support both coder and decoder with the same counting[] tables? in that case providing lower-level functions that rely on provided frequencies will allow to use codec in semi-dynamic mode where each block is encoded using stats of previous block; as well as provide alternative header encodings without modifying FSE sources. moreover, i feel that header encoding doesn't belong to the core of the algorithm implementation and may be cosidered as part of [example] client code

3. next thing is variant with reversed compression direction. it may be useful since overlapping modeling and encoding should make compression faster, while for decompression, the fastest way is multithreaded H/ARI/FSE decoding with the following single-threaded LZ step

4. i like to see higher-order variants of code. f.e. i think about LZ+order1 coder 19. ## The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

Cyan (19th December 2013)

20. The higher order or context dependent method, instead of e.g. "tableDecode[state]", will use "tableDecode[history/context][state]" - many smaller coding tables.
Where, while encoding is in backward direction, the history/context means the one for later forward direction decoding.

The minimal header problem is indeed interesting: how to find the best description/approximation: minimizing "header size + imprecision cost".
For large data, the header cost becomes negligible (unless using some high order or complex context model) - static compressors can be better then adaptive, as they use the whole data to construct the optimal model - the income can be larger than the header cost.

ps. As I have written in a different thread, for "reversed entropy coding" problem, we can use generalizations of Kuznetsov-Tsybakov problem to encode such that decoder doesn't have to know the probabilities used - no header is needed.
I don't see how to do it for standard entropy coding, but maybe something like that is possible? The information about probability distribution seems somehow excess ... 21. Where, while encoding is in backward direction, the history/context means the one for later forward direction decoding.
it should be no problem for fixed-order prediction

The higher order or context dependent method, instead of e.g. "tableDecode[state]", will use "tableDecode[history/context][state]" - many smaller coding tables
hopefully someone will write entire code but if i undestood you correctly, it may be problematic - f.e. with 256 contexts and 16 kbyte for every context we will need too much memory (i.e. it will not fit in L1/L2 cache) and with smaller encoding tables precision will be smaller. while for arithmetic coding it's not much problem, we need only 256 elements of encoding table for every context 22. As I have written, it would rather require a smaller alphabet - for example of size 16, which would need a few hundred bytes instead of 16kB.
Size 16 alphabet means that there are 4 binary steps of arithmetic coding per such single step - still should be much faster. 23. Originally Posted by Jarek The higher order or context dependent method, instead of e.g. "tableDecode[state]", will use "tableDecode[history/context][state]" - many smaller coding tables.
Where, while encoding is in backward direction, the history/context means the one for later forward direction decoding.

The minimal header problem is indeed interesting: how to find the best description/approximation: minimizing "header size + imprecision cost".
For large data, the header cost becomes negligible (unless using some high order or complex context model) - static compressors can be better then adaptive, as they use the whole data to construct the optimal model - the income can be larger than the header cost.

ps. As I have written in a different thread, for "reversed entropy coding" problem, we can use generalizations of Kuznetsov-Tsybakov problem to encode such that decoder doesn't have to know the probabilities used - no header is needed.
I don't see how to do it for standard entropy coding, but maybe something like that is possible? The information about probability distribution seems somehow excess ...
I use to think the header size problem was interesting. But not any more. The optimal solutions are all bijective. Some better suited for large number of symbols and some better suited for when only a small subset of available symbols are used in file to compress. But in general if not known how many symbols are used and if the file IID then the best Huffman is a bijective adaptive huffman. And if you have the time a bijective adaptive arithmetic file compressor will compress the smallest.

If one wants to write an optimal header by using symbols and bit lengths which is not the way way to go unless you have extra complex rules on how to handle the
data that follows so that data and extra rules lead to same file again with compression. You can use a million hybrid ways to do it such as how I did the bijective DC the header in that can easily be used to make a bijective huffman or arithmetic. In which case the header is in front of file and the data follows. The resultant file is such that any subset of the file would decompress and compress back to the same subset so fully bijective. 24. Originally Posted by Jarek About Huffman, by grouping symbols it allows to approach the Shannon limit as close as we want - in this sense it is optimal.
However, to get to dH distance, the number of symbols we need to group is approximately proportional to 1/dH, so we would need huge 2^(1/dH) size alphabets - for dH~0.002 bits/symbol it becomes completely non-practical (picture in poster).
The bigger the alphabet, the bigger the tree. I think that just moves the problem somewhere else. If you dig up the original papers on Huffman, you should find the proof of optimality. It's a valid proof, but for a different set of conditions than the ones here. 25. Originally Posted by Jarek The minimal header problem is indeed interesting: how to find the best description/approximation: minimizing "header size + imprecision cost".
For large data, the header cost becomes negligible (unless using some high order or complex context model) - static compressors can be better then adaptive, as they use the whole data to construct the optimal model - the income can be larger than the header cost.
Theoretically, the header cost for static should equal the startup cost for adaptive, because they contain the same amount of information.

Yann --
Storing only the delta for headers after the first would probably help. I haven't taken a look at the code, so I'm not sure if you're doing that already. 26. Originally Posted by Bulat Ziganshin Piotr, i mean that we don't need to store bit lengths at all: all symbols with smaller frequency assigned the fixed 0.0000 0001 amount. if we encode higher frequencies only with 8-bit precision, we shouldn't store bitlengths higher than 8 bits - instead assign them all the fixed 1/256 frequency. i believe that it will optimize overall codelength for tables+data, although have no proof
I think 1/256 would be too high. It could treat the frequencies as being relative to each other, though. So the denominator would be the sum of all of them. That would help.

The frequencies are really integer counts. If they are highly skewed, Elias gamma would do a decent job, I think. 27. Originally Posted by Bulat Ziganshin 1. all lzh/ari codecs i know use alphabets with >256 symbols since they combine literal chars and matches into the one dictionary. so the first improvement i propose is to support 16-bit symbols. one possible implementation:

#ifndef SYMBOL
#define SYMBOL unsigned char
#endif

although ideally one sourcefile should provide both 8-bit and 16-bit routines
Indeed, that's how zip works, and by way of consequence many other lzh algorithms.
I'm personally not found of it, since it introduces some complexity which can solved differently.

Anyway, nonetheless, if that's the way many programmers want to use it, there should be a mode for it.
It's more complex than it sounds : the decoding state table is directly affected.
A cell size is 4 bytes, exactly, and I can't increase the size of symbol that easily.
I guess that one way to proceed would be to "share some space" with '.nbBits',
resulting in an extended range for symbol from 8 bits to 12 bits (4096 values).
This size should be large enough for most lzh with more than 256 values.
For larger number of values than that, modifications are going to be larger (and most likely no longer compatible with L1 cache requirement). Originally Posted by Bulat Ziganshin 2. next thing is header-less mode. does i undestand correctly that it's enough to support both coder and decoder with the same counting[] tables? in that case providing lower-level functions that rely on provided frequencies will allow to use codec in semi-dynamic mode where each block is encoded using stats of previous block; as well as provide alternative header encodings without modifying FSE sources. moreover, i feel that header encoding doesn't belong to the core of the algorithm implementation and may be cosidered as part of [example] client code
Yes, I agree,
this is definitely one of the key properties of the interface.
The code is almost ready for it.
As you say, header encoding doesn't belong to the core of the algorithm implementation and may be considered as part of client code. Originally Posted by Bulat Ziganshin 3. next thing is variant with reversed compression direction. it may be useful since overlapping modeling and encoding should make compression faster, while for decompression, the fastest way is multithreaded H/ARI/FSE decoding with the following single-threaded LZ step
I'm not sure to understand this part.
Keep in mind that, regarding direction, ANS theory is a bit special, since it requires encoding and decoding to be performed in reverse directions.
The way FSE deals with it is that it encodes in backward direction and output in forward direction,
while decoding is performed in forward direction while reading the compressed input in backward direction.
A bit complex, but it works well for block mode. Originally Posted by Bulat Ziganshin 4. i like to see higher-order variants of code. f.e. i think about LZ+order1 coder
From an algorithm perspective, this point is equivalent to using multiple tables to encode a single bitstream.
It should cause no problem as long as all state table have same size.
(For different sizes, though, it's a bit more complex...).

Now, obviously, the major issue with this is the total size of tables, which is very likely to become larger than L1 cache, directly affecting speed. 28. Originally Posted by nburns Theoretically, the header cost for static should equal the startup cost for adaptive, because they contain the same amount of information.
No not theoretically. The length of header followed by data will in general be greater than proper adaptive even if you think of the information as the same. As an example you can think of the storing of header as one number and data as a second number while the adaptive can be thought of as a single number. A single number can be the bijective combination of 2 numbers which require less space to represent it compared to the space needed to represent two single numbers. 29. I know most people don't worry about the few bits saved by doing things bijective but. Below I have an example of why header followed by data is not as good as a single number containing what most consider the same info.

to use the cantor pairing function of http://en.wikipedia.org/wiki/Cantor_...iring_function

let X and Y be two numbers encode as a universal number plus a binary coded number which has zero waste versus the
the number z result of the cantor pairing function for x and Y written as a single binary coded number which has no
waste.

starting at zero to match the article
for binary coded integer starting at zero add two drop high order 1 bit from binary representation
0 0
1 1
2 00
3 10
6 000
9 110
24 01011

for universal number using Elias gamma code shifted to start with zero

0 1
1 010
2 011
3 00100

for <3,3> 00100 10 7 bits
for 24 01011 5 bits so single number wins here

for <0,3> 1 00100 6 bits
for 9 110 3 bits so single number wins here

for <3,0> 00100 1 6 bits
for 6 000 3 bits so single number wins here

of course you could use a different universal number representation if you like. 30. Originally Posted by biject.bwts No not theoretically. The length of header followed by data will in general be greater than proper adaptive even if you think of the information as the same. As an example you can think of the storing of header as one number and data as a second number while the adaptive can be thought of as a single number. A single number can be the bijective combination of 2 numbers which require less space to represent it compared to the space needed to represent two single numbers.
If you made a header and bijectively merged it with the data, I wouldn't consider that an adaptive algorithm. 31. Originally Posted by nburns If you made a header and bijectively merged it with the data, I wouldn't consider that an adaptive algorithm.
In general neither would I consider it an adaptive alogrithm. But it is easy to make a header that can be concatenated with any file the result of the two can be thought of as result an adaptive bijective algorithm. Note I am not saying bijectively combine but to combine data after the hearder id any. What you call the header or trimmed version of it would be the result of a bijective compression. The shortened header would still be bijective and valid the data is just there after the header if more data is to be compressed. 32. I tried to start playing with it, but it keeps segfaulting in FSE_decompress:

Code:
```Program received signal SIGSEGV, Segmentation fault.
0x0000000000405485 in FSE_decompress (dest=0x6084d0 "", originalSize=44032, compressed=0x7ffff7ed7013 "F\002") at ../fse.c:625
625            bitStream = *(U32*)ip;
(gdb) list
620            U32 state;
621
622            bitCount = (((*(U32*)ip)-1) & 7) + 1 + 24;
623            iend = ip + (((*(U32*)ip)+7) / 8);
624            ip = iend - 4;
625            bitStream = *(U32*)ip;
626            bitCount -= FSE_MAX_TABLELOG;
627            state = (bitStream >> bitCount) & FSE_MAXTABLESIZE_MASK;
628
629            while (op<oend)
(gdb)```
Edit:
That was in Ubuntu, by the way, with gcc 4.7.2.

I get the same result in 32-bit Cygwin, however:

Code:
```~/git/FiniteStateEntropy/test\$ ./fse COPYING -o FOO
FSE : Finite State Entropy, capability demo by Yann Collet (Dec 20 2013)
Compressed 18092 bytes into 10758 bytes ==> 59.46%
~/git/FiniteStateEntropy/test\$ ./fse -d FOO > BAR
FSE : Finite State Entropy, capability demo by Yann Collet (Dec 20 2013)
Segmentation fault (core dumped)
~/git/FiniteStateEntropy/test\$```
Am I using it wrong? #### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•