Well, at least you can read the assembler directly, which is better than just speculating as I do
wish I could do the same...
Thanks for confirmation !
Well, at least you can read the assembler directly, which is better than just speculating as I do
wish I could do the same...
Thanks for confirmation !
i'm speculating, but 99% sure. there are no other variants. may be i've seen that effect in asm listings of my own code, not remember exactly
Maybe one way to move forward would be to keep the symbolTT Table into the innermost compression function.
It would still be allocated on stack, and would be recalculated each time the compression function is called,
but since its size only depends on nbSymbols, this is a fraction of the cost of the much larger state table.
Now, one problem is, it requires 'normalizedCounter' to be built, so that would change the interface contract.
you can just copy this table to the stack
Good point, I'll try that
OK, this version is much better
I still have to complain about a ~2% performance loss on GCC / Linux, but that's much more manageable.
We are getting pretty close to a github update.
Btw, I should have a look at this "branching" methodology...
Is it possible to quickly compute the compressed data size without actually performing the compression just based on the symbol distribution?
Last edited by caveman; 15th January 2014 at 01:34.
caveman, yes we can analytically calculate the expected number of bits per symbol assuming for example i.i.d. symbol distribution - tests in the paper were performed this way.
First we need to find stationary probability distribution of used states (dominant eigenvector), then we can calculate the expected number bits/symbol it produces - here is Mathematica procedure I have used: https://dl.dropboxusercontent.com/u/12405967/ans.nb
However, these values are a bit better than experimental ones - should be discussed somewhere here: http://encode.ru/threads/1821-Asymet...l-System/page2
And I was only using precise initialization, while the github implementation uses less accurate table generation.
There are 2 parts :Is it possible to quickly compute the compressed data size without actually performing the compression just based on the symbol distribution?
1) Compute, no, estimate, yes.
If the point is to have a rough idea of final size, it can be done.
If the objective is to know the exact size of compressed bitStream before generating it, then I don't know how to do that with 100% accuracy.
2) Quickly : yes, but...
it requires some pre-computed tables for higher accuracy, which in turn require memory.
Making it a compulsory part of a portable reference library would risk excluding low-mem systems.
So it depends on the intended usage.
A work-around could be to use rough and pessimist approximations instead.
In both cases, the solution is no different from what could be achieved for Arithmetic compression, since we have same input, and almost same compression efficiency.
Large independent tests of FSE: https://groups.google.com/a/webmproj...Y/T1cZQLnJrZ4J
I have just realized that we can use the huge speedup of large alphabet ANS also for the binary bit by bit case - by grouping symbols of the same quantized probability.
Let say we use 64 different quantizations of probability of the least probable symbol (1/128 precision) - so we can use 64 of 8bit buffers, add binary choices bit by bit to corresponding buffers, and use one of 64 ANS coding tables for 256 size alphabet every time a buffer accumulates to 8 binary symbols.
All these 64 tables can work on the same state - decoder knows symbols of which probability run out.
Interesting,
I wonder who's skal, he seems to understand the topic pretty fast.
Anyway, I saw your comment at the end regarding precise initialization, and I can only invite you to test your code, which seems almost functional, by integrating it into the github version for example.
I can only hope you can prove me wrong regarding the high impact on speed for a comparatively small gain reported from a previous version developed before release.
Pascal Massimino, he was known as french demo maker:
http://demozoo.org/sceners/252/
http://fr.linkedin.com/in/skaaaaaaaaal
He joined Google in 2006 and notably works on WebP.
Last edited by caveman; 16th January 2014 at 16:45.
OK, thanks for info, now that's much clearer...
Indeed - I think the video compression like Google VP is what will gain the most from ANS, as you want high quality video decompression, not killing e.g. smartphone battery.
We can directly use it for large static distributions like DCT coefficients there ...
... but this trick from my previous post: divide binary choices into let say 64 of 8 bit bufferers for different quantized binary probability distributions, and use 256 alphabet ANS every time a buffer fills up, allows to also gain like 5x decoding speed for varying binary choices.
It seems there is an issue with encoding the remaining values of the buffers when the block ends - we could do it using direct ABS formulas, but we would still need to additionally store the numbers of remaining symbols in each of them, so that decoder knows how many load at start - storing these numbers would require e.g. 3bits * 64 = 24 bytes per block - not very much ...
... but it can be avoided, again at cost of encoding time - this time let us use 64 large buffers, feed them with succeeding binary symbols, and start using 256 alphabet ANS after processing the whole data block - this way decoder can start with loading 8 binary choices to each of 64 buffers (assume at least one, even if it will be not entirely used - the not used positions are filled with the more probable symbols).
It has a small complication at the end of decoding block - decoder has to decide that there has remained less than 8 symbols for given probability to switch to using ABS formulas instead.
Most of current arithmetic coding applications could get huge speed up this way ...
ps. If it is too costly from memory point of view, using 16 size alphabet would cost about 20 times less and would give only ~3x speedup ...
Last edited by Jarek; 16th January 2014 at 22:43.
I'm a native English speaker, so I'll throw in my 2 cents. I like FSE_count, FSE_sizeof_counts, and FSE_count_normalized. Frequency is afaik just a fancier word for count. Distribution might be technically correct, but it sounds more obfuscated.
Cyan (17th January 2014)
nburns, what are the good names for procedure that takes unnormalized counters, does some calculations and returns normalized counters (aka distribution aka proportion of symbols)?
Pascal Massimino has written than he has made some "obvious optimization" of decoder (which, as he said, was already faster than FSE), what made it nearly twice faster (from being 50% slower than VLC to having similar speed):
https://groups.google.com/a/webmproj...Y/dPqqSeaQhhYJ
Unfortunately, there is no code provided.
He mentioned getting better compression from an FSE-style coder than fpaqc. I think this must be model-related, because, in my experiments, BWTS+MTF+fpaqc gave compression results that FSE couldn't touch.
A basic order(1) FSE style encoder could be a very strong alternative to simple modelling + arithmetic coding.
I have my own hacky O(1) arithmetic coder (based on Shelwien's code) that removes all divisions for encoding, but still leaves 1 for encoding. That makes it close to FSE for encoding speeds but still annoyingly tardy on decoding. I should investigate FSE more to see how easy it is to convert it to higher orders, or at least order-1 as higher than that becomes hard to do in a static block-by-block manner rather than adaptive modelling.
With -march=pentiumpro added to Makefile compiler options i've got ~10% decompression speedup. Tested several times for sure.
Yes, I confirm the same experience,
-march options tend to be a bit too "random", and it's difficult to guarantee that one value will always be better for everyone, not even for the architecture it's supposed to target.
I'm glad -march=pentiumpro gives you better result, and I'm likely to try it too, just to witness the result.
I tested -march=pentiumpro with GCC v4.8.1,
and indeed, the gains are pretty large.
Test System : Core 2 Duo, 3Gz, Linux Mint Petra 64-bits
normal version (-O3)
$ ./fse32 proba70.bin
FSE : Finite State Entropy, 32-bits demo by Yann Collet (Feb 3 2014)
proba70.bin : 1048575 -> 165911 (15.82%), 160.9 MB/s , 204.9 MB/s
march version (-O3 -march=pentiumpro)
$ ./fse32 proba70.bin
FSE : Finite State Entropy, 32-bits demo by Yann Collet (Feb 3 2014)
proba70.bin : 1048575 -> 165911 (15.82%), 172.5 MB/s , 255.1 MB/s
That's a really big improvement.
So I'm likely to adopt it for the 32-bits version.
Thanks ivan2k2 for reporting it !
Obviously, it only works for the 32-bits version, since Pentium Pro is not 64-bits compatible, therefore the compilation fails for this target.
Last edited by Cyan; 3rd February 2014 at 22:34.
The beta branch of FSE
now supports experimental U16 input mode :
https://github.com/Cyan4973/FiniteSt...ropy/tree/beta
It means it can accept alphabet size > 256.
Alphabet size can be defined using #define MAX_NB_SYMBOLS, at the top of fse.c
Current default value is 286, which matches zlib alphabet (literals + EOF + Match).
This is still beta, and although it passes tests, still needs to be tested a bit more before being merged into master
[Edit] : tests completed, now merged into master
Last edited by Cyan; 5th February 2014 at 11:29.
Bulat Ziganshin (4th February 2014),samsat1024 (4th February 2014)