I made and uploaded this thing quite a while ago, but I guess
its still hard to understand what's it about from this archive.
The idea is kinda obvious - instead of coding all data bits
with a single rangecoder sequentially, we can use a separate
rc instance for each bit, and then update all rcs at once in
a single loop after processing a byte (or maybe some 2^n bytes).
The benefits of that are:
1) inlined rc calls (which usually have 2-3 branches inside)
disappear from the model.
2) vector instructions can be used for rc update, especially
for multiplication in it.
3) with a large enough number of rc instances it starts making
sense to run the model and rc updates in different threads.
And that's implemented in vecrc above, where IntelC was able
to automatically vectorize the rc update loop, including
multiplication. This coder was tested by attaching to ccm,
where it demonstrated ~15% speed improvement.
Is it clear how it works, or I have to explain the details?
Do we need a simpler implementation, like based on rc_vX series?