1. Weird Matt's carryless rc replaced with sh_v1m port,
48.940Mbps 48.862Mbps | 100% 100% | fpaq0p (original)
56.590Mbps 54.367Mbps | 116% 111% | fpaq0p IntelC build
56.987Mbps 49.813Mbps | 116% 102% | fpaq0pv2 (original)
64.992Mbps 53.078Mbps | 133% 109% | fpaq0pv3 (original)
77.834Mbps 67.819Mbps | 159% 139% | fpaq0pv4 GCC 4.3.0 prof-gen build
89.309Mbps 74.677Mbps | 182% 153% | fpaq0pv4nc (blockwise eof encoding)
79.103Mbps 66.729Mbps | 162% 137% | fpaq0pv5 (original)
80.067Mbps 70.099Mbps | 164% 143% | fpaq0pv5 GCC 4.3.0 prof-gen build
81.834Mbps 74.656Mbps | 167% 153% | fpaq0pv4A (IntelC)
95.188Mbps 87.459Mbps | 194% 179% | fpaq0pv4Anc (IntelC)
106.537Mbps 97.165Mbps | 218% 199% | fpaq0pv4B (IntelC, not compatible with fpaq0p)
which supports single-step renormalization, and also
other optimizations, like 16bit i/o (IntelC somehow
optimizes it using vectorization).
2. Classes made more readable using templates.
Well, actually not much code left from previous version.
3. Direct win32 i/o implemented (bypassing stdio),
but only visible effect seem to be the smaller executable size.
4. Preprocessors implemented for classes translation into macros -
couldn't find any other way for rc to use local variables for its
state. This gives most of speedup. Other possible way could be
adding rc vars as arguments to all the functions, but it would be
too ugly and won't guarantee that compiler will properly optimize it.