Ok, here's a version with interpolated SSE2.
The compression improvement from it is pretty good (and very close to what I expected),
but unfortunately its also slow - no wonder with extra 11 MULs and 1 DIV
So I guess it won't be of much help for BWT postcoders, even if we'd optimize it,
but still, its a component with better effect than logistic mixing, so it has its uses.
book1bwt enwik8bwt enctime dectime
216838 21337208 6.015s 6.797s // ic111, o01_v0, static linear mix
216816 21334589 6.344s 6.688s // ic110_no_PGO, mixtest_v2, static linear mix
216098 21204128 6.797s 7.156s // ic110_no_PGO, mixtest_v2, adaptive linear mix
214695 21065497 9.954s 10.609s // ic110_no_PGO, mixtest_v3, adaptive logistic mix
212624 21220420 21.469s 21.031s // ic110, o01_v1 = SSE2i(o0,o1), book1bwt profile
213129 21040485 22.188s 22.718s // ic110, o01_v1 = SSE2i(o0,o1), enwik8bwt profile
212478 21098548 23.234s 23.578s // ic110_no_PGO, o01_v1, o1 dual update, book1bwt profile