Thanks Shelwien!![]()
http://shelwien.googlepages.com/fpaq0pv4b.rar
http://shelwien.googlepages.com/fpaq0pv4B.htm
<div class="jscript"><pre> 48.940Mbps 48.862Mbps | 100% 100% | fpaq0p (original)
56.590Mbps 54.367Mbps | 116% 111% | fpaq0p IntelC build
56.987Mbps 49.813Mbps | 116% 102% | fpaq0pv2 (original)
64.992Mbps 53.078Mbps | 133% 109% | fpaq0pv3 (original)
77.834Mbps 67.819Mbps | 159% 139% | fpaq0pv4 GCC 4.3.0 prof-gen build
89.309Mbps 74.677Mbps | 182% 153% | fpaq0pv4nc (blockwise eof encoding)
79.103Mbps 66.729Mbps | 162% 137% | fpaq0pv5 (original)
80.067Mbps 70.099Mbps | 164% 143% | fpaq0pv5 GCC 4.3.0 prof-gen build
81.834Mbps 74.656Mbps | 167% 153% | fpaq0pv4A (IntelC)
95.188Mbps 87.459Mbps | 194% 179% | fpaq0pv4Anc (IntelC)
106.537Mbps 97.165Mbps | 218% 199% | fpaq0pv4B (IntelC, not compatible with fpaq0p)
</pre></div>
1. Weird Matt's carryless rc replaced with sh_v1m port,
which supports single-step renormalization, and also
other optimizations, like 16bit i/o (IntelC somehow
optimizes it using vectorization).
2. Classes made more readable using templates.
Well, actually not much code left from previous version.
3. Direct win32 i/o implemented (bypassing stdio),
but only visible effect seem to be the smaller executable size.
4. Preprocessors implemented for classes translation into macros -
couldn't find any other way for rc to use local variables for its
state. This gives most of speedup. Other possible way could be
adding rc vars as arguments to all the functions, but it would be
too ugly and won't guarantee that compiler will properly optimize it.
Thanks Shelwien!![]()
Btw, does anyone know how to disable the gcc "anonymous struct" error? Its really inconvenient.
stdio translates to the same win32 calls. so it sghould be enough to use read/write operations instead of getc/putc. if you need more speed - use separate i/o threadsOriginally Posted by Shelwien
> stdio translates to the same win32 calls. so it should be enough to use
> read/write operations instead of getc/putc.
Not exactly. It also uses an extra file descriptor
(FILE* != windows handle)
So executable size with win32 i/o is 3-4k less than with stdio.
Also I actually expected to gain something from CreateFile()
flags (WRITE_THROUGH/NO_BUFFERING/SEQUENTIAL_SCAN etc), but
it didn't help anyway.
> if you need more speed - use separate i/o threads
For single-core cpu that surely would be slower.
Memory-mapped i/o might help though.
But anyway I'm only interested in algorithmic optimizations.
Well, including compiler tweaks.
And actually I was just trying to prove (if only to myself)
that my rc is better than Matt's![]()
Your sources is very hard to read - these .INC files...Originally Posted by Shelwien
Matts coder is *VERY* simple, also its efficient. Im not sure that it is possible to do something principally simpler and more efficient. Anyway, still the catch is in modeling. FPAQ0P-like model is very fast. Maybe, with ideas like multi-threaded entropy coder we catching the fleas...
![]()
Can this improved coder be integrated into LPAQ?
well, i mean that if you read/fread in large blocks, its translate to the same win32 callsOriginally Posted by Shelwien
are you ever tried?Originally Posted by Shelwien
i tested mm io on windows - its faster for reading from disk cache (because we avoid memcpy) but for reading from disk its slower! probably because read calls are optimized by windows with read-ahead while mm files are supposed to be accessed in random order
OTOH, using async i/o calls or separate i/o threads allows to overlap compression and DMA transfers to/from disk. if your disk still use PIO mode, just upgrade your 386-based box to i486 or something newer![]()
win32 calls are already async enough as they are.
I mean, WriteFile() surely doesn't wait until block is written,
and ReadFile() actually copies the data from cache... well, with
reasonable block sizes.
And caching system works in separate thread(s) anyway.
But Read/Write operations at least have to copy the data
into user's buffer - that's where having another core for i/o
would help. But I don't see how a separate i/o thread
could help with single cpu core.
And btw, my dataset is only 64M so there's almost no disk i/o anyway.
MM, on the other hand, is supposed to read the data into
the same memory where you use it, avoiding the buffering.
But I guess with it you have to implement some prefetching
and delayed write commits, so that's where threading would
be really useful.
encode wrote:
>> And actually I was just trying to prove (if only to myself)
>> that my rc is better than Matt's
>
> Your sources is very hard to read - these .INC files...
Believe me, it would be even harder to read if I merged it.
Also there's no other way with so much replaceable parts.
> Matt's coder is *VERY* simple,
1. Only Matt's renormalization for encoding is simpler.
And decoding in rcs with carry always was faster than carryless
2. Visual simplicity doesn't really mean anything.
Eg. do you understand why Matt's coder doesn't need any
range cutting like in older carryless rcs?
> also it's efficient.
1. I _proved_ that my counterpart is more efficient,
by testing both rc_mm and rc_sh in the same environment
fpaq0pv4B.htm:
43769221 6.030s 6.579s // fpaq0pv4Anc0
43769372 5.811s 6.564s // best result for rc_mm
43780374 4.857s 5.342s // best result for rc_sh
In v4B, comparing to v4A, rc_mm version became faster too,
but really not as much.
2. Matts coder's drawback is that its renormalization
always has to support all shifts up to 4 bytes, because
even x1>x2 is possible. And because of the same reason
it also can't be switched to 16bit granularity - it can't
afford losing any more precision.
Another thing is having the same renormalization in the
decoder, but I already mentioned that.
> I'm not sure that it is possible
> to do something principally simpler and more efficient.
Dunno about simplicity, but more efficient we do already have.
> Anyway, still the catch is in modeling.
> FPAQ0P-like model is very fast.
I didn't change the model, only rewritten it with templates to
reduce branch count and improve compiler optimizations.
> Maybe, with ideas like multi-threaded entropy coder we catching the fleas...
???
fpaq0p4B can be made faster by multi-threading, if you meant that.
Thats a totally different direction from my optimizations.
Black_Fox wrote:
> Can this improved coder be integrated into LPAQ?
Most of fpaq0pv3->fpaq0pv4B optimizations should be applicable.
hi shelwien,
so tested different file access modes and it turned out the speed of posix vs win native unter windows is about 99% equal... and thats why i would go portable and use posix then... besides these two modes there is the memorymap mode... ist very easy to program but not faster... and the last mode is the async io... this is where you can realy gain advantage! but its much code to write for file operation then... and you have to do some threading by hand to fully gain from it.
why? just process file in blocks and make some function for management. asynchronous io is somewhat similiar to triple- buffered graphics output.Originally Posted by eugene
Shelwien
I meant that MT is too heavy for simple entropy coder. Also, it is possible that RC not needs in such heavy stuff.
According to results you did a nice coder! I will do an additional test with RC with carry on my BALZ - since we deal with asymmetric things here.![]()
At the same time, comparing to pure FPAQ0P is not fair since you also improved other things like better EOF coding - no need in extra bit coding per each byte, plus you changed the I/O routines, and so on. So, 2X speed up is not only gained by the better RC as itself.
![]()
The effect from bytewise eof removal was surprisingly small - only ~0.3s in my tests.
Also i/o replacement wasn't that good either - well, it was for intel compiles, but v3 as a gcc prof_gen build was quite ok even with getc/putc.
So what really helped was branch merging, model structure changes, removal of bit extraction/combining and other simplifications... and rc replacement.