Okay, so I still wanted a way for paq8f (and others) to use the SIMD speedups if available (CPUID), but my lousy paq8o8z port is obviously not interesting to anybody but myself. The only appeal for paq8f (to me) over newer versions is the lower minimum RAM (21 MB vs. 37 MB for paq8o.
Nevertheless, I did convert the assembly to GAS, and this version doesn't need any changes to any .CPP files (well, apart from obvious things like working around tmpfile() in MinGW [_tempnam() ??] or deleting bogus "Mixer::" from paq8f that newer G++ versions bomb out on). Note that I tested it with paq8f, paq8l, and latest(?) paq8px, and it seemed to work fine (output md5sum matched that of original compiles).
It's easy to convert this to FASM, NASM, LZASM, or WASM/JWASM syntax (as I did previously), but GCC / BinUtils is so dang common that I figured this was the most popular + portable way (at least initially).
Only includes two files: fastpaq2.s and GNUmakefile (for testing)
Tested with Cygwin, unofficial MinGW (G++ 4.4.0), and various DJGPP versions (including DJELF hack).
P.S. It's not the cleanest source ever. And some of the stuff is redundant (e.g. enable_sse2) and even that could be improved (already written test using fxsave to see if SSE is already enabled), but since I don't figure anybody here cares ("modern" OS usually does it for you), I didn't bother to fix any of that (yet).
EDIT: Nov. 26, fixed small typo/potential bug.
EDIT: Dec. 22, updated GAS version, added NASM version