I've been talking to some graphics guys, and it seems that you can losslessly compress or decompress images by about a factor of two at hundreds of megabytes per second, single core, using SIMD instructions (e.g., SSE4). (8-bit grayscale images from the imagecompression.info gray8 benchmark.)
Unfortunately, the people who know how to do this are not publishing how or releasing the code because it's not a standard compression format and their companies are pursuing standardized compression formats like JPEG-LS instead.
On the other hand, it seems that the techniques are not patented and it doesn't appear that they will be, so this might be a good thing for a free software person who's good at images and SIMD to look into, if they're so inclined.
This posting at Jan Wassenberg's blog is particularly intriguing:
Jan did some very cool work for his dissertation on symmetrical compression of 16-bit images, but that code did not work well for 8-bit color channels; the newer variant he describes does, at a cost in compression ratio of a bit under 20 percent.
On his blog he says:
I'm not an expert on either images or SIMDfying algorithms, but that sounded very promising.The most interesting part is fixed-to-variable coding for entire vector registers of bits. For the image residuals above, it compresses to within 5-20% of the entropy at speeds comparable to or exceeding state of the art integer coders (http://arxiv.org/abs/1209.2137). A single core can compress/decompress between 4600 MB/s (sparse bit arrays with 0.05% ones) and 1000 MB/s (relatively dense, 12% ones). Of course this scales to multiple cores, as there is no shared dictionary to update.
Although the instruction set improved with SSE4, writing vector code is still an exercise in improvisation. Finding 1-bits is accomplished by a hash function based on de Bruijn sequences (http://supertech.csail.mit.edu/papers/debruijn.pdf) and using the FPU normalization hardware. Vectorized shifts are emulated with the PSHUFB universal shuffle instruction.