I created 4 experimental ZPAQ compatible BWT based compressors. The goal is fast but reasonably good compression. Some benchmarks:

Code:
  config  calgary.tar enwik8    Time    Memory
  ------  ----------- ------   -------  ------
  original 3152896  100000000    0   0    0 MB
  bwt1      844305   21965906   95  69  671
  bwt2      800498   21246030  122  93  671
  bwtrle1   861300   22823439   82  56  671
  bwtrle2   828249   21985593   97  66  672

  min      1027299   33460930   41  43    4
  fast      806959   24837469   78  86   38
  mid       699191   20941558  265 301  111
  max       644190   19448650  716 727  246
Times are process times in seconds with zpaq v2.05 on one of two cores to compress and decompress enwik8 on a 2.0 GHz T3200 with 3 GB memory under 32 bit Windows. The current version sorts the whole file into 1 block so it is limited to files smaller than 128 MB and uses 640 MB memory. These are experimental models. A production version (maybe added to pzpaq) would split the input into smaller blocks to remove the file size limitation and process them in separate threads. Compression would also skip the preprocessor check (26 seconds on enwik after I am sure there are no bugs in it.

BWT encoding uses libdivsufsort-lite (30 sec. for enwik and inverse BWT coded in both C++ and ZPAQL for comparison. BWT decoding enwik8 takes about 23 sec. in C++, 26 sec. in automatically converted ZPAQL (both compiled with g++ 4.5.0), and 61 sec. when interpreted. I modified the BWT to include the EOF byte to simplify the decoding.

As an optimization, two of the config files also run length encode after BWT. Runs of 2-257 or more are encoded as the first 2 bytes plus a count byte. RLE is very fast, 2 seconds at most, and speeds up modeling at some cost in compression. The models are more complicated because they have to model the RLE parse state (literal or count) as context.

For each model, I used a fast version (bwt*1.cfg) and slower (bwt*2.cfg) with 1 or 2 ZPAQ components respectively. The fast version is an indirect order 0 model. The slow version is an order 0-1 ISSE chain.

I also tried experimenting with BWT+MTF (move to front) and BWT+MTF+RLE but compression was a lot worse and MTF offers no speedup.. RLE does speed up compression but the size penalty seems larger than I was expecting.

Code: bwt.1.zip from http://mattmahoney.net/dc/zpaq.html#configurations
Technical details in the readme file.