Did you check what paq8 does? http://encode.ru/threads/1464-Paq8pxd-dict
So I decided to test it and made this sample: http://nishi.dreamhosters.com/u/dce_ja_utf8.txt
by google-translating Matt's DCE to japanese and saving as utf8 plaintext.
Then converted it to utf16 and zero-padded to 32-bit per symbol
(couldn't easily find an utf-32 converter).
And compressed both with my old dos coder (8-bit port of which I posted yesterday as "dosppm"),
which incidentally internally works with a 32-bit alphabet.
This coder, in addition to actual compressed data, produces an offset table for first
occurrences of each symbol encountered in the data, which is used to sort the alphabet etc
(no sense to allocate 16G for stats if the only symbol used is 0xFFFFFFFF).
So with it, I compressed both dce_utf8 expanded to 32-bit per symbol, and similarly expanded utf16.
And results are as follows:
Some more results from other compressors, for comparison:
(1) (2) (3) (4) (5)
dce_ja_utf8.txt 391266 960 349 347 83893
dce_ja_utf32.txt 657972 261248 3853 3530 81673
(1) uncompressed size
(2) uncompressed size of offset table
(3) offset table compressed using old "asc" coder
(4) offset table compressed with paq8px69
(5) file's compressed size + asc-compressed table size
gain = (1-81673/83893)*100 = 2.646%
.txt 391266 328986
.7z 97166 90147
.ccm 83980 82021 // ccmx 1.30c 5
.rar 79987 78515 // rar -m5 -mct+, aka ppmd vH
.paq8px 70698 71232 // paq8px69 -6
.paq8pxd 70568 70968 // paq8pxd_v4 -6; didn't detect utf8 though