I wanted to try something different from SCSU or BOCU-1 to compress Unicode files, something that would stay closer to UTF-8.
I've started to write a small transformation, once finished it will turn any valid UTF-8 file into a file usually of the same size.
From a byte point of view UTF-8 characters that take 1, 2, 3 or 4 bytes will keep the same size after the transformation, the only exceptions are 29 out of the 32 C0 control characters: HT -Tab-, LF and CR will still take only one byte the others will need one more.
The new file contains only 7 bits values in the [0..127] range whereas UTF-8 characters are more or less made of a 7.5 bits lead octet eventually followed by up to 3 continuation octets (these hold only 6 bits values), thus it could easily be shrink by turning 8 7-bits values into 7 8-bits ones. The funny thing is that the reduced alphabet even helps well-known compression algorithms (Deflate, Bzip2, LZMA...).
My goal is to write a relatively simple/fast UTF-8 compressor, this transformation is only a first step (perhaps in a wrong direction).
I would be very grateful if some of you could test the transformation (I call it UTF-827 for now) against UTF-8 text files -short or long- containing non Latin scripts (Greek, Cyrillic, Devanagari, Chinese...), report if it produces some savings when compressing the resulting file (using whatever compression tool you want, even your own prototypes) and if the reverse transformation effectively works.
Things like this would be welcome:
Alpha2 released for Linux, Mac OS X and Windows (Intel x86), has no know limitation, could be faster.
./u827 multi.txt multi.827
./u728 multi.827 multi-rev.txt
md5sum multi.txt multi-rev.txt
ls -l multi.*
-rw-r--r-- 1 fred fred 908 2013-01-12 21:37 multi.827.xz
-rw-r--r-- 1 fred fred 944 2013-01-12 14:05 multi.txt.xz