Results 1 to 4 of 4

Thread: UTF-8 transformation to 7 bits format, helps compression

  1. #1
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts

    UTF-8 transformation to 7 bits format, helps compression

    Hello,
    I wanted to try something different from SCSU or BOCU-1 to compress Unicode files, something that would stay closer to UTF-8.
    I've started to write a small transformation, once finished it will turn any valid UTF-8 file into a file usually of the same size.
    From a byte point of view UTF-8 characters that take 1, 2, 3 or 4 bytes will keep the same size after the transformation, the only exceptions are 29 out of the 32 C0 control characters: HT -Tab-, LF and CR will still take only one byte the others will need one more.
    The new file contains only 7 bits values in the [0..127] range whereas UTF-8 characters are more or less made of a 7.5 bits lead octet eventually followed by up to 3 continuation octets (these hold only 6 bits values), thus it could easily be shrink by turning 8 7-bits values into 7 8-bits ones. The funny thing is that the reduced alphabet even helps well-known compression algorithms (Deflate, Bzip2, LZMA...).

    My goal is to write a relatively simple/fast UTF-8 compressor, this transformation is only a first step (perhaps in a wrong direction).

    I would be very grateful if some of you could test the transformation (I call it UTF-827 for now) against UTF-8 text files -short or long- containing non Latin scripts (Greek, Cyrillic, Devanagari, Chinese...), report if it produces some savings when compressing the resulting file (using whatever compression tool you want, even your own prototypes) and if the reverse transformation effectively works.

    Things like this would be welcome:
    Code:
    ./u827 multi.txt multi.827
    ./u728 multi.827 multi-rev.txt
    md5sum multi.txt multi-rev.txt
    da16bb10d71a5deab49d664e9de67597  multi.txt
    da16bb10d71a5deab49d664e9de67597  multi-rev.txt
    xz multi.txt
    xz multi.827
    ls -l multi.*
    -rw-r--r-- 1 fred fred 908 2013-01-12 21:37 multi.827.xz
    -rw-r--r-- 1 fred fred 944 2013-01-12 14:05 multi.txt.xz
    Alpha2 released for Linux, Mac OS X and Windows (Intel x86), has no know limitation, could be faster.
    Attached Files Attached Files
    Last edited by caveman; 14th January 2013 at 14:06.

  2. #2
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    156
    Thanks
    18
    Thanked 50 Times in 26 Posts
    Compress worse for lpaq8?

    $ ../godlike/bin/lpaq8 7 sangoku.txt 8.paq
    1797079 -> 486991 in 7.560 sec. using 390 MB memory
    $ ../godlike/bin/lpaq8 7 sangoku.827.txt 7.paq
    1797079 -> 487205 in 7.840 sec. using 390 MB memory
    Attached Files Attached Files

  3. #3
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    Quote Originally Posted by RichSelian View Post
    Compress worse for lpaq8?
    Thank you for the test.
    It could be the case when the compressor has its own UTF-8 filter and since the context is modified (continuation octets look like lead octets) it could also hurt compressor that rely on precise context modelling.

  4. #4
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    Released u827 alpha2 this time for Linux, Mac OS X, Windows.

    Regarding the sangoku.txt file, the transformation helped Deflate (Gzip), Bzip2 and PAQ8L, in the other hand Deflate (Kzip), LZMA2 (xz) and LPAQ8 results are bigger.

    786344 sangoku.utf.gz
    783175 sangoku.827.gz (smaller)

    733584 sangoku.utf.kzip.gz
    740003 sangoku.827.kzip.gz (bigger)

    597008 sangoku.utf.xz
    598576 sangoku.827.xz (bigger)

    566176 sangoku.utf.bz2
    563666 sangoku.827.bz2 (smaller)

    470130 sangoku.utf.paq8l
    468806 sangoku.827.paq8l (smaller)


    On ENWIK8:

    36445248 enwik8.utf.gz
    36296638 enwik8.827.gz (smaller)

    35025675 enwik8.utf.kzip.gz
    34889162 enwik8.827.kzip.gz (smaller)

    29008758 enwik8.utf.bz2
    29035961 enwik8.827.bz2 (bigger)

    26375764 enwik8.utf.xz
    26402316 enwik8.827.xz (bigger)

    PAQ8L still running

Similar Threads

  1. UTF-8 dedicated compression?
    By caveman in forum Data Compression
    Replies: 9
    Last Post: 23rd November 2012, 04:05
  2. Reference of compression format
    By Silky in forum Data Compression
    Replies: 6
    Last Post: 24th April 2012, 04:18
  3. Replies: 33
    Last Post: 27th August 2011, 05:13
  4. New guy look for helps on naive questions
    By yacc in forum Data Compression
    Replies: 4
    Last Post: 1st January 2010, 18:39
  5. Delta transformation
    By encode in forum Forum Archive
    Replies: 16
    Last Post: 4th January 2008, 12:13

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •