Results 1 to 10 of 10

Thread: UTF-8 dedicated compression?

  1. #1
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts

    UTF-8 dedicated compression?

    Hello,
    I've found at least one case of UTF-8 French text where NFD (Normalization Form D) produces smaller files once compressed (bzip2 or gzip) compared to the more usual NFC (Normalization Form C).
    But I'm looking for a compressor that takes advantage of UTF-8 specifications (for instance the byte FF never appears, non ASCII characters have a variable but known in advance size and each continuation byte starts with 1...), does this even exist? It could help the IETF http working group to move beyond Deflate in HTTP 2.0.

  2. #2
    Member
    Join Date
    May 2008
    Location
    HK
    Posts
    160
    Thanks
    4
    Thanked 25 Times in 15 Posts

  3. #3
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    Quote Originally Posted by roytam1 View Post
    Hmmm, it looks like it does not directly apply to UTF-8, it's just another compact Unicode form.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Did you check what paq8 does? http://encode.ru/threads/1464-Paq8pxd-dict


    Update:

    So I decided to test it and made this sample: http://nishi.dreamhosters.com/u/dce_ja_utf8.txt
    by google-translating Matt's DCE to japanese and saving as utf8 plaintext.
    Then converted it to utf16 and zero-padded to 32-bit per symbol
    (couldn't easily find an utf-32 converter).
    And compressed both with my old dos coder (8-bit port of which I posted yesterday as "dosppm"),
    which incidentally internally works with a 32-bit alphabet.
    This coder, in addition to actual compressed data, produces an offset table for first
    occurrences of each symbol encountered in the data, which is used to sort the alphabet etc
    (no sense to allocate 16G for stats if the only symbol used is 0xFFFFFFFF).
    So with it, I compressed both dce_utf8 expanded to 32-bit per symbol, and similarly expanded utf16.
    And results are as follows:
    Code:
                       (1)     (2)  (3)  (4)    (5)
    dce_ja_utf8.txt  391266    960  349  347  83893
    dce_ja_utf32.txt 657972 261248 3853 3530  81673
    
    (1) uncompressed size
    (2) uncompressed size of offset table
    (3) offset table compressed using old "asc" coder
    (4) offset table compressed with paq8px69
    (5) file's compressed size + asc-compressed table size
    
    gain = (1-81673/83893)*100 = 2.646%
    Some more results from other compressors, for comparison:
    Code:
                    utf8  utf16
    .txt          391266 328986
    .7z            97166  90147
    .ccm           83980  82021 // ccmx 1.30c 5
    .rar           79987  78515 // rar -m5 -mct+, aka ppmd vH
    .paq8px        70698  71232 // paq8px69 -6
    .paq8pxd       70568  70968 // paq8pxd_v4 -6; didn't detect utf8 though

  5. #5
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    Quote Originally Posted by Shelwien View Post
    Did you check what paq8 does? http://encode.ru/threads/1464-Paq8pxd-dict

    So I decided to test it and made this sample: http://nishi.dreamhosters.com/u/dce_ja_utf8.txt
    by google-translating Matt's DCE to japanese and saving as utf8 plaintext.
    ...
    [/code]
    I did not check paq8, I was looking for something a bit less heavyweight.

    Why did you only pick Japanese?
    It's a specific case, as would have been Arabic, Russian, Greek, Hebrew, Thai... (most of the characters use 2 bytes)
    Many of the other European languages (Spanish, German, French, Danish, Italian...) use a lot of characters from plain ASCII (1 byte) with some incursions in so called "Latin extensions" (2 bytes) mostly for diacritics.
    In that case I figured out that Normalization Form D could sometimes help compression (it's the case with deflate/gzip and bzip2, even if the uncompressed size is bigger).
    Since deflate and bzip2 do not handle UTF-8 specifically, I was looking for something a bit more specific.

    It would be interesting to have a compression method that is faster than deflate and produces better results on relatively short sequences (between 5 and 80 kilo bytes, it's supposed to be a web page not an entire book) of mixed html+native language texts encoded in UTF-8, something that could compress web traffic in real time.


    For those interested by the NFC vs NFD topic, here is what I've found on small text excerpt from "Around the World in 80 Days" by Jules Verne (this sample is in French) when comparing normalization forms C and D.

    Normalization form C (NFC):
    Original file size: 1648 bytes
    Compressed deflate stream: 7130 bits (gzip file 910 bytes)
    Click image for larger version. 

Name:	c.png 
Views:	415 
Size:	154.3 KB 
ID:	2090

    Normalization form D (NFD):
    Original file size: 1692 bytes
    Compressed deflate stream: 7102 bits (gzip file 906 bytes)
    Click image for larger version. 

Name:	d.png 
Views:	408 
Size:	158.8 KB 
ID:	2091

    Color scale used:
    Click image for larger version. 

Name:	thermal-colors.png 
Views:	308 
Size:	183 Bytes 
ID:	2107
    Deep blue less than a bit, next one less than 2 bits and so on... orange is less than 8 bits, red denotes expansion (more than 8 bits, dark red is worse than bright red).
    Blue spots are usually LZ77 matches.
    Attached Files Attached Files
    Last edited by caveman; 29th November 2012 at 03:22. Reason: Added color scale

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > I did not check paq8, I was looking for something a bit less heavyweight.

    I meant checking how it handles utf8, not using it.
    paq8 is only "heavyweight" because it uses lots of different models at once,
    but you don't have to do that if you know the type of your data.

    > Why did you only pick Japanese?

    Because its relevant for me.

    > utf8-nfcvsnfd.zip (2.0 KB, 2 views)

    This is kinda interesting, but these weird-'e' letters actually look different when i open them in notepad
    (they're somehow subscript in the NFD version)

    And anyway, I guess this kind of utf8 requires different handling from my use cases.
    I'd replace the extra letters with similar latin ones, and then add an extra CM stream
    containing extension flags for these letters.

    In either case I'm not interested in gaining 4 bytes using that, though

  7. #7
    Member Karhunen's Avatar
    Join Date
    Dec 2011
    Location
    USA
    Posts
    91
    Thanks
    2
    Thanked 1 Time in 1 Post
    Found this cross-platform encoding converter, and I also thought I remembered something from the compression crypt, a description of the M$ patent on hybrid coding of Word docs

  8. #8
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    Quote Originally Posted by Karhunen View Post
    Found this cross-platform encoding converter, and I also thought I remembered something from the compression crypt, a description of the M$ patent on hybrid coding of Word docs
    convmv converts the filenames not their content, uconv is available in Ubuntu/Debian.

  9. #9
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    Quote Originally Posted by Shelwien View Post
    This is kinda interesting, but these weird-'e' letters actually look different when i open them in notepad
    (they're somehow subscript in the NFD version)
    They should look exactly the same, it's the case in UltraEdit, TextWrangler and all web browsers.

  10. #10
    Member przemoc's Avatar
    Join Date
    Aug 2011
    Location
    Poland
    Posts
    44
    Thanks
    3
    Thanked 23 Times in 13 Posts
    Shelwien, to convert encoding of text files you can use following tools for instance:
    - iconv
    - recode
    - enca
    All of them support UCS-4 IIRC.

    I guess you couldn't find appropriate tool, because you were searching using "utf-32" term instead of "ucs-4".

Similar Threads

  1. Stuffit Method 15 - Dedicated to BWT maniacs
    By encode in forum Data Compression
    Replies: 1
    Last Post: 12th May 2008, 23:43

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •