Results 1 to 9 of 9

Thread: Mini-research: Pre-processing Unicode Text

  1. #1
    Member
    Join Date
    Feb 2018
    Location
    Singapore
    Posts
    5
    Thanks
    1
    Thanked 1 Time in 1 Post

    Mini-research: Pre-processing Unicode Text

    I just posted a write-up for a mini-research for pre-processing Unicode text. The pre-processing technique is a matrix transpose (so nothing new). Nevertheless, it's interesting to see its benefit -- particularly -- for ZPAQ and RAR.

  2. #2
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    377
    Thanks
    139
    Thanked 198 Times in 108 Posts
    I once added this kind of filter in paq8pxd: https://github.com/kaitz/paq8pxd/tre...75619c70d4e994
    Cant remember what effect it had. https://encode.ru/threads/1464-Paq8p...ll=1#post28037
    KZo


  3. #3
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by Attila Suszter View Post
    I just posted a write-up for a mini-research for pre-processing Unicode text. The pre-processing technique is a matrix transpose (so nothing new). Nevertheless, it's interesting to see its benefit -- particularly -- for ZPAQ and RAR.
    Brotli has a simple utf-8 compatible context model. Have you considered doing the optimization through contex modeling instead of transposing?

  4. #4
    Member
    Join Date
    Feb 2018
    Location
    Singapore
    Posts
    5
    Thanks
    1
    Thanked 1 Time in 1 Post
    Brotli has a simple utf-8 compatible context model. Have you considered doing the optimization through contex modeling instead of transposing?
    I see your point. It's out of the scope for my project. I've been writing a tool for recognizing patterns in binary data so I thought it could be useful in the compression field as well as it could be useful in other fields such as digital forensics.

    I run some test with Brotli on the same samples that mentioned in the blog post. It seems Brotli benefits the pre-processing too on these two isolated samples.

    7-Zip 17.01 ZS v1.3.2 R1

    Compression method: Brotli
    Compression level: Level 11 (Ultra)

    Real-world Data Results:

    1,828,539 file.packed.brotli.7z
    1,577,959 file.transformed.packed.brotli.7z

    Random Data Results:

    39,910,585 unicode.packed.brotli.7z
    34,418,513 unicode.transformed.packed.brotli.7z

  5. The Following User Says Thank You to Attila Suszter For This Useful Post:

    Jyrki Alakuijala (11th February 2018)

  6. #5
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by Attila Suszter View Post
    I see your point. It's out of the scope for my project. I've been writing a tool for recognizing patterns in binary data so I thought it could be useful in the compression field as well as it could be useful in other fields such as digital forensics.

    I run some test with Brotli on the same samples that mentioned in the blog post. It seems Brotli benefits the pre-processing too on these two isolated samples.

    7-Zip 17.01 ZS v1.3.2 R1

    Compression method: Brotli
    Compression level: Level 11 (Ultra)

    Real-world Data Results:

    1,828,539 file.packed.brotli.7z
    1,577,959 file.transformed.packed.brotli.7z

    Random Data Results:

    39,910,585 unicode.packed.brotli.7z
    34,418,513 unicode.transformed.packed.brotli.7z
    That's nice!

    How fast is the 'untranspose' phase?

  7. #6
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    You might try the nibble transpose in the more general TurboTranspose. For your UTF-16 files, you can set the element size (esize) to 2 in the call to tp4enc.
    The byte transpose in your experment is corresponding to the "tpenc" function with esize = 2

  8. The Following User Says Thank You to dnd For This Useful Post:

    xinix (11th February 2018)

  9. #7
    Member
    Join Date
    Feb 2018
    Location
    Singapore
    Posts
    5
    Thanks
    1
    Thanked 1 Time in 1 Post
    How fast is the 'untranspose' phase?
    In-memory transpose with data size 82,097,412 fluctuates around these values:

    Transpose: 00:00:00.5122561

    Untranspose: 00:00:00.4562677

    Note, that my research pre-processor running on .NET Framework 4.

  10. #8
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by Attila Suszter View Post
    In-memory transpose with data size 82,097,412 fluctuates around these values:

    Transpose: 00:00:00.5122561

    Untranspose: 00:00:00.4562677

    Note, that my research pre-processor running on .NET Framework 4.
    Your work is with UTF-16, i.e., separating two bytes to two different streams?

  11. #9
    Member
    Join Date
    Feb 2018
    Location
    Singapore
    Posts
    5
    Thanks
    1
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Jyrki Alakuijala View Post
    Your work is with UTF-16, i.e., separating two bytes to two different streams?
    Yes, it's with UTF-16 LE. There is one buffer for the high and low bytes.

    The transpose is a nested for loop in C#.

    Code:
    for (int i=0; i<width; i++)
    {
      for (int j=0; j<height; j++)
      {
        output[j,i] = input[i,j];
      }
    }

Similar Threads

  1. Compression of Cyrilic Unicode
    By Cyan in forum Data Compression
    Replies: 9
    Last Post: 9th February 2016, 06:43
  2. Ph.D Research topic
    By bello in forum The Off-Topic Lounge
    Replies: 6
    Last Post: 15th July 2015, 02:48
  3. Improved MTF, WFC or qlfc post-processing
    By quadro in forum Data Compression
    Replies: 22
    Last Post: 30th June 2013, 16:46
  4. Natural Language Processing, PPMd, PPMZ and frozen models
    By patentsguy in forum Data Compression
    Replies: 7
    Last Post: 6th June 2012, 21:01
  5. Nice Mini-ITX case
    By Bulat Ziganshin in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 27th February 2012, 02:45

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •