Results 1 to 11 of 11

Thread: BWT + LZMA

  1. #1
    Member
    Join Date
    Jul 2013
    Location
    Dallas, TX
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts

    BWT + LZMA

    Although I haven't tried this yet, would pre-processing data with BWT before compressing with LZMA provide a compression improvement? Tbh, I'm not sure it would, so I've been studying LZMA a bit to try to see if this is even relevant for it.

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    The program bwtdisk with option -b 4 does this. But the compression is not very good. http://mattmahoney.net/dc/text.html#1901

    After BWT, you don't need any high order modeling. You can get good compression with an order 0-1 or 0-1-2 ICM-ISSE chain, or see the model used by BBB, http://mattmahoney.net/dc/dce.html#Section_554

  3. #3
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    Depends on data.
    enwik8 lzma ultra = 24 861 091
    enwik8 openbwt2.0 -b 100 + lzma ultra = 23 714 242

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    In general BWT as a preprocessing step will improve the compression of many other compressors. There was a paper showing how it improved on zlib and lzma. http://people.unipmn.it/manzini/boosting/

    It doesn't work when the other compressor is better than BWT to begin with (like CM or PPM), or for files that compress poorly with BWT like sorted data (english.dic) or a mix of different types like some tar files.

  5. #5
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Shockingly, enwik8 seems to benefit from going through the BWT twice:

    enwik8 bzip2 default = 29 008 758
    enwik8 bwts + bzip2 default = 27 054 861

  6. #6
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Block size?

  7. #7
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by m^2 View Post
    Block size?
    I didn't use blocking. The whole thing was one block.

  8. #8
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Wait...did you modify bzip2 to do so? Because normally it supports at most 900 KB blocks.

  9. #9
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Oh. I didn't know you were asking about bzip2. I used the default settings.

  10. #10
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    So I guess the benefit comes from processing enwik as whole in the double BWT case and in smaller pieces (600K? Don't remember) with just bzip2.
    You can repeat the experiment with some BWT compressor that supports large blocks if you want.

  11. #11
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by m^2 View Post
    So I guess the benefit comes from processing enwik as whole in the double BWT case and in smaller pieces (600K? Don't remember) with just bzip2.
    You can repeat the experiment with some BWT compressor that supports large blocks if you want.
    I retried it with 900k chunks of enwik8m, and bzip2 -9, and the effect disappeared: the double-bwt was larger, as expected. So, blocking could be the explanation.

    It's sort of interesting that doing the bwt twice could ever come out better, even given the effects of blocking. It seems to happen on XML files in particular. I guess there is a significant startup cost for all of the XML markup that gets paid at the start of each block, and a pre-bwt over the whole file defeats that.

Similar Threads

  1. lzma recompressor
    By Shelwien in forum Data Compression
    Replies: 33
    Last Post: 25th February 2016, 23:40
  2. LZMA markup tool
    By Shelwien in forum Data Compression
    Replies: 13
    Last Post: 29th November 2015, 23:05
  3. LZMA recompression
    By twisted89 in forum Data Compression
    Replies: 4
    Last Post: 4th December 2012, 18:31
  4. LZMA Description
    By hdneo in forum Data Compression
    Replies: 1
    Last Post: 14th September 2012, 22:47
  5. LZMA source
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 29th March 2010, 18:45

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •