Results 1 to 3 of 3

Thread: Lightweight BWT construction

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

    Lightweight BWT construction

    This might be old news but I just came across this. http://people.unipmn.it/manzini/bwtdisk/

    The paper describes a clever method of using less memory to construct a BWT. The idea is to compute the suffix array over small segments of the input starting from the end and going backward, then merging the output into the BWT on disk in multiple passes. This is fast enough because all of the disk access is sequential. This uses a lot less external disk than BBB, which writes all the suffix arrays to disk as temporary files and then merges them by reading from all of them at once.

    There is also a demo program which I added to LTCB. http://mattmahoney.net/dc/text.html#1901
    It is source code only, and I only got it to compile in Linux. There is no block size parameter. It compresses the whole file in a single block. Decompression still requires 4x memory. The BWT is compressed using zlib, lzma, or RLE plus range coding. The compression ratio is not that great, but the code is designed to make it easy to add other compressors.

  2. The Following 3 Users Say Thank You to Matt Mahoney For This Useful Post:

    biject.bwts (15th July 2013),Bulat Ziganshin (24th November 2013),nburns (16th July 2013)

  3. #2
    Tester
    Black_Fox's Avatar
    Join Date
    May 2008
    Location
    [CZE] Czechia
    Posts
    471
    Thanks
    26
    Thanked 9 Times in 8 Posts
    I think one of each "-m 3500" pairs should read "-m 500".
    I am... Black_Fox... my discontinued benchmark
    "No one involved in computers would ever say that a certain amount of memory is enough for all time? I keep bumping into that silly quotation attributed to me that says 640K of memory is enough. There's never a citation; the quotation just floats like a rumor, repeated again and again." -- Bill Gates

  4. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Probably should but I tested with -m 3500. enwik8 uses only 500 MB according to top, so I expect -m 500 would give the same result. -m specifies the maximum memory it is allowed to use.

Similar Threads

  1. Fast CRC table construction and rolling CRC hash calculation
    By Bulat Ziganshin in forum Data Compression
    Replies: 10
    Last Post: 27th May 2018, 03:32
  2. An Elegant Algorithm for the Construction of Suffix Arrays
    By willvarfar in forum Data Compression
    Replies: 18
    Last Post: 11th July 2013, 16:01
  3. The BWT is redundant...
    By nburns in forum Data Compression
    Replies: 29
    Last Post: 10th April 2013, 03:59
  4. BWT - how to use?
    By m^2 in forum Data Compression
    Replies: 29
    Last Post: 6th November 2008, 03:01
  5. BWT/ST compression
    By encode in forum Data Compression
    Replies: 60
    Last Post: 25th June 2008, 07:21

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •