Results 1 to 5 of 5

Thread: Comparison of compressors for molecular sequence databases

  1. #1
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Comparison of compressors for molecular sequence databases

    Hi all! I am running a benchmark of compressors on sequence databases. I thought I'd start a thread to discuss it, respond to comments from the "Data Compression Tweets" thread, and gather expert feedback to hopefully improve it.

    Benchmark data: http://kirr.dyndns.org/sequence-compression-benchmark/

    This is a work in progress, and I will continue improving it when I can. Suggestions are welcome!

    Disclaimers/Disclosures/Limitations:

    1) Like in any benchmark, the results are specific to particular hardware, test data and methodology. I used a reasonably standard workstation machine, and test data consists of commonly used sequence databases. Thus the results should be reasonably informative, but not necessarily 100% transferable to other machine or data.

    2) I benchmark a very specific task. Namely, lossless compression (and decompression), without reference, of FASTA files with DNA, RNA and protein sequences. This is the data I often work with, so I'm familiar with what is used and how it is used.

    3) My own compressor is included in the benchmark. I need to know how it compares to other compressors. Also when I make improvement, I need to measure the improvement. My compressor receives no any special treatment in the benchmark.

    4) Benchmark takes lot of time. For anything missing, it's possible that I just haven't had time to add it yet.

    5) The interface is a mess. I'll need to organize it in a much better way.

  2. #2
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Now, on to replies to comments from another thread.

    Quote Originally Posted by Jarek View Post
    This comparison clearly misses CRAM ( https://en.wikipedia.org/wiki/CRAM_(file_format) ), its winning NAF is from the same author, and seems to be preprocessing+zstd: http://kirill-kryukov.com/study/naf/
    CRAM uses reference sequence. This is very interesting, but outside the scope of my benchmark.

    Quote Originally Posted by Jyrki Alakuijala View Post
    They failed to benchmark in a useful way. Brotli is run with 4 MB window whereas zstd is run with 8 to 128 MB.

    If they use the same window length, brotli will compress 5 % better than zstd on an average corpora.
    Thanks to your helpful comment and a fair bit of googling, I finally managed to find out about the existance of "--large_window" option of brotli. I now added "-q 11 --large_window=30" setting to the benchmark.

    If it was listed in "brotli -h" output, I might have added it sooner. Someone might say that brotli authors failed to expose or document this setting in a useful way.

    If you feel that some other important setting is missing, please kindly let me know.

    Quote Originally Posted by JamesB View Post
    They should compare themselves against modern incarnation of FASTQ compressors, such as Spring
    Not sure it this was about my benchmark. If it was, then I have to mention that I did not use any FASTQ files. It's interesting and I might do it some day. But currently I just benchmark on FASTA files.

    Any further comments or suggestions are greatly appreciated!

  3. #3
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Missed one.

    Quote Originally Posted by JamesB View Post
    NAF looks like it's rather naive, just separating out seq and qual and doing zstd on them.
    Yes, NAF is very simple, by design. zstd works spectacularly well in the way I use it in NAF. I much prefer to keep it simple. There's a number of more complex DNA compressors, but most of them are impractically slow. Actually it was previously a mystery to me why everyone keeps using gzip instead of specialized compressors. Only after starting the benchmark I realized how slow most of sequence compressors are.

  4. #4
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    It's a mystery to me why people use gzip still too. It's even more of a mystery why every linux distribution I know ships with the slow zlib instead of any of the myriad of optimised ones.

    You may also want to test some of the fastq compressors (quip, fastqz, fqzcomp, spring, fqsqueezer, etc). If the data is lots of short sequences in fasta format then just generate a fixed string of qualities (all the same) to turn it into fastq, if the tool in question doesn't also support fasta. This may be interesting as there's been much more work on compression of myriad of small fastq fragments - ie raw sequence machine output - than there has fasta.

  5. #5
    Member
    Join Date
    May 2019
    Location
    Japan
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by JamesB View Post
    You may also want to test some of the fastq compressors (quip, fastqz, fqzcomp, spring, fqsqueezer, etc). If the data is lots of short sequences in fasta format then just generate a fixed string of qualities (all the same) to turn it into fastq, if the tool in question doesn't also support fasta. This may be interesting as there's been much more work on compression of myriad of small fastq fragments - ie raw sequence machine output - than there has fasta.
    That's a nice idea! I may try this some time. Indeed, it's easy to add constant quality via a wrapper script, and this will allow using fastq compressors.

Similar Threads

  1. Best compression algorithm for a sequence of incremental integers
    By CompressMaster in forum Data Compression
    Replies: 18
    Last Post: 17th May 2019, 11:56
  2. Pseudorandom Number Sequence Test + Benchmark Compressors
    By Samantha in forum Data Compression
    Replies: 3
    Last Post: 3rd January 2019, 16:19
  3. Binary sequence compression
    By smjohn1 in forum Data Compression
    Replies: 23
    Last Post: 8th December 2017, 02:48
  4. Sequence of bits
    By Kaw in forum Data Compression
    Replies: 12
    Last Post: 25th September 2009, 08:53
  5. LZP flag sequence compression
    By Shelwien in forum Data Compression
    Replies: 8
    Last Post: 9th August 2009, 02:08

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •