Results 1 to 5 of 5

Thread: why Compression of DNA sequences is a very challenging task?

  1. #1
    Member
    Join Date
    Aug 2016
    Location
    libya
    Posts
    4
    Thanks
    9
    Thanked 0 Times in 0 Posts

    why Compression of DNA sequences is a very challenging task?

    why the compression of DNA sequence is a difficult for compression algorithms?
    although DNA is composed of four bases (A, T, G, C), and can be coded using two bits per base , The human genome contains around 3 billion characters .although these software are designed for text compression.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    The problem is that even when its this kind of DNA data (and its not always this - the previous dna compression competition used a different encoding),
    its not just random independent symbols which can be packed to 2 bits each and that's all - it can be compressed much further than that.
    For one, there're some common strings which appear frequently enough, so plain LZ would compress it better than just 2-bit packed coding.
    But then, there're lots of other types of dependencies which don't normally appear in data compression - like palindromes,
    complementary strings, imprecise matches, and even some restrictions appearing from the molecular structure of DNA sequence.

    So I'd say its hard mainly because you have to know a lot about its properties to compress it well.
    Also some things like imprecise matches are just hard to do on their own.

  3. The Following User Says Thank You to Shelwien For This Useful Post:

    omran farhat (4th September 2016)

  4. #3
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Several small things, but none of them insurmountable.

    Firstly, DNA consists (mostly) of just 4 symbols, meaning that bit-based encoding like Huffman is just never going to beat 2 bits per base at raw entropy encoding and in practice doesn't get that far due to often having an EOF symbol. Trivial workaround is to pack multiple bases per byte - eg triplets (permitting ACGTN). Try gzip on a dna sequence and you'll see it's really weak due to the small alphabet penalties of its huffman implementation.

    Secondly, if you want LZ style matching then DNA is comparable to a corrupted archive. You'll be able to deduplicate a lot, but there are various imperfect repeats too, sometimes not just substitutions but small insertions and deletions. While this could be coded as a whole series of separate exact repeats (eg lots of LZ matches), it's inefficient as the distances are highly correlated (so maybe you need to use a more complex LZ with recent distance codes). Alternatively you can do a full smith-waterman style dynamic programming matrix to identify the repeats and the optimal edits from one to the other, permitting a richer LZ scheme.

    Then just to make life awkward, matches can occur on either strand (so AT<>TA and CG<>GC plus reverse order).

    Imagine trying to describe the duplications in this sequence:

    https://www.researchgate.net/figure/...ns-where-there

    Each dot on that graph is a short match between substrings of a pair of DNA sequences, but it could equally be between a DNA sequence and itself (making it perfectly symmetrical along the leading diagonal). The dots combine to make lines where the matches are long, either matching the leading diagonal for forward strand match or at right angles for a reverse strand match. The small repeat many times over gives rise to an approximate grid of short matches, but they're not quite all identical.


    Frankly though it's probably not worth doing anything clever. Basic LZ (albeit both strands) coupled with 2 bit encoding and an exception matrix for the non ACGT symbols would get you 99% of the way there I think.

    PS. Using dot plots on other data can be a nice way to visualise the repeat structure. Even just turning data into DNA via 2-bit encoding and using dotter (or similar) can shine a light on the internal data structure.

  5. The Following User Says Thank You to JamesB For This Useful Post:

    omran farhat (5th September 2016)

  6. #4
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by omran farhat View Post
    why the compression of DNA sequence is a difficult for compression algorithms?
    Because of mutations, length optimization by evolution, and sex (and other exchange of dna) the dna has increased entropy, and is difficult to compress.

    Some 3–10th order context modeling seems to help a tiny bit. Non-scientific personal theory: I believe a longer context model works better than shorter because the protein coming from the synthesis would have a self-collision or collision with the machinery building proteins with some of the sequences, and could not be completed or would be much slower to be completed. Such protein synthesis would seem harmful for most life, and thus is less common.

  7. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    The best you can compress any string is to find the shortest program that outputs it. The program that wrote the human genome is called evolution. It is not a complex algorithm, but it required 10^48 DNA base copy operations on 10^38 bits of memory, consuming 280 terawatts of power for 3.5 billion years.

    I hope that including the human genome as 30% of the 10 GB benchmark will spur research in genomics in the same way that the large text benchmark spurs research in AI.

  8. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    omran farhat (26th October 2016)

Similar Threads

  1. DNA Corpus
    By Kennon Conrad in forum Download Area
    Replies: 20
    Last Post: 10th April 2019, 08:13
  2. DNA storage
    By Shelwien in forum The Off-Topic Lounge
    Replies: 3
    Last Post: 25th January 2013, 23:42
  3. Finding most frequency sequences in data?
    By RichSelian in forum Data Compression
    Replies: 5
    Last Post: 21st September 2012, 03:29
  4. Another? DNA contest
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 8th February 2012, 17:17

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •