Results 1 to 9 of 9

Thread: 3 failed dna preprocessing methods

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

    3 failed dna preprocessing methods

    I originally just wanted to test contrepl, but then tried a couple of other random things.
    (contrepl is my tool for turning lossy replacement regexps into lossless)

    File was http://corpus.canterbury.ac.nz/descr...ge/E.coli.html

    1. Apply replacements t->a, g->c, c->a .
    Results is a file of only "a" + 3 flag streams.

    Compression is somewhat hard to evaluate, since I only have my CM for flag compression with external contexts,
    but comparing to same CM on original file, compression is worse.
    Code:
    4,638,690 e_coli
    1,123,529 e_coli.CMo5
      576,425 flg_ari_CA
      290,275 flg_ari_GC
      279,553 flg_ari_TA
    1,146,253 CA+GC+TA
    Surprisingly, there was also very little effect from switching to right-side contexts, like +/-100 bytes.

    Of course, this doesn't mean much since CMo5 was tuned to book1 and flag coder to "contractions config",
    so both results can be obviously improved (its fact its possible to tune 3 different models for flags here).

    But there's no immediate win, so its not that interesting.


    2. Identify some important sequences and insert spaces to split contexts
    (its a popular text preprocessing trick called "space stuffing")

    step 1: compress e.coli with 7-zip to a zip archive with parsing optimization
    step 2: extract optimized match strings from deflate stream using deflmatch kit
    step 3: use a perl script to insert spaces after 256 most frequent match strings
    Code:
    4,638,690 e_coli
    1,235,713 e_coli.zip
    1,185,356 e_coli.lzma
    1,129,315 e_coli.pmm       // ppmonstr -o128
    1,130,772 e_coli_pre.pmm   // spaces after 256 most frequent matches (2.pl)
    1,136,925 e_coli_pre1.pmm  // spaces after 3717 most frequent matches
    1,168,500 e_coli_pre2.pmm  // spaces before 256 most frequent matches
    1,129,493 e_coli_pre3.pmm  // spaces after 8 most frequent matches (3.pl)
    Well, it clearly does something, since compressed size changes very little despite significant
    added redundancy (+500k in pre1 case), but also no improvement.

    Improvements might be still possible with a special tool for context clustering,
    but for now no luck.


    3. An interesting detail in (2) was that all most frequent strings consisted of 7 symbols.
    Code:
    4,638,690 e_coli
    1,187,935 e_coli.lzma
    
    5,301,360 1 // stegdict.exe d e_coli_lst 1 2 (_lst has \n inserted after every 7 chars)
       22,649 1.lzma // lzma -d23 -fb273 -mc9999 -lc0 -lp3 -pb3 -mt1 (sorted list of 7-grams)
    1,374,446 2 // stegdict "diff" to restore the permutation
    
    4,638,690 1 // stegdict.exe d16 e_coli 1 2 (just 16-char records, no padding)
      605,214 2 // diff file
      700,937 3 // lzma -d23 -fb273 -mc9999 -lc1 -lp2 -pb2 -mt1 
    1,306,151 2+3
    
    5,275,326 e_coli_deflmatch // cut -d " " -f 2 00000000.txt | sed -e s/\x22//
    5,275,326 4 // stegdict.exe d e_coli_deflmatch 4 5
    1,345,276 5
    Attached Files Attached Files

  2. #2
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    You could try turning it into codons: replace triplets of ACGT with a value between 0-63 and compress those instead. A codon is the natural unit size for DNA/RNA to Protein (using the "genetic code").

    Better still, identify the potential reading frames and have triplets of ACGT<absent> (125 combinations now) so that you can switch reading frames by emitting a single or double code instead of a triple, possibly even add the ability switch strands (reverse & complement), when appropriate. On something like E-coli this may help as it'll be quite gene rich.

    To explain this, consider

    ACGTATAGCGAGGC:

    ACG TAT AGC GAG GC (reading frame 1)
    A CGT ATA GCG AGG C (reading frame 2)
    AC GTA TAG CGA GGC (reading frame 3)

    You can work out where the coding regions are by looking at the statistical nature of bases 1, 2 and 3 in the codons. The genetic code maps 64 triplets to 20ish amino acids, with common amino acids (lucine, serine, etc) having more triplets "assigned" to them than rare ones (eg tryptophan). Usually these are in blocks with the same first couple bases. Eg Lucine is CTT (actually CUU, but we'll ignore RNA vs DNA), CTT, CTC, CTA, CTG as well as TTA, TTG.

    As a consequence, if there is selective pressure to have a particular ratio of ATs vs GCs (aka the "GC content") then the organism often ends up adapting by tweaking that 3rd base. Perhaps it'll use TTG in preference to TTA for example. By analysing the base frequencies in the 1st, 2nd and 3rd positions and seeing how similar or different the distributions are, we can make a good stab at predicting the reading frame. Couple this with a search for the stop codon (TGA, TAG, TAA) and you'll find regions that are clearly genes. A visual example of this can be seen here:

    Click image for larger version. 

Name:	spin.png 
Views:	36 
Size:	23.4 KB 
ID:	6500

    Edit the above is the forward strand only. The reverse strand will also have coding regions, most likely in the regions where we don't see coding on this strand (although there are some very rare cases of coding in more than one overlapping reading frame and/or strand). You could compute both together by reverse complementing each codon before doing the statistical analysis and stop codon detection..

  3. The Following 2 Users Say Thank You to JamesB For This Useful Post:

    Mike (31st March 2019),Shelwien (8th March 2019)

  4. #3
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    More detailed codon usage analysis bits over here: http://www.pdg.cnb.uam.es/cursos/Bio...tent/main.html

    See the pics about half way down covering codon usage for introns (intra-gene coding) and exons ("excised" portions of the RNA used for gene coding). The distributions are very different. If we can make a stab at each position whether we are intron or exon (with appropriate shuffling to keep in reading frame) then we can also improve compression.

    It's possible a high order model maybe already doing this for free though, but it's unlikely as typically these methods have a reasonable sized sliding window.

    Edit: If you want a really really simple test. Try adding something like, say, >20bp since a TGA, TAG or TAA were seen into the context. That'll use (one strand only) a predictor for whether we are inside an "open reading frame" (region of no stop codons) or not.

  5. The Following User Says Thank You to JamesB For This Useful Post:

    Mike (31st March 2019)

  6. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I mainly wanted to check whether contrepl is applicable for dna.
    contrepl allows to run a replacement, like say s/TGA/TAA/ and create an additional flag stream,
    which tags each TAA instance in output with a flag saying whether to replace it back.
    Then I have a specialized CM coder for these flags, which encodes them in context of their
    location in original data. (Its also possible to compress different flag streams in parallel).
    So when applied to english text, we can eg. replace "the" with "a", thus merging similar contexts,
    and improve overall compression.
    (flag coder can use both left and right contexts and has lots of tunable parameters).
    I think a similar effect might apply to dna data as well, but I don't really know which patterns to try.

    That's actually why I did parsing optimization test, and now there's another question.
    Somehow 3717 most frequent match strings in e.coli all have length 7.
    Also 4638690=2*3^2*5*7*37*199 (e.coli length).
    But why 7?

  7. #5
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Well, you could try checking the bacterial genetic code and do substitutions based on that, but I doubt it'll work well due to not knowing the correct reading frame at any point.

    The length 7 is probably some common repeat structure?

  8. #6
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    I was curious if a hacky open reading frame detection context helped. Yes it does, but marginally.


    Code:
    # Disable ORF code
    $ cc -DORDER=4 -DORF=999999 -g dna.c -I. && ./a.out < /tmp/ec|wc -c
    1127709
    
    # Context uses ORF>=100 bases
    $ cc -DORDER=4 -DORF=100 -g dna.c -I. && ./a.out < /tmp/ec|wc -c
    1125128
    It's not a huge benefit, but it's something. Misc code snippets (not the complete thing).
    "orf" is forward strand open reading frame counter and "fro" is the reverse strand one. I just reset it to zero whenever I see a stop codon, and use of a bit of context when they get beyond a certain size. (Maybe I should combine both instead the same bit.)

    I'm not sure my E-coli is the same as yours as the page you linked to had no downloads. I just used a local copy of E-coli strain K12.

    Code:
    static int map[256];
    int init_map(void) {
        memset(map, 0, 256*sizeof(*map));
        map['A'] = 0;
        map['C'] = 1;
        map['G'] = 2;
        map['T'] = 3;
    }
    
    int codon(char *c) {
        return (map[c[0]]<<4)|(map[c[1]]<<2)|map[c[2]];
    }
    
    
    size_t encode(uint8_t *in, size_t in_len, uint8_t *out, size_t out_len) {
        RangeCoder rc;
        size_t i;
    
    #ifndef ORDER
    #  define ORDER 2
    #endif
    
    #define OSIZE (1<<((int)(ORDER*2)))
    #define CSIZE (OSIZE*4)
    
        SIMPLE_MODEL(4,_) base_model[CSIZE];
        for (i = 0; i < CSIZE; i++)
             SIMPLE_MODEL(4,_init)(&base_model[i], 4);
    
        RC_SetOutput(&rc, (char *)out);
        RC_StartEncode(&rc);
    
    #ifndef ORF
    #  define ORF 100
    #endif
    
    #ifndef FRO
    #  define FRO ORF
    #endif
    
        unsigned int last = 0;
        int TAG = codon("TAG"); int CTA = codon("CTA");
        int TGA = codon("TGA"); int TCA = codon("TCA");
        int TAA = codon("TAA"); int TTA = codon("TTA");
        int codon = 0, orf = 0, fro = 0;
        int orf1 = 0, orf2 = 0, orf3 = 0; // forward strand                                                 
        int fro1 = 0, fro2 = 0, fro3 = 0; // reverse strand                                                 
        for (i = 0; i < in_len; i++) {
            SIMPLE_MODEL(4, _encodeSymbol)(&base_model[(last & (OSIZE-1))*4 + orf*2 + fro], &rc, map[in[i]]);
            last = (last<<2) | map[in[i]];
            codon = last & 63; // last 3 bases                                                              
    
            if (i%3 == 0) {
                orf1 = (codon == TAG || codon == TGA || codon == TAA) ? 0 : orf1+1;
                fro1 = (codon == CTA || codon == TCA || codon == TTA) ? 0 : fro1+1;
            }
            if (i%3 == 1) {
                orf2 = (codon == TAG || codon == TGA || codon == TAA) ? 0 : orf2+1;
                fro2 = (codon == CTA || codon == TCA || codon == TTA) ? 0 : fro2+1;
            }
            if (i%3 == 2) {
                orf3 = (codon == TAG || codon == TGA || codon == TAA) ? 0 : orf3+1;
                fro3 = (codon == CTA || codon == TCA || codon == TTA) ? 0 : fro3+1;
            }
            orf = (orf1>ORF) | (orf2>ORF) | (orf3>ORF);
            fro = (fro1>FRO) | (fro2>FRO) | (fro3>FRO);
        }
    
        RC_FinishEncode(&rc);
    
        return RC_OutSize(&rc);
    }
    Last edited by JamesB; 8th March 2019 at 21:10. Reason: Formatting, and replace byte with base.

  9. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    It can be downloaded here: http://corpus.canterbury.ac.nz/resources/large.zip

    Also these are top 1000 strings from deflate and lzma parsing optimization:
    Attached Files Attached Files

  10. The Following User Says Thank You to Shelwien For This Useful Post:

    JamesB (10th March 2019)

  11. #8
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Some results from the usual suspects (no preprocessing):

    Code:
    1,095,987 e_coli.paq8px178
    1,091,124 e_coli.cmix_17 // cmix v17 from GitHub, commit c4a649..
    http://schnaader.info
    Damn kids. They're all alike.

  12. #9
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    216
    Thanks
    97
    Thanked 128 Times in 92 Posts
    1091124 cmix v17
    1094645 cmv-2 0.3.0a1 -m2,0,>
    1095987 paq8px_v178 (schnaader test)
    1096031 paq8px_v178 -9
    1098242 cmv 0.2.0 -m2,0,>
    1099675 paq8pxd64 (SSE2 build using IC19 from Shelwien) -s9
    4638690 E.coli
    Last edited by Mauro Vezzosi; 7th April 2019 at 23:07. Reason: Added cmv-2

Similar Threads

  1. DNA Corpus
    By Kennon Conrad in forum Download Area
    Replies: 20
    Last Post: 10th April 2019, 08:13
  2. Preprocessing using a low order model
    By byronknoll in forum Data Compression
    Replies: 10
    Last Post: 6th May 2016, 00:22
  3. Optimal Preprocessing/Parsing for LZ?
    By comp1 in forum Data Compression
    Replies: 68
    Last Post: 11th February 2015, 20:27
  4. Another? DNA contest
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 8th February 2012, 17:17
  5. One lossy BMP format that failed by a lot!
    By toi007 in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 29th June 2011, 00:53

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •