Results 1 to 26 of 26

Thread: Genomic filter

  1. #1
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    46
    Thanks
    0
    Thanked 11 Times in 8 Posts

    Genomic filter

    I wrote a simple filter for genomic data: it works by reading three bases and coding them in a byte,
    stated the number of distinct bases are four, this operation actual takes 6 bit. Nevertheless it is not
    convenient to insert four bases per byte because amino acid transcription happens on a triplet basis,
    and a loose of correlation might occur. For the remaining two bits it is possible to implement a limited
    RLE system, please look at the code below for details.

    Code:
    program gen;
    {$R-,S-}
    var aa,oaa,cnt,lim,trip,base:byte;
    nb:word;
    fin,fout:file;
    fname:string[12];
    begin
      write('Input file?');
      readln(fname);
      write('RLE?');
      readln(lim);
      if lim>3 then
        lim:=3;
      assign(fin,fname);
      reset(fin,1);
      fname:=copy(fname,1,pos('.',fname)-1);
      assign(fout,fname+'.out');
      rewrite(fout,1);
      oaa:=64;
      while not eof(fin) do
        begin
          aa:=0;
          for trip:=0 to 2 do
            begin
              if not eof(fin) then
                blockread(fin,base,1,nb)
              else
                begin
                  writeln('Sequence truncation');
                  halt(1);
                end;
              case base of
                ord('c'),ord('C'):base:=0;
                ord('g'),ord('G'):base:=1;
                ord('a'),ord('A'):base:=2;
                ord('t'),ord('T'),ord('u'),ord('U'):base:=3;
              else
                begin
                  writeln('The file does not appear as genome');
                  halt(1);
                end;
              end;
              aa:=aa+base shl (trip shl 1);
            end;
          if (aa=oaa) and (cnt<lim) then
            inc(cnt)
          else
            begin
              if oaa<64 then
                oaa:=oaa+cnt shl 6
              else
                oaa:=aa;
              blockwrite(fout,oaa,1,nb);
              oaa:=aa;
              cnt:=0;
            end;
        end;
      if oaa<64 then
        begin
          oaa:=oaa+cnt shl 6;
          blockwrite(fout,oaa,1,nb);
        end;
      close(fin);
      close(fout);
    end.
    I made some experiments on the E.coli file of the Canterbury large corpus, the file was first packed with
    the above filter and then the output compressed with gzip and glue (see the post Grouped (ROLZ) LZW
    compressor
    ). Here are the result:

    original file size 4638690
    gzip size 1341254
    glue size 1345972
    filtered file size 1508628
    gzip filtered size 1165463
    glue filtered size 1388177

    I choose to make this confrontation by the perception the 1-order prediction inside glue could behave good
    for the DNA source, as one can see the difference in compression for the original file is small. Otherwise for the
    filtered file the story is more sad (to me). gzip leads to a better compression, glue does not.
    The interpretation I feel to make is that increasing the number of symbols for glue allows for a large dictionary but
    at the same time increases the average bits for group selection done by Huffman, while prediction for amino acids
    is worse than bases. Instead gzip benefits from an expanded dictionary and his entropy reduction based on the most
    recent occurrence makes the difference as usual.
    Last edited by Marco_B; 7th July 2015 at 10:15.

  2. The Following 2 Users Say Thank You to Marco_B For This Useful Post:

    Gonzalo (9th July 2015),Stephan Busch (1st July 2015)

  3. #2
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    The order 0 entropy for the E.coli file is 1,159,568 bytes. Just packing 4 symbols into each byte would give 1,159,672 bytes. My fairly lame adaptive order 0 range encoder derived from ppmd (ao0ac) produces a compressed file size of 1,174,611 bytes. So I think the gzip filtered size has more to do with the order 0 entropy than order 1 predictions. I'm a little surprised gzip didn't do better on the original file.
    Last edited by Kennon Conrad; 1st July 2015 at 18:36.

  4. #3
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    46
    Thanks
    0
    Thanked 11 Times in 8 Posts
    To msg #2:
    Then it seems less convenient an approach dictionary plus entropy coder than a direct 0-order. It would be interesting to determine whether it is better to rely on the 4 symbols of a base or on the 64 of a triplet before to apply a pure 0-order, in the latter case the coder could take advantage from the fact amino acids are about 20, therefore some triplets lead to the same amino acid and they are more rare than others.
    Stated the multilevel structure of a protein probably it is necessary a far deep order than 1 to exploit a well prediction, a similar justification could apply for the size of a dictionary coder, where gzip has only 32 Kb.

  5. #4
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    The main problem with deflate is that it allocates a separate symbol for EOF and this has to be a minimum of 1 bit due to huffman.
    This turns an alphabet of 4 symbols, likely each taking 2 bits, to an alphabet of 5 symbols with one of the ACGT being 3 bits to accommodate the EOF. That means you end up with 2.25bits per base instead of 2 bits (assuming no repeats or base composition biases). Gzip gets 2.3 bits, so around the same mark.

    By packing 3 into a byte (I've used 125 symbols indicating ACGTN before) the hit you take of adding +1 symbol becomes very minor.

    Obviously using a fractional bit encoding (arithmetic, ANS, etc) solves this by a different route, but there are lots of huffman based coders out there still so bit packing can be a valuable solution. Bit packing also means you can more accurate handle GC bias. If you have 60% GC content then p(A)=p(T)=.2 and p(C)=p(G)=.3. All would end up as 2 bits with huffman (picking one of A/T for the 3 bit to make room for EOF). However with a larger alphabet each triplet will be 5-7 bits, meaning you have more wiggle room to accurately match the actual probabilities.

  6. #5
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Also, remember you can change strategies in Zlib too.

    Code:
    jkb@seq3a[data/canterbury] ~/work/compression/zlib_test Z_DEFAULT_STRATEGY < E.coli |wc -c
    dcurr=4638690
    Strat=0 level=-1
    1341984
    jkb@seq3a[data/canterbury] ~/work/compression/zlib_test Z_DEFAULT_STRATEGY 9 < E.coli |wc -c
    dcurr=4638690
    Strat=0 level=9
    1299711
    jkb@seq3a[data/canterbury] ~/work/compression/zlib_test Z_DEFAULT_STRATEGY 1 < E.coli |wc -c
    dcurr=4638690
    Strat=0 level=1
    1527229
    jkb@seq3a[data/canterbury] ~/work/compression/zlib_test Z_HUFFMAN_ONLY < E.coli |wc -c
    dcurr=4638690
    Strat=2 level=-1
    1299550
    jkb@seq3a[data/canterbury] ~/work/compression/zlib_test Z_RLE < E.coli |wc -c
    dcurr=4638690
    Strat=3 level=-1
    1282043
    Basically Zlib even with a slow level 9 compression cannot do as well as pure huffman alone. The code to find matches harms the compression, so a quick and dirty RLE only LZ match comes out both smaller and quicker. The same also applies to Illumina quality values.

    Code:
    jkb@seq3a[data/canterbury] awk 'NR%4==0' ../SRR519063_1.fastq | tr -d '\012' | head -10000000c | ~/work/compression/zlib_test Z_DEFAULT_STRATEGY | wc -c
    dcurr=10000000
    Strat=0 level=-1
    3083146
    jkb@seq3a[data/canterbury] awk 'NR%4==0' ../SRR519063_1.fastq | tr -d '\012' | head -10000000c | ~/work/compression/zlib_test Z_DEFAULT_STRATEGY 9 | wc -c
    dcurr=10000000
    Strat=0 level=9
    2935330
    jkb@seq3a[data/canterbury] awk 'NR%4==0' ../SRR519063_1.fastq | tr -d '\012' | head -10000000c | ~/work/compression/zlib_test Z_RLE | wc -c
    dcurr=10000000
    Strat=3 level=-1
    2939565

  7. #6
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    46
    Thanks
    0
    Thanked 11 Times in 8 Posts
    To msg #4:
    I didn't know deflate makes use of an EOF character, I thought it saves the size of each block in the header. And thank you to remember us that Huffman code is built upon a tree.
    I made some additional tests with minigzip (included in zlib) activating the Z_HUFFMAN_ONLY mode via the -h option:

    Huffman size 1299568
    Huffman filtered size 1161668

  8. #7
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    See https://www.ietf.org/rfc/rfc1951.txt for details.
    I implemented the huffman part once, but not LZ portion, to allow me to do multiple interlaced streams. (This worked well for _simple_ coding of arrays of 4 16-bit ints as a series of 8 interlaced encoders.)

    Value 256 is end of block, 257 and above are distance codes.

    I just tested Z_HUFFMAN_ONLY. It assigned 5 codes: a=2 c=2 g=2 t=3 eob=3.

    Z_RLE used 10 codes, with the same acgt=2223 lengths, eob=8 and a few distance codes.

    With Z_DEFAULT_STRATEGY it had 19 codes instead (due to distance codes), yielding a=4, c=4, g=4, t=4 with lots of other bit lengths ranging from 3 to 9 (12 for eob). The LZ part really didn't help it at all frankly as it needed more deduplication than was available to recover the increased cost of literals.

  9. #8
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    46
    Thanks
    0
    Thanked 11 Times in 8 Posts
    To msg #7:
    If I understand correctly Z_RLE is RLE (good point!) plus Huffman (from the manual of zlib: 'Z_RLE to limit match distances to one'), so I dispensed from the RLE my filter (I updated the source in msg #1) and I compressed the output with minigzip -r, it squeezes a little bit further:

    RLE filtered size 1155813
    Last edited by Marco_B; 3rd July 2015 at 15:25.

  10. #9
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    The real benefit of packing multiple bases together, for huffman based entropy encoding or other bit-based systems, comes with genomes having a more biased GC content. Eg Malaria.

    Chr 14:

    Code:
    jkb@seq3b[data/canterbury] ls -l MAL14
    -rw-r--r-- 1 jkb team117 3291871 Jul  3 09:30 MAL14
    jkb@seq3b[data/canterbury] ~/work/compression/entropy16 < MAL14
    len = 3291871 bytes, entropy16 = 693592.708524, 1.685589 bits per byte
    len = 3291871 bytes, entropy8  = 695291.056966, 1.689716 bits per byte
    jkb@seq3b[data/canterbury] ~/work/compression/zlib_test Z_DEFAULT_STRATEGY -1 < MAL14 |wc -c
    dcurr=3291871
    Strat=0 level=-1
    843341
    jkb@seq3b[data/canterbury] ~/work/compression/zlib_test Z_DEFAULT_STRATEGY 9 < MAL14 |wc -c
    dcurr=3291871
    Strat=0 level=9
    792927
    jkb@seq3b[data/canterbury] ~/work/compression/zlib_test Z_RLE < MAL14 |wc -c
    dcurr=3291871
    Strat=3 level=-1
    772356
    jkb@seq3b[data/canterbury] ~/work/compression/zlib_test Z_HUFFMAN_ONLY < MAL14 |wc -c
    dcurr=3291871
    Strat=2 level=-1
    760651
    So again default modes of gzip are both larger and slower than huffman alone. They also cannot get remotely close to the simple entropy.

    Code:
    jkb@seq3b[data/canterbury] ~/work/solexa/seq3.pl < MAL14 > MAL14.3
    # Generates a minor error as it's not a multiple of 3 long, but irrelevant really for testing.
    jkb@seq3b[data/canterbury] ~/work/compression/zlib_test Z_DEFAULT_STRATEGY -1 < MAL14.3 | wc -c
    dcurr=1097298
    Strat=0 level=-1
    716549
    jkb@seq3b[data/canterbury] ~/work/compression/zlib_test Z_DEFAULT_STRATEGY 9 < MAL14.3 | wc -c
    dcurr=1097298
    Strat=0 level=9
    715826
    jkb@seq3b[data/canterbury] ~/work/compression/zlib_test Z_RLE < MAL14.3 | wc -c
    dcurr=1097298
    Strat=3 level=-1
    688595
    jkb@seq3b[data/canterbury] ~/work/compression/zlib_test Z_HUFFMAN_ONLY < MAL14.3 | wc -c
    dcurr=1097298
    Strat=2 level=-1
    693184
    So that's not bad - it tricks zlib into getting something which matches the simple entropy for huffman and actually marginally beats it with RLE.

    Of course why would you use huffman these days instead of ANS.
    Packing marginally helps rANS too though. (680887 for an order-1 rANS on MAL14.3)

    It also has a remarkable affect on zstd:

    -rw-r--r-- 1 jkb team117 887256 Jul 3 09:49 MAL14.zstd
    -rw-r--r-- 1 jkb team117 679403 Jul 3 09:49 MAL14.3.zstd
    -rw-r--r-- 1 jkb team117 1414388 Jul 3 09:50 E.coli.zstd
    -rw-r--r-- 1 jkb team117 1150751 Jul 3 09:56 E.coli.3.zstd

    FSE does a good job on this, but zstd does not unless we pack the bases together. Maybe it suffers from over optimistic LZ encoding in the same way that deflate does.
    Last edited by JamesB; 3rd July 2015 at 11:59.

  11. #10
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    46
    Thanks
    0
    Thanked 11 Times in 8 Posts
    To msg #9:
    What you call simple entropy (entropy8/16), does rely on fixed in advance probabilities?
    Really it appears FSE beats Huffman, considering further that LZ4, if I understand correctly, becomes more and more inefficient as offset increases.
    I read a post around here it turns me into wondering if a permutation of 0..3 value assigned to each base when the filter packs triplet could help, maybe keeping close bases with similar or equal probabilities.
    Last edited by Marco_B; 3rd July 2015 at 19:17.

  12. #11
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    The entropy is simply calculated from the frequencies of 8 bit and 16 bit quantities for the file as a whole. There is no adaptive model or block by block analysis.

    FSI is a tANS encoder, meaning that it shares a lot of similarities with arithmetic coding in that it is combining values into an integer and shifting out bits periodically, enabling symbols to consume fractional bits. It will always be as efficient or better than huffman. FSE is what ZSTD uses (essentially ZSTD can be considered as LZ4 + FSE, but I'm sure there are a few other tweaks Cyan put in too). Cyan's ZHUFF was a similar fast LZ combined with a huffman encoder instead of tANS.

    I don't understand your LZ4 comment though. LZ4 is a lightweight block based encoder, strictly deduplication stages with no backend entropy encoding in order to keep it as fast as possible.

  13. #12
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    46
    Thanks
    0
    Thanked 11 Times in 8 Posts
    Trying for the idea in msg #10, I modified my filter (I updated the source in msg #1) with the following assignments

    C 0
    G 1
    A 2
    T 3

    and I tested again without RLE inside it and minigzip -r, size is

    E.coli 1155817

    so, at least for this genome, there is no improvement.

  14. #13
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    46
    Thanks
    0
    Thanked 11 Times in 8 Posts
    To msg #11:
    I want to say the LZ4 part of ZSTD codes long offset adding byte by byte, and since repetition in genome, as demonstrated above, takes place at long distances, the FSE part is able to remedy to this deficiency even. Credit where credit

  15. #14
    Member
    Join Date
    Jul 2015
    Location
    Germany
    Posts
    3
    Thanks
    3
    Thanked 0 Times in 0 Posts
    What about disregarding triplett encoding altogether and instead use delta compression followed by ppmd: 7z a -m0=delta:1 -m1=ppmd e_coli.7z e.coli This yields 1.149.861 bytes finally

  16. #15
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    149
    Thanks
    30
    Thanked 59 Times in 35 Posts
    Just packing 4 symbols into each byte would give 1,159,672 bytes.
    I tested it.
    Code:
    original
    7zip gzip: 4638690 -> 1235587
    lzma     : 4638690 -> 1187955
    
    
    packed
    7zip gzip: 1159674 -> 1148027
    lzma     : 1159674 -> 1139737
    Attached Files Attached Files
    • File Type: 7z gf.7z (9.3 KB, 58 views)

  17. The Following User Says Thank You to xezz For This Useful Post:

    Kennon Conrad (20th July 2015)

  18. #16
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    I just realized my earlier comment should have said 1,159,673 bytes instead of 1,159,672 bytes. The file size divided by 4 is 1,159,672.5. In addition to the file size, you need to also know how many symbols are encoded in the last byte. That could be done with two bits and fit within the 1,159,673 bytes, or for simplicity it could be a byte giving xezz's 1,159,674 bytes.

  19. #17
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    149
    Thanks
    30
    Thanked 59 Times in 35 Posts
    changed header.
    Attached Files Attached Files

  20. #18
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    46
    Thanks
    0
    Thanked 11 Times in 8 Posts
    To msg #14:
    I tried to write some specialized program to make delta on genomic data:

    Code:
    program delta;
    {$R-,S-}
    var base,obase,dbase:byte;
    nb:word;
    fin,fout:file;
    fname:string[12];
    begin
      write('Input file?');
      readln(fname);
      assign(fin,fname);
      reset(fin,1);
      fname:=copy(fname,1,pos('.',fname)-1);
      assign(fout,fname+'.out');
      rewrite(fout,1);
      obase:=4;
      while not eof(fin) do
        begin
          blockread(fin,base,1,nb);
          case base of
            ord('c'),ord('C'):base:=0;
            ord('g'),ord('G'):base:=1;
            ord('a'),ord('A'):base:=2;
            ord('t'),ord('T'),ord('u'),ord('U'):base:=3;
          else
            begin
              writeln('The file does not appear as genome');
              halt(1);
            end;
          end;
          if obase<4 then
            dbase:=(base+4-obase) and 3
          else
            dbase:=base;
          obase:=base;
          blockwrite(fout,dbase,1,nb);
        end;
      close(fin);
      close(fout);
    end.
    but the results do not seem to be quite good. For the following compressors I obtained:

    minigzip -r 1288679
    gzip 1350538
    glue 1370589

    PS The previous version of the code contained a bug which leads to incredible ratio
    dbase:=(base+4-obase) shr 2 -> dbase:=(base+4-obase) and 3
    Last edited by Marco_B; 20th July 2015 at 16:16.

  21. #19
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Marco_B View Post
    I wrote a simple filter for genomic data: it works by reading three bases and coding them in a byte,
    stated the number of distinct bases are four, this operation actual takes 6 bit. Nevertheless it is not
    convenient to insert four bases per byte because amino acid transcription happens on a triplet basis,
    and a loose of correlation might occur. For the remaining two bits it is possible to implement a limited
    RLE system, please look at the code below for details.

    Code:
    program gen;
    ...
    
    end.
    I made some experiments on the E.coli file of the Canterbury large corpus, the file was first packed with
    the above filter and then the output compressed with gzip and glue (see the post Grouped (ROLZ) LZW
    compressor
    ). Here are the result:

    original file size 4638690
    gzip size 1341254
    glue size 1345972
    filtered file size 1508628
    gzip filtered size 1165463
    glue filtered size 1388177

    I choose to make this confrontation by the perception the 1-order prediction inside glue could behave good
    for the DNA source, as one can see the difference in compression for the original file is small. Otherwise for the
    filtered file the story is more sad (to me). gzip leads to a better compression, glue does not.
    The interpretation I feel to make is that increasing the number of symbols for glue allows for a large dictionary but
    at the same time increases the average bits for group selection done by Huffman, while prediction for amino acids
    is worse than bases. Instead gzip benefits from an expanded dictionary and his entropy reduction based on the most
    recent occurrence makes the difference as usual.

    Since 4638690 / 4 = 1159672.5 bytes. Using a simple bijective Huffman compressor where the symbols replaced
    by 00 01 10 11 you get exactly 1159673 bytes. This is actual file that should then be compressed to a smaller size
    instead of your filtered sized file. And in example it already bets before any other follow on compressions you used.

  22. #20
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    46
    Thanks
    0
    Thanked 11 Times in 8 Posts
    To msg #19:
    As said, because to pack four bases per byte might result in a loss of correlation, and to start from an Huffman output falls in a case of iterated compression, it seems a good choice to apply a filter. Sadly, for now experimental evidences show this approach does not lead to a better compression, and I do not want support a basic principle, but nevertheless experimental evidences can not be known in advance.

  23. #21
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Quote Originally Posted by biject.bwts View Post
    Since 4638690 / 4 = 1159672.5 bytes. Using a simple bijective Huffman compressor where the symbols replaced
    by 00 01 10 11 you get exactly 1159673 bytes. This is actual file that should then be compressed to a smaller size
    instead of your filtered sized file. And in example it already bets before any other follow on compressions you used.
    For uniform distributions, yes.

    Try it on Malaria or similar GC/AT biased genomes and you'll see that packing multiple DNA bases into a byte before applying huffman encoding makes a big difference. This is simply down to the whole-bit vs fractional-bit entropy encoding of huffman (vs arithmetic etc).

  24. #22
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    46
    Thanks
    0
    Thanked 11 Times in 8 Posts
    I made another test with an arithmetic encoder, see Witten-Cleary arithmetic coder
    I applied it on the original E.coli where size becomes 1176777, and on a filtered version with RLE=3 that does 1154857.
    I think the difference could be even more larger on genomes with a GC predominance, despite handling of fractional bit by this kind of entropy encoder.
    Last edited by Marco_B; 21st July 2015 at 16:50.

  25. #23
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    46
    Thanks
    0
    Thanked 11 Times in 8 Posts
    I rewrite my filter to replace the RLE with a mixed contexts PPM.
    Code:
    program gen;
    {$R-,S-}
    const lim=3;
    order=1023;
    scale=254;
    var aa,oaa,paa,cnt:byte;
    trans:array[0..order,0..63] of byte;
    context,weight:array[0..order] of byte;
    mixer:array[0..63] of cardinal;
    beg,off,i,j,k:word;
    fin,fout:file;
    fname:string[12];
    begin
      write('Input file?');
      readln(fname);
      assign(fin,fname);
      reset(fin,1);
      if filesize(fin) mod 3>0 then
        begin
          writeln('Sequence truncation');
          halt(1);
        end;
      fname:=copy(fname,1,pos('.',fname)-1);
      assign(fout,fname+'.out');
      rewrite(fout,1);
      oaa:=64;
      beg:=0;
      off:=0;
      for i:=0 to order do
        begin
          for j:=0 to 63 do
            trans[i,j]:=64;
          weight[i]:=1;
        end;
      while not eof(fin) do
        begin
          aa:=0;
          for i:=0 to 2 do
            begin
              blockread(fin,j,1,k);
              case j of
                ord('c'),ord('C'):j:=0;
                ord('g'),ord('G'):j:=1;
                ord('a'),ord('A'):j:=2;
                ord('t'),ord('T'),ord('u'),ord('U'):j:=3;
              else
                begin
                  writeln('The file does not appear as genome');
                  halt(1);
                end;
              end;
              aa:=aa+j shl (i shl 1);
            end;
          for i:=0 to 63 do
            mixer[i]:=0;
          i:=beg;
          j:=off;
          k:=order+1;
          while j>0 do
            begin
              paa:=trans[j,context[i]];
              if paa<64 then
                inc(mixer[paa],weight[j]);
              if paa<>aa then
                trans[j,context[i]]:=aa
              else
                begin
                  inc(weight[j]);
                  if weight[j]=scale then
                    k:=0;
                end;
              if i<order then
                inc(i)
              else
                i:=0;
              dec(j);
            end;
          context[i]:=aa;
          trans[0,aa]:=aa;
          for i:=k to order do
            weight[i]:=(weight[i]+1) shr 1;
          paa:=oaa;
          for i:=0 to 63 do
            if mixer[i]>mixer[paa] then
              paa:=i;
          if (paa=aa) and (cnt<lim) then
            inc(cnt)
          else
            begin
              if oaa<64 then
                oaa:=oaa+cnt shl 6
              else
                oaa:=aa;
              blockwrite(fout,oaa,1,k);
              oaa:=aa;
              cnt:=0;
            end;
          if off=order then
            begin
              inc(beg);
              if beg>order then
                beg:=0;
            end
          else
            inc(off);
        end;
      if oaa<64 then
        begin
          oaa:=oaa+cnt shl 6;
          blockwrite(fout,oaa,1,k);
        end;
      close(fin);
      close(fout);
    end.
    The predictor is quite unusual because it only can record one symbol per context instead of a set of probabilities,
    this allows for a plain counter (<=3) of matched symbols. Furthermore the contexts rely on a bare model wherein
    only the first symbol and the lenght lead to the prediction and everything in the middle it is considered irrelevant,
    the advantage is an exponential space reduction and consequently an extension of the context order (=1023).
    Such a model is sketched around the picture of the monkey trying to type Shakespeare but keeping all good past results,
    in other words there is the assumption the structure of the DNA at every backward lenght suffices to determine the next base.
    To mitigate the possible (probable) deviation from this pattern is the main goal of the mixer,
    which is linear with adaptive but not normalized weights.
    The Escherichia coli compress to 1493881 bytes, where the same with RLE was 1508628,
    in conclusion one is justified to say the nature is more subtle than expected.
    Last edited by Marco_B; 15th September 2015 at 10:56. Reason: Change from fixed to adaptive weights.

  26. #24
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    46
    Thanks
    0
    Thanked 11 Times in 8 Posts
    The inadequacy of that model appears clear when one look at the values of weight: except for order 1 the others are near to a flat distribution.
    It is better I go to eat some chocolate.

  27. #25
    Member
    Join Date
    Jul 2015
    Location
    Germany
    Posts
    3
    Thanks
    3
    Thanked 0 Times in 0 Posts
    Maybe one could as well save the time for creating a preprocessor
    und just have the archiver (i.e. zpaq) doing the work:

    1,114,236 <- zpaq -m 5
    1,121,051 <- nz -cc
    1,136,873 <- 7z -m0=mmpd=5
    1,137,171 <- arc -mppmd:5
    1,138,379 <- rar -mcT
    1,140,020 <- arc -mgrzip:1
    1,185,232 <- 7z -m0=lzma
    1,186,180 <- xz -e
    1,187,145 <- arc -m5
    1,203,882 <- arc
    1,235,833 <- 7z a -mx7 e.zip ...
    1,250,548 <- 7z a -mx7 e.bz2 ...
    1,251,004 <- bzip2 -9
    1,295,501 <- rar a -m5 e.zip ...
    1,299,066 <- gzip -9
    1,299,319 <- zip -9
    4,638,690 <- E.coli (original size)

    or alternatively, if there are many such sequences to store: invest the space and time on saving a reference sequence, coding only the deviations from that (using relative ccordinates from one to the next), and then compress that:
    http://bioinformatics.oxfordjournals....full.pdf+html

    This finally allows to store a complete human genome within 2.5mb (not counting space for the reference)

    For an overview on current methods used by the community in this area one might have a look at

    https://en.wikipedia.org/wiki/Compre...equencing_Data

  28. #26
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    46
    Thanks
    0
    Thanked 11 Times in 8 Posts
    To msg #25:
    To play with models is a way to understand something about the process generating some sequences, it is funny too.
    The compression of a reference sequence is a useful tool for DNA, because it is highly conserved even among different species,
    but it is not so interesting from the point of view of pure compression schema. Moreover it is tightly bounded to DNA data, as any
    preprocessor does on his own target.

Similar Threads

  1. EOL filter
    By kaitz in forum Data Compression
    Replies: 0
    Last Post: 29th September 2014, 23:43
  2. Simple WAV Delta Filter
    By david_werecat in forum Data Compression
    Replies: 5
    Last Post: 6th February 2012, 01:17
  3. Need god PCM compressor/filter
    By SvenBent in forum Data Compression
    Replies: 10
    Last Post: 8th July 2008, 15:52
  4. Stand alone pcm dat preprocessor/filter
    By SvenBent in forum Data Compression
    Replies: 5
    Last Post: 15th May 2008, 15:36
  5. About filter
    By vcore in forum Forum Archive
    Replies: 4
    Last Post: 22nd January 2008, 13:45

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •