Results 1 to 15 of 15

Thread: Frequent Bits Elimination Method - suggest please

  1. #1
    Member
    Join Date
    Sep 2009
    Location
    APITRC
    Posts
    27
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Post Frequent Bits Elimination Method - suggest please

    Hello Experts

    I and my fellow researchers have developed few compression algorithms (keeping in view good speed and highest compression ratio), most of the compression specifically for raw DNA sequences, which resulted in very good speed and a good compression ratio. One of them was AMC (Matrix Compression), we started modifying it and did some paper work before translating it into the C/C++ code. We modified it to work with binary data (not only the raw DNA sequence files). In DNA data we had so many options because it is limited to coinsurances i.e. 'a', 'c','g', and 't' or capitalized version.

    How AMC 3.0 works?

    AMC has two schemes of filtering (Scheme H and scheme V - Horizontal[ROWs], Vertical[COLUMNs]), it filters a matrix and calculates the scheme which can allow to eliminate more places in a matrix. Then keeps track of the places to be filled later (when decompression takes place) in an unsigned __int64 integer, where each bit indicates the row or column, if turned 1: To be filled with popular character we removed, when 0: fill the remaining uneliminatable bytes (we call it JUNK). Repeating this process until we receive eliminatable popular character + offsets sizeof unsigned __int64 is less than 16. Input is 4096 bytes block (assumed a 64x64 unsigned __int8 matrix).

    (Popular Character (PC) is the character that is most frequent in the matrix)

    Now we have moved back and realized that elimination of similar bits in N bytes (vertical comparison of current with last byte) will be better idea, BUT question was, input might be a highly compressed data which might not provide us enough matches, so we decided to transform the bits using 6 ways, and we use 3 bits to store the type of filter used in 3 bits.

    Code:
    bits Filter
    ---------------------------
    000  No Filter
    001  Toggle
    010  Flip (Flip all 8 bits)
    100  Swap (Actually swap each 2 bits in a byte)
    101  Toggle + Flip
    110  Toggle + Swap
    011  Flip + Swap
    111  Unused

    The very first thing we do is check the number of on bits in the first byte, if on bits are less than 4 then Toggle first byte and add 001 in TFS block.

    We call them filters, where each filter occupies 3 bits space in a separate block called TFS block. Applying each filter one by one on the current byte and comparing it with FIRST byte gives us result of at least normalizing each byte to at least match 4 bits to previous byte. Once the (VERTICAL) matches are less than 4 BREAK and store the data (NOT THE TFS block), and count next incoming byte as first. A sample data:

    Code:
    1. 1 1 1 1 1 1 1 1
    2. 1 0 1 1 1 1 0 0
       
    3. 0 0 1 0 1 1 1 1
    Flip 11110100 (4 matches with previous) 
    
    4. 0 0 0 0 0 0 1 1
    Toggle - 11111100 (5 matches)
    Finally makes:
    Code:
    1 1 1 1 1 1 1 1
    1 0 1 1 1 1 0 0
    1 1 1 1 0 1 0 0
    1 1 1 1 1 1 0 0
    etc. Where if you review it as matrix there are some columns that can be eliminated and their offsets can be preserved in just single byte. We call it OFFSETS Block.

    OFFSETS: 1 0 1 1 0 1 0 0 (Where 1 means column was chopped off, 0 means data is considered JUNK)
    Code:
                1 1 1 1 1 1 1 1
                1 0 1 1 1 1 0 0
                1 1 1 1 0 1 0 0
                1 1 1 1 1 1 0 0
    After elimination of said columns, resulted bits are
    Code:
                - 1 - - 1 - 1 1
                - 0 - - 1 - 0 0
                - 1 - - 0 - 0 0
                - 1 - - 1 - 0 0
    Eventually the remains, I name them JUNK

    Code:
    1111
    0100
    1000
    1100    =>   1111010010001100 == (2 bytes)


    So finally the structure of encoded block will be:

    Code:
    Block           bits     bytes       Value              Description
    ---------------------------------------------------------------------------------------
    SIZE            8        1           uc 4               Total bytes encoded
    OFFSETS         8        1           10110100           Offsets of eliminated 
    JUNK            16       2           1111010010001100   Uneliminatable columns' data
    Where TFS block is a global block, and written down to end of file when encoding has been completed.
    In this 4 bytes encoding case, if we attach TFS to it, it will unfortunately result in 6 bits more (storable in 8) = 1 byte, so final data after encoding 4 bytes is now 5 byte. Once such situation is encountered, these 4 bytes block is considered incompressible.

    BUT at some position, when compressing our test files of :
    1. raw DNA sequences
    2. Windows BMP files
    3. some already compressed JPEG files
    4. PAQ compressed files


    we found there are many matches that can form a favorite sequence (after filters).

    What I and my fellow researchers are seeking are suggestions.
    What we already have tried is:

    1. keeping filters entries of TFS block to 2 bits only, but again it accommodates 4 types only, where first must remain No-Filter, so practically we have 3 possible 2bit combinations left, i.e. 01, 10, and 11, using these we cannot broaden the filters application, such as combination of two - Toggle + Flip, Toggle + Swap, and Flip + Swap.


    2. Keeping yet another superior block of bits that indicates either filter is applied or not, where 0 is NO, and 1 is YES, using this method we could save a lot in TFS block by eliminating No-Filter notations.

    3. We also have tried TYPES OF BLOCKS, means the maximum number of filters (M) applied in a block of N bytes, where
    if M is 0 then type is 00
    if M <= 3 then type is 01
    if M > 3 then type is 10
    But yet we cannot eliminate the No-Filter block here, we did not try procedures 2 and 3 together, but I thought we must consult the experts here. They can give us some great suggestions which will ultimately result in a new compressor with great compression ratio.


    We have named it FBEM, that will be released under GNU GPL, once finalized and tested.

    Your reviews and suggestions are most welcomed. Our joint effort can result in something really good and fast. Constructive criticism is most welcomed to make us work better.

    regards
    Ali...
    Last edited by Scientist; 27th September 2009 at 12:43.

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Quote Originally Posted by Scientist View Post
    BUT at some position, when compressing our test files of :
    1. PAQ compressed files
    JUNK!

  3. #3
    Member
    Join Date
    Sep 2009
    Location
    APITRC
    Posts
    27
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Smile

    Quote Originally Posted by Bulat Ziganshin View Post
    JUNK!
    Bulat, I did not follow what are the meanings of your VERY MUCH COMPRESSED reply, but I think you missed last 3 points. When I wrote about PAQ, I am right about it , we have a dna file compressed using AMC, then using paq8o.
    zin.seq > zin.amc > zin.paq8o (finally 161kb)

    file zin.paq8o

    Code:
    Actual   AMC       PAQ 
    ----------------------
    14MB    3.6MB     161KB
    
    Test only 13 bytes from zin.paq8o (after filters), found many eliminatable columns, proof
     1 1 1 1 0 1 1 1
     1 1 0 1 1 1 1 1
     1 1 0 0 1 1 1 1
     1 1 1 1 1 1 1 1
     1 1 1 1 1 1 1 1
     1 1 0 0 1 1 1 1
     1 1 0 0 1 1 1 1
     1 1 0 0 1 1 1 1
     1 1 0 0 1 1 1 1
     1 1 0 0 1 1 1 1
     1 1 0 0 1 1 1 1
     1 1 0 0 1 1 1 1
     1 1 1 1 1 1 1 1
    Eliminatable 5 columns = 65 bits, and JUNK is now 39 bits
    Total 104 > Encoded to 39 bits.


    Dear Bulat Ziganshin
    I request you once again, review end of my post again sir . There are chances paq8o file may be recompressed (I might be wrong), but tests dont lie .

    Waiting for your 2nd review, just few minutes.
    Last edited by Scientist; 27th September 2009 at 12:57.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Well, its really useless considering compression, but might be as well a good programming exercise :)
    So I'd suggest:
    1. Implement a _reversible_ transformation like that (might require adding more flags etc).
    2. Replace fixed bitfields with arithmetic coding (which would solve the topic question).
    3. Discard the mentioned transformations and make a normal CM using arithmetic coding from [2].

    Also, good entropy coding produces random output, which is impossible to shrink
    using ad hoc transformations - some fragments might seem compressible,
    but on average result would be the same or worse (in a proper lossless implementation).

    But in most cases there're still some ways to compress already compressed data:
    1. Decode it using original method and encode again using a better model
    for prediction.
    2. Only target the cases where used algorithm is imprecise.
    For example, most compressors only can reduce the redundancy to a
    certain fraction, still generating redundant output for very redundant input.
    Thus, it would be possible to, say, compress again a paq8 archive of a file
    with 10M zero bytes. But such compressible patterns would be still algorithm-dependent,
    and types of data specifically compatible with suggested transformations are really unlikely.
    And then, its much more reasonable to preprocess original data and remove
    the redundancy for which the actual compression algorithm doesn't have
    enough precision - instead of inventing a secondary compression algorithm.
    Last edited by Shelwien; 27th September 2009 at 13:10.

  5. #5
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    For just getting some "more" ideas, could you post a known DNA sequence (I personally prefer e.coli from canterbury corpus) compression ratio with timing? Because, i didn't get the point of "usage" that "matrix filtering". Because, MTF and alphabet remapping can be used for that too. Btw, I've some experiments with a model for specialized DNA sequences (seems I have to jump in also FASTA format prior to real compression experiments). I'm still working on it with a professor on my university.
    BIT Archiver homepage: www.osmanturan.com

  6. #6
    Member
    Join Date
    Sep 2009
    Location
    APITRC
    Posts
    27
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    1. Implement a _reversible_ transformation like that (might require adding more flags etc).
    2. Replace fixed bitfields with arithmetic coding (which would solve the topic question).
    3. Discard the mentioned transformations and make a normal CM using arithmetic coding from [2].
    Any papers or web links you might suggest sir?


    Quote Originally Posted by osmanturan View Post
    For just getting some "more" ideas, could you post a known DNA sequence (I personally prefer e.coli from canterbury corpus) compression ratio with timing? Because, i didn't get the point of "usage" that "matrix filtering". Because, MTF and alphabet remapping can be used for that too. Btw, I've some experiments with a model for specialized DNA sequences (seems I have to jump in also FASTA format prior to real compression experiments). I'm still working on it with a professor on my university.
    Merhaba kardisim, Well it only works on raw DNA sequence files (for now) just as your suggested E.coli; We did not work on sophisticated formats yet which include more information than just sequences, such as IG, EMBL, GenBank, GCG, FASTA etc. For these formats (also), we plan to compress the non-sequence part using some other compressor and on decompression reformat the sequence as per format of file, we will record file format.

    AMC results on E.coli
    Code:
    Compressed 4638690 bytes to 1159672 bytes in 0.58s
    AMC is basically a normalizer that takes place just before final compression, basically designed for speed.


    regards

  7. #7
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Scientist View Post
    Merhaba kardisim...
    Nice to see some words in Turkish. (Btw, the correct form "Merhaba kardeşim")

    Quote Originally Posted by Scientist View Post
    ...Well it only works on raw DNA sequence files (for now) just as your suggested E.coli; We did not work on sophisticated formats yet which include more information than just sequences, such as IG, EMBL, GenBank, GCG, FASTA etc. For these formats (also), we plan to compress the non-sequence part using some other compressor and on decompression reformat the sequence as per format of file, we will record file format.
    I'm working on raw DNA sequences as well. But, I need also FASTA interpreting capabilities. Because, human genome sequences provided mostly in FASTA format. Just for now, I've builded a limited alphabet based n-grams model which can track up to order-9 (for now). But, still compression part is not implemented. Because, I need a better model.

    Quote Originally Posted by Scientist View Post
    AMC results on E.coli
    Code:
    Compressed 4638690 bytes to 1159672 bytes in 0.58s
    AMC is basically a normalizer that takes place just before final compression, basically designed for speed.
    For a comparison, here is my old compressor compressor performance on E.coli. Note that, it's not a DNA specialized compressor, it's a generic compressor for any file.

    1,133,329 bytes in 5.383 seconds
    BIT Archiver homepage: www.osmanturan.com

  8. #8
    Member
    Join Date
    Sep 2009
    Location
    APITRC
    Posts
    27
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Nice to see some words in Turkish. (Btw, the correct form "Merhaba kardeşim")
    I did not have 'ş' in the keyboard actually , and yes am wrong at 'i' it must be 'e', 'Kardeş'. I have been in Ankara for 2 years .


    For a comparison, here is my old compressor compressor performance on E.coli. Note that, it's not a DNA specialized compressor, it's a generic compressor for any file.

    1,133,329 bytes in 5.383 seconds
    I just visisted your site (is nice), and found bit archiver, tested it on Intel (R) 3.0GHz with a little RAM of 512MB (tested before this post of yours), my results were a little different about time spent:

    Compressed 4638690 E.coli

    AMC 1159672 0.58s
    BIT 1133329 9.205s
    commands:
    amc E.coli
    bit a E.coli

  9. #9
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Scientist View Post
    I did not have 'ş' in the keyboard actually , and yes am wrong at 'i' it must be 'e', 'Kardeş'. I have been in Ankara for 2 years .
    Really? What have you done there? Worked on a university? There are many universities in Ankara, but I really don't like the city itself.

    Quote Originally Posted by Scientist View Post
    I just visisted your site (is nice), and found bit archiver, tested it on Intel (R) 3.0GHz with a little RAM of 512MB (tested before this post of yours), my results were a little different about time spent:

    commands:
    amc E.coli
    bit a E.coli
    I've a CoreDuo 2.2 GHz with 2 GiB RAM installed. Maybe this could be the reason. Also, I've tested with p=4 switch. In the another words:

    bit a -p=4 E.coli

    But, I should warn you. -p=4 switch needs 330 MiB memory. So, you will definitely encounter with swapping (=slow processing).
    BIT Archiver homepage: www.osmanturan.com

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    4638690*2/8=1159672.5 without any transformations - just
    compact coding of 4-symbol alphabet.

    Also, it seems like a rather weird type of data - not much context
    correlation most of the time, but with occasional long matches.
    So rep + compact direct coding might be a good idea it seems,
    taking speed into account.

    Btw, I expected to get better results from bytewise compressors,
    but somehow it didn't turn out as good as expected:

    1129315 ppmonstr Jr1
    1123614 BWTmix_v1
    1119892 bwmostr 001
    1108006 ash 04a
    1101678 paq8p

    And anyway that "matrix coding" doesn't seem like a good way
    for compact coding of small alphabets either :)

  11. #11
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    Is the basic symbol a 2-bits sequence (4 values) ?
    Then, maybe byte-wise is not appropriate...

  12. #12
    Member
    Join Date
    Sep 2009
    Location
    APITRC
    Posts
    27
    Thanks
    0
    Thanked 0 Times in 0 Posts
    We have little confusion in the thread here, I am seeking advice on Frequent Bits Elimination rather than old AMC.
    AMC is old method of finding Frequenct BYTES, but in this thread I would request you to comment and suggest on BITS ELIMINATION.

    Code:
    Really? What have you done there? Worked on a university? There are many universities in Ankara, but I really don't like the city itself.
    Well I liked the Ankara city, I was there back in 1993-95.

    So rep + compact direct coding might be a good idea it seems,
    taking speed into account.
    Will try that.

    1129315 ppmonstr Jr1
    1123614 BWTmix_v1
    1119892 bwmostr 001
    1108006 ash 04a
    1101678 paq8p

    And anyway that "matrix coding" doesn't seem like a good way
    for compact coding of small alphabets either
    True, AMC is old resource and needs a lot imporvements.

    Is the basic symbol a 2-bits sequence (4 values) ?
    Then, maybe byte-wise is not appropriate...
    Talking about AMC or the FBEM (Frequent Bits Elimination Method)?

    regards

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    As I said before, you don't have to limit yourself with any fixed encodings.
    Instead you can collect the statistics and assign "codelengths" to field
    values etc depending on their probabibility.
    See http://en.wikipedia.org/wiki/Range_encoding or something.

  14. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    fpaq0 E.coli -> 1161005, 0.9 sec. (2 GHz T3200, 32-bit)
    zpaq co0s.cfg E.coli -> 1157569, 2.0 sec.

    Both programs are simple order 0 models, so there is not a lot to be gained with fancy compression algorithms. o0s.cfg is:

    Code:
    (stationary order 0 model)
    comp 0 0 0 0 1
      0 cm 9 255
    hcomp
      halt
    post
      0
    end
    But about the bits elimination algorithm, it looks to me like yet another attempt at random data compression doomed to fail.

  15. #15
    Member
    Join Date
    Sep 2009
    Location
    APITRC
    Posts
    27
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hello Dr.Matt

    fpaq has outstanding performance taking compression ratio as well as speed in account. This so far theory of Frequent Bits Elimination Method has no competetion with the method you guys have developed after a huge research. I call it still a theory because you have been comparing results with old ACM which is NOT Frequent Bits Elimination Menthod actually . I am calling it still a theory because not a fully functional code has yet been written for it. But yes at some position Shelwien suggested that it can be good programming exercise, so we are willing to do this exercise even if it does not reduce even a single byte.

    Quote Originally Posted by Matt Mahoney View Post
    But about the bits elimination algorithm, it looks to me like yet another attempt at random data compression doomed to fail.
    FBEM test are not yet posted, these were old ACM results. Besides, thanks for taking time out to reply, will let you know when we havea working model of what we are talking about.

    regards
    Last edited by Scientist; 28th September 2009 at 18:33.

Similar Threads

  1. found 6 bits redundancy in sharnd_challenge.dat
    By ddfox in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 4th June 2010, 22:30
  2. Sequence of bits
    By Kaw in forum Data Compression
    Replies: 12
    Last Post: 25th September 2009, 08:53
  3. Compressing small bits of data
    By fredrp in forum Data Compression
    Replies: 9
    Last Post: 28th April 2009, 22:27
  4. Stuffit Method 15 - Dedicated to BWT maniacs
    By encode in forum Data Compression
    Replies: 1
    Last Post: 12th May 2008, 23:43
  5. Does this compression method already exist?
    By Lasse Reinhold in forum Forum Archive
    Replies: 4
    Last Post: 24th August 2007, 13:59

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •