Page 1 of 3 123 LastLast
Results 1 to 30 of 79

Thread: BCM v0.08 - The ultimate BWT-based file compressor!

  1. #1
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts

    Cool BCM v0.08 - The ultimate BWT-based file compressor!

    OK, a new release is here. A hardcore work. Anyway, check out what's new:
    • This is a complete rewrite of the compressor
    • Default block size now is 64 megabytes
    • Invented new conception in BWT-output encoding. Still no cheating or tricks like M03
    • Many optimizations and code cleanups
    So, it was really hard to beat the BCM v0.07, but I did it! The brand new BCM is one the strongest BWT-based compression engines available. A few weeks of running optimizations, some serious analysis on a special contexts and models for BWT-data... Uncounted number of bottles of Vodka and Pepsi...

    Some brief results:

    book1 -> 209,826 bytes
    calgary.tar -> 779,619 bytes
    world95.txt -> 466,250 bytes
    fp.log -> 557,226 bytes
    ENWIK8 -> 20,744,613 bytes
    bible.txt -> 725,091 bytes
    3200.txt-> 3,660,195 bytes

    So anyways, enjoy!

    Attached Files Attached Files

  2. #2
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts

    Thumbs up

    Thanks Ilia!

    Mirror: Download

  3. #3
    Member Vacon's Avatar
    Join Date
    May 2008
    Location
    Germany
    Posts
    523
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hello everyone,

    thanks Ilia

    Best regards!

  4. #4
    Member
    Join Date
    Aug 2008
    Location
    Saint Petersburg, Russia
    Posts
    215
    Thanks
    0
    Thanked 0 Times in 0 Posts
    20 744 613 on enwik8 using 1024m

  5. #5
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    1024m? No way! It uses 5N!

  6. #6
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts

    Thumbs up

    Quick speed test on my old 750 MHz, 512MB RAM Pentium 3 machine...

    D:\viper bcm e25 enwik8 enwik8.bcm
    VIPER v1.0 beta
    Copyright (c) 2009 by LovePimple

    bcm v0.08 by ilia muraviev
    encoding 25m block...
    encoding 25m block...
    encoding 25m block...
    encoding 20m block...
    ratio: 1.776 bpb
    done

    Kernel Time = 2.323 = 00:00:02.323 = 0%
    User Time = 309.785 = 00:05:09.785 = 97%
    Process Time = 312.108 = 00:05:12.108 = 97%
    Global Time = 319.028 = 00:05:19.028 = 100%
    The compressed size of enwik8 is 21.1 MB (22,203,843 bytes) with 25MB block size.


    Same test, but with 48MB block size...

    D:\viper bcm e48 enwik8 enwik8.bcm
    VIPER v1.0 beta
    Copyright (c) 2009 by LovePimple

    bcm v0.08 by ilia muraviev
    encoding 48m block...
    encoding 47m block...
    ratio: 1.717 bpb
    done

    Kernel Time = 2.733 = 00:00:02.733 = 0%
    User Time = 321.191 = 00:05:21.191 = 97%
    Process Time = 323.925 = 00:05:23.925 = 98%
    Global Time = 330.094 = 00:05:30.094 = 100%
    The compressed size of enwik8 is 20.4 MB (21,466,782 bytes) with 48MB block size.


    Same again, but with 95MB block size...

    D:\viper bcm e96 enwik8 enwik8.bcm
    VIPER v1.0 beta
    Copyright (c) 2009 by LovePimple

    bcm v0.08 by ilia muraviev
    encoding 95m block...
    ratio: 1.660 bpb
    done

    Kernel Time = 6.669 = 00:00:06.669 = 1%
    User Time = 331.206 = 00:05:31.206 = 57%
    Process Time = 337.875 = 00:05:37.875 = 58%
    Global Time = 581.045 = 00:09:41.045 = 100%
    The compressed size of enwik8 is now down to 19.7 MB (20,744,613 bytes) with 95MB block size.

    Note: Slower global time is mainly due to disk thrashing caused by excessive paging (only 512MB RAM).

  7. #7
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    No doubt its a good news! Thank you Ilia !!!
    Gonna test it today.

  8. #8
    Member
    Join Date
    Aug 2008
    Location
    Saint Petersburg, Russia
    Posts
    215
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by encode View Post
    1024m? No way! It uses 5N!
    Lol, sorry for misunderstanding I meant using e1024, I have now read the memory usage and, yes, it used one 95m block hence around 475 mb of memory.

  9. #9
    Member
    Join Date
    Jun 2008
    Location
    L?vis, Canada
    Posts
    30
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I don't think a BWT compressor can be practical:

    a) BWT is not good for executables.
    b) BWT compressors use blocks. With a stream compressor, for example, a web server could compress the first half of a HTML file at the same time the PHP interpreter is writing the second half.
    c) No option to keep broken files.

  10. #10
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Thumbs up

    Quote Originally Posted by encode View Post
    OK, a new release is here. A hardcore work. Anyway, check out what's new:[LIST][*]This is a complete rewrite of the compressor[*]Default block size now is 64 megabytes[*]Invented new conception in BWT-output encoding. Still no cheating or tricks like M03

    just curious what happens if you use BWTS instead of BWT would
    it compress smaller.

    I also was wondering what is the cheating and MO3 stuff your
    talking about.

    Also you did a great job way to go.

  11. #11
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts

    Smile

    Quote Originally Posted by nanoflooder View Post
    Lol, sorry for misunderstanding I meant using e1024, I have now read the memory usage and, yes, it used one 95m block hence around 475 mb of memory.
    New BCM allows any number as a memory parameter, but:
    • BCM will never use a block size larger than a file
    • If a number is too large or zero, BCM will skip back to the default block size
    • Due to Windows limitations, the max block size is 512 MB, even if you have any large amount of RAM. The max allocation block is 2 GB, even under 64-bit Windows. BCM allocates two chunks of memory ? for pointers, that need 4N memory and for the block itself. So, 512 MB*4 = 2 GB!
    Quote Originally Posted by Mihai Cartoaje View Post
    I don't think a BWT compressor can be practical:

    a) BWT is not good for executables.
    b) BWT compressors use blocks. With a stream compressor, for example, a web server could compress the first half of a HTML file at the same time the PHP interpreter is writing the second half.
    c) No option to keep broken files.
    a) Relatively and depends on actual executable. In some cases it may beat even LZMA at max settings. Anyway, yep, the overall compression is not that good as with texts and similar files.

    b) Not a problem. Use a smaller block size. Network traffic organized into packets. Stream audio and media organized into frames. So, block size is equal to frame or a packet.

    c) It?s a fable about the compressed data recovery. High compression is incompatible with any recovery. Change one byte and you lost everything. Such a feature added to the archivers mostly to keep users calm? Malcolm Taylor (The author of WinRK) some time ago posted a nice message about data recovery. So, in practice, data only with a small damage might be fixed, at the cost of (attention) 10...25% or more extra data. Worth it? It?s about the same to send or store two copies of compressed data and in case of some damage, use undamaged copy. Anyway, the data storage and networks are too reliable these days to get broken files?

    Quote Originally Posted by biject.bwts View Post
    just curious what happens if you use BWTS instead of BWT would
    it compress smaller.

    I also was wondering what is the cheating and MO3 stuff your
    talking about.

    Also you did a great job way to go.
    With no BWT index coded we might win one byte! Better to think about better modeling than about so eccentric ideas?

    A small preamble. New BCM achieves about the same compression power as PAQ8 on BWT data. BCM even notable stronger on BWT-data than a compression monsters of the past times like PAQ6. That means BCM is closely about to the max BWT compression possible! Anyway, we have the M03 idea. It?s not a pure BWT by itself. AFAIK, it is much like CM that uses BWT structures to access the statistics. Most likely BWMonstr uses the same approach. Anyway, M03 and BWMonstr are too way slow ? better use pure CM in such case ? so we will have both faster and stronger compression. In my opinion, the strength of BWT in its speed and efficiency. Making it too slow will kill any of its advantages over CM or PPM. Anyway, BCM is already more CM than BWT. BWT with BCM is intended to help to access to a higher order statistics. One of the problems of CM/PPM is an efficient access to a high order statistics, model becomes larger, and serious cache misses can?t be avoided, we need more complex blending of all models available, and if we go bitwise? all our performance are drop-downed. With BCM I use BWT and a small CM ? the entire CM model fits in less than 300 KB! So it is extremely efficient in terms of caching. Yep, the model is really complex and computationally expensive. But I?ve done as many optimizations as possible ? you will never believe how many stuff works inside BCM, how many sub-models? And all that stuff works such a fast and how it?s memory efficient! A hardcore work and such result! Anyway, I will do some experiments with M03 idea as well?

    Thanks for reading!


  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > So, 512 MB*4 = 2 GB!

    It might be worthy to do some research on chunked storage
    of BWT input - like allocating 1M blocks and remapping
    the indexes into actual offsets via a table.
    Like that it'd be possible to use block sizes up to ~610M
    with 5N BWT (not that blocks more than filesize/2 are any
    useful, but just for a record).
    But actual benefit would be less restrictions when used
    as a part of some GUI framework (these commonly have a
    weird memory fragmentation).

    Then another possibility is a multi-pass BWT (like what
    I did in rcbwt) - filtering out an interval of indexes
    on each pass, and sorting them.
    Its kinda similar to usual blockwise BWT (in speed too),
    but instead of 500M data buffer + 2G index buffer for enwik9,
    we'd use 1G data buffer + 2G index (2861M, should fit with /3GB).

    > Relatively and depends on actual executable

    More specifically, it depends on preprocessing method.
    With disasm32-style preprocessor the compression would
    be pretty good.

    > Anyway, the data storage and networks are too reliable
    > these days to get broken files

    1. Its wrong. Just try to download some large volume of data
    (like rainbow tables) via http or ftp.
    Anyway, its quite probable to get an error within 100G of data,
    as ip packet checksum is only 16bit.

    2. Rar's recovery saved me a few times when uploading video to
    a server - you only need filesize/sectorsize*crcsize + sectorsize
    extra data to fix any single broken sector.

    Though its certainly in no way related to BWT - error correction
    methods are applicable to any kinds of data.

  13. #13
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by encode View Post
    With no BWT index coded we might win one byte! Better to think about better modeling than about so eccentric ideas?
    Well...BWTS+[Start index] is not BWT. BWTS has a slightly different permutation and thus usually makes files a few KiB smaller. Not only 4 bytes (=Start Index).
    BIT Archiver homepage: www.osmanturan.com

  14. #14
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts

    Thumbs up

    MC SFC test...

    A10.jpg > 824,677
    AcroRd32.exe > 1,562,601
    english.dic > 1,164,850
    FlashMX.pdf > 3,735,285
    FP.LOG > 557,226
    MSO97.DLL > 1,889,122
    ohs.doc > 823,380
    rafale.bmp > 753,524
    vcfiu.hlp > 640,574
    world95.txt > 466,250

    Total = 12,417,489 bytes

  15. #15
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Don't forget, that on some files the smaller block is better...

  16. #16
    Programmer giorgiotani's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    166
    Thanks
    3
    Thanked 2 Times in 2 Posts
    Results are impressive, keep up the good work!

  17. #17
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    Quote Originally Posted by encode View Post
    Anyway, we have the M03 idea. It?s not a pure BWT by itself. AFAIK, it is much like CM that uses BWT structures to access the statistics. Most likely BWMonstr uses the same approach. Anyway, M03 and BWMonstr are too way slow ?
    M03 isn't slow at all. It's mij4x's implementation which is slow. I have it running about as fast as M99 right now and in 5N space as well. This is for boundless order context modeling. Limited order is really not much faster as most of the time is spent at the lower orders anyhow.

    If I could only find enough time to put the encoder from M99 into modeling of M03 I would be able to release it! But work on the house never ends

    Anyhow, M03 deals strictly with modeling the context boundaries contained within the BWT. There's still lots of room for improvements in how to encode the statistics generated from that modeling.

    I'll see if I can make time in the next week or two to get the fast M03 posted.

    - Michael Maniscalco

  18. #18
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    A couple of tests between v0.08 and v0.07

    Code:
                  | encode time |     size    | decode time |
    --------------------------------------------------------
                  |         PAK0.PAK (302 895 669)          |
    --------------------------------------------------------
    0.07 -b65536  |  181.813    | 117 053 060 |  135.813    |
    0.08 e64      |  227.422    | 116 587 092 |  185.234    |
    --------------------------------------------------------
                  |          ENWIK8 (you know :) )          |
    --------------------------------------------------------
    0.07 -b102400 |   61.265    |  20 770 673 |   45.109    |
    0.08 e100     |   77.828    |  20 744 613 |   61.172    |
    --------------------------------------------------------
                  |         NKOEF.DBF (151 585 377)         |
    --------------------------------------------------------
    0.07 -b32768  |   71.735    |   2 942 131 |   59.453    |
    0.08 e32      |  102.531    |   2 912 917 |   87.000    |
    --------------------------------------------------------

  19. #19
    Member
    Join Date
    Aug 2008
    Location
    Saint Petersburg, Russia
    Posts
    215
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Having seen that ^ I would say that it makes sense to have both versions ready to be run The time difference is quite notable here...

  20. #20
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Results on my granny Core2Duo with ENWIK8:

    BCM007 -> 20,770,673 bytes, comp. 51 sec, decomp. 39 sec
    BCM008 -> 20,744,613 bytes, comp. 63 sec, decomp. 50 sec



    The compression gain on ENWIK8 and other ENWIKs is not quite impressive. But, in some cases, in some very important cases, new BCM v0.08 really strikes!
    Anyway, I have a nice light-weight version of BCM's CM that has really nice efficiency that might be included in some BCM version. But currently, my goal is the MAX compression possible!
    Now I'm playing with M03 ideas and some other ideas that may not only increase the compression power, but processing speed as well...


  21. #21
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    Quote Originally Posted by encode View Post
    c) It?s a fable about the compressed data recovery. High compression is incompatible with any recovery. Change one byte and you lost everything. Such a feature added to the archivers mostly to keep users calm? Malcolm Taylor (The author of WinRK) some time ago posted a nice message about data recovery. So, in practice, data only with a small damage might be fixed, at the cost of (attention) 10...25% or more extra data. Worth it?
    From my point of view you're little bit pessimistic giving such rating of data recovery. I just made a little test. I took one album in MP3 format (320 KBit, 69:56 lenght, 167 712 768 bytes) and packed it to RAR archive (Best, Solid, 1% recovery data). Resulting size of RAR archive is 167 998 135 bytes where recovery data takes only 998 912 bytes. Then I zero-filled a 204 800 byte block somewhere in the middle with WinHEX. 204 800 block emulates 100 consecutive unreaded sectors of CD\DVD medium. RAR successfully repaired archive. I think 204 800 bytes its not so small amount of data and 998 912 bytes of recovery data is not too much. Well, 1% only So recovery data can be very helpfull and not too burdensome at the same time. Just my point of view

  22. #22
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Tested on LTCB. http://cs.fit.edu/~mmahoney/compression/text.html#1694

    I could test block sizes up to 370 MB before getting "out of memory" error.

    enwik9 e370 -> 171,891,509 (3 unequal sized blocks)
    enwik9 e318 -> 171,932,395 (3 equal sized blocks)

  23. #23
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Code:
    Q6600 3.3Ghz, WinXP SP3, ramdrive
    
    comp.size enc.time dec.time
     20744613  49.516s  38.547s // bcm008 e477 enwik8
     20677205  49.906s  38.968s // reorder + bcm008
    169179098 545.563s 418.390s // bcm008 e477 enwik9
    168694909 548.297s 422.625s // reorder + bcm008

  24. #24
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Big respect to Matt Mahoney and Shelwien! Boyakasha!

  25. #25
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Some results with blazing fast CM variant:

    book1

    order-0+simple SSE -> 215,730 bytes
    order-0+interpolated SSE -> 214,738 bytes

    Very interesting, taking into account the processing speed!


  26. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Considering "blazing fast" it might be interesting to estimate
    cpu clocks per data byte (by TSC) in your postcoder versions,
    and compare to eg. LZSS decoding.

  27. #27
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    And after, compare the compression ratio!

  28. #28
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    I'm just thinking about BCM simplification, or at least keep v0.07 complexity. I have a few great ideas to check...

  29. #29
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by Shelwien View Post
    Code:
    Q6600 3.3Ghz, WinXP SP3, ramdrive
    
    comp.size enc.time dec.time
     20744613  49.516s  38.547s // bcm008 e477 enwik8
     20677205  49.906s  38.968s // reorder + bcm008
    169179098 545.563s 418.390s // bcm008 e477 enwik9
    168694909 548.297s 422.625s // reorder + bcm008
    Updated http://cs.fit.edu/~mmahoney/compression/text.html#1687

  30. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Optimized the permutation with full enwik8:
    Code:
    Q6600 3.3Ghz, WinXP SP3, ramdrive
    
    comp.size enc.time dec.time
     20665536  50.078s  38.782s // bcm008 e477 enwik8r
    168598121 552.984s 420.454s // bcm008 e477 enwik9r
    Attached Files Attached Files

Page 1 of 3 123 LastLast

Similar Threads

  1. BCM v0.09 - The ultimate BWT-based file compressor!
    By encode in forum Data Compression
    Replies: 22
    Last Post: 6th March 2016, 10:26
  2. PPMX v0.05 - new PPM-based compressor
    By encode in forum Data Compression
    Replies: 49
    Last Post: 28th July 2010, 02:47
  3. BCM v0.01 - New BWT+CM-based compressor
    By encode in forum Data Compression
    Replies: 81
    Last Post: 9th February 2009, 16:47
  4. Blizzard - Fast BWT file compressor!!!
    By LovePimple in forum Data Compression
    Replies: 40
    Last Post: 6th July 2008, 14:48
  5. DARK - a new BWT-based command-line archiver
    By encode in forum Forum Archive
    Replies: 138
    Last Post: 23rd September 2006, 21:42

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •