Results 1 to 26 of 26

Thread: BCE v0.4

  1. #1
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts

    BCE v0.4

    Hello,

    I'm proud to announce the release of a new version of my compressor BCE
    While BCE v0.2 and v0.3 were mostly research compressors, I've spend
    some time to make BCE v0.4 a more reasonable compressor.

    The main differences to BCE v0.3 are:
    1. Improved performance (mostly by reducing cache misses)
    2. Improved stability (tested against Calgary, Silesia and enwik8/9)
    3. Improved compression ratio using an adaptive coder rather than the old uniform (still available in the code)
    4. Multiple coders allow for up to 8 threads at the encoding stage synchronizing after each order (OpenMP)
    5. libdivsufsort instead of sais (times are now strongly influenced by bwt/unbwt)

    Benchmark results:
    Code:
    File / Corpus | Compressed size | Comp. Time (s) | Decomp. Time(s) | Mem (KB)  | v0.3 Times
    enwik8        |      20.926.428 |           20.8 |            22.7 |   422.668 |   100 / 188
    enwik8.drt    |      20.722.282 |           12.2 |            13.5 |   274.208 |
    enwik9        |     164.648.620 |          253.6 |           275.6 | 3.394.480 | 1.151 / 2.444
    enwik9.drt    |     164.264.278 |          170.2 |           180.2 | 1.941.336 |
    text8         |      19.603.636 |           20.8 |            22.7 |   430.132 |
    Calgary       |         835.876 |           ~9.8 |            10.5 |      -    |
    Silesia       |      47.576.935 |           74.9 |            92.6 |      -    |
    
    *.drt times don't include time for DRT
    Machine: Core i7-4770K, 8 GB DDR3, Samsung 840Pro 128 GB, Fedora 22 64 bit in VBox (Host: Win 10, 6 GB RAM available), gcc 5.3.1
    Memory for compression/decompression is 5N but I've made available the old
    slow bitwise unbwt (option: -ds) for low memory PCs. The memory this needs is
    given as Mem (KB). It equals the memory the coding stage uses.

    I've made the code available at https://github.com/akamiru/bce.
    Maybe someone can help me build an Windows executable. Feel free
    to contact me if you find any bug/improvements/problems.

    The original thread about BCE can be found here:
    http://encode.ru/threads/2150-A-new-...order-n-models

    Thanks for your attention,
    Christoph

    @Matt Mahoney: I would be very thankful if you could update LTCB and Silesia ! If you have memory problems
    encoding enwik9 I could upload an compressed archive so maybe you could decode it with the -ds option.
    Last edited by Christoph Diegelmann; 18th June 2016 at 15:03. Reason: Wrong size for Calgary and wrong format of Silesia

  2. The Following 9 Users Say Thank You to Christoph Diegelmann For This Useful Post:

    Bulat Ziganshin (31st May 2016),comp1 (31st May 2016),Cyan (31st May 2016),hexagone (31st May 2016),jibz (1st June 2016),load (31st May 2016),lorents17 (31st May 2016),Stephan Busch (5th June 2016),Turtle (31st May 2016)

  3. #2
    Member
    Join Date
    Apr 2009
    Location
    here
    Posts
    202
    Thanks
    165
    Thanked 109 Times in 65 Posts
    i compiled it for windows (x64), but got a bunch of warnings, and in the end bce.exe crashes when i try to build an archive.

    Problemereignisname: APPCRASH
    Anwendungsname: bce.exe
    Anwendungsversion: 0.0.0.0
    Anwendungszeitstempel: 574d8af4
    Fehlermodulname: bce.exe
    Fehlermodulversion: 0.0.0.0
    Fehlermodulzeitstempel: 574d8af4
    Ausnahmecode: c0000005
    Ausnahmeoffset: 0000000000032e93
    Betriebsystemversion: 6.1.7601.2.1.0.256.1
    someone with more experience is needed i'm afraid.

  4. The Following User Says Thank You to load For This Useful Post:

    Christoph Diegelmann (31st May 2016)

  5. #3
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    Yeah I tried compiling using mingw and got a lot of warnings (long unsigned seems to be 32 bit on windows but 64 bit on linux so printf("%lu"); gives warnings).
    But sadly I wasn't able to find out why compression crashes. I always used the fixed bit types (uint64_t/uint32_t) so this shouldn't be the problem.

  6. #4
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    91
    Thanked 69 Times in 49 Posts
    I don't know if it helps, but here it crashes in bce.cpp line 82 with index = 0xFFFFFFFF.

    Code:
    Program received signal SIGSEGV, Segmentation fault.
    [Switching to Thread 3428.0x5c4]
    0x0000000000419123 in Rank::get<1> (this=0x6afa28, index=4294967295)
        at bce.cpp:82
    82            auto rank = rank_[index / 32] & (-1llu >> (32 - index % 32));
    (gdb) bt
    #0  0x0000000000419123 in Rank::get<1> (this=0x6afa28, index=4294967295)
        at bce.cpp:82
    #1  0x0000000000402f67 in BCE<AdaptiveCoder<255>, unbwt::bytewise>::code ()
        at bce.cpp:1266
    (mingw-w64 64-bit compile)

  7. The Following User Says Thank You to jibz For This Useful Post:

    Christoph Diegelmann (31st May 2016)

  8. #5
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    91
    Thanked 69 Times in 49 Posts
    I made a couple of PR to hopefully help with the issues on Windows, attached is a Win64 build.
    Attached Files Attached Files

  9. The Following User Says Thank You to jibz For This Useful Post:

    Christoph Diegelmann (1st June 2016)

  10. #6
    Member
    Join Date
    Apr 2009
    Location
    here
    Posts
    202
    Thanks
    165
    Thanked 109 Times in 65 Posts
    @jibz

    sorry for off topic, but how did you manage to compile a static divsufsort lib? i only got the divsufsort.dll.a
    i'm not familiar with cmake at all.

  11. #7
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    91
    Thanked 69 Times in 49 Posts
    Quote Originally Posted by load View Post
    sorry for off topic, but how did you manage to compile a static divsufsort lib? i only got the divsufsort.dll.a
    i'm not familiar with cmake at all.
    If you look through the CMakeLists.txt file, there is a section with options. One of them is BUILD_SHARED_LIBS, which defaults to ON. You can either change these in the CMake gui, or pass options on the command line.

    I used:

    Code:
    cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF -DBUILD_EXAMPLES=OFF -DUSE_OPENMP=ON

  12. The Following User Says Thank You to jibz For This Useful Post:

    load (1st June 2016)

  13. #8
    Member
    Join Date
    Apr 2009
    Location
    here
    Posts
    202
    Thanks
    165
    Thanked 109 Times in 65 Posts
    ah, thanks. forgot about the gui.

    although cmake has its advantages, i don't really like it for small projects, it generates too many files and the makefiles really are bloated.
    but i see, usually looking at the CMakeLists.txt is enough and one doesn't have to worry about the other files cmake generates.

    it is working now nicely after editing options section.

  14. #9
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    Quote Originally Posted by jibz View Post
    Win64 build.
    thank you very much for making a win64 - binary, but something goes wrong -- or is there a limit for the inputfile?

    ---
    c:\0_COMPR-2016\PGM>bce -c ..\result\lab-bce ..\dta\backup.dat
    BCE v0.4 Release
    Copyright (C) 2016 Christoph Diegelmann
    This is free software under GNU Lesser General Public License. See <http://www.gnu.org/licenses/lgpl>

    Compressed from 18253824 B -> 5151328 B in 40.6 s
    ---
    c:\0_COMPR-2016\PGM>bce -d backup.dat lab-bce.bce
    BCE v0.4 Release
    Copyright (C) 2016 Christoph Diegelmann
    This is free software under GNU Lesser General Public License. See <http://www.gnu.org/licenses/lgpl>

    Decompressed from 5151328 B -> 18253824 B in 41.0 s
    ---
    it runs without an error-message

    the program correctly compresses 18253824 bytes to 5151328 bytes
    and decompresses correctly 5151328 bytes to 18253824 bytes

    BUT the inputfile backup.dat has real 64.442.763.264 bytes and not 18253824 bytes ...

    it seems the program READ simply ONLY THE FIRST 18253824 bytes from my inputfile ...

    why?

    best regards

  15. #10
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    A short day afk and a lot of messages to replay to thank you very much for your interest !

    @jibz
    I made a couple of PR to hopefully help with the issues on Windows, attached is a Win64 build.
    Thank you very much. I thought it would be only a small change but finding help is always great and saved me a lot of time.
    I accepted your pull requests at github

    @load
    sorry for off topic, but how did you manage to compile a static divsufsort lib? i only got the divsufsort.dll.a
    i'm not familiar with cmake at all.
    Feel free to talk about this topic here ! I'm very interessted myself

    @joerg
    Thanks for your interested again ! BCE currently uses a uint32_t internally (18253824 is simply the overflow) and
    I didn't test files with more than 1 GB because I don't have enough RAM for it.

    I will try refactoring BCE to support blocksizes and therefore bigger files. This fits the github issue of @nemequ to
    provide an API. If you want quick results you might want to try splitting the file in 1 GB each and compress
    them one at a time.

  16. #11
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    91
    Thanked 69 Times in 49 Posts
    Since 64.442.763.264 mod 4 GB = 18253824, I am guessing 32-bit values are used for the file size -- looking at the source, the File class for example uses uint32_t.

    I think we will need Christoph to figure out if it is possible to support files larger than 4 GB.

    Edit: had that one sitting in the browser for too long before I pressed post, sorry!

  17. #12
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    The uint32_t limit is mostly a relict from the basic BCE being a whole file algorithm (for enwik9) and suffix sorting algorithms being limited to int32.
    With an API this relict will vanish and BCE will become a block based compressor

  18. #13
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    I tested out the win64 build and I can now report better times for enwik9 due to the fact that now the working memory fits into RAM (the virtual machine was limited to ~6GB RAM)

    Code:
    File / Corpus | Compressed size | Comp. Time (s) | Decomp. Time(s) | Mem (KB)  | v0.3 Times
    enwik9        |     164.648.620 |          197.5 |           215.6 | 3.394.480 | 1.151 / 2.444

  19. #14
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    I was able to improve upon compression (already merged at github):

    Code:
    File / Corpus | Compressed size | Comp. Time (s) | Decomp. Time(s) | Mem (KB)  | v0.3 Times
    lpqdict0.dic  |         158.304 |            0.4 |             0.5 |         - |
    enwik8        |      20.835.184 |           20.1 |            21.7 |   422.668 |   100 / 188
    enwik8.drt    |      20.625.244 |           11.6 |            12.5 |   274.208 |
    enwik9        |     164.211.574 |          197.5 |           215.6 | 3.394.480 | 1.151 / 2.444
    enwik9.drt    |     163.819.188 |          123.3 |           133.4 | 1.941.336 |
    text8         |      19.509.398 |           20.5 |            21.7 |   430.132 |
    Calgary       |         826.552 |           ~9.8 |            10.5 |      -    |
    Silesia       |      47.301.187 |           74.9 |            92.6 |      -    |
    
    *.drt times don't include time for DRT ans space for lpqdict0.dic
    Machine: Core i7-4770K, 8 GB DDR3, Samsung 840Pro 128 GB, Win 10
    Last edited by Christoph Diegelmann; 18th June 2016 at 15:00.

  20. #15
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    91
    Thanked 69 Times in 49 Posts
    Updated win64 build.
    Attached Files Attached Files

  21. The Following 3 Users Say Thank You to jibz For This Useful Post:

    Christoph Diegelmann (4th June 2016),comp1 (4th June 2016),Stephan Busch (5th June 2016)

  22. #16
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    Quote Originally Posted by jibz View Post
    Updated win64 build.
    thank you for the update.

    but sadly not working for me (windows 10 64bit):
    ---
    Microsoft Windows [Version 10.0.10586]
    (c) 2015 Microsoft Corporation. Alle Rechte vorbehalten.

    c:\0_COMPR-2016\PGM>test-bce

    c:\0_COMPR-2016\PGM>bce -c ..\result\lab-bce ..\dta\backup.dat
    BCE v0.4 Release
    Copyright (C) 2016 Christoph Diegelmann
    This is free software under GNU Lesser General Public License. See <http://www.gnu.org/licenses/lgpl>

    terminate called after throwing an instance of 'std::bad_alloc'
    what(): std::bad_alloc

    This application has requested the Runtime to terminate it in an unusual way.
    Please contact the application's support team for more information.
    ---
    best regards

  23. #17
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    Thanks again for testing joerg. Could you upload the file that crashes bce (if it's not too big) so I can debug it? Or does it crash with every file?
    Are you sure to have enough memory available (5 * size of the file) ?

    I've tried jibz' build with enwik8 and enwik9 and it worked for me.

  24. #18
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    thank you very much for your quick answer
    and yes it seems - i do not have enough RAM

    May be in the next update you can implement a short test
    and implement an error message "not enough memory" ?

    how many RAM do you have in your computer ?

  25. #19
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    91
    Thanked 69 Times in 49 Posts
    I attempted to compile this version with large file support, so I think it is likely to give an allocation error instead of just compressing the file size modulo 2^32 (due to the allocation at the start of the File constructor).

  26. #20
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    I'll open an issue for that on github so the next patch will feature a better error message.

    I physically have 8GB DDR3 in my machine but I program in a VM which only has up to 6GB available.

  27. #21
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    91
    Thanked 69 Times in 49 Posts
    I managed to get bce to compile with Visual C++ 2015 -- there is a WiP branch here for anyone interested: https://github.com/jibsen/bce/tree/msvc-support

    Edit: Some build instructions (these should work with both MSVC and mingw-w64 and any generator you like). To build libdivsufsort, something along the lines of:

    Code:
    mkdir build
    cd build
    cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=NO -DBUILD_EXAMPLES=NO -DUSE_OPENMP=YES ..
    cmake --build . --config Release
    And then to build bce (put the path to the libdivsufsort build folder where mentioned):

    Code:
    mkdir build
    cd build
    cmake -Ddivsufsort_ROOT_DIR=path_to_divsufsort_build_folder -DCMAKE_BUILD_TYPE=Release ..
    cmake --build . --config Release

  28. The Following User Says Thank You to jibz For This Useful Post:

    Christoph Diegelmann (8th June 2016)

  29. #22
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    Thanks again ! I see you really put some afford into bce and I just wanted to ask if the code is clear enough to you or if I should add some comments to explain what it actually does ?

  30. #23
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    91
    Thanked 69 Times in 49 Posts
    I must admit I haven't spent much time trying to understand how the code actually works, but rather just attempted building bce and fixing whatever issues the compiler reported.

    It was a good opportunity to learn more about CMake and writing find modules.

    I am not entirely sure about the changes for MSVC compatibility -- or if that is even something worth adding -- but it was fun working on and I learned a few new things along the way. If you are interested in it, let me know and I'll open a PR.

  31. #24
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    I've had a look at your changes and they seem to check out just fine please open a PR so I can merge them

  32. #25
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    I have uploaded a new version to the clean-up branch.
    See https://github.com/akamiru/bce/tree/clean-up

    It includes an API and supports chunks but the chunk size is hardcoded.
    Currently the chunk size is 2^30 Byte so it's using up to 5 GB memory.

    The API has a lot of usefull options and I'm not quiet sure how to expose all/most of them
    so I'll merge with main as soon as the front end has a higher quality.

  33. #26
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    Interessting news: Yesterday I found the article
    "Lossless Data Compression via Substring Enumeration" by D. Dubé (2010)

    In short: I now know that bce is a practical implementation of Compression via Substring Enumeration.
    Moreover bce v0.3 and v0.4 implement what is described in
    "Improving Compression via SubstringEnumeration by Explicit Phase Awareness" by M. Béliveau and Danny Dubé (2014)

    I already had the idea that bce is closely related to compression via antidictionaries and this topic is covered in
    "Relationship between Antidictionary Automata andCompacted Substring Automata" by T. Ota and H. Morita (2013)

    Another quiet interesting article would be
    "Efficient Implementation and Empirical Evaluation of Compression by Substring Enumeration" by S. Kanai, H Yokoo, K. Yamazaki and H. Kaneyasu (2016)

    Because from the abstract it looks like it describes what I described in my first thread on bce in 2015. But sadly I can't find a publicly avaible version of it

    As bce already was preceeded by M03 (which does pretty much the same but not on a binary alphabet) it seems the wheel got invented at least 3 times

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •