Results 1 to 18 of 18

Thread: Alba

  1. #1
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    144
    Thanks
    29
    Thanked 58 Times in 34 Posts

    Alba

    I made a new BPE compresser, named Alba.
    Text compression is better.
    Attached Files Attached Files

  2. #2
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    144
    Thanks
    29
    Thanked 58 Times in 34 Posts

    update

    Add new compression mode.
    Attached Files Attached Files
    Last edited by xezz; 5th February 2014 at 09:37.

  3. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 774 Times in 484 Posts
    I tested on LTCB and Silesia. The original compression modes c and c32768 work but C does not produce identical output on enwik9 and some of the Silesia files. The decompressed files are also the wrong size by a few bytes. I posted benchmark results for the correct output.

    http://mattmahoney.net/dc/silesia.html
    http://mattmahoney.net/dc/text.html#5269

  4. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    just a worm (11th February 2014)

  5. #4
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    144
    Thanks
    29
    Thanked 58 Times in 34 Posts
    Thanks, Matt! Update now.

  6. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 774 Times in 484 Posts
    It works now. I updated Silesia and LTCB.

  7. #6
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    144
    Thanks
    29
    Thanked 58 Times in 34 Posts
    Thank's Matt!
    I made a new version.
    Attached Files Attached Files

  8. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 774 Times in 484 Posts
    Tried it out on the Silesia corpus. Looks like C compresses better than e most of the time.

    Code:
      Silesia dicke mozil   mr   nci ooff  osdb reym samba  sao webst x-ray  xml Compressor -options
    --------- ----- ----- ---- ----- ---- ----- ---- ----- ---- ----- ----- ---- -------------------
    103224331  5496 25622 5163  6435 4447  8690 2484  8909 7239 18705  8468 1559 alba 0.2 C
    103232868  5495 25625 5163  6433 4447  8690 2486  8918 7239 18702  8468 1562 alba 0.2 e
    103273433  5497 25638 5163  6436 4450  8700 2487  8917 7239 18712  8468 1562 alba 0.2 c
    I wonder if there is a way to automatically optimize block size.

    Edit: updated LTCB. http://mattmahoney.net/dc/text.html#5265
    e compresses enwik8 a little better than C, but a little worse on enwik9. Also on enwik9, compression crashes (free(): invalid pointer. Core dumped) when done but still produces an output file that decompresses correctly with no errors. This was compiling from source with gcc 4.8.1 -O3 in 64 bit Linux.
    Last edited by Matt Mahoney; 6th February 2014 at 23:41.

  9. #8
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    144
    Thanks
    29
    Thanked 58 Times in 34 Posts
    I wonder if there is a way to automatically optimize block size.
    I have an idea that is using suffix tree or enhanced suffix array.
    But memory requirement is too large.
    Simple method is flzp like selection.
    compression crashes (free(): invalid pointer. Core dumped)
    sorry, I updated. next header size estimation is optimized.
    Attached Files Attached Files
    Last edited by xezz; 7th February 2014 at 17:31.

  10. #9
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    144
    Thanks
    29
    Thanked 58 Times in 34 Posts

    v0.3

    alba 'E' is best compression.
    Attached Files Attached Files
    Last edited by xezz; 11th February 2014 at 06:16.

  11. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 774 Times in 484 Posts
    I tested alba 0.3 on the Silesia corpus compressing with E but it crashed during decompression on most of the files. Tested with the supplied alba.exe in 32 bit Vista.

  12. #11
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    144
    Thanks
    29
    Thanked 58 Times in 34 Posts
    Oh my god... I have already developed next version.
    probably same bug lives in v0.4.
    sorry, please wait now.

  13. #12
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    144
    Thanks
    29
    Thanked 58 Times in 34 Posts
    E is changed to e. becouse old e is not good.
    c and C are modified.
    Attached Files Attached Files

  14. #13
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    144
    Thanks
    29
    Thanked 58 Times in 34 Posts
    This version is implemented dynamic block encoding.
    I have corrected the bad calculation in it.
    Attached Files Attached Files
    Last edited by xezz; 18th February 2014 at 18:17.

  15. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 774 Times in 484 Posts
    alba cd enwik8 - decompression did not verify although the size was right. I tested alba-0.5.1 in 64 bit Ubuntu, both alba.exe in Wine and compiling from source with gcc 4.8.1 -O3

  16. #15
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    144
    Thanks
    29
    Thanked 58 Times in 34 Posts
    Thanks, Matt! Bug fixed.

  17. #16
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 774 Times in 484 Posts

  18. #17
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    144
    Thanks
    29
    Thanked 58 Times in 34 Posts
    Implemented the second BPE. Smaller is good Its block size.
    Max block size is up to 512KB for dynamic block encoding.
    But 'd' is required improvement yet.
    Code:
    // normal BPE + optimal BPE
    alba c infile outfile +C
    // optimal BPE(dynamic block, max 512KB) + optimal BPE(blocksize:1024)
    alba Cd512k infile outfile +C1024
    
    alba Cd enwik8 out +C1280
    size: 100000000 to 51238011
    Attached Files Attached Files
    Last edited by xezz; 14th March 2014 at 12:49.

  19. #18
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    144
    Thanks
    29
    Thanked 58 Times in 34 Posts
    v0.7, bug fixed. And made a v0.8.


    Code:
    // 2nd BPE encode 1st BPE data widthout pair table.
    alba c+c in out
    // 2nd BPE encode all 1st BPE data. It's slower than 'c+c'
    alba c,c in out
    
    
    // e[0-2] is encode level. It's for 'd' only. No need for 2nd BPE.
    alba cd:e2 in out
    
    
    // c[0-512] is max count of block spread. It's for 'd' only.
    alba cd:c15 in out
    
    
    // Example for large file
    alba Cd512k,C1280 in out
    alba Cd512k:e2 in out
    Attached Files Attached Files
    Last edited by xezz; 15th March 2014 at 12:32.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •