Page 1 of 2 12 LastLast
Results 1 to 30 of 33

Thread: M99

  1. #1
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts

    M99

    Recently I have been describing M99 (and eventually M03) to a fellow named Arto via email (He's working on his Phd thesis on 'interpolative coding' which it turns out has a lot in common with M99).

    Anyhow, I sketched up a couple of illustrations on the encoding/decoding process and decided that they were clear enough (I think) and so I posted them on my site. I'm just posting a link here so that there is a link to this old BWT encoding algorithm (has it really been 18 years?!). Also, I'll be sketching up how M03 works in the same thread so if anyone is interested in that then M99 is the place to start. It's not really that complex but maybe it will help somebody.

    http://www.michael-maniscalco.com/blog.htm

    - Michael

  2. The Following 6 Users Say Thank You to michael maniscalco For This Useful Post:

    encode (3rd August 2017),hexagone (2nd August 2017),Mike (2nd August 2017),RamiroCruzo (2nd August 2017),Samantha (3rd August 2017),Shelwien (2nd August 2017)

  3. #2
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    I was able to find enough time to write up a clean implementation of M99 and tweak it a bit for speed. I don't plan on spending much more time on this algorithm as it's fairly old, however, it still is very good even by modern standards and much faster than almost any other BWT coder as well. I think it's quite possible to write a specialized version of M99 to deal specifically with DCT blocks which would be very fast and highly efficient. At some point I might put something together there as proof of concept.

    Source code for M99 (using MSufsort 4a for BWT) is now posted at:

    https://github.com/michaelmaniscalco/m99


    - Michael

  4. The Following 2 Users Say Thank You to michael maniscalco For This Useful Post:

    Mike (9th September 2017),Shelwien (9th September 2017)

  5. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Is "MSufsort 4a" different from https://github.com/michaelmaniscalco/msufsort ?

  6. #4
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    Quote Originally Posted by Shelwien View Post
    Is "MSufsort 4a" different from https://github.com/michaelmaniscalco/msufsort ?
    Yeah, same one. I've got some more work done on MSufsort 4 as well but the work isn't complete yet. I've figured out how to do a recursive Improved Two Stage in 5N space (I think) so I've still got to complete that work before I can release the next update to MSufsort.

    Mostly, I published the M99 code because I was putting together a working version for somebody working on their PhD who wanted to understand how it worked. So I figured I would make it public. I'm pretty please with the performance of this old algorithm after I gave it a bit of cleaning up. Some timings (excluding BWT time) from my old 'Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz' (single threaded).

    enwik8 44 MB/sec encode, 41 MB/sec decode 100,000,000 -> 22,275,155
    webster 56.5 MB/sec encode, 49.5 MB/sec decode 41,458,703 -> 6,876,020
    pic 122 MB/sec encode, 122 MB/sec decode 513,216 -> 47,089
    book1 33MB/sec encode, 25 MB/sec decode 768,771, -> 225,765

    It was sort of fun to re-write this code from scratch. I did all sorts of stuff I would never do in the 'real' world such as:

    Instead of:

    auto x = (symbolA <= symbolB) ? countA : 0;

    I wrote the following to get rid of the stall introduced by the ? operator:

    auto x = ((-(symbolA <= symbolB)) & countA);

  7. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    As to ?:, its actually pretty safe to use these days: https://godbolt.org/g/nyPGRH
    Only if() would actually generate branches.

    Btw, do you have any ideas for efficient implementation of bitwise BWT?
    I used bit-to-byte expansion + divsufsort in https://encode.ru/threads/2742-Compr...ll=1#post52493
    and apparently atm my binary bwt keeps it from being barely usable.

  8. #6
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    Quote Originally Posted by Shelwien View Post
    As to ?:, its actually pretty safe to use these days: https://godbolt.org/g/nyPGRH
    Only if() would actually generate branches.
    True. I'm not certain exactly what is going on under the hood (I didn't bother to look at the assembly) but removing the ? operator resulted in about a 5% performance increase in a hot spot in the code (Actually removed two ? operators in neighboring instructions so It's more likely a smaller increase but doubled or the fact that two ? operators in tendem makes 2^2 possible outcomes at which point pipelining becomes more difficult?) ... whatever. It was faster so I used it. (^:

    Quote Originally Posted by Shelwien View Post
    Btw, do you have any ideas for efficient implementation of bitwise BWT?
    I used bit-to-byte expansion + divsufsort in https://encode.ru/threads/2742-Compr...ll=1#post52493
    and apparently atm my binary bwt keeps it from being barely usable.
    Off the top of my head:

    In MSufSort there is a get_value(input_index) function that gets the next 8 bytes of the input at the specified location and then converts the endian from host to network and returns the value (plus a trick to return trailing zeros if the index requested is close to the end of the input). The main loop has a local stack to avoid recursion (in the quicksort) that maintains the suffix groups offset into the suffix array and the length of the group (the quicksort parition information).

    So, if we add a bitshift information to the paritition information (so as to avoid adding bit shift information to every suffix index) and then modify the get_value to return 4 bytes rather than 8 (there is no performance loss and I infact do this in the next release of ms4) then get_value can model 8 bytes, shift the value by the correct number of bits and return the correct next 32 bits requested.

    The problem is that your SA is now 8 times as large and, likely, a little more than 8 times as slow. Also, the initial 16 bit radix sort to create the initial partitions would not be possible - a smaller bit radix could be used in stead though. But it could be done this way and I don't think it would be too much difficulty to make this modification.

    - M

    [edit]

    Actually, you would need to keep the bit shift value with the input index or your final SA would have 8 of each unique input index and no way to tell which is which. Therefore you would need to reserve 3 bits of each 32bit index for this purpose limiting your input to 2^29 or 1/2GB in size maximum. MS4 already claims the upper most bit for cache performance improvement so it's already limited to 2^31 and therefore this modification would make the maximum size of MS4 2^28 or 256MB.
    Last edited by michael maniscalco; 9th September 2017 at 17:13. Reason: updated thoughts on bitwise BWT

  9. The Following User Says Thank You to michael maniscalco For This Useful Post:

    Shelwien (9th September 2017)

  10. #7
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    I've updated M99 to use the latest MSufSort 4 as well as exploiting some byproducts of M99 decoding to speed up the inverse BWT (getting the count of each unique symbol type in the BWT). A comparison against another fast BWT coder - BSC: M99 clearly has an advantage during encoding which is likely due to MSufSort 4's superiority over libdivsufsort. On decoding, BSC has the advantage. Likely Ilya is doing something clever with the inverse BWT but I haven't taken a look at the BSC code at all to confirm this.

    - Michael


    == single threaded comparison ==

    m99 (single threaded)
    ./m99 enwik8 enwik8.m99 1
    compressed: 100000000 -> 22275163 bytes. ratio = 22.2752%
    Elapsed time: 12.355 seconds

    ./m99 d enwik8.m99 enwik8 1
    Elapsed time: 7.305 seconds

    bsc (single threaded - default switches)
    ./bsc e enwik8 enwik8.bsc -tT
    compressed 100000000 into 22439870 in 14.139 seconds.

    ./bsc d enwik8.bsc enwik8 -tT
    decompressed 22439870 into 100000000 in 8.208 seconds.

    bsc (single threaded - fullblocksize, no preprocessing)
    ./bsc e enwik8 enwik8.bsc -tTpb100
    compressed 100000000 into 20920018 in 16.513 seconds.

    ./bsc d enwik8.bsc enwik8 -tT
    decompressed 20933900 into 100000000 in 8.356 seconds

    == multithreaded comparison ==

    m99 (multithreaded)

    ./m99 e enwik8 enwik8.m99
    compressed: 100000000 -> 22276698 bytes. ratio = 22.2767%
    Elapsed time: 5.712 seconds

    ./m99 d enwik8.m99 enwik8
    Elapsed time: 3.948 seconds

    bsc (mutithreaded - default switches)
    ./bsc e enwik8 enwik8.bsc
    compressed 100000000 into 22439870 in 5.947 seconds.

    ./bsc d enwik8.bsc enwik8
    decompressed 22439870 into 100000000 in 3.139 seconds.


    bsc (multithreaded - full blocksize, no pre-processing)
    ./bsc e enwik8 enwik8.bsc -b100p
    compressed 100000000 into 20933900 in 10.590 seconds.

    ./bsc d enwik8.bsc enwik8
    decompressed 20933900 into 100000000 in 3.186 seconds.

  11. #8
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    TL;DR; I ran some experiments and I believe that M99+tANS should acheive enwik9 from 1,000,000,000 -> 171,546,xxx at somewhere near 250MB/sec


    Inspired by the work that Lucas is doing on BRC I decided to invest a little bit of time into making m99 as fast as I could possibly make it. M99 isn't as amenable to SIMD and vectorization since it doesn't work in such a linear fashion as do most typical BWT algorithms. But I was able to get the modelling for M99 to about 80% of the performance that Lucas is reporting for his fantastic BRC transform.

    However, unlike BRC, M99's modelling phase sets the stage for the opportunity to keep compression ratios high while choosing very high throughput entropy coding as well.

    An older version of M99 had a 'max' mode that replaced M99's basic bit packing technique with arithmetic coders and simple models. But, since the introduction of ANS I've been meaning to experiment with using tANS for use as a faster 'max' mode within M99 rather than bit packing. M99's bitpacking will be faster for sure since it does even less than tANS, however, tANS should be exceptionally close in throughput while still being able to increase compression as was done with the original arithmetic coder 'max' mode.

    Moreover, tANS is particularlly well suited for use with M99 because M99s modelling phase is done in reverse order when decoding which matches with ANS's oddity of decoding in reverse order. So the two should be a perfect compliment.

    M99 achieves compression by repeatedly encoding binary values K which are bounded between [0 -> N]. The distribution for any N typically looks like a bell curve and M99 exploits this in its bit packing to get better compression for values of K which are closer to N/2 (by being able to drop the most significant bit - often a zero).

    So I made an experiment by keeping M99s bit packing for any value N >= 64 but, rather than bit packing values where N < 64, I simply pushed all values K for N to files 'outputN.dat'
    This gave me a number of bits to encode a file for all instances of N >=64 and a series of N files containing all K for each unique value of N.

    I then compressed each of these 64 files using tANS (Yann's version) and produced an estimate for M99+tANS for enwik9 of 171,546,xxx.
    Given that an intel 4 physical core i7 @ 3.4 GHz the M99 modelling phase (everything but the bit packing) I get about 550MB/sec @ 16 threads. Coupled with the well established superior throughput of tANS I expect that M99+tANS should be able to encode enwik9 from 1,000,000,000 -> 171,546,xxx at somewhere near 250MB/sec overall.

    Obviously, these numbers have some guesswork involved but given that I used an established encoder for the basic test and I used M99's equally established modelling code, I feel pretty confident that these results are a good estimate of the real world results once I get around to putting it all together into one complete compressor.

    Interestingly, I ran the same experiment using James' rans_dyn_experiments encoder and compression was slightly inferior to what FSE produced - close by not quite as good as raw M99 numbers directly to tANS.

    - Michael

  12. The Following 8 Users Say Thank You to michael maniscalco For This Useful Post:

    algorithm (9th November 2018),Bulat Ziganshin (9th November 2018),Cyan (9th November 2018),encode (9th November 2018),hexagone (9th November 2018),JamesB (9th November 2018),Jarek (9th November 2018),Lucas (9th November 2018)

  13. #9
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    68
    Thanks
    31
    Thanked 22 Times in 15 Posts
    Use can also test FPC instead of FSE. Faster with higher compression ratio because of smaller blocks. https://github.com/algorithm314/FPC

  14. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    What is the speed of M99 modeling stage as opposite to BRC1 (i.e. just vectorized SRC) for a single thread?

  15. #11
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    What is the speed of M99 modeling stage as opposite to BRC1 (i.e. just vectorized SRC) for a single thread?
    Intel i7-3770 CPU @ 3.40GHz (4 physical cores)
    Obviously, at these speeds and using all cores the timings won't be 100% consistent. Any rouge process that happens to go off during the run will likey
    cause the timing to vary a tiny bit. I ran each test a couple of times and reported the best of each run.

    Name:  image.png
Views: 381
Size:  14.8 KB


    Name:  image2.png
Views: 387
Size:  29.4 KB


    Name:  image3.png
Views: 377
Size:  18.5 KB


    Name:  image4.png
Views: 377
Size:  18.2 KB

  16. The Following 4 Users Say Thank You to michael maniscalco For This Useful Post:

    Bulat Ziganshin (9th November 2018),Cyan (9th November 2018),Lucas (9th November 2018),Mike (9th November 2018)

  17. #12
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Any idea why the percentage of improvement for the brc variations is significantly better for 4 threads than either 1 or 16 threads? Just curious, that seems odd.

  18. #13
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Guesswork aside, that's a great speed/size tradeoff. Pleased to see the discussions having such a positive outcome already.

  19. The Following User Says Thank You to JamesB For This Useful Post:

    michael maniscalco (9th November 2018)

  20. #14
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Any idea why the percentage of improvement for the brc variations is significantly better for 4 threads than either 1 or 16 threads? Just curious, that seems odd.
    The hardware I used has 4 physical cores so I wouldn't expect a linear increase in throughput once hyperthreading is used. I'm not familiar with the inner workings of BRC since I haven't looked at the code at all other than to get it to compile on linux. But once the cpu is doing hyperthreading then several threads can compete for cache which will have a negative impact on performance. Having said that, I would suspect that cache eviction isn't really a problem for what I presume BRC to be doing.

    I'm not well informed on the scalability of SIMD in multithreaded environments or if there's any contention issues at all - probably not. But I'm sure that there are others here who would be more qualified to speak to that.

  21. #15
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Kennon: it may be frequencies and/or sharing of SIMD devices between 2 logical threads

    when all cores are loaded, freqs are more limited for AVX rather than for scalar/SSE code

    SIMD code gets less benefit from HT since we can execute only 2-3 SIMD commands per cycle as opposed to 4 scalar ones, and SIMD code usually has higher natural parallelism (in particular BRC processes 256 bytes independent on each other)

  22. The Following 2 Users Say Thank You to Bulat Ziganshin For This Useful Post:

    Kennon Conrad (9th November 2018),michael maniscalco (9th November 2018)

  23. #16
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    Kennon: it may be frequencies and/or sharing of SIMD devices between 2 logical threads

    when all cores are loaded, freqs are more limited for AVX rather than for scalar/SSE code

    SIMD code gets less benefit from HT since we can execute only 2-3 SIMD commands per cycle as opposed to 4 scalar ones, and SIMD code usually has higher natural parallelism (in particular BRC processes 256 bytes independent on each other)
    Hi Bulat,

    That makes sense, but I guess my "real" question is why would you get a better percentage improvement with 4 cores over 1 core? M99 -> BRC4 AVX is 39% improvement for one thread and 64% improvement for four threads. My simplistic thought pattern tells me that you should see maximum percentage improvement with one core since there shouldn't be limitations caused by running more than one thread. The only possible reason I can think of is that the individual cores use less RAM with four threads instead of one, making the cache more efficient, but that doesn't fit with my understanding of the algorithm (which is minimal).

    Best Regards,

    Kennon

  24. #17
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Nice work Michael, it's nice to see friendly competition

    BRC scales up better simply because it's block based, I set it to only operate on 1MB blocks. If I recall correctly M99 uses parallel merging which has some dependencies and can be a little slower, but it's mostly linear scaling.

    The insides of BRC1-4 is quite simple: perform vectorized MTF, insert rank at the bucket position specified by the symbols frequency, increment the bucket position, repeat.

    Then after that's done BRC4 uses byte aligned RLE0 to remove the pressure caused by run lengths.

    The inverse MTF will usually be done either out-of-order or using an intrinsic shuffle because next symbol is always located at map[0], so it has instant access to the next symbol.

    I did do experimenting with manual unrolling of the decoder by prefetching the next 2 symbols and updating the table once per 2 bytes. I did get nearly double the throughput but only on hard to compress data, for compressible data it hurt speed by 40%. So I just left the normal decoder in place.

  25. The Following User Says Thank You to Lucas For This Useful Post:

    Kennon Conrad (9th November 2018)

  26. #18
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    Quote Originally Posted by algorithm View Post
    Use can also test FPC instead of FSE. Faster with higher compression ratio because of smaller blocks. https://github.com/algorithm314/FPC
    That's one speedy little coder!

    M99+FSE 1,000,000,000 -> 171,546,xxx
    M99+FPC 1,000,000,000 -> 173,306,xxx

    (all data tarred for speed measurements):
    FSE enc = 365 MB/sec, dec = 467 MB/sec
    FPC enc = 625 MB/sec, dec = 811 MB/sec

    So, yes, a very impressive little coder!

    Interestingly, I tried FSE with Huffman on the same file and the benchmark mode for FSE reports a decode error. I'll be sure to notify Yann.
    Last edited by michael maniscalco; 9th November 2018 at 22:46. Reason: added enc/dec measurements

  27. The Following User Says Thank You to michael maniscalco For This Useful Post:

    algorithm (9th November 2018)

  28. #19
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    68
    Thanks
    31
    Thanked 22 Times in 15 Posts
    Compression ratio depends on the probability distribution of the symbols and block size. What block size did you check? Default is 16 KB. The best compression ratio is using adaptive block size -b 0 . Looks like you tested 32 KB.

  29. #20
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    I used whatever the default was. Adding -b 0 clearly slows it down as well, however, with -b 0 the size drops to 172,663,016
    Last edited by michael maniscalco; 9th November 2018 at 23:19. Reason: fixed stupid mistake in reporting compressed size

  30. #21
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Quote Originally Posted by michael maniscalco View Post
    I used whatever the default was. Adding -b 0 clearly slows it down as well, however, with -b 0 the size drops to 145,172,256.
    fpc's -b 0 activates an optimal parser for the huffman coder. Are those numbers validated? If so then you've cracked the secret to optimal BWT compression.

  31. #22
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    68
    Thanks
    31
    Thanked 22 Times in 15 Posts
    145,172,256 is too good to be true. The adaptive mode can be tuned for higher compression ratio by setting STEP to 1024 but compression will be on the order 50 MB/s. Decompression speed will be similar.
    Last edited by algorithm; 9th November 2018 at 23:12. Reason: updates

  32. #23
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    Opps. Wrong number. This is the compressed size of the data that M99 is passing off to the entropy coder rather than bitpacking .

    The overhead for data the experimental M99 coder is still bitpacking is 27,490,760 bytes. I had forgotten to add that to total that was compressed by the secondary entropy coder rather than bitpacking.

    I'll update post with the real total.
    Sorry for the confusion.

  33. #24
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Out of curiosity; how does fpaqc fair with M99 output? I'm interested to see the max ratio in comparison to BRC4 with fpaqc.

  34. #25
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    Quote Originally Posted by Lucas View Post
    Out of curiosity; how does fpaqc fair with M99 output? I'm interested to see the max ratio in comparison to BRC4 with fpaqc.
    I've updated the repo to have the code needed to dump the data to output files at https://github.com/michaelmaniscalco/m99

    This will allow you to experiment with other compression techniques. however, I did run fpaqc and it doesn't compress as well as fse in this case.
    M99+FSE -> 171,266,345
    M99+fpaqc -> 171,401,911

    You need to define ENABLE_EXPERIMENTAL and recompile to enable this mode though. This will dump all values that m99 would normally pack (for all values if N < 256) to individual files named m99.experimental.N.dat. This mode will also produce a normal output.m99 file which contains all of the encoded data for cases where the value N to pack was >= 256. You need to add this data to the total compressed size to get the experimental compression size.

    Here's the enwik9 output for M99+FSE

    Code:
    m99 e enwik9 ./enwik9.dat 8
    m99 - high performance BWT compressor.  Author: M.A. Maniscalco (1999 - 2017)
    **** EXPERIMENTAL MODE ENABLED **** 
    **** This mode does not produce viable compressed data **** 
    compressed: 1000000000 -> 75673174 bytes.  ratio = 7.56732%
    This produces a series of m99.experimental.N.dat files which get compressed with fse:

    Code:
    for ((i=1; i<256; i++)); do echo $i; ./fse -e ./m99.experimental.$i.dat ./m99.experimental.$i.fse | grep "compressed"; done
    du -hscb ./*.fse | grep "total"
    output:
    Code:
    1
    FSE : Finite State Entropy, 64-bits demo by Yann Collet (Nov  9 2018)
    Compressed 0 bytes into 8 bytes ==> inf%                                       
    2
    FSE : Finite State Entropy, 64-bits demo by Yann Collet (Nov  9 2018)
    Compressed 130899222 bytes into 25091278 bytes ==> 19.17%                      
    3
    FSE : Finite State Entropy, 64-bits demo by Yann Collet (Nov  9 2018)
    Compressed 58150706 bytes into 13913580 bytes ==> 23.93%                       
    4
    FSE : Finite State Entropy, 64-bits demo by Yann Collet (Nov  9 2018)
    Compressed 30712600 bytes into 8484908 bytes ==> 27.63%                        
    <SNIPPED>
    FSE : Finite State Entropy, 64-bits demo by Yann Collet (Nov  9 2018)
    Compressed 9650 bytes into 9396 bytes ==> 97.37%                               
    255
    FSE : Finite State Entropy, 64-bits demo by Yann Collet (Nov  9 2018)
    Compressed 9535 bytes into 9368 bytes ==> 98.25%                               
    95593171    total
    Add the total FSE output with the total M99 output to get the M99+FSE compressed size:
    95593171 + 75673174 = 171,266,345

    result:

    enwik9 1,000,000,000 -> 171,266,345
    Last edited by michael maniscalco; 10th November 2018 at 00:45. Reason: added fpaqc result

  35. The Following User Says Thank You to michael maniscalco For This Useful Post:

    Cyan (10th November 2018)

  36. #26
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    I got a few errors when trying to compile m99 with `scons`,
    mostly of this kind :

    ```
    build/release/src/library/m99/m99.cpp:119:5: error: 'force_inline' does not name a type
    ```

  37. The Following User Says Thank You to Cyan For This Useful Post:

    michael maniscalco (10th November 2018)

  38. #27
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    What about entropy coding with some Markov modelling, e.g. order 1 from https://github.com/jkbonfield/rans_static ?

  39. #28
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    Quote Originally Posted by Cyan View Post
    I got a few errors when trying to compile m99 with `scons`,
    mostly of this kind :

    ```
    build/release/src/library/m99/m99.cpp:119:5: error: 'force_inline' does not name a type
    ```

    I forgot to push the SConstruct file in the last push. It should be good now.

    By the way, how much overhead is there in FSE on a per-block basis? For smaller files I see that M99+FSE produces worse results than M99 alone. I'm assuming it might be the overhead of per block headers?

  40. #29
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    I forgot to push the SConstruct file in the last push. It should be good now.
    Thanks, compile fine now !


    how much overhead is there in FSE on a per-block basis?
    It depends. I would expect the budget to be between 80 and 120 bytes, though this is highly dependent on probability distribution.
    For very small data (<1 KB), it does indeed contribute a lot to final cost.


    The FSE default header format is not generic at all.
    It's designed for a probability distribution which favors 0, and then gradually decreases as the symbol value increases.
    This is a great match for LZ lengths, which tend to respect, roughly, this order.

    Whenever the real distribution is significantly different, the default header format is simply wasteful.


    FSE API makes it possible for anyone to replace the header format with something else of their choice.
    But it's obviously more work.


    I guess a more generic version of the header function should at least make it possible to select a different distribution order (other than the naive 0 > 1> 2 > 3 ...)

    An even more powerful scheme could start from a "default" or "expected" distribution, and encodes the delta.

    This is obviously non-negligible work.

    But as block sizes become smaller, it also becomes more important.


    Combining short headers with an ability to dynamically determine block boundaries, to better match surrounding statistics, would ultimately offer the best performance.

  41. #30
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    122
    Thanks
    36
    Thanked 33 Times in 24 Posts
    Quote Originally Posted by michael maniscalco View Post
    I've updated the repo to have the code needed to dump the data to output files at https://github.com/michaelmaniscalco/m99

    This will allow you to experiment with other compression techniques. however, I did run fpaqc and it doesn't compress as well as fse in this case.
    M99+FSE -> 171,266,345
    M99+fpaqc -> 171,401,911

    You need to define ENABLE_EXPERIMENTAL and recompile to enable this mode though. This will dump all values that m99 would normally pack (for all values if N < 256) to individual files named m99.experimental.N.dat. This mode will also produce a normal output.m99 file which contains all of the encoded data for cases where the value N to pack was >= 256. You need to add this data to the total compressed size to get the experimental compression size.

    Here's the enwik9 output for M99+FSE

    Code:
    m99 e enwik9 ./enwik9.dat 8
    m99 - high performance BWT compressor.  Author: M.A. Maniscalco (1999 - 2017)
    **** EXPERIMENTAL MODE ENABLED **** 
    **** This mode does not produce viable compressed data **** 
    compressed: 1000000000 -> 75673174 bytes.  ratio = 7.56732%
    This produces a series of m99.experimental.N.dat files which get compressed with fse:

    Code:
    for ((i=1; i<256; i++)); do echo $i; ./fse -e ./m99.experimental.$i.dat ./m99.experimental.$i.fse | grep "compressed"; done
    du -hscb ./*.fse | grep "total"
    output:
    Code:
    1
    FSE : Finite State Entropy, 64-bits demo by Yann Collet (Nov  9 2018)
    Compressed 0 bytes into 8 bytes ==> inf%                                       
    2
    FSE : Finite State Entropy, 64-bits demo by Yann Collet (Nov  9 2018)
    Compressed 130899222 bytes into 25091278 bytes ==> 19.17%                      
    3
    FSE : Finite State Entropy, 64-bits demo by Yann Collet (Nov  9 2018)
    Compressed 58150706 bytes into 13913580 bytes ==> 23.93%                       
    4
    FSE : Finite State Entropy, 64-bits demo by Yann Collet (Nov  9 2018)
    Compressed 30712600 bytes into 8484908 bytes ==> 27.63%                        
    <SNIPPED>
    FSE : Finite State Entropy, 64-bits demo by Yann Collet (Nov  9 2018)
    Compressed 9650 bytes into 9396 bytes ==> 97.37%                               
    255
    FSE : Finite State Entropy, 64-bits demo by Yann Collet (Nov  9 2018)
    Compressed 9535 bytes into 9368 bytes ==> 98.25%                               
    95593171    total
    Add the total FSE output with the total M99 output to get the M99+FSE compressed size:
    95593171 + 75673174 = 171,266,345

    result:

    enwik9 1,000,000,000 -> 171,266,345
    m99+FSE 95593171 + 75673174 = 171266345 in 0.95 sec
    m99+fpc 96970773 + 75673174 = 172643947 in 0.61 sec
    m99+fpaqc 95728737 + 75673174 = 171401911 in 13.5 sec
    m99+kanzi fpaq 95315146 + 75673174 = 170988320 in 11.3 sec

    Given the difference in compression ratios and the encoding times, FSE looks like the best option.

Page 1 of 2 12 LastLast

Similar Threads

  1. enwik9 benchmark nanozip, bliz, m99, dark
    By Sami in forum Data Compression
    Replies: 6
    Last Post: 31st July 2008, 20:24
  2. M99 v.2.2 is ready
    By michael maniscalco in forum Data Compression
    Replies: 11
    Last Post: 22nd July 2008, 22:24

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •