Results 1 to 13 of 13

Thread: MCM open source

  1. #1
    Member
    Join Date
    Jun 2013
    Location
    Canada
    Posts
    36
    Thanks
    24
    Thanked 47 Times in 14 Posts

    MCM open source

    I have put up the current state of the MCM source code on github. The code isn't very clean yet, but it shouldn't be too hard to understand.

    https://github.com/mathieuchartier/mcm

    I don't consider this to be the official v 0.4 release since it is not significantly better than v 0.3. I still need to add filters like E8E9, dict, deltas, etc...

  2. The Following 13 Users Say Thank You to Mat Chartier For This Useful Post:

    Bulat Ziganshin (17th July 2013),encode (17th July 2013),inikep (18th July 2013),Jan Ondrus (17th July 2013),kaitz (21st July 2014),load (19th July 2013),Matt Mahoney (18th July 2013),Mike (18th July 2013),Nania Francesco (18th July 2013),Piotr Tarsa (23rd July 2013),RichSelian (30th August 2013),samsat1024 (17th July 2013),Stephan Busch (17th July 2013)

  3. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

  4. #3
    Programmer Jan Ondrus's Avatar
    Join Date
    Sep 2008
    Location
    Rychnov nad Kněžnou, Czech Republic
    Posts
    278
    Thanks
    33
    Thanked 137 Times in 49 Posts
    I wonder how you estimated/computed initial probabilities for states in initial_probs array. Can you elaborate on this?

  5. #4
    Member
    Join Date
    Jun 2013
    Location
    Canada
    Posts
    36
    Thanks
    24
    Thanked 47 Times in 14 Posts
    CM.hpp:961 there is code that dumps the probabilities. I don't remember which file I used, but maybe it was a tar containing various types of data. I wonder if it is worth switching to probabilities which learn quicker initially.

  6. #5
    Member
    Join Date
    Jun 2013
    Location
    Canada
    Posts
    36
    Thanks
    24
    Thanked 47 Times in 14 Posts
    New version mcm v0.4 has static huffman preprocessing which causes a nice speedup on redundant files. Compression is hurt slightly, probably due to order 1, match model not being tuned for the variable length codes. Some cache miss reducing optimizations still need to be done. Code is available, and regularly updated on github.

    Question, does anybody have any ideas how to adapt the match model to variable length codes? e.g. If the bit is index 3 but the code only has 3 bits, then the probability should be around the same as if it was the 8th bit in a 8 bit code.
    Attached Files Attached Files

  7. The Following User Says Thank You to Mat Chartier For This Useful Post:

    samsat1024 (23rd July 2013)

  8. #6
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Mat Chartier View Post
    New version mcm v0.4 has static huffman preprocessing which causes a nice speedup on redundant files. Compression is hurt slightly, probably due to order 1, match model not being tuned for the variable length codes. Some cache miss reducing optimizations still need to be done. Code is available, and regularly updated on github.

    Question, does anybody have any ideas how to adapt the match model to variable length codes? e.g. If the bit is index 3 but the code only has 3 bits, then the probability should be around the same as if it was the 8th bit in a 8 bit code.
    For an alphabet of cardinality N a huffman tree has N-1 internal nodes.
    Label each node with a unique number from 0 to N-1.
    Thus, each number corresponds to a bitstring (the root is the empty bitstring).
    Let b be the bit predicted by the match model,
    let d be the partial huffman code of the current symbol,
    let i be the number of the node in the huffman tree which corresponds to d,
    let L be the match length (in bytes) and l the minimum match length (in bytes).
    The match model now uses context:
    1. If L-l<0, then context is i.
    (The match model degrades to order-0.)
    2. If L-l>=0, then context is N+(L-l)*N*2+i*2+b.
    (The context is partial bitstring d + match length in bytes + predicted bit.)
    EDIT: Now map the context to a bit model and update as usual.
    Cheers
    Last edited by toffer; 23rd July 2013 at 20:39.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  9. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    For modeling a match of length n bits, the probability that the next bit will not match is about 1/n. This should be the same whether the data is bytes or Huffman codes.

    I suppose an indirect model would be more accurate. Use a table to map n (quantized) or 1/n to a probability, then adjust the prediction to reduce the error.

    In zpaq I use a direct model (p = 1/n) and let the mixer do the adjustment.

  10. #8
    Member
    Join Date
    Jun 2013
    Location
    Canada
    Posts
    36
    Thanks
    24
    Thanked 47 Times in 14 Posts
    Thanks a lot, Matt and Toffer! I really like the idea of using the non-leaf nodes in the huffman tree for obtaining an order 0 context. This should help improve compression and speed. As for the match model, it does need to be updated. Currently it is using a context map which maps len / 4 + bit_pos to a nonstationary model. Matt, in zpaq, you simply use p = 1/n? No context map? I'll have to give this a try.

  11. #9
    Member
    Join Date
    Apr 2012
    Location
    Denmark
    Posts
    65
    Thanks
    21
    Thanked 16 Times in 14 Posts
    Any chance of a binary that runs on XP? This seems to require Vista.

  12. #10
    Member
    Join Date
    Jun 2013
    Location
    Canada
    Posts
    36
    Thanks
    24
    Thanked 47 Times in 14 Posts
    What error are you getting when you try running the binary on XP?

  13. #11
    Member
    Join Date
    Apr 2012
    Location
    Denmark
    Posts
    65
    Thanks
    21
    Thanked 16 Times in 14 Posts
    "Not a valid Win32 application". I checked the executable header, and the executable wants Windows 6.0 (Vista) to run.

  14. #12
    Member
    Join Date
    Jun 2013
    Location
    Canada
    Posts
    36
    Thanks
    24
    Thanked 47 Times in 14 Posts
    Here is a copy of the gcc version that I just built. It has a few extra print statements but it should be similar to the current version in github. This one *should* run on win xp. I haven't been working on it much lately, but I am thinking of adding block based multi-threading and 64 bit support in the near future.
    Attached Files Attached Files

  15. The Following User Says Thank You to Mat Chartier For This Useful Post:

    Nania Francesco (29th August 2013)

  16. #13
    Member
    Join Date
    Jun 2013
    Location
    USA
    Posts
    98
    Thanks
    4
    Thanked 14 Times in 12 Posts
    I'm having issues decompressing a file. The program crashes. Here is a core dump. Not sure if it's useful.

    https://mega.co.nz/#!eVJhHDIQ!Ef3QGA...uuQ5pAOQ3WNI7w

    edit: second issue. With the recent binary that you made, decompression is broken. It's decompressing way too much.

    ----------+ 1 mangix None 134M Aug 4 14:11 rockyou.txt
    -rw-rw-r--+ 1 mangix None 32M Aug 29 10:04 rockyou.txt.mcm
    ----------+ 1 mangix None 1.4G Aug 29 10:21 rockyou2.txt



    rockyou2.txt is me uncompressing the .mcm file while rockyou.txt is what I compressed.

    edit: looks like there's an issue with the binary I made. It produces invalid files as well as being unable to decompress them. Using the posted binary for both works fine.
    Last edited by Mangix; 29th August 2013 at 20:45.

Similar Threads

  1. Silesia Open Source Compression Benchmark
    By Alexander Rhatushnyak in forum Data Compression
    Replies: 15
    Last Post: 22nd May 2016, 15:34
  2. Fast, portable, open source LZH?
    By m^2 in forum Data Compression
    Replies: 25
    Last Post: 24th March 2012, 16:00
  3. Open source JPEG compressors
    By inikep in forum Data Compression
    Replies: 8
    Last Post: 22nd October 2011, 00:16
  4. PeaZip - open source archiver
    By squxe in forum Data Compression
    Replies: 1
    Last Post: 3rd December 2009, 22:01
  5. fcm1 - open source order-1 cm encoder
    By encode in forum Data Compression
    Replies: 34
    Last Post: 4th June 2008, 23:16

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •