Page 1 of 14 12311 ... LastLast
Results 1 to 30 of 407

Thread: cmix

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 775 Times in 484 Posts

    cmix

    A new compressor, cmix v1 by Byron Knoll has taken the number 2 spot on LTCB and number 1 spot on the Silesia benchmark. It is open source (GPL) based on code from paq8hp12any and paq8l using dictionary preprocessing and more context models and more mixer layers. I was able to compile with g++ for both Windows and Linux but not able to run it because it requires 20 GB of memory. It took 2 days to compress or decompress enwik9 on his Core i7-3770 with 32 GB of memory.

    cmix: http://www.byronknoll.com/cmix.html
    LTCB: http://mattmahoney.net/dc/text.html#1289
    Silesia: http://mattmahoney.net/dc/silesia.html

  2. #2
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    376
    Thanks
    137
    Thanked 194 Times in 107 Posts
    I compiled Windows 32bit version with reduced memory usage.
    paq8hp model is not used, paq8l is reduced to level 4 and for others max_size is reduced 80x times.
    As it uses some C++11 features i had to update my Dev-C++ to modern version. Project file included in attachments.

    Memory usage was about 1.8GB

    Code:
    E:\O\cmix\src>cmix -c enwik6 out.cm
    Number of models: 113
    Number of mixer neurons: 528497
    1000000 bytes -> 189637 bytes in 285.55 s.
    
    
    E:\O\cmix\src>paq8pxd_v7 -8 enwik6
    Creating archive enwik6.paq8pxd7 with 1 file(s)...
    
    
    File list (16 bytes)
    Compressed from 16 to 17 bytes.
    
    
    1/1  Filename: enwik6 (1000000 bytes)
    Block segmentation:
     0           | utf-8     |   1000000 bytes [0 - 999999] (wrt: 848354)
    Compressed from 1000000 to 202709 bytes.
    
    
    Total 1000000 bytes compressed to 202737 bytes.
    Time 127.54 sec, used 1633432453 bytes of memory
    
    
    
    
    E:\O\cmix\src>cmix -c world95.txt out.cm
    Number of models: 113
    Number of mixer neurons: 528497
    2988578 bytes -> 342310 bytes in 1069.59 s.
    
    
    E:\O\cmix\src>paq8pxd_v7 -8 world95.txt
    Creating archive world95.txt.paq8pxd7 with 1 file(s)...
    
    
    File list (21 bytes)
    Compressed from 21 to 24 bytes.
    
    
    1/1  Filename: world95.txt (2988578 bytes)
    Block segmentation:
     0           | text      |   2988578 bytes [0 - 2988577] (wrt: 1888874)
    Compressed from 2988578 to 332092 bytes.
    
    
    Total 2988578 bytes compressed to 332127 bytes.
    Time 286.51 sec, used 1633432448 bytes of memory
    Attached Files Attached Files
    KZo


  3. The Following User Says Thank You to kaitz For This Useful Post:

    avitar (22nd April 2014)

  4. #3
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    186
    Thanked 16 Times in 11 Posts
    oops - with 32 bit xp sp3 the compress cl cmix -c file arch doesn't finish & I later get a fatal error message from xp.

    Any tried it on xp? ideas?

    John

  5. #4
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    342
    Thanks
    12
    Thanked 34 Times in 28 Posts
    Quote Originally Posted by avitar View Post
    oops - with 32 bit xp sp3 the compress cl cmix -c file arch doesn't finish & I later get a fatal error message from xp.

    Any tried it on xp? ideas?

    John
    AFAIK software compiled with newer version of Microsoft C++ does not work on XP (require a strange switch or something, I don't use m$ software)

  6. The Following User Says Thank You to fcorbelli For This Useful Post:

    avitar (22nd April 2014)

  7. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,491
    Thanks
    730
    Thanked 655 Times in 351 Posts
    x86: /link /SUBSYSTEM:CONSOLE,5.01 /LARGEADDRESSAWARE
    x64: /link /SUBSYSTEM:CONSOLE,5.02

  8. #6
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    186
    Thanked 16 Times in 11 Posts
    So the patch for the code that does the os version check or the registry is???? - think I've seen it somewhere on this forum, but dunno where.

    J

  9. #7
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    186
    Thanked 16 Times in 11 Posts
    Dooooh, M$ again!- I can't find a patch, so can you, Kaitz or anyone, pls make an xp 32 bit version by adding /subsystem=5.01 or similar when compiling (see Bulat's email #5).

    John

  10. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,491
    Thanks
    730
    Thanked 655 Times in 351 Posts

  11. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    avitar (23rd April 2014)

  12. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 775 Times in 484 Posts
    Brian Knoll has released cmix v2. It now takes first place on both LTCB (beating Durilca) and Silesia.
    http://mattmahoney.net/dc/text.html
    http://mattmahoney.net/dc/silesia.html

    I did not verify the results because it takes 7 days to compress or decompress enwik9 on a Core-i7 3770K and uses 28 GB memory. He sent me these results:

    Code:
    Version: 2
    Compressed size of enwik8: 15863623 bytes
    Compressed size of enwik9: 126323656 bytes
    The program doesn't contain any configurable options.
    Size of executable as a zip file: 310068 bytes
    enwik9 compression time: 580083.71 seconds
    enwik9 decompression time: 577626.4 seconds
    Approximate memory used: 28152048 KiB
    
    Description of test machine:
    processor: Intel Core i7-3770K
    memory: 31.4 GiB
    OS: Linux Mint 14
    
    Description and source code: http://www.byronknoll.com/cmix.html
    
    Silesia compressed size in bytes:
    
    dicke: 1900496
    mozil: 9631848
    mr: 2031156
    nci: 839206
    ooff: 1361033
    osdb: 2027181
    reym: 768777
    samba: 2605457
    sao: 3766486
    webst: 4732791
    x-ray: 3570437
    xml: 246717
    
    total: 33481585
    The zipped executable is apparently smaller than the zipped source code.

  13. #10
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    186
    Thanked 16 Times in 11 Posts
    someone got a .exe, for windoze somewhere that'll run in sensible memory eg 4 or 8G? John

  14. #11
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    319
    Thanks
    172
    Thanked 51 Times in 37 Posts
    Quote Originally Posted by avitar View Post
    someone got a .exe, for windoze somewhere that'll run in sensible memory eg 4 or 8G? John
    Wondering the same thing...

  15. #12
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    But what sense does it make a compressor slower than a snail. Will have a practical application in 2100. Anyway Congrats to his creator.

  16. #13
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,468
    Thanks
    26
    Thanked 118 Times in 93 Posts
    Quote Originally Posted by Nania Francesco View Post
    But what sense does it make a compressor slower than a snail
    Top the benchmark?

  17. #14
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    107
    Thanks
    7
    Thanked 79 Times in 24 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Top the benchmark?

    Sure, but a quick "back of the envolope calculation" tells me that I can download the entire enwik9 in about a day and a half over an 8kbp/s dialup line *. That's better compression and 4 times faster too! Plus, if the point of the benchmark is for artificial intelligence I bet the first thing an intelligent piece of software would do is to say "Screw the epic decoding session. I'll just download it!" (^:

    * I'm entirely aware that I probably screwed up that calculation but my point remains valid (^:

  18. #15
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    746
    Thanks
    62
    Thanked 259 Times in 181 Posts
    Where is the respect for a guy who beat public everybody in the world till today with data compression in size?

    Bryan Knoll:
    http://www.byronknoll.com/

    His blog:
    http://byronknoll.blogspot.com/

    His social profile:
    https://plus.google.com/+ByronKnoll

    His resume:
    http://www.byronknoll.com/resume.pdf

    How it started:
    http://www.cs.ubc.ca/projects/knoll/
    https://web.archive.org/web/20121130....com/p/siarco/
    http://www.byronknoll.com/thesis.pdf
    http://byronknoll.blogspot.nl/2013/05/paqclass.html
    https://code.google.com/p/paqclass/

    Where it lead to:
    http://byronknoll.blogspot.nl/2014/01/cmix.html
    His quote Jan 6, 2014: "Realistically, it is unlikely that I will be able to surpass PAQ8 (given the amount of work/expertise that has gone into its development)"
    https://sites.google.com/site/cmixbenchmark/
    http://byronknoll.blogspot.nl/2014/02/cmix-update.html
    http://byronknoll.blogspot.nl/2014/0...1-release.html
    http://byronknoll.blogspot.nl/2014/0...version-2.html
    http://www.byronknoll.com/cmix.html

    I guess he is in his mid twenties and just started, so what become his next surprise?

  19. #16
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,468
    Thanks
    26
    Thanked 118 Times in 93 Posts
    Topping the benchmark is certainly an achievment and congrats to his author. But AFAIU his main contribution is extending neural networks sizes and number and adding some models based on existing ones. So in short it's a brute force approach, which additionally was tried before (with less brute force perhaps). Due to the slowness nobody is really interested in joining his effort. On the other hand, when someone invents something really new then people get excited as they see the potential to refine the method and make some revolution. That was the case with ANS recently for example. But also eg with early immature PAQ versions, where PAQ was generally first public context mixing compressor. BWT became interesting when people started to design fast algorithms for BWT computation (nowadays divsufsort dominates all other algorithms and interest diminished). Etc, etc.

  20. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 775 Times in 484 Posts
    What we learned from this little 8 year long (so far) experiment is that AI (text prediction) requires lots of CPU, lots of memory, and lots of code. In the "neat" vs. "scruffy" debate, the scruffies win. It explains, to some extent, why the problem is so hard and why it hasn't been solved. In the 1960's people were predicting that AI will be solved in a few years. In hindsight we know we vastly underestimated the difficulty.

    Just like evolution, AI is experimental. You make small changes to the code and test them, not knowing if the changes will work, and often without a good understanding of why it works if it does. I've looked at the code for paq8hp12any with its thousands of mysterious parameters. What can you learn from it? It's like looking at your DNA. We can write code that we can't comprehend.

    I expect there will be future advances over the years using teams of people on supercomputers. We are still a long way from solving AI. Meanwhile I think experiments like this just help confirm my hideous cost estimates for AI. https://docs.google.com/document/d/1...pCkpWFn9IglW3o

  21. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    PSHUFB (15th August 2014)

  22. #18
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    AI? Scruffy?

    Matt,

    The "scruffies" beat the "neats" by using lots of RAM & CPU cycles? I had more the opposite impression---this is fairly shallow applied math (neat) plus brute force, without a basic qualitative/semantic theory of the domain regularities in something like human terms. Not really what I think of as "scruffy." Where's the schema theory or bottom-up feature detection/classification hierarchy, or the adaptive/reactive planning component or whatever?

    But then maybe I don't understand what it's really doing---it sounds kinda scruffy in that presumably most of the context models (and how they're being mixed) are based on detecting and reacting to specific domain regularities. (But then I'd think a scruffy would be busily explaining those things; that would be the main point.)

    Either way, for it to fulfill the promise of an "AI" approach (IMHO) that's the kind of explanation and discussion we need to be having, especially if we're scruffy---the demo program is only useful if you know what theories it embodies/supports. Then you can make some guesses how to get most of the benefit in an efficient system for practical uses, or different uses.

    "If brute force doesn't work, you're not using enough" isn't much of a theory.

  23. #19
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 775 Times in 484 Posts
    "Scruffy" means using lots of code and complexity and empirical results without a simple theory behind it. But it also takes a lot of CPU and memory too. Watson is a good example of a scruffy approach. There are years of development and hundreds of modules that each handle a small fraction of possible questions.

  24. #20
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    186
    Thanked 16 Times in 11 Posts
    Well tried latest cmix on win 64, 4 cores, 8Gbytes, compressing arc.exe (just a suitable sized .exe I had around) - took ages (30 mins?) and increased the size by 100%. Said something like 5000 neurons, Given the no of models it tries isn't that a bit weird?

    John

  25. #21
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    376
    Thanks
    137
    Thanked 194 Times in 107 Posts
    Byron Knoll released new version.
    Looking at readme:

    Changes from version 2 to version 3:
    - Merged paq8pxd code into paq8l
    - Optimized speed of cmix and paq8 mixers
    - Minor refactoring and tuning
    KZo


  26. #22
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 775 Times in 484 Posts
    cmix v3 is about twice as fast as v2 and improves compression on LTCB and Silesia according to his tests. v2 was already top ranked in both tests.
    http://mattmahoney.net/dc/text.html#1262
    http://mattmahoney.net/dc/silesia.html

  27. #23
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    342
    Thanks
    12
    Thanked 34 Times in 28 Posts
    A cmix V3 executable for Windows?

  28. #24
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 775 Times in 484 Posts
    I didn't try to test it myself. You need a computer with 32 GB memory.

  29. #25
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 775 Times in 484 Posts
    Brian Knoll has released cmix v4, again setting new records on LTCB and Silesia. I did not test myself. The LTCB test would take a week requiring 32 MB memory.
    http://mattmahoney.net/dc/text.html
    http://mattmahoney.net/dc/silesia.html

  30. #26
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    186
    Thanked 16 Times in 11 Posts
    Can't see any release notes, eg differences to v3. Anyone? J

  31. #27
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts

    Changes from version 3 to version 4:
    - Added an additional mixer context (longest match length)
    - Implemented an additional PPM model
    - Integrated some code from paq8pxd_v11 into paq8l
    - Integrated mixer contexts from paqar into paq8l
    - Minor refactoring and tuning

    Changes from version 2 to version 3:
    - Merged paq8pxd code into paq8l
    - Optimized speed of cmix and paq8 mixers
    - Minor refactoring and tuning

    Changes from version 1 to version 2:
    - Memory tuning
    - Increase PPM order to 7
    - Replaced dictionary with the one used by decmprs8
    - Added paq8pxd_v7 models
    - Removed pic models
    - Added "facade" models to copy the paq8 model predictions to the cmix mixer

  32. #28
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 775 Times in 484 Posts
    cmix v5 was released with better compression on both LTCB and Silesia. enwik8 looks like it now beats the Hutter prize winner although it would not be eligible because it uses 29 GB memory and doesn't exceed the 3% threshold. cmix now uses an external dictionary like paq8hp12.

  33. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    PSHUFB (15th August 2014)

  34. #29
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 775 Times in 484 Posts
    cmix has again broken the records for LTCB and Silesia.
    http://mattmahoney.net/dc/text.html#1243
    http://mattmahoney.net/dc/silesia.html

  35. #30
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    888
    Thanks
    507
    Thanked 346 Times in 257 Posts
    Here are the scores of cmix v6 x64 build tested on Win7 (as Byron wrote - this compile has a little worse compression to Linux version):

    SILESIA (tested with English.dic switch):

    Total Silesia - 33383299
    dickens - 1897371
    mozilla - 9591339
    mr - 2028765
    nci - 838852
    ooffice - 1355776
    osdb - 2024100
    reymont - 770959
    samba - 2593614
    sao - 3767446
    webster - 4701467
    x-ray - 3569000
    xml - 244610

    ENWIK (tested with English.dic switch):
    ENWIK8: compressed size 15'778'184, compression time 34'588,42 s
    ENWIK9: compressed size 125'697'691, compression time 398'293,92 s

    Darek

  36. The Following User Says Thank You to Darek For This Useful Post:

    byronknoll (8th December 2014)

Page 1 of 14 12311 ... LastLast

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •