Page 1 of 2 12 LastLast
Results 1 to 30 of 41

Thread: Spec for portable PAQ

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,603
    Thanks
    125
    Thanked 363 Times in 234 Posts

    Spec for portable PAQ

    I am putting together a specification for a portable PAQ like file format. It is designed so that different versions of the compressor can read each other's archives. The idea is to specify the architecture (connections between context models, match models, mixers, and SSE stages), the contexts used by each model and memory allocation, a wide variety of preprocessing like E8E9, dictionary, delta, etc. all described in the archive. This means you can create an archive that compresses fast or one that compresses smaller with more models, and the same program would decompress it. You could spend a lot of time trying to pick the best parameters. That makes it asymmetric, since the decompressor doesn't have to do this search. All decompressors can read them just by reading the parameters that describe the model.

    The file format is designed for archivers (solid or not), single file compressors or memory to memory compression. It is a sequence of independently compressed blocks made up of non-independent segments that can be directed to separate files or use different preprocessing steps, or where you can turn off some of the components for speed. Each segment has an uncompressed header and compressed data.

    To support this, I designed the arithmetic coder with a couple of neat innovations First, the stream is self delimiting so you don't have to save the uncompressed size to know when to stop decoding. Second, I designed it so it never outputs a sequence of 4 NUL bytes, then used 4 NULs to mark the end of a segment so you can scan the archive quickly to find all the segment headers where the filenames are stored so you can list the archive contents (like a TAR file).

    Unfortunately I haven't built it yet so I don't know how well it will work. Once I do I will probably make some minor changes to the spec. I designed the spec so there is room for upward compatible formats so that newer decompressors can read the output of older compressors. Still, there is plenty of room for tuning to get PAQ like compression without changing the format.

    http://cs.fit.edu/~mmahoney/compression/zpaq1.pdf

    I may write an LGPL implementation (like ZLIB) so it is widely adopted. The hard part is going to be automatically choosing parameters for best compression based on the file type and contents. For now I may just write a version where the user specifies everything and do experiments with it.

    Comments?

  2. #2
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    With a quick looking, I found a tiny detail in the mixer which can be improved:
    Code:
    WT[0...SIZE-1][0...M-1] := floor(65536/M) (range -x80000000...x7FFFFFFF)
    Instead of this, you should initialize a the mixer weights as zero. Because, you are in logistic domain. So, actually zero means 0.5 in here. It's proven in BIT implementation. I had used in old version like your code while new version (BIT 0.7) has extra performance by selecting zero. Please look at SqueezeChart. BIT 0.7 outperforms LPAQ1 with less time and less memory.

    Also, I found that LearningRate highly restricted to data type. For example, in text, around 7 is suitable while in hardly compressible thing, around 15 is suitable. However, I found different parameters for whole SFC with automatic optimization. So, both final shift and learning rate parameter can be stored in the archive. Because, PAQ mostly has some inoptimal thing. And high precision StateMap and Mixer has a great effect to fix that improper things.
    BIT Archiver homepage: www.osmanturan.com

  3. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,603
    Thanks
    125
    Thanked 363 Times in 234 Posts
    Actually, if mixer weights are all 0 then initially the output is p = squash(0) = 1/2 regardless of inputs.

    But I might add learning rates as tunable parameters after I do some experiments. The document is surely going to change after I start writing code. Maybe I will take out stuff to save work. I have already updated it in the hour since I posted it to fix some mistakes and add a linear mixer component.

  4. #4
    Programmer giorgiotani's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    163
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I am putting together a specification for a portable PAQ like file format. It is designed so that different versions of the compressor can read each other's archives.
    Quote Originally Posted by Matt Mahoney View Post
    All decompressors can read them just by reading the parameters that describe the model.
    That is great, I'll look forward for it!

  5. #5
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    "... asymmetric ..."

    Looks like you also got into optimization . I had a quick look at our specs...

    It'd be great if you could also support the usage of a mixing tree and secondary models for mixing. In addition using other 256-state FSMs for bit history quantisation would be good, too (store the table). For my optimization experiments i used a parametrized FSM model, which is similar to PAQ, but shows a very different state allocation.

    BTW did you do any other experiments a la PAQ9?

    Greets

  6. #6
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Actually, if mixer weights are all 0 then initially the output is p = squash(0) = 1/2 regardless of inputs.
    Yep. That's what I was trying to point out.
    Quote Originally Posted by Matt Mahoney View Post
    But I might add learning rates as tunable parameters after I do some experiments. The document is surely going to change after I start writing code.
    As a note, I found out learning rate as 6 and final shift as 5 with automatic optimization (on whole SFC). Of course, this depends on my the parameters. But, you can try them.
    Quote Originally Posted by Matt Mahoney View Post
    Maybe I will take out stuff to save work. I have already updated it in the hour since I posted it to fix some mistakes and add a linear mixer component.
    Yes. It should be what you've pointed out.

    BTW, why don't you take some our (toffer and me) observations into account? They are mainly related both modeling and speed factors. Note that, most of people suffer from PAQ due to long processing time. I think, you should also include some speed oriented things. Considering PAQ's main goal (maximal compression at any cost), the users should have two options in storable parameters: speed vs strength.
    BIT Archiver homepage: www.osmanturan.com

  7. #7
    Member
    Join Date
    Jun 2008
    Location
    L?vis, Canada
    Posts
    30
    Thanks
    0
    Thanked 0 Times in 0 Posts
    What's this product's marketability? Who's the target consumer?

  8. #8
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    578
    Thanks
    56
    Thanked 44 Times in 28 Posts
    Marketing?

    Does everything discussed here need a marketing potential?
    This is all about improving algorithms, breaking records, trying new ideas.

    I adore Matt's idea and vote or it.

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    2,061
    Thanks
    0
    Thanked 6 Times in 5 Posts
    Random comments.

    1. There's not much reason to do it, if it'd be inferior to paq8 in the end

    2. I don't like basic format features involving redundancy and speed impact,
    like in-stream EOF and segment delimiters and embedded filenames.
    Instead, I'd suggest attaching a compressed archive structure segment.
    Scanning the archive is never quick.
    Also, its not necessary to flush the rc at all, as its possible to
    store whole rc state as a segment index.

    3. Interpreted hashing sequences would be really slow.
    Imho its much better to support it in the form of a single generalized
    function with fixed parameters. Though i guess standard
    sequences can be "cached" by including precompiled handlers,
    but that'd just make "non-standard" hash functions completely
    unreasonable to use.
    Also is there even a non-hashed context map?

    4. It should be useful to be able to calculate the memory usage and
    speed estimation for a given model structure.

    5. The model scheme is considerably weird. There're more ways of
    merging the submodel predictions, like:
    a) predicting a value by a submodel's probability distribution
    (like a byte value) and generating a sequence of match flags
    for further modelling
    b) switching (using only one of a few submodels, selected by
    some entropy estimation method)
    c) 2d+ SSE (multi-dimensional mapping)
    d) adaptive linear mixing

    Also SSE's best feature is context clustering, not some
    "estimation correction", it really can't do much when
    applied to adaptively mixed input.
    And I don't see any mapping component which would be able
    to map multiple quantized inputs (like a column number and
    values of adjacent elements in a table) to an output.

    6. A useful feature for asymmetric coding is separation of
    modelling and coding stages in compression. I mean, if
    we store the probability values from submodels for a data block,
    then it would be possible to try multiple ways of combining them.

  10. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,603
    Thanks
    125
    Thanked 363 Times in 234 Posts
    I updated the spec. http://cs.fit.edu/~mmahoney/compression/zpaq1.pdf

    - Added a direct context model (context to prediction) in addition to the indirect (context to bit history to prediction). The direct model should be faster.
    - Added learning rates for SSE and context models.
    - Made the syntax simpler and more extensible. There is a lot more flexibility now in how components can be connected. You can have up to 255 in any combination of context models, mixers, SSE, and linear mixers with fixed weights.
    - I separated the model and hash specifications. The hash specification has to be interpreted once every byte, so the new format should be faster.
    - Took out some of the transforms (to save work). I may put them back if they prove to be useful. I kept dictionary, capitalization, E8E9, and bit reverse.
    - Took out the segment level component switches. It's more work and I have no evidence the feature would be useful. Almost everything in the spec is using tested methods.

    Like I said, it won't be finished until it's in code. I might add a paq9a type cascading model that takes a bit history and prediction as input. I might also add other contexts. But this should be enough for now. I won't put them in unless there's a good chance they are useful.

    Anyway to answer some comments, compression might not be the very best you can get, but that is always a problem with portable formats. The goal is to be close. There are so many parameters to play with that I don't think it will make a big difference in development time. Even following the spec, it could take years to find the best combinations. Meanwhile, if restrictions are a problem, there is always ZPAQ2, and it will be backwards compatible.

    It is possible to make very fast compression, for example, just a single component, an order 2 model going directly to the arithmetic coder.

    Asymmetric? Yes compression speed is the same if you don't count all the time analyzing the file to figure out which parameters will work best.

    Marketing? Like zlib. I will probably release an LGPL or public domain decoder. No restrictions = wide acceptance.

    Segment delimiters don't take a lot of space. Embedded file names can be empty if you're writing a file compressor. The implementation will probably be an archiver where you can rename the output files like this, so you still have the convenience of a file compressor:

    Code:
    zpaq1 a archive files... - creates or appends a block to archive
    zpaq1 a1...a5 - compresses fastest...best, default a3
    zpaq1 x archive - extracts using stored filenames, does not clobber
    zpaq1 x archive files... - renames files in the order they were added, clobbers
    zpaq1 l archive - lists contents
    To list, you can find end of compressed data segments by looking for 4 0 bytes followed by a x01 or 'z'. Should be as fast as tar tvf.

    But mainly the first version will be for testing. It will probably do lots of file analysis and pick parameters for x86, text, .bmp, .wav, etc. Anything that helps that's not in the spec, I may still add.

    Memory usage depends on sizebits for each component. I should add a section on how to calculate it, but it's not hard. Spec limits arrays to 1 GB each but you can still have a lot of them.
    Last edited by Matt Mahoney; 20th January 2009 at 01:00.

  11. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,603
    Thanks
    125
    Thanked 363 Times in 234 Posts
    Some minor updates to zpaq1 v0.04:

    - Added terminating markers to component list, hash list, segments, and blocks to make parsing easier. The hash instruction list has to be interpreted once per byte, so it needs to be fast.
    - Added a byte swap post processing transform.
    - Added a word level context and minor changes.
    - Added a section on implementation notes, including memory usage.

    I am already working on an implementation. I have the framework for an archiver with the new self terminating, non flushing arithmetic coder done, and it works. It creates, extracts, and lists file contents, but no compression yet. When it is done, it will probably take a configuration file where you can specify a PAQ like architecture with an arbitrary combination of components and contexts so you can do experiments with different compression algorithms without recompiling the program or breaking compatibility.

  12. #12
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,657
    Thanks
    44
    Thanked 40 Times in 23 Posts
    HEAVY DUTY!

  13. #13
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,603
    Thanks
    125
    Thanked 363 Times in 234 Posts
    In version 0.05 of http://cs.fit.edu/~mmahoney/compression/zpaq1.pdf I added a specification for an implementation (sec. 6). It will be a GPL archiver. After some testing I might release a public domain version. Anyway it will work like this:
    Code:
      zpaq1 (a[config] | x | l) archive files...
    "a" compresses and appends to the archive. "x" extracts, with the option to rename files. "l" lists the contents. "a" takes an optional config file where you can specify preprocessor transforms, the component architecture, and contexts.

    Also, a minor change, the block header has a size field for easy parsing

  14. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,603
    Thanks
    125
    Thanked 363 Times in 234 Posts
    Progress report:

    Code:
    BIB 111261 -> 27229
    BOOK1 768771 -> 216793
    BOOK2 610856 -> 145024
    GEO 102400 -> 64471
    NEWS 377109 -> 107969
    OBJ1 21504 -> 11496
    OBJ2 246814 -> 80447
    PAPER1 53161 -> 13543
    PAPER2 82199 -> 19989
    PIC 513216 -> 50428
    PROGC 39611 -> 11863
    PROGL 71646 -> 14463
    PROGP 49379 -> 11450
    TRANS 93695 -> 15441
    Used 8.80 seconds
    (total size 790,606). I haven't implemented all of the models yet. The compressor reads this configuration file:
    Code:
    comp 20 6
      0 cm 20 64
      1 cm 22 64
      2 cm 22 64
      3 match 22
      4 cm 24 64
      5 mixb 16 0 5 32
    hash
      0 z off 1 255 off 2 255 end
      1 h 0 off 3 255 end
      2 h 1 off 4 255 end
      3 h 2 off 5 255 end
      4 word 65 90 223 12 end
      5 z off 1 255 end
    pre
      pass
      pass
      pass
      pass
    end
    zpaq1 interprets the file, constructs a model, and creates an archive. During decompression, the instructions are retrieved from the archive and executed. Even though it is an interpreted program, speed is similar to lpaq1 for the same number of models.

    The configuration file describes 4 direct context models (context -> prediction, orders 2 through 4 and lowercase words), an order 5 match model and a mixer, and no preprocessing. You can connect components any way you want. I just connected the mixer to take all the other predictions. I haven't implemented indirect context models (with bit histories) or SSE yet, or all of the context functions. I also did not implement the LZP or dictionary preprocessor yet. It will do EXE (e8e9) but it's not useful here. So code is not ready to release yet.

    The latest spec adds LZP (v 0.10), removes BSWAP (not useful without a bitstream model) and SWAP (not useful without a delta or LPC preprocessor). I found it useful to specify a learning rate for mixers. Also lots of minor changes and fixes.

  15. #15
    Member kaitz's Avatar
    Join Date
    May 2008
    Location
    Estonia
    Posts
    189
    Thanks
    6
    Thanked 15 Times in 7 Posts
    What about multithreading?
    KZo


  16. #16
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,657
    Thanks
    44
    Thanked 40 Times in 23 Posts
    Can't wait for ZPAQ release!

  17. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,603
    Thanks
    125
    Thanked 363 Times in 234 Posts
    Quote Originally Posted by kaitz View Post
    What about multithreading?
    The spec only defines a data format. The architecture is an array of components that can be connected in fairly arbitrary ways. Some of these components could be executed in independent threads, but the execution is only a few instructions so I think the overhead would be prohibitive. But anyway, that is an optimization problem.

    It will be possible to compress blocks independently, so therefore in parallel, and concatenate them at the end.

  18. #18
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    3,587
    Thanks
    150
    Thanked 192 Times in 106 Posts
    Matt, look at this patent:

    and good programmer what invented a neural network data compression http://appft1.uspto.gov/netacgi/nph-...DN/20070233477
    may be you will find something familiar here

  19. #19
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    3,587
    Thanks
    150
    Thanked 192 Times in 106 Posts
    Matt, it's also possible to run several models simultaneously at compression stage

  20. #20
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,603
    Thanks
    125
    Thanked 363 Times in 234 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    Matt, look at this patent:

    http://appft1.uspto.gov/netacgi/nph-...DN/20070233477

    may be you will find something familiar here
    Not only does Infima steal code, now they are stealing patents.
    http://www.c10n.info/archives/415

    The text of the patent is plagiarized right from https://www.cs.fit.edu/Projects/tech...cs-2005-16.pdf which I wrote in 2005. The patent application was May 24, 2006.

    They are too stupid to even understand what they wrote. They patented my gradient descent mixer which I described in my paper and used in PAQ6. I did not introduce a neural network until PAQ7.

    Unfortunately their stupidity is surpassed only by that of the USPTO.

  21. #21
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    393
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Heavy. There has something to be done.

  22. #22
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    385
    Thanks
    12
    Thanked 13 Times in 9 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Not only does Infima steal code, now they are stealing patents.
    http://www.c10n.info/archives/415

    The text of the patent is plagiarized right from https://www.cs.fit.edu/Projects/tech...cs-2005-16.pdf which I wrote in 2005. The patent application was May 24, 2006.

    They are too stupid to even understand what they wrote. They patented my gradient descent mixer which I described in my paper and used in PAQ6. I did not introduce a neural network until PAQ7.

    Unfortunately their stupidity is surpassed only by that of the USPTO.

    Reminds me of the company that lost out to MicroShaft when it invented I think it was Stacker or something similar. Or the intermittent windshield wiper. You can't win when big money is against you. I would like to know what you or anyone else could do to stop them or at least revoke their patent. I hope its an interesting fight or did they already win!

    Take Care
    David Scott

  23. #23
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    385
    Thanks
    9
    Thanked 85 Times in 63 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Not only does Infima steal code, now they are stealing patents.
    http://www.c10n.info/archives/415

    The text of the patent is plagiarized right from https://www.cs.fit.edu/Projects/tech...cs-2005-16.pdf which I wrote in 2005. The patent application was May 24, 2006.

    They are too stupid to even understand what they wrote. They patented my gradient descent mixer which I described in my paper and used in PAQ6. I did not introduce a neural network until PAQ7.

    Unfortunately their stupidity is surpassed only by that of the USPTO.
    I can't find any patent with inventor name Matt Mahoney so how Infima can know it's prior art? It's a patent publication application http://v3.espacenet.com/publicationD...C&locale=en_EP and maybe in the Manual of Patent Examining Procedure (MPEP) http://www.uspto.gov/web/offices/pac/mpep/ section 900 Prior Art, Classification, and Search http://www.uspto.gov/web/offices/pac...ments/0900.htm they find something to stop the patent prosecution like section 904.02(c) Internet Searching http://www.uspto.gov/web/offices/pac...tm#sect904.02c otherwise only a patent litigation is left to fight the infringement that it is prior art.
    Last edited by Sportman; 27th January 2009 at 03:14.

  24. #24
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    3,587
    Thanks
    150
    Thanked 192 Times in 106 Posts
    >I can't find any patent with inventor name Matt Mahoney so how Infima can know it's prior art?

    really - where compression genie can hear about some Matt Mahoney?

  25. #25
    Member Zonder's Avatar
    Join Date
    May 2008
    Location
    House
    Posts
    38
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by biject.bwts View Post
    You can't win when big money is against you.
    http://www.crunchbase.com/company/infima-technologies

    Raised Funding - $630k

  26. #26
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,603
    Thanks
    125
    Thanked 363 Times in 234 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    >I can't find any patent with inventor name Matt Mahoney so how Infima can know it's prior art?

    really - where compression genie can hear about some Matt Mahoney?
    The law in the U.S. says that if you invent something, you have one year after disclosing it to apply for a patent. But you can't patent something that you didn't invent. It is the responsibility of the inventor to search for prior art. Prior art does not have to be patented. Infima obviously was aware of my TR because they copied text from it into their application. It was their responsibility to cite it and PAQ6 as prior art.

    I didn't patent any compression software because I am opposed to software patents in general. Patents are supposed to foster innovation by rewarding the inventor. But the effect is quite the opposite. Any hint that an algorithm might be covered by a patent, and people will look elsewhere. I posted all versions of PAQ and dated them precisely so that nobody else could patent any of the algorithms.

  27. #27
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    3,587
    Thanks
    150
    Thanked 192 Times in 106 Posts
    Matt, will you try to cancel this patent?

  28. #28
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,603
    Thanks
    125
    Thanked 363 Times in 234 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    Matt, will you try to cancel this patent?
    If I knew how. I suppose I could write a letter to USPTO but I can't find any information on the procedure. If the examiners are doing their job they should reject it. But it is easier for them to grant a patent than do research and fight with the applicant. That's why we have idiotic patents like the combover.
    http://www.google.com/patents?vid=USPAT4022227

  29. #29
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Matt Mahoney View Post
    ...That's why we have idiotic patents like the combover
    BIT Archiver homepage: www.osmanturan.com

  30. #30
    Member
    Join Date
    Jan 2009
    Location
    Germany
    Posts
    35
    Thanks
    0
    Thanked 0 Times in 0 Posts

Page 1 of 2 12 LastLast

Similar Threads

  1. PAQ turns up in the most unlikely places
    By willvarfar in forum Data Compression
    Replies: 0
    Last Post: 27th May 2010, 11:19
  2. Paq mixer theory
    By Shelwien in forum Data Compression
    Replies: 0
    Last Post: 22nd November 2009, 03:32
  3. can someone help me compiling paq by myself?
    By noshutdown in forum Forum Archive
    Replies: 4
    Last Post: 4th December 2007, 11:49
  4. New fast open-source paq-based jpeg compressor
    By Bulat Ziganshin in forum Forum Archive
    Replies: 14
    Last Post: 13th September 2007, 14:57

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •