Results 1 to 8 of 8

Thread: Compression state-of-art discussion with C.Bloom

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

    Compression state-of-art discussion with C.Bloom

    From comments to
    http://cbloomrants.blogspot.com/2010...ession-on.html

    take CTW, the idea of working in binary and blending context models
    by weighting rather than selecting


    Its not the CTW idea at all, and its trivial.

    CTW idea is to use the string probabilities for prediction, ie
    logistic estimation (which appears because we can only handle string
    probabilities in log domain). However, Matt's logistic mixing was
    taken from neural networks, not CTW.

    Basically, both bitwise processing and "blending" are obvious, when
    you try to make a compressor based on NN (which is what Matt
    originally did).

    Also, (linear) mixing is natural for another reason too - if we have
    two applicable models which give predictions p1 and p2, and a
    probability w of p1 being right, then p=w*p1+(1-w)*p2 appears straight
    from the definition.

    As to logistic mixing, its less obvious, but I found that paq mixing
    formulas can be actually derived from the same source model based on
    submodel switching, simply by considering string probabilities.

    take Malcolm's ideas from Rkive such as using non-contiguous
    contexts like AxBx


    Nobody remembers that anymore, but most still use sparse contexts
    because its natural with a hashtable-based coder.

    take SEE from PPMZ and the PPMii ideas of percolating updates to
    neighbors to help with sparse statistics and slow learning


    I agree with SEE, but PPMII's ideas are its initialization of new
    counters and dynamic ranking of symbols in tree nodes and unary coding
    and SSE. And as to "fuzzy" updates - its also one of these things
    which everybody invents on their own.

    Btw, ppmonstr also has sparse contexts.

    you pretty much have modern CM

    Still, I don't see it. Modern CMs store different counters into
    different data structures (they have to be cache-friendly these days),
    mix submodel predictions with a different mixing function and encode
    the bits with a different arithmetic coder (Matt's rangecoder is
    pretty unique, although I don't like it). So even if you can find some
    similarities, they're likely false.

    yes obviously there are lots of little tweaks, but so far as my
    understanding goes there's nothing dramatic.


    Depends on where you look. Both old and new coders take data on input
    and write some function of it on output, and both are based on the
    same idea of using statistics to predict the data. But implementations
    look completely different to me. And in fact their behaviour is fairly
    different too - old coders compressed texts considerably better -
    considerably enough for ppmonstr still to keep the best result on 1G
    text compression: http://mattmahoney.net/dc/text.html (durilca is
    ppmonstr with attached text preprocessor)

    the weighting in PAQ is very basic machine learning stuff so far as
    a I can tell (simple linear online training),


    It likely isn't that simple - that update formula was acquired by
    minimizing the prediction entropy; originally Matt used a different
    one, borrowed from NN theory, and it didn't work good enough.

    modern ML has lots more tools to apply.

    http://prize.hutter1.net/
    http://mailcom.com/challenge/
    Good luck :)

  2. #2

  3. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I'm glad to see he has noticed my work.

    I agree Wikipedia information on data compression is incomplete. I was reluctant to write about PAQ and ZPAQ because of their policy on original research, but who else should write it? Charles Bloom could obviously write about LZP and PPMZ if these topics are missing.

    There is quite good info on his blog, like analysis of LZMA last month. Until now there has been no description of the algorithm.

  4. #4
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    I'd love it if you guys could help me out a bit. I'm trying to catch up on any progress in the last 10 years. It seems that research literature is even less up to date than it was in my day, so that's not helping much.

    Are there any writeups or web pages I can look at?

    So far the best thing I've found is the comments at the start of the PAQ .cpp files!

    I looked at Shelwien's "green" and "tree" which are nice and simple. Have other people tried byte-wise CM? What are the issues with that?

    PAQ9 is pretty nice and simple. I understood almost everything in it, but a few questions aren't clear :

    What's the difference between binary cascades of mixing vs. N-way mixing? (it certainly seems like binary will have different adaptation rates)

    In this line :

    pr=m[i-1].pp(pr, stretch(sm[i].p(st)), st+r)*3+pr>>2;

    it looks like the mix is being forceably damped by 3/4 and 1/4 of the unmixed probability is kept, is that right? Is there a theoretical reason to do this or is this a heuristic tweak?

    I haven't figured out everything in PAQ8 yet, it's rather more complex.

    It looks like the hash tables try a few probes and then kill one of the previous entries if all the probe spots were mismatches. I'm curious if the penalty for this has been measured (vs. hash-to-link-list or something that would not have to replace slots).

    Is there a better test corpus anywhere? I'm not really that interested in enwik8/9 because it's so specific and amenable to special casing, what I'd like to see is a corpus of 100 or 1000 medium sized files with lots of variation of data type so that we can test a compressor's ability to work well in lots of different scenarios.

    Lastly, I'll be writing about things on my blog as I learn them, and I'd appreciate any corrections if I get things wrong!
    Last edited by cbloom; 3rd August 2016 at 20:44.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > I'd love it if you guys could help me out a bit. I'm trying
    > to catch up on any progress in the last 10 years.

    Well, 10 years is a bit too much. There was enough activity
    at least up to 2004 - ppmd,lzma,bwt coders, paq early versions etc.
    But then only Matt remained, until 2007 or so, and that's how
    we lost the bytewise line - new people only see Matt's works.

    > It seems that research literature is even less up to date
    > than it was in my day, so that's not helping much.

    Imho its become more a matter of programming than finding good ideas.
    DCC is still there though.

    > Are there any writeups or web pages I can look at?

    There's Matt's "book" for Matt's methods: http://mattmahoney.net/dc/dce.html
    And the zpaq manual: http://mattmahoney.net/dc/zpaq1.pdf
    Also posts about counters and SSE and mixing here and on http://ctxmodel.net .

    There're also some notable programs:

    http://freearc.org/ http://freearc.org/Research.aspx http://haskell.org/bz/
    http://libbsc.com/ http://quicklz.com/ http://encode.narod.ru/
    http://schnaader.info/precomp.html
    http://translate.google.com/translat...pression.ru/ds
    http://nanozip.net/ http://www.msoftware.co.nz/products/winrk

    Also comparison sites:
    http://compressionratings.com/ http://www.maximumcompression.com/

    > So far the best thing I've found is the comments at the start of the PAQ .cpp files!

    That might be true

    > I looked at Shelwien's "green" and "tree" which are nice and
    > simple. Have other people tried byte-wise CM?

    There've been a few before the slump:
    http://www.compression.ru/sv/download.shtml
    http://translate.google.com/translat...ression.ru/fa/
    http://translate.google.com/translat...ression.ru/so/
    http://translate.google.com/translat.../hipp5819.html

    Also mine:
    http://compression.ru/sh/ppmy_3c.rar http://compression.ru/sh/ppmy_3c_sse.rar
    http://ctxmodel.net/files/ASH/
    http://ctxmodel.net/files/PPMd/

    > What are the issues with that?

    Bytewise models are better for text compression, but with time user's priorities changed
    and now the volume of text to compress is relatively small (comparing to other file types).
    And while bytewise models are relatively bad at handling binary data, bitwise ones appeared
    pretty stable both for binary and texts, despite showing slightly worse results on texts
    (comparing to bytewise models fully tuned to texts):
    http://encode.ru/threads/1052-Bytewise-vs-Binary

    And then trees are too slow to use for bitwise coding, and hard to implement, comparing
    to hashtables, so there're no trees anymore.

    I still think that trees and bytewise models have their use, but it appears really
    hard to demonstrate.

    > What's the difference between binary cascades of mixing vs.
    > N-way mixing? (it certainly seems like binary will have
    > different adaptation rates)

    Binary mixer trees are easier to control (more parameters to
    tune) and are generally more precise, because with N weights
    its hard to keep their sum equal to 1 (divisions are slow,
    so division by weight sum is not used).

    > pr=m[i-1].pp(pr, stretch(sm[i].p(st)), st+r)*3+pr>>2;
    > it looks like the mix is being forceably damped by 3/4 and
    > 1/4 of the unmixed probability is kept, is that right? Is
    > there a theoretical reason to do this or is this a heuristic
    > tweak?

    The specific 0.75+0.25 is ad hoc, but overall it just means
    that in 25% cases "pr" is a better estimation than that function.

    > I haven't figured out everything in PAQ8 yet, it's rather
    > more complex.

    Yes, but its not surprising, seeing how Matt himself can't make
    a better codec since paq8, despite starting lpaq,paq9 and zpaq lines

    > I'm curious if the penalty for this has been measured (vs.
    > hash-to-link-list or something that would not have to
    > replace slots).

    I'm a wrong person to answer this, but imho its not so easy
    to measure. Paq models are very "solid" - many components make
    use of side effects introduced by other components.
    For example, hash collisions can improve compression, if
    hash function manages to cluster similar contexts together.

    So I'd say that a precise statistics lookup in theory can
    improve compression comparing to hashtables, but for that
    we'd have to make workarounds for "positive" side effects,
    and reimplement most of the codec from scratch, because
    most components can't be just replaced like "black boxes".

    For example, using higher precision than 12-bit probabilities
    in paq can in theory improve compression, but Matt's rangecoder
    can't handle larger scales, there're lookup tables that won't
    easily scale etc etc.

    > Is there a better test corpus anywhere? I'm not really that
    > interested in enwik8/9 because it's so specific and amenable
    > to special casing, what I'd like to see is a corpus of 100
    > or 1000 medium sized files with lots of variation of data
    > type so that we can test a compressor's ability to work well
    > in lots of different scenarios.

    I don't quite agree with that approach, because "universal compression"
    doesn't seem practical anymore. In other words, you can't invent a
    simple function which would be better than existing implementations
    for most known file types, although that seemed to work like that
    with escape estimation and mixing functions before.

    But all benchmark sites have their corpora, like
    http://compressionratings.com/download.html or
    http://ctxmodel.net/sh_samples_1.rar or
    http://www.maximumcompression.com/data/files/index.html

    > Lastly, I'll be writing about things on my blog as I learn
    > them, and I'd appreciate any corrections if I get things
    > wrong!

    There've been too little activity recently, so you're very welcome.

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    > Are there any writeups or web pages I can look at?

    I did write a small book. http://mattmahoney.net/dc/dce.html
    (corrections appreciated, BTW).

    > So far the best thing I've found is the comments at the start of the PAQ .cpp files!

    I did write quite a bit about the algorithm in my book. Also, I recently wrote a Wikipedia page on ZPAQ. ZPAQ is based on PAQ but I prefer it because it is configurable and you can change the algorithm for better compression without breaking older versions of the decompresser.

    > I looked at Shelwien's "green" and "tree" which are nice and simple. Have other people tried byte-wise CM? What are the issues with that?

    Bytewise CM seems to me hard to do and theoretically not any better.

    > PAQ9 is pretty nice and simple. I understood almost everything in it, but a few questions aren't clear :

    > What's the difference between binary cascades of mixing vs. N-way mixing? (it certainly seems like binary will have different adaptation rates)

    The mid (default) model in ZPAQ is based on PAQ9A without the LZP preprocessor. (Instead it uses a match model for long contexts). It uses both a cascade from low to high order and then a final mixer (because you need it to combine the match model, and the cascade mixing isn't perfect). But I find in experiments with ZPAQ that cascades of dependent contexts from low to high order tend to work better than n-way mixing, but n-way tends to work better for independent contexts. I don't know the theoretical reason for this. A cascade (ISSE in ZPAQ) is basically a learned rule for adjusting a low order context prediction given a bit history in the higher order context. The rule is an adjustment p -> Ap + B to the prediction p in the logistic domain where the coefficients A and B (for each bit history) are learned to minimize prediction error. I had the idea to do this after looking at the learned weights in SSE tables.

    > In this line :

    > pr=m[i-1].pp(pr, stretch(sm[i].p(st)), st+r)*3+pr>>2;

    > it looks like the mix is being forceably damped by 3/4 and 1/4 of the unmixed probability is kept, is that right? Is there a theoretical reason to do this or is this a heuristic tweak?

    It's a heuristic tweak. I wouldn't need it if the mixing were perfect. Unfortunately I don't know what the perfect mixing algorithm is. There doesn't seem to be any theory behind it. I could probably make a weak argument based on algorithmic information theory why mixing in the logistic domain is better than linear domain. But there are probably data dependent algorithms that would work better still.

    > I haven't figured out everything in PAQ8 yet, it's rather more complex.

    > It looks like the hash tables try a few probes and then kill one of the previous entries if all the probe spots were mismatches. I'm curious if the penalty for this has been measured (vs. hash-to-link-list or something that would not have to replace slots).

    Well on average you start getting replacements when the table approaches 7/8 full, so you have to lose statistics somewhere. You can test the effects of different memory options. (More memory is always better). One thing I did test was what is the optimal number of probes. I settled on 3 because using more increases the chance of a missed collision and makes compression worse. The hash entries use an 8 bit checksum so there is a 1/256 chance that a collision will be missed. If you were mapping contexts directly to predictions then this wouldn't be a problem, but it tends to hurt more when you map contexts to bit histories because a single extra bit in a long sequence of 0's or 1's really messes things up.

    > Is there a better test corpus anywhere? I'm not really that interested in enwik8/9 because it's so specific and amenable to special casing, what I'd like to see is a corpus of 100 or 1000 medium sized files with lots of variation of data type so that we can test a compressor's ability to work well in lots of different scenarios.

    Lots of benchmarks but no definite answers.
    http://www.maximumcompression.com
    http://www.squeezechart.com/
    http://compressionratings.com/
    http://heartofcomp.altervista.org/MOC/MOC.htm
    are probably the most up to date. Of course the answers you get depend on the data. enwik9 depends almost entirely on available memory, but for the Calgary corpus and other small data sets, memory hardly matters.

    I did try to come up with a generic test. http://mattmahoney.net/dc/uiq/

    The idea was that a universal compression algorithm should do well on a Solomonoff distribution, i.e. data generated by random programs. I thought it was really cool that I could solve the private/public data set problem by constructing a test where the input data was always different, yet the compression results were repeatable with a very small uncertainty. But doing this well turned out to still be a hard problem. For complex strings, the choice of programming language (for the random programs) should not matter because there are simple programs that translate any language to any other language. But it turns out to be a hard problem still. You want to make the strings just complex enough to make it computationally infeasible to guess the programs part of the time, but not all of the time, so that better compression algorithms can show an improvement. But at that level, around 10 to 100 bits of randomness, the choice of language does make a big difference. I still ended up with a data set for which the compression algorithm could be tuned. The test set has a million strings that are almost always random in the first few bytes and have a repeating sequence after that. The top ranked programs are tuned specifically to that case.

    There are probably ways to fix that, for example, periodically making random bit flips to the generating program instead of using lots of independent programs. But to get a true Solomonoff distribution with low complexity, I still would need to find an expressive language that could be described in just a few bits. For example, I know there are universal Turing machines using a 3 symbol alphabet and just 2 states, but such machines are not very expressive. It would take long programs to produce most simple strings.

  7. #7
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Mr Bloom you may have also missed out on the BWT and BWTS methods in compression.
    Another land mark in my view is Matt Timmermans stuff in bijective ppm compression.
    At least from a historical point of view since many don't believe that bijective file compression
    using arithmetic methods are possible. At least including these in you study or write ups would
    not only give a more complete picture of the state of compression but also a better in site
    as to what is and is not possible in file compression. Since many seem to come away with
    wrong beliefs from studying just the traditional stuff.

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Once upon a time, cbloom talks about lzma

    http://cbloomrants.blogspot.com/2011...with-star.html

    1. Tuned matrix as a generalization of match comparison heuristics

    Surely its better to have a matrix than a bunch of hand-coded conditions like
    Code:
    if( len==2 ) 
      if( dist<256 )
        encode a 2-byte match
      else
        encode a literal
    With a matrix its replaced by
    Code:
    // price() quantizes the arguments into a single context
    // and looks up a value in a table ("matrix") produced by parameter optimizer
    cost1 = price(state1,flags1,len1,dist1);
    cost2 = price(state2,flags2,len2,dist2);
    if( cost1<cost2 ) 
      use match1
    else
      use match2
    Note that this is already more general than cbloom's version.

    But I also have a further idea - such heuristics are intended for
    prediction of the best choice based on some inputs.

    In other words, its the same statistical modelling task as compression,
    so we can solve it with the same methods.

    To be specific, we have a context (2 sets of match properties) and
    a symbol (none/1/2 - we know the best choice after parsing optimization).
    Thus we can tune a model which compresses the sequence of these symbols
    (generated by processing a test corpus)
    in those contexts, and get an optimized heuristic as a result.

    2. "It's crazy that LZ compression can do so well with so much redundancy in the code stream"

    > Start with the shortest compressed bit stream. See what raw bits that decodes to

    I described a "CM with LZ model" before, which is probably a more practical version of that.
    Enumerating the arithmetic codes and decoding is certainly an interesting way to enumerate
    complex structures, but in this case looping through all matches available at current position
    and integrating their codelengths by first referenced byte is clearly easier.

    > add even *more* code stream redundancy by using things like the "repeat matches"

    LZMA uses arithmetic coding and 11-bit probabilities, so it can encode constants
    (fields that are the same in all records) with negligible redundancy.

    Here's the test:
    Code:
    24557177 // enwik8.lzma
    25021760 // enwik8.lzma processed with delrep to remove all rep codes
    25002018 // Encoding of IsRep flag disabled
      788.99 // rep support redundancy per MB of compressed data, bytes
             // (25021760-25002018)*1000000/25021760
      197.42 // rep support redundancy per MB of uncompressed data, bytes
             // (25021760-25002018)*1000000/100000000
    Note that IsRep flag is encoded for every match.
    With that kind of overhead, we certainly can afford having lots of options
    in the code, and a good parsing optimizer would improve the compression
    without fail by making use of these options.

Similar Threads

  1. Discussion initiated by a amateur
    By Fu Siyuan in forum Data Compression
    Replies: 35
    Last Post: 18th August 2009, 02:33

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •