Page 1 of 4 123 ... LastLast
Results 1 to 30 of 93

Thread: Data compression explained

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

    Data compression explained

    My free online book on data compression.
    http://mattmahoney.net/dc/dce.html

    I may add some more stuff later. For now it covers most of the basic algorithms like LZ77, BWT, PPM, CM, Huffman, arithmetic coding, JPEG, MPEG, audio, etc, plus some theory. Most of the books on data compression are out of date, so someone needed to write this. Comments welcome.

  2. #2
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Thumbs up

    I did a quick look but its only in the last few years people have used 14 files for the Calgary Corpus. In the early years people compressed the 18 files and every one was trying to get the smallest length. Now they only talk about 14 files and don't care about the length that it compressed to.
    Instead they see how many bits per byte for the selected files. And then take the average of those 14 values. So to me its very confusing. I know on this forum its the 14 files. However when I compare using BWTS with Mark's I am going to stick to the 18.

    But over all good job!
    David Scott

  3. #3
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    5.3.2.5. Results

    There's something broken with the table.
    Especially, there's 6 columns and 5 headers.

    5.7.2

    If precomp can't recompress some stream prefectly, but has something close, it generates patches.

  4. #4
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I've just had a quick look at your text and really appreciate it - i'll just drop a few thoughts shortly, since i'm working actually. Maybe you already got some of that.


    1 mixing via 2d SSE, which is used in some bitwise CM compressors

    1a with quantised probabilities P(y=1|Q(p0),Q(p1)) (CCM)
    http://groups.google.com/group/encod...d17dbd63707f41

    1b with quantised bit histories P(y=1|Q(s0),Q(s1)) (M1)


    2 linear mixing in probability space, i.e. w0*p0 + p1*p2 + ...

    2a two input mixers (1-w)*p0 + w*p1, there exist several update rules, e.g. counter backpropagation, numeric iterative optimization, etc. all with(out), some approximations. You may want to have a look within IRC.
    There actually exist alot of implementations.

    2b N-ary inputs. I once used it in conjunction with a simple gradient update.


    3 Non-flat alphabet decomposition (order-0 Huffman is used in some CTW, order-1 in M1)


    4 CTW seems to be missing


    5 When talking about parameter tuning you may want to mention the direct application of stochastic search algorithms like GA, which is superiour to "ad-hoc" tuning (BEE, Shelwiens coders, M1).


    6 I got some serious refinements to match models (e.g. suffix contexts), which i can publish after making further experiments/implementations.

  5. #5
    Programmer Jan Ondrus's Avatar
    Join Date
    Sep 2008
    Location
    Rychnov nad Kněžnou, Czech Republic
    Posts
    278
    Thanks
    33
    Thanked 137 Times in 49 Posts
    Preformatted text sections (closed in <pre> html tag) are not displayed correctly in my firefox.
    You may remove "<font face=arial,sans-serif>" tag or close all "<p>" elements and it will display correctly.
    Last edited by Jan Ondrus; 26th February 2010 at 11:26.

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Thanks for the comments. I fixed the formatting in Firefox. I guess I should not be so lazy and leave out closing </p> tags. It had looked OK in IE and Chrome.
    I also added some updates based on everyone's comments and added an acknowledgement section. I still plan to add sections for more algorithms like CTW, byte pair encoding, numeric codes (unary, rice, mu-law, floating point, etc), JPEG compression, maybe something about noisy channels, archive formats, etc.

  7. #7
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts

    fantastic!

    It is absolutely fantastic!

    It of course comes at just the right time and at just the right level for my exploration of compression.

    Its a lot easier to read if you've tried to pick up the principles first; if I had come across this book first, I think I'd have needed some diagrams of trees for PPM and such.

    And now I want SEE and SSE covered to the same depth as arithmetic encoding was! But that's just to suit my personal learning agenda.

    I think that HTTP on-the-fly compression with DEFLATE is very common, and often forgotten; presumably gazillions of DEFLATEd bytes are going through the internet every day; probably outnumbering the transfer of archived bytes on the internet.

    (Tiny correction: "I frame" in intra-frame, not inter-frame in video)

    (H.264 intra-frames are typically half the size of the equivalent JPEG. They delta encode blocks within themselves, which is one of the things that helps achieve this feat.)

    I'm not convinced byte pair encoding needs a much mention
    Last edited by willvarfar; 27th February 2010 at 00:42.

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Here's a version for printing (compact): http://nishi.dreamhosters.com/u/dce2010-02-26.pdf
    Also can be used for precomp bechmarks

  9. #9
    Member
    Join Date
    Aug 2008
    Location
    Saint Petersburg, Russia
    Posts
    215
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Just fabulous. Shame that I can only learn stuff from there rather than help improving it

  10. #10
    Member
    Join Date
    May 2008
    Location
    England
    Posts
    325
    Thanks
    18
    Thanked 6 Times in 5 Posts
    I'll add my thanks for this, as someone who is fascinated with compression i'll have to find the time to read it.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    1. Mistypes

    "DMC (dynamic Markov coding) was and described in a paper"
    "after disarding"

    2. Bytewise coding

    Its described as one huge mistake ("A fourth problem is...")
    which is plain wrong.

    a) Bitwise models can't precisely estimate the probabilities
    of bytewise data (like ascii text)
    - There're hardware (cpu) precision limits which affect the
    results if we'd try to compute bitwise probabilities from
    the precise bytewise distribution
    (implying practical speed considerations).
    - Same applies to updating the bitwise probabilities in a
    bytewise fashion (its slow and imprecise)
    - Most of bitwise modelling components are not "portable" -
    they can't be directly applied for bytewise modelling
    because of various technical problems.
    - Thus most practical bitwise models produce noticeably
    different results with different permutations of byte
    alphabet
    (eg. see reorder results in http://compressionratings.com/bwt.html#transformation)
    Unfortunately its not really a possibility for compression
    improvement - more like partial compensation of modelling
    imprecision.

    Again, there's just no way to make a precise bitwise model
    for bytewise source within current CPU architecture limits.

    But, unfortunately, people only see that bitwise models are
    simpler and show better overall results (meaning, mostly for
    data which is not bytewise, like executables).

    b) Arithmetic coding of non-binary alphabets is an important
    technique, both for speed optimization and algorithm
    simplicity.
    For example, it allows to encode a random value from a given
    range with a single operation (involving divisions, but
    that's still faster than multiple rc calls with renorm
    loops).
    Well, its reasonable that its never used in paq framework,
    because its completely incompatible (for example, paq's
    carryless rangecoder can't be used for bytewise coding
    because of underflow issues), but why its implied that no
    other approaches are possible?

    c) Non-binary alphabet coding (including bytewise) is
    basically a speed optimization method. It can be explained
    in terms of using two types of rangecoder operations -
    combining intervals without renormalization (computing the
    alphabet intervals), and then coding them with
    renormalization (actual rangecoding).
    Of course, its not a universal method, but if something
    isn't compatible with paq framework, it doesn't mean that
    its not a useful compression technique.
    For example, not long ago we've seen a bytewise order0
    rangecoding demo by Cyan with ~100MB/s processing speed,
    while bitwise coders can demonstrate ~30MB/s at most.

    3. ppmd/ppmonstr

    > binary context - in the highest order context only one value
    > has appeared (one or more times).

    "value"? I'd say byte/symbol.
    Also its called "binary" because it virtually contains two
    symbols - the one encountered in this context and escape.

    > nm-context - two or more values have appeared, and none of
    > them have appeared (are masked) in a higher order context.

    Actually its just a context with maximum order and >1
    different symbols seen in it.
    That is, a context won't be processed as "nm" just because
    there're no masked symbols.

    > -r1 partially rebuilds the model after disarding it when
    > memory is used up, which improves compression.

    Its not "partially rebuilt".
    Instead, ppmd keeps rescaling the counts in the tree
    (while removing the symbols for which freq becomes zero,
    and their subtrees)
    until the tree size becomes smaller than some threshold
    (25% afair).

    > ppmonstr uses an even more complex SEE context, and
    > additionally uses interpolation to smooth some of the
    > quantized contexts.

    I think ppmonstr only uses SSE.
    Also I don't remember any such interpolation there -
    although it updates some adjacent entries in SSE table(s),
    which might provide a similar effect.

    > It also adjusts the prediction for the most probable byte
    > using secondary symbol estimation (SSE).

    Not only "most probable", but all symbols afaik.

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

    Matt's updates

    (newly added CTW section is removed from diff)

    Code:
    diff -rw DCE_2010-02-26/dce.html DCE_2010-02-27/dce.html
    17c17
    < </font></p><p><font face="sans-serif">Last update: Feb. 26, 2010.
    ---
    > </font></p><p><font face="sans-serif">Last update: Feb. 27, 2010.
    
    
    512,513c512,520
    < a score that combines size and speed, similar to the Maximum Compression
    < MFC test, but with minimum speed requirements. The benchmark includes a calculator
    ---
    > a score that combines size and speed. The data includes English text, Windows executable
    > code, RGB and grayscale images, CD quality audio, and a mix of data types from
    > two video games. None of the test data is compressed (JPEG, PDF, ZIP, etc).
    > Each of the 10 data sets contains multiple files. Single file
    > compressors are tested on an equivalent
    > <a href="http://en.wikipedia.org/wiki/Tar_%28file_format%29">tar</a> file.
    > Programs must pass a qualifying round
    > with minimum compression ratio and time requirements on a small data set.
    > The benchmark includes a calculator
    
    
    516,517c523,529
    < The top ranked programs for the default settings as of Jan. 2010 are nanozip followed by 
    < freearc, CCM, flashzip, and 7-zip. Runsas is the author of nanozip.
    ---
    > The default scale is "1/10 lower ratio or twice the total time is worth half
    > the rating". This is effectively the same formula used by maximumcompression.com.
    > Compressed sizes include the size of the (not compressed)
    > Windows executable decompression program.
    > As of Feb. 2010, 427 qualifying combinations of program versions and options were tested.
    > The top ranked programs for the default settings were nanozip followed by 
    > freearc, CCM, flashzip, and 7-zip. Runsas is also the author of nanozip.
    535a548,552
    > </font></p><p><font face="sans-serif"><a href="http://metacompressor.com/">Metacompressor</a> is an automated
    > benchmark that allows developers to submit programs and test files
    > and compare size (excluding the decompresser), compression time, and
    > decompression time.
    > 
    
    
    840c857,858
    < </font></p><p><font face="sans-serif">A fourth problem is that bytewise arithmetic coding is inefficient.
    ---
    > </font></p><p><font face="sans-serif">A fourth problem is that straightforward
    > bytewise arithmetic coding is inefficient.
    
    
    845c863
    < up to 256 values. This problem is solved by encoding one bit at a time
    ---
    > up to 256 values. One solution is to encode one bit at a time
    
    
    1043c1061
    < </font></p><h3><font face="sans-serif">4.2. Variable Order Models (DMC, PPM)</font></h3>
    ---
    > </font></p><h3><font face="sans-serif">4.2. Variable Order Models (DMC, PPM, CTW)</font></h3>
    
    
    1057c1075
    < and described in a paper by Gordon Cormack and Nigel Horspool in 1987
    ---
    > described in a paper by Gordon Cormack and Nigel Horspool in 1987
    
    
    1199c1217
    < <li><font face="sans-serif">binary context - in the highest order context only one value has appeared
    ---
    > <li><font face="sans-serif">binary context - in the highest order context only one byte value has appeared
    
    
    1201,1202c1219
    < </font></li><li><font face="sans-serif">nm-context - two or more values have appeared, and none of them have
    < appeared (are masked) in a higher order context.
    ---
    > </font></li><li><font face="sans-serif">nm-context - two or more values have appeared in the highest order context.
    
    
    1248,1249c1265,1275
    < from scratch. Optionally, the tree may be partially rebuilt before
    < modeling resumes.
    ---
    > from scratch. Optionally, the the statistics associated with each context are scaled
    > down and those with zero counts are pruned until the tree size becomes
    > smaller than some threshold (25%). This improves compression but takes longer.
    > 
    > </font></p><p><font face="sans-serif">Although bytewise arithmetic encoding can be inefficient, ppmd is in
    > practice faster than many equivalent bitwise context mixing models.
    > First, a byte is encoded as a sequence of escape symbols followed by
    > a non-escape. Each of these encodings is from a smaller alphabet.
    > Second, the alphabet within each context can be ordered so that the
    > most likely symbols are first. This reduces the number of operations
    > in the majority of cases.
    
    
    1254,1255c1280
    < -r1 partially rebuilds the model after disarding it when memory is
    < used up, which improves compression.
    ---
    > -r1 says to prune the context tree rather than discard it.
    
    
    
    1259c1284,1373
    < </font><h3><font face="sans-serif">4.3. Context Mixing</font></h3>
    ---
    > </font><h4><font face="sans-serif">4.2.3. CTW</font></h4>
    [...]
    > </font></p><h3><font face="sans-serif">4.3. Context Mixing</font></h3>
    
    
    1263,1267c1377,1383
    < algorithms predict one bit at a time (like DMC) except that
    < there are multiple models making independent predictions which
    < are then combined by weighed averaging. Often the result is
    < that the combined prediction is better than any of the individual
    < predictions that contribute to it.
    ---
    > algorithms predict one bit at a time (like CTW) except that
    > weights are associated with
    > models rather than contexts, and the contexts need not be mixed from
    > longest to shortest context order. Contexts can be arbitrary functions of the history,
    > not just suffixes of different lengths. Often the result is
    > that the combined prediction of independent models compresses better than
    > any of the individuals that contributed to it.
    
    
    1269c1385
    < </font></p><p><font face="sans-serif">PPM and DMC are based on the premise that the longest context
    ---
    > </font></p><p><font face="sans-serif">DMC, PPM, and CTW are based on the premise that the longest context
    
    
    
    3280,3282c3396,3397
    < </font></p><p><font face="sans-serif">MPEG-4/AVC (H.264) differs from MPEG-1/2 mainly in that it uses
    < a <a href="http://en.wikipedia.org/wiki/Discrete_wavelet_transform">wavelet</a> transform
    < instead of a discrete cosine transform (DCT), and supports arithmetic
    ---
    > </font></p><p><font face="sans-serif">MPEG-4/AVC (H.264) differs from MPEG-1/2 in that it
    > supports arithmetic
    3287a3403
    > It also uses a deblocking filter.
    Last edited by Shelwien; 28th February 2010 at 20:22.

  13. #13
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I noticed that some images where in bmp format, so i converted it to png.
    http://www.sendspace.com/file/xyxs97

  14. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Yeah, I left them as BMP because PNG wasn't much smaller.

    Anyway I was going to post about the updates and the new section on CTW. Thanks for suggestions.

    Bitwise and bytewise context models are equivalent because by the chain rule, P(x0..x7) = P(x0)P(x1|x0)P(x2|x0x1)...P(x7|x0...x6). You will get differences if your bit probabilities are not accurate. Alphabet reordering might also affect memory usage. I know about the trick in lpaq9 of swapping bytes 32 and 31, but such differences are small. For BWT it is another matter. It does context sorting based on alphabet order so the order matters. You get better text compression if you group letters, digits, whitespace, etc. because they make similar predictions.

    I have to look at Shkarin's paper again. It is based on ppmd/ppmonstr var. H. I don't know about later versions. I am not sure how you do SSE on more than a binary alphabet prediction.

  15. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Bitwise and bytewise context models are equivalent because
    > by the chain rule,
    > P(x0..x7) = P(x0)P(x1|x0)P(x2|x0x1)...P(x7|x0...x6).
    > You will get differences if your bit probabilities are not accurate.

    Unfortunately its not that simple.
    Sure, with plain occurrence counts it doesn't matter,
    but with linear counters (p'=p*(1-wr)+y*wr), more so state machines,
    there's a visible redundancy in bitwise coding:

    For example:
    434564 // book1 result of a tuned order0 bytewise model
    434873 // book1 result of a tuned order0 bitwise model

    (Note that both coders were tuned specifically to book1)

    This difference may seem insignificant, but it gets wider
    with more orders mixed in.
    But usually the bitwise coder would be still chosen in this
    case, because it shows much better results for binary files.
    However, its a fact that there're cases where a bytewise
    model has better precision.

    > Alphabet reordering might also affect memory usage.
    > I know about the trick in lpaq9 of swapping bytes 32 and 31,
    > but such differences are small.

    202928 // reorder forward.xlt book1 | paq8px68 -7
    202120 // reorder backward.xlt book1 | paq8px68 -7

    Is that considered small?.. There likely would be at least
    2k difference if we'd bother to actually tune the permutation.

    > I have to look at Shkarin's paper again. It is based on
    > ppmd/ppmonstr var. H. I don't know about later versions.

    Btw, winrar uses ppmd vH as well, not vI.
    Also, that paper doesn't really explain how ppmonstr works in detail.
    But i have a decompiled version of vI.

    > I am not sure how you do SSE on more than a binary
    > alphabet prediction.

    The most straightforward way would be to do a n-ary mapping,
    p.d. to p.d.
    But actual implementations are kinda bitwise in that sense
    anyway (eg. http://ctxmodel.net/files/ASH/ash04.rar) -
    they just use unary coding based on adaptive symbol ranking.
    Still, there's no problem with using bitwise probability
    estimation and bytewise coding.

  16. #16
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by Shelwien View Post
    202928 // reorder forward.xlt book1 | paq8px68 -7
    202120 // reorder backward.xlt book1 | paq8px68 -7
    What do those transforms do?
    paq8px68 has a lot of language modeling that gets messed up by alphabet reordering.

    BTW I added a little about BWT suffix array sorting. I still need to do some more research on this topic though.

  17. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Its that - http://ctxmodel.net/files/BWT_reorder_v2.rar
    Please test it yourself then, as all of your public compressors have
    "a lot of language modeling", and demonstrating it with my coders
    won't be much of a proof.

  18. #18
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    http://mattmahoney.net/dc/dce.html

    Added section 3.4. numeric codes (unary, Rice, Golomb, extra bit).
    3.5. archive formats, checksums, encryption.

  19. #19
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    So I guess Shelwien is right that alphabet order does matter. I did an experiment where I transformed book1 by multiplying all bytes by 3 (reversible by multiplying by 171), then compressing with various algorithms that are supposedly not optimized for English text. This transform hurts everything except LZW it seems.

    Code:
    x3         original
    -------  -------
    768,771 768,771 BOOK1
    566,269 566,253 flzp (byte aligned LZP)
    443,111 442,501 fpaq0p (adaptive order 0)
    439,562 432,845 bpe 5000 4096 200 3 (byte pair encoding)
    435,363 435,227 fpaq0 (stationary order 0)
    431,509 424,522 fpaq0f2 (fast adapting indirect order 0)
    335,033 335,033 compress (LZW)
    313,761 313,597 zip (LZ77)
    312,410 312,391 ha (PPMC order 5)
    278,293 276,364 sr2 (symbol ranking)
    264,226 264,144 cab -m lzx:21 (LZX)
    263,476 261,064 7zip (LZMA)
    251,791 251,499 dmc c 100000000  (DMC)
    233,666 232,598 bzip2 -9 (BWT+MTF)
    222,115 219,434 p6 (neural network CM)
    216,412 216,021 ppmd (PPM order 4)
    215,162 213,162 bbb (BWT+CM)
    211,321 209,514 ctw (context tree weighting)
    210,671 208,691 zpaq cmid.cfg  (ICM-ISSE chain, MATCH, MIX)
    206,162 203,247 ppmonstr (PPM order 12)
    ppmd and ppmonstr have some language modeling for English because one of the SEE contexts is whether the first 2 bits of the last byte are 00, which distinguishes letters from spaces and punctuation. None of the other algorithms use language modeling. zpaq uses more memory for x3 because context hash tables are indexed on nibble boundaries, but hash tables are only 5% full vs 4% so this should not have much effect.

    Aside from ppmd and ppmonstr, the alphabet effect seems to be large on adaptive bitwise contexts like fpaq0p, fpaq0f2, sr2 (for its order 1 model), bwt (sorting is bitwise lexicographical), ctw (uses 8 bit counters), p6, and zpaq cmid.cfg (indirect models) and smaller on stationary bitwise models like fpaq0 and dmc. But there is still a small effect on bytewise models that is hard to explain.

  20. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > But there is still a small effect on bytewise models that is hard to explain.

    Likely because of hashtables, of maybe they're not completely bytewise.
    Also, MTF is initialized with specific byte value ranking (bzip2),
    and zip/cabarc could use some static huffman tables somewhere.
    PPMs (ha) and DMC could be affected because of their initialization
    of low orders.

    Some more results:

    Code:
    x3     original
    ------ -------
    434877 434877 // sh_v1m e book1 | http://compression.ru/sh/coders6c2.rar
    214846 214846 // green c book1 | http://ctxmodel.net/files/green/green_v3a.rar
    206808 206808 // ppmy book1 | http://compression.ru/sh/ppmy_3c.rar
    205417 205417 // ash04a /s0 book1 | http://compression.ru/sh/ash04a.rar
    204318 203294 // ash04a book1
    Note that difference shows if we let Ash to use SSE.

    Reorder utility for x3:

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    int main( int argc, char** argv ) {
      // reorder <coef> <input> <output>
      if( argc<4 ) return 1;
      FILE* f = fopen( argv[2], "rb" ); if( f==0 ) return 2;
      FILE* g = fopen( argv[3], "wb" ); if( g==0 ) return 3;
      int h = atoi(argv[1]);
      while( 1 ) {
        int c = getc(f);
        if( c==-1 ) break;
        putc( c*h, g );
      }
      return 0;
    }
    Last edited by Shelwien; 3rd March 2010 at 20:04.

  21. #21
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Bitwise modelling with flat decomposition effectively models guesses on character ranges, e.g. bit 7 distingushes whether or not the next character is within the range 0..127 or 128..255; and so on.

    A multi-symbol counter, like P(x=a) = S(a)/T, where S(a) is the occourence count of the symbol a and T the total count of the current context only modifies S(a) and T per update step, but not S(x) for x!=a. In bitwise models however all symbols frequencies get changed, since P(x=a) is expressed as a product of bitwise decisions. Hence the probabilities of other symbols can change more drastically than just S(x)/(T+1) (since S(x) gets modified, too).

    The ASCII encoding groups characters with similar meaning (e.g. lower case letters, upper case letters, numbers, ...). Still the ASCII mapping is just a more or less sensful permutation. Modifying the character codes modifies the decomposition, i.e. the mentioned grouping is likely to be distored.

    In addition bitwise models contain some redundancy, since some branches in that character guessing are deterministic (e.g. sometimes bit 7 is always 0). That relation can get changed by modifying the permutation, too. Huffman decomposition removes that redundancy (there're no "internal" nodes with just a single leaf).

  22. #22
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Yep, that's how I created book1x3.

    Here are some more results using

    reorder forward.xlt book1 book1f
    reorder backward.xlt book1 book1b

    forward.xlt helps on average but not every time. backward.xlt not so much. Compression options are same as above. Biggest gain is DMC.

    Code:
     768,771 BOOK1
     261,064 book1.7z
     213,162 book1.bbb
     432,845 book1.bpe
     232,598 book1.bz2
     264,144 book1.cab
     209,514 book1.ctw
     251,499 book1.dmc
     566,253 book1.flzp
     435,227 book1.fpaq0
     424,522 book1.fpaq0f2
     442,501 book1.fpaq0p
     312,391 BOOK1.HA
     219,434 book1.p6
     216,021 BOOK1.pmd
     203,247 BOOK1.pmm
     276,364 book1.sr2
     335,033 BOOK1.Z
     312,502 book1.zip
     208,691 book1.zpaq
    6,585,783 bytes
    
     768,771 book1f
     260,835 book1f.7z
     211,629 book1f.bbb
     434,801 book1f.bpe
     231,416 book1f.bz2
     264,121 book1f.cab
     209,847 book1f.ctw
     237,658 book1f.dmc
     571,461 book1f.flzp
     435,306 book1f.fpaq0
     422,012 book1f.fpaq0f2
     443,452 book1f.fpaq0p
     312,397 BOOK1F.HA
     219,199 book1f.p6
     216,026 book1f.pmd
     203,222 book1f.pmm
     280,404 book1f.sr2
     335,033 BOOK1F.Z
     313,664 book1f.zip
     208,747 book1f.zpaq
    6,580,001 bytes
    
     768,771 book1b
     262,884 book1b.7z
     215,428 book1b.bbb
     435,203 book1b.bpe
     233,853 book1b.bz2
     264,159 book1b.cab
     211,483 book1b.ctw
     240,622 book1b.dmc
     570,290 book1b.flzp
     435,321 book1b.fpaq0
     430,147 book1b.fpaq0f2
     442,851 book1b.fpaq0p
     312,405 BOOK1B.HA
     220,616 book1b.p6
     216,143 book1b.pmd
     204,430 book1b.pmm
     280,310 book1b.sr2
     335,033 BOOK1B.Z
     313,624 book1b.zip
     210,294 book1b.zpaq
    6,603,867 bytes
    However it breaks the language model in paq8px_v67 -6.

    Code:
    191,702 book1.paq8px
    201,507 book1b.paq8px
    200,062 book1f.paq8px

  23. #23
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    That "forward.xlt" permutation is tuned for BWTmix enwik6 compression,
    and "backward.xlt" is a inverse permutation.
    So for programs which gain something from forward.xlt there's a
    high probability of gaining a few more kbs by real tuning.
    Unfortunately, realtime optimization of such alphabet permutation
    doesn't seem possible for modern bitwise CMs.

  24. #24
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    http://mattmahoney.net/dc/dce.html

    Added 6.1.6. JPEG recompression (Stuffit and PAQ).

    Minor rewrites of 6.1.5 (added inverse DCT and color transforms) and 1.3 (better explanation of algorithmic probability).

  25. #25
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Thanks for Jan's jpeg model description - its really hard to understand the code.
    Also, there's that: http://www.winzip.com/wz_jpg_comp.pdf

  26. #26
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Thanks. I will have to add that one too. It has some nice ideas that could be used in the PAQ model. I'm surprised they published such details.

  27. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Thanks. I will have to add that one too.

    What about this one then?
    http://www.elektronik.htw-aalen.de/p.../packjpg_m.htm

    > I'm surprised they published such details.

    They advertise it (zipx) as an "open format",
    and I guess posting that spec was more convenient than
    posting the sources.
    http://www.winzip.com/comp_info.htm

  28. #28
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Matt, can you add Table of contents?

  29. #29
    Programmer Jan Ondrus's Avatar
    Join Date
    Sep 2008
    Location
    Rychnov nad Kněžnou, Czech Republic
    Posts
    278
    Thanks
    33
    Thanked 137 Times in 49 Posts
    Quote Originally Posted by Matt Mahoney View Post
    However it breaks the language model in paq8px_v67 -6.

    Code:
    191,702 book1.paq8px
    201,507 book1b.paq8px
    200,062 book1f.paq8px
    Here is result for paq8px_v67 with modified wordmodel, distancemodel and nestmodel to handle reordered alphabet.

    paq8px_v67 -6: 191,702 book1.paq8px
    paq8px_v67_reorder -6: 191,689 book1.paq8px
    Attached Files Attached Files

  30. #30
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Yeah, I'll have to add all that too. Meanwhile here is a description of the WinZIP JPEG algorithm (section 6.1.6.3)

    http://mattmahoney.net/dc/dce.html

    I had to figure out the coder from the (expired) patent. I can see some places where the JPEG algorithm could be improved. The coder was designed for speed in 1987 when table lookup was faster than multiplication. The coder also tightly integrates the context model so they lose flexibility. You can't do context mixing or change the adaptation rate. Also there is some loss due to coarse quantization of the probabilities and update rates. The coder was tuned on binary and grayscale images which is probably not optimal for a lot of other stuff. I had done some experiments earlier on PAQ where I replaced the ICM like context models now used with more stationary (CM) like models and got better JPEG compression. The IBM coder seems to be too fast adapting.

    There are a few other strangenesses, like the avg() function not being symmetric with respect to the 2 neighboring blocks, or the coding of mantissa bits not using the previous bits as context. However there are a lot of interesting ideas that could be used in PAQ like using low frequency coefficients in neighboring blocks to predict the current block.

Page 1 of 4 123 ... LastLast

Similar Threads

  1. Any money in data compression?
    By bitewing in forum The Off-Topic Lounge
    Replies: 18
    Last Post: 19th March 2019, 10:34
  2. Data compression group on facebook
    By Matt Mahoney in forum The Off-Topic Lounge
    Replies: 8
    Last Post: 14th May 2010, 22:16
  3. Advice in data compression
    By Chuckie in forum Data Compression
    Replies: 29
    Last Post: 26th March 2010, 15:09
  4. Data Compression Crisis
    By encode in forum The Off-Topic Lounge
    Replies: 15
    Last Post: 24th May 2009, 19:30
  5. Data Compression Evolution
    By encode in forum Forum Archive
    Replies: 3
    Last Post: 11th February 2007, 15:33

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •