Results 1 to 14 of 14

Thread: good bytewise-ish compressor for slightly wider chars? (For preprocessing.)

  1. #1
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    good bytewise-ish compressor for slightly wider chars? (For preprocessing.)

    I'm looking for a good compressor that can take input in an alphabet larger than 256 symbols, maybe 1024, so that I can do preprocessing experiments without the weirdness of escape codes.

    (I'm NOT trying to compress actual wide-character text data, and I know most strong compressors do pretty well on wide characters even if they treat them as strings of bytes.)

    Something that takes 16-bit character input would be fine, if there's not a lot of loss if I only actually use 1024. (E.g., a pretty good LZ variant with Huffman-coded literal bytes and offsets would probably be fine.)

    Does anybody else do this sort of thing, or do everybody's preprocessors detect unused byte values and use those for escapes, or what?

  2. #2
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts

  3. #3
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Cyan,

    That's just the entropy coder, not a normal compressor, right? (Not to demean it. It's awesome.)

    I was looking for something that does the usual string-matching & dictionary stuff already, preferably fairly well. E.g., if it's an LZ, maybe one with a not-entirely-greedy parsing mode, a little LZP-style prediction, etc. It doesn't have to be an LZ though; for experiments I could probably use a mode of a PPM, or a basic CM mode of ZPAQ or something, and it would be nice to be able to do that. I'm just looking for something pretty good that handles wide chars. An excellent compressor would be a bonus, but any good one that gives me a big enough alphabet will probably do fine.

    If I were to write my own, I'd probably never make it as good as many things people here have already written, just a wider version of another slow, not-particularly-strong LZ+entropy coder, because it'd just be a throwaway for experiments, not much use to anybody else. My eventual compressor(s) will probably not use a preprocessor at all, or only a weird and limited but blazingly fast one in certain situations. I mainly want to preprocess for basic modeling experiments, to tell me how to write the actual compressors. (Weird things like the context trick LZMA uses to implicitly detect and exploit some strides.)

    (If I do come up with some particularly useful preprocessing tricks, though, I would probably convert those to forms useful to others with normal 8-bit compressors, either by biting the bullet and hacking my code to use escapes and rotate what's the escape, or by hacking my new features into somebody else's preprocessor that already handles that stuff.)
    Last edited by Paul W.; 7th May 2014 at 18:12.

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    You could encode using 2 bytes with the first in range 0-127 and the second in range 128-255. Then just about any compressor should work on the back end.

  5. The Following 2 Users Say Thank You to Matt Mahoney For This Useful Post:

    Bulat Ziganshin (10th May 2014),Paul W. (7th May 2014)

  6. #5
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Matt:
    You could encode using 2 bytes with the first in range 0-127 and the second in range 128-255. Then just about any compressor should work on the back end.
    That's a great idea. The nonoverlapping ranges of values in the odd- and even-numbered bytes would ensure that I don't get spurious matches at odd context lengths which wouldn't correspond to anything in my eventual compressor.

    I just need a compressor that can handle long contexts, like 6 and/or 8, which would correspond to 3 and/or 4 in my compressor. Cool.

  7. #6
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    That's just the entropy coder, not a normal compressor, right? (Not to demean it. It's awesome.)
    I was looking for something that does the usual string-matching & dictionary stuff already,
    Yes correct, I misunderstood your need.

    Matt's suggestion seems pretty cool.

  8. #7
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    A suffix array is a good start toward LZ. I've come across suffix sorts that are not exclusive to bytes. I think Yuta wrote one. I think you should be able to suffix sort on bytes and just eliminate the odd indices, as well. If you're looking for MTF that's efficient on large alphabets, I wrote one a while back. It's on this forum.

  9. #8
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Cyan View Post
    Yes correct, I misunderstood your need.

    Matt's suggestion seems pretty cool.
    UTF-8 does something roughly similar. Its codes vary from 1-4 bytes, but it would probably work for this application. The fact that the codes vary in length would help squash spurious matches.

    I tried doing the BWT bit-wise once, but the result, while decent, didn't beat the simpler BWT + wavelet tree. It's a little bit hard to visualize how this can happen, but apparently the BWT suffers when you give it data that causes spurious matches. Somewhat paradoxically, random data contains more potential matches than low-entropy data; but the matches are all worthless. I've wondered if pre-Huffman encoding the data might help with the bit-wise BWT, by smoothing out the entropy at the sub-byte level.

  10. #9
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    long context compressors (6-8 bytes)

    Now it occurs to me to wonder whether it's reasonable to expect any general, strongish compressor to actually handle 6- or 8-byte contexts. Is it?

    It occurs to me that many compressors in practice rely on using preprocessors to reduce the effective context size (e.g., wxrt to do order-3 words rather than order 8+ chars), and that many only implement the most useful implementations for specific orders. (E.g. dense array tables for 8- or 16-bit indexes, some specific hash tables for 3- or 4-byte indexes.)

    Which off-the-shelf compressors actually support 6- and/or 8-byte contexts, and is it likely to actually work, or flake out because the code paths have not been stressed and debugged like the usual low-order Markov stuff?

    I assume that some kind of well-debugged low-order ROLZ or something would work, and just slow way the hell down with too-short contexts and correspondingly long chains, which would be sufficient for experiments, but I'm not sure of that, either.

    Any advice/recommendations?

  11. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    ppmd can use contexts up to 16 bytes. ppmonstr up to 128. On most files it doesn't help much to use high order contexts.

  12. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Paul W. (17th May 2014)

  13. #11
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Matt, can I assume from that answer that ZPAQ can't handle 6-8 byte contexts? (Or you'd have said it could, I'd naively guess.) I'm having trouble interpreting the ZPAQL spec in those terms. (I was kinda hoping to be able to experiment with ZPAQ models combined w/preprocessors.)

    BTW, is there an ZPAQ Models for Dummies guide anywhere?

  14. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Here is a simple order 6 model.
    zpaq a archive file -method s6.0c0.0.255.255.255.255.255.255

    s = streaming mode (or use x if you want packing files into blocks and deduplication).
    6 = 2^6 MB block size (for larger files, else each file is a separate block).
    .0 = no preprocessing.
    c0 = ICM (or use c1..c256 for CM with fast..slow update)
    .0 = no special models
    .255 = context bit mask (repeat n times for order n).

    You should get better but slower compression with an ICM-ISSE chain that mixes orders 0 through 6 like this:
    -method s6.0ci1.1.1.1.1.1

    c = ICM (with order 0 context, same as c0.0.0...)
    i1 = ISSE with order 1 context.
    .1 = another ISSE increasing context order by 1 (repeat n times).

    And you can also add a mixer on the end like s6.0ci1.1.1.1.1.1m for a little better compression, but an ISSE already mixes the previous prediction with the current context. Note that method 4 uses something like x6.0ci1.1.1.1.2am where the a is a match model and the order 5 context is skipped. In this case the mixer is required to get all the contexts.

    What you'll find for the first model (one component) is that compression improves up to some order like 3 or 4 and then gets worse for higher order. For the ICM-ISSE chain you will see compression get better for higher order, probably a little better than ppmd at the same order, but also get slower. For high order you will see diminishing returns and probably none after about 8 depending on the data.

    There isn't really a simple guide. The spec is in the zpaq.cpp header, but it requires some knowledge of ZPAQL which is in the ZPAQ specification to understand how the components work. I added these specialized methods to implement the most common component arrangements and contexts without having to write ZPAQL code in config files. The most common components are:

    CM (c1..c256) maps a context to a prediction. It adapts fast at first, inversely proportional to a count limited to 4 x 1..256.
    ICM (c0) maps a context to a bit history (8 bit state) and then to a prediction with a slow adaptation rate.
    MIX (m) weighted average of all prior component predictions in logistic domain. Weights adapt to select the best predictors. Weight vectors selected by order 0 context bits (or you can use higher order like m16 for order 1).
    ISSE (i) mixes previous component with a constant 1 in the logistic domain using a pair of weights selected by a bit history. i1, i2, etc. compute a context hash from the previous hash combined with the next 1, 2... context bytes.
    MATCH (a) searches for the previous context hash and predicts whatever bit was next. It usually is a high order context.

  15. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Paul W. (17th May 2014)

  16. #13
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Paul W. View Post
    BTW, is there an ZPAQ Models for Dummies guide anywhere?
    Yes. ZPAQ for dummies is called WinZip.

  17. #14
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Thanks very much, Matt.

Similar Threads

  1. Compression of array of chars
    By garriks in forum Data Compression
    Replies: 2
    Last Post: 19th May 2012, 01:30
  2. Tring to find a good compressor for PSD files
    By toi007 in forum Data Compression
    Replies: 4
    Last Post: 9th July 2011, 02:06
  3. Good Compression for Microcontrollers
    By elektronika in forum Data Compression
    Replies: 12
    Last Post: 23rd March 2010, 19:36
  4. Simple bytewise context mixing demo
    By Shelwien in forum Data Compression
    Replies: 11
    Last Post: 27th January 2010, 04:12
  5. Bytewise vs Binary
    By Shelwien in forum Forum Archive
    Replies: 9
    Last Post: 30th March 2008, 16:51

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •