Results 1 to 24 of 24

Thread: Fractal text compression, imperfect match LZ and distortion-resistant hash function?

  1. #1
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts

    Lightbulb Fractal text compression, imperfect match LZ and distortion-resistant hash function?

    There is a nice old concept for fractal compression of pictures: trying to find self-similarities and utilize them, for example encoding the mappings and differences:
    https://en.wikipedia.org/wiki/Fractal_compression

    It didn’t seem to be successful, however maybe analogous approach could be reasonable for text or binary file compression?
    Specifically, while standard LZ schemes require perfect match for a sequence, there should be much more sequences with a nearly perfect match – it might be beneficial to extend LZ scheme to allow also for imperfect match and then encode the differences: for example byte sequence to XOR with, with dominating zeros – such difference could be cheaply stored using entropy coder.

    The problem is that finding similar sequences (by encoder) seems very costly.
    This is a very common task in DNA alignment and a few days ago I have realized that such search can be done in a relatively cheap way: http://arxiv.org/pdf/1602.05889
    Specifically, rate distortion theory allows to find Distortion-Resistant Hashing (DRH): hash values which are not changed while small distortion of object (like sequence).
    So searching for similar sequences can be reduced to having the same DRH value, and such collisions are relatively cheap to find.

    Does LZ scheme which (additionally!) allows partial match and then coding of differences seem reasonable? Have you met with such approaches?
    What other applications of distortion-resistant hashing can you think of?

  2. The Following 2 Users Say Thank You to Jarek For This Useful Post:

    Cyan (19th February 2016),Turtle (20th February 2016)

  3. #2
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    91
    Thanked 69 Times in 49 Posts
    I think given many LZ implementations already use special encoding for using the previous match distance (essentially this allows cheaply coding a match with a mismatching sequence inside), the encoding of the differences is unlikely to be smaller.

  4. The Following User Says Thank You to jibz For This Useful Post:

    Cyan (19th February 2016)

  5. #3
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Imagine given sequence appears many times but with small differences, like capital letters:
    "House"
    "house"
    XOR vector of nearly equal sequences is mostly made of zeroes: e.g. Pr(0) > 0.9, hence using entropy coder to store such a difference vector is many times smaller than standard 8bits/byte.

    There can be also considered more complex differences, like insertion/deletion, replacing words with synonyms ...
    The question is if nearly similar sequences are often in some types of files, what could make such imperfect match LZ reasonable?

  6. #4
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    91
    Thanked 69 Times in 49 Posts
    I am not saying it will not work, sometimes you have to code up these ideas to see, intuition is not always enough.

    But it seems to me that coding the amount of zeroes between each difference would take up as much space as coding distances for continued matches. And how close the XOR values will be to zero is not obvious without domain specific knowledge.

  7. #5
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    As I understand it, LZMA already does some of this, for literals after matches, XORing the literal byte seen after the last match with the literal byte after the current match, and using that as the key for order 0 modeling and encoding.

    I've never seen this explained well, but it seems to me that the idea is that LZMA is essentially predicting that the current mismatch byte will be similar to the previous mismatched byte. For example, if the bytes are lower-case ASCII letters, the upper nybble will be a 6 or a 7, and if they're both upper-case ASCII letters, the upper nybble will be a 4 or a 5. You'll also predict that digits follow digits---they are all in the range of hex 30 to 39 and differ by only 3 bits at most, and less on average.

    So if you have match to a syllable, but a mismatch to the next syllable, you'll predict that the next character is lower-case letter. And if you have a match to a typical sentence or paragraph break (period followed by an LF or LF-CR pair, or any common word-ending sequence followed by a period and a space or two, you'll tend to predict a capital letter, like you probably saw last time.

    This works for some other character encodings, like the two most common non-UTF8 encodings of Cyrillic. In one, the upper-case and lower-case letters are in separate contiguous ranges, like ASCII a-z and A-Z. In the other, the upper and lower case letters are in one contiguous range, interleaved, so the lower-case and corresponding capital letters differ in only the low bit. (More importantly, all the capitals are the same in the high few bits and the lowest bit; likewise for all the lower-case letters, and you'll often get another bit or two that match, even if the letter codes are fairly randomly ordered with respect to how they tend to follow each other.) In alphabetic languages that don't use capitals at all, or don't use them much, you'll tend to just predict that letters follow letters, and digits follow digits, and do a little better.

    And digits following digits are usually fine for most languages and encodings, because most modern scripts use Arabic numbers, and their encodings use the ASCII encoding of digits.

    More generally, XORing mismatches predicts that the the next byte will be numerically similar to the the byte you saw last time, differing in only one or a few bits. (For example, an ascending sequence like 3, 4, 5... will differ by 1 bit half the time, when there's no carry, and two bits a quarter of the time, and three bits an eighth of the time...)

    You're essentially computing the Hamming distance between two successive values. This is reasonably well correlated with several different kinds of numerical similarity---similar ranges, similar evenness, and smallness of numerical difference.

    So:

    1. small numbers predict small numbers, and match in the upper bits. More generally, numerically similar numbers match in the high bits, and often some low bits.
    2. numbers divisible by powers of two (like the low bits of many pointers) predict numbers divisible by that power of two, and match in that many low bits. (Pointers to most structs of the same type, or heap-allocated objects usually match in the lower three or four bits or five bits, depending on their size and maybe the allocator's block or alignment size.) This will work for normal C-style raw pointers AND for tagged pointers for dynamically typed languages, if the pointers have the same low-bits tag, as they usually do. Pointers are just funny integers, and tend to have some usefully similar numerical properties in both the high bits and the low bits.
    3. numbers that ascend or descend by some power-of-two increment will differ in different bits, but but with a similar distribution of how many because you'll usually get some matches in the lowest bits.
    4. it'll help with numbers that ascend or descend by a small odd increment, too, less obviously and (I think) not quite as much because they'll differ in more bits half the time, but fewer bits other times.

    Or maybe I have this wrong, and don't understand what LZMA is really doing with XORing literal bytes. This is what it seems to me it *should* be doing. (I'd thought of something similar to this before I looked at LZMA, so I just assumed it's what LZMA is doing with XOR.)
    Last edited by Paul W.; 20th February 2016 at 07:08.

  8. The Following 2 Users Say Thank You to Paul W. For This Useful Post:

    Jarek (20th February 2016),jibz (19th February 2016)

  9. #6
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Not much related, to make it less simple
    https://medium.com/@rkhokhla/fractal...f-d69578929002

  10. #7
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Jarek -

    fuzzy match in LZ is in fact a very interesting and useful area to explore. All modern LZ's (LZMA-class ; eg. LZHAM, Oodle LZNA, BitKnit, etc.) are actually doing fuzzy match, whether they are aware of it or not.

    I haven't spent much time with your DRH paper yet, but so far I'm not understanding it. If anyone reads it and understands it, please post a tutorial!

    I have used approximate-nearest-neighbors searches in the past. There's a large literature on ANN but it's mostly non-incremental (eg. not useful for an advancing LZ window (ignoring sliding for the moment)).

    This is of course often used in image & video compression, but the problem is a bit different there.

    See LZ Sub & discussion of what literals in modern LZ are really doing :

    http://cbloomrants.blogspot.com/2015...15-lz-sub.html

    http://cbloomrants.blogspot.com/2015...ter-match.html

    http://cbloomrants.blogspot.com/2015...on-images.html

    http://cbloomrants.blogspot.com/2014...lzma-like.html
    Last edited by cbloom; 3rd August 2016 at 20:00.

  11. The Following 3 Users Say Thank You to cbloom For This Useful Post:

    Jarek (20th February 2016),Paul W. (20th February 2016),Turtle (20th February 2016)

  12. #8
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    @ Paul W -

    LZMA's coding is not actually XOR.

    It does this funny thing where it uses the rep0lit as context for coding while the coded bits match, and then switches to using the current character's bits as context.

    See discussion & attempted de-obfuscation here :

    http://cbloomrants.blogspot.com/2014...zma-notes.html

    http://cbloomrants.blogspot.com/2014...lzma-like.html

    The rep0lit is both an *exclude* (current lit != rep0lit) and also a strong predictor (current lit probably "similar" to rep0lit). You need to model both effects.

    The LZMA way works very slightly better than just doing XOR.

    The reason is when the rep0lit just turns out to be quite a bad predictor, then the XOR is really just randomizing your bits. With the LZMA way, once you get a bit mismatch, you turn into more of a normal order0 coder and can still find order0 redundancy.

    You are correct that ASCII and UTF8 are specifically designed so that "similar" characters often share top bits (though they are not ideal for this) (* see next).

    On binary files, the LZMA way is not ideal. Often "similar" character will differ by 0x80 or by -1 (0xFF) which changes the top bit and makes the LZMA way fail.

    On things like WAV and BMP the similar character metric should obviously just be subtract - hence LZ-Sub.
    Last edited by cbloom; 3rd August 2016 at 20:00.

  13. The Following 2 Users Say Thank You to cbloom For This Useful Post:

    Paul W. (20th February 2016),Turtle (20th February 2016)

  14. #9
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    * = for LZMA on text (or any XOR-like coder) you should be able to get a win by doing a more explicit rearrangement of the alphabet such that likely character replacements have the maximum number of shared high bits. (eg. put vowels next to each other, etc.)

    Recent post on this in rambles :

    01-24-16 | LZMA for Text

    LZMA is extremely binary-oriented, and is excellent on binary. It's surprisingly not bad on text, but so much of the really special stuff in it (literal-after-match coding, pos bits, rep match "markov" models, etc.) is really for binary. On text, you should do things differently. A sketch of ideas of easy modifications to LZMA to make LZMA-Text :
    1. Replace the literal model. The LZMA literal model with the bit-by-bit encoding and semi-xor exclusion is super nice on binary. It's not right for text.
    You should use something like a simple order-1 N-ary literal model. (possibly mixing the o1 (previous) literal and the "lolit" (symbol at rep0 offset))
    If you're doing order-1 literal coding you may want to disable the length-2 matches, since there's an overlap there. (not sure, all the details of these ideas require testing and tweaking, of course)
    An even more extreme change would be to use order-2 literal coding and disable length-3 matches (so MML=4). This is what I used to do in LZCB back in 199x.
    2. Give LZMA a precondition dictionary of text, like Brotli has.
    Matches in the precondition dictionary could be sent using the normal offset scheme, though it might be even better to use LZSA.
    Precondition could also be done to the statistical models.
    This is particularly important on small text, where Brotli really shines.
    3. Use a simple text preprocessor and/or word-replacing transform. There's a whole host of literature on this. There are lots of little tricks, like changing capital letters to a "caps flag" + lowercase letter, factoring out punctuation to another stream, replacing the standard end of sentence sequence ". caps" with a single token, etc.
    eg. 7zip + WRT46 (Skibinski) : enwik8 -> 23,671,028
    Another thing that pops in my head occasionally is :
    4. If you did want to continue to use a bit-by-bit literal encoding (say for example you were doing context mixed literals, mixing the predictions from o1 and lolit), then you should shuffle the alphabet so that characters are grouped into binary chunks.
    What I mean is the high bits of the character should indicate the group or class of character (eg. vowel vs consonant is an obvious one). The low bits should exchange symbols for the most similar possible symbol.
    (ascii is arranged to do this in the top 3 bits, they tell you the gross type of character (whitespace, alphabetic, lower vs caps), but the bottom 5 bits are alphabetic and not useful as groupings)
    That is, for any character x, then (x^1) should be the character most likely to substitute for that character. (like, 'e' and 'a' perhaps or 'n' and 't'). And (x^2) should be the second most likely, etc.
    5. Another possible way to send lolit-excluded symbols is to use a "substitution symbol rank distance".
    That is, for each character x, precompute a table of substitutions, (so a [256][256] table). The table substitutions[x] is sorted by likelihood to be a valid substitution for x in any text.
    (something like : for a preceding context C, you observe a Cx in the file, then increment counts[x][y] by the number of occurances of Cy, and then make substitutions[x] by sorting counts[x])
    You use this to convert the lo_lit - actual_lit pair to a "distance" and then send that.
    eg. on binary (like a BMP or a WAV) where the characters are signal magnitudes, you can just use Euclidean distance |x-y| to get the order of relation between two values. eg. if lolit was x, the most likely values (with are not x) are x+1,x-1,x+2,x-2,etc. On text, the similarity relationship is not so simple (and I think maybe not symmetric ; counts[x][y] != counts[y][x]), so the goal is to convert into a linear distance like that.
    ADD: an even more extreme version would be to use an order-1 conditioned substitution distance. So the encoder does :

    I need to send current literal "lit"
    I know that lit != lolit (the literal at last offset)
    because if it did I would have sent the lolit flag
    o1 = currently order1 character

    look up the mapping :

    send = table[o1][lolit][lit]

    and send that.

    Decoder uses the inverse mapping.

    table[o1][lolit] has the {255} characters that could occur in that spot, ranked by likelihood.

    This is a 24M table, and it's really just a funny symbol-ranker.
    Last edited by cbloom; 3rd August 2016 at 20:00.

  15. #10
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Sportman, here is another approach for using fractals in image compression:
    http://encode.ru/threads/2045-2D-fra...implemented%29

    Charles, thanks for the materials - I will have to take a closer look at modern LZ-based compressors.

    Regarding explaining DRH, so basically everything is on the diagram below, which is analogous to the QR-like picture paper based on this implementation: https://github.com/JarekDuda/PRcodes
    Here are both diagrams: for DRH on the left, and picture-resembling QR codes on the right (which is probably better explained):



    Lets look at a simple example - for rate 1/2 DRH like in the tree (bottom-left):
    We construct the hash values from 1-bit blocks.
    We need to enlarge it to 2 bit block (for R=1/2), which can be directly translated into one of 4 nucleotides.
    This enlarging is the key - if we would just use a bijection, we would generate all sequences, so the closest one would be the same sequence: such hash value would not allow for any distortion.
    The smaller R, the larger distortion we allow. For simple binary channel and Hamming distance D, R = 1 - h(D), where h is Shannon entropy.
    https://en.wikipedia.org/wiki/Rate%E...tortion_theory

    Let's think about this enlarging.
    We could choose any injection, e.g. from [0,1] -> [0,3] in our case, for example
    0 -> 01
    1-> 10
    but it would make that we consider only sequences made of 2 out of 4 nucleotides.
    So there is required a more complex behavior - depending on some internal state - there are many ways to do it.

    Finally we get hash function which is just a lossy compression version of the input sequence.
    Constructing such hash function for a similar sequence, we will probably get the same lossy compression - hash function.
    To preventing getting a different hash value for a similar sequence, we can use a few hash values for a given object (sequence).
    Let me know if you have any questions.

  16. #11
    Member
    Join Date
    Nov 2015
    Location
    boot ROM
    Posts
    83
    Thanks
    25
    Thanked 15 Times in 13 Posts
    Jarek, practical QR codes are mostly interesting by the following properties:
    1) Distinct SYNC patterns, created in reasonable ways, it both allows humans to understand they see QR code due to specific appearance, and software could deal with scanning picture under harsh conditions, getting idea how to get it right thanks to distinct, a priori known sync patterns.
    2) High redundancy. First, individual bits are large, when camera captures image, bit takes plenty of matrix pixels. It reduces capacity but skyrockets reliability. So random noise, formats conversions and other variations do not matter and something like averaging allows one to get bits right with good probability. And then, even if some bits are wrong, there is error correction code, its strength is selectable, strong codes could survive noteworthy corruption. That's what allows people to put e.g. logos in the middle, etc: error correction code still manages to get it right, despite of pre-corruption, though it reduces read margins.

    QR codes could survive arbitrary rotation, reasonable rescale up/down, imperfect/dirty surfaces, imperfect focusing and so on. So people can just grab their phone with mediocre camera and scan it, without taking much care to position camera to take perfect picture. Yet, even "redundant" QR codes could be challenging to read, if QR code is large and ECC level is weak, especially on noisy/flickering medium like screen. At some point programs either fail to sync or getting too much bit errors. And what you've described seems to be closer to steganography techniques rather than QR codes (in sense I fail to get idea how one could reliably read your code using crappy camera and careless shooting). They maybe have something in common, but QR codes techs are heavily biased for reading under harsh, suboptimal conditions.

  17. #12
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    xcrh, image-like QR codes is a topic for a different discussion:
    http://encode.ru/threads/1833-%28Rev...-probabilities
    but these are codes directly living in the human world - they should also contain information directly available to human receiver - both for informational and aesthetic purposes: QR code containing link to a music can resemble the singer, to an app can resemble Android logo etc. And the approach is very general - also for more subtle steganography/watermarking.
    Sure error correction capabilities are crucial for them, and I have also developed ways to handle errors for these image-like QR codes ... but preparing a market-ready product would require a team and a few months ...

  18. #13
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    LZ77 considered harmful for text?

    Charles,

    My impression is that for compressing text, literal handling isn't the main limitation on compression ratio for LZ77-based algorithms---especially if you use a preloaded dictionary. It's the fact that match offset distances aren't a particularly good way to encode repeated strings.

    (I assume here that shorter match distances get shorter codes; my argument is that that's not good enough.)

    If you have a good, sizable preloaded dictionary for a kind of text, you'll generally have relatively few literals and a lot of matches to reasonable-sized strings, like "in that case" and "president of"; what follows a dictionary string will usually be a dictionary string, too, and you'll mostly have literals where you need to "spell out" unusual words or numbers. (And XOR prediction already works well for spelling out numbers as digits.)

    The LZ77 sliding window offset model assumes that recent strings are likely to repeat, and that recency is a passable proxy for frequency. (A string like "of the" that is stably frequent is likely to be recent, too.) But for the stably frequent stuff that should go in a preloaded text dictionary, actual frequency is just better. In normal text, the distance back to the previous occurrence of a string like "of the" is fairly randomly Zipf distributed, but the distribution isn't quite as skewed as you'd like.

    (You know all this, but...) LZX- and LZMA-like algorithms partially solve this problem for clearly strided (usually binary) data, where the match distances are often repetitive, using short codes for repeats of recently seen offsets. If you have strong clear strides, the current match distance is likely to be the same as the previous one, or closely harmonically related to it. (E.g., if your stride is 6, you'll tend to get matches at small multiples of 6.)

    As I understand it, this just doesn't work for actual text where there's ubiquitous variation in string lengths, due to orthographic weirdness, word choice, phrasing choice, etc. All you get is a Zipfy distribution of match distances, with little exact repetition. There are other, weaker regularities in there, but LZX-like stride detection won't work.

    Which makes me think that another hack on LZ77 isn't the best way to go for text. LZ77 is just the wrong algorithm. If you're going to use a dictionary compressor for text, you probably want a smarter one, more like LZ78 in its basic structure, and likely a grammar-based compressor like (Kennon Conrad's) GLZA. A grammar-based compressor can decompress very, very nearly as fast as LZ77 and give you considerably better ratios for text, closer to PPM's. (GLZA is the first to actually do so, because previous ones made one or another basic conceptual mistake in grammar construction. Kennon nailed it.)

  19. #14
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Quote Originally Posted by Paul W. View Post
    My impression is that for compressing text, literal handling isn't the main limitation on compression ratio for LZ77-based algorithms---especially if you use a preloaded dictionary. It's the fact that match offset distances aren't a particularly good way to encode repeated strings.
    Eh, you can choose to look at it that way.

    But the fact is that LZ77 on text still sends lots of literals, and they make up a large % of the output bit stream size.

    The problem is that while large text files will have lots of substrings to match against, most of them are very far away, hence need large offsets.

    Rather than match the next 4 characters with a large offset, it's often cheaper to send a literal, and then a len-3 match with a low offset (for example).

    You could choose to look at it as "offsets are too expensive to send, if they were cheaper we could send fewer literals" , or you could look at it as "with the offset model that we have, we send a lot of literals and need to make them cheaper".

    The LZ77 sliding window offset model assumes that recent strings are likely to repeat, and that recency is a passable proxy for frequency. (A string like "of the" that is stably frequent is likely to be recent, too.) But for the stably frequent stuff that should go in a preloaded text dictionary, actual frequency is just better. In normal text, the distance back to the previous occurrence of a string like "of the" is fairly randomly Zipf distributed, but the distribution isn't quite as skewed as you'd like. ...
    Yes, certainly. This a well known issue of course (see LZFG in the ancient days, for example). More recently I came up with "LZSA" which I think is a really neat way to index strings in static dictionaries :

    http://cbloomrants.blogspot.com/2015...ndex-post.html

    the problem is that offsets are just really fast to decode, and everything else is really slow.

    Other references :

    http://cbloomrants.blogspot.com/2014...-coded-lz.html

    LZFG =
    "Data compression with finite windows" - LZFG ; the "C2" is the Suffix Trie LZFG that I usually think of.
    Instead of offsets you send indices of suffix tree internal nodes or leaves, and MTF them or something.


    As I understand it, this just doesn't work for actual text where there's ubiquitous variation in string lengths, due to orthographic weirdness, word choice, phrasing choice, etc. All you get is a Zipfy distribution of match distances, with little exact repetition. There are other, weaker regularities in there, but LZX-like stride detection won't work.
    rep-matches still work in text, though, as a way of doing "gap matches". You can find a long repeated sequence, maybe a whole sentence except that one word is changed, or the only change is "then" -> "than" or whatever.

    Which makes me think that another hack on LZ77 isn't the best way to go for text. LZ77 is just the wrong algorithm. If you're going to use a dictionary compressor for text, you probably want a smarter one, more like LZ78 in its basic structure,
    Sure. The problem is the space-speed tradeoff of LZ77 makes it very compelling.

    Better modeling of the offsets as string-references based on their contents just gets you into the speed domain of PPM/BWT/CM and then it's better to just go ahead and use that.
    Last edited by cbloom; 3rd August 2016 at 20:00.

  20. #15
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    A simple one that I've written about before (eg. see "LZ and exclusions") is :

    At the end of a match (assuming you always write matches of maximum length)

    The next character cannot be one that would have extended the match

    This is the "literal after match" exclusion or "rep0lit exclusion" as used in LZMA-like literal coders

    *however* that exclusion should also apply to offsets

    eg. if the rep0lit is 'a' you know the next literal can't be 'a'
    but you also know that the next *string* can't start with 'a'
    so all offsets that point at 'a' should be excluded from coding

    Currently nobody does this, doing it efficiently (mainly in regards to decoder speed) is tricky.

    etc. lots of issues like this in LZ.
    Last edited by cbloom; 3rd August 2016 at 20:00.

  21. #16
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    but yeah, LZ on text & LZ on binary are extremely different in many ways.

    LZ binary literals can be very low entropy sometimes; LZ text literals usually have high entropy
    LZ binary is fine at MML 4 ; LZ text needs MML 3 (*)
    LZ binary offsets are very clumpy ; LZ text offsets have steady falloff
    LZ binary has structured/repeated offsets
    LZ text has repeated substrings (or repeated positions/contents not offsets)
    LZ binary is fine with very poor matchers, but likes strong multi-chain parsers
    LZ text needs very strong string matchers

    * = the easy way to fix LZ offsets not working great for len-3 strings is just to use MML 4 and order-2 context modeling. That's a way of getting the len-3 substring modeling by content frequency instead of the janky offset modeling. Which I believe is kind of what brotli does. (?)
    Last edited by cbloom; 3rd August 2016 at 20:00.

  22. The Following 3 Users Say Thank You to cbloom For This Useful Post:

    Jyrki Alakuijala (22nd February 2016),Paul W. (23rd February 2016),Turtle (23rd February 2016)

  23. #17
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by cbloom View Post
    but yeah, LZ on text & LZ on binary are extremely different in many ways.

    LZ binary literals can be very low entropy sometimes; LZ text literals usually have high entropy
    LZ binary is fine at MML 4 ; LZ text needs MML 3 (*)
    LZ binary offsets are very clumpy ; LZ text offsets have steady falloff
    LZ binary has structured/repeated offsets
    LZ text has repeated substrings (or repeated positions/contents not offsets)

    * = the easy way to fix LZ offsets not working great for len-3 strings is just to use MML 4 and order-2 context modeling. That's a way of getting the len-3 substring modeling by content frequency instead of the janky offset modeling. Which I believe is kind of what brotli does. (?)
    In brotli's iterative model (at quality 11) the context modeling is applied only after the LZ77 calculation, where even a 2-byte matches are done (for the last 64 bytes). So far we didn't make an attempt to combine the two or take the context modeling into account in literal cost estimation. I'm hopeful that we will get a nice density boost (~2 %) once we find a way to take the context modeling into account in the iterative cost evaluation.
    Last edited by Jyrki Alakuijala; 22nd February 2016 at 00:22.

  24. #18
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by Paul W. View Post
    ... LZ77 is just the wrong algorithm. ...
    LZ77 is fast to decode.

  25. #19
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by cbloom View Post
    Sure. The problem is the space-speed tradeoff of LZ77 makes it very compelling.

    Better modeling of the offsets as string-references based on their contents just gets you into the speed domain of PPM/BWT/CM and then it's better to just go ahead and use that.
    Not necessarily. GLZA is rated #1 for text decompression efficiency on World Compression Challenge 2015. It is significantly faster for decompression than PPM/BWT/CM with RAM usage more typical of LZ77 algorithms while beating LZ77 on test compression ratios for large text files by at least 10%.

  26. #20
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Not necessarily. GLZA ...
    Good point, GLZA/Tree is very interesting. Looking forward to the paper and/or deobfuscation!
    Last edited by cbloom; 3rd August 2016 at 20:00.

  27. #21
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by cbloom View Post
    Good point, GLZA/Tree is very interesting. Looking forward to the paper and/or deobfuscation!
    Soon, DCC 2016 is barely a month away.

    I think the key things to understand about GLZA are that it creates a grammar with low order 0 entropy and then codes it with length and "offset" pairs. For GLZ, the length indicates the number of upcoming symbols (unlike LZ77 where it's the length at the offset). Lengths > 1 indicate the length of a new dictionary symbol. Length 1 indicates an existing dictionary symbol and is followed by an "offset" code (dictionary index). The rest is just details on how to find a reasonably good grammar and model the lengths and offsets.
    Last edited by Kennon Conrad; 23rd February 2016 at 09:42.

  28. #22
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    LZ77 is fast to decode.
    Grammar-based compression is about as fast to decode as LZ77. (And the difference is lost in the costs of any low-order modeling and entropy coding you do to get a better compression ratio.) The usual case is to use a symbol as an index into a dictionary array, fetch a pointer to the already-decoded string, check to make sure the pointer is not null, and copy the already-decoded string. So it's usually just a few instructions plus a string copy, like LZ77. If the string hasn't been decoded yet, you do a fast depth-first left-to-right traversal of the parse tree for that string, which is linear in the size of the string in the worst case, and usually faster because most of the substrings have already been seen and decoded, so the unparsing stops whenever you hit something that's already been unparsed. For text, that's usually.

    I think that this is probably actually faster than LZ77, because you have better (and somewhat longer) strings to copy, and fewer literals to deal with, if your compressor has done its job well.

    It looks weirdly like a dead simple kind of combinator graph reduction for a (really, *really* simple) lazy functional language implementation. You have a node for each functional expression (compound string), and each node only needs to be evaluated once, if you "memoize" (cache) the output the first time it's actually used. You can think of the grammar construction phase as generating and compiling a simple purely functional program to compute the uncompressed output, and the depth-first left-to-right unparsing of a parse tree as interpreting the procedure call graph of that program.

  29. #23
    Member
    Join Date
    Jul 2014
    Location
    Orlando, FL
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Jarek View Post
    There is a nice old concept for fractal compression of pictures: trying to find self-similarities and utilize them, for example encoding the mappings and differences:
    https://en.wikipedia.org/wiki/Fractal_compression

    It didn’t seem to be successful, however maybe analogous approach could be reasonable for text or binary file compression?
    Specifically, while standard LZ schemes require perfect match for a sequence, there should be much more sequences with a nearly perfect match – it might be beneficial to extend LZ scheme to allow also for imperfect match and then encode the differences: for example byte sequence to XOR with, with dominating zeros – such difference could be cheaply stored using entropy coder.

    The problem is that finding similar sequences (by encoder) seems very costly.
    This is a very common task in DNA alignment and a few days ago I have realized that such search can be done in a relatively cheap way: http://arxiv.org/pdf/1602.05889
    Specifically, rate distortion theory allows to find Distortion-Resistant Hashing (DRH): hash values which are not changed while small distortion of object (like sequence).
    So searching for similar sequences can be reduced to having the same DRH value, and such collisions are relatively cheap to find.

    Does LZ scheme which (additionally!) allows partial match and then coding of differences seem reasonable? Have you met with such approaches?
    What other applications of distortion-resistant hashing can you think of?
    Notionally, this idea is very similar to Blarpy:
    code.google.com/p/blarpy

  30. #24
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Hi Conduit,
    The page discusses some variation on LSH, which is based on random projections - works great if only synchronization is guaranteed: the same coordinates correspond to the same data.
    This is not true for DNA sequencing, nor text - where there are also synchronization errors: inserted or deleted symbols, shifting all positions of what comes next, e.g. "ATATATATAT" and "TATATATAT" differ only by a single deletion, but will have complete different projections.
    You could use all shifts to prevent that, but it also won't work as the insertion/deletion can happen in the middle.

    DRH with metric allowing for synchronization errors, like Needleman-Wunch, allows for hashing resistant also to synchronization errors - do you know a different approach which allows for that?

Similar Threads

  1. Perfect Hash Function to Hash Strings
    By joey in forum Data Compression
    Replies: 18
    Last Post: 22nd March 2016, 10:59
  2. text compression?
    By codebox in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 16th March 2015, 16:31
  3. 2D fractal Haar wavelet transform (implemented)
    By Jarek in forum Data Compression
    Replies: 2
    Last Post: 12th September 2014, 15:50
  4. Rationale for Text Compression
    By cfeck in forum Data Compression
    Replies: 34
    Last Post: 20th November 2013, 04:43
  5. Fastest non-secure hash function!?
    By Sanmayce in forum Data Compression
    Replies: 13
    Last Post: 20th November 2010, 20:54

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •