Page 1 of 3 123 LastLast
Results 1 to 30 of 69

Thread: Optimal Preprocessing/Parsing for LZ?

  1. #1
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    323
    Thanks
    174
    Thanked 51 Times in 37 Posts

    Optimal Preprocessing/Parsing for LZ?

    Good morning,

    I've been doing some tests and research trying to find the optimal preprocessor/parser for LZ+Huffman/LZ+Arithmetic compressors.

    I know that Microsoft's LZX uses some sort of parsing. Also, Ilya's CRUSH uses optimal parsing.

    General preprocessors such as Bulat's Delta does help. Is there anything like Ilya's optimal parser or MS's LZX parsing for general use with other LZ+Huffman/LZ+Arithmetic archivers/compressors?

    Thanks!

  2. #2
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    No no. CRUSH has no Optimal Parser - it has an Advanced Lazy Matching at best.
    My LZSS v0.02 is LZSS with 16 KB window and Optimal Parsing. The latest LZF v1.02 is 8 KB window LZ with nearly Optimal Parsing.
    If you know the "price" (how many bits it will take) for each literal/match length&offset and you already got all matches at each position of input, building optimal or sub-optimal path is pretty straightforward!

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    these two questions have nothing in common

    1. there are lot of preprocessors possible for LZH compressor. freearc includes delta, dict, bcj/dispack and mm preprocessors, there is also text preprocessing popularized by inikep and bcj2 in 7-zip

    2. since lzx, optimal parsing is used by all high-compression engines, including lzma, rar5, tornado 0.6

  4. #4
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    323
    Thanks
    174
    Thanked 51 Times in 37 Posts
    Quote Originally Posted by encode View Post
    No no. CRUSH has no Optimal Parser - it has an Advanced Lazy Matching at best.My LZSS v0.02 is LZSS with 16 KB window and Optimal Parsing. The latest LZF v1.02 is 8 KB window LZ with nearly Optimal Parsing.If you know the "price" (how many bits it will take) for each literal/match length&offset and you already got all matches at each position of input, building optimal or sub-optimal path is pretty straightforward!
    Yes, I meant LZSS 0.02. I got confused.Are there any optimal parser tools that are separate/stand-alone from any compressor?For fast decompression, LZ+HUF/ARI is still the best, it just needs some preprocessing help to achieve good ratios.

  5. #5
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    323
    Thanks
    174
    Thanked 51 Times in 37 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    these two questions have nothing in common1. there are lot of preprocessors possible for LZH compressor. freearc includes delta, dict, bcj/dispack and mm preprocessors, there is also text preprocessing popularized by inikep and bcj2 in 7-zip2. since lzx, optimal parsing is used by all high-compression engines, including lzma, rar5, tornado 0.6
    Ahh I see. Thanks Bulat.Is there a syntax for Freearc to only apply all preprocessors but no compression?

  6. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    -m=delta+dispack+mm+rep:512m

    you can apply any algorithms you want, in any order

  7. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Optimal parsing for adaptive code lengths is hard. What Bulat refers to as optimal parsing isn't always optimal in strict mathematical sense - optimal parsing would mean that you can't do better given a reference decoder and finite resources.

    Storer-Szymanski parsing is optimal in mathematical sense but coding is simple and not adaptive, so it's doesn't attain very high compression ratios.

  8. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    comp1 (12th December 2014)

  9. #8
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    323
    Thanks
    174
    Thanked 51 Times in 37 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    -m=delta+dispack+mm+rep:512m

    you can apply any algorithms you want, in any order
    Ok I will test that thanks. So on the other question: Are there any general preprocessor tools that would work best for LZX? Or is a matter of trial and error with the ones you mentioned?

  10. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    those preprocessors are mostly independent, so you need to apply them all

  11. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    comp1 (12th December 2014)

  12. #10
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Optimal parsing for adaptive code lengths is hard.
    Yeah. When the code lengths depend on the codes' global frequencies, it becomes much more difficult to break down the problem into overlapping subproblems, as is required for a dynamic programming solution. At least, it appears to be much more difficult.

    What Bulat refers to as optimal parsing isn't always optimal in strict mathematical sense - optimal parsing would mean that you can't do better given a reference decoder and finite resources.

    Storer-Szymanski parsing is optimal in mathematical sense but coding is simple and not adaptive, so it's doesn't attain very high compression ratios.
    I've pointed this out before, but let me restate: optimal doesn't necessarily mean good. An algorithm can be legitimately be called optimal as soon as it optimizes one, single variable. The variable that it optimizes can be anything whatsoever. Authors like to have the word optimal in the title of their paper, because it sounds good. However, being optimal tells you basically nothing until you know what is being optimized. In general, it's hard to take an intuitive concept like "global best possible" and make it rigorous enough to be mathematically optimized. Even then, if you try settling for some metric that's mathematically rigorous and kind of, sort of resembles globally optimal, the optimization problem usually becomes either NP-hard or uncomputable. Compression seems to be like that.

    So, LZSS certainly isn't global best possible, although, in a limited, technical sense, it is optimal.

  13. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Storer-Szymanski parsing is optimal given the constraints which impose that all matches are coded in the same number of bits and all literals are coded in the same number of bits. There's no room for interpretation. You get globally best result.

    Wiktionary says for 'optimal':
    The best, most favourable or desirable, especially under some restriction.
    So Storer-Szymanski certainly fits.

    I wonder what should be the restriction to make a usual 'optimal' parsing for adaptive codes really optimal? If it's too convoluted then it doesn't make sense to name an algorithm 'optimal'.

  14. #12
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    you can start with compilers that provides "program optimization".
    Yeah and the binaries are 'optimized' but not 'optimal'.

    overall, it's obvious that optimal parsing was started with LZSS where it was math-optimal and then similar algos was used for LZ+EC where they were not math-optimal so now "optimal LZ parsing" means backward price-based evaluation and not math optiomization.
    Just because some algo is similar to Storer-Szymanski one doesn't mean it's also optimal.

    but every time we start discussing optimal LZ, the discussion is terminated by those dictionary researches. do we really need to discuss names instead of algos?
    Optimal means that it can't be beaten. So after discovering optimal algorithm there's no sense in searching further as it can't be beaten by definition.

    now "optimal LZ parsing" means backward price-based evaluation
    So maybe use something like "blockwise parsing", "locally optimal parsing" or something like that so it will better reflect what's going on, won't abuse terminology and won't mislead.



    I think that SS inspired parsing should be sufficiently good for typical LZ schemes, but problem arises when coding scheme allows for special treatment of repeated matches, matches continuations, etc SS algo doesn't handle that at all so a smart custom heuristic is needed.

  15. #13
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Well, I think it's better to first study and understand how Storer-Szymanski parsing ( http://cbloom.com/algs/dictionary.html#optimal ) works and why it's optimal. Only then proceed to heuristics that are labelled as optimal whlie being non-optimal but being able to work on adaptive coding scheme.

  16. #14
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    when it comes to compression that is not bijective it can't be optimal. To be optimal it has to make full use of the compression space and has to be bijective. Since in reality a flat copy could be considered an optimal compressor. My bijective LZW compressors are optimal compressors for some infinite class of files so it must use optimal parsing. The problem is that there is no such thing is an Optimal parser since its intransitive by nature. Some one could prove method A is better than method B while someone else could prove method B is better than method C while another person could prove C is better than A. It's really an unending search. Since it still ultimately depends on what you what to compress. And assuming mankind does not go back to the stone age the type of files being compressed in future not likely to be the same today. Since its always changing.

  17. #15
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I think LZ77 was proved to be asymptotically optimal on stationary, ergodic sources. That's not the same as optimal compression, which Kolgomorov proved is not computable

  18. #16
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Bulat:
    "math-optimal"? Something is either optimal or not. You're inventing neologisms here.

    In LZ world the intuitive description of optimal parser is a parser that gives best compression (ie cannot be beaten at all) while the file is still fully decodable by reference decompressor.

    You're trying to tell people that good parsing algorithm has to look like SS algorithm, while there's no proof. SS algo is optimal given the constraints Charles Bloom provided and he haven't abused the term Optimal Parsing in his article (citation: "Hopefully you see the beauty and simplicity of this scheme. This finds the optimal coding, by construction, but only if our assumptions are true : matches are coded in M bits, and literals in L bits.").

    Kennon Conrad in his Tree program doesn't do any SS-like parsing (ie going backwards) but he achieves very good compression ratio. I think that tokenization could be applicable in LZ77 world and maybe could allow for better compression ratios than the so called "optimal parsing". And tokenization in Tree is basically a parsing - it's used to find matches which should lead to very good compression ratio.

    Repeated matches or matches continuations bring conventional LZ77 closer to Tree like coding. The more entries are in the recent matches queue the more similarity between Tree and that LZ77 derived coder. So one could eg try local Tree like tokenizations within small windows to extract repeated matches and then complete the parsing by finding matches over a bigger LZ77 window.

  19. #17
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    323
    Thanks
    174
    Thanked 51 Times in 37 Posts
    Just to jump in here real quick:

    Thank you to Piotr, Bulat, nburns, Matt, and biject.bwts for the informative explanations and discussion.

    Although there is disagreement in this discussion, it is still very helpful to me.

  20. #18
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    FIRST: optimal LZ parsing isn't even locally optimal, look at fast bytes parameter. moreover, a few years ago lzma had TWO optimal LZ parsers, both having short-circuited some evaluations

    SECOND: according to your idea, i should answer comp1 that no modern LZ optimal parsers exist. unfortunately it will not help him

    THIRD: natural language is a live changing thing. twit, ring, driver, calculator, energy is just a few examples of words that got additional meanings as progress continues. And word combinations becomes an idioms with their own meaning, f.e. russian roulette (that isn't realy russian) or locally optimal (that isn't really optimal).

    Words "optimization" and "optimal" have different meanings in math and in usual life. People continue to give new meanings to old words everyday. 15 years ago you had chances to change name of LZX algorithm when its description first appeared, but now it's just too late. The idiom "optimal LZ parsing" is coined now and means well-established algorithm family rather than optimization in compiler or mathematical sense. By using other name you will just lose connection with everyone learned algorithm by this name

    but if you like such meaningless work, you can start with LZ name itself - read LZ77 paper and realize that since lzari (about 1987) we don't use the algorithm they have described

    4TH: but really i prefer to stop discussing lingustics and start discussing algorithms. otherwise encode.ru will finally turn into zpaq/freearc support forum that's hardly your goal
    Last edited by Bulat Ziganshin; 23rd December 2014 at 18:48.

  21. The Following 2 Users Say Thank You to Bulat Ziganshin For This Useful Post:

    comp1 (23rd December 2014),Matt Mahoney (24th December 2014)

  22. #19
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I still think that understanding why SS parsing is optimal and not just "optimal" is crucial to understand the difficulty and goal of parsing. Making 'optimal' synonymous with 'good enough' (or rather 'optimal' with 'blockwise backwards') doesn't help.

    Good parsing doesn't have to be blockwise and backwards, you can do good parsing without that by using A* algorithm: http://cbloomrants.blogspot.com/2011...with-star.html

    Also (after quick Googling) I've found some lengthy paper: https://tel.archives-ouvertes.fr/tel-00804215/document

  23. #20
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    comp1 doesn't asked me about that and it seems that he isn't programmer at all. i already provided him a link to the thread that contains ONE message about algorithm and DOZEN messages about the name. but you find it's not enough and started to write this again

    CBloom still call it an optimal parsing, so it seems that one should find something really unlike the LZSS algo to assign it a new name. second paper is about real optimal parsing, but they will have great problems with the name

  24. #21
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    CBloom still call it an optimal parsing, so it seems that one should find something really unlike the LZSS algo to assign it a new name.
    What's common between Charles' A* derived algorithm and SS parsing? Cost computation? Well, every parsing technique that isn't greedy parsing take cost into account.

    If both SS-derived parsing and A*-derived parsing can be labelled optimal and both aren't truly optimal then what's the definition of optimal parsing? How to tell if a parsing is optimal or not? Even if the answer is somehow obvious to someone, then it's still worth formulating.


    Also, I think the A* thing has more potential than SS scheme. SS requires freezing costs for every block, while A* can be (almost?) fully adaptive. So SS scheme should be OK with blockwise Huffman coding, while A* should probably be better when dealing with fully adaptive (arithmetic) coding.
    Last edited by Piotr Tarsa; 23rd December 2014 at 22:17.

  25. #22
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Storer-Szymanski parsing is optimal given the constraints which impose that all matches are coded in the same number of bits and all literals are coded in the same number of bits. There's no room for interpretation. You get globally best result.
    Maybe "global best possible" was poor word choice and too ambiguous to be meaningful. I was trying to describe the concept of a completely unconstrained optimal, which can't be exceeded by any other solution under any set of constraints. In compression, this ultimate perfect solution is probably a myth -- or some would say it's the Kolmogorov entropy. Actually, the Kolmogorov entropy does have constraints, and there seems to be no apparent reason why those constraints should be considered better than all other constraints under all circumstances. But, since no one can possibly compute the Kolmogorov entropy, ever, it's perhaps thankfully permanently beyond reach, and not worth thinking much about.

    LZSS gets the absolute best result within its constraints -- that's the meaning of optimal. However, eventually you must address the question of how much the problem defined by those constraints actually matters. Compression in general isn't subject to the constraints chosen by LZSS, so people tend to judge LZSS by how well it achieves compression. When you measure it that way, the optimality within its own constraints becomes a lot less relevant -- a failure of the constraints themselves. If you change the constraints to make a better problem, you can no longer use LZSS...

    Wiktionary says for 'optimal':

    So Storer-Szymanski certainly fits.

    I wonder what should be the restriction to make a usual 'optimal' parsing for adaptive codes really optimal? If it's too convoluted then it doesn't make sense to name an algorithm 'optimal'.
    I think its a fairly fundamental problem. When you additionally allow Huffman and related codes within an LZSS-like framework, it becomes a less constrained problem than the one LZSS solved, and so there may be no optimal solution to this bigger problem, under meaningful constraints (just because you can change the constraints to make things easier on yourself doesn't mean it isn't still cheating). LZSS minimizes file size, AFAIK, which is what you want... however, in reality, LZSS is helped to get to the optimal file size by the fact that it side-steps and avoids dealing with harder problems that may not have solutions. (Reading between the lines at bit, I am supposing that the reason that some problems were not solved by LZSS likely is due to the inherent hardness of those more interesting problems. I don't think the choice of what problems to attack and solve are made haphazardly, and I don't think Storer and Symanski stopped where they did just because they felt like doing something else and never considered optimizing LZ+H.)

    I gave some concerted thought to the problem of optimizing file size in LZ under Huffman today. It's easy to see how the problem defeats the usual dynamic programming solution -- the symbol costs are not known until the very end; this prevents you from precomputing optimal solutions to subproblems, because it's generally impossible to know if any solutions are optimal until you are done with the whole problem. OTOH, it must be somewhat hard to prove that this problem is insurmountable, because AFAIK no one has proved that, either.

    It may be possible to attack from another angle. Huffman comes with its own constraints, and other invariants and things can be derived that start to help to cut down on the chaos of never knowing any symbol costs until the very end. In fact, Huffman lengths pretty much lose exactly one bit for every time you double the frequency. So that helps to put sharp bounds on the uncertainty. There will also probably be some cases where the length can be predicted exactly without having to do any experimentation, because there are hard upper and lower bounds on the number of times you might posibly use that symbol, and they are both inside a single code length range. In general, you may find that only a subset of LZ match decisions are truly in play and depend on Huffman, and perhaps the remaining problem will be tractable. But, I'm not sure whether this problem is really incredibly important, anyway. Even supposing that you could determine the optimal match choices for LZ with Huffman, the resulting algorithm would still only be LZ. There would still be other compression techniques that it couldn't take advantage of that other, non-LZ compressors can.
    Last edited by nburns; 23rd December 2014 at 22:47.

  26. #23
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    What's common between Charles' A* derived algorithm and SS parsing? Cost computation? Well, every parsing technique that isn't greedy parsing take cost into account.
    lazy parsing in deflate/lzma/tornado doesn't use prices. in defalte it accepts newer match if len2>len and in lzma/tor it checks "len2>len1+1 or (len2==len1+1 and dist2/dist1<64)"

    Quote Originally Posted by Piotr Tarsa View Post
    If both SS-derived parsing and A*-derived parsing can be labelled optimal and both aren't truly optimal then what's the definition of optimal parsing? How to tell if a parsing is optimal or not? Even if the answer is somehow obvious to someone, then it's still worth formulating.
    A* used the optimal-LZ idea, just computing position prices in opposite way. usually, we have price of position N and compute price of position N+len, and A* goes the opposite way

    it is how words live in natural languages. for example, energy once meant a mental energy, then mechanical energy and now it may mean electron mass (via e=mc^2). where it will stop and what was definition by Aristotlte if the word can be used in such a wide way?

    Quote Originally Posted by Piotr Tarsa View Post
    Also, I think the A* thing has more potential than SS scheme. SS requires freezing costs for every block, while A* can be (almost?) fully adaptive. So SS scheme should be OK with blockwise Huffman coding, while A* should probably be better when dealing with fully adaptive (arithmetic) coding.
    as CBloom said, A* provides better compression than the Chain-N scheme, but sometimes became too slow. i wonder why adaptive encoding cannot be used with the following scheme:

    Code:
    for pos in (0..N)
      state[pos] := state[best_prev_pos[pos]], updated for a best_match[pos]
      for (len,dist) in possible_matches[]
        if price[pos]+price(len,dist) < price[pos+len]
          best_prev_pos[pos+len] = pos
          best_match[pos+len] = (len,dist)
    if len is limited to M, we will need to keep only M recent states plus initial state for the encoding stage. this scheme may be further improved by avoiding copying states that will be never more used and allowing to reuse the state by restoring changes made by the update
    Last edited by Bulat Ziganshin; 24th December 2014 at 00:11.

  27. #24
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by biject.bwts View Post
    when it comes to compression that is not bijective it can't be optimal. To be optimal it has to make full use of the compression space and has to be bijective. Since in reality a flat copy could be considered an optimal compressor. My bijective LZW compressors are optimal compressors for some infinite class of files so it must use optimal parsing.

    You are bringing in additional definitions of optimal here -- specifically, you are using optimal to mean completely without redundancy. Aiming for zero redundancy is kind of slippery, because you will always be able to turn up redundancy somewhere, at least when compressing data taken from the real world. Being bijective means you've eliminated a class of waste (redundancy) that is totally objective, because it's completely pure waste that codes for zero information. Once you get rid of all of that, the remaining waste is subjective, or has a subjective component -- that's the intractable waste, because you can't really even see it or get universal agreement on where or what it is.

    S+S used the usual mathematical definition of optimal when they wrote their original paper. They selected a variable to minimize (compressed size), selected a problem that had many possible solutions that was amenable to optimization (LZ match selection, with constraints), and they published an algorithm that returns the optimal solution to a particular instance of the problem (it returns the sequence of match references and literals that yield the minimum file size).

    The problem is that there is no such thing is an Optimal parser since its intransitive by nature. Some one could prove method A is better than method B while someone else could prove method B is better than method C while another person could prove C is better than A. It's really an unending search. Since it still ultimately depends on what you what to compress. And assuming mankind does not go back to the stone age the type of files being compressed in future not likely to be the same today. Since its always changing.
    Of course there are optimal parsers, so long as you use the mathematical definition of optimal -- mathematicians defined that term with the intent of doing actual optimization, so, of course, they defined it in such a way that it's not impossible to do. Claims of being optimal always need to come with constraints, though. Being just plain "optimal", period, is much too ambitious to be attempted by a human.

    What you wrote about optimal parsing being instransitive strikes me as somewhat insightful -- you could say that the ultimate measurement of a compression algorithm would be a vector mapping the complete space of inputs (infinite) to the size it compresses each one to. Clearly, the complete set of these vectors would only be partially ordered, since there would be numerous cases where compressor A was better than compressor B on file x, but the relationship was reversed on file y.

    So it would be pointless to look for the one best within that partially ordered set. That's kind of a truth that's central to compression, because whether it's more important to compress file x or file y is a question that has no ultimate answer. It's just a choice.

  28. #25
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    lazy parsing in deflate/lzma/tornado doesn't use prices. in defalte it accepts newer match if len2>len and in lzma/tor it checks "len2>len1+1 or (len2==len1+1 and dist2/dist1<64)"



    A* used the optimal-LZ idea, just computing position prices in opposite way. usually, we have price of position N and compute price of position N+len, and A* goes the opposite way

    it is how words live in natural languages. for example, energy once meant a mental energy, then mechanical energy and now it may mean electron mass (via e=mc^2). where it will stop and what was definition by Aristotlte if the word can be used in such a wide way?
    You seem to be missing something that I think is important. The word optimal actually has a very precise and meaningful definition in math, and CS papers generally adhere to that -- including the one by S+S. In math, you can be optimal as long as you define in precise terms what kind of optimal you are claiming to be. If you provide sufficient proof (like S+S did), then you are optimal, but only in exactly the precise way that you proved. There is no such thing in math as just being optimal, period -- without specifying what you are optimizing.

    If optimal LZ has evolved a different definition that no longer agrees with the math definition, then it is a curious occurrence and the only example I know of. Perhaps there are many people in compression who acquired English late in life (and never studied very much math) and they misused it until it lost its meaning? Or, maybe it has nothing to do with English, but just that many people in compression don't have much math training and made sloppy use of the term.

    as CBloom said, A* provides better compression than the Chain-N scheme, but sometimes became too slow. i wonder why adaptive encoding cannot be used with the following scheme:

    Code:
    for pos in (0..N)
      state[pos] := state[best_prev_pos[pos]], updated for a best_match[pos]
      for (len,dist) in possible_matches[]
        if price[pos]+price(len,dist) < price[pos+len]
          best_prev_pos[pos+len] = pos
          best_match[pos+len] = (len,dist)
    if len is limited to M, we will need to keep only M recent states plus initial state for the encoding stage. this scheme may be further improved by avoiding copying states that will be never more used and allowing to reuse the state by restoring changes made by the update
    If by adaptive encoding you are talking about the same problem I have been writing about -- the reason that algorithm won't work is that you will not know the prices until after making all the choices. That's because Huffman assigns all the prices at the end, based on the set of choices.

  29. #26
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    read this if you need two more examples
    Last edited by Bulat Ziganshin; 24th December 2014 at 01:36.

  30. #27
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    lazy parsing in deflate/lzma/tornado doesn't use prices. in defalte it accepts newer match if len2>len and in lzma/tor it checks "len2>len1+1 or (len2==len1+1 and dist2/dist1<64)"

    A* used the optimal-LZ idea, just computing position prices in opposite way. usually, we have price of position N and compute price of position N+len, and A* goes the opposite way
    Doesn't but could. We can forget about prices and make such translation:
    greedy parsing -> no lookahead
    lazy parsing -> one position lookahead
    advanced lazy parsing -> two positions lookahead
    flexible parsing -> multiple positions lookahead
    quasi-optimal parsing -> multiple matches lookahead
    Or make something seemingly weird like 5-matches-with-3-positions-per-match-lookahead (that gives maximum of 3^5 = 243 searches for a given input position).

    We can employ different strategies, eg:
    - maximize the number of matched bytes in specific number of matches (that's how original flexible parsing works),
    - reduce the price of coding a small window of data - eg a window that ends after three greedy matches,

    We can mix above strategies and add additional skewing like:
    - accepting very long matches unconditionally,
    - giving additional penalty to matches that exceed a distance threshold or multiple distance thresholds corresponding to CPU cache size,
    - preferring longer matches and less literals even if slightly more costly,

    That skewing will make the result of parsing further from optimality but can improve efficiency of compression and/ or decompression.

    as CBloom said, A* provides better compression than the Chain-N scheme, but sometimes became too slow. i wonder why adaptive encoding cannot be used with the following scheme:

    Code:

    for pos in (0..N)
    state[pos] := state[best_prev_pos[pos]], updated for a best_match[pos]
    for (len,dist) in possible_matches[]
    if price[pos]+price(len,dist) < price[pos+len]
    best_prev_pos[pos+len] = pos
    best_match[pos+len] = (len,dist)

    if len is limited to M, we will need to keep only M recent states plus initial state for the encoding stage. this scheme may be further improved by avoiding copying states that will be never more used and allowing to reuse the state by restoring changes made by the update
    As nburns said, you don't know the prices beforehand because you delay actual selection of matches. And your heuristic adds additional complexity when it comes to tuning algorithm. A* has much shorter delay (between predicting prices and updating them) than blockwise SS parsing (with nontrivial window size).

    it is how words live in natural languages. for example, energy once meant a mental energy, then mechanical energy and now it may mean electron mass (via e=mc^2). where it will stop and what was definition by Aristotlte if the word can be used in such a wide way?
    'Optimal' is adjective like 'best'. 'Best' always meant best, it never meant second place or whatever. And like 'optimal', 'best' depends on constraints - you can be the best runner at a competition (the constraint here is a particular competition), while someone could be even better than you, but not starting at the competition you do. You still would've been the best runner on that competition though.


    The word 'parsing' was already hijacked and given new meaning. If you look into dictionary you would see:
    parse pärs
    verb
    gerund or present participle parsing
    analyze (a sentence) into its parts and describe their syntactic roles.
    analyze (a string or text) into logical syntactic components, typically in order to test conformability to a logical grammar.
    examine or analyze minutely.
    But parsing in LZ schemes is not looking for syntactic roles. It's merely a decomposition that doesn't care if the chunks correspond to anything. It doesn't create confusion though, because if 'parsing' is used in place of 'decomposition' then it's immediately explained that it's about finding repetitions and nothing is said about syntax.

  31. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    Gonzalo (25th December 2014)

  32. #28
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    As nburns said, you don't know the prices beforehand because you delay actual selection of matches. And your heuristic adds additional complexity when it comes to tuning algorithm. A* has much shorter delay (between predicting prices and updating them) than blockwise SS parsing (with nontrivial window size).
    seems that you both misundestood my algorithm and optimal parsing overall. in lzma/tornado, when you are starting to check matches started at position N, you are already found best path to this postion from the start and the postion price, but only using old prices

    the only thing i've added is up-to-date state that means up-to-date prices. contrary to my approach, A* only knows prices optimized for the best encoding of further path to the end of window

    PS: just now 80% of this topic is about Linguistic and 20% about algorithms
    Last edited by Bulat Ziganshin; 24th December 2014 at 01:55.

  33. #29
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Your algo seems unfinished. You don't show how you compute prices and it seems that for a given pos you're examining prices of future positions. I assume pos goes from 0 to N, as it wouldn't make sense to update state in reverse order. You didn't provide enough details, so maybe I've misunderstood something.

  34. #30
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Code:
    for pos in (0..N)
      state[pos] := state[best_prev_pos[pos]], updated for a best_match[pos]
      for (len,dist) in possible_matches[]
        if price[pos]+price(len,dist) < price[pos+len]
          price[pos+len] = price[pos]+price(len,dist)
          best_prev_pos[pos+len] = pos
          best_match[pos+len] = (len,dist)
    you are right. in the second line, when i update the state (that may be many KB), i update the prices too. so we have already selected the optimal path to this concrete pos, we know full state and we can compute all prices

    it may look too slow, but in lzma updating prices with one match/literal is usually very cheap and copying multiple KBs within L3 cache has the speed >100 GB/s


    it seems that for a given pos you're examining prices of future positions.
    it is how lzma/tornado optimal parsing work - we start with
    Code:
    price[0]=0
    for (pos in 1..N)
      price[pos] = VERY_HIGH_PRICE
    and then try every possibility to reach every position and choose minimal price. but it uses price[pos+len] only for comparison with new price, so no future prices are involved

    you can read this post for details of lzma/tor algo
    Last edited by Bulat Ziganshin; 24th December 2014 at 02:50.

Page 1 of 3 123 LastLast

Similar Threads

  1. Replies: 13
    Last Post: 17th May 2014, 06:11
  2. [LZ] Optimal parsing
    By Bulat Ziganshin in forum Data Compression
    Replies: 14
    Last Post: 19th March 2014, 21:56
  3. Optimal ordered bitcodes (or smth like this)
    By Piotr Tarsa in forum Data Compression
    Replies: 29
    Last Post: 26th July 2011, 13:18
  4. optimal parsing
    By flounder in forum Data Compression
    Replies: 10
    Last Post: 3rd August 2008, 13:07
  5. parsing methods
    By Piotr Tarsa in forum Forum Archive
    Replies: 18
    Last Post: 9th August 2007, 06:45

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •