Page 1 of 4 123 ... LastLast
Results 1 to 30 of 93

Thread: Suffix tree transformation?

  1. #1
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Suffix tree transformation?

    While I am by no means an expert in compression, it seems to me that a new classification of compression methods would be useful to better describe programs such as bpe, krc, diqs, tree, and possibly others.

    As I understand it, these programs are classified as "Dictionary" compressors. I think this is somewhat misleading because, for the most part, none of these programs care about "words" as we think of them normally (unlike xwrt, I think) and the symbols can be composed of other symbols in addition to letters.

    Therefore, it seems to me that a description such as "Suffix tree transformation" would more accurately describe what these programs do. They all create symbols for the "best" suffix(es) and then use the new symbols in place of the suffixes until they run out of good suffixes, and then write the compressed output using the transformed suffix tree.

    Thoughts?

  2. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Isn't your compression scheme a variant of LZ78 with offline parsing optimization? By offline I mean making decisions at some point using data well past that point. Usual LZ78 variant only examine immediate input and current contents of dictionary to make a local decision.

  3. #3
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Isn't your compression scheme a variant of LZ78 with offline parsing optimization? By offline I mean making decisions at some point using data well past that point. Usual LZ78 variant only examine immediate input and current contents of dictionary to make a local decision.
    Tree is similar to LZ78 in terms of the dictionary being built up dynamically as the compressed stream is sent or received, but I think there are many important differences:

    1. LZ78 dictionary entries are limited to a previous match plus one character. Tree dictionary entries can be any combination of previous dictionary entries and new dictionary entries up to order 9000.

    2. LZ78 creates one dictionary entry for every match+symbol entity in the compressed stream. Tree can write output codes without creating new dictionary entries. Tree only creates dictionary symbols if they are used later in the file. Tree also can and does create new dictionary entries within new dictionary entries that are being created. (See .pdf document in v0.1 - Tree creates dictionary entries me, med, ia, media, wi, ki and wiki within the creation of the dictionary entry for mediawiki the first time mediawiki is encountered.)

    3. LZ78 substitutes dictionary entries linearly, from file start to file end. Tree substitutes dictionary entries from best entry to worst entry.

    4. LZ78 creates dictionary entries independent of the Markov chain probability. Tree only creates dictionary entries when the entry's probability of being used exceeds the Markov chain probability.

    5. LZ78 dictionary entries stay in the dictionary until the end of the file. Tree removes dictionary entries that are used less than 20 times after the last usage.

    I think the biggest reason Tree seems to create smaller compressed files for text compared to existing LZ78 algorithms is because the dictionary does not get as big, making dictionary references more efficient. This is just a wild guess for LZ78, but I think for enwik9, the dictionary size would be around 100 million entries (if never reset). Tree ends up with about 9 million dictionary entries total but only about 4.5 million at the peak.

    Tree really is a suffix tree transformation. During compression, whenever a new dictionary symbol is created, it is substituted for the string it represents throughout the entire data set and the dictionary suffixes are added to the suffix tree. Final encoding is just a matter of efficiently writing the transformed suffix tree in the order in which it is traversed by the input data.

    If done properly, a suffix tree transformation will never create useless dictionary entires. As far as I know, all LZ algorithms create useless dictionary entries.
    Last edited by Kennon Conrad; 21st June 2014 at 18:47.

  4. #4
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    You're mostly right, but does the 'Tree' encoding scheme really have anything to do with suffix trees? Couldn't you output the same kind of stream using suffix arrays or eg Morphing Match Chains? Also you can use suffix trees to implement string searching in usual LZ77-derived algos, eg LZMA.

    I think more relevant description would be something like "Recursive Pattern Substitution" rather than "Suffix Tree Transformation". I haven't yet read the Tree paper fully so maybe it isn't hitting the nail on the head.

    There are derivatives of LZ78, namely LZMW and LZAP which add either concatenation of two dicitonary entries (LZMW) or all prefixes of that concatenation (LZAP). So that emulates the recursive pattern substitution. Usual LZ78 variants have poor management of dictionary entries - there's only a full reset of dictionary instead of purging only the least useful ones from time to time. Tree resolves most of the problems with dictionary management but otherwise still looks somewhat like LZ78.

  5. #5
    Member
    Join Date
    May 2008
    Location
    brazil
    Posts
    163
    Thanks
    0
    Thanked 3 Times in 3 Posts
    "Dictionary transformation" or "Recursive symbol transformation"description is not a bad thing and easy to understand because Letters are symbols ,Ancient symbols,but people maybe think it's the same the LZ-class compressors .

    A question , All these compressors are very asymmetric? Or is it possible to use a symmetric version?
    Assymetry is good only for servers today. People have lots of ram today. Of course dont go above the 32 bit limit (2gb)
    Last edited by lunaris; 21st June 2014 at 20:05.

  6. #6
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    You're mostly right, but does the 'Tree' encoding scheme really have anything to do with suffix trees? Couldn't you output the same kind of stream using suffix arrays or eg Morphing Match Chains? Also you can use suffix trees to implement string searching in usual LZ77-derived algos, eg LZMA.

    I think more relevant description would be something like "Recursive Pattern Substitution" rather than "Suffix Tree Transformation". I haven't yet read the Tree paper fully so maybe it isn't hitting the nail on the head.

    There are derivatives of LZ78, namely LZMW and LZAP which add either concatenation of two dicitonary entries (LZMW) or all prefixes of that concatenation (LZAP). So that emulates the recursive pattern substitution. Usual LZ78 variants have poor management of dictionary entries - there's only a full reset of dictionary instead of purging only the least useful ones from time to time. Tree resolves most of the problems with dictionary management but otherwise still looks somewhat like LZ78.
    You are right, the scheme does not require trees and I think bpe, diqqs and krc use arrays. Any suffix structure could work, so tree should not be part of the name for this type of compression.

    The Tree paper is a bit dated now, although the basic ideas are the same. While tree, bpe, diqqs and krc are all recursive, I believe a non-recursive method can be developed. It would be difficult code to write, but I think compression time could be much better with hopefully just a small decrease in compression ratio. So maybe "Hierarchical Suffix Substitution" or "Hierarchical Suffix Transformation" would be more inclusive of possible future methods.

    The order in which strings are compressed is completely different for Tree than with any LZ based algorithm. I could be wrong, but I don't think LZ schemes handle long, low frequency matches well. I think LZ algorithms only combine two things for each new dictionary symbol, no matter which variant is used. Tree does much better with long low frequency matches since it is generally starts with high order matches and breaks them down rather than starting with low order matches and building them up.

  7. #7
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by lunaris View Post
    "Dictionary transformation" or "Recursive symbol transformation"description is not a bad thing and easy to understand because Letters are symbols ,Ancient symbols,but people maybe think it's the same the LZ-class compressors .

    A question , All these compressors are very asymmetric? Or is it possible to use a symmetric version?
    Assymetry is good only for servers today. People have lots of ram today. Of course dont go above the 32 bit limit (2gb)
    Okay, so maybe Heirarchical Dictionary Substitution would be good. I have never been good with names....

    I am afraid I don't understand your question about asymmetry. If you are asking if a "uq" gets added to the dictionary when "qu" gets added to the dictionary, the answer is no.

  8. #8
    Member
    Join Date
    May 2008
    Location
    brazil
    Posts
    163
    Thanks
    0
    Thanked 3 Times in 3 Posts
    Use the same memory ram size for compression and decompression.

  9. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Suffix is and ending of a sequence. Hierarchical Suffix Substitution would then imply you're substituting some suffix with a code and repeat that process. That's obviously wrong.


    LZ78 build up dictionary basing on structure of matches, but LZ77, which is prevalent variant of LZ today, does not have a dictionary of entries but a window of most recent past data. Therefore LZ77 can deal pretty well with long low frequency matches. The only requirement for them is that the repeated string plus the data between original string and repetition must fit in window.

    You can try it for yourself. Eg generate 3 random chunks: A, B and C. Then generate a file by concatenating A + B + C + B. Assuming a window can fit more than two chunks at a time then LZ77 should almost perfectly compress the repetition of chunk B. LZ78 on the other hand should behave rather badly.


    LZ77 compared to LZ78 has the disadvantage of having to transmit the length of repetition and also by transmitting the window offset instead of dictionary offset. Assuming dictionary entries are created less frequently than on each input byte then window offsets would probably be bigger than dicitonary offsets. OTOH, LZ77 compared to LZ78 has the advantage of the possibility of addressing any subsequence in data window.

  10. #10
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    While I am by no means an expert in compression, it seems to me that a new classification of compression methods would be useful to better describe programs such as bpe, krc, diqs, tree, and possibly others.

    As I understand it, these programs are classified as "Dictionary" compressors. I think this is somewhat misleading because, for the most part, none of these programs care about "words" as we think of them normally (unlike xwrt, I think) and the symbols can be composed of other symbols in addition to letters.
    Dictionary has a different meaning when used in the context of compression. A dictionary compressor replaces strings with shorter references to a "dictionary." Pretty much any algorithm that outputs a stream of references is going to fall into this category.

    See: http://en.wikipedia.org/wiki/Dictionary_coder

    Therefore, it seems to me that a description such as "Suffix tree transformation" would more accurately describe what these programs do. They all create symbols for the "best" suffix(es) and then use the new symbols in place of the suffixes until they run out of good suffixes, and then write the compressed output using the transformed suffix tree.

    Thoughts?
    I think it's a good idea to start a thread about this general class of algorithms, and I was considering starting one.

    This type of algorithm is like a pure form of dictionary compression, because, unlike LZ77, the dictionary is made explicit and it's a distinct thing that exists separate from the text. The dictionary can be sent as a separate part, or it can be integrated into the text with escapes to signal the definition of a phrase. Tree uses the second technique (with lots more details). Decompression is very fast, comparable to LZ77. Being able to choose exactly what's in the dictionary makes it possible to use dictionary space efficiently.

    This class of algorithm has been described and named before. It seems equivalent to grammar-based code (http://www.mit.edu/~6.454/www_fall_2...r_and_yang.pdf TL;DR). Larsson and Moffat referred to it as offline dictionary-based compression (http://www.larsson.dogma.net/dcc99.pdf). I prefer offline dictionary-based compression, because it makes more sense.

    Quote Originally Posted by Piotr Tarsa View Post
    Suffix is and ending of a sequence. Hierarchical Suffix Substitution would then imply you're substituting some suffix with a code and repeat that process. That's obviously wrong.
    Right. What Kennon is talking about is factors or substrings, not suffixes.

    LZ78 build up dictionary basing on structure of matches, but LZ77, which is prevalent variant of LZ today, does not have a dictionary of entries but a window of most recent past data. Therefore LZ77 can deal pretty well with long low frequency matches. The only requirement for them is that the repeated string plus the data between original string and repetition must fit in window.
    I wouldn't say it's LZ78. LZ78 doesn't get to choose the dictionary, it's forced to create a new phrase every time it outputs something.

    It's certainly closer to LZ than any other algorithm I know of. The practical difference from LZ77 is that the phrases are defined explcitly when they first appear, rather than when they're referenced. Consequently, the reference is just one number, instead of (distance, length).
    Last edited by nburns; 22nd June 2014 at 00:16.

  11. #11
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I think bpe, diqqs and krc use arrays
    krc use the .NET Dictionary Class to store/compare unique keywords and double counts (my private krc version store till 45 million unique keywords in it):
    http://msdn.microsoft.com/en-us/libr...px?cs-lang=cpp
    Last edited by Sportman; 22nd June 2014 at 00:46.

  12. #12
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by lunaris View Post
    Use the same memory ram size for compression and decompression.
    Ah, I see. Tree is not symmetric. Compression is done with all of the data in RAM including the dictionary. For decompression, only a window of data and the dictionary are kept in RAM. I am not so sure about the others, but am pretty sure they are also not symmetric.

  13. #13
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Suffix is and ending of a sequence. Hierarchical Suffix Substitution would then imply you're substituting some suffix with a code and repeat that process. That's obviously wrong.
    While Tree uses what I have called a truncated suffix tree, it really looks for common prefixes. My description tends to be tainted by my implementation and lack of compression experience. The structure is essentially equivalent to an LCP array if you overlook details like RAM usage and leafs.

    You are right about LZ77. The was a stupid statement on my part.

    I suppose you could think of Tree's algorithm as a way to get the best of both LZ77 and LZ78. It has the ability to match long low frequency matches and the ability to use dictionary offsets in place of window offsets.
    Last edited by Kennon Conrad; 22nd June 2014 at 04:40.

  14. #14
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    Dictionary has a different meaning when used in the context of compression. A dictionary compressor replaces strings with shorter references to a "dictionary." Pretty much any algorithm that outputs a stream of references is going to fall into this category.

    See: http://en.wikipedia.org/wiki/Dictionary_coder
    I was thinking more about this definition from the LTCB site: Dict (Dictionary). Symbols are words, coded as 1 or 2 bytes, usually as a preprocessing step. LZ algorithms are listed separately by LZ type.

    These programs all fit nicely under the Dictionary Coder as described by Wikipedia, but then again, so do LZ algorithms. I was trying for something more narrow that would separate the algorithm class from LZ.

    Quote Originally Posted by nburns View Post
    This class of algorithm has been described and named before. It seems equivalent to grammar-based code (http://www.mit.edu/~6.454/www_fall_2002/emin/kieffer_and_yang.pdf TL;DR). Larsson and Moffat referred to it as offline dictionary-based compression (http://www.larsson.dogma.net/dcc99.pdf). I prefer offline dictionary-based compression, because it makes more sense.
    I like "Grammer Transformation". It gives me a better sense of what the compressor is doing than "Dictionary Coder". The term dictionary confuses me because I would expect to be able to look the word up in the dictionary, but in this case, if I tried looking up a "word" in the dictionary, I would often find the "word" is made up of other "words" and need to look those up too.
    Last edited by Kennon Conrad; 22nd June 2014 at 04:42.

  15. #15
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    krc use the .NET Dictionary Class to store/compare unique keywords and double counts (my private krc version store till 45 million unique keywords in it):
    http://msdn.microsoft.com/en-us/libr...px?cs-lang=cpp
    So krc uses hash tables. Interesting. Tree uses trees, diqqs uses suffix arrays, and krc uses hash tables. Lots of ways to get there.....
    Last edited by Kennon Conrad; 22nd June 2014 at 04:43.

  16. #16
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    149
    Thanks
    30
    Thanked 59 Times in 35 Posts
    fujimap is a library for a space-efficient associative map.
    https://code.google.com/p/fujimap/

  17. #17
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Kennon Conrad View Post
    it seems to me that a new classification of compression methods would be useful to better describe programs such as bpe, krc, diqs, tree, and possibly others.
    What about use pl dict = pointerless dictionary?

    http://en.wiktionary.org/wiki/pointerless (plus first text under "quotations")

    pointerless
    1.(computing, programming) Without the use of pointers.

    1989, Allen Kent, James G Williams, Rosalind Kent, Encyclopedia of Microcomputers: Volume 4 We will see, however, that the pointerless techniques are seldom satisfactory. Implementations of trees tend to be more complex than those of ordered lists.
    Last edited by Sportman; 23rd June 2014 at 22:24.

  18. #18
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Tree is not symmetric. Compression is done with all of the data in RAM including the dictionary. For decompression, only a window of data and the dictionary are kept in RAM. I am not so sure about the others, but am pretty sure they are also not symmetric.
    krc use the same way to compress as decompress (with all data in RAM including dictionary), but decompress is backwards and without need to generate dictionary, calculate profit, sort and test profit, so much quicker when decompressing.

    So I shall say krc is symmetric in memory use (accept the memory used for the temporary dictionary build ups) and asymmetry in speed.

  19. #19
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I was thinking more about this definition from the LTCB site: Dict (Dictionary). Symbols are words, coded as 1 or 2 bytes, usually as a preprocessing step. LZ algorithms are listed separately by LZ type.

    These programs all fit nicely under the Dictionary Coder as described by Wikipedia, but then again, so do LZ algorithms. I was trying for something more narrow that would separate the algorithm class from LZ.
    Dictionary compression is one of a couple of broad categories of compressors, a category that includes a variety of algorithms, including LZ-derived ones and some that are almost nothing like LZ.

    Dictionary compression actually fits Tree extremely well. To me, it seems like a better description of Tree than LZ77, because LZ77's sliding window doesn't remind me much of a dictionary.

    I like "Grammer Transformation". It gives me a better sense of what the compressor is doing than "Dictionary Coder". The term dictionary confuses me because I would expect to be able to look the word up in the dictionary, but in this case, if I tried looking up a "word" in the dictionary, I would often find the "word" is made up of other "words" and need to look those up too.
    This might surprise you, but real dictionaries work that way. In real dictionaries, words are defined using other words, which have their own entries, and there are lots of cycles. You have to already know words to use real dictionaries, so you can't use them to learn to read. In a sense, dictionaries generated by Tree are better than real dictionaries, because Tree dictionaries don't have cycles, so you'll never get stuck, and they don't require a priori knowledge of words, so they have everything you need to know within them.

    I'm glad that you like grammar-based code, because when you discover or invent something in computer science, it's a good thing to connect it to prior research, even though it might make the invention seem less significant on the surface. The fact that you can point to where some other researcher already described the idea and said that it's a good idea will make it easier to get CS people to pay attention and take your ideas seriously, especially for an amateur computer scientist. An important part of research is doing literature review and showing how your ideas relate to existing work. And it shows that you did your homework and know what you're talking about.
    Last edited by nburns; 22nd June 2014 at 14:45.

  20. #20
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    I understand that Tree falls under the category of Dictionary compression, along with probably over half the compressors in the world. If so, the term provides less than 1 bit of information. I would like a term that provides perhaps 3 or 4 bits of information about the type of compressor.

    Here's a list of the names derived from suggestions above:

    1. Recursive Pattern Substitution
    2. Dictionary Transformation
    3. Recursive Symbol Transformation
    4. Hierarchical Common Prefix Transformation
    5. Pointerless Dictionary
    6. Grammar Transformation

    It seems like any of these is more descriptive of the general method than "Dictionary Encoder". I find the ones that include recursive or hierarchical to be the most descriptive. There are clearly some better names available than "Suffix Tree Transformation". Perhaps Grammer Transformation is best since there seems to be some precendent for using that term.

  21. #21
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    `Explicit-entries-bounds LZ78`
    (Maybe shortened to `explicit-entries LZ78` or even `explicit LZ78`)

    All of the previous LZ78 coders used implicit bounds for entries, derived from structure of matches. `Tree` includes explicit marks for dictionary entries in the code stream, AFAIU. And that's the biggest difference.

    BTW:
    I feel there's some analogy to Future-LZ from Bulat. Future-LZ is a derivation of LZ77 where repetition marks are clustered in place of first occurence of a repetition instead of scattering it all over the file. `Tree` also defines repetitions in place of first occurence.
    Last edited by Piotr Tarsa; 23rd June 2014 at 19:25.

  22. #22
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I understand that Tree falls under the category of Dictionary compression, along with probably over half the compressors in the world. If so, the term provides less than 1 bit of information. I would like a term that provides perhaps 3 or 4 bits of information about the type of compressor.
    You're right. It's very general, like "C program."

    It seems like any of these is more descriptive of the general method than "Dictionary Encoder". I find the ones that include recursive or hierarchical to be the most descriptive. There are clearly some better names available than "Suffix Tree Transformation". Perhaps Grammer Transformation is best since there seems to be some precendent for using that term.
    Precedent is good.

  23. #23
    Member
    Join Date
    May 2008
    Location
    brazil
    Posts
    163
    Thanks
    0
    Thanked 3 Times in 3 Posts
    Offline/Static Dictionary encoder i think it's still good because Tree is a real offline/static dictionary encoder because creates all processing only one time on file entirely . LZ encoders repeats lots of cycles on various segments of file. Remember static compilation vs JIT compilers.

  24. #24
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by lunaris View Post
    Offline/Static Dictionary encoder i think it's still good because Tree is a real offline/static dictionary encoder because creates all processing only one time on file entirely . LZ encoders repeats lots of cycles on various segments of file. Remember static compilation vs JIT compilers.
    Static sounds good, except that I think static already has a different meaning. A static dictionary seems to be a dictionary that never changes and is reused on multiple inputs.

  25. #25
    Member
    Join Date
    May 2008
    Location
    brazil
    Posts
    163
    Thanks
    0
    Thanked 3 Times in 3 Posts
    Quote Originally Posted by nburns View Post
    Static sounds good, except that I think static already has a different meaning. A static dictionary seems to be a dictionary that never changes and is reused on multiple inputs.

    Static symbol encoding
    Static grammar encoding.

  26. #26
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    "Offline dictionary-based compressor" is a good name, IMO. Dictionary methods like LZ78 and LZ77 are clearly online. Tree is clearly offline. The only problem is that it doesn't exactly roll off the tongue.

  27. #27
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    A few random thoughts in response to all the helpful replies:

    I think Tree, bpe, krc, and diqqs are sufficiently different from the algorithms developed by Limpel and Ziv to apply the LZ name. Limpel and Ziv developed on-line techniques for implicit dictionaries. These other programs use off-line techniques and explicit dictionaries (well, at least Tree for sure). At what point do we stop giving Limpel and Ziv credit for algorithms?

    It's interesting to hear about Future LZ. It sounds promising. Maybe I missed something, but it sounds like forward references use uncompressed bytes, which doesn't seem as efficient as using compressed symbols to count. Perhaps there are benefits I am not aware of to using offsets based on uncompressed data rather than compressed data.

    While algorithms such as bpe, diqqs, krc and tree are all offline, there doesn't seem to be anything preventing online recursive matching algorithms from being developed other than a loss of effectiveness. Maybe it's a silly point, but for that reason I am hesistant to include offline in the category name for these types of algorithms.

    Also hesistant about including "static". Tree's decoder has to deal with a dynamic dictionary with (some) dynamic codes.

    I see where another developer called their technique "Multilevel Pattern Matching Grammar Transformation". I like that because the name tells me what the algorithm does in words that can be helpful to someone that knows little or nothing about compression. Another name previously used is "Multilevel Frequent Pattern Mining".
    Last edited by Kennon Conrad; 24th June 2014 at 06:45.

  28. #28
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    A few random thoughts in response to all the helpful replies:

    I think Tree, bpe, krc, and diqqs are sufficiently different from the algorithms developed by Limpel and Ziv to apply the LZ name. Limpel and Ziv developed on-line techniques for implicit dictionaries. These other programs use off-line techniques and explicit dictionaries (well, at least Tree for sure). At what point do we stop giving Limpel and Ziv credit for algorithms?
    Sometimes I wonder the same thing, considering the incredibly wide range of algorithms that are called LZ. I don't think Tree fits even a liberal definition of LZ, though.

    It's interesting to hear about Future LZ. It sounds promising. Maybe I missed something, but it sounds like forward references use uncompressed bytes, which doesn't seem as efficient as using compressed symbols to count. Perhaps there are benefits I am not aware of to using offsets based on uncompressed data rather than compressed data.

    While algorithms such as bpe, diqqs, krc and tree are all offline, there doesn't seem to be anything preventing online recursive matching algorithms from being developed other than a loss of effectiveness. Maybe it's a silly point, but for that reason I am hesistant to include offline in the category name for these types of algorithms.
    If it was online, then it would have to work strictly left-to-right. It's as if the algorithm was part of a pipeline and it had to read and write in real time, without being allowed to stop and buffer the whole thing, because it's not allowed to hold up the pipeline indefinitely.

    Tree would break down if it had to be online, because it wouldn't know which phrases were going to repeat in the future. It would have to create dictionary phrases without knowing what was coming, and that would pretty much make it a variety of LZ78.

  29. #29
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    future-lz is a modification of LZ77 storing matches at the match source rather than destinatioon position. For compressor like SREP that utilizes only long matches it allows to decrease amount of data stored in the decompressor, therefore decreasing amount of memory required for decompression. In my tests, decompression required RAM equal to about 10% of filesize. Moreover, since we know order of access to these stored data, they may be swapped from RAM to diskfile without losing efficiency, so decompression may be performed using just about 100 mb of RAM.
    Future-LZ compression is performed in two passes - at the first pass, matches are found and saved in the memory. Then matches are sorted by their source address and the second pass performed, storing each match in the compressed file at position of match source. So, Future-LZ compression requires seekable input file on compression stage. Decompression access both input and output files sequentially, thus very fast.

  30. The Following 2 Users Say Thank You to Bulat Ziganshin For This Useful Post:

    Kennon Conrad (24th June 2014),Paul W. (9th May 2015)

  31. #30
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    future-lz is a modification of LZ77 storing matches at the match source rather than destinatioon position. For compressor like SREP that utilizes only long matches it allows to decrease amount of data stored in the decompressor, therefore decreasing amount of memory required for decompression. In my tests, decompression required RAM equal to about 10% of filesize. Moreover, since we know order of access to these stored data, they may be swapped from RAM to diskfile without losing efficiency, so decompression may be performed using just about 100 mb of RAM.
    Future-LZ compression is performed in two passes - at the first pass, matches are found and saved in the memory. Then matches are sorted by their source address and the second pass performed, storing each match in the compressed file at position of match source. So, Future-LZ compression requires seekable input file on compression stage. Decompression access both input and output files sequentially, thus very fast.
    You could probably use any LZ algorithm as the first stage and permute the matches to make it Future-LZ. It's almost like a post-processor for LZ, then.

Page 1 of 4 123 ... LastLast

Similar Threads

  1. An Elegant Algorithm for the Construction of Suffix Arrays
    By willvarfar in forum Data Compression
    Replies: 18
    Last Post: 11th July 2013, 16:01
  2. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 08:37
  3. Suffix Trees, Suffix Arrays, or What?
    By incblitz in forum Data Compression
    Replies: 47
    Last Post: 12th July 2011, 21:54
  4. Little question about Suffix Sorting techniques
    By Piotr Tarsa in forum Data Compression
    Replies: 0
    Last Post: 23rd May 2011, 20:01
  5. Transformation for UIQ2
    By Jan Ondrus in forum Data Compression
    Replies: 49
    Last Post: 4th October 2009, 17:30

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •