Results 1 to 18 of 18

Thread: GLZA

  1. #1
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    GLZA

    Here is a paper written by Paul Wilson and me about GLZA titled Grammatical Ziv-Lempel Compression: Achieving PPM-Class Text Compression Ratios With LZ-Class Decompression Speed.

    GLZA code, executables, and further discussion can be found here: http://encode.ru/threads/1909-Tree-alpha-v0-1-download.
    Attached Files Attached Files
    Last edited by Kennon Conrad; 30th March 2016 at 02:47. Reason: Added paper

  2. The Following 2 Users Say Thank You to Kennon Conrad For This Useful Post:

    Cyan (14th February 2017),encode (31st March 2016)

  3. #2
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    189
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear Mr. Conrad,

    is it possible to disable Tree/GLZA ability to find the most occurred symbols through many lines but to find ones only in the lines range? Example:

    aaabbb<CR-LF>
    bbbccc

    has these symbols: aaa + bbb + ccc but NOT bbb<CR-LF>bbb. Serialize 20 GB into one line is impractical.
    Is there also any other input file size limit?

    Best regards,

    FatBit

  4. #3
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by FatBit View Post
    Dear Mr. Conrad,

    is it possible to disable Tree/GLZA ability to find the most occurred symbols through many lines but to find ones only in the lines range? Example:

    aaabbb<CR-LF>
    bbbccc

    has these symbols: aaa + bbb + ccc but NOT bbb<CR-LF>bbb. Serialize 20 GB into one line is impractical.
    Is there also any other input file size limit?

    Best regards,

    FatBit
    Dear FatBit,

    Yes, it is possible if you are willing to modify the source code. To accomplish this, the score_node function would need to be changed. Shortly after the "top_score_loop" label, there is a line of code that looks like this:

        if (node_instances >= 2)  {


    that could be changed to this:

        if ((node_instances >= 2) && (node_ptr->token != 0xA) && (node_ptr->token != 0xD)) {


    Plus you would need to change the score_nodes function from:

    #define score_nodes(max_token) \
    { \
    while (i1 <= max_token) { \
    ...


    to
    #define score_nodes(max_token) \
    { \
    while (i1 <= max_token) { \
    if ((i1 == 0xA) || (i1 == 0xD))
    base_node_child_num_ptr += 16;
    else {
    ...
    }


    This would block deduplications that include the carraige return or linefeed symbol.

    I don't really understand why you want to do this, so if I am missing something, please let me know. In the example you provided GLZA would only deduplicate "bbb", but perhaps the implication is that there are more "bbb" strings elsewhere in the file.

    Best Regards,

    Kennon
    Last edited by Kennon Conrad; 25th January 2016 at 00:56.

  5. #4
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    189
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear Mr. Conrad,

    thank you for your reply. The reason is simple and is related to chemical structures coding = SMILES. Because any molecule structure representation is "independent" = one line, therefore is not right to compute common parts through many lines together.

    Best regards,

    FatBit

  6. #5
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by FatBit View Post
    Dear Mr. Conrad,

    thank you for your reply. The reason is simple and is related to chemical structures coding = SMILES. Because any molecule structure representation is "independent" = one line, therefore is not right to compute common parts through many lines together.

    Best regards,

    FatBit
    Ah, I should have known . In that case, it may be better to allow the carriage returns and linefeeds to be put on either the start or end of deduplicated strings. It seems like maybe putting them at the start would be better. If so, then then code would not skip strings starting with carriage returns and linefeeds, but would instead need to allow those symbols within a string until it saw one that was not a carriage return or linefeed.

    BTW, I forgot to answer a question in your first post. GLZA should work for files up to 1 GB. Beyond that, I haven't done a lot of testing. It will have problems at some point if the file is big enough to cause more than 10 million dictionary entries.
    Last edited by Kennon Conrad; 25th January 2016 at 10:07.

  7. #6
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    189
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear Mr. Conrad,

    thank you for your reply. For example Benson's group contribution metod has ~ 300-400 structure fragments so I am not afraid tens mega records. I have smaller or bigger data files therefore I can make evolution.

    substance formula SMILE
    ethane CH3-CH3 CC
    propane CH3-CH2-CH3 CCC
    butane CH3-CH2-CH2-CH3 CCCC
    alkane CH3-(CH2)n-CH3 C(C)nC

    Best regards,

    FatBit

  8. #7
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Dear FatBit,

    I meant 1 GB rather than 1 MB and fixed my response. You should be fine with tens mega records. On the other hand, I think for data like this that perhaps LZHAM might provide a better compression ratio. It is open source program and has fast decoding. The data has a different structure than regular text and LZHAM will probably do better with the local repetition.

    Best Regards,

    Kennon

  9. #8
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    189
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear Mr. Conrad,

    I also have to clarify my response. Record I mean fragment/record in dictionary. Not number of substances. And the goal is not only compression but also different fragments "catching".

    Best regards,

    FatBit

  10. #9
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    149
    Thanks
    30
    Thanked 59 Times in 35 Posts
    GLZAcompressFast.c
    Code:
    line 28
    GLZAcompressFast [-p#] [-w#] [-c#] [-m#] <infilename> <outfilename>
    But not implemented -m#
    Code:
    line 2135 - 2137
    back_nodes = (struct back_node *)malloc(sizeof(struct back_node) * 100000000);
    match_start_nodes = (struct match_start_node *)malloc(sizeof(struct match_start_node) * 30000000);
    match_list = (unsigned int *)malloc(sizeof(unsigned int) * 30000000);
    I think that memory requarement is too large for small file.
    I tested on 1/10 memory, and it works.

  11. #10
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by xezz View Post
    I think that memory requarement is too large for small file.
    I tested on 1/10 memory, and it works.
    You are correct. I will try to allocate memory more appropriately on the next release.

  12. #11
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    189
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear Mr. Conrad,

    could you be so kind and help me? I am not able to generate (as you) symbol frequencies as a text file (like thread Tree alpha v01 download #424). Give me please step by step batch solution. And second question - do you provide x64 compiles only? Because I am not able to run your programs on Win 7 x32 Profesional Czech edition. Message "This is not Win32 application" occurs and my access to x64 is limited.

    Best regards,

    FatBit

  13. #12
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Dear Fatbit,

    I fixed a couple of things in the code that were problems for 32-bit builds and posted the source and executables in the Tree thread. If you have any problems, please let me know. I didn't test much.

    To generate the list of symbols in the encoded stream, you can run the batch file to compress and decompress a file using "TestGLZA32 <filename>" (where <filename> is your file's name). You could then rerun GLZAencode32 like this "GLZAencode32 -v <filename>.glzc <filename>.glze >symbol_list.txt" and the symbol list will be put into a file named symbol_list.txt. You can adjust the dictionary size using the -m command line argument (production cost in bits) with GLZAcompress32. The larger the value, the smaller the dictionary. Default is 7.0. For example "GLZAcompress32 -m100 <filename>.glzf <filename>.glzc" will typically cause the dictionary to be much smaller.

    Best Regards,

    Kennon
    Last edited by Kennon Conrad; 29th January 2016 at 21:27.

  14. #13
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    189
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear Mr. Conrad,

    I used GLZA 03b and tried TestGLZA.bat data.asc and obtained (on W7 X32 Prof Czech) for all programs: *.exe is not compatible with runnig version of Windows. To see if you need x32 and x64 version and contact the software publisher.

    On W7 X64 Prof Czech OK.

    Best regards,

    FatBit

  15. #14
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    I had a typo in my prior post. You should use TestGLZA32.bat to use the 32-bit executables.

  16. #15
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    The GLZA paper was added to the first post.

  17. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    encode (31st March 2016)

  18. #16
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Kennon, thanks very much for the paper and for sharing Tree/GLZA in general.

    Unfortunately the paper doesn't help me much in understanding the details of Tree/GLZA.

    How exactly is entropy coding done? How are literals sent, how are dictionary indexes sent? Is there a limit on number of dictionary entries? Is it per first-char? How are the localized strings handled? What does the "ergodic" bit do? etc. etc.
    Last edited by cbloom; 3rd August 2016 at 19:59.

  19. #17
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Charles,

    Maybe I can answer some of your questions. (I'm coauthor on the paper, but Kennon came up with most of it all by himself, and designed and wrote all the code and knows all the details. He's at DCC, and will probably correct some details when he has a chance.)

    There's no localization per se, if I understand your question. GLZA just parses UTF8 characters up front, in the prepass that characterizes the input as UTF8 or not, normal-looking space-separated text or not, and strided binary or not, then massages the UTF8 characters (or bytes) a tiny bit in preparation for IRR grammar construction. (That's all in GLZAformat.c, which is surprisingly simple.) None of that matters a whole lot---if you treat a normal space-separated-words UTF8 text file as a binary bytes file, it still compresses well. The biggest wins are using estimates of encoded string length to guide mostly-top-down grammar construction (dictionary string selection), and GLZ coding (using the natural first-occurrence order to implicitly communicate dictionary offsets).

    One of the cool things about grammar-based compression is that it doesn't have any particular difficulty dealing with multibyte and variable-length characters. Once you've initially parsed them and turned them into discrete symbols, they're pretty much just atomic symbols for most purposes. The decoder doesn't know much about it, or care---it's mainly just concatenating variable-length byte strings anyhow, and doesn't care whether they represent fixed-size ASCII characters, binary bytes, or variable-length Cyrillic or Chinese characters. You don't need much postprocessing, and what little there is can be integrated into the decoder with very little overhead. (No separate postprocessing of everything to get back to UTF8.) This could easily be generalized to handle non-UTF8 encodings too, like Big5 Chinese & Japanese, or the usual Cyrillic code pages.

    The symbol size is 32 bits, so you could have billions of dictionary entries ("productions" in "grammar-based"-speak). Right now there's a more serious implementation limit on total file size, but that is trivial and should be easy to fix, so that you could handle inputs of many gigabytes (like all of Wikipedia) with up to a few billion dictionary entries.

    The ergodicity bit distinguishes symbols that tend to repeat pretty soon from those that don't, so that the decoder knows whether to put them in one of the MTF queues when it sees them. (And not to put others in the MTF queues, which would waste code space and slow decoding down---you don't want to waste time looking in MTF lists for things that aren't likely to be there and get a shorter code than they otherwise would, based on frequency in the normal way, because they're there.) In text, for example, you tend to have some very common strings with relatively stable frequencies (" if the", " of a", etc.) that are best coded by frequency, and other strings in between that are better coded by recency. (That's basically the same idea as "stop words" in various text & query processing contexts; you don't want to key off of stuff that's both frequent and ubiquitous, and want to focus on stuff that's specific to the current topic and/or mode of speech---tense, person, etc.)

    I'm not sure I understand what you're asking with "is it per first-char?". The order 1 Markov model basically predicts the first character of the next symbol based on the last char or few chars of the current symbol. It's pretty crude at this point, but good enough to handle some very important common cases, like digits tending to be followed by digits or commas or decimal points, and Chinese chars by Chinese chars (vs. Latin-1 by Latin-1, when you have a mix like in HTML Chinese in, say, Chinese Wikipedia).

    As to how literals are sent, the first occurrence of a character (for UTF8 ) is sent with a special code, saying to add it to the dictionary. Subsequent occurrences are sent with a normal length 1 dictionary code, saying it's a known symbol in the dictionary and not a new char/string to be put in the dictionary.
    Last edited by Paul W.; 2nd April 2016 at 01:52. Reason: typo

  20. The Following 2 Users Say Thank You to Paul W. For This Useful Post:

    encode (31st March 2016),inikep (31st March 2016)

  21. #18
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Hi Charles,

    Paul's comments are accurate but I have a few additional comments. Literals are not handled much differently than other dictionary entries. The main difference is that when the first instance of a literal is sent, it is sent with a special length code indicating a literal. The literal is sent with a fixed width field with size corresponding to the largest literal value. For standard ASCII text it is 8 bits (or 7 if only ASCII characters 0x0 - 0x7F are used), and it is up to 19 bits for UTF-8 compliant text. Once the initial literal is sent it is treated the same as any other dictionary symbol.

    The dictionary is kept as a set of lists per first (sub)symbol, with a list per dictionary code length. The set of dictionary lists for each leading symbol are mapped into up to 4096 code bins. Dictionary symbols are sent by sending the first symbol (with adaptive leading symbol predictions based on the trailing symbol, then the bin is sent with range coding, then if the bin is for a code length of more than 12 bits, the index within that code length list is also range coded.

    Best Regards,

    Kennon

  22. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    encode (31st March 2016)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •