Results 1 to 19 of 19

Thread: parsing methods

  1. #1
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    can anyone point me to some good knowledge resource on different parsing methods for data compression?

    what i know already is:
    - greedy parsing (the obvious one),
    - lazy parsing,
    - flexible parsing,
    - parsing methods from cbloom.com.

    what i want to know is how parsing is done in:
    - 7- zip,
    - cabarc.

    does anybody know?

    i want to make a lz77 compressor (but not in nearest future), with block- sorting based matching (as i described in some post on this forum), and without statistical modelling, like in joergen ibsen's aplib with the difference in literal coding - i want to cluster flags for literals (ie. to code legths of literal run instead of flagging every lieral), and i want to know how some good weighted parsing can be done.

  2. #2
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quote Originally Posted by donkey7
    - parsing methods from cbloom.com.


    1. Greedy Parsing
    2. Lazy Matching (aka Lazy Evaluation)
    3. Flexible Parsing
    4. Optimal Parsing (proposed by Storer and Szymanski)
    5. Dynamic Programming

    Quote Originally Posted by donkey7
    what i want to know is how parsing is done in:
    - 7- zip,
    - cabarc.
    7-Zip (LZMA) as CABARC (LZX) (and ROLZ2/ROLZ3) uses Dynamic Programming. This scheme represents some sort of Optimal Parsing witch can cope with adaptive encodings. Dig this forum for more info!

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    dynamic programming is general algorithmic idea. afaik, it means building solution for larger input from solutions for smaller ones

    cabarc and 7-zip uses optimal parsing with pricing of matches/literals (i.e. they evaluate how much bits are required for each item and optimize bitcount) while initial optimal parsing idea invented by SS just optimized *amount* of literals/matches)

    so, parsings are
    1. greedy
    2. lazy
    3. flexible
    4. number of matches-optimal
    5. bit-optimal

    lazy parsing: after you have found that at pos P best match has length LEN, you try match at pos P+1. if its length >LEN, then you abandon math at current pos. there are two subapproaches - you either encode match at P+1 unconditionally or check match at P+2 which may be even longer and so on

    flexible: when you found P,LEN match, you check matches at positions P+1..P+LEN-1 and cut down current match according to the best *summary* length of cutted-down match plus next match. we may consider lazy matching as stripped-down version of flexible matching where only P+1 position is checked

    optimal by amount of matches and literals: you find best match for *each* position of some buffer (say, 100 kb), then you go from buffer *end* BUFEND and select first match P1,LEN1 (i.e. minimal P1) such that it ends at buffer end: P1+LEN1>=BUFEND, then selects minimal P2 that ends at P1: P2+LEN2>=P1 and so on. literals are considered as matches of length 1. this strategy allows to find parsing that requires minimal possible amount of matches+literals.

    next improvement may be saving of nearest matches for *each* length for *each* position. this allows to use more close match when full length isn't required. imagine, for example, that P1-P2=5. but LEN2=10. in this case by using probably more close match at P2 with LEN=5 we can reduce compressed size

    bit-optimal: for *each* position in buffer we evaluate price (amount of bits) of each possible match (because closer matches almost always has lower prices we actually evaluate price of *nearest* match for *each* length). then, going from end, we select cheapest way to the buffer start

    we don't need to save all intermediate data for optimal parsing. actually, for each pos in buffer, we just keep its lowest price (or amount of literals+matches) found so far and concrete match (including pseudo-matches of length 1, i.e. literals) that allows to reach such price. initially all positions are tagged with INT_MAX, then finding nearest match of length LEN at pos P we try to tag all buffer positions from P+PREVLEN to P+LEN-1 with match just found and its total price (i.e. best price for P itself plus price of new match). if new price turns out to be smaller than already stored at some pos, we can replace it with our new price

    then we go from buffer end. at this moment it's already tagged with smallest possible price and last match in the trace of this price. we jump to the beginning of this match and get next to last match in this trace...

    you can find explanations and pseudo-code by Igor Pavlov at this forum, translated and published by encode

  4. #4
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quote Originally Posted by Bulat Ziganshin
    dynamic programming is general algorithmic idea. afaik, it means building solution for larger input from solutions for smaller ones
    Malcolm Taylor call his near Optimal Parsing scheme which can cope with adaptive encodings Dynamic Programming. So do I. Yes, the algorithm divides input by small blocks (say 4k) and calculates the optimum for each block using static prices.


  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    "dynamic programming" is a term coined much earlier. i think that he means that he uses d.p. approach in his parser

  6. #6
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Yepp!

    Quote Originally Posted by Malcolm Taylor
    There is also for lz77 several different optimal parsing methods. One
    (used by quantum IIRC) parses all the matches in the file, and then
    passes backwards over them finding the best match to reach each byte.
    The result is optimal for a static encoding (much like FP).
    Another one (used by 7zip and ROLZ and others), is a dynamic programming
    approach which can cope with adaptive encodings. This often leads to
    some fantastic results. Although I call this optimal, it is still only a
    near optimal method. It always converges to a local minimum, so may not
    always hit the global minimum.

  7. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    hmm, igor's parsing seems fairly similar to charles' parsing. the only difference is the direction of parsing. they both would be opitmal with static probabilities or with semi - adaptive entropy coding (eg. divide data to n kilobyte blocks, then do optimal parsing on n- th block using probabilites from (n-1)- th block). and they both can be bit optimal or optimal by amount of symbols (symbol is literal or match).

    cbloom wrote on his page that he used lazy parsing only to decide when to jump out of literal run. Bulat, have you tried pure lz77 with lazy parsing inside literal runs and flexible parsing inside match runs? i think it should be good enough and fairly fast.

    there's one thing in optimal parsing that i don't know how to handle. lzx uses the so called recent matches, ie. if current match offset is equal to one of recent three matches, then lzx can output special short code (1, 2 or 3 respectively) instead of raw offset. anybody have clue how to cope with this? if there were only the most recent match offset used then it would be easy, but matter gets complicated when amount of recent matches is more than one.

    currently i thinking on a way to build virtual suffix tree (a'ka: array representation of suffix tree, i've described that on http://asembler.republika.pl/misc/bwt-lz.txt ). i'm considersing modified and combined bentley- sedgewick and manber- myers algorithms for suffix array construction, but till now i haven't come up with something reasonable. (the algo for converting suffix array to virtual suffix tree i've presented in my article i've pointed above is unacceptably slow in degenerate cases).

    lz77 is getting more and more interesting for me. maybe optimal parsing will be my subject of master thesis.

  8. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    btw, look at this: http://forum.compression.ru:8080/viewtopic.php?t=1 948

    Ilya said that he found very fast way of bwt sorting

  9. #9
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quote Originally Posted by encode
    theres one thing in optimal parsing that i dont know how to handle. lzx uses the so called recent matches, ie. if current match offset is equal to one of recent three matches, then lzx can output special short code (1, 2 or 3 respectively) instead of raw offset. anybody have clue how to cope with this? if there were only the most recent match offset used then it would be easy, but matter gets complicated when amount of recent matches is more than one.
    This trick is not belongs to Optimal Parsing by itself. All modern LZ77-based compressors use such trick (LZMA, LZX, RARs LZH). You just keep a MRU list of recent matches i.e. when you code a match, you remember the offset storing it in a special list. Decoder do the same job.


  10. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Ilya, the problem is how to calc price of match if we don't know whether it will be repdist/repboth or not. but i think it is no problem because in algorithm i described you know best trace to the current position, so you know that is the previous match

  11. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Quote Originally Posted by donkey7
    hmm, igors parsing seems fairly similar to charles parsing. the only difference is the direction of parsing.
    i dont know that is "charles parsing" but if it is flexible one, i cant agree that it the same as optimal one. you cant find optimum without traversing from end, its the key point in optimal parsing scheme

  12. #12
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    sorry, i meant storer- szymanski parsing http://cbloom.com/algs/dictionary.html#optimal . in pavlov's parsing we compute the number bits to code before current position and in stores- szymanski's parsing we compute the number bits to code from current position.

  13. #13
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    if that's true then SS parsing is not optimal. to be exact, to find optimal encoding, you will need to test too much variants, parsimng recursively. backward pass makes finding optimal trace trivial and fast

  14. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    donkey7
    btw, look at SLUG. its decompression speed amazing and may convince you to use huffman encoder with a small LZ window for compressed data in your compressed-disk-image project

  15. #15
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    almost nobody is using assembly for decompression. bsr instruction greatly improves decoding speed of gamma encoded values. combine it with flags clustering and you have fastest decoding on the world ;]

  16. #16
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    of course bit decoding will be a bit faster, may be 1.5x faster. but even on my 1GHz cpu slug decompress more than 50mb/sec

  17. #17
    Member
    Join Date
    Aug 2007
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    hello,everyone .
    I have study 7-zip source code for a while ,but I still don't know the price mechanism in 7-zip ,How does Igor use it?
    Thanks.

  18. #18
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Hi again! And welcome to my forums! (It's me - Ilia Muraviev)

    Firstly read about Storer and Szymanski parsing:
    http://cbloom.com/algs/dictionary.html#optimal

    The basic idea behind LZMA's is how to find the shortest path - i.e. optimal literal/match sequence.

    In short:
    1. Find all matches for current block (at each position)
    2. Parse backwards finding the shortest path, taking into account the real price of each choice - i.e. literal/match price.

    As you know, lower price means less bits needed to code the unit (literal or match). During parsing we have many variants to choose - a set of matches (from MINMATCH to longest match found) and a literal. So, we can determine which unit to choose - selecting the cheapest. It think it's clear.

    For myself, I also cannot determine what's going on with 7-Zip sources. Usually, it's easier to write own code instead of decrypting source of others.


  19. #19
    Member
    Join Date
    Aug 2007
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    thank you,Ilia Muraviev.
    thank you for helping me sincerely!

Similar Threads

  1. Executable patch generation methods
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 2nd April 2010, 09:13
  2. optimal parsing
    By flounder in forum Data Compression
    Replies: 10
    Last Post: 3rd August 2008, 13:07

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •