Results 1 to 9 of 9

Thread: Is there any cruncher with statistical output?

  1. #1
    Member
    Join Date
    Sep 2015
    Location
    germany
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Cool Is there any cruncher with statistical output?

    Is there some kind of "experimental" compressor that gives statistical information on how often some special byte combinations repeat or how often something cannot be compressed and how long those blocks are, perhaps also what different methods can be applied or not like markov models, wave-offsets or other compressable bit-patterns? Google did not help to find anything usable. Do all compressor coders make their own test suites alone again or only code down their ideas and look how it´s getting? Any idea? The only thing that offers some data analysis I found is Cryptool 2.
    Last edited by Crush; 16th September 2015 at 12:55. Reason: compression, test, statistic, information

  2. #2
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Have you looked at GLZA/Tree on the download page? It is an experimental compressor that gives statistical information on how many times the byte combinations it uses appear (by using the -v command line option when running GLZAencode), including literals. The first appearance of byte combinations are marked with an escape symbol and a length code and then put in a dictionary and assigned a dictionary code. It uses Markov modeling to determine which combinations are profitable and should be put in the dictionary. The latest version produces a compressed enwik9 file that is a little more than 14% smaller than the leading LZ based compression algorithm shown on the Large Text Compression Benchmark.

  3. #3
    Member
    Join Date
    Sep 2015
    Location
    germany
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Thank you for the hint. GLZAencode was hidden in the Tree thread. This is a bit too simple only counting the amount of literals. Cryptool gives much better output with 2byte and 3byte amounts, but without any compression view. It would be more interesting to show standard RLE compressor behaviour showing how often also more complex strings appear and how often uncompressed blocks have to be signed.

  4. #4
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    GLZAencode counts the symbols that are in the input file. It is part of a toolset (with example .bat files included) and the intent is that a compressed file is created by running GLZAformat, GLZAcompress and then GLZAencode. If either of the first two steps are skipped, the output will only contain literals (with one missing literal and possible other problems if the GLZAformat step is skipped). The statistical output for GLZAencode when compressing enwik9 with the intended sequence starts like this, with 'C' indicating the next letter should be capitalized:

    0: #400960 L8: " "
    1: #326425 L8: ","
    2: #304387 L8: " C"
    3: #295356 L8: "]]"
    4: #219757 L8: " and"
    5: #197922 L9: "s"
    6: #174778 L9: ")"
    7: #157298 L9: "|C"
    8: #142221 L9: "]]
    *[[C"
    9: #141000 L9: ", C"
    10: #137467 L9: " [[C"
    11: #118636 L9: "]], [[C"
    12: #108478 L9: "]],"
    13: #107463 L9: ", and"
    14: #101665 L9: "]]</text>
    </revision>
    </page>
    <page>
    <title>C"
    15: #99892 L10: "&quot;"
    16: #92969 L10: ". C"
    17: #92332 L10: "]]
    * [[C"
    18: #86298 L10: " [["
    19: #84405 L10: "''"
    20: #74947 L10: " and C"
    21: #71963 L10: "|"
    22: #69614 L10: "]] and [[C"
    23: #67163 L10: " or"
    24: #55365 L10: " of C"
    25: #54381 L10: "'s"
    26: #53897 L10: ". Cthe"
    27: #53476 L10: " is"
    28: #52816 L10: " in"
    29: #51935 L10: " of"
    30: #51130 L10: ", the"

    Symbols are shown from most common to least common with # showing how often the symbol for the corresponding string (dictionary entry) is used, the L is the length of the dictionary code in bits. The symbols are a mix of literals (1 byte except extended UTF-8 characters) and deduplicated longer strings created by GLZAcompress. The length of the strings represented by the dictionary symbols can be quite long, with the longest created for enwik9 representing a string that is 25,650 bytes long uncompressed (25,650 space characters, 0x20):

    7412600: #2 L24: "25,640 spaces (0x20)"

    GLZA is a complex (some would say convoluted) grammar compressor. It doesn't break things into "compressible" and "uncompressible" blocks, but you have given me something to think about. It just tries to find the lowest order 0 entropy grammar it can (GLZAcompress) and writes the final output using that grammar (GLZAencode). For text files, it is typical that almost everything is compressible, so I'm not sure how much breaking things into compressible/uncompressible blocks would help. I have found DNA to be trickier. GLZAcompress does pretty well at finding long matches (actually outperformed DNASequitor in a limited test) but the encoding could still use some improvement and the idea of identifying non-deduplicatible sections could be useful.

  5. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Crush (17th September 2015)

  6. #5
    Member
    Join Date
    Sep 2015
    Location
    germany
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts
    GLZA is some kind of dictionary compression. This is not the type of statistics I´m looking for, but better than nothing. I´m more aiming at some kind of LZ77 compression stats.
    I made something similar but much much simpler than GLZA for a coder contest here:
    An even more improved version is here: http://www.donationcoder.com/forum/i...?topic=21290.0
    A fixed version from the contest with extended dictionaries is here (the .dic files are the compressed libraries - sourcecode is included): http://netpan.ironbytes.de/stuff/Salatschleuder.zip

  7. #6
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Crush View Post
    GLZA is some kind of dictionary compression. This is not the type of statistics I´m looking for, but better than nothing. I´m more aiming at some kind of LZ77 compression stats.
    I'm a bit confused. As far as I know LZ77 compression stats wouldn't provide any statistical information on how often some special byte combinations repeat.

  8. #7
    Member
    Join Date
    Sep 2015
    Location
    germany
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts
    It´s about how often matches appear, how long they are and how much of the data cannot be compressed. The strings themselves are not so important for me.

  9. #8
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    couldn't you just take an open source (LZ) compressor and add your own variables for counting how often literals are written out or matched? I don't imagine that'd be too difficult.

  10. #9
    Member
    Join Date
    Sep 2015
    Location
    germany
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Yes, I think that´s the only solution.

Similar Threads

  1. Replies: 15
    Last Post: 4th September 2015, 15:10
  2. COMPRESSING AES CBC MODE OUTPUT
    By biject.bwts in forum Data Compression
    Replies: 3
    Last Post: 24th January 2012, 23:40
  3. Slow output to nul:
    By Matt Mahoney in forum The Off-Topic Lounge
    Replies: 3
    Last Post: 31st December 2011, 00:21
  4. Rangecoding with restricted output alphabet
    By Shelwien in forum Data Compression
    Replies: 0
    Last Post: 17th August 2010, 18:38
  5. Statistical implementation of Ziv-Lempel
    By thomas in forum Data Compression
    Replies: 3
    Last Post: 10th February 2009, 20:13

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •