Page 1 of 2 12 LastLast
Results 1 to 30 of 49

Thread: krc - keyword recursive compressor

  1. #1
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts

    krc - keyword recursive compressor

    krc - keyword recursive compressor build (brute force) during compression dynamic a keyword dictionary with all possible 2 till 16 bytes keywords and replace each keyword by a unique value starting with the most profitable keyword to replace. This process is repeated after every keyword is replaced till no new keywords profitable to replace are found. The encoded data left, keyword dictionary and values and/or skipped values are stored. To avoid out of memory during keyword dictionary build up, the less profitable keywords are removed when 5 million keywords are reached. By decompression the encoded data left, keyword dictionary and values are read in memory, the values in the encoded data left are replaced backwards with the matching keywords.

    Because after every keyword replace build up a new keyword dictionary is extreme slow, I added 5 compression levels, where level 5 build a keyword dictionary after every replace and level 4 after 10 replaces and level 3 after 100 replaces etc. till level 1 the quickest. For level 5 only it's possible to add the keyword dictionary to the encoded data after every two replaces with option 5d, what makes it slower but give better results.

    It run under Windows GUI and command line, mode 32 and 64-bit and need Framework 4.5 or higher.

    Compression is round ARJ default, speed is still very slow, so test only small files, compression level 5d - 0.5 KB/s, 5 - 0.5 KB/s, 4 - 3 KB/s, 3 - 8 KB/s, 2 - 15 KB/s, 1 - 23 KB/s, decompression: 670 KB/s. Memory usage is round two times input file size.

    Download
    http://www.metacompressor.com/download/krc.zip

    krc e 1 in out (1 can be also 2, 3, 4, 5 or 5d as compression level)
    krc d in out
    Last edited by Sportman; 28th April 2014 at 00:31. Reason: Updated krc.zip

  2. #2
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    149
    Thanks
    30
    Thanked 59 Times in 35 Posts
    nice!

  3. #3
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    I released a new krc version. Level 1 is removed, because level 2 was often quicker, all levels are moved one down, so old level 2 is new level 1 till 5=4. All levels are improved in compression speed, decompression speed and file size. Krc can handle now bigger input files, but it can take days or weeks to complete. GUI do not remove all lines, but keep lines where replacement value changed size. Some bugs are fixed what prevented right decompression at some type of files. Krc also run now under Linux mono (at least for small files).

    Download
    http://www.metacompressor.com/download/krc.zip

    krc e 1 in out (1 can be also 2, 3, 4 or 4d as compression level)
    krc d in out

  4. #4
    Member
    Join Date
    Jun 2008
    Location
    G
    Posts
    372
    Thanks
    26
    Thanked 22 Times in 15 Posts
    Isnt it possible to use multithreading to speed things up?

  5. #5
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by thometal View Post
    Isnt it possible to use multithreading to speed things up?
    Yes, I think it's possible to speed up dictionary build and replace by splitting input data over available threads, but this was more an experiment.

    I tried to make a variable keyword length dictionary already multiple times in the past after I released kwc and ksc and I know an other guy who also tried it before that time, but I had problems with dictionary buildup for long keywords and avoiding collisions with replace and speed. After Kennon Conrad released his working Tree code, what do almost the same but find and replace much longer strings and is fixed to some UTF8 files, I got renewed energy to give it an other try and this time I got it working.

    I updated krc.zip today, old version displayed a to small dictionary size with decompression after changes I made for mono.

  6. #6
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Sportman,

    Could you explain what you're actually trying to do in more detail? This seems deeply related to some things I want to do in preprocessors that are heuristic and extremely fast, but different, and I'm having trouble interpreting it. Your krc tool might be good for exploratory data analysis to help me develop some slightly less strong but very fast stuff.

    What is your metric for the "most profitable" replacement on each pass? What assumptions/generalizations are you making about costs in a backend compressor?

    I'm particularly interested in very fast, very vectorizable preprocessors that do things like (approximate) frequent-bigram merging to make Markov regularities more recognizable with a fast (single fixed-order, e.g., 3-byte) hash table approach.

    (The idea there is to recognize the most common bigrams or trigrams for a common kind of data as though they were single reasonably-common individual bytes, so that you flatten out the probability distribution and a fixed-order model effectively acts like a variable-order model, but much cheaper. Is there a general name for that? It seems to be implicit in various preprocessing strategies, but I don't know good search terms for looking up related work.)

    Any explanations or relevant pointers would be appreciated, TIA.

  7. #7
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I think this is a derivative of Kennon Conrad's Tree compressor. There's a lot more discussion on that thread in the download section.

  8. The Following User Says Thank You to nburns For This Useful Post:

    Paul W. (16th May 2014)

  9. #8
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Paul W. View Post
    Could you explain what you're actually trying to do in more detail? What is your metric for the "most profitable" replacement on each pass? What assumptions/generalizations are you making about costs in a backend compressor?
    I did not build it as preprocessor, all code I wrote is to come as close to optimal with only the recursive keyword replace method.

    I tried all kind of fancy dictionary build ups in the past but none was workable. Finally I used the most simple approach and take only the maximum keyword length, build a standard dictionary with all possible keywords between minimum and maximum keyword length and raise a keyword counter when keyword exist, shift one byte up, add all possible keywords again till end of input file minus minimal keyword length.

    When the keyword dictionary size reach 5 million unique keywords, all keywords what only existed one time are removed, when after that size is above 2.5 million unique keywords, that existing one up is raised, for example one time existing one up to two times existing and removed also. When the keyword dictionary is fully buildup also all keywords with this offset are removed, this offset is reset to one time existing after every finished keyword dictionary buildup.

    When the keyword dictionary is cleaned that last time, profit is added by (keyword found * (keyword length - replacement value length)) - (keyword length + 0.5) then sorted from highest to lowest profit.

    After that, it start to replace the most profitable keyword by a replacement value, this replacement value come from an other dictionary that is build up the same way as the keyword dictionary, but only take all 1 till 3 bytes combinations and fill the gaps starting from the lowest value. The lower half of the possible replacement values are skipped to avoid to store to much skipping values. With compression levels 1 till 3 the replacement is first tested if it's still more then 50% of the pre-calculated profit and otherwise skipped.

    When the replacement value change length or when the maximal compression level loop amount is reached, the keyword dictionary is build up again. The minimal keyword length is raised one up when the replacement value change length.
    Last edited by Sportman; 16th May 2014 at 16:03.

  10. The Following User Says Thank You to Sportman For This Useful Post:

    Paul W. (16th May 2014)

  11. #9
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    If you have some days time, krc can be used as preprocessor

    Input:
    264,996,083 IIS log file

    Output:
    20,373,111 bytes arj default
    19,717,605 bytes gzip default
    19,557,254 bytes zip default
    19,477,300 bytes zip max
    12,372,141 bytes rar default
    12,240,595 bytes rar max
    12,105,281 bytes krc 1
    10,036,707 bytes krc.zip max
    9,598,230 bytes krc.rar max

  12. The Following User Says Thank You to Sportman For This Useful Post:

    Paul W. (19th May 2014)

  13. #10
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    Yes, I think it's possible to speed up dictionary build and replace by splitting input data over available threads, but this was more an experiment.
    Very nice compression ratio. Where did you get the test file?

    If and when I multithread Tree, I plan to speed up dictionary building by splitting the types of words created into different threads rather than the input since bigger input blocks usually give better compression. Just food for thought....
    Last edited by Kennon Conrad; 19th May 2014 at 06:59.

  14. #11
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Where did you get the test file?
    I got it from a website owner, I'm not allowed to spread it.
    I tested Tree on it, treecapencode worked fine but treecompressls and treecompress both crash after a while.

    But I searched in some FTP search engines and found this file:
    ftp://202.177.25.251/Backup/website_...er/W3SVC21.zip
    I think if you merge the 896 IIS log files in it to one 214,790,064 bytes file, you get something similar to test with.

    Update:
    I merged the log files and did some tests.

    Input:
    214,790,064 bytes - IIS log file

    Output:
    12,063,469 bytes gzip default
    11,732,515 bytes krc 1
    11,290,126 bytes arj default
    10,738,058 bytes zip default
    10,694,401 bytes zip max
    9,052,361 bytes krc.zip max
    8,542,515 bytes krc.rar max
    7,723,627 bytes rar default
    7,630,064 bytes rar max

    19,365,895 bytes tree (after treecapencode, treecompressls, treecompress, treebitencode crash)
    10,475,795 bytes tree.zip max
    9,969,557 bytes tree.rar max

    Quote Originally Posted by Kennon Conrad View Post
    If and when I multithread Tree, I plan to speed up dictionary building by splitting the types of words created into different threads rather than the input since bigger input blocks usually give better compression. Just food for thought....
    Yes that's something similar I had in mind, split depending the first byte of the keyword found or everything multi thread except the compare, add, increase.
    Last edited by Sportman; 22nd May 2014 at 12:10. Reason: Added merged log files test results, Added krc results

  15. #12
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    323
    Thanks
    174
    Thanked 51 Times in 37 Posts
    @Sportman

    Could you make a 32-bit version of krc? I'd like to test it

  16. #13
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by comp1 View Post
    Could you make a 32-bit version of krc? I'd like to test it
    Krc 32-bit .NET Framework 3.5 version:
    http://www.metacompressor.com/download/krc-32bit.zip

  17. #14
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    342
    Thanks
    12
    Thanked 34 Times in 28 Posts
    crash for a 400MB file after reading all for insufficient memory (krc 1)

  18. #15
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by fcorbelli View Post
    crash for a 400MB file after reading all for insufficient memory (krc 1)
    Did you use the 64-bit or 32-bit krc version?

    Because there is now a separate 32-bit krc version I updated krc.zip (http://www.metacompressor.com/download/krc.zip) with the same version but 64-bit only, this must remove all software memory limitations and is round 15% faster. Maximal recursive loops is still set at 1,000,000, after that it shall pop up with an error.

  19. #16
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    342
    Thanks
    12
    Thanked 34 Times in 28 Posts
    Quote Originally Posted by Sportman View Post
    Did you use the 64-bit or 32-bit krc version?

    Because there is now a separate 32-bit krc version I updated krc.zip (http://www.metacompressor.com/download/krc.zip) with the same version but 64-bit only, this must remove all software memory limitations and is round 15% faster. Maximal recursive loops is still set at 1,000,000, after that it shall pop up with an error.
    32 bit

  20. #17
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    323
    Thanks
    174
    Thanked 51 Times in 37 Posts
    Quote Originally Posted by Sportman View Post
    Krc 32-bit .NET Framework 3.5 version:
    http://www.metacompressor.com/download/krc-32bit.zip
    Thanks. I'll begin testing today!

  21. #18
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by fcorbelli View Post
    32 bit
    I made a special version krc 32-bit lite for you, where I limited the dictionary size to 2,500,000/1,250,000 instead of 5,000,000/2,500,000. If this still give a memory error, I can half the recursive loop counter also.
    http://www.metacompressor.com/downlo...32bit-lite.zip

  22. #19
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    323
    Thanks
    174
    Thanked 51 Times in 37 Posts
    I tried the first version you posted for me and it gave me an OutOfMemoryException error. So now I am trying the 32-bit lite version...

    No error yet

    EDIT: It took longer this time, but same error...
    Last edited by comp1; 19th May 2014 at 14:56.

  23. #20
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by comp1 View Post
    It took longer this time, but same error...
    Sorry, I found a typo, a zero to much, I typed 12,500,000 instead of 1,250,000, so more then not lite version, I updated the ksc-32bit-lite.zip file.

    If it give a memory error again, maybe you can pass the value's behind keywords:, offset: and clean: visible in GUI.

  24. #21
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    I released a new krc version. Keyword length 17 is added and set default, this make compression round 10% slower but better at big files and break backward compatablity, setting code size in GUI is replaced by setting maximal keyword length, can be set between 2 and 17, setting maximal keyword length is also added to the command line options, with optional parameter -ml=x (x must be a value between 2 and 17). Dictionary buildup is a little faster and maximum loop memory allocation is automatic increased when needed, so krc can handle now bigger files with more then 1 million loops.

    Download
    krc 64-bit .NET Framework 4.5.1 version:
    http://www.metacompressor.com/download/krc.zip

    Krc 32-bit .NET Framework 3.5 version:
    http://www.metacompressor.com/download/krc-32bit.zip

    Krc 32-bit lite .NET Framework 3.5 version:
    http://www.metacompressor.com/downlo...32bit-lite.zip

    Example new maximal keyword length option:
    krc e 1 -ml=3 in out
    krc d in out
    Last edited by Sportman; 23rd May 2014 at 17:06.

  25. #22
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    I did a quick test with krc fastest and slowest compression levels and all possible maximum keyword lengths, to see how it influence output size and speed when a small test file is used:

    Input:
    100,000 bytes, enwik5

    Output:
    66,605 bytes, 0.293 sec. - 0.034 sec., krc 1 -ml=2
    49,273 bytes, 3.064 sec. - 0.230 sec., krc 1 -ml=3
    47,029 bytes, 3.490 sec. - 0.211 sec., krc 1 -ml=4
    46,304 bytes, 3.530 sec. - 0.188 sec., krc 1 -ml=5
    45,850 bytes, 4.270 sec. - 0.190 sec., krc 1 -ml=6
    45,828 bytes, 4.666 sec. - 0.185 sec., krc 1 -ml=7
    45,614 bytes, 5.089 sec. - 0.181 sec., krc 1 -ml=8
    45,550 bytes, 5.848 sec. - 0.179 sec., krc 1 -ml=9
    45,472 bytes, 6.109 sec. - 0.176 sec., krc 1 -ml=10
    45,444 bytes, 7.155 sec. - 0.174 sec., krc 1 -ml=11
    45,391 bytes, 7.915 sec. - 0.173 sec., krc 1 -ml=12
    45,434 bytes, 8.650 sec. - 0.172 sec., krc 1 -ml=13
    45,440 bytes, 9.099 sec. - 0.171 sec., krc 1 -ml=14
    45,456 bytes, 10.131 sec. - 0.172 sec., krc 1 -ml=15
    45,454 bytes, 10.989 sec. - 0.170 sec., krc 1 -ml=16
    45,394 bytes, 11.968 sec. - 0.170 sec., krc 1 -ml=17

    60,779 bytes, 2.348 sec. - 0.035 sec., krc 4d -ml=2
    46,572 bytes, 58.237 sec. - 0.211 sec., krc 4d -ml=3
    44,603 bytes, 129.165 sec. - 0.199 sec., krc 4d -ml=4
    44,018 bytes, 192.552 sec. - 0.181 sec., krc 4d -ml=5
    43,503 bytes, 305.799 sec. - 0.189 sec., krc 4d -ml=6
    43,427 bytes, 415.390 sec. - 0.188 sec., krc 4d -ml=7
    43,297 bytes, 507.457 sec. - 0.182 sec., krc 4d -ml=8
    43,241 bytes, 638.147 sec. - 0.183 sec., krc 4d -ml=9
    43,346 bytes, 778.652 sec. - 0.183 sec., krc 4d -ml=10
    43,212 bytes, 912.065 sec. - 0.182 sec., krc 4d -ml=11
    43,250 bytes, 1,042.713 sec. - 0.179 sec., krc 4d -ml=12
    43,156 bytes, 1,208.220 sec. - 0.180 sec., krc 4d -ml=13
    43,213 bytes, 1,387.091 sec. - 0.181 sec., krc 4d -ml=14
    43,162 bytes, 1,585.182 sec. - 0.179 sec., krc 4d -ml=15
    43,173 bytes, 1,734.119 sec. - 0.177 sec., krc 4d -ml=16
    43,144 bytes, 1,918.310 sec. - 0.180 sec., krc 4d -ml=17
    Last edited by Sportman; 23rd May 2014 at 06:35.

  26. #23
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    149
    Thanks
    30
    Thanked 59 Times in 35 Posts
    more optimization,
    code size:1
    maximal keyword length:2 - 17
    code size:2
    maximal keyword length:3 - 18
    code size:3
    maximal keyword length:4 - 19

  27. #24
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by xezz View Post
    more optimization,
    code size:2
    maximal keyword length:3 - 18
    code size:3
    maximal keyword length:4 - 19
    Yes I thought about that two when adding keyword length 17 for code size 1. But when I added keyword length 17, I saw something strange in my dictionary buildup, found lower length (<17) keyword matches got lower already existing counts then before (with maximal keyword length 16) what I can not explain, so I tried first to find why that happen, but did not found it yet.

  28. #25
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    does krc do SSR (Statistical Substring Reduction)?

    I'm still a bit confused about what krc, and that may be entirely my fault for not reading carefully enough, but here's a question worth making explicit:

    Does krc do or approximate Statistical Substring Reduction, a common technique in n-gram processing?

    The basic idea of SSR is that when you have long n-grams, that shifts the profitability of subsituting shorter n-grams that have that n-gram as a prefix (or more generally, I guess, as a substring).

    A standard example for word n-grams is the 4-gram "People's Republic of China" vs. the 2-gram "People's Republic". If you match all occurrences of "People's Republic of China" and replace them, you usually find that you're left with few or no other occurrences of "People's Republic", because that phrase is almost always followed by "of China" if it's used at all. (Less reliably, you may find few or no occurrences of the 2-gram "of China" as well.)

    In Markov terms, you don't want to just fall back from a high-order Markov model to a lower-order one, because the presence of the high-order model excludes certain things from the view of the lower-order model. Most compressors don't get this right, or only get it right in certain fairly important common cases.

    If KRC gets this right, that's a feature worth explicitly mentioning. If not, it's probably a suboptimality worth fixing.

  29. #26
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Paul W. View Post
    Does krc do or approximate Statistical Substring Reduction, a common technique in n-gram processing?
    No, but I wrote before "With compression levels 1 till 3 the replacement is first tested if it's still more then 50% of the pre-calculated profit and otherwise skipped." so this skip most or maybe all the already replaced substrings. In case of level 4 and 4d the dictionary is buildup after every keyword replacement cycle and fix automatic the already replaced substrings problem.

  30. The Following User Says Thank You to Sportman For This Useful Post:

    Paul W. (25th May 2014)

  31. #27
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Paul W. View Post
    I'm still a bit confused about what krc, and that may be entirely my fault for not reading carefully enough, but here's a question worth making explicit:

    Does krc do or approximate Statistical Substring Reduction, a common technique in n-gram processing?

    The basic idea of SSR is that when you have long n-grams, that shifts the profitability of subsituting shorter n-grams that have that n-gram as a prefix (or more generally, I guess, as a substring).

    A standard example for word n-grams is the 4-gram "People's Republic of China" vs. the 2-gram "People's Republic". If you match all occurrences of "People's Republic of China" and replace them, you usually find that you're left with few or no other occurrences of "People's Republic", because that phrase is almost always followed by "of China" if it's used at all. (Less reliably, you may find few or no occurrences of the 2-gram "of China" as well.)

    In Markov terms, you don't want to just fall back from a high-order Markov model to a lower-order one, because the presence of the high-order model excludes certain things from the view of the lower-order model. Most compressors don't get this right, or only get it right in certain fairly important common cases.

    If KRC gets this right, that's a feature worth explicitly mentioning. If not, it's probably a suboptimality worth fixing.
    I tracked down a reference on SSR, and it looks like it was developed for automatic translation or something along those lines. I didn't read enough of the paper to be able to tell exactly what it is. Could you summarize it?

  32. #28
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    I released a new krc version. Dictionary buildup was not building up fully to the end of file for short keywords, fixed it but had no time to test it well. Maximum keyword length 18 and 19 are added, 19 is set as default, it start with 17 and when replacement value change length it increase one up.

    Download
    krc 64-bit .NET Framework 4.5.1 version:
    http://www.metacompressor.com/download/krc.zip

    Krc 32-bit .NET Framework 3.5 version:
    http://www.metacompressor.com/download/krc-32bit.zip

    Krc 32-bit lite .NET Framework 3.5 version:
    http://www.metacompressor.com/downlo...32bit-lite.zip
    Last edited by Sportman; 18th June 2014 at 00:47.

  33. #29
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by nburns View Post
    I tracked down a reference on SSR, and it looks like it was developed for automatic translation or something along those lines. I didn't read enough of the paper to be able to tell exactly what it is. Could you summarize it?
    I don't know much about it---I just stumbled into it the other day when I was looking at various n-gram analysis packages.

    As I understand it, the idea is that you often want to know what the most common n-grams are, irrespective of n---some long n-grams are as common as other, shorter ones, due to stereotyped phrases, grammatical tics, canned quotations, etc.

    You often want a list of, say, the top 1000 n-grams in a text or corpus, where some of them are 2-grams, and others are 3- or 4- or 5-grams, but they're all relatively common, and none of them subsumes the others. You don't want "four score and seven" or "score and seven years" listed separately from "four score and seven years ago" if they only occur in that context, because they don't tell you anything that the frequency of the longer phrase doesn't tell you. It would just pollute your list of the most common terms and phrases, and it would be misleading to list them as though they were common phrases in their own right.

    Depending on what you're using the n-gram frequencies for, you may or may not want those sub-phrase frequencies in your list, so the high-end n-gram packages usually give you a switch saying whether to do SSR or not before assigning frequencies.

    This is interesting to me for the purpose of coming up with canned preloaded dictionaries for compressing qualitatively different kinds of data. You generally don't want to pollute your dictionary (and code space) with short strings that are not actually common in their own right, just because they're substrings of a longer phrase that is common. You'd rather just recognize the long strings and give them short codes, rather than wasting time and code bits matching the short strings and then trying to lump them into long ones.

  34. #30
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    I released a new krc version. Fixed multiple problems with dictionary buildup.

    Download
    krc 64-bit .NET Framework 4.5.1 version:
    http://www.metacompressor.com/download/krc.zip

    Krc 32-bit .NET Framework 3.5 version:
    http://www.metacompressor.com/download/krc-32bit.zip

    Krc 32-bit lite .NET Framework 3.5 version:
    http://www.metacompressor.com/downlo...32bit-lite.zip

    Some output of tests I did today with this new krc version:

    Input:
    5,345,280 bytes, xml - Silesia corpus Collected XML files (html)

    Output:
    677,186 bytes arj default
    665,114 bytes zip default
    655,593 bytes zip max
    649,780 bytes krc 1
    563,512 bytes krc.zip max
    534,910 bytes krc.rar max
    501,903 bytes rar default
    491,013 bytes rar max

    -------------------------------

    Input:
    20,617,071 bytes, fp.log - log file

    Output:
    1,384,583 bytes arj default
    1,339,573 bytes zip default
    1,323,424 bytes zip max
    1,028,125 bytes krc 1
    904,862 bytes rar default
    884,964 bytes rar max
    838,421 bytes krc.zip max
    788,792 bytes krc.rar max
    Last edited by Sportman; 18th June 2014 at 00:46.

Page 1 of 2 12 LastLast

Similar Threads

  1. Recursive LZ
    By chornobyl in forum Data Compression
    Replies: 1
    Last Post: 29th September 2018, 12:28
  2. Trivia: recursive zip files
    By willvarfar in forum The Off-Topic Lounge
    Replies: 9
    Last Post: 24th April 2012, 13:57
  3. Recursive data compression patent for sale
    By Matt Mahoney in forum Data Compression
    Replies: 52
    Last Post: 10th January 2012, 01:18
  4. Kwc – very simple keyword compressor
    By Sportman in forum Data Compression
    Replies: 10
    Last Post: 20th January 2010, 17:06

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •