24th April 2014, 02:28
Is it possible to utilize grammars to aid in compression
Some files, such as programming source code files, if they are without errors, are restricted to the grammar of that programming language.
A grammar gives a precise, yet easy-to-understand, syntactic specification and therefore restriction of a programming language. This restriction should provide a beneficial quality for compression algorithms. Since as well as compiling probabilities of next letters, it could be possible to use the grammar to restrict the characters further.
Has any research been conducted in this topic? I am very interested.
24th April 2014, 11:05
Just for the case you didn`t realize yet - it is nearly the same for text compression
You also have some kind of fixed grammar rules there - e.g. for the english language - like you have it for e.g. the Java programming language
It is just a difference in the complexity and amount of grammar.
So if you do not find something to this topic explicit for programming languages - you can search for text compression also.
24th April 2014, 11:17
It is already done, but for much simpler language: XML.
24th April 2014, 21:56
Of course there are differences between source code and natural language. For source code the grammars are simple enough to build context models by hand. For natural language you need to use statistical methods to learn the grammar. It is a difficult problem, but might pay off on the Hutter prize or LTCB.
24th April 2014, 23:20
Short answer: yes, there has been research.
Originally Posted by rhulcomputerscience
At a high level, you might divide grammar-based compression into two basic approaches: hard-code the grammar statically into the compressor, or learn the grammar dynamically from the data. As Piotr pointed out, the static approach has been used with XML. In practice, most of the XML compressors that people have written are not completely new algorithms, but pre-processors that feed into standard algorithms and try to achieve better compression than the standard algorithms alone. I ran some experiments with a bunch of XML compressors/preprocessors, and I wasn't all that impressed with the results. The dynamic approach has been applied in research compressors (http://en.wikipedia.org/wiki/Grammar-based_code). You can probably sense by the length of the wikipedia entry that this area of research hasn't been fantastically successful. I think there are problems related to CFGs that are NP-complete, and that may be a fundamental obstacle to this line of work.
You *could* build context models by hand for programming languages, but how many examples are there? If not many people have done this, then it could be that there's a general sense that it's not worth doing, either because the use case is too narrow or because the savings would be too small to justify the amount of work.
Originally Posted by Matt Mahoney
It would be interesting to see a theoretical analysis of the amount of entropy in a document that's attributable solely to the grammar. The amount of information in a fixed grammar is constant, obviously, so it would get proportionally smaller for longer documents. (Doesn't Kolmogorov theory have a theorem to similar effect?)
Trying to specify the complete grammar of a human language as a CFG seems like it may not be worth doing, even if you had the resources. The central point of a CFG is to explicitly narrow down the kinds of statements that are possible by enumerating a finite set of unbreakable rules. If a statement doesn't match any rules, reject. To accept all English as actually spoken, the CFG would have to be so large that it wouldn't be able to deduct much from the entropy.
25th April 2014, 00:21
A grammar for English would be more like a statistical model of n-grams over word clusters in context space. For example, from phrases like "the X is", and "the Y is", you learn that X and Y belong to the same category (nouns). The top compressors like durilca and paq8hp12 use dictionaries organized so that grammatically related words are grouped together. Thus, when they are mapped to tokens, you can drop the low bits and use the high bits as context. This grouping can be learned, obviously, although I think there was some manual organization as well.
Tags for this Thread