Results 1 to 6 of 6

Thread: Is it possible to utilize grammars to aid in compression

  1. #1
    Member
    Join Date
    Jan 2014
    Location
    london
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Smile Is it possible to utilize grammars to aid in compression

    Some files, such as programming source code files, if they are without errors, are restricted to the grammar of that programming language.

    A grammar gives a precise, yet easy-to-understand, syntactic specification and therefore restriction of a programming language. This restriction should provide a beneficial quality for compression algorithms. Since as well as compiling probabilities of next letters, it could be possible to use the grammar to restrict the characters further.

    Has any research been conducted in this topic? I am very interested.

    Kind regards.

  2. #2
    Member
    Join Date
    Dec 2011
    Location
    Germany / Hessen
    Posts
    18
    Thanks
    0
    Thanked 1 Time in 1 Post
    Just for the case you didn`t realize yet - it is nearly the same for text compression
    You also have some kind of fixed grammar rules there - e.g. for the english language - like you have it for e.g. the Java programming language
    It is just a difference in the complexity and amount of grammar.

    So if you do not find something to this topic explicit for programming languages - you can search for text compression also.

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    It is already done, but for much simpler language: XML.
    http://xwrt.sourceforge.net/

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Of course there are differences between source code and natural language. For source code the grammars are simple enough to build context models by hand. For natural language you need to use statistical methods to learn the grammar. It is a difficult problem, but might pay off on the Hutter prize or LTCB.

  5. #5
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by rhulcomputerscience View Post
    Some files, such as programming source code files, if they are without errors, are restricted to the grammar of that programming language.

    A grammar gives a precise, yet easy-to-understand, syntactic specification and therefore restriction of a programming language. This restriction should provide a beneficial quality for compression algorithms. Since as well as compiling probabilities of next letters, it could be possible to use the grammar to restrict the characters further.

    Has any research been conducted in this topic? I am very interested.

    Kind regards.
    Short answer: yes, there has been research.

    At a high level, you might divide grammar-based compression into two basic approaches: hard-code the grammar statically into the compressor, or learn the grammar dynamically from the data. As Piotr pointed out, the static approach has been used with XML. In practice, most of the XML compressors that people have written are not completely new algorithms, but pre-processors that feed into standard algorithms and try to achieve better compression than the standard algorithms alone. I ran some experiments with a bunch of XML compressors/preprocessors, and I wasn't all that impressed with the results. The dynamic approach has been applied in research compressors (http://en.wikipedia.org/wiki/Grammar-based_code). You can probably sense by the length of the wikipedia entry that this area of research hasn't been fantastically successful. I think there are problems related to CFGs that are NP-complete, and that may be a fundamental obstacle to this line of work.

    Quote Originally Posted by Matt Mahoney View Post
    Of course there are differences between source code and natural language. For source code the grammars are simple enough to build context models by hand. For natural language you need to use statistical methods to learn the grammar. It is a difficult problem, but might pay off on the Hutter prize or LTCB.
    You *could* build context models by hand for programming languages, but how many examples are there? If not many people have done this, then it could be that there's a general sense that it's not worth doing, either because the use case is too narrow or because the savings would be too small to justify the amount of work.

    It would be interesting to see a theoretical analysis of the amount of entropy in a document that's attributable solely to the grammar. The amount of information in a fixed grammar is constant, obviously, so it would get proportionally smaller for longer documents. (Doesn't Kolmogorov theory have a theorem to similar effect?)

    Trying to specify the complete grammar of a human language as a CFG seems like it may not be worth doing, even if you had the resources. The central point of a CFG is to explicitly narrow down the kinds of statements that are possible by enumerating a finite set of unbreakable rules. If a statement doesn't match any rules, reject. To accept all English as actually spoken, the CFG would have to be so large that it wouldn't be able to deduct much from the entropy.

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    A grammar for English would be more like a statistical model of n-grams over word clusters in context space. For example, from phrases like "the X is", and "the Y is", you learn that X and Y belong to the same category (nouns). The top compressors like durilca and paq8hp12 use dictionaries organized so that grammatically related words are grouped together. Thus, when they are mapped to tokens, you can drop the low bits and use the high bits as context. This grouping can be learned, obviously, although I think there was some manual organization as well.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •