Results 1 to 10 of 10

Thread: DCA [antidictionary compression]

  1. #1
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    149
    Thanks
    30
    Thanked 59 Times in 35 Posts

    DCA [antidictionary compression]

    I'm looking for antidictionary based compressor. Does anyone know where is it?(source & exe)

  2. The Following User Says Thank You to xezz For This Useful Post:

    Jarek (28th November 2016)

  3. #2
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Interesting, compression based on the list of forbidden words - let me tell something about the theory for such constraining.

    The simplest example is the Fibonacci coding - bit-sequences with forbidden word '11'.
    https://en.wikipedia.org/wiki/Fibonacci_coding
    This constraint reduces capacity from 1bit/node to log_2(1/2 (1 + Sqrt[5])) ~ 0.694242 bits/node.
    For 2D analogue of Fibonacci coding (forbidden vertical and horizontal '11') it is ~ 0.5878911 bits/node (can be practically approached: https://arxiv.org/pdf/0710.3861 ).

    For a general "antidictionary" (set of forbidden words), the capacity can be analytically found, but it's a bit complex (Maximal Entropy Random Walk).
    Specifically, if L is the maximal length of words in the anti-dictionary, we take all length L-1 sequences and construct a matrix from them:
    M_{ij}=1 iff after sequence i there can be sequence j
    E.g. for Fibonacci coding this matrix is {{1,1},{1,0}}.
    Then log_2(largest eigenvalue of M) is the maximal capacity in bits/symbol.
    It is achieved by a well defined transition probability:
    Pr(next = j | previous = i) = M_{ij}/lambda *psi_j/psi_i
    where lambda is the largest eigenvalue of M, psi is the corresponding eigenvector.

    It seems tempting to support data compression with a set of forbidden, however, it seems difficult to make it really practical (?)

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    1. At some point I made a non-prefix bitcode based on escape codes (aka forbidden words).
    It had a decoder similar to huffman code, but backtracked when parsing encountered
    an escape code - so decoding was a little slower than huffman, and compression better.
    However, encoding was really dumb - it started with huffman code, built a set of
    all bit substrings used in it (up to max codelength), then assigned the shortest
    unused string as a code for some symbol (its unique, so can't appear at random),
    and this process repeated until it became impossible to reduce some symbol's codelength.

    2. Same way, its possible to attach some meaning to bytewise "forbidden words".
    Like, actually use them for encoding some frequent longer words/phrases
    (some text preprocessors actually do just this, by encoding some n-grams with
    unused byte codes etc)

    3. Perfect coding can be approximated by counting the number of exclusions
    among the strings of length N (or more; with context of length N-1) starting with 0 and 1 bits,
    and then arithmetic-decoding of the bits (similar to stegdict).
    Its not that complex to make actually - do you think it would be interesting?

  5. #4
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    149
    Thanks
    30
    Thanked 59 Times in 35 Posts
    I haven't understand its algorithm well yet. So lookinf for.
    If the compression is good or run fast, it is interesting.

  6. #5
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    I was thinking about optimally using a channel having some constraints (like in Fibonacci coding) - for data compression it seems as just a weakened PPM: instead of modeling probability distribution for some various length contexts, you just model which symbols are allowed for the next step (assuming uniform probability distribution among them?).

    However, this is more memory friendly - it might be worth to consider a hybrid: PPM/CM which immediately cuts branches corresponding to forbidden words.

  7. The Following User Says Thank You to Jarek For This Useful Post:

    xezz (29th November 2016)

  8. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I mainly considered this kind of approach for compressing nearly-random data (like archives).
    For example, jpegs have FF masking - FF FF can't appear there (and most other FF xx strings).
    And many rangecoders have various small skews in their output - for example, paq uses a carryless rc,
    so a long series of FF (like, longer than 4) can't appear there either.

    @xezz:
    The closest practical thing are text proprocessors, like xwrt, because they use symbol codes
    (and short strings) which don't appear in text to compress frequent words etc.
    Otherwise its not anything useful for normal kinds of data, nor is it fast.

  9. The Following User Says Thank You to Shelwien For This Useful Post:

    xezz (29th November 2016)

  10. #7
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    My compressor bce (http://github.com/akamiru/bce) uses Compression by Substring Enumeration.
    That means it lists all minimum unique substrings. For the binary alphabet it uses this is equal
    to specifying all minimal absent words. Thus it is equal to encoding the maximum usefull
    antidictionary (it can always decode the next bit from the current context).

  11. The Following 2 Users Say Thank You to Christoph Diegelmann For This Useful Post:

    Shelwien (2nd December 2016),xezz (2nd December 2016)

  12. #8
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    Quote Originally Posted by Christoph Diegelmann View Post
    My compressor bce (http://github.com/akamiru/bce) uses Compression by Substring Enumeration.
    That means it lists all minimum unique substrings. For the binary alphabet it uses this is equal
    to specifying all minimal absent words. Thus it is equal to encoding the maximum usefull
    antidictionary (it can always decode the next bit from the current context).
    Christoph, could you make Win executable of bce v0.4?
    Second question - where I could find bce v0.2 and v0.3 Win executables?

    Darek

  13. #9
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    bce v0.2 and v0.3 were pretty experimental and are around a factor of 10 slower while having worse compression (they lack the adaptive stage of bce v0.4).
    source can still be found at my homepage: http://www.kta-design.de/releases/.
    bce v0.4 is pretty easy to compile on windows using cmake for windows + visual studio.
    You need to compile libdivsufsort as a static lib and bce will link against it.
    If you're having trouble with it I can try to find some time to do it for you (maybe sunday).

    If you want to try you could use my modification of libdivsufsort (http://github.com/akamiru/libdivsufsort) which I modified to use a novel SACA which makes it run around 15% faster.
    Note that my modification has only been tested by me (but quiet well) because I haven't released it officially by now.

  14. The Following User Says Thank You to Christoph Diegelmann For This Useful Post:

    Darek (3rd December 2016)

  15. #10
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    91
    Thanked 69 Times in 49 Posts
    Here are 32- and 64-bit builds of 0.4.0 from GCC 6.2 using Christoph's libdivsufsort fork and OpenMP.
    Attached Files Attached Files

  16. The Following 3 Users Say Thank You to jibz For This Useful Post:

    Christoph Diegelmann (4th December 2016),Darek (3rd December 2016),Shelwien (2nd December 2016)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •