Results 1 to 10 of 10

Thread: ACB

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    2,061
    Thanks
    0
    Thanked 6 Times in 5 Posts

    ACB

    Uploaded my ACB materials: http://ctxmodel.net/files/ACB.rar

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,614
    Thanks
    127
    Thanked 365 Times in 235 Posts
    Anybody figure this out? The paper by Buyanovsky describes a "funnel of analogies" that looks to me like a BWT transform. However it is not BWT. It looks like the context sorted pointers are only used to find matching contexts and the input is compressed in its original order? The paper is very hard to read.

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    2,061
    Thanks
    0
    Thanked 6 Times in 5 Posts
    Its not BWT. If anything, its LZ with advanced coding of matches.
    First, match offset is encoded as offset of the best matching string
    in the sorted list of contexts. And then, neighbours of that
    best match string are used to determine the minimal match length,
    and there's tricky coding for remaining match length too - its stored
    as a number of occurences of the first mismatching symbol.

    book1 result with acb 2.00c (best mode) is 222,690.

    But it seemed really good back in 1997, and it was relatively
    famous in Russia thanks to an article in a popular magazine.
    Last edited by Shelwien; 20th January 2010 at 06:39.

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,614
    Thanks
    127
    Thanked 365 Times in 235 Posts
    I tried ACB on the LTCB but it failed on enwik9. It has a 64 MB file size limit so I divided enwik8 into 2 equal sized parts and enwik9 into 16 equal sized parts. On enwik8 it compressed to 25,063,656 in 1359 sec. and decompressed in 1347 sec. using 16 MB memory and max compression (command u). On enwik9 after running several hours I got this:

    Code:
    C:\tmp>timer acb u x9 e9
    
    Timer 3.01  Copyright (c) 2002-2003 Igor Pavlov  2003-07-10
                     Associative Coder of Buyanovsky
    ACB ver.2.00c (Shareware) Copyright(C) 1994-97 George Buyanovsky 04.25.1997
    Compress ( MAX_mode ) :
    e9  Data size= 1000000208
    Passed... 386402304  bytes
    ILLEGAL INTERRUPT #00
    eax = 00000140      esi = 00000024      flags = 3257        ds = 01C7
    ebx = 00000000      edi = 00AF9503      eip = 00005B29      es = 01C7
    ecx = 00000004      ebp = FE507E18      cs = 01BF           fs = 0000
    edx = 00000000      esp = FE507E08      ss = 01B7           gs = 0000
    Kernel Time  =     0.000 = 00:00:00.000 =   0%
    User Time    =     0.000 = 00:00:00.000 =   0%
    Process Time =     0.000 = 00:00:00.000 =   0%
    Global Time  = 43518.070 = 12:05:18.070 = 100%
    e9 is a subdirectory containing enwik9 in 16 parts named 01 through 16, each 62,500,000 bytes. The output when it crashed:
    Code:
     25,063,656 X8.ACB (enwik8 OK)
    657,327,616 X9.00B
    632,878,592 X9.01B
    184,361,987 X9.02B

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,614
    Thanks
    127
    Thanked 365 Times in 235 Posts
    I was able to test ACB on enwik9 by producing 16 separate archives. It ranks 61 of 126. http://mattmahoney.net/dc/text.html

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    2,061
    Thanks
    0
    Thanked 6 Times in 5 Posts
    found this in comp.compression:

    > Dear compression gurus,
    > >
    > > Does anyone recognize this compressed file:
    > >
    > > http://gdcm.sourceforge.net/thingies/dummy.acb
    > >
    > > This is reported as 'ACB archive data' but I cannot open it from my
    > > debian/linux machine. I tried with p7zip, cabextract, unzip & gzip.

    ACB was a compression format from the mid 1990s created by George
    Buyanovsky and was called Associative Coding. It can only be
    decompressed using the ACB utility, which is a DOS command-line
    executable. You'll need to run a DOS emulator if all you have is
    Linux. One place I found the decompressor is ACB200C.ZIP located
    here: http://www.qumran.org/ftp/local/dos/app/arc/files.htm

    This being a group mostly for the research and technical details of
    compression methods, here are some obligatory links:

    http://www.stringology.org/DataCompr.../index_en.html

    http://cbloomrants.blogspot.com/2008/09/09-27-08-2.html

    ....which references:

    http://siret.ms.mff.cuni.cz/skopal/pub/acb.pdf
    http://www.cbloom.com/news/leoacb.html
    http://www.cbloom.com/news/bygeorge.html

    Some related info: http://compression-links.info/LinkLi...&Search=Search

  7. #7
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks for the post.

    Tsk... Classic.

    Just slow compressor, but really got my hats off.

    Really curious if it would do well on filtered data
    from filters available today...

    Testing begins.

  8. #8
    Member
    Join Date
    Jul 2011
    Location
    phils
    Posts
    32
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quoting Charles Bloom:

    ---If you see a context like "what the ", then it's highly likely the next sequence is "fu@k"---

    Haha.

  9. #9
    Member
    Join Date
    Aug 2011
    Location
    US
    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Here is my take on this algorithm
    (Although it is pretty slow it does provide reasonable level of compression)
    Some terminology that i use to describe it
    Context = the bytes preceding the data being compressed (goes from right to left from current position)
    Content = the bytes following the Context (goes from left to right from current position)
    Context | Content

    Assume a context window holding only pointers into a text buffer(history) and new pointers are added to maintain the sorted order.
    That is the pointers are sorted according to the context from current location to be processed. In order to find the next match first it must
    identify the associative list of contexts. To do this first find the best matching context from the sorted context window. Then group all the contexts above and below it example if you found a context at 400 then the associative list would be from 300 to 500 (depends on your parameter).

    Now sort this list based on content , from this associative list try to find the best content match. This is a bit tricky as you don't actually find the exact string.
    Assume we found that our string match falls between 410(matched length of 20) and 411(matched length of 24) from the associative list of contexts.
    Now all we have to output is location of match falls in that is 110 (410-300) and the decoder can decode the string based on this.

    Decoder does the same thing as encoder first by identifying the list of associative contexts which is 300 to 500 (since best context it finds 400)
    Now it gets from input 110 that means the match is at 410 (300+110) . To decode the match it looks at 410 and 411 keeps outputting bytes till they match.
    And that is how it can decode without the need for offset , length.

    Although latter versions of ACB seem to add few more ideas like if the match length is longer than the mismatched byte from contents.
    Then the output would be a (position,miss matched byte,extra length) this is can be improved further perhaps

    Thanks for reading
    Last edited by Guzzo; 27th August 2011 at 23:54.

  10. #10
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    737
    Thanks
    24
    Thanked 33 Times in 26 Posts
    Quote Originally Posted by Shelwien View Post
    Uploaded my ACB materials: http://ctxmodel.net/files/ACB.rar
    This basic idea can be formalized in terms of suffix arrays. Suppose T[i] is the ith character of the text, SA[i] is the text position of the ith suffix, and ISA[i] is the inverse of SA, which gives the index into SA of the suffix beginning at text position i.

    One approach is, for each character, find the position of the character that sorts just after it (or before, it doesn't matter). Let's say NextSuff[i] = SA[ISA[i] + 1]. Then, for each character after the first, you can store NextSuff as the delta, NextSuff[i] - NextSuff[i-1]. Most of the time, the delta = 1, so this function is compressible. You can get back the identity of the character, because NextSuff is like a linked list that links all the suffixes in sort order. So if you have the character frequencies, you can fill in the characters by iterating the list. The compression you get from this is not great, but there is plenty of room for tweaks. For one thing, you can look at the suffixes before and after the current one, and if you're inside of a common prefix, you could avoid encoding that character entirely.

    A slightly different approach is, rather than outputting the text position of the next sorted suffix, you could output its SA index, as a delta. The function that iterates forward in the SA is usually called psi in the compression research papers. The function that iterates backward (e.g., in the BWT) is LF. psi and LF are inverses of each other. PSI[i] = ISA[SA[i]+1]. PSI takes the current SA index and returns the index of the following text position. To make it compressible, PSI is encoded as the delta. NextSuff[i] is the suffix that sorts after the current one, so you could output ISA[NextSuff[i]+1]-ISA[i+1] as the sort distance to the next suffix (distance from the previous suffix works the same way). Since the decoder won't have seen some of the suffixes yet, you could revise the calculation to omit these, using SA to get the text positions in the intervening range. Reproducing the text from the compressor output would be analogous to reading forward in the SA using psi. You could use the same trick here of omitting the common prefix as in approach #1. Another possible improvement is to compute the SA distance between the two suffixes above and below the new insertion point (delta psi) and use this bound to store the new delta more succinctly (if there is a nonzero gap). This might save bits compared with gamma or delta coding. Arithmetic coding may be a good option using this restricted range.

    I played around a little bit with both approaches, but I haven't made a working encoder/decoder yet. I didn't know about ACB before now, so it's useful to know that this style of compressor has been implemented successfully and showed promising results. It sounds like Buyanovsky implemented a lot of the tweaks I had in mind. Using a suffix array as a starting point would almost certainly make it faster, and ought to be significantly easier to experiment with.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •