Page 1 of 2 12 LastLast
Results 1 to 30 of 33

Thread: best compression ration for compressing small text(length < 2000 bytes)

  1. #1
    Member
    Join Date
    Sep 2013
    Location
    Romania
    Posts
    2
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Question best compression ration for compressing small text(length < 2000 bytes)

    Hi compression gurus,

    Which algorithms offer the best compression rate for compressing small text(length < 2000 bytes)?

    Any link to java open source code is highly appreciated as well .

    Thanks

  2. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    If the files have something in common then a pre-trained context mixing compressor should obtain top compression ratio.

  3. #3
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Input RAM-disk:
    898 bytes - text file

    Output RAM-disk:
    1,520 bytes, 0.075 sec. - 0.041 sec., WinZpaq - 6
    1,146 bytes, 0.543 sec. - x.xxx sec., eXdupe - 3
    779 bytes, 0.005 sec. - 0.005 sec., Qpress - 3
    765 bytes, 0.006 sec. - 0.006 sec., LZ4 - 2
    746 bytes, 0.055 sec. - 0.050 sec., FreeArc - 9
    690 bytes, 0.008 sec. - 0.006 sec., 7-Zip - 9
    625 bytes, 0.010 sec. - 0.005 sec., lzturbo - 49
    622 bytes, 0.014 sec. - 0.012 sec., WinRAR - 5
    621 bytes, 0.020 sec. - 0.018 sec., ZCM - 7
    600 bytes, 0.107 sec. - 0.102 sec., NanoZip - c
    586 bytes, 0.124 sec. - 0.008 sec., Bsc - 8


    Input RAM-disk:
    1,480 bytes - text file

    Output RAM-disk:
    1,766 bytes, 0.077 sec. - 0.042 sec., WinZpaq - 6
    1,424 bytes, 0.543 sec. - x.xxx sec., eXdupe - 3
    1,201 bytes, 0.006 sec. - 0.005 sec., LZ4 - 2
    1,148 bytes, 0.005 sec. - 0.004 sec., Qpress - 3
    996 bytes, 0.055 sec. - 0.050 sec., FreeArc - 9
    983 bytes, 0.008 sec. - 0.006 sec., 7-Zip - 9
    933 bytes, 0.010 sec. - 0.005 sec., lzturbo - 49
    933 bytes, 0.021 sec. - 0.020 sec., ZCM - 7
    913 bytes, 0.013 sec. - 0.011 sec., WinRAR - 5
    872 bytes, 0.107 sec. - 0.102 sec., NanoZip - c
    854 bytes, 0.125 sec. - 0.007 sec., Bsc - 8


    Input RAM-disk:
    3,199 bytes - text file

    Output RAM-disk:
    2,418 bytes, 0.081 sec. - 0.046 sec., WinZpaq - 6
    2,350 bytes, 0.006 sec. - 0.005 sec., LZ4 - 2
    2,170 bytes, 0.543 sec. - x.xxx sec., eXdupe - 3
    2,167 bytes, 0.005 sec. - 0.004 sec., Qpress - 3
    1,759 bytes, 0.024 sec. - 0.023 sec., ZCM - 7
    1,753 bytes, 0.011 sec. - 0.005 sec., lzturbo - 49
    1,747 bytes, 0.008 sec. - 0.006 sec., 7-Zip - 9
    1,691 bytes, 0.013 sec. - 0.011 sec., WinRAR - 5
    1,657 bytes, 0.055 sec. - 0.050 sec., FreeArc - 9
    1,560 bytes, 0.122 sec. - 0.008 sec., Bsc - 8
    1,559 bytes, 0.110 sec. - 0.104 sec., NanoZip - c


    Input RAM-disk:
    7,626 bytes - text file

    Output RAM-disk:
    4,782 bytes, 0.006 sec. - 0.005 sec., LZ4 - 2
    4,492 bytes, 0.005 sec. - 0.004 sec., Qpress - 3
    3,858 bytes, 0.543 sec. - x.xxx sec., eXdupe - 3
    3,848 bytes, 0.089 sec. - 0.057 sec., WinZpaq - 6
    3,585 bytes, 0.031 sec. - 0.030 sec., ZCM - 7
    3,533 bytes, 0.012 sec. - 0.006 sec., lzturbo - 49
    3,477 bytes, 0.014 sec. - 0.012 sec., WinRAR - 5
    3,449 bytes, 0.009 sec. - 0.006 sec., 7-Zip - 9
    3,150 bytes, 0.124 sec. - 0.008 sec., Bsc - 8
    3,104 bytes, 0.056 sec. - 0.051 sec., FreeArc - 9
    3,006 bytes, 0.114 sec. - 0.111 sec., NanoZip - c

    All one thread.
    Last edited by Sportman; 17th September 2013 at 14:24.

  4. The Following 3 Users Say Thank You to Sportman For This Useful Post:

    Hacker (29th January 2017),prese (17th September 2013),samsat1024 (18th September 2013)

  5. #4
    Member
    Join Date
    Sep 2013
    Location
    Romania
    Posts
    2
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Thanks Sportman.

    Are any chances to find a Java implementation for Nanozip or for BSC?

    Thanks

  6. #5
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    I don't think there are Java sources, but you can ask:
    http://libbsc.com
    http://www.nanozip.net

  7. #6
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Because headers can be big of popular archivers and have big impact at small test files, I tested also some less popular (zip excluded) archivers:

    Input:
    898 bytes, text file

    Output:
    995 bytes, zlite
    919 bytes, kwc
    803 bytes, mtcc
    696 bytes, sharc 1
    671 bytes, crush x
    661 bytes, rar zip max
    640 bytes, packet
    501 bytes, bcm
    470 bytes, ppmx
    466 bytes, paq8pxd7 8


    Input:
    1,480 bytes, text file

    Output:
    1,501 bytes, kwc
    1,299 bytes, mtcc
    1,293 bytes, zlite
    1,136 bytes, sharc 1
    1,028 bytes, crush x
    946 bytes, rar zip max
    941 bytes, packet
    769 bytes, bcm
    727 bytes, ppmx
    702 bytes, paq8pxd7 8


    Input:
    3,199 bytes, text file

    Output:
    3,203 bytes, kwc
    2,744 bytes, mtcc
    2,369 bytes, sharc 1
    2,122 bytes, zlite
    1,955 bytes, crush x
    1,755 bytes, packet
    1,715 bytes, rar zip max
    1,485 bytes, bcm
    1,407 bytes, ppmx
    1,326 bytes, paq8pxd7 8


    Input:
    7,626 bytes, text file

    Output:
    7,349 bytes, kwc
    6,499 bytes, mtcc
    5,340 bytes, sharc 1
    4,021 bytes, zlite
    3,956 bytes, crush x
    3,560 bytes, packet
    3,492 bytes, rar zip max
    3,072 bytes, bcm
    2,916 bytes, ppmx
    2,712 bytes, paq8pxd7 8
    Last edited by Sportman; 17th September 2013 at 16:56.

  8. The Following 2 Users Say Thank You to Sportman For This Useful Post:

    Hacker (29th January 2017),samsat1024 (18th September 2013)

  9. #7
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    With smaller text files to see better header size influence:

    Input RAM-disk:
    346 bytes, test file

    Output RAM-disk:
    1,233 bytes, 0.067 sec. - 0.037 sec., WinZpaq - 6
    871 bytes, 0.542 sec. - x.xxx sec., eXdupe - 3
    499 bytes, 0.053 sec. - 0.047 sec., FreeArc - 9
    425 bytes, 0.004 sec. - 0.003 sec., Qpress - 3
    396 bytes, 0.007 sec. - 0.005 sec., 7-Zip - 9
    344 bytes, 0.018 sec. - 0.015 sec., ZCM - 7
    341 bytes, 0.004 sec. - 0.004 sec., LZ4 - 2
    338 bytes, 0.011 sec. - 0.010 sec., WinRAR - 5
    315 bytes, 0.106 sec. - 0.096 sec., NanoZip - c
    313 bytes, 0.009 sec. - 0.005 sec., lzturbo - 49
    312 bytes, 0.027 sec. - 0.006 sec., Bsc - 6


    Input:
    346 bytes, test file

    Output:
    714 bytes, zlite
    390 bytes, rar zip max
    370 bytes, kwc
    368 bytes, packet
    317 bytes, crush x
    314 bytes, mtcc
    300 bytes, sharc 1
    232 bytes, paq8pxd7 8
    225 bytes, bcm
    221 bytes, ppmx


    Input RAM-disk:
    186 bytes, text file

    Output RAM-disk
    1,194 bytes, 0.066 sec. - 0.035 sec., WinZpaq - 6
    768 bytes, 0.652 sec. - x.xxx sec., eXdupe - 3
    422 bytes, 0.053 sec. - 0.048 sec., FreeArc - 9
    298 bytes, 0.007 sec. - 0.006 sec., 7-Zip - 9
    259 bytes, 0.004 sec. - 0.004 sec., Qpress - 3
    233 bytes, 0.012 sec. - 0.010 sec., WinRAR - 5
    232 bytes, 0.025 sec. - 0.005 sec., Bsc - 6
    225 bytes, 0.018 sec. - 0.014 sec., ZCM - 7
    216 bytes, 0.104 sec. - 0.097 sec., NanoZip - c
    213 bytes, 0.009 sec. - 0.005 sec., lzturbo - 49
    202 bytes, 0.004 sec. - 0.004 sec., LZ4 - 2


    Input:
    186 bytes, text file

    Output:
    623 bytes, zlite
    297 bytes, rar zip max
    240 bytes, packet
    208 bytes, kwc
    201 bytes, crush x
    186 bytes, sharc 1
    174 bytes, mtcc
    156 bytes, paq8pxd7 8
    144 bytes, ppmx
    137 bytes, bcm
    Last edited by Sportman; 17th September 2013 at 23:05.

  10. The Following 2 Users Say Thank You to Sportman For This Useful Post:

    Hacker (29th January 2017),samsat1024 (18th September 2013)

  11. #8
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    186
    Thanked 16 Times in 11 Posts
    You could try try ccm or ccmx too. Quite old (you can find it in this forum) but still often the 'best' J

  12. #9
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    For small text compression its hard to beat BWTS followed by simple RLE ARI bijective type of things.
    I usually showed how the file "BANANAS" compresses to smaller than 7 bytes which is hard to do with
    most compressors. Even The "SCOTT" transform followed by bijective LZW does pretty good. All
    these programs on this site somewhere.

  13. #10
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    186 bytes to 151 bytes, ccmx 7
    346 bytes to 241 bytes, ccmx 7
    898 bytes to 547 bytes, ccmx 7
    1,480 bytes to 872 bytes, ccmx 7
    3,199 bytes to 1,665 bytes, ccmx 7
    7,626 bytes to 3,370 bytes, ccmx 7

    186 bytes to 144 bytes, cmm 43
    346 bytes to 244 bytes, cmm 43
    898 bytes to 534 bytes, cmm 43
    1,480 bytes to 799 bytes, cmm 43
    3,199 bytes to 1,499 bytes, cmm 43
    7,626 bytes to 2,996 bytes, cmm 43

  14. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Some results on enwik3 (first 1000 bytes of enwik9).

    Code:
      302 enwik3.pmm (ppmonstr)
      314 enwik3-7.paq8px
      320 enwik3-o8-m256-r1.pmd
      322 enwik3-d6-m16M-f16M.ctw
      343 enwik3-9.gz
      354 enwik3-7.ccmx
      355 enwik3-7.ccm
      377 enwik3-c1000000000.dmc
      391 enwik3-9.bz2
      466 enwik3-mx.7z
      506 enwik3-ms0.0c3.0.255-fragile.zpaq
      530 enwik3.Z
      570 enwik3-9.zip
      628 enwik3.fpaq0
    1,000 enwik3
    1,133 enwik3-m5.zpaq
    1,266 enwik3-m4.zpaq
    1,306 enwik3-m3.zpaq
    1,321 enwik3-m6.zpaq
    1,488 enwik3-m2.zpaq
    1,503 enwik3-m1.zpaq
    Size differences for small files are mainly due to archive overhead, so single file compressors do better than archivers. zpaq journaling archive format has a lot of overhead for versioning, deduplication and error recovery. Option -m s0.0c3.0.255 specifies streaming format (s), 1 MB block size (0), no preprocessing (0), a direct context model (c) with fast adaptation (3 = max count , no special contexts (0), and order 1 (255). -fragile removes header locator tag and checksum to save 23 bytes. There is still the overhead of the block header with the algorithm description and segment header with the file name and size, but the algorithm is much simpler than LZ77 (-m 1, 2, 3), BWT (4), or multiple CM (5, 6).
    Last edited by Matt Mahoney; 17th September 2013 at 22:11.

  15. The Following 2 Users Say Thank You to Matt Mahoney For This Useful Post:

    Hacker (29th January 2017),samsat1024 (18th September 2013)

  16. #12
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Input RAM-disk:
    1,000 bytes, enwik3

    Output RAM-disk:
    1,321 bytes, 0.072 sec. - 0.039 sec., WinZpaq - 6
    972 bytes, 0.542 sec. - x.xxx sec., eXdupe - 3
    577 bytes, 0.057 sec. - 0.050 sec., FreeArc - 9
    533 bytes, 0.022 sec. - 0.018 sec., ZCM - 7
    493 bytes, 0.005 sec. - 0.005 sec., Qpress - 3
    466 bytes, 0.008 sec. - 0.006 sec., 7-Zip - 9
    463 bytes, 0.109 sec. - 0.103 sec., NanoZip - c
    453 bytes, 0.006 sec. - 0.006 sec., LZ4 - 2
    430 bytes, 0.028 sec. - 0.008 sec., Bsc - 6
    409 bytes, 0.014 sec. - 0.011 sec., WinRAR - 5
    405 bytes, 0.011 sec. - 0.006 sec., lzturbo - 49


    Input:
    1,000 bytes, enwik3

    Output:
    879 bytes, mtcc
    837 bytes, zlite
    833 bytes, kwc
    712 bytes, sharc 1
    463 bytes, rar zip max
    457 bytes, cmm 43
    431 bytes, bcm
    399 bytes, packed
    391 bytes, crush x
    355 bytes, ccm 7
    354 bytes, ccmx 7
    313 bytes, ppmx
    286 bytes, paq8pxd7 8

  17. The Following 2 Users Say Thank You to Sportman For This Useful Post:

    Hacker (29th January 2017),samsat1024 (18th September 2013)

  18. #13
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by biject.bwts View Post
    For small text compression its hard to beat BWTS followed by simple RLE ARI bijective type of things.
    I usually showed how the file "BANANAS" compresses to smaller than 7 bytes which is hard to do with
    most compressors. Even The "SCOTT" transform followed by bijective LZW does pretty good. All
    these programs on this site somewhere.
    BANANAS happens to have one segment that usefully repeats, but for words that small, usually almost all the compression is order 0, basically just like plain Huffman would do. The BWTS would be unlikely to do any harm by itself, but it would depend on the second step transform, etc.
    Last edited by nburns; 18th September 2013 at 18:33.

  19. #14
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by nburns View Post
    BANANAS happens to have one segment that usefully repeats, but for words that small, usually almost all the compression is order 0, basically just like plain Huffman would do. The BWTS would be unlikely to do any harm by itself, but it would depend on the second step transform, etc.
    Of course since BWTS of BANANAS is BNNSAAA it works out nice.

    But if all seven letter different then there is no harm.

    Then again if the letters are already grouped then a BWTS will make it usually worse.

    For really small files. If the file is random its best not to compress at all

    The more you know about the file the better you can make a compressor for it.

    For example if its byte data and random just use an order zero bijective arithmetic not a Huffman if
    the goal is to make the average the smallest.

    For any file of IID file where the symbols are any byte. For a group of random files the average
    of the bijective zero order arithmetic will compress smaller on the average as the number of random
    files increases compared to the bijective huffman compressor.

    People get confused about the above since far any arbitrary file of the type above sometimes the
    arithmetic compresses better and sometimes the huffman but in general the arithmetic wins.

    A so called plain Huffman where you have to store a table in front would not be a good compressor
    for such small files because its not possible to make it bijective. Though there is a hybrid type of
    huffman with a header of sorts which contains the symbols used and where they first occur that can
    be made bijective but the second part of file is still adaptive.

  20. #15
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by biject.bwts View Post
    A so called plain Huffman where you have to store a table in front would not be a good compressor
    for such small files because its not possible to make it bijective. Though there is a hybrid type of
    huffman with a header of sorts which contains the symbols used and where they first occur that can
    be made bijective but the second part of file is still adaptive.
    There is fully-adaptive Huffman, which alters the tree after every character, and always yields the optimal tree for that prefix. It apparently never caught on, because it was always too slow.

  21. #16
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Also, fully adaptive Huffman coding does not compress as well (or as fast) as arithmetic coding using the same model.

  22. #17
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by nburns View Post
    There is fully-adaptive Huffman, which alters the tree after every character, and always yields the optimal tree for that prefix. It apparently never caught on, because it was always too slow.
    I agree and have coded fully bijective adaptive Huffman coders. And yes I prefer them over plain Huffman coders.
    But even though they on the average do better plain Huffman coders. The bijective adaptive arithmetic coder beats
    the adaptive bijective Huffman on the average. I have implemented an adaptive Huffman coder using the code of
    an arithmetic coder.

    However there is more than one way to update the tree see http://www.ics.uci.edu/~dan/pubs/DC-Sec4.html
    and there are several ways to do the adaptive arithmetic.

  23. #18
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Huffman versus arithmetic a simple example

    the best general adaptive Huffman when it gets a file of 256 bytes all different
    the first byte as 8 bits. Then next 255 bytes have a single bit that tells if its
    a new character or not. The above 263 bits of compression for the adaptive Huffman can
    also be used in a simple arithmetic coder. Where you have a weight of 50/50 for
    the case of seeing a new character during compression. Of course most reasonable
    arithmetic would assign less than a bit to a new character if many in a row but
    this is for the comparison of the arithmetic coder versus an adaptive Huffman.

    Next the bits for worst case Huffman when assigning code to new characters

    127 at 8 bits 64 at 7 bits 32 at 6 bits 16 at 5 bits 8 at 4 bits 4 at 3 bits 2 at 2 bits 1 at 1 bits 1 at zero bits
    263 + 127*8 + 64*7 + 32*6 + 16*5 + 8*4 +4*3 + 2*2 + 1 *1 + 1*0 = 2048 bits or 256 bytes with bijective ending 256 bytes


    know look at best case adaptive Huffman

    128 at 7 bits 64 at 6 bits 32 at 5 bits 16 at 4 bits 8 at 3 bits 4 at 2 bits 2 at 1 bit =
    263 + 128*7 + 64*6 + 32*5 + 16*4 + 8*3 +4*2 + 2*1 + 1 *0= 1801 bits or 225.125 bytes with bijective endings 226 bytes

    the simple adaptive arithmetic

    263 + log (255!) / log(2) = 1938.9962872242146194061476793149... equal 242.374535903... bytes for bijective file length of 242 or 243 bytes depending on the order of data.

    Note in all 3 cases your left with each symbol having a weight of 1/256 or 8 bits when you process the 256 bytes of data.


    Note the average of the arithmetic for the set of all files where each symbol used once is smaller than the average of the compressed set using the adaptive Huffman. The reason is the simple counting argument. The arithmetic is by default optimal for the average. Also note
    the average of the smallest possible Huffman length to the longest Huffman length is smaller than what the arithmetic does. However when using weights of all files the Huffman compresses more files to a longer length than the arithmetic.

  24. #19
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I wonder how adaptive Huffman stacks up to splay tree compression. This was mentioned in a CS class I took (the CS department was big on splay trees... this may have had something to do with Danny Sleator being on the faculty). I did a search on it recently, and I managed to turn up a really old paper and some crufty C code that wasn't easy to play with.

  25. #20
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    The difference between Huffman and arithmetic coding is that Huffman codes cannot have fractional bit lengths. The worst case is when all characters in the input are the same and therefore almost perfectly predictable. Huffman coding codes to 1 bit per byte. Arithmetic coding codes to near 0 bits per byte.

  26. #21
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Also, fully adaptive Huffman coding does not compress as well (or as fast) as arithmetic coding using the same model.
    That's a definite deal-breaker.

  27. #22
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    If you are doing arithmetic coding optimally, it should never do worse than Huffman. In the case where the optimal bit lengths turn out to always be whole numbers, arithmetic coding should effectively degrade into Huffman. Otherwise, it should take advantage of any opportunity to do better.

  28. #23
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by nburns View Post
    If you are doing arithmetic coding optimally, it should never do worse than Huffman. In the case where the optimal bit lengths turn out to always be whole numbers, arithmetic coding should effectively degrade into Huffman. Otherwise, it should take advantage of any opportunity to do better.

    No this is a common misunderstanding. Since for "optimal adaptive arithmetic coding" You do have the probabilities of each future symbol to be processed more accurately than for "optimal adaptive huffman" based on PAST data but the next symbol
    may not be what your hoping for.

    Example suppose they are only 3 symbol types in a file and you have processed all 3 types so no more unknown symbol types.

    say 5 A have occurred
    and 4 B have occurred
    and 3 C have occurred

    the optimal Huffman will assign next like this

    A 1
    B 01
    C 10

    so that it writes 1 bit for A and 2 bits B or C

    but the optimal arithmetic writes for A B C as follows

    -log(5/12)/log(2) bits for A 1.26303440583... bits for A
    -log(4/12)/log(2) bits for B 1.58496250072... bits for B
    -log(3/12)/log(2) bits ofr C 2 bits for C

    in this case where I am assuming more symbols to follow so no file ending treatment needed yet.

    if SYMBOL A occurs next the file will be shorter for the Optimal Huffman by .26303440583... bits
    If SYMBOL B occurs next the file will be shorter for the Optimal arithmetic by .41503749927... bits
    If SYMBOL C occurs next it makes on difference since both cause the same length

    if the real probabilities of file match closer to the arithmetic it will create the shorter file.

    Only at rare places in the file will they every have the same possible outputs for a next symbol.
    And even at there places when the data is processed the updated table for the very next set of
    tables will NEVER match each other.

  29. #24
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Well, an bijective arithmetic coder can't be always better or equal to bijective huffman coder because that would defy the counting theorem.

    Let's suppose we have an compressor that treats the input as bijectively huffman encoded, decodes it and outputs as bijectively arithmetic encoded. If it compresses in some case, then it must expand in some other case.

  30. #25
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Well, an bijective arithmetic coder can't be always better or equal to bijective huffman coder because that would defy the counting theorem.

    Let's suppose we have an compressor that treats the input as bijectively huffman encoded, decodes it and outputs as bijectively arithmetic encoded. If it compresses in some case, then it must expand in some other case.
    Good point.

  31. #26
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Well, an bijective arithmetic coder can't be always better or equal to bijective huffman coder because that would defy the counting theorem.

    Let's suppose we have an compressor that treats the input as bijectively huffman encoded, decodes it and outputs as bijectively arithmetic encoded. If it compresses in some case, then it must expand in some other case.
    Anything bijective must be exactly optimal for some distribution. You could discover the distribution by feeding the decoder files starting from the smallest and seeing what comes out.

    Huffman can only make coarse-grained changes, though, so it doesn't do very well on real data.

    There must be a trade-off somewhere that benefits Huffman. Since Huffman can't make any symbol use <1 bit, that also means that it can't over-adapt to a string of identical symbols and incur a large cost to stop. But that's really not a problem of arithmetic coding, it's how the probabilities are adjusted.

  32. #27
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Well, an bijective arithmetic coder can't be always better or equal to bijective huffman coder because that would defy the counting theorem.

    Let's suppose we have an compressor that treats the input as bijectively huffman encoded, decodes it and outputs as bijectively arithmetic encoded. If it compresses in some case, then it must expand in some other case.
    Very good point. But when looking at all cases of compression of a file of given length the bijective arithmetic will also compress to an average that is smaller in length than the bijective huffman.

    Here is an example by looking at a subset and the subset is any file of the length of concern and all its permutations. Well the
    reasoning that follows hold for all subset of like this for files of all lengths. The bijective arithmetic compressor compresses
    all permutations of a file to either n+1 or n bits while the Huffman would have a wider span of bit lengths and that makes
    its average longer.


    To help see this suppose you have a tree for N bits suppose N is 8 then you have 256 cases. each maps to 8 bits the tree
    is a valid tree and the average is 8 bits per symbol. know do the least possible to make the tree have some 9 bits long
    in this case 2 symbols would be 9 bits long 1 symbol 7 bits long and 253 symbols 8 bits long the average length of the
    compressed stream just increased to 8.00390625 This is why when compressing purely random data nothing can beat a flat
    copy of one file to the other for best average. The hope of a compression program is that files of use not random.

  33. #28
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by nburns View Post
    There must be a trade-off somewhere that benefits Huffman. Since Huffman can't make any symbol use <1 bit, that also means that it can't over-adapt to a string of identical symbols and incur a large cost to stop. But that's really not a problem of arithmetic coding, it's how the probabilities are adjusted.
    Actually I think the places where a Huffman would be of more use than an arithmetic would be extremely rare since Matt has stated
    that Huffman compression slower.

    As an example from geometry. Of the two which is best shape for a corral a rectangular shape or a square shape. Well the
    answer has to be the rectangular since the square is a subset of rectangular.

    Its the same thing with compression it has to be the arithmetic since Huffman is a subset of arithmetic. Maybe by trying to
    force a nonoptimal soultion by trying to find the closest poor fit of a Huffman takes more time and work I don't know.

    I am sure many would argue that they are many places since Huffman easier for most to grasp. Yet it hard for me to
    think of a place where arithmetic not better or the same since Huffman just a subset of arithmetic.

  34. #29
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by biject.bwts View Post
    Its the same thing with compression it has to be the arithmetic since Huffman is a subset of arithmetic. Maybe by trying to
    force a nonoptimal soultion by trying to find the closest poor fit of a Huffman takes more time and work I don't know.

    I am sure many would argue that they are many places since Huffman easier for most to grasp. Yet it hard for me to
    think of a place where arithmetic not better or the same since Huffman just a subset of arithmetic.
    It's a little bit of an apples to oranges comparison, because Huffman, at least in its original form, computes an exact solution based on a given set of symbol frequencies, and arithmetic coding is just a technique for encoding arbitrary symbols within arbitrary ranges into one big number. Huffman sort of views symbols as separate messages that obey some known distribution, and it optimally solves that problem. Arithmetic coding is compared to taking a number with a mixed radix and converting it to binary.

    Static Huffman is blazingly fast, but the tree is not so efficient to update incrementally. Since adaptive Huffman isn't really solving the right problem, anyway, there's not much point in using it.

  35. #30
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by nburns View Post
    It's a little bit of an apples to oranges comparison, because Huffman, at least in its original form, computes an exact solution based on a given set of symbol frequencies, and arithmetic coding is just a technique for encoding arbitrary symbols within arbitrary ranges into one big number. Huffman sort of views symbols as separate messages that obey some known distribution, and it optimally solves that problem. Arithmetic coding is compared to taking a number with a mixed radix and converting it to binary.

    Static Huffman is blazingly fast, but the tree is not so efficient to update incrementally. Since adaptive Huffman isn't really solving the right problem, anyway, there's not much point in using it.
    Static Huffman is also a 2 pass process. The solution that Static Huffman computes is not the best. It is only conditional optimal because it forces the use of an unrealistic constraint. That unrealistic constraint is to force at every point in the file the use of a WHOLE number of bits for the compression. If you remove this unnecessary constraint you can get an optimal one pass arithmetic coder.

Page 1 of 2 12 LastLast

Similar Threads

  1. lost interest in text data compression
    By RichSelian in forum The Off-Topic Lounge
    Replies: 12
    Last Post: 11th February 2014, 00:12
  2. Compressing small packets
    By cbloom in forum Data Compression
    Replies: 4
    Last Post: 1st March 2013, 02:59
  3. Small dictionary prepreprocessing for text files
    By Matt Mahoney in forum Data Compression
    Replies: 40
    Last Post: 23rd June 2011, 06:04
  4. compressing a really small 1k .COM file
    By Rugxulo in forum Data Compression
    Replies: 3
    Last Post: 28th November 2009, 01:32
  5. Compressing small bits of data
    By fredrp in forum Data Compression
    Replies: 9
    Last Post: 28th April 2009, 22:27

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •