Results 1 to 16 of 16

Thread: Compressed data model

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

    Compressed data model

    http://nishi.dreamhosters.com/u/cdm_test_1.rar

    This is a coder that attempts to make use of actual tiny redundancies
    left after some compression algorithms, especially bitcode-based ones.

    The idea is rather simple - there're some (combinatorical) coding methods
    that always compress any data (not counting the space required to store stats).

    For example, a 1024-bit block with 0s count = 412 and 1s count = 612 yields redundancy of 33 bits,
    which can be used to encode counts, and still save some space:
    1024-Log[2, Binomial[1024.,1024/2-100]] = 33.6576 bits

    So cdm runs a parsing optimization algorithm that tries to split the data
    into a series of small blocks which can be either stored or encoded.

    The method is asymmetric, so decoding is much faster than encoding.
    Encoding can be also faster with some weaker optimizer settings.

    Unfortunately I wasn't able to compress any paq archives with this version,
    but results for any (unknown) bitcoded formats seem to be interesting.

    Also, it seems that cdm can be used for file analysis/format detection -
    here we can clearly see that jpeg uses bit7-0 bit order, while deflate
    uses bit0-7:
    Code:
    cdm = tuned to A10.jpg
    cdm1 = tuned to A10.jpg.bwts7-0
    cdm2 = tuned to book1__kzip.raw.bwts0-7
    
    [A10.jpg]                cdm     cdm1*   cdm2
    original size        === 842468  842468  842468
    7z -mx=9 compression === 842114  842114  842114
    direct test          === 841932  842526  842047
    bytewise bwts        === 838909  838926  838711
    bit7-0 bitwise bwts  === 820363* 817187* 818645*
    bit0-7 bitwise bwts  === 835902  835128  835247
    
    [book1__kzip.raw]        cdm     cdm1    cdm2*
    original size        === 299437  299437  299437
    7z -mx=9 compression === 299594  299594  299594
    direct test          === 299387  299551  299381
    bytewise bwts        === 299460  299687  299476
    bit7-0 bitwise bwts  === 299450  299657  299460
    bit0-7 bitwise bwts  === 299271* 299331* 299196*
    
    [decmprs8.exe]           cdm     cdm1    cdm2*
    original size        === 16720   16720   16720
    7z -mx=9 compression === 16929   16929   16929
    direct test          === 16616*  16615*  16609*
    bytewise bwts        === 16647   16651   16643
    bit7-0 bitwise bwts  === 16632   16641   16630
    bit0-7 bitwise bwts  === 16622   16630   16621
    
    [zik.ogg]                cdm     cdm1*   cdm2
    original size        === 2652463 2652463 2652463
    7z -mx=9 compression === 2622858 2622858 2622858
    direct test          === 2630762 2624089 2626150
    bytewise bwts        === 2620787 2618797 2618391
    bit7-0 bitwise bwts  === 2606332 2603721 2603797
    bit0-7 bitwise bwts  === 2584570*2580573*2581467*
    
    [cdm_test_0.rar]         cdm*    cdm1    cdm2
    original size        === 920156  920156* 920156
    7z -mx=9 compression === 920346  920346  920346
    direct test          === 919991  920653  920100
    bytewise bwts        === 920062  920777  920190
    bit7-0 bitwise bwts  === 919907* 920593  920032*
    bit0-7 bitwise bwts  === 919996  920704  920132
    
    [chm.7z]                 cdm     cdm1    cdm2*
    original size        === 1414783 1414783 1414783
    7z -mx=9 compression === 1414981 1414981 1414981
    direct test          === 1414065*1414445*1413792*
    bytewise bwts        === 1414718 1415734 1414773
    bit7-0 bitwise bwts  === 1414694 1415719 1414757
    bit0-7 bitwise bwts  === 1414704 1415730 1414775
    Please test it, I'd like to know the filetypes which it can compress (better than normal methods).

  2. The Following 11 Users Say Thank You to Shelwien For This Useful Post:

    comp1 (12th April 2017),Darek (12th April 2017),encode (11th April 2017),GOZARCK (11th April 2017),Jarek (13th April 2017),Matt Mahoney (11th April 2017),Mike (11th April 2017),RamiroCruzo (13th April 2017),Razor12911 (12th April 2017),snowcat (12th April 2017),xinix (11th April 2017)

  3. #2
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    where is cdm2 file?

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Here, if you need it.
    I'm going to post a new version anyway...
    Attached Files Attached Files

  5. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    Razor12911 (17th April 2017),xinix (13th April 2017)

  6. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    http://nishi.dreamhosters.com/u/cdm_test_2.rar

    - saved 1-2 bytes per file on EOF coding
    - removed price cache which reduced encoder memory usage from 1G to 10M
    - support for blklen!=maxblk for uncompressed blocks
    - 3 freq0 ranges instead of 2
    - speed optimizations

    LZ4.exe is added to the demo, use LZ4 -12 first if input file is compressible,
    because CDM doesn't include LZ compression.

    Code:
    cdm  = maxblk=256, tuned to book1__kzip.raw.bwts0-7
    cdm1 = maxblk=1024,tuned to book1__kzip.raw.bwts0-7
    cdm2 = maxblk=1024,tuned to A10.jpg.bwts7-0
    
            c.size c.time  d.time
    v2/cdm  818732  4.150s 0.234s
    v2/cdm1 818518 15.974s 0.281s
    v1/cdm2 818645 20.608s 0.281s
    (for a10.jpg.bwts7-0, input size = 842468 )
    
    [A10.jpg]                  cdm     lz4+cdm   cdm1     cdm2*
    original size        ===   842468   842487   842468   842468
    7z -mx=9 compression ===   842114   842133   842114   842114
    direct test          ===   842246   842265   842050   842561
    bytewise bwts        ===   838877   838900   838706   838965
    bit7-0 bitwise bwts  ===   818732*  818755*  818518*  817114*
    bit0-7 bitwise bwts  ===   835410   835420   835237   835146
    
    [book1__kzip.raw]          cdm     lz4+cdm   cdm1*    cdm2
    original size        ===   299437   299456   299437   299437
    7z -mx=9 compression ===   299594   299613   299594   299594
    direct test          ===   299375   299393   299368   299555
    bytewise bwts        ===   299492   299510   299474   299687
    bit7-0 bitwise bwts  ===   299476   299494   299458   299660
    bit0-7 bitwise bwts  ===   299217*  299239*  299195*  299345*
    
    [decmprs8.exe]             cdm     lz4+cdm*  cdm1     cdm2
    original size        ===   16720     16688   16720    16720
    7z -mx=9 compression ===   16929     16822   16929    16929
    direct test          ===   16612*    16599*  16609*   16614*
    bytewise bwts        ===   16643     16628   16642    16650
    bit7-0 bitwise bwts  ===   16630     16624   16629    16640
    bit0-7 bitwise bwts  ===   16620     16619   16619    16629
    
    [zik.ogg]                  cdm     lz4+cdm   cdm1     cdm2*
    original size        === 2652463   2646642  2652463  2652463
    7z -mx=9 compression === 2622858   2638401  2622858  2622858
    direct test          === 2625717   2626365  2626010  2624074
    bytewise bwts        === 2620392   2627915  2618428  2618939
    bit7-0 bitwise bwts  === 2605960   2618596  2603874  2603858
    bit0-7 bitwise bwts  === 2583340*  2600926* 2581441* 2580665*
    
    [cdm_test_0.rar]           cdm     lz4+cdm   cdm1*    cdm2
    original size        === 920156     920175   920156   920156*
    7z -mx=9 compression === 920346     920365   920346   920346
    direct test          === 920190     920208   920095   920683
    bytewise bwts        === 920270     920289   920184   920809
    bit7-0 bitwise bwts  === 920154*    920174*  920036*  920628
    bit0-7 bitwise bwts  === 920238     920256   920131   920735
    
    [chm.7z]                   cdm     lz4+cdm   cdm1*    cdm2
    original size        === 1414783   1414802  1414783  1414783
    7z -mx=9 compression === 1414981   1415000  1414981  1414981
    direct test          === 1413956*  1413974* 1413771* 1414490*
    bytewise bwts        === 1414854   1414870  1414756  1415746
    bit7-0 bitwise bwts  === 1414846   1414860  1414746  1415741
    bit0-7 bitwise bwts  === 1414863   1414878  1414765  1415751
    Attached Files Attached Files

  7. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    Razor12911 (17th April 2017),xinix (13th April 2017)

  8. #5
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Quote Originally Posted by Shelwien View Post
    The idea is rather simple - there're some (combinatorical) coding methods
    that always compress any data (not counting the space required to store stats).
    Sounds like what is usually referred as "enumerative coding" - use that:
    binomial(n,k) = binomial(n-1,k) + binomial(n-1,k-1)
    which is quite intuitive: to put k ones in n positions, we can put 0 (left) or 1 (right) in the first position and k or k-1 in the remaining positions, getting the above recurrence.
    So to enumerate (n,k) combinations, we can use [0..binomial(n-1,k)-1] to enumerate those starting with 0 and the remaining for those starting with 1.

    While for standard entropy coding it is sufficient to know approximate probability p~k/n, paying for inaccuracy with increased size (Kullback-Leibler bits/symbol), enumerative coding requires knowing the exact number of ones (k).
    It is great if we indeed need encode/enumerate combinations, or generate a random combination.
    It seems costly as it requires knowing and operating on binomials being large numbers ... but I remember seeing a paper claiming that it can be quite fast (...comparing to AC).

    A huge drawback is that it is usually used only for binary alphabet and single static probability, what is insufficient for real data compressors ...
    I see it could be generalized for large alphabet, e.g. ternary:
    binomial(n|k,l) = binomial(n-1|k-1,l) + binomial(n-1|k,l-1) + binomial(n-1|k,l)
    but it seems even more costly and varying probability is difficult ... could be realized e.g. by interleaving a few fixed probability coders.

    Similar concept as in enumerative coding is used in Fisher's PVQ (Pyramid Vector Quantizer: http://ieeexplore.ieee.org/document/...number=1057198 ): encode points for l1 sphere: sum_i |x_i|=k (again using similar recurrence), which is the heart of Daala PVQ (Perceptual Vector Quantizer).

  9. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Sounds like what is usually referred as "enumerative coding"

    Yes, but its implemented with a normal (although precise) rangecoder in this case.

    The difference from usual adaptive arithmetic coding is that freq0 is known in advance
    (is encoded before data block), so instead of starting with freq0=1,freq1=1 and incrementing
    them, i'm starting with freq1=blklen-freq0 and decrementing.

    In this case, the size of the code generated by rc is very close to log2( (freq0+freq1)!/freq0!/freq1! ).

    Btw, adaptive model would have codelen = log2( (freq0+freq1+1)!/freq0!/freq1! ),
    which is the same as encoding freq0 value with probability = 1/(blklen+1)
    (it can have values in range 0..blklen) in decrementing scheme.

    But then, there we can do better and use some model to encode the freq values.
    Which is important, because with compressed data we would have a pretty good
    idea about freq0 value, rather than assuming it to be purely random.

    > It seems costly as it requires knowing and operating on binomials being large numbers...
    > but I remember seeing a paper claiming that it can be quite fast (...comparing to AC).

    I made a fixed-point log2(n) table, then log2(n!) table using it.
    Then, log2( C(c0+c1,c0) ) = LUT[c0+c1]-LUT[c0]-LUT[c1].
    Here I'm using this to estimate the codelengths for the parsing optimizer.

    But some very long time ago I also made an actual enumerative encoder -
    it precalculated binomials up to a fixed block size using Pascal's triangle.
    Sure, it was multiplication-free and pretty fast, but there was no
    compression gain compared to a rangecoder, and all the symbol decomposition
    and blockwise coding were complicated, and the complete coder wasn't
    faster than rc either.

    As to further cdm improvements, I have the following ideas:
    - Use adaptive linear counters for flags etc (cdm currently uses static probabilities).
    - Alternative model with bit[i]^bit[i-k] transform before encoding
    - Alternative model with byte value bitmask for a block
    - Simple LZ4-like LZ with its own parsing optimization

    Originally I wanted to make a byte model and compress a frequency table of byte values,
    but it would require supporting much longer blocks, and parsing optimization
    is pretty slow already even with maxblk=256.

  10. The Following 3 Users Say Thank You to Shelwien For This Useful Post:

    Mike (13th April 2017),Razor12911 (17th April 2017),xinix (13th April 2017)

  11. #7
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    Some my tests of cdm in attached Excel file. There also JPG file with the same table.

    My insights are like follows:

    1) This algorithm is very powerful for low redundant files (as you wrote) - for such files it has an some gains even over 7zip with maximal compression - for my testbed it was E.TIF (internally compressed), F.JPG and J.EXE (internally packed).

    2) But, some other files are also quite good compressed by pure cdm1 algorithm - f.e.:

    ==== 0.WAV - this audio file is compressed only 7.9% worse than 7zip with compression ratio about 60%
    ==== G.EXE - this executable file is also close to 7zip - only 8.7% worse with compression ratio about 75%

    3) For 0.WAV file compression of pure cdm1 is about 25% better than lz4!

    4) For text files compression of pure cdm1 is only about 6.6% worse than lz4.

    impressive.

    Darek
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	CDM_test.jpg 
Views:	67 
Size:	1.33 MB 
ID:	4882  
    Attached Files Attached Files

  12. The Following User Says Thank You to Darek For This Useful Post:

    Shelwien (13th April 2017)

  13. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    http://nishi.dreamhosters.com/u/cdm_test_3.rar

    - 40% faster decoding (~8mb/s now)
    - gray code submodel added to parsing optimizer (better/slower compression)

    Code:
    cdm  = maxblk=256, tuned to book1__kzip.raw.bwts0-7
    cdm1 = maxblk=256, tuned to A10.jpg.bwts7-0
    
            c.size c.time  d.time
    v2/cdm  818732  4.150s 0.234s
    v2/cdm1 818518 15.974s 0.281s
    v3/cdm  818532  6.162s 0.156s
    v3/cdm1 816313  6.100s 0.171s
    (for a10.jpg.bwts7-0, input size = 842468 )
    
    Compression gain:
    A10.jpg.bwts7-0: (1-818532/842468)*100 = 2.84%
    zik.ogg.bwts0-7: (1-2583162/2652463)*100 = 2.61%
    
    [A10.jpg]                 cdm       cdm1*
    original size        === 842468    842468
    7z -mx=9 compression === 842114    842114
    direct test          === 842226    843455
    bytewise bwts        === 838587    839050
    bit7-0 bitwise bwts  === 818532*   816313*
    bit0-7 bitwise bwts  === 835366    835410
                                           
    [book1__kzip.raw]         cdm*      cdm1
    original size        === 299437    299437*
    7z -mx=9 compression === 299594    299594
    direct test          === 299367    299845
    bytewise bwts        === 299484    300002
    bit7-0 bitwise bwts  === 299471    299979
    bit0-7 bitwise bwts  === 299217*   299584
                                             
    [decmprs8.exe]            cdm*      cdm1 
    original size        === 16720      16720
    7z -mx=9 compression === 16929      16929
    direct test          === 16603*     16620*
    bytewise bwts        === 16641      16663
    bit7-0 bitwise bwts  === 16630      16655
    bit0-7 bitwise bwts  === 16619      16645
                                             
    [zik.ogg]                 cdm       cdm1*
    original size        === 2652463  2652463
    7z -mx=9 compression === 2622858  2622858
    direct test          === 2624956  2620633
    bytewise bwts        === 2619752  2620534
    bit7-0 bitwise bwts  === 2605869  2606261
    bit0-7 bitwise bwts  === 2583162* 2581991*
                                         
    [cdm_test_0.rar]          cdm*      cdm1                
    original size        === 920156    920156*
    7z -mx=9 compression === 920346    920346
    direct test          === 920147    921624
    bytewise bwts        === 920254    921816
    bit7-0 bitwise bwts  === 920152*   921643
    bit0-7 bitwise bwts  === 920232    921761
                                             
    [chm.7z]                  cdm*      cdm1                
    original size        === 1414783  1414783*
    7z -mx=9 compression === 1414981  1414981
    direct test          === 1413913* 1415641
    bytewise bwts        === 1414820  1417246
    bit7-0 bitwise bwts  === 1414818  1417255
    bit0-7 bitwise bwts  === 1414834  1417253

  14. The Following 3 Users Say Thank You to Shelwien For This Useful Post:

    Darek (15th April 2017),RamiroCruzo (14th April 2017),Razor12911 (17th April 2017)

  15. #9
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    Here is the file with v3 test. I've added comparison to lz4 -12 scores.

    About v2:
    Pure cdm v2 is about 1.1% better than lz4 -12 for whole my testbed.
    Adding cdm on preprocessed files by lz4 -12 adds about 7% to lz4 scores.
    Combining best scores from both pure cdm and preprocessed gives 9.9% better score than lz4.

    About v3:
    Generally for pure cdm scores v3 get about 5% gain to v2.
    Pure cdm v3 is about 6.0% better than lz4 -12 for entire my testbed !!!
    Adding cdm on preprocessed files by lz4 -12 adds only about 2.6% to lz4 scores. Due to better cdm compression lz4 preprocessing gets less important.
    Combining best scores from both pure cdm and preprocessed gives 10.7% better score than lz4.

    Darek
    Attached Files Attached Files

  16. The Following User Says Thank You to Darek For This Useful Post:

    Shelwien (15th April 2017)

  17. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Its also possible to compress CDM_test.7z from 32,455 to 32,438.

    But overall its not quite what I wanted. There're better algorithms for wavs and images, etc.
    What about actual compressed formats? .aac? videos? .rar? .cab/.wim? image raws? .flif? LZO? various exe packers etc?

    Well, flif doesn't seem compressible, but otherwise here's an example:
    Code:
    [1.aac]                   cdm       cdm1*
    original size        === 1552687  1552687 (1-1268137/1552687)*100 = 18.32%
    7z -mx=9 compression === 1371836  1371836 (1-1268137/1371836)*100 = 7.55% 
    direct test          === 1545933  1534597
    bytewise bwts        === 1480729  1479445
    bit7-0 bitwise bwts  === 1274797* 1268137*
    bit0-7 bitwise bwts  === 1342676  1334783
    
    [001.avi(h264)]           cdm*      cdm1
    original size        === 4291766  4291766 (1-4282124/4291766)*100 = 0.22%
    7z -mx=9 compression === 4283009  4283009
    direct test          === 4283683  4291059
    bytewise bwts        === 4282819  4290384
    bit7-0 bitwise bwts  === 4282124* 4289619*
    bit0-7 bitwise bwts  === 4282193  4289641
    
    [skiing.avi(MPEG1)]       cdm       cdm1   lz4+cdm1
    original size        === 1331204  1331204  1188245 (1-1113776/1331204)*100 = 16.33%
    7z -mx=9 compression === 1172618  1172618  1176586 (1-1113776/1172618)*100 = 5.01% 
    direct test          === 1185342  1183480  1174674
    bytewise bwts        === 1179461  1177934  1172547
    bit7-0 bitwise bwts  === 1119314* 1113776* 1113910
    bit0-7 bitwise bwts  === 1164394  1162437  1160071

  18. The Following User Says Thank You to Shelwien For This Useful Post:

    xinix (18th April 2017)

  19. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    http://nishi.dreamhosters.com/u/cdm_test_4.rar

    - rangecoder bugfix for files of size <4
    - encoder speed optimizations (~3x faster than v3, 0.4M/s)
    - decoder speed optimizations (~2x faster than v3, 14M/s)
    - adaptive coding of uncompressed block flag (incl. parsing optimizer)
    Code:
    cdm  = maxblk=256, tuned to zik.ogg.bwts7-0
    
            c.size  c.time d.time
    v3/cdm 2583162 19.219s 0.312s
    v4/cdm 2580964  6.661s 0.187s
    (for zik.ogg.bwts7-0, input size = 2,652,463 )
    !---------------------!--------!--------!--------!--------!--------!--------!------!
    ! Filename            !  (1)   !  (2)   !  (3)   !  (4)   !  (5)   !  (6)   ! (7)  !
    !---------------------!--------!--------!--------!--------!--------!--------!------!
    ! A10.jpg             ! 842468 ! 842114 ! 842230 ! 838483 ! 816763*! 835101 ! 3.01%!
    ! enwik6a.zip         ! 728786 ! 728956 ! 727994*! 728708 ! 728660 ! 728435 ! 0.11%!
    ! zik.ogg             !2652463 !2622858 !2620905 !2618378 !2604407 !2580964*! 1.60%!
    ! cdm_test_0.rar      ! 920156 ! 920346 ! 920025*! 920124 ! 920110 ! 920142 ! 0.01%!
    ! chm.7z              !1414783 !1414981 !1413863*!1414569 !1414593 !1414617 ! 0.07%!
    ! 1.aac               !1552687 !1371836 !1537938 !1479186 !1269026*!1335940 ! 7.49%!
    ! 001.avi             !4291766 !4283009 !4282918 !4282012 !4281321*!4281398 ! 0.04%!
    ! skiing.avi          !1331204 !1172618 !1183171 !1177441 !1114276*!1162067 ! 4.98%!
    ! calgarytest.zpaq.7z !1002082 !1002283 !1001967*!1002057 !1002070 !1002069 ! 0.01%!
    ! enwik6a.cab         ! 618046 ! 618210 ! 617809*! 618041 ! 618031 ! 618028 ! 0.04%!
    ! enwik6a_ma3.rar     ! 643238 ! 643413 ! 642878*! 643249 ! 643153 ! 643213 ! 0.06%!
    ! enwik6a_ma5.rar     ! 644419 ! 644594 ! 644083*! 644435 ! 644326 ! 644397 ! 0.05%!
    ! enwik6a_mct.rar     ! 509469*! 509638 ! 509478 ! 509479 ! 509470 ! 509478 ! 0.00%!
    !---------------------!--------!--------!--------!--------!--------!--------!------!
    ! (1) original size                                                                !
    ! (2) 7z -mx=9 compression                                                         !
    ! (3) direct test                                                                  !
    ! (4) bytewise bwts                                                                !
    ! (5) bit7-0 bitwise bwts                                                          !
    ! (6) bit0-7 bitwise bwts                                                          !
    ! (7) gain: best result vs min of (1) and (2)                                      !
    !----------------------------------------------------------------------------------!
    1. Original size of calgarytest.zpaq is 1,003,792
    2. enwik6a is first 2,097,152 (=2M) bytes of enwik)
    3. enwik6a_mct.rar is compressed with ppmd vH -o16 -m256 (rar -mc16:256t+)

  20. The Following 5 Users Say Thank You to Shelwien For This Useful Post:

    comp1 (17th April 2017),Darek (17th April 2017),Mike (17th April 2017),Razor12911 (17th April 2017),xinix (17th April 2017)

  21. #12
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    Is there cdm1 v4 or only cdm?

    >But overall its not quite what I wanted. There're better algorithms for wavs and images, etc.
    Yes I Know. Good wavs compression is an only side effect but it's quite interesting because it's quite effective...

    >What about actual compressed formats? .aac? videos? .rar? .cab/.wim? image raws? .flif? LZO? various exe packers etc?
    I'll test some othere files. Generally after CM compressors there no any improvements - I've tested files after CMIX12 and some CMV files. I'll check more files too.

  22. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Is there cdm1 v4 or only cdm?

    Only cdm. cdm1 was necessary because p_maxblk was static, and coders optimized for more redundant data (eg. jpg,ogg) and less redundant (.zip)
    produced significantly different results.
    But now that I made p_maxblk adaptive, it doesn't matter that much anymore.

    > Good wavs compression is an only side effect but it's quite interesting because it's quite effective...

    Its actually BWT compression. You can probably get better results with normal bwt postcoders, rather than cdm.

    > Generally after CM compressors there no any improvements -

    With some faster ones, like ccm, there could be. Eg.
    Code:
    ! xml.lzna            ! 469489*! 469639 ! 469504 ! 469507 ! 469506 ! 469505 ! 0.00%!
    ! xml.kraken          ! 484282 ! 483623 ! 483063*! 484148 ! 483974 ! 483990 ! 0.12%!
    ! xml.bitknit         ! 500819 ! 500977 ! 500707*! 500829 ! 500829 ! 500790 ! 0.02%!
    > I've tested files after CMIX12 and some CMV files. I'll check more files too.

    Well, maybe you can check more redundant files? Like paq'ed bmps or something?

    But overall I don't have much ideas for cdm after arithmetic coding, if bytewise BWTS doesn't help.

    I did try a permutation model (each byte value appears only once in a block) - on a block like that of length 256
    we can save 45.5 bytes, for example.
    And it does help a little, but cdm's parsing optimizer apparently only uses such blocks as a replacement
    for uncompressed ones, and even that very rarely - like 4-5 times per file.
    And of course it also slows down the encoding, so I didn't keep it.

  23. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    RamiroCruzo (18th April 2017),xinix (18th April 2017)

  24. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    http://nishi.dreamhosters.com/u/cdm_test_5.rar

    - encoding speed optimization
    - compression re-tuned (better on compressible files, slightly worse on incompressible)
    - blockwise bwts/bwtsh/bwtsl utils added

    Code:
    cdm  = maxblk=256, tuned to zik.ogg.bwts7-0
    
            c.size  c.time d.time
    v3/cdm 2583162 19.219s 0.312s
    v4/cdm 2580964  6.661s 0.187s
    v5/cdm 2579802  1.685s 0.218s
    (for zik.ogg.bwts7-0, input size = 2,652,463 )
    
    !-----------------------!--------!--------!--------!--------!--------!--------!------!
    ! Filename              !  (1)   !  (2)   !  (3)   !  (4)   !  (5)   !  (6)   ! (7)  !
    !-----------------------!--------!--------!--------!--------!--------!--------!------!
    ! A10.jpg               ! 842468 ! 842114 ! 842364 ! 838363 ! 816109*! 834903 ! 3.09%!
    ! enwik6a.zip           ! 728786 ! 728956 ! 728093*! 728908 ! 728865 ! 728589 ! 0.10%!
    ! zik.ogg               !2652463 !2622858 !2617995 !2617549 !2603563 !2579806*! 1.64%!
    ! 1.aac                 !1552687 !1371836 !1534628 !1478552 !1266856*!1332214 ! 7.65%!
    ! 001.avi               !4291766 !4283009 !4283839 !4283248 !4282464*!4282486 ! 0.01%!
    ! skiing.avi            !1331204 !1172618 !1182452 !1176832 !1112306*!1161503 ! 5.14%!
    ! enwik6a_ma3.rar       ! 643238 ! 643413 ! 643015*! 643431 ! 643327 ! 643391 ! 0.03%!
    ! enwik6a_ma5.rar       ! 644419 ! 644594 ! 644192*! 644607 ! 644475 ! 644572 ! 0.04%!
    ! enwik6a_mct.rar       ! 509469*! 509638 ! 509611 ! 509632 ! 509616 ! 509617 ! 0.00%!
    ! enwik6a.cab           ! 618046 ! 618210 ! 617894*! 618217 ! 618202 ! 618206 ! 0.02%!
    !-----------------------!--------!--------!--------!--------!--------!--------!------!
    ! (1) original size                                                                  !
    ! (2) 7z -mx=9 compression                                                           !
    ! (3) direct test                                                                    !
    ! (4) bytewise bwts (5M block)                                                       !
    ! (5) bit7-0 bitwise bwts (5M block)                                                 !
    ! (6) bit0-7 bitwise bwts (5M block)                                                 !
    ! (7) gain: best result vs min of (1) and (2)                                        !
    !------------------------------------------------------------------------------------!

  25. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    Darek (21st April 2017),xinix (21st April 2017)

  26. #15
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    I've tested some Audio and Image files and some of formats with CDM v5.

    There are two approach:

    1. I've found 6 songs and 6 images and convert it into few different formats then compress. Some insights:
    ------ AAC and MP3 formats got a 1.5-2.2% gain over 7zip compression (-mx9) - respectively in average 1.9% and 1.7%;
    ------ FLIF format got the same compression ratio like 7zip (-mx9) which means -1.4% - flif is completely incompressibe;
    ------ TIFF (LZW internally compressed), JPG and PNG got the biggest gain to 7zip from Inage formats - respectively 2.5%, 3.2% and 4.3%;
    ------ RAW format is exception because it's high compression ratio - about 54% - for this format CDM have much worse score than 7zip but there is an one strange case - RAW5 image (lot's stars on black sky) - for this image CDM compression is only 1.2% worse than 7zip despite almost 50% compression ratio of this file;

    2. I've got two files 1'st audio song and 3'rd Image and then convert it to multiple formats then compress. Some insights for these two files:
    ------ WAV and AIFF raw formats are quite good compressed by CDM - only about 5-6% worse than 7zip. This is similar score to my testbed good wave compression CDM ratio - 6.7% worse trhan 7zip;
    ------ WMA format is about 9% better compressed than 7zip - compression ratio =21% vs. 13% of 7zip;
    ------ other audio formats gains a little from CDM - average 1% better than 7zip;
    ------ raw or similar to raw formats (BMP, EMF, ICO, PPM, RAW, TGA) with huge compression ratio are much worse compressed by CDM than 7zip;
    ------ majority of formats have small gain compared to 7zip - similar like audio formats = 0.9% - GIF, JLS, JP2, good quality JPG, JPM, images in PDF, PGM, TIFF, WEBP...
    ------ low quality JPG (probably due to it's small size) is quite well compressed - 12% better than 7zip - compression ratio 63%;
    and more interesting cases:
    ------ PBM file is also 11.5% better compressed than 7zip - compression ratio 93%;
    ------ PCX file have 5% gain to 7zip with compression ratio = 57%;
    ------ Reymont.pdf - text PDF file from Silesia Corpus - 12% better score than 7zip with quite huge compression ratio = 83%;
    ------ these three examples above means that for some kind of data or fileformats CDM have very competitive compression ratio.

    I've attached Excel file with all scores and two JPG with tables with both approach figures.

    I have a question - what is a range of block sizes to set? Default is 5M but how are the limitations?

    Darek
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	CDM_test2_a.jpg 
Views:	47 
Size:	710.3 KB 
ID:	4887   Click image for larger version. 

Name:	CDM_test2_b.jpg 
Views:	49 
Size:	573.0 KB 
ID:	4888  
    Attached Files Attached Files

  27. The Following User Says Thank You to Darek For This Useful Post:

    Shelwien (22nd April 2017)

  28. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Thanks, this is more interesting, in particular I completely forgot about .wma and .pcx. Also .flac.
    Can you check other lossless audio codecs too?
    However,
    1. You can try compressing redundant files with lzma2, then cdm. Note that lzma/plzma don't work - only lzma2 would use "stored" blocks for "incompressible" data.
    2. Alternatively srep or lz4 can be used - in some cases LZ4 can be better, because lzma2 blocks are not very small (~64k)
    3. Better also check files with precomp - for example, raw image formats frequently contain large embedded jpegs,
    and we already have better (than cdm) solutions for deflate and jpeg recompression.

Similar Threads

  1. Replies: 2
    Last Post: 19th April 2016, 00:00
  2. BWT with compressed input data
    By Shelwien in forum Data Compression
    Replies: 3
    Last Post: 29th May 2009, 15:16
  3. Data Structures for Context Model Statistics
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 8th November 2008, 11:14
  4. CM model
    By toffer in forum Forum Archive
    Replies: 40
    Last Post: 26th January 2008, 07:59
  5. Model orders
    By Shelwien in forum Forum Archive
    Replies: 50
    Last Post: 22nd December 2007, 22:43

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •