Results 1 to 11 of 11

Thread: ASH04a and LTCB

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    http://compression.ru/sh/ash04.rar
    http://compression.ru/sh/ash04a.rar

    100,000,000 // enwik8
    19,963,105 // ash 04a /m700 /o10 /s5 -- Matt's result in LTCB
    22,122,484 // hook v1.0 1700

    21,020,430 // ash 04a /m2047 /o07 /s255 ( 967.8s 385M)
    20,269,557 // ash 04a /m2047 /o08 /s255 ( 973.2s 609M)
    19,870,243 // ash 04a /m2047 /o09 /s255 ( 980.9s 880M)
    19,626,447 // ash 04a /m2047 /o10 /s255 (1026.4s 1184M)
    19,459,384 // ash 04a /m2047 /o11 /s255 (1066.2s 1506M)
    19,345,240 // ash 04 /m2047 /o12 /s255
    19,342,258 // ash 04a /m2047 /o12 /s128
    19,342,204 // ash 04a /m2047 /o12 /s255 (1098.9s 1827M)

    As to enwik9, guess I'd check it later - don't even have it yet.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Also, this:

    10,000,000 // "enwik7"
    2,127,986 // ash 04a /m2047 /o12 /s255 (86.6s 251M)
    2,113,985 // ash 04a /m2047 /o256 /s255 (103.2s 651M)
    2,113,979 // ash 04a /m2047 /o512 /s255 (103.8s 675M)

    So I guess we could expect at least
    19342204*2113985/2127986=19214942
    for enwik8 at order-256, if only I wasn't being lazy while designing that suffix tree.

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Btw, here's a new compression challenge ;)
    Googlepages has a 10M filesize limit, so the question is how
    to compress this into 10M... while keeping it reliable :)
    http://shelwien.googlepages.com/mmah...0223.part1.rar
    http://shelwien.googlepages.com/mmah...0223.part2.rar

    That's a rip of http://cs.fit.edu/~mmahoney/compression/ files...
    I just met a page I didn't see before (and no wonder with Matt's weird
    cross-linking), so decided to finally check what's in there ;)

  4. #4
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    there's lots of compressed pdf's and png's so maybe paq8pre will do the job

  5. #5
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Quote Originally Posted by donkey7
    theres lots of compressed pdfs and pngs so maybe paq8pre will do the job
    Hm... paq8o8pre v2 is getting close, but has not enough power.
    Here are my results:

    Original (stored in an UHARC archive): 27,195,278 bytes
    Precomp PCF: 49.506.081 bytes (428 PDF, 16 PNG, 33 GIF, 29 JPG)
    RAR, max. compression: 13,914,090 bytes
    CCM, option 2:13,603,992 bytes
    PCF+RAR: 13,189,571 bytes
    PCF+CCM: 12,033,486 bytes
    lprepaq v.1.3, level 5: 11,614,609 bytes
    paq8o8pre v2, level 4: 10.811.515 bytes

    Analysis of the data shows some html files that could perhaps
    be optimized with some HTML/XML-filter, however some quick
    tests with xmlwrt 3 didnt improve the results, and this only makes
    ~350 KB of the compressed file. Almost the same for some C++
    files, although this could save some more space...
    There are some Postscript files compressed in .Z archives, too.
    Decompressing them manually with WinRAR and recompressing
    them with lprepaq leads to an improvement of 226 KB.
    But the main problem is a big 4,5 MB MPEG movie... seems like its
    time for some lossless video compression... any ideas?
    Compressing it by about 20% would be enough to get the
    archive to 10 MB.
    http://schnaader.info
    Damn kids. They're all alike.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > paq8o8pre v2, level 4: 10.811.515 bytes
    >
    > Analysis of the data shows some html files that could perhaps
    > be optimized with some HTML/XML-filter, however some quick
    > tests with xmlwrt 3 didn't improve the results,

    Real model should be much better for such data...

    > But the main problem is a big 4,5 MB MPEG movie... seems like it's
    > time for some lossless video compression... any ideas?

    There's a 743k audio track in it, so we could expect ~20% compression,
    but that would only amount to ~150k max. And sadly I was too lazy
    to implement mpeg1 layer2 (and 1) support in mp3zip.

    Also there's no index which could be easily compressed, and
    the only thing that comes into mind is that there's some padding
    which could be removed by parsing the container format.

    And mpeg1 video can be compressed of course, but I guess nobody
    would go that far to support some archaic format

    Though it might be a good idea to exclude mpg file from the main
    archive when compressing with paq8pre (and maybe try some other
    compressor for it), also it might be worthwhile to sort the files
    using rar (rar -m0 as a container) or freearc, and maybe try
    seg_file/durilca.

    > Compressing it by about 20% would be enough to get the
    > archive to 10 MB.

    That's surely possible but would take too much work.
    But then, if not for that, it wouldn't be much of a challenge

  7. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Quote Originally Posted by schnaader
    Precomp PCF: 49.506.081 bytes (428 PDF, 16 PNG, 33 GIF, 29 JPG)
    have the jpg files been processed by packjpg?? if yes, then disabling it would enable paq8 jpeg engine.
    Quote Originally Posted by schnaader
    paq8o8pre v2, level 4: 10.811.515 bytes
    what about higher levels?

  8. #8
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Quote Originally Posted by donkey7
    have the jpg files been processed by packjpg?? if yes, then disabling it would enable paq8 jpeg engine.
    This is already done, as packJPG recompression is switched
    off in paq8o8pre v2 by default and it was fed with the UHA file,
    not with the PCF file.

    Quote Originally Posted by donkey7
    what about higher levels?
    I couldnt test because I only have 192 MB RAM and I dont
    expect too much improvement by higher levels, although every
    bit helps... Ill make some more tests on a faster machine with
    more memory.

    Segmentation, file sorting (although some of it was done by
    UHARC) and html/xml/c++ models also sound interesting.
    http://schnaader.info
    Damn kids. They're all alike.

  9. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    also try to group files by type and sort groups by decreasing compression ratio (highly compressible files first).

  10. #10
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Further tests show that there is much room for improvement.

    paq8o8pre -7 compresses the UHARC archive to 10,732,036
    bytes, saving 79 KB.

    There are 20 PNG files, only 16 can be recompressed by
    Precomp. Manually converting to BMP and compressing with
    paq8o8pre v2 -4 leads to 1,721,932 bytes instead of
    2,023,292 bytes for the original files, gaining 301 KB.

    Moreover, there are 124 GIF files of which only 33 (!)
    can be recompressed. The others always only lack 2
    identical bytes and are skipped by Precomp, so
    implementing partial GIF recompression would help a
    lot here. Again, converting to BMP and compressing with
    paq8o8pre v2 -4 leads to 537,391 bytes instead of
    886,590 bytes, gaining 349 KB.

    Together with .Z support (LZC shouldn't be that hard
    to implement, but are there any patent issues left
    or is it just the same as for GIF?) which saves 226 KB,
    total savings should be 955 KB, decreasing total size to
    around 9,856 KB.

    With further optimizations like the suggested reordering
    and html/xml/c++ filters (or with video recompression ),
    I think even lprepaq could compress the files to under
    10 MB, speeding the process up a lot (new tests were done
    on a Athlon X2 4000+ dualcore machine with 1.8 GB RAM,
    paq8o8pre v2 -7 took almost 2 hours, while
    lprepaq 8 {11,585,279 bytes} finished after 2 minutes).
    Edit: paq9a -8 also is a nice candidate, being 15 seconds
    faster than lprepaq 8, compressing to 11,598,452 bytes.
    http://schnaader.info
    Damn kids. They're all alike.

  11. #11
    Member
    Join Date
    Dec 2006
    Posts
    611
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by schnaader
    .Z support (LZC shouldnt be that hard
    to implement, but are there any patent issues left
    or is it just the same as for GIF?)
    I think .Z files are produced with compress, which uses LZW thats already a few years patent-free

Similar Threads

  1. New LTCB record
    By Matt Mahoney in forum Data Compression
    Replies: 12
    Last Post: 4th December 2009, 04:57
  2. New LTCB record
    By Matt Mahoney in forum Data Compression
    Replies: 20
    Last Post: 13th August 2009, 00:06
  3. New LTCB champion
    By Matt Mahoney in forum Data Compression
    Replies: 2
    Last Post: 23rd May 2008, 17:33

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •