Results 1 to 8 of 8

Thread: I found an interesting test set...

  1. #1
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts

    I found an interesting test set...

    I found an interesting test set.

    Open Office documents are zips with several files, mostly xmls. I wanted to know whether zip is a good tool for the task and results turned out to be quite interesting.

    Data: Spreadsheet, archivers testing results with some charts. Original size - 3 185 340 B. Zipped by OO - 210*481 B. Very redundant...
    Test set is small, so timing is not very accurate, especially with fast compressors.
    I can't measure how fast does OO compress the file, I could use stopwath, but it would be very inaccurate. Also generating xmls increases saving time and I have no idea by how much. The closest thing I could do was to create a zip with 7z.
    Code:
    Archiver        Size    Time
    7z -tzip -mx=1  226850  0.187
    7z -tzip -mx=3  226850  0.203
    7z -tzip -mx=5  216749  0.640
    7z -tzip -mx=7  208253  1.203
    7z -tzip -mx=9  195038  5.437
    OO seems to use something close to 7z -mx7, but a bit weaker. It takes probably 0.8-1s.
    How good can zip be? I tried kzip, it claims to generate zips smaller than PKZIP by 1-3%.
    Code:
    Archiver        Size    Time
    kzip /s0 /b0    189783  42.516
    kzip /s0 /b128  194807  53.640
    kzip /s0 /b256  189461  57.406
    kzip /s0 /b512  187122  50.406
    kzip /s0 /b1024 187816  46.641
    kzip /s1 /b0    190463  33.406
    kzip /s1 /b128  195240  57.562
    kzip /s1 /b256  189862  43.656
    kzip /s1 /b512  187444  42.593
    kzip /s1 /b1024 188183  38.032
    kzip /s2 /b0    319485  0.688
    kzip /s2 /b128  325280  1.937
    kzip /s2 /b256  319849  1.593
    kzip /s2 /b512  317122  1.281
    kzip /s2 /b1024 317677  1.125
    kzip /s3 /b0    1787918 0.578
    kzip /s3 /b128  1763812 1.812
    kzip /s3 /b256  1771326 1.468
    kzip /s3 /b512  1779670 1.156
    kzip /s3 /b1024 1782342 1.015
    Very slow, but the size got down to 187122 B, 11% smaller than original, 4% smaller than 7zip. Very small.
    Now other compressors...The best results:
    Code:
    Archiver            Size    Time
    FastLZ opt -2       370534  0.015
    FastLZ -2           365344  0.016
    quick -0            311703  0.031
    slug                198995  0.046
    NanoZip -cd         186876  0.093
    4x4 1t              123047  0.171
    4x4 2t              122051  0.234
    4x4 4t              121034  0.296
    FreeArc -m4 -ms     117940  0.875
    FreeArc -m5 -ms     116319  1.421
    FreeArc -m5         115312  1.437
    FreeArc -m7         115305  1.453
    CCM 0               108787  1.578
    CCM 1               108579  1.609
    CCM 2               108481  1.735
    CCM 3               108433  1.860
    CCMX 0              106744  2.031
    CCMX 1              106225  2.094
    CCMX 2              105849  2.250
    CCMX 3              105634  2.437
    FreeArc -max -ms    95661   2.515
    FreeArc -max        94654   2.531
    FreeArc -max -ma-   86912   3.734
    NanoZip -cc         84755   16.297
    PAQ8p -1            83124   66.625
    PAQ8p -2            81566   67.578
    PAQ8p -3            81040   68.375
    PAQ8p -4            51999   499.015
    PAQ8p -5            50780   501.672
    PAQ8p -6            50046   513.891
    PAQ8p -7            49870   538.828
    FastLZ needs just 0.015s, that's over 200 MB/s. IO is definitely cached by OS.
    Slug makes it smaller than OO while being 15 times faster.
    4x4 1t almost halves the OO result and is 5 times faster (!!!).
    Then there's nothing really interesting until PAQ8p -3/4...I tested several times, there were no memory issues, -4 really takes that long. And decompresses it's output. I tried to investigate, it seems to get consistent gains on almost all files and is always equally slow.
    But there's another thing. Let's calculate efficiency as maximumcompression.com does:
    Code:
    Archiver            Size      Time    Efficiency(maximumcompression.com)
    FastLZ opt -2       370534    0.015   340654353845826000.0
    FastLZ -2           365344    0.016   176627773533093000.0
    quick -0            311703    0.031   197866418022820.0
    slug                198995    0.046   46172317.6
    NanoZip -cd         186876    0.093   17320813.6
    4x4 1t              123047    0.171   4468.6
    4x4 2t              122051    0.234   5324.4
    4x4 4t              121034    0.296   5847.4
    FreeArc -m4 -ms     117940    0.875   11243.8
    FreeArc -m5 -ms     116319    1.421   14576.4
    FreeArc -m5         115312    1.437   12815.3
    FreeArc -m7         115305    1.453   12945.4
    CCM 0               108787    1.578   5682.1
    CCM 1               108579    1.609   5628.6
    CCM 2               108481    1.735   5987.3
    CCM 3               108433    1.860   6376.0
    CCMX 0              106744    2.031   5505.4
    CCMX 1              106225    2.094   5281.2
    CCMX 2              105849    2.250   5385.7
    CCMX 3              105634    2.437   5661.5
    FreeArc -max -ms    95661     2.515   1460.9
    FreeArc -max        94654     2.531   1278.2
    FreeArc -max -ma-   86912     3.734   642.9
    NanoZip -cc         84755    16.297   2079.1
    PAQ8p -1            83124    66.625   6775.6
    PAQ8p -2            81566    67.578   5534.4
    PAQ8p -3            81040    68.375   5204.9
    PAQ8p -4            51999   499.015   670.9
    PAQ8p -5            50780   501.672   569.3
    PAQ8p -6            50046   513.891   526.6
    PAQ8p -7            49870   538.828   538.8
    Ladies and gentlemen, welcome the new efficiency king, PAQ8p. Who cares that it gets 6 KB/s, what a great size! I wonder why didn't OO team choose to use it, maybe we should suggest it to them?
    I know that some uses might be more sensitive to file size and less to speed than office documents, but that's just ridiculous.
    I've been thinking about different measure for efficiency for some time and now it's the time to show my take on the topic.
    1. Copying is usually a very viable method of archiving, much more than PAQ. And IMO this is what archivers should be compared to.
    2. Extreme slowness = 0 usefulness = 0 score.
    3. Use of minimal size is wrong. If I was looking for something under 0.1s on this test, I couldn't care less about PAQ scores. And the fact that I tested it chaged the ranking. Remove things slower than 0.15s and slug wins. Include them - NanoZip is better. You always have to recalculate everything to your time boundaries.
    My proposal is: POWER(10;1/10)/((size/original_size)*LOG(size/original_size+1;2)*POWER(POWER(10;1/10);time/time_of_copying)) (A bit unreadable, but I won't learn tech to show it better ).
    The higher score the better. XCOPY gets 1. I call these which score at least as much "practical".
    Results:
    Code:
    Archiver            Size    Time    Efficiency(proposed)    Efficiency(maximumcompression.com)
    FastLZ opt -2       370534  0.015    66.52                    340654353845826000.0
    FastLZ -2           365344  0.016    68.26                    176627773533093000.0
    quick -0            311703  0.031    90.80                    197866418022820.0
    slug                198995  0.046   213.82                    46172317.6
    NanoZip -cd         186876  0.093   224.14                    17320813.6
    4x4 1t              123047  0.171   450.79                    4468.6
    4x4 2t              122051  0.234   413.32                    5324.4
    4x4 4t              121034  0.296   379.77                    5847.4
    FreeArc -m4 -ms     117940  0.875   155.30                    11243.8
    FreeArc -m5 -ms     116319  1.421    65.44                    14576.4
    FreeArc -m5         115312  1.437    64.86                    12815.3
    FreeArc -m7         115305  1.453    63.20                    12945.4
    CCM 0               108787  1.578    57.83                    5682.1
    CCM 1               108579  1.609    55.18                    5628.6
    CCM 2               108481  1.735    45.00                    5987.3
    CCM 3               108433  1.860    36.72                    6376.0
    CCMX 0              106744  2.031    28.66                    5505.4
    CCMX 1              106225  2.094    26.10                    5281.2
    CCMX 2              105849  2.250    20.38                    5385.7
    CCMX 3              105634  2.437    15.08                    5661.5
    FreeArc -max -ms    95661   2.515    16.16                    1460.9
    FreeArc -max        94654   2.531    16.08                    1278.2
    FreeArc -max -ma-   86912   3.734     2.67                    642.9
    NanoZip -cc         84755   16.297    0.00                    2079.1
    PAQ8p -1            83124   66.625    0.00                    6775.6
    PAQ8p -2            81566   67.578    0.00                    5534.4
    PAQ8p -3            81040   68.375    0.00                    5204.9
    PAQ8p -4            51999   499.015   0.00                    670.9
    PAQ8p -5            50780   501.672   0.00                    569.3
    PAQ8p -6            50046   513.891   0.00                    526.6
    PAQ8p -7            49870   538.828   0.00                    538.8
    There's one more interesting thing.
    Code:
    Archiver    Size    Time    Efficiency
    PAQ9a 1     98727   3.140   5.47
    PAQ9a 2     97583   3.062   6.36
    PAQ9a 3     97310   3.094   6.07
    PAQ9a 4     97137   3.187   5.23
    PAQ9a 5     96795   3.359   6.36
    PAQ9a 6     97112   3.625   6.07
    PAQ9a 7     97527   4.063   5.23
    PAQ9a 8     98465   4.953   3.98
    PAQ9a is the first and the only PAQ that's practical. Congratulations, no other (L)PAQ tested even came close.
    That's because of LZP greatly reducing size for CM, right?

    P.S.:
    I write"Efficiency(maximumcompression.com)" because maximumcompression.com is the most popular site that uses this function, I don't know and don't care who's the founder.
    EDIT:
    I forgot to attach the results.
    Attached Files Attached Files
    Last edited by m^2; 31st March 2011 at 09:30.

  2. #2
    Member
    Join Date
    Aug 2008
    Location
    Saint Petersburg, Russia
    Posts
    215
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by m^2 View Post
    I've been thinking about different measure for efficiency for some time and now it's the time to show my take on the topic.
    1. Copying is usually a very viable method of archiving, much more than PAQ. And IMO this is what archivers should be compared to.
    2. Extreme slowness = 0 usefulness = 0 score.
    3. Use of minimal size is wrong. If I was looking for something under 0.1s on this test, I couldn't care less about PAQ scores. And the fact that I tested it chaged the ranking. Remove things slower than 0.15s and slug wins. Include them - NanoZip is better. You always have to recalculate everything to your time boundaries.
    My proposal is: POWER(10;1/10)/((size/original_size)*LOG(size/original_size+1;2)*POWER(POWER(10;1/10);time/time_of_copying)) (A bit unreadable, but I won't learn tech to show it better ).
    The higher score the better. XCOPY gets 1. I call these which score at least as much "practical".
    Great to know I'm not the only one who doesn't like this efficiency measurement Same as you, I was thinking in the way of comparing it somehow with the copying, but gave up trying to find the solution. I'll look into your formula deeper when I'll be able to put my hands on this stuff, thanks for your research

    Quote Originally Posted by m^2 View Post
    P.S.:
    I write"Efficiency(maximumcompression.com)" because maximumcompression.com is the most popular site that uses this function, I don't know and don't care who's the founder.
    Just in case someone else cares --
    Quote Originally Posted by maximumcompression.com
    Scoring system: The program yielding the lowest compressed size is considered the best program. The most efficient (read:use full) program is calculated by multiplying the compression time (in seconds) it took to produce the archive with the power of the archive size divided by the lowest measured archive size. The lower score the better. The basic idea is a compressor X has the same efficiency as compressor Y if X can compress twice as fast as Y and resulting archive size of X is 10% larger than size of Y. (Special thanks to Uwe Herklotz to get this formula right)
    Last edited by nanoflooder; 15th October 2008 at 20:11.

  3. #3
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by nanoflooder View Post
    Great to know I'm not the only one who doesn't like this efficiency measurement Same as you, I was thinking in the way of comparing it somehow with the copying, but gave up trying to find the solution. I'll look into your formula deeper when I'll be able to put my hands on this stuff, thanks for your research
    Here you can see how does it work with more archivers and different data.
    I'm preparing another data set, much less compressible.

  4. #4
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Updated.
    New tests:
    Tornado, m1 0.02, qpress 0.22, Nanozip 0.05.
    Updated "the best" list:
    Code:
    Archiver            Size    Time
    Tornado 2 c2        321598  0.015
    Tornado 3 c1        298121  0.016
    Tornado 3 c4        217796  0.031
    slug                198995  0.046
    NanoZip cd          190065  0.078
    Tornado 6           189871  0.093
    Tornado 7           184962  0.156
    4x4 1t              123047  0.171
    4x4 2t              122051  0.234
    4x4 4t              121034  0.296
    FreeArc -m4 -ms     117940  0.875
    NanoZip co          116468  0.906
    FreeArc -m5 -ms     116319  1.421
    FreeArc -m5         115312  1.437
    FreeArc -m7         115305  1.453
    CCM 0               108787  1.578
    CCM 1               108579  1.609
    CCM 2               108481  1.735
    CCM 3               108433  1.860
    CCMX 0              106744  2.031
    CCMX 1              106225  2.094
    CCMX 2              105849  2.250
    CCMX 3              105634  2.437
    FreeArc -max -ms    95661   2.515
    FreeArc -max        94654   2.531
    FreeArc -max -ma-   86912   3.734
    NanoZip cc          84756   16.250
    PAQ8p -1            83124   66.625
    PAQ8p -2            81566   67.578
    PAQ8p -3            81040   68.375
    PAQ8p -4            51999   499.015
    PAQ8p -5            50780   501.672
    PAQ8p -6            50046   513.891
    PAQ8p -7            49870   538.828
    Nanozip -co and 3 modes of Tornado are new entrants here, QuickLZ and FastLZ are out.
    QPress is nowhere near Quick.exe
    Attached Files Attached Files
    Last edited by m^2; 23rd October 2008 at 21:57.

  5. #5
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Update:
    PPMX 0.2, 0.3
    7z 4.62 (enters "the best" list)
    rings
    Tornado 0.5a (the 1st version)
    PPMd
    PPMVC
    BIT 0.7

    The best:
    Code:
    Tornado 2 c2        321598  0.015
    Tornado 3 c1        298121  0.016
    Tornado 3 c4        217796  0.031
    slug                198995  0.046
    NanoZip cd          190065  0.078
    Tornado 6           189871  0.093
    Tornado 7           184962  0.156
    4x4 1t              123047  0.171
    4x4 2t              122051  0.234
    4x4 4t              121034  0.296
    7z -m0=PPMD -mx=9   117746  0.859
    NanoZip co          116468  0.906
    FreeArc -m5 -ms     116319  1.421
    FreeArc -m5         115312  1.437
    FreeArc -m7         115305  1.453
    CCM 0               108787  1.578
    CCM 1               108579  1.609
    CCM 2               108481  1.735
    CCM 3               108433  1.860
    CCMX 0              106744  2.031
    CCMX 1              106225  2.094
    CCMX 2              105849  2.250
    CCMX 3              105634  2.437
    FreeArc -max -ms    95661   2.515
    FreeArc -max        94654   2.531
    FreeArc -max -ma-   86912   3.734
    NanoZip cc          84756   16.250
    PAQ8p -1            83124   66.625
    PAQ8p -2            81566   67.578
    PAQ8p -3            81040   68.375
    PAQ8p -4            51999   499.015
    PAQ8p -5            50780   501.672
    PAQ8p -6            50046   513.891
    PAQ8p -7            49870   538.828
    EDIT: As usually, I forgot to attach the results.
    Attached Files Attached Files
    Last edited by m^2; 3rd January 2009 at 14:11.

  6. #6
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks for update! Do you have a plan to build a site for your test? It would be useful to find all of them in a single site with sortable fields (like BlackFox's and MetaCompressor).
    BIT Archiver homepage: www.osmanturan.com

  7. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    it's just meaningless to make all thos speed/efficiency (and even size) tests on file 1) so small, 2) so redundant

    about 2 - typical files are less compressible so what you see on these xml data may be not rue for other data types

    i can suggest you to try on smth like oo sources or installation and then make conclusions

    MC rating has one disadvantage - it's based on the best compressor so far. it was done in order to make fast and slow compressors "more equal" but doesn't work 100% reliable

    once i've made another rating, based on absolute values (i.e. 10% larger archive = 2x/3x/4x faster compression). results are:

    http://www.haskell.org/bz/2.txt
    http://www.haskell.org/bz/3.txt
    http://www.haskell.org/bz/4.txt

    you can see that 2.txt allows to select best fast compressors (smth if you prefer freearc -m2 mode), 4.txt - better compressors and 3.txt is somethat balanced and most close to MC own rating (although i've added decompression times too)
    Last edited by Bulat Ziganshin; 3rd January 2009 at 14:33.

  8. #8
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by osmanturan View Post
    Thanks for update! Do you have a plan to build a site for your test? It would be useful to find all of them in a single site with sortable fields (like BlackFox's and MetaCompressor).
    Nope. Web design is one of the most boring things in the world IMO.
    That's why I don't have personal website and probably never will.

    I agree that it would be good to have them together and if I had access to a server, I would upload it either on FTP or make the simplest website.
    But I don't and I'm sick and tired of free webhosts that remove your account for no reason or put ads.

    Quote Originally Posted by Bulat Ziganshin View Post
    it's just meaningless to make all thos speed/efficiency (and even size) tests on file 1) so small, 2) so redundant
    OpenOffice compresses files of this size every time I save the new version and compression takes enough time to be significant. If my guess (0.8-1s(*)) is correct, that's ~1/4 of saving time. The test showed that it's possible to reduce it to just over 0 w/out making files bigger (which has more psychological effect than any other). Worthwhile IMO.

    Obviously both compression ratio and speed are more important for big poorly compressible files, but here they are not something meaningless.

    Timings of the fast compressors surely have to be taken with a large grain of salt. The test set is indeed too small. Can't distinguish between i.e. LZOP 3 and 4, but shows things like "FreeArc 3 is just over twice slower than 1" and this is something.

    Quote Originally Posted by Bulat Ziganshin View Post
    about 2 - typical files are less compressible so what you see on these xml data may be not rue for other data types
    Sure. That's why I try different test sets. I want them to represent some real world workloads and I think that this one does it well.

    Quote Originally Posted by Bulat Ziganshin View Post
    i can suggest you to try on smth like oo sources or installation and then make conclusions
    That would be totally different data and the fact that the results would likely be different too doesn't negate value of ods.
    BTW I'm preparing a large test with C++ sources. Chromium. And I have installation already. TC UP.
    ADDED: But suggestions are welcome.

    Quote Originally Posted by Bulat Ziganshin View Post
    MC rating has one disadvantage - it's based on the best compressor so far. it was done in order to make fast and slow compressors "more equal" but doesn't work 100% reliable

    once i've made another rating, based on absolute values (i.e. 10% larger archive = 2x/3x/4x faster compression). results are:

    http://www.haskell.org/bz/2.txt
    http://www.haskell.org/bz/3.txt
    http://www.haskell.org/bz/4.txt

    you can see that 2.txt allows to select best fast compressors (smth if you prefer freearc -m2 mode), 4.txt - better compressors and 3.txt is somethat balanced and most close to MC own rating (although i've added decompression times too)
    I like it.
    My measure is not really successful. I assumed that "10 times slower than copying should be maximum acceptable compression time". But since then not only I've met people who like much slower ones, also XCOPY time depends hugely on OS caching and is sometimes too fast.
    I'll try your approach too on my data.

    Ad. (*):
    I should test some zlib archiver to know better, but it was quicker this way.
    I want to do it eventually, but more interesting things come first.
    Last edited by m^2; 3rd January 2009 at 18:34.

Similar Threads

  1. An interesting test set
    By nanoflooder in forum Data Compression
    Replies: 12
    Last Post: 13th April 2009, 01:33
  2. Test set: bookstar
    By m^2 in forum Data Compression
    Replies: 5
    Last Post: 11th February 2009, 17:49
  3. Test set: bitmap
    By m^2 in forum Data Compression
    Replies: 28
    Last Post: 13th January 2009, 17:44
  4. Test set: music
    By m^2 in forum Data Compression
    Replies: 0
    Last Post: 10th December 2008, 00:25
  5. Test set: backup
    By m^2 in forum Data Compression
    Replies: 1
    Last Post: 23rd October 2008, 22:16

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •