Results 1 to 8 of 8

Thread: A right way to select images for a web benchmark

  1. #1
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts

    A right way to select images for a web benchmark

    I've been asked to standardise the dataset that I used in my recent benchmark. I refused to do it because I think that it's substandard, which is caused by numerous series of similar images.
    Also, there are things that I did and I don't know if they were right and I'd like to ask what others think about it.
    During selection, I assigned equal probability to all images. Was it right? I don't know, but there were vast differences between numbers of images from different sites. Quite a few didn't host any PNGs, while some others (especially Chinese ones) had tons of them. Wouldn't it be better to apply some weights? Or shouldn't front page images get higher weight then ones below? Or maybe some weighting based on position in the ranking? And how to treat duplicates? Shouldn't they be used to adjust the weight? For example Google logo got selected, but I think that its chance selection was way lower then its importance.

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I think that a benchmark should be no larger than necessary to establish which compressors are better, perhaps 10-20 images. If two files give the same compression ratios for a bunch of different compressors, then you don't need both. Pick a wide range: big and small, high and low resolution, many and few colors, photos and drawings, faces, text, logos. Also, I would not exclude JPEG and GIF (converted to PNG).

    There are probably a lot of PNGs that used to be JPEGs. Include them. The obvious compression method is to guess the quantization and convert back to JPEG, but so far nobody has done this. Maybe with the right benchmark, someone will.

    I think these things are more important than download frequency. For something like the Google logo, they will spend a lot of time making the image as small as possible by reducing colors, resolution, size, etc. and using the best PNG optimizers. OTOH I see a lot of cases of obvious bloat, like huge JPEGS at max quality displayed artificially small using <img width=... height=...> tags.

  3. #3
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I got the same comment from another person too.

    I know cherry-picking test files, I use this method too, but I have doubts if it's the best thing that can be done. After all, such selection is inherently biased and rules that make us select files into different categories may not be well suited to choices that compressors do. In other words, we may miss some properties that matter for compression. Or more importantly - ones that don't matter ATM, but will with future codec designs. I believe that by letting statistics do the job there is a very small chance of missing something important. If it's important, it's quite popular, so it's unlikely to be missed. With most types of data it's infeasible because files are too large and having many of them would reduce testability and / or it's hard to find a large, representative sample. Well, actually making a representative sample here is not easy either, but I believe the bias is less important then i.e. with films.

    As to ex-JPEGs I am running Alex's ex-JPEG detector. I suspect it does a ton of false positives on very tiny files, but I'll see what's up further. A nice thing about random selection is that if ex-JPEGs are important, they are likely to be represented.


    BTW, I think it would be nice to make an experiment:
    1. Get a large random selection of images
    2. Cherry pick a subset
    3. Run benchmarks on both sets
    4. Download a set of websites that doesn't contain the test images
    5. Run benchmarks on each site
    6. Calculate correlation between reductions achieved on actual sites and those on test data, for each of the test datasets.

    I have no idea what the correlation would be, but my intuition tells me that the first dataset would have it higher.

    ADDED: The detector is useless here. It reports half of the larger images as ex-JPEGs, which is clearly very wrong.
    Last edited by m^2; 29th November 2011 at 22:13.

  4. #4
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    What about making a database and letting the users to input any query and then visualise the results? I think you can easily make an standalone desktop application in Python or something Java based (eg Scala or obviously Java) together with a standalone SQL engine (eg H2 Database or something pythonic).

  5. #5
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I don't get what do you mean.

    ADDED:
    But BTW you put 2 languages that I think about the most in the recent months (Python and Scala) in a single sentence.
    Last edited by m^2; 29th November 2011 at 22:21.

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Well, there are lots of ways you could interpret the results based on how you weight the various test files. It gets even worse when you consider speed, memory, and other criteria. Compressionratings.com ( http://compressionratings.com/ ) allows you to set the criteria however you want, whether you care more about size or speed, with lots of different types of data. But I think it only confuses the issue. It doesn't give any clear answers. The other extreme is a contest like the Calgary challenge ( http://www.mailcom.com/challenge/ ), but maybe this is not what you want either.

  7. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    m^2:
    I meant generating a list of results into a database and then distributing that database along with a program. The program would allow user to filter out entries from the list and then produce a chart from the filtered out entries. That way you wouldn't have to worry about the "right way" of selecting files - users would be able to visualize the results only on the files they are interested in.

    I mentioned Scala because I like it most currently and I mentioned Python because it's very popular in OSS. I don't like any dynamically or weakly typed languages though (so I wouldn't want to program in Python).

  8. #8
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Like http://cdb.paradice-insight.us ?
    I think the best thing would be to put it into this database. And it would cost much less work for me.

    Actually the database author contacted me to do it, but I think that first we should get some better dataset.
    There are some controversies regarding the making of it, I guess it's good.

Similar Threads

  1. ISO images compression
    By Surfer in forum Data Compression
    Replies: 17
    Last Post: 24th March 2011, 22:16
  2. Opera Web Browser's Turbo Feature
    By osmanturan in forum Data Compression
    Replies: 0
    Last Post: 19th January 2011, 16:46
  3. Web Chat Link
    By encode in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 30th October 2010, 17:57
  4. Precompression of Tiff Images
    By Simon Berger in forum Data Compression
    Replies: 52
    Last Post: 8th May 2009, 00:14
  5. Replies: 11
    Last Post: 18th August 2008, 21:02

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •