Page 1 of 2 12 LastLast
Results 1 to 30 of 41

Thread: Another benchmark

  1. #1
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    189
    Thanks
    0
    Thanked 36 Times in 27 Posts

    Another benchmark

    Dear members of forum,

    would you like another benchmark? What features should it at least have to be better than others? Any criticism, remarks, suggestions etc. are welcomed.

    Sincerely yours,

    FatBit

  2. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Graphical charts with linear and non-linear scales. Peeking at bunch of number doesn't give much information immediately.

  3. #3
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    First and foremost: try to do it right. There are many bad benchmarks around, if methods are wrong, results are worth little. If you have none to little experience, you'll likely make mistakes. I cringe when I think about my early trials....they were worthless from scientific POV. But at the same time they were a valuable experience. If you're a benchmarking newbie, your path is likely to be similar, be prepared for it or don't start at all.

  4. #4
    Tester
    Black_Fox's Avatar
    Join Date
    May 2008
    Location
    [CZE] Czechia
    Posts
    471
    Thanks
    26
    Thanked 9 Times in 8 Posts
    I'd say test only a few compressors on a worthless testset, create some output from it, then hear criticism from forum members. Improve testset, improve output, repeat a few times - you will have quite a nice start with almost none effort gone to waste.

    E.g. I started with this (I hope nobody from forum remembered it now, it was terrible design )
    I am... Black_Fox... my discontinued benchmark
    "No one involved in computers would ever say that a certain amount of memory is enough for all time? I keep bumping into that silly quotation attributed to me that says 640K of memory is enough. There's never a citation; the quotation just floats like a rumor, repeated again and again." -- Bill Gates

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I already suggested this to other people before (eg. to Sami), but afaik it still doesn't exist.
    What I'd like to see is a similar approach to how benchmarks are handled for cpus, gpus and such -
    detailed reviews with chosen few programs, instead of wall-of-digits tables with useless stats for all existing compressors.

    To be specific, the approach is like this:
    1) choose a specific task (for example, sending your recent photos to 10 friends;
    or uploading a new app to a server, from which its downloaded by 10000 users;
    or updating an existing app to a new version; or backing up VM images...
    anyway, there're lots of compression-related tasks).
    2) find programs which can perform required operations (in most cases there're
    unexpectedly few relevant programs)
    3) implement solutions for the task using a few different programs, get reproducible
    measurements of relevant metrics (eg. distribution time).
    Note that imho "process times" are completely worthless for benchmarks, and especially
    can't be used to compute derived metrics.
    4) write a review of working solutions.

    Here's one example of such a "review" - http://compressionratings.com/bwt.html
    Though I'd prefer even more text and less numbers.

    Afaik, the common method of compression benchmarking (collect a set of files, compress
    and decompress them with various programs, measure file sizes and operation times)
    is already obsolete.
    Maybe it made sense at old times, when the main task was to compress plaintext.
    But currently there're too many rare options, which are only present in a few programs.
    For example, what's the point of comparing single-threaded vs multi-threaded compressors
    on a multicore cpu? Or freearc vs stuffit on jpegs/pdfs?

    Anyway, I just want to say that current benchmarks take a lot of work to maintain, but
    are actually useless as a reference for any practical compression-related tasks.
    So it could be cool if somebody could change this.

  6. #6
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    What I'd like to see is a similar approach to how benchmarks are handled for cpus, gpus and such -
    detailed reviews with chosen few programs, instead of wall-of-digits tables with useless stats for all existing compressors.
    Here's one example of such a "review" - http://compressionratings.com/bwt.html
    Though I'd prefer even more text and less numbers.
    I have a different view.
    IMO walls of number are great and the best benchmarks around are metacompressor and Para-Dice in Sight.
    That's because there's a lot of knowledge that one can get from a set of numbers. Benchmark authors present what they find the most important, which is often not what matters for me. Having raw test data is what lets me take what I want from a test.

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Well, I certainly don't agree.

    Surely, its all good when you're the one performing the benchmark - then you know which components are relevant for you.
    For example, that in-memory benchmark of LZ codecs is completely ok for me - but its also an example of what I described.
    But imagine that we'd added bsc/cuda to that, and it appeared both faster and better than most of the coders?
    Would it make sense to treat bsc as a best choice just because there's no MT/GPU implementations of other algos?

    What I'm trying to say, is that things like that happen too frequently in current "universal" benchmarks.
    Its unfair to compare archivers with single-file compressors, compressors with filters (and/or dictionaries) vs
    ones that don't have that etc.
    And its not even a matter of a few programs just being better than others, because there's no program
    that has it all.

    Anyway, my main point is that unlike other areas (cpus etc), the existing compression benchmarks
    have no practical use. I mean, a user can't choose a compressor based on any of them.

    As to numbers, I can't even say that its better when they exist.
    Anyway, for any actual use it would be necessary to extract them to a usable format and then
    write some filter scripts to process them. Which takes quite some work.
    And then, isn't it always better to also redo the actual tests?
    It could be different if these benchmark sites had anything exclusive - like some unique ways to collect statistics
    (eg. using something like http://encode.ru/threads/435-compression-trace-tool )
    or commercial codecs not available otherwise.
    So there's basically nothing useful, as all the work has to be duplicated anyway - especially
    because of subjective choices like comparing single-file compressors to multi-file by tar'ing the corpus files
    (and including the tar times into the stats), or excluding i/o times, or tests performed on a loaded machine.

    So my proposed solution to this is discarding the idea of universal compression benchmark
    (as its became nearly as subjective as if we had to choose "the best program")
    and instead setting up multiple limited benchmarks.

    I'm mostly interested in seeing a "full list of common compression-related tasks",
    because currently I have to guess too much, and my priorities may be wrong.

  8. #8
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    But imagine that we'd added bsc/cuda to that, and it appeared both faster and better than most of the coders?
    Would it make sense to treat bsc as a best choice just because there's no MT/GPU implementations of other algos?
    Surely no. Compression testing is hard nowadays. CPU time is irrelevant for end-users. Wall time is very volatile, repeatability is bad which forces you to do many more tests, but worse: it depends on hardware much more than CPU time.

    Quote Originally Posted by Shelwien View Post
    (1)What I'm trying to say, is that things like that happen too frequently in current "universal" benchmarks.
    Its unfair to compare archivers with single-file compressors, compressors with filters (and/or dictionaries) vs
    ones that don't have that etc.

    (2) Anyway, my main point is that unlike other areas (cpus etc), the existing compression benchmarks
    have no practical use. I mean, a user can't choose a compressor based on any of them.
    I don't agree with what I marked as (1). If a compressor is unsuitable for some task I see no reason not to point it out. Such unsuitability may be obvious for you, but not for readers. Though taking average (or sum) from a set of files is wrong. Still, I can't propose a good test for a general purpose compressor.

    Quote Originally Posted by Shelwien View Post
    As to numbers, I can't even say that its better when they exist.
    Anyway, for any actual use it would be necessary to extract them to a usable format and then
    write some filter scripts to process them. Which takes quite some work.
    And then, isn't it always better to also redo the actual tests?
    It could be different if these benchmark sites had anything exclusive - like some unique ways to collect statistics
    (eg. using something like http://encode.ru/threads/435-compression-trace-tool )
    or commercial codecs not available otherwise.
    So there's basically nothing useful, as all the work has to be duplicated anyway - especially
    because of subjective choices like comparing single-file compressors to multi-file by tar'ing the corpus files
    (and including the tar times into the stats), or excluding i/o times, or tests performed on a loaded machine.
    No, in general, it wouldn't be better to redo the tests. Usually it's just not worth the time to discard available knowledge. Sometimes tests are bad or don't cover your needs well enough, quite often actually. But then I redo just the most interesting subset of tests (i.e. least covered, with strange results etc.) and leave all others as they are. Abstraction of having unlimited time is nice. But not very practical. I intended to do 2 big benchmarks last summer. Didn't find the time for any of them, just did a very rough version of one.

    Quote Originally Posted by Shelwien View Post
    So my proposed solution to this is discarding the idea of universal compression benchmark
    (as its became nearly as subjective as if we had to choose "the best program")
    and instead setting up multiple limited benchmarks.
    Totally agree. There is no best compression program or algorithm, only ones that are the best for particular tasks.

    Quote Originally Posted by Shelwien View Post
    I'm mostly interested in seeing a "full list of common compression-related tasks",
    because currently I have to guess too much, and my priorities may be wrong.
    No way. You can focus on some niche and you can find the most common tasks in it. But going through all important niches gets hard. Evaluating their relative importance and accounting for unanalysed ones...I don't see it.
    For me the solution is to live with a limited view. I don't have a full list. But I do have a good idea what's going on in the areas that are important for me.

  9. #9
    Tester
    Black_Fox's Avatar
    Join Date
    May 2008
    Location
    [CZE] Czechia
    Posts
    471
    Thanks
    26
    Thanked 9 Times in 8 Posts
    There is a need for the user himself to set the weights and priorities, to say what he expects the most (usually done compression activities / usually compressed extensions / if he needs GUI, if he needs low-memory operation, if he could use I/O-demanding software ) otherwise you will NEVER cover everything. I have touched that very basically in my attempt - I still think nobody else or very few did at least that - but obviously, you need much more than sortable columns.

    EDIT: improved wording
    Last edited by Black_Fox; 8th February 2012 at 01:21.
    I am... Black_Fox... my discontinued benchmark
    "No one involved in computers would ever say that a certain amount of memory is enough for all time? I keep bumping into that silly quotation attributed to me that says 640K of memory is enough. There's never a citation; the quotation just floats like a rumor, repeated again and again." -- Bill Gates

  10. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I think compression benchmarks are more useful to developers than to users. For most people, compression speed and ratio are less important than availability, reliability and licensing. I distribute ZPAQ compressors as zip archives because zip is fast and small enough and I know just about everyone has zip or can get it, whether they are using Windows, Linux, or a Mac. Zip or deflate is popular because both the applications and zlib are well tested, open source, not patented, and the code can be freely used. The format is standardized by a RFC and hasn't changed in many years. People know that when they make a zip archive, they can still read the files 10 or 20 years from now. You can't say that about closed source compressors like nanozip or ppmonstr, even if they compress much better. WinRK and Compressia were very good too, except that if you didn't save an uncompressed copy of your files, they are probably gone.

    That said, I think the most interesting kinds of data for a benchmark are:
    - Natural language text (various languages)
    - Source code (various languages)
    - Executable code (various architectures and compilers)
    - Uncompressed images (faces, text, medical)
    - Uncompressed audio (speech and music)
    - Uncompressed video
    - DNA (human, bacteria)
    - Maps
    - Other sensor data (weather, seismic, etc)

    I would be less interested in already compressed data like JPEG, MPEG, MP3, PDF, DOCX, zip archives, UPX compressed executables etc. even though this is more typical of the data that people actually have on their computers. I'm more interested in the problem of modeling natural complex data in order to develop better algorithms.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Wall time is very volatile, repeatability is bad which forces you to do many more tests,

    Timings on a idle multicore cpu + ramdrive are pretty stable actually, up to 0.1s or so at least.
    And that can be further improved by taking min-of-N timings (not average!).

    Another interesting approach is running tests under DOS - in most cases tools like wdosx can be
    used to port console compressors to DOS.
    Like that the timing error can be reduced down to ~1000 clocks or so - I used to do my benchmarks
    like that before WinXP.

    I guess, it should be possible also to make a stripped-down linux version, without any background processes.
    I wonder if there're any COFF-to-ELF converters similar to wdosx

    > it depends on hardware much more than CPU time.

    Yes, but I think its ok.
    In-memory benchmarks and profiling of specific components may be useful too, but only when
    they're properly prepared - ie not approximated by "user times" and not spammed by results
    from programs which are not directly comparable.

    > If a compressor is unsuitable for some task I see no reason not to point it out.
    > Still, I can't propose a good test for a general purpose compressor.

    Yes, exactly. So I think we should make a list of these "some tasks" first.

    > But then I redo just the most interesting subset of tests and leave all others as they are

    Well, what I meant is that if you want to compare 7z,fa,nz then its likely more reasonable
    to just compare them on your machine and with your files - instead of trying to extract the
    relevant stats from compressionratings.com or some such.

    But if there was an independent review of modern archivers (instead of benchmarks trying to
    be complete), it would be likely much more useful as a reference.

    > But going through all important niches gets hard.

    I don't think that there's that much actually... without going too deep anyway.
    Its more interesting that there're no public solutions for many commonly asked
    questions, like LZ with a dictionary or good compression for short records.

    > But I do have a good idea what's going on in the areas that are important for me.

    So why don't you post some more specific information?

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > I think compression benchmarks are more useful to developers than to users.

    Existing ones, yes, though that's probably why they're weird.
    Their maintainers usually try to make them good both for developers and regular users
    (especially because developers don't visit benchmark sites that frequently)
    and kinda fail overall.

    To be specific, for developers its imho more important to have "fair" comparisons.
    ie to compare plzma and 7z/lzma we'd have to disable 7z's file sorting and MT.
    Surely it would be more "fair" to somehow enable these for plzma instead, but
    obviously it requires much more work.
    Such a comparison is useful for a developer, because it shows whether it makes
    sense to add these missing parts.

    On other hand, for most users only complete solutions would be of any use -
    eg. complete archivers or external codec configurations for fa, or some such.

    > For most people, compression speed and ratio are less important than
    > availability, reliability and licensing.

    "Most people" actually don't have any use for compression at all.
    The data they deal with is usually already compressed anyway (docx,jpg,video)
    and they frequently don't bother even to wrap up multiple files into one.

    So afaik, our target audience consists mostly of wintel software developers
    (for their apps/updates; non-wintel ones usually can't choose the installer),
    game developers, and various pirates.

    > You can't say that about closed source compressors like nanozip or
    > ppmonstr, even if they compress much better.

    More like, there's no use for them, aside from game rips and such.
    And for these it doesn't really matter whether they're closed-source or not,
    but decompression speed does matter a lot, and its hard to compete with lzma
    in that sense.

    > That said, I think the most interesting kinds of data for a benchmark are:
    > - Natural language text (various languages)

    Are you aware that enwik8 is not a natural language text, I wonder?
    Afaik, the only fitting kind of samples for this would be narrative
    (without much dialogs) plaintext (without pictures and structure tags)
    fiction books.

    Also, I'm not aware of any compressors intended specifically for text.
    Word models and filters are only small tricks in compressors which
    otherwise try to be "universal".

    > - Source code (various languages)

    This is an "easier" task (at least the language structure is known and static),
    but for now I'd like to find simply a working C++ to AST to C++ converter.

    > - Executable code (various architectures and compilers)

    This is very similar to previous case imho, but I have a plan for dealing with this one,
    so I guess its even easier.

    > - Uncompressed images (faces, text, medical)
    > - Uncompressed audio (speech and music)
    > - Uncompressed video

    These are very deep actually, eg. they can include any of the previous types.
    But fortunately its still very easy to produce good results in these areas.

    > - DNA (human, bacteria)

    This one is really interesting, as I'm not aware of any proper CM compressors for DNA data.
    There're lots of non-trivial dependencies (eg. molecular structure), but unfortunately
    its pretty hard to find any comprehesible information about them.

    > - Maps
    > - Other sensor data (weather, seismic, etc)

    Lossless compression of these doesn't make much sense imho.
    But I'm interested in using CM models for forecasts, which is kinda related.

    > I would be less interested in already compressed data like JPEG,
    > MPEG, MP3, PDF, DOCX, zip archives, UPX compressed executables etc.
    > even though this is more typical of the data that people actually
    > have on their computers.

    That's good. Please keep looking elsewhere until I manage to patch up
    something usable

    > I'm more interested in the problem of modeling natural complex data
    > in order to develop better algorithms.

    Unfortunately its more a matter of psycho-models and lossy compression,
    and psycho-models are mostly about collecting valid feedback from lots
    of testers, so its very expensive (or useless).

  13. #13
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    > Wall time is very volatile, repeatability is bad which forces you to do many more tests,

    Timings on a idle multicore cpu + ramdrive are pretty stable actually, up to 0.1s or so at least.
    And that can be further improved by taking min-of-N timings (not average!).
    I've seen up to 10% on highly cleaned up machine with just 2 cores. Yes, I always do multiple tests. Some time ago I used to take min too, but now I think that more statistical approach (average, standard deviation) might be better because high deviation is a sign of some problem. This might be a problem with the test itself or with the tested app. Both are important to note.
    Though really, I'm not sure, minimum coupled with these stats might be better.
    And when you are unable to run with a really clean machine the results are awful. Recently I asked my friends to run a benchmark on their machines. Though the results were not bad enough to be useless, I didn't feel comfortable with their accuracy.

    Quote Originally Posted by Shelwien View Post
    Another interesting approach is running tests under DOS - in most cases tools like wdosx can be
    used to port console compressors to DOS.
    Like that the timing error can be reduced down to ~1000 clocks or so - I used to do my benchmarks
    like that before WinXP.

    I guess, it should be possible also to make a stripped-down linux version, without any background processes.
    I wonder if there're any COFF-to-ELF converters similar to wdosx
    Interesting. I know that FreeBSD can take Windows drivers after conversion, so some form of it exists, though it would need some mods...maybe there are other projects like this too.

    > If a compressor is unsuitable for some task I see no reason not to point it out.
    > Still, I can't propose a good test for a general purpose compressor.

    Yes, exactly. So I think we should make a list of these "some tasks" first.
    I'm not sure if you got what I meant with the first statement correctly. For me the best way to point out that something doesn't work well in particular situation is by testing it and showing numbers.


    > But then I redo just the most interesting subset of tests and leave all others as they are

    Well, what I meant is that if you want to compare 7z,fa,nz then its likely more reasonable
    to just compare them on your machine and with your files - instead of trying to extract the
    relevant stats from compressionratings.com or some such.
    I don't think so. Because I remember how bad I was at choosing good test data not so long ago. And I've seen recently another person who did so even worse, claiming that LZMA was not significantly stronger than deflate based on a test on a single XX KB text file. I think that such errors are common among compression amateurs. So while pros can't really make a good generic test, their data is often even worse for them than ours.

    > But going through all important niches gets hard.

    I don't think that there's that much actually... without going too deep anyway.
    Its more interesting that there're no public solutions for many commonly asked
    questions, like LZ with a dictionary or good compression for short records.
    Well...I don't know. I feel that there's a fine line between uses. For example compression of LAN traffic is much different than that used for communication with distant satellites. But there are there any really strong borders on the way between them?
    Or maybe you would shoehorn them into a common niche?
    I don't know what would be a good way to categorise them, but anyway I see compression as a vast field, though majority of if it is very obscure and (at best) barely known even to generic compression pros.

    > But I do have a good idea what's going on in the areas that are important for me.

    So why don't you post some more specific information?
    Because it's not a good place for it.
    You can start a thread about a list of all important uses of compression and I'll gladly put my 3 gr. in there.

  14. #14
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Matt Mahoney View Post
    - Natural language text (various languages)
    - Source code (various languages)
    - Executable code (various architectures and compilers)
    - Uncompressed images (faces, text, medical)
    - Uncompressed audio (speech and music)
    - Uncompressed video
    - DNA (human, bacteria)
    - Maps
    - Other sensor data (weather, seismic, etc)
    I would add some things to the list (and remove some too )
    Databases. A MS study on dedupe showed it to be very common (though on a clearly biased population). And many DB engines use compression by themselves. Yet they are barely represented in general purpose benchmarks.
    I'd like to see some research on the impact of:
    -DBMS design
    -database design
    -database contents on compressibility.
    Even basic stuff like what kinds of databases are there floating around.
    Well, OK, I could think of some more, but I'm tired and unwilling to think about it more. This one is by far the most important thing (for me) that you didn't mention.

  15. #15
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    You would need to specify what's in the database. A mailing list? Tweets? A star catalog? Synthetic data like http://www.tpc.org/ (which is not really suitable because it is designed to test speed, not compression).

    Another approach would be to take files from lots of different benchmarks plus some synthetic ones and test them on about 20-30 different compressors representing a wide range of algorithms, including unusual ones like LZW, CTW, byte pair encoding, order 0, etc. Then run regression tests and toss the files whose compression ratios can be predicted from the other files until you are down to a minimal set.

    For measuring speed, you don't need to use the whole benchmark because speed depends a lot less on content than size. For most compressors, the two extremes are likely to be random data and a file of all zeros. I doubt you would need more than that. But of course you could run regression tests to find out. Speed doesn't need to be accurate because others will be using different hardware anyway.

  16. #16
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Synthetic data like http://www.tpc.org/ (which is not really suitable because it is designed to test speed, not compression).
    Not that it's stopped anyone. http://www.xtremecompression.com/compare.html

  17. #17
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Matt Mahoney View Post
    You would need to specify what's in the database. A mailing list? Tweets? A star catalog? Synthetic data like http://www.tpc.org/ (which is not really suitable because it is designed to test speed, not compression).
    By saying 'database contents' I meant that the impact of what's inside should be determined. We know nothing about it. It seems obvious that emails will compress different than tweets, but we don't have any numbers.
    Quote Originally Posted by Matt Mahoney View Post
    Another approach would be to take files from lots of different benchmarks plus some synthetic ones
    I don't think it would be useful. IMO tests should take into account real uses. Abstracting things is sometimes useful, but this looks like ignoring reality and going after some fixed categorisation. I would start with the widest selection of databases that I could get and then try to find patterns. Something tells me that they would be different from ones that you can find elsewhere.

  18. #18
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    As I said, real users have files that are mostly already compressed because the applications compress the data automatically. Users don't usually think or care about compression. Also, I think that most people have mostly useless data on their computers but have no way to know what is safe to delete. I have thousands of files on my computer and I have no idea what most of them are for. There is probably lots of duplicate data, but that doesn't mean I can delete the extra copies.

    In my experience, people that do care about compression (Ocarina customers) typically have terabytes or petabytes of one or a few particular file types that you'll probably never see anywhere else. Examples: studio quality video, rendered animation, DNA sequencing machine raw or processed data (gigabyte TIFF images), seismic data from huge sensor arrays, or billions (yes billions) of user uploaded JPEG images. We usually write one-time custom algorithms because there is nothing off the shelf for them. They are big companies so they can afford it. Anyone else will just buy more disk space (it's cheap) or figure out what they don't need.

    So like I said, benchmarks are mostly useful to developers. I know the best way to compress already compressed data is to decompress it and use a better algorithm. So that's not really interesting. What's interesting is knowing which algorithms work best on data types that are important to us, and their relative performance (speed, size, and memory tradeoff). The basic data types seem to be text, code, DNA, and n-dimensional sensor data (sound, pictures, video, scientific instruments) because of their complexity.

  19. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > So like I said, benchmarks are mostly useful to developers.

    Well, there're still lots of people who have some personal uses for archivers.
    Though maybe they don't pay as well as big companies

    > I know the best way to compress already compressed data is to
    > decompress it and use a better algorithm.

    Yes, but there're many cases where you can't simply decompress it
    (like all the lossy formats, or known compressed blocks in unknown container)
    and that's where it becomes interesting.
    For example, did you see http://encode.ru/threads/1405-Stegan...-lossy-formats ?

    > What's interesting is knowing which algorithms work best on data
    > types that are important to us,

    But we know that, right?
    The winner is always a custom CM based on dependencies in specific type of data.
    The only worthy alternative is BWT, but its just a tricky
    speed optimization method in the end, and thus can't provide
    the best compression in most cases.

    > and their relative performance (speed, size, and memory tradeoff).

    Also parallelism, API type (serial, blockwise...) etc.
    But that's considerably different from actual benchmarks, especially LTCB.
    Because if one coder is tested using xwrt, and another with random options,
    does it really provide any information about their relative performance?

    > The basic data types seem to be text, code, DNA, and n-dimensional
    > sensor data (sound, pictures, video, scientific instruments) because
    > of their complexity.

    Ok. So, is there any progress in text compression?
    Like identifying parts of speech and sentence structure and semantic associations
    and using that for compression?
    Are you going to write that in zpaql, when its stable?

  20. #20
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    856
    Thanks
    45
    Thanked 104 Times in 82 Posts
    i was thinking of using a 600mb chatlog form a online game named Final Fantasy XI online. even though it has a lot of repeating paterns is also contains a lot of noise. (mis spelling different kinds of acronyms, game related words only etc etc) so its a bit harder for dictionary based prefiltering.. sadly it contains personal conversations also so not able to share the data

  21. #21
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    > Well, there're still lots of people who have some personal uses for archivers.
    > Though maybe they don't pay as well as big companies

    Yes, but all the best archivers are free. The way you make money is to give away the software to sell yourself.

    >> I know the best way to compress already compressed data is to
    >> decompress it and use a better algorithm.
    >
    > Yes, but there're many cases where you can't simply decompress it

    Right, so you have to guess what version and what options were used to compress so it's not lossy. It's a lot of work with a straightforward solution, so not very interesting to me IMHO.

    >> What's interesting is knowing which algorithms work best on data
    >> types that are important to us,

    > But we know that, right?
    > The winner is always a custom CM based on dependencies in specific type of data.
    > The only worthy alternative is BWT, but its just a tricky
    > speed optimization method in the end, and thus can't provide
    > the best compression in most cases.

    Not necessarily. No algorithm can predict text as well as the human brain yet. So this is an area of research. Not so much for compression, but because a better model could be used for a wide range of AI problems like speech recognition, OCR, language translation, search, and text classification. Image compression is an even harder theoretical problem, because an ideal model would have to include a language model to handle scanned text, in addition to a visual model of the world. And DNA would require totally different techniques. I don't care so much about compression for saving disk space or bandwidth (which is cheap), but about using compression to measure the quality of the model.

    > Ok. So, is there any progress in text compression?
    > Like identifying parts of speech and sentence structure and semantic associations
    > and using that for compression?

    Well, I found it interesting that the top programs already do this in efficient and non-obvious ways. For example, the dictionaries for durilca4linux and the Hutter prize winners are organized so that grammatically and semantically related words are grouped together, like days of the week or "mother" with "father". There are even ways to do this automatically from a corpus, like clustering in context space, reducing the dimensions using neural LSA models, and finding a short path to write the dictionary. Not that Shkarin has said how he did it, but that would be one way.

    > Are you going to write that in zpaql, when its stable?

    Well, it's possible, because level 2 allows any string to be encoded as an arbitrary program that outputs it. But there is a lot of research to do first. For one thing, I think that with more memory and processing power, that it is possible to get better compression without a dictionary. Constraining a dictionary to 1 dimension has to cost some bits. But if you only have 16 GB of memory you need to do it. Looking at the two graphs on the LTCB page, I think that supercomputers or clusters with TB of memory are eventually going to be taking the top spots. You're trying to model a process that runs in the brain. A human brain sized neural network with 10^15 connections running at 10 Hz would require something the size of the Japanese K computer and 10 MW of electricity.

  22. #22
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    A human brain sized neural network with 10^15 connections running at 10 Hz would require something the size of the Japanese K computer and 10 MW of electricity.
    I wonder if all those 10^15 connections are dedicated for our intelligence or at least for predicting the next byte

  23. #23
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I guess maybe 10% for language modeling. But it is interesting that in most neural models, a synapse stores around 1 bit of memory. (Actually 0.15 bits in the Hopfield associative model). Yet Landauer estimated that human long term memory is around 10^9 bits. http://atom.physics.helsinki.fi/kurs...0_477_1986.pdf
    So it seems that storage is very inefficient or redundant. We don't know why. But maybe that is a characteristic of highly parallel systems, like when you have millions of processors each having an identical copy of the operating system.

  24. #24
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    > No algorithm can predict text as well as the human brain yet

    how it was checked? i think that program should be trained on 1 gb of text first, then checked on further data. experiments i've heard about, started with zero knowledge and checked compression of the subsequent 1 gb of text

  25. #25
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Matt Mahoney View Post
    > Well, there're still lots of people who have some personal uses for archivers.
    > Though maybe they don't pay as well as big companies

    Yes, but all the best archivers are free. The way you make money is to give away the software to sell yourself.
    This is a viable, but not the only business model possible.
    For example you could use free archiver to promote your library and sell it to governments and businesses.

  26. #26
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    >> No algorithm can predict text as well as the human brain yet

    > how it was checked? i think that program should be trained on 1 gb of text first, then checked on further data. experiments i've heard about, started with zero knowledge and checked compression of the subsequent 1 gb of text

    I guess I still need to replicate on enwik9 Shannon's 1950 experiment where he estimated the entropy of written English to be about 1 bit per character based on how well people could predict the next letter. But if you try the experiment yourself, you know that sometimes you can guess the next word using real-world knowledge that no compressor is going to have.

  27. #27
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    > This is a viable, but not the only business model possible.
    > For example you could use free archiver to promote your library and sell it to governments and businesses.

    Oops, too late. I made libzpaq public domain.

    But another approach is to license it under GPL. Then you can sell a separate license to companies that want to use it in their product but don't want to reveal their source code. But IMHO I think it doesn't matter because if a company really wants to use your code, who else are they going to hire to work on it?

  28. #28
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Matt Mahoney View Post
    But IMHO I think it doesn't matter because if a company really wants to use your code, who else are they going to hire to work on it?
    Exactly. I did not mean selling licenses but other moneymaking activities too.

  29. #29
    Member
    Join Date
    Apr 2010
    Location
    El Salvador
    Posts
    43
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Matt Mahoney View Post
    I guess I still need to replicate on enwik9 Shannon's 1950 experiment where he estimated the entropy of written English to be about 1 bit per character based on how well people could predict the next letter. But if you try the experiment yourself, you know that sometimes you can guess the next word using real-world knowledge that no compressor is going to have.
    One outstanding ability of the brain is the development of a prediction-algorithm on-the-fly (while you do the test). Software mostly is locked on a single though possibly generic algorithm changing only parameters. The brain has full degree of freedom, and within reason is able to automatically initiate a "reasonable" search for Kolmogorov-complexity. A software can hardly do anything likewise.

  30. #30
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    But we can program complete brain simulator. To make comparison between human brain and simulator fair we need to either copy entire human's brain content into simulator or erase human's brain content altogether. I wonder what would be easier?

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •