Results 1 to 7 of 7

Thread: µnit — a unit testing framework designed with compression libraries in mind

  1. #1
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts

    µnit — a unit testing framework designed with compression libraries in mind

    A while back I wrote a small unit testing framework called "µnit", and it just occured to me that I haven't posted about it here, which is unfortunate since it was written with compression libraries in mind and works rather well for them.

    It consists of a single header and a single C source file, and it doesn't require any configuration information from the build system, so it's pretty trivial to integrate. It's quite portable, and permissively licensed (MIT). There is some pretty good documentation (IMHO), so I won't go into too much detail here, but there are a few features which are very interesting for compression libraries:

    First, there is a built-in pseudo-random number generator which will produce reproducible random data. This lets you increase coverage a bit without exhaustive (i.e., extremely time-consuming) tests, and you can still reproduce any failures by feeding the seed back into another execution. There are functions for generating ints, doubles, and arbitrary-sized blobs of data.

    The output includes basic timing information (CPU and wall clock), so it can be used for benchmarking; you can even request that the test be run multiple times and have the results averaged.

    It also supports parameterized tests, which basically provides an easy way to pass different arguments to a test (like compression level, how much memory to use, etc.). You can have multiple arguments, and every combination will be tried automatically.

    Of course, it also includes everything else you probably expect from a testing framework, including nested test suites, loads of handy assertion macros (including memory comparison, which is very handy for data compression), and a rather nice CLI.

    Anyways, if you're writing compression code, it might be worth taking a look. Unit tests are always nice, and µnit makes writing them relatively painless. If you have any questions/comments/hate mail let me know; it's always nice to get feedback, especially criticism.

  2. The Following 3 Users Say Thank You to nemequ For This Useful Post:

    Bulat Ziganshin (14th June 2016),Cyan (14th June 2016),jibz (14th June 2016)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    As to producing random _data_ I'd suggest adding something like http://encode.ru/threads/305-Searchi...ull=1#post5643
    (reuploaded to http://nishi.dreamhosters.com/u/lzgen_v0.rar)
    Because its not exactly trivial to provide random content for testing of compression algorithms - simple sequence of random characters won't be of much use.
    Alternatively, you can consider Mahoney's data generator from http://mattmahoney.net/dc/uiq/
    Also its preferable to have generators with computable entropy (ie amount of random bits required to produce the data sample) - it would allow for a better
    estimation of compression quality.

  4. #3
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    The link to that archive is broken, so I can't really look at how it works, but I don't think trying to use random data to figure out how well a codec is able to compress data is ever going to provide useful data… codecs all perform very differently on different data, so I don't see why anyone would care about the compression ratio of some generic low-entropy randomly generated data is. If that's your goal you should probably be working off real-world data. µnit can help with that; just write a test which accepts a file name then use the parameterization feature to run your test on each piece of data. Dump some test data in your tree and you're good to go.

    That said, compressing (and decompressing) random data is certainly useful. I've found lots of bugs in many libraries by compressing random data, decompressing random data, and decompressing compressed random data. For example, one popular issue with random data is programmers getting the max compressed size function wrong (think compressBound in zlib, most codecs provide something similar). This tends to be an especially popular issue when the value is right around a power of two.

    Another interesting feature of random data is it's a pathological case for most codecs. If you're dealing with user-supplied data, it's important that worst-case performance is acceptable, otherwise you're opening yourself up to a DoS issue.

    Thanks to using random data in Squash's tests I've also caught data corruption, out-of-bounds memory access, memory corruption, etc. Remember, these are security issues; if all you can do is cause a program can crash it's a DoS attack waiting to happen, but some issues can also lead to remote execution vulnerabilities. Running the tests with AddressSanitizer and MemorySanitizer are very helpful here, or valgrind if you prefer it. Obviously fuzzing is the best way to go about this, but a unit test may catch an issue much sooner, and in general the sooner you catch a bug the easier it is to fix.

    Another place I've found bugs is by compressing a randomly sized block of non-random data (I use a few paragraphs of lorem ipsum in my tests, but anything will work). This is especially true if you don't place artificial bounds on the range of the PRNG… for example, what happens when you try to compress one or two bytes? Some codecs will (or would, I've reported this issue with multiple libraries, all but one have been fixed) attempt to access out-of-bounds data. If you happen to be up against a page boundary this can easily lead to a segfault, and if the issue is on the write side obviously memory corruption can be an issue, too.

    Finally, a PRNG is extremely useful for tweaking configuration parameters. Compression level, block size, dictionary size, degree of parallelism, etc. Exhaustively testing every value often isn't feasible, especially not as part of tests run on every commit in a CI environment. Using random values provides a good way to help maximize coverage of test cases without spending too much time coding or running them. The cost is that sometimes you won't see an issue until a few commits later, but once you have a failing test case it's usually pretty easy to trace it back to the root problem. With Squash, we run the unit tests in enough different configurations (different operating systems, compilers, build configurations, etc.) that stuff tends to show up pretty quickly. Given the number of CI services available for free to open-source software (Travis, AppVeyor, Drone.io, Snap CI, GitLab CI, etc.), I don't see much of a reason not to do the same.

  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > The link to that archive is broken, so I can't really look at how it works,

    Sorry, its http://nishi.dreamhosters.com/u/lzgen_v0.rar

    > but I don't think trying to use random data to figure out how well a codec
    > is able to compress data is ever going to provide useful data...

    That's my point too - you can't use plain random data for that, it has
    to be data generated using a specific model, like LZ77 one in my lzgen utility.

    > codecs all perform very differently on different data, so I don't see why
    > anyone would care about the compression ratio of some generic low-entropy
    > randomly generated data is.

    I think it makes a lot of sense specifically for unittests.
    It can be a little inconvenient to store GBs of sample data just for testing a codec.
    And some useful measurements, eg. actual window size, or ability to handle >4GB of data,
    or plain redundancy comparing to amount of random bits used by the model,
    would require that.

    > I've found lots of bugs in many libraries by compressing random data,

    Yes, that's one possible test case, but for other ones you'd have to provide
    data generated with a more complex model than plain sequence of random bytes.

  6. #5
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    As to producing random _data_ I'd suggest adding something like http://encode.ru/threads/305-Searchi...ull=1#post5643
    (reuploaded to http://nishi.dreamhosters.com/u/lzgen_v0.rar)
    Because its not exactly trivial to provide random content for testing of compression algorithms - simple sequence of random characters won't be of much use.
    Alternatively, you can consider Mahoney's data generator from http://mattmahoney.net/dc/uiq/
    Also its preferable to have generators with computable entropy (ie amount of random bits required to produce the data sample) - it would allow for a better
    estimation of compression quality.
    Shelwein, try American Fuzzy Lop.
    It generates bitstreams that match the actual model in the program, not a generic one. It's combination of simplicity (relative), generality and sheer effectiveness is truly remarkable.

  7. The Following User Says Thank You to m^2 For This Useful Post:

    Shelwien (14th June 2016)

  8. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    That's certainly interesting, but my idea is imho different - its more about expected compression ratio than crashes and bugs.

  9. #7
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by Shelwien View Post
    That's my point too - you can't use plain random data for that, it has
    to be data generated using a specific model, like LZ77 one in my lzgen utility.
    I understand that, but my point is that I still don't think there is much use for it. Who cares what compression ratio a codec achieves on that data? That number is only useful to the extent that it matches up to real data, and it's much simpler and less error-prone to just use real data than trying to create a bunch of programs which generate realistic data. Any result you achieve on synthetic data is, IMHO, dubious at best.

    Quote Originally Posted by Shelwien View Post
    I think it makes a lot of sense specifically for unittests.
    It can be a little inconvenient to store GBs of sample data just for testing a codec.
    Several years ago maybe, but data storage and networking are both fast enough and cheap enough these days that I don't think a few gigabytes is a big deal. Certainly not worth all the worries synthetic data would bring in. If you're actually working on a compression codec you should have plenty of data sitting around, and if you're testing one you should be using your own (real) data.

    Besides, how could you possibly determine success/failure? I imagine it would some sort of cut off at a specific ratio, but I can think of so many problems, exceptions, and corner cases… what about tiny files? How far from the model can your generator stray in order for the results to be valid? What happens when you optimize for data which is more general than what you would find in the real world, and compression suffers? What if your model differs from real-world data?

    Quote Originally Posted by Shelwien View Post
    Yes, that's one possible test case, but for other ones you'd have to provide
    data generated with a more complex model than plain sequence of random bytes.
    Compressing random data is just one possibility. I mentioned several cases, most of which have already proven to be quite feasible (not to mention useful), and none of which are particularly complicated or require modeling data.

    Quote Originally Posted by Shelwien View Post
    That's certainly interesting, but my idea is imho different - its more about expected compression ratio than crashes and bugs.
    I think this pretty much sums up our disagreement. You want to use random data (not pure-random, but lzgen-style-random) to measure compression ratio, which I don't think is a bad idea, but I do think it has too many problems that would need to be solved and the benefits are far too small to justify it.

    On the other hand, I think that that pure-random data can be helpful to look for bugs in the codecs which impact correctness or worst-case performance and help keep a lid on technical debt. It sounds like you agree with this part (which, TBH, wasn't clear to me until your last post)?

Similar Threads

  1. DEFLATE-ing popular Javascript libraries
    By stbrumme in forum Data Compression
    Replies: 1
    Last Post: 30th January 2016, 21:43
  2. Fuzz testing
    By m^2 in forum Data Compression
    Replies: 23
    Last Post: 10th January 2016, 10:01
  3. Archiver benchmarking framework
    By Bulat Ziganshin in forum Data Compression
    Replies: 11
    Last Post: 8th February 2013, 20:46
  4. Standard for compression libraries API
    By Bulat Ziganshin in forum Data Compression
    Replies: 47
    Last Post: 30th March 2009, 06:10
  5. Help beta testing QuickLZ 1.40 with the new test framework
    By Lasse Reinhold in forum Forum Archive
    Replies: 10
    Last Post: 19th April 2008, 16:16

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •