Results 1 to 8 of 8

Thread: Reducing buffering and copying between buffers

  1. #1
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts

    Reducing buffering and copying between buffers

    Bulat mentioned buffering in the coroutine thread, and I wanted to discuss it a bit more without derailing that thread, so here's a new thread for it. I know this ended up a bit verbose, sorry about that. There is a proposal at the end, but it may not seem significant without the book in the middle for context

    While working on Squash, one concern that's been gnawing at me is the impact of buffering. With a streaming (zlib-style) interface, the API consumer probably has a decently sized buffer, let's say 1 MiB, that they fill somehow; maybe it's just a single fread(), maybe not, but it's almost certainly a size *they* chose as a reasonable default, and probably doesn't change. Maybe this is something which has already occurred to everyone, but I haven't really seen it discussed, so…

    The problem is that the codec probably has its own buffer… let's say they're using bzip2 level 9; so that would be 900 KiB (or maybe 900 KB, can't remember). I'll assume bzip2 doesn't just blindly copy everything into its internal buffer, but instead will read directly from the input as long as its large enough, so it consumes 900 KiB. Then it copies the remaining 124 KiB into its internal buffer and returns, indicating it has consumed all the input. The caller then reads in another 1 MiB and calls the process function again.

    This time, the codec wants to consume 900 KiB again, but it's not in a contiguous block; there is 124 KiB in its internal buffer, and the other piece is the first 776 KiB of the new data. It needs a contiguous block, so it memcpy()s 776 KiB over to fill up its buffer, then compresses that to the output. Now it has 248 KiB remaining in the input buffer, which isn't enough for another block, so it copies that to its internal buffer and returns again.

    This process is repeated over and over, but as long as the buffer sizes don't match, after the first block you're always going to have a mismatch. Maybe you can help things by moving to a larger input block size so each call processes multiple blocks, but there is still going to be some memcpy()ing. Maybe the codec can handle non-contiguous blocks, so the only memcpy() is the remainder for each invocation of the process function. But these are all optimizations that mask the real problem: buffer size mismatches. Also, if the user-provided output buffer isn't large enough there could be similar problems on the output side.

    What we need an API to query the codec about how much space it needs for the next operation, and I'm not aware of any codecs which provide one. Splicing API, where the codec wants callbacks which look a lot like fread and fwrite, actually solves this problem fairly well—it is requesting a specific amount of data. When converting to a streaming API we can easily just set a field somewhere to that value, and if we're asked what the next read request size is we just use that data. Output is more of a mystery, but providing too much output isn't really a problem, so just provide a buffer large enough for the worst case. Besides, we don't know how much output we truly need until we actually execute the compression.

    It's worth noting that this problem also happens with stdio-like APIs (gzread/gzwrite), since the caller still controls buffer sizes. With the splicing API, however, the codec effectively *becomes* the caller.





    For APIs which are natively streaming, I've been thinking about adding a callback or two to Squash which plugins can implement to specify what they want for the next request. Something like:

    void (* suggested_buffer_size) (SquashCodec* codec, SquashOptions* options, size_t* input_size, size_t* output_size);
    void (* next_block_size) (SquashStream* stream, size_t* input_size, size_t* output_size);


    The first variant, suggested_buffer_size, would be used for determining the minimum suggested allocation size; i.e., the maximum values that calls to next_block_size would put in its input_size and output_size params.

    next_block_size, OTOH, is a bit more interesting. For compression, output_size will probably just be max_compressed_size(block_size). input_size may vary based on contents of the codec's internal buffer, configuration options, etc.

    For decompression, both may vary pretty heavily. I guess input_size would generally correlate with either sizeof(block_header) or be something between 1 and input_block_size.

    The idea is that callers can ask the codec how much data they want, and we can eliminate any intermediary buffers. Depending on how much codec authors care about making their API easy to use, they may be able to completely eliminate internal buffers by *requiring* that callers provide buffers of the requested size. Codecs may still have some relatively large structures for state, but I'm not sure how much can be done about that without a priori knowledge of the size of the complete input, which is not possible in the general case for streams.

    So, what does everyone think?

  2. The Following User Says Thank You to nemequ For This Useful Post:

    snowcat (17th February 2017)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    1. Can I format your wall of text?

    2. I think I can add support for a filter requesting input to its own buffer easily enough - just an extra branch in the handlers.
    But then, I don't really like the idea of letting filters manage large memory allocations on their own at all.
    I don't have this implemented yet, but I have this idea of a frontend collecting allocation requests from the whole filter graph
    and deciding what to allocate it what order, and maybe where to adjust the size. Such things can be pretty important for 32-bit systems.

    Same with buffering. What you described, with unaligned inputs, would be a common case for blockwise coders.
    So won't it be better to just move it to the library side completely?
    I mean, things like maintaining a sliding window?

    3. There's the MT issue. If a coder requests reading data to its buffer, then all system would wait until read is complete.
    While reading to an outside buffer can work asynchronously from the coder, and then we'd only wait for that memcpy.

  4. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    nemequ, it seems that you model your decisions based on the single codec - bzip2. But other codecs has different requirements:
    * ppmd API is getbyte/putword. so any buffering is good as far as you avoid extra copying (or not avoid - anyway ppmd is pretty slow)
    * lzma and most other LZ compressors has large contiguous dictionary buffer. In order to avoid extra copying, it should be filled directly. How much bytes are read on each call isn't really important as far as it's not too small or too large
    - BCJ codec can modify data on-place or copy input buffer of any size to outbuf of the same size, so you can provide new buffer for each call
    - delta codec is the same, but it adds small extra header, slightly inflating the data
    - rep codec can be considered as sort of LZ

    Overall, i see 3 categories:
    - block-sorting codecs need inbuf of specified size and either single outbuf of max_inflation(insize) bytes or arbitrary sequence of smaller outbufs (f.e. BSC can work only with large outbuf since it fills the outbuf in multiple points in parallel)
    - LZ codecs can provide parts of their dictionary buffer as outbufs to preceding codecs
    - other codecs can work with arbitrary buffers, although some of them (like delta) prefer large enough buffers (i use 8 MB) to improve compression ratio

    Also, you don't consider multibuffer queue, which allows to mitigate irregularities in processing speeds of producer, compressor and consumer

    I think that we should optimize for case of fastest codecs, since they really suffer from extra copying and uninvited synchronization. F.e. fa'next in -m2 mode has decompression speed less than 1 GB/s, while every codec in the pipe is capable of 2+ GB/s. I think it's a real problem while bzip2 is too slow anyway.


    3. There's the MT issue. If a coder requests reading data to its buffer, then all system would wait until read is complete.
    While reading to an outside buffer can work asynchronously from the coder, and then we'd only wait for that memcpy.
    i don't think so. f.e. for LZ with 100 MB dictionary we can alloc 110 MB buffer. In this case, we can process data within 1 MB, use preceding 100 MB as dictionary, and have following 9 MB available for asynchronous reading

    And for other algorithms, coder just need to own multiple inbufs and multiple outbufs. I've decribed full scheme here


    But then, I don't really like the idea of letting filters manage large memory allocations on their own at all.
    I don't have this implemented yet, but I have this idea of a frontend collecting allocation requests from the whole filter graph and deciding what to allocate it what order, and maybe where to adjust the size. Such things can be pretty important for 32-bit systems.
    at the moment you will finally implement that, number of XP users trying to use huge dictionaries, will drop below 0 i think it doesn't worth even design, even less real development
    Last edited by Bulat Ziganshin; 16th February 2017 at 15:18.

  5. #4
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    91
    Thanked 69 Times in 49 Posts
    It seems to me like if you are going to process a large file in 1 MB blocks, then using a streaming interface is perhaps the wrong choice. Streaming is for when you continuously receive (or generate) data, and do not necessarily know how much you will have available each time, or the total size. You should probably be using a memory-to-memory interface instead.

    Now that is not always possible of course, and then, ideally, it would be nice to be able to query the internal block size of the streaming interface, and perhaps even be able to access and fill that buffer directly.

  6. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    how to compress with a memory-to-memory interface using lzma with 1 GB dictionary?

  7. #6
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    114
    Thanks
    91
    Thanked 69 Times in 49 Posts
    I should add that my remark was given that your use is so focused on speed that the copying of 700k of memory matters.

  8. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    you are right - i already have basic compression API in freearc, so i look into improving it. The main perfromance problem is lack of parallelism, but once it will be solved, -m2 decompression will include 4 methods, each perfroming at 2 GB/s in its own thread. And 4 extra memcpy operations may have significant impact at such speeds. Plus, direct buffer I/O will reduce memory requirements. So i think that it will be great to implement it for all methods capable for 500+ MB/s compression or decompression

  9. #8
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by nemequ View Post
    This time, the codec wants to consume 900 KiB again, but it's not in a contiguous block; there is 124 KiB in its internal buffer, and the other piece is the first 776 KiB of the new data. It needs a contiguous block, so it memcpy()s 776 KiB over to fill up its buffer, then compresses that to the output. Now it has 248 KiB remaining in the input buffer, which isn't enough for another block, so it copies that to its internal buffer and returns again.

    It seems to me that, assuming you have some idea what the consumer does, you can adapt to work with it without creating a new api.

    If you expect the consumer to have its own internal buffer, then you should buffer as little as possible (because redundancy), and you can make your buffer big enough to hold one chunk from the provider. You can make the consumer take a small chunk if it touches the end.

    If the consumer doesn't have its own internal buffer, then you can take buffering hints from the amount that it takes in one iteration. If it asks for 900k, then give it 900k from then on (assuming that the producer is flexible).

    If the consumer does inefficient things and it doesn't provide any explicit way to prevent this, then it's not very well designed. Don't you agree?

Similar Threads

  1. Replies: 75
    Last Post: 24th March 2014, 19:34
  2. Replies: 2
    Last Post: 23rd February 2013, 07:01
  3. About Ring Buffers
    By Cyan in forum Data Compression
    Replies: 5
    Last Post: 17th November 2009, 18:04

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •