Page 1 of 2 12 LastLast
Results 1 to 30 of 33

Thread: Coroutine class implementation

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

    Coroutine class implementation

    >> Btw, what do you think about my coroutine implementation - https://github.com/Shelwien/stegdict...Lib/coro2b.inc

    > FWIW, I've been planning on adding libcoro to Squash for some stuff, it might be interesting to compare.
    > Also, Microsoft has an interesting fiber API which could make for a nice backend…

    No, MS fiber api is not portable, and rather slow.
    This is portable (relies on setjmp.h only) and much faster.
    I also have a memcpy-less implementation, where the only task switch overhead is the jmp instruction.

    Edit: also there were these posts:
    http://stackoverflow.com/a/4352706/395609
    http://stackoverflow.com/questions/3...ne-demo-source

  2. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    nemequ (15th February 2017),RamiroCruzo (15th February 2017)

  3. #2
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by Shelwien View Post
    No, MS fiber api is not portable,
    Is yours? It looks Intel-only. Or do you just need to switch the commented out code in to use the system setjmp? If that's the case, you should probably use some ifdefs instead of requiring people to change the code…

    Pre-defined Compiler Macros is an *excellent* resource for this type of thing (see the "Architectures" section).

    Obviously it's at least portable to other operating systems, which is a *huge* improvement.

    Quote Originally Posted by Shelwien View Post
    and rather slow.
    Ah, I didn't know that. I just remember skimming the API a few years ago and thinking it seemed relatively nice.

    Do you have any idea how your code compares to libcoro? I know libcoro is quite portable, so I've been leaning towards that instead of Squash's existing thread-based approach to some stuff (for codecs which have a stdio-style API).

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Is yours? It looks Intel-only. Or do you just need to switch the commented out code in to use the system setjmp?

    Yes, its mine and not intel-only. I tested a coder based on it on ARM (iphone) and it worked.
    The main difference is that it only relies on documented behavior of setjmp.h - no asm, no modifying SP/IP
    in setjmp structure (like libcoro does).
    Asm files are mainly used for speed boost.

    > If that's the case, you should probably use some ifdefs instead of requiring people to change the code…

    True, but I'm still tweaking stuff there.
    For example, newer VS needs some pragmas to disable stack checks for my asm to work.

    ...
    Ok, made a test - http://nishi.dreamhosters.com/u/coro_test_v0.rar
    These are TSC timings (best of 5) for reading the contents of 10Mb array with coroutine switch on each byte.

    Code:
                      x86/IntelC17   x64/IntelC17  x86/gcc61mingw x64/gcc61mingw x86/clang40    x64/clang40    
    coro1:            356066391 clk  388800909 clk  526458321 clk  391349463 clk  366917424 clk  418888021 clk
    coro1/setjmp.h:  2548925862 clk 2140649242 clk 2754041659 clk ----//---- clk 2721752679 clk 2103642626 clk
    coro2b:          1568539035 clk 1753295481 clk 1815209178 clk 1860441816 clk 2083372884 clk 1896192873 clk
    coro2b/setjmp.h: 2957554695 clk 2835614988 clk 3603872111 clk ----//---- clk 3733222815 clk 2930762538 clk
    coro_lc:         2465117040 clk 2199127326 clk 2634450642 clk ----//---- clk 2646668352 clk 2065848174 clk
    coro_lc/FIBER:   1323266490 clk 2331741264 clk 1397835567 clk 2284588336 clk 1499312910 clk 2499409457 clk
    1) setjmp seems to be broken in x64 version of my gcc/mingw.
    2) clang needed a special fix to make it work with winapi fibers
    3) coro_lc is a coro1 port which uses libcoro instead of setjmp directly
    4) Main coro1 vs coro2 difference is that coro1 doesn't save the stack - its necessary to manually allocate each coro1 instance.

  5. The Following User Says Thank You to Shelwien For This Useful Post:

    RamiroCruzo (15th February 2017)

  6. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    20 years ago arjz suffered from the inversion control problem. Since it was dos-compatible, i was unable to use threads, and C* languages never supported coroutines. So at some desperate moment i started to develop my own coroutine library. Then i realized that i don't need to use asm, since alloca+longjmp does everything i need. It was a great moment! When i returned from vacation, i looked into setjmp help and found that it mentions that setjmp can be used to implement coroutines. It was Borland C++ 3.1 (1992) help, so solution was always here - i just didn't know where to look for!

    But simple longjmp is a good solution only for pure C. AFAIK, in C++ it may lead to troubles with exceptions/destructors - i.e. runtime thinks that all stack frames below the current one are out of scope and destroys them.

    Boost contains 4 libraries from lowest level (just portable stack switching) to the thread-emulation level:
    * Context - (C++11) Context switching library.
    * Coroutine (deprecated) - Coroutine library. Performance
    * Coroutine2 - (C++11) Coroutine library. Performance
    * Fiber - (C++11) Userland threads library.

    so it's hard to justify use of other libs with C++. In particular, if you use coros as poor men threads, you can prefer to use fibers communicating via channels
    Last edited by Bulat Ziganshin; 15th February 2017 at 15:35.

  7. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    That said, i don't see why coro speed is important to you - compression libs can process data in 64-1024K chunks so context switch time don't need to be that low.

    Moreover, i think that running multiple compressioon algorithms in the single thread seriously reduces the perfromance. FreeArc already suffers from this problem - while it has separate thread for each algorithm, the I/O model employed forces them to run sequentially. Let's see - if algo1 "writes" 1000 bytes of and algo2 "reads" 999 bytes, then algo1 need to wait until algo2 will finish processing those 999 bytes before the single remaining byte will be flushed and algo1 can continue its processing. Since most algos are at least slightly inflate or daflate data, proper synchronization (when algo2 reads exacly the same amount of data that algo1 writes) is almost impossible.

    So the proper solution is to make data exchange asynchronous. One possibility is to use more threads - f.e. first thread can fill input buffers, sending them to the computation thread, while last thread can receive filled output buffers from the c.thread and send data to algo2 using the existing synchronous API. But this requires 1 or 2 extra threads per algo just for data exchange.



    More efficient approach is to support asynchronous I/O directly in the API. If compression algo allocates fixed amount of input buffers and fixed amount of output buffers when compression begins, it can work like that:
    Code:
    for(i=1..N)
      buf[i] = malloc(size)
      outbuf[i] = malloc(size)
      AsyncReadStart(buf[i])
    
    while(!EOF)
      WaitAsyncReadDone(buf[i])
      if (AllWriteBufsAreFilled)  WaitAsyncWriteDone(outbuf[i])
    
      ProcessData(buf[i],outbuf[i])
    
      AsyncReadStart(buf[i])
      AsyncWriteStart(outbuf[i])
      if (i==N)  AllWriteBufsAreFilled=true
      i = i%N+1
    where ProcessData implements what is called a streaming API in Squash - it process data in chunks, stopping when there isn't enough input data or output space. Ideally, it should allow to read data directly into the algorithm input buffer and write data directly from algorithm output buffer in order to avoid extra copying.



    This aprroach can be further improved. There are two possible situations:
    1) InBuf==OutBuf. Some algos like delta and BCJ doesn't have a separate outbuf and can write output data directly from the input buffer
    2) In the remaining cases, allocation of separate outbuffers isn't efficient - we spend extra memory and run extra memcpy. Instead, it will be great to write directly into input buffer of the next algo

    So, we go to idea that each algo can either own its input buffers and write compressed data into input buffers of the next algo, or opposite - it can own output buffers but read data directly from outbufs of previous algo. In the compression world, it seems that former is better for compression stage (since LZ algos with large dictionary require that the dictionary occupies large contiguous memory space, and the dictionary serves as multiple input buffers), while later is better for decompression stage (for exactly the same reason - now dictionary serves as multiple output buffers).

    So, we don't allocate outbufs, instead requesting them directly from the next algo:
    Code:
    for(i=1..N)
      buf[i] = malloc(size)
      AsyncReadStart(buf[i])
    
    while(!EOF)
      WaitAsyncReadDone(buf[i])
      outbuf = RequestOutBuf()
    
      ProcessData(buf[i],outbuf)
    
      AsyncReadStart(buf[i])
      OutBufFilled(outbuf)
      i = i%N+1
    and at the decompression stage:
    Code:
    for(i=1..N)
      outbuf[i] = malloc(size)
    
    while(!EOF)
      buf = RequestInputBuf()
      if (AllWriteBufsAreFilled)  WaitAsyncWriteDone(outbuf[i])
    
      ProcessData(buf,outbuf[i])
    
      InputBufConsumed(buf)
      AsyncWriteStart(outbuf[i])
      if (i==N)  AllWriteBufsAreFilled=true
      i = i%N+1
    This leaves on the table one problem - should ProcessData be able to work with any reasonable size of inbuf (for decompression) or outbuf (for compression)? Or auxiliary code should provide extra buffers between algo1 and algo2 when incompatibilty is detected?

    Also i don't covered multi-output algos, although at the first sight there is nothing specific to them aside of even more complicated logic

  8. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > i looked into setjmp help and found that it mentions that setjmp can be used to implement coroutines.

    I also "always" knew about coroutines from Knuth and the like, and didn't care much.
    But turns out, coroutines are really useful as codec interfaces.
    I/o is done by frontend, processing in coroutine - perfect abstraction.
    And yes, I also started with callbacks, as everybody, then moved to i/o class as template param -
    because of performance issues.
    And then it started to be annoying, like having to recompile the whole source each time,
    because template parameter is basically a part of class name.
    After that, I ended up using thread switch to handle buffer replacement for a codec, tried
    to implement it with less overhead, and got this coroutine class.

    > , in C++ it may lead to troubles with exceptions/destructors -
    > i.e. runtime thinks that all stack frames below the current one are out of scope and destroys them.

    My implementation is actually compatible with exceptions.
    I even just tested this: http://pastebin.com/aLrn4QSN
    And it prints the following:
    Code:
    Calling coroutine.
    Entered coroutine.
    (coro) Entered try block.
    (coro) Caught exception! x=666
    Caught exception! x=42
    Out of try block.
    > Boost contains 4 libraries from lowest level (just portable stack switching) to the thread-emulation level:

    Just looked at that again.
    Well, its C++11 only, actual task switch is done with asm, and it depends on like 100 files,
    so I won't even add it to benchmark.

    Its obvious that code using coroutines can have problems with some platforms/compilers.
    So its much better to have a single 3k file that implements it, rather than
    half of boost framework with mind-blowing code full of template tricks.

    Also anyway, its harder to switch contexts faster than a single indirect jmp.

    > That said, i don't see why coro speed is important to you -
    > compression libs can process data in 64-1024K chunks so context switch time don't need to be that low.

    That's true in most cases, but reflate has 4-5 coroutines even at one nesting level.

    And there's pjpg, where switching is done per byte - outer coroutine handles FF masking, tags,
    tag lengths etc, and inner just parses actual structures.

    But yes, I'm more concerned about boundary checking on i/o.
    It was improved a lot by __builtin_expect, but I'd like to find an even better way somehow.

    Well, I already tested this coroutine class with hardware breakpoints for bounds checking.
    And it wasn't any faster because of exception handling overhead... also not portable,
    and there're only 4 hardware breakpoints, which can support 2 coroutines max.

    But I still want to try guard pages or maybe redirecting int1 to local handler.

    > Moreover, i think that running multiple compressioon algorithms in the
    > single thread seriously reduces the perfromance.

    Sure, but its good to have such an option.
    Also, its much better to test MT as coroutines first, to see that main logic works correctly,
    before all the race conditions start appearing.

    In addition, I'm actually using coroutines as thread interfaces too - thread driver
    does actual sync'ing and queue handling and calls a processing coroutine which
    works as usual.


    > So, we go to idea that each algo can either own its input buffers and write
    > compressed data into input buffers of the next algo, or opposite

    My interfaces are different I guess.
    These coroutines don't have their own buffers - both input and output buffers
    have to be added by caller, and replaced when requested.

    And for MT, there're buffer queues. So thread_pipe wrapper waits for full input buffer
    and free output buffer, passes them to coroutine, gets full output buffer, pushes
    it into the queue, where next thread_pipe gets it, etc.

    > Also i don't covered multi-output algos, although at the first sight there
    > is nothing specific to them aside of even more complicated logic

    I already support these, there's nothing special aside from some C++ tricks
    like access to wrapper class without pointer and initializer/constructor for it,
    or equivalent of decltype(*this) for static class members.

    That's only for now though, while I'm still using the 7z framework.
    Data interleaving for multi-output coders is still tricky, even 7z has bugs in that.
    Its still always possible to get a stream which is only written at the very end,
    but has to be available for reading right at start (eg. because of rangecoder init).

  9. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    > , in C++ it may lead to troubles with exceptions/destructors -
    > i.e. runtime thinks that all stack frames below the current one are out of scope and destroys them.
    My implementation is actually compatible with exceptions.
    I even just tested this: http://pastebin.com/aLrn4QSN
    in order to test that, you need to start two coroutines and raise exception in the first one

    But actually i don't know how to make your library fail. And there are muliple compilers, some implement multiple exception mechanisms, x86 and x64 may have different implementations too. So yeah, i leave that to professionals

    Also anyway, its harder to switch contexts faster than a single indirect jmp.
    what about saving registers? and what should be exact set of registers to save?

    And for MT, there're buffer queues. So thread_pipe wrapper waits for full input buffer
    and free output buffer, passes them to coroutine, gets full output buffer, pushes
    it into the queue, where next thread_pipe gets it, etc.
    why coroutine? you can replace it by simple routine that waits for input/output buffer when it needs it. i think that you just love coroutines and use them when they aren't really required

    Sure, but its good to have such an option.
    Also, its much better to test MT as coroutines first, to see that main logic works correctly,
    before all the race conditions start appearing.
    to have this as option, you need the same or similar API bor both cases, i.e. use smth like boost.fiber
    if you use proper abstractions such as queues, m/t debugging should be not much more complex than s/t one

    My interfaces are different I guess.
    These coroutines don't have their own buffers - both input and output buffers
    have to be added by caller, and replaced when requested.
    it seems inefficient, even compared to the current freearc implementation. f.e. my lzma implementation reads directly into dictionary buffer while you alloc and memcpy extra buffer.

  10. #8
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    FWIW, my primary use case is to support codecs like CRUSH and ZPAQ, which basically expect callbacks that look a lot like fread/fwrite (in CRUSH's case it just *uses* fread/fwrite, but that's easy to patch). I need to allow people using zlib-style APIs to push data as needed. It's all pretty standard stuff; just need to provide callbacks which look like they block (as fread/fwrite do) when there is insufficient I/O available, but in reality control returns to the caller until they provide more memory.

    Currently we're using threads in Squash for this (via TinyCThread). My desire to use coroutines is less about the overhead from a context switch and more about cache locality. The application side loads data into a buffer, so there is a good chance that at least some of it is cached, then sends it to the codec. With threads, the codec is running on a separate thread, quite possibly on a different core, so there are probably going to be some extra cache misses.

    This is one of those things that lets codec authors provide a very simple API and have Squash take care of the hard parts for them. Obviously the highest performance option would be to simply have the codec implement a zlib-style interface, but that's a lot harder to do than a stdio-style interface. I'd like to close the performance gap a bit if possible to help codecs with less developer resources (think "researchers and tinkerers", as nburns mentioned in the framework thread).

  11. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > in order to test that, you need to start two coroutines and raise exception in the first one

    Uh, no.
    I'm pretty sure that last active try block would get the exception.
    Of course, I can only say anything in context of my setjmp replacements -
    I know that setjmp can be sometimes broken in that sense.

    And in any case, I don't use C++ exceptions - they add too much bloat.

    >> Also anyway, its harder to switch contexts faster than a single indirect jmp.
    > what about saving registers? and what should be exact set of registers to save?

    Well, with setjmp.h, setjmp takes care of saving/restoring registers.
    But I was talking about my asm version.
    Where for gcc/intelc it uses gcc asm inline syntax that says that all registers
    would be modified by this asm block, so compiler saves everything on its own.
    Which is more efficient that setjmp.h way, because compiler doesn't really
    need to save anything in most cases - it just knows where to read it again.
    As to MS asm version, all registers are always saved around it.

    > why coroutine? you can replace it by simple routine that waits for
    > input/output buffer when it needs it.

    Yes, a callback.
    My coders are always classes, so callback likely would be another object.
    Which would have to be given a pointer to processing object, and defined
    strictly after the processing class. But wait, processing class is
    actually a template instance, which only appears in the middle of some random other class...
    No, thanks, I had enough of that.

    That is, it sure is possible to implement everything with callbacks,
    but its too much work to maintain that with my coding style (templates etc).
    Passing i/o methods as template params is still easier, but incompatible
    with C-style lib headers, and requires compiling everything in one module.

    > i think that you just love coroutines and use them when they aren't really required

    No, as I tried to explain, it is a result of a fairly long evolution of i/o interfaces.
    And I don't like all coroutines, just my specific implementation.

    > if you use proper abstractions such as queues, m/t debugging should be not much more complex than s/t one

    But I don't need mutexes, winapi and whatever else in my coders.
    So they're plain coroutines, while MT stuff is all encapsulated in MT components.
    I mean, I don't need to define macros to enable/disable MT or anything else -
    there're no "potentially MT" components in processing coroutines.

    > it seems inefficient, even compared to the current freearc implementation.
    > f.e. my lzma implementation reads directly into dictionary buffer while you
    > alloc and memcpy extra buffer.

    Uh, no, as I said, there're buffer queues.
    so one thread reads and writes one pair of buffers, while other threads
    work with other buffers. Nothing is copied anywhere and async work is
    perfectly possible.
    I mean, why can't I add blocks of lzma dictionary buffer to the output buffer
    queue for the reader thread.

    But well, there is some inefficiency unfortunately - for example,
    rep and lzma would have their own window buffers, mostly with the same data.
    I was actually thinking to force the next versions to work with arrays
    of small overlapping buffers (eg. 64k-273 bytes of actual data per buffer)
    instead of large continuous window.
    It would solve some other issues too, like shifting the window, working
    with fragmented memory on 32-bit. And it would let me avoid some data
    copying in cases when such a block can be processed inplace.

  12. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    nemequ, look at that from other side - you can have either 3-4 threads perfrorming reading, hashing, compression and writing simultaneously, or just one thread perfroming all that in sequence. what is faster?

    now let's look - if you have one 1 MB inbuf, one 1 MB outbuf, and algorithm itself uses no more than 1 MB, you can be sure that they all work inside 8 MB cpu cache. But if you have three inbufs and two outbufs, you can be sure too. So, with m/t, you definitely need more buffers, they MAY overflow the cpu cache, but you are SURE that i/o and computations run in parallel

    that said, you may even try async i/o, but afair it has severe limitations both in windows and linux. and it still needs extra buffers, you just avoid having extra threads stupidly waiting on I/O calls

    are you know that most modern cpus has LLC (last-level cache) shared among all cores? and for sequential reading it has speed >100GB/s, so it is hard to find algorithm that will be limited by its speed (data are prefetched when read sequentially so cache latency isn't relevant here)

    IMHO, read/write-style API is most relevant one, once we add support for anynchronous "I/O" and requesting outbuffers (inbuffers for decompression) from the next algo in chain. This way, all algorithms works simultaneously, data are never memcopied and we use minimal amount of extra buffers



    Just an example how it will work with typical freearc pipe: rep+exe+delta+4x4:zstd

    Prior to pipe, there are fread and hashing threads. So, fread thread requests buffer from hashing thread, and hashing thread rerequests this buffer from the REP thread. REP splits its 512+4 MB buffer into 1 MB segments and on each request returns the next 1 MB. Of thiose extra 4 MB, one MB is used to ensure that we always have full 512 history, and 3 remaining MBs are returned as buffers by preceding thread requests.

    So, fread fills each buffer with the data and sends it to the hashing thread, whioch hashes the data. Once hashing is done, buffer is made ready for REP processing and as REP processes the data, it builds the LZ77 index of this MB. Once it was done, REP requests buffer from the next thread, copies non-LZed data from his 1 MB inbuf into outbuf and signals that it's done.

    EXE thread by itself owns several buffers. Once next buffer is filled, it processes it in-place, requests buffer from the next thread and copies data into this buffer. The same does the DELTA thread.

    Finally, 4x4 owns multiple buffers, once the next buffer is filled, it requests outbuf from Writer thread and spawns ZSTD memory-to-memory compression job. And the Writer job writes to the disk its own buffers once they are filled by 4x4.



    It looks that the main problem is that sometimes producing algorithm knows better how much and how large buffers should be allocated. May be the two algorithms may go to consensus by allocation of max(algo1_reqs,algo2_reqs)? In this case, algo1 just should inform algo2 about its needs, and algo2 will arrange that. This complicates the API, though, so people just writing "newzip" algo will have to deal with a bit more complex things. OTOH, it may have status of advanced feature, so that simple algos will automatically get intermediate buffers allocated by the framework and extra memcpy operations, while advanced implementations will take into account previous algo1 requirements. Does it look good?

  13. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    I'm pretty sure that last active try block would get the exception.
    Of course, I can only say anything in context of my setjmp replacements -
    I know that setjmp can be sometimes broken in that sense.

    And in any case, I don't use C++ exceptions - they add too much bloat.
    yeah, it is what i say about. Boost libs doesn't have such limits

    Where for gcc/intelc it uses gcc asm inline syntax that says that all registers
    would be modified by this asm block, so compiler saves everything on its own.
    Which is more efficient that setjmp.h way, because compiler doesn't really
    need to save anything in most cases - it just knows where to read it again.
    great. and it seems also used by other coroutine libraries

    > why coroutine? you can replace it by simple routine that waits for
    > input/output buffer when it needs it.

    Yes, a callback.
    My coders are always classes, so callback likely would be another object.
    Which would have to be given a pointer to processing object, and defined
    strictly after the processing class. But wait, processing class is
    actually a template instance, which only appears in the middle of some random other class...
    No, thanks, I had enough of that.

    That is, it sure is possible to implement everything with callbacks,
    but its too much work to maintain that with my coding style (templates etc).
    Passing i/o methods as template params is still easier, but incompatible
    with C-style lib headers, and requires compiling everything in one module.
    either you overcomplicates or i don't understand your programming style. if you can just hardcode that, it will be a simple routine called very much like fread, executing just the same "wait for buffer ready" code which is now executed by the coroutine caller

    if you want more flexiility, you can pass this function pointer to your compress() function, i.e. use a callback. but callback don't need access to your class - like fread, it just gets pointer to buffer, size, and void*userdata to keep all its internal state info


    > if you use proper abstractions such as queues, m/t debugging should be not much more complex than s/t one

    But I don't need mutexes, winapi and whatever else in my coders.
    So they're plain coroutines, while MT stuff is all encapsulated in MT components.
    I mean, I don't need to define macros to enable/disable MT or anything else -
    there're no "potentially MT" components in processing coroutines.
    me too - all the m/t stuff is implemented by the framework-provided callbacks. so there is no difference, and i get that without coroutines. And as i described, freearc solution can be improved to work with multiple incoming/outgoing buffers and avoid extra memcpy calls - just like your one. And all that stuff doesn't require coroutines.

    > it seems inefficient, even compared to the current freearc implementation.
    > f.e. my lzma implementation reads directly into dictionary buffer while you
    > alloc and memcpy extra buffer.

    Uh, no, as I said, there're buffer queues.
    so one thread reads and writes one pair of buffers, while other threads
    work with other buffers. Nothing is copied anywhere and async work is
    perfectly possible.
    let's see. in freearc, lzma algo is active - it knows by itself where its dictionary and can request reading data directly to this place. In your code, lzma algo is passive - it's a coroutine that receives already existing buffers to process. So in order to read directly into lzma dictionary, the hiogher-level routine should know about this lzma dictionary, how to keep enough valid data in that buffer and so on - so it became essentially a part of lzma algorithm rather than some generic code that can work with any algo

    But well, there is some inefficiency unfortunately - for example,
    rep and lzma would have their own window buffers, mostly with the same data.
    I was actually thinking to force the next versions to work with arrays
    of small overlapping buffers (eg. 64k-273 bytes of actual data per buffer)
    instead of large continuous window.
    It would solve some other issues too, like shifting the window, working
    with fragmented memory on 32-bit. And it would let me avoid some data
    copying in cases when such a block can be processed inplace.
    it's great and seems that zstd supports such multibuffers, but they should be much larger since you will lose matches split between two buffers (or overcomplicate the code). so it may be solution only for areas incompressible by REP

  14. #12
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    You're thinking about it at a higher level, where you chain together multiple operations, but that's not what I'm talking about. I'm talking about within a single operation. In other words, for your example of a filter that looks like rep+exe+delta+4x4:zstd, you may want to run each one of those in a thread so you can process data concurrently, but what I'm talking about here is *just* the "zstd" operation.

    Imagine someone wants to compress some data with the ZPAQ codec using the zlib-style API. ZPAQ only provides a splicing API (fread/fwrite-like callbacks). The user first reads some data into a buffer, then calls the process operation (think `deflate(stream, Z_NO_FLUSH)`). ZPAQ starts up and calls fread callback to read that data, and its fwrite callback to write the result to the user-provided output buffer. Eventually, the input is exhausted (or the output fills up, but the concept is the same so let's keep it simple); from ZPAQ's point of view, the fread callback should block until more room becomes available. But the user is expecting to push data to the codec, not have the codec pull data from it, so the blocking ZPAQ needs is really switching contexts to get back to that deflate() call and return Z_OK (indicating all input was consumed). The user does whatever they want, eventually getting more data they want to compress, which they then push to the compressor by calling deflate() again with the updated buffers. Squash sees this and switches back to the middle of the fread-like callback now that there is data available. From ZPAQ's perspective the it requested data and got it, from the user's perspective they sent data to the compressor as it became available, and all expectations are fulfilled. Repeat until all data is compressed.

    This can't be parallelized because from Squash's perspective the request is just "here is a blob of data, compress it". The opportunities for parallelization come at a higher level, composed on top of what Squash is already doing. A pipeline is one of them, and when Squash adds support for pipelines I plan to use multiple threads. If you're reading from a FILE you could prefetch data in a separate thread so I/O and compression run in parallel (I plan to do this using the POSIX AIO API, but there may be a thread-based fallback…).

  15. The Following User Says Thank You to nemequ For This Useful Post:

    Shelwien (16th February 2017)

  16. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > yeah, it is what i say about. Boost libs doesn't have such limits

    I think it would have all the same limits, and more, because of all the code complexity.
    In addition, it would be simply impossible to build it with a slightly older compiler.
    Which would be annoying, because, for example, atm I have to build zstd with IC17,
    because it crashed when compiled with older IC; and the rest of 7z.dll I have to build
    with IC11, because 7z framework crashes when IC17 is used (reflate and plzma also
    use IC17, but they're separate dlls).

    And boost would basically force me to use latest gcc or nothing.
    Which, on windows, has its own problems.

    > either you overcomplicates or i don't understand your programming style.

    Well, I prefer to use templates where others use macros.
    The code is much more readable with templates, and easier to maintain.
    But thanks to that I have code like this:
    struct RecWrap_codec : Coroutine {
    template <int DECODE> struct Rec1Wrap : Rec1::LzmaModel<DECODE, Rangecoder<DECODE,13,1> > {};
    template <int DECODE> struct Rec2Wrap : Rec2::LzmaModel<DECODE, Rangecoder<DECODE,15,0> > {};
    template <int DECODE> struct Rec3Wrap : Rec3::LzmaModel<DECODE, Rangecoder<DECODE,15,0> > {};
    struct Dec1Wrap : RecWrap<1,Rec1Wrap> {};
    struct Enc1Wrap : RecWrap<0,Rec1Wrap> {};
    struct Dec2Wrap : RecWrap<1,Rec2Wrap> {};
    struct Enc2Wrap : RecWrap<0,Rec2Wrap> {};
    struct Dec3Wrap : RecWrap<1,Rec3Wrap> {};
    struct Enc3Wrap : RecWrap<0,Rec3Wrap> {};

    Inserting additional callbacks into this would be hard.

    > if you can just hardcode that, it will be a simple routine called very much
    > like fread, executing just the same "wait for buffer ready" code which is
    > now executed by the coroutine caller

    Uh, yes, so I'd have to copy-paste bounds-checking code with callback for buffer processing
    into each filter class?
    For example:

    uint get( void ) {
    if( __builtin_expect(inpptr>=inpend,0) ) {
    do {
    if( f_quit ) return uint(-1);
    yield(this,1);
    } while( inpptr>=inpend );
    }
    return *inpptr++;
    }

    Now, this is a part of my coroutine class.
    So, I guess, I'd still have the same base i/o class, just with callbacks
    instead of yield. And then I'd have to somehow explicitly assign these
    callbacks to a filter - by "somehow" I mean that they'd be methods of
    some other class, so I'd have to pass "this" value for them too, then
    convert it from void*, etc.
    Too much dumb copy-paste imho. And i/o logic, error handling and whatever
    else gets separated into separate functions for no reason.

    While with coroutines there'd be a single call loop, with status code handling.
    And it would be even possible to use persistent local variables during the processing,
    while with callbacks these would have to be class fields.

    > i.e. use a callback. but callback don't need access to
    > your class - like fread, it just gets pointer to buffer, size,

    Well, I presumed that I won't have to paste the equivalent of above get()
    manually into each class.

    I guess, you have to optimize each coder separately, to use full buffer i/o only.
    Then we end up with things like bcj2, where 10 lines of logic (including rc) are wrapped in 30k
    of manual bounds-checking and state managing.

    > And all that stuff doesn't require coroutines.

    Its not really a matter of requiring or not...
    Using coroutines I just get very compact and modular code, automatically.
    Sure, in theory there's no difference and everything can be implemented
    via callbacks just as well. Even though they have this inverted logic,
    where parts of the caller get executed in context of processing function,
    so the developer has to manually setup everything in such a way that
    callbacks could have access (for example, they won't normally have
    access to local vars from the function which called the processing function).

    So I do think that I'd not be able to write eg. reflate without coroutines.
    For example, outer loop processes the data with entropy filter and other quick checks
    (while keeping some lookahead buffer), then calls deflate decoder coroutine where it seems
    possible. If a block is successfully decoded, its then passed to raw2hif coroutine
    for actual recompression.
    So, with your method, I'd have to provide callbacks both to decoder and to raw2hif,
    and then somehow pass the data between them and the mainloop, and handle error codes.
    While what I have is a simple loop in a single function, which simply calls two
    other functions and then checks their return values.
    In addition, raw2hif is actually much more complicated and might even call
    the main reflate routine recursively to process nested streams...

    > let's see. in freearc, lzma algo is active - it knows by itself where its dictionary
    > and can request reading data directly to this place.

    Yes, but this is unrelated to coro or callback interfaces. Its just how I prefer to use it.
    In my framework, a coder can quite as well request input explicitly, and provide the buffer for it too,
    which the caller routine would fill with data - I just never needed this until now.

    But I do have cases where a filter ignores the output buffer provided for it, and passes
    the input buffer as output, on write request.

    > In your code, lzma algo is passive - it's a coroutine that receives already
    > existing buffers to process.

    Yes. In fact, plzma is two coroutines - lzmaenc and lzmarec. Normally they work
    in their own threads though.

    > So in order to read directly into lzma dictionary, the hiogher-level
    > routine should know about this lzma dictionary,

    Sure, either this, or lzma should know about the buffer queue and
    be able to update it (and it has to shift the window somehow anyway).

    > they should be much larger since you will lose matches split between two
    > buffers (or overcomplicate the code). so it may be solution only for areas
    > incompressible by REP

    For lzma its enough to have buffer overlap by 273 bytes.
    And my rep does not use lookahead at all, so its not a problem.

  17. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    @nemequ:
    Yes, tasks like that also exist, but you should take into account that zlib is basically already coroutine-based - it uses setjmp to implement that interface.
    So its not a wonder that you need coroutines to simulate it.

    While I'm trying to convince Bulat that coroutine-based interface is just better in general :)

  18. The Following User Says Thank You to Shelwien For This Useful Post:

    nemequ (16th February 2017)

  19. #15
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    nemequ, yes - there are cases when starting a new thread for each compression operation is too expensive. F.e. application may need to compress lots of 1 KB blocks. My interest is limited to use in archivers/compressors, which impose some natural limits - Windows can't open more than ~10K files per second (even when all data are cached), so i don't expect execution of more than ~50K compression operations per second, and for such load starting a new thread for every compression operation is affordable.

    This means that your approach of converting I/O API into zlib API using coroutines, may have sense for some applications. Although better, application should just check input size beforehand and use single-shot API on smaller inputs. Or framework may cache compression thread once started, reusing it for subsequent operations.

    But as i said, there is other side too - zlib-style approach of executing everything in the single thread has the drawback of serializing execution. Imagine gzip-style program that reads data at 100-500 MB/, hashes them with SHA2 (~300 MB/s) and compresses with fast zstd mode (~300 MB/s). Rereading data from memory (instead of cpu cache) works at ~10 GB/s, so it's a penny. But perfroming 3 above-mentioned operations sequentially will significantly slow-down the program.

    Or application may got the same effect since it generates data to compress on the fly. Or because it sends compressed data over the network. So running compression algorithm in the user thread isn't necessarily faster, it may be slower as well since it limits the parallelism.

    This can't be parallelized because from Squash's perspective the request is just "here is a blob of data, compress it".
    i don't think so. There are no guarantees that the returned compressed data correspond to input one. So you can just compress first input message when user thread builds the second one. And when user requests to compress the second message, copy it into internal buffer and return the compressed first one.

    Yes, this means some complications and here i try to find solutions for them. My goal is to develop standard compression API that allows to use compression algos and chains of compression algos in archivers/compressors, while ensuring maximum parallelism and performance by dropping unneeded data allocation/copying



    I agree that some applications will prefer zlib-style API. But it can be implemented on top of "efficient API" i described - all we need is to add buffering thread prior to first algorithm in the pipe, which just copies input data into buffers requested from the first algorithm.

    Let's see: we want to implement zlib-style API. In the call, user provides input buffer and output buffer. The subroutine sends this output buffer to the last algo in the pipe. Then, subroutine requests input buffers from the first algo, copies input data into these buffers and allows to start their compression. As this is done, it waits until output data arrives and returns them to the caller.

    I think it looks pretty simple and shows what the "efficient API" can serve those needs too. It's good for everyting, except for extra-small compression jobs.

  20. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    nemequ (16th February 2017)

  21. #16
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Quote Originally Posted by Shelwien View Post
    @nemequ:
    Yes, tasks like that also exist, but you should take into account that zlib is basically already coroutine-based - it uses setjmp to implement that interface.
    https://github.com/madler/zlib/searc...9C%93&q=setjmp

  22. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    nemequ (16th February 2017)

  23. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Yes, my mistake, its already implemented as switch(state) since 0.71 at least.
    Doesn't make it any less of a coroutine though.

  24. The Following User Says Thank You to Shelwien For This Useful Post:

    nemequ (16th February 2017)

  25. #18
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    > yeah, it is what i say about. Boost libs doesn't have such limits

    I think it would have all the same limits, and more, because of all the code complexity.
    In addition, it would be simply impossible to build it with a slightly older compiler.
    Which would be annoying, because, for example, atm I have to build zstd with IC17,
    because it crashed when compiled with older IC; and the rest of 7z.dll I have to build
    with IC11, because 7z framework crashes when IC17 is used (reflate and plzma also
    use IC17, but they're separate dlls).

    And boost would basically force me to use latest gcc or nothing.
    Which, on windows, has its own problems.
    i think that boost.context is that large not because they are more silly than you, but exactly because it is more portable and solves problems of your simpler design. The docs of any og these 4 libraries don't note that they are incompatible with exceptions, so i believe they are compatible

    zstd/7z problems has notjing common with boost. boost libraries tend to provide maximum portability. In particular, i use (other) boost libraries with gcc 4.9 and msvc2013. But overall, yes - the more instruments you use, the smaller their common platform. So, for you it should be better to use your own library and avoid exceptions. For general-purpose compression framework, it's not an option

    So, squash should use Boost or other C++-compatible coroutine library, which is quite a large dependency. Then, it should decide how much stack space to allocate for each coroutine. I see two solutions:
    - detect stack size at the call, and divide it equally between 2 coroutines
    - allow each comperssor to specify its stack requirements, and use say 256KB by default

    Finally, Squash cannot create 2 coroutines just inside the compression call. Instead, coroutines should be started at higher level and all calls to compression function should be issued inside the user coroutine. I.e. application code should look like:

    Code:
    main()
      StartCoroutines(user_code)
    
    user_code()
      zlib_style_compress(...)
      ...
      zlib_style_compress(...)
    here user_code() is executed as second coroutine (first coroutine executes I/O-style compressors) and all zlib-style calls should be issued inside this coroutine. So much complications... Things that work great insiide the single specific application, tend to become much more complicated when we are starting to put them in more generic environment. And yeah, Boost.Conexts is C++11 only so Squash will get that limit too


    I guess, you have to optimize each coder separately, to use full buffer i/o only.
    Then we end up with things like bcj2, where 10 lines of logic (including rc) are wrapped in 30k
    of manual bounds-checking and state managing.
    i think that BCJ2 suffers from classical problem of inversion of logic. You should remember that coroutines were invented exactly to solve it. But with threads, you can run the same code in separate threads, each having natural logic with simple read/write calls. The question is only perfromance. When you have multiple coroutines switching at every byte processed, threads are no-way. OTOH, when you have mutiple codecs each running at 1 GB/s, squeezing them all into single thread will ruin performance.

    Going 15 years back, i think that Igor developed this zlib-style API for compatibility with DOS and other s/t environments. It still presents in 7-zip, and enough for handling of builtin-algos. You have simplified programming in this scheme by employing coroutines, so you enjoy freedom of I/O-style implementation while preserving the 7-zip external APIs.

    But aside of compatibility with 7-zip codebase, there is no need to use coroutines. You just need to replace this yield() call to read_callback(buffer,size,udata) and let framework manage details - it's how things work in freearc. Taking into account my answer to the first question, i don't think that othercompression frameworks should follow your coroutine-based approach. In short, coroutines are for tightly-integrated routines such as multiple byte-level processing layers inside the SINGLE compressor, rather than for high-level stuff like buffer exchange between filters.


    Too much dumb copy-paste imho. And i/o logic, error handling and whatever
    else gets separated into separate functions for no reason.
    your code derives from Coroutine which provides I/O support in the form of yield. My code derives from COMPRESSION_METHOD which could provide readbuf/writebuf calls (although i never implemented that)

    So I do think that I'd not be able to write eg. reflate without coroutines.
    Using coroutines inside a compression method isn't a problem. You know stack size each coroutine need, you can avoid using exceptions. Using coroutines in framework means serious dependency (on Boost & C++11), require to know stack size for each compressor, require tedious programming technique (see user_code() above) and finally may decrease performance. It's why i suggest to not use them, or at most provide it as optional part.

    For lzma its enough to have buffer overlap by 273 bytes.
    you are right, this idea makes multi-buffered compression much easier!

  26. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > your code derives from Coroutine which provides I/O support in the form of
    > yield. My code derives from COMPRESSION_METHOD which could provide
    > readbuf/writebuf calls (although i never implemented that)

    Ok, but what do I do if I need to use your library to (de)compress data?

    struct X { 
    FILE* f;
    Coder C;
    byte buf[1<<16];
    static uint cb_read( void* ptr, void* buf, uint len ) {
    X& x = *(X*)ptr; // needs checks for ptr==0
    len = fread( buf, 1,len, x.f ); // needs checks for buf==0
    // read errors; asking to put a disk with the next volume?
    return len;
    }
    void do_process() {
    C.Init( cb_read, this, buf, sizeof(buf) );
    C.Process();
    }
    };


    Compare with my way:

    struct X {
    FILE* f;
    Coder C; // coroutine
    byte buf[1<<16];
    void do_process() {
    C.Init();
    do {
    uint r = C.Process();
    if( r==1 ) {
    uint l = fread( buf, 1,sizeof(buf), f );
    // f and buf are safe, read errors are easy to handle
    C.addinp( buf, l );
    }
    } while( r!=0 );
    }
    };


    Its certainly not any harder to use, right?
    The code is more or less the same anyway...

    But what if we'd want to add something more complicated?
    Instead of a file, X gets its input from some preprocessor?
    In my version, I'd just add another loop with D.Process()
    instead of fread, but what would you do?

    So C would only call cb_read when it wants, and D would
    only call cb_write when it wants (and with size it wants).
    So its _necessary_ to use threads and have a large cache.
    And callbacks have to know about that cache, so it becomes
    a part of interface basically.

    And in case of non-fatal error handling, you'd end up with something like: signal an error
    from a callback function to another thread, and wait for it to resolve the problem.

    Meanwhile, I also have buf_ring/thread_pipe templates to turn coroutines
    into threads, so coroutine approach doesn't get any complicated with MT.
    In fact, I have a template which turns a thread_pipe chain into a basic coroutine.
    While callback approach can't work without threads at all,
    and ends up simulating coroutines with threads in complex cases.

  27. #20
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    struct X { 
    FILE* f;
    Coder C;
    byte buf[1<<16];
    static size_t cb_read( void* ptr, void* buf, size_t len ) {
    // ptr and buf are safe, read errors are easy to handle
    return fread( buf, 1,len, (X*)ptr->f );
    }
    void do_process_c_style() {
    C.Process(cb_read, this, buf, sizeof(buf) );
    }
    void do_process_cpp11_style() {
    C.Process(buf, sizeof(buf), [&] {return fread(buf,1,sizeof(buf),f);});
    }
    };

  28. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Shelwien (18th February 2017)

  29. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    http://nishi.dreamhosters.com/u/coro_test_v1.rar
    Code:
                      x86/IntelC17   x64/IntelC17  x86/gcc61mingw x64/gcc61mingw
    lambda:            347157435 clk  302401332 clk  294905436 clk  348303594 clk
    coro1:             624486180 clk  698908656 clk  758465600 clk  598263747 clk
    coro1/setjmp.h:   3436588002 clk 3002616072 clk 3656575464 clk ----//---- clk
    coro2b:           2261828952 clk 2663721714 clk 2559217299 clk 2610582951 clk
    coro2b/setjmp.h:  4284366069 clk 4073995008 clk 4774340381 clk ----//---- clk
    coro_lc:          3388835483 clk 2978441007 clk 3585246903 clk ----//---- clk
    coro_lc/FIBER:    1860559794 clk 3082368484 clk 1902442901 clk 2982194916 clk
      byte lam_buf[24];
    void (Coroutine::*lam_ptr)(void);
    ..
    memcpy( lam_buf, &lambda, sizeof(lambda) );
    void (decltype(lambda)::*ptr)(void) const = &decltype(lambda)::operator();
    lam_ptr = *(decltype(lam_ptr)*)&ptr;
    ...
    auto lambda = [&]()->void { while(1); };
    void (decltype(lambda)::*ptr)(void) const = &decltype(lambda)::operator();
    ( (*(decltype(lambda)*)q.lam_buf) .* ((decltype(ptr)&)q.lam_ptr) ) ();


    So its actually much more complicated than it seems, but does work (although I couldn't make it work with std::function and/or clang).
    And what's more interesting, its actually faster than coro1 and relatively compatible with my code, so I may be able to actually use it sometimes...
    like in thread_pipe. I guess I'd need the usual setjmp too, to handle yield(0), but still seems worth testing.

  30. #22
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Coroutines are a form of cooperative multitasking with addition of data passing. You build a pipe "reader|compressor1|compressor2|writer". When reader filled its buffer, it sends the buffer to C1 AND switches to the execution of C1. When C1 filled its outbuf, it switches to C2 passing the outbuf. Or, when C1 needs more data, it switches back to R. So execution goes forth and back between adjancent coroutines, governed by the demand to fill inbuf or flush outbuf.

    OTOH, threads and fibers (cooperative threads) are just independent procedures, executed more or less simultaneously. They have no builtin data exchange mechanism, but it may be built on top of provided primitives - mutexes, locks and so on.

    When i started to write FreeArc, i developed my own micro-framework implementing thread chains. In short, every thread executed by this framework, get a pipe parameter, with sendP,receiveP and a few other operations for data exchange. It looks exactly like coros, although i was inspired by unix pipes and Hoare CSP (did you read this book?)

    Later i translated this approach to C++ in order to support multi-compressor methods in unarc. The C++ code, though, is more limited - it can exchange data between two threads only via single buffer rather than send arbitrary data types.

    So, at the end of day, we have two pretty similar engines. Your engine can run in a single thread, but has wrappers for m/t processing. My engine is m/t only, although current implementation rarely enojoys parallel execution due to the single-buffer approach. Parallel execution becomes possible only when codec itself arranges that - now it's rep, lzma, grzip and 4x4.

    BTW, it's interesting how your API works with m/t codecs such as lzma/lzma2/plzma? My guess is that you are forced to run such codec in separate thread, so essentially you have threads with simple codecs joined as coroutines and threads with m/t codecs, running on their own? Or you can glue together multiple codecs as far as they switch to coroutines only in main thread?




    auto lambda = [&]()->void { while(1); };
    void (decltype(lambda)::*ptr)(void) const = &decltype(lambda):: operator();
    try
    Code:
    auto lambda = [&]{ while(1); };
    auto ptr = &decltype(lambda)::operator();
    your code is complex because you don't use lambdas directly. usually lambda type either implicitly converted into std::function (it's slower) or used as a template parameter:

    Code:
    definition:
    typedef int ReadFunc(void*,int);
    compress(std::function<ReadFunc> read) {...}
    or
    template <typename ReadF>
    compress(ReadF read) {...}
    
    usage:
    compress([&] (void* buf, int size) {return fread(buf,1,size,infile);});


    And what's more interesting, its actually faster than coro1
    may be, it's faster because lambdas are usually inlined?

    but i see that your new test has very different results from older one - may be it's just incorrect measurement?
    Last edited by Bulat Ziganshin; 18th February 2017 at 16:17.

  31. #23
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    I know I'm a bit late here, but…

    Regarding C++, I think that as long as exceptions are handled on the coroutine side of things everything should be okay. I don't see any reason to force C code to use a C++ library.

    One of the things I really like about libcoro is that it's really an abstraction for different implementations. In the worst case, you could simply use the pthread backend (maybe I'll write a C11 threads backend). The less portable backends could then be used simply as a faster alternative only when they're known to work.

    Shelwien, maybe you'd be interested in adding your implementation as a libcoro backend? Sounds like it's faster than the existing ones, but TBH I'm a bit worried about portability… it would be nice to be able to have an easy way to switch to something more portable (like threads) if necessary. Or maybe I just really like abstraction libraries

  32. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Reuploaded the benchmark results: http://nishi.dreamhosters.com/u/coro_test_v1.rar
    Added VS2015, std::function, and lambda support for Clang (there're apparently some quirks with decltype).
    Code:
                      IntelC17 x86/x64      VS2015 x86/x64        gcc61mingw x86/x64    clang40 x86/x64     
    lambda:           214537791  210948246  270348222  230743896  205528416  192649839  220490406  192700917
    lambda/std::fun:  214958934  211000845  270391380  230901783  205640802  192716865  220445787  192051945
    coro1:            357143037  387461040 ----//---- ----//----  519351318  382770018  364518549  415306404
    coro1/setjmp.h:  2823461024 2234846685 2822951028 2232028617 2667703154 ----//---- 2677903332 2177516415
    coro2b:          1714640568 1769019291 ----//---- ----//---- 1806651510 1763740344 2060953071 1808276091
    coro2b/setjmp.h: 3264814960 2620448826 3417974493 2712727291 3565386258 ----//---- 3667393362 2604234567
    coro_lc:         2860263324 2250443775 2774967198 2203718535 2612727888 ----//---- 2576286733 2056592649
    coro_lc/FIBER:   1366957545 2326090809 1758015684 2291457282 1390310523 2277445976 1465933935 2431484493
    
    (1) clang needs RTTI to use std::function
    (2) I didn't write setjmp replacements for VS - IntelC supports gnu inline asm
    (3) default setjmp is somehow broken in gcc/mingw x64 builds that I have

  33. #25
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Coroutines are a form of cooperative multitasking with addition of data passing.

    Yes, but its also a type of interface, where you call a function and it returns
    a status code which you can handle and call the function again to resume the work.

    > BTW, it's interesting how your API works with m/t codecs such as
    > lzma/lzma2/plzma?

    plzma is actually two coroutines - LZ parser and entropy coder (lzmarec).
    So it works as expected as thread chain or coroutine chain.

    lzma2... well, I had to fix 7z's mtcoder.c to only call progress callback
    from the main thread, because delphi GUI had issues with random threads.
    Had to add the main thread with WaitFor loop for that.

    But otherwise I didn't port lzma/lzma2 for my own use.
    Although parser optimizer MT won't matter for me, if it still has
    sequential input and output.

    > My guess is that you are forced to run such codec in separate thread,

    Well, I had some early implementations with reader/writer threads, but
    no dedicated control thread, but it was inconvenient to use.
    So now I have a MT wrapper with a coroutine interface, that can
    run a thread_pipe chain with blockwise MT.

    > so essentially you have threads with simple codecs joined as coroutines and
    > threads with m/t codecs, running on their own?

    Yes, are there any alternatives?
    In fact, I tried using plzma chained into a single thread, at first.
    The idea was that like that I'd get better performance at mt6+.
    But it turned out that mt8 hurts compression too much (when compressing
    reasonable amounts of data) and we have many filters that could use
    these threads anyway.

    > Or you can glue together multiple codecs as far as they switch to
    > coroutines only in main thread?

    Not quite sure what you mean here...

    I port all data-processing algorithms to my coroutine interface.
    Basically the same as your read/write callbacks, but in my case
    its a set of status values and methods which let me implement i/o.

    Then, I have a bunch of templates, like CoroFileProc, CoroPair, CoroChain,
    thread_pipe, etc, where I can plug any coroutine to get some effect.

    Thus MT coder with coroutine interface would be still just another coroutine.

    > may be, it's faster because lambdas are usually inlined?

    Sure, its a very simple algorithm in this case too,
    but I still might have some use cases for it

    > i see that your new test has very different results from older one - may be it's just incorrect measurement?

    Yes, I had 90% cpu load when running it. But it showed what I wanted anyway.

  34. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    @nemequ:
    > Shelwien, maybe you'd be interested in adding your implementation as a
    > libcoro backend?

    It would be impossible, because my implementation is more limited.
    For example, it isn't able to use any random buffer for stack.
    Instead, it just makes a padding (64k normally) in default stack,
    so the caller routine would work as usual, and the coroutine would
    work 64k deeper in stack.
    Then, with coro2b.inc, memcpy is used to save/restore coroutine stacks,
    while all coroutines would use the same part of main stack.
    And in coro1 there's a template parameter which determines the
    size of initial padding. So its possible to run 1st coroutine at depth 64k
    and 2nd at depth 128k, which is faster, because no memcpy is needed.
    Its kinda a common sense to use alloca() in this case to automate it,
    but unfortunately I wasn't able to make alloca version work with some compilers.

    In short, my implementation does what I need, but can't be used to
    implement libcoro's coro_transfer.

    > Sounds like it's faster than the existing ones, but TBH I'm a bit worried
    > about portability...

    My implementation should work on any platforms with top-down stack and stack frames -
    it doesn't use any undocumented setjmp internals (like libcoro does).
    Of course, some issues are still possible, like with VS stack security checks,
    but I can't predict them in advance.

    > it would be nice to be able to have an easy way to switch to something more
    > portable (like threads) if necessary.

    As I tried to explain already, threads are actually much less portable,
    unless you're only considering porting to gcc6 on unix-like OSes.
    For example, DOS/Windows platform has lots of old C++ compilers, for which
    you won't be able to easily find/port a pthreads implementation,
    but there is a working setjmp.h

  35. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    nemequ (19th February 2017),RamiroCruzo (19th February 2017)

  36. #27
    Member
    Join Date
    Nov 2015
    Location
    ?l?nsk, PL
    Posts
    81
    Thanks
    9
    Thanked 13 Times in 11 Posts
    Just stumbled upon
    http://libmill.org/
    I think it may be interesting for you.

  37. #28
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Quote Originally Posted by nemequ View Post
    Regarding C++, I think that as long as exceptions are handled on the coroutine side of things everything should be okay. I don't see any reason to force C code to use a C++ library.
    C/C++ standards guarantee longjmp functionality only when jumping up in the stack and no exception handlers/destructors are skipped: http://en.cppreference.com/w/cpp/uti...rogram/longjmp

    The common evidence is what longjmp can switch between multiple stacks but incompatible with exceptions/destructors. I don't know why, may be some C++ runtimes run destructors from other coroutines when they handle exceptions. It seems that Eugene knows more, but hides this knowledge.

    Note that mingw has 3 mechanisms for exception handling (sjlj,seh,dwarf), MSVC has at least 2 afaik, and there are other compilers/platforms, so ensuring compatibility with them all is a lot of work. You are better rely on libraries what explicitly declare which platforms they are support, and Boost.Context sources size shows a cost of providing such portability.

    So, yes, it's a rare case when C++ application can't use library developed for pure C. Or you may use libcoro and switch to thread-based backend in C++ apps. BTW, libcoro creates new stacks dynamically and outside of main stack, that solves all the issues with stack size management i mentioned earlier. This makes libcoro superior to Eugene code for general-purpose library code like Squash.
    Last edited by Bulat Ziganshin; 20th February 2017 at 16:35.

  38. #29
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    A coroutine is really a kind of state machine. I wonder if a state machine generator, like Ragel, would be useful.

    If you look at the Ragel manual, you can probably skip the regex stuff and go directly to explicit state charts.

  39. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Its a matter of compatibility with code optimizations.
    There've been state machine-like coroutine libraries - its possible to avoid using local vars (use class fields instead)
    implement yield as return, and resume as switch/case or computed goto.
    Its how public versions of zlib are written, for example (I suspect that some pre 0.7 version did use setjmp).
    But this approach quickly gets annoying if there're nested function calls, and overall its less efficient,
    because the compiler would have to store all the member values on any memory access via non-restrict pointer.

Page 1 of 2 12 LastLast

Similar Threads

  1. Nimrod: a new "better C++" class language
    By Bulat Ziganshin in forum The Off-Topic Lounge
    Replies: 10
    Last Post: 29th March 2014, 13:10
  2. Win7 taskbar state class
    By Bulat Ziganshin in forum Download Area
    Replies: 1
    Last Post: 25th June 2013, 12:59
  3. Fast Huffman implementation
    By Gribok in forum Data Compression
    Replies: 5
    Last Post: 26th January 2012, 01:26
  4. I'm looking for the best free implementation of deflate
    By caveman in forum Data Compression
    Replies: 2
    Last Post: 22nd November 2010, 09:27
  5. Move-to-Front Implementation
    By Cyan in forum Data Compression
    Replies: 34
    Last Post: 8th August 2010, 01:11

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •