Results 1 to 17 of 17

Thread: Better compression performance across time?

  1. #1
    Member
    Join Date
    May 2008
    Location
    USA
    Posts
    42
    Thanks
    0
    Thanked 2 Times in 2 Posts

    Better compression performance across time?

    I had asked this in comp.compression, but I muddied the question with the details of how it is implemented, so I'm going to try again in this forum by asking my question better:

    I have a system where there is an opportunity to compress a stream of data before it is sent to its destination. The time I have to compress the data is variable; that is, I do not know how much time I have to compress before the data *must* be sent, and that time could arrive while in the middle of the compression process. The output stream can be broken up into variably-sized "compressed" and "non-compressed" frames if necessary for transmission. Given those constraints, is there an algorithm, or a particular way of using an algorithm, that would satisfy this requirement? Specifically, is there a family of algorithms where it is a design requirement to interrupt the compression process, output what has compressed so far, then output the remainder?

    I can already think of trivial ways to do this with any LZ77 variant, but was wondering if there was an algorithm or implementation that was already specifically tuned for this behavior.

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    you can obviously do it with any stream compression algorithm - i.e. anything in wide use except bwt/st

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > I had asked this in comp.compression, but I muddied the
    > question with the details of how it is implemented,

    Imho comp.compression these days just is not the place to ask questions.

    > Specifically, is there a family of algorithms
    > where it is a design requirement to interrupt the
    > compression process, output what has compressed so far, then
    > output the remainder?

    There might be that kind of algorithm, but it won't necessarily
    be optimal for your task.
    I mean, its not an axiom that any multipass algorithm is
    better than any sequential.

    > I can already think of trivial ways to do this with any LZ77
    > variant, but was wondering if there was an algorithm or
    > implementation that was already specifically tuned for this
    > behavior.

    Leaving some data uncompressed is too trivial,
    as its possible to implement some LZ with almost memcpy speed.
    Also I think that LZ optimal parsing is the algorithm you'd like,
    as it can work by almost infinitely improving the match sequence,
    with ability to quit anytime.
    Good example of that is pngout: http://www.advsys.net/ken/util/pngout.htm

    But, as I said, its not necessarily the best solution.
    Imho it would be more practical to apply some fast LZAri first
    (as dynamic huffman is slower than rangecoding, and blockwise-static
    huffman is not exactly sequential so would be probably slower in
    compression anyway),
    And then, if there's more time, you can try some CM for best compression.
    With blockwise implementation it would be possible even to partially
    convert the data to CM.

  4. #4
    Member
    Join Date
    May 2008
    Location
    USA
    Posts
    42
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by Shelwien View Post
    Leaving some data uncompressed is too trivial,
    as its possible to implement some LZ with almost memcpy speed.
    I have to be honest: I hate statements like that. I am an x86 assembler programmer so I know what you mean when you say "almost memcpy speed" (about 10x slower than memcpy), but that statement alone sounds like only 2x slower or better.

    Also I think that LZ optimal parsing is the algorithm you'd like,
    as it can work by almost infinitely improving the match sequence,
    with ability to quit anytime.
    I would love to know how to do that, but even monitoring this board for a year, I have had a very hard time trying to understand optimal parsing. And I wish I could understand it, because I have written an asymmetric compression system with only a greedy and lazy matching compressor, and since the decompresser is a fixed target, I would love to have "perfect" results in the compressor (in my system, only decompression speed matters). Heck, my decompresser requires byte-aligned output, so my codes are of a fixed size -- it should be easy to write an optimal parsing compressor, yes?

    Good example of that is pngout: http://www.advsys.net/ken/util/pngout.htm
    Did Ken ever release the source of that?

    But, as I said, its not necessarily the best solution.
    Imho it would be more practical to apply some fast LZAri first
    (as dynamic huffman is slower than rangecoding, and blockwise-static
    huffman is not exactly sequential so would be probably slower in
    compression anyway),
    And then, if there's more time, you can try some CM for best compression.
    With blockwise implementation it would be possible even to partially
    convert the data to CM.
    I'm sorry, I must still be an amateur: What is "CM" short for?

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > > Leaving some data uncompressed is too trivial,
    > > as its possible to implement some LZ with almost memcpy speed.
    > > I have to be honest: I hate statements like that.

    You know, I tend to write "imho" when I'm not sure.

    > > I am an x86 assembler programmer so I know what you mean when you
    > > say "almost memcpy speed" (about 10x slower than memcpy),
    > > but that statement alone sounds like only 2x slower or
    > > better.

    No, I actually meant "the same of maybe even slightly faster" here.
    As if you never heard about improving i/o speed by compression.

    Though of course it all depends on the specific CPU and only applies
    to these systems where memory access is the bottleneck.
    So, if the speed of

    Code:
    ; i don't say this is optimal
    m0: mov al,[byte esi+ecx]
        mov [byte edi+ecx],al
        inc ecx
        jnz m0
    isn't 2x or 10x slower than REP MOVSD, then compression might be
    used as well - of course not these well-known implementations,
    but you know, CRLF->LF conversion is compression too, and can be
    (forcibly) classified as LZ78.

    > > Also I think that LZ optimal parsing is the algorithm you'd like,
    >
    > I would love to know how to do that, but even monitoring
    > this board for a year, I have had a very hard time trying to
    > understand optimal parsing.

    I think its really easy to understand, though can be fairly troublesome
    to implement for specific compression method.
    Just read something about dynamic programming, like this: http://en.wikipedia.org/wiki/Levenshtein_distance

    > And I wish I could understand
    > it, because I have written an asymmetric compression system
    > with only a greedy and lazy matching compressor, and since
    > the decompresser is a fixed target, I would love to have
    > "perfect" results in the compressor (in my system, only
    > decompression speed matters).

    You just have to try all the possible match/literal sequences.

    > Heck, my decompresser requires
    > byte-aligned output, so my codes are of a fixed size -- it
    > should be easy to write an optimal parsing compressor, yes?

    I don't think so. Afaik, most of "optimal" LZs are not really optimal
    considering the compressed output size, and only work on the match
    sequence level with some random metric.

    > Good example of that is pngout: http://www.advsys.net/ken/util/pngout.htm
    > Did Ken ever release the source of that?

    I won't know even if he did, but kzip library or something like that was
    mentioned there.

    > > But, as I said, its not necessarily the best solution.
    > > Imho it would be more practical to apply some fast LZAri first
    > > (as dynamic huffman is slower than rangecoding, and blockwise-static
    > > huffman is not exactly sequential so would be probably slower in
    > > compression anyway),
    > > And then, if there's more time, you can try some CM for best compression.
    > > With blockwise implementation it would be possible even to partially
    > > convert the data to CM.
    >
    > I'm sorry, I must still be an amateur: What is "CM" short for?

    context modelling.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Good example of that is pngout: http://www.advsys.net/ken/util/pngout.htm

    Btw, the idea is like this: you fully compress the data including entropy
    coding, but remember the choices you made for matches.
    Then, after compression, you have your precise code lengths for these matches
    (and the statistics required to calculate such code lengths if a dynamic compression
    method is used).
    So you can use that info as a metric for optimal parsing pass
    (as pngout does imho).
    Or you can just randomly select some location and make a modification
    and measure the result (a la genetic optimization).
    That's actually what I was suggesting as its an O(1) operation,
    if you don't recode the data after each modification.

  7. #7
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quote Originally Posted by Shelwien View Post
    > I'm sorry, I must still be an amateur: What is "CM" short for?
    context modelling.
    Or Context Mixing!

    (As noted at LTCB)

    Context Modeling is really wide term: PPM, CM, CTW, ...

    Context Mixing is about how we use available contexts/predictions from various contexts - instead of selecting a particular context as with PPM, we just mix the statistics from all or a few available contexts to make a new prediction/probability with specific weight for each context. Just a brief explanation.

    One very interesting paper:
    Attached Files Attached Files

  8. #8
    Member
    Join Date
    May 2008
    Location
    USA
    Posts
    42
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by Shelwien View Post
    > Heck, my decompresser requires
    > byte-aligned output, so my codes are of a fixed size -- it
    > should be easy to write an optimal parsing compressor, yes?

    I don't think so. Afaik, most of "optimal" LZs are not really optimal
    considering the compressed output size, and only work on the match
    sequence level with some random metric.
    But, if the symbol sizes are fixed, it should be possible to write a 100% optimal compressor, right?

    > I'm sorry, I must still be an amateur: What is "CM" short for?

    context modelling.
    Ah... Unfortunately there's no time on my (very slow) target platform for that.

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > > I don't think so. Afaik, most of "optimal" LZs are not really optimal
    > > considering the compressed output size, and only work on the match
    > > sequence level with some random metric.
    >
    > But, if the symbol sizes are fixed, it should be possible to write a 100% optimal compressor, right?

    This possibility is based on the high probability of the cases where
    optimal decomposition of the prefix is still optimal for the whole string.
    So in theory its all the same for any sequential compression method,
    static or not.
    The point is that you won't like its worst case performance,
    so you won't like having it perfectly optimal.

    > > > I'm sorry, I must still be an amateur: What is "CM" short for?
    > > context modelling.
    > Ah... Unfortunately there's no time on my (very slow) target platform for that.

    There is time on any platform, depending on the implementation.
    CM is any method based on assumption that yet unknown data has different
    properties depending on the known context.
    Or, alternatively, CM can be described as a model for a (markovian)
    source with a hidden state depending on which it changes the statistical
    properties of generated data.
    Btw LZ is not a model, or even a compression method, actually.
    Its just a (redundant) transformation, results of which can be easily
    compressed (so similar to BWT in a way), but there's afaik no even
    a single example of a type of data generated like that.

    Well, I just wanted to say that CM can have any speed, depending on the
    specific implementation.
    And even traditional implementations can be faster than non-greedy LZ
    in compression, though their decompression is usually symmetrical.
    But then again, there's no rule against asymmetrical CM with faster-than-LZ
    decoding.

    Also, btw, BWT might be "that kind of algorithm" too.
    I mean, you can implement some fast sorting for small blocks,
    and then merge them while you have time.

    > > > I'm sorry, I must still be an amateur: What is "CM" short for?
    > > context modelling.
    > Or Context Mixing!
    > (As noted at LTCB)
    > Context Modeling is really wide term: PPM, CM, CTW, ...

    Well, I do agree with the last line.
    But Context Mixing != Context Modelling despite whats written on LTCB page.
    Because Context Mixing is about mixing - combining the predictions of
    several models. While Context Modelling doesn't necessarily have to
    be implemented that way.
    Last edited by Shelwien; 17th May 2008 at 00:35.

  10. #10
    Programmer
    Join Date
    May 2008
    Location
    denmark
    Posts
    94
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Just some notes about speed on my Athlon64 3000+.

    Speed of QuickLZ 1.40 in the fastest level: 116 - 252 Mbyte/s depending on file type.

    Speed of the memcpy() implementation of VC2005: 977 Mbyte/s.

    Speed of unrolled MMX loop using prefetch at src + 128 bytes: 1300 Mbyte/s

    Speed of code below: 526 Mbyte/s.

    char *s = malloc(100000000);
    char *d = malloc(100000000);
    __asm{
    mov edx, 100
    again:
    mov esi, s
    mov edi, d
    mov ecx, 0
    m0:
    mov al,[esi+ecx]
    mov [edi+ecx],al
    inc ecx
    cmp ecx, 100000000
    jne m0

    dec edx
    jnz again
    }

  11. #11
    Programmer
    Join Date
    May 2008
    Location
    denmark
    Posts
    94
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Oh, and REP MOVSD is exactly 1000 Mbyte/s.

    Also note that VS is not using REP MOVSD but goes through some registers in a loop instead. This is slightly faster than REP MOVSD on most systems (just not here), perhaps 100 Mbyte/s or so.

    Apparently Intel and AMD stopped optimizing the string instructions some time ago.

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > cmp ecx, 100000000

    My loop didn't contain that.
    Try this:
    Code:
    mov esi, s
    mov edi, d
    mov ecx, 100000000
    add esi,ecx
    add edi,ecx
    neg ecx
    m0:
    mov al,[esi+ecx]
    mov [edi+ecx],al
    inc ecx
    jne m0
    And this:
    Code:
    mov esi, s
    mov edi, d
    mov ecx, 100000000
    m0:
    mov al,[esi]
    lea esi,[esi+1]
    sub ecx,1
    mov [edi],al
    lea edi,[edi+1]
    jne m0
    Edit: Then again, nobody said that unrolling and vectorization and
    prefetching couldn't be used for LZ decoding.
    Last edited by Shelwien; 17th May 2008 at 04:00.

  13. #13
    Programmer
    Join Date
    May 2008
    Location
    denmark
    Posts
    94
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Those two versions are 530 and 500 Mbyte/s (second version being slower).

  14. #14
    Member
    Join Date
    May 2008
    Location
    USA
    Posts
    42
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by Shelwien View Post
    > Ah... Unfortunately there's no time on my (very slow) target platform for that.

    There is time on any platform, depending on the implementation.
    Since CM usually results in symmetric comp/decomp, and since I am on a very slow platform where decomp is all that matters, that's what I meant by "no time for that on my platform".

    Thanks for your thoughts.

  15. #15
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    Then again, nobody said that unrolling and vectorization and prefetching couldn't be used for LZ decoding.
    Actually, I did I have tested several unrolled versions for copying matches, but I didn't succeeded in. Although, in old days I have some really good results with my ancient pentium-2 400mhz. New CPUs are really different (like mine: core2 2.2ghz). With my knowledge, I cannot make huge difference with assembly when we compare pure c/c++ code. So, currently I'm working on implementations. The optimization comes at the end

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > > Then again, nobody said that unrolling and vectorization and prefetching couldn't be used for LZ decoding.
    > Actually, I did I have tested several unrolled versions for
    > copying matches, but I didn't succeeded in.

    Well, I wasn't talking about copy routines (that's obvious), but about things like:
    * aligned reading, and calculating the hashes for multiple positions at once
    * keeping some useful substrings in registers
    * branch removal by vector instructions (like min/max and select)
    * rangecoder i/o optimizations - like what IntelC does for fpaq0v4B
    * rangecoder multiplication removal by range splitting

    > Although, in old days I have some really good results with my ancient pentium-2 400mhz.
    > New CPUs are really different (like mine: core2 2.2ghz).

    Not that much is different actually... compilers still do weird things instead of optimization.

    > With my knowledge, I cannot make huge difference with assembly when we compare pure c/c++ code.

    I think its a matter of tools, not knowledge.
    Only compilers are able to use global optimization techniques,
    automated vectorization, and adjust the inlined code to environment.
    But only their developers know how to force them to do something specific (if anybody knows at all).
    So even if you see how to improve some code block (and compiler-generated code
    still looks ugly most of the time) - its mostly waste of the time as that particular code block
    would probably disappear after any source modification.

    So its still possible to manually write faster code (if only because of compilers not knowing
    how to use stack and flags), but then you basically have to write it all manually -
    the approach with asm inlines only for bottlenecks doesn't work anymore as compiler
    optimization around these asm inlines significantly deteriorates.

    > So, currently I'm working on implementations. The optimization comes at the end

    Language syntax has a major effect on the algorithms and data structures choice.
    So there might be a sense to write a (simple and slow) assembly implementation first,
    and then port the algorithm to C++ or whatever (for automated optimizations).
    Sounds reversed, but after all compiler is a code management tool which allows to
    automate some routine optimizations - just write your own if you can't deal with existing ones.

  17. #17
    Member
    Join Date
    Jun 2008
    Location
    USA
    Posts
    111
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Trixter
    Ah... Unfortunately there's no time on my (very slow) target platform for that.
    Heh. You're a bit hardcore here. Too bad we aren't all that dedicated.

    Apparently Intel and AMD stopped optimizing the string instructions some time ago.
    At least beginning with the 486 ("fully pipelined").

    Code:
    sub ecx,1
    mov [edi],al
    lea edi,[edi+1]
    I'm no optimization guru, but why "sub ecx,1" (better on P4 b/c "dec" modifies the flags) and then something entirely different like "lea edi,[edi+1]" ?? Unless this is for code alignment, why not do the obvious "add edi,1" ?

    New CPUs are really different (like mine: core2 2.2ghz). With my knowledge, I cannot make
    huge difference with assembly when we compare pure c/c++ code.
    Not even all Core 2s are the same, esp. with newer models supposedly being 30% faster (ah, marketing, gotta love 'em). But you can indeed optimize for 'em. See here for an example (w/ FASM src).

Similar Threads

  1. C++ compile-time constant detection
    By Shelwien in forum The Off-Topic Lounge
    Replies: 5
    Last Post: 5th August 2010, 08:11
  2. TC 5.0dev11 is here - Time to gain compression!
    By encode in forum Forum Archive
    Replies: 38
    Last Post: 1st August 2006, 09:24

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •