Results 1 to 7 of 7

Thread: bsc's qlfc postcoder

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

    Post bsc's qlfc postcoder

    http://nishi.dreamhosters.com/u/bsc240_qlfc_v0.rar

    Looks like I've been too optimistic about it.
    1. Actually there're 2-3 rc calls per byte, <1 is only possible with something highly redundant
    2. There's no huge speed difference with (much simpler) direct CMs


    Code:
    book1bwt:
    l = 768771 bytes = 6150168 bits, rc_count=2152534
    
    enwik8bwt:
    l = 100000004 bytes = 800000032 bits, rc_count=214320941
    
    book1bwt  enwik8bwt  enctime dectime
    
    216816    21334589    6.344s  6.688s // ic110_no_PGO, mixtest_v2, static linear mix
    216098    21204128    6.797s  7.156s // ic110_no_PGO, mixtest_v2, adaptive linear mix
    
    214695    21065497    9.954s 10.609s // ic110_no_PGO, mixtest_v3, adaptive logistic mix
    
    216838    21337208    6.015s  6.797s // ic111, o01_v0, static linear mix
    
    212088    20787423    5.813s  6.468s // ic110, bsc 2.4.0 "slow" qlfc
                          5.703s  6.500s // ic111
                          8.094s  8.000s // ic111_no_PGO
                          6.891s  8.297s // gcc450

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I tried restoring ShiftLow() in bsc's rangecoder back to original version (w/o 64-bit comparison),
    and got a strange result - my way there's less instructions, but overall speed is exactly the same:
    http://nishi.dreamhosters.com/u/bsc_shiftlow.png
    So which one is better in the end?..

    Code:
      long long ari_low;
      uint ari_code;
      uint ari_ffnum;
      uint ari_cache;
      uint ari_range;
    
    // bsc version -> 2.asm
      void ShiftLow() {
        if( (ari_low^0xff000000) > 0xffffff ) {
          OutputByte((byte)(ari_cache + (ari_low >> 32)));
          int c = (int)(0xff + (ari_low >> 32));
          while (ari_ffnum) OutputByte(c), ari_ffnum--;
          ari_cache = (uint)(ari_low) >> 24;
        } else ari_ffnum++;
        ari_low = (uint)ari_low << 8;
      }
    
    // sh_v1m version -> 1.asm
      void ShiftLow() {
        uint Carry = uint(ari_low >> 32);
        uint low = uint(ari_low);
        if( low<0xFF000000U || Carry ) {
          OutputByte( ari_cache + Carry );
          uint c = Carry-1;
          while( ari_ffnum ) OutputByte(c), ari_ffnum--;
          ari_cache = low >> 24;
        } else ari_ffnum++;
        ari_low = low<<8;
      }

  3. #3
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    I already tried to use some ideas from sh_ versions of range coders in bsc. Usually they don't effect performance. Modeling take significant time. It would be better if you could have a look at modeling. I believe it is possible to archive better compression with the same performance.
    Enjoy coding, enjoy life!

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    http://nishi.dreamhosters.com/u/bsc240_qlfc_v1.rar

    > I already tried to use some ideas from sh_ versions of range coders in bsc.

    Well, still I tried a few things too:
    0. Wrapping stuff into templates: compiled code is unaffected, but source is much easier to tweak
    1. ShiftLow without 64-bit xor and cmp: No visible effect, though in theory should be better on x86
    2. Restrict tag for i/o pointers: No effect (even in code)
    3. Renorm w/o loops: Speed gain with noinline and no PGO, insignificant gain without noinline and with PGO.
    4. Branchless range update (bsc_rc.inc5-): Slower encoding as it didn't have any branches from the start,
    and decoding where there's a branch anyway.
    5. Storing the shifted range (bsc_rc.inc5b-): Slower (more complex?) and a little extra redundancy
    6. Renorm before range update: Slower;
    7. rc state in local vars, methods in macros (src2, bsc_rc_D.inc is generated from bsc_rc.inc by def.pl):
    Same encoding, slower decoding (because of unrolled renorm?).

    > Usually they don't effect performance.

    That's really unfortunate actually... in my coders most of things above help...
    probably the main loop is too complicated.
    There's still more though, but I didn't try these yet:

    3a. Manually optimizing the 2-byte renorm case.
    8. Explicit renorm calls in the model - 0/1 branches can share the same renormalization,
    so less code -> better code caching / branch prediction.
    8a. Skipping renorms (eg. it can be done once per 2 bit with 12-bit probabilities).
    9. 16-bit rc i/o like in http://www.ctxmodel.net/files/fpaq0pv4b3.rar
    10. vectorized rc (http://nishi.dreamhosters.com/u/vecrc_v0.rar), although its
    harder to implement with variable-length symbol decompositions.
    11. multiplication-free range update (with logarithmic range and LUTs)

    > Modeling take significant time.
    > It would be better if you could have a look at modeling.

    Sure, but I don't yet have a clear idea about what to do with these "wall of text"
    functions... at least I have to merge encoding and decoding somehow, to make it
    possible to tweak the model without decodability problems.
    It'd be impressive if these functions were automatically generated from readable
    source, but somehow I doubt that its the case.

    > I believe it is possible to achieve better compression with the same performance.

    At least I can try tuning the parameters with my optimizer.
    Also the mixer is pretty suspicious.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    http://nishi.dreamhosters.com/u/bsc240_qlfc_v2.rar
    - "slow" decoding/encoding merged into single function (bsc_qlfc_slow.inc)
    - parameter optimization interface attached
    (but test run only got me 212088 -> 211956 on book1)

    Finally it became clear that the model consists of 7 similar submodels
    looking like ((q=mix(p0,p1,p2))*3+SSEi<17>(q))/4.

    Counters are 12-bit linear with multiplication, something like I use,
    but with separate parameters for update after 0 and 1 (negative "thresholds"
    are pretty weird though - I don't see why it can't cause a counter overflow
    and then crash in stretch()).
    Afaik, 15-bit counters would give better results there (although I don't yet
    know how much better), but 12-bit version is probably a speed optimization
    to reduce the size of LUT in stretch() and avoid extra shifts.
    Well, anyway its better to turn counters into state machines.

    Considering that, there're some interesting context quantization tables, like
    model_rank_state_table[(contextRun<<11)|(contextRank4<<3)|(rankSizeHistor y)] -> [0..255]
    (interesting point is how they were generated), but mostly the same contexts are
    used for all counters, so I'd expect custom context quantization separately
    for all counters/mixers/SSE to significantly improve the compression.
    Although again, its a speed/ratio tradeoff, so maybe such a design was chosen
    consciously, but then there's an integrated mixer with SSE and static mixing,
    and normally they have to be tuned separately, so maybe not.

    The mixer is a 3-way logistic paq mixer. It could be good to tweak it for
    auto vectorization compatibility (like, use w[4] instead of w0,w1,w2), but
    I'd prefer to replace it with mix2(mix2(p0,p1),p2) - its more precise by definition
    (there's nothing to maintain w0+w1+w2=1 in 3-way mixer), and there're much more
    parameters to tweak, which improves results even further.

    And then, its nice to see an interpolated probability mapping, but as usual
    there's no interpolation in update, and also the fixed 1/4 weight for it is
    no fun at all.

  6. #6
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Shelwien View Post
    Counters are 12-bit linear with multiplication, something like I use, but with separate parameters for update after 0 and 1 (negative "thresholds" are pretty weird though - I don't see why it can't cause a counter overflow and then crash in stretch()).
    Thresholds is selected in a way so overflow is not possible.

    Quote Originally Posted by Shelwien View Post
    Afaik, 15-bit counters would give better results there (although I don't yet know how much better), but 12-bit version is probably a speed optimization to reduce the size of LUT in stretch() and avoid extra shifts. Well, anyway its better to turn counters into state machines.
    12-bit counters selected for speed optimizations. I already tried to map counters to 12-bit lookup table, but it is slower.

    Quote Originally Posted by Shelwien View Post
    Considering that, there're some interesting context quantization tables, like
    model_rank_state_table[(contextRun<<11)|(contextRank4<<3)|(rankSizeHistor y)] -> [0..255](interesting point is how they were generated), but mostly the same contexts are used for all counters, so I'd expect custom context quantization separately for all counters/mixers/SSE to significantly improve the compression.
    I use following idea:
    1. Collect statistics like [Pos, Bit] for all contexts.
    2. Use greedy approach for the next step.
    3. In each step I try to take two contexts and merge them in single one. During merge I sort statistics by Pos and then use adaptive arithmetic for Bits to measure compression.
    4. Stop merging if number of contexts <= 256

    Quote Originally Posted by Shelwien View Post
    Although again, its a speed/ratio tradeoff, so maybe such a design was chosen
    consciously, but then there's an integrated mixer with SSE and static mixing,
    and normally they have to be tuned separately, so maybe not.
    I have tried many ideas. But if idea add low value I always prefer to have something faster.

    Quote Originally Posted by Shelwien View Post
    The mixer is a 3-way logistic paq mixer. It could be good to tweak it for
    auto vectorization compatibility (like, use w[4] instead of w0,w1,w2), but
    I'd prefer to replace it with mix2(mix2(p0,p1),p2) - its more precise by definition
    (there's nothing to maintain w0+w1+w2=1 in 3-way mixer), and there're much more
    parameters to tweak, which improves results even further.
    This is good idea I will try.
    Enjoy coding, enjoy life!

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Btw, one thing which i always notice is this:
    Code:
    for( runSize=1; (i<n) && (input[i]==currentChar); ++i ) runSize++;
    I'd replace it with

    Code:
    input[n]=~input[n-1];
    [...]
    for( runSize=1; (input[i]==currentChar); ++i ) runSize++;
    ...or, actually
    Code:
    input[n]=~input[n-1];
    [...]
    byte* restrict p = &input[i];
    while( *p==currentChar ) p++;
    runSize = p-(&input[i])+1; i+=runSize-1;

Similar Threads

  1. bsc, new block sorting compressor
    By Gribok in forum Data Compression
    Replies: 363
    Last Post: 14th June 2017, 20:55

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •