Results 1 to 12 of 12

Thread: CPU Branch Prediction as an entropy model

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

    CPU Branch Prediction as an entropy model

    Quote Originally Posted by wikipedia
    "The AMD Ryzen processor, previewed on December, 13th, 2016, revealed its newest processor architecture using a neural network based branch predictor to minimize prediction errors."
    So here's a "revolutionary" entropy coder based on branch prediction: http://nishi.dreamhosters.com/u/BPCM_v0.rar
    Code:
    sample_0: 256 to 79
    sample_1: 256 to 77
    sample_2: 232 to 227
    sample_0 is 256 x '\x00'
    sample_1 is 256 x '1'
    sample_2 is c.bat

    The method is based on timing (with rdtsc) a function like this:
    Code:
      uint estimate( byte* p, uint l, uint cc ) {
        uint c,i,flag;
        for( i=0,f0=0,f1=0; i<l; i++ ) {
          if( p[i]!=cc ) ++f1; // else ++f1;
        }
        return (f0<<16)|f1;
      }
    This function processes the window buffer with cc=0..255 and takes branches based on symbol matches
    (it actually jumps when p[i]==cc - compiler puts a JZ instruction there).
    Branch prediction logic is supposed to "learn" the patterns, so for more probable symbols it should run faster
    (faster timings turn into higher probabilities).
    It kinda even works when its all the same symbol, but not so good with real files (still, can compress them).

    My results are for i7 930, so it might be interesting to see what happens on a newer cpu.

  2. The Following 5 Users Say Thank You to Shelwien For This Useful Post:

    encode (16th December 2016),Mike (17th December 2016),quadro (17th December 2016),RamiroCruzo (17th December 2016),xinix (18th December 2016)

  3. #2
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    AMD has been using Neural Net branch predictors for years, e.g. in the Bobcat/Jaguar APU cores (http://amd-dev.wpengine.netdna-cdn.c.../10/apu101.pdf page 34). I believe they already used them in some older Opterons as well, but I don't have a reference to hand.

    What's changed between 2011 and now is that Neural Nets are very "in" right now, so they get name-dropped, whereas they weren't that fashionable a few years ago.

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Yeah, from what I've read, the branch predictors on intel cpus were also basically a CM since Intel Core or so -
    recording the history of branches taken/not taken, detecting patterns in that history (loops etc) and estimating the probability.
    So I thought that maybe they finally added context mixing to CM in Zen (that would fit with neural network concept,
    because logistic activation function in NN is the same as paq mixer).

    Quote Originally Posted by wikipedia
    The main advantage of the neural predictor is its ability to exploit long histories while requiring only linear resource growth. Classical predictors require exponential resource growth. Jimenez reports a global improvement of 5.7% over a McFarling-style hybrid predictor.
    In any case, this coder should compress better with an improved branch predictor.
    I'd appreciate any ideas on improving the estimation function though - atm it doesn't seem like branch predictor
    is able to estimate the probability of branch taking even roughly.

  5. The Following User Says Thank You to Shelwien For This Useful Post:

    xinix (18th December 2016)

  6. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    For example, these are results for files in the form 0...010..01 ie substr(('0' x $n . '1') x 10000, 0,10000)
    Code:
     1: 01010101010101010101010101010101010101010...
     2: 00100100100100100100100100100100100100100...
     3: 00010001000100010001000100010001000100010...
     4: 00001000010000100001000010000100001000010...
     5: 00000100000100000100000100000100000100000...
     6: 00000010000001000000100000010000001000000...
     7: 00000001000000010000000100000001000000010...
     8: 00000000100000000100000000100000000100000...
     9: 00000000010000000001000000000100000000010...


    This looks like it has actual branch history of length 4-5, so loops of length 6..15 are not handled efficiently,
    and then after 16 loop predictor starts working (based on distances between jumps).

  7. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    RamiroCruzo (19th December 2016),xinix (18th December 2016)

  8. #5
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    If you want to predict bits with the branch predictor, you should make your branches correspond more directly with the bits you want to predict.

    Most branch predictors use a global history of the last N branches for context, along with the address of the branch instruction. In the "estimate" function you gave, fully half the branches are just for the outer loop over the array (meaning those bits of the history contain essentially zero pertinent information), and the rest are just the indicator function for the question "is x[i] == cc or not?", which is a) likely to be be a very rarely taken branch (below the probability resolution of typical predictors), b) virtually ensures that the (very relevant) information of "if it wasn't cc, what was it?" is not to be found anywhere even in very long prediction histories.

    Separately, trying to divine branch predictor behavior from RDTSCs isn't exactly a convenient testing setup. Just use an actual branch predictor! There are regular competitions and the entries come with (C++) source code.
    http://www.jilp.org/cbp2016/program.html
    http://www.jilp.org/cbp2014/program.html
    http://www.jilp.org/jwac-2/program/JWAC-2-program.htm
    http://hpca23.cse.tamu.edu/taco/camino/cbp2/source.html (papers here http://hpca23.cse.tamu.edu/taco/camino/cbp2/CBP-2.pdf)
    http://www.jilp.org/cbp/Agenda-and-Results.htm

  9. The Following 3 Users Say Thank You to fgiesen For This Useful Post:

    Bulat Ziganshin (17th December 2016),Cyan (17th December 2016),Mike (17th December 2016)

  10. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Most branch predictors use a global history of the last N branches for context,
    > along with the address of the branch instruction.

    I'm running the estimation function multiple times over the window, separately
    for each byte value. In theory, if BP is able to predict patterns in the symbol's occurences,
    the tsc timing for this symbol would be smaller, and thus the assigned probability
    would be higher. Currently the model uses a 64-byte window, and does 8 passes for each symbol value -
    imho that should be enough to flush the caches.

    > In the "estimate" function you gave, fully half the branches are just for the outer loop over the array

    Yes, but these are the same in all iterations, and min tsc is calculated and subtracted from all timings.

    I guess I could try eliminating the outer loop branch by generating
    unrolled code with symbol branches aligned in a way to cause branch cache
    collisions, but that would depend on specific branch cache design too much.

    > and the rest are just the indicator function for the question "is x[i] == cc or not?", which is
    > a) likely to be be a very rarely taken branch (below the probability resolution of typical predictors),

    If cpu branch predictor really estimated the branch-taking probability with any sane precision
    (and I mean not 2 bits), there could be enough for a order0 symbol's probability estimation.

    Again,
    1) estimate() is timed for all byte values 0..255
    2) compression actually _works_ - the coder doesn't blow up compressible files.

    > b) virtually ensures that the (very relevant) information of "if it wasn't cc, what was it?"
    > is not to be found anywhere even in very long prediction histories.

    Again, estimate is called for every byte value separately, so I'm basically working with 256 binary strings of (c==cc).
    Still, compression works, and although not up to byte, but results are stable enough to
    see the correlation with file's contents.

    > Separately, trying to divine branch predictor behavior from RDTSCs isn't exactly a convenient testing setup.

    That depends on the goal imho.
    As it is, I think this coder could be useful to compare branch predictors on significantly different x86 cpus.

    And originally it was just a joke - "if there's a hardware context model in the cpu, why not use it for
    compression instead of paq".

    Thanks for the links, I didn't know about these competitions,
    but the idea here wasn't to build a better branch predictor or anything -
    imho there're too many restrictions for it to be interesting.

  11. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    RamiroCruzo (17th December 2016),xinix (18th December 2016)

  12. #7
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    Quote Originally Posted by Shelwien View Post
    I'm running the estimation function multiple times over the window, separately
    for each byte value. In theory, if BP is able to predict patterns in the symbol's occurences,
    Yeah, but that's just the thing, the way you end up encoding the data into branch decisions wipes out most of the information that a branch predictor actually does modelling on.

    The number one thing most branch predictors you're actually going to see in HW try to pick on are correlations between the branch history and the to-be-predicted branch. The way your scheme "encodes" bytes into branch decisions destroys most of the useful information in there, because you're essentially forcing the branch predictor to predict the positions of byte value cc, without having any actual context. It *will* be able to pick up patterns when say a given byte value occurs every 4 positions, but not the more interesting case of characters that tend to occur grouped together etc.

    Suppose you have a lexical scanner that contains the check "if (cur_char == '\r' || cur_char == '\n')". When processing Windows-style text files, a typical branch predictor will pick up on the fact that the presence of a LF character is highly correlated with the presence of a CR character in the previous loop iteration. But the line lengths themselves are fairly random, and if you were to first predict the positions of all CRs in a block multiple times, and then predict the positions of all LFs in a block multiple times, it wouldn't work very well. I'm not saying it wouldn't work at all - it obviously does - just that it's not a useful way to gauge the capabilities of the hardware.

    Quote Originally Posted by Shelwien View Post
    > In the "estimate" function you gave, fully half the branches are just for the outer loop over the array

    Yes, but these are the same in all iterations, and min tsc is calculated and subtracted from all timings.
    Sure they're always the same, but having half of the (finite-size!) branch history be filled with things that have nothing to do with the data again decreases efficiency further.

    Quote Originally Posted by Shelwien View Post
    If cpu branch predictor really estimated the branch-taking probability with any sane precision
    (and I mean not 2 bits), there could be enough for a order0 symbol's probability estimation.
    Why would it? Not a rhetorical question: what's the use of a high-resolution probability estimate here?

    A branch predictor isn't generating probability estimates for an arithmetic coder. Its end result is a single bit, "taken" or "not taken", which feeds into a bunch of machinery that ultimately decides which address to continue fetching instructions at.

    For an arithmetic coding back-end, the difference between a coin toss with a "heads" probability of P(H)=51%, and one with P(H)=99%, is substantial. For a branch predictor, it's not. It always wants to predict H in both cases (and eats a lot less mispredictions in the second case).

    This is important to be clear about. Say you have a "coin toss" (effectively, random order-0) source that's 55% H (heads), 45% T (tails). In that sequence, the goal of the ideal branch predictor is to predict H (the most likely outcome), 100% of the time; if its prediction is a sequence that has 55% H and 45% T, it's doing a really bad job. (Since that sequence is virtually guaranteed to be completely uncorrelated with the actual sequence of events).

    Branch predictors have relatively large histories (often >100 branches long, albeit subsampled), and are trying to find context models that produce near-certain predictions. If a given context doesn't have a very good track record at predicting, it's not worth keeping around; that space would be better used on something else.

    There are also secondary predictors that just purely keep track of the overall miss rate of a given predictor; this is sometimes called "statistical correction", and relevant to the coin-toss example I mentioned. Basically, whenever you have a branch where the context-model part of the predictor appears to be generating a lot of misses, then it's likely trying to mine structure from noise. So for those branches, you keep track simply of whether they're biased one way or the other, mostly ignore the context, and always predict the most likely (in the past) outcome.

    Quote Originally Posted by Shelwien View Post
    That depends on the goal imho.
    As it is, I think this coder could be useful to compare branch predictors on significantly different x86 cpus.
    I agree that yes, this is a thing you can measure, but very much disagree that it's a useful thing to measure, or that it tells you anything relevant about the branch predictor (or how well it does context modelling).

    A sensible thing to compare branch predictors on is on branch traces of actual programs (which is what CBP does).
    Running your program will indeed tell you something about the branch predictor in a CPU, and I'm willing to believe you that it's reasonably repeatable, but I flat-out disagree about it being a useful way to compare branch predictors.

    If you want to know how good a branch predictor is at predicting a bit stream, then I believe that the right way to do that is by feeding in the bit stream as branch history and seeing what it produces. Turning it into a much larger set of indicator functions, thus destroying most of the structure that the built-in context model actually tries to key off of, is still an experiment you can run, but it's not an experiment that tells you much about anything.

  13. #8
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    i7-4790
    Code:
    sample_0: 256 to 72
    sample_1: 256 to 72
    sample_2: 232 to 218

  14. The Following User Says Thank You to encode For This Useful Post:

    Shelwien (17th December 2016)

  15. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > The way your scheme "encodes" bytes into branch decisions destroys most of
    > the useful information in there, because you're essentially forcing the
    > branch predictor to predict the positions of byte value cc, without having
    > any actual context.

    Yes, but its not totally wrong even completely without context - it could work
    as order0 model at least.
    Also sure, maybe half of the bits of the global branch history are taken
    by the loop branch, but then half still remains?
    Also, I'm pretty sure that there's also a local history for a specific branch now.

    > It *will* be able to pick up patterns when say a given byte value occurs
    > every 4 positions, but not the more interesting case of characters that
    > tend to occur grouped together etc.

    Somehow its pretty bad in "every 4 positions case" actually, but gets better
    at "every 16+ position" - see the image http://encode.ru/threads/?p=51181&pp=1

    > I'm not saying it wouldn't work at all - it obviously does - just that it's
    > not a useful way to gauge the capabilities of the hardware.

    Yes, its unlikely to be better than a real order0 coder - its still much worse
    now though.

    Atm I have 4 ideas about improving it:

    1. Generating the code for estimate(), which would have a single branch per page,
    and NOP padding in-between.

    2. Converting bytewise data to 255 bitstreams first.

    3. Adding cpu load in c!=cc case:
    Code:
        volatile uint f0,f1;
        for( i=0,f0=0x12345678,f1=0x12345678; i<l; i++ ) {
          if( p[i]!=cc ) f0=f1+f0/f1,f0=f1+f0/f1,f0=f1+f0/f1,f0=f1+f0/f1,++f1; // else ++f1;
        }
    4. Using some better method to convert timings to probabilities,
    likely based on logistic function, currently its just
    Code:
        for( i=1; i<=256; i++ ) tim[i] = qword(t_max+1-t_min)*64/(tim[i]+1-t_min)+1; // (tim[i]-t_min)/(winsize/32)+1;
    > A branch predictor isn't generating probability estimates for an arithmetic
    > coder. Its end result is a single bit, "taken" or "not taken", which feeds
    > into a bunch of machinery that ultimately decides which address to continue
    > fetching instructions at.

    Even for a single bit, its likely generating at least 2 bits - see https://en.wikipedia.org/wiki/Branch...rating_counter

    Also, if there was no prediction at all, just more clocks to take a branch,
    it would be already an equivalent of occurence counter

    And again, look at the graph - compression smoothly improves with less
    occurences of '1'. http://encode.ru/threads/?p=51181&pp=1

    > So for those branches, you keep track simply of whether they're biased one
    > way or the other, mostly ignore the context, and always predict the most
    > likely (in the past) outcome.

    Uh, not quite.
    1. Remember, it takes into account the clocks of _all_ branches on a given symbol.
    So the timings would be smaller even depending on prediction quality of previous
    symbols in the window. As I said, ideally it could perform like order0, or a little better.

    2. Its not a fact that there're only 2 decisions on every branch.
    There can also be "speculative execution", where both cases would be processed at once,
    then one discarded - it could be enabled based on (un)certainty of BP prediction.

    > I agree that yes, this is a thing you can measure, but very much disagree
    > that it's a useful thing to measure, or that it tells you anything relevant
    > about the branch predictor (or how well it does context modelling).

    And I agree that this coder can be improved, but disagree that its useless atm -
    let's just wait for the results of test with loop files from @encode.

    Btw, files are here: http://nishi.dreamhosters.com/u/bpc_testfiles.rar
    Run
    Code:
    for %a in (*.txt) do BPC.exe c %a %~na.bpc
    Then
    Code:
    for %a in (*.bpc) do echo %~za >>results.txt
    to store results.

  16. The Following User Says Thank You to Shelwien For This Useful Post:

    xinix (18th December 2016)

  17. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Got results for i7-3820 ("Sandy Bridge-E") first, thanks to kampaster. Seem the same as encode's (i7-4790, "Haswell-DT") though.


    Compare with i7-930 ("Bloomfield")


    Seems like longer context (12 bits vs 5?) and much better results overall.

    Also seems like both have special handling for the "0101010101..." case (x=1 on graphs)

  18. The Following User Says Thank You to Shelwien For This Useful Post:

    xinix (18th December 2016)

  19. #11
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    i7-4790
    Code:
    results.txt:
    1814 
    4206 
    3790 
    3232 
    2960 
    2738 
    2407 
    2330 
    2247 
    2278 
    2141 
    2109 
    2053 
    2090 
    2140 
    1995 
    2092 
    1913 
    2322 
    2239 
    2530 
    2120 
    2305 
    2332 
    2221 
    2045 
    2005 
    1999 
    1998 
    1920 
    1939 
    196608

  20. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Thanks, this seems even better, I guess history is much longer

  21. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    encode (17th December 2016),xinix (18th December 2016)

Similar Threads

  1. CPU to GPU memory bottleneck
    By boxerab in forum Data Compression
    Replies: 6
    Last Post: 12th June 2014, 19:41
  2. More CPU or More Ghz?
    By Nania Francesco in forum The Off-Topic Lounge
    Replies: 17
    Last Post: 19th March 2009, 17:23
  3. RZM doesn't like to share cpu
    By SvenBent in forum Data Compression
    Replies: 2
    Last Post: 19th July 2008, 22:35
  4. CCM 1.2x branch
    By Christian in forum Forum Archive
    Replies: 107
    Last Post: 8th June 2007, 17:56
  5. CCM - 1.1.x branch
    By Christian in forum Forum Archive
    Replies: 105
    Last Post: 19th March 2007, 23:50

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •