Results 1 to 27 of 27

Thread: Mixer vs 2D SSE

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Mixer vs 2D SSE

    http://shelwien.googlepages.com/sse2.htm
    http://shelwien.googlepages.com/order_test5.rar

    A good example of how SSE benefits from stable
    history-to-probability mapping.
    Well, actually SSE2 would be better than mixer
    anyway, its just that parameters was tuned to
    wcc386 only (and context is totally wrong).
    But at least it proves that SSE is less stable
    than my mixer in similar environment.

    Edit: you can compare it with
    http://shelwien.googlepages.com/o0c.htm
    and http://shelwien.googlepages.com/fpaq0f2.htm

  2. #2
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    My yesterday experiments confirm that. These (2d) SSE maps are only well suited for a stable input distribution (by models) - but that's nothing new Varying mappings don't pay off (setting a previous SSE map as input). Your implementation is order 210? Tell me, so i can post my version - and see how well it does. Aren't just 7x7 vertices a bit too less? And about your linear mapping... I use 33x33 grid with nonlinear scaling (gradient of the scaling function increases near p=0/1)

    EDIT: it is order210, why didn't i look at the complete source first
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > My yesterday experiments confirm that.

    Well, you were talking about that, so I decided to post some example.

    > These (2d) SSE maps are only well suited
    > for a stable input distribution (by models) - but that's nothing new

    Thats not exactly right. My SSE works the same way as that mixer, so
    its only less stable because it has more variables. Normally its only
    slightly worse or better than a mixer. But its much slower and can
    be significantly better only in conjunction with counters with fixed
    update.

    > Varying mappings don't pay off (setting a previous SSE map as input).

    Usually it still allows to gain something, but not much thats for sure.
    And context tuning might be necessary as its not really a direct mixer
    counterpart, but I'm leaving it for later

    > Aren't just 7x7 vertices a bit too less?

    Actually I never needed more than SSE<6>, Sometimes even SSE<4> is enough.
    And it should be 6x6, not 7x7 in 2xSSE2 version actually,,, forgot to
    restore after testing...

    279077 // 2 mixers
    278037 // partly optimized SSE2<4,4> instead of w1
    278047 // SSE2<5,5>
    278005 // SSE2<6,6>
    278087 // SSE2<7,7>
    278833 // SSE2 instead of w2 too (parameters copied from first SSE2)
    278327 // 2xSSE2, SSE2 parameters optimized
    280811 // SSE2<5,5> Z2
    277251 // optimized SSE2+mixer
    276973 // optimized SSE2+SSE2

    > And about your linear mapping...
    > I use 33x33 grid with nonlinear scaling
    > (gradient of the scaling function increases near p=0/1)

    Well, I'd say that your (or anybody's else, actully) SSE isn't
    very similar to mine, it might even exploit different correlations.

    Also I did try the nonlinear mapping and found that its possible
    to squeeze a small gain out of it, but then it appeared that the
    same redundancy could be removed in a more efficient way.

    And there's another possible improvement in the form of "wider" update
    (eg. include C[i-1] and C[i+2], not only C[i] and C[i+1])
    But again, for now it isn't worth the trouble.

  4. #4
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    We had a similar discussion about the number of SSE vertices, and i found the answer (again) - you simply use much more SSE contexts:

    SSE2i<6,6,0> Z1[256*256];

    So you move some information into the SSE context. I'm still experimenting with this and i'll let you know if i found something i'm confident with.

    In my experiments coupeling a 2d SSE map with something else than two model's as input always showed compression worse than a mixer.

    For me interpolation with the rectangle bounded by x(+1), y(+1) worked well. I tried to use farer vertices with little gain in compression, but a large time increase.

    BTW: i had an idea to increase mixing robustness. I haven't implemented it yet, so i don't know how well it will score.

    p = w p_a + (1-w) p_b
    = p_a ( w + (1-w) p_b/p_a )
    = p_a ( (1-z) w + z )

    z = p_b/p_a

    Now, under a (larger) context you have an average z(k+1) = alpha z(k) + (1-alpha) p_b/p_a

    and dependent on z you select the weight from a precalculated table, which minimizes:

    - ld ( pa * p_a ((1-z) w + z) )

    or in other words, the minimum doesn't change, if you minimize

    - ld (1-z)w + z

    The weight under some context can be averaged too.

    Forget about this, it is nonsense...
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > you simply use much more SSE contexts:
    > SSE2i<6,6,0> Z1[256*256];
    > So you move some information into the SSE context.

    Well, I don't.
    As I already mentioned, contexts are totally wrong there, for both
    mixers and SSE. I just didn't want to think about it and made it
    like that, so there'd be _some_ contexts.

    > In my experiments coupeling a 2d SSE map with something else than two model's as
    > input always showed compression worse than a mixer.

    By "models" you mean counters here?

    And I thought that my demo clearly shows that SSE2 is always better than a
    mixer, even with the same context. Well, 2xSSE2 version has worse overall
    result, but its still better than SSE2+mixer on the file for which it was
    tuned.

    > For me interpolation with the rectangle bounded by x(+1), y(+1) worked well. I
    > tried to use farer vertices with little gain in compression, but a large time
    > increase.

    Well, I had some SSE with probability interpolated from C[i] and C[i+1] and
    update like d=p'-p, C[i]+=d, C[i+1]+=d. Then I tried adding C[i-1]+=d/2 and
    C[i+2]+=d/2 (without involving them into probability estimation) and got some
    improvement. But its an additional parameter to optimize and additional slowdown,
    so I still don't actively use this.

    > BTW: i had an idea to increase mixing robustness.

    I'd say, that thing I'm currently flaming about in "Counters" looks really promising.
    I mean, that (bayesian?) approach can be applied not only to counters, but to
    mixers and SSE and whatever else quite the same.

  6. #6
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien
    Well, I dont.
    As I already mentioned, contexts are totally wrong there, for both
    mixers and SSE. I just didnt want to think about it and made it
    like that, so thered be _some_ contexts.
    Up to a certain degree more contexts help. You can seperate contextual correlation between the output characteristics. After a certain context you might always get better predictions than after some other, which is more random. So you can drop SSE resolution using more contexts. There must always be a tradeoff.

    Quote Originally Posted by Shelwien
    By "models" you mean counters here?
    Yep.

    Quote Originally Posted by Shelwien
    And I thought that my demo clearly shows that SSE2 is always better than a
    mixer, even with the same context. Well, 2xSSE2 version has worse overall
    result, but its still better than SSE2+mixer on the file for which it was
    tuned.
    I made my own experiments. My model architecture is different than yours. I had the results i already postet. There was always a small difference between a mixer and a SSE map when using two SSE maps as an input. SSE(SSE,SSE) was always worse than MIX(SSE,SEE). I guess that SSE is a nonlinear, adaptive mapping; and MIX is, in my case, a simple linear combination, which slowly changes, i think that SSE simply cant handle the varying source very well.

    Ill post some more in a few hours.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    >> Well, I don't.
    >> As I already mentioned, contexts are totally wrong there, for both
    >> mixers and SSE. I just didn't want to think about it and made it
    >> like that, so there'd be _some_ contexts.
    >
    > Up to a certain degree more contexts help. You can seperate contextual
    > correlation between the output characteristics. After a certain context you
    > might always get better predictions than after some other, which is more random.
    > So you can drop SSE resolution using more contexts. There must always be a tradeoff.

    Well, in theory I agree.
    But in this specific case I found that SSE<4..8> is always better than bigger
    arrays even with contexts optimized for the latter.
    Though of course it allows to use more contexts too (than with bigger counters).

    Somehow I think that with smaller discretization step you're just making it
    unstable and that's why additional nonlinear mapping helps etc.

    Also have you ever tried to just directly use context history bits as context?
    There's even this funny patent for it:
    http://www.google.com/patents?q=%22s...y+estimates%22
    If you tried, you may know that there're no real cases where more than 3 bits
    are useful, its only 1 bit most of the time - or your model is totally redundant.

    I think that explains my quantization scale.

    >> And I thought that my demo clearly shows that SSE2 is always better than a
    >> mixer, even with the same context. Well, 2xSSE2 version has worse overall
    >> result, but its still better than SSE2+mixer on the file for which it was
    >> tuned.
    >
    > I made my own experiments. My model architecture is different than yours.

    Yeah, I think that my approach works right only with precise linear counters.
    Imho thats why there's relatively small effect from SSE in paq etc.

    > There was always a small difference between a mixer and a SSE map
    > when using two SSE maps as an input.
    > SSE(SSE,SSE) was always worse than MIX(SSE,SEE).

    To be specific, I only tested these 3 cases in my demo:
    * mix(mix(o0p,o1p),o2p)
    * mix(sse(o0p,o1p),o2p)
    * sse(sse(o0p,o1p),o2p)

    But obviously things like
    * mix(sse(o0p,o1p),sse(o1p,o2p))
    * sse(sse(o0p,o1p),sse(o1p,o2p))
    or even
    * sse(sse(sse(o0p,o1p),sse(o1p,o2p)),sse(o0p,o2p))
    are possible and probably would provide better compression

    > I'll post some more in a few hours.


  8. #8
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Sorry, i didnt have any more time yesterday evening. Its really annoying if other things eat up your time....

    Quote Originally Posted by Shelwien
    Somehow I think that with smaller discretization step youre just making it
    unstable and thats why additional nonlinear mapping helps etc.
    Hm, maybe "spatial" averaging helps, with a larger quantization step. I had a problem like this when doing some time series analysis and made a histogram, which dynamically adjusted the bin width to drop resolution in very "active" areas (lots of values falling into a bin) and to increase in very inactive ones. This helped and identified areas with less information easily.

    Quote Originally Posted by Shelwien
    Also have you ever tried to just directly use context history bits as context?
    Ill have a look, if im back home.

    Quote Originally Posted by Shelwien
    * sse(sse(o0p,o1p),sse(o1p,o2p))
    I tried such things. The compression dropped. But as i already said, my architecture is different than yours. I use a state machine (simply, because i can fit more nibble models in a cache line), so a state values represents several different counter states.

    And something else:

    I would be very greatful, if you could do me a favor: Could you make a large 2d SSE map (on the output of say, order 1 and 2), say 256x256 or something and train it on some big homogen data set (e.g. 50mb of enwik/a large binary file/...) and show a 2d plot of the generated 3d function. I would do it myself, but if i dont have any commercial math software and the last time i used gnuplot, it took me 2 hours to plot a plane with some vector arrows in it...



    And it would also be nice, if you cold plot a transform of the data: at each position dont plot p_estimated( pa, pb ), but also plot (p_estimated-pb)/(pa-pb). You know what it is Actually your mixer can be interpreted that way. It is a vertex, which "slides" through the "true 2d probability mapping function".

    Ill hurry, so that this evening i can post a NN based mixer (as with PAQ). I decided to rewrite it myself, since i had strange problems to use the PAQ one "out of the box". I wanted to try a NN earlier, but i got stuck using other optimization approaches with little benefit And i dont know what is my fault with PAQs mixer.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Tried inserting SSE2<256,256> into order_test5 code, but
    the output looks completely random... as expected.
    Guess I won't be able to help here, as my models are not
    designed for something like this, and it would take too much
    time to design something specifically for visualization.

    Plotting is not a problem though, so I can make
    some graphs if you upload any applicable statistics.

  10. #10
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Here is some data: There are two sse functions, with a linear grid of 33x33, each. They are seperated by a "*** 2 ***". The first one is:
    p=SSE( order0, order1 ), the second SSE( p, order2 ).

    Values were generated compressing 10mb of enwik.

    http://freenet-homepage.de/toffer_86/a.txt
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  11. #11
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Here's the promised demo with Order 210 mixing with a NN:

    http://freenet-homepage.de/toffer_86/cmmd.exe

    It's pretty much the same like PAQ.

    The results are a bit better than my 210 2d SSE map. I guess the NN does a fairly well job. I only selected one of 4 NNs dependent on coding order (orders 2-0, unseen symbol).

    Code:
     
    class mixer 
    { 
    protected: 
      static const uint N=3; 
      memory<sint32> weights; 
      sint32* w; 
      memory<int> inputs; 
      int p1; 
     
    public: 
      mixer( const uint m ): 
        weights(N*m), 
        w(&weights[0]), 
        inputs(N) 
      { Clear(); } 
     
      void Clear() 
      { weights.Zero(); w=&weights[0]; inputs.Zero(); } 
     
      inline int& Input( const int i ) { return inputs[i]; } 
     
      inline int P1() const { return p1<<bitmodel::P_BITS-math::P_BITS; } 
     
      inline int Mix( const uint ctx ) 
      { 
        w = &weights[ctx*N]; 
        p1 = 0; 
        for (uint i=0; i<N; ++i) { 
          inputs[i] = math::Stretch(inputs[i]>>bitmodel::P_BITS-math::P_BITS); 
     
          p1 += w[i]*inputs[i]; 
        } 
        p1 = math::Squash(p1>>16); 
        return P1(); 
      } 
     
      inline void Update( const int bit ) 
      { 
        const int e = 7*((bit<<math::P_BITS)-p1); 
        for (uint i=0; i<N; ++i) { 
          w[i]+=inputs[i]*e + (1<<15)>>16; 
        } 
      } 
    };



    And the math:

    math.h
    Code:
     
    namespace math { 
     
    const int P_BITS=12; 
    const int SCALE=1<<P_BITS;  // all values are scaled by this to create integers 
     
    class squash 
    { 
    protected: 
      static short table[SCALE]; 
     
    public: 
      static const int X_MIN=-SCALE/2; 
      static const int X_MAX=SCALE/2-1; 
      static const int Y_MIN=0; 
      static const int Y_MAX=SCALE-1; 
     
      squash(); 
      inline int operator()( const int i ) const 
      { 
        if (i>X_MAX) { return SCALE-1; } 
        if (i<X_MIN) { return 0; } 
        return table[i-X_MIN]; 
      } 
    }; 
     
     
    class stretch 
    { 
    protected: 
      static short table[SCALE]; 
     
    public: 
      static const int X_MIN=0; 
      static const int X_MAX=SCALE-1; 
      static const int Y_MIN=-SCALE/2; 
      static const int Y_MAX=SCALE/2-1; 
     
      stretch(); 
     
      inline int operator()( const int i ) const 
      { 
        assert( i>=0 && i<SCALE );  // should be true, since we input model probs 
        return table[i]; 
      } 
    }; 
     
    extern const squash Squash; 
    extern const stretch Stretch; 
     
    };



    math.cpp
    Code:
     
    namespace math { 
     
    const int STEM_COUNT=33;  // number of vertices (on the left side, x<0) 
    const int DELTA=SCALE/(2*STEM_COUNT-2); // width of a bin between two vertices 
     
    // Uncomment and call CalcStems to generate the stems[]. 
     
    //const double X0=-8; 
    //const double X1= 0; 
    // 
    //double f( const double x ) 
    //{ return 1.0/(1.0 + exp(-x)); } 
    // 
    //void CalcStems() 
    //{ 
    //  const double dx = (X1-X0)/(STEM_COUNT-1); 
    //  double x = X0; 
    // 
    //  for (int i=0; i<STEM_COUNT; ++i) { 
    //    stems[i] = int(round(f(x)*SCALE)); 
    //    printf( "%4d, ", stems[i] ); //getchar(); 
    //    if (i%12==11) printf(" 
    "); 
    //    x+=dx; 
    //  } 
    //} 
     
    static short stems[STEM_COUNT] = {  // left side of the sigmoid function, x<0 
       1,    2,    2,    3,    4,    5,    6,    8,   10,   13,   17,   21, 
      27,   35,   45,   58,   74,   94,  120,  153,  194,  246,  311,  391, 
     488,  606,  747,  912, 1102, 1314, 1546, 1793, 2048 
    }; 
     
    //////////////////////////////////////////////////////////////////////////////// 
     
    short squash::table[] = {squash::Y_MIN-1}; 
     
    squash::squash() 
    { 
      if (table[0]>=squash::Y_MIN) { return; } 
     
      for (int i=0; i<=SCALE/2; ++i) { 
     
        short* t = &stems[i/DELTA]; 
        const int r = i%DELTA; 
     
        table[i] = (t[0]*(DELTA-r) + t[1]*r)/DELTA; 
     
        int j = SCALE-i; 
        if (j>SCALE/2 && j<SCALE) { table[j] = SCALE-table[i]; } 
      } 
     
      table[0]=stretch::X_MIN; 
    } 
     
     
    short stretch::table[]={stretch::Y_MIN-1}; 
     
    stretch::stretch() 
    { 
      if (table[0]>=stretch::Y_MIN); 
      const squash Sq; 
     
      for (int y=squash::X_MIN; y<squash::X_MAX; ++y) { 
        int x=Sq(y); 
        while (x!=Sq(y+1)) { 
          table[x++]=y; 
        } 
      } 
     
      table[SCALE-1]=squash::X_MAX; 
    } 
     
    const squash Squash; 
    const stretch Stretch; 
     
    };

    I might write a cmm variant based on it. Don't use cmmd for benchmarking, it's just released for comparision with Shelwien's mixing. It lacks further SSE.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Here is some data: There are two sse functions, with a linear grid of 33x33,
    > each. They are seperated by a "*** 2 ***". The first one is:
    > p=SSE( order0, order1 ), the second SSE( p, order2 ).
    > Values were generated compressing 10mb of enwik.
    > http://freenet-homepage.de/toffer_86/a.txt

    Well, here goes:
    http://src2.zxq.net/2a/
    http://shelwien.googlepages.com/a1e.png
    http://shelwien.googlepages.com/2a.rar

    a1e.png is a (a[x][y]-x)/(y-x) sample, I didn't bother to plot more
    as it doesn't seem to have any interesting details.

    ...Maybe this would be more interesting:
    http://shelwien.googlepages.com/a1e1.png
    Its a part with x=0.5..1 and y=0..0.5

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Here's the promised demo with Order 210 mixing with a NN:
    > http://freenet-homepage.de/toffer_86/cmmd.exe

    Here I added it: http://shelwien.googlepages.com/sse2.htm
    Seems that cmmd is significantly better for texts, especially fp.log...
    and worse than even o2_mm for executables.

    > It's pretty much the same like PAQ.
    > The results are a bit better than my 210 2d SSE map.
    > I guess the NN does a fairly well job.

    Guess so, but its not a completely fair comparison, because
    seemingly both contexts and alphabet decomposition are different.
    I think that for unbiased comparison we'd have to either insert
    your classes into my coder, or vice versa... processed bit sequences
    have to be the same or results wouldn't be informative.

    So can you wrap that NN into a standalone class compatible with my mixer class?
    Or at least post some compilable sources?

    > I only selected one of 4 NNs dependent on coding order (orders 2-0, unseen symbol).

    What's "unseen symbol"? Do you have an escape model there?
    That might explain why cmmd is better on english texts.

    Also do you use simple counter arrays like I do, or there're some hashes?
    And I hope you don't have a match model or something like that

    ...

    Guess I'd have to add missing features to my demo coder already,
    instead of all these excuses

  14. #14
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Actually there were two bugs in cmmd. The coding order isn't recognized correctly and due to a mistyping it dropped order 0. I don't use any special models. Orders 0 and 1 use lookup tables. Order 2 is hashed. After breakfest i'll fix it and upload.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  15. #15
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    And thanks for making the plots!
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  16. #16
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien
    I guess the left plots are the first sse function (priopr to "***2***") ?

    Quote Originally Posted by Shelwien
    The weights seem to be distributed quiet uniform. Maybe one can see a difference at a finer scale?! The lack of continuously in the middle is simply a division by zero, where pa=pb=p.

    Alltogether the mapping looks like a plane, which is dominated by the more accurate model, p(pa,pb)=pb (higher order). Theres some randomness introduced by the less accurate model (lower order). This can easily be seen in the *density plots. Larger areas with p=0.5 are created by initialization and are simply left untouched (a1_2_density).

    Unfortunately not as interesting as i expected

    But thanks again, for the plots!

    Quote Originally Posted by Shelwien
    Guess so, but its not a completely fair comparison, because
    seemingly both contexts and alphabet decomposition are different.
    I think that for unbiased comparison wed have to either insert
    your classes into my coder, or vice versa... processed bit sequences
    have to be the same or results wouldnt be informative.
    <div class=""jscript""><pre>
    for( i=8,d=0; i>0; i-- ) {
    Model_O2( flag, d+(256>>i)-1 ); PPtr++;
    d += d+flag;
    }
    </pre>[/QUOTE]





    You use a flat, 256ary decomposition, like i do. The code quoted is already compatible, expect some constants. The math stuff should work out of the box.
    <div class=""jscript""><pre>
    struct bitmodel{
    static const int P_BITS=16; // the models output 16 bit probabilities
    static const int P_ONE=(1<<P_BITS)-1;
    };

    class mixer
    {
    protected:
    static const int N=3;
    const int M;
    int* weights;
    int* w;
    int inputs[N];
    int p1;

    public:
    mixer( const uint m ):
    M(m),
    weights( new int[N*m]),
    w(weights)
    { Clear(); }

    ~mixer()
    { delete[] weights; }

    void Clear()
    {
    for (int i=0; i<N*M; ++i) {
    weights[i]=0;
    }
    w=weights;
    for (int i=0; i<N; ++i) {
    inputs[i]=1<<bitmodel::P_BITS-1;
    }
    }

    inline int& Input( const int i ) { return inputs[i]; }

    inline int P1() const { return p1<<bitmodel::P_BITS-math::P_BITS; }

    inline int Mix( const uint ctx )
    {
    w = &weights[ctx*N];
    p1 = 0;
    for (int i=0; i<N; ++i) {
    inputs[i] = math::Stretch(inputs[i]>>bitmodel::P_BITS-math::P_BITS);

    p1 += w[i]*inputs[i];
    }
    p1 = math::Squash(p1>>16);
    return P1();
    }

    inline void Update( const int bit )
    {
    const int e = 7*((bit<<math::P_BITS)-p1);
    for (int i=0; i<N; ++i) {
    w[i]+=inputs[i]*e + (1<<15)>>16;
    }
    }
    };
    [/code]





    Constant tuning (learning rate, rounding, ...) might help, but i was too lazy (these were taken from some paq source). When ive done some other things i will try to tune this stuff. First i wanted to see, how well this scores.

    To use the NN with N context:

    mixer nn(N)

    nn.Input(0) = p1 // i always use P(1) !
    ...

    p1mix = nn.Mix( <mixer context> ) // should be the coding order in addition some other stuff, like a piece of an order 1 context

    [QUOTE]<div class=""quoting"">Quoting: Shelwien</div>Whats "unseen symbol"? Do you have an escape model there? </div>

    No. But this simplifies the prediction function:

    If you encounter an unseen symbol i can sense this, since all models bit histories are in initial state. Instead of encoding the symbol with p=0.5 i let the NN output probabilities with all inputs fixed to .5.
    Theres nothing special about that. Stretch(0.5)=0, so the output will always be .5: Squash( 3*Stretch(0.5) ) = 0.5 (3*, since weve got inputs of 3 models, orders 0-2). One might add a bit position as input or something else to gain a benefical effect.

    Heres the fixed demo:

    http://freenet-homepage.de/toffer_86/cmmd2.exe
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  17. #17
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    What if we add a momentum term in NN? I'll try it. I hope it can give a compression gain. But, it seems difficult due to playing with integer arithmetics
    BIT Archiver homepage: www.osmanturan.com

  18. #18
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by osmanturan
    momentum term
    What is this?

    If you want to add some other things than probabilities to the NN, i would suggest to drop the stretch, since its mapping is designed for probability quantization. But dont change the logic (-2048..-1 a 0 bit, 1..2047 a 1 bit).
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  19. #19
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    When choosing high learning rates, this causes oscillation in learning phase. To avoid this effect momentum term is added to error term. So, the neural network can rapidly adapt the inputs. I'm not going into theory. Here is the short formula:
    Code:
    // initialization 
    delta = 0; 
    ... 
    // weight updates 
    delta = delta * momentum + learningRate * error * Output; 
    weight += delta;
    Also, there are some papers around about adaptive learning rates. But, I haven't look at them yet. Also, we may add weight decay term. Final formula will like that with weight decay:
    Code:
    // initialization 
    delta = 0; 
    ... 
    // weight updates 
    delta = delta * momentum + learningRate * error * output - weightDecay * weight; 
    weight += delta;
    Don't forget to set momentum and weightDecay variables.
    BIT Archiver homepage: www.osmanturan.com

  20. #20
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Ok, now i know what you mean. I should have looked up "momentum" in a dictionary .

    This might help a bit.

    Quote Originally Posted by osmanturan
    Also, there are some papers around about adaptive learning rates. But, I havent look at them yet.
    This is a chicken-egg problem. If you select a function, to choose the learning rate from this function also has parameters... How do you want to select these? Such a heuristic can help, but i dont expect too much from that.

    Keep in mind, that error backpropagation simply is gradient descend. There are several heuristics to adaptively control the step size (learning rate). But all of them (i know about) have something in common - they estimate the cost function. This might be done using, e.g. a quadratic approximation, but since you have to do this several times a bit, youll get unacceptable speed. Since the NN is trained every bit the effect of "averaging" should mask out such more or less cunning methods (simply because your training set is extremly large and varying, the convergence speed has little influence).
    Most times i used MLPs i needed about 1000 - 10k learning cycles to achieve good results. This would be equal to just 10k/8 approx 1.25 k bytes of data per mixer. I wouldnt expect very stunning effects.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  21. #21
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Code:
    This might help a bit.
    I think, momentum term can help on highly skewed inputs. I hope, we can put this term into your neural network code successfully. Because, when you published your code, I noticed my code is very similar yours
    BIT Archiver homepage: www.osmanturan.com

  22. #22
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Would be strange if id didn't... how do you want to implement a dot product in another way? and function lookups
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  23. #23
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Updated the comparison:
    http://shelwien.googlepages.com/sse2.htm
    That's a really cool fix. Now I just have to rewrite my model

  24. #24
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien
    Updated the comparison:
    http://shelwien.googlepages.com/sse2.htm
    Thats a really cool fix. Now I just have to rewrite my model
    Let me see what you can make of it Whats your opinion about the plots?
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  25. #25
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    >> Now I just have to rewrite my model
    > Let me see what you can make of it

    Nothing, I guess, if you mean your NN.
    Though it might be interesting to see the performance of
    my coder with your mixer... but only if you make it

    Here I modified my coder so that it could be built with gcc:
    http://shelwien.googlepages.com/order_test5a.rar

    And by rewrite I meant:
    - unary alphabet decomposition
    - tuned counter initialization
    - escape modelling
    - better mixer/SSE contexts
    stage 2:
    - unary SSE for symbols
    - delayed counters
    - bayesian extrapolation
    stage 3:
    - polynomial code length approximation for context clustering
    - source clustering by compressibility (segmentation)
    - all modelling via interval arithmetics
    - SSE replacement with more complex model

    > What's your opinion about the plots?

    I've seen something really similar quite a long time ago
    Afair, even made a movie out of it in Mathematica (3D SSE plots
    changing in time - can redo it btw, if you upload the coder which
    would dump the table on each step or something)

    But, although it might be interesting to watch, I dunno how
    it can be used for compression improvement.

  26. #26
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Ok, i guess this will be major changes.

    Quote Originally Posted by Shelwien
    - polynomial code length approximation for context clustering
    ??? What about this?

    Now i dont think that these SSE visualization experiments help to improve compression, simply because there was nothing surprising.

    Again, some old stuff:

    p = p + (y-p) / (c*n+1) = p + (y-p) * 1/n

    I made some experiments and was suprised, that it took between 100 and 400 iterations do get n to its infinite limit (where c was around .95 to .99) at 16 bit precision of 1/n. So for me it might be worth to add a lookup table for 1/n. I underestimated this. You merged a mixture of a random distribution into p, like:

    p = a*p + (1-a) *.5

    Do you have any "rules of thumb" how to weighten this distribution or is it simply a matter of empirical tuning?

    BTW: why are you so afraid of pasting the code - et voila?! Ill do it this afternoon.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  27. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Ok, i guess this will be major changes.

    I doubt that I'd really go after "stage 1" there -
    only if it'd fail to win vs cmmd without it (which is imho unlikely).
    Or maybe if it'll cause any entertaining discussions.

    >> - polynomial code length approximation for context clustering
    >
    > ??? What about this?

    A trick impossible for models with paq-way counters

    While my counters are linear, so a generalized string probability
    can be computed, in the form of a polynomial.
    Also all my parameters are from [0..1] which can easily be converted
    to [-0.5..0.5], so powers of a selected parameter would converge
    really fast.
    Thus, we should be able to calculate a good enough string probability
    approximation using only a limited number of parameter powers.
    And code length is log2 of that, so this approximation basically
    allows to reduce optimization by one dimension (or more if we'd
    manage to maintain a precise enough approximation in more than one
    parameter).

    Then, context clustering is a really good application for this,
    as a rough implementation would need only a method of concatenating
    two contexts, which can be done relatively fast if we'd have
    functions for code lengths.

    > Now i don't think that these SSE visualization experiments
    > help to improve compression, simply because there was
    > nothing surprising.

    Well, at least you could notice that its rather smooth -
    imho a good argument to shrink the table size.

    > Again, some old stuff:
    > p' = p + (y-p) / (c*n+1) = p + (y-p) * 1/n'
    > I made some experiments and was suprised, that it took
    > between 100 and 400 iterations do get n' to its infinite
    > limit (where c was around .95 to .99) at 16 bit precision of
    > 1/n'. So for me it might be worth to add a lookup table for
    > 1/n'. I underestimated this.

    Better late then never

    Well, its not always necessary - inheritance or really good
    submodel merging can compensate for that.

    > You merged a mixture of a random distribution into p, like:
    > p = a*p + (1-a) *.5
    > Do you have any "rules of thumb" how to weighten this
    > distribution or is it simply a matter of empirical tuning?

    Of course bruteforce tuning always helps, especially when you
    feed that counter's prediction into SSE or something.
    Or you can make a pass using a slow dynamic mixer, and then
    hardcode the weight estimated by it.

    But basically that mix corresponds to a composite generator model
    where bit is generated with probability p if rand()<a,
    or with probability 0.5 otherwise.
    With that you can make some "common sense" weight estimations.

    > BTW: why are you so afraid of pasting the code - et voila?!

    You mean me posting links instead of pasting the code into posts?
    1) it makes a thread unreadable
    2) its troublesome to copy the code back into files and even more
    troublesome to collect all files necessary to make it work
    3) nobody can guarantee that this forum scripts won't damage anything;
    common forum engines have attaches for that.

    > I'll do it this afternoon.

    That'd be nice, if you're talking about replacing a mixer in my coder.

Similar Threads

  1. Paq mixer theory
    By Shelwien in forum Data Compression
    Replies: 0
    Last Post: 22nd November 2009, 02:32
  2. M01 - Order 01 context mixer
    By toffer in forum Data Compression
    Replies: 58
    Last Post: 17th June 2008, 18:29
  3. SSE, TSE & higher symbol probability estimations
    By Dmitry Shkarin in forum Forum Archive
    Replies: 4
    Last Post: 6th March 2008, 19:47

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •