1. ## Parameter optimization

Hello everyone!

I'm currently working on bit model parameter optimization for my study thesis. My bit models have got two parameters, 1-c and d:

p_k+1 = p_k + (y_k-p_k)*(1-c) + d

I've already written a working optimizer for simple models (direct lookup - order 0-2) and will possibly include a wrapper which dumps coding steps to a data file to allow an easy integration into compression programs (so you can tune your bit models on a large data set with less effort). My first results show, that the expectable improvement will be somewhere around .1% .. 1% - not very stunning, but hey - the improvement doesn't require extra effort.

I might include more advanced bit models, e.g. using different parameter sets for 1-c and d for the startup phase (small k), similar to Matt's approach in PAQ, to allow rapid initial learning - but the learning curve will be optimized . Or using different parameters wether the current bit is a more probable symbol or a less probable symbol. Do you have any other ideas here?

As a short example i ripped of some fpaq version and included an (locally) optimal 1-c and d at 15 bit precision. It was tuned on enwik8 - and shows best results for such a simple model.

http://freenet-homepage.de/toffer_86/fpaq0cm.cpp

Greets

2. > I'm currently working on bit model parameter optimization
> for my study thesis.
> My bit models have got two parameters, 1-c and d:
>
> p_k+1 = p_k + (y_k-p_k)*(1-c) + d

I normally use counters like this:
p'=(p*(1-wr)+y*wr)*(1-mw)+pm*mw =
= p + mw*(pm-p) + (1-mw)*wr*(y-p) =
= p + (y-p)*(mw*(1-wr)+wr) - mw*(y-pm)
but with fixed pm=0.5 (of course it can be optimized too)

Its similar, but your parametrization can overflow
and there'd be a correlation between your c and d optimal values.

> I've already written a working optimizer for simple models
> (direct lookup - order 0-2)

What kind of optimizer, I wonder?

> wrapper which dumps coding steps to a data file to allow an
> easy integration into compression programs (so you can tune
> your bit models on a large data set with less effort).

Which "coding steps"? Contexts and context histories?
I do that manually... as most of models are still written
manually anyway - not really constructed from common blocks
or anything.

> first results show, that the expectable improvement will be
> somewhere around .1% .. 1% - not very stunning, but hey -
> the improvement doesn't require extra effort.

Right, counter parameter optimization isn't very interesting
(bruteforce optimization at least, as there're some applications
for smarter techniques).
But what's really pretty much impossible without special
optimization tools is context quantization.
And there're lots of useful non-probability statistics, like
eg. match offset/length which need to be quantized to be used
as a context.

> I might include more advanced bit models, e.g. using
> different parameter sets for 1-c and d for the startup phase
> (small k), similar to Matt's approach in PAQ, to allow rapid
> initial learning - but the learning curve will be optimized.

Normally that's impractical.
I mean, like that you'd get an improvement with simple counters, right.
But with secondary models results would be worse...
Although sometimes I do use a simpler model for initial coding...
I call it a "bootstrap model" - usually its a small counter table
and its estimation is quantized into a part of main model's context,
but also is directly used for coding in initial phase,
before switching to the main model.

> Or using different parameters whether the current bit is a
> more probable symbol or a less probable symbol.
> Do you have any other ideas here?

More than one context history bit would be useful like that,
but updating the counter with bits and using the same bits in
other contexts causes context correlation which is no good
at least for memory usage.
So its better to have a delayed counter with one static mapping
(in context of some bit history) for probability estimation and
another mapping for update.
These mapping are similar to SSE, but have better to be static,
or you won't be able to apply another (dynamic) SSE stage there.
Also quantization could be applied to bit histories.

> http://freenet-homepage.de/toffer_86/fpaq0cm.cpp

You could at least separately optimize parameter values for
different contexts - that would be somewhat interesting at least.

Calculations are done like this:

p_k+1 = p_k + ((y_k-p_k)*(1-c) + d)/S

where 1-c ranges from 0..S-1 and d can range from [-S/2, S/2). y_k-p_k can at most be a N bit signed integer and 1-c a M bit unsigned integer (S=2^M). I've choosen N+M<(native integer matissa bits), N+M<31. So nothing can overflow.

I'm using this type of calculation since its simple and fast.

There are correlations between c and d, but this doesn't influence the optimization, since derivates by c and d or d and c are both zero.

> What kind of optimizer, I wonder?

I'm numerically minimizing coding cost of the test data set, currently i use CG (calculating the hessian matrix is too slow) and might switch to a second order optimization algorithm and estimate the hessian matrix using bfgs.

> Which "coding steps"? Contexts and context histories?
> I do that manually... as most of models are still written
> manually anyway - not really constructed from common blocks
> or anything.

All my work upon now uses one thing: "class bitmodel", which encapsulates the above counter. So calling this class can dump the needed information for the optimization to a file.

> Normally that's impractical.

In my experience it isn't. Recall some of our previous discussions of including 1-c_k instead of simply 1-c to simulate a counter like p1 = (c*n1_k + y_k) / (c*n_k + 1). I've got a positive effect of doing this (still when using second order models) - just test lpaq with disabled variable bit model adaption speed.

I agree that this type of learning changes the output characteristic of a bit model, which makes the distribution a secondary mapping has to learn more unstable.

The difference isn't that high, but as i said it doesn't require any extra effort.

> I mean, like that you'd get an improvement with simple counters, right.
> But with secondary models results would be worse...
> Although sometimes I do use a simpler model for initial coding...
> I call it a "bootstrap model" - usually its a small counter table
> and its estimation is quantized into a part of main model's context,
> but also is directly used for coding in initial phase,
> before switching to the main model.

I thought about something like this since it is something PAQ's state machine does.

> Also quantization could be applied to bit histories.

This is very interesting for me. I've made up a simple mathematical describtion (basically densitiy and transistion functions) which can be used to derive a state machine. The needed functions can be derived from my bit model parameters which are divided into special classes.

> http://freenet-homepage.de/toffer_86/fpaq0cm.cpp

It's just a proof-of-concept, nothing special.

I just wanted to point that your parameters are functions of mine.
So in the case where my representation fits the source model better
than yours, your parameters are harder to optimize because they're
entangled.
However I won't bother to prove that its really the case.

> Calculations are done like this:
> p_k+1 = p_k + ((y_k-p_k)*(1-c) + d)/S
> where 1-c ranges from 0..S-1 and d can range from [-S/2,S/2).

You skipped that before, so I thought that d is [0..S-1].
But like that it won't overflow, you're right.

> I'm using this type of calculation since its simple and fast.

Its not really simple, as it won't work completely the same
if you replace the division with a shift.

> There are correlations between c and d, but this doesn't
> influence the optimization, since derivates by c and d or d
> and c are both zero.

Well, I think it does.
Even if you're able to find a global minimum (which I doubt)
of a code length _function_, its still not the same as a real rangecoder's
code length.

> > What kind of optimizer, I wonder?
>
> I'm numerically minimizing coding cost of the test data set,
> currently i use CG (calculating the hessian matrix is too
> slow) and might switch to a second order optimization
> algorithm and estimate the hessian matrix using bfgs.

Somehow that doesn't feel right.
Are you doing gradient descent parameter tuning for whole enwik8 or what?
At least you could split it at some points, like long sequences of the
same bit value or something - well, points where probability value is
least dependent on the parameters.
Also did you consider the reduced polinomial approximation which I was
Wonder if you're trying to be precise here... its no use because
you cannot precisely compute the behaviour of discrete system like rc
with mathematics.

> > Which "coding steps"? Contexts and context histories?
> > I do that manually... as most of models are still written
> > manually anyway - not really constructed from common blocks
> > or anything.
>
> All my work upon now uses one thing: "class bitmodel", which
> encapsulates the above counter. So calling this class can
> dump the needed information for the optimization to a file.

Is it a single counter class, or counter table class?
Because sometimes its useful to dump the whole context
values along with the data as it allows to tune the quantization.
However simple flattened bit histories are tricky too -
you have to add terminators and implement the model flush
(and its obviously not much sense tuning to the single context sequence)

> > Normally that's impractical.
>
> In my experience it isn't.
[...]
> I agree that this type of learning changes the output
> characteristic of a bit model, which makes the distribution

Yeah, so here I meant that not using any secondary model is impractical.

> > Also quantization could be applied to bit histories.
>
> This is very interesting for me. I've made up a simple
> mathematical describtion (basically densitiy and transistion
> functions) which can be used to derive a state machine. The
> needed functions can be derived from my bit model parameters
> which are divided into special classes.

Quantization is a simple case of context clustering
(where any two contexts can correspond to the same table index, or not).
In my implementation, there're bitmaps corresponding to contexts
which are optimized in a somewhat "genetic" fashion and then used
to generate a quantization code like ctx=ctx*4+(c>4)+(c>+(c+11).

> > http://freenet-homepage.de/toffer_86/fpaq0cm.cpp
>
> It's just a proof-of-concept, nothing special.

Can you post the enwik8 result for this version and also
for some neighbourhood like 32088+-10,20,30 and 7000+-10,20,30?

5. > I just wanted to point that your parameters are functions of mine.
> So in the case where my representation fits the source model better
> than yours, your parameters are harder to optimize because they're
> entangled.

Mathematically it is the same, but the real-life numeric situation might be
different. At the moment i don't concider a change of the bit model.

> Its not really simple, as it won't work completely the same
> if you replace the division with a shift.

The only difference is that if the compiler generates a SAR/SAL instruction it
will carry over the sign bit (-1>>1 = -1). However writing this as a division
(-1/2) solves the problem (gcc dosn't generate a division and you don't have the
sign problem).

> Well, I think it does.
> Even if you're able to find a global minimum (which I doubt)
> of a code length _function_, its still not the same as a real rangecoder's
> code length.

This is true. Minimizing the coding cost function represents a theoretical
limit, the range coder is bounded by this. But the "direction" of parameter
influence is *about* the same. Actually i use a range coder to estimate the
coding cost, since the difference to the true entropy function is small (and the
odd thing - it is faster!).

After falling into a local optimum i'll start a brute force search around it,
which can compensate the difference between the numerically found minimum and
the "true" local minimum. This isn't perfect - i know.

> Somehow that doesn't feel right.

I can't have any other clue than the coding cost function - which can be used
to optimize with limited time.

> Are you doing gradient descent parameter tuning for whole enwik8 or what?

It was just a test case, as i said - "proof of concept". My plans are to add
tuned parameters to my compressors, which will detect the data type by some
(maybe correlative) measure and use optimal parameters. I'll find "optimal"
parameters by optimizing for large data sets per type.

As i said before another useful thing is that i can add state machines which
like in PAQ.

> At least you could split it at some points, like long sequences of the
> same bit value or something - well, points where probability value is
> least dependent on the parameters.

Since i probe for the minimum over the whole (large) data set the influence of
such "unnormal" regions should be tiny (averaged out).

> Also did you consider the reduced polinomial approximation which I was

> Wonder if you're trying to be precise here... its no use because
> you cannot precisely compute the behaviour of discrete system like rc
> with mathematics.

True.

> Is it a single counter class, or counter table class?
> Because sometimes its useful to dump the whole context
> values along with the data as it allows to tune the quantization.

It is a single class. Normally i call

bitmodel bm;
...
bm.Update(y);

Which can easily replaced by:

bm.UpdateAndDump(y, x);

In my case x is an integer which represents a "class of parameters". In some
situations x needs to be passed as a parameter, in others it can be calculated
by the bit model. It depends on the type of classes you want to regard. Of course
i could add more advanced things, but this study thesis is nothing special. I'm
just focusing on completing it, since i'm happy to have my advisor enthused for
this topic (since i could join a hobby and work). If this is completed i will
probably do more advanced things with this - but i'll always listen to
suggestions and remarks for the (future) development of this stuff.

> > Normally that's impractical.
>
> In my experience it isn't.
[...]
> I agree that this type of learning changes the output
> characteristic of a bit model, which makes the distribution

Yeah, so here I meant that not using any secondary model is impractical.

> Quantization is a simple case of context clustering
> (where any two contexts can correspond to the same table index, or not).
> In my implementation, there're bitmaps corresponding to contexts
> which are optimized in a somewhat "genetic" fashion and then used
> to generate a quantization code like ctx=ctx*4+(c>4)+(c>+(c+11).

We discussed something similar before, didn't we? But i can't remember the
thread. Or am i wrong here?

> Can you post the enwik8 result for this version and also
> for some neighbourhood like 32088+-10,20,30 and 7000+-10,20,30?

You want to see, if it's really minimum? I can do, but at the moment i'm working
on the improvment of my optimizer. I still need to add simple things like
automatic stopping or a line search algorithm.

6. > > Its not really simple, as it won't work completely the same
> > if you replace the division with a shift.
>
> The only difference is that if the compiler generates a SAR/SAL instruction it

Right:

> ;;; zz/=32768;
> mov eax, DWORD PTR [esp+28]
> mov edx, eax
> sar edx, 14
> shr edx, 17
> sar edx, 15

I meant that its totally slow comparing to a single SHR.
Probably slower than multiplication.

> > Somehow that doesn't feel right.
>
> I can't have any other clue than the coding cost function - which can be used
> to optimize with limited time.

I just hoped that you made something smarter than a bruteforce search.
Btw there's nothing wrong with that, I still do it like that myself,
but its too heavy a method so can't be applied to really interesting

> As i said before another useful thing is that i can add state machines which
> are built from empirical collected statistics instead of using "ad-hoc" ones
> like in PAQ.

Well, I advertised this approach since my first post on that forum

> > At least you could split it at some points, like long sequences of the
> > same bit value or something - well, points where probability value is
> > least dependent on the parameters.
>
> Since i probe for the minimum over the whole (large) data set the influence of
> such "unnormal" regions should be tiny (averaged out).

Guess you misunderstood... I meant that its possible to split it to increase
optimization speed.

> > Also did you consider the reduced polinomial approximation which I was
>

http://encode.ru/forums/index.php?ac...ic=648#msg8913

> > Quantization is a simple case of context clustering
> > (where any two contexts can correspond to the same table index, or not).
> > In my implementation, there're bitmaps corresponding to contexts
> > which are optimized in a somewhat "genetic" fashion and then used
> > to generate a quantization code like ctx=ctx*4+(c>4)+(c>+(c+11).
>
> We discussed something similar before, didn't we? But i can't remember the
> thread. Or am i wrong here?

http://encode.ru/forums/index.php?ac...page=0#msg8093

7. Originally Posted by Shelwien
I meant that its totally slow comparing to a single SHR.
Probably slower than multiplication.
I've tested against an order 0 model. There is no difference in processing time.
A solution could be: (y-p)*(1-c)+d & (1<<LD_SCALE)-1 >> LD_SCALE
Loosing the lowest bit shouldn't make any difference (it is almost random).

Originally Posted by Shelwien
> Guess you misunderstood... I meant that its possible to split it to increase
> optimization speed.
You mean to optimize just "typical" pieces of data and discard the rest?

You mean one should approximate log2 using a taylor polynomal? This would avoid
divisions.

BTW:

Code:
```toffer@debian:~\$ ./opt enwik8 32069 -8069
0: [c d] = [0.97867 -0.24625] ~ 1/32768 [32069 -8069]  grad H = [-0.00546  0.00005]  H = 61185836  (47.11 s)
1: [c d] = [0.97870 -0.24625] ~ 1/32768 [32070 -8069]  grad H = [-0.00514  0.00003]  H = 61185856  (47.10 s)
2: [c d] = [0.97873 -0.24625] ~ 1/32768 [32071 -8069]  grad H = [-0.00479 -0.00005]  H = 61185832  (47.09 s)
3: [c d] = [0.97876 -0.24625] ~ 1/32768 [32072 -8069]  grad H = [-0.00444 -0.00008]  H = 61185821  (47.09 s)
4: [c d] = [0.97879 -0.24625] ~ 1/32768 [32073 -8068]  grad H = [-0.00407 -0.00000]  H = 61185839  (47.11 s)
5: [c d] = [0.97882 -0.24625] ~ 1/32768 [32074 -8068]  grad H = [-0.00372 -0.00000]  H = 61185826  (47.09 s)
6: [c d] = [0.97885 -0.24625] ~ 1/32768 [32075 -8068]  grad H = [-0.00332 -0.00000]  H = 61185780  (47.10 s)
7: [c d] = [0.97888 -0.24625] ~ 1/32768 [32076 -8068]  grad H = [-0.00296 -0.00000]  H = 61185781  (47.09 s)
8: [c d] = [0.97891 -0.24625] ~ 1/32768 [32077 -8068]  grad H = [-0.00260 -0.00000]  H = 61185750  (47.10 s)
9: [c d] = [0.97894 -0.24625] ~ 1/32768 [32078 -8068]  grad H = [-0.00225 -0.00000]  H = 61185729  (47.10 s)
10: [c d] = [0.97897 -0.24625] ~ 1/32768 [32079 -8068]  grad H = [-0.00190 -0.00002]  H = 61185721  (47.10 s)
11: [c d] = [0.97900 -0.24625] ~ 1/32768 [32080 -8068]  grad H = [-0.00152 -0.00002]  H = 61185692  (47.09 s)
12: [c d] = [0.97903 -0.24625] ~ 1/32768 [32081 -8068]  grad H = [-0.00102 -0.00000]  H = 61185799  (47.10 s)
13: [c d] = [0.97906 -0.24625] ~ 1/32768 [32082 -8068]  grad H = [-0.00066 -0.00000]  H = 61185767  (47.10 s)
14: [c d] = [0.97910 -0.24624] ~ 1/32768 [32083 -8068]  grad H = [-0.00033 -0.00001]  H = 61185797  (47.10 s)
15: [c d] = [0.97913 -0.24624] ~ 1/32768 [32084 -8068]  grad H = [ 0.00005 -0.00001]  H = 61185745  (47.08 s)
15: [c d]* = 1/32768 [32080 -8068]   H* = 61185692```

8. > > I meant that its totally slow comparing to a single SHR.
> > Probably slower than multiplication.
>
> I've tested against an order 0 model. There is no difference in processing time.
> A solution could be: (y-p)*(1-c)+d & (1<<LD_SCALE)-1 >> LD_SCALE

Not sure where's "no difference".
That code for division via SAR is surely slower than a single SHR.

Also your d parameter has almost no meaning.
So I think that my formula would have better effect even not touching
the third parameter.

Edit: Here's a fpaq0p version with my formula, btw
(and parameters tuned with my system, not to enwik8 though)

> > Guess you misunderstood... I meant that its possible to split it to increase
> > optimization speed.
>
> You mean to optimize just "typical" pieces of data and discard the rest?

I mean that at some points the probability value depends on update parameter
less than on average. Like after a long sequence of 1 the probability of 1
would be close to 1.0 with almost any parameters.
So its possible to cut the file at these points and improve optimization speed.

Edit: Guess its even possible to _force_ the counter to specific
probability values at some contexts, to make the recurrent
regions shorter.
As to how it allows to improve optimization speed - at least
by factoring the similar sequences.

> You mean one should approximate log2 using a taylor polynomal?
> This would avoid divisions.

No, that's not necessary.
I mean that you can formally calculate a string probability and it
would be a pure polynomial.
And probability is nowhere worse than its log2 for optimization.

> 15: [c d]* = 1/32768 [32080 -8068] H* = 61185692

That's not [32088 -7000]

9. Originally Posted by Shelwien
Not sure where's "no difference".
That code for division via SAR is surely slower than a single SHR.
I still don't measure any speed difference, the code is longer though.

Originally Posted by Shelwien
Also your d parameter has almost no meaning.
So I think that my formula would have better effect even not touching
the third parameter.
Well, the way i've rewritten your counter i would interpret d as a rounding
constant. Your parameters are meaningfuller than mine - but both representations
are equal (mathematically).

Originally Posted by Shelwien
Edit: Here's a fpaq0p version with my formula, btw
(and parameters tuned with my system, not to enwik8 though)
Comparing against my simple bitmodel is unfair here, since this model has
4096*2 "classes" (different pairs of [c d]) compared to just a single class
with my model. Btw i plan to seperate the parameter classes exactly that way.
But alltogether i've got far better (in terms of how much you can squeeze out
of this) results.

61499101 bytes vs 61185692 bytes

Originally Posted by Shelwien
Edit: Guess its even possible to _force_ the counter to specific
probability values at some contexts, to make the recurrent
regions shorter.
Yep, this is what i want to do, introducing several classes dependent on
distigushing probability and more/less probable symbols - but first get the
simpler things to work (to complete my work).

Originally Posted by Shelwien
No, that's not necessary.
I mean that you can formally calculate a string probability and it
would be a pure polynomial.
Approximate the whole training-string as a polynomal?

Originally Posted by Shelwien
And probability is nowhere worse than its log2 for optimization.
That's true (monotone transform), but in my case (iterative minimization) log2
changes the problem's optimization metric.

Originally Posted by Shelwien
That's not [32088 -7000]
Yep, since i frequently change my optimizer's source it behaves slightly
different. Despite some internal changes i used another initial estimate.

10. > > Not sure where's "no difference".
> > That code for division via SAR is surely slower than a single SHR.
>
> I still don't measure any speed difference, the code is longer though.

Guess you have something much slower there
But in fpaq0pv4B this could easily cause ~10% slowdown.

> Well, the way i've rewritten your counter i would interpret d as a rounding
> constant. Your parameters are meaningfuller than mine - but both representations
> are equal (mathematically).

Your rounding constant is fourth parameter actually, if I'd add it to my formula.

> > Here's a fpaq0p version with my formula, btw
> > (and parameters tuned with my system, not to enwik8 though)
>
> Comparing against my simple bitmodel is unfair here, since this model has
> 4096*2 "classes" (different pairs of [c d]) compared to just a single class
> with my model.

Only potentially. In fact, there're different parameter values, but not that much,
specifically 6.

> Btw i plan to seperate the parameter classes exactly that way.
> But alltogether i've got far better (in terms of how much you can squeeze out
> of this) results.
>
> 61499101 bytes vs 61185692 bytes

I was talking about my p'=(p*(1-wr)+y*wr)*(1-mw)+pm*mw
update formula, there's not exactly the same thing.

> > No, that's not necessary.
> > I mean that you can formally calculate a string probability and it
> > would be a pure polynomial.
>
> Approximate the whole training-string as a polynomal?

The probability of encountering the given string, yes.
But the point is that we can shift the parameters into
[-0.5,0.5] range and then only a handful of orders
would have any meaning.

For example, leaving a single parameter:
p'=p*(1-wr)+y*wr = p*(0.5-w)+y*(0.5+w)
string "00110100", we'd have

p = ( (44+p0) -
16*(9+p0)*w +
16*(1+7*p0)*w^2 -
64*(-6+7*p0)*w^3 +
32*(-22+35*p0)*w^4 -
256*(-3+7*p0)*w^5 +
256*(-1+7*p0)*w^6 - 1024*p0*w^7 + 256*p0*w^8 )/256

And even here eg. w^8 is already relatively insignificant.
(Optimization target is product of such p's, though).

11. Originally Posted by Shelwien
Guess you have something much slower there
But in fpaq0pv4B this could easily cause ~10% slowdown.
Could be true, there is one virtual function call per nibble. In addition,
estimating the gradient makes the overall process about 4 times slower. I tested
without a gradient estimation, speed stays the same. The speed is comparable to
fpaq0p.

Originally Posted by Shelwien
Your rounding constant is fourth parameter actually, if I'd add it to my formula.
Unfortuneatly you are right. My first look wasn't carefully enough. If i rewrite

p' = (p*(1-wr)+y*wr)*(1-mw)+pm*mw
= (p + (y-p)*wr)*(1-mw) + pm*mw
= p + (y-p)*wr*(1-mw) + (pm-p)*m
= p + (y-p)*w0 + (pm-p)*w1

p' = p + (y-p)*c + d

This way c=w0 weightens new data and d is the weightend average difference
between a mixed distribution and the estimated probability. However this is a
bit "around the corner".

> > Here's a fpaq0p version with my formula, btw
> > (and parameters tuned with my system, not to enwik8 though)
>
> Comparing against my simple bitmodel is unfair here, since this model has
> 4096*2 "classes" (different pairs of [c d]) compared to just a single class
> with my model.

Originally Posted by Shelwien
Only potentially. In fact, there're different parameter values, but not that much,
specifically 6.
Code:
`int j = 19 + (i < 2048) + (i < 1024) + (i < 512) + (i < 256) + (i < 128);`
Yep, my fault. I only looked at the probability update. But i think here's
much potential introducing several classes dependent on p and y.

Originally Posted by Shelwien
For example, leaving a single parameter:
p'=p*(1-wr)+y*wr = p*(0.5-w)+y*(0.5+w)
string "00110100", we'd have

p = ( (44+p0) -
16*(9+p0)*w +
16*(1+7*p0)*w^2 -
64*(-6+7*p0)*w^3 +
32*(-22+35*p0)*w^4 -
256*(-3+7*p0)*w^5 +
256*(-1+7*p0)*w^6 - 1024*p0*w^7 + 256*p0*w^8 )/256

And even here eg. w^8 is already relatively insignificant.
(Optimization target is product of such p's, though).
To be a bit more preciese you need more powers of w. My expirience says that
0.5+w will be close to 0 so w will be close to -.5. Taking 2^-8 still
contributes in the domain of 1e-3. But this appoach seems to be interesting too,
thanks for the explaination!
So the optimization target would be to maximize the probability of the training
string dependent on the rescaled parameters?
How do you compute the coefficients a_i for w^i. I couldn't really identify the
patterns here - especially not at 3 in the morning

I really like the idea of shifting the parameters into this range to improve the
convergence speed

12. > > Guess you have something much slower there
> > But in fpaq0pv4B this could easily cause ~10% slowdown.
>
> Could be true, there is one virtual function call per nibble.

Well, I checked and it seems that virtual functions are called
statically (or even inlined) if compiler is able to resolve them,
and via a single pointer otherwise - in which case in might be equivalent
to static calls due to branch prediction.
So I guess that virtual functions are really problematic only
with multiple inheritance etc.

> The speed is comparable to fpaq0p.

fpaq0p binary or the one you compiled?
Somehow I'm not sure if you have a good compiler.

> > Your rounding constant is fourth parameter actually, if I'd add it to my formula.
>
> Unfortunately you are right. My first look wasn't carefully enough.

Its quite good to have more parameters actually.
However using a delayed counter I can get as much parameters as I want.

> Yep, my fault. I only looked at the probability update. But i think here's
> much potential introducing several classes dependent on p and y.

In the end, counter optimization turns into context clustering problem anyway.
I mean that "ideal counter" can precisely predict the next
bit by context, without any coding.

> > For example, leaving a single parameter:
> > p'=p*(1-wr)+y*wr = p*(0.5-w)+y*(0.5+w)
> > string "00110100", we'd have
> >
> > p = ( (44+p0) -
> > 16*(9+p0)*w +
> > 16*(1+7*p0)*w^2 -
> > 64*(-6+7*p0)*w^3 +
> > 32*(-22+35*p0)*w^4 -
> > 256*(-3+7*p0)*w^5 +
> > 256*(-1+7*p0)*w^6 - 1024*p0*w^7 + 256*p0*w^8 )/256
>
> To be a bit more precise you need more powers of w.

Its just an example, I'd never said I'd stop at that.
Though order 8 might be barely enough for a current
probability function representation...

> My expirience says that 0.5+w will be close to 0 so w will be close to -.5.

Its always possible to adjust it to wr range like [0..0.1] instead of [0..1]

> Taking 2^-8 still contributes in the domain of 1e-3.

Note that its a bit probability function, not string probability.
I'd expect order 16 to be enough for a bit probability, but
with a real optimization target it'd be more tricky.

> So the optimization target would be to maximize the probability of the training
> string dependent on the rescaled parameters?

You can take a log2 of a polynomial approximation after you calculate it,
and then a derivative of that.

> How do you compute the coefficients a_i for w^i. I couldn't really identify the
> patterns here - especially not at 3 in the morning

There're no common patterns... it somehow correlates with processed data probably.
I just written a couple of update functions and calculated
b0[b0[b1[b0[b1[b1[b0[b0[p0]]]]]]]] in Mathematica, and then sorted it by w powers.

But of course, for practical use you'd need a polynomial class with custom
number class for coefficients (not necessarily infinite precision, but
with longer exponent part). Or coefficients could be polynomials too, but
I doubt that there'd be enough precision even for simultaneous optimization
of two parameters.

But still, even if 1000s of coefficients would appear necessary, it would
be incomparably faster than processing all the data.

> I really like the idea of shifting the parameters into this range
> to improve the convergence speed

Actually I think that [-1;1] is better for target parameter range as it'd
allow to determine what can be cut off directly by polynomial coefficients.

13. Originally Posted by Shelwien
Well, I checked and it seems that virtual functions are called
statically (or even inlined) if compiler is able to resolve them,
and via a single pointer otherwise - in which case in might be equivalent
to static calls due to branch prediction.
So I guess that virtual functions are really problematic only
with multiple inheritance etc.

fpaq0p binary or the one you compiled?
Somehow I'm not sure if you have a good compiler.
I've compiled everything using gcc 4.1.2 with the switches -O3 -lstdc++. Since
in this stage the program is a single file gcc might optimize the function calls
out. I haven't looked at the disassembly, though.

Originally Posted by Shelwien
Its quite good to have more parameters actually.
However using a delayed counter I can get as much parameters as I want.
Do you update your counter based on all delayed bits?

p_k+1 = t(p_k, y_k, y_k-1, ..., y_k-n+1)

I think converting the bits like x_k = y_k XOR (p_k>.5) could improve
compression, since you distingush dependent on more/less probable bits.
Doing things like this intorduces the possibility of distingushing a change
in statistics (the bit history), like ...00000111111....

EDIT: This is something what Matt's newest fpaq0xx variant does: You can measure a _rapid_ increase in compression by increasing the "bit history length" over 2, which makes you able to distingush a change in statistics like
...y (not y) ... However there are no further (parameter) optimizations

Originally Posted by Shelwien
In the end, counter optimization turns into context clustering problem anyway.
I mean that "ideal counter" can precisely predict the next
bit by context, without any coding.
Unfortunately ideality isn't practical
But the above thing sounds very much like context clustering...

[QUOTE=Shelwien]

No patterns? I meant writing it like that: a_i = (b_i0 * p0 + b_i1 * y0 + ...)

Maximizing the string probability should be the same as minimizing its coding
cost, the metric would change though. Using the coding cost would be better
numericall, i think.

Originally Posted by Shelwien
But still, even if 1000s of coefficients would appear necessary, it would
be incomparably faster than processing all the data.
This is very nice - since constructing the neccesarry coefficients once from the
training data and iterating over the polynomal is far faster than calculating
the coding cost function every step. The speedup would be huge - but how
preciese would it be? I mean the slower way i'm optimizing atm has the advantage
of calculating the coding cost *exactly* using the system i'm targeting my
optimizations with (range coder, maybe a hash mapping of contexts to bit
models, etc...)

14. > I've compiled everything using gcc 4.1.2 with the switches -O3 -lstdc++. Since
> in this stage the program is a single file gcc might optimize the function calls
> out. I haven't looked at the disassembly, though.

Well, generally imho IntelC is better, but for basic fpaq0p clones gcc/mingw is quite ok.
Consider using -fprofile-generate though... it might make things significantly faster.

> > Its quite good to have more parameters actually.
> > However using a delayed counter I can get as much parameters as I want.
>
> Do you update your counter based on all delayed bits?
> p_k+1 = t(p_k, y_k, y_k-1, ..., y_k-n+1)

Well, you can skip some bits but how'd that help?
Another point is that estimation mapping is different
from update mapping... also p0's are selected by context too.
So for k bits "delay" there's at least 2*2^k+2*2^(k+1)+2^(k+1)-1
parameters.

> I think converting the bits like x_k = y_k XOR (p_k>.5) could improve
> compression, since you distingush dependent on more/less probable bits.

Its a static SSE, so p can be used in context of course.
And I agree that correlations in the optimized mapping coefficients
could be used to improve further optimization speed.
But I can't say that your suggestion would be helpful in general.

> Doing things like this intorduces the possibility of distingushing a change
> in statistics (the bit history), like ...00000111111....

You should check first whether such substrings make a significant amount of
output code.

> This is something what Matt's newest fpaq0xx variant does:

You mean fpaq0f2?

> You can measure a _rapid_ increase in compression by increasing the "bit history length" over 2,

Its just a basic SSE coder, nothing special, except that Matt made it.
And its not even a good example because normally SSE is used to improve
an already good primary model, which is not the case with order0, and
secondary models for barely redundant data (like usually) and quite redundant
(like in fpaq0f) are not very similar.

> > In the end, counter optimization turns into context clustering problem anyway.
> > I mean that "ideal counter" can precisely predict the next
> > bit by context, without any coding.
>
> Unfortunately ideality isn't practical

This case is kinda special.
Static models have their use for asymmetrical statistical coding.

> But the above thing sounds very much like context clustering...

Yeah, lossy model compression with a perfect metric

> No patterns? I meant writing it like that: a_i = (b_i0 * p0 + b_i1 * y0 + ...)

Wonder what you meant by that.
The point of this polynomial representation is to be able to shrink it
by removing the insignificant powers of target parameter, while
a precise polynomial is easily computable but would have n+1 coefficients
for n bits of data, so no sense doing it.

> Maximizing the string probability should be the same as minimizing its coding
> cost, the metric would change though. Using the coding cost would be better
> numericall, i think.

After you calculate the approximate probability function, you can
easily turn it into approximate code length or whatever.
Point is that with linear update its possible to have a _precise_
polynomial representation of the string probability, while log2

> > But still, even if 1000s of coefficients would appear necessary, it would
> > be incomparably faster than processing all the data.
>
> This is very nice - since constructing the neccesarry coefficients once from the
> training data and iterating over the polynomal is far faster than calculating
> the coding cost function every step.

Its not exactly once... As I said, it would probably only allow to calculate
an optimal value of a single parameter without multiple tests, but another
pass would be necessary to switch to another parameter.
So its basically about removing one dimension from optimization process.
And it would be surely impossible to have a precise enough approximation
with 3 or more parameters, while 2D polynomial might be possible but probably
impractical.

> The speedup would be huge - but how preciese would it be?

As much as you want. Well, except for rc specifics.

> I mean the slower way i'm optimizing atm has the advantage
> of calculating the coding cost *exactly* using the system i'm targeting my
> optimizations with (range coder, maybe a hash mapping of contexts to bit
> models, etc...)

Right, but that blocks off many advanced options.
Like, its hardly possible to optimize the parameters even for a single
delayed counter with a 3-bit context, and full context clustering is
just impossible.
But if this polynomial approximation proves successful, that would allow
at least affordable optimization of multiple delayed counters.

15. Originally Posted by Shelwien
Well, you can skip some bits but how'd that help?
I meant you could be applying some function(s) of the delayed bits.

Originally Posted by Shelwien
Another point is that estimation mapping is different
from update mapping...
I never took this into account. One could interpret the PAQ implementation like
this: the automata itself represents a static update mapping, while the APMs
try to dynamically model the estimation mapping.

Originally Posted by Shelwien
also p0's are selected by context too.
So for k bits "delay" there's at least 2*2^k+2*2^(k+1)+2^(k+1)-1
parameters.
How do you get so much parameters? With k bits delay there are 2^k possible
discrete states, taking N parameters per state i get N*2^k parameters. Am
missing something here?

Originally Posted by Shelwien
> Doing things like this intorduces the possibility of distingushing a change
> in statistics (the bit history), like ...00000111111....

You should check first whether such substrings make a significant amount of
output code.

-> fpaq02f

> You can measure a _rapid_ increase in compression by increasing the "bit history length" over 2,

Its just a basic SSE coder, nothing special, except that Matt made it.
And its not even a good example because normally SSE is used to improve
an already good primary model, which is not the case with order0, and
secondary models for barely redundant data (like usually) and quite redundant
(like in fpaq0f) are not very similar.
What i wanted to point out:

I haven't checked if there are lots of places where such a change of statistics
happen. But i have a clue - make fpaq02's bit history length equal to 1, 2 and 3.
You'll see a notable increase in compression when coming accross 2, increasing
further doesn't result in a larger step in compression.
I would interpret it as a clue that such changes of statistics happen frequently
(especially in lower order models, since the contextual seperation isn't as
strong as with higher orders) and the jump in compression gain is caused by the
ability of bit histories longer than two to distingush such changes (..000111..)

Originally Posted by Shelwien
> > In the end, counter optimization turns into context clustering problem anyway.
> > I mean that "ideal counter" can precisely predict the next
> > bit by context, without any coding.
>
> Unfortunately ideality isn't practical

This case is kinda special.
Static models have their use for asymmetrical statistical coding.

> But the above thing sounds very much like context clustering...

Yeah, lossy model compression with a perfect metric

> No patterns? I meant writing it like that: a_i = (b_i0 * p0 + b_i1 * y0 + ...)

Wonder what you meant by that.
The point of this polynomial representation is to be able to shrink it
by removing the insignificant powers of target parameter, while
a precise polynomial is easily computable but would have n+1 coefficients
for n bits of data, so no sense doing it.
I meant, the approx. string probability would be:

prod i=1..n (1-y_i) + (2*y_i-1)*p_i

Where n is the number of input bits and p_i is the estimate (for y_i = 1)
our model gives in the i.th step. If you estimate using a contextual seperation,
p_i are functions of the bit histories under each coding steps context. Let the
bit history in the under the i.th step be bh_i = b_1 ... b_k (k bits in the
current bit history), so p_i = f(bh_i), where f is the probability estimate
function dependent on our parameter(s).
p_i(k+1) = p_i(k)*(.5-w) + b_k*(.5+w)
(can be rewritten as a serie to just depend on b_k, p_0 and w)

Rewriting p_i(k) dependent on the last bit history updates and ordering by w
yields a polynomal in w, the thing you posted. This polynomal only depends on:
(1) the bit history b_1 .. b_k under the current context,
(2) the initial estimate p_0 (no influence for k->infinity ~ k>>1)
(3) the parameter w

k_i(k+1) ~ sum i=1..L a_i*w^i (your polynomal approximation)

So the polynomals coefficients a_i only depend on (1) and (2), since you ordered
by the powers of w.

My question is, how can we compute the coefficients a_i? I tried to find patterns
substituting p_k into p_k+1 and ordering by w, ... but i was hardly able to find
any. There must be a way to compute a_i(k+1) = g(a_0(k) ... a_i-1(k), b_k).
Otherwise one can't implement it.

Some more stuff this afternoon, i have to catch my train

16. >> Well, you can skip some bits but how'd that help?
>
> I meant you could be applying some function(s) of the delayed bits.

And I already said that it should be done after correlation analysis
of optimized coefficients, not by correlations that you _think_
there'd be.

> > Another point is that estimation mapping is different
> > from update mapping...
>
> I never took this into account. One could interpret the PAQ implementation like
> this: the automata itself represents a static update mapping, while the APMs
> try to dynamically model the estimation mapping.

Yes, but paq uses this just to fit counters into bytes.

> > So for k bits "delay" there's at least 2*2^k+2*2^(k+1)+2^(k+1)-1 parameters.
>
> How do you get so much parameters?

As I said... there're N*2^k estimation parameters, N*2^(k+1) update mapping
parameters (including the just processed bit) and 2^(k+1)-1 initial probability
values for every context of lengths 1..k.
Well, also estimation and update mappings can have different contexts, but
you'd need enough initial values for update mapping to work.

Also its important that all these things are linear, so its easy to
calculate a set of coefficients for a perfect simulation of a simple
linear-update counter.

> I would interpret it as a clue that such changes of statistics happen frequently
> (especially in lower order models, since the contextual seperation isn't as
> strong as with higher orders) and the jump in compression gain is caused by the
> ability of bit histories longer than two to distingush such changes (..000111..)

Whatever... Its the same as any other primary context model.
Actually fpaq0f2 has very similar performance to a traditional order1 model
and uses the same number of counters: http://shelwien.googlepages.com/fpaq0f2.htm
And there's no reason to think that previous data bytes are less significant
than context history bits... in fact, its kinda obvious that source models
don't use history bits - there're even data types with known generator, like
executables.

Its another case when the primary model grows huge and lacks statistics.
Then we just have to merge some contexts and context history similarity
is a good indicator of what to merge.
And that's what SSE is, basically, though it uses a quantized version of
context history aka probability.

> Rewriting p_i(k) dependent on the last bit history updates and ordering by w
> yields a polynomal in w, the thing you posted. This polynomal only depends on:
> (1) the bit history b_1 .. b_k under the current context,
> (2) the initial estimate p_0 (no influence for k->infinity ~ k>>1)
> (3) the parameter w
>
> k_i(k+1) ~ sum i=1..L a_i*w^i (your polynomal approximation)
>
> So the polynomals coefficients a_i only depend on (1) and (2), since you ordered
> by the powers of w.
>
> My question is, how can we compute the coefficients a_i? I tried to find patterns
> substituting p_k into p_k+1 and ordering by w, ... but i was hardly able to find
> any. There must be a way to compute a_i(k+1) = g(a_0(k) ... a_i-1(k), b_k).
> Otherwise one can't implement it.

Guess I was wrong to leave p0 as a parameter in that example...
Look, imagine that we want to optimize just a single w parameter

p' = p*(1-wr)+y*wr = p*(0.5-w)+y*(0.5+w) = w*(y-p)+(y+p)/2

Now lets assume that p=Sum[ a[i]*w^i, i=0..k ]. Then

p' = Sum[ (a[i]/2-a[i-1])*w^i, i=2..k ] + (a[1]/2+y-a[0])*w + (y+a[0])/2

a'[0] = (y+a[0])/2
a'[1] = (y-a[0]-a[1]/2)
a'[i] = (a[i]/2-a[i-1])

Is it clear now?

Then, its possible to work with more parameters in the same way,
eg. we can use the same polynomial class based on p0 for a[i] coefficients.
But I suspect that it'd take too much memory to get a tolerable precision
even with two variables.

17. Originally Posted by Shelwien
>> Well, you can skip some bits but how'd that help?
>
> I meant you could be applying some function(s) of the delayed bits.

And I already said that it should be done after correlation analysis
of optimized coefficients, not by correlations that you _think_
there'd be.

> > Another point is that estimation mapping is different
> > from update mapping...
>
> I never took this into account. One could interpret the PAQ implementation like
> this: the automata itself represents a static update mapping, while the APMs
> try to dynamically model the estimation mapping.

Yes, but paq uses this just to fit counters into bytes.

> > So for k bits "delay" there's at least 2*2^k+2*2^(k+1)+2^(k+1)-1 parameters.
>
> How do you get so much parameters?

As I said... there're N*2^k estimation parameters, N*2^(k+1) update mapping
parameters (including the just processed bit) and 2^(k+1)-1 initial probability
values for every context of lengths 1..k.
Well, also estimation and update mappings can have different contexts, but
you'd need enough initial values for update mapping to work.

Also its important that all these things are linear, so its easy to
calculate a set of coefficients for a perfect simulation of a simple
linear-update counter.

> I would interpret it as a clue that such changes of statistics happen frequently
> (especially in lower order models, since the contextual seperation isn't as
> strong as with higher orders) and the jump in compression gain is caused by the
> ability of bit histories longer than two to distingush such changes (..000111..)

Whatever... Its the same as any other primary context model.
Actually fpaq0f2 has very similar performance to a traditional order1 model
and uses the same number of counters: http://shelwien.googlepages.com/fpaq0f2.htm
And there's no reason to think that previous data bytes are less significant
than context history bits... in fact, its kinda obvious that source models
don't use history bits - there're even data types with known generator, like
executables.

Its another case when the primary model grows huge and lacks statistics.
Then we just have to merge some contexts and context history similarity
is a good indicator of what to merge.
And that's what SSE is, basically, though it uses a quantized version of
context history aka probability.

> Rewriting p_i(k) dependent on the last bit history updates and ordering by w
> yields a polynomal in w, the thing you posted. This polynomal only depends on:
> (1) the bit history b_1 .. b_k under the current context,
> (2) the initial estimate p_0 (no influence for k->infinity ~ k>>1)
> (3) the parameter w
>
> k_i(k+1) ~ sum i=1..L a_i*w^i (your polynomal approximation)
>
> So the polynomals coefficients a_i only depend on (1) and (2), since you ordered
> by the powers of w.
>
> My question is, how can we compute the coefficients a_i? I tried to find patterns
> substituting p_k into p_k+1 and ordering by w, ... but i was hardly able to find
> any. There must be a way to compute a_i(k+1) = g(a_0(k) ... a_i-1(k), b_k).
> Otherwise one can't implement it.

Guess I was wrong to leave p0 as a parameter in that example...
Look, imagine that we want to optimize just a single w parameter

p' = p*(1-wr)+y*wr = p*(0.5-w)+y*(0.5+w) = w*(y-p)+(y+p)/2

Now lets assume that p=Sum[ a[i]*w^i, i=0..k ]. Then

p' = Sum[ (a[i]/2-a[i-1])*w^i, i=2..k ] + (a[1]/2+y-a[0])*w + (y+a[0])/2

a'[0] = (y+a[0])/2
a'[1] = (y-a[0]-a[1]/2)
a'[i] = (a[i]/2-a[i-1])

Is it clear now?

Then, its possible to work with more parameters in the same way,
eg. we can use the same polynomial class based on p0 for a[i] coefficients.
But I suspect that it'd take too much memory to get a tolerable precision
even with two variables.
Sorry, i've been busy the last few weeks.

Substituting the polynomal like this is easier (and cleverer ). Things are clear now, thanks again!

Do you have any expirience/results in using different update and prediction mappings? This is something new to me.

18. Here's a real optimized 4-bit delayed counter.
Not that these specific values would be of any use, though

Code:
```P0                Xa              Xb                Ya              Yb
110110011000000 | 001111000010000 111010110110000 | 000011000011100 111111011100100
110110101011100 | 001011110011000 111001111110000 | 000011000001001 111111111010110
110101110000111 | 001011100001110 111001111000110 | 000011011010101 111111001111010
110011000000000 | 000100000000000 111001000000010 | 000011101111001 111111111111010
110011001010001 | 001110101000001 111000101100110 | 000001011010111 111100101100110
110010001001000 | 000101001100000 111011011011100 | 000000110010110 111110000000100
110010100100000 | 001001010100000 111001011001011 | 000000111011010 111101111110000
110000000000100 | 000000001100000 110011000100000 | 000000111111111 111011101111100
110110011000000 | 001111000010000 111010110110000 | 000011000011100 111111011100100
110110101011100 | 001011110011000 111001111110000 | 000011000001001 111111111010110
110101110000111 | 001011100001110 111001111000110 | 000011011010101 111111001111010
110011000000000 | 000100000000000 111001000000010 | 000011101111001 111111111111010
110011001010001 | 001110101000001 111000101100110 | 000001011010111 111100101100110
110010001001000 | 000101001100000 111011011011100 | 000000110010110 111110000000100
110010100100000 | 001001010100000 111001011001011 | 000000111011010 111101111110000
110000000000100 | 000000001100000 110011000100000 | 000000111111111 111011101111100

ctx = counter context
b = B[ctx] = bit history
c = C[ctx] = number of bits in context (history length)

if( c>=4 ) {
p = ( c==4 ) ? P0[0] : p[ctx]
p = Xa[b&15]*(1-p) + Xb[b&15]*p;
} else {
s = (1<<c); s += b & (s-1);
p = P0[s];
}

[...]

if( c>=4 ) {
p = ( c==4 ) ? P0[0] : p[ctx]
p[ctx] = Ya[b&15]*(1-p) + Yb[b&15]*p;
}```

19. The implementation is clear to me, i meant, how does it preform, compared to a simple counter a-la fpaq0p? (Or a counter with optimized prediction=update mapping). And how did you optimize the parameters? Have you printed the parameters binary (the patterns somehow look like binary search)?

20. > The implementation is clear to me,

Really? This initialization workaround wasn't that clear to me,
I was actually quite disappointed with delayed counters until
accidentally finding the real importance of initial values.

> i meant, how does it perform, compared to a simple counter a-la fpaq0p?

Well, you know, I don't use these counters (mine provide better compression).
And anyway, the results I have was taken under SSE.
But such a delayed counter can perfectly simulate the behavior of usual
counter, so it can never perform worse.
Likewise, using longer context (bit history) never hurts compression.

> (Or a counter with optimized prediction=update mapping).

There's no average comparison for that.
I can only say that its never worse, and can be better adapted to
any environment, so I'd probably be able to construct bit sequences
where it'd have as much advantage as you'd like.

Well, I also admit that improvement won't look impressive to the audience here.
Afaik it was like this:
371699 // normal counters
370757 // 3bit delayed
370606 // 4bit delayed
But for me it was quite significant - I couldn't gain this much
by "heavy" techniques like mixing in another submodel or expanding contexts.

> Have you printed the parameters binary (the patterns somehow look like binary search)?

Of course that's binary. But I keep them that way, so these was copy-pasted from the source.

> And how did you optimize the parameters?

My optimizer is even dumber than your implementation - it doesn't use any gradients, at least.
Its a perl script which manipulates these bit strings in the compressor executable and runs
it until satisfied.
Well, its also a part of my C++ syntax extensions so can't be replaced that easily even
if/when I'd switch to polynomial optimization.

21. Originally Posted by Shelwien
Really? This initialization workaround wasn't that clear to me,
I was actually quite disappointed with delayed counters until
accidentally finding the real importance of initial values.
What do you mean here? I haven't read anything in your source which i wouldn't have done in the same way?! The right parameters are another thing.

Originally Posted by Shelwien
> (Or a counter with optimized prediction=update mapping).

There's no average comparison for that.
I can only say that its never worse, and can be better adapted to
any environment, so I'd probably be able to construct bit sequences
where it'd have as much advantage as you'd like.

Well, I also admit that improvement won't look impressive to the audience here.
Afaik it was like this:
371699 // normal counters
370757 // 3bit delayed
370606 // 4bit delayed
But for me it was quite significant - I couldn't gain this much
by "heavy" techniques like mixing in another submodel or expanding contexts.
Improving the overall performance in the domain of .1 - 1% is significant, at least not by adding models, etc.

Originally Posted by Shelwien
> And how did you optimize the parameters?

My optimizer is even dumber than your implementation - it doesn't use any gradients, at least.
Its a perl script which manipulates these bit strings in the compressor executable and runs
it until satisfied.
Well, its also a part of my C++ syntax extensions so can't be replaced that easily even
if/when I'd switch to polynomial optimization.
And your script does a binary search? Somehow your parameter patterns look like a result of this (simply the "look and feel") and since you output binary strings instead of using the actual values...

The gradient is just another source of information for the optimization... well, i don't see what should be wrong with this optimization approach. My counters aren't as good as yours, but this is related to the optimization target, not the optimization technique.

22. > > Really? This initialization workaround wasn't that clear to me,
> > I was actually quite disappointed with delayed counters until
> > accidentally finding the real importance of initial values.
>
> What do you mean here? I haven't read anything in your source

Well, its just some example code, not "my source".
Originally X and Y had fixed 2 MSBs (00 and 11), and there
was a single multiplication in extrapolation/update etc.

> which i wouldn't have done in the same way?!

I don't say its not trivial.
But then again, there's a lot of space for alternative
implementations. Like, I doubt you'd write it the same
way with P0[0].
Also, about the thing I mentioned, its quite appealing
to use the same mappings for incomplete contexts, by
padding them with zeroes or something, like its always
done with contexts and counters. Without the P0 array.
At least thats how I implemented it first and compression
difference was quite unexpected. In fact, usual combinatorical
counters (with variable-rate update) worked better.

> Improving the overall performance in the domain of .1 - 1%
> is significant, at least not by adding models, etc.

Well, yes. Specifically here it was like going from 203k to 202k
on book1 without text tricks.

> And your script does a binary search?

It doesn't. What it does is imho closer to genetic optimization.
Also this representation (and optimizer itself) was first used
for context quantization and other discrete values, and only after
that I found that its good enough for usual parameters too.
Well, it isn't that good with update rates optimization, but I have
some tricks to overcome that.

> Somehow your parameter patterns look like a result of this
> (simply the "look and feel") and since you output binary strings
> instead of using the actual values...

Add binary "line numbers" (0000..1111) to that table and
you might see why. That'd be history bits, I mean.

> The gradient is just another source of information for the
> optimization... well, i don't see what should be wrong with
> this optimization approach.

1. Processing the whole sample for each measurement is dumb.
2. As I said, my optimizer is even dumber than yours in that
because it works only with actual parameter values, not gradients.

But I can't do much about that, as gradients are not applicable
for discrete parameters, and its troublesome to automate their
calculation for a practical model and even more troublesome to
manually write another model specifically for optimization.

> My counters aren't as good as
> yours, but this is related to the optimization target, not
> the optimization technique.

What I'm basically saying is that your "optimization technique"
is more advanced, but maybe not as widely applicable.

Btw, my C++ preprocessor has separate modes for release versions,
and optimization builds, where these binary patterns/values are
embedded into executable as is and accessible to optimizer.

23. The first thing which came to my mind was to adaptivley learn the mapping for the bit histories of a length < 4. Especially in higher order contexts this is significant, or at least having a "better than p=.5"-mapping for short bit histories.

Originally Posted by Shelwien
Add binary "line numbers" (0000..1111) to that table and
you might see why. That'd be history bits, I mean.

p = Xa[b&15]*(1-p) + Xb[b&15]*p;
It's obvious. But i'm queit suprised, that the two mappings sometimes differ that much (e.g. X, Y at the 4th line).

Originally Posted by Shelwien
1. Processing the whole sample for each measurement is dumb.
2. As I said, my optimizer is even dumber than yours in that
because it works only with actual parameter values, not gradients.

But I can't do much about that, as gradients are not applicable
for discrete parameters,
Well it is the way things are done in (mixed) integer programming. Find a continuous solution, keep discretizing it and keep track of how the discretization hurts the optimum (e.g. with some search tree). Since the influence of my rounding parameter is tiny compared to the learning rate finding a good enough solution using this iterative optimization and the run a brute force search around the found minimum in parameter space. Basically its a question of what you numerically call "continuous" and discrete.

Originally Posted by Shelwien
and its troublesome to automate their
calculation for a practical model and even more troublesome to
manually write another model specifically for optimization.
My sources are designed to be extensible. The optimizer classes arguement is a target function, which can return its value and gradient at a desired point in parameter space. This can be replaced. There's a similar mechanism within the cost function to replace a certain model.

24. > The first thing which came to my mind was to adaptivley
> learn the mapping for the bit histories of a length < 4.
> Especially in higher order contexts this is significant, or
> at least having a "better than p=.5"-mapping for short bit
> histories.

1. Adaptation would work like that only for stationary contexts.
In fact, there's a lot of cases where _any_ adaptive mapping
would be bad. But even if it improves compression, I doubt
that you'd be able to detect an optimal configuration among
all mapping's states during processing of nonstationary data.

2. Optimal static mapping is even harder to find if its used
under SSE or mixing or both. But local entropy optimization
won't work for sure.

3. Anyway, if an adaptive mapping's snapshot works well
as a static mapping, imho it means that adaptive mapping
itself would work even better.

> > Add binary "line numbers" (0000..1111) to that table and
> > you might see why. That'd be history bits, I mean.

> p = Xa[b&15]*(1-p) + Xb[b&15]*p;
> It's obvious. But i'm queit suprised, that the two
> mappings sometimes differ that much (e.g. X, Y at the 4th line).

Its an extrapolation at the _end_ of 0100 sequence, and
update at its start - just before 1.
Well, here're sorted tables:
Code:
```Xa                   Xb                     Ya                   Yb
000000001100000 111 110011000100000 111 | 000000110010110 101 111011101111100 111
000100000000000 011 111000101100110 100 | 000000111011010 110 111100101100110 100
000101001100000 101 111001000000010 011 | 000000111111111 111 111101111110000 110
001001010100000 110 111001011001011 110 | 000001011010111 100 111110000000100 101
001011100001110 010 111001111000110 010 | 000011000001001 001 111111001111010 010
001011110011000 001 111001111110000 001 | 000011000011100 000 111111011100100 000
001110101000001 100 111010110110000 000 | 000011011010101 010 111111111010110 001
001111000010000 000 111011011011100 101 | 000011101111001 011 111111111111010 011```
Like that its more visible how it depends on bits in different positions
and number of 1s in the context.

> Basically its a question of what you numerically call "continuous" and discrete.

Well, _there're_ some more discrete things than others.
For example, a bit in a binary number can determine whether
two adjacent columns in the table are merged into single context, or not.

Of course that can be somewhat changed eg. with mixing, but
resources don't allow to mix _all_, so...

> > and its troublesome to automate their
> > calculation for a practical model and even more troublesome to
> > manually write another model specifically for optimization.
>
> My sources are designed to be extensible. The optimizer
> classes arguement is a target function,

Is it a template argument or member function argument?

> which can return its value and gradient at a desired point in parameter space.
> This can be replaced. There's a similar mechanism within the
> cost function to replace a certain model.

I still think that this is possible only for a simple CM where counters
are directly used for coding. And even there it probably causes a significant
slowdown... if you don't use different classes for optimization and
real compressor.

25. With adaptivley learning (i meant something like a stationary counter (n1, n) -> (n1+y, n+1), for each partitial context (here: len<4), which is optimal for a stationary source. After enough learning this counter should provide a near optimal (same as minimizing coding cost) solution, since (n1+y)/(n+1) ~ n1/n when n1, n >> 1. The learned adaptive mapping (n>>1) could be used like your p0[s]. I don't know if online learning is worse than offline learning here.

Originally Posted by Shelwien
> > Add binary "line numbers" (0000..1111) to that table and
> > you might see why. That'd be history bits, I mean.

> p = Xa[b&15]*(1-p) + Xb[b&15]*p;
> It's obvious. But i'm queit suprised, that the two
> mappings sometimes differ that much (e.g. X, Y at the 4th line).

Its an extrapolation at the _end_ of 0100 sequence, and
update at its start - just before 1.
Well, here're sorted tables:

Xa Xb Ya Yb
000000001100000 111 110011000100000 111 | 000000110010110 101 111011101111100 111
000100000000000 011 111000101100110 100 | 000000111011010 110 111100101100110 100
000101001100000 101 111001000000010 011 | 000000111111111 111 111101111110000 110
001001010100000 110 111001011001011 110 | 000001011010111 100 111110000000100 101
001011100001110 010 111001111000110 010 | 000011000001001 001 111111001111010 010
001011110011000 001 111001111110000 001 | 000011000011100 000 111111011100100 000
001110101000001 100 111010110110000 000 | 000011011010101 010 111111111010110 001
001111000010000 000 111011011011100 101 | 000011101111001 011 111111111111010 011

Like that its more visible how it depends on bits in different positions
and number of 1s in the context.
Could you please explain this again and mark the bits. ATM i don't know what you mean.

Originally Posted by Shelwien
> Basically its a question of what you numerically call "continuous" and discrete.

Well, _there're_ some more discrete things than others.
For example, a bit in a binary number can determine whether
two adjacent columns in the table are merged into single context, or not.
What do you mean exactly? The contextual input seperation? The mapping isn't directly involved in the optimization maths - it only selects bit models. The whole math thing is related to bit modelling, which is continuous.

Originally Posted by Shelwien
Is it a template argument or member function argument?

I still think that this is possible only for a simple CM where counters
are directly used for coding. And even there it probably causes a significant
slowdown... if you don't use different classes for optimization and
real compressor.
Most of the stuff in the optimizer is bloated up to allow easier extension and maintainance (virtual functions, some abcs, ...). No template arguments, since i don't require (very) fast code. Most of the time is eaten up by calculations anyway, which aren't directly involved with modelling, like gradient calculations, some 64 bit divisions, etc.
Of course i seperate the optimizer and the actual implementation of the compressor.

Yesterday i wanted to see, how well the paq state mapping does and if it can be improved, in most cases the optimizer tends to diverge here and a short brute force search revealed, that there's not much chance of improvement. Somehow Matt's y-p>>8 was a quiet good choice. I tried orders 0-2 only. The 15 bit parameter quantization seemed to be to croase here, since changing a parameter by a quantization level often caused the gradients sign to swap.

26. > which is optimal for a stationary source. After enough learning this counter should
> provide a near optimal (same as minimizing coding cost)
> solution, since (n1+y)/(n+1) ~ n1/n when n1, n >> 1.

Of course it would converge somewhere, but at best that would give us an optimal
probability for static encoding of the sequence.
1. "At best" = only for stationary contexts... which don't actually exist as it seems.
2. We are actually talking about mapping optimization, and that mapping is
recurrent and nonlinear. Well, there may be some sense in optimizing
the extrapolation mapping like that... after finishing the optimization
of update mapping and initial values.

Also imho the whole idea of markov model convergence is misleading.
I mean, it really has sense only for determined state machine,
but is commonly applied to the cases where probability "converges"
to some value far from 0 or 1, and that is considered "stationary".
But how it can be truly stationary if the data was generated with
a pretty much determined source (like eg. executables are) and
probability distribution looks like some curve only because
the context is far from what it should be.
So that's why there're no stationary data and probability
distributions can only be locally optimal.

> The learned adaptive mapping (n>>1) could be used like your p0[s].

As I said, these P0's have much greater effect than expected.
At least, statistics should be collected among similar contexts,
but that wouldn't be optimal either

> I don't know if online learning is worse than offline learning here.

Adapation is only good when it catches the local dependencies.
But in this case it would be too hard to trace the dependency
of current estimation back to P0.

> > 000000001100000 111 110011000100000 111 | 000000110010110 101 111011101111100 111
> [...]
> > 001111000010000 000 111011011011100 101 | 000011101111001 011 111111111111010 011
> >
> > Like that its more visible how it depends on bits in different positions
> > and number of 1s in the context.
>
> Could you please explain this again and mark the bits. ATM i don't know what you mean.

3-bit columns are contexts, most recent bit on the right.
And also I forgot to mention that my example worked with probability of 0.

> > Well, _there're_ some more discrete things than others.
> > For example, a bit in a binary number can determine whether
> > two adjacent columns in the table are merged into single context, or not.
>
> What do you mean exactly? The contextual input seperation?
> The mapping isn't directly involved in the optimization
> maths - it only selects bit models. The whole math thing is
> related to bit modelling, which is continuous.

I was talking about my optimizer, which has to optimize such
discrete parameters as a context quantization map, so is unable

Btw, have you tried using gradients in the model?
That might be interesting.

> Most of the stuff in the optimizer is bloated up to allow
> easier extension and maintainance (virtual functions, some
> abcs, ...). No template arguments, since i don't require
> (very) fast code. Most of the time is eaten up by
> calculations anyway, which aren't directly involved with
> modelling, like gradient calculations, some 64 bit
> divisions, etc.

I still don't understand how you're going eg. optimize
a counter under SSE with it.

> Of course i seperate the optimizer and the actual
> implementation of the compressor.

I don't see how is it "of course" - imho that totally slows
down the development and also introduces the possibility of
bugs due to release and optimization models mismatch.
On the other hand, my optimizer allows to fully retune
the production model to the new sample data, including
new context quantization, in 3-5 days and mostly automatically.
And also it could be much faster too, if not for my modest
computational resources.

> Yesterday i wanted to see, how well the paq state mapping
> does and if it can be improved,

The biggest problem with Matt's models is that they're too solid -
its almost impossible to replace some element with a completely
different implementation and keep the same results.

> changing a parameter by a quantization level often caused the gradients sign to swap.

That's basically why I don't use gradients at all

27. Originally Posted by Shelwien
1. "At best" = only for stationary contexts... which don't actually exist as it seems.
If the contextual seperation is strong, this might hold as an approximation. Perfect context should give uniform bit histories, in practise we are far from perfect (as your example - x86 data with order-n contexts).

Originally Posted by Shelwien
2. We are actually talking about mapping optimization, and that mapping is
recurrent and nonlinear. Well, there may be some sense in optimizing
the extrapolation mapping like that... after finishing the optimization
of update mapping and initial values.
In my previous post i didn't think of seperate mappings (concerning this). BTW: How do you *seperately* formulate/work out an optimized update/prediction mapping? This might be some kind of misunderstanding, but you first generate and optimize a mapping. This will be the update mapping. Based on this mapping you optimize the "prediction" mapping? I can't imagine another way of doing this.

Originally Posted by Shelwien
Also imho the whole idea of markov model convergence is misleading.
I mean, it really has sense only for determined state machine,
but is commonly applied to the cases where probability "converges"
to some value far from 0 or 1, and that is considered "stationary".
But how it can be truly stationary if the data was generated with
a pretty much determined source (like eg. executables are) and
probability distribution looks like some curve only because
the context is far from what it should be.
So that's why there're no stationary data and probability
distributions can only be locally optimal.
True, since "perfect" contextual seperation is impossible (knowing every property of the source), in general at least. Global optimality is impossible anyway.

Originally Posted by Shelwien
As I said, these P0's have much greater effect than expected.
At least, statistics should be collected among similar contexts,
but that wouldn't be optimal either
I don't expect you optimized the P0's for every (or some larger amount) of contexts within a model? One P0 set per update/prediction mapping for every model?

Originally Posted by Shelwien
3-bit columns are contexts, most recent bit on the right.
And also I forgot to mention that my example worked with probability of 0.
I tried to read the numbers in both directions and didn't get it I'll reread everything tomorrow. Too late to get useful thoughts at this time.

Originally Posted by Shelwien
I was talking about my optimizer, which has to optimize such
discrete parameters as a context quantization map, so is unable
Ok, interesting. Do you mean contexts or bit history quantization, or both? Can you tell me something more about this? And show some results on "general data", i don't know which data you deal with.

Originally Posted by Shelwien
Btw, have you tried using gradients in the model?
That might be interesting.
You mean on online optimization of the parameters or add the gradient itself as an additional degree of freedom? Atm i haven't and i don't expect it to be practical (computatin effort vs gain). But i might do, it would be interesting to see what happes.

Originally Posted by Shelwien
I still don't understand how you're going eg. optimize
a counter under SSE with it.
Having optimized the models which generate the input probability distribution you basically have the same situation as with context mappings - map some input (context, probability, ...) to some output (probability). The input basically is a number which selcts a bit model.
My optimization target is the estimated entropy H(source) ~ sum i=1 to n -ln p_i. Where p_i is the estimated entropy in the i-th coding step. p_i is related to "some output" (see above) and the connected "some input" is the context under the i-th step.

Good night

EDIT: i said, that i have to write some more or less short text. I've completed a section about arithmetic coding. Comments and suggestions are welcome. I can't go too deep into detail, since the whole work is limited to about 20 pages and this part (together with others) is just "accessories". While rereading several times i found lots of mistypes. Every review will help. Thanks in advance!

28. > > after finishing the optimization of update mapping and initial values.

> In my previous post i didn't think of seperate mappings (concerning this).

> How do you *seperately* formulate/work out an optimized update/prediction mapping?
> This might be some kind of misunderstanding, but you first
> generate and optimize a mapping. This will be the update
> mapping. Based on this mapping you optimize the "prediction"
> mapping? I can't imagine another way of doing this.

My optimizer just randomly modifies parameters until it finds some metric improvement
(btw, that metric isn't necessarily only a code length. In some cases it includes
other components, like a function of model memory size).
And parameter subsets may be optimized separately or not depending on the requirements,
but either way it continues iteratively until there're no improvements for some long
enough time.

> > Also imho the whole idea of markov model convergence is misleading.
>
> True, since "perfect" contextual seperation is impossible
> (knowing every property of the source), in general at least.
> Global optimality is impossible anyway.

That's exactly what I meant... its misleading.
Because there're no probabilities in the source.
Well, probabilities are applicable with analog source,
but even text is not like that.

> > As I said, these P0's have much greater effect than expected.
> > At least, statistics should be collected among similar contexts,
> > but that wouldn't be optimal either

> I don't expect you optimized the P0's for every (or some
> larger amount) of contexts within a model?

Why, of course its supposed to be optimized for every context separately.
In this model for a codec which I'm currently supporting, there're
48 counter tables, 24 SSE1 tables and 12 SSE2 tables, each with its own
optimized set of parameters (including context).

Also this is a specialized model for a known data type, so I'm sure
that a proper universal model would have 10x of that (or more).

> One P0 set per update/prediction mapping for every model?

There's no real reason to use the same counter parameters for
different contexts.
However I don't widely use delayed counters yet, I think that
should wait until successful implementation of polynomial (or another)
fast optimizer.
The problem there is that I cannot just replace all the counters
with delayed ones, though that would surely be the best for compression.
But each delayed counter requires two additional multiplications and
two table lookups, so its too slow... and searching for a new balanced
model structure with current methods would take a lot of time.

> Ok, interesting. Do you mean contexts or bit history quantization, or both?

Both.

Here's a context declaration and code generated for it:

Index y
ych: ch, 1!1
yfb: fb, 1!00100010
yp0: p01, 1!0000000000010000
y_s: As, -7!000000000000000
y_p: p, -7!010000000000001
y_q: q, -7!001000000000111
yr1: r1, -7!000000000000000
y_r: r , -7!100000000000000
yr2: r2, -7!000000000000000
y_col: col, 1!0000000000000000000

y = 0;
y = y*2 + ((ch>0+(1-1)));
y = y*3 + ((fb>2+(1-1))+(fb>6+(1-1)));
y = y*2 + ((p01>11+(1-1)));
y = y*3 + ((p>1+(-7-1))+(p>14+(-7-1)));
y = y*5 + ((q>2+(-7-1))+(q>12+(-7-1))+(q>13+(-7-1))+(q>14+(-7-1)));
y = y*2 + ((r >0+(-7-1)));

p,q,r are some adjacent values (its a table),
ch is a channel, col is a table column,
and fb,p01 are scaled probabilities from another counters.

> And show some results on "general data", i don't know which data you deal with.

I didn't make any "universal" models since ash, so no "general data" for you.
Thing is, I really want to create a proper model for text, but it gets too complex.

> You mean on online optimization of the parameters or add the

Yeah, If a probability can be a context, why probability gradient cannot?
Somehow that reminds me of my idea with using interval arithmetics for more
precise statistics.

> > I still don't understand how you're going eg. optimize a counter under SSE with it.

> Having optimized the models which generate the input
> probability distribution you basically have the same
> situation as with context mappings - map some input
> (context, probability, ...) to some output (probability).
> The input basically is a number which selcts a bit model.
> My optimization target is the estimated entropy H(source) ~
> sum i=1 to n -ln p_i. Where p_i is the estimated entropy in
> the i-th coding step. p_i is related to "some output" (see
> above) and the connected "some input" is the context under
> the i-th step.

Seems like a misunderstanding again...
You see, if you have a prediction p from some model, already
optimized by entropy of p sequence, that doesn't mean that
SSE(p) sequence would be optimal right away.
In fact, my counters, which was optimized under SSE, are
no good without SSE,
And SSE is not a continuous function despite its interpolation,
so you cannot really calculate a gradient for it.

> preview.pdf (63.7 KB, 10 views)
> Comments and suggestions are welcome.

Don't call the result of arithmetic coding a "floating-point number".
Its a rational number, or real, but not FP, as that implies limited
precision.

Also the description is more or less traditional, but I think that
its impossible to really write a rangecoder using it.

> While rereading several times i found lots of mistypes.

Yeah, there're lots, but I'm too lazy to extract that text from pdf,
and I'd say the same as whatever spellchecker anyway.

29. Originally Posted by Shelwien
The problem there is that I cannot just replace all the counters
with delayed ones, though that would surely be the best for compression.
But each delayed counter requires two additional multiplications and
two table lookups, so its too slow
...
I'm suprised to hear something like this from you

BTW exactly this slowness was the reason for my counters to have such a wierd parameter d. Even the single multiplication is notable. I didn't want to add additional overhead.

Originally Posted by Shelwien
Index y
ych: ch, 1!1
yfb: fb, 1!00100010
yp0: p01, 1!0000000000010000
y_s: As, -7!000000000000000
y_p: p, -7!010000000000001
y_q: q, -7!001000000000111
yr1: r1, -7!000000000000000
y_r: r , -7!100000000000000
yr2: r2, -7!000000000000000
y_col: col, 1!0000000000000000000

y = 0;
y = y*2 + ((ch>0+(1-1)));
y = y*3 + ((fb>2+(1-1))+(fb>6+(1-1)));
y = y*2 + ((p01>11+(1-1)));
y = y*3 + ((p>1+(-7-1))+(p>14+(-7-1)));
y = y*5 + ((q>2+(-7-1))+(q>12+(-7-1))+(q>13+(-7-1))+(q>14+(-7-1)));
y = y*2 + ((r >0+(-7-1)));
What do the bit patterns in the declarations mean? I can see how you describe your quantization and i think i know where the numbers come from (comparisions = bit positions, multipliers make room for the amount quantization level's per quantization input); Maybe i should ask, what kind of data you work with?

Originally Posted by Shelwien
Seems like a misunderstanding again...
What i wanted to point out here is that a context model and a sse map are implemented in the same fashion (a mapping of a context to a bit model). So after optimizing the counters of the model, i can optimize the sse parameters. A sse implementation can have more parameters (like for interpolation, etc...) but atm i'm just focusing on optimizing bit models, so these are the only degree of freedom in my sse implementation.

I don't expect anybody to be my spellchecker, i'll fix the article. Thanks for reading.

BTW: my reason to be so lazy is that i'm running a gentoo distribution. Installing new things require to compile them.

30. > > But each delayed counter requires two additional multiplications and
> > two table lookups, so its too slow...
>
> I'm suprised to hear something like this from you

by using SSE2 instead of mixers. So eventually I'd
be probably forced to do some speed optimizations.
For now, luckily, psychoacoustic part is much slower.

> BTW exactly this slowness was the reason for my counters to
> have such a wierd parameter d. Even the single

Unary coding allows to use relatively expensive formulas.
Eg. if you use bitwise encoding with a single multiplication per bit,
and 50% of bytes in your data have the same value, then you
can use 4 multiplications for separate modelling of that value
and still have the same number of multiplications (and supposedly speed)
with significantly better compression.
Its also possible to encode a whole string like that, if it
appears frequently enough.

> (comparisions = bit positions, multipliers make room for the amount quantization level's
> per quantization input);

Exactly.

> Maybe i should ask, what kind of data you work with?

That doesn't really matter... its fairly generic.
These p,q,r,col could be replaced with distance to last context occurence,
escape count, sum of bits in previous byte and last 3 bits of current
data byte's offset, and after optimization it would work pretty well.

And there're different kinds of data I work with, mostly originating from
different kinds of psychoacoustic functions of audio samples.

> > Seems like a misunderstanding again...
>
> What i wanted to point out here is that a context model and
> a sse map are implemented in the same fashion (a mapping of
> a context to a bit model). So after optimizing the counters
> of the model, i can optimize the sse parameters. A sse
> implementation can have more parameters (like for
> interpolation, etc...) but atm i'm just focusing on
> optimizing bit models, so these are the only degree of
> freedom in my sse implementation.

And still I don't understand, are you measuring entropy of counter
or of counter mapping. for optimization?

Page 1 of 2 12 Last

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•