# Thread: Direct reconstruction of correlations with context?

1. ## Direct reconstruction of correlations with context?

It turns out that we can cheaply estimate density as a linear combination: rho = sum_j a_j f_j.
If f_i is chosen as orthonormal family of functions, L2 optimization leads to just a_i = average over sample of f_j:

a_j = (1/n) * sum_k f_j (x^k)

Where n is size of data sample, x^k are sample points from R^d. Error of such estimated coefficients drops as 1/sqrt(n).
Sketch of derivation: smoothen the sample with epsilon-width Gaussian kernel, take mean-square optimization, then use epsilon->0 which makes calculations simpler and turns out to give the best estimate: https://arxiv.org/pdf/1702.02144 The attached image contains some examples. This somehow optimal, but surprisingly inexpensive method also works well for multidimensional densities - literally allows e.g. to hierarchically reconstruct correlations in our data sample: first separately model probability density for each variable, then add corrections from correlations between pairs of variables, and so on for correlations of a growing number of variables.
This approach is perfect e.g. for missing data case: when we don't know all coordinates - to model given correlations we can use all data points having all these coordinates ... then use obtained density (with included correlations) e.g. for imputation of the missing coordinates: https://arxiv.org/pdf/1804.06218

I thought that maybe such approach could be useful for PPM-CM range data compressors?
So in addition to standard p[context] tables, it allows to directly model correlations between the current symbol and chosen symbols from the context.
Using the evidence (past), we can evaluate importance of given correlations, successivelygrowing the number of essential correlations to use, and adapting their coefficients.

What do you think about it?
Have such approaches been considered?
https://www.dropbox.com/s/7u6f2zprep...rapid.pdf?dl=0

ps. Simple Mathematica 2D interactive demonstration - you can add points and see change of estimated density:
http://demonstrations.wolfram.com/ParametricDensityEstimationUsingPolynomialsAndFour ierSeries/

Slides: https://www.dropbox.com/s/7u6f2zpreph6j8o/rapid.pdf 2. ## The Following 2 Users Say Thank You to Jarek For This Useful Post:

miha (26th August 2018),SolidComp (9th June 2018)

3. Also zero answers on reddit - suggesting this is a really unknown approach ... let me try one more time to convince that it is interesting ... fresh https://arxiv.org/pdf/1804.06218v2.pdf

We have some variables/observations and it allows to cheaply, independently model various types of their correlations:
- correlations for C subset of variables allow to conclude about any of these variables from the remaining |C|-1 (correlations allow to imply in any direction),
- for each such C we additionally have freedom of choosing basis: e.g. first coefficient (j=(1,1)) denotes that growing given variable, grows or shrinks the second variable. Second coefficient (j=(1,2)) analogously: concentrates or spreads the second variable etc.
- finally they create complete basis: for large sample they could theoretically recreate exact joint probability distribution the sample comes from.
- while it was thought for continuous distribution, it will also work well for discrete.

For data compression we are interested in modeling correlations between the currently predicted symbol and the context: we use only basis with C always containing the current symbol, implying its probability distribution from the context.
We can independently model various types of correlations, evaluating their importance - e.g. start with low order correlations, and successively grow the space of considered correlations.
Optimal coefficients are just averages of given function over the sample, what for data compression application can be done in adaptive way:

a_f = (1-lambda) * a_f + lambda * f(x)

for some learning coefficient lambda.
Here are some first functions of the basis for two variables, in practice we can choose correlations between chosen any pairs of variables to include in our model, triples, quadruples and so on (be more selective than just p[context]): ps. Stack about using it as artificial neurons: https://stats.stackexchange.com/ques...redicting-from 4. Originally Posted by Jarek For data compression we are interested in modeling correlations between the currently predicted symbol and the context
I think we are interested in the mutual information between the currently predicted symbol and the context(s). i.e. how much given context(s) tells us about the next symbol.

When the mutual information is near 0 (the context seemed to be independent from the next symbol in the past) than the weight of the symbol prediction coming from this context will be 0 in the "average" of predictions (it will be ignored).
During data compression a context is thought to be useful (and will adaptively gain more weight with time) when the mutual information is large enough.

Def: Mutual information is a distance between two probability distributions. Correlation is a linear distance between two random variables.
Remark: The concept of mutual information is linked to "bits" and "entropy" and so it is linked to "data compression".
My question is: why "correlation"? 5. Mutual information is crucial for error correction - it is the number of bits per symbol we can send through a noisy channel (maximized over probability distribution of sent symbols): https://en.wikipedia.org/wiki/Channel_capacity

In data compression we need to estimate probability distribution of symbols (basing on already decoded), minimizing number of bits to produce: sum_t lg(1/Pr(s_t)) over symbol sequence {s_t}, where Pr(s_t) is estimated probability in time t. The more focused distribution, the better compression.

Summer break is close, when I plan among others to test the above method e.g. for lossless image compression:
- we scan line by line, and encode residue (error) based on local prediction,
- use CDF (cumulative distribution function) probably of Laplace distribution to rescale residues to [0,1] range, with nearly uniform distribution,
- however, there are correlations with context (neighbors) to be modeled with the above correlation decomposition - getting 1D densities for predicted residue, basing on the current context.

Without the context it would be nearly uniform distribution, the question is how much local correlations will allow to focus this distribution?
The more, the better improvement in compression ratio ...

Compering to JPEG LS ( https://en.wikipedia.org/wiki/Lossless_JPEG ) - while it separately models behavior for 365 contexts, here we directly fit (3D) polynomial to this entire behavior, interpolating and smoothing between such (365) contexts ... and can predict more complex than centered Laplace distributions ... 6. Originally Posted by Jarek Mutual information is crucial for error correction.
Yes, mutual information is important in many fields.
But how does that answer my question? I'm puzzled. Originally Posted by Jarek In data compression we need to estimate probability distribution of symbols (basing on already decoded), minimizing number of bits to produce: sum_t lg(1/Pr(s_t)) over symbol sequence {s_t}, where Pr(s_t) is estimated probability in time t.
Yes. Also true. But since your first two posts are about "contexts" and "correlation" it should be then:

"In data compression we need to estimate the probability distribution of symbols in their contexts."

But this, too, does not answer my question. I'm puzzled. Maybe I'm a bit tired (and that would certainly answer all my questions ).

You wrote that you are interested in the "correlation" between a "context" and the next symbol.
What do you mean by "context"? You mentioned Prediction by Partial Matching and Context Mixing. So I suppose you define "context" in a text file such as characters preceding the to-be-predicted symbol. Such preceding strings could be "cat" or "dog". A symbol seen in the past after such strings (contexts) could be a " " (space) or a "s".
My question is: how do you define "correlation" in this case? 7. Ok, you are right - from this point of view we are interested in minimizing conditional entropy:
H(current | context) = H(current) - I(current ; context)
using (maximized) mutual information.
However, we cannot directly influence this (idealized) mutual information - in practice this optimization means just finding the best estimation of rho(current | context).

In the approach I am discussing here, we want to model joint distribution:
rho(current,context)
as fitted polynomial (after transformation of all coordinates to [0,1] using approximated CDF: so that they independently have ~uniform distribution on [0,1]).

For this fitting we use orthonormal basis as in the image above, thanks of which fitting is very simple: coefficient of f function is just average of f over sample.
Such coefficients are improved adaptively with processed symbols (a_f += (f(x) - a_f) * rate), returning conditional distribution for current symbol:
rho(current | context)
as 1D polynomial, used for entropy coding.

The difference from standard context modelling like Markov is that they try to table behavior based on context - modelling contexts independently "behavior[context]".
In contrast, here we model entire rho(current,context) as polynomial - smoothing/interpolating between different contexts.
By correlation types I have interpreted considered polynomials in the orthonormal basis as in the image above - describing correlation between C of variables, using polynomials of growing degrees, and having concrete interpretations, e.g.:
f11: with growth of first variable there increase or decrease of second
f12: with growth of first variable there is focus or spread of second, etc. 8. Thank you for clearing that up!
"influence" -> you intended to write "infer"? Originally Posted by Jarek The difference from standard context modelling like Markov is that they try to table behavior based on context - modelling contexts independently "behavior[context]".
In contrast, here we model entire rho(current,context) as polynomial - smoothing/interpolating between different contexts.

The joint probability distribution *is* modelled in CTW/DMC/PPM/CM/ but not with polinomials.
CTW does context tree weighting - proabably it has the most similarities to your idea.
DMC does context (state) cloning when a threshold is hit.
PPM could use escapes or information inheritance and several other tricks.
A CM compressor usually uses a mixer to estimate the joint probability, and will adjust weights ("coefficinets" as you write) of models and/or contexts as compression preogresses.

Using polinomials for estimating joint probability distribution: there has been such approaches. See for example "Estimating Multivariate Discrete Distributions Using Bernstein Copulas" (Just googled it, haven't read it.) For data compression... I don't know. Probably not optimal. But it's just a feeling. 9. I have meant influence: some perfect rho(current | context) is already hidden in the data, determining theoretical mutual information.
We can only estimate it - the better, the closer we are to this theoretical bound (usually unknown).

I know CTW/DMC/PPM/CM - they usually use some "behavior[context]" tables - modelling contexts independently.
Polynomial allows to smoothen/interpolate between contexts - directly using the fact that similar context should have similar behavior.
It rather won't work for text, but can for numerical data like in image/video compression - I will probably try it out in a few weeks.

Thanks for the Bernstein Copulas (some other paper) - at first look it seems like a normalization of multidimensional CDF.
The question is if it can be inexpensively fitted to data and used for predictions? I will have to take a closer look ...
Using density as linear combination of orthonormal polynomials with L2 optimization is (relatively) inexpensive for fitting and predicting, also allows for flexibility like changing the basis of considered correlations. 10. Originally Posted by Jarek I know CTW/DMC/PPM/CM - they usually use some "behavior[context]" tables - modelling contexts independently.
Polynomial allows to smoothen/interpolate between contexts - directly using the fact that similar context should have similar behavior.
I feel you are mistaken here. Or I completely misunderstand what you mean.

CTW, for example, stores counts in shorter and longer contexts, and estimates the joint probability by weighting the predictions of these contexts.
"The CTW algorithm is an “ensemble method,” mixing the predictions of many underlying variable order Markov models, where each such model is constructed using zero-order conditional probability estimators." from https://en.wikipedia.org/wiki/Context_tree_weighting
Context Mixing compressors, too use mixing (as the name implies).

They do not just "select the current context" using a table lookup (this is what I think you think), they select the contexts that are in effect and they mix them is a smart way. Thus estimating the joint probability.

See? There is a mixing involved. It is an interpolation/smoothing if you will. 11. Sure e.g. CM can mix predictions from various contexts, however, still each context is basically modeled independently.
In contrast, fitting joint distribution as polynomial, directly interpolates between these contexts.

For example if context is single last byte and until now there was only seen context "2" or "4", context mixing will not know what to do for new context "3" (escape symbol ...), while presented approach will interpolate between known contexts here.
All observations modify global fitting - behavior for each context.

Such interpolation is forbidden for text, but can be valuable for numerical data like in image/video compression. 12. Aha! I see now. The "2"-"4" example worked on my mind.

So contexts in this case are a bit different from what we usually call a "context". You intend to interpolate between the "contexts" identified by numerical values, and not just between the probabilities emitted by these context/models. For that to work you are restricted to contexts that are orderable - as you also pointed out.

Side note:
Paq8px does not have a numerical prediction model for text files, yet. So orderable values in enwik8, like the revision ids ("32899315") the timestamps ("2005-12-27T18:46:47Z") in the xml and years ("1864") in the articles are all predicted by "normal" models (I have to say, quite well). By predicting these strings bit by bit it "feels" the densities of these strings as it advances in the string during compression. At the first bits the context is dense, but at the last bits it becomes sparse. Paq8px does not represent these data as numbers, and so it cannot perfectly interpolate between them, but still it can do the prediction very well.

Representing such orderable strings by numbers (of course in a bijective way) could work very well for density estimation. Here's my success story for encouragement: I represented the revision ids and timestamps by numbers in enwik8 and hierarchically predicted their densities in my Fastest and smallest enwik8 skeleton (de)compression (for fun) project. So it is a proof that (preprocessed) text files may also benefit from your idea and that density estimation helps compression. (Though I did not use polynomials - I'm sure they would help even more.)

Well, see you in a couple of weeks then. And good luck with your plan! 13. ## The Following User Says Thank You to Gotty For This Useful Post:

Jarek (6th June 2018)

14. Fitting polynomial to predict numerical values is a standard technique - in 1D the predictors for succeeding polynomial degrees are:
v[x] ~ v[x-1]
v[x] ~ 2 v[x-1] - v[x-2]
v[x] ~ 3 v[x-1] - 3 v[x-2] + v[x-3]
v[x] ~ 4 v[x-1] - 6 v[x-2] + 4 v[x-3] - v[x-4]
In 2D the simplest e.g. in JPEG LS (generally: Lorenzo predictor) is
v[x,y] ~ v[x-1 , y] + v[x, y-1] - v[x-1, y-1]
We use such predictor then encode only residue (difference), usually using Laplace distribution.

What is new here is that we predict not only value, but entire probability distribution for given context.
For example if using to encode residues from some predictor, it could imply e.g. how to shift the center basing on the context, change width, or apply correction to density using further polynomials. 15. Originally Posted by Jarek For example if context is single last byte and until now there was only seen context "2" or "4", context mixing will not know what to do for new context "3" (escape symbol ...), while presented approach will interpolate between known contexts here.
That would be true for PPM (and even there we can use information inheritance to populate new sparse contexts), in bitwise CM we predict bit by bit, so that is not an issue. Originally Posted by Gotty Paq8px does not have a numerical prediction model for text files, yet. So orderable values in enwik8, like the revision ids ("32899315") the timestamps ("2005-12-27T18:46:47Z") in the xml and years ("1864") in the articles are all predicted by "normal" models (I have to say, quite well). By predicting these strings bit by bit it "feels" the densities of these strings as it advances in the string during compression. At the first bits the context is dense, but at the last bits it becomes sparse. Paq8px does not represent these data as numbers, and so it cannot perfectly interpolate between them, but still it can do the prediction very well.
The new text model detects monotonically increasing or decreasing numerical sequences, I just haven't used that for any context. The timestamps are detected by the XML model, but again aren't really used. On most actual textual files that is not needed, and I don't like overfitting models just for a single file, but I did prepare the models to allow it if anyone wants to do it. Originally Posted by Jarek Fitting polynomial to predict numerical values is a standard technique - in 1D the predictors for succeeding polynomial degrees are:
v[x] ~ v[x-1]
v[x] ~ 2 v[x-1] - v[x-2]
v[x] ~ 3 v[x-1] - 3 v[x-2] + v[x-3]
v[x] ~ 4 v[x-1] - 6 v[x-2] + 4 v[x-3] - v[x-4]
In 2D the simplest e.g. in JPEG LS (generally: Lorenzo predictor) is
v[x,y] ~ v[x-1 , y] + v[x, y-1] - v[x-1, y-1]
We use such predictor then encode only residue (difference), usually using Laplace distribution.

What is new here is that we predict not only value, but entire probability distribution for given context.
For example if using to encode residues from some predictor, it could imply e.g. how to shift the center basing on the context, change width, or apply correction to density using further polynomials.
All are used on paq8/cmix, along dozens more. We don't encode residues, we use the predictions themselves as contexts. 16. ## The Following User Says Thank You to mpais For This Useful Post:

Jarek (6th June 2018)

17. Originally Posted by Jarek ...for succeeding polynomial degrees are...
Aren't they all linear (i.e. degree=1)? Originally Posted by Jarek What is new here is that we predict not only value, but entire probability distribution for given context.
In my enwik8 examples I simply used the first some bits of the current string when encoding the next bit. You are right, that is not a continuous *context*. That is just about encoding a *value* from a continuous distribution.
Let me correct myself and find a better example. In enwik8 (again) the timestamp and revision id values both could be thought of values from a continuous distribution. Important to know that the revision id is growing by time and obviously the timestamp is growing by time. The revision id is encoded first then the timestamp (they are in that order in the file). When we have the revision id that is a really strong (but really strong!) context for the timestamp. We can't use the revision id as a direct (discrete) context as it is ((ok, we could use the first some bits or bytes)), because it is very sparce. Instead we use the revision id as a continuous context for predicting the continuous value, the timestamp. Did I catch it better this time? 18. Originally Posted by mpais The timestamps are detected by the XML model.
Oh, I'll look into it when I have some time. Thanx for mentioning! 19. Originally Posted by mpais That would be true for PPM (and even there we can use information inheritance to populate new sparse contexts), in bitwise CM we predict bit by bit, so that is not an issue.
Ok, so let's look at lossless image compression from both perspectives.
Contexts are a few neighboring pixels (let say bytes), and for you also value of predictor (or a few). Each context independently predicts the current bit (thresholding?), then CM weights these predictions - I would say this is some heuristic approximation of reconstructing pairwise correlations between pixels.

In contrast, fitting polynomial to joint distribution, we have certainty of approaching the real probability distribution (not only thresholding) - especially that for image compression the evidence is huge (thousands/millions) - the found coefficients are practically accurate.
We can start with recreating pairwise correlations like above, but then also add corrections from correlations between triples, quadruples (of pixels) - what seems very tough for CM: only weighting between independently considered contexts.

Aren't they all linear (i.e. degree=1)?
No, we use orthonormal basis of polynomials - which is complete: growing the size of basis, we can approximate any density as close as we want.
Sure usually we start with linear polynomials for each coordinate (marked violet region in image above), then add degree 2 polynomials (entire size 8 basis above), degree 3 and so on ...
We can freely modify this basis, estimating coefficients - searching for essential correlations in our data sample, and use them for estimating density.

Regarding compression of time series (like time stamps), I would also start with some predictor and encode residues - for which estimation of probability distribution can be again made by above correlation decomposition (with last time differences) - searching for essential correlation types and using them for estimation of current density for residue. 20. Originally Posted by Jarek Each context independently predicts the current bit (thresholding?), then CM weights these predictions - I would say this is some heuristic approximation of reconstructing pairwise correlations between pixels.
No heuristics and no threshold - at least in the paq world. In the paq series it is kind of a gradient descent. All predictions of all contexts are mixed. Better models with better contexts (better predictions) will gain more weights as compression advances. Originally Posted by Jarek We can start with recreating pairwise correlations like above, but then also add corrections from correlations between triples, quadruples (of pixels) - what seems very tough for CM: only weighting between independently considered contexts.
Contexts in Paq8px are not just individual pixels, but triples, quadruples and more. By "more" I don't just mean more pixels, but differences, (weighted) averages, most significant bits, etc. It is very sophisticated. Márcio probably will tell more about it. So CM does what you have in mind. Of course these contexts are fixed, preprogrammed (there is no search), but it looks like by mixing them we find a really good approximation for the joint probability. Originally Posted by Jarek In contrast, fitting polynomial to joint distribution, we have certainty of approaching the real probability distribution.
I'm a bit lost (again). Are you planning processing the input file byte by byte as a stream or are you preprocessing the entire file, finding the best coefficients and then encode? All above mentioned types of compressors (CTW/DMC/PPM/CM) do the former, I suspect that you are planning the latter. Is this the case? In the latter case you'll need to store the coefficients somehow and that could usually be costy. When you want to achieve more precision (better prediction) you'll need to use more coefficients and then you'll need to store more coefficients. Originally Posted by Jarek (with last time differences)
The articles in enwik8 are not in time order, so timestamps are not in order. The last value is not a good predictor, the difference is jumpy. That's why you need the revision id as a context to predict the timestamp. That's an importent piece of information, I should have mentioned. 21. Originally Posted by Gotty No heuristics and no threshold - at least in the paq world. In the paq series it is kind of a gradient descent. All predictions of all contexts are mixed. Better models with better contexts (better predictions) will gain more weights as compression advances.
(Stochastic) gradient descent is only learning method through multiple small steps - to get the best coefficients.
In contrast, L2 fitting of polynomial allows to directly get final coefficients in just a single step: coefficient for function f (from orthonormal basis) is just average of f over sample: 1/n * sum f(x_i).

Context mixed means separate prediction for each context, then weighted (logistic) mixing of their predictions.
Ok, you can use multiple neighboring pixels as single context, the hard question is how it is modeled? Tabled behavior for sequence of their most significant bits? (it's very approximated).
This question seems tough to do accurately with standard methods, but becomes relatively simple to do well with fitting polynomial: we can get as close as we want to the real joint probability distribution they come from (rho(current, context)).

Regarding static or adaptive model, it can be done in both ways, but for image compression static seems more appropriate: the rules are usually universal over the entire picture, eventually, requiring conditioning to different types or regions: it should be included inside the modeled correlations. Adaptation is not fast enough for distinguishing regions - e.g. a person from the background.
Sure static requires a header with coefficients, but I don't think large number of such correlations type will be crucial - dozens, maybe up to a thousand of coefficients, which can be quantized and entropy coded, maybe found universal (default) e.g. for photography.
Using polynomial coefficients to describe correlations is much more compact way (unique optimum) than through weights for neural network (gigantic freedom).

I plan basic testing: sum_t lg(1/Pr(s_t)) bits for various ways of prediction, static model. We will see how it goes (in a few weeks). 22. Originally Posted by Gotty Contexts in Paq8px are not just individual pixels, but triples, quadruples and more. By "more" I don't just mean more pixels, but differences, (weighted) averages, most significant bits, etc. It is very sophisticated. Márcio probably will tell more about it. So CM does what you have in mind. Of course these contexts are fixed, preprogrammed (there is no search), but it looks like by mixing them we find a really good approximation for the joint probability.
As briefly explained here, the image models I made for paq8px are designed to handle a lot of different image types. If you guys want, I'll elaborate more on what they're doing. Originally Posted by Gotty I'm a bit lost (again). Are you planning processing the input file byte by byte as a stream or are you preprocessing the entire file, finding the best coefficients and then encode? All above mentioned types of compressors (CTW/DMC/PPM/CM) do the former, I suspect that you are planning the latter. Is this the case? In the latter case you'll need to store the coefficients somehow and that could usually be costy. When you want to achieve more precision (better prediction) you'll need to use more coefficients and then you'll need to store more coefficients.
See here (Fast Autocorrelated Context Models for Data Compression, patented). Now compare the listed results for rafale.bmp to the results from paq8/cmix/EMMA. Originally Posted by Jarek Ok, you can use multiple neighboring pixels as single context, the hard question is how it is modeled? Tabled behavior for sequence of their most significant bits? (it's very approximated).
This question seems tough to do accurately with standard methods, but becomes relatively simple to do well with fitting polynomial: we can get as close as we want to the real joint probability distribution they come from (rho(current, context)).
We mainly use finite-state-machine counters and other adaptive maps with exponential decay. For fsm counters, a context maps to an 8 bit state per input bit, which is then mapped to a prediction. This prediction is then in turn adapted based on coding cost multiplied by an exponential decay rate.
The sets of mixing weights to use are also selected based on contexts designed to select for common correlations, for even better adaptability. Originally Posted by Jarek Regarding static or adaptive model, it can be done in both ways, but for image compression static seems more appropriate: the rules are usually universal over the entire picture, eventually, requiring conditioning to different types or regions: it should be included inside the modeled correlations. Adaptation is not fast enough for distinguishing regions - e.g. a person from the background.
There's a lot of similar research in the literature. See MRP (Minimum rate predictors), the previous state-of-the-art grayscale image codec. 23. ## The Following User Says Thank You to mpais For This Useful Post:

Jarek (7th June 2018)

24. Thank you, I will have to look closer (in a few weeks), but I got the message that closing state-of-art would probably need years of work ... so maybe it's better to focus on different applications for this hierarchical correlation reconstruction - there are many of them, e.g. missing data.

In data compression also lossy image compression seems promising: for modelling joint distribution of neighboring transform coefficients ... 25. As promised, thanks to the summer break, I have started working on lossless image compression, for example concluding that LOCO-I predictor is not so great for photos, we can do much better e.g. with just linear predictors, MSE gradient optimization of parameters leads to approximately:
v[x,y] ~ 0.8 v[x-1,y] - 0.3 v[x-1,y-1] + 0.2 v[x,y-1] + 0.3 v[x+1,y-1]
What are the current state-of-art predictors?

We subtract predictor, difference is approximately from Laplace distribution, LOCO-I splits contexts into 365 bins and estimates for each independently just width of Laplace.
Here instead of splitting into bins we use all points to optimize coefficients of globally fitted polynomial as joint distribution of (current value, context). As in plots below, densities predicted this way are much more complex that just width of Laplace.
For a few reasons (like handling negative probabilities), to use it in image compression we would rather need to work bitwise, treating decoded MSB as part of context - there is needed a lot of experimentation to do it right, then well optimized implementation and huge competition ... it is a bit overwhelming so I am leaving it for now, but I can help if someone would be interested.

Instead, I have written short paper about using it for (e.g. financial) time series: https://arxiv.org/pdf/1807.04119
Here we can download ~30000 long Dow Jones daily averages sequence (100+ years), it seems tough to get better prediction of given value than just the previous value (?)
However, it turns out that we can tell a lot about its probability distribution from the context.

Here are some predicted densities using length 5 context and degree 5 polynomial (top) and comparison of predicted densities for various orders of prediction (bottom) - without context it would be rho~1 (normalization of variables), here it is usually essentially better: I wonder how it would compare with probability densities predicted by other methods like context mixing?
In contrast to neural networks, here each coefficient has independent unique value and a specific cumulant-like interpretation.
The number of coefficients (|B|) can be increased a few orders of magnitude (if having a better implementation), most of them are tiny - all coefficients but a few percents can be neglected.

Unfortunately it sometimes leads to negative densities: e.g. if fitting parabola to density localized in two points, this parabola should be negative between them.
Negative densities should be interpreted as low positive - any ideas how to translate it well? 26. While it seems too costly for data compression (?), above approach also nicely works for predicting probability distribution of multidimensional time series (e.g. RGB in image compression) - fresh draft: https://arxiv.org/pdf/1807.11743 (for yield curve parameters)
For example here are (overfitted) predictions of (x1,x2) based on context being one (left) or two (right) previous pairs - come from fitting degree 9 polynomial to joint distribution of 4 (10000 coefficients) or 6 (million coefficients) values: Anyway, data compression gives sophisticated tools for predicting probability distribution - maybe let's try to use them in other fields where time series analysis is crucial?
Anybody has tried to get some prediction for above Dow Jones series? Context mixing?
Reddit discussion: https://old.reddit.com/r/MachineLear...n_of_value_in/

There are usually used some primitive ARIMA-like models ( https://en.wikipedia.org/wiki/Autore...moving_average ): predictor is a linear combination of previous values, error/residue is Gaussian usually of constant width ... while data shows it should be rather Laplace distribution instead (as data compression people know well) ... and generally we can get much better prediction for probability distribution. 27. One more important addition - adaptivity for non-stationary time series to https://arxiv.org/pdf/1807.04119
As a_f is average of f(x) over the sample, we can instead use average over past values with exponentially decaying weights like in data compression:
a_f -> lambda * a_f + (1-lambda) * f(x)
for some learning rate lambda.
Here (the left one) is evolution of probability distribution - the top two use such adaptation for Dow Jones series:  The bottom one is a bit different - it estimated as polynomial density of (t, x_t) sequence.
It is the red one on time evolution of used first four coefficients (right plots) which have interpretation similar to cumulants.
So it is kind of interpolation in time+space - polynomial described here by 10 x 10 = 100 coefficients, which gets smoother and better agreement with local probability distribution - the orange adaptive curve is very noisy (lamba = 0.999), the blue one (lambda=0.9997) has delay comparing to the red one.

It suggests that maybe it is worth considering static-adaptive data compression: with evolution of probability distribution along the file stored in the header as e.g. coefficients of polynomial - like 100 coefficients for the bottom time+space plot.
For example for a given context encoder looks at occurrences of '0' and '1' along the file, fit polynomial describing evolution of its probability (cheap) and store polynomial coefficients for each context in the header - decoder just calculates this evolution as polynomial without statistical modelling ... ? 28. Very nice data compression-like problem: credibility evaluation.
I recently got dataset of 37000 households: declared income and a few dozens of variables of all types - continuous, discrete, binary.
We need to model (continuous) probability distribution of income based on the remaining variables (context).

I think it could be done e.g. with context mixing (?)
The standard philosophy is modelling probability distribution of oldest bit (left/right half), then the next one adding the first to context and so on ...
Any suggestions for the details?

I have used basis of orthogonal polynomial instead (this thread), what allowed to model each (cumulant-like) coefficient independently.
Very simple model: just linear regression for each coefficient has turned out to work nicely: https://arxiv.org/pdf/1812.08040 29. Finally closing to data compression - found how to properly (learned from sample) calibrate: interpret predicted densities (as MSE-optimal polynomials) as real density - including negative values of polynomial as low positive density: Additionally, I was recently told that there is a classical alternative statistical method: https://en.wikipedia.org/wiki/Quantile_regression
It uses interesting fact that to estimate median of y from x we should minimize mean absolute difference:
argmin_beta \sum_i |y_i - beta.x_i|
Analogously we can estimate quantiles, finally getting practical method for modelling conditional probability distribution of continuous variables, e.g. modeled 0.05,...,0.95 quantiles from http://www.econ.uiuc.edu/~roger/research/rq/rq.pdf : It is more costly than fitting polynomial (linear programming vs least squares), and doesn't directly give density.

Anyway, even context mixing has frequency counts at the bottom - the question is if we could practically use some more sophisticated statistical modelling to exploit dependencies in data compression?
It is done by in too complex to follow way in rising NN autoencoder-based compressors - above two methods would allow to do it in a controllable way ... the hard question is if it could compete with autoencoder-based?? #### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•