It turns out that we can cheaply estimate density as a linear combination: rho = sum_j a_j f_j.
If f_i is chosen as orthonormal family of functions, L2 optimization leads to just a_i = average over sample of f_j:
a_j = (1/n) * sum_k f_j (x^k)
Where n is size of data sample, x^k are sample points from R^d. Error of such estimated coefficients drops as 1/sqrt(n).
Sketch of derivation: smoothen the sample with epsilon-width Gaussian kernel, take mean-square optimization, then use epsilon->0 which makes calculations simpler and turns out to give the best estimate: https://arxiv.org/pdf/1702.02144 The attached image contains some examples.
This somehow optimal, but surprisingly inexpensive method also works well for multidimensional densities - literally allows e.g. to hierarchically reconstruct correlations in our data sample: first separately model probability density for each variable, then add corrections from correlations between pairs of variables, and so on for correlations of a growing number of variables.
This approach is perfect e.g. for missing data case: when we don't know all coordinates - to model given correlations we can use all data points having all these coordinates ... then use obtained density (with included correlations) e.g. for imputation of the missing coordinates: https://arxiv.org/pdf/1804.06218
I thought that maybe such approach could be useful for PPM-CM range data compressors?
So in addition to standard p[context] tables, it allows to directly model correlations between the current symbol and chosen symbols from the context.
Using the evidence (past), we can evaluate importance of given correlations, successivelygrowing the number of essential correlations to use, and adapting their coefficients.
What do you think about it?
Have such approaches been considered?
ps. Simple Mathematica 2D interactive demonstration - you can add points and see change of estimated density: