Non-linear density estimation using a sparse Haar prior

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Density estimation Typical approaches Hardy Haar

Hardy HaarNon-linear density estimation using a sparse Haar prior

Arthur Breitman

April 3, 2016

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Table of contents

Density estimationProblem statementApplications

Typical approachesParametric densityKernel density estimation

Hardy HaarPrinciplesRecipeLinear transformsHow to use for data mining

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Problem statement

What is density estimation?

▶ Given an i.i.d sample xn1 from unknown distribution P,estimate P(x) for arbitrary x

▶ For instance P belongs to some parametric family, but nonBayesian treatment possible

▶ Examples:▶ Model P is a multivariate gaussian with unknown mean and

covariance.▶ Kernel density estimation is non paremetric (but morally ∼ to

P as a uniform mixture of n distributions, fit with maximumlikelihood)

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Applications

Why is density estimation useful?

▶ Allows unsupervized learning by recovering the latentparameters of a distribution describing the data

▶ But can also be used for supervized learning.

▶ Learning P(x , y) is more general than learning y = f (x).

▶ For instance, to minimize quadratic error, use

f (x) =

∫y P(x , y) dx∫P(x , y) dx

▶ Knowledge of the full density permits the use of any∗ lossfunction

∗offer void under fat tailsArthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Applications

Mutual information

Of particular is the ability to compute “mutual information”.Mutual information is the principled way to measure what isloosely referred to as “correlation”

I (X ;Y ) =

∫Y

∫Xp(x , y) log

(p(x , y)

p(x) p(y)

)dx dy

Measures the amount of information one variable gives us aboutanother.

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Applications

Correlation doesn’t always capture this relation

In the case of a bivariate gaussian,

I = −1

2log

(1− ρ2

)We can get a correlation equivalent by using

ρ̂ =√

1− e−2I

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Bivariate normal

Assume that the data observed was drawn from a bivariate normaldistribution

▶ Latent parameters: mean and covariance matrix

▶ Unsupervized view: learn the relationship between tworandom variables (mean, variance, correlation)

▶ Supervized view: equivalent to simple linear regression

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Parametric density

Bivariate normal density

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Parametric density

Kernel density estimation

Kernel density estimation is a non-parametric density estimatordefined as

f̂h(x) =1

nh

n∑i=1

K(x − xi

h

)▶ K is a non negative function that integrates to 1 and has

mean 0 (typically gaussian)

▶ h is the scale or bandwith.

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.



Gaussian kernel density estimation

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.



Bandwith selection

Picking h can be tricky

▶ h too small =⇒ overfit the data

▶ h too large =⇒ underfit the data

▶ There are rules of thumbs to pick h from variance of data andnumber of points

▶ Can be picked by cross-validation

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.



Gaussian kernel density estimationUnder and over-fitting with different bandwiths (the correlation ofthe kernel is estimated from the data)

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.



Issues with kernel density estimation

Naive Kernel density estimation has several drawbacks

▶ Kernel covariance is fixed for the entire space

▶ Does not adjust bandwith to local density

▶ No distributed representation =⇒ poor generalization

▶ Performs poorly in high dimensions

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.



Adaptive kernel density estimation

One approach is to use a different kernel for every point, varyingthe scale based on local features.Ballon estimators make the kernel width inversely proportional todensity at the test point

h =k

(nP(x))1/D

Pointwise estimator try to vary the kernel at each sample point

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.



Local bandwith choice

▶ If latent distribution has peaks, tradeoff between accuracyaround the peaks and in regions of low density.

▶ This is reminiscent of time-frequency tradeoffs in fourieranalysis (hint: h is called bandwith)

▶ Suggests using wavelets which have good localization in timeand frequency

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Principles

Introducing Haardy Haar

Hardy Haar attempts to address some of these shortcomings

▶ Full Bayesian treatment of density estimation

▶ (Somewhat) distributed representation

▶ Fast!

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Principles

Examples of Haardy Haar

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Principles

Goals

Coming up with a prior?

▶ In principle, any distribution whose support contains thesample is potential.

▶ Some distributions are more likely than others, but why notjust take the empirical distribution?

▶ May work fine for integration problems for instance▶ Doesn’t help with regression or to understand the data

▶ There should be some sort of spatial coherence to thedistribution

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Principles

Sparse wavelet decomposition prior

To express the spatial coherence constraint, we take a L0 sparsityprior over the coefficient of the decomposition of the PDF in asuitable wavelet basis.

▶ This creates a coefficient ”budget” to describe the distribution

▶ Large scale wavelet describe coarse features of the distribution

▶ Sparse areas can be described with few coefficients

▶ Areas with a lot of sample points are described in more detail

▶ Closely adheres to the minimum description length principle

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Principles

Haar basis

The Haar wavelet is the simplest wavelet

Not very smooth, but no overlap between wavelets at the samescale =⇒ tractability

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Principles

Model

▶ The minimum length principle suggests penalizing the loglikelihood of producing the sample with the number of nonzero coefficients.

▶ We can put a weight on the penalty, which will enforce moreor less sparsity

▶ We can “cheat” with an improper prior: use an infinitenumber of coefficient, but favor models with many zeros

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Recipe

Sampling

To sample from this distribution over distributions, conditional onthe observed sample, we interpret the data as generated by arecursive generative model.As n is held fixed, the number of datapoints in each orthant isdescribed by a multinomial distribution. We put a non-informativedirichlet prior on the probabilities of each quadrant. This processusis repeated for each orthant.

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Recipe

Orthant treePlace the data points in an orthant tree. The structure is built intime O(n log n)

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Recipe

Probabily of each orthant

Conditional on the number of datapoints falling in each orthant,the distribution of the probability mass over each orthant is givenby the Dirichlet distribution:

Γ(∑2d

i=1 ni)∏2d

i=1 Γ(1 + ni )

2d∏i=1

pnii

d is the dimension of the space, thus there are 2d orthants.

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Recipe

What about our prior?Having a zero coefficient in the Haar decomposition translate itselfin certain symmetries. In two dimension there are eight cases

▶ 1 non-zero coeffs: each quadrant has 14 of the mass

▶ 2 non-zero coeff▶ Left vs. right, top and bottom weight independent of side▶ Top vs. bottom▶ Diagonal vs. other diagonal

▶ 3 non-zero coeffs▶ Shared equally between left and right, but each side has its

own distribution between top and bottom▶ Same for top and bottom▶ Same for diagonals

▶ 4 non-zero coeffs: each quadrant is independent

N.B. probabilities must sum to 1Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Recipe

Single point example

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Recipe

Marginalizing

▶ The distribution of weight over orthants is independentbetween the “levels” of the tree

▶ We can marginalize to efficiently compute the mean density ateach point.

▶ The cost is then O(2dn log n)

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Linear transforms

Orthant artifacts

▶ The model converges to the true distribution

▶ But the choice of orthants is arbitrary

▶ Introduces unnecessary variance

▶ We’d like to remove that sensitivity

Solution: integrate over all affine transforms of the data

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Linear transforms

Integrating over linear transforms

Fortunately, the Haar model gives us the evidence

▶ Assume the data comes from a gaussian Copula

▶ Adjust by the Jacobian of the transform▶ To sample from linear distributions

▶ perform PCA▶ translate by ( ux√

n, uy√

n)

▶ rotate randomly▶ scale variances by 1√

2n

▶ ... then weight by evidence from the model

Arthur Breitman

Hardy Haar

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


How to use for data mining

Application to data-mining

▶ select relevant variables to use as regressors

▶ evaluate the quality of hand-crafted features

▶ explore unknown relationships in the data

▶ in time series, mutual information between time and datadetects non stationarity

Arthur Breitman

Hardy Haar

Data & Analytics

Non-linear density estimation using a sparse Haar prior