Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Differential Stein operators for multivariate continuousdistributions and applications
Gesine Reinert
A French/American Collaborative Colloquium on ConcentrationInequalities, High Dimensional Statistics and Stein’s Method
July 4th, 2017
Joint work with Guillaume Mijoule and Yvik Swan (Liege)
1 / 41
Stein’s method
Outline
1 Stein’s method
2 The score function and the Stein kernel
3 Higher dimensions
4 Stein operators TpF = div(F p)/p
5 Last remarks
2 / 41
Stein’s method
Stein’s method in a nutshell
For µ a target distribution, with support I:
1 Find a suitable operator A (called Stein operator) and a wide class offunctions F(A) (called Stein class) such that X ∼ µ if and only if forall functions f ∈ F(A),
EAf (X ) = 0.
2 Let H(I) be a measure-determining class on I. For each h ∈ H finda solution f = fh ∈ F(A) of the
h(x)− Eh(X ) = Af (x),
where X ∼ µ. If the solution exists and if it is unique in F(A) thenwe can write
f (x) = A−1(h(x)− Eh(X )).
We call A−1 the inverse Stein operator (for µ).
3 / 41
Stein’s method
Example: mean zero normal
Stein (1972, 1986), see also Chen, Goldstein, Shao 2011Z ∼ N (0, σ2) if and only if for all smooth functions f ,
EZf (Z ) = σ2Ef ′(Z ).
Given a test function h, let Z ∼ N (0, σ2); the Stein equation is
σ2f ′(w)− wf (w) = h(w)− Eh(Z )
which has as unique bounded solution
f (y) =1
σ2ey
2/2σ2∫ y
−∞(h(x)− Eh(Z )) e−x
2/2σ2dx .
4 / 41
Stein’s method
Example: the sum of independent random variables
X1, . . . ,Xn indept mean zero, Var = 1n ; W =
∑ni=1 Xi . Then
Ef ′(W )− EWf (W )
= Ef ′(W )−n∑
i=1
EXi f (W )
= Ef ′(W )−n∑
i=1
EXi f (W − Xi ) +n∑
i=1
EX 2i f′(W − Xi ) + R
=1
n
n∑i=1
(Ef ′(W )− Ef ′(W − Xi )
)+ R;
bound this expression by Taylor expansion to give that for any smooth h
|Eh(W )− Nh| ≤ ‖h′‖
(2√n
+n∑
i=1
E|X 3i |
).
Note: nothing goes to infinity.5 / 41
Stein’s method
Comparison of distributions
Let X and Y have distributions µX and µY with Stein operators AX andAY , so that F(AX ) ∩ F(AY ) 6= ∅ and choose H(I) such that allsolutions f of the Stein equation belong to this intersection. Then
Eh(X )− Eh(Y ) = EAY f (X ) = EAY f (X )− EAX f (X )
and
suph∈H(I)
|Eh(X )− Eh(Y )| ≤ supf ∈F(AX )∩F(AY )
|EAX f (X )− EAY f (X )|.
If H(I) is the set of all Lipschitz-1-functions then the resulting distance isdW , the Wasserstein distance. For examples: Holmes (2004),Eichelsbacher and R. (2008), Dobler (2012), Ley, Swan and R. 2015, 2017.
6 / 41
The score function and the Stein kernel
Outline
1 Stein’s method
2 The score function and the Stein kernel
3 Higher dimensions
4 Stein operators TpF = div(F p)/p
5 Last remarks
7 / 41
The score function and the Stein kernel
A Stein operator for continuous real-valued variables
Let X be continuous having pdf p with support I = [a, b] ⊂ R.
The Stein class of X is the class F(p) of functions f : R→ R such that(i) x 7→ f (x)p(x) is differentiable on R(ii) (fp)′ is integrable and
∫(fp)′ = 0.
To p associate the Stein operator Tp:
Tpf =(fp)′
p.
(Stein 1986, Stein et al. 2004, Ley and Swan 2013)
By the product rule,
E[g ′(X )f (X )
]= −E [g(X )Tpf (X )]
for all f ∈ F(p) and for all differentiable functions g such that∫(gfp)′dx = 0, and
∫|g ′fp|dx <∞; we say that g ∈ dom((·)′ , p, f ).
8 / 41
The score function and the Stein kernel
Stein characterisations
Let Y be continuous with density q, and same support as X .
1 Suppose that qp is differentiable. Take g ∈ ∩f ∈F(p)dom((·)′ , p, f )
such that g is p-a.s. never 0 and g qp is differentiable. Then
YD= X if and only if E
[f (Y )g ′(Y )
]= −E [g(Y )Tpf (Y )]
for all f ∈ F(p).
2 Let f ∈ F(p) be p-a.s. never zero and assume that dom((·)′ , p, f ) isdense in L1(p). Then
YD= X if and only if E
[f (Y )g ′(Y )
]= −E [g(Y )Tpf (Y )]
for all g ∈ dom((·)′ , p, f ).
9 / 41
The score function and the Stein kernel
The inverse Stein operator
Let F (0)(p) be the class of mean zero smooth test functions; the inverseStein operator T −1
p : F (0)(p)→ F(p) is
T −1p h(x) = − 1
p(x)
∫ x
ap(y)h(y)dy =
1
p(x)
∫ b
xp(y)h(y)dy .
The equation
h(x)− Eh(X ) = f (x)g ′(x) + g(x)Tpf (x), x ∈ I,
is a Stein equation for the target p. Solutions of this equation (for h suchthat a solution exists) are pairs of functions (f , g) such that
fg = T −1p (h − Eph).
Although fg is unique, the individual f and g are not.
10 / 41
The score function and the Stein kernel
f (x)g ′(x) + g(x)Tpf (x): Special Stein operators
Our general Stein operator is an operator on pairs of functions (f , g);
A(f , g)(x) = Tp(fg)(x) = f (x)g ′(x) + g(x)Tpf (x).
Suppose that 1 ∈ F(p). Then taking f (x) = 1 we get
Apg(x) = g ′(x) + g(x)ρ(x) with ρ(x) = Tp1(x) =p′(x)
p(x)
the so-called “score function” of p; see for example Stein (2004).
If X has finite mean ν taking f (x) = T −1p (ν − x) we get
AXg(x) = τ(x)g ′(x) + (ν − x)g(x) with τ = T −1p (ν − Id)
the “Stein kernel of p”; see Stein (1986) and Cacoullos et al. (1992).
11 / 41
The score function and the Stein kernel
Example: Normal
In the example of a N (0, σ2) random variable,
TN f (x) = −f ′(x) +1
σ2xf (x)
which contrasts withσ2f ′(x)− xf (x),
the standard Stein operator for this case. The score function is − xσ2 . The
Stein kernel isτ(x) = σ2
giving the standard Stein operator.
12 / 41
Higher dimensions
Outline
1 Stein’s method
2 The score function and the Stein kernel
3 Higher dimensions
4 Stein operators TpF = div(F p)/p
5 Last remarks
13 / 41
Higher dimensions
Notation
Let e1, . . . , ed be the canonical basis for Cartesian coordinates in Rd .
The gradient for φ : Rd → R is ∇φ =(∂φ∂x1, . . . , ∂φ∂xd
)T=∑d
i=1(∂iφ)ei .
The gradient of a vector field v : Rd → Rr : x 7→ (v1(x), v2(x), . . . , vr (x))(a line vector) is the matrix
∇v =(∇v1 ∇v2 · · · ∇vr
)=
(∂vj∂xi
)1≤i≤d ,1≤j≤r
.
If r = d then the divergence of v is
div(v) = ∇ · vT =d∑
i=1
∂vi∂xi
= Tr (∇v) ,
with Tr the trace operator and x · y = xT y = 〈x , y〉 the Euclidean scalarproduct between x and y .
14 / 41
Higher dimensions
More generally, the divergence of a q × d tensor field
F : Rd → Rq × Rd : x 7→ F(x) =
F1(x)...
Fq(x)
=
F11(x) . . . F1d(x)...
. . ....
Fq1(x) . . . Fqd(x)
is
div(F) :=∇∇∇ · F =
div(F1)...
div(Fq)
=
∇ · FT1
...∇ · FT
q
=
d∑
i=1
∂F1i∂xi
...d∑
i=1
∂Fqi
∂xi
.
The divergence maps matrix-valued functions F : Rd → Rq × Rd ontovector valued functions div(F) : Rd → Rq.
15 / 41
Higher dimensions
Product rule for divergence
Let F : Rd → Rq × Rd be a q × d tensor field and φ : Rd → R. Then,under appropriate regularity conditions,
div(Fφ) = div(F)φ+ F∇φ.
Similarly if F is a q × d tensor field and G is a d × d tensor field then FGis a q × d vector field and
(div (FG))j = Fjdiv(G) + Tr (grad (Fj)G)
for j = 1, . . . , q.
16 / 41
Higher dimensions
What is known: multivariate normal
Y ∈ Rd is a multivariate normal MVN (0,Σ) if and only if
EY t∇f (Y ) = E∇tΣ∇f (Y ), for all smooth f : Rd → R.
Assume that h : Rd → R has 3 bounded derivatives. Then, if Σ ∈ Rd×d issymmetric and positive definite, and Z ∼MVN (0,Σ) , there is a solutionf : Rd → R to the Stein equation
∇tΣ∇f (w)− w t∇f (w) = h(w)− Eh(Σ1/2Z ),
which holds for every w ∈ Rd .
17 / 41
Higher dimensions
The Mehler formula
To solve ∇tΣ∇f (w)− w t∇f (w) = h(w)− Eh(Σ1/2Z ), w ∈ Rd , fort ∈ [0, 1] put
Zw ,t =√tw +
√1− t Σ1/2Z ,
then
f (w) =
∫ 1
0
1
2t[Eg(Zw ,t)− Eg(Σ1/2Z )]dt
is a solution to the Stein equation. This solution f satisfies the bounds∣∣∣∣∣ ∂k f (w)∏kj=1 ∂wij
∣∣∣∣∣ ≤ 1
k
∣∣∣∣∣ ∂kh(w)∏kj=1 ∂wij
∣∣∣∣∣for every w ∈ Rd .(Barbour 1990, Gotze 1993, Rinott and Rotar 1996, Goldstein and Rinott1996, R. + Rollin 2007, Meckes 2009, Chen, Goldstein and Shao 2011)
18 / 41
Higher dimensions
What is known: strictly log-concave
(Mackey and Gorham 2016) For continuous p on Rd , such thatlog p ∈ C 4(Rd) is k-strictly concave, the operator
Af (w) =1
2〈∇f (w),∇ log p(w)〉+
1
2∆f (w)
is the generator of an overdamped Langevin diffusion. The Stein equation
Af (w) = h(w)− Eph
is solved by
f (w) =
∫ ∞0
[Eph(Z )− Eh(Zw ,t)]dt
with (Zw ,t)t≥0 the overdamped Langevin diffusion with generator A andZw ,0 = w .
The first three derivatives of f can be bounded in terms of same and lowerorder derivatives of h.
19 / 41
Higher dimensions
What is known: Score functions
(Nourdin et al. 2013, 2014) Let X ∈ Rd have mean 0 and p.d.f.p : Rd → R. The score of p is the random vector ρp(X ) inRd whichsatisfies
Eρp(X )φ(X ) = −E∇φ(X )
for all φ ∈ C∞c (Rd).
If p has a score, then it is uniquely defined through ρp(x) = ∇ log p(x).
20 / 41
Higher dimensions
What is known: Stein kernels
(Nourdin et al. 2013, 2014) A random d × d matrix τp(X ) such that
Eτp(X )∇φ(X ) = EXφ(X )
for all φ ∈ C∞c (Rd) is called a strong Stein kernel for p.Ledoux et al. 2015: τp(X ) is a weak Stein kernel if for all φ ∈ C∞c (Rd)
ETr(τp(X )Hess(φ(X ))T ) = EX∇φ(X ).
There is no reason to assume uniqueness for the Stein kernel, or existence.If τ1 and τ2 are two Stein kernels for p, then for all φ ∈ C∞c (Rd),
E(τ1(X )− τ2(X ))∇φ(X ) = 0;
thendiv(p(x)(τ1(x)− τ2(x)) = 0
from which we get uniqueness only in the one-dimensional case.21 / 41
Stein operators TpF = div(F p)/p
Outline
1 Stein’s method
2 The score function and the Stein kernel
3 Higher dimensions
4 Stein operators TpF = div(F p)/p
5 Last remarks
22 / 41
Stein operators TpF = div(F p)/p
The general multivariate density case
Let X ∈ Rd have pdf p : Rd → R with respect to the Lebesgue measureon Rd . Let Ω be the support of p.
1 Let q ∈ N0. The q-Stein class for X is the class Fq(X ) of allF : Rd → Rq × Rd such that pF is(i) differentiable in the sense that its gradient exists,(ii) div(pF) is integrable, on Ω(iii)
∫Ω div(pF) = 000.
2 We propose as Stein operator of p the operator
TpF =div(F p)
p
acting on test functions F : Rd → Rq × Rd ∈ Fq(X ).
If F ∈ Fq(X ) then TXF : Rd → Rq.
23 / 41
Stein operators TpF = div(F p)/p
Stein type integration by parts
To each F : Rd → Rq × Rd ∈ Fq(p) we associate dom(∇, p,F) the vectorspace of functions g : Rd → R such that F g ∈ Fq(p) and F∇g ∈ L1(p).Proposition:
Ep [F∇g ] = −Ep [(TXF) g ]
for all F ∈ Fq(p) and all g ∈ dom(∇, p,F).
Proof: Apply the product rule for divergence,
div(Fφ) = div(F)φ+ F∇φ,
to (Fφ)p with φ = g , to show that for TpF = div(F p)p ,
Tp(F g) = (TpF) g + F∇g ,
and then take expectations, using that∫
Ω div(Fgp) = 000 and hence thel.h.s has mean 0.
24 / 41
Stein operators TpF = div(F p)/p
Stein operators
As in the 1-dimensional case, our Stein operators depend on two testfunctions, F and g , and are of the form
Tp(F g) = (TpF) g + F∇g
obtained by
either by fixing F and considering g as the (scalar-valued) testfunctions,
or fixing g and considering F as the (matrix-valued) test functions.
25 / 41
Stein operators TpF = div(F p)/p
Tp(F g) = (TpF) g + F∇g : F = Id fixed
Suppose that the identity matrix Id ∈ Fd(p) (e.g. if p is log-concave andvanishes at ∂Ω). Then
TpId = ∇ log p = ρp,
and the Stein operator is Apg : Rd → Rd ,
Apg = Tp(Ig) = ∇g + ρX g
acting on g : Rd → R belonging to dom(∇, p, Id).
26 / 41
Stein operators TpF = div(F p)/p
Tp(F g) = (TpF) g + F∇g : F = τpτpτp fixed
Let X have mean ν and suppose that there exists a d × d matrix-valuedfunction F = τpτpτp (a Stein kernel) satisfying
Tp(τpτpτp)(x) = −(x − ν)
at all x . Then Apg : Rd → Rd ,
Apg(x) = Tp(τpτpτpg)(x) = −(x − ν)g(x) + τXτXτX (x)∇g(x)
acting on differentiable functions g : Rd → R belonging to dom(∇, p, τττp).
27 / 41
Stein operators TpF = div(F p)/p
Tp(F g) = (TpF) g + F∇g : g = 1 fixed
For g : Rd → R, g(x) = 1 we obtain for F ∈ Fq(p),
ApF(x) = TpF(x) ∈ Rq,
vector-valued. The Stein equation for a zero mean function h : Rd → Rq
is then
ApF(x) =div(Fp)
p(x) = h(x)
which givesdiv(Fp)(x) = p(x)h(x).
There is not a unique solution. If q = d then we could choose a solution Fsuch that Fi ,j = 0 for i 6= j .
28 / 41
Stein operators TpF = div(F p)/p
Special case: q = 1
Let v = (v1, . . . , vn) : Rd → Rd be a vector field in the 0-Stein class forp : Rd → R. Then our Stein operator of p is
Tpv =(∇ · v)p + v∇p
p
=d∑
i=1
∂vi∂xi
+d∑
i=1
vi∂ip
p.
This is a function from Rd to R.
Take as vector field v = ∇f for a smooth function f : Rd → R. Thischoice gives
Ap(f ) = Tpv = ∆f + 〈∇ log p,∇f 〉,
interpreted as operator on f rather than v. This is the operator consideredby Mackey and Gorham 2016, except for a factor 1
2 .
29 / 41
Stein operators TpF = div(F p)/p
Tp(F g) = (TpF) g + F∇g : g = p−1 fixed
For g : Rd → R, g(x) = 1/p(x) we obtain for F ∈ Fq(p),
ApF =div(Fp)
p2+ F∇(1/p) ∈ Rq,
vector-valued. The Stein equation for a zero mean function h : Rd → Rq
is thendiv(Fp)
p2(x) + F∇(1/p)(x) = h(x)
which givesdiv(F)(x) = p(x)h(x).
Again there is not a unique solution. If q = d then we could choose asolution F such that Fi ,j = 0 for i 6= j .
30 / 41
Stein operators TpF = div(F p)/p
Example: multivariate normal
Consider Z ∼MVN d(0,ΣΣΣ). Then
ρp(x) = −ΣΣΣ−1x and τpτpτp(z) = ΣΣΣ.
(linear score and constant Stein kernel). These lead to the Stein operatorfor g : Rd → R
Apg(x) = ΣΣΣ∇g(x)− g(x)x .
31 / 41
Stein operators TpF = div(F p)/p
Example: elliptical distributions
A d-random vector has multivariate elliptical distribution Ed(µ,Σ, φ) if itsdensity is given by
p(x) = κ|ΣΣΣ|−1/2φ
(1
2(x − µ)tΣΣΣ−1(x − µ)
)on Rd , for φ a smooth function and κ the normalising constant. Ellipticaldistributions have the score function
ρp(x) = ΣΣΣ−1xφ′(x tΣ−1x/2)
φ(x tΣ−1x/2),
and
τττp(x) =
(1
φ(x tΣΣΣ−1x/2)
∫ +∞
x tΣ−1x/2φ(u)du
)ΣΣΣ
is a strong Stein kernel for p (Landsman, Vanduffel, Yao 2014).
32 / 41
Stein operators TpF = div(F p)/p
Bounds on the solution of the Stein equation
So we have Stein equations, but when are the solutions well behaved?
In the multivariate normal case: Mehler formula.
In the case of strictly log-concave distributions: overdamped Langevindiffusion.
The bounds will be distribution-specific.
33 / 41
Stein operators TpF = div(F p)/p
Bounds using a Poincare constant
We say that Cp is a Poincare constant associated to µX if for everysmooth function ϕ ∈ L2(µX ) such that Eϕ(X ) = 0,
Eϕ2(X ) ≤ CpE|∇ϕ(X )|2.
For example, when X has k-log-concave density, then the law of Xsatisfies a Poincare inequality with Cp = 1/k .Using the Lax-Milgram theorem we can show the following result.
Let h be a smooth, 1-Lipschitz function. Let X be a random vector withdensity p, and assume Cp <∞ is a Poincare constant for p(x)dx . Thenwe prove that there exists a weak solution u to
∆u +∇ log p · ∇u = h − p(h),
such that ∫|∇u|2p ≤ C 2
p .
34 / 41
Stein operators TpF = div(F p)/p
Application: nested densities
The Wasserstein distance between (the distributions of) X and Y is
dW (X ,Y ) = suph∈Lip(1)
|Eh(X )− Eh(Y )| .
Compare the Wasserstein distance between P1 and P2 on Rd , withdensities p1, assumed k-log concave, and p2 = π0p1. Put
A1u =1
2∇ log p1 · ∇u +
1
2∆u,
and
A2u =1
2∇ log p2 · ∇u +
1
2∆u.
Then
A2u = A1u +1
2∇ log π0 · ∇u.
35 / 41
Stein operators TpF = div(F p)/p
Let h : Rd 7→ R be a 1-Lipschitz function, and uh a solution toA1uh = h −
∫hp1. Let X1 (X2) have distribution P1 (P2). Then as
A2u = A1u + 12∇ log π0 · ∇u,
E[h(X2)]− E[h(X1)] = E[A1uh(X2)]
= E[A2uh(X2)− 1
2∇ log π0(X2) · ∇uh(X2)
]= −1
2E [∇ log π0(X2) · ∇uh(X2)] .
Using the Poincare bounds we obtain
dW (X1,X2) ≤ 1
kE[|∇π0(X1)|].
36 / 41
Stein operators TpF = div(F p)/p
Example: Copulas
Let (V1,V2) be a 2-dimensional random vector, such that the marginalsV1 and V2 have uniform U[0, 1] distribution. The copula of (V1,V2) is
C (x1, x2) = P[V1 ≤ x1,V2 ≤ x2], (x1, x2) ∈ [0, 1]2
and we assume that c = ∂2x1x2
C exists.Let (U1,U2) be independent U[0, 1]. The copula of (U1,U2) is(x1, x2)→ x1x2.Payne 1960: an optimal Poincare constant for U[0, 1]2 is Cp = 2/π2.Now we can show:
dW [(V1,V2) , (U1,U2)] ≤ 2
π2
√∫[0,1]2
|∇c(x1, x2)|2dx1 dx2.
37 / 41
Stein operators TpF = div(F p)/p
Example: the effect of the prior on the posterior
Consider a normal model with mean θ ∈ Rd and positive definitecovariance matrix Σ. The likelihood of θ given a sample (x1, . . . , xn) is
(2π)−nd/2 det(Σ)−n/2 exp
(−1
2
n∑i=1
(xi − θ)TΣ−1(xi − θ)
).
We want to compare the posterior distribution P1 = N (x , n−1Σ) of θ withuniform prior with the posterior P2 with normal prior with parameters(µ,Σ2); Σ2 is assumed positive definite.
38 / 41
Stein operators TpF = div(F p)/p
The operator norm of a matrix A is |||A||| = sup||x ||=1 ||Ax ||. The nomal
density p1 is n/|||Σ|||-log concave. Moreover P2 = N (µ, Σn) with
µ = µ+ nΣnΣ−1(x − µ)
Σn = (Σ−12 + nΣ−1)−1.
After some calculation we find
dW (P1,P2) ≤ |||Σ||| |||(Σ + nΣ2)−1||| ||x − µ||
+
√2Γ(d/2 + 1/2)
Γ(d/2)
|||Σ|||n|||(Σ2 + nΣ2Σ−1Σ2)−1/2|||.
The closer x is to µ, the smaller the bound.The influence of Σ2 vanishes as n→∞.
39 / 41
Last remarks
Outline
1 Stein’s method
2 The score function and the Stein kernel
3 Higher dimensions
4 Stein operators TpF = div(F p)/p
5 Last remarks
40 / 41
Last remarks
Last remarks
Solving and bounding the Stein equation is crucial for applying themethod. Our framework gives a large (indeed infinite) choice for Steinequations to choose from.
The effect of the prior on the posterior will be studied in more detail.
We are thinking about the multivariate discrete case, too. Note thatBarbour et al. 2017 gives an approximation by a discretisedmultivariate normal, using Markov process arguments.
41 / 41