Di erential Stein operators for multivariate continuous

Differential Stein operators for multivariate continuousdistributions and applications

Gesine Reinert

A French/American Collaborative Colloquium on ConcentrationInequalities, High Dimensional Statistics and Stein’s Method

July 4th, 2017

Joint work with Guillaume Mijoule and Yvik Swan (Liege)

1 / 41

Stein’s method

Outline

1 Stein’s method

2 The score function and the Stein kernel

3 Higher dimensions

4 Stein operators TpF = div(F p)/p

5 Last remarks

2 / 41

Stein’s method

Stein’s method in a nutshell

For µ a target distribution, with support I:

1 Find a suitable operator A (called Stein operator) and a wide class offunctions F(A) (called Stein class) such that X ∼ µ if and only if forall functions f ∈ F(A),

EAf (X ) = 0.

2 Let H(I) be a measure-determining class on I. For each h ∈ H finda solution f = fh ∈ F(A) of the

h(x)− Eh(X ) = Af (x),

where X ∼ µ. If the solution exists and if it is unique in F(A) thenwe can write

f (x) = A−1(h(x)− Eh(X )).

We call A−1 the inverse Stein operator (for µ).

3 / 41

Stein’s method

Example: mean zero normal

Stein (1972, 1986), see also Chen, Goldstein, Shao 2011Z ∼ N (0, σ2) if and only if for all smooth functions f ,

EZf (Z ) = σ2Ef ′(Z ).

Given a test function h, let Z ∼ N (0, σ2); the Stein equation is

σ2f ′(w)− wf (w) = h(w)− Eh(Z )

which has as unique bounded solution

f (y) =1

σ2ey

2/2σ2∫ y

−∞(h(x)− Eh(Z )) e−x

2/2σ2dx .

4 / 41

Stein’s method

Example: the sum of independent random variables

X1, . . . ,Xn indept mean zero, Var = 1n ; W =

∑ni=1 Xi . Then

Ef ′(W )− EWf (W )

= Ef ′(W )−n∑

i=1

EXi f (W )

= Ef ′(W )−n∑

i=1

EXi f (W − Xi ) +n∑

i=1

EX 2i f′(W − Xi ) + R

=1

n

n∑i=1

(Ef ′(W )− Ef ′(W − Xi )

)+ R;

bound this expression by Taylor expansion to give that for any smooth h

|Eh(W )− Nh| ≤ ‖h′‖

(2√n

+n∑

i=1

E|X 3i |

).

Note: nothing goes to infinity.5 / 41

Stein’s method

Comparison of distributions

Let X and Y have distributions µX and µY with Stein operators AX andAY , so that F(AX ) ∩ F(AY ) 6= ∅ and choose H(I) such that allsolutions f of the Stein equation belong to this intersection. Then

Eh(X )− Eh(Y ) = EAY f (X ) = EAY f (X )− EAX f (X )

and

suph∈H(I)

|Eh(X )− Eh(Y )| ≤ supf ∈F(AX )∩F(AY )

|EAX f (X )− EAY f (X )|.

If H(I) is the set of all Lipschitz-1-functions then the resulting distance isdW , the Wasserstein distance. For examples: Holmes (2004),Eichelsbacher and R. (2008), Dobler (2012), Ley, Swan and R. 2015, 2017.

6 / 41

The score function and the Stein kernel

Outline

1 Stein’s method


3 Higher dimensions


5 Last remarks

7 / 41


A Stein operator for continuous real-valued variables

Let X be continuous having pdf p with support I = [a, b] ⊂ R.

The Stein class of X is the class F(p) of functions f : R→ R such that(i) x 7→ f (x)p(x) is differentiable on R(ii) (fp)′ is integrable and

∫(fp)′ = 0.

To p associate the Stein operator Tp:

Tpf =(fp)′

p.

(Stein 1986, Stein et al. 2004, Ley and Swan 2013)

By the product rule,

E[g ′(X )f (X )

]= −E [g(X )Tpf (X )]

for all f ∈ F(p) and for all differentiable functions g such that∫(gfp)′dx = 0, and

∫|g ′fp|dx <∞; we say that g ∈ dom((·)′ , p, f ).

8 / 41


Stein characterisations

Let Y be continuous with density q, and same support as X .

1 Suppose that qp is differentiable. Take g ∈ ∩f ∈F(p)dom((·)′ , p, f )

such that g is p-a.s. never 0 and g qp is differentiable. Then

YD= X if and only if E

[f (Y )g ′(Y )

]= −E [g(Y )Tpf (Y )]

for all f ∈ F(p).

2 Let f ∈ F(p) be p-a.s. never zero and assume that dom((·)′ , p, f ) isdense in L1(p). Then

YD= X if and only if E

[f (Y )g ′(Y )

]= −E [g(Y )Tpf (Y )]

for all g ∈ dom((·)′ , p, f ).

9 / 41


The inverse Stein operator

Let F (0)(p) be the class of mean zero smooth test functions; the inverseStein operator T −1

p : F (0)(p)→ F(p) is

T −1p h(x) = − 1

p(x)

∫ x

ap(y)h(y)dy =

1

p(x)

∫ b

xp(y)h(y)dy .

The equation

h(x)− Eh(X ) = f (x)g ′(x) + g(x)Tpf (x), x ∈ I,

is a Stein equation for the target p. Solutions of this equation (for h suchthat a solution exists) are pairs of functions (f , g) such that

fg = T −1p (h − Eph).

Although fg is unique, the individual f and g are not.

10 / 41


f (x)g ′(x) + g(x)Tpf (x): Special Stein operators

Our general Stein operator is an operator on pairs of functions (f , g);

A(f , g)(x) = Tp(fg)(x) = f (x)g ′(x) + g(x)Tpf (x).

Suppose that 1 ∈ F(p). Then taking f (x) = 1 we get

Apg(x) = g ′(x) + g(x)ρ(x) with ρ(x) = Tp1(x) =p′(x)

p(x)

the so-called “score function” of p; see for example Stein (2004).

If X has finite mean ν taking f (x) = T −1p (ν − x) we get

AXg(x) = τ(x)g ′(x) + (ν − x)g(x) with τ = T −1p (ν − Id)

the “Stein kernel of p”; see Stein (1986) and Cacoullos et al. (1992).

11 / 41


Example: Normal

In the example of a N (0, σ2) random variable,

TN f (x) = −f ′(x) +1

σ2xf (x)

which contrasts withσ2f ′(x)− xf (x),

the standard Stein operator for this case. The score function is − xσ2 . The

Stein kernel isτ(x) = σ2

giving the standard Stein operator.

12 / 41

Higher dimensions

Outline

1 Stein’s method


3 Higher dimensions


5 Last remarks

13 / 41

Higher dimensions

Notation

Let e1, . . . , ed be the canonical basis for Cartesian coordinates in Rd .

The gradient for φ : Rd → R is ∇φ =(∂φ∂x1, . . . , ∂φ∂xd

)T=∑d

i=1(∂iφ)ei .

The gradient of a vector field v : Rd → Rr : x 7→ (v1(x), v2(x), . . . , vr (x))(a line vector) is the matrix

∇v =(∇v1 ∇v2 · · · ∇vr

)=

(∂vj∂xi

)1≤i≤d ,1≤j≤r

.

If r = d then the divergence of v is

div(v) = ∇ · vT =d∑

i=1

∂vi∂xi

= Tr (∇v) ,

with Tr the trace operator and x · y = xT y = 〈x , y〉 the Euclidean scalarproduct between x and y .

14 / 41

Higher dimensions

More generally, the divergence of a q × d tensor field

F : Rd → Rq × Rd : x 7→ F(x) =

F1(x)...

Fq(x)

=

F11(x) . . . F1d(x)...

. . ....

Fq1(x) . . . Fqd(x)

is

div(F) :=∇∇∇ · F =

div(F1)...

div(Fq)

=

∇ · FT1

...∇ · FT

q

=

d∑

i=1

∂F1i∂xi

...d∑

i=1

∂Fqi

∂xi

.

The divergence maps matrix-valued functions F : Rd → Rq × Rd ontovector valued functions div(F) : Rd → Rq.

15 / 41

Higher dimensions

Product rule for divergence

Let F : Rd → Rq × Rd be a q × d tensor field and φ : Rd → R. Then,under appropriate regularity conditions,

div(Fφ) = div(F)φ+ F∇φ.

Similarly if F is a q × d tensor field and G is a d × d tensor field then FGis a q × d vector field and

(div (FG))j = Fjdiv(G) + Tr (grad (Fj)G)

for j = 1, . . . , q.

16 / 41

Higher dimensions

What is known: multivariate normal

Y ∈ Rd is a multivariate normal MVN (0,Σ) if and only if

EY t∇f (Y ) = E∇tΣ∇f (Y ), for all smooth f : Rd → R.

Assume that h : Rd → R has 3 bounded derivatives. Then, if Σ ∈ Rd×d issymmetric and positive definite, and Z ∼MVN (0,Σ) , there is a solutionf : Rd → R to the Stein equation

∇tΣ∇f (w)− w t∇f (w) = h(w)− Eh(Σ1/2Z ),

which holds for every w ∈ Rd .

17 / 41

Higher dimensions

The Mehler formula

To solve ∇tΣ∇f (w)− w t∇f (w) = h(w)− Eh(Σ1/2Z ), w ∈ Rd , fort ∈ [0, 1] put

Zw ,t =√tw +

√1− t Σ1/2Z ,

then

f (w) =

∫ 1

0

1

2t[Eg(Zw ,t)− Eg(Σ1/2Z )]dt

is a solution to the Stein equation. This solution f satisfies the bounds∣∣∣∣∣ ∂k f (w)∏kj=1 ∂wij

∣∣∣∣∣ ≤ 1

k

∣∣∣∣∣ ∂kh(w)∏kj=1 ∂wij

∣∣∣∣∣for every w ∈ Rd .(Barbour 1990, Gotze 1993, Rinott and Rotar 1996, Goldstein and Rinott1996, R. + Rollin 2007, Meckes 2009, Chen, Goldstein and Shao 2011)

18 / 41

Higher dimensions

What is known: strictly log-concave

(Mackey and Gorham 2016) For continuous p on Rd , such thatlog p ∈ C 4(Rd) is k-strictly concave, the operator

Af (w) =1

2〈∇f (w),∇ log p(w)〉+

1

2∆f (w)

is the generator of an overdamped Langevin diffusion. The Stein equation

Af (w) = h(w)− Eph

is solved by

f (w) =

∫ ∞0

[Eph(Z )− Eh(Zw ,t)]dt

with (Zw ,t)t≥0 the overdamped Langevin diffusion with generator A andZw ,0 = w .

The first three derivatives of f can be bounded in terms of same and lowerorder derivatives of h.

19 / 41

Higher dimensions

What is known: Score functions

(Nourdin et al. 2013, 2014) Let X ∈ Rd have mean 0 and p.d.f.p : Rd → R. The score of p is the random vector ρp(X ) inRd whichsatisfies

Eρp(X )φ(X ) = −E∇φ(X )

for all φ ∈ C∞c (Rd).

If p has a score, then it is uniquely defined through ρp(x) = ∇ log p(x).

20 / 41

Higher dimensions

What is known: Stein kernels

(Nourdin et al. 2013, 2014) A random d × d matrix τp(X ) such that

Eτp(X )∇φ(X ) = EXφ(X )

for all φ ∈ C∞c (Rd) is called a strong Stein kernel for p.Ledoux et al. 2015: τp(X ) is a weak Stein kernel if for all φ ∈ C∞c (Rd)

ETr(τp(X )Hess(φ(X ))T ) = EX∇φ(X ).

There is no reason to assume uniqueness for the Stein kernel, or existence.If τ1 and τ2 are two Stein kernels for p, then for all φ ∈ C∞c (Rd),

E(τ1(X )− τ2(X ))∇φ(X ) = 0;

thendiv(p(x)(τ1(x)− τ2(x)) = 0

from which we get uniqueness only in the one-dimensional case.21 / 41

Stein operators TpF = div(F p)/p

Outline

1 Stein’s method


3 Higher dimensions


5 Last remarks

22 / 41


The general multivariate density case

Let X ∈ Rd have pdf p : Rd → R with respect to the Lebesgue measureon Rd . Let Ω be the support of p.

1 Let q ∈ N0. The q-Stein class for X is the class Fq(X ) of allF : Rd → Rq × Rd such that pF is(i) differentiable in the sense that its gradient exists,(ii) div(pF) is integrable, on Ω(iii)

∫Ω div(pF) = 000.

2 We propose as Stein operator of p the operator

TpF =div(F p)

p

acting on test functions F : Rd → Rq × Rd ∈ Fq(X ).

If F ∈ Fq(X ) then TXF : Rd → Rq.

23 / 41


Stein type integration by parts

To each F : Rd → Rq × Rd ∈ Fq(p) we associate dom(∇, p,F) the vectorspace of functions g : Rd → R such that F g ∈ Fq(p) and F∇g ∈ L1(p).Proposition:

Ep [F∇g ] = −Ep [(TXF) g ]

for all F ∈ Fq(p) and all g ∈ dom(∇, p,F).

Proof: Apply the product rule for divergence,

div(Fφ) = div(F)φ+ F∇φ,

to (Fφ)p with φ = g , to show that for TpF = div(F p)p ,

Tp(F g) = (TpF) g + F∇g ,

and then take expectations, using that∫

Ω div(Fgp) = 000 and hence thel.h.s has mean 0.

24 / 41


Stein operators

As in the 1-dimensional case, our Stein operators depend on two testfunctions, F and g , and are of the form

Tp(F g) = (TpF) g + F∇g

obtained by

either by fixing F and considering g as the (scalar-valued) testfunctions,

or fixing g and considering F as the (matrix-valued) test functions.

25 / 41


Tp(F g) = (TpF) g + F∇g : F = Id fixed

Suppose that the identity matrix Id ∈ Fd(p) (e.g. if p is log-concave andvanishes at ∂Ω). Then

TpId = ∇ log p = ρp,

and the Stein operator is Apg : Rd → Rd ,

Apg = Tp(Ig) = ∇g + ρX g

acting on g : Rd → R belonging to dom(∇, p, Id).

26 / 41


Tp(F g) = (TpF) g + F∇g : F = τpτpτp fixed

Let X have mean ν and suppose that there exists a d × d matrix-valuedfunction F = τpτpτp (a Stein kernel) satisfying

Tp(τpτpτp)(x) = −(x − ν)

at all x . Then Apg : Rd → Rd ,

Apg(x) = Tp(τpτpτpg)(x) = −(x − ν)g(x) + τXτXτX (x)∇g(x)

acting on differentiable functions g : Rd → R belonging to dom(∇, p, τττp).

27 / 41


Tp(F g) = (TpF) g + F∇g : g = 1 fixed

For g : Rd → R, g(x) = 1 we obtain for F ∈ Fq(p),

ApF(x) = TpF(x) ∈ Rq,

vector-valued. The Stein equation for a zero mean function h : Rd → Rq

is then

ApF(x) =div(Fp)

p(x) = h(x)

which givesdiv(Fp)(x) = p(x)h(x).

There is not a unique solution. If q = d then we could choose a solution Fsuch that Fi ,j = 0 for i 6= j .

28 / 41


Special case: q = 1

Let v = (v1, . . . , vn) : Rd → Rd be a vector field in the 0-Stein class forp : Rd → R. Then our Stein operator of p is

Tpv =(∇ · v)p + v∇p

p

=d∑

i=1

∂vi∂xi

+d∑

i=1

vi∂ip

p.

This is a function from Rd to R.

Take as vector field v = ∇f for a smooth function f : Rd → R. Thischoice gives

Ap(f ) = Tpv = ∆f + 〈∇ log p,∇f 〉,

interpreted as operator on f rather than v. This is the operator consideredby Mackey and Gorham 2016, except for a factor 1

2 .

29 / 41


Tp(F g) = (TpF) g + F∇g : g = p−1 fixed

For g : Rd → R, g(x) = 1/p(x) we obtain for F ∈ Fq(p),

ApF =div(Fp)

p2+ F∇(1/p) ∈ Rq,

vector-valued. The Stein equation for a zero mean function h : Rd → Rq

is thendiv(Fp)

p2(x) + F∇(1/p)(x) = h(x)

which givesdiv(F)(x) = p(x)h(x).

Again there is not a unique solution. If q = d then we could choose asolution F such that Fi ,j = 0 for i 6= j .

30 / 41


Example: multivariate normal

Consider Z ∼MVN d(0,ΣΣΣ). Then

ρp(x) = −ΣΣΣ−1x and τpτpτp(z) = ΣΣΣ.

(linear score and constant Stein kernel). These lead to the Stein operatorfor g : Rd → R

Apg(x) = ΣΣΣ∇g(x)− g(x)x .

31 / 41


Example: elliptical distributions

A d-random vector has multivariate elliptical distribution Ed(µ,Σ, φ) if itsdensity is given by

p(x) = κ|ΣΣΣ|−1/2φ

(1

2(x − µ)tΣΣΣ−1(x − µ)

)on Rd , for φ a smooth function and κ the normalising constant. Ellipticaldistributions have the score function

ρp(x) = ΣΣΣ−1xφ′(x tΣ−1x/2)

φ(x tΣ−1x/2),

and

τττp(x) =

(1

φ(x tΣΣΣ−1x/2)

∫ +∞

x tΣ−1x/2φ(u)du

)ΣΣΣ

is a strong Stein kernel for p (Landsman, Vanduffel, Yao 2014).

32 / 41


Bounds on the solution of the Stein equation

So we have Stein equations, but when are the solutions well behaved?

In the multivariate normal case: Mehler formula.

In the case of strictly log-concave distributions: overdamped Langevindiffusion.

The bounds will be distribution-specific.

33 / 41


Bounds using a Poincare constant

We say that Cp is a Poincare constant associated to µX if for everysmooth function ϕ ∈ L2(µX ) such that Eϕ(X ) = 0,

Eϕ2(X ) ≤ CpE|∇ϕ(X )|2.

For example, when X has k-log-concave density, then the law of Xsatisfies a Poincare inequality with Cp = 1/k .Using the Lax-Milgram theorem we can show the following result.

Let h be a smooth, 1-Lipschitz function. Let X be a random vector withdensity p, and assume Cp <∞ is a Poincare constant for p(x)dx . Thenwe prove that there exists a weak solution u to

∆u +∇ log p · ∇u = h − p(h),

such that ∫|∇u|2p ≤ C 2

p .

34 / 41


Application: nested densities

The Wasserstein distance between (the distributions of) X and Y is

dW (X ,Y ) = suph∈Lip(1)

|Eh(X )− Eh(Y )| .

Compare the Wasserstein distance between P1 and P2 on Rd , withdensities p1, assumed k-log concave, and p2 = π0p1. Put

A1u =1

2∇ log p1 · ∇u +

1

2∆u,

and

A2u =1

2∇ log p2 · ∇u +

1

2∆u.

Then

A2u = A1u +1

2∇ log π0 · ∇u.

35 / 41


Let h : Rd 7→ R be a 1-Lipschitz function, and uh a solution toA1uh = h −

∫hp1. Let X1 (X2) have distribution P1 (P2). Then as

A2u = A1u + 12∇ log π0 · ∇u,

E[h(X2)]− E[h(X1)] = E[A1uh(X2)]

= E[A2uh(X2)− 1

2∇ log π0(X2) · ∇uh(X2)

]= −1

2E [∇ log π0(X2) · ∇uh(X2)] .

Using the Poincare bounds we obtain

dW (X1,X2) ≤ 1

kE[|∇π0(X1)|].

36 / 41


Example: Copulas

Let (V1,V2) be a 2-dimensional random vector, such that the marginalsV1 and V2 have uniform U[0, 1] distribution. The copula of (V1,V2) is

C (x1, x2) = P[V1 ≤ x1,V2 ≤ x2], (x1, x2) ∈ [0, 1]2

and we assume that c = ∂2x1x2

C exists.Let (U1,U2) be independent U[0, 1]. The copula of (U1,U2) is(x1, x2)→ x1x2.Payne 1960: an optimal Poincare constant for U[0, 1]2 is Cp = 2/π2.Now we can show:

dW [(V1,V2) , (U1,U2)] ≤ 2

π2

√∫[0,1]2

|∇c(x1, x2)|2dx1 dx2.

37 / 41


Example: the effect of the prior on the posterior

Consider a normal model with mean θ ∈ Rd and positive definitecovariance matrix Σ. The likelihood of θ given a sample (x1, . . . , xn) is

(2π)−nd/2 det(Σ)−n/2 exp

(−1

2

n∑i=1

(xi − θ)TΣ−1(xi − θ)

).

We want to compare the posterior distribution P1 = N (x , n−1Σ) of θ withuniform prior with the posterior P2 with normal prior with parameters(µ,Σ2); Σ2 is assumed positive definite.

38 / 41


The operator norm of a matrix A is |||A||| = sup||x ||=1 ||Ax ||. The nomal

density p1 is n/|||Σ|||-log concave. Moreover P2 = N (µ, Σn) with

µ = µ+ nΣnΣ−1(x − µ)

Σn = (Σ−12 + nΣ−1)−1.

After some calculation we find

dW (P1,P2) ≤ |||Σ||| |||(Σ + nΣ2)−1||| ||x − µ||

+

√2Γ(d/2 + 1/2)

Γ(d/2)

|||Σ|||n|||(Σ2 + nΣ2Σ−1Σ2)−1/2|||.

The closer x is to µ, the smaller the bound.The influence of Σ2 vanishes as n→∞.

39 / 41

Last remarks

Outline

1 Stein’s method


3 Higher dimensions


5 Last remarks

40 / 41

Last remarks

Last remarks

Solving and bounding the Stein equation is crucial for applying themethod. Our framework gives a large (indeed infinite) choice for Steinequations to choose from.

The effect of the prior on the posterior will be studied in more detail.

We are thinking about the multivariate discrete case, too. Note thatBarbour et al. 2017 gives an approximation by a discretisedmultivariate normal, using Markov process arguments.

41 / 41

Documents

Di erential Stein operators for multivariate continuous