42
Dependent Dirichlet processes and application to ecological data Julyan Arbel Joint work with Kerrie Mengersen & Judith Rousseau CREST-INSEE, Universit´ e Paris-Dauphine 2 December 2012 ERCIM 2012 5 th International Conference on Computing & Statistics

Arbel oviedo

Embed Size (px)

Citation preview

Page 1: Arbel oviedo

Dependent Dirichlet processesand application to ecological data

Julyan ArbelJoint work with Kerrie Mengersen & Judith Rousseau

CREST-INSEE, Universite Paris-Dauphine

2 December 2012ERCIM 2012

5th International Conference onComputing & Statistics

Page 2: Arbel oviedo

Biology questionNonparametric model

Outline

1 Biology questionIntroductionData

2 Nonparametric modelDirichlet processDependent Dirichlet process

Julyan Arbel DDP and ecological data

Page 3: Arbel oviedo

Biology questionNonparametric model

IntroductionData

Outline

1 Biology questionIntroductionData

2 Nonparametric modelDirichlet processDependent Dirichlet process

Julyan Arbel DDP and ecological data

Page 4: Arbel oviedo

Biology questionNonparametric model

IntroductionData

Biology introduction

Series of measurements atdifferent places aroundCasey Station, permanentbase in AntarcticaAt each site: pollutionlevel, and abundance ofmicrobes called OTUs.Assess the impact of apollutant on the soilcomposition / biodiversity

Julyan Arbel DDP and ecological data

Page 5: Arbel oviedo

Biology questionNonparametric model

IntroductionData

Data

Data consist of measurements of microbes abundance:

Site TPH 06251 00576 00429 06360 08793 06259 05164 007721 80 3 724 88 1 0 0 0 4672 80 9 2364 252 0 0 2 0 6163 80 12 443 1655 11 0 0 0 168...

......

......

......

......

...

13 2600 2262 339 229 1100 537 352 0 020 10000 1883 23 18 879 224 325 9 124 22000 1446 2 27 920 1808 1456 0 0

Sample of abundance of 8 microbes (columns) at 6 sites(rows)Main covariate is a pollution level called TPH, denoted x

Julyan Arbel DDP and ecological data

Page 6: Arbel oviedo

Biology questionNonparametric model

IntroductionData

Data

Data consist of measurements of microbes abundance:

Site TPH 06251 00576 00429 06360 08793 06259 05164 00772

1 80 3 724 88 1 0 0 0 4672 80 9 2364 252 0 0 2 0 6163 80 12 443 1655 11 0 0 0 168...

......

......

......

......

...

13 2600 2262 339 229 1100 537 352 0 020 10000 1883 23 18 879 224 325 9 124 22000 1446 2 27 920 1808 1456 0 0

Sample of abundance of 8 microbes (columns) at 6 sites(rows)Main covariate is a pollution level called TPH, denoted x

Julyan Arbel DDP and ecological data

Page 7: Arbel oviedo

Biology questionNonparametric model

IntroductionData

Data

Data consist of measurements of microbes abundance:

Site TPH 06251 00576 00429 06360 08793 06259 05164 007721 80 3 724 88 1 0 0 0 4672 80 9 2364 252 0 0 2 0 6163 80 12 443 1655 11 0 0 0 168...

......

......

......

......

...

13 2600 2262 339 229 1100 537 352 0 020 10000 1883 23 18 879 224 325 9 124 22000 1446 2 27 920 1808 1456 0 0

Sample of abundance of 8 microbes (columns) at 6 sites(rows)Main covariate is a pollution level called TPH, denoted x

Julyan Arbel DDP and ecological data

Page 8: Arbel oviedo

Biology questionNonparametric model

IntroductionData

Data

Data consist of measurements of microbes abundance:

Site TPH 06251 00576 00429 06360 08793 06259 05164 007721 80 3 724 88 1 0 0 0 4672 80 9 2364 252 0 0 2 0 6163 80 12 443 1655 11 0 0 0 168...

......

......

......

......

...

13 2600 2262 339 229 1100 537 352 0 020 10000 1883 23 18 879 224 325 9 124 22000 1446 2 27 920 1808 1456 0 0

Sample of abundance of 8 microbes (columns) at 6 sites(rows)Main covariate is a pollution level called TPH, denoted x

Julyan Arbel DDP and ecological data

Page 9: Arbel oviedo

Biology questionNonparametric model

IntroductionData

Notations

Microbe species are denoted by j = 1, . . . by decreasingtotal abundance

At each site x , there are N(x) microbes, denoted Yi(x),i = 1, . . . ,N(x).Data are a frequency matrix:

Site TPH 06251 00576 . . .

j = 1 j . . .

1 x = 80 #(Yn(x = 80) = 1) = 3 . . . . . ....

......

......

k x . . . #(Yn(x) = j) . . .

Julyan Arbel DDP and ecological data

Page 10: Arbel oviedo

Biology questionNonparametric model

IntroductionData

Notations

Microbe species are denoted by j = 1, . . . by decreasingtotal abundanceAt each site x , there are N(x) microbes, denoted Yi(x),i = 1, . . . ,N(x).

Data are a frequency matrix:

Site TPH 06251 00576 . . .

j = 1 j . . .

1 x = 80 #(Yn(x = 80) = 1) = 3 . . . . . ....

......

......

k x . . . #(Yn(x) = j) . . .

Julyan Arbel DDP and ecological data

Page 11: Arbel oviedo

Biology questionNonparametric model

IntroductionData

Notations

Microbe species are denoted by j = 1, . . . by decreasingtotal abundanceAt each site x , there are N(x) microbes, denoted Yi(x),i = 1, . . . ,N(x).Data are a frequency matrix:

Site TPH 06251 00576 . . .

j = 1 j . . .

1 x = 80 #(Yn(x = 80) = 1) = 3 . . . . . ....

......

......

k x . . . #(Yn(x) = j) . . .

Julyan Arbel DDP and ecological data

Page 12: Arbel oviedo

Biology questionNonparametric model

IntroductionData

Notations

A standard example of diversity is Shannon diversity, taken asthe exponential of Shannon entropy

D(x) = exp(∑

j −pj(x) log pj(x))

with pj(x) =#(Yn(x)=j)

N(x)

0 5000 10000 20000

2.5

3.0

3.5

tph

Sha

nnon

ent

ropy

0 5000 10000 20000

1020

3040

tph

Sha

nnon

div

ersi

ty

Figure: Left: Shannon entropy in row data. Right: Shannon diversityin row data.

Julyan Arbel DDP and ecological data

Page 13: Arbel oviedo

Biology questionNonparametric model

IntroductionData

Notations

A standard example of diversity is Shannon diversity, taken asthe exponential of Shannon entropy

D(x) = exp(∑

j −pj(x) log pj(x))

with pj(x) =#(Yn(x)=j)

N(x)

0 5000 10000 20000

2.5

3.0

3.5

tph

Sha

nnon

ent

ropy

0 5000 10000 2000010

2030

40

tph

Sha

nnon

div

ersi

ty

Figure: Left: Shannon entropy in row data. Right: Shannon diversityin row data.

Julyan Arbel DDP and ecological data

Page 14: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Outline

1 Biology questionIntroductionData

2 Nonparametric modelDirichlet processDependent Dirichlet process

Julyan Arbel DDP and ecological data

Page 15: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

First model

Pavlovian conditioning associated with the word species leadsto the Dirichlet process and/or related processes.

First, we run anindependent model ateach site with TPH x

Yi(x) |G ∼ G,

G(·) =∞∑

j=1

pjδj(·),

(pj)j ∼ GEM(M).

The GEM(M) distribution is defined in [Pitman, 2002] (GEM

stands for Griffiths, Engen and McCloskey) and represents thedistribution of the weights in a Dirichlet process:

pj = Vj

∏l<j

(1 − Vl), Vj ∼ Beta(1,M).

Julyan Arbel DDP and ecological data

Page 16: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

First model

Pavlovian conditioning associated with the word species leadsto the Dirichlet process and/or related processes.

First, we run anindependent model ateach site with TPH x

Yi(x) |G ∼ G,

G(·) =∞∑

j=1

pjδj(·),

(pj)j ∼ GEM(M).

The GEM(M) distribution is defined in [Pitman, 2002] (GEM

stands for Griffiths, Engen and McCloskey) and represents thedistribution of the weights in a Dirichlet process:

pj = Vj

∏l<j

(1 − Vl), Vj ∼ Beta(1,M).

Julyan Arbel DDP and ecological data

Page 17: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

First model

Pavlovian conditioning associated with the word species leadsto the Dirichlet process and/or related processes.

First, we run anindependent model ateach site with TPH x

Yi(x) |G ∼ G,

G(·) =∞∑

j=1

pjδj(·),

(pj)j ∼ GEM(M).

The GEM(M) distribution is defined in [Pitman, 2002] (GEM

stands for Griffiths, Engen and McCloskey) and represents thedistribution of the weights in a Dirichlet process:

pj = Vj

∏l<j

(1 − Vl), Vj ∼ Beta(1,M).

Julyan Arbel DDP and ecological data

Page 18: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Posterior sampling

We use a blocked Gibbs sampler (truncated version of theinfinite sum)

The prior on p is induced by the Beta prior on V ,π⊥(Vj) = Be(1,M).This is conjugated, with a Beta posterior:

π(Vj |Y ) = Be(Vj |1 + #(Yn = j),M + #(Yn > j)).

Julyan Arbel DDP and ecological data

Page 19: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Posterior sampling

We use a blocked Gibbs sampler (truncated version of theinfinite sum)The prior on p is induced by the Beta prior on V ,π⊥(Vj) = Be(1,M).

This is conjugated, with a Beta posterior:

π(Vj |Y ) = Be(Vj |1 + #(Yn = j),M + #(Yn > j)).

Julyan Arbel DDP and ecological data

Page 20: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Posterior sampling

We use a blocked Gibbs sampler (truncated version of theinfinite sum)The prior on p is induced by the Beta prior on V ,π⊥(Vj) = Be(1,M).This is conjugated, with a Beta posterior:

π(Vj |Y ) = Be(Vj |1 + #(Yn = j),M + #(Yn > j)).

Julyan Arbel DDP and ecological data

Page 21: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Second model

But we want to run a single model across TPH x ; it means apredictor-dependent model

Early references to predictor-dependent DP models includeCifarelli and Regazzini [1978] and Muliere and Petrone[1993]Increasing interest since MacEachern [1999,2000,2001]Extensions with varying weights include, among others,order-based DDP [Griffin and Steel, 2006], local DP [Chungand Dunson, 2009], weighted mixtures of DP [Dunson andPark, 2008], and kernel stick-breaking processes [Dunsonet al., 2007].

Julyan Arbel DDP and ecological data

Page 22: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Second model

But we want to run a single model across TPH x ; it means apredictor-dependent model

Early references to predictor-dependent DP models includeCifarelli and Regazzini [1978] and Muliere and Petrone[1993]

Increasing interest since MacEachern [1999,2000,2001]Extensions with varying weights include, among others,order-based DDP [Griffin and Steel, 2006], local DP [Chungand Dunson, 2009], weighted mixtures of DP [Dunson andPark, 2008], and kernel stick-breaking processes [Dunsonet al., 2007].

Julyan Arbel DDP and ecological data

Page 23: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Second model

But we want to run a single model across TPH x ; it means apredictor-dependent model

Early references to predictor-dependent DP models includeCifarelli and Regazzini [1978] and Muliere and Petrone[1993]Increasing interest since MacEachern [1999,2000,2001]

Extensions with varying weights include, among others,order-based DDP [Griffin and Steel, 2006], local DP [Chungand Dunson, 2009], weighted mixtures of DP [Dunson andPark, 2008], and kernel stick-breaking processes [Dunsonet al., 2007].

Julyan Arbel DDP and ecological data

Page 24: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Second model

But we want to run a single model across TPH x ; it means apredictor-dependent model

Early references to predictor-dependent DP models includeCifarelli and Regazzini [1978] and Muliere and Petrone[1993]Increasing interest since MacEachern [1999,2000,2001]Extensions with varying weights include, among others,order-based DDP [Griffin and Steel, 2006], local DP [Chungand Dunson, 2009], weighted mixtures of DP [Dunson andPark, 2008], and kernel stick-breaking processes [Dunsonet al., 2007].

Julyan Arbel DDP and ecological data

Page 25: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Second model

Only interested in a dependence in the weights. We worked outa dependent process prior with a simple structure ofdependence on the weights.

Yi(x) |G(x) ∼ G(x),

G(x)(·) =∞∑

j=1

pj(x)δj(·),

(pj(x))j ∼ DGEM(M),

pj(x) = Vj(x)∏l<j

(1 − Vl(x)),

Vj(x) ∼ Beta(1,M).

where DGEM(M) stands for Dependent GEM distribution.Want a process for each j , (Vj(x))x , which is marginallyBeta(1,M).

Julyan Arbel DDP and ecological data

Page 26: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Second model

Only interested in a dependence in the weights. We worked outa dependent process prior with a simple structure ofdependence on the weights.

Yi(x) |G(x) ∼ G(x),

G(x)(·) =∞∑

j=1

pj(x)δj(·),

(pj(x))j ∼ DGEM(M),

pj(x) = Vj(x)∏l<j

(1 − Vl(x)),

Vj(x) ∼ Beta(1,M).

where DGEM(M) stands for Dependent GEM distribution.Want a process for each j , (Vj(x))x , which is marginallyBeta(1,M).

Julyan Arbel DDP and ecological data

Page 27: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Second model

Only interested in a dependence in the weights. We worked outa dependent process prior with a simple structure ofdependence on the weights.

Yi(x) |G(x) ∼ G(x),

G(x)(·) =∞∑

j=1

pj(x)δj(·),

(pj(x))j ∼ DGEM(M),

pj(x) = Vj(x)∏l<j

(1 − Vl(x)),

Vj(x) ∼ Beta(1,M).

where DGEM(M) stands for Dependent GEM distribution.

Want a process for each j , (Vj(x))x , which is marginallyBeta(1,M).

Julyan Arbel DDP and ecological data

Page 28: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Second model

Only interested in a dependence in the weights. We worked outa dependent process prior with a simple structure ofdependence on the weights.

Yi(x) |G(x) ∼ G(x),

G(x)(·) =∞∑

j=1

pj(x)δj(·),

(pj(x))j ∼ DGEM(M),

pj(x) = Vj(x)∏l<j

(1 − Vl(x)),

Vj(x) ∼ Beta(1,M).

where DGEM(M) stands for Dependent GEM distribution.Want a process for each j , (Vj(x))x , which is marginallyBeta(1,M).

Julyan Arbel DDP and ecological data

Page 29: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Process on the beta breaks,Vj(x)

Construction from Trippa, Muller and Johnson [2011].

V (x1) =Γ(x1)

Γ(x1)+ΓM (x1)α1

α12

α2α3

α23α123

x1 x3x2

Γ(x1) = Γ1 + Γ12 + Γ123,

ΓM(x1) = ΓM1 + ΓM

12 + ΓM123.

Γ1 ∼ Ga(α1), . . . , Γ123 ∼ Ga(α123),

ΓM1 ∼ Ga(α1M), . . . , ΓM

123 ∼ Ga(α123M).

In the end:

pj(x) = Vj(x)∏

l<j(1 − Vl(x)) ∼ DGEM(M).

Julyan Arbel DDP and ecological data

Page 30: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Process on the beta breaks,Vj(x)

Construction from Trippa, Muller and Johnson [2011].

V (x1) =Γ(x1)

Γ(x1)+ΓM (x1)

α1α12

α2α3

α23α123

x1 x3x2

Γ(x1) = Γ1 + Γ12 + Γ123,

ΓM(x1) = ΓM1 + ΓM

12 + ΓM123.

Γ1 ∼ Ga(α1), . . . , Γ123 ∼ Ga(α123),

ΓM1 ∼ Ga(α1M), . . . , ΓM

123 ∼ Ga(α123M).

In the end:

pj(x) = Vj(x)∏

l<j(1 − Vl(x)) ∼ DGEM(M).

Julyan Arbel DDP and ecological data

Page 31: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Process on the beta breaks,Vj(x)

Construction from Trippa, Muller and Johnson [2011].

V (x1) =Γ(x1)

Γ(x1)+ΓM (x1)α1

α12

α2α3

α23α123

x1 x3x2

Γ(x1) = Γ1 + Γ12 + Γ123,

ΓM(x1) = ΓM1 + ΓM

12 + ΓM123.

Γ1 ∼ Ga(α1), . . . , Γ123 ∼ Ga(α123),

ΓM1 ∼ Ga(α1M), . . . , ΓM

123 ∼ Ga(α123M).

In the end:

pj(x) = Vj(x)∏

l<j(1 − Vl(x)) ∼ DGEM(M).

Julyan Arbel DDP and ecological data

Page 32: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Process on the beta breaks,Vj(x)

Construction from Trippa, Muller and Johnson [2011].

V (x1) =Γ(x1)

Γ(x1)+ΓM (x1)α1

α12

α2α3

α23α123

x1 x3x2

Γ(x1) = Γ1 + Γ12 + Γ123,

ΓM(x1) = ΓM1 + ΓM

12 + ΓM123.

Γ1 ∼ Ga(α1), . . . , Γ123 ∼ Ga(α123),

ΓM1 ∼ Ga(α1M), . . . , ΓM

123 ∼ Ga(α123M).

In the end:

pj(x) = Vj(x)∏

l<j(1 − Vl(x)) ∼ DGEM(M).

Julyan Arbel DDP and ecological data

Page 33: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Process on the beta breaks,Vj(x)

Construction from Trippa, Muller and Johnson [2011].

V (x1) =Γ(x1)

Γ(x1)+ΓM (x1)α1

α12

α2α3

α23α123

x1 x3x2

Γ(x1) = Γ1 + Γ12 + Γ123,

ΓM(x1) = ΓM1 + ΓM

12 + ΓM123.

Γ1 ∼ Ga(α1), . . . , Γ123 ∼ Ga(α123),

ΓM1 ∼ Ga(α1M), . . . , ΓM

123 ∼ Ga(α123M).

In the end:

pj(x) = Vj(x)∏

l<j(1 − Vl(x)) ∼ DGEM(M).

Julyan Arbel DDP and ecological data

Page 34: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Process on the beta breaks,Vj(x)

Construction from Trippa, Muller and Johnson [2011].

V (x1) =Γ(x1)

Γ(x1)+ΓM (x1)α1

α12

α2α3

α23α123

x1 x3x2

Γ(x1) = Γ1 + Γ12 + Γ123,

ΓM(x1) = ΓM1 + ΓM

12 + ΓM123.

Γ1 ∼ Ga(α1), . . . , Γ123 ∼ Ga(α123),

ΓM1 ∼ Ga(α1M), . . . , ΓM

123 ∼ Ga(α123M).

In the end:

pj(x) = Vj(x)∏

l<j(1 − Vl(x)) ∼ DGEM(M).

Julyan Arbel DDP and ecological data

Page 35: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Interesting features

This idea can be extended to large dimensional covariatespaces:

α1α123

α2

α12

α23

α3

x1 x2

x3

..

.

Easy to simulate in: only needs to simulate Gammarandom variables

Julyan Arbel DDP and ecological data

Page 36: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Posterior sampling

There is independence across j , so it suffices to be able tosimulate in each posterior:

π(Vj |Y ) ∝ π(V j)L(Y |V j),

∝ π(V j)∏

xVj(x)#(Yn(x)=j)(1 − Vj(x))#(Yn(x)>j).

Quite uncommon situation: we can sample in the priorπ(V j), but we cannot evaluate it. Reverse situation toApproximate Bayesian computation (ABC), where thelikelihood is intractable, but can be sampled.

Julyan Arbel DDP and ecological data

Page 37: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

A first solution is to use a Metropolis-Hastings algorithm:

Metropolis-Hastings Algorithm1 Given a current value V j , sample a new one V ∗j

independently in the prior π(V j).2 Acceptance probability is

ρ = min

1, L(Y |V ∗j )

L(Y |V j)

.

But it is not a good idea to propose in the prior.Acceptance rate is low (around 1%).

Julyan Arbel DDP and ecological data

Page 38: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

A first solution is to use a Metropolis-Hastings algorithm:

Metropolis-Hastings Algorithm1 Given a current value V j , sample a new one V ∗j

independently in the prior π(V j).2 Acceptance probability is

ρ = min

1, L(Y |V ∗j )

L(Y |V j)

.But it is not a good idea to propose in the prior.Acceptance rate is low (around 1%).

Julyan Arbel DDP and ecological data

Page 39: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

A better solution is to use Importance Sampling:

Importance Sampling1 Sample iid values V j in the prior π(V j).2 Use a weighted sample by the importance weights defined

by the likelihood w(V j) = L(Y |V j).

iid sample instead of a Markov chainbetter precision by a Rao-Blackwellisation argument(weights instead of accept-reject)

Julyan Arbel DDP and ecological data

Page 40: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

A better solution is to use Importance Sampling:

Importance Sampling1 Sample iid values V j in the prior π(V j).2 Use a weighted sample by the importance weights defined

by the likelihood w(V j) = L(Y |V j).

iid sample instead of a Markov chainbetter precision by a Rao-Blackwellisation argument(weights instead of accept-reject)

Julyan Arbel DDP and ecological data

Page 41: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Results

0 5000 10000 20000

1020

3040

tph

Pos

terio

r di

vers

ity

0 5000 10000 20000

1020

3040

tphD

iver

sity

in d

ata

Figure: Left: dependent DP prior: posterior mean of the Shannondiversity by TPH; 95% centred credible intervals. Right: Shannondiversity in row data.

Julyan Arbel DDP and ecological data

Page 42: Arbel oviedo

Biology questionNonparametric model

Dirichlet processDependent Dirichlet process

Conclusion

Such a model allows to give probabilistic answers toquestions about diversity as we get a posterior sample.The use of Gaussian processes transformed to Betaprocesses by the inverse CDF might fastened the posteriorcomputations.Extension to handle other covariates.

Julyan Arbel DDP and ecological data