Species sampling models in Bayesian Nonparametrics

Species sampling models

B [email protected] Í www.julyanarbel.com

Bocconi University, Milan, Italy & Collegio Carlo Alberto, Turin

Statalks Seminar @ Collegio Carlo AlbertoFebruary 12, 2016

1/12

mailto:[email protected]

http://www.julyanarbel.com

http://www.unibocconi.eu/

http://www.carloalberto.org/

2/12

Discovery probabilities

Table of Contents


3/12


Discovery problem: motivating example

What is the probability of observing a new species?

4/12


Discovery problem: motivating example

Good and Turing worked on this problem Bletchley Park to crack Germanciphers for the Enigma machine during World War II

They proposed the estimator

Number of species observed once

Total number of species

5/12


Discovery problem

• Population of individuals (Xi )i≥1 belonging to an ideally an infinitenumber of species (θi )i≥1, respective unknown proportions (pi )i≥1

• Given (X1, . . . ,Xn), make inference on the probability that the (n + 1)-thobservation coincides with a species whose frequency is l , forl = 0, 1, . . . , n. This probability is termed l-discovery, that is

Dn(l) =∑i≥1

pi I{l}(ni )

where ni is the frequency of the species of type θi in the sample

• Dn(0) denotes the proportion of yet unobserved species, or the probabilityof discovering a new species, or the missing mass

• Applications arising from ecology, biology, design of experiments,bioinformatics, genetics, linguistic, economics, network modeling,chemistry, ...

6/12


BNP model

• The BNP approach for estimating Dn(l) is based on the randomization ofthe unknown species proportions pi ’s. See Lijoi, Mena and Prunster (2007)Let P =

∑i≥1 piδθ denote a discrete random probability measure

Let X n = (X1, . . . ,Xn) be a sample from a population with composition P,namely

Xi |Piid∼ P

P ∼ Q

with P playing the role of the nonparametric prior

• Due to the discreteness of P, the sample X n from P exhibits ties withpositive probability. In other terms X n features k distinct observationsX ∗1 , . . . ,X

∗Kn

with corresponding frequencies (n1, . . . , nk)

• The information provided by (n1, . . . , nk) can be coded bymn = (m1, . . . ,mn) where mi = number of species in the sample X n

having frequency iUnder this alternative codification one obtains

∑1≤i≤n mi = k and∑

1≤i≤n imi = n.

7/12


Good Turing estimators of discovery

Remember, Good and Turing estimate the prob. of observing a new species as

Number of species observed once

Total number of species

ieDn(0) =

m1

n

Also generalized to any frequency l ≤ n

Dn(l) =(l + 1)ml+1

n

Good (1953)BNP counterparts of these estimators?

8/12


BNP estimators of discovery

Gibbs-type random probability measure P with index σ ∈ (0, 1): it ischaracterized by (it induces) a predictive distribution of the form

P[Xn+1 ∈ A |X n] =Vn+1,kn+1

Vn,kn

G0(A) +Vn+1,kn

Vn,kn

kn∑i=1

(ni − σ) δX∗i

(A),

BNP estimator Dn(l) of Dn(l) derived from the predictive using setsA0 = X\{X ∗1 , . . . ,X ∗Kn

} and Al = {X ∗i : Ni,n = l}

BNP Good Turing

Dn(0) = E[Ph(A0) |X n] =Vn+1,kn+1

Vn,knDn(l) = m1

n

Dn(l) = E[Ph(Al) |X n] = (l − σ)mlVn+1,knVn,kn

Dn(l) =(l+1)ml+1

n

9/12


Credible intervals for discovery

• Special case of Pitman–Yor process (Perman, Pitman and Yor, 1992).For σ ∈ (0, 1) and θ > −σ and

Vn,kn =

∏kn−1i=1 (θ + iσ)

(θ + 1)(n−1)

Then closed form expression for the posterior distribution as Beta

Pp(A0) |X nd= Bθ+σkn,n−σkn

andPp(Al) |X n

d= B(l−σ)ml ,θ+n−(l−σ)ml

• Similar results in the general Gibbs class

• Practical tool for deriving credible intervals for the BNP estimator Dn(l),for any l = 0, 1, . . . , n. This is typically done by performing a numericalevaluation of appropriate quantiles of the distribution of Pp(Al) |X n

10/12


Application to EST libraries

Application to genomic datasets called Expressed Sequence Tags (EST)libraries

• Naegleria gruberi aerobic library consists of n = 959 ESTs with kn = 473distinct genes and ml,959 = 346, 57, 19, 12, 9, 5, 4, 2, 4, 5, 4, 1, 1, 1, 1, 1, 1,for l∈{1, 2, . . . , 12} ∪ {16, 17, 18} ∪ {27} ∪ {55}

• Naegleria gruberi anaerobic library consists of n = 969 ESTs withkn = 631 distinct genes and ml,969 = 491, 72, 30, 9, 13, 5, 3, 1, 2, 0, 1, 0, 1,for l ∈ {1, 2, . . . , 13}

• Prior specification: Pitman–Yor process, with empirical Bayes procedurefor estimating (σ, θ)

• σ = 0.669, θ = 46.241 for the Naegleria gruberi aerobic library• σ = 0.656, θ = 155.408 for the Naegleria gruberi anaerobic library

11/12


Application to EST libraries

Posterior distributions (dashed curve for aerobic, solid for anaerobic) ofdiscovery probabilities Dn(l), for l ∈ {0, 1, 5}

0.3 0.4 0.5 0.60

10

20

30

0.08 0.12 0.16 0.20

10

20

30

40

0.02 0.03 0.04 0.05 0.06 0.070

10

20

30

40

50

60

70

12/12


Conclusion

Take-home messages

We have seen that Bayesian nonparametric methods allow for

• smoothing estimation of the discovery probabilities Dn(l) via more robustestimators than frequentist counterparts

• a principled treatment of uncertainty where credible intervals can beobtained naturally: closed form expression of the posterior distribution

Data & Analytics

Species sampling models in Bayesian Nonparametrics