Upload
julyan-arbel
View
464
Download
0
Embed Size (px)
Citation preview
Species sampling models
B [email protected] Í www.julyanarbel.com
Bocconi University, Milan, Italy & Collegio Carlo Alberto, Turin
Statalks Seminar @ Collegio Carlo AlbertoFebruary 12, 2016
1/12
2/12
Discovery probabilities
Table of Contents
Discovery probabilities
3/12
Discovery probabilities
Discovery problem: motivating example
What is the probability of observing a new species?
4/12
Discovery probabilities
Discovery problem: motivating example
Good and Turing worked on this problem Bletchley Park to crack Germanciphers for the Enigma machine during World War II
They proposed the estimator
Number of species observed once
Total number of species
5/12
Discovery probabilities
Discovery problem
• Population of individuals (Xi )i≥1 belonging to an ideally an infinitenumber of species (θi )i≥1, respective unknown proportions (pi )i≥1
• Given (X1, . . . ,Xn), make inference on the probability that the (n + 1)-thobservation coincides with a species whose frequency is l , forl = 0, 1, . . . , n. This probability is termed l-discovery, that is
Dn(l) =∑i≥1
pi I{l}(ni )
where ni is the frequency of the species of type θi in the sample
• Dn(0) denotes the proportion of yet unobserved species, or the probabilityof discovering a new species, or the missing mass
• Applications arising from ecology, biology, design of experiments,bioinformatics, genetics, linguistic, economics, network modeling,chemistry, ...
6/12
Discovery probabilities
BNP model
• The BNP approach for estimating Dn(l) is based on the randomization ofthe unknown species proportions pi ’s. See Lijoi, Mena and Prunster (2007)Let P =
∑i≥1 piδθ denote a discrete random probability measure
Let X n = (X1, . . . ,Xn) be a sample from a population with composition P,namely
Xi |Piid∼ P
P ∼ Q
with P playing the role of the nonparametric prior
• Due to the discreteness of P, the sample X n from P exhibits ties withpositive probability. In other terms X n features k distinct observationsX ∗1 , . . . ,X
∗Kn
with corresponding frequencies (n1, . . . , nk)
• The information provided by (n1, . . . , nk) can be coded bymn = (m1, . . . ,mn) where mi = number of species in the sample X n
having frequency iUnder this alternative codification one obtains
∑1≤i≤n mi = k and∑
1≤i≤n imi = n.
7/12
Discovery probabilities
Good Turing estimators of discovery
Remember, Good and Turing estimate the prob. of observing a new species as
Number of species observed once
Total number of species
ieDn(0) =
m1
n
Also generalized to any frequency l ≤ n
Dn(l) =(l + 1)ml+1
n
Good (1953)BNP counterparts of these estimators?
8/12
Discovery probabilities
BNP estimators of discovery
Gibbs-type random probability measure P with index σ ∈ (0, 1): it ischaracterized by (it induces) a predictive distribution of the form
P[Xn+1 ∈ A |X n] =Vn+1,kn+1
Vn,kn
G0(A) +Vn+1,kn
Vn,kn
kn∑i=1
(ni − σ) δX∗i
(A),
BNP estimator Dn(l) of Dn(l) derived from the predictive using setsA0 = X\{X ∗1 , . . . ,X ∗Kn
} and Al = {X ∗i : Ni,n = l}
BNP Good Turing
Dn(0) = E[Ph(A0) |X n] =Vn+1,kn+1
Vn,knDn(l) = m1
n
Dn(l) = E[Ph(Al) |X n] = (l − σ)mlVn+1,knVn,kn
Dn(l) =(l+1)ml+1
n
9/12
Discovery probabilities
Credible intervals for discovery
• Special case of Pitman–Yor process (Perman, Pitman and Yor, 1992).For σ ∈ (0, 1) and θ > −σ and
Vn,kn =
∏kn−1i=1 (θ + iσ)
(θ + 1)(n−1)
Then closed form expression for the posterior distribution as Beta
Pp(A0) |X nd= Bθ+σkn,n−σkn
andPp(Al) |X n
d= B(l−σ)ml ,θ+n−(l−σ)ml
• Similar results in the general Gibbs class
• Practical tool for deriving credible intervals for the BNP estimator Dn(l),for any l = 0, 1, . . . , n. This is typically done by performing a numericalevaluation of appropriate quantiles of the distribution of Pp(Al) |X n
10/12
Discovery probabilities
Application to EST libraries
Application to genomic datasets called Expressed Sequence Tags (EST)libraries
• Naegleria gruberi aerobic library consists of n = 959 ESTs with kn = 473distinct genes and ml,959 = 346, 57, 19, 12, 9, 5, 4, 2, 4, 5, 4, 1, 1, 1, 1, 1, 1,for l∈{1, 2, . . . , 12} ∪ {16, 17, 18} ∪ {27} ∪ {55}
• Naegleria gruberi anaerobic library consists of n = 969 ESTs withkn = 631 distinct genes and ml,969 = 491, 72, 30, 9, 13, 5, 3, 1, 2, 0, 1, 0, 1,for l ∈ {1, 2, . . . , 13}
• Prior specification: Pitman–Yor process, with empirical Bayes procedurefor estimating (σ, θ)
• σ = 0.669, θ = 46.241 for the Naegleria gruberi aerobic library• σ = 0.656, θ = 155.408 for the Naegleria gruberi anaerobic library
11/12
Discovery probabilities
Application to EST libraries
Posterior distributions (dashed curve for aerobic, solid for anaerobic) ofdiscovery probabilities Dn(l), for l ∈ {0, 1, 5}
0.3 0.4 0.5 0.60
10
20
30
0.08 0.12 0.16 0.20
10
20
30
40
0.02 0.03 0.04 0.05 0.06 0.070
10
20
30
40
50
60
70
12/12
Discovery probabilities
Conclusion
Take-home messages
We have seen that Bayesian nonparametric methods allow for
• smoothing estimation of the discovery probabilities Dn(l) via more robustestimators than frequentist counterparts
• a principled treatment of uncertainty where credible intervals can beobtained naturally: closed form expression of the posterior distribution