Approximate+Bayesian+ Computation+methods+ and+their ...rmjbale/Stat/12.pdf · Longitudinal...

Preview:

Citation preview

Approximate Bayesian Computation methods and their applications for

hierarchical statistical models

University College London, 2015

Contents

1. Introduction2. ABC methods3. Hierarchical models4. Application for ovarian cancer detection5. Conclusion

Introduction

• The likelihood function plays an important role in statisticalinference problems

• For complex models computational costs for evaluating theanalytical formula are very high

• Methods which provide statistical inference bypassingevaluation of the likelihood function gained high popularity

ABC methods

• ABC methods provide ways of evaluatingposterior distributions when the likelihood functionis analytically or computationally intractable

• These methods are based on replacing thecalculation of the likelihood with a comparisonbetween the observed and simulated data

Let be a parameter vector to be estimated. Given the prior distribution , the goal is to approximate the posterior distribution , where

is the likelihood.

θπ(θ)

π(θ |x)∝ f (x|θ)π(θ)f (x|θ)

Generic form of ABC methods

1. Sample a candidate parameter vector from some proposal distribution .

2. Simulate a dataset from the model described by a conditional probability distribution .

3. Compare the simulated dataset, , with the experimental data, , using a distance function, and tolerance ;; if , accept .

The tolerance is the desired level of agreement between and .

θ*π(θ)

x*f (x|θ*)

x*x0

d ε d(x0,x*)≤ε θ*ε≥0

x0 x*

Most popular ABC algorithms

• ABC rejection algorithm

• ABC MCMC algorithm (Markov Chain Monte

Carlo)

• ABC SMC algorithm (Sequential Monte Carlo)

ABC rejection method

1. Sample from .2. Simulate a dataset from .3. If , accept , otherwise reject.4. Return to step 1.

Disadvantage: if prior distribution will be very different from the posterior, acceptance rate would be low.

θ* π(θ)x* f (x|θ*)

d(x0,x*)≤ε θ*

Markov chain Monte Carlo (WIKIPEDIA)

In mathematics, more specifically in statistics, Markov chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability distribution based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a number of steps is then used as a sample of the desired

distribution. The quality of the sample improves as a function of the number of steps.

ABC MCMC

1. Metropolis-­Hastings Algorithm2. Random-­walk-­Metropolis-­Hastings Algorithm3. Gibbs Sampling Algorithm4. Metropolis within Gibbs Algorithm

Metropolis-­Hastings Algorithm

Let be an arbitrary, friendly distribution (we know how to sample from) called proposal. Choose arbitrarily. Suppose we have generated

. To generate do the following:(1) Generate a proposal or candidate value(2) Evaluate where

(3) Set

q(y|x)

X0

X0,X1,...,Xi Xi+1

Y~q(y|Xi)r≡r(Xi|Y)

r(x,y)=min f (y)q(x|y)f (x)q(y|x),1!

"#

$#

%

&#

'#

Xi+1=Y with probability rXi with probability 1−r

"

#$$

%$$

Remarks to Metropolis-­Hastings Algorithm

• A simple way to execute step (3) is to generate . If set otherwise .

• A common choice for is for some. In this case, proposal density is symmetric,

, and

U~(0,1) U<r Xi+1=Y Xi+1=Xi

q(y|x) N(x,b2)b>0 qq(y|x)=q(x|y)

r(x,y)=min f (y)f (x),1!

"#

$#

%

&#

'#

Metropolis-­Hastings Algorithm. Example1

Let’s simulate a Markov chain whose distribution is(The Cauchy distribution)

Let’s take as proposal distribution.Then

Let’s choose , length of chain.

f (x)=1π 11+x2

N(x,b2)

r(x, y)=minf (y)

f (x),1

!"#

$%&=min

1+x 2

1+y2, 1

!"#

$%&

b=1 N=10,000

Example 1. Code in RN=10000b=1x_values=rep(0,N)x_cauchy=rep(0,N)x_axis=seq(-­7,7,by=0.1)x_old=0x_new=x_oldfor (i in 1:N)

y=rnorm(1,x_old,b)r=min((1+x_old^2)/(1+y^2),1)p=runif(1)if (p<r)

(x_new=y) else (x_new=x_old);;x_values[i]=x_newx_old=x_new

x_cauchy=dcauchy(x_axis)plot(x_axis,x_cauchy,type="p",col="black")points(density(x_values),type="l",col="red",lwd=3)

Gibbs Sampling

Gibbs Sampling is the easiest to use MCMC algorithm in case of dealing with high-­dimensional problems as it helps to turn a high-­dimensional problem into several one-­dimensional problems.

One of the examples of high-­dimensional problems is hierarchical model

Hierarchical model. Example1Posterior distribution on associated with the joint model

specified.

(θ,σ 2)

Xi~(θ,σ 2), i=1,...,n,θ~N(θ0,τ 2), σ 2~IG(a,b),θ0,τ 2, a, b

Gibbs Sampling algorithm

Suppose that has density . Supposethat it is possible to simulate from the conditionaldistributions and . Let bestarting values. Assume we have drawnThen the Gibbs sampling algorithm for getting :

(X,Y) fX,Y(x,y)

fX|Y(x|y) fY | X(y|x) (X0,Y0)(X0,Y0),...,(Xn,Yn)

(Xn+1,Yn+1)

Xn+1~ fX|Y(x|Yn)Yn+1~ fY | X(y|Xn+1)repeat

Posteriors for the Example1

f (θ |x,σ 2)~N σ 2

σ 2+nτ 2θ0+ nτ 2

σ 2+nτ 2 x, σ 2τ 2

σ 2+nτ 2

!

"

#####

$

%

&&&&&

f (σ 2|x,θ)~IGn2+a,12 xi−θ"

#$$

%

&''

2+b

i∑

"

#

$$$$$$$$

%

&

''''''''

Xi~(θ,σ 2), i=1,...,n,θ~N(θ0,τ 2), σ 2~IG(a,b),θ0,τ 2, a, b

Example 1. Code in Rx=rnorm(1000,10,2)n=length(x)a=3;; b=3tau2=10theta0=5Nsim=5000xbar=mean(x)sh1=(n/2)+asigma2=theta=rep(0,Nsim) #init arrayssigma2[1]=1/rgamma(1,shape=a,rate=b) #init chainsB=sigma2[1]/(sigma2[1]+n*tau2)theta[1]=rnorm(1,m=B*theta0+(1-­B)*xbar,sd=sqrt(tau2*B))for (i in 2:Nsim)B=sigma2[i-­1]/(sigma2[i-­1]+n*tau2)theta[i]=rnorm(1,m=B*theta0+(1-­B)*xbar,sd=sqrt(tau2*B))ra1=(1/2)*(sum((x-­theta[i])^2))+bsigma2[i]=1/rgamma(1,shape=sh1,rate=ra1)

mean(theta[3000:5000])mean(sigma2[3000:5000])

Conjugate priors

In Bayesian probability theory, if posteriordistributions are in the same family as the priordistributions, then both prior and posterior are calledconjugate distributions and the prior is calledconjugate prior.

P(θ |D)= P(θ)P(D|θ)P(θ)P(D|θ)dθ∫

Conjugate priors. Example

Let’s consider normal distribution .For normally distributed with fixed variance , the conjugate prior is also normally distributed. For prior posterior will be in the form:

x~N(µ,σ (2))x σ (2)

µ~N(µ0,σ 0(2))

µ|x,σ (2)~N(µ0,σ 0(2)),

µ0= σ 0(2)

σ (2)+σ 0(2) x+ σ 0

(2)

σ (2)+σ 0(2)µ0,

σ 0(2)= σ

(2)σ 0(2)

σ (2)+σ 0(2)

Conjugate priors. ExampleLet’s consider normal distribution .For normally distributed with fixed mean , the conjugate prior is distributed according to inverse-­gamma distribution. For prior

x~N(µ,σ (2))x µ

σ (2)~IG(α,β)P(x,µ|σ (2))= 1

σ 2πexp(−(x−µ)22σ 2

)∝(σ 2)−1/2exp(−1/2(x−µ)2σ 2 )

P(σ (2))=IG(α,β)=βα(σ (2))(−α−1)Γ(α) exp− β

σ (2)

$

%

&&&&&&

'

(

))))))

,

P(σ (2)|x,µ)∝(σ 2)−(α+1/2)−1exp−β−1/2(x−µ)2σ (2)

$

%

&&&&&&

'

(

))))))

α=α+1/2, β=β+12(x−µ)2

ABC SMC

A number of sampled parameter values (particles) , sampled from the prior distribution ,

are propagated through a sequence of intermediate distributions, , until it represents a sample from the target distribution

. The tolerances what mean gradual evolving towards the target posterior.For sufficiently large numbers of particles, this approach avoid the problem of getting stuck in areas of low probability (as in ABC MCMC)

θ (1),...,θ (n) π(θ)

π(θ |d(x0,x*)≤εi), i=1,...,T−1

π(θ |d(x0,x*)≤εT) ε1>...>εT≥0

ABC SMC Algorithm

S1. Initialize . Set the population indicator .S2.0 Set the particle indicator .S2.1 If , sample independently from .

Else, sample from the previous population with weights and perturb the particle to obtain

, where is a perturbation kernel.If , return to S2.1.Simulate a candidate dataset . If , return to S2.1.

ε1,...,εT t=0i=1

t=0 θ ** π(θ)θ * θt−1

(i)wt−1

θ **~Kt(θ |θ *) Kt

π(θ **)=0x*~ f (x|θ **)

d(x*,x0)≥εt

ABC SMC Algorithm

S2.2 Set and calculate the weight for particle

If , set , go to S2.1.S3 Normalize the weights.If , set , go to S2.0.

θt(i)θt

(i)=θ **

wt(i)=1, if t=0,

π(θt(i))

wt−1( j)Kt(θt−1

( j),θt(i))

j=1

N

∑, if t>0.

#

$

%%%%

&

%%%%

i<N i=i+1

t<T t=t+1

Ovarian Cancer case study. CA125

Risk calculation

Change-­point hierarchical model for CA125

Controls:

Cases:

Yij|tij~N(θi,σ 2)

Yij|tij,Ii=0~N(θi,σ 2)

Yij|tij,Ii=1~N(θi+γi(tij−τ i)+,σ 2)

Conditional distributions

Conclusion

1. ABC methods has great impact on parameters’ estimation.

1. A lot of applied problems can be reduced to hierarchical model

2. Gibbs Sampling Algorithm is most useful in dealing with hierarchical models

Literature1. Steven J. Skates, Donna K. Pauler, Ian J. Jacobs.Screening Based on the Risk of Cancer Calculation fromBayesian Hierarchical Changepoint and Mixture Models ofLongitudinal Markers. Journal of the American StatisticalSociety, vol. 96 (2001).2. Wasserman L. All of Statistics. A concise course inStatistical Inference, Springer, 2004.3. Tina Toni, David Welch, Natalja Strelkowa, Andreas Ipsen,Michael P.H. Stumpf. Approximate Bayesian computationscheme for parameter inference and model selection indynamical systems. Journal of the royal society, 6, 187-­202(2009).4. Robert P. Christian, Casella George. Introducing MonteCarlo Methods with R, Springer, 2009.

Data processing with “caret” package in R

• Data preprocessing

• Data splitting

• Data processing

• Model comparison

Data preprocessing

preProcess

• Standardizing

• Transformation

• Imputing

Data preprocessing. Example

data(BloodBrain)# contains array bbbDescrbbbDescr=bbbDescr[,-­3]preProc <-­ preProcess(bbbDescr,method = c("center", "scale"))data <-­ predict(preProc, bbbDescr)mean(bbbDescr[,1])mean(data[,1])var(data[,1])mean(bbbDescr[,2])mean(data[,2])var(data[,2])

Data splitting

• createDataPartition # training/test partition

• createResample # bootstrap samples

• createFolds # split the data into k groups

• createTimeSlices # is used for time series data

Data splitting. Example

data(BloodBrain)# contains array bbbDescrbbbDescr=bbbDescr[,-­3]train_part <-­ createDataPartition(y=bbbDescr[,1], p=0.75, list=FALSE)training <-­ bbbDescr[train_part,]testing <-­ bbbDescr[-­train_part,]dim(bbbDescr)dim(training)dim(testing)

Data processing. Resampling

train• method=– boot # bootstraping– boot632 # bootstrapping with adjustment– cv # cross validation– repeatedcv # repeated cross validation– LOOCV # leave one out cross validation

Data processing. Example

library(mlbench)data(Sonar)set.seed(107)inTrain <-­ createDataPartition(y = Sonar$Class, p = .75, list = FALSE)training <-­ Sonar[ inTrain,]testing <-­ Sonar[-­inTrain,]

plsFit <-­ train(Class ~ ., data = training, method = "knn", preProc = c("center", "scale"))plsClasses <-­ predict(plsFit, newdata = testing)plsClasses

Model comparison. Metric options

confusionMatrix

Continuous outcomes:• RMSE # root mean squared error• RSquared # R^2 from regression models

Categorical outcomes:• Accuracy # fraction of correct classes• Kappa # measure of concordance

Model comparison. Example

names(getModelInfo())

plsFit <-­ train(Class ~ ., data = training, method = "knn", preProc = c("center", "scale"))plsClasses <-­ predict(plsFit, newdata = testing)confusionMatrix(data = plsClasses, testing$Class)plsFit <-­ train(Class ~ ., data = training, method = "pls", preProc = c("center", "scale"))plsClasses <-­ predict(plsFit, newdata = testing)confusionMatrix(data = plsClasses, testing$Class)plsFit <-­ train(Class ~ ., data = training, method = "cforest", preProc = c("center", "scale"))plsClasses <-­ predict(plsFit, newdata = testing)confusionMatrix(data = plsClasses, testing$Class)

Literature

1. Max Kuhn. A Short Introduction to the caret Package(2014).

2. Model training and tuning:http://topepo.github.io/caret/training.html

Questions