44
Comparison of Asymptotic, Bootstrap and Posterior Predictive P-values in Assessing Latent Class Model Fit Geert van Kollenburg 647091 * Abstract Goodness-of-fit testing in Latent Class analysis can result in unreliable asymptotic p-value when the reference distributions are unknown or when the contingency tables become sparse. For instance, it has been shown that the asymptotic p-value belonging to the likelihood ratio statistic becomes untrustworthy in sparse data. A number of solutions to this problem have risen in the form of resampling techniques. The parametric bootstrap uses the maximum likelihood estimates as popu- lation parameters to sample new datasets to see whether the observed statistics are likely to occur under the proposed model. The posterior predictive check is the Bayesian alternative for a p-value and is simi- lar to the bootstrap, but controls for uncertainty about the parameter values by drawing samples from the posterior predictive distribution. The purpose of this thesis is to compare the asymptotic, bootstrap and posterior predictive p-values in assessing the model-fit of latent class models when sample size is large and when it is small. Key words: Latent Class Analysis, Goodness-of-Fit, Bayes Theorem, Parametric Bootstrap, posterior predictive check. * Department of Methodology and Statistics, Tilburg University, the Netherlands. 1

Geert van Kollenburg-masterthesis

Embed Size (px)

Citation preview

Page 1: Geert van Kollenburg-masterthesis

Comparison of Asymptotic, Bootstrapand Posterior Predictive P-values in

Assessing Latent Class Model Fit

Geert van Kollenburg 647091∗

Abstract

Goodness-of-fit testing in Latent Class analysis can result in unreliableasymptotic p-value when the reference distributions are unknown orwhen the contingency tables become sparse. For instance, it has beenshown that the asymptotic p-value belonging to the likelihood ratiostatistic becomes untrustworthy in sparse data. A number of solutionsto this problem have risen in the form of resampling techniques. Theparametric bootstrap uses the maximum likelihood estimates as popu-lation parameters to sample new datasets to see whether the observedstatistics are likely to occur under the proposed model. The posteriorpredictive check is the Bayesian alternative for a p-value and is simi-lar to the bootstrap, but controls for uncertainty about the parametervalues by drawing samples from the posterior predictive distribution.The purpose of this thesis is to compare the asymptotic, bootstrapand posterior predictive p-values in assessing the model-fit of latentclass models when sample size is large and when it is small.

Key words: Latent Class Analysis, Goodness-of-Fit, Bayes Theorem,Parametric Bootstrap, posterior predictive check.

∗Department of Methodology and Statistics, Tilburg University, the Netherlands.

1

Page 2: Geert van Kollenburg-masterthesis

1 Introduction

To test the fit of a latent class (LC) model to a dataset, there exist overall

goodness-of-fit tests, which measure the discrepancy between observed fre-

quencies and those expected under the proposed model for all cells in a cor-

responding contingency table (e.g. the likelihood ratio L2 (Vermunt, 2010))

Also bivariate, or higher order, measures can be estimates, which assess the

remaining association between two or more items in a dataset after a LC

model has been fitted. For instance, the bivariate residual (BVR)(Vermunt

& Magidson, 2005) is an approximation of the score-test for the association

parameter between two items. Its value gives an indication of the estimated

increase in model fit if the association parameter would be included in the

model.

Significance testing may become troublesome if the distribution of a

statistic is unknown. For example, the score test follows asymptotically a

chi-squared distribution when the model is true (Bera & Bilias, 2001), but as

the BVR is an approximation, its distribution is at best only approximated

by the chi-squared distribution. This problem broadens when even more

complex measures (e.g. the sum of all BVRs) are being used. The quality

of the approximation of the reference distribution then depends on the qual-

ity of the approximations to the score-tests. Here, the BVR is an example

where there is some approximation possible, but the asymptotic distribu-

tion of other statistics may not be approximated well by other distributions

2

Page 3: Geert van Kollenburg-masterthesis

or it may be very hard to derive the asymptotic distributions of a statistic

analytically.

Even when a statistic follows a known distribution asymptotically (i.e.,

when the sample size goes to infinity), its use in performing significance

tests can become inappropriate sample sizes are not large and contingency

tables become sparse. Also as the number of items increases or the sample

size is small to moderate, the contingency tables may quickly become sparse

(i.e. many cells will have 0 or 1 entries). For instance, with 10 dichotomous

items there are already 210 = 1024 cells in the table and in these cases the

asymptotic distributions do not hold anymore and the associated p-values

become untrustworthy (Maydue-Olivares & Joe, 2006, Reiser & Lin, 1999;

Vermunt, 2010). In case of unknown, untrustworthy or incorrect distributions

it is necessary to calculate empirical reference distributions. According to

Formann(2003) this holds for overall goodnes-of-fit tests, residuals and other

statistics.

In order to determine empirical reference distributions resampling tech-

niques, like the parametric bootstrap by Collins et al. (1993; in Formann,

2003), have been proposed to solve the problem of untrustworthy asymptotic

p-values and unknown distributions. If one assumes that the data contain

information about the true values of the parameters of interest, it is possi-

ble to create a reference distribution to determine how likely an observation

is given the estimated parameters. The parametric bootstrap, for instance,

is implemented in the software package LatentGold (Vermunt & Magidson,

3

Page 4: Geert van Kollenburg-masterthesis

2005) and uses Monte Carlo simulations to approximate the empirical dis-

tribution of the goodness-of-fit statistics based on the maximum likelihood

(ML) estimates obtained from the data.

Instead of relying on the ML estimates, several authors propose using

Bayesian methods to assess model fit in LC analysis (Berkhof, Van Mechelen

& Gelman, 2003; Garrett & Zeger, 2000; Hoijtink, 1998). The Bayesian

method for obtaining a p-value is the Posterior Predictive Check (PPC),

which can be used in complex models where analytic solutions are tedious

to obtain. Instead of relying on ML estimates, this method uses random

draws for the unknown parameters from the posterior predictive distribution

to determine how likely an observed statistic is (Gelman, Carlin, Stern &

Rubin, 2004).

The purpose of this thesis is to investigate the PPC as an alternative

to asymptotic and bootstrap p-values in assessing model-fit of LC models.

Also a comparison will be made between all methods to check whether they

produce comparable results in large samples and whether the resampling

techniques are more adequate than the asymptotic p-value in small samples.

To investigate this, I use a number of commonly used fit statistics of which

the long-run behavior of the resulting p-values from the different methods

will be compared by a Monte Carlo simulation study. This will lead to

a direct comparison of the asymptotic, bootstrap and PPC p-values under

different conditions such as sample size. Importantly, it is assessed whether

the different p-values are uniformly distributed under the null-hypothesis,

4

Page 5: Geert van Kollenburg-masterthesis

and whether nominal Type-I error levels are correct for the given statistics.

I do not intend to discuss the use of cut-off scores in significance testing, but

rather apply the commonly used levels as a reference to the behavior of the

statistics under the different methods.

The outline of this thesis is as follows. Section 2 describes the LC model,

estimation of a LC model, and the fit statistics used in the study. Section 3

provides an overview of the used methods for obtaining p-values. Section 4

illustrates the simulation studies and gives the results. In Section 5 an em-

pirical dataset is analyzed to illustrate the techniques that result in p-values.

Finally, in Section 6 I discuss the findings and issues in need of further re-

search.

2 Latent Class Analysis

2.1 Defining the LC model

In the multivariate setting, let an N × J matrix Y , contain the responses

of N units (i.e. individuals) on J discrete variables with Rj, j = 1, . . . , J

categories. Let Yi = (Yi1, . . . , YiJ)′ be row i, i = 1, . . . , N of Y , containing

the responses to the J variables. In total there are S =∏J

j=1Rj possible

response patterns for Yi. Therefore, let Ys, s = 1, . . . , S denote a specific

pattern, and ns denote the observed pattern count. Finally let y(without

subscripts) denote an observed dataset.

The LC model assumes that the N =∑ns units can be partitioned into

5

Page 6: Geert van Kollenburg-masterthesis

C latent classes, which have their own probability density for the responses.

A unit’s unobservable class membership is represented by the latent variable

θ and a particular class is denoted by c, with c = 1, . . . , C. The idea is then

to find a LC model with the lowest number of classes for which the responses

conditional on class membership are independent. This assumption is called

local independence and lies at the basis of LC analysis.

In a LC model P (Ys), the probability of observing pattern Ys, is assumed

to be is a weighted average of the class-specific probabilities, with weights πc

being the probability that an individual belongs to LC c (Vermunt, 2010).

So for each of the S patterns, the probability density is given by

P (Ys) =C∑c=1

πcP (Ys|θ = c), (1)

Assuming local independence,

P (Ys|θ = c) =J∏j=1

P (Ysj|θ = c). (2)

Using the notation of Vermunt (2010) to indicate the conditional item re-

sponse probability of a person in class c giving response r to item j as πjrc,

the conditional probability P (Ysj|θ = c) is then a multinomial probability

density given by

P (Ysj|θ = c) =R∏r=1

πy∗sjrjrc , (3)

where y∗sjr is 1 if Ysj = r and 0 otherwise.

6

Page 7: Geert van Kollenburg-masterthesis

Lastly, the probability that a person belongs to LC c, conditional on

having response Ys, called the posterior membership probability (Vermunt,

2010), is obtained using the Bayes rule:

πc|s =P (Ys|θ = c)πc

P (Ys)(4)

2.2 Estimating the LC Model

To obtain ML estimates for the LC model, typically the Expectation- Maximization

(EM) algorithm (Goodman, 1974) is used. The EM algorithm finds the ML

estimates by maximizing the log-likelihood function

logL = ns

S∑s=1

logP (Ys). (5)

Because only non-zero response frequencies attribute to the likelihood, the

convention 0 log(0) = 0 is used throughout this thesis.The details of the EM

algorithm and the convention 0 log(0) = 0 are discussed in Appendix A.

Using the EM to obtain the ML estimates requires that starting values

are provided for the parameters in ψ = (π′jrc,π′c)′, denoted as π

(0)jrc and π

(0)jrc.

Caution is advised that when the starting values are too similar, the model

can become unidentifiable. To solve this, it should be possible to order the

LCs by π(0)c or, for instance, π

(0)1rc (Hoijtink, 1998). For further discussion

on the identifiability of LC models, including item/class ratios see Goodman

(1974). The EM algorithm goes as follows:

7

Page 8: Geert van Kollenburg-masterthesis

Step 0: Choose initial values for ψ(0) and set t = 1.

Step 1: Expectation step

Given ψ(t−1), calculate πc|s(see Equation 4). Then multiply this with

ns to obtain n(t)sc , the estimated number of respondents in each class

having pattern s.

Step 2: Maximization step

Calculate

π(t)c = mc/N =

S∑s=1

nsc/N

and

π(t)jrc =

S∑s=1

(n(t)sc y∗sjr)/mc,

where y∗sjr is 1 if Ysj = r and 0 otherwise.

Step 3: Set t = t+ 1 and repeat Steps 1 and 2 until the decrease in the log-

likelihood between two iterations is smaller than a given convergence

criteria (e.g., 10−8).

Estimation of the model can also be done in a Bayesian context using a

Gibbs sampler (e.g., Hoijtink, 1998). The Gibbs sampler is similar to the

EM procedure, but relies on sampling distributions at each step (Ligtvoet

& Vermunt, 2011) and results in an estimated (posterior) distribution of

the parameters rather then stationary estimates for ψ. The Gibbs sampler

proceeds as follows:

8

Page 9: Geert van Kollenburg-masterthesis

Step 0: Choose initial values for ψ(0) and set d = 1.

Step 1: Data augmentation

Given ψ(d−1), calculate πc|s(see Equation 4). Then, every subject with

a particular pattern is assigned to a LC by drawing from a multino-

mial distribution with probabilities from πc|s. This results in both the

class sizes m(d)c and n

(d)jrc, the number of respondents from class c with

response r to item j.

Step 2: Draw a sample from the posteriors for

π(d)c ∼ Dir(m

(d)1 + αc, . . . ,m

(d)C + αc)

and

(π(d)j1c, . . . , π

(d)jRc) ∼ Dir(n

(d)j1c + αjrc, . . . , n

(d)jRc + αjrc),

where αc = 1/C and αjrc = 1/Rj (see Appendix B)

Step 3: Set d = d+1 and repeat Steps 1 and 2 until convergence (Section 3.3

describes a method for assessing the convergence of the sampler. For

more Bayesian convergence criteria see e.g. Brooks and Gelman, 1998).

After convergence, repeat Step 1 and 2 L times and keep the sampled

values to estimate the posterior distribution of the parameters.

In the simulation study (see Section 4), I use the population parameter values

as starting points and use a burn-in of 100 iterations before I start sampling.

9

Page 10: Geert van Kollenburg-masterthesis

This way the method is likely to start close to the parameter values and the

posterior is properly estimated. When the population values were not useful

(e.g., when the analysis only had 1 LC), I used the ML estimates that were

obtained from the EM as starting values.

2.3 Model-fit test statistics

Three test-statistics are used to assess model fit. These fit statistics are indi-

cators of the local dependencies given class membership. Let es = P (Ys)N

denote the expected pattern frequencies under the fitted LC model given the

(estimated) values of ψ (from which P (Ys) is calculated). The likelihood

ratio statistic L2 and the overall Pearson chi-squared test statistic X2 are

then:

L2 = 2S∑s=1

ns lnnses,

X2 =S∑s=1

(ns − es)2

es

Thirdly also the bivariate residual (BVR) is used, which measures re-

maining local dependencies between two items. The BVRs are X2 values

computed for pairs of variables (Vermunt & Magidson, 2005). So for items j

and j′

BV Rjj′ =

Rj∑r=1

Rj′∑

r′=1

(nrr′ − err′ )2

err′.

To investigate the BVR statistics based on a number of random samples

10

Page 11: Geert van Kollenburg-masterthesis

I assume that all BVRs behave the same and will therefore only need to

analyze the BVR of items 1 and 2.

The L2, X2 and BV R are all in the form of less-is-better and can be seen

as indicators of badness-of-fit. In the next section I will describe how these

statistics can be used to perform significance tests for goodness-of-fit. The

significance tests are based on p-values, which indicate how likely the value

of an observed statistic is, given certain assumptions about the population

parameters and/or the data. The methods differ from each other in the

assumptions about the population parameters and in the estimation process.

First I will describe how to obtain a p-value using a asymptotic reference

distribution, then by means of parametric bootstrap and finally by means of

two PPCs.

3 Estimating p-values

3.1 Asymptotic reference distribution

In the frequentist framework, the p-value is the theoretical probability of

finding a test statistic that is more extreme than the one actually observed,

under the null-hypothesis H0 (Hogg & Tanis, 2010). In testing a LC model

with C classes, we base the p-value on the assumption that this model is true.

The p-value associated with an observed test statistic Tobs is the probability

that a value for T is at least as extreme as Tobs, given the C-class model is

true.

11

Page 12: Geert van Kollenburg-masterthesis

In testing model fit I am only interested in the probability of worse fit.

This is indicated by larger values for T so the asymptotic p-value can be

defined as

pa = Pr(T ≥ Tobs|H0), (6)

where the conditioning upon H0 can means that the posited model is assumed

to be true or that ψ = ψ0, the values postulated in H0(Gelman et al., 2004;,

Meng, 1994)). To obtain this p-value one calculates the area beyond the value

of Tobs in a reference distribution, with a specified degrees of freedom (df).

In an unrestricted LC model the L2 and X2 statistics under H0 are assumed

to asymptotically follow a chi-squared distribution (χ2df ) with df given by

df =J∏j=1

Rj − C[1 +J∑j=1

(Rj − 1)]. (7)

As noted before, the BVR does not have a direct reference distribution

since it is an approximation of the score-test which follows a chi-squared

distribution. In the coming simulation only binary variables are used and

the BVR will then approximate the score-test for a 2× 2 contingency table.

Because the score-test is known to asymptotically follow a (Rj−1)×(Rj′−1) =

1 df chi-squared distribution in this case, I will assume that the BVR can be

approximated by the same asymptotic distribution and check the validity of

this assumption.

Issues concerning pa values

Besides misconceptions and malpractices concerning p-values (see Sterne

12

Page 13: Geert van Kollenburg-masterthesis

& Smith, 2001 for a clear evaluation), also statistical problems arise with the

use of (asymptotic) reference distributions. One problem with the asymptotic

p-value is that if it is unknown what distribution a statistic follows, the use of

an incorrect reference distribution can result in inaccurate p-values. Another

problem is that, by definition, an asymptotic p-value is not exact because

sample sizes are always finite. And although results might be trustworthy in

very large samples, even moderate sample sizes can lead to inaccurate results.

When the number of items in the data becomes large, the observed pat-

tern frequencies in the contingency tables quickly become very sparse and

one needs very large sample sizes to control for this. In sparse tables statis-

tics like the L2 cannot be approximated well. And even though pa can still be

calculated, its values can no longer be trusted (Magidson & Vermunt, 2004;

Maydue-Olivares & Joe, 2006; Reiser & Lin, 1999; Vermunt, 2010). Other

methods have to be used in order to get more reliable and accurate p-values

in situations where these issues occur.

Because of these and other problems associated with pa-values, other

methods have been proposed to obtain p-values, which do not rely on asymp-

totic theory, but are based on resampling techniques. These techniques gen-

erate a large number of random replicate samples from a set of (estimated)

population parameter values. For each of these datasets yrepit is possible

to calculate the statistics of interest and determine the probability that a

statistic Trepis larger than the one observed. This is done by estimating the

proportion of Trepthat were more extreme than Tobs, given the estimation of

13

Page 14: Geert van Kollenburg-masterthesis

the parameters. For the LC model, I will compare resampling techniques

from the frequentist (bootstrap) and from the Bayesian framework (PPC).

3.2 Parametric Bootstrap Method

The parametric bootstrap can be used to estimate the distribution of the

statistics for which the distribution is unknown, either due to the limited

sample size or to inapproximability. If we use the ML estimates from the

observed data as population values, it is possible to estimate the probability

that Trep ≥ Tobs, given that the estimates are true (Langeheine, Pannekoek

& Van de Pol, 1996). The bootstrap p-value is then given by:

pb = Pr[(Trep ≥ Tobs)|ψ̂, H0]. (8)

The bootstrap method proceeds as follows:

Step 1. Assume that the model (H0) is true.

Step 2. Treat the ML estimates from the observed data under H0 as popu-

lation parameters.

Step 3. Draw B random replicate samples yrep,b, b = 1, . . . , B of size N

based on these population parameter estimates

Step 4. Estimate the LC model for each dataset using the EM algorithm

and calculate T brepfrom the ML estimates ψ̂b.

14

Page 15: Geert van Kollenburg-masterthesis

The proportion

B−1B∑b=1

I(T brep≥Tobs)

,

(where the indicator function I equals 1 if T brep ≥ Tobs and 0 otherwise) is

taken to be the estimate of pb. In words, pb is (estimated by) the proportion

of samples in which the value T brep is greater than of equal to Tobs.

3.3 Posterior Predictive Check

The PPC is the Bayesian counterpart of the classical statistical tests (Meng,

1994). Given that H0 is true and that the observed data came from the

population of interest, the posterior predictive (PP) p-value is given by:

pp = Pr[(T lrep ≥ Tobs)|y, H0]. (9)

In the Bayesian framework one is not particularly interested in the probability

that observed data have come from a population with parameters posited in

the null-hypothesis (as in the frequentist framework), but rather in what the

probability is that parameters have certain values given that the observed

data indeed came from that population (Gelman et al., 2004).

As a result of this philosophy the major difference with the bootstrap is

that the PPC is based on the posterior distribution P (ψ|y) of the unknown

parameters (rather than on a point estimate like ψ̂) and on the predictive

distribution for the replicated data P (yrep|ψ). In its general form, the prob-

15

Page 16: Geert van Kollenburg-masterthesis

ability in Equation 9 is taken over the joint distribution P (ψ,yrep|y) so that

pp =

∫ ∫I(T lrep≥Tobs)

P (yrep|ψ)P (ψ|y)dyrepdψ, (10)

where I equals 1 if T lrep ≥ Tobs for all possible values of Tobs(Gelman et

al., 2004). Appendix B shows how the posterior and PP distribution are

obtained.

In practice, the PP distribution P (yrep|ψ) is usually estimated through

simulations and the pp-value is then estimated based on these draws. In

principle the PPC is done like this:

Step 1. Assume that the model is true.

Step 2. Draw L samples from the PP distribution to obtain ψland yrep,l,

l = 1, . . . , L.

Step 3. Estimate the LC model under H0 on each dataset yrep,land calculate

the statistic T lrep.

So T lrep is obtained by estimating the model under H0 using the EM algo-

rithm. For each replication the ML estimates ψ̂l

are used to calculate T lrep

and the proportion

L−1L∑l=1

I(T lrep≥Tobs)

,

(where the indicator function I equals 1 if T lrep ≥ Tobs and 0 otherwise) is

taken to be the estimate of pp.

16

Page 17: Geert van Kollenburg-masterthesis

In more complex models (like the LC model) however, it may not be

possible to obtain the PP distribution in Step 2 analytically. The solution

involves splitting up Step 2 and using an iterative sampling procedure:

Step 2a. Draw a sample from the posterior distribution ψl ∼ P (ψ|y).

Step 2b. Generate a replicate dataset yrep,l ∼ P (yrep|ψl).

Step 2c. Repeat Steps 2a and 2b to obtain L replicated datasets.

But, as shown in Appendix B, the posterior distribution for the LC model

again does not have a convenient form to sample directly from. Fortunately

the Gibbs sampler, as discussed in Section 2.2, can be used to obtain the re-

quired posterior draws ψl (Rubin & Stern, 1994). At convergence, the draws

in a Gibbs sampler iteration are actually samples from the posterior P (ψ|y)

and as a result the L iterations result in an approximation of the posterior

distribution. Performing Step 2b results in draws from the predictive distri-

bution. The joint draws from the posterior distribution and the predictive

distribution can together be seen as a single draw from the PP distribution

Figure 1 in Appendix C is a graphical representation of the PPC. The

upper plot is a trace plot and depicts the values of the Trep = X2rep statistic

during the L = 500 replications for the empirical example described in Sec-

tion 5 where N = 94 and C = 2. If the plot shows any long-term trends,

this is an indication that successive draws are highly correlated and that the

method has not converged. The values should move freely around in the

value space, without getting stuck in a local region (King et al., 2011). The

17

Page 18: Geert van Kollenburg-masterthesis

bottom plot shows a smoothed density of the replicated values. The horizon-

tal and vertical dashed lines indicates the observed value X2obs = 67.993 and

the proportion of values above or beyond that line (.554) is the estimate for

pp.

PPC using discrepancy variables

The formulation of the PP p-value has been extended by Gelman et

al.(2004) by using, instead of a statistic T, a discrepancy variable D(ψ)

which depends on the data as well as the parameters. For each draw from

the posterior Dobs(ψl)is calculated as the discrepancy between ψ̂and ψland

Drep(ψl)is calculated as the discrepancy between ψ̂

land ψl.

The p-value for the discrepancy measure is given by:

pd = Pr[Drep(ψ) ≥ Dobs(ψ)|y, H0].

Goodness-of-fit measures like L2 can be used as discrepancy variables be-

cause the predicted pattern frequencies are functions of the parameters in

ψ. For instance, the expected frequencies for the L2 are calculated as

els = P (Ys|ψl)N . The discrepancy p-value is estimated by taking the L

sampled draws, computing the predicted pattern frequencies els directly from

ψl, and computing Dobs(ψl)and Drep(ψ

l)based on these predicted frequen-

cies. In this method on obtains L ’observed’ discrepancies Dobs(ψl)and L

18

Page 19: Geert van Kollenburg-masterthesis

replicated discrepancies Drep(ψl). The pd is estimated by

L−1L∑l=1

I(Drep(ψl) ≥ Dobs(ψ

l)).

The PPC using discrepancy variables was used in LC analysis by Berkhof,

van Mechelen and Gelman (2003) and Meulders et al. (2002), who indicate

that this procedure tends to be conservative. Conservativeness, however, is

not the only issue with the pd-value. Hjort, Dahl & Steinbakk (2006) showed

that the distribution of pd under H0 is far from uniform and have indicated

that its values need to be adjusted in order to make results interpretable.

Hjort et al. investigated the behavior of pd in a number of models, but

not the LC model. In order to test the appropriateness of the method it is

important to investigate the behavior of pd in the current setting as well, and

the method is therefore included in this study.

4 Simulation study

To compare the methods described above, the behaviors of the p-values under

different situations need to be assessed. In situations where H0 is true, the

p-values from the fit-statistics described in Section 2.3 should be uniformly

distributed (Sackrowitz & Samuel-Cahn, 1999). Deviations from uniformity

could indicate that the used reference distribution or method is incorrect.

The uniformity of the p-values will therefore be used to assess applicability

19

Page 20: Geert van Kollenburg-masterthesis

of the methods in different situations.

To investigate the behavior of the proposed p-values I generated data for

J = 6 dichotomous items (Rj = 2, for all j). The population class sizes

and conditional response probabilities used throughout the simulations can

be found in Table 1. To test the behavior of the p-values under H0 in large

Table 1: Population values for the simulation studiesc = 1 c = 2

πc) 0.5 0.5πj1c 0.8 0.2πj2c 0.2 0.8

samples I generated 500 datasets with N = 1000. In large samples the p-

values ought to behave approximately equivalently. Since one of the reasons

for using resampling techniques was usage in small samples and spare tables

I generated the same number of datasets with N = 100. On all datasets a

2-class LC model was fitted using the EM algorithm. At convergence the

asymptotic p-values were calculated for the L2 and X2 based on the χ250

distribution and for the BV R12 using the χ21 distribution. To obtain the

pb-value the bootstrap with B = 100 was performed and similarly the pp and

pd were calculated based on L = 100 PP samples. In total, the LC model

had to be fitted to 200,000 additional datasets.

To test the behavior of the p-values under a misspecified model and to

perform a power test, again 500 datasets with N = 1000 and 500 datasets

with N = 100 were generated from a 2-class population, but each of these

datasets was analyzed using a 1-class LC model. I then calculated the pa-

20

Page 21: Geert van Kollenburg-masterthesis

values (with df = 57 for the L2 and X2) and obtained the pb, pp and pd-values

based on B = L = 100.

To check whether the p-values are uniformly distributed under H0 I per-

formed two numerical checks and a graphical check to substantiate the find-

ings. If a p-value is uniformly distributed its expected value E(p) = .5 and

P (p < .05) = .05 (i.e., in 5% of the cases the p-value is less than .05).

I use the convenient significance level of .05 (Fisher, 1954) as upper-limit

for rejecting the null-hypothesis. If there are considerable deviations from

the indicators of uniformity, the used method might be inappropriate or in-

correctly specified. The graphical checks are shown as the distributions of

the p-values, smoothed using splines to approximate the log-densities (see

Stone, et al., 1997). These graphical checks can be used directly to see any

deviations from uniformity anywhere. Please note that sharp increases in

density at the very boundaries (at approximately < .02, > .98) are due to

the estimation procedure rather than implying practical malbehavior of the

p-value.

Results

Figure 2 in Appendix C and Table 2 provide the results of the p-values

under H0 for sample size of N = 1000. Figure 3 in Appendix C and Table 3

provide the results under H0 for N = 100. The densities of the pa-values

are depicted as solid lines, the pb-values as dashed lines, the pp-values as

dash-dotted lines and the pd-values are shown as dotted lines. Also included

is a reference line indicating a truly uniform distribution as a reference. The

21

Page 22: Geert van Kollenburg-masterthesis

tables can be used as a summary of the figures and include two checks for nor-

mality; the expected p-values E(p) and P (p < .05) for the different goodness-

of-fit statistics. Not only can these proportions be used as an indication of

systematic deviations from uniformity but may also be helpful if only Type-I

error (false rejection of the null-hypothesis) rates are the issue of concern.

The results show that with a sample size of N = 1000, under H0, the chi-

squared reference distribution used for the pa-values is not an exact reference

to the L2 statistic. Using the χ250 distribution resulted in too liberal results,

since the Type-1 error rate was .094 (almost twice high as expected under

H0). Also the expected values is much lower than .5. From Figure 2 it is

clear that the density becomes larger as pa comes closer to 0, indicating too

many small p-values. Although this may be due to sampling fluctuations

given the limited number of simulations, it is worth mentioning that within

the same analyses the pa-value for the X2 statistic shows this behavior much

less. To illustrate, there were 81 analyses where the pa-value for L2 was less

than .10 (where there should be only 50). In those analyses the X2 had

p-values less than .10 in only 57 cases. Inspection of the pa-values for BV R12

clearly indicates that BVR does not follow a χ21 distribution. The density of

p-values becomes larger in a linear fashion as the values of pa increase.

Conversely, from Table 2 it can be seen that in the large sample case the

pb and pp-values only seem to provide somewhat too liberal results, having

slightly too many values smaller than .05. Other than that these p-values

show very good approximations to the uniform distribution. In the current

22

Page 23: Geert van Kollenburg-masterthesis

Table 2: Uniformity measures of p-valuesE(p) Pr(p < .05)

pa pb pp pd pa pb pp pdL2 .4388 .4945 .4946 .8449 .094 .062 .064 .002X2 .4917 .4918 .4923 .8496 .060 .068 .064 .000BV R12 .6706 .5065 .5072 .7667 .000 .046 .046 .000N = 1000, MC simulations = 500, bootstrap/PPC replications = 100

setting, with large sample size, pb and pp clearly outperform the asymptotic

p-value for both the L2 and BVR, but this is perhaps more likely due the

specification of the asymptotic reference distribution than to the quality of

the methods in the large sample case since the methods have very similar

behaviors for the X2 statistic.

As expected, the most ’problematic’ results came from the PPC using

discrepancy variables, which is very clearly not adequate for testing model-

fit using any of the goodness-of-fit statistics. In line with the findings of Hjort

et al. (2006) the pd is distributed far from uniform in the LC goodness-of-fit

setting. Figure 2 shows that for the L2 and X2 the density increases as pd

gets larger and peaks at 1. For the BVR statistic it peaks at around .78,

with a range of [0.54, 0.93]. In only 1 dataset (the value .002 in Table 2) a

pd-value was found that was less than .05.

From Table 3 it can be seen that in sparser datasets the expected values of

pb and pp are somewhat higher than pa for the L2 statistic (perhaps still due

to the asymptotic reference distribution), about equal for X2 and lower for

the BVR (although rather trivial since the reference distribution was clearly

23

Page 24: Geert van Kollenburg-masterthesis

Table 3: Uniformity measures of p-valuesE(p) Pr(p < .05)

pa pb pp pd pa pb pp pdL2 .4019 .4354 .4352 .8854 .016 .040 .034 .000X2 .5224 .5200 .5114 .8535 .028 .024 .018 .000BV R12 .6758 .5088 .5136 .7607 .004 .040 .038 .000N = 100, MC simulations = 500, bootstrap/PPC replications = 100

inadequate). Also in sparser tables the pd has much higher values than the

other measures, except for pa of the BVR (again probably due to incorrect

reference). All methods tend to be conservative in that too few p-values were

less than .05, even when the expected values are lower than expected. From

Figure 3 it can be seen that the distribution of the pa-value under H0 with a

small sample size is far from uniform for the L2 statistic. Interestingly this

behavior is mimicked by the pb and pp. Although the behavior is similar,

the pb and pp are distributed more flatly for all statistics, with the bootstrap

method resulting in the least peaked distribution.

Finally, in analyzing the 500 datasets of N = 1000 from a 2-class pop-

ulation with a 1 class model, the probability of correctly rejecting the null-

hypothesis (i.e., the power) was 1, using any of the statistics. That is, all

pa-values were less than 10−19 for the BVR, and less than 10−161 for the L2

and less than 10−291 for the X2 statistic. All other p-values were always equal

to 0. In the 500 smaller samples, all p-values resulted in a power of 1 for the

L2 and X2.

Although the power for the BVR was 1 in the previous simulation, it is

24

Page 25: Geert van Kollenburg-masterthesis

not a very good measure to determine model-misfit if analyzed solely as it

is based only on the two-item relationships. That is to say, if one BVR does

not provide a small p-value, it does not indicate an that the whole model fits

well. This aspect is captured by the p-values in the small sample case. The

expected and maximum p-values, as well as power (indicated as P (p < .05),

the probability of a value less than .05) for all methods are provided in

Table 4. Also here, the pd provides very inadequate results if the values are

not processed (see Hjort et al., 2006).

Table 4: Power results for the BVRpa pb pp pd

E(p) .001 0.010 0.009 0.146P (p < .05) .964 .944 .952 .284max(p) .565 0.60 0.55 0.84

5 Empirical example

To illustrate the usage of the proposed methods I have analyzed data which

were obtained by Galen and Gambino (1975, in Rindskopf, 2002) in a study

of 94 patients who suffered chest pains and were admitted to an emergency

room. Four indicators of myocardial infarction (MI) were scored either a 1

(present) or 0 (not-present); the patients’ heart-rhythm Q-waves (Q), high

low-density blood cholesterol levels (L), creatine phosphokinase levels (C) and

their clinical history (H). The response patterns and their observed frequen-

cies can be found in Table 5. Rindskopf indicated that the data are consistent

25

Page 26: Geert van Kollenburg-masterthesis

with a 2-class LC model, with df = 6, the L2 = 4.29 with pa = .64.

To obtain the 4 p-values for the different statistics, I used the χ26 reference

distribution for the L2 and X2, and set B = L = 500 to obtain the resam-

pling p-values. Because the data is quite sparse, given the results from the

simulation study with N=100, I expected to find that the pb and pp would be

higher than pa for the L2 statistic, about equal for X2 and lower for the BVR

(due to the unknown reference distribution for the BVR). Also I expected pd

to be much higher than the other p-values but less so than pa for the BVR.

Table 5: Response pattern frequenciesQ L C H count Q L C H count0 0 0 0 33 1 0 0 0 00 0 0 1 7 1 0 0 1 00 0 1 0 7 1 0 1 0 20 0 1 1 5 1 0 1 1 30 1 0 0 1 1 1 0 0 00 1 0 1 0 1 1 0 1 00 1 1 0 3 1 1 1 0 40 1 1 1 5 1 1 1 1 24

Table 6 provides the conditional response probabilities and group sizes

resulting from fitting the 2 LC model on the data (which are identical to

those reported by Rindskopf, 2002). The first class (likely to have had MI)

had high conditional probabilities for all indicators , the other group had low

conditional probabilities.

In Table 7 the estimated p-values from all methods are shown for the

2-class model for the three used statistics. As none of the p-values are small,

all p-values indicate that the 2-class model fits the data well. Against ex-

26

Page 27: Geert van Kollenburg-masterthesis

Table 6: ML parameter estimates of ψfor the MI data using a 2-Class modelMI no MI

πc 0.4578 0.5422Q 0 0.2332 1.0000Q 1 0.7668 0.0000L 0 0.1721 0.9731L 1 0.8279 0.0269C 0 0.0000 0.8045C 1 1.0000 0.1955H 0 0.2086 0.8049H 1 0.7914 0.1951

pectation the bootstrap resulted in much smaller p-values than the other

methods for the L2 and X2. Although no p-value indicated lack of fit, there

are large differences in the actual values of the p-values.

Table 7: Results for the empirical examplep-value

pa pb pp pdL2 = 4.292611 .637 .358 .606 .874X2 = 4.22263 .647 .306 .554 .892BV R12 = 0.1545949 .694 .230 .182 .652df = 6, N = 94, B = L = 500

6 Discussion

In this thesis I compared different p-values in goodness-of fit testing of LC

models. The classical asymptotic p-value was compared to the p-values ob-

tained by means of parametric bootstrap and PPCs in large and small sam-

ples. The methods were discussed and the differences illustrated. Two prob-

27

Page 28: Geert van Kollenburg-masterthesis

lems that occur in using asymptotic p-values were discussed, firstly that they

cannot be trusted in small samples, and secondly that they are not useful

when it is unknown what distribution a statistic follows.

The results suggested that the χ2df may not be a valid reference for the

L2 statistic in LC analysis, since it produced too liberal results in large

samples under H0. Also the BVR has been shown to clearly not follow an χ21

distribution. The pb and pp showed much better behavior than the asymptotic

p-value for both the L2 and BVR, although this might have been due to the

used asymptotic reference distribution, since the methods were comparable

for the X2, for which also pa showed good behavior.

Whether the bootstrap or PPC are better methods for approximating a

p-value in the current setting is not clear-cut. The data for N = 100 were not

extremely sparse since the number of patterns with observed frequencies of 0

or 1 was not so large. But especially the L2 statistic showed very surprising

behaviors and needs to be investigated further.

More research should be done to investigate the distribution of the L2

and BVR statistics, which can be done by looking at the actual values of the

statistics rather than the p-values under the reference distribution.

Additionally, analysis of the empirical example showed that the p-values

can differ from each other quite severely within one dataset, even though

the expected values did not differ much. To find out more about the dif-

ference between the p-values within datasets, a comparison of the p-values

within each simulation could provide a better insight into the characteristics

28

Page 29: Geert van Kollenburg-masterthesis

of the data responsible for these differences. This may result in a clearer

understanding of when each of the methods can be used optimally.

Since the current research has focused on (overall) goodness-of-fit statis-

tics, an option for future research is to do a similar study to investigate

the applicability of resampling techniques to issues regarding LC model se-

lection/comparison. For instance, the PPC could provide a p-value for the

increase in fit when adding LCs or when including local dependencies.

This said, I have only considered rather simple LC models and future

research on this topic should include, for example, models with more LCs,

local dependencies or models which include covariates.

Note on computational time

Because for each dataset B = L = 100 bootstraps and PPCs are per-

formed to estimate pb, pp and pd, a total of 400,000 replicated datasets had

to be computed and analyzed using the EM algorithm, which can become

rather time consuming. For instance the analysis for N = 100 with 2 LCs

took over 20 hours to complete on a 32 bits, 2.61 GHz, 3.43 GB RAM com-

puter using the software package R (CRAN, 2012).

However, the individual analyses themselves do not take very long (a

couple of minutes per run). The assessment of the empirical data using all

techniques took only about 3 minutes with 500 bootstrap/PPC replications,

indicating the practical usefulness of the methods in obtaining p-values. Of

course the empirical dataset was not very large, but researchers should not

be inhibited to use these techniques in empirical research. The used soft-

29

Page 30: Geert van Kollenburg-masterthesis

and hardware (and the efficiency of the programming) can greatly diminish

the time needed to analyze a problem and, moreover, even waiting a day to

get reliable research results should be considered worthwhile.

30

Page 31: Geert van Kollenburg-masterthesis

References

Bera, A. K. & Bilias, Y. (2001). Rao’s score, Neyman’s C(a) and Silvey’sLM tests: An essay on historical developments and some new results.Journal of Statistical Planning and Inference, 97, 944.

Berkhof, J., Van Mechelen, I., & Gelman, A. (2003). A Bayesian approachto the selection and testing of Mixture Models. Statistica Sinica, 13,423 – 442.

Brooks, S. P. & Gelman, A. (1998). General Methods for MonitoringConvergence of Iterative Simulations. Journal of Computational andGraphical Statistics, 7(4), 434–455

Fisher, R. A. (1925). Statistical methods for research workers (chapter 3).Retrieved May 2, 2012, from http://psychclassics.yorku.ca/Fisher/Methods/

Formann, A. K. (2003). Latent Class Model Diagnosis – a review and someproposals. Computational Statistics & Data Analysis ,41, 548 – 559.

Galindo–Garre, F., & Vermunt, J.K, (2005). Testing log–linear modelswith inequality constraints: a comparison of asymptotic, bootstrap,and posterior predictive p values. Statistica Neerlandica, 59, 82–94.

Garrett, S. G., & Zeger, S. L. (2000). Latent Class Model Diagnosis. Bio-metrics, 56, 1055–1067.

Gelman, A., Carlin, J., Stern, H. & Rubin D. (2004). Bayesian Data Anal-ysis. 2nd edition. Boca Raton, FL: Chapman & Hall

Goodman, L.A. (1974). Exploratory latent structure analysis using bothidentifiable and unidentifiable models. Biometrika, 61, 215–231.

Hjort, N. L., Dahl, F. A. & Steinbakk, G. H. (2006): Post–Processing Pos-terior Predictive p Values. Journal of the American Statistical Associ-ation, 101(475), 1157–1174.

Hogg, R. V. & Tanis, E. A. (2010). Probability and Statistical Inference.8th edition. Upper Saddle River, NJ: Pearson Prentice Hall

31

Page 32: Geert van Kollenburg-masterthesis

Hoijtink, H. (1998). Constrained Latent Class Analysis using the GibbsSampler and Posterior Predictive P–values: applications to educationaltesting. Statistica Sinica, 8, 691–711.

King, M. D., Calamante, F., Clark, C. A. & Gadian, D. G. (2011). MarkovChain Monte Carlo Random Effects Modeling in Magnetic ResonanceImage Processing Using the RBugs Interface to WinBUGS. Journal ofStatistical Software, 44(2), available online from http://www.jstatsoft.org/v44/i02

Langeheine, R., Pannekoek, J. & Van de Pol, F.(1996). BootstrappingGoodness–of–Fit Measures in Categorical Data Analysis. SociologicalMethods & Research, 24, 492–516.

Ligtvoet, R. & Vermunt, J.K. (2012). Latent class models for testing mono-tonicity and invariant item ordering for polytomous items. BritishJournal of Mathematical and Statistical Psychology, 65(2), 237–250.

Magidson, J., and Vermunt, J.K, (2004) Latent class models. in D. Kaplan(ed.), The Sage Handbook of Quantitative Methodology for the SocialSciences (pp. 175–198). Thousand Oaks, CA: Sage Publications, Inc.

Maydeu–Olivares, A. & Joe, H. (2006). Limited Goodness–of–Fit testing inMultidimensional Contingency tables. Psychometrika, 71, 713–732.

Meulders, M., de Boeck, P., Kuppens, P. & Van Mechelen, I. (2002). Con-strained Latent Class Analysis of Three-Way Three-Mode Data. Jour-nal of Classification, 19, 277–302

Nylund, K. L., Asparouhov, T. & Muthn, B.O.(2007). Deciding on theNumber of Classes in Latent Class Analysis and Growth Mixture Mod-eling: A Monte Carlo Simulation Study. Structural Equation Modeling:A Multidisciplinary Journal, 14(4), 535–569.

Reiser, M., & Lin, Y. (1999). Goodness–of–fit test for the latent class modelwhen expected frequencies are small. M.Sobel and M.Becker (Eds.),Sociological Methodology (pp. 81–111). Boston: Blackwell Publishers.

Rindskopf, D. (2002). The use of latent class analysis in medical diagnosis.Proceedings of the Joint Meetings of the American Statistical Associa-tion, 29122916.

32

Page 33: Geert van Kollenburg-masterthesis

Rubin, D. B., & Stern, H. S. (1994). Testing in latent class models usinga posterior predictive check distribution. In Von Eye, A. & Clogg,C. C. (Eds.), Latent variables analysis: Applications for developmentalresearch (pp. 420–438). Thousand Oaks, CA: Sage Publications, Inc.

Sackrowitz, H. & Samuel–Cahn, E. (1999). P Values as Random Variables–Expected P Values. The American Statistician, 53(4), 326–331.

Sterne, J. A. C. & Smith, G. D. (2001) Sifting the evidencewhat’s wrongwith significance tests?. BMJ, 322, 226–231.

Stone, C. J., Hansen, M., Kooperberg, C. & Truong, Y. K. (1997). Theuse of polynomial splines and their tensor products in extended linearmodeling (with discussion). Annals of Statistics, 25, 1371–1470.

Tanner, M. A. & Wong, H.W. (1984). The Calculation of Posterior Dis-tributions by Data Augmentation. Journal of the American StatisticalAssociation, 82(398), 528–540

Vermunt, J. K. (2010). Latent Class Models. In P. Peterson, E. Baker, &B. McGaw (Eds.), International Encyclopedia of Education (pp. 238–244). Oxford: Elsevier

Vermunt, J.K., & Magidson, J. (2005). Technical Guide for Latent GOLD4.0: Basic and Advanced. Belmont Massachusetts: Statistical Innova-tions Inc.

33

Page 34: Geert van Kollenburg-masterthesis

A EM algorithm

The EM algorithm

Because the LC membership is unobservable, the (logarithm of the) like-

lihood is hard to estimate. The summation within the log makes separation

of the product terms unviable. It is possible, however, to use a sequential

algorithm if we give starting values for the missing data (i.e., the unobserved

class membership).

Combining Equations 1-3 to obtain the likelihood gives:

P (Ys) =C∑c=1

πc

J∏j=1

R∏r=1

πy∗sjrjrc (11)

and taking the log gives the log-likelihood:

logP (Ys) = logC∑c=1

πc

J∏j=1

R∏r=1

πy∗sjrjrc . (12)

With class membership unobservable, this expression is unsolvable. However,

if we impute values for the missing class membership (also called data aug-

mentation, e.g., Ligtvoet & Vermunt, 2011), the expression can be written

as:

logP (Ys) = ns

C∑c=1

πc|s log πc

J∏j=1

R∏r=1

πy∗sjrjrc .

Now, the EM algorithm consists of sequentially updating πc|s(providing πc)

34

Page 35: Geert van Kollenburg-masterthesis

and πjrc to maximize

logL =S∑s=1

logP (Ys).

The algorithm continues until the change in the log-likelihood between iter-

ation t and t+ 1 is lower than a given convergence criterium. The values for

which this log-likelihood is maximized are the ML estimates.

Using the EM algorithm it can, however, occur that convergence is at-

tained at a local maximum. Often, to control for this, multiple starting sets

are used and the values for ψ resulting in the highest log-likelihood are taken

as ML estimates.

0 log(0) = 0 convention

In order to only let observed patterns contribute to the likelihood, I used

a convention that 0 log(0) = 0. This is needed because log(0) is undefined,

and multiplying log(0) with 0 will technically not result in 0. Following is

the justification of using the convention.

If I define the natural logarithm as log(x) =∫ x1

1tdt, and need to find a

reasonable value for 0 log(0), I should take the limit as x approaches 0. Using

Hopital’s Rule one can show that although log(0) is undefined, the limit of

35

Page 36: Geert van Kollenburg-masterthesis

x log(x) as x approaches zero is:

limx→0

xlog(x) = limx→0

log(x)

x−1

= limx→0

x−1

−x−2

= limx→0−x

= 0

36

Page 37: Geert van Kollenburg-masterthesis

B The Gibbs sampler (in LC analysis)

The Gibbs sampler can be used to estimate the LC model, as described in

Section 2.2, but also to perform the PPC (see Section 3.3 as a means of test-

ing model fit. The Bayesian model fit approach compares the goodness-of-fit

statistic Tobsto a reference distribution which is obtained by averaging the

distribution P (T |ψ) over the posterior P (ψ|y). When the posterior distri-

bution is not (or tediously) calculable analytically, one can use simulations to

estimate it. Here I show in detail how to obtain the posterior (and) predictive

distribution for ψand yrepand perform the PPC.

The method goes as follows:

Step 1. Assume that the model is true.

Step 2a. Draw a sample from the posterior distribution ψl ∼ P (ψ|y).

Step 2b. Generate a replicate dataset yrep,l ∼ P (yrep|ψl).

Step 2c. Repeat Steps 2a and 2b to obtain L draws from the posterior

predictive distribution.

Step 3. Estimate the LC model under H0 on each dataset and calculate the

statistic T lrep.

Drawing the samples in Step 2 had to be split into 3 parts, which involve the

posterior distribution of the parameters in ψ, from which it is not straight-

forward to draws samples of the LC model parameters. The following text

37

Page 38: Geert van Kollenburg-masterthesis

discusses how to specify the posterior distribution and how to obtain samples

from it using the Gibbs sampler.

This discussion is about obtaining (draws from) the posterior distribution

P (ψ|y). Note that this applies to the Gibbs sampler both in the estimation

process as in the PPC.. The posterior distribution of ψ can be obtained

using the Bayes rule:

P (ψ|y) =P (y|ψ)P (ψ)

P (y)(13)

∝ P (y|ψ)P (ψ) (14)

The term P (y) is called the marginal likelihood or normalizing constant.

To draw samples from the posterior we can simply use Equation 14 because

the shape of the distribution is not influenced by multiplying/dividing by a

constant. However, as can be seen, one does need a prior distribution P (ψ)

for the parameters in ψ, which can be used to include prior knowledge (or

lack thereof) about the parameters of interest.

For each set of multinomial parameters (e.g., πjrc, r = 1, . . . , Rj) I have

used a Dirichlet prior distribution. For dichotomous variables (Rj = 2 for all

j), I could equivalently have used Beta distributions (Gelman et al., 2004),

but for the sake of generality, I show the use of the Dirichlet distribution here.

For example, the prior distribution of the conditional response probabilities

38

Page 39: Geert van Kollenburg-masterthesis

of a person in LC c = 1, . . . , C on item j = 1, . . . , J is given by:

P (πjrc, r = 1, . . . , R) =

(Rj∑q=1

αjqc

)!

Rj∏q=1

αjqc!

Rj∏r=1

παjrc−1jrc (15)

∝Rj∏r=1

παjrc−1jrc . (16)

It is commonplace to ignore the constant and only indicate the parts of the

distribution which involve the parameters (here, πjrc) and use the propor-

tionality property. The prior distribution for the class sizes is given by:

P (θc, c = 1, . . . , C) ∝C∏c=1

παc−1c (17)

The values for the hyperparameters αjrc in absolute sense indicate the

strength of one’s prior belief about the probability of giving response r to item

j in class c, and the relative sizes of the hyperparameters indicate the relative

probabilities for the responses (Rubin & Stern, 1994). αc is used likewise for

the class-sizes. To indicate no prior knowledge about the items of LC sizes

I only use vague (diffuse) priors in the analysis where∑

c αc =∑

r αjrc = 1

(see Section 2.2.

The prior distribution of the entire set ψ is the product of the priors on

39

Page 40: Geert van Kollenburg-masterthesis

the elements in it:

p(ψ) =C∏c=1

[παc−1c

(R1∏r=1

πα1rc−11rc × · · · ×

RJ∏r=1

παJrc−1Jrc

)](18)

and the posterior is then obtained by combining this prior distribution with

the likelihood (Equation 11) of the LC model (Rubin & Stern, 1994):

P (ψ|y) ∝S∏s=1

[C∑c=1

P (Ys)

]ns

P (ψ). (19)

As indicated earlier, this posterior distribution does not have a convenient

form to sample from. But, as it turns out, augmenting the data with esti-

mates for the unobserved LC memberships can make the model estimable.

As shown in Section 2.2, the Gibbs sampler can be used to estimate the

LC model in an iterative fashion, but it requires that unobserved indicators

for the LC memberships are used to augment the data. In this way it is

possible to obtain conditional distributions of the parameters given the LC

membership (Tanner & Wong, 1984). To illustrate, let Zsic = 1 if the ith ob-

servation in the sth cell of the contingency table (i = 1, . . . , ns, s = 1, . . . , S)

belongs to LC c and 0 otherwise. Then the joint distribution

P (ψ, Z,y) ∝S∏s=1

ns∏i=1

C∏c=1

P (Ys)ZsicP (ψ). (20)

The distribution of ψconditional on Z and yis given by the product of inde-

pendent Dirichlet distributions with hyperparameters αjrc+njrc and αc+mc.

40

Page 41: Geert van Kollenburg-masterthesis

The conditional probability P (Z|ψ,y) is given by the Bernoulli distribution.

Using the Bayes rule, the probabilities that Zsic = 1 is obtained using Equa-

tion 4:

P (Zsic = 1|ψ,y) =P (Ys|θ = c)πc

P (Ys). (21)

These conditional distributions are easy to sample from (see Section 2.2. The

Gibbs sampler described in this thesis does this iteratively, and at conver-

gence, the sampled values for Z and ψare draws from the joint posterior

distribution P (Z,ψ|y) (Rubin & Stern, 1994; Tanner & Wong, 1984). To

avoid correlations between the samples, one is advised not to use subsequent

draws, but, for instance, to retain only every 50th draw or so.

To obtain the replicate data yrep,l in Step 2b as a draw from the predic-

tive distribution P (yrep|psibf), we just need to draw N observations from a

multinomial distribution with probabilities P (Ys) estimated from ψl.

41

Page 42: Geert van Kollenburg-masterthesis

C Figures

0 100 300 500

05

1525

Trace of replicated X2

Iteration

Trep

(X2)

0 5 10 15

0.00

0.05

0.10

0.15

Density for replicated X2

Replicated X2 valuesD

ensi

ty

Figure 1: Example of trace and density plot for the PPC in the empiricaldata. The dashed lines indicate X2

obs = 4.223, pp = .554

42

Page 43: Geert van Kollenburg-masterthesis

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

L2

p−value

p−va

lue

dens

ity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

X2

p−value

p−va

lue

dens

ity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

BVR

p−value

p−va

lue

dens

ity

Asympotic p−valuesBootstrapPPCDiscrepancyUniform

Figure 2: P-value log-densities for the 2-Class model with N = 1000

43

Page 44: Geert van Kollenburg-masterthesis

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

L2

p−value

p−va

lue

dens

ity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

X2

p−value

p−va

lue

dens

ity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

BVR

p−value

p−va

lue

dens

ity

Asympotic p−valuesBootstrapPPCDiscrepancyUniform

Figure 3: P-value log-densities for the 2-Class model with N = 100

44