171
Applying Finite Mixture Models Presenter: Geoff McLachlan

Applying Finite Mixture Models Presenter: Geoff McLachlan

Embed Size (px)

Citation preview

Page 1: Applying Finite Mixture Models Presenter: Geoff McLachlan

Applying Finite Mixture Models

Presenter: Geoff McLachlan

Page 2: Applying Finite Mixture Models Presenter: Geoff McLachlan
Page 3: Applying Finite Mixture Models Presenter: Geoff McLachlan

TopicsIntroductionApplication of EM algorithmExamples of normal mixturesRobust mixture modelingNumber of components in a mixture modelNumber of nonnormal componentsMixture models for failure-time dataMixture software

Page 4: Applying Finite Mixture Models Presenter: Geoff McLachlan

1.1 Flexible Method of Modeling

AstronomyBiologyGeneticsMedicinePsychiatryEconomicsEngineeringMarketing

Page 5: Applying Finite Mixture Models Presenter: Geoff McLachlan

1.2 Initial Approach to Mixture Analysis

Classic paper of Pearson (1894)

Page 6: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 1: Plot of forehead to body length data on 1000 crabs and of the fitted one-component (dashed line) and two-component (solid line) normal mixture models.

Page 7: Applying Finite Mixture Models Presenter: Geoff McLachlan

1.3 Basic Definition

We let Y1,…. Yn denote a random sample of size n where Yj is a p-dimensional random vector with probability density function f (yj)

where the f i(yj) are densities and the i are nonnegative quantities that sum to one.

)y()y( j

g

1iij iff

(1)

Page 8: Applying Finite Mixture Models Presenter: Geoff McLachlan

1.4 Interpretation of Mixture Models

An obvious way of generating a random vector Yj with the g-component mixture density f (Yj), given by (1), is as follows. Let Zj be a categorical random variable taking on the values 1, … ,g with probabilities 1, … g, respectively, and suppose that the conditional density of Yj given Zj=i is f i(yj) (i=1, … , g). Then the unconditional density of Yj, (that is, its marginal density) is given by f (yj).

Page 9: Applying Finite Mixture Models Presenter: Geoff McLachlan

1.5 Shapes of Some Univariate Normal Mixtures

Consider

where

denotes the univariate normal density with mean and variance 2.

),;y(),;y()y(f 22j2

21j1j

})y(exp{)2(),;y( 22j2

112j

21

(5)

(6)

Page 10: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 2: Plot of a mixture density of two univariate normal components in equal proportions with common variance 2=1

=1 =2

=3 =4

Page 11: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 3: Plot of a mixture density of two univariate normal components in proportions 0.75 and 0.25 with common variance

=1 =2

=3 =4

Page 12: Applying Finite Mixture Models Presenter: Geoff McLachlan
Page 13: Applying Finite Mixture Models Presenter: Geoff McLachlan
Page 14: Applying Finite Mixture Models Presenter: Geoff McLachlan

1.6 Parametric Formulation of Mixture ModelIn many applications, the component densities fi(yj) are specified to belong to some parametric family. In this case, the component densities fi(yj) are specified as fi(yj;i), where i is the vector of unknown parameters in the postulated form for the ith component density in the mixture. The mixture density f (yj) can then be written as

Page 15: Applying Finite Mixture Models Presenter: Geoff McLachlan

1.6 cont.

where the vector Y containing all the parameters in the mixture model can be written as

where is the vector containing all the parameters in 1,…g known a priori to be distinct.

);y();y( ij

g

1iij

iff

TT1g1 ),,...,(

(7)

(8)

Page 16: Applying Finite Mixture Models Presenter: Geoff McLachlan

1.7 Identifiability of Mixture DistributionsIn general, a parametric family of densities f (yj; is identifiable if distinct values of the parameter determine distinct members of the family of densities

where is the specified parameter space; that is,} :);y(f { j

);y(f);y(f * jj (11)

Page 17: Applying Finite Mixture Models Presenter: Geoff McLachlan

1.7 cont.

if and only if

identifiability for mixture distributions is defined slightly different. To see why this is necessary, suppose that f (yj;) has two component densities, say, f i(y; i) and f h(y; h), that belong to the same parametric family. Then (11) will still hold when the component labels i and h are interchanged in .

* (12)

Page 18: Applying Finite Mixture Models Presenter: Geoff McLachlan

1.8 Estimation of Mixture DistributionsIn the 1960s, the fitting of finite mixture

models by maximum likelihood had been studied in a number of papers, including the seminal papers by Day (1969) and Wolfe (1965, 1967, 1970).

However, it was the publication of the seminal paper of Dempster, Laird, and Rubin (1977) on the EM algorithm that greatly stimulated interest in the use of finite mixture distributions to model heterogeneous data.

Page 19: Applying Finite Mixture Models Presenter: Geoff McLachlan

1.8 Cont.

This is because the fitting of mixture models by maximum likelihood is a classic example of a problem that is simplified considerably by the EM's conceptual unification of maximum likelihood (ML) estimation from data that can be viewed as being incomplete.

Page 20: Applying Finite Mixture Models Presenter: Geoff McLachlan

1.9 Mixture Likelihood Approach to Clustering

Suppose that the purpose of fitting the finite mixture model (7) is to cluster an observed random sample y1,…,yn into g components. This problem can be viewed as wishing to infer the associated component labels z1,…,zn of these feature data vectors. That is, we wish to infer the zj on the basis of the feature data yj.

Page 21: Applying Finite Mixture Models Presenter: Geoff McLachlan

1.9 Cont.

After we fit the g-component mixture model to obtain the estimate of the vector of unknown parameters in the mixture model, we can give a probabilistic clustering of the n feature observations y1,…,yn in terms of their fitted posterior probabilities of component membership. For each yj, the g probabilities 1(yj; ) ,…, g(yj; ) give the estimated posterior probabilities that this observation belongs to the first, second,…, and g th component, respectively, of the mixture (j=1,…,n).

Page 22: Applying Finite Mixture Models Presenter: Geoff McLachlan

1.9 Cont.

We can give an outright or hard clustering of these data by assigning each yj to the component of the mixture to which it has the highest posterior probability of belonging. That is, we estimate the component-label vector zj by , where is defined by

for i=1,…,g; j=1,…,n.

jz ijij )ˆ(ˆ zz

),;(maxarg if ,1ˆ jhh

ij yi z

otherwise, ,0ˆ ij z (14)

Page 23: Applying Finite Mixture Models Presenter: Geoff McLachlan

1.10 Testing for the Number of Components

In some applications of mixture models, there is sufficient a priori information for the number of components g in the mixture model to be specified with no uncertainty. For example, this would be the case where the components correspond to externally existing groups in which the feature vector is known to be normally distributed.

Page 24: Applying Finite Mixture Models Presenter: Geoff McLachlan

1.10 Cont.

However, on many occasions, the number of components has to be inferred from the data, along with the parameters in the component densities. If, say, a mixture model is being used to describe the distribution of some data, the number of components in the final version of the model may be of interest beyond matters of a technical or computational nature.

Page 25: Applying Finite Mixture Models Presenter: Geoff McLachlan

2. Application of EM algorithm 2.1 Estimation of Mixing Proportions

Suppose that the density of the random vector Yj has a g-component mixture from

where =(1,….,g-1)T is the vector containing the unknown parameters, namely the g-1 mixing proportions 1,…,g-1, since

),y();y( j

g

1iij iff

1g

1iig 1

(15)

Page 26: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.1 cont.In order to pose this problem as an incomplete-data one, we now introduce as the unobservable or missing data the vector

where zj is the g-dimensional vector of zero-one indicator variables as defined above. If these zij were observable, then the MLE of i is simply given by

,),,( TTn

T1 zzz

),g,,1i( n/n

1j

ijz

(18)

(19)

Page 27: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.1 Cont.

The EM algorithm handles the addition of the unobservable data to the problem by working with Q(;(k)), which is the current conditional expectation of the complete-data log likelihood given the observed data. On defining the complete-data vector x as

,),( TTT zxx (20)

Page 28: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.1 Cont.

the complete-data log likelihood for Y has the multinomial form

where

does not depend on .

g

1i

n

1jiijc ,Clogz)(Llog

g

1i

n

1jjiij )y(logzC f

(21)

Page 29: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.1 Cont.

As (21) is linear in the unobservable data zij, the E-step (on the (k+1) th iteration) simply requires the calculation of the current conditional expectation of Zij given the observed data y, where Zij is the random variable corresponding to zij. Now

}y1Z{pr)yZ(E ijij )k()k(

,)k(ij (22)

Page 30: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.1 Cont.

where by Bayes Theorem,

for i=1,…,g; j=1,…,n. The quantity i(yj;(k)) is the posterior probability that the j th member of the sample with observed value yj belongs to the i th component of the mixture. 

);y( )k(ji

)k(ij

);y(/)y( )k(jji

)k(i ff

(23)

Page 31: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.1 Cont.

The M-step on the (k+1) th iteration simply requires replacing each zij by ij

(k) in (19) to give

for i=1,…,g.

n/n

1j

)k(ij

)1k(i

(24)

Page 32: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.2 Example 2.1:Synthetic Data Set 1

We generated a random sample of n=50 observations y1,…,yn from a mixture of two univariate normal densities with means 1=0 and 2=2 and common variance 2=1 in proportions 1=0.8 and 2=0.2.

Page 33: Applying Finite Mixture Models Presenter: Geoff McLachlan

Iteration

k

0 0.50000 -91.87811 1 0.68421 -85.553532 0.70304 -85.09035 3 0.71792 -84.81398 4 0.72885 -84.68609 5 0.73665 -84.63291 6 0.74218 -84.60978 7 0.74615 -84.58562

0.50000 -91.87811

27 0.68421 -85.55353

)k(1 )(Llog )k(

1

Table 1: Results of EM Algorithm for Example on Estimation of Mixing Proportions

Page 34: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.3 Univariate Normal Component Densities

The normal mixture model to be fitted is thus

where

),;y();y( ij

g

1iij

iff

),,y();y( 2ijij if

}./)y(exp{)2( 22ij2

12 21

(28)

Page 35: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.3 Cont.

The complete-data log likelihood function for Y is given by (21), but where now

);y(logC ij

g

1i

n

1jij

ifz

)2log(n21

}./)y({logz 22ij

2g

1i

n

1jij2

1

Page 36: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.3 Cont.The E-Step is the same as before, requiring the calculation of (23). The M-step now requires the computation of not only (24), but also the values and (k+1)2 that, along with maximize Q(;(k)). Now

are the MLE’s of i and 2 respectively, if the zij were observable.

)1k(g

)1k(1 ,,

)k(1g

)k(1 ,,

n

1jijijij /y zz

n

1jand n/)y(z 2

j

g

1i

n

1jij

(29)

Page 37: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.3 Cont.

As logLc() is linear in the zij, it follows that the zij in (29) and (30) are replaced by their current conditional expectations , which here are the current estimates i(yj;(k)) of the posterior probabilities of membership of the components of the mixture, given by

)k(ij

);y(f/);y(f);y( )k(j

)k(iji

)k(i

)k(ji

Page 38: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.3 Cont.

This yields

and

and is given by (24).

g),1,(i /yn

1j

)k(ij

n

1jj

)k(ij

)1k(i

g

1i

2n

1j

)1k(ij

)k(ij

)1k( n/)y(2

)1k(i

(31)

(32)

Page 39: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.4 Multivariate Component Densities

g),1,(i /n

1j

)k(ij

n

1jj

)k(ij

)1k(i

y(34)

n/))(( T)1k(ij

)1k(i

g

1i

n

1jj

)k(ij

yy

1k

(35)

Page 40: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.4 Cont.

In the case of normal components with arbitrary covariance matrices, equation (35) is replaced by

)1k(i

n

1j

)k(ij

T)1k(ij

)1k(i

n

1jj

)k(ij /))(( yy

g),1,(i (36)

Page 41: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.5 Starting Values for EM AlgorithmThe EM algorithm is started from some initial value of , (0). Hence in practice we have to specify a value for (0).

Hence an alternative approach is to perform the first E-step by specifying a value j

(0) for (yj;) for each j (j=1,…,n), where

is the vector containing the g posterior probabilities of component membership for yj,

Tjgj1j ));y(),...,;y(();y(

Page 42: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.5 Cont.

The latter is usually undertaken by setting j(0)=zj

(0) for j=1,…,n, where

defines an initial partition of the data into g groups. For example, an ad hoc way of initially partitioning the data in the case of, say, a mixture of g=2 normal components with the same covariance matrices, would be to plot the data for selections of two of the p variables, and then draw a line that divides the bivariate data into two groups that have a scatter that appears normal.

TT)0(n

T)0(1

)0( )z,...,z(z

Page 43: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.5 Cont.

For higher dimensional data, an initial value z(0) for z might be obtained through the use of some clustering algorithm, such as k-means or, say, an hierarchical procedure if n is not too large.

 

Another way of specifying an initial partition z(0) of the data is to randomly divide the data into g groups corresponding to the g components of the mixture model.

Page 44: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.6 Example 2.2:Synthetic Data Set 2

Page 45: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.7 Example 2.3:Synthetic Data Set 3

Page 46: Applying Finite Mixture Models Presenter: Geoff McLachlan

True Values Initial Values Estimates by EM

10.333 0.333 0.294

20.333 0.333 0.337

30.333 0.333 0.370

1(0 –2)T (-1 0) T (-0.154 –1.961) T

2(0 0) T (0 0) T (0.360 0.115) T

3(0 2) T (1 0) T (-0.004 2.027) T

1

1

1

2.00

02

2.00

02

2.00

02

10

01

10

01

10

01

218.0016.0

016.0961.1

218.0553.0

553.0346.2

206.0042.0

042.0339.2

Page 47: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 7

Page 48: Applying Finite Mixture Models Presenter: Geoff McLachlan

Fig

ure

8

Page 49: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.8 Provision of Standard ErrorsOne way of obtaining standard errors of the estimates of the parameters in a mixture model is to approximate the covariance matrix of by the inverse of the observed information matrix, which is given by the negative of the Hessian matrix of the log likelihood evaluated at the MLE. It is important to emphasize that estimates of the covariance matrix of the MLE based on the expected or observed information matrices are guaranteed to be valid inferentially only asymptotically.

Page 50: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.8 Cont.In particular for mixture models, it is well

known that the sample size n has to be very large before the asymptotic theory of maximum likelihood applies.

Hence we shall now consider a resampling approach, the bootstrap, to this problem.

Standard error estimation of may be implemented according to the bootstrap as follows.

Page 51: Applying Finite Mixture Models Presenter: Geoff McLachlan

Step 1A new set of data, y*, called the bootstrap sample, is generated according to , an estimate of the distribution function of Y formed from the original observed data y. That is, in the case where y contains the observed values of a random sample of size n, y* consists of the observed values of the random sample

F

i.i.d

2.8 Cont.

(40)F ~ Y,,Y *n

*1

Page 52: Applying Finite Mixture Models Presenter: Geoff McLachlan

where the estimates (now denoting the distribution function of a single observation Yj) is held fixed at its observed value.

The EM algorithm is applied to the bootstrap observed data y* to compute the MLE for this data set, .

Step 2

2.8 Cont.

*

F

Page 53: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.8 Cont.

The bootstrap covariance matrix of is given by

where E* denotes expectation over the bootstrap distribution specified by .

Step 3*

)ˆ(cov **

],)}ˆ(Eˆ)}{ˆ(Eˆ[{E T*****

F

(41)

Page 54: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.8 Cont.

The bootstrap covariance matrix can be approximated by Monte Carlo methods. Steps (1) and (2) are repeated independently a number of times (say, B) to give B independent realizations of , denoted by .

**B

*1

ˆ,...,ˆ

Page 55: Applying Finite Mixture Models Presenter: Geoff McLachlan

2.8 Cont.

Then (41) can be approximated by the sample covariance matrix of these B bootstrap replications to give

where

)ˆ(cov **

),1B/()ˆˆ()ˆˆ( T**b

B

1b

**b

B

1b

** .B/ˆˆ

(42)

(43)

Page 56: Applying Finite Mixture Models Presenter: Geoff McLachlan

3 Examples of Normal Mixtures3.1 Basic Model in Genetics

Page 57: Applying Finite Mixture Models Presenter: Geoff McLachlan

3.2 Example 3.1:PTC Sensitivity Data

We report in Table 3, the results of Jones and McLachlan (1991) who fitted a mixture of three normal components to data on phenylthiocarbamide (PTC) sensitivity for three groups of people.

Page 58: Applying Finite Mixture Models Presenter: Geoff McLachlan

Parameter Data Set 1 Data Set 2 Data Set 3

pA 0.572(.027) 0.626(.025) 0.520(.026)

1 2.49(.15) 1.62(.14) 1.49(.09)

2 9.09(.18) 8.09(.15) 7.47(.47)

3 10.37(.28) 8.63(.50) 9.08(.08)

1.34(.29) 1.44(.28) 0.34(.09)

2.07(.39) 1.19(.22) 6.23(2.06)

0.57(.33) 0.10(.18) 0.48(.10)

Test statistic:

-2log(22=3

2) 3.60 6.87 58.36

-2log(HWE) 0.00 3.76 1.06

Table 3: Fit of Mixture Model to Three Data Sets

Page 59: Applying Finite Mixture Models Presenter: Geoff McLachlan

3.3 Example 3.2: Screening for Hemochronatosis

We consider the case study of McLaren et al. (1998) on the screening for hemochromatosis.

Page 60: Applying Finite Mixture Models Presenter: Geoff McLachlan

3.3 Cont.

Studies have suggested that mean transferrin saturation values for heterozygotes are higher than among unaffected subjects, but lower than homozygotes. Since the distribution of transferrin saturation is known to be well approximated by a single normal distribution in unaffected subjects, the physiologic models used in the study of McLaren et al. (1998) were a single normal component and a mixture of two normal components.

Page 61: Applying Finite Mixture Models Presenter: Geoff McLachlan

Table 4: Transferrin Saturation Results Expressed as Mean Percentage SD.

Asymptomatic

Individuals

Individual Identified

by Pedigree Analysis

Sex

Postulated Unaffected

Postulated Heterozygotes

Known Heterozygotes

Known Homozygotes

Male 24.16.0 37.3 7.7 37.1 17.0 82.7 14.4

Female 22.5 6.4 37.6 10.4 32.5 15.3 75.3 19.3

Page 62: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 9: Plot of the densities of the mixture of two normal heteroscedastic components fitted to some transferrin values on asymptomatic Australians.

Page 63: Applying Finite Mixture Models Presenter: Geoff McLachlan

3.4 Example 3.3:Crab Data

Figure 10: Plot of Crab Data

Page 64: Applying Finite Mixture Models Presenter: Geoff McLachlan

3.4 Cont.

Progress of fit to Crab Data

Page 65: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 11: Contours of the fitted component densities on the 2nd & 3rd variates for the blue crab data set.

Page 66: Applying Finite Mixture Models Presenter: Geoff McLachlan

3.5 Choice of Local Maximizer

The choice of root of the likelihood equation in the case of homoscedastic components is straightforward in the sense that the MLE exists as the global maximizer of the likelihood function. The situation is less straightforward in the case of heteroscedastic components as the likelihood function is unbounded.

Page 67: Applying Finite Mixture Models Presenter: Geoff McLachlan

3.5 Cont.

But assuming the univariate result of Hathaway (1985) extends to the case of multivariate normal components, then the constrained global maximizer is consistent provided the true value of the parameter vector belongs to the parameter space constrained so that the component generalized variances are not too disparate; for example,

g).ih(1 0C||/|| ih (46)

Page 68: Applying Finite Mixture Models Presenter: Geoff McLachlan

3.5 Cont

If we wish to proceed in the heteroscedastic case by the prior imposition of a constraint of the form (46), then there is the problem of how small the lower bound C must be to ensure that the constrained parameter space contains the true value of the parameter vector .

Page 69: Applying Finite Mixture Models Presenter: Geoff McLachlan

3.5 Cont

Therefore to avoid having to specify a value for C beforehand, we prefer where possible to fit the normal mixture without any constraints on the component covariances i. It thus means we have to be careful to check that the EM algorithm has actually converged and is not on its way to a singularity which exists since the likelihood is unbounded for unequal component-covariance matrices.

Page 70: Applying Finite Mixture Models Presenter: Geoff McLachlan

3.5 Cont

Even if we can be sure that the EM algorithm has converged to a local maximizer, we have to be sure that it is not a spurious solution that deserves to be discarded. After these checks, we can take the MLE of to be the root of the likelihood equation corresponding to the largest of the remaining local maxima located.

Page 71: Applying Finite Mixture Models Presenter: Geoff McLachlan

3.6 Choice of Model for Component-Covariance MatricesA normal mixture model without restrictions on the component-covariance matrices may be viewed as too general for many situations in practice. At the same time, though, we are reluctant to impose the homoscedastic condition i= (i=1,…,g), as we have noted in our analyses that the imposition of the constraint of equal component-covariance matrices can have a marked effect on the resulting estimates and the implied clustering. This was illustrated in Example 3.3.

Page 72: Applying Finite Mixture Models Presenter: Geoff McLachlan

3.7 Spurious Local Maximizers

In practice, consideration has to be given to the problem of relatively large local maxima that occur as a consequence of a fitted component having a very small (but nonzero) variance for univariate data or generalized variance (the determinant of the covariance matrix) for multivariate data.

Page 73: Applying Finite Mixture Models Presenter: Geoff McLachlan

3.7 Cont.

Such a component corresponds to a cluster containing a few data points either relatively close together or almost lying in a lower dimensional subspace in the case of multivariate data. There is thus a need to monitor the relative size of the fitted mixing proportions and of the component variances for univariate observations, or of the generalized component variances for multivariate data, in an attempt to identify these spurious local maximizers.

Page 74: Applying Finite Mixture Models Presenter: Geoff McLachlan

3.8 Example 3.4:Synthetic Data Set 4

Local

Max. log L 1 1 2

1-170.56 0.157 -0.764 1.359 0.752 1.602 0.4696

2-165.94 0.020 -2.161 1.088 5.2210-

9

2.626 1.9710-

9

3

(binned)

-187.63 0.205 -0.598 1.400 0.399 1.612 0.2475

Table 5: Local Maximizers for Synthetic Data Set 4.

21

22

22

21

Page 75: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 12: Histogram of Synthetic Data Set 4 for fit 2 of the normal mixture density.

Page 76: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 13: Histogram of Synthetic Data Set 4 for fit 1 of the normal mixture density.

Page 77: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 14: Histogram of Synthetic Data Set 4 for fit 3 of the normal mixture density.

Page 78: Applying Finite Mixture Models Presenter: Geoff McLachlan

3.9 Example 3.5:Galaxy Data Set

Figure 15: Plot of fitted six-component normal mixture density for galaxy data set

Page 79: Applying Finite Mixture Models Presenter: Geoff McLachlan

Table 6: A Six-Component Normal Mixture Solution for the Galaxy Data Set.

Component i i i

1 0.085 9.7101 0.178515

2 0.024 16.127 0.001849

3 0.037 33.044 0.849564

4 0.425 22.920 1.444820

5 0.024 26.978 0.000300

6 0.404 19.790 0.454717

21

Page 80: Applying Finite Mixture Models Presenter: Geoff McLachlan

4.1 Mixtures of t Distributions

)p(

2p

2p

21

21

21

);,x(1 )()(

),,;y(f

where

)y()y();,y( 1T

(49)

(50)

Page 81: Applying Finite Mixture Models Presenter: Geoff McLachlan

4.2 ML Estimation

where

The update estimates of i and i (i=1,…,g) are given by

n

1j

)k(ij

)k(ij

n

1ji

)k(ij

)k(ij

)1k(i / uu y (53)

),,y(

p)k(

i)k(

ij)k(

i

)k(i)k(

i

u

Page 82: Applying Finite Mixture Models Presenter: Geoff McLachlan

4.2 Cont.

and

n

1j

)k(ij

T)1k(ii

)1k(i

n

1ji

)k(ij

)k(ij )()( yyu

)1k(i

(54)

Page 83: Applying Finite Mixture Models Presenter: Geoff McLachlan

4.2 Cont.

It follows that is a solution of the equation

)1k(i

1)log()({ i21

i21

)uu(log )k(j

n

1j

)k(ij

)k(ijn

1)k(

i

0)}log()( 2p

2p )k(

i)k(

i

Where and is the Digamma function.

(55)

n

1j

)k(ij

)k(in )(

Page 84: Applying Finite Mixture Models Presenter: Geoff McLachlan

Example 4.1:Noisy Data Set

A sample of 100 points was simulated from

T3

T

2

T

1 03 03 30

1.00

1

5.05.0

221

5.05.0

2 3

Page 85: Applying Finite Mixture Models Presenter: Geoff McLachlan

Example 4.1:Noisy Data Set

To this simulated sample 50 noise points were added from a uniform distribution over the range -10 to 10 on each variate.

Page 86: Applying Finite Mixture Models Presenter: Geoff McLachlan

4.1 True Solution

Page 87: Applying Finite Mixture Models Presenter: Geoff McLachlan

4.1 Normal Mixture Solution

Page 88: Applying Finite Mixture Models Presenter: Geoff McLachlan

4.1 Normal + Uniform Solution

Page 89: Applying Finite Mixture Models Presenter: Geoff McLachlan

4.1 t Mixture Solution

Page 90: Applying Finite Mixture Models Presenter: Geoff McLachlan

4.1 Comparison of Results

True Solution

Normal Mixture Solution

Normal + Uniform Mixture

t Mixture Solution

Page 91: Applying Finite Mixture Models Presenter: Geoff McLachlan

5. Number of Components in a Mixture Model

Testing for the number of components g in a mixture is an important but very difficult problem which has not been completely resolved. We have seen that finite mixture distributions are employed in the modeling of data with two main purposes in mind.

Page 92: Applying Finite Mixture Models Presenter: Geoff McLachlan

5.1 Cont

One is to provide an appealing semiparametric framework in which to model unknown distributional shapes, as an alternative to, say, the kernel density method.

The other is to use the mixture model to provide a model-based clustering. In both situations, there is the question of how many components to include in the mixture.

Page 93: Applying Finite Mixture Models Presenter: Geoff McLachlan

5.1 Cont.

In the former situation of density estimation, the commonly used criteria of AIC and BIC would appear to be adequate for choosing the number of components g for a suitable density estimate.

Page 94: Applying Finite Mixture Models Presenter: Geoff McLachlan

5.2 Order of a Mixture Model

A mixture density with g components might be empirically indistinguishable from one with either fewer than g components or more than g components. It is therefore sensible in practice to approach the question of the number of components in a mixture model in terms of an assessment of the smallest number of components in the mixture compatible with the data.

Page 95: Applying Finite Mixture Models Presenter: Geoff McLachlan

5.2 Cont.

To this end, the true order g0 of the g-component mixture model

is defined to be the smallest value of g such that all the components fi(y;) are different and all the associated mixing proportions i are nonzero.

);();( ii

g

1ii

yy ff (57)

Page 96: Applying Finite Mixture Models Presenter: Geoff McLachlan

5.3 Adjusting for Effect of Skewness on LRT

We now consider the effect of skewness on hypothesis tests for the number of components in normal mixture models. The Box-Cox (1964) transformation can be employed initially in an attempt to obtain normal components. Hence to model some univariate data y1,…,yn by a two-component mixture distribution, the density of Yj is taken to be

Page 97: Applying Finite Mixture Models Presenter: Geoff McLachlan

5.3 Cont.

where

1j

222

)(j2

211

)(j1 y)},;y(),;y({

);(yf

,0 ifylog

,0 if/)1y(y

j

)()( (58)

(59)

Page 98: Applying Finite Mixture Models Presenter: Geoff McLachlan

5.3 Cont.

and where the last term on the RHS of (58) corresponds to the Jacobian of the transformation from to .

Gutierrez et al. (1995) adopted this mixture model of transformed normal components in an attempt to identify the number of underlying physical phenomena behind tomato root initiation. The observation yj corresponds to the inverse proportion of the j th lateral root which expresses GUS (j=1,…,40).

)(jy

jy

Page 99: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 20: Kernel density estimate and normal Q-Q plot of the yj . From Gutierrez et al. (1995).

Page 100: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 21: Kernel density estimate and normal Q-Q plot of the yj

-1. From Gutierrez et al. (1995).

Page 101: Applying Finite Mixture Models Presenter: Geoff McLachlan

5.4 Example 5.1:1872 Hidalgo Stamp Issue of Mexico

Izenman and Sommer (1988) considered the modeling of the distribution of stamp thickness for the printing of a given stamp issue from different types of paper. Their main concern was the application of the nonparametric approach to identify components by the resulting placement of modes in the density estimate.

Page 102: Applying Finite Mixture Models Presenter: Geoff McLachlan

5.4 Cont.

The specific example of a philatelic mixture, the 1872 Hidalgo issue of Mexico, was used as a particularly graphic demonstration of the combination of a statistical investigation and extensive historical data to reach conclusions regarding the mixture components.

Page 103: Applying Finite Mixture Models Presenter: Geoff McLachlan

(Izenman & Sommer)

(Basford et al.)

g=3

Figure 23Figure 22

Figure 24 Figure 25

g=7g=7

g=4(equal variances)

Page 104: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 26 Figure 27

Figure 28

g=8(equal variances)

g=5(equal variances)

g=7(equal variances)

Page 105: Applying Finite Mixture Models Presenter: Geoff McLachlan

Table 7:Value of the Log Likelihood for g=1 to 9 Normal Components.

Number of Components

Unrestricted Variances

Equal Variances

1 1350 1350.3

2 1484 1442.6

3 1518 1475.7

4 1521 1487.4

5 1527 1489.5

6 1535 1512.9

7 1538 1525.3

8 1544 1535.1

9 1552 1536.5

Page 106: Applying Finite Mixture Models Presenter: Geoff McLachlan

5.6 Likelihood Ratio Test Statistic

An obvious way of approaching the problem of testing for the smallest value of the number of components in a mixture model is to use the LRTS, -2log. Suppose we wish to test the null hypothesis,

11 :H gg versus

for some g1>g0.

00 :H gg (60) (61)

Page 107: Applying Finite Mixture Models Presenter: Geoff McLachlan

5.6 Cont.

Usually, g1=g0+1 in practice as it is common to keep adding components until the increase in the log likelihood starts to fall away as g exceeds some threshold. The value of this threshold is often taken to be the g0 in H0. Of course it can happen that the log likelihood may fall away for some intermediate values of g only to increase sharply at some larger value of g, as in Example 5.1.

Page 108: Applying Finite Mixture Models Presenter: Geoff McLachlan

5.6 Cont.

We let denote the MLE of calculated under Hi , (i=0,1). Then the evidence against H0 will be strong if is sufficiently small, or equivalently, if -2log is sufficiently large, where

i

)}ˆ(Llog)ˆ(L{log2log2 01 (62)

Page 109: Applying Finite Mixture Models Presenter: Geoff McLachlan

5.7 Bootstrapping the LRTS

McLachlan (1987) proposed a resampling approach to the assessment of the P-value of the LRTS in testing

for a specified value of g0.

1100 :H v:H gggg (63)

Page 110: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 29a: Acidity data set.

5.8 Application to Three Real Data Sets

Page 111: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 29b: Enzyme data set.

Page 112: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 29c: Galaxy data set.

Page 113: Applying Finite Mixture Models Presenter: Geoff McLachlan

Table 9: P-Values Using Bootstrap LRT.

P-Value for g (versus g+1)

Data Set 1 2 3 4 5 6

Acidity 0.01 0.08 0.44 - - -

Enzyme 0.01 0.02 0.06 0.39 - -

Galaxy 0.01 0.01 0.01 0.04 0.02 0.22

Page 114: Applying Finite Mixture Models Presenter: Geoff McLachlan

5.9 Akaike’s Information Criterion

Akaike’s Information Criterion (AIC) selects the model that minimizes

where d is equal to the number of parameters in the model.

d2)ˆ(Llog2 (65)

Page 115: Applying Finite Mixture Models Presenter: Geoff McLachlan

5.10 Bayesian Information Criterion

The Bayesian information criterion (BIC) of Schwarz (1978) is given by

as the penalized log likelihood to be maximized in model selection,including the present situation for the number of components g in a mixture model.

nlogd)ˆ(Llog2 (66)

Page 116: Applying Finite Mixture Models Presenter: Geoff McLachlan

5.11 Integrated Classification Likelihood Criterion

nlogd)ˆ(EN2)ˆ(Llog2

ij

g

1i

n

1jij log)(EN

where(67)

Page 117: Applying Finite Mixture Models Presenter: Geoff McLachlan

6. Mixtures of Nonnormal Components

We first consider the case of mixed feature variables, where some are continuous and some are categorical. We shall outline the use of the location model for the component densities, as in Jorgensen and Hunt (1996), Lawrence and Krzanowski (1996), and Hunt and Jorgensen (1999).

Page 118: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.1 Cont.The ML fitting of commonly used components, such as the binomial and Poisson, can be undertaken within the framework of a mixture of generalized linear models (GLMs). This mixture model also has the capacity to handle the regression case, where the random variable Yj for the j th entity is allowed to depend on the value xj of a vector x of covariates. If the first element of x is taken to be one, then we can specialize this model to the nonregression situation by setting all but the first element in the vector of regression coefficients to zero.

Page 119: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.1 Cont One common use of mixture models with discrete data is to handle overdispersion in count data. For example, in medical research, data are often collected in the form of counts, corresponding to the number of times that a particular event of interest occurs. Because of their simplicity, one-parameter distributions for which the variance is determined by the mean are often used, at least in the first instance to model such data. Familiar examples are the Poisson and binomial distributions, which are members of the one-parameter exponential family.

Page 120: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.1 Cont

However, there are many situations, where these models are inappropriate, in the sense that the mean-variance relationship implied by the one-parameter distribution being fitted is not valid. In most of these situations, the data are observed to be overdispersed; that is, the observed sample variance is larger than that predicted by inserting the sample mean into the mean-variance relationship. This phenomenon is called overdispersion.

Page 121: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.1 Cont

These phenomena are also observed with the fitting of regression models, where the mean (say, of the Poisson or the binomial distribution), is modeled as a function of some covariates. If this dispersion is not taken into account, then using these models may lead to biased estimates of the parameters and consequently incorrect inferences about the parameters.

Page 122: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.1 Cont

Concerning mixtures for multivariate discrete data, a common application arises in latent class analyses, in which the feature variables (or response variables in a regression context) are taken to be independent in the component distributions. This latter assumption allows mixture models in the context of a latent class analysis to be fitted within the above framework of mixtures of GLMs.

Page 123: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.2 Mixed Continuous and Categorical Variables

We consider now the problem of fitting a mixture model

To some data, y=(y1T,…,yn

T)T, where where some of the feature variables are categorical.

);y();y( ij

g

1iij

iff (70)

Page 124: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.2 Cont.

The simplest way to model the component densities of these mixed feature variables is to proceed on the basis that the categorical variables are independent of each other and of the continuous feature variables, which are taken to have, say, a multivariate normal distribution. Although this seems a crude way in which to proceed, it often does well in practice as a way of clustering mixed feature data. In the case where there are data of known origin available, this procedure is known as the naive Bayes classifier.

Page 125: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.2 Cont. We can refine this approach by adopting the location model. Suppose that p1 of the p feature variables in Yj are categorical, where the q th categorical variable takes on mq distinct values (q=1,…,p1). Then there are

distinct patterns of these p1 categorical variables. With the location model, the p1 categorical variables are replaced by a single multinomial random variable Yj

(1) with m cells; that is, (Yj(1)}s=1 if the realizations of the p1 categorical variables in Yj correspond to the s th pattern

1p

1q qmm

Page 126: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.2 Cont.Any associations between the original categorical variables are converted into relationships among the resulting multinomial cell probabilities. The location model assumes further that conditional on (yj

(1))s=1 and membership of the i th component of the mixture model, the distribution of the p-p1 continuous feature variables is normal with mean is and covariance matrix i, which is the same for all cells s.

Page 127: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.2 Cont.

The intent in MULTIMIX is to divide the feature vector into as many subvectors as possible that can be taken to be independently distributed. The extreme form would be to take all p feature variables to be independent and to include correlation structure where necessary.

Page 128: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.3 Example 6.1:Prostate Cancer DataTo illustrate the approach adopted in

MULTIMIX, we report in some detail a case study of Hunt and Jorgensen (1999).

They considered the clustering of patients on the basis of pretrial covariates alone for some prostate cancer clinical trial data. This data set was obtained from a randomized clinical trial comparing four treatments for n=506 patients with prostatic cancer grouped on clinical criteria into Stages 3 and 4 of the disease.

Page 129: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.3 Cont.

As reported by Byar and Green (1980), Stage 3 represents local extension of the disease without evidence of distant metastasis, while Stage 4 represents distant metastasis as evidenced by elevated acid phosphatase, X-ray evidence, or both.

Page 130: Applying Finite Mixture Models Presenter: Geoff McLachlan

Table 10:Pretreatment CovariatesCovariate Abbreviation Number of Levels

(if Categorical)

Age Age

Weight index WtI

Performance rating PF 4Cardiovascular disease history HX 2Systolic blood pressure SBP

Diastolic blood pressure DBP

Electrocardigram code EKG 7Serum hemoglobin HGSize of primary tumor SZIndex of tumor stage & histolic grade SG

Serum prostatic acid phosphatase AP

Bone metastases BM 2

Page 131: Applying Finite Mixture Models Presenter: Geoff McLachlan

Table 11: Models and Fits.

Model Variable Groups No. of Parameters d

Log Lik+11386.265

[ind] - 55 0.000

[2] {SBP,DBP} 57 117.542

[3,2] {BM,WtI,HG},{SBP,DBP} 63 149.419

[5] {BM,WtI,HG,SBP,DBP} 75 169.163

[9] Complement of {PF,HX,EKG} 127 237.092

Page 132: Applying Finite Mixture Models Presenter: Geoff McLachlan

Table 12: Clusters and Outcomes for Treated and Untreated Patients.

Patient Outcome Group Alive Prostate Dth. Cardio Dth. Other Dth.

Untreated Patients

Cluster 1 Stage 3 39 18 37 33

Stage 4 3 4 3 3

Cluster 2 Stage 3 1 4 2 3

Stage 4 14 49 18 6

Treated Patients

Cluster 1 Stage 3 50 3 52 20

Stage 4 4 0 1 3

Cluster 2 Stage 3 1 6 3 1

Stage 4 25 37 22 10

Page 133: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.4 Generalized Linear ModelWith the generalized linear model (GLM) approach originally proposed by Nelder and Wedderburn (1972), the log density of the (univariate) variable Yj has the form

where j is the natural or canonical parameter, is the dispersion parameter, and mj is the prior weight.

),;y(flog jj

),;y(c)}(by{m jjjj1

j (71)

Page 134: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.4 Cont.

The mean and variance of Yj are given by

and

respectively, where the prime denotes differentiation with respect to j

)('b)Y(E jjj

),(''b)Yvar( jj

Page 135: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.4 Cont.In a GLM, it is assumed that

where xj is a vector of covariates or explanatory variables on the j th response yj is a vector of unknown parameters, and h(·) is a monotonic function known as the link function. If the dispersion parameter is known, then the distribution (71) is a member of the (regular) exponential family with natural or canonical parameter j.

)(h jj

,xT

Page 136: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.4 Cont.The variance of Yj is the product of two terms, the dispersion parameter and the variance function b''(j), which is usually written in the form

So-called natural or canonical links occur when j=j, which are respectively the log and logit functions for the Poisson and binomial distributions;

./)(V jjj

Page 137: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.4 Cont.

The likelihood equation for can be expressed as

where '(j)=dj/dj and w(j) is the weight function defined by

It can be seen that for fixed , the likelihood equation for is independent of .

,0)(')y)((wm jj

m

1jjjj

)].(V)}(/[{1)(w j2

j'jj

(73)

Page 138: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.4 Cont.The likelihood equation (73) can be solved iteratively using Fisher's method of scoring, which for a GLM is equivalent to using iteratively reweighted least squares (IRLS); see Nelder and Wedderburn (1972). On the (k+1) th iteration, we form the adjusted response variable as

)(')y()(y~ )k(j

)k(jj

)k(j

)k(j

jy~

(74)

Page 139: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.4 Cont.

These n adjusted responses are then regressed on the covariates x1,…,xn using weights m1w(1

(k)), …, mnw(n(k)). This

produces an updated estimate (k+1) for , and hence updated estimates j

(k+1) for the j, for use in the right-hand side of (74) to update the adjusted responses, and so on. This process is repeated until changes in the estimates are sufficiently small.

Page 140: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.5 Mixtures of GLMsFor a mixture of g component distributions of GLMs in proportions 1…,g, we have that the density of the j th response variable Yj is given by

where for a fixed dispersion parameter i,

),,;y();y( iijj

g

1i ij ff (75)

),;y(l iijjf og

);y(c)}(by{m ijiijijij1

ij (76)

Page 141: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.5 Cont.

for the i th component GLM, we let mij be the mean of Yj, hi(ij) the link function, and

the linear predictor (i=1,…g).

A common model for expressing the i th mixing proportion i as a function of x is the logistic. Under this model, we have corresponding to the j th observation yj with covariate vector xj

jTiijii x)(h

Page 142: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.5 Cont.

contains the logistic regression coefficients. The first element of xj is usually taken to be one, so that the first element of each wi is an intercept.

The EM algorithm of Dempster et al. (1977) can be applied to obtain the MLE of as in the case of a finite mixture of arbitrary distributions.

);x( jiij )}exp(1/{)exp( j

Th

1g

1hjTi xx

where and0gTT

1gT1 ),...,(

(78)

Page 143: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.5 Cont.As the E-step is essentially the same as for arbitrary component densities, we move straight to the M-step.

If the 1,…,g have no elements in common a priori, then (82) reduces to solving

separately for each i to produce i(k+1) (i=1,

…,g).

0x)()y)((w);y( jij

n

1j

'iijjij

)k(jij

(83)

Page 144: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.6 A General ML Analysis of Overdispersion in a GLM

In an extension to a GLM for overdispersion, a random effect Uj

can be introduced additively into a GLM on the same scale as the linear predictor, as proposed by Aitkin (1996). This extension in a two-level variance component GLM has been considered recently by Aitkin (1999).

Page 145: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.6 Cont.For an unobservable random effect uj for the j th response on the same scale as the linear predictor, we have that

where uj is realization of a random variable Uj distributed N(0,1) independently of the j th response Yj(j=1,…, n).

,x jjT

j

Page 146: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.6 Cont. The (marginal) log likelihood is thus

The integral (84) does not exist in closed form except for a normally distributed response yj. Following the development in Anderson and Hinde (1988), Aitkin (1996, 1999) suggested that it be approximated by Gaussian quadrature, whereby the integral over the normal distribution of U is replaced by a finite sum of g Gaussian quadrature mass-points ui with masses i; the ui and i are given in standard references, for example, Abramowitz and Stegun (1964).

.du)u()u,,;y(flog)(Llogn

1jj

( 84)

Page 147: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.6 Cont.The log likelihood so approximated thus has the form for that of a g-component mixture model,

where the masses 1,…, g correspond to the (known) mixing proportions, and the corresponding mass points u1,…,ug to the (known) parameter values. The linear predictor for the j th response in the i th component of the mixture is

g),1,...,(i , ijjT

j x

),u,,;y(flog i

n

1j

g

1iji

Page 148: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.6 Cont.

Hence in this formulation, ui is an intercept term.

Aitkin (1996, 1999) suggested treating the masses 1,…,g as g unknown mixing proportions and the mass points u1,…,ug as g unknown values of a parameter. This g-component mixture model is then fitted using the EM algorithm, as described above.

Page 149: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.6 Cont.In this framework since now ui is also unknown, we can drop the scale parameter and define the linear predictor for the j th response in the i th component of the mixture as

Thus ui acts as an intercept parameter for the i th component. One of the ui parameters will be aliased with the intercept term 0; alternatively, the intercept can be removed from the model.

,x ijT

ij

Page 150: Applying Finite Mixture Models Presenter: Geoff McLachlan

6.7 Example 6.2:Fabric Faults Data

We report the analysis by Aitkin (1996), who fitted a Poisson mixture regression model to a data set on some fabric faults.

Page 151: Applying Finite Mixture Models Presenter: Geoff McLachlan

Table 14: Results of Fitting Mixtures of

Poisson Regression Models.

g 0 1 1 2 3 Deviance

(SE) (1) (2) (3)

2 -2.979 0.800 0.609 -0.156 49.364

(0.201) (0.203) (0.797)

3 -2.972 0.799 0.611 -0.154 -0.165 49.364

(0.201) (0.202) (0.711) (0.087)

Page 152: Applying Finite Mixture Models Presenter: Geoff McLachlan

7.1 Mixture Models for Failure-Time Data

It is only in relatively recent times that the potential of finite mixture models has started to be exploited in survival and reliability analyses.

Page 153: Applying Finite Mixture Models Presenter: Geoff McLachlan

7.1 Cont. In the analysis of failure-time data, it is often necessary to consider different types of failure. For simplicity of exposition, we shall consider the case of g=2 different types of failure or causes, but the results extend to an arbitrary number g. An item is taken to have failed with the occurrence of the first failure from either cause, and we observe the time T to failure and the type of failure i, (i=1,2). In the case where the study terminates before failure occurs, T is the censoring time and the censoring indicator is set equal to zero to indicate that the failure time is right-censored.

Page 154: Applying Finite Mixture Models Presenter: Geoff McLachlan

7.1 Cont.

The traditional approach to the modeling of the distribution of failure time in the case of competing risks is to postulate the existence of so-called latent failure times, T1 and T2, corresponding to the two causes and to proceed with the modeling of T=min (T1,T2) on the basis that the two causes are independent of each other.

Page 155: Applying Finite Mixture Models Presenter: Geoff McLachlan

7.1 Cont.An alternative approach is to adopt a two-component mixture model, whereby the survival function of T is modeled as

where the i th component survival function Si(t;x) denotes the conditional survival function given failure is due to the i th cause,

and i(x) is the probability of failure from the i th cause (i=1,2).

)x;t(S)x()x;t(S)x()x;t(S 2211 (85)

Page 156: Applying Finite Mixture Models Presenter: Geoff McLachlan

7.1 Cont.Here x is a vector of covariates associated with the item. It is common to assume that the mixing proportions i(x) have the logistic form,

where =(a,bT)T is the vector of logistic regression coefficients.

);x(1);x( 21

)},xbaexp(1/{)xbaexp( TT

(86)

Page 157: Applying Finite Mixture Models Presenter: Geoff McLachlan

7.2 ML Estimation for Mixtures of Survival FunctionsThe log likelihood for that can be formed from the observed data y is given by

where I[h](j) is the indicator function that equals one if j=h(h=0,1,2).

)(Llog

)}x,;t(f);x(log{)(I[ j1j1

n

1jj1j]1[

)}x,;t(f);x(log{)(I j2j22j]2[

)}x,;t(Slog)(I jjj]0[ (88)

Page 158: Applying Finite Mixture Models Presenter: Geoff McLachlan

7.3 Example 7.1:Heart-Valve Data

To illustrate the application of mixture models for competing risks in practice, we consider the problem studied in Ng et al. (1999). They considered the use of the two-component mixture model (85) to estimate the probability that a patient aged x years would undergo a rereplacement operation after having his/her native aortic valve replaced by a xenograft prosthesis.

Page 159: Applying Finite Mixture Models Presenter: Geoff McLachlan

7.3 Cont.

At the time of the initial replacement operation, the surgeon has the choice of using either a mechanical valve or a biologic valve, such as a xenograft (made from porcine valve tissue) or an allograft (human donor valve). Modern day mechanical valves are very reliable, but a patient must take blood-thinning drugs for the rest of his/her life to avoid thromboembolic events.

Page 160: Applying Finite Mixture Models Presenter: Geoff McLachlan

7.3 Cont.

On the other hand, biologic valves have a finite working life, and so have to be replaced if the patient were to live for a sufficiently long enough time after the initial replacement operation. Thus inferences about the probability that a patient of a given age will need to undergo a rereplacement operation can be used to assist a heart surgeon in deciding on the type of valve to be used in view of the patient's age.

Page 161: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 30: Estimated probability of

reoperation at a given age of patient.

Page 162: Applying Finite Mixture Models Presenter: Geoff McLachlan

Figure 31: Conditional probability of reoperation (xenograft valve) for specified age of patient.

Page 163: Applying Finite Mixture Models Presenter: Geoff McLachlan

7.4 Conditional Probability of a Reoperation

As a patient can avoid a reoperation by dying first, it is relevant to consider the conditional probability of a reoperation within a specified time t after the initial operation given that the patient does not die without a reoperation during this period.

Page 164: Applying Finite Mixture Models Presenter: Geoff McLachlan

7.3 Long-term Survivor Model

In some situations where the aim is to estimate the survival distribution for a particular type of failure, a certain fraction of the population, say 1, may never experience this type of failure. It is characterized by the overall survival curve being leveled at a nonzero probability. In some applications, the surviving fractions are said to be “cured.”

Page 165: Applying Finite Mixture Models Presenter: Geoff McLachlan

7.3 Cont.

It is assumed that an entity or individual has probability 2=1-1 of failing from the cause of interest and probability 1 of never experiencing failure from this cause. In the usual framework for this problem, it is assumed further that the entity cannot fail from any other cause during the course of the study (that is, during follow-up). We let T be the random variable denoting the time to failure, where T= denotes the event that the individual will not fail from the cause of interest. The probability of this latter event is 1.

Page 166: Applying Finite Mixture Models Presenter: Geoff McLachlan

7.5 Long-term Survivor Model

In some situations where the aim is to estimate the survival distribution for a particular type of failure, a certain fraction of the population, say 1, may never experience this type of failure. It is characterized by the overall survival curve being leveled at a nonzero probability. In some applications, the surviving fractions are said to be “cured.’’

Page 167: Applying Finite Mixture Models Presenter: Geoff McLachlan

7.5 Cont. It is assumed that an entity or individual has probability 2=1-1 of failing from the cause of interest and probability 1 of never experiencing failure from this cause. In the usual framework for this problem, it is assumed further that the entity cannot fail from any other cause during the course of the study (that is, during follow-up). We let T be the random variable denoting the time to failure, where T= denotes the event that the individual will not fail from the cause of interest. The probability of this latter event is 1.

Page 168: Applying Finite Mixture Models Presenter: Geoff McLachlan

7.5 Cont. The unconditional survival function of T can then be expressed as

where S2(t) denotes the conditional survival function for failure from the cause of interest. The mixture model (92) with the first component having mass one at T= can be regarded as a nonstandard mixture model.

),t(S)t(S 221 (92)

Page 169: Applying Finite Mixture Models Presenter: Geoff McLachlan

8. Mixture Software8.1 EMMIX

McLachlan, Peel, Adams, and Basford http://www.maths.uq.edu.au/~gjm/emmix/emmix.html

Page 170: Applying Finite Mixture Models Presenter: Geoff McLachlan

Other Software

AUTOCLASSBINOMIXC.A.MANMCLUST/EMCLUSTMGTMIXMIXBIN

Page 171: Applying Finite Mixture Models Presenter: Geoff McLachlan

Other Software(cont.)

Program for Gompertz MixturesMPLUSMULTIMIXNORMIXSNOBSoftware for Flexible Bayesian Modeling

and Markov Chain Sampling