Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Likelihood-Based Finite Mixture Models forOrdinal Data
Daniel FernandezFundacio Sant Joan de Deu ⇒ Universitat Politecnica
de Catalunya
Seminari del Servei d’Estadıstica Aplicada & Grup de RecercaAdvanced Stochastic Modelling
Universitat Autonoma de BarcelonaFeb 27th, 2020
Acknowledgments
I Servei d’Estadıstica Aplicada & Grup de Recerca AdvancedStochastic Modelling – Universitat Autonoma de Barcelona
I School of Mathematics and Statistics at Victoria University ofWellington, New Zealand
Prof. Richard Arnold Prof. Emer. Shirley Pledger AProf. Ivy Liu
Outline
1 Background.I Ordinal data. Motivation.I Stereotype model. Definition and standard models.I Stereotype model including clustering.
2 Model fitting.
3 Example.I Level of depression data.I Ordinal data visualization: Spaced mosaic plots and fuzziness.
4 Bayesian inference approach: RJMCMC.
5 Summary.
1. Ordinal data
The response variable has ordinal categorical scalesOrdinal data is widely used in areas such as marketing, social,medical and ecological science.
I Pain scale:
I Likert scale: “strongly disagree”, “disagree”, “agree”, or“strongly agree” in a survey.
I Braun-Blanquet cover-abundance scale is very common invegetation analysis.
I Degree of dissimilarity among the different levels of the scaleis not necessarily always the same.
1. Ordinal Data and Goal
I Data represented as a matrix Y with dimensions n ×m (ncould be questions, m could be subjects) where
yij ∈ 1, . . . , q i = 1, . . . , n j = 1, . . . ,m q categories.
I Note: no covariates available.I For example, questionnaires to assess levels of depression:
I n = 13 questions (rows).I m = 151 individuals (columns).I q = 4 categories: 1 to 4, with higher scores indicating higher levels
of depression.
1. Ordinal Data and Goal
I For example, questionnaires to assess the level of depressionI n = 13 questions (rows).I m = 151 individuals (columns).I q = 4 categories: 1 to 4, with higher scores indicating higher levels
of depression.
I Goals:I Can we group patients/questions together?I Which questions or patients tend to be linked with higher
values of the ordinal response?
1. Motivation
I Minimal research on methods of clustering focusing onordinal data.
I Most of the current methods based on mathematicaltechniques (e.g. distance-based algorithms) ⇒ Neitherstatistical inference nor model selection.
I Recent work (Fernandez et al, 2016): fuzzy biclustering viafinite mixtures model for ordinal data ⇒ Statistical inferenceand model selection.
1. Motivation
I Minimal research on methods of clustering focusing onordinal data.
I Most of the current methods based on mathematicaltechniques (e.g. distance-based algorithms) ⇒ Neitherstatistical inference nor model selection.
I Recent work (Fernandez et al, 2016): fuzzy biclustering viafinite mixtures model for ordinal data ⇒ Statistical inferenceand model selection.
1. Motivation
I Minimal research on methods of clustering focusing onordinal data.
I Most of the current methods based on mathematicaltechniques (e.g. distance-based algorithms) ⇒ Neitherstatistical inference nor model selection.
I Recent work (Fernandez et al, 2016): fuzzy biclustering viafinite mixtures model for ordinal data ⇒ Statistical inferenceand model selection.
1. Motivation
Source: David Sontag, NYU
I Clusters may overlapI Some clusters may be
”wider” than othersI Distances can be
deceiving!I Try a probabilistic model
I allows overlapsI allows clusters of
different sizeI allows a soft/fuzzy
clustering
1. Motivation
Hard clustering Fuzzy clustering
1. Model-based clustering
I Model-based clustering: process of clustering via statisticalmodels, typically Finite Mixture Models (FMM).
I Finite mixture models: a way of clustering in order to reducedimensionality and identifying patterns related to theheterogeneity of the data (e.g. rows/columns with similareffect on the response)
1. Model-based clustering
red line - what we see
1. Model-based clustering
I Model-based clustering: process of clustering via statisticalmodels, typically Finite Mixture Models (FMM).
I Finite mixture models: a way of clustering in order to reducedimensionality and identifying patterns related to theheterogeneity of the data (e.g. rows/columns with similareffect on the response)
I Our research: model-based clustering for ordinal data, withcomponents within the FMM ⇒ Stereotype model.
1. Stereotype Model. Formulation
I Stereotype model (Anderson, J. A., 1984):
log(
P [yij = k | x]P [yij = 1 | x]
)= µk + (φkβ′)x k = 2, . . . , q
q − 1 log odds for categories k and 1. First category as abaseline.
I β: Assumes the parameter of the predictor regarding thecovariates is the same for all categories.
I φk : “score” for the response category k.
1. Stereotype Model. Formulation
I Stereotype model:
log(
P [yij = k | x]P [yij = 1 | x]
)= µk + (φkβ′)x k = 2, . . . , q
I Nothing in the stereotype model treats the response asordinal.
I Including an increasing order constraint (Anderson, J.A.,1984):
0 = φ1 ≤ φ2 ≤ · · · ≤ φq = 1 ,
captures the ordinal nature of the outcomes.I The model has received more attention, after Agresti (2010,
Ch.4) discussed the model in his book.
1. Stereotype Model. Formulation
I Stereotype model:
log(
P [yij = k | x]P [yij = 1 | x]
)= µk + (φkβ′)x k = 2, . . . , q
I Nothing in the stereotype model treats the response asordinal.
I Including an increasing order constraint (Anderson, J.A.,1984):
0 = φ1 ≤ φ2 ≤ · · · ≤ φq = 1 ,
captures the ordinal nature of the outcomes.I The model has received more attention, after Agresti (2010,
Ch.4) discussed the model in his book.
1. Stereotype Model. Scores φk Interpretation
Use the fitted score parameters φk: determining the spacingamong categories.
Level of depression data: φ0 = 0, φ1 = 0.347, φ2 = 0.853, φ3 = 1.
1. Stereotype Model. Scores φk Interpretation
Stereotype model for categories a and b:
log(
P [yij = a | x]P [yij = b | x]
)= log
(P [yij = a | x] /P [yij = 1 | x]P [yij = b | x] /P [yij = 1 | x]
)= (µa − µb) + (φa − φb)β′x .
0 = φ1 ≤ . . . φa ≤ · · · ≤ φb . . . ≤ φq = 1
I The larger the difference (φa − φb) ⇒ The more the odds of aand b are influenced by x
1. Stereotype Model. Scores φk Interpretation
Stereotype model for categories a and b:
log(
P [yij = a | x]P [yij = b | x]
)= log
(P [yij = a | x] /P [yij = 1 | x]P [yij = b | x] /P [yij = 1 | x]
)= (µa − µb) + (φa − φb)β′x .
0 = φ1 ≤ . . . φa ≤ · · · ≤ φb . . . ≤ φq = 1
I If φa = φb ⇒ the logit is the constant µa − µb⇒ The covariates x do not distinguish between a and b⇒ We could collapse the categories a and b in our data.
1. Stereotype Model. Software
I Stereotype model
I STATA module called SOREG (Lunt, 2001)I R package called ordinalgmifs (Archer et al. 2014)I R package VGAM (Yee, 2008) – it is not able to add the
monotonic constraint in the scoreI R package called clustord (Fernandez and Ryan, soon in
CRAN) https://github.com/vuw-clustering/clustordparameters.
1. Stereotype Model. Main effects
I Stereotype model:
log(
P [yij = k | x]P [yij = 1 | x]
)= µk + φkβ′x k = 2, . . . , q
I Build up β′x considering row and column effect of the yij(Fernandez et al. 2016).
1. Stereotype Model. Main effects
I Build up β′x considering row and column effect of the yij(Fernandez et al. 2016).
I Main effects model:
log(
P [yij = k]P [yij = 1]
)= µk + φk(αi + βj)
k = 2, . . . , q i = 1, . . . , n j = 1, . . . ,m
I αi : interpreted as the effect of the rows.I βj : interpreted as the effect of the columns.I Identifiability constraints:
∑i αi =
∑j βj = 0, µ1 = 0, and
0 = φ1 ≤ φ2 ≤ · · · ≤ φq = 1.
1. Model-based clustering
I Main effect model 2q + n + m− 5 independent parameters.
log(
P [yij = k]P [yij = 1]
)= µk + φk(αi + βj) k=2,. . . ,q
i=1,. . . ,n j=1,. . . ,m
I Avoid αi + βj that overspecifies the data structure ⇒Clustering via finite mixtures models in order to reducedimensionality (McLachlan, G. and Peel, D., 2000).
1. Model-based clustering - Column clustering
For example column clustering:
We change from main effects model
log(
P [yij = k]P [yij = 1]
)= µk + φk(αi + βj) j = 1, . . . ,m
to
log(
P [yij = k | j ∈ c]P [yij = 1 | j ∈ c]
)= µk + φk(αi + βc) c = 1, . . . ,C < m
where βc is interpreted as the effect of the column cluster c.
1. Model-based clustering - Column clustering
For example column clustering:
We change from main effects model
log(
P [yij = k]P [yij = 1]
)= µk + φk(αi + βj) j = 1, . . . ,m
to
log(
P [yij = k | j ∈ c]P [yij = 1 | j ∈ c]
)= µk + φk(αi + βc) c = 1, . . . ,C < m
where βc is interpreted as the effect of the column cluster c.
1. Model-based clustering. Biclustering
I General formulation of model-based clustering(biclustering):
log(
P [yij = k | i ∈ r , j ∈ c]P [yij = 1 | i ∈ r , j ∈ c]
)= µk+φk(αr + βc) k = 2, . . . , q
I αr : interpreted as the effect of the row cluster r .I βc : interpreted as the effect of the column cluster c.I Constraints: α1 = β1 = 0 (or
∑αr =
∑βc = 0) and
0 = φ1 ≤ φ2 ≤ · · · ≤ φq = 1.
I The formulation is similar to a latent class model.I Further, αr + βc can be extended to αr + βc + γrc .I The model provides a simultaneous fuzzy clustering of the
rows and columns.
1. Model-based clustering. Biclustering
I General formulation of model-based clustering(biclustering):
log(
P [yij = k | i ∈ r , j ∈ c]P [yij = 1 | i ∈ r , j ∈ c]
)= µk+φk(αr + βc) k = 2, . . . , q
I αr : interpreted as the effect of the row cluster r .I βc : interpreted as the effect of the column cluster c.I Constraints: α1 = β1 = 0 (or
∑αr =
∑βc = 0) and
0 = φ1 ≤ φ2 ≤ · · · ≤ φq = 1.
I The formulation is similar to a latent class model.I Further, αr + βc can be extended to αr + βc + γrc .I The model provides a simultaneous fuzzy clustering of the
rows and columns.
1. Model-based clustering - Column clustering
Main effects model for stereotype model had likelihood:
L(Ω | yij) =n∏
i=1
m∏j=1
q∏k=1
(P[yij = k])I[yij =k]
and with Column clustering model turns into:
L(Ω | yij) =m∏
j=1
[ C∑c=1
κc
n∏i=1
q∏k=1
(P[yic = k])I[yij =k]]
where κc is the proportion of columns in column group c.
1. Model-based clustering
I Problem: Missing information.I We do not know the actual membership in columns (rows) nor
the number of columns (rows).
κc is the proportion of columns in column group c.
2. Model fitting
Model fitting
2. Model fitting
I EM algorithm for finding the ML solution for the parametersof models with missing information (the actual unknowncluster membership of each row and column).
I Information criteria (AIC, BIC,...)
I Comprehensive simulation study (4500 scenarios) testing 12information criteria (Fernandez and Arnold, 2016)
2. Model fitting
Table: Information criteria summary table
Criteria Definition Proposed for Depending on
AIC −2` + 2K
Regression models
Number of parameters
AICc AIC + 2K(K+1)n−K−1
AICu AICc + n log( nn−K−1 ) Number of parameters
and sample sizeCAIC −2` + K(1 + log(n))
BIC −2` + K log(n)
AIC3 −2` + 3K
Clustering
Number of parameters
CLC −2` + 2EN(R)Entropy
NEC(R) EN(R)`(R)−`(1)
ICL-BIC BIC + 2EN(R) Number of parameters, sample sizeand entropy
AWE −2`c + 2K(3/2 + log(n))
L −`− K2
∑log( nπR
12 )− Number of parameters, sample sizeR
2 log( n12 ) −
R(K+1)2 and mixing proportions
Notes: n represents the sample size, K the number of parameters, R the number of clusters, πR the mixing clusterproportion, ` the log-likelihood and EN(·) the entropy function.
2. Model fitting. Simulation study
I Simulated data with true number of row clusters.
I General results: Percentage of cases for each criteriondetermines the true number of row clusters (Fit).
2. Model fitting. One-dimensional Clustering
Table: Top 5. Overall results. One-dimensional clustering
Overall Results Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5AIC 93.8% 91.4% 97.6% 88.0% 92.9% 99.1%AICc 89.8% 90.2% 94.8% 74.7% 91.1% 98.2%AICu 82.4% 79.0% 80.0% 66.7% 88.0% 98.2%AIC3 67.7% 61.7% 65.6% 56.7% 56.4% 98.2%BIC 43.7% 41.2% 39.1% 40.0% 39.6% 58.7%
2. Model fitting. Biclustering
Table: Top 5. Overall results. Biclustering
Overall Results Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5AIC 86.1% 89.2% 82.3% 80.5% 85.5% 92.8%AICc 85.6% 89.2% 81.5% 80.0% 84.5% 92.8%AICu 84.2% 84.8% 80.7% 79.3% 83.3% 92.8%AIC3 71.2% 75.8% 65.5% 64.7% 66.5% 83.3%BIC 36.5% 34.5% 35.2% 33.5% 32.3% 47.2%
2. Model fitting
I EM algorithm for finding the ML solution for the parametersof models with missing information (the actual unknowncluster membership of each row and column).
I Information criteria (AIC, BIC,...)
I Comprehensive simulation study (4500 scenarios) testing 12information criteria (Fernandez and Arnold, 2016) ⇒ AIC isthe best criterion
2. Model fitting
I Two possible Bayesian approaches:I “Fixed” dimension: Metropolis-Hastings and Gibbs sampler.I Variable dimension: Reversible Jump MCMC (RJMCMC,
Green, P. J., 1995)
I RJMCMC ⇒ Num. components (dimension) is a parameter.
I Convergence diagnostic: Castelloe and Zimmerman method.
2. Model fitting packages
I Model-based clustering for ordinal dataI R package called clustord (Fernandez and Ryan, soon in
CRAN) https://github.com/vuw-clustering/clustord
I Model-based clustering for mixed-type dataI R package called clustMD (McParland and Gormley,2017)
https://cran.r-project.org/web/packages/clustMD/clustMD.pdf
I Model-based clustering for Gaussian dataI R package called mclust (Scrucca, 2019)
https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html
3. Example
Level of Depression Data Set
3. Example. Level of Depression Dataset
I Patients admitted for deliberated self-harm at the medicaldepartments of 3 major hospitals in Eastern Norway.
I Questionnaire designed to assess the level of depression.I 13 questions(rows), 151 patients (columns).I Ordinal data: 4 categories. From 1 (lower level) to 4 (higher level)I For instance, ”Sadness”
yij =
1 I do not feel sad2 I feel sad most of the time3 I am sad all the time4 I am so sad or unhappy that I can’t stand it
I Possible research questions:I Can we group patients/questions together?I Which questions or patients are similar?I Which questions or patients tend to be linked with higher
values of the ordinal response?
3. Results Model Fitting - EM algorithm
Table: Level of Depression. Model Fitting (1/3)
Model R C npar AIC AICc BIC ICL.BICNull effects µk + φk 1 1 5 441.63 441.81 460.71 460.71Row effects µk + φkαi n 1 16 428.81 430.52 489.89 489.89
Column effects µk + φkβj 1 m 32 463.85 470.82 586.00 586.00Main effects µk + φk (αi + βj ) n m 43 422.54 421.50 547.67 547.67
Row Clustering
µk + φkαr
2 1 7 415.70 416.04 442.42 442.493 1 9 419.42 419.97 453.77 470.374 1 11 423.36 424.17 465.35 481.865 1 13 427.40 428.53 477.02 496.256 1 15 430.96 432.46 488.22 488.24
µk + φk (αr + βj )
2 m 34 431.02 438.92 560.80 572.873 n 20 435.91 444.82 573.33 594.324 n 22 439.57 449.55 584.62 593.905 n 24 443.91 455.03 596.60 599.436 n 26 447.69 460.02 608.01 618.21
µk + φk (αr + βj + γrj )
2 m 61 406.22 423.83 629.06 639.083 n 42 424.71 491.57 668.25 776.264 n 55 426.25 558.47 680.49 681.495 n 68 549.95 585.80 681.88 684.896 n 81 531.77 630.58 707.40 717.40
3. Results Model Fitting - EM algorithm
Table: Level of Depression. Model Fitting (2/3)
Model R C npar AIC AICc BIC ICL.BIC
Column Clustering
µk + φkβc
1 2 7 412.46 412.81 439.18 463.051 3 9 418.12 418.67 452.47 482.001 4 11 421.90 422.71 463.89 515.371 5 13 426.43 427.56 476.06 507.191 6 15 429.96 431.46 487.22 547.28
µk + φk (αi + βc )
n 2 18 410.13 415.81 520.82 526.18n 3 20 397.28 409.28 561.54 565.73n 4 22 401.23 413.55 607.22 609.89n 5 24 412.15 447.29 671.71 675.77n 6 26 460.91 513.21 770.10 772.98
µk + φk (αi + βc + γic )
n 2 29 534.06 538.66 664.21 669.38n 3 42 436.57 439.24 512.92 542.04n 4 55 440.43 443.66 524.41 549.82n 5 68 444.03 447.89 535.64 554.73n 6 81 450.14 454.68 549.38 595.48
3. Results Model Fitting - EM algorithm
Table: Level of Depression. Model Fitting (3/3)
Model R C npar AIC AICc BIC ICL.BIC
Biclustering
µk + φk (αr + βc )
2 2 9 421.76 422.31 456.11 498.312 3 11 419.64 420.20 454.00 490.752 4 13 425.74 426.88 475.37 549.882 5 15 431.31 432.81 488.56 572.193 2 11 423.22 424.03 465.20 517.863 3 13 476.66 477.79 501.77 526.293 4 15 439.87 441.37 497.13 522.803 5 17 435.21 437.13 500.10 567.884 2 13 482.98 484.11 492.13 532.604 3 15 433.70 435.20 490.96 550.304 4 17 435.22 437.14 500.11 571.154 5 19 464.04 466.44 536.56 568.45
µk + φk (αr + βc + γrc )
2 2 10 427.97 429.10 477.59 527.432 3 13 422.00 422.68 460.17 486.882 4 16 434.39 436.09 495.46 520.852 5 19 438.61 441.01 511.13 538.563 2 13 497.76 498.89 505.27 547.383 3 17 433.91 435.84 498.80 540.763 4 21 441.89 444.83 522.05 559.233 5 25 453.08 457.27 548.50 615.814 2 16 445.85 447.55 506.92 528.754 3 21 448.82 451.76 528.98 538.184 4 26 468.71 473.25 567.95 622.254 5 31 530.60 537.12 619.79 648.93
3. Results. Model Selection. AIC
I Best AIC model: Column clustering model with C = 3groups of patients
3. Results. Common Visualisation Tools
Figure: Level of Depression: Column Clustering with C=3 patient groups
3. Results. Common Visualisation Tools
Figure: Level of Depression C=3: Distribution in each group
The proportion of individuals in clusters that had at least oneepisode of DSH (deliberated self-harm, i.e. predictor of suicide(Hawron et al. 2013)) within 3 months is: 3.4%, 16%, and 28%.
3. Results. More Visualisation Tools
Use the fitted score parameters φk: determining the spacingamong categories.
Level of depression data: φ0 = 0, φ1 = 0.347, φ2 = 0.853, φ3 = 1.
3. Spaced Mosaic Plot (Fernandez et al, 2015)
I No rows (questions) orcolumn (indiv.) groups.
I Overall distribution ⇒Frequency of each ordinalcategory.
3. Spaced Mosaic Plot (Fernandez et al, 2015)
I No rows (questions) orcolumn (indiv.) groups.
I Overall distribution ⇒Frequency of each ordinalcategory.
I Level 2 ⇒ Most common.I Level 4 ⇒ Less common.
3. Spaced Mosaic Plot (Fernandez et al, 2015)
Figure: Level of depression data: Mosaic plot for stereotype model includingcolumn clustering model with C = 3 column (patient) clusters.
3. Spaced Mosaic Plot (Fernandez et al, 2015)
I Column clusters ⇒ 3 horiz. bands.I Height of each band ⇒
Proportional to number of patientsper group (C1=8.6 + 21.6 + 7.8 + 4.2 = 42.2%).
I Area in each block ⇒ Freq. of the 4ordinal categories per cluster (e.g.patients of C2 ⇒ strong preferenceresponse at Level 1)
I Horizontal separation betweenblocks ⇒ Spacing between theadjacent ordinal categ.(φ1 = 0, φ2 = 0.347, φ3 = 0.852, φ4 = 1)
I Level 3 and 4 are very similar:φ4 − φ3 = 1− 0.852 = 0.148
3. Results. More Visualisation Tools. Fuzziness
Figure: Contour plot depicting the fuzzy clustering structure with C = 3patient clusters. The left figure is without any sorting and both axes are sortedby patient cluster on the right figure.
Probability two patients are classified in the same cluster.
4.Bayesian Inference
Bayesian Inference Approach
4. Developing RJMCMC. DAG
Figure: Directed acyclic graph: Hierarchical Stereotype Mixture Model.One–dimensional Clustering. ”TrGeometric” refers to a truncated Geometricdistribution.
4. Developing RJMCMC. Sweep
Table: RJMCMC Moves
Block Move Param. Prop. Constants Pr(Move) Move Type1 σ2
µ νσµ = 3 δσµ = 40 1 M-HHyperpar. σ2
α νσα = 3 δσα = 40 1 M-Hσ2β νσβ = 3 δσβ = 40 1 M-H
2 µk σ2µp = 0.3 1 M-H
General φk 1 M-HParameters βj σ2
βp= 0.3 1 M-H
3 αr σ2αp = 0.3 pα = 0.35 M-H
Cluster πr σ2πp = 0.3 pπ = 0.35 M-H
Parameters Split p = 0.3 pS = p ρ1+ρ RJ
Merge pM = p 11+ρ RJ
4. Developing RJMCMC. Split Step
I Split and Merge steps involve αR and πRI Steps have to be reversible and keep the constraints
(∑R
r=1 αr = 0,∑R
r=1 πr = 1)
I Split move:1 Draw u1, u2 ∼ U(0, 1) and one r ∈ 1, . . . ,R.2 New parameters:
α(t)r = u1α
(t−1)r α
(t)r+1 = (1− u1)α(t−1)
r
π(t)r = u2π
(t−1)r π
(t)r+1 = (1− u2)π(t−1)
r
3 Increase R by 1.4 Relabel r + 1, . . .R as r + 2, . . .R + 1
4. Developing RJMCMC. Split Step
I Split and Merge steps involve αR and πRI Steps have to be reversible and keep the constraints
(∑R
r=1 αr = 0,∑R
r=1 πr = 1)
I Split move:1 Draw u1, u2 ∼ U(0, 1) and one r ∈ 1, . . . ,R.2 New parameters:
α(t)r = u1α
(t−1)r α
(t)r+1 = (1− u1)α(t−1)
r
π(t)r = u2π
(t−1)r π
(t)r+1 = (1− u2)π(t−1)
r
3 Increase R by 1.4 Relabel r + 1, . . .R as r + 2, . . .R + 1
4. Developing RJMCMC. Merge Step
I Split and Merge move involve αR and πRI Moves have to be reversible and keep the constraints
(∑R
r=1 αr = 0,∑R
r=1 πr = 1)
I Merge move:1 Draw one random component r ∈ 1, . . . ,R − 1.2 Selecting the adjacent component r + 1.3 New parameters:
α(t)r = α(t−1)
r + α(t−1)r+1
π(t)r = π(t−1)
r + π(t−1)r+1
4 Reduce R by 1.5 Relabel r + 2, . . .R as r + 1, . . .R − 1
4. Example. Level of depression data set. RJMCMC
Figure: Level of Depression: Dimension (Column) visits
4. Example. Level of depression data set. RJMCMC
Figure: Level of Depression C=3: Distribution in each group
5. Summary. Conclusions
I Clustering rows(columns) for ordinal data allows us to:I Describe data with fewer parameters than current methods.I Identify similar rows (i.e. questions.) and/or similar columns
(i.e. subjects).I Find an a posteriori classification.
I Likelihood-based stereotype models ⇒ Inferences andmodel comparison.
I Using the fitted score parameters φk among ordinalcategories, dictated by data
I Data visualisation tools for ordinal clustering data: spacedmosaic plots, fuzziness.
I Model fitting ⇒ EM algorithm (AIC), RJMCMC (Clustercomponent as a parameter).
References
I Anderson, J. A. (1984). Regression and ordered categorical variables. JRSS Series B,46(1):1-30.
I Castelloe, J. and Zimmerman, D. (2002) Convergence assessment for RJMCMC samplers.Technical Report 313, SAS Institute, Cary, North Carolina.
I Fernandez, D., Pledger, S. and Arnold, R. (2014). Introducing spaced mosaic plots.Research Report Series. ISSN: 1174-2011. 14-3, MSOR, VUW, 2014.
I Fernandez, D., Arnold, R. and Pledger, S. (2016). Mixture-based clustering for the orderedstereotype model. CSDA. 93. 46-75.
I Green, P. J. (1995). Reversible jump MCMC computation and Bayesian modeldetermination. Biometrika, (82):711-732, 1995.
I McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley Series in Probability andStatistics.
I Pledger, S. and Arnold, R (2014). Multivariate methods using mixtures: Correspondenceanalysis, scaling and pattern-detection. CSDA
I Stephens, M. (2000). Dealing with label switching in mixture models. JRSS, Series B, 62,795-809.
Thank you
Thank you for listening!
References
I Anderson, J. A. (1984). Regression and ordered categorical variables. JRSS Series B,46(1):1-30.
I Castelloe, J. and Zimmerman, D. (2002) Convergence assessment for RJMCMC samplers.Technical Report 313, SAS Institute, Cary, North Carolina.
I Fernandez, D., Pledger, S. and Arnold, R. (2014). Introducing spaced mosaic plots.Research Report Series. ISSN: 1174-2011. 14-3, MSOR, VUW, 2014.
I Fernandez, D., Arnold, R. and Pledger, S. (2016). Mixture-based clustering for the orderedstereotype model. CSDA. 93. 46-75.
I Green, P. J. (1995). Reversible jump MCMC computation and Bayesian modeldetermination. Biometrika, (82):711-732, 1995.
I McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley Series in Probability andStatistics.
I Pledger, S. and Arnold, R (2014). Multivariate methods using mixtures: Correspondenceanalysis, scaling and pattern-detection. CSDA
I Stephens, M. (2000). Dealing with label switching in mixture models. JRSS, Series B, 62,795-809.
Extra Slides
1. Stereotype Model. Response Probabilities
The stereotype model is also described in terms of the responseprobabilities
P [yij = k | x] = exp(µk + φk(β′x))∑q`=1 exp(µ` + φ`(β′x)) k = 1, . . . , q ,
where the probability for the baseline category is defined as,
P [yij = 1 | x] = 1−q∑`=2
exp(µ` + φ`(β′x)) .
Reparametrization of the scores parameters
0 = φ1 ≤ φ2 ≤ · · · ≤ φq = 1
which we transform to
−∞ ≤ ν2 ≤ ν3 ≤ · · · ≤ νq−1 ≤ ∞ where νk = logit(φk ).
The previous expression may be redefined as,
−∞ ≤ ν2 ≤ ν2 + ez3 ≤ · · · ≤ νq−2 + ezq−1 ≤ ∞ ,
i.e.,νk = νk−1 + ezk for−∞ < zk <∞, k = 3, . . . , q − 1 .
- The inverse parametrization:
φk =
0 k = 1
11+e−ν2
k = 2
expit[
logit(φ2) +∑q
`=3ez`]
k = 3, . . . , q − 11 k = q
. (1)
Stereotype reformulated as adjacent categories logit
log(
P [yij = k | x]P [yij = k + 1 | x]
)= (µk − µk+1) + (φk − φk+1)δ′x = ηk + ϑk δ
′x k = 2, . . . , q
whereηk = µk − µk+1 k = 1, . . . , q − 1
and the relation between φk and ϑk is defined by
ϑk = φk − φk+1 k = 1, . . . , q − 1
and
φk =q−1∑t=1
ϑt k = 1, . . . , q − 1 .
Adjacent-categories logit model is a particular case of the ordered stereotypemodel when ϑk is a constant such that ϑk < 1(i.e.,φk are fixed and equally spaced)
Weighted average of the fitted scoresI Fitted response probabilities with the estimated parameters over the R
groups and the q groups
P[yij = k | i ∈ r ] =exp(µk + φk (αr + βj ))∑q`=1
exp(µ` + φ`(αr + βj ))
i = 1, . . . , n j = 1, . . . ,m k = 1, . . . , q r = 1, . . . ,R .
I Weighted average over the q categories for each row cluster
y rij =
q∑k=1
k × P[yij = k | i ∈ r ]
i = 1, . . . , n j = 1, . . . ,m r = 1, . . . ,R .
I Weighted average by using the fitted conditional probabilities zir
y ij =R∑
r=1
zir × y rij i = 1, . . . , n j = 1, . . . ,m .
I Mean y ij over the m columns
y i. =1m
m∑j=1
y ij i = 1, . . . , n .
1. Finite Mixtures with Stereotype ModelExample: EM - Row clusteringDefine the unknown group membership as latent variables:
Zir = I[i ∈ r ] (i = 1, . . . , n, r = 1, . . . ,R), that follows:∑Rr=1 Zir = 1, and (Zi1, . . . ,Zir ) ∼ Mult(1;π1, . . . , πR )
E-Step:The indicator latent variables fulfill the following convenient identity:∏R
r=1 aZiri =
∑Rr=1 ai Zir for any ai 6= 0.
`c (Ω | yij, Zir) =n∑
i=1
R∑r=1
Zir log(πr ) +n∑
i=1
m∑j=1
q∑k=1
R∑r=1
Zir I(yij = k) log(θrjk),
where Ω are parameters, θrjk = P [yij = k | i ∈ r ], and Zir = E[Zir |yij].
Applying Bayes’ rule at iteration t:
Z (t)ir = E[Zir |yij] = P[Zir = 1|yij] =
P[yij|Zir = 1]P[Zir = 1]∑R`=1 P[yij|Zi` = 1]P[Zi` = 1]
=π
(t−1)r
∏mj=1
∏qk=1
(θ
(t−1)rjk
)I(yij =k)
∑Rl=1
π
(t−1)l
∏mj=1
∏qk=1
(θ
(t−1)ljk
)I(yij =k) .
1. Finite Mixtures with Stereotype ModelM-Step:Two separate parts, πr and the remaining parameters:
1. MLE for πr :
π(t)r = 1
n
n∑i=1
E[Zir | yij,Ω(t−1)
]= 1
n
n∑i=1
Z (t)ir , r = 1, . . . ,R.
2. Remaining parameters Ω: numerically maximize the conditional
expectation of the complete data log-likelihood `c :
Ω = argmaxΩ
n∑i=1
m∑j=1
q∑k=1
R∑r=1
Zir I(yij = k) log (θrjk)
,We repeat the two step iteration of the EM algorithm until
convergence.
4. Developing RJMCMC. Priors
Table: RJMCMC Prior and Hyperparameters
Parameter Prior Distribution Hyperparameters
σ2µ InverseGamma
(νσµ , δσµ
) νσµ = 3δσµ = 40
µk N (0, σ2µ)
φk Dirichlet(λφ) λφ = 1
σ2α InverseGamma (νσα , δσα ) νσα = 3
δσα = 40
αr DegenNormal(R; 0, σ2α)
σ2β InverseGamma
(νσβ , δσβ
) νσβ = 3δσβ = 40
βj DegenNormal(m; 0, σ2β)
γrj DegenNormal(R,m; 0, σ2γ) σ2
γ = 5
πr Dirichlet(λπ) λπ = 1
1. Model-based clustering. Biclustering
I General formulation of model-based clustering(biclustering):
log(
P [yij = k | i ∈ r , j ∈ c]P [yij = 1 | i ∈ r , j ∈ c]
)= µk+φk(αr + βc) k = 2, . . . , q
I Probability of the data response yrc being equal to thecategory k:
P [yrc = k] = exp(µk + φk(αr + βc))∑q`=1 exp(µ` + φ`(αr + βc))