StatLearn 2018 : tutorial on Model Based Learningeric.univ-lyon2.fr/~jjacques/Download/Cours/StatLearn2018-labMBL.… · -0.15 -0.05 0.05 0.15-0.15-0.05 0.05 0.15 Comp.1 Comp.2 1

StatLearn 2018 : tutorial on Model Based LearningJulien Jacques (Université de Lyon, Lyon 2 & ERIC EA3083)

Discover mixture model through simulations

One-dimensional data

Let’s simulate the size of 30 women and 70 mentaille=c(rnorm(n=30,mean=162,sd=5),rnorm(n=70,mean=175,sd=8))sexe=c(rep(1,30),rep(2,70))

hist1=hist(taille[sexe==1],breaks=seq(140,200,5),plot=F)hist2=hist(taille[sexe==2],breaks=seq(140,200,5),plot=F)barplot(hist1$counts,beside=TRUE,col="pink",ylim=c(0,max(hist1$counts,hist2$counts)))par(new=TRUE)barplot(hist2$counts,beside=TRUE,col="blue",ylim=c(0,max(hist1$counts,hist2$counts)))

02

46

810

1214

02

46

810

1214

Let’s try clustering with kmeansres=kmeans(taille,centers = 2)table(res$cluster,sexe)

## sexe## 1 2## 1 0 46## 2 30 24

We can use the Rand Index to compare two partitions Z1 and Z2:

R = a + d

a + b + c + d= a + d(2

n

) ∈ [0, 1]

where, among the(2

n

)pair of objects :

• a : is the number of pairs in the same class in Z1 and in Z2

1

• b : is the number of pairs in the same class in Z1 and separated in Z2

• c : is the number of pairs separated in Z1 and in the same class in Z2

• d : is the number of pairs separated in Z1 and in Z2

The function adjustedRandIndex() from mclust package computes an adjusted version (ARI) of this index(more similar are the partition, closer to 1 is the ARI).library(mclust)

## Package 'mclust' version 5.3

## Type 'citation("mclust")' for citing this R package in publications.cat('ARI kmeans = ',adjustedRandIndex(res$cluster,sexe),'\n')

## ARI kmeans = 0.2634361

Let’s have a look, before of looking at theory, to clustering with a Gaussian mixtureres=Mclust(taille,G = 2,verbose = F)table(res$classification,sexe)

## sexe## 1 2## 1 30 26## 2 0 44cat('ARI GMM = ',adjustedRandIndex(res$classification,sexe),'\n')

## ARI GMM = 0.2221024barplot(hist1$counts,beside=TRUE,col="pink",ylim=c(0,max(hist1$counts,hist2$counts)))par(new=TRUE)barplot(hist2$counts,beside=TRUE,col="blue",ylim=c(0,max(hist1$counts,hist2$counts)))par(new=TRUE)plot(res,what="density",yaxt="n")

2

02

46

810

1214

02

46

810

1214

160 170 180 190

taille

dens

ityDensity

##Two-dimensional data

Let’s now and the weight of women and menpoids=c(rnorm(n=30,mean=62,sd=6),rnorm(n=70,mean=77,sd=9))data=data.frame(taille,poids)sexe=c(rep(1,30),rep(2,70))couleur=c(rep("pink",30),rep("blue",70))plot(taille,poids,col=couleur,pch=16)

160 170 180 190

5060

7080

9010

0

taille

poid

s

3

Let’s run kmeans and GMMres=kmeans(data,centers = 2)table(res$cluster,sexe)

## sexe## 1 2## 1 30 12## 2 0 58cat('ARI kmeans = ',adjustedRandIndex(res$cluster,sexe),'\n')

## ARI kmeans = 0.5723122res=Mclust(data,G = 2,verbose = F)table(res$classification,sexe)

## sexe## 1 2## 1 29 8## 2 1 62cat('ARI GMM = ',adjustedRandIndex(res$classification,sexe),'\n')

## ARI GMM = 0.6661479

Classification with GMM

Download the wine dataset on: https://archive.ics.uci.edu/ml/datasets/winedata=read.table('data/wine.txt',sep=',')cls=data$V1data$V1=NULL

Let’s start with representation of the data with a biplot:plot(data,col=cls,pch=19)

4

https://archive.ics.uci.edu/ml/datasets/wine

V2

1 4 10 30 1.0 0.2 2 12 1.5

11

1 V3

V4

1.5

10

V5

V6 80

1.0 V7

V8 1

0.2 V9

V10

0.5

2 V11

V12

0.6

1.5 V13

11 1.5 80 1 4 0.5 3.5 0.6 400

400V14

We don’tsee anything.. Let’s go for a PCA:pc = princomp(data,cor=TRUE)plot(pc)

Comp.1 Comp.3 Comp.5 Comp.7 Comp.9

pc

Var

ianc

es

01

23

4

biplot(pc)

5

−0.15 −0.05 0.05 0.15

−0.

15−

0.05

0.05

0.15

Comp.1

Com

p.2

1

2

3

4

5

6

78

91011

121314

15161718

19

202122

232425

2627

28

2930

3132

33

34

3536

3738

39

40

4142

43

4445

464748

4950

5152

53 54

55565758

59

60

6162

6364 65

66

6768

69

707172 73

74

75

7677

7879 80

81

82

83

84

8586 8788

89

909192

93

9495

9697

98

99

100101 102

103

104105

106107

108

109

110111112

113

114115

116

117

118119

120

121

122

123124

125126

127 128

129

130

131132133134

135

136137138

139140141142143144145

146 147148

149150

151152

153154

155

156157158

159

160

161162

163164

165166

167

168

169170

171

172

173174175

176177

178

−10 −5 0 5 10

−10

−5

05

10

V2

V3V4

V5

V6

V7V8 V9V10

V11

V12V13

V14

pairs(predict(pc)[,1:5],col=cls,pch=19)

Comp.1

−2 0 2 4 −3 −1 1 3

−4

04

−2

2

Comp.2

Comp.3

−4

04

−3

02

4

Comp.4

−4 0 2 4 −4 0 2 4 −2 0 2 4

−2

13

Comp.5

In order to evaluate classification, we randomly select 1/3 of the data as test setitest <- sample(1:nrow(data), nrow(data)/3)data.train <- data[-itest,]cls.train <- cls[-itest]

6

data.test <- data[itest,]cls.test <- cls[itest]

We train a GMM (EEE : Elipsoidal, equal volume, shape and orientation) with 1 component per classwineMclustDA <- MclustDA(data.train, cls.train, modelType = "EDDA",verbose=F)summary(wineMclustDA)

## ------------------------------------------------## Gaussian finite mixture model for classification## ------------------------------------------------#### EDDA model summary:#### log.likelihood n df BIC## -1979.751 119 156 -4705.045#### Classes n Model G## 1 41 VVE 1## 2 45 VVE 1## 3 33 VVE 1#### Training classification summary:#### Predicted## Class 1 2 3## 1 41 0 0## 2 0 45 0## 3 0 0 33#### Training error = 0#summary(wineMclustDA, parameters = TRUE)summary(wineMclustDA, newdata = data.test, newclass = cls.test)

## ------------------------------------------------## Gaussian finite mixture model for classification## ------------------------------------------------#### EDDA model summary:#### log.likelihood n df BIC## -1979.751 119 156 -4705.045#### Classes n Model G## 1 41 VVE 1## 2 45 VVE 1## 3 33 VVE 1#### Training classification summary:#### Predicted## Class 1 2 3## 1 41 0 0## 2 0 45 0

7

## 3 0 0 33#### Training error = 0#### Test classification summary:#### Predicted## Class 1 2 3## 1 18 0 0## 2 0 26 0## 3 0 0 15#### Test error = 0plot(wineMclustDA, dimens = 1:2,what="scatterplot")

11.5 12.0 12.5 13.0 13.5 14.0 14.5

12

34

56

V2

V3

Let’s compare with Mixture of Discriminant Analysis (several gaussian components per class), and alwayswith the EEE modelswineMclustDA <- MclustDA(data.train, cls.train, modelType = "MclustDA", modelNames = "EEE",verbose=F)summary(wineMclustDA)

## ------------------------------------------------## Gaussian finite mixture model for classification## ------------------------------------------------#### MclustDA model summary:#### log.likelihood n df BIC## -1794.149 119 312 -5079.385#### Classes n Model G## 1 41 XXX 1## 2 45 XXX 1

8

## 3 33 XXX 1#### Training classification summary:#### Predicted## Class 1 2 3## 1 41 0 0## 2 0 45 0## 3 0 0 33#### Training error = 0#summary(wineMclustDA, parameters = TRUE)summary(wineMclustDA, newdata = data.test, newclass = cls.test)

## ------------------------------------------------## Gaussian finite mixture model for classification## ------------------------------------------------#### MclustDA model summary:#### log.likelihood n df BIC## -1794.149 119 312 -5079.385#### Classes n Model G## 1 41 XXX 1## 2 45 XXX 1## 3 33 XXX 1#### Training classification summary:#### Predicted## Class 1 2 3## 1 41 0 0## 2 0 45 0## 3 0 0 33#### Training error = 0#### Test classification summary:#### Predicted## Class 1 2 3## 1 18 0 0## 2 0 26 0## 3 0 0 15#### Test error = 0plot(wineMclustDA, dimens = 1:2,what="scatterplot")

9

11.5 12.0 12.5 13.0 13.5 14.0 14.5

12

34

56

V2

V3

Let’sfinally try all parsimonious model:wineMclustDA <- MclustDA(data.train, cls.train, modelType = "MclustDA",verbose=F)summary(wineMclustDA)

## ------------------------------------------------## Gaussian finite mixture model for classification## ------------------------------------------------#### MclustDA model summary:#### log.likelihood n df BIC## -1975.559 119 164 -4734.895#### Classes n Model G## 1 41 EEI 2## 2 45 VEI 3## 3 33 EEI 4#### Training classification summary:#### Predicted## Class 1 2 3## 1 40 1 0## 2 0 45 0## 3 0 0 33#### Training error = 0.008403361#summary(wineMclustDA, parameters = TRUE)summary(wineMclustDA, newdata = data.test, newclass = cls.test)

## ------------------------------------------------## Gaussian finite mixture model for classification

10

## ------------------------------------------------#### MclustDA model summary:#### log.likelihood n df BIC## -1975.559 119 164 -4734.895#### Classes n Model G## 1 41 EEI 2## 2 45 VEI 3## 3 33 EEI 4#### Training classification summary:#### Predicted## Class 1 2 3## 1 40 1 0## 2 0 45 0## 3 0 0 33#### Training error = 0.008403361#### Test classification summary:#### Predicted## Class 1 2 3## 1 18 0 0## 2 0 26 0## 3 0 0 15#### Test error = 0plot(wineMclustDA, dimens = 1:2,what="scatterplot")

11.5 12.0 12.5 13.0 13.5 14.0 14.5

12

34

56

V2

V3

11

Some questions:

• how are choose the number of components per class?

• is-it the best way to do that?

Clustering with GMM

Let’s continue with the wine dataset:data=read.table('data/wine.txt',sep=',')cls=data$V1data$V1=NULL

Select using BIC the best model as well as the best nb. of clustersBIC = mclustBIC(data,verbose=F)plot(BIC)

−25

000

−15

000

Number of components

BIC

1 2 3 4 5 6 7 8 9

EIIVIIEEIVEIEVIVVIEEE

EVEVEEVVEEEVVEVEVVVVV

summary(BIC)

## Best BIC values:## EVE,3 VVE,5 VVE,3## BIC -6873.246 -6884.37905 -6896.8868## BIC diff 0.000 -11.13286 -23.6406

We can have a look to the selected modelmod1 = Mclust(data, x = BIC)summary(mod1)

## ----------------------------------------------------## Gaussian finite mixture model fitted by EM algorithm## ----------------------------------------------------##

12

## Mclust EVE (ellipsoidal, equal volume and orientation) model with 3 components:#### log.likelihood n df BIC ICL## -3032.444 178 156 -6873.246 -6873.537#### Clustering table:## 1 2 3## 63 51 64

We can compare the best partition with the type of winetable(mod1$classification,cls)

## cls## 1 2 3## 1 59 4 0## 2 0 3 48## 3 0 64 0

13

Documents

StatLearn 2018 : tutorial on Model Based Learningeric.univ-lyon2.fr/~jjacques/Download/Cours/StatLearn2018-labMBL.… · -0.15 -0.05 0.05 0.15-0.15-0.05 0.05 0.15 Comp.1 Comp.2 1