Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
An Introduction to Model-Based Clustering
Anish R. Shah, CFA Northfield Information Services
London Nov 17, 2011
Clustering
• Observe characteristics of some objects – {x1, …, xN} N objects
• Goal: group alike objects
– say there are M clusters {z1, …, zN} cluster memberships zk = object k’s membership, a number from 1..M
– k, j in the same cluster → xk, xj similar -or- k, j in different clusters → xk, xj dissimilar
Examples of Characteristics • Clustering dog breeds
– x = (snout length / width of face, dog’s BMI, % of day spent sleeping)
• Positioning sandwich carts (in cluster centers)
– x = (location of office worker)
• Clustering country indices via returns – x = (past 5 years of monthly returns)
• Clustering stocks via fundamentals & returns
– x = (beta to the market, dividend rate, past 2 years of monthly returns)
Machine Learning • Rather than being programmed with rules, the system inferentially learns
the patterns/rules of reality from data
• Supervised Learning – Some of the training data is labeled – e.g. There are 5 company types - AAPL & MSFT are type 1, ... , XOM is type 5.
Find the prototype for each type and label the rest of the universe – e.g. Amazon & Netflix recommendations
• Unsupervised Learning
– None of the data has labels – Organize the system to maximize some criterion – e.g. Clustering maximizes similarity within each cluster – e.g. Principal Components Analysis maximizes explained variance – Vanilla clustering is the canonical example of unsupervised machine learning
Review of Forms of Hard Clustering • ‘Hard’ means an object is assigned to only one cluster
– In contrast, model-based clustering can give a probability distribution over the clusters
• Hierarchical Clustering
– Maximize distance between clusters – Flavors come from different ways of measuring distance
• Single Linkage – distance between the two nearest elements • Complete Linkage – distance between the two farthest elements • Average Linkage – mean (or median) distance between all elements
• K-Means
– Minimize mean (median in K-medians) distance within clusters
K-Means / K-Medians
• K-Means (heuristically) assigns objects to clusters to minimize the average squared distance (absolute distance in K-Medians) from object to cluster center
• Minimize ⅟N ∑k=1..N ||xk- μzk||2
over z1..zN = cluster assignments μ1..μM = centers of the clusters
K-Means Algorithm 1. Randomly assign objects to clusters
2. Calculate the center (mean) of each cluster
3. Check assignments for all the objects: if another center is nearer,
reassign the object to that cluster
4. Repeat steps 2-3 until no reassignments occur
• Extremely fast • The solution is a local max, so several starting points are used in
practice • (K-Medians) For robustness, step 2 finds centers via median
Mixture of Gaussians: A Model-Based Clustering Similar to K-Means
• Observe data for N objects, {x1, …, xN}
• Each cluster generates data distributed normally around its center – when object k is from cluster m, p(xk) ~ exp(||xk- μm||2 / σm
2)
• Some clusters appear more frequently than others – given no observation information, p(an object belongs to cluster m) = πM
• Find the setup that make the observations most likely to occur
– cluster centers {μ1 .. μM} – variances {σ1
2.. σM2}
– cluster frequencies {π1 .. πM}
Model-Based Clustering • Observe characteristics of some objects
– {x1, …, xN} N objects
• An object belongs to one of M clusters, but which is unknown – {z1, …, zN} cluster memberships, numbers from 1..M
• Some clusters are more likely than others
– P(zk=m) = πm (πm = frequency cluster m occurs)
• Within a cluster, objects’ characteristics are generated by the same distribution, which has free parameters – P(xk|zk=m) = f(xk, λm) (λm = parameters of cluster m) – f need not be Gaussian
Model-Based Clustering (2) • Now you have a model connecting the observations to the cluster
memberships and parameters – P(xk) = ∑m=1..M P(xk|zk=m) P(zk=m) – = ∑m=1..M f(xk, λm) πm
– P(x1 … xN) = ∏k=1..N P(xk) (assuming x’s are independent)
1. Find the values of the parameters by maximizing the likelihood (usually
the log of the likelihood) of the observations – max log P(x1 … xN) over λ1 … λM and π1 … πM – This turns out to be a nonlinear mess and is greatly aided by the “EM
Algorithm” (next slide) 2. With parameters in hand, calculate the probability of membership given
the observations – P(z|x) = P(x|z) P(z) / P(x)
EM (Expectation-Maximization) Algorithm Setup
• Let θ = (λ1 … λM, π1 … πM), the parameters being maximized over
• Observe x. Don’t know z, the cluster memberships
• Want to maximize log p(x|θ), but it is too complicated
• EM can be used when – It’s possible to make an approximation of p(z|x,θ), the
conditional distribution of the hidden variables – log p(x,z|θ), the probability if all the variables were known, is
easy to manipulate
The EM Algorithm • Want to maximizeθ log p(x|θ) = log ∫ p(x,z|θ) dz • (E Step)
– Create an approximate distribution of the missing data. Call it u(z) Ideally this is p(z|x,θ) – Let Q(θ) = the log likelihood under θ averaged by u(z) = ∫ log p(x,z|θ) u(z) dz
• (M Step) – Maximize Q(θ) over θ – θnew = the maximizer
• Repeat E & M steps until convergence
• EM switches between 1) finding an approximate distribution of missing
data given the parameters and 2) finding better parameters given the approximation
Determining the # of Clusters: Information Criteria
• BayesianIC, AkaikeIC, AkaikeICcorrected, …
• Minimum Description Length ideas
• Log-likelihood of observations penalized by the model’s complexity (# of parameters)
• My experience – Optimal # of clusters varies greatly with the choice of
criterion – e.g. BIC says 2, AICc says 9
Experiments • Country indices – monthly local currency returns of constituent
companies weighted by √cap
• An aside: I looked into dividing index returns by VIX to account for time varying volatility, but high volatility periods shrink too much
• A period’s clusters are inferred from past 60 months equally weighted
• 2 cluster distributions recall P(xk|zk=m) = f(xk, λm) (λm = parameters of m)
– Gaussian (2pi σm2)-D/2 ∏i=1..D exp(-½[(xk
i-μmi)/σm]2)
– Laplace (2σm2)-D/2 ∏i=1..D exp(-√2|xk
i-μmi|/σm)
Jan 1998 – Gaussian, 6 clusters
33% 27% 25%
10%
2% 2% 0%
5%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Ca
nada
Fr
ance
Ita
ly
Germ
any
Denm
ark
Russ
ia
Spai
n Be
lgiu
m
Aust
ria
Switz
erla
nd
Swed
en
Nor
way
N
ethe
rland
s Tu
rkey
Ja
pan
Aust
ralia
Ho
ng K
ong
Mal
aysia
In
dia
Sout
h Ko
rea
Taiw
an
Sout
h Af
rica
Sing
apor
e Th
aila
nd
New
Zea
land
Ph
ilipp
ines
Pa
kist
an
Finl
and
Chile
In
done
sia
Mex
ico
Braz
il Gr
eece
Po
rtug
al
Pola
nd
Colo
mbi
a Ar
gent
ina
Czec
h Re
publ
ic
Hung
ary
Irela
nd
Isra
el
Vene
zuel
a Pe
ru
Chin
a Lu
xem
bour
g Sr
i Lan
ka
Jan 1998 – Gaussian, 5 clusters
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Ca
nada
Fr
ance
Ita
ly
Germ
any
Denm
ark
Russ
ia
Spai
n Be
lgiu
m
Aust
ria
Switz
erla
nd
Swed
en
Nor
way
N
ethe
rland
s Tu
rkey
Ja
pan
Aust
ralia
Ho
ng K
ong
Mal
aysia
In
dia
Sout
h Ko
rea
Taiw
an
Sout
h Af
rica
Sing
apor
e Th
aila
nd
New
Zea
land
Ph
ilipp
ines
Pa
kist
an
Finl
and
Indo
nesia
M
exic
o Br
azil
Gree
ce
Pola
nd
Colo
mbi
a Ch
ile
Irela
nd
Port
ugal
Is
rael
Ve
nezu
ela
Peru
Ar
gent
ina
Czec
h Re
publ
ic
Hung
ary
Chin
a Lu
xem
bour
g Sr
i Lan
ka
33% 27% 15%
23%
2% 0%
5%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
Jan 1998 – Gaussian, 4 clusters
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Ca
nada
Fr
ance
Ita
ly
Germ
any
Denm
ark
Russ
ia
Spai
n Be
lgiu
m
Aust
ria
Switz
erla
nd
Swed
en
Nor
way
N
ethe
rland
s Ja
pan
Aust
ralia
Ho
ng K
ong
Mal
aysia
In
dia
Sout
h Ko
rea
Taiw
an
Sout
h Af
rica
Sing
apor
e Th
aila
nd
New
Zea
land
Ph
ilipp
ines
Pa
kist
an
Finl
and
Indo
nesia
M
exic
o Br
azil
Turk
ey
Gree
ce
Port
ugal
Po
land
Ch
ile
Irela
nd
Sri L
anka
Is
rael
Ve
nezu
ela
Peru
Co
lom
bia
Arge
ntin
a Cz
ech
Repu
blic
Hu
ngar
y Ch
ina
Luxe
mbo
urg
31% 27% 17%
25%
0%
5%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
Jan 1998 – Gaussian, 3 clusters
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Ca
nada
Fr
ance
Ita
ly
Germ
any
Finl
and
Denm
ark
Russ
ia
Spai
n Be
lgiu
m
Aust
ria
Switz
erla
nd
Swed
en
Nor
way
N
ethe
rland
s Tu
rkey
Gr
eece
Ja
pan
Aust
ralia
Ho
ng K
ong
Mal
aysia
In
dia
Sout
h Ko
rea
Taiw
an
Sout
h Af
rica
Sing
apor
e Th
aila
nd
New
Zea
land
Ph
ilipp
ines
Pa
kist
an
Chile
In
done
sia
Mex
ico
Braz
il Ire
land
Po
rtug
al
Sri L
anka
Po
land
Is
rael
Ve
nezu
ela
Peru
Co
lom
bia
Arge
ntin
a Cz
ech
Repu
blic
Hu
ngar
y Ch
ina
Luxe
mbo
urg
37% 27%
35%
0%
5%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
Jan 1998 – Gaussian, 2 clusters
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Ca
nada
Fr
ance
Ita
ly
Germ
any
Finl
and
Denm
ark
Russ
ia
Spai
n Be
lgiu
m
Chile
Au
stria
Sw
itzer
land
In
done
sia
Swed
en
Mex
ico
Nor
way
N
ethe
rland
s Br
azil
Turk
ey
Gree
ce
Irela
nd
Port
ugal
Sr
i Lan
ka
Pola
nd
Isra
el
Vene
zuel
a Pe
ru
Colo
mbi
a Ar
gent
ina
Czec
h Re
publ
ic
Hung
ary
Chin
a Lu
xem
bour
g Ja
pan
Aust
ralia
Ho
ng K
ong
Mal
aysia
In
dia
Sout
h Ko
rea
Taiw
an
Sout
h Af
rica
Sing
apor
e Th
aila
nd
New
Zea
land
Ph
ilipp
ines
Pa
kist
an
73% 27%
0%
5%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
Jan 1998 – Laplace, 4 clusters
94%
2% 2% 2% 0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Ja
pan
Germ
any
Chin
a Sp
ain
Russ
ia
Switz
erla
nd
Italy
Au
stra
lia
Sout
h Ko
rea
Hong
Kon
g N
ethe
rland
s In
dia
Braz
il Ta
iwan
Sw
eden
Si
ngap
ore
Finl
and
Luxe
mbo
urg
Belg
ium
Th
aila
nd
Sout
h Af
rica
Nor
way
Ire
land
M
exic
o M
alay
sia
Denm
ark
Aust
ria
Pola
nd
Turk
ey
Gree
ce
Chile
In
done
sia
Isra
el
Czec
h Re
publ
ic
Hung
ary
Port
ugal
Co
lom
bia
New
Zea
land
Pe
ru
Paki
stan
Ar
gent
ina
Sri L
anka
Ve
nezu
ela
Cana
da
Phili
ppin
es
Fran
ce
Clusters estimated under a different distribution:
Jan 1998 – Laplace, 3 clusters
96%
2% 2% 0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Ja
pan
Fran
ce
Germ
any
Chin
a Sp
ain
Russ
ia
Cana
da
Switz
erla
nd
Italy
So
uth
Kore
a Ho
ng K
ong
Net
herla
nds
Indi
a Ta
iwan
Sw
eden
Si
ngap
ore
Finl
and
Luxe
mbo
urg
Belg
ium
Th
aila
nd
Sout
h Af
rica
Nor
way
Ire
land
M
exic
o M
alay
sia
Denm
ark
Aust
ria
Pola
nd
Turk
ey
Gree
ce
Chile
In
done
sia
Isra
el
Czec
h Re
publ
ic
Hung
ary
Port
ugal
Co
lom
bia
Phili
ppin
es
New
Zea
land
Pe
ru
Paki
stan
Ar
gent
ina
Sri L
anka
Ve
nezu
ela
Aust
ralia
Br
azil
Jan 2001 – Gaussian, 6 clusters
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Japa
n U
nite
d Ki
ngdo
m
Cana
da
Germ
any
Fran
ce
Taiw
an
Sout
h Af
rica
Sout
h Ko
rea
Braz
il Ch
ina
Aust
ralia
M
alay
sia
Thai
land
N
ethe
rland
s Si
ngap
ore
Swed
en
Indi
a Ita
ly
Hong
Kon
g Sw
itzer
land
Be
lgiu
m
Gree
ce
Mex
ico
Nor
way
Ch
ile
New
Zea
land
Po
land
Is
rael
Fi
nlan
d De
nmar
k Tu
rkey
Pa
kist
an
Aust
ria
Spai
n Ru
ssia
In
done
sia
Peru
Lu
xem
bour
g Cz
ech
Repu
blic
Ar
gent
ina
Sri L
anka
Co
lom
bia
Irela
nd
Phili
ppin
es
Hung
ary
Port
ugal
Ve
nezu
ela
33%
27%
15% 15% 8%
2% 0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
Jan 2001 – Gaussian, 5 clusters
25% 36% 29% 8%
2% 0%
5%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Japa
n U
nite
d Ki
ngdo
m
Cana
da
Germ
any
Fran
ce
Sout
h Af
rica
Sout
h Ko
rea
Braz
il Au
stra
lia
Mal
aysia
Th
aila
nd
Taiw
an
Swed
en
Chin
a N
ethe
rland
s In
dia
Sing
apor
e De
nmar
k Ita
ly
Hong
Kon
g Sw
itzer
land
Be
lgiu
m
Phili
ppin
es
Gree
ce
Mex
ico
Nor
way
Po
rtug
al
Pola
nd
Indo
nesia
Fi
nlan
d Tu
rkey
Ire
land
Au
stria
Ch
ile
Hung
ary
Luxe
mbo
urg
Spai
n Cz
ech
Repu
blic
N
ew Z
eala
nd
Isra
el
Russ
ia
Arge
ntin
a Pa
kist
an
Peru
Sr
i Lan
ka
Colo
mbi
a Ve
nezu
ela
Jan 2001 – Gaussian, 4 clusters
21% 40%
31% 8%
0%
5%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Japa
n U
nite
d Ki
ngdo
m
Cana
da
Germ
any
Fran
ce
Sout
h Ko
rea
Braz
il Au
stra
lia
Thai
land
Ta
iwan
So
uth
Afric
a Sw
eden
Ch
ina
Mal
aysia
N
ethe
rland
s In
dia
Sing
apor
e De
nmar
k Ita
ly
Hong
Kon
g Sw
itzer
land
Be
lgiu
m
Phili
ppin
es
Gree
ce
Mex
ico
Nor
way
Po
rtug
al
Pola
nd
Indo
nesia
Fi
nlan
d Tu
rkey
Pa
kist
an
Irela
nd
Aust
ria
Chile
Hu
ngar
y Pe
ru
Spai
n Cz
ech
Repu
blic
N
ew Z
eala
nd
Isra
el
Russ
ia
Sri L
anka
Lu
xem
bour
g Ar
gent
ina
Vene
zuel
a Co
lom
bia
Jan 2001 – Gaussian, 3 clusters
21% 40%
40%
0%
5%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Japa
n U
nite
d Ki
ngdo
m
Cana
da
Germ
any
Fran
ce
Sout
h Ko
rea
Braz
il Au
stra
lia
Thai
land
Ta
iwan
So
uth
Afric
a Sw
eden
Ch
ina
Mal
aysia
N
ethe
rland
s In
dia
Sing
apor
e De
nmar
k Ita
ly
Hong
Kon
g Sw
itzer
land
Be
lgiu
m
Phili
ppin
es
Gree
ce
Mex
ico
Nor
way
Po
rtug
al
Pola
nd
Indo
nesia
Fi
nlan
d Tu
rkey
Pa
kist
an
Irela
nd
Aust
ria
Chile
Hu
ngar
y Pe
ru
Luxe
mbo
urg
Spai
n Cz
ech
Repu
blic
N
ew Z
eala
nd
Isra
el
Russ
ia
Arge
ntin
a Sr
i Lan
ka
Vene
zuel
a Co
lom
bia
Jan 2001 – Gaussian, 2 clusters
56%
44%
0%
5%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Japa
n U
nite
d Ki
ngdo
m
Cana
da
Germ
any
Fran
ce
Taiw
an
Sout
h Af
rica
Sout
h Ko
rea
Braz
il Sw
eden
Ch
ina
Aust
ralia
M
alay
sia
Thai
land
N
ethe
rland
s In
dia
Sing
apor
e De
nmar
k Ita
ly
Hong
Kon
g Sw
itzer
land
Be
lgiu
m
Gree
ce
Mex
ico
Nor
way
Po
land
In
done
sia
Finl
and
Turk
ey
Paki
stan
Ire
land
Au
stria
Ph
ilipp
ines
Ch
ile
Hung
ary
Peru
Lu
xem
bour
g Sp
ain
Czec
h Re
publ
ic
New
Zea
land
Po
rtug
al
Isra
el
Russ
ia
Arge
ntin
a Sr
i Lan
ka
Vene
zuel
a Co
lom
bia
Jan 2004 – Gaussian, 6 clusters
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Fr
ance
Ge
rman
y N
ethe
rland
s Ca
nada
Sw
itzer
land
Sp
ain
Italy
Be
lgiu
m
Nor
way
Ire
land
De
nmar
k Po
rtug
al
Chin
a Ta
iwan
So
uth
Kore
a Si
ngap
ore
Mal
aysia
Ru
ssia
In
done
sia
Arge
ntin
a Ve
nezu
ela
Paki
stan
Co
lom
bia
Sri L
anka
Ja
pan
Aust
ralia
Th
aila
nd
Sout
h Af
rica
Mex
ico
Chile
Au
stria
N
ew Z
eala
nd
Phili
ppin
es
Peru
Ho
ng K
ong
Swed
en
Finl
and
Braz
il Is
rael
Lu
xem
bour
g Po
land
Hu
ngar
y Cz
ech
Repu
blic
In
dia
Gree
ce
Turk
ey
29%
25%
21% 19% 4%
2% 0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
Jan 2004 – Gaussian, 5 clusters
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Fr
ance
Ge
rman
y N
ethe
rland
s Sw
itzer
land
Sp
ain
Italy
Sw
eden
Fi
nlan
d N
orw
ay
Port
ugal
Lu
xem
bour
g Ch
ina
Taiw
an
Sout
h Ko
rea
Indi
a Gr
eece
Ru
ssia
Is
rael
Ar
gent
ina
Czec
h Re
publ
ic
Vene
zuel
a Pa
kist
an
Colo
mbi
a Sr
i Lan
ka
Japa
n Ca
nada
Au
stra
lia
Belg
ium
Br
azil
Sout
h Af
rica
Irela
nd
Mex
ico
Denm
ark
Chile
Au
stria
N
ew Z
eala
nd
Peru
Ho
ng K
ong
Sing
apor
e M
alay
sia
Thai
land
In
done
sia
Pola
nd
Phili
ppin
es
Hung
ary
Turk
ey
27%
27%
27% 17%
2% 0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
Jan 2004 – Gaussian, 4 clusters
33%
33%
31%
2% 0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Fr
ance
Ge
rman
y N
ethe
rland
s Ca
nada
Sw
itzer
land
Sp
ain
Italy
Sw
eden
Be
lgiu
m
Nor
way
Ire
land
De
nmar
k Po
rtug
al
Luxe
mbo
urg
Chin
a Ta
iwan
So
uth
Kore
a Si
ngap
ore
Indi
a M
alay
sia
Gree
ce
Russ
ia
Isra
el
Indo
nesia
Ar
gent
ina
Czec
h Re
publ
ic
Vene
zuel
a Pa
kist
an
Colo
mbi
a Sr
i Lan
ka
Japa
n Au
stra
lia
Hong
Kon
g Fi
nlan
d Br
azil
Thai
land
So
uth
Afric
a M
exic
o Ch
ile
Pola
nd
Aust
ria
New
Zea
land
Ph
ilipp
ines
Hu
ngar
y Pe
ru
Turk
ey
Jan 2004 – Gaussian, 3 clusters
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Fr
ance
Ge
rman
y N
ethe
rland
s Ca
nada
Sw
itzer
land
Sp
ain
Italy
Sw
eden
Be
lgiu
m
Nor
way
Ire
land
De
nmar
k Po
rtug
al
Luxe
mbo
urg
Chin
a Ta
iwan
So
uth
Kore
a In
dia
Mal
aysia
Gr
eece
Ru
ssia
Tu
rkey
Is
rael
In
done
sia
Arge
ntin
a Cz
ech
Repu
blic
Ve
nezu
ela
Paki
stan
Co
lom
bia
Sri L
anka
Ja
pan
Aust
ralia
Ho
ng K
ong
Finl
and
Sing
apor
e Br
azil
Thai
land
So
uth
Afric
a M
exic
o Ch
ile
Pola
nd
Aust
ria
New
Zea
land
Ph
ilipp
ines
Hu
ngar
y Pe
ru
33% 34%
33% 0%
20%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
Jan 2004 – Gaussian, 2 clusters
56%
44%
0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Japa
n U
nite
d Ki
ngdo
m
Fran
ce
Germ
any
Net
herla
nds
Cana
da
Switz
erla
nd
Spai
n Ita
ly
Aust
ralia
Sw
eden
Fi
nlan
d Be
lgiu
m
Braz
il So
uth
Afric
a N
orw
ay
Irela
nd
Mex
ico
Denm
ark
Chile
Po
rtug
al
Luxe
mbo
urg
Pola
nd
Aust
ria
New
Zea
land
Hu
ngar
y Ch
ina
Hong
Kon
g Ta
iwan
So
uth
Kore
a Si
ngap
ore
Indi
a M
alay
sia
Thai
land
Gr
eece
Ru
ssia
Tu
rkey
Is
rael
In
done
sia
Phili
ppin
es
Arge
ntin
a Cz
ech
Repu
blic
Pe
ru
Vene
zuel
a Pa
kist
an
Colo
mbi
a Sr
i Lan
ka
Jan 2007 – Gaussian, 6 clusters
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Fr
ance
Ge
rman
y Sp
ain
Net
herla
nds
Switz
erla
nd
Italy
Sw
eden
Fi
nlan
d Be
lgiu
m
Irela
nd
Port
ugal
Ca
nada
Au
stra
lia
Hong
Kon
g Si
ngap
ore
Braz
il So
uth
Afric
a Th
aila
nd
Mal
aysia
M
exic
o Au
stria
De
nmar
k Ch
ile
New
Zea
land
Ja
pan
Sout
h Ko
rea
Chin
a Ta
iwan
N
orw
ay
Luxe
mbo
urg
Gree
ce
Arge
ntin
a Ph
ilipp
ines
Pa
kist
an
Peru
Ve
nezu
ela
Sri L
anka
Ru
ssia
Po
land
Tu
rkey
Hu
ngar
y Cz
ech
Repu
blic
Co
lom
bia
Indi
a In
done
sia
Isra
el
27% 27% 27% 13% 4%
2% 0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
Jan 2007 – Gaussian, 5 clusters
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Fr
ance
Ge
rman
y Sp
ain
Net
herla
nds
Switz
erla
nd
Italy
Sw
eden
Fi
nlan
d Be
lgiu
m
Irela
nd
Port
ugal
Ca
nada
Au
stra
lia
Hong
Kon
g Si
ngap
ore
Braz
il So
uth
Afric
a Th
aila
nd
Mal
aysia
M
exic
o Au
stria
De
nmar
k Ch
ile
New
Zea
land
Ja
pan
Sout
h Ko
rea
Taiw
an
Nor
way
Lu
xem
bour
g Gr
eece
Po
land
Is
rael
In
done
sia
Hung
ary
Czec
h Re
publ
ic
Phili
ppin
es
Russ
ia
Chin
a In
dia
Arge
ntin
a Pa
kist
an
Colo
mbi
a Pe
ru
Vene
zuel
a Sr
i Lan
ka
Turk
ey
27% 27% 25% 19%
2% 0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
Jan 2007 – Gaussian, 4 clusters
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Fr
ance
Ge
rman
y Sp
ain
Net
herla
nds
Switz
erla
nd
Italy
Sw
eden
Fi
nlan
d Be
lgiu
m
Irela
nd
Port
ugal
Ca
nada
Au
stra
lia
Hong
Kon
g Si
ngap
ore
Braz
il So
uth
Afric
a Th
aila
nd
Mex
ico
Aust
ria
Denm
ark
Chile
N
ew Z
eala
nd
Japa
n So
uth
Kore
a Ta
iwan
N
orw
ay
Luxe
mbo
urg
Mal
aysia
Gr
eece
Po
land
Is
rael
In
done
sia
Hung
ary
Czec
h Re
publ
ic
Phili
ppin
es
Russ
ia
Chin
a In
dia
Turk
ey
Arge
ntin
a Pa
kist
an
Colo
mbi
a Pe
ru
Vene
zuel
a Sr
i Lan
ka
27% 25% 27%
21%
0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
Jan 2007 – Gaussian, 3 clusters
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Fr
ance
Ge
rman
y Sp
ain
Net
herla
nds
Switz
erla
nd
Italy
Sw
eden
Fi
nlan
d Be
lgiu
m
Irela
nd
Port
ugal
Ca
nada
Au
stra
lia
Hong
Kon
g Si
ngap
ore
Braz
il So
uth
Afric
a Th
aila
nd
Mal
aysia
M
exic
o Au
stria
De
nmar
k Ch
ile
New
Zea
land
Ja
pan
Russ
ia
Sout
h Ko
rea
Chin
a Ta
iwan
In
dia
Nor
way
Lu
xem
bour
g Gr
eece
Po
land
Tu
rkey
Is
rael
Ar
gent
ina
Indo
nesia
Hu
ngar
y Cz
ech
Repu
blic
Ph
ilipp
ines
Pa
kist
an
Colo
mbi
a Pe
ru
Vene
zuel
a Sr
i Lan
ka
27% 27% 46%
0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
Jan 2007 – Gaussian, 2 clusters
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Fr
ance
Ge
rman
y Ca
nada
Sp
ain
Net
herla
nds
Switz
erla
nd
Aust
ralia
Ita
ly
Swed
en
Finl
and
Sing
apor
e Be
lgiu
m
Irela
nd
Mex
ico
Aust
ria
Denm
ark
Chile
Po
rtug
al
New
Zea
land
Ja
pan
Russ
ia
Hong
Kon
g So
uth
Kore
a Ch
ina
Taiw
an
Indi
a Br
azil
Nor
way
So
uth
Afric
a Th
aila
nd
Luxe
mbo
urg
Mal
aysia
Gr
eece
Po
land
Tu
rkey
Is
rael
Ar
gent
ina
Indo
nesia
Hu
ngar
y Cz
ech
Repu
blic
Ph
ilipp
ines
Pa
kist
an
Colo
mbi
a Pe
ru
Vene
zuel
a Sr
i Lan
ka
44% 56%
0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
Jan 2010 – Gaussian, 6 clusters
27% 23%
19% 17% 12%
2% 0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Fr
ance
Ge
rman
y Sw
itzer
land
Ita
ly
Net
herla
nds
Swed
en
Finl
and
Belg
ium
Ire
land
De
nmar
k Au
stria
So
uth
Kore
a Ho
ng K
ong
Indi
a Ta
iwan
Si
ngap
ore
Pola
nd
Turk
ey
Gree
ce
Indo
nesia
Cz
ech
Repu
blic
Hu
ngar
y Ca
nada
Au
stra
lia
Braz
il Lu
xem
bour
g Th
aila
nd
Sout
h Af
rica
Nor
way
M
exic
o Ar
gent
ina
Japa
n Sp
ain
Mal
aysia
Ch
ile
Isra
el
Port
ugal
Ph
ilipp
ines
N
ew Z
eala
nd
Russ
ia
Colo
mbi
a Pe
ru
Paki
stan
Sr
i Lan
ka
Vene
zuel
a Ch
ina
Jan 2010 – Gaussian, 5 clusters
31% 37% 15%
15%
2% 0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Fr
ance
Ge
rman
y Sw
itzer
land
Ita
ly
Aust
ralia
N
ethe
rland
s Sw
eden
Fi
nlan
d Be
lgiu
m
Irela
nd
Denm
ark
Aust
ria
Port
ugal
Ca
nada
So
uth
Kore
a Ho
ng K
ong
Braz
il Ta
iwan
Si
ngap
ore
Luxe
mbo
urg
Thai
land
N
orw
ay
Pola
nd
Gree
ce
Indo
nesia
Is
rael
Cz
ech
Repu
blic
Hu
ngar
y Ph
ilipp
ines
Pe
ru
Arge
ntin
a Ja
pan
Spai
n So
uth
Afric
a M
exic
o M
alay
sia
Chile
N
ew Z
eala
nd
Russ
ia
Indi
a Tu
rkey
Co
lom
bia
Paki
stan
Sr
i Lan
ka
Vene
zuel
a Ch
ina
Jan 2010 – Gaussian, 4 clusters
29% 38% 15%
19%
0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Fr
ance
Ge
rman
y Sw
itzer
land
Ita
ly
Aust
ralia
N
ethe
rland
s Sw
eden
Fi
nlan
d Be
lgiu
m
Irela
nd
Denm
ark
Aust
ria
Cana
da
Sout
h Ko
rea
Hong
Kon
g Br
azil
Sing
apor
e Lu
xem
bour
g Th
aila
nd
Nor
way
M
alay
sia
Pola
nd
Gree
ce
Indo
nesia
Is
rael
Cz
ech
Repu
blic
Hu
ngar
y Ph
ilipp
ines
Pe
ru
Arge
ntin
a Ja
pan
Spai
n So
uth
Afric
a M
exic
o Ch
ile
Port
ugal
N
ew Z
eala
nd
Chin
a Ru
ssia
In
dia
Taiw
an
Turk
ey
Colo
mbi
a Pa
kist
an
Sri L
anka
Ve
nezu
ela
Jan 2010 – Gaussian, 3 clusters
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Fr
ance
Ge
rman
y Sw
itzer
land
Ita
ly
Net
herla
nds
Swed
en
Finl
and
Belg
ium
Ire
land
De
nmar
k Au
stria
Ja
pan
Spai
n Ca
nada
Au
stra
lia
Braz
il Si
ngap
ore
Luxe
mbo
urg
Thai
land
So
uth
Afric
a N
orw
ay
Mex
ico
Mal
aysia
Gr
eece
Ch
ile
Isra
el
Czec
h Re
publ
ic
Port
ugal
Ph
ilipp
ines
N
ew Z
eala
nd
Arge
ntin
a Ch
ina
Russ
ia
Sout
h Ko
rea
Hong
Kon
g In
dia
Taiw
an
Pola
nd
Turk
ey
Indo
nesia
Hu
ngar
y Co
lom
bia
Peru
Pa
kist
an
Sri L
anka
Ve
nezu
ela
27% 42% 31%
0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
Jan 2010 – Gaussian, 2 clusters
0%
25%
50%
75%
100%
Uni
ted
Stat
es
Uni
ted
King
dom
Fr
ance
Ge
rman
y Sp
ain
Cana
da
Switz
erla
nd
Italy
Au
stra
lia
Net
herla
nds
Swed
en
Finl
and
Belg
ium
So
uth
Afric
a N
orw
ay
Irela
nd
Mex
ico
Denm
ark
Aust
ria
Port
ugal
N
ew Z
eala
nd
Japa
n Ch
ina
Russ
ia
Sout
h Ko
rea
Hong
Kon
g In
dia
Braz
il Ta
iwan
Si
ngap
ore
Luxe
mbo
urg
Thai
land
M
alay
sia
Pola
nd
Turk
ey
Gree
ce
Chile
In
done
sia
Isra
el
Czec
h Re
publ
ic
Hung
ary
Colo
mbi
a Ph
ilipp
ines
Pe
ru
Paki
stan
Ar
gent
ina
Sri L
anka
Ve
nezu
ela
44% 56%
0%
10%
Spre
ad
(% R
et/M
o)
Spread (MSE of monthly returns) and Size (% of probability mass)
Closing Remarks • Idea works with different distributions
– Here, deviations from cluster centers were Gaussian or Laplace
• Can dress up the model to account for interactions – e.g. Countries A, B, and C are trading partners thus more likely to
belong to the same cluster
• Clustering helps identify what a market or security is, i.e. what forecasting model to use for it
• Purpose of presentation: switching from applying filters to thinking about underlying mathematical models – gives you your own custom tool set – makes understanding what something does infinitely easier