Upload
hillary-clarke
View
212
Download
0
Embed Size (px)
Citation preview
Generalized Model SelectionFor Unsupervised Learning
in High Dimension
Vaithyanathan and DomIBM Almaden Research Center
NIPS’99
Abstract• Bayesian approach to model selection in
unsupervised learning– propose a unified objective function whose
arguments include both the feature space and number of clusters.
• determining feature set (dividing feature set into noise features and useful features
• determining the number of clusters
– marginal likelihood with Bayesian scheme vs. cross-validation(cross-validated likelihood).
• DC (Distributional clustering of terms) for initial feature selection.
Model Selection in Clustering
• Bayesian approaches1), cross-validation2) techniques, MDL approaches3).
• Need for unified objective function– the optimal number of clusters is
dependent on the feature space in which the clustering is performed.
– c.f. feature selection in clustering
Model Selection in Clustering (Cont’d)
• Generalized model for clustering– data D = {d1,…,d}, feature space T with
dimension M– likelihood P(DT|) maximization, where
(with parameter ) is the structure of the model (# of clusters, the partitioning of the feature set into U(useful set), N(noise set) and the assignment of patterns to clusters).
• Bayesian approach to model selection– regularization using marginal likelihood
Bayesian Approach to Model Selection for Clustering
• Data– data D = {d1,…,dn}, feature space T with
dimension M
• Clustering D– finding and such that
– where is the structure of the model and is the set of all parameter vectors
– the model structure consists of the # of clusters + the partitioning of the feature set and the assignment of patterns to clusters.
(1) ),|(maxarg)ˆ,ˆ( , TDP
Assumptions1. The feature sets T represented by U and N are
conditionally independent and the data is independent.
2. Data = {d1,…,dn} is i.i.d
),|(),|(),|( UNT DPDPDP
)|()|(
11
Uk
Uj
Dj
K
k
NNi
idpdp
k
(2) )},|(),|({maxarg)ˆ,ˆ( )1( , UN DPDP
n
i
Uik
Ui
NNi
UN dpdpDPDP1
)( )|()|(),|(),|(
(4) })|()|({maxarg
(3) })|()|({maxarg)ˆ,ˆ( )2(
1 1,
1)(,
n
i
K
k Dj
Uk
Uj
NNi
n
i
Uik
Ui
NNi
k
dpdp
dpdp
lack ofregularizationmarginal
or integrated likelihood
3. All parameter vectors are independent.
– marginal likelihood
– Approximations to Marginal Likelihood/Stochastic Complexity
)()()(1
Uk
K
k
N
(5) )( )|(
)( )|()|(
1
1
Uk
Uk
K
k Dj
Uk
Ui
NNn
i
NNi
T
ddp
ddpDP
Uk
N
computationallyvery expensivepruning of search space by reducing
the number of feature partitions
model complexity
Document Clustering• Marginal likelihood
(11)
(10) )()(}|{
)()(}|{
)( )(
)|(
,
,
1 ,
1 ,
NN
Nn
tn
j nj
Nj
Uk
Uk
Uu
tUk
K
k Di Ui
Ui
SStS
N
UN
dNnt
t
dUut
t
dt
ttDP
nj
ui
k
U
adapting multinomial modelsusing term counts as the
features
assuming that priors (..)
is conjugate to the Dirichlet distribution
NLML (Negative Log Marginal Likelihood)
• Cross-Validated likelihood
Document Clustering (cont’)
Distributional clustering for feature subset selection
• heuristic method to obtain a subset of tokens that are topical and can be used as features in the bag-of-words model to cluster documents
• reduce feature size M to C• by clustering words based on their
distributions over the documents.• A histogram for each token
– the first bin: # of documents with zero occurrences of the token
– the second bin: # of documents consisting of a single occurrence of the token
– the third bin: # of documents that contain two or more occurrence of the term
DC for feature subset selection(Cont’d)
• measure of similarity of the histograms– relative entropy or the K-L distance
(.||.)• e.g. for two terms with prob. p1(.), p2(.)
• k-means DC
t tp
tptptptp
)(
)(log)())(||)((
2
1121
Experimental Setup
• AP Reuters Newswire articles from the TREC-6– 8235 documents from the routing
track, 25 classes, disregard multiple classes
– 32450 unique terms (discarding terms that appeared in less than 3 documents)
• Evaluation measure of clustering– MI
)|()()()(
),(log),();( KGHGH
KpGp
KGpKGpKGI
i jji
jiji
Results of Distributional Clustering
• cluster 32450 tokens into 3,4,5 clusters.
• eliminating function words
function words
Figure 1. centroid of atypical high-frequencyfunction-words cluster
Finding the Optimum Features and Document Clusters for a Fixed Number of
Clusters• Now, apply the objective function (11)
to the feature subsets selected by DC– EM/CEM (Classification EM: hard-
thresholded version of the EM)1)
• initialization: k-means algorithm
• Comparison of feature-selection heuristics• FBTop20: Removal of the top 20% of the most frequent
terms• FBTop40: Removal of the top 40% of the most frequent
terms• FBTop40Bot10: Removal of top 40% of the most frequent
terms and removal of all tokens that do not appear in at least 10 documents
• NF: No feature selection• CSW: Common stop words removed