Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99

Generalized Model SelectionFor Unsupervised Learning

in High Dimension

Vaithyanathan and DomIBM Almaden Research Center

NIPS’99

Abstract• Bayesian approach to model selection in

unsupervised learning– propose a unified objective function whose

arguments include both the feature space and number of clusters.

• determining feature set (dividing feature set into noise features and useful features

• determining the number of clusters

– marginal likelihood with Bayesian scheme vs. cross-validation(cross-validated likelihood).

• DC (Distributional clustering of terms) for initial feature selection.

Model Selection in Clustering

• Bayesian approaches1), cross-validation2) techniques, MDL approaches3).

• Need for unified objective function– the optimal number of clusters is

dependent on the feature space in which the clustering is performed.

– c.f. feature selection in clustering

Model Selection in Clustering (Cont’d)

• Generalized model for clustering– data D = {d1,…,d}, feature space T with

dimension M– likelihood P(DT|) maximization, where

(with parameter ) is the structure of the model (# of clusters, the partitioning of the feature set into U(useful set), N(noise set) and the assignment of patterns to clusters).

• Bayesian approach to model selection– regularization using marginal likelihood

Bayesian Approach to Model Selection for Clustering

• Data– data D = {d1,…,dn}, feature space T with

dimension M

• Clustering D– finding and such that

– where is the structure of the model and is the set of all parameter vectors

– the model structure consists of the # of clusters + the partitioning of the feature set and the assignment of patterns to clusters.

(1) ),|(maxarg)ˆ,ˆ( , TDP

Assumptions1. The feature sets T represented by U and N are

conditionally independent and the data is independent.

2. Data = {d1,…,dn} is i.i.d

),|(),|(),|( UNT DPDPDP

)|()|(

11

Uk

Uj

Dj

K

k

NNi

idpdp

k

(2) )},|(),|({maxarg)ˆ,ˆ( )1( , UN DPDP

n

i

Uik

Ui

NNi

UN dpdpDPDP1

)( )|()|(),|(),|(

(4) })|()|({maxarg

(3) })|()|({maxarg)ˆ,ˆ( )2(

1 1,

1)(,

n

i

K

k Dj

Uk

Uj

NNi

n

i

Uik

Ui

NNi

k

dpdp

dpdp

lack ofregularizationmarginal

or integrated likelihood

3. All parameter vectors are independent.

– marginal likelihood

– Approximations to Marginal Likelihood/Stochastic Complexity

)()()(1

Uk

K

k

N

(5) )( )|(

)( )|()|(

1

1

Uk

Uk

K

k Dj

Uk

Ui

NNn

i

NNi

T

ddp

ddpDP

Uk

N

computationallyvery expensivepruning of search space by reducing

the number of feature partitions

model complexity

Document Clustering• Marginal likelihood

(11)

(10) )()(}|{

)()(}|{

)( )(

)|(

,

,

1 ,

1 ,

NN

Nn

tn

j nj

Nj

Uk

Uk

Uu

tUk

K

k Di Ui

Ui

SStS

N

UN

dNnt

t

dUut

t

dt

ttDP

nj

ui

k

U

adapting multinomial modelsusing term counts as the

features

assuming that priors (..)

is conjugate to the Dirichlet distribution

NLML (Negative Log Marginal Likelihood)

• Cross-Validated likelihood

Document Clustering (cont’)

Distributional clustering for feature subset selection

• heuristic method to obtain a subset of tokens that are topical and can be used as features in the bag-of-words model to cluster documents

• reduce feature size M to C• by clustering words based on their

distributions over the documents.• A histogram for each token

– the first bin: # of documents with zero occurrences of the token

– the second bin: # of documents consisting of a single occurrence of the token

– the third bin: # of documents that contain two or more occurrence of the term

DC for feature subset selection(Cont’d)

• measure of similarity of the histograms– relative entropy or the K-L distance

(.||.)• e.g. for two terms with prob. p1(.), p2(.)

• k-means DC

t tp

tptptptp

)(

)(log)())(||)((

2

1121

Experimental Setup

• AP Reuters Newswire articles from the TREC-6– 8235 documents from the routing

track, 25 classes, disregard multiple classes

– 32450 unique terms (discarding terms that appeared in less than 3 documents)

• Evaluation measure of clustering– MI

)|()()()(

),(log),();( KGHGH

KpGp

KGpKGpKGI

i jji

jiji

Results of Distributional Clustering

• cluster 32450 tokens into 3,4,5 clusters.

• eliminating function words

function words

Figure 1. centroid of atypical high-frequencyfunction-words cluster

Finding the Optimum Features and Document Clusters for a Fixed Number of

Clusters• Now, apply the objective function (11)

to the feature subsets selected by DC– EM/CEM (Classification EM: hard-

thresholded version of the EM)1)

• initialization: k-means algorithm

• Comparison of feature-selection heuristics• FBTop20: Removal of the top 20% of the most frequent

terms• FBTop40: Removal of the top 40% of the most frequent

terms• FBTop40Bot10: Removal of top 40% of the most frequent

terms and removal of all tokens that do not appear in at least 10 documents

• NF: No feature selection• CSW: Common stop words removed

Documents

Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99