39
Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela, S. Merugu & D. Modha Data Mining Seminar Series, Mar 26, 2004

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin

  • Upload
    janae

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin. Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela, - PowerPoint PPT Presentation

Citation preview

Page 1: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Information Theoretic Clustering, Co-clustering and Matrix Approximations

Inderjit S. Dhillon University of Texas, Austin

Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela, S. Merugu & D. Modha

Data Mining Seminar Series,

Mar 26, 2004

Page 2: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Clustering: Unsupervised Learning Grouping together of “similar” objects

Hard Clustering -- Each object belongs to a single cluster

Soft Clustering -- Each object is probabilistically assigned to clusters

Page 3: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Contingency Tables Let X and Y be discrete random variables

X and Y take values in {1, 2, …, m} and {1, 2, …, n} p(X, Y) denotes the joint probability distribution—if not

known, it is often estimated based on co-occurrence data Application areas: text mining, market-basket analysis,

analysis of browsing behavior, etc. Key Obstacles in Clustering Contingency Tables

High Dimensionality, Sparsity, Noise Need for robust and scalable algorithms

Page 4: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Co-Clustering Simultaneously

Cluster rows of p(X, Y) into k disjoint groups Cluster columns of p(X, Y) into l disjoint groups

Key goal is to exploit the “duality” between row and column clustering to overcome sparsity and noise

Page 5: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Co-clustering Example for Text Data

document

word wordclusters

document clusters

Co-clustering clusters both words and documents simultaneously using the underlying co-occurrence frequency matrix

Page 6: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Co-clustering and Information Theory View “co-occurrence” matrix as a joint probability

distribution over row & column random variables

We seek a “hard-clustering” of both rows and columns such that “information” in the compressed matrix is maximized.

XY

XY

Page 7: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Information Theory Concepts Entropy of a random variable X with probability

distribution p:

The Kullback-Leibler (KL) Divergence or “Relative Entropy” between two probability distributions p and q:

Mutual Information between random variables X and Y:

x

xqxpxpqpKL ))()(log()(),(

)(log)()( xpxppHx

x y ypxpyxpyxpYXI

)()(),(log),(),(

Page 8: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

“Optimal” Co-Clustering Seek random variables and taking values

in {1, 2, …, k} and {1, 2, …, l} such that mutual information is maximized:

where = R(X) is a function of X alone where = C(Y) is a function of Y alone

X Y

XY

)ˆ,ˆ( YXI

Page 9: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Related Work

Distributional Clustering Pereira, Tishby & Lee (1993), Baker & McCallum

(1998) Information Bottleneck

Tishby, Pereira & Bialek(1999), Slonim, Friedman & Tishby (2001), Berkhin & Becher(2002)

Probabilistic Latent Semantic Indexing Hofmann (1999), Hofmann & Puzicha (1999)

Non-Negative Matrix Approximation Lee & Seung(2000)

Page 10: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Information-Theoretic Co-clustering Lemma: “Loss in mutual information” equals

p is the input distribution q is an approximation to p

Can be shown that q(x,y) is a maximum entropy approximation subject to cluster constraints.

),()ˆ|()ˆ|()ˆ,ˆ(

)),( || ),(( )ˆ,ˆ( - ),(

YXHYYHXXHYXH

yxqyxpKLYXIYXI

yyxxyypxxpyxpyxq ˆ,ˆ),ˆ|()ˆ|()ˆ,ˆ(),(

Page 11: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.

),( yxp

Page 12: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.

),( yxp

5.005.0005.005.0005.005.

)ˆ|( xxp

Page 13: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.

),( yxp

5.005.0005.005.0005.005.

36.36.28.00000028.36.36.

)ˆ|( xxp

)ˆ|( yyp

Page 14: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.

),( yxp

5.005.0005.005.0005.005.

2.2.3.003.

36.36.28.00000028.36.36.

)ˆ|( xxp

)ˆ,ˆ( yxp)ˆ|( yyp

Page 15: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.

),( yxp

036.036.028.028.036.036.036.036.028.028036.036.054.054.042.000054.054.042.000000042.054.054.000042.054.054.

5.005.0005.005.0005.005.

2.2.3.003.

36.36.28.00000028.36.36.

)ˆ|( xxp

)ˆ,ˆ( yxp)ˆ|( yyp

),( yxq

#parameters that determine q(x,y) are: )()1()( lnklkm

Page 16: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Decomposition Lemma

Question: How to minimize ? Following Lemma reveals the Answer:

Note that may be thought of as the “prototype” of row

cluster.

Similarly,

x xx

xyqxypKLxpyxqyxpKLˆ ˆ

))ˆ|(||)|(()()),(||),((

)ˆ|( xyq

)ˆ|()|ˆ()ˆ|()ˆ|ˆ()ˆ|()ˆ|( whereˆ

xxpxypyypxypyypxyqxx

y yy

yxqyxpKLypyxqyxpKLˆ ˆ

))ˆ|(||)|(()()),(||),((

)),(||),(( yxqyxpKL

Page 17: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Co-Clustering Algorithm [Step 1] Set . Start with , Compute .

[Step 2] For every row , assign it to the cluster that minimizes

[Step 3] We have . Compute .

[Step 4] For every column , assign it to the cluster that minimizes

[Step 5] We have . Compute . Iterate 2-5.

))ˆ|(||)|(( ],[ xyqxypKL ii

x

y))ˆ|(||)|(( ],1[ yxqyxpKL ii

],1[ iiq

],1[ iiiq

1i ),( ii CR ],[ iiq

),( 1 ii CR

y

),( 11 ii CR

x

Page 18: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Properties of Co-clustering Algorithm Main Theorem: Co-clustering “monotonically”

decreases loss in mutual information Co-clustering converges to a local minimum Can be generalized to multi-dimensional

contingency tables q can be viewed as a “low complexity” non-negative

matrix approximation q preserves marginals of p, and co-cluster statistics Implicit dimensionality reduction at each step helps

overcome sparsity & high-dimensionality Computationally economical

Page 19: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.

),( yxp

032.032.030.025.039.039.032.032.030.025.039.039.036.036.014.028.018.018.036.036.014.028.018.018.018.018.028.014.036.036.024.024.022.019.029.029.

36.0036.0005.005.000128.00

25.30.20.10.05.10.

36.36.028.000028.036.36.

)ˆ|( xxp

)ˆ,ˆ( yxp)ˆ|( yyp

),( yxq

Page 20: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.

),( yxp

046.046.020.035.025.025.028.028.033.022.043.043.034.034.015.026.019.019.034.034.015.026.019.019.018.018.028.014.036.036.018.018.028.014.036.036.

04.010003.003.0005.005.

08.12.32.18.10.20.

36.36.028.000028.036.36.

)ˆ|( xxp

)ˆ,ˆ( yxp)ˆ|( yyp

),( yxq

Page 21: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.

),( yxp

054.054.042.013.017.017.043.043.033.022.028.028.041.041.031.010.013.013.041.041.031.010.013.013.000042.054.054.000042.054.054.

04.010003.003.0005.005.

12.08.38.12.030.

36.36.28.00000028.36.36.

)ˆ|( xxp

)ˆ,ˆ( yxp)ˆ|( yyp

),( yxq

Page 22: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.

),( yxp

036.036.028.028.036.036.036.036.028.028036.036.054.054.042.000054.054.042.000000042.054.054.000042.054.054.

5.005.0005.005.0005.005.

2.2.3.003.

36.36.28.00000028.36.36.

)ˆ|( xxp

)ˆ,ˆ( yxp)ˆ|( yyp

),( yxq

Page 23: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Applications -- Text Classification

Assigning class labels to text documents Training and Testing Phases

Documentcollection

Class-1

Class-m

Grouped intoclasses

Training Data

New Document

Classifier

(Learns fromTraining data)

New Document

WithAssigned

class

Page 24: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Feature Clustering (dimensionality reduction) Feature Selection

Feature Clustering

DocumentBag-of-words

1

m

VectorOf

words

• Select the “best” words• Throw away rest• Frequency based pruning• Information criterion based pruning

DocumentBag-of-words

VectorOf

words

1

m

Cluster#1

Cluster#k

• Do not throw away words • Cluster words instead• Use clusters as features

Word#1

Word#k

Page 25: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Experiments

Data sets 20 Newsgroups data

20 classes, 20000 documents Classic3 data set

3 classes (cisi, med and cran), 3893 documents Dmoz Science HTML data

49 leaves in the hierarchy 5000 documents with 14538 words Available at http://www.cs.utexas.edu/users/manyam/dmoz.txt

Implementation Details Bow – for indexing,co-clustering, clustering and classifying

Page 26: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Results (20Ng) Classification Accuracy

on 20 Newsgroups data with 1/3-2/3 test-train split

Divisive clustering beats feature selection algorithms by a large margin

The effect is more significant at lower number of features

Page 27: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Results (Dmoz) Classification

Accuracy on Dmoz data with 1/3-2/3 test train split

Divisive Clustering is better at lower number of features

Note contrasting behavior of Naïve Bayes and SVMs

Page 28: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Results (Dmoz) Naïve Bayes on

Dmoz data with only 2% Training data

Note that Divisive Clustering achieves higher maximum than IG with a significant 13% increase

Divisive Clustering performs better than IG at lower training data

Page 29: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Science

Math Physics Social Science

Number Theory

Logic MechanicsQuantum Theory Economics Archeology

•Flat classifier builds a classifier over the leaf classes in the above hierarchy•Hierarchical Classifier builds a classifier at each internal node of the hierarchy

Hierarchical Classification

Page 30: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Results (Dmoz)

Dmoz data

01020304050607080

5

10 20 50

100

200

500

1000

5000

1000

0

Number of Features

% A

ccur

acy

HierarchicalFlat(DC)Flat(IG)

• Hierarchical Classifier (Naïve Bayes at each node)• Hierarchical Classifier: 64.54% accuracy at just 10 features (Flat achieves 64.04% accuracy at 1000 features)• Hierarchical Classifier improves accuracy to 68.42 % from 64.42%(maximum) achieved by flat classifiers

Page 31: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Anecdotal Evidence

Cluster 10Divisive Clustering(rec.sport.hockey)

Cluster 9Divisive Clustering(rec.sport.baseball)

Cluster 12Agglomerative Clustering

(rec.sport.hockey and rec.sport.baseball)

teamgameplay

hockeySeasonbostonchicago

pitvannhl

hitruns

Baseballbase Ballgreg

morrisTed

PitcherHitting

team detroit hockey pitching Games hitter Players rangers baseball nyi league morris player blues nhl shots Pit Vancouver buffalo ens

Top few words sorted in Clusters obtained by Divisive and

Agglomerative approaches on 20 Newsgroups data

Page 32: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Co-Clustering Results (CLASSIC3)

109986275138741

405954417145240

4414284784992

1-D Clustering(0.821)

Co-Clustering(0.9835)

Page 33: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Results – Binary (subset of 20Ng data)

15671239161467221943

941791123410417831207

1-D Clustering

Co-clustering

1-D Clustering

Co-clustering

Binary_subject(0.946,0.648)

Binary (0.852,0.67)

Page 34: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Precision – 20Ng dataCo-clustering

1D-clustering

IB-Double IDC

Binary 0.98 0.64 0.70

Binary_Subject 0.96 0.67 0.85

Multi5 0.87 0.34 0.5

Multi5_Subject 0.89 0.37 0.88

Multi10 0.56 0.17 0.35

Multi10_Subject 0.54 0.19 0.55

Page 35: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Results: Sparsity (Binary_subject data)

Page 36: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Results: Sparsity (Binary_subject data)

Page 37: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Results (Monotonicity)

Page 38: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

Conclusions

Information-theoretic approach to clustering, co-clustering and matrix approximation

Implicit dimensionality reduction at each step to overcome sparsity & high-dimensionality

Theoretical approach has the potential of extending to other problems: Multi-dimensional co-clustering MDL to choose number of co-clusters Generalized co-clustering by Bregman divergence

Page 39: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon                        University of Texas, Austin

More Information Email: [email protected] Papers are available at:

http://www.cs.utexas.edu/users/inderjit “Divisive Information-Theoretic Feature Clustering for

Text Classification”, Dhillon, Mallela & Kumar, Journal of Machine Learning Research(JMLR), March 2003 (also KDD, 2002)

“Information-Theoretic Co-clustering”, Dhillon, Mallela & Modha, KDD, 2003.

“Clustering with Bregman Divergences”, Banerjee, Merugu, Dhillon & Ghosh, SIAM Data Mining Proceedings, April, 2004.

“A Generalized Maximum Entropy Approach to Bregman Co-clustering & Matrix Approximation”, Banerjee, Dhillon, Ghosh, Merugu & Modha, working manuscript, 2004.