Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Preview:

Citation preview

ClusteringShallow Processing Techniques for NLP

Ling570November 30, 2011

RoadmapClustering

Motivation & Applications

Clustering Approaches

Evaluation

ClusteringTask: Given a set of objects, create a set of

clusters over those objects

Applications:

ClusteringTask: Given a set of objects, create a set of

clusters over those objects

Applications: Exploratory data analysis Document clustering Language modeling

Generalization for class-based LMs Unsupervised Word Sense Disambiguation

Automatic thesaurus creations Unsupervised Part-of-Speech Tagging Speaker clustering,….

Example: Document Clustering

Input: Set of individual documents

Output: Sets of document clusters

Many different types of clustering:

Example: Document Clustering

Input: Set of individual documents

Output: Sets of document clusters

Many different types of clustering:Category: news, sports, weather, entertainment

Example: Document Clustering

Input: Set of individual documents

Output: Sets of document clusters

Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets,

newswire

Example: Document Clustering

Input: Set of individual documents

Output: Sets of document clusters

Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets,

newswireAuthor clustering

Example: Document Clustering

Input: Set of individual documents

Output: Sets of document clusters

Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets,

newswireAuthor clusteringLanguage ID: language clusters

Example: Document Clustering

Input: Set of individual documents

Output: Sets of document clusters

Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets, newswireAuthor clusteringLanguage ID: language clustersTopic clustering: documents on the same topic

OWS, debt supercommittee, Seattle Marathon, Black Friday..

Example:Word Clustering

Input: WordsBarbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox,

ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats

Output: Word clusters

Example:Word Clustering

Input: WordsBarbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox,

ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats

Output: Word clusters

Example clusters:

Example:Word Clustering

Input: Words Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot,

finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats

Output: Word clusters

Example clusters: (from NYT) ballot, polls, Gov, seats profit, finance, payments NFL, Reds, Sox, inning, quarterback, scored, score researchers, science Scott, Mary, Barbara, Edward

QuestionsWhat should a cluster represent?

Due to F. Xia

QuestionsWhat should a cluster represent?

Similarity among objects

How can we create clusters?

Due to F. Xia

QuestionsWhat should a cluster represent?

Similarity among objects

How can we create clusters?

How can we evaluate clusters?

Due to F. Xia

QuestionsWhat should a cluster represent?

Similarity among objects

How can we create clusters?

How can we evaluate clusters?

How can we improve NLP with clustering?

Due to F. Xia

SimilarityBetween two instances

SimilarityBetween two instances

Between an instance and a cluster

SimilarityBetween two instances

Between an instance and a cluster

Between clusters

Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)

Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)

Euclidean distance:

Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)

Euclidean distance:

Manhattan distance:

Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)

Euclidean distance:

Manhattan distance:

Cosine similarity:

Clustering Algorithms

Types of ClusteringFlat vs Hierarchical Clustering:

Flat: partition data into k clusters

Types of ClusteringFlat vs Hierarchical Clustering:

Flat: partition data into k clustersHierarchical: Nodes form hierarchy

Types of ClusteringFlat vs Hierarchical Clustering:

Flat: partition data into k clustersHierarchical: Nodes form hierarchy

Hard vs Soft ClusteringHard: Each object assigned to exactly one cluster

Types of ClusteringFlat vs Hierarchical Clustering:

Flat: partition data into k clustersHierarchical: Nodes form hierarchy

Hard vs Soft ClusteringHard: Each object assigned to exactly one clusterSoft: Allows degrees of membership and

membership in more than one cluster Often probability distribution over cluster

membership

Hierarchical Clustering

Hierarchical Vs. FlatHierarchical clustering:

Hierarchical Vs. FlatHierarchical clustering:

More informativeGood for data explorationMany algorithms, none good for all dataComputationally expensive

Hierarchical Vs. FlatHierarchical clustering:

More informativeGood for data explorationMany algorithms, none good for all dataComputationally expensive

Flat clustering:

Hierarchical Vs. FlatHierarchical clustering:

More informativeGood for data explorationMany algorithms, none good for all dataComputationally expensive

Flat clustering:Fairly efficientSimple baseline algorithm: K-meansProbabilistic models use EM algorithm

Clustering AlgorithmsFlat clustering:

K-means clustering

K-medoids clustering

Hierarchical clustering:Greedy, bottom-up clustering

K-Means ClusteringInitialize:

Randomly select k initial centroids

K-Means ClusteringInitialize:

Randomly select k initial centroidsCenter (mean) of cluster

Iterate until clusters stop changing

K-Means ClusteringInitialize:

Randomly select k initial centroidsCenter (mean) of cluster

Iterate until clusters stop changingAssign each instance to the nearest cluster

Cluster is nearest if cluster centroid is nearest

K-Means ClusteringInitialize:

Randomly select k initial centroidsCenter (mean) of cluster

Iterate until clusters stop changingAssign each instance to the nearest cluster

Cluster is nearest if cluster centroid is nearestRecompute cluster centroids

Mean of instances in the cluster

K-Means: 1 step

K-MeansRunning time:

K-MeansRunning time:

O(n) – where n is the number of clustersConverges in finite number of steps

Issues:

K-MeansRunning time:

O(n) – where n is the number of clustersConverges in finite number of steps

Issues:Need to pick # clusters kCan find only local optimumSensitive to outliersRequires Euclidean distance:

What about enumerable classes (e.g. colors)?

MedoidMedoid: Element in cluster with highest average

similarity to other elements in cluster

MedoidMedoid: Element in cluster with highest average

similarity to other elements in cluster

Finding the medoid:For each element compute:

MedoidMedoid: Element in cluster with highest average

similarity to other elements in cluster

Finding the medoid:For each element compute:

Select the element with highest f(p)

K-MedoidsInitialize:

Select k instances at random as medoids

K-MedoidsInitialize:

Select k instances at random as medoids

Iterate until no changesAssign instances to cluster with nearest medoid

K-MedoidsInitialize:

Select k instances at random as medoids

Iterate until no changesAssign instances to cluster with nearest medoid

Recompute medoid for each cluster

Greedy, Bottom-Up Hierarchical Clustering

Initialize:Make an individual cluster for each instance

Greedy, Bottom-Up Hierarchical Clustering

Initialize:Make an individual cluster for each instance

Iterate until all instances in same clusterMerge two most similar clusters

Evaluation

EvaluationWith respect to gold standard

AccuracyFor each cluster, assign most common label to all

itemsRand indexF-measure

Alternatives:

EvaluationWith respect to gold standard

AccuracyFor each cluster, assign most common label to all

itemsRand indexF-measure

Alternatives:Extrinsic evaluation

EvaluationWith respect to gold standard

AccuracyFor each cluster, assign most common label to all

itemsRand indexF-measure

Alternatives:Extrinsic evaluationHuman inspection

ConfigurationGiven

Set of objects O = {o1,o2,….on}

ConfigurationGiven

Set of objects O = {o1,o2,….on}

Partition X ={x1,…,xr}

Partition Y ={y1,….ys}

ConfigurationGiven

Set of objects O = {o1,o2,….on}

Partition X ={x1,…,xr}

Partition Y ={y1,….ys}

In same sets in X In diff’t sets in X

In same sets in Y a d

In diff’t sets in Y c b

Rand IndexMeasure of cluster similarity (Rand, 1971)

No agreement?

In same sets in X In diff’t sets in X

In same sets in Y a d

In diff’t sets in Y c b

Rand IndexMeasure of cluster similarity (Rand, 1971)

No agreement? 0; Full agreement

In same sets in X In diff’t sets in X

In same sets in Y a d

In diff’t sets in Y c b

Rand IndexMeasure of cluster similarity (Rand, 1971)

No agreement? 0; Full agreement? 1

In same sets in X In diff’t sets in X

In same sets in Y a d

In diff’t sets in Y c b

Precision & RecallAssume X is the gold standard partition

Assume Y is the system-generated partition

Precision & RecallAssume X is the gold standard partition

Assume Y is the system-generated partition

For each pair of items in a cluster in YCorrect if they appear together in a cluster in X

Precision & RecallAssume X is the gold standard partition

Assume Y is the system-generated partition

For each pair of items in a cluster in YCorrect if they appear together in a cluster in X

Can compute P, R, and F-measure

HW #10

Due to F. Xia

HW #10Unsupervised POS tagging:

Word clustering by neighboring word cooccurrence

Create feature vectors: Features: counts of adjacent word occurrence E.g., L=he:10 or R=run:3

Perform clustering: K-medoids algorithm ( with cosine similarity)

Evaluate clusters: Cluster mapping + accuracy

Q1create_vectors.* training_file word_file feat_file outfile

Training file: one-sentence-per-line: w1 w2 w3 …wn

word_file: List of words to cluster word<tab>freq

feat_file: List of words to use as features feat<tab>freq

outfile: One list per word in word_file Format: word L=he 10 L=she 5 ….. R=gone 2 R=run 3…

FeaturesFeatures are of the form:

(L|R)=xx freqwhere xx is a word in the feat_file, L, R: the position where the feature

appeared freq: # of times word xx appeared in

position in training file

Suppose ‘New York’ appears 540 times in corpus

York L=New 540 … R=New 0…

Vector FileOne line per word in word_file

Lines should be ordered by word_file

Features should be sorted alphabetically by feature nameE.g. L=an 3 L=the 10 … R=aqua 1 R=house 5

Feature sorting aids cosine computation

Q2k_medoids.* vector_file num_clusters

sys_cluster_file

vector_file: Created by Q1

num_clusters : number of clusters to create

sys_cluster_file: output representing clustering of vectorsmedoid w1 w2 w3 …wnwhere medoid is the medoid representing the cluster w1…wn are the words in the cluster

Q2: K-MedoidsSimilarity measure: Cosine similarity

Initial medoids:Medoid i is at instance:

where N is # of words to cluster C is # of clusters

Mapping Sys to Gold:One-to-One

Find highest number in matrixRemove corresponding row and columnRepeat until all rows removeds1 => g2 10 s2 => g1 7s3 => g3 6 acc= (10+7+6)/sum

Due to F. Xia

g1 g2 g3

s1 2 10 9

s2 7 4 2

s3 0 9 6

s4 5 0 3

Mapping Sys to Gold:One-to-One

Find highest number in matrix Remove corresponding row and column Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 6 acc= (10+7+6)/sum

Due to F. Xia

g1 g2 g3

s1 2 10 9

s2 7 4 2

s3 0 9 6

s4 5 0 3

Mapping Sys to Gold:Many-to-One

Find highest number in matrix Remove corresponding row (but not column) Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 9 s4 => g1 5 acc= (10+7+9+5)/sum

Due to F. Xia

g1 g2 g3

s1 2 10 9

s2 7 4 2

s3 0 9 6

s4 5 0 3

Q3: calculate_accuracycalculate_accuracy.* sys_clust gold_clust flag

map_file acc_file sys_clust: output of Q2: m w1 w2 …

gold_clust: similar format, gold standard

flag: 0: one-to-one; 1:many-to-one

map_file: mapping of sys to gold clusterssys_clust_num => gold_clust_num count

acc_file: just overall accuracy

ExperimentsCompare different numbers of words and

different feature representations

Compare different mapping strategies for accuracy

Tabulate results

Recommended