View
223
Download
1
Tags:
Embed Size (px)
Citation preview
ClusteringShallow Processing Techniques for NLP
Ling570November 30, 2011
RoadmapClustering
Motivation & Applications
Clustering Approaches
Evaluation
ClusteringTask: Given a set of objects, create a set of
clusters over those objects
Applications:
ClusteringTask: Given a set of objects, create a set of
clusters over those objects
Applications: Exploratory data analysis Document clustering Language modeling
Generalization for class-based LMs Unsupervised Word Sense Disambiguation
Automatic thesaurus creations Unsupervised Part-of-Speech Tagging Speaker clustering,….
Example: Document Clustering
Input: Set of individual documents
Output: Sets of document clusters
Many different types of clustering:
Example: Document Clustering
Input: Set of individual documents
Output: Sets of document clusters
Many different types of clustering:Category: news, sports, weather, entertainment
Example: Document Clustering
Input: Set of individual documents
Output: Sets of document clusters
Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets,
newswire
Example: Document Clustering
Input: Set of individual documents
Output: Sets of document clusters
Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets,
newswireAuthor clustering
Example: Document Clustering
Input: Set of individual documents
Output: Sets of document clusters
Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets,
newswireAuthor clusteringLanguage ID: language clusters
Example: Document Clustering
Input: Set of individual documents
Output: Sets of document clusters
Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets, newswireAuthor clusteringLanguage ID: language clustersTopic clustering: documents on the same topic
OWS, debt supercommittee, Seattle Marathon, Black Friday..
Example:Word Clustering
Input: WordsBarbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox,
ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats
Output: Word clusters
Example:Word Clustering
Input: WordsBarbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox,
ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats
Output: Word clusters
Example clusters:
Example:Word Clustering
Input: Words Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot,
finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats
Output: Word clusters
Example clusters: (from NYT) ballot, polls, Gov, seats profit, finance, payments NFL, Reds, Sox, inning, quarterback, scored, score researchers, science Scott, Mary, Barbara, Edward
QuestionsWhat should a cluster represent?
Due to F. Xia
QuestionsWhat should a cluster represent?
Similarity among objects
How can we create clusters?
Due to F. Xia
QuestionsWhat should a cluster represent?
Similarity among objects
How can we create clusters?
How can we evaluate clusters?
Due to F. Xia
QuestionsWhat should a cluster represent?
Similarity among objects
How can we create clusters?
How can we evaluate clusters?
How can we improve NLP with clustering?
Due to F. Xia
SimilarityBetween two instances
SimilarityBetween two instances
Between an instance and a cluster
SimilarityBetween two instances
Between an instance and a cluster
Between clusters
Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)
Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)
Euclidean distance:
Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)
Euclidean distance:
Manhattan distance:
Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)
Euclidean distance:
Manhattan distance:
Cosine similarity:
Clustering Algorithms
Types of ClusteringFlat vs Hierarchical Clustering:
Flat: partition data into k clusters
Types of ClusteringFlat vs Hierarchical Clustering:
Flat: partition data into k clustersHierarchical: Nodes form hierarchy
Types of ClusteringFlat vs Hierarchical Clustering:
Flat: partition data into k clustersHierarchical: Nodes form hierarchy
Hard vs Soft ClusteringHard: Each object assigned to exactly one cluster
Types of ClusteringFlat vs Hierarchical Clustering:
Flat: partition data into k clustersHierarchical: Nodes form hierarchy
Hard vs Soft ClusteringHard: Each object assigned to exactly one clusterSoft: Allows degrees of membership and
membership in more than one cluster Often probability distribution over cluster
membership
Hierarchical Clustering
Hierarchical Vs. FlatHierarchical clustering:
Hierarchical Vs. FlatHierarchical clustering:
More informativeGood for data explorationMany algorithms, none good for all dataComputationally expensive
Hierarchical Vs. FlatHierarchical clustering:
More informativeGood for data explorationMany algorithms, none good for all dataComputationally expensive
Flat clustering:
Hierarchical Vs. FlatHierarchical clustering:
More informativeGood for data explorationMany algorithms, none good for all dataComputationally expensive
Flat clustering:Fairly efficientSimple baseline algorithm: K-meansProbabilistic models use EM algorithm
Clustering AlgorithmsFlat clustering:
K-means clustering
K-medoids clustering
Hierarchical clustering:Greedy, bottom-up clustering
K-Means ClusteringInitialize:
Randomly select k initial centroids
K-Means ClusteringInitialize:
Randomly select k initial centroidsCenter (mean) of cluster
Iterate until clusters stop changing
K-Means ClusteringInitialize:
Randomly select k initial centroidsCenter (mean) of cluster
Iterate until clusters stop changingAssign each instance to the nearest cluster
Cluster is nearest if cluster centroid is nearest
K-Means ClusteringInitialize:
Randomly select k initial centroidsCenter (mean) of cluster
Iterate until clusters stop changingAssign each instance to the nearest cluster
Cluster is nearest if cluster centroid is nearestRecompute cluster centroids
Mean of instances in the cluster
K-Means: 1 step
K-MeansRunning time:
K-MeansRunning time:
O(n) – where n is the number of clustersConverges in finite number of steps
Issues:
K-MeansRunning time:
O(n) – where n is the number of clustersConverges in finite number of steps
Issues:Need to pick # clusters kCan find only local optimumSensitive to outliersRequires Euclidean distance:
What about enumerable classes (e.g. colors)?
MedoidMedoid: Element in cluster with highest average
similarity to other elements in cluster
MedoidMedoid: Element in cluster with highest average
similarity to other elements in cluster
Finding the medoid:For each element compute:
MedoidMedoid: Element in cluster with highest average
similarity to other elements in cluster
Finding the medoid:For each element compute:
Select the element with highest f(p)
K-MedoidsInitialize:
Select k instances at random as medoids
K-MedoidsInitialize:
Select k instances at random as medoids
Iterate until no changesAssign instances to cluster with nearest medoid
K-MedoidsInitialize:
Select k instances at random as medoids
Iterate until no changesAssign instances to cluster with nearest medoid
Recompute medoid for each cluster
Greedy, Bottom-Up Hierarchical Clustering
Initialize:Make an individual cluster for each instance
Greedy, Bottom-Up Hierarchical Clustering
Initialize:Make an individual cluster for each instance
Iterate until all instances in same clusterMerge two most similar clusters
Evaluation
EvaluationWith respect to gold standard
AccuracyFor each cluster, assign most common label to all
itemsRand indexF-measure
Alternatives:
EvaluationWith respect to gold standard
AccuracyFor each cluster, assign most common label to all
itemsRand indexF-measure
Alternatives:Extrinsic evaluation
EvaluationWith respect to gold standard
AccuracyFor each cluster, assign most common label to all
itemsRand indexF-measure
Alternatives:Extrinsic evaluationHuman inspection
ConfigurationGiven
Set of objects O = {o1,o2,….on}
ConfigurationGiven
Set of objects O = {o1,o2,….on}
Partition X ={x1,…,xr}
Partition Y ={y1,….ys}
ConfigurationGiven
Set of objects O = {o1,o2,….on}
Partition X ={x1,…,xr}
Partition Y ={y1,….ys}
In same sets in X In diff’t sets in X
In same sets in Y a d
In diff’t sets in Y c b
Rand IndexMeasure of cluster similarity (Rand, 1971)
No agreement?
In same sets in X In diff’t sets in X
In same sets in Y a d
In diff’t sets in Y c b
Rand IndexMeasure of cluster similarity (Rand, 1971)
No agreement? 0; Full agreement
In same sets in X In diff’t sets in X
In same sets in Y a d
In diff’t sets in Y c b
Rand IndexMeasure of cluster similarity (Rand, 1971)
No agreement? 0; Full agreement? 1
In same sets in X In diff’t sets in X
In same sets in Y a d
In diff’t sets in Y c b
Precision & RecallAssume X is the gold standard partition
Assume Y is the system-generated partition
Precision & RecallAssume X is the gold standard partition
Assume Y is the system-generated partition
For each pair of items in a cluster in YCorrect if they appear together in a cluster in X
Precision & RecallAssume X is the gold standard partition
Assume Y is the system-generated partition
For each pair of items in a cluster in YCorrect if they appear together in a cluster in X
Can compute P, R, and F-measure
HW #10
Due to F. Xia
HW #10Unsupervised POS tagging:
Word clustering by neighboring word cooccurrence
Create feature vectors: Features: counts of adjacent word occurrence E.g., L=he:10 or R=run:3
Perform clustering: K-medoids algorithm ( with cosine similarity)
Evaluate clusters: Cluster mapping + accuracy
Q1create_vectors.* training_file word_file feat_file outfile
Training file: one-sentence-per-line: w1 w2 w3 …wn
word_file: List of words to cluster word<tab>freq
feat_file: List of words to use as features feat<tab>freq
outfile: One list per word in word_file Format: word L=he 10 L=she 5 ….. R=gone 2 R=run 3…
FeaturesFeatures are of the form:
(L|R)=xx freqwhere xx is a word in the feat_file, L, R: the position where the feature
appeared freq: # of times word xx appeared in
position in training file
Suppose ‘New York’ appears 540 times in corpus
York L=New 540 … R=New 0…
Vector FileOne line per word in word_file
Lines should be ordered by word_file
Features should be sorted alphabetically by feature nameE.g. L=an 3 L=the 10 … R=aqua 1 R=house 5
Feature sorting aids cosine computation
Q2k_medoids.* vector_file num_clusters
sys_cluster_file
vector_file: Created by Q1
num_clusters : number of clusters to create
sys_cluster_file: output representing clustering of vectorsmedoid w1 w2 w3 …wnwhere medoid is the medoid representing the cluster w1…wn are the words in the cluster
Q2: K-MedoidsSimilarity measure: Cosine similarity
Initial medoids:Medoid i is at instance:
where N is # of words to cluster C is # of clusters
Mapping Sys to Gold:One-to-One
Find highest number in matrixRemove corresponding row and columnRepeat until all rows removeds1 => g2 10 s2 => g1 7s3 => g3 6 acc= (10+7+6)/sum
Due to F. Xia
g1 g2 g3
s1 2 10 9
s2 7 4 2
s3 0 9 6
s4 5 0 3
Mapping Sys to Gold:One-to-One
Find highest number in matrix Remove corresponding row and column Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 6 acc= (10+7+6)/sum
Due to F. Xia
g1 g2 g3
s1 2 10 9
s2 7 4 2
s3 0 9 6
s4 5 0 3
Mapping Sys to Gold:Many-to-One
Find highest number in matrix Remove corresponding row (but not column) Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 9 s4 => g1 5 acc= (10+7+9+5)/sum
Due to F. Xia
g1 g2 g3
s1 2 10 9
s2 7 4 2
s3 0 9 6
s4 5 0 3
Q3: calculate_accuracycalculate_accuracy.* sys_clust gold_clust flag
map_file acc_file sys_clust: output of Q2: m w1 w2 …
gold_clust: similar format, gold standard
flag: 0: one-to-one; 1:many-to-one
map_file: mapping of sys to gold clusterssys_clust_num => gold_clust_num count
acc_file: just overall accuracy
ExperimentsCompare different numbers of words and
different feature representations
Compare different mapping strategies for accuracy
Tabulate results