Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

ClusteringShallow Processing Techniques for NLP

Ling570November 30, 2011

RoadmapClustering

Motivation & Applications

Clustering Approaches

Evaluation

ClusteringTask: Given a set of objects, create a set of

clusters over those objects

Applications:

ClusteringTask: Given a set of objects, create a set of

clusters over those objects

Applications: Exploratory data analysis Document clustering Language modeling

Generalization for class-based LMs Unsupervised Word Sense Disambiguation

Automatic thesaurus creations Unsupervised Part-of-Speech Tagging Speaker clustering,….

Example: Document Clustering

Input: Set of individual documents

Output: Sets of document clusters

Many different types of clustering:




Many different types of clustering:Category: news, sports, weather, entertainment




Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets,

newswire





newswireAuthor clustering





newswireAuthor clusteringLanguage ID: language clusters




Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets, newswireAuthor clusteringLanguage ID: language clustersTopic clustering: documents on the same topic

OWS, debt supercommittee, Seattle Marathon, Black Friday..

Example:Word Clustering

Input: WordsBarbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox,

ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats

Output: Word clusters


Input: WordsBarbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox,

ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats


Example clusters:


Input: Words Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot,

finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats


Example clusters: (from NYT) ballot, polls, Gov, seats profit, finance, payments NFL, Reds, Sox, inning, quarterback, scored, score researchers, science Scott, Mary, Barbara, Edward

QuestionsWhat should a cluster represent?

Due to F. Xia


Similarity among objects

How can we create clusters?

Due to F. Xia




How can we evaluate clusters?

Due to F. Xia




How can we evaluate clusters?

How can we improve NLP with clustering?

Due to F. Xia

SimilarityBetween two instances


Between an instance and a cluster


Between an instance and a cluster

Between clusters

Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)


Euclidean distance:


Euclidean distance:

Manhattan distance:


Euclidean distance:

Manhattan distance:

Cosine similarity:

Clustering Algorithms

Types of ClusteringFlat vs Hierarchical Clustering:

Flat: partition data into k clusters


Flat: partition data into k clustersHierarchical: Nodes form hierarchy



Hard vs Soft ClusteringHard: Each object assigned to exactly one cluster



Hard vs Soft ClusteringHard: Each object assigned to exactly one clusterSoft: Allows degrees of membership and

membership in more than one cluster Often probability distribution over cluster

membership

Hierarchical Clustering

Hierarchical Vs. FlatHierarchical clustering:


More informativeGood for data explorationMany algorithms, none good for all dataComputationally expensive



Flat clustering:



Flat clustering:Fairly efficientSimple baseline algorithm: K-meansProbabilistic models use EM algorithm

Clustering AlgorithmsFlat clustering:

K-means clustering

K-medoids clustering

Hierarchical clustering:Greedy, bottom-up clustering

K-Means ClusteringInitialize:

Randomly select k initial centroids


Randomly select k initial centroidsCenter (mean) of cluster

Iterate until clusters stop changing



Iterate until clusters stop changingAssign each instance to the nearest cluster

Cluster is nearest if cluster centroid is nearest



Iterate until clusters stop changingAssign each instance to the nearest cluster

Cluster is nearest if cluster centroid is nearestRecompute cluster centroids

Mean of instances in the cluster

K-Means: 1 step

K-MeansRunning time:


O(n) – where n is the number of clustersConverges in finite number of steps

Issues:


O(n) – where n is the number of clustersConverges in finite number of steps

Issues:Need to pick # clusters kCan find only local optimumSensitive to outliersRequires Euclidean distance:

What about enumerable classes (e.g. colors)?

MedoidMedoid: Element in cluster with highest average

similarity to other elements in cluster



Finding the medoid:For each element compute:



Finding the medoid:For each element compute:

Select the element with highest f(p)

K-MedoidsInitialize:

Select k instances at random as medoids



Iterate until no changesAssign instances to cluster with nearest medoid



Iterate until no changesAssign instances to cluster with nearest medoid

Recompute medoid for each cluster

Greedy, Bottom-Up Hierarchical Clustering

Initialize:Make an individual cluster for each instance

Greedy, Bottom-Up Hierarchical Clustering

Initialize:Make an individual cluster for each instance

Iterate until all instances in same clusterMerge two most similar clusters

Evaluation

EvaluationWith respect to gold standard

AccuracyFor each cluster, assign most common label to all

itemsRand indexF-measure

Alternatives:




Alternatives:Extrinsic evaluation




Alternatives:Extrinsic evaluationHuman inspection

ConfigurationGiven

Set of objects O = {o1,o2,….on}

ConfigurationGiven


Partition X ={x1,…,xr}

Partition Y ={y1,….ys}

ConfigurationGiven


Partition X ={x1,…,xr}

Partition Y ={y1,….ys}

In same sets in X In diff’t sets in X

In same sets in Y a d

In diff’t sets in Y c b

Rand IndexMeasure of cluster similarity (Rand, 1971)

No agreement?





No agreement? 0; Full agreement





No agreement? 0; Full agreement? 1




Precision & RecallAssume X is the gold standard partition

Assume Y is the system-generated partition



For each pair of items in a cluster in YCorrect if they appear together in a cluster in X



For each pair of items in a cluster in YCorrect if they appear together in a cluster in X

Can compute P, R, and F-measure

HW #10

Due to F. Xia

HW #10Unsupervised POS tagging:

Word clustering by neighboring word cooccurrence

Create feature vectors: Features: counts of adjacent word occurrence E.g., L=he:10 or R=run:3

Perform clustering: K-medoids algorithm ( with cosine similarity)

Evaluate clusters: Cluster mapping + accuracy

Q1create_vectors.* training_file word_file feat_file outfile

Training file: one-sentence-per-line: w1 w2 w3 …wn

word_file: List of words to cluster word<tab>freq

feat_file: List of words to use as features feat<tab>freq

outfile: One list per word in word_file Format: word L=he 10 L=she 5 ….. R=gone 2 R=run 3…

FeaturesFeatures are of the form:

(L|R)=xx freqwhere xx is a word in the feat_file, L, R: the position where the feature

appeared freq: # of times word xx appeared in

position in training file

Suppose ‘New York’ appears 540 times in corpus

York L=New 540 … R=New 0…

Vector FileOne line per word in word_file

Lines should be ordered by word_file

Features should be sorted alphabetically by feature nameE.g. L=an 3 L=the 10 … R=aqua 1 R=house 5

Feature sorting aids cosine computation

Q2k_medoids.* vector_file num_clusters

sys_cluster_file

vector_file: Created by Q1

num_clusters : number of clusters to create

sys_cluster_file: output representing clustering of vectorsmedoid w1 w2 w3 …wnwhere medoid is the medoid representing the cluster w1…wn are the words in the cluster

Q2: K-MedoidsSimilarity measure: Cosine similarity

Initial medoids:Medoid i is at instance:

where N is # of words to cluster C is # of clusters

Mapping Sys to Gold:One-to-One

Find highest number in matrixRemove corresponding row and columnRepeat until all rows removeds1 => g2 10 s2 => g1 7s3 => g3 6 acc= (10+7+6)/sum

Due to F. Xia

g1 g2 g3

s1 2 10 9

s2 7 4 2

s3 0 9 6

s4 5 0 3

Mapping Sys to Gold:One-to-One

Find highest number in matrix Remove corresponding row and column Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 6 acc= (10+7+6)/sum

Due to F. Xia

g1 g2 g3

s1 2 10 9

s2 7 4 2

s3 0 9 6

s4 5 0 3

Mapping Sys to Gold:Many-to-One

Find highest number in matrix Remove corresponding row (but not column) Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 9 s4 => g1 5 acc= (10+7+9+5)/sum

Due to F. Xia

g1 g2 g3

s1 2 10 9

s2 7 4 2

s3 0 9 6

s4 5 0 3

Q3: calculate_accuracycalculate_accuracy.* sys_clust gold_clust flag

map_file acc_file sys_clust: output of Q2: m w1 w2 …

gold_clust: similar format, gold standard

flag: 0: one-to-one; 1:many-to-one

map_file: mapping of sys to gold clusterssys_clust_num => gold_clust_num count

acc_file: just overall accuracy

ExperimentsCompare different numbers of words and

different feature representations

Compare different mapping strategies for accuracy

Tabulate results

Documents

Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011