76
Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

  • View
    223

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

ClusteringShallow Processing Techniques for NLP

Ling570November 30, 2011

Page 2: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

RoadmapClustering

Motivation & Applications

Clustering Approaches

Evaluation

Page 3: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

ClusteringTask: Given a set of objects, create a set of

clusters over those objects

Applications:

Page 4: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

ClusteringTask: Given a set of objects, create a set of

clusters over those objects

Applications: Exploratory data analysis Document clustering Language modeling

Generalization for class-based LMs Unsupervised Word Sense Disambiguation

Automatic thesaurus creations Unsupervised Part-of-Speech Tagging Speaker clustering,….

Page 5: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Example: Document Clustering

Input: Set of individual documents

Output: Sets of document clusters

Many different types of clustering:

Page 6: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Example: Document Clustering

Input: Set of individual documents

Output: Sets of document clusters

Many different types of clustering:Category: news, sports, weather, entertainment

Page 7: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Example: Document Clustering

Input: Set of individual documents

Output: Sets of document clusters

Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets,

newswire

Page 8: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Example: Document Clustering

Input: Set of individual documents

Output: Sets of document clusters

Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets,

newswireAuthor clustering

Page 9: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Example: Document Clustering

Input: Set of individual documents

Output: Sets of document clusters

Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets,

newswireAuthor clusteringLanguage ID: language clusters

Page 10: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Example: Document Clustering

Input: Set of individual documents

Output: Sets of document clusters

Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets, newswireAuthor clusteringLanguage ID: language clustersTopic clustering: documents on the same topic

OWS, debt supercommittee, Seattle Marathon, Black Friday..

Page 11: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Example:Word Clustering

Input: WordsBarbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox,

ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats

Output: Word clusters

Page 12: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Example:Word Clustering

Input: WordsBarbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox,

ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats

Output: Word clusters

Example clusters:

Page 13: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Example:Word Clustering

Input: Words Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot,

finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats

Output: Word clusters

Example clusters: (from NYT) ballot, polls, Gov, seats profit, finance, payments NFL, Reds, Sox, inning, quarterback, scored, score researchers, science Scott, Mary, Barbara, Edward

Page 14: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

QuestionsWhat should a cluster represent?

Due to F. Xia

Page 15: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

QuestionsWhat should a cluster represent?

Similarity among objects

How can we create clusters?

Due to F. Xia

Page 16: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

QuestionsWhat should a cluster represent?

Similarity among objects

How can we create clusters?

How can we evaluate clusters?

Due to F. Xia

Page 17: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

QuestionsWhat should a cluster represent?

Similarity among objects

How can we create clusters?

How can we evaluate clusters?

How can we improve NLP with clustering?

Due to F. Xia

Page 18: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

SimilarityBetween two instances

Page 19: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

SimilarityBetween two instances

Between an instance and a cluster

Page 20: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

SimilarityBetween two instances

Between an instance and a cluster

Between clusters

Page 21: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)

Page 22: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)

Euclidean distance:

Page 23: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)

Euclidean distance:

Manhattan distance:

Page 24: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)

Euclidean distance:

Manhattan distance:

Cosine similarity:

Page 25: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Clustering Algorithms

Page 26: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Types of ClusteringFlat vs Hierarchical Clustering:

Flat: partition data into k clusters

Page 27: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Types of ClusteringFlat vs Hierarchical Clustering:

Flat: partition data into k clustersHierarchical: Nodes form hierarchy

Page 28: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Types of ClusteringFlat vs Hierarchical Clustering:

Flat: partition data into k clustersHierarchical: Nodes form hierarchy

Hard vs Soft ClusteringHard: Each object assigned to exactly one cluster

Page 29: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Types of ClusteringFlat vs Hierarchical Clustering:

Flat: partition data into k clustersHierarchical: Nodes form hierarchy

Hard vs Soft ClusteringHard: Each object assigned to exactly one clusterSoft: Allows degrees of membership and

membership in more than one cluster Often probability distribution over cluster

membership

Page 30: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Hierarchical Clustering

Page 31: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Hierarchical Vs. FlatHierarchical clustering:

Page 32: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Hierarchical Vs. FlatHierarchical clustering:

More informativeGood for data explorationMany algorithms, none good for all dataComputationally expensive

Page 33: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Hierarchical Vs. FlatHierarchical clustering:

More informativeGood for data explorationMany algorithms, none good for all dataComputationally expensive

Flat clustering:

Page 34: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Hierarchical Vs. FlatHierarchical clustering:

More informativeGood for data explorationMany algorithms, none good for all dataComputationally expensive

Flat clustering:Fairly efficientSimple baseline algorithm: K-meansProbabilistic models use EM algorithm

Page 35: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Clustering AlgorithmsFlat clustering:

K-means clustering

K-medoids clustering

Hierarchical clustering:Greedy, bottom-up clustering

Page 36: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

K-Means ClusteringInitialize:

Randomly select k initial centroids

Page 37: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

K-Means ClusteringInitialize:

Randomly select k initial centroidsCenter (mean) of cluster

Iterate until clusters stop changing

Page 38: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

K-Means ClusteringInitialize:

Randomly select k initial centroidsCenter (mean) of cluster

Iterate until clusters stop changingAssign each instance to the nearest cluster

Cluster is nearest if cluster centroid is nearest

Page 39: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

K-Means ClusteringInitialize:

Randomly select k initial centroidsCenter (mean) of cluster

Iterate until clusters stop changingAssign each instance to the nearest cluster

Cluster is nearest if cluster centroid is nearestRecompute cluster centroids

Mean of instances in the cluster

Page 40: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

K-Means: 1 step

Page 41: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

K-MeansRunning time:

Page 42: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

K-MeansRunning time:

O(n) – where n is the number of clustersConverges in finite number of steps

Issues:

Page 43: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

K-MeansRunning time:

O(n) – where n is the number of clustersConverges in finite number of steps

Issues:Need to pick # clusters kCan find only local optimumSensitive to outliersRequires Euclidean distance:

What about enumerable classes (e.g. colors)?

Page 44: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

MedoidMedoid: Element in cluster with highest average

similarity to other elements in cluster

Page 45: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

MedoidMedoid: Element in cluster with highest average

similarity to other elements in cluster

Finding the medoid:For each element compute:

Page 46: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

MedoidMedoid: Element in cluster with highest average

similarity to other elements in cluster

Finding the medoid:For each element compute:

Select the element with highest f(p)

Page 47: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

K-MedoidsInitialize:

Select k instances at random as medoids

Page 48: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

K-MedoidsInitialize:

Select k instances at random as medoids

Iterate until no changesAssign instances to cluster with nearest medoid

Page 49: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

K-MedoidsInitialize:

Select k instances at random as medoids

Iterate until no changesAssign instances to cluster with nearest medoid

Recompute medoid for each cluster

Page 50: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Greedy, Bottom-Up Hierarchical Clustering

Initialize:Make an individual cluster for each instance

Page 51: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Greedy, Bottom-Up Hierarchical Clustering

Initialize:Make an individual cluster for each instance

Iterate until all instances in same clusterMerge two most similar clusters

Page 52: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Evaluation

Page 53: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

EvaluationWith respect to gold standard

AccuracyFor each cluster, assign most common label to all

itemsRand indexF-measure

Alternatives:

Page 54: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

EvaluationWith respect to gold standard

AccuracyFor each cluster, assign most common label to all

itemsRand indexF-measure

Alternatives:Extrinsic evaluation

Page 55: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

EvaluationWith respect to gold standard

AccuracyFor each cluster, assign most common label to all

itemsRand indexF-measure

Alternatives:Extrinsic evaluationHuman inspection

Page 56: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

ConfigurationGiven

Set of objects O = {o1,o2,….on}

Page 57: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

ConfigurationGiven

Set of objects O = {o1,o2,….on}

Partition X ={x1,…,xr}

Partition Y ={y1,….ys}

Page 58: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

ConfigurationGiven

Set of objects O = {o1,o2,….on}

Partition X ={x1,…,xr}

Partition Y ={y1,….ys}

In same sets in X In diff’t sets in X

In same sets in Y a d

In diff’t sets in Y c b

Page 59: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Rand IndexMeasure of cluster similarity (Rand, 1971)

No agreement?

In same sets in X In diff’t sets in X

In same sets in Y a d

In diff’t sets in Y c b

Page 60: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Rand IndexMeasure of cluster similarity (Rand, 1971)

No agreement? 0; Full agreement

In same sets in X In diff’t sets in X

In same sets in Y a d

In diff’t sets in Y c b

Page 61: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Rand IndexMeasure of cluster similarity (Rand, 1971)

No agreement? 0; Full agreement? 1

In same sets in X In diff’t sets in X

In same sets in Y a d

In diff’t sets in Y c b

Page 62: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Precision & RecallAssume X is the gold standard partition

Assume Y is the system-generated partition

Page 63: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Precision & RecallAssume X is the gold standard partition

Assume Y is the system-generated partition

For each pair of items in a cluster in YCorrect if they appear together in a cluster in X

Page 64: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Precision & RecallAssume X is the gold standard partition

Assume Y is the system-generated partition

For each pair of items in a cluster in YCorrect if they appear together in a cluster in X

Can compute P, R, and F-measure

Page 65: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

HW #10

Due to F. Xia

Page 66: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

HW #10Unsupervised POS tagging:

Word clustering by neighboring word cooccurrence

Create feature vectors: Features: counts of adjacent word occurrence E.g., L=he:10 or R=run:3

Perform clustering: K-medoids algorithm ( with cosine similarity)

Evaluate clusters: Cluster mapping + accuracy

Page 67: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Q1create_vectors.* training_file word_file feat_file outfile

Training file: one-sentence-per-line: w1 w2 w3 …wn

word_file: List of words to cluster word<tab>freq

feat_file: List of words to use as features feat<tab>freq

outfile: One list per word in word_file Format: word L=he 10 L=she 5 ….. R=gone 2 R=run 3…

Page 68: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

FeaturesFeatures are of the form:

(L|R)=xx freqwhere xx is a word in the feat_file, L, R: the position where the feature

appeared freq: # of times word xx appeared in

position in training file

Suppose ‘New York’ appears 540 times in corpus

York L=New 540 … R=New 0…

Page 69: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Vector FileOne line per word in word_file

Lines should be ordered by word_file

Features should be sorted alphabetically by feature nameE.g. L=an 3 L=the 10 … R=aqua 1 R=house 5

Feature sorting aids cosine computation

Page 70: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Q2k_medoids.* vector_file num_clusters

sys_cluster_file

vector_file: Created by Q1

num_clusters : number of clusters to create

sys_cluster_file: output representing clustering of vectorsmedoid w1 w2 w3 …wnwhere medoid is the medoid representing the cluster w1…wn are the words in the cluster

Page 71: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Q2: K-MedoidsSimilarity measure: Cosine similarity

Initial medoids:Medoid i is at instance:

where N is # of words to cluster C is # of clusters

Page 72: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Mapping Sys to Gold:One-to-One

Find highest number in matrixRemove corresponding row and columnRepeat until all rows removeds1 => g2 10 s2 => g1 7s3 => g3 6 acc= (10+7+6)/sum

Due to F. Xia

g1 g2 g3

s1 2 10 9

s2 7 4 2

s3 0 9 6

s4 5 0 3

Page 73: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Mapping Sys to Gold:One-to-One

Find highest number in matrix Remove corresponding row and column Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 6 acc= (10+7+6)/sum

Due to F. Xia

g1 g2 g3

s1 2 10 9

s2 7 4 2

s3 0 9 6

s4 5 0 3

Page 74: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Mapping Sys to Gold:Many-to-One

Find highest number in matrix Remove corresponding row (but not column) Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 9 s4 => g1 5 acc= (10+7+9+5)/sum

Due to F. Xia

g1 g2 g3

s1 2 10 9

s2 7 4 2

s3 0 9 6

s4 5 0 3

Page 75: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

Q3: calculate_accuracycalculate_accuracy.* sys_clust gold_clust flag

map_file acc_file sys_clust: output of Q2: m w1 w2 …

gold_clust: similar format, gold standard

flag: 0: one-to-one; 1:many-to-one

map_file: mapping of sys to gold clusterssys_clust_num => gold_clust_num count

acc_file: just overall accuracy

Page 76: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

ExperimentsCompare different numbers of words and

different feature representations

Compare different mapping strategies for accuracy

Tabulate results