42
Lecture 7: Clustering and Decomposition of a Document to Authorial Components Moshe Koppel and Navot Akiva

Lecture 7: Clustering and Decomposition of a Document to Authorial Components

  • Upload
    oralee

  • View
    39

  • Download
    0

Embed Size (px)

DESCRIPTION

Lecture 7: Clustering and Decomposition of a Document to Authorial Components. Moshe Koppel and Navot Akiva. Clustering. The idea is to define “natural” clusters Position in space depends on feature set Need to have some notion of what you want in order to choose right feature types - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Lecture 7: Clustering and Decomposition of a Document to Authorial Components

Moshe Koppel and Navot Akiva

Page 2: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Clustering• The idea is to define “natural” clusters

• Position in space depends on feature set

• Need to have some notion of what you want in order to choose right feature types– e.g., cluster by topic: content words– e.g., cluster by author: style features

• With many features, clusters aren’t so elegant

Page 3: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

How Many Clusters?

• Some methods decide in advance

• Some methods try to optimize k based on data– e.g., in previous example, would want 3 clusters

• Caution: some optimization criteria biased towards one giant cluster or many tiny ones

Page 4: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Flat Clustering: K-Means Algorithm

Select K random docs {s1, s2,… sK} as seeds.

Until clustering converges or other stopping criterion: For each doc di:

Assign di to the cluster cj such that dist(xi, sj) is minimal.

For each cluster cj:

sj = (cj) (Update the seeds to the centroid of each cluster)

Page 5: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

K-Means Example (K=2)Pick seeds

Reassign clusters

Compute centroids

xx

Reassign clusters

xx xx Compute centroids

Reassign clusters

Converged!

Page 6: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Termination conditions

• Several possibilities, e.g.,–A fixed number of iterations.–Doc partition unchanged.–Centroid positions don’t change.

Page 7: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Convergence

• Why should the K-means algorithm ever reach a fixed point?– A state in which clusters don’t change.

• K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm.– EM is known to converge.– Number of iterations could be large.

Page 8: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Convergence of K-Means

• Define goodness measure of cluster k as sum of squared distances from cluster centroid:

Gk = Σi (di – ck)2 (sum over all di in cluster k)

G = Σk Gk

• Reassignment monotonically decreases G since each vector is assigned to the closest centroid.

Page 9: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Dependence on Seed Choice

• Results depend on random seed selection.• Some tricks for avoiding bad seeds:

– Choose seed far from any already chosen– Try out multiple seed sets

If you start with B and E as centroidsyou converge to {A,B,C} and {D,E,F}

If you start with D and F you converge to {A,B,D,E} and {C,F}

Page 10: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

A Problem…

• The methods we’ve discussed so far don’t always give very satisfying results.

• Have a look at this example….

Page 11: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Ng et al On Spectral clustering: analysis and algorithm

Page 12: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Spectral Methods

• Because of examples like that, spectral methods have been developed.

• Here’s the idea: Imagine weighted edges between points, where weight reflects similarity. Zero out all edges except k nearest neighbors with high weights.

• Now make the “cheapest” cuts that result in separate components.

Page 13: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Ng et al On Spectral clustering: analysis and algorithm

Page 14: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Spectral Clustering

Properly formulated, the min Ncut problem is NP-hard. But there are some nice linear algebra tricks we can use to get reasonable approximate solutions:

1. Pre-processing• Construct a matrix representation of the dataset.

2. Decomposition• Compute eigenvalues and eigenvectors of the matrix.• Map each point to a lower-dimensional representation based

on one or more eigenvectors.

3. Grouping• Assign points to two or more clusters, based on the new

representation.

Page 15: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

New Problem: Separating Document Components

• Often documents consist of multiple authorial components.

• Our object is to tease apart the components of a composite document.

Page 16: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Basic Idea

• Divide the document into natural chunks (e.g., chapters, paragraphs)

• Vectorize chunks using some feature set

• Cluster the vectors

Page 17: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Classic Example: The Bible

• Great historic and cultural interest

• Much prior research on components

• Has been manually tagged in every conceivable way

Page 18: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Test CaseLet’s throw the chapters of Ezekiel and Jeremiah into a hat

and see if we can separate them out.

• Each is presumably the work of a single (primary) author.

• There’s no reason to think it’s the same author.

• They are thematically quite similar, though far from identical.

Page 19: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Clustering Jeremiah+Ezekiel

• Chunks = chapters (no sequence info)

• Features = bag of words

• Cluster method = Ncut

• # clusters = 2

Page 20: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Results using all words

Ezekiel Jeremiah

28 29 Cluster 1

20 23 Cluster 2

•Not too exciting. We must be picking up thematic or genre-related differences that cross books.•Let’s try using only function words.

Page 21: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Results using function words

Ezekiel Jeremiah

28 34 Cluster 1

20 18 Cluster 2

•Not any better.•Let’s try a new approach.

Page 22: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

A Better Idea• Exploit the fact that different authors use different synonyms

for the same idea (e.g., makel/mateh).

• It would be really convenient if it turned out that Jeremiah and Ezekiel made consistently different choices for various synsets.

• Note that words aren’t synonyms, rather word senses are synonyms. (For example mateh=staff is a synonym of makel, but mateh=tribe is not.)

Page 23: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Automatically Finding Synonyms

• There are various clever methods for identifying synsets, but most are not exact enough for our purpose.

• Conveniently, for the Bible, we have many useful tools, including careful translations and manual sense tagging.

• We identify as synonyms word senses that are translated into the same English word (e.g., makel=staff and mateh=staff). (We even get sense disambiguation for free.)

• Due to polysemy (in English), this method overshoots. We manually delete mistakes. (This is the only manual intervention we will ever do.)

Page 24: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Synonym Method• The usual similarity measures (e.g., cosine, inverse Euclidean

distance) aren’t quite right.– If two docs use the same synonym, they are similar. – If two docs use opposite synonyms, they are different.– If one of the docs uses one of the synonyms, but the other doesn’t, cosine

would regard them as different. But are they?

• For measuring similarity, we only consider synsets represented in both docs.

Page 25: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Results using synonyms

Ezekiel Jeremiah

6 46 Cluster 1

41 7 Cluster 2

•Now we’re getting somewhere.•But we’re not done yet…

Page 26: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Core of a Cluster• Some chapters are in a cluster because they really belong there; some just

have to be somewhere.

• Let’s consider only chapters near one centroid and far from the other. These are the “cores” of the respective clusters.

36

36

74

12 3

2

Page 27: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Cluster cores

Ezekiel Jeremiah

2 36 Cluster 1

36 0 Cluster 2

Page 28: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Cluster cores

Ezekiel Jeremiah

2 36 Cluster 1

36 0 Cluster 2

Ezekiel 1, 10

Page 29: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Expanding the Core• Now that we have a core, we can use supervised methods

(e.g., SVM) to learn a boundary.

• In fact, we can use function words as our features. – Using synonyms will just get us back where we started. – And besides, FW are generally very reliable for supervised authorship

attribution.

Page 30: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

SVM expansion of coreEzekiel Jeremiah

1 52 Cluster 1

47 0 Cluster 2

•Two training examples are “misclassed” by SVM.•Incredibly, these are Ezekiel 1 and 10, which were part of Jeremiah core, but are on Ezekiel side of optimal SVM boundary.•The only exception is Ezekiel 42, a non-core chapter which lies in the SVM margin.

Page 31: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Decomposing Unsegmented Text

• Until now, we’ve assumed that our units (chapters) were all pure Ezekiel or pure Jeremiah. That’s a major unrealistic assumption.

• Let’s create a munged “Jer-iel” book by alternately taking a random number of verses from each book until we run out of verses.

• Can we tease that apart?

Page 32: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Chunks Might Be Mixedxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Page 33: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Chunks Might Be Mixedxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Page 34: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Decomposing Unsegmented Text

• Start with some arbitrary chunking of the munged text. (Some of these chunks will be mixed.)

• Now proceed exactly as above, until you have cluster cores.

• With a bit of luck, the chunks in the core will be the “pure” chunks.

Page 35: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

A Munged Documentxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Page 36: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Decomposing Unsegmented Text

• As above, use SVM to learn a classifier, using the core chunks as training examples.

• But at the last step, classify individual verses instead of chapters.

• Some smoothing techniques are necessary to:– Deal with the sparseness of features in individual verses– Take advantage of serial correlation

Page 37: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

A Munged Document

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Page 38: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

A Munged Documentxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Page 39: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

A Munged Documentxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Page 40: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Results on Munged Books

Page 41: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Multiple Books• The same method works if we create a munged book out of

k>2 constituent books.

• BUT, we need to be given the k.

• So we don’t yet know how many components there are, only how to best split into a given number of components.

Page 42: Lecture 7:  Clustering and Decomposition  of a Document to Authorial Components

Open Question

• Can we extend this method so it can be used also for texts that don’t have canonical translations like the Bible?

• Can we use this method to identify plagiarized sections of a text?