Lecture 7: Clustering and Decomposition of a Document to Authorial Components

Lecture 7: Clustering and Decomposition of a Document to Authorial Components

Moshe Koppel and Navot Akiva

Clustering• The idea is to define “natural” clusters

• Position in space depends on feature set

• Need to have some notion of what you want in order to choose right feature types– e.g., cluster by topic: content words– e.g., cluster by author: style features

• With many features, clusters aren’t so elegant

How Many Clusters?

• Some methods decide in advance

• Some methods try to optimize k based on data– e.g., in previous example, would want 3 clusters

• Caution: some optimization criteria biased towards one giant cluster or many tiny ones

Flat Clustering: K-Means Algorithm

Select K random docs {s1, s2,… sK} as seeds.

Until clustering converges or other stopping criterion: For each doc di:

Assign di to the cluster cj such that dist(xi, sj) is minimal.

For each cluster cj:

sj = (cj) (Update the seeds to the centroid of each cluster)

K-Means Example (K=2)Pick seeds

Reassign clusters

Compute centroids

xx

Reassign clusters

xx xx Compute centroids

Reassign clusters

Converged!

Termination conditions

• Several possibilities, e.g.,–A fixed number of iterations.–Doc partition unchanged.–Centroid positions don’t change.

Convergence

• Why should the K-means algorithm ever reach a fixed point?– A state in which clusters don’t change.

• K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm.– EM is known to converge.– Number of iterations could be large.

Convergence of K-Means

• Define goodness measure of cluster k as sum of squared distances from cluster centroid:

Gk = Σi (di – ck)2 (sum over all di in cluster k)

G = Σk Gk

• Reassignment monotonically decreases G since each vector is assigned to the closest centroid.

Dependence on Seed Choice

• Results depend on random seed selection.• Some tricks for avoiding bad seeds:

– Choose seed far from any already chosen– Try out multiple seed sets

If you start with B and E as centroidsyou converge to {A,B,C} and {D,E,F}

If you start with D and F you converge to {A,B,D,E} and {C,F}

A Problem…

• The methods we’ve discussed so far don’t always give very satisfying results.

• Have a look at this example….

Ng et al On Spectral clustering: analysis and algorithm

Spectral Methods

• Because of examples like that, spectral methods have been developed.

• Here’s the idea: Imagine weighted edges between points, where weight reflects similarity. Zero out all edges except k nearest neighbors with high weights.

• Now make the “cheapest” cuts that result in separate components.

Ng et al On Spectral clustering: analysis and algorithm

Spectral Clustering

Properly formulated, the min Ncut problem is NP-hard. But there are some nice linear algebra tricks we can use to get reasonable approximate solutions:

1. Pre-processing• Construct a matrix representation of the dataset.

2. Decomposition• Compute eigenvalues and eigenvectors of the matrix.• Map each point to a lower-dimensional representation based

on one or more eigenvectors.

3. Grouping• Assign points to two or more clusters, based on the new

representation.

New Problem: Separating Document Components

• Often documents consist of multiple authorial components.

• Our object is to tease apart the components of a composite document.

Basic Idea

• Divide the document into natural chunks (e.g., chapters, paragraphs)

• Vectorize chunks using some feature set

• Cluster the vectors

Classic Example: The Bible

• Great historic and cultural interest

• Much prior research on components

• Has been manually tagged in every conceivable way

Test CaseLet’s throw the chapters of Ezekiel and Jeremiah into a hat

and see if we can separate them out.

• Each is presumably the work of a single (primary) author.

• There’s no reason to think it’s the same author.

• They are thematically quite similar, though far from identical.

Clustering Jeremiah+Ezekiel

• Chunks = chapters (no sequence info)

• Features = bag of words

• Cluster method = Ncut

• # clusters = 2

Results using all words

Ezekiel Jeremiah

28 29 Cluster 1

20 23 Cluster 2

•Not too exciting. We must be picking up thematic or genre-related differences that cross books.•Let’s try using only function words.

Results using function words

Ezekiel Jeremiah

28 34 Cluster 1

20 18 Cluster 2

•Not any better.•Let’s try a new approach.

A Better Idea• Exploit the fact that different authors use different synonyms

for the same idea (e.g., makel/mateh).

• It would be really convenient if it turned out that Jeremiah and Ezekiel made consistently different choices for various synsets.

• Note that words aren’t synonyms, rather word senses are synonyms. (For example mateh=staff is a synonym of makel, but mateh=tribe is not.)

Automatically Finding Synonyms

• There are various clever methods for identifying synsets, but most are not exact enough for our purpose.

• Conveniently, for the Bible, we have many useful tools, including careful translations and manual sense tagging.

• We identify as synonyms word senses that are translated into the same English word (e.g., makel=staff and mateh=staff). (We even get sense disambiguation for free.)

• Due to polysemy (in English), this method overshoots. We manually delete mistakes. (This is the only manual intervention we will ever do.)

Synonym Method• The usual similarity measures (e.g., cosine, inverse Euclidean

distance) aren’t quite right.– If two docs use the same synonym, they are similar. – If two docs use opposite synonyms, they are different.– If one of the docs uses one of the synonyms, but the other doesn’t, cosine

would regard them as different. But are they?

• For measuring similarity, we only consider synsets represented in both docs.

Results using synonyms

Ezekiel Jeremiah

6 46 Cluster 1

41 7 Cluster 2

•Now we’re getting somewhere.•But we’re not done yet…

Core of a Cluster• Some chapters are in a cluster because they really belong there; some just

have to be somewhere.

• Let’s consider only chapters near one centroid and far from the other. These are the “cores” of the respective clusters.

36

36

74

12 3

2

Cluster cores

Ezekiel Jeremiah

2 36 Cluster 1

36 0 Cluster 2

Cluster cores

Ezekiel Jeremiah

2 36 Cluster 1

36 0 Cluster 2

Ezekiel 1, 10

Expanding the Core• Now that we have a core, we can use supervised methods

(e.g., SVM) to learn a boundary.

• In fact, we can use function words as our features. – Using synonyms will just get us back where we started. – And besides, FW are generally very reliable for supervised authorship

attribution.

SVM expansion of coreEzekiel Jeremiah

1 52 Cluster 1

47 0 Cluster 2

•Two training examples are “misclassed” by SVM.•Incredibly, these are Ezekiel 1 and 10, which were part of Jeremiah core, but are on Ezekiel side of optimal SVM boundary.•The only exception is Ezekiel 42, a non-core chapter which lies in the SVM margin.

Decomposing Unsegmented Text

• Until now, we’ve assumed that our units (chapters) were all pure Ezekiel or pure Jeremiah. That’s a major unrealistic assumption.

• Let’s create a munged “Jer-iel” book by alternately taking a random number of verses from each book until we run out of verses.

• Can we tease that apart?

Chunks Might Be Mixedxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Chunks Might Be Mixedxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


• Start with some arbitrary chunking of the munged text. (Some of these chunks will be mixed.)

• Now proceed exactly as above, until you have cluster cores.

• With a bit of luck, the chunks in the core will be the “pure” chunks.

A Munged Documentxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


• As above, use SVM to learn a classifier, using the core chunks as training examples.

• But at the last step, classify individual verses instead of chapters.

• Some smoothing techniques are necessary to:– Deal with the sparseness of features in individual verses– Take advantage of serial correlation

A Munged Document

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

A Munged Documentxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

A Munged Documentxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Results on Munged Books

Multiple Books• The same method works if we create a munged book out of

k>2 constituent books.

• BUT, we need to be given the k.

• So we don’t yet know how many components there are, only how to best split into a given number of components.

Open Question

• Can we extend this method so it can be used also for texts that don’t have canonical translations like the Bible?

• Can we use this method to identify plagiarized sections of a text?

Documents

Lecture 7: Clustering and Decomposition of a Document to Authorial Components