30
Patrick Juola Duquesne University www.jgaap.com [email protected] Authorship Attribution and Stylometry

Authorship Attribution and Stylometry

  • Upload
    aideen

  • View
    67

  • Download
    3

Embed Size (px)

DESCRIPTION

Authorship Attribution and Stylometry. Patrick Juola Duquesne University www.jgaap.com [email protected]. Whodunit?. Authorship Attribution (aka Stylometry, cf. Authorship Profiling) : identifying an author from his/her writings Did Shakespeare really write those plays? - PowerPoint PPT Presentation

Citation preview

Page 1: Authorship Attribution and Stylometry

Patrick JuolaDuquesne University

[email protected]

Authorship Attribution and Stylometry

Page 2: Authorship Attribution and Stylometry

Whodunit?

• Authorship Attribution (aka Stylometry, cf. Authorship Profiling) : identifying an author from his/her writings

• Did Shakespeare really write those plays?• Or was it the Earl of Oxford?• Or Francis Bacon?• Or Roger Bacon?• Or Kevin Bacon?

• &c.

Page 3: Authorship Attribution and Stylometry

More technical definition

• Authorship attribution : inferring the identity of the author of a document by examination.

• Stylometry : inferring properties of the author by examination • E.g. the author was a male native English

speaker aged between 25-35 with no college education but with theater training

Page 4: Authorship Attribution and Stylometry

Important problem

• Long history (Book of Judges, “shibboleth”)

• Key to literature• and to history and journalism• and teaching (catching cheaters) and

law/investigation (Unabomber)• and psychology (inferring personality from

writing) and security and,… and,…

Page 5: Authorship Attribution and Stylometry

Computers are problematic

• Handwriting is easy, anyone can do it.• Typewriting is still pretty easy if you know

what you’re looking for• But one 12pt Times Roman ‘A’ looks

identical to any other.

• What cues to authorship exist?

Page 6: Authorship Attribution and Stylometry

Looking for clues

What is this object?

Page 7: Authorship Attribution and Stylometry

Looking for clues (2)

• How far does light travel in 1/300,000 of a second?

Page 8: Authorship Attribution and Stylometry

Looking for clues (3)• Where is the dinner fork?

Page 9: Authorship Attribution and Stylometry

Finding clues

The object is a “couch.”

Page 10: Authorship Attribution and Stylometry

Looking for clues (2)

• How far does light travel in 1/300,000 of a second?• Approximately one kilometer.

• Note that other answers are not wrong, just individual. • E.g. “kilometre” is a standard spelling• ‘km’ is standard abbreviation• “click” or “k” are commonly-understood slang

Page 11: Authorship Attribution and Stylometry

Finding clues (3)• The dinner fork is to the left of the plate

Page 12: Authorship Attribution and Stylometry

Finding clues (3b)• The dinner fork is on the immediate left of the plate

Page 13: Authorship Attribution and Stylometry

Another example

• The paradigmatic and systematic utilization of sesquipedalian lexical items can be an informative element of individual and idiosyncratic patterns of linguistic variation

• Or, some people use big words

Page 14: Authorship Attribution and Stylometry

History

• Judges 12:6 Then said they unto him, Say now Shibboleth: and he said Sibboleth: for he could not frame to pronounce it right. Then they took him, and slew him at the passages of Jordan, and there fell at that time of the Ephraimites forty and two thousand.

Page 15: Authorship Attribution and Stylometry

The “stylome”

• The underlying theoretical assumption is that language is not completely controllable (i.e. it’s hard to lose your accent)

• Obviously, some parts (e.g. lexicon) are more controllable than others (e.g. accent).

• Van Halteren has coined the term “stylome” to describe these specific individual differences. Others use “fingerprint.”

Page 16: Authorship Attribution and Stylometry

Some early candidates

• Authorial vocabulary may be a stylome.• You can’t use words you don’t know.• Can we measure vocabulary size?

• Similarly, average word length may be a stylome (first proposed by De Morgan)

• …. But neither of these work especially well.

Page 17: Authorship Attribution and Stylometry

Federalist Papers

• Modern stylometry more or less starts with Mosteller and Wallace

• Studied the Federalist Papers using multivariate statistics

• Took frequencies of specific high-frequency function words

• Classified disputed documents as H/M based on Bayesian analysis

Page 18: Authorship Attribution and Stylometry

Successes and Failures

• M/W results generally confirmed accepted scholarship• But it’s also a largely artificial problem!

• Federalist Papers have become “standard”• Other examples have produced noted

failures• E.g. Foster’s attribution of “A Funeral Elegy”

Page 19: Authorship Attribution and Stylometry

Lots of ways to study

• Rudman has suggested that more than 1000 different features have been proposed over the past 100+ years.

• Most “work” in the sense of better than chance.

• But “better than chance” isn’t very good in the real world.

Page 20: Authorship Attribution and Stylometry

The Ur-study

• Find a document, with presumptive author• Collect uncontroversial corpus of author’s

writings• Collect set of distractor authors, with sample

corpora for each author• Identify something found in author’s writings

and test documentbut not in distractors’• Publish

Page 21: Authorship Attribution and Stylometry

Textual considerations

• First question : How confident are we that we have a valid text to study?• Issues include corruption, editorial changes,

formatting (e.g. running heads), printers errors• Second question : How confident are we of

our “uncontroversial” stuff?• Third question : Do we have the right

distractor authors?

Page 22: Authorship Attribution and Stylometry

Technical considerations

• First question : How good is the technique we’re using?

• Second question : Are there representativeness issues involved?

• Third question : Do we have enough data?• Fourth question : How do we interpret the

results?

Page 23: Authorship Attribution and Stylometry

Search for best practices

• First question : How good is the technique we’re using?• Development of “good” techniques is an open

research question.

• … hence JGAAP

Page 24: Authorship Attribution and Stylometry

JGAAP

• Single framework allows comparative testing under controlled conditions

• Modular, object-oriented approach makes extension to new methods easy.

• Simple GUI for ease of use• Simple 3-phase model under the hood

Page 25: Authorship Attribution and Stylometry

Under the hood

• Canonicization -- perform necessary conversions, strip out irrelevant and confusing differences

• Event Set Generation – partition document into “Events”

• Statistical Analysis – k-NN, LDA, SVM, Naïve Bayes, whatever you like….

Page 26: Authorship Attribution and Stylometry

Event Set Generation

• Documents contain “events” (also called “features,” but “events” stresses ordering).

• E.g. words are events• Make bag-of-words, or bag of word-bigrams

• Properties of words are also events• POS, word lengths, frequencies

• Phase II – convert document to (ordered) Event Set.

Page 27: Authorship Attribution and Stylometry

Analysis

• Classify Event Sets based on statistical properties.

• Again, many different ways to do this

Page 28: Authorship Attribution and Stylometry

A simple example

• Build histogram of Events based on (normalized) frequencies.

• Convert histogram to vector space by enumerating over elements

• Calculate distances between various histograms using distance formula

• Assign authorship of unknown document to closest document of known authorship

Page 29: Authorship Attribution and Stylometry

Getting JGAAP

• JGAAP is available at www.jgaap.com• Also available at

http://www.mathcs.duq.edu/~fa08rutenbar/jgaap.zip

• Requires Java (JDK) 1.5 or better• Also requires ‘ant’

• Freeware, so we can (and will) be developing during the course

Page 30: Authorship Attribution and Stylometry

Plans for rest of course

• Details of JGAAP• Details of some of the models JGAAP

includes• Developing new test corpora• Developing new models based on analysis

of test corpora• Extension to profiling….