16
Patrick Juola Duquesne University www.jgaap.com [email protected] Authorship Attribution and Stylometry (lecture 5)

Authorship Attribution and Stylometry (lecture 5)

  • Upload
    luann

  • View
    84

  • Download
    2

Embed Size (px)

DESCRIPTION

Authorship Attribution and Stylometry (lecture 5). Patrick Juola Duquesne University www.jgaap.com [email protected]. Some Housekeeping. I’m having trouble with n/w connectivity to Duquesne Watch www.mathcs.duq.edu/~juola Watch www.jgaap.com - PowerPoint PPT Presentation

Citation preview

Page 1: Authorship Attribution and Stylometry (lecture 5)

Patrick JuolaDuquesne University

[email protected]

Authorship Attribution and Stylometry(lecture 5)

Page 2: Authorship Attribution and Stylometry (lecture 5)

Some Housekeeping

• I’m having trouble with n/w connectivity to Duquesne• Watch www.mathcs.duq.edu/~juola• Watch www.jgaap.com• Will be posting new developments as they

occur• (Will also post NG corpus as requested.)

Page 3: Authorship Attribution and Stylometry (lecture 5)

ESSLLI material

• The Personae corpus is freely available• BUT the one we’ve developed is not

• If you’re willing to have your essays and information published, contact me

[email protected]• I will collate and publish via the web

Page 4: Authorship Attribution and Stylometry (lecture 5)

JGAAP material

• JGAAP is freeware; use and enjoy• New developments to JGAAP are always

welcome, subject to licensure (i.e. GPL).• Wiki at www.jgaap.com is open for

• Feature requests• Bug reports• Comments• New developers

Page 5: Authorship Attribution and Stylometry (lecture 5)

Interest in a volume?

• Depending upon public interest,... i.e. you, should we pursue the idea of an edited collection of JGAAP-related papers?• There are a lot of publishers at this summer

school• Contact me if you’re interested

Page 6: Authorship Attribution and Stylometry (lecture 5)

So, now what?

• JGAAP seems to work, but needs more development

• More corpora (and more specialist corpora) are needed

• But if you have an authorship problem to solve NOW…

Page 7: Authorship Attribution and Stylometry (lecture 5)

Top/bottom methods

• Sorry, still having n/w troubles 8-(• Best canonicizers : unify case, normalize

whitespace• Strip punctuation hinders

• Best events : word bigrams• Worst : word lengths

• Best analysis : KL-distance, cosine distance• Worst : LZW

Page 8: Authorship Attribution and Stylometry (lecture 5)

But....

• (Show spreadsheet, stupid!)

Page 9: Authorship Attribution and Stylometry (lecture 5)

Testing transferrence

• 8 AAAC problems are “English”• 5 are “foreign” (French [x2], Dutch, Latin,

Serbian/Slavonic)• Does English score reflect “foreign” score?

• If so, have evidence that best practices in English are also best practices in novel language.

• N.b. evidence is not proof!

Page 10: Authorship Attribution and Stylometry (lecture 5)

2008/9 AAAC data

• 281 different analyses, generally better than AAAC submisssions.

• Correlation: r = 0.6680 (cf. 0.594)• Significance: p < 0.0001 (cf. 0.05)• Coefficient of determination (r2)

• 45% of variation explained by algorithm performance alone (rather than other factors)

Page 11: Authorship Attribution and Stylometry (lecture 5)

Tranferrence

• Best practices transfer – a best practice in one environment is likely to be a “good” practice in another• Turn it around : Do we really expect something

terrible in English to magically improve in Polish?

• Caveat : No predictions about “absolute” error rates

• Caveat(2) : Assumes lg. agnosticism

Page 12: Authorship Attribution and Stylometry (lecture 5)

Some other findings

• OCR errors do not materially impact accuracy (Noecker, et al.)

• Asymmetry is a significant factor in distance-based attribution methods (Ryan and Juola)

• Algorithm performance dominates language or data size effects (Juola)

Page 13: Authorship Attribution and Stylometry (lecture 5)

Other findings (2)

• Cosine distance on large numbers of words outperforms higher-overhead methods on fewer words (Noecker & Juola)

• Characters trump words for Chinese at current word seg technology (Zhao & Juola)

• Mosteller-Wallace’s function words are overtuned (in preparation)

Page 14: Authorship Attribution and Stylometry (lecture 5)

Best practices for now

• “Mixture of experts” improves accuracy• Run multiple analyses, mixing event types

(character and word n-grams)• Cosine distance and KL-distance work well

on large event sets• SVM works well on small event set• Current leader : KL-distance (max) on

word bigrams

Page 15: Authorship Attribution and Stylometry (lecture 5)

• AAAC corpus too small to distinguish among 20,000 methods (testing continuing, though)

• Add more methods to JGAAP, hopefully solicited from community

• Continue to develop/publish “best practices”

Future extensions

Page 16: Authorship Attribution and Stylometry (lecture 5)

• Merci• Arigato• Спасибо• Danke• Gracias• Teşekkür ederim• Dank U

Tak!