Upload
luann
View
84
Download
2
Embed Size (px)
DESCRIPTION
Authorship Attribution and Stylometry (lecture 5). Patrick Juola Duquesne University www.jgaap.com [email protected]. Some Housekeeping. I’m having trouble with n/w connectivity to Duquesne Watch www.mathcs.duq.edu/~juola Watch www.jgaap.com - PowerPoint PPT Presentation
Citation preview
Some Housekeeping
• I’m having trouble with n/w connectivity to Duquesne• Watch www.mathcs.duq.edu/~juola• Watch www.jgaap.com• Will be posting new developments as they
occur• (Will also post NG corpus as requested.)
ESSLLI material
• The Personae corpus is freely available• BUT the one we’ve developed is not
• If you’re willing to have your essays and information published, contact me
• [email protected]• I will collate and publish via the web
JGAAP material
• JGAAP is freeware; use and enjoy• New developments to JGAAP are always
welcome, subject to licensure (i.e. GPL).• Wiki at www.jgaap.com is open for
• Feature requests• Bug reports• Comments• New developers
Interest in a volume?
• Depending upon public interest,... i.e. you, should we pursue the idea of an edited collection of JGAAP-related papers?• There are a lot of publishers at this summer
school• Contact me if you’re interested
So, now what?
• JGAAP seems to work, but needs more development
• More corpora (and more specialist corpora) are needed
• But if you have an authorship problem to solve NOW…
Top/bottom methods
• Sorry, still having n/w troubles 8-(• Best canonicizers : unify case, normalize
whitespace• Strip punctuation hinders
• Best events : word bigrams• Worst : word lengths
• Best analysis : KL-distance, cosine distance• Worst : LZW
But....
• (Show spreadsheet, stupid!)
Testing transferrence
• 8 AAAC problems are “English”• 5 are “foreign” (French [x2], Dutch, Latin,
Serbian/Slavonic)• Does English score reflect “foreign” score?
• If so, have evidence that best practices in English are also best practices in novel language.
• N.b. evidence is not proof!
2008/9 AAAC data
• 281 different analyses, generally better than AAAC submisssions.
• Correlation: r = 0.6680 (cf. 0.594)• Significance: p < 0.0001 (cf. 0.05)• Coefficient of determination (r2)
• 45% of variation explained by algorithm performance alone (rather than other factors)
Tranferrence
• Best practices transfer – a best practice in one environment is likely to be a “good” practice in another• Turn it around : Do we really expect something
terrible in English to magically improve in Polish?
• Caveat : No predictions about “absolute” error rates
• Caveat(2) : Assumes lg. agnosticism
Some other findings
• OCR errors do not materially impact accuracy (Noecker, et al.)
• Asymmetry is a significant factor in distance-based attribution methods (Ryan and Juola)
• Algorithm performance dominates language or data size effects (Juola)
Other findings (2)
• Cosine distance on large numbers of words outperforms higher-overhead methods on fewer words (Noecker & Juola)
• Characters trump words for Chinese at current word seg technology (Zhao & Juola)
• Mosteller-Wallace’s function words are overtuned (in preparation)
Best practices for now
• “Mixture of experts” improves accuracy• Run multiple analyses, mixing event types
(character and word n-grams)• Cosine distance and KL-distance work well
on large event sets• SVM works well on small event set• Current leader : KL-distance (max) on
word bigrams
• AAAC corpus too small to distinguish among 20,000 methods (testing continuing, though)
• Add more methods to JGAAP, hopefully solicited from community
• Continue to develop/publish “best practices”
Future extensions
• Merci• Arigato• Спасибо• Danke• Gracias• Teşekkür ederim• Dank U
Tak!