Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Computational Stylometry
Ben Verhoeven
CLiPS, University of Antwerp
Guest lecture at Universite Libre de Bruxelles8 May 2015
Text mining
Three layers of information in textI Objective
I Facts, concepts, characteristics of concepts,relations between concepts, . . .
I Who does what, where, how and why?
I SubjectiveI Opinion, sentiment, emotion, . . .I Who believes what about what?
I Metadata - ProfileI Age, gender, region, . . .I What do we know about the author?
Example
Objective
Who? Ed MilibandWhat? Steve Coogan backs Labour this election
Subjective
What? Really great that SC backs LabourWho believes this? Ed Miliband
Example
ProfileWho? Ed MilibandAge? 40+Gender? MalePersonality? weird?Education? High
Sentiment Analysis
Commercial applications
Semantria - https://semantria.com/demo
Open-source implementations
Pattern - http://www.clips.ua.ac.be/pages/patternNLTK - http://www.nltk.org
Stylometry
The quantitative study of stylistic characteristics of a text
Writing style
A combination of invariant and unconscious decisions in languageproduction on all linguistic levels, uniquely associated with specificauthors or groups of authors
→ Human Stylome Hypothesis (Van Halteren et al. 2005)
Federalist papers
I Collection of 85 essays from 1780s in favour of new USconstitution
I Written by Hamilton, Madison (& Jay)
I Disputed authorship
Federalist papers
Traditional stylometry
I Guesswork
I Odd verbs
I Checklist of conspicuous features
I But: schools, workshops, imitation, . . .
Mosteller & Wallace (1964)
I Quantitative
I Inconspicuous features
I Functors
Experiment
Count the number of letters ‘f’ on the next slide.
Experiment
Finished files are the resultof years of scientific studycombined with the experienceof many years.
Experiment
How many were there?
Experiment
Finished f iles are the resultof years of scientif ic studycombined with the experienceof many years.
Experiment
Which text is on the following slide?
Experiment
Experiment
Difficult error spotting . . .
Does our brain process little words differently?And is this important?
Functors
I ↔ content words
I also called function words
I words or morphemes with little lexical meaning, they servegrammatical functions
I the, and, they, she, on, at, . . .
Measuring unconscious language decisions
Can we measure them?How would we measure them?What do they mean?
Functors
Use of pronouns reveals personality and mental state of the author
I More ‘we’ than ‘I’ after 9/11 in USA
I More use of pronouns when in depression
I www.analyzewords.com
Computational Stylometry
Authorship attribution & verification
I Attribution - attribute text to one of limited set of authors
I Verification - is unknown text written by given author?
Author profiling
I Age
I Gender
I Location
I Personality
I Education
I Ideology
I Mental health
Authorship attribution
Digital humanities
The study of arts and human sciences by use of digital methods.
Case studies
I Federalist Papers
I Unabomber
I Robert Galbraith = J.K. Rowling
I Dutch medieval writers
Unmasking the Unabomber
I Theodore Kaczynski
I Professor Mathematics at Berkeley
I Bomb letters against universities and airlines in 70s and 80s
I Wrote a manifesto of 35,000 words
I Word use recognised by family member
Robert Galbraith
I Crime writer, debute in 2013
I Stylometric investigation by Patrick Juola confirms gossipstarted by lawyer’s wife
I Pseudonym of J.K. Rowling
Medieval Dutch literature
Case: Spiegel Historiael
I Gigantic chronicles in rhyme verses
I History of creation until around 1316I Three authors
I Jacob Van MaerlantI Filip UtenbroekeI Lodewijk van Velthem
Medieval Dutch literature
Problems
I Few texts in bad shape
I Spelling variation
I Copyists changing the text, making errors
Medieval Dutch literature
Solution? Rhyme words
Fraeye historie ende al waerMach ic v tellen hoort naerHet was op enen auontstontDat karel slapen begondeTengelem op den rijnDlant was alle gader sijn.Hi was keyser ende coninc mede.Hoort hier wonder ende waerhedeWat den coninc daer gheuelDat weten noch die menige wel
Medieval Dutch literature
Maerlant and Utenbroeke
I Knew each other well
I Worked closely together
I Hard to distinguishusing machine learning?
Medieval literature
Documentary on Hildegard of Bingen
https://vimeo.com/70881172
Text categorization
Given a text and a predefined set of classes,predict the class the text belongs to
Many applications
I Spam
I News article topics
I Stylometry
I . . .
Methods
I Handcrafted approach
I Machine learning (since 1992)
Text categorization
Different aspects
I Class representation
I Document representation (features)
I Supervised machine learning method
Class representation
Binary
I Spam vs. ‘ham’
I Man vs. woman writer
I Maerlant vs. Utenbroeke
Multiclass
I Genres: blogs, news, jokes, novels, . . .
I Topics of reviews: books, phones, movies, hotels, . . .
I Age groups: 10s, 20s, 30s, 40s
Brief catalogue of features for stylometry
Numeric
I Complexity, readabilityI Vocabulary richness
I Type-token ratioI Hapax legomena
I Averages or distributions ofI Syllable lengthI Word lengthI Sentence length
Character-level
I Letter frequency
I Punctuation
I Spelling errors
I Character n-grams
Brief catalogue of features for stylometry
Word-level
I Word n-grams
I Special dictionaries
I Morphology: prefixes and suffixes
Syntax
I Part-of-speech distributions
I Frequencies of syntactic chunks (e.g. NP = Det + Adj + N)
. . .
Metadata reflected in → language
Document variation
I Topic
I Register, genre
I Diachrony
Individual variation
I Age
I Gender
I Region
I Personality
I Education
I Ideology
I Mental Health
Language variation
I Spelling, punctuation
I Word choice
I Sentence structure
I Thematics, tone
I Text structure
Language: problematic for machine learning
Ambiguity at multiple levels
I Lexical
I Structural
Naive assumptions are false
I Word order doesn’t matter ?
I Features are independent ?
Ambiguity
Lexical
I Polysemy: word with different meanings (“book”)
I Context matters (“fall terribly” vs. “terribly interesting”)
Structural
I Metaphor (“Her smile was like sunshine”)
I Implication (“A bus!” could mean “Watch out!”)
I Co-reference (“I am here”)
Naive assumptions in machine learning are false forlanguage
Word order matters
I The woman hit the man.
I The man hit the woman.
Word occurences are not independent
I Syntax has rules, certain word forms demand others(Determiner + Noun)
I Phrases, proverbs (“Kind regards”, “Raining cats and dogs”)
I Whole idea of distributional semantics: a word can beunderstood by the company it keeps
Experiment: gender recognition
Is the author male or female?
omdat je t zo netjes vraagt.. en omdat je mijn tekst zo goed vond,een spoor achterlaten op je profiel....
dezze is gans moooiii meisje ik mis je verschrikkelijk hard we moeteeeens afspreken iloveyou
niiice (’; srry datk ni antwoorde maar pc ga traag ;)
o; ge moogt wel ni vloeken ea o; & ontkenning (aa) das ni goed eao; x
Experiment: gender recognition
Is the author male or female?
maleomdat je t zo netjes vraagt.. en omdat je mijn tekst zo goed vond,een spoor achterlaten op je profiel....
femaledezze is gans moooiii meisje ik mis je verschrikkelijk hard we moeteeeens afspreken iloveyou
femaleniiice (’; srry datk ni antwoorde maar pc ga traag ;)
maleo; ge moogt wel ni vloeken ea o; & ontkenning (aa) das ni goed eao; x
Experiment: age recognition
Is the author younger than 16 or older than 25?
thx vr u notaaa ˆˆ ik ben toch geweldig in de plaaats van superhahaha das een grapje he xd?
kan alles tege uu zeggge ;s gy kom maandag bx mx slape ea ;otaniaa ga ook kome ;o & da ga leuuk worde enall x
erg knappe en sexy vrouw ben je!!! ge prikkelt me!!! val je nog opjongere mannen ??? ;-)
mercikes! hij is echt ongelofelijk mooi en zo lief, echt niet te doen!het is een echt beertje [hug]
Experiment: age recognition
Is the author younger than 16 or older than 25?
−16thx vr u notaaa ˆˆ ik ben toch geweldig in de plaaats van superhahaha das een grapje he xd?
+25kan alles tege uu zeggge ;s gy kom maandag bx mx slape ea ;otaniaa ga ook kome ;o & da ga leuuk worde enall x
+25erg knappe en sexy vrouw ben je!!! ge prikkelt me!!! val je nog opjongere mannen ??? ;-)
+25mercikes! hij is echt ongelofelijk mooi en zo lief, echt niet te doen!het is een echt beertje [hug]
Age and gender recognition
Dutch
I Chat language on NetlogI Age: −16 vs. +25 achieves 82% accuracyI Gender: male vs. female achieves 70% accuracy
English
I Argamon & Koppel (2003)I Age: 10s vs. 20s vs. 30s achieves 75% accuracyI Gender: male vs. female achieves 80% accuracy
Gender recognition: Explanation
Dutch chat language
Male
I dame
I knappe
I vrouw
I maat
I profiel
I den
I tege
I :-)
Female
I kei
I italics
I zonder
I iloveyou
I twee
I zijn
I ll
I ;-)
Gender recognition: Explanation
British National Corpus
I Use of pronouns (more by women) and certain types of nounmodification (more by men)
I ‘Male’ words: a, the, that, these, one, two, more, someI ‘Female’ words: I, you, she, her, their, myself, yourself, herself
I More ‘relational’ language by women, more‘informative/rational’ language by men
I Even in formal language (non-fiction)
Gender recognition: by content
Personality recognition
What is personality?
I “individual differences among people in behavior patterns,cognition and emotion” (Michel, Shoda & Smith, 2004)
I Personality explains 35% of variance in life satisfactionI Compare: income (4%), employment (4%),
marital status (1-4%)
I Personality changesI Reflected in language use?
Personality traits
I Personality can be broken down in componentsI Different typologies
I Big Five (OCEAN)I Myers-Briggs Type Indicator (MBTI)
Personality recognition
Big Five (OCEAN)
I Openness to experienceI Inventive/curious vs. consistent/cautious
I ConscientiousnessI Efficient/organized vs. easy-going/careless
I ExtraversionI Outgoing/energetic vs. solitary/reserved
I AgreeablenessI Friendly/compassionate vs. analytical/detached
I Neuroticism (emotional stability)I Sensitive/nervous vs. secure/confident
Do the test at http://www.outofservice.com/bigfive/
Personality recognition
Interesting method - LIWC
I Linguistic Inquiry and Word Count
I James W. PennebakerI 80 categories of words related to personality and mental state
I Syntactic categories: e.g. self-reference, articles, . . .I Emotional categories: e.g. positive emotion, anxiety, . . .I Thematic categories: e.g. job, leisure, family, . . .
Personality recognition
Interesting corpora
I Essays dataset (Pennebaker, later Mairesse)I English stream-of-consciousness texts by students
I myPersonality (Stillwell & Kosinski)I Large-scale data collection through Facebook app,
many languages
I Personae (Luyckx & Daelemans)I Dutch essays, written by students
I CSI Corpus (Verhoeven & Daelemans)I Dutch papers, essays and reviews written by students
Results
I 55-65% for most traits
I Better than humans
Experiment: Personality recognition
Which text was written by an extravert/introvert author?
Hey there, if you are watching this movie you probably all ready know whatCircle Lens are. For those of you that don’t I will just let you know really quick.Um, Circle Lens is a type of contact lens, um, that make your iris appearlarger. So they’re really good for cross playing or giving a dolly effect. Theyalso help with helping make somebody look, like, more awake. And, um,they’re colored lens usually. They come in, like, black, brown, but like, green,blue, all different colors.
This is how it is in my school. Okay, here’s an example. All right, um, whenthey see two guys are gay, they’re together, they’re like no, ew, no. No, no that– that doesn’t go together - - you know, two guys, no. two sticks, no. It justdoesn’t work like . But when they see two girls, they’re like, get it on. And Idon’t get these people. I’ve never seen someone say like, oh, you’re sohomosexual or you’re so lesbian or you’re such a child molester. It is always theword gay, cause apparently gay is now an insult, even though the word meanslike happy and lively and that kinda giddy feeling you have inside
Experiment: Personality recognition
Which text was written by an extravert/introvert author?
ExtravertHey there, if you are watching this movie you probably all ready know whatCircle Lens are. For those of you that don’t I will just let you know really quick.Um, Circle Lens is a type of contact lens, um, that make your iris appearlarger. So they’re really good for cross playing or giving a dolly effect. Theyalso help with helping make somebody look, like, more awake. And, um,they’re colored lens usually. They come in, like, black, brown, but like, green,blue, all different colors.
IntrovertThis is how it is in my school. Okay, here’s an example. All right, um, whenthey see two guys are gay, they’re together, they’re like no, ew, no. No, no that– that doesn’t go together - - you know, two guys, no. two sticks, no. It justdoesn’t work like . But when they see two girls, they’re like, get it on. And Idon’t get these people. I’ve never seen someone say like, oh, you’re sohomosexual or you’re so lesbian or you’re such a child molester. It is always theword gay, cause apparently gay is now an insult, even though the word meanslike happy and lively and that kinda giddy feeling you have inside
Region recognition
I Expecially interesting on chat data where regional languagevariation is very visible
Mental health recognition
Mental illnesses such as Alzheimer’s disease or schizophrenia mightbe discovered by diachronically looking at writing style changes
Possible indicators
I Reduced vocabulary size
I Increased repetition
I More vague words
Case study
I Agatha Christie, British mystery writer
I Never diagnosed, but believed to have Alzheimer’s
I Investigated by Hirst, Le & Lancashire
Mental health recognition
Repetition of content words within ten lemmatized words
She got near the door. She stopped suddenly, then walked on. Itlooked as though something like a bundle of clothes was lyingnear the door. Something they’d pulled out of Mathilde and notthought to look at, Tuppence wondered. She quickened her pace,almost running. When she got near the door she stopped suddenly.It was not a bundle of old clothes. The clothes were old enough,and so was the body that wore them. Tuppence bent over andthen stood up again, steadied herself with a hand on the door.(Agatha Christie, Postern of Fate)
Mental health recognition
The Nun Study
I Life-long diaries of nuns of Notre Dame congregation inMinnesota (Kemper et al., 2001)
I Measure scores forI Grammatical complexityI Idea density: number of distinct ideas per 10 words
I ResultsI AD initially lower scores than non-ADI AD declines at a faster rate
I Possible explanationI Early-life language ability can predict risk of dementia
Ideology detection
Task
I Predict the cultural or ideological differences between textualsources
Possible use cases
I Find cultural differences between Western and local sources inAfrican election (Pollak, 2008)
I Can we distinguish left-wing from right-wing politicians bytheir social media writing?
I Can we distinguish left-wing from right-wing newspapers bytheir writing?
Political opinion mining
Politieke Barometer
I Track mentions of politicians and parties on Twitter
I Analyse sentiment of these tweets
I Try to predict outcome of elections
www.politiekebarometer.be
Applications
Marketing
I TextGain
Text forensics
I Daphne & AMiCA
I Adversarial stylometry
I Deception detection
Stylometry for marketing purposes
Demographic market research
I Who says what about your product?I Are young educated women critical of it?
Demographic marketing
I Aim your advertising at specific groups of peopleI Google and Facebook are already doing this, because they just
have all your personal dataI e.g. pregnancy advertisingI http://mashable.com/2014/04/26/big-data-pregnancy/
Text forensics
Daphne
I Defending Against Paedophiles inHeterogeneous Network Environments
I Predict age and gender of userI Compare predicted with profile informationI Suspect if they don’t match
AMiCA
I Automatic Monitoring in Cyberspace ApplicationsI Cyberbullying detection
I Children from different ages find different things offensiveI Personality may have an influence on the way people bully
Text forensics
Adversarial stylometry
I Style = beyond conscious control?I Can you make your style unrecognisable? (Yes.)
I Machine translation (bad idea, but works)I Obfuscation: try to cover up your own writing styleI Imitation: try to pretend to be someone else
(pastiche, fanfiction)
I Context of cyberpaedophiliaI Are older people recognisable when they pretend
to be younger?
Text forensics
Deception detection
I Problem: fake reviewsI Positive by owner/producerI Negative by competitor
I Let’s do the test
Experiment: real or fake review?
I have stayed at many hotels traveling for both business andpleasure and I can honestly stay that The James is tops. Theservice at the hotel is first class. The rooms are modern and verycomfortable. The location is perfect within walking distance to allof the great sights and restaurants. Highly recommend to bothbusiness travellers and couples.
My husband and I stayed at the James Chicago Hotel for ouranniversary. This place is fantastic! We knew as soon as we arrivedwe made the right choice! The rooms are BEAUTIFUL and thestaff very attentive and wonderful!! The area of the hotel is great,since I love to shop I couldn’t ask for more!! We will definatly beback to Chicago and we will for sure be back to the James Chicago.
Experiment: real or fake review?
TrueI have stayed at many hotels traveling for both business andpleasure and I can honestly stay that The James is tops. Theservice at the hotel is first class. The rooms are modern and verycomfortable. The location is perfect within walking distance to allof the great sights and restaurants. Highly recommend to bothbusiness travellers and couples.
FakeMy husband and I stayed at the James Chicago Hotel for ouranniversary. This place is fantastic! We knew as soon as we arrivedwe made the right choice! The rooms are BEAUTIFUL and thestaff very attentive and wonderful!! The area of the hotel is great,since I love to shop I couldn’t ask for more!! We will definatly beback to Chicago and we will for sure be back to the James Chicago.
Deception detection
Cornell University Study
I Positive reviewsI Truthful reviews from TripAdvisorI Deceptive reviews from Mechanical Turk
I FeaturesI LIWC, word unigrams and bigrams
I ResultsI Human judges fail to make the distinctionI Classifier is 90% accurateI Deceptive language is imaginative and narrative rather than
informative and contains more superlatives
Deception detection
CLiPS Stylometry Investigation Corpus
I Positive and negative reviewsI Same authorsI Deceptive reviews about fictional products from same
categories
I FeaturesI Word unigramsI Without domain-specific words (product names)
I Results