Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
Signals from Text: Sentiment, Intent, Emotion,Deception
Stephen Pulman
TheySay Ltd, www.theysay.ioand Dept. of Computer Science, Oxford University [email protected]
March 9, 2017
Text Analytics
Overview
Sentiment analysis proper: classify text into positive, negative orneutral.
(Future intent, risk, speculation...)
(Demographic profile: age, gender, politics, religion...)
Deception: can we tell if someone is lying?
Emotion: joy, sadness, fear, anger...
Some example application areas.
What tools do we use?
Linguistically based pattern matching (information extraction) e.g.<PERSON> resign/be-sacked/fired/moved-from <COMPANY-ROLE>
Machine learning methods: train a classifier using annotated data:Naive Bayes, Support Vector Machine, Averaged Perceptron, NeuralNetworks etc.
Currently fashionable: Deep Learning methods - Convolutional NeuralNet, Long Short Term Memory models etc
Choice usually depends on the availability of annotated data:expensive and time-consuming to acquire.
What is sentiment analysis?
Sentiment proper
Positive, negative, or neutral attitudes expressed in text:
Suffice to say, Skyfall is one of the best Bonds in the 50-yearhistory of moviedom’s most successful franchise.
Skyfall abounds with bum notes and unfortunate compromises.
There is a breach of MI6. 007 has to catch the rogue agent.
Sometimes, factual statements imply sentiment:
Online giant Amazon’s shares have closed 9.8% higher
Building a sentiment analysis system
Cheap and cheerful
Version 1: collect lists of positive and negative words, classify a textbased on proportion of pos/neg.
Version 2: (what most commercial systems do) get a training corpusof texts human annotated for sentiment (e.g. pos/neg/neut);represent each text as a vector of counts of words or successive pairsof words; train your favourite classifier on these vectors
Problems:
if number of positive = number of negatives?
bag-of-words means structure is ignored:“Airbus: orders slump but profits rise” wrongly =“Airbus: orders rise but profits slump”
Compositional effects will be missed:
clever, too clever, not too cleverbacteria, kill, kill bacteria, fail to kill bacteria, never fail to kill bacteria
Version 3
Use linguistic analysis
do as full a parse as possible on input texts
use the syntax to do ‘compositional’ sentiment analysis:
Sentence
�����
HHHH
H
Noun
Our product
VerbPhrase
����
HHHH
Adverb
never
VerbPhrase
���
HHH
Verb
fails
VerbPhrase
���
HHH
Verb
to kill
Noun
bacteria
Sentiment logic rules
kill + negative → positive (kill bacteria)
kill + positive → negative (kill kittens)
kill + neutral → neutral (kill time)
too + anything → negative (too clever, too red, too cheap)
etc. In our system (www.theysay.io) we have 75,000+ rules...
Problems:
still need extra work for context-dependence (‘cold’, ‘wicked’, ‘sick’...)
can’t deal with reader perspective: “Oil prices are down” is good forme, not for investors.
can’t deal with sarcasm or irony: “Oh, great, they want it to run onWindows”
Linguistic characteristics of deceptive vs. truthful text
Several studies have found that we can tell liars from truthtellers:
Liars
Liars tend to use more emotion words, fewer first person (“I”, “we”),more negation words, and more motion verbs (“lead”, “go”). (Why?)
Possible that liars exaggerate certainty and positive/negative aspects,and do not associate themselves closely with the content.
Truthtellers
Truthtellers use more self references, exclusive words (“except”,“but”), tentative words (“maybe”, “perhaps”) and time relatedwords.
Truthtellers are more cautious, accept alternatives, and do associatethemselves with the content.
Financial applications
Lying CEOs
Larcker and Zakolyukina (2012) looked at the language used by CEOs andCFOs talking to analysts in conference calls about earningsannouncements.Looking at subsequent events:
discovery of ‘accounting irregularities’
restatements of earnings
changes of accountants
exit of CEO and/or CFO
- you can identify retrospectively who was telling the truth or not.This gives us a corpus of transcripts which can be labelled as ‘true’ or‘deceptive’: training data.
Detecting deception
Avoid losing money
Training a classifier on features like those just described, Larcker andZakolyukina were able to get up to 66% prediction accuracy.
Building a portfolio of deceptive companies will lose you 4 to 11% perannum...
Verisimilitude
We (TheySay) have build a general ‘verisimilitude’ classifier whichlooks out for linguistic indicators of deception...
... and also measures clarity and readability (vs. obfuscation, hedging,etc.)
We’ve tried this on speeches by many politicians, and guess what?!
Emotion detection
Many theories of emotion...
Ekman: anger, disgust, fear, happiness, sadness, surprise.
Seems to be a correspondence with facial expressions, and emoticons:
Universality of emotion classes?
Whereas Japanese and US subjects agree on classification ofexpressions of happiness, surprise, and sadness...
... agreement is significantly lower for anger, disgust and fear
Is this emotion taxonomy universal? Possibly not - in Ifaluk, spokenon a small group of islands in the Pacific:
“...fago is felt when someone dies, is needy, is ill, or goes on avoyage...”
So fago looks like sadness so far. But “...fago is also felt when in thepresence of someone admirable or when given a gift.”
Nevertheless...
...the basic dimensions of emotion are expressed in all languages, webelieve.
Emotion classification is difficult
Data collection
Human annotation usually gives an upper limit on what is possible:initial results of cross annotator agreement not encouraging.
But we can bootstrap data collection using self-labelled blog posts(LiveJournal) or social media:
Best day in ages! #Happy :)
Gets so #angry when tutors don’t email back..Do you job
idiots!
In Oxford (TheySay) we have used distant supervision and humanannotated data to build a multi-dimensional emotion classifier withacceptable accuracy...
Text analytics and politics
Predicting election results: can we eliminate opinion polls?
A very mixed picture, to put it mildly:
Tumasjan et a. 2010 claimed that share of volume on Twittercorresponded to distribution of votes between 6 main parties in the2009 German Federal election. (Volume rather than sentiment is alsoa better predictor of movie success).
Jungherr et al. 2011 pointed out that not all parties running had beenincluded and that different results are got with different timewindows. Tumasjan et al. replied, toning down their original claims.
Tjong et al. found that Tweet volume was NOT a good predictor forthe Dutch 2011 Senate elections, and that sentiment analysis wasbetter.
Predicting election results
Skoric et al. 2012 found some correlation between Twitter volumeand the votes in the 2011 Singapore general election, but not enoughto make accurate predictions.
Bermingham and Smeaton 2011 found that share of Tweet volumeand proportion of positive Tweets correlated best with the outcome ofthe Irish General Election of 2011
... but that the mean average error was greater than that oftraditional opinion polls!
So this doesn’t look very promising, although of course other sources thanTwitter might give better results - but they are difficult to get at quickly.
What do the voters think?
We can detect which topics arouse strong emotions among the voters:
Candidate Jones:
economy immigration education health
fear 8 6 2 1
anger 2 9 1 1
happiness 1 1 5 6
Candidate Smith:
economy immigration education health
fear 2 7 2 3
anger 1 9 1 1
happiness 7 1 5 3
Scottish Independence Referendum
Positive sentiment
Levels of fear
Economy: Daily Mirror debate 17th May:John McDonnell, Nigel Farage and Peter Mandelson
Immigration: murder of Remain MP Jo Cox 16th June
Sentiment: all topics
Brexit today
Brexit today
Brexit today
Summary
Text Analytics today
We can detect several different types of signals in text. Performanceis nowhere near 100% accurate but improving all the time, and beingable to process large volumes of text increases the signal to noiseratio.
We know that sentiment has a predictive value. Emotion classificationgives a finer grained analysis of opinion, and more insight andexplanation than traditional sentiment analysis. Current research isalso finding signals from text that predict intent, risk, deception, etc.
Text is an underused resource: it is (mostly) free and thoughunstructured, text analytics can derive structured information from it,very quickly.
Combining rich text signals (sentiment, deception, risk etc.) withother time series data will provide analysts with important insightsinto underlying trends.