Hao lyu slides_sarcasm

Sarcasm Detection on Twitter

May 2016

Hao Lyu, MSIS Student

Guided by Dr. Byron Wallace

17/7/2016

Content

1. Introduction

2. Data

3. Feature Models(machine learning)

4. Experimental settings

5. Result and discuss

7/7/20162

Why social media?

Mine and analyze data in blogs, postings, tweets can:

• Support marketing and customer service activities

• Help decision making

• Enhance the products and services

• Improve the competitive advantage of companies

Twitter is one of the most important social media resources.

Support different types of data: text, pictures, videos

7/7/20163

Sarcasm poses problems for

algorithms in U.S. election 2016

7/7/20164

In the race for the White House in 2016, election

campaigns rely on social media analysis to help

them tailor advertising and other outreach to

particular groups of voters.

Average follower growth since

Jan 26 --- Feb 26

1. @realDonaldTrump 20,900

2. @BernieSanders 10,400

3. @HillaryClinton 10,300

4. @MarcoRubio 5,320

5. @TedCruz 3,950

6. @RealBenCarson 1,870

7. @JohnKasich 1,440

Stay Classy

7/7/20165

A predictive analysis firm,

examined Tweets

containing the expression

“classy” and found 72

percent of them used it in a

positive way.

But when used near the

name of Republican

presidential candidate

Donald Trump, around three

quarters of tweets citing

"classy" were negative.

What is Sarcasm on Twitter

7/7/20166

A sarcastic tweet. The speaker is clearly not

welcoming allergy season back.

Lexical clues could provide enough knowledge to

detect sarcasm.

What is Sarcasm on Twitter

7/7/20167

Another sarcastic tweet. The speaker actually

supports democrat.

This one needs contextual information surrounding

his posting to detect it is whether or not sarcastic.

Sarcasm Detection on Twitter

State-of-the-art method combines lexical and contextual

information to achieve robust classification performance.

In this project, I re-implement of a recent method for automatic

sarcasm detection due to Bamman and Smith (2015).

I utilize multiple approaches to extract large mount of data and

apply machine learning models to detect sarcastic and non-

sarcastic tweets.

7/7/20168

DATA

Bamman dataset: 19534 tweets, around half

sarcastic tweets, while the other half non-sarcastic

tweets. Bamman shares the IDs of those tweets.

Tweets are dispearing with time goes, because

users may quit Twitter, protect their accounts from

viewing by the public or delete tweets. After data

crawling, I finally collected 17926 tweets.

DATA

The labels of tweets are inferred from self-

declaration of sarcasm, e.g. a tweet is marked as

sarcastic if it contains the hashtag \#sarcasm or

\#sarcastic and non-sarcastic otherwise.

DATA

Historical(past) tweets and profiles of user

DATA

Audience(the user who responded to the target

tweet, or was mentioned in the target tweet)

Original Tweet(the tweet to which the target tweet

responded)

DATA EXTRACTION

Static web crawling

Dynamic web crawling

Twitter Stream API

DATA EXTRACTION

Static web crawling：Scrapes static web pages

and extracts text from the HTML mark

profile

DATA EXTRACTION

Dynamic web crawling: Focus on the data sent from the

Twitter server when I interact with a website, e.g. scroll down

the page to view more tweets from a user

DATA EXTRACTION

Twitter Stream API: Make it efficient to collect

public tweets. Twitter provides an interface to

developers using its API.

Limit: 1% of public tweets

DATA PROCESSING

Remove tweets that are:

• Not English

• Shorter than 3 words

• Retweet

Replace URLs and user mentions

Remove hashtags #sarcastic and #sarcasm in the Sarcastic

tweets

Normalize profile data, e.g.,

timezone data are mapped to different area using Google

geocoder package

Numbers in Twitter are displayed in string, like ’22K’ or ‘2

Million’, and they are converted to numeric type.

FEATURE ENGINEERING

In machine learning and pattern recognition, a feature is an

individual measurable property of a phenomenon being observed.

Similar concept: the explanatory variable used in statistical

techniques such as linear regression

FEATURE ENGINEERING

Tweet Features Author Features

Represent the lexical and grammatical information of the target tweet.

Using only text of the target tweet

Capture information about the author of the target tweet.

Using historical tweets and profileinformation of the author

Audience Features Response Features

Encode information about the addressee of the tweet

Using historical tweets, profile information of the audience, and the communication

between audience and the author

Consider the interaction between the target tweet and the tweet that it is

responding to. Using text of the original tweet

TWEET FEATURES

Bag of Words: In this model, a text (such as a sentence or a

document) is represented as the bag (multiset) of its words,

disregarding grammar and even word order but keeping

multiplicity.

“Get in am at work (not) #Work” 1 1 1 1 0 0

“Love my new work #Work” 0 0 1 0 1 1

Stop words are removed.

get am work not love new

Pronunciation features: Twitter users have specific writing styles,

e.g., RT (Retweet), CHK (Check) and IIRC (If I recall correctly).

I count the number of words that only have alphabetic characters

but no vowels, and the words with more than three syllables.

Wow! wtf man? RT @latimes: Gov. Brown signs bills to

raise smoking age to 21, restrict e-cigarettes2 0

https://twitter.com/latimes

AUTHOR FEATURES

Author historical topics：Historical topic features are inferred

under LDA with 100 topics over all historical tweets.

LDA , short for Latent Dirichlet Allocation, is a generative

statistical model that allows sets of observations to be explained

by unobserved groups that explain why some parts of the data are

similar(Blei, Ng, and Jordan 2003)

Author 1 (tweet01, tweet11… tweetX1)

Author 2 (tweet02, tweet12… tweetX2)

Topic 1, Topic2 ,…, Topic 100

0.3232 0.932 ,…, 0.1522

0.4232 0.3322 ,…, 0.5522

Each topic is defined by multiple words, e.g.,

Topic 1 : basketball, StephCurry, Stadium, fans, awesome,

champion…

AUDIENCE FEATURES

Author/Audience Interactional topics: This feature measures the

similarity of historical topics of the audience and author.

I take the element-wise product of the author and audience's

historical topic distribution. Similar topics will have higher

distribution.

Author historical topic

Audience historical topic

element-wise product 0.05 0.81 ,…, 0.01

Topic 1, Topic2 ,…, Topic 100

0.1 0.9 ,…, 0.1

0.5 0.9 ,…, 0.1

RESPONSE FEATURES

Bag of Words: Here we use the BoW from the original tweet(the

tweet that it is responding to the target tweet)

EXPERIMENTAL SETTING

Data meaningful features

Machine learning model: Logistic Regression

Tune

setLR

Model

Optimized

Parameter Train

setLR

Model

Fit

Test

set

Evalute

Results

69.1%

73.3%

75.7%

75.3%

77.6%

78.3%

7/7/201625

Discussion

• Combining lexical information of text and contextual

information can generate the best accuracy in detecting

sarcasm.

• Collecting historical tweets is very expensive in both time

and computing. Not very practical!

• I suggest to use less contextual information of the author,

especially the data that can be collected easily and fast.

E.g., the profile information of the author and the response

features are relatively effective and cost less.

7/7/201626

Discussion

• Extract the historical tweets around the target tweet. From

intuition, these surrounding tweets posted in the closer

time could probably emphasize on the similar object more

often.

• Random sampling from the historical tweet cans also both

generate the topic distribution and reduce cost.

7/7/201627

Questions?

[email protected]

5127183100

287/7/2016

mailto:[email protected]

Data & Analytics

Hao lyu slides_sarcasm