Automatic generation of event summaries using microblog streams


Citation preview

“Twitsum” : Automatic generation of event summaries using microblog streams


Motivation - The problem with Twitter search● Twitter ranks tweets based on

user interaction with them. (number of retweets, favorites)

● Top results for the query ‘Ebola’ (25th November 2014)

● How to distinguish newsworthy tweets drowned in a sea of noise

Goal● Distinguish newsworthy tweets based on syntactic features

without depending on manual annotations

● Group tweets discussing the similar content together

Contributions● A heuristic based scheme for annotating tweets as


● A classifier capable of detecting objective tweets using only the syntactic information of tweets

● An entity-centric tweet clustering algorithm

Twitter summarization - Earlier approaches

Sub-event detection based methods● Use of a Hidden Markov Model to detect sub-events during an American football

match (D.Chakrabarti and K.Punera, 2011)● Sub-event detection by identifying outlier peaks in the temporal distribution of

tweets on a topic. (Zubiaga et al., 2012)

Clustering based approaches ● A support platform for event detection using social intelligence (T.Baldwin, P.

Cook and B.Han, 2012) ○ Tweets are filtered using manually selected keywords


● Tweet storage - stores the set of tweets downloaded using streaming API

● Classifier - selection of objective tweets

● Summarizer - removes duplicates and clusters the tweets based on their similarity

Design - Objectivity detection

● Tweets are periodically downloaded by querying the public timeline using Streaming API

● Structure of a tweet object:

tweet text, user name, created time, geo location, language code, favorite count, retweeted_status, retweet count

Data collection● Training data annotated using a heuristic


● Objective - If the tweet is generated by a verified profile

● Subjective - Tweets containing at least a single emoticon or an emoji character

Preprocessing● All emoticons and emoji characters

are removed from the corpus● User mentions are replaced with the

tag ‘MENTION’ (eg: “@john said this” converts to “MENTION sad this”)

● Punctuation symbols including the pound(#) character are removed.

● Urls are replaced with the tag ‘URL’ (eg: converts to URL)

● Numbers in a tweet are replaced by the tag ‘NUMERIC’

● Remove stop words

Feature extraction● Tweets are tokenized using TweetNLP

tokenizer (K. Gimpel, N. Schneider, and B. O’Connor, 2011)

● Words are stemmed using Porter stemmer● Stemmed unigrams, bigrams converted to

binary Tf-Idf values (with Laplace smoothing)

● binary feature - presence of slang words (using an external gazetteer)

● binary feature - presence of bad words● Unigrams, bigrams and trigrams of POS

tags as binary Tf-Idf values● Average number of misspelled words● Average number of all-capital words● Average number of hashtags

Classifier selection

● A dataset of 6,000 tweets on Ebola is used to benchmark three classifiers (3,000 tweets from each class)○ Support Vector Machines○ Logistic Regression○ Naive Bayes

● Classifiers trained on a random sample of 4800 tweets and remaining used as the test set.

● Classifier parameters are found using 10-fold cross validation

Classifier performance● SVM was selected because it had higher recall than Logistic Regression● A higher recall results in a larger fraction of newsworthy tweets being detected

Contribution from features● Measured using ablation test● Features divided into three sets

WRD - unigram and bigramsLEX - all other lexical features

Selection of the POS-tagger● NLTK POS tagger● Stanford tagger with GATE twitter model (L. Derczynski et al., 2013)● SENNA tagger (Ronan Collobert, 2011) - “deep” recurrent convolutional neural

network based discriminant parser

Eg:"Last US Ebola Patient Is Cured: Dr. Craig Spencer To Be Released… | #news"

NLTK tagger:

[('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NNP'), ('Is', 'NNP'), ('Cured', 'NNP'), ('Dr', 'NNP'), ('Craig',

'NNP'), ('Spencer', 'NNP'), ('To', 'NNP'), ('Be', 'NNP'), ('Released', 'NNP'), ('\u2026', 'NNP'), ('|', 'NNP'), ('news',


Selection of the POS tagger..."Last US Ebola Patient Is Cured: Dr. Craig Spencer To Be Released… | http://t.

co/NoFij4iACl #news"

SENNA tagger:

[('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NNP'), ('Is', 'VBZ'), ('Cured', 'VBN'), ('Dr', 'NNP'),

('Craig', 'NNP'), ('Spencer', 'NNP'), ('To', 'TO'), ('Be', 'VB'), ('Released', 'VBN'), ('\u2026', 'JJ'), ('|', 'NN'), ('news',


Stanford tagger with Gate twitter model:

[('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NN'), ('Is', 'VBZ'), ('Cured', 'VBN'), ('Dr', 'NNP'), ('Craig',

'NNP'), ('Spencer', 'NNP'), ('To', 'TO'), ('Be', 'VB'), ('Released', 'VBN'), ('\u2026', '.'), ('|', ':'), ('news', 'NN')]

ResultsData sets● 1 million tweets containing the term ‘Ebola’

● 22,250 tweets related to the fifth Sri Lanka vs India ODI cricket match held on 16th November (objective- 465, subjective- 878)

○ Filtered using terms “SLvIND”, “SLvsIND”, “INDvSL” and “INDvsSL”.

● 6,800 tweets related to the fourth Sri Lanka vs England ODI cricket match held on 7th December (objective- 215, subjective- 242)

○ Filtered using terms “SLvENG”, “SLvsENG”, “ENGvSL” and ENGvsSL”.

Gold standard data set● A sample 500 tweets on the topic ‘ebola’ is annotated manually as objective or

subjective (objective- 206, subjective- 294)

● Classifier scores on this data

● Errors:“RT @TheDailyEdge: UPDATE: Obama has reduced the US deficit by 70% and Ebola cases in the

US by 100%.” It’s hard to judge the objectivity of such sentences only based on syntactical information.

Comparison with prior research● Event related tweets detection with user type recognition (L.Silva, E.Rillof, 2013)

○ A set of 6,000 tweets on disease outbreaks manually labeled using Amazon Mechanical Turk

● Twitter Sentiment Classification using Distant Supervision (A.Go, R.Bhayani and L.huang, 2013)

○ An SVM model trained on syntactic features used for sentiment classification

Classifier Precision Recall F1-score

User type agnostic classifier 83.15 55.99 66.92

User type specific classifier 80.35 66.07 72.15

Features Accuracy

Unigram + Bigram 81.6

Unigram + POS 81.9

Cross-domain applicability● The classifier trained on Ebola tweets applied on cricket related tweets

● The classifier trained on SLvIndia match performed well on SLvEngland tweets well


● Duplicates and near-duplicate tweets are abundant due to Retweets and tweets generated by ‘Tweet’ buttons on news sites

● Removes duplicates in the objective tweets detected by the classifier

● Tweets discussing the same entities are clustered together

● Objective tweets are stripped of following symbols ‘RT’, ‘@-mentions’ and punctuation

● Jaccard similarity of tokens used to detect duplicate tweets

● Two tweets are considered similar if their Jaccard similarity is greater than a threshold d

Near-duplicate removal

Clustering● The goal is to cluster tweets mentioning the same entities together

Eg: “#Miami #News NYC Doc Free of Ebola: Sources: Dr. Craig Spencer, the physician being treated for Ebola at Belle...”

“#Ebola so the good doctor Craig Spencer will go home - well - the nurse too free to roam but lest we forget 3 countries still suffer deeply”

● Vectors of NER tags converted to Tf-Idf scores and cosine value is selected as the distance measure among two NER tag vectors

● DBSCAN is selected because the number of clusters is not required and it is capable of identifying arbitrary shaped clusters

Clustering - results● SVM classifier trained on ebola-3000 data set is applied on a corpus of 24,038

unseen tweets retrieved on a single day (11-11-2014)

● 13,380 tweets detected as objective and 8,138 as duplicates among them. Clustering resulted in 332 clusters while 2751 tweets labeled as noise

● Clusters depend on the quality of Named Entity Recognizer

Entities: ['Craig', 'Ebola', 'Patient', 'Spencer', 'US']

Clustering - discussion● In contrast this tweet labeled as noise

“‘#Ebola Ebola Outbreak: US Free of Virus After New York Doctor Craig Spencer Cleared - International Business Times UK”

entities - ['Business', 'Craig', 'Ebola', 'Free', 'International', 'New', 'Outbreak', 'Spencer' 'Times', 'US' 'Virus' 'York']

Future work● Improve cross-domain applicability

○ Finding better features with less dependence on the domain

● A better methodology to evaluate summaries

● Improve clustering to consider verbs also

● Generate an abstractive summary○ Generate novel sentences from the information contained in tweets

● Generate summaries realtime
