Automatic generation of event summaries using microblog streams

“Twitsum” : Automatic generation of event summaries using microblog streams

P.K.K.Madhawa2012MCS044

Motivation - The problem with Twitter search● Twitter ranks tweets based on

user interaction with them. (number of retweets, favorites)

● Top results for the query ‘Ebola’ (25th November 2014)

● How to distinguish newsworthy tweets drowned in a sea of noise

Goal● Distinguish newsworthy tweets based on syntactic features

without depending on manual annotations

● Group tweets discussing the similar content together

Contributions● A heuristic based scheme for annotating tweets as

subjective/objective

● A classifier capable of detecting objective tweets using only the syntactic information of tweets

● An entity-centric tweet clustering algorithm

Twitter summarization - Earlier approaches

Sub-event detection based methods● Use of a Hidden Markov Model to detect sub-events during an American football

match (D.Chakrabarti and K.Punera, 2011)● Sub-event detection by identifying outlier peaks in the temporal distribution of

tweets on a topic. (Zubiaga et al., 2012)

Clustering based approaches ● A support platform for event detection using social intelligence (T.Baldwin, P.

Cook and B.Han, 2012) ○ Tweets are filtered using manually selected keywords

Design

● Tweet storage - stores the set of tweets downloaded using streaming API

● Classifier - selection of objective tweets

● Summarizer - removes duplicates and clusters the tweets based on their similarity

Design - Objectivity detection

● Tweets are periodically downloaded by querying the public timeline using Streaming API

● Structure of a tweet object:

tweet text, user name, created time, geo location, language code, favorite count, retweeted_status, retweet count

Data collection● Training data annotated using a heuristic

measure

● Objective - If the tweet is generated by a verified profile

● Subjective - Tweets containing at least a single emoticon or an emoji character

Preprocessing● All emoticons and emoji characters

are removed from the corpus● User mentions are replaced with the

tag ‘MENTION’ (eg: “@john said this” converts to “MENTION sad this”)

● Punctuation symbols including the pound(#) character are removed.

● Urls are replaced with the tag ‘URL’ (eg: http://t.co/12d3 converts to URL)

● Numbers in a tweet are replaced by the tag ‘NUMERIC’

● Remove stop words

Feature extraction● Tweets are tokenized using TweetNLP

tokenizer (K. Gimpel, N. Schneider, and B. O’Connor, 2011)

● Words are stemmed using Porter stemmer● Stemmed unigrams, bigrams converted to

binary Tf-Idf values (with Laplace smoothing)

● binary feature - presence of slang words (using an external gazetteer)

● binary feature - presence of bad words● Unigrams, bigrams and trigrams of POS

tags as binary Tf-Idf values● Average number of misspelled words● Average number of all-capital words● Average number of hashtags

Classifier selection

● A dataset of 6,000 tweets on Ebola is used to benchmark three classifiers (3,000 tweets from each class)○ Support Vector Machines○ Logistic Regression○ Naive Bayes

● Classifiers trained on a random sample of 4800 tweets and remaining used as the test set.

● Classifier parameters are found using 10-fold cross validation

Classifier performance● SVM was selected because it had higher recall than Logistic Regression● A higher recall results in a larger fraction of newsworthy tweets being detected

Contribution from features● Measured using ablation test● Features divided into three sets

WRD - unigram and bigramsLEX - all other lexical features

Selection of the POS-tagger● NLTK POS tagger● Stanford tagger with GATE twitter model (L. Derczynski et al., 2013)● SENNA tagger (Ronan Collobert, 2011) - “deep” recurrent convolutional neural

network based discriminant parser

Eg:"Last US Ebola Patient Is Cured: Dr. Craig Spencer To Be Released… http://t.co/92JfMm2LaN | http://t.co/NoFij4iACl #news"

NLTK tagger:

[('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NNP'), ('Is', 'NNP'), ('Cured', 'NNP'), ('Dr', 'NNP'), ('Craig',

'NNP'), ('Spencer', 'NNP'), ('To', 'NNP'), ('Be', 'NNP'), ('Released', 'NNP'), ('\u2026', 'NNP'), ('|', 'NNP'), ('news',

'NN')]

Selection of the POS tagger..."Last US Ebola Patient Is Cured: Dr. Craig Spencer To Be Released… http://t.co/92JfMm2LaN | http://t.

co/NoFij4iACl #news"

SENNA tagger:

[('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NNP'), ('Is', 'VBZ'), ('Cured', 'VBN'), ('Dr', 'NNP'),

('Craig', 'NNP'), ('Spencer', 'NNP'), ('To', 'TO'), ('Be', 'VB'), ('Released', 'VBN'), ('\u2026', 'JJ'), ('|', 'NN'), ('news',

'NN')]

Stanford tagger with Gate twitter model:

[('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NN'), ('Is', 'VBZ'), ('Cured', 'VBN'), ('Dr', 'NNP'), ('Craig',

'NNP'), ('Spencer', 'NNP'), ('To', 'TO'), ('Be', 'VB'), ('Released', 'VBN'), ('\u2026', '.'), ('|', ':'), ('news', 'NN')]

ResultsData sets● 1 million tweets containing the term ‘Ebola’

● 22,250 tweets related to the fifth Sri Lanka vs India ODI cricket match held on 16th November (objective- 465, subjective- 878)

○ Filtered using terms “SLvIND”, “SLvsIND”, “INDvSL” and “INDvsSL”.

● 6,800 tweets related to the fourth Sri Lanka vs England ODI cricket match held on 7th December (objective- 215, subjective- 242)

○ Filtered using terms “SLvENG”, “SLvsENG”, “ENGvSL” and ENGvsSL”.

Gold standard data set● A sample 500 tweets on the topic ‘ebola’ is annotated manually as objective or

subjective (objective- 206, subjective- 294)

● Classifier scores on this data

● Errors:“RT @TheDailyEdge: UPDATE: Obama has reduced the US deficit by 70% and Ebola cases in the

US by 100%.” It’s hard to judge the objectivity of such sentences only based on syntactical information.

Comparison with prior research● Event related tweets detection with user type recognition (L.Silva, E.Rillof, 2013)

○ A set of 6,000 tweets on disease outbreaks manually labeled using Amazon Mechanical Turk

● Twitter Sentiment Classification using Distant Supervision (A.Go, R.Bhayani and L.huang, 2013)

○ An SVM model trained on syntactic features used for sentiment classification

Classifier Precision Recall F1-score

User type agnostic classifier 83.15 55.99 66.92

User type specific classifier 80.35 66.07 72.15

Features Accuracy

Unigram + Bigram 81.6

Unigram + POS 81.9

Cross-domain applicability● The classifier trained on Ebola tweets applied on cricket related tweets

● The classifier trained on SLvIndia match performed well on SLvEngland tweets well

Summarizer

● Duplicates and near-duplicate tweets are abundant due to Retweets and tweets generated by ‘Tweet’ buttons on news sites

● Removes duplicates in the objective tweets detected by the classifier

● Tweets discussing the same entities are clustered together

● Objective tweets are stripped of following symbols ‘RT’, ‘@-mentions’ and punctuation

● Jaccard similarity of tokens used to detect duplicate tweets

● Two tweets are considered similar if their Jaccard similarity is greater than a threshold d

Near-duplicate removal

Clustering● The goal is to cluster tweets mentioning the same entities together

Eg: “#Miami #News NYC Doc Free of Ebola: Sources: Dr. Craig Spencer, the physician being treated for Ebola at Belle... http://t.co/iXSUk4axVV”

“#Ebola so the good doctor Craig Spencer will go home - well - the nurse too free to roam but lest we forget 3 countries still suffer deeply”

● Vectors of NER tags converted to Tf-Idf scores and cosine value is selected as the distance measure among two NER tag vectors

● DBSCAN is selected because the number of clusters is not required and it is capable of identifying arbitrary shaped clusters

Clustering - results● SVM classifier trained on ebola-3000 data set is applied on a corpus of 24,038

unseen tweets retrieved on a single day (11-11-2014)

● 13,380 tweets detected as objective and 8,138 as duplicates among them. Clustering resulted in 332 clusters while 2751 tweets labeled as noise

● Clusters depend on the quality of Named Entity Recognizer

Entities: ['Craig', 'Ebola', 'Patient', 'Spencer', 'US']

Clustering - discussion● In contrast this tweet labeled as noise

“‘#Ebola Ebola Outbreak: US Free of Virus After New York Doctor Craig Spencer Cleared - International Business Times UK”

entities - ['Business', 'Craig', 'Ebola', 'Free', 'International', 'New', 'Outbreak', 'Spencer' 'Times', 'US' 'Virus' 'York']

Future work● Improve cross-domain applicability

○ Finding better features with less dependence on the domain

● A better methodology to evaluate summaries

● Improve clustering to consider verbs also

● Generate an abstractive summary○ Generate novel sentences from the information contained in tweets

● Generate summaries realtime

Automatic generation of event summaries using microblog streams

Data & Analytics

Automatic Microblog Summarization Based on Unsupervised ...Automatic Microblog Summarization Based on Unsupervised Key-Bigram Extraction Yufang Wu*, Heng Zhang, Bo Xu, Hongwei Hao,

REGIONAL SUMMARIES I THE AMERICAS REGIONAL SUMMARIES …

Virtual network analysis of Wuhan 1+8 City Circle based on Sina microblog user relations€¦ · · 2017-08-23Virtual network analysis of Wuhan 1+8 City Circle based on Sina microblog

Link Prediction in Microblog Network Using Supervised

After the Boom No One Tweets: Microblog-based Influenza ...sociocom.jp/~aramaki/PAPER/2016-wakamiya.pdf · After the Boom No One Tweets: Microblog-based Influenza Detection Incorporating

Afterthe BoomNo One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

Social network Microblog: 140 letters per message

Microblog tells_Research on Sina Microblog in China

An Ontology Design Pattern for Microblog Entriesontologydesignpatterns.org/wiki/images/c/ce/Paper-06.pdfAn Ontology Design Pattern for Microblog Entries Cogan Shimizu and Michelle

Disambiguating company names in microblog textDisambiguating company names in microblog text using clustering for online reputation management Revista Signos, vol. 48, núm. 87, marzo,

A Corpus for Entity Profiling in Microblog Posts

Let Them Blog, Glog, Microblog & More!

Computational Framework for Generating Visual Summaries of Topical Clusters in Twitter Streams

9 Facts About Microblog Marketing In China

Truthy: Mapping the Spread of Astroturf in Microblog Streams · Online social media are complementing and in some cases replac-ing person-to-person social interaction and redeﬁning

Fundamentals of Analyzing and Mining Data Streams · Streaming summaries, sketches and samples – Motivating examples, applications and models – Random sampling: reservoir and

2012 crisis management in the microblog era white paper

A Language Modeling Approach to Personalized Search based on Users' Microblog Behavior

Ring: Real-Time Emerging Anomaly Monitoring …...In this paper we demonstrate RING, a real-time emerging anomaly monitoring system over microblog text streams. R ING integrates our

Detecting and Tracking the Spread of Astroturf Memes in ... · Detecting and Tracking the Spread of Astroturf Memes in Microblog Streams ... accounts of the event