View
246
Download
1
Category
Preview:
Citation preview
Joseph Orilogbon Luis Lasierra
Bin Shen
5/12/14 Semantic Technologies in IBM Watson 1
Discovering why Topics are Trending on Twitter
5/12/14 Semantic Technologies in IBM Watson 2
*
*We set out to explain Why Topics are Trending on Twitter
*Main approach to achieve this was to use summarization.
5/12/14 Semantic Technologies in IBM Watson 3
*
*News break on Twitter
*Twitter -> prominent way of expressing opinions on the Internet
*Why people are talking about a particular topic in a given location
*Commercial interest
5/12/14 Semantic Technologies in IBM Watson 4
*
*Summarization of trending topics on Twitter
*Categorization of Topics; and
*Named-Entity Extraction for Trending topics
5/12/14 Semantic Technologies in IBM Watson 5
*http://whytrend.intelworx.com
5/12/14 Semantic Technologies in IBM Watson 6
*
*Speech Act Guided Summarization
*Phrase Ranking using MLE
*Phrase Extraction using POS filtering
*Salience Score of Extracted Phrases
*Summary generation using templates
5/12/14 Semantic Technologies in IBM Watson 7
*
*Speech Acts include : Statement [sta], Question [que], Comment [com], Suggestion [sug] and Miscellaneous [mis]
*Speech Act classification is a multiclass problem *K-Nearest neighbors approach was used for classification.
5/12/14 Semantic Technologies in IBM Watson 8
*
*Extracted Phrase were Ranked using the following equation
*πππππ π = log πΏ(π€π€π€π€π€ ππ π ππ€π πππ€πππππ€πππ) πΏ(π€π€π€π€π€ ππ π ππ€π π€πππππ€πππ)
*Dependence/Independence measured based on using a background twitter corpus built from 550,000 tweets
*For lengths 1 to L, we extract the top 50 phrases. *L is a model parameter for maximum phrase length
5/12/14 Semantic Technologies in IBM Watson 9
*
*Extracted N-Grams are only useful if they are: *Nouns or Noun Phrase
*Verbs or a Verb-Centered Phrase
*After Extracting N-Grams, those not matching the required patterns were filtered out using RegEx on their POS Tag Pattern
*Tagging was done before extracting N-Grams to give the tagger the proper context.
*Different patterns are suitable for different Speech-Act
5/12/14 Semantic Technologies in IBM Watson 10
*
*This is another round of ranking of phrases based on how βSalientβ they are within the given topic
*Salience Score is given as ππ πππ = πΊπ πππ Γ ππ *ππ is the length of N-Gram πππ
*πΊπ πππ is a graph score obtained by iterating over a graph G=(V, E), where V is the set of N-grams, and E is a set edges weighted based on the number of times the N-Grams co-occur.
5/12/14 Semantic Technologies in IBM Watson 11
*
*Greedy strategy was used to select most salient phrases
*Phrases were used to fill templates
*Speech acts used to describe how people are talking about the salient phrases.
*Redundant phrases were detected using Jaccard Coefficient of 0.275
5/12/14 Semantic Technologies in IBM Watson 12
*
*The main reference is Zhang et. al, 2013
*Speech Acts were not used for filtering out tweets
*Two rounds of POS filtering was done, as supposed to one in the original paper
*Greedy strategy was used as opposed to Round-robin used in the original paper
*Representative tweets were also presented to give the user some sense of context.
5/12/14 Semantic Technologies in IBM Watson 13
*
*Speech Act Training Data Set (Liu, et. al), for speech act classification
*Sentiment 140 dataset, for background corpus
*TweetMotif dataset (OβConnor et. al, 2010) for background corpus.
*Twitter NLP (Gimpel et al) for POS tagging
*Tweets collected via Twitter API for testing summarization model, see examples on site.
5/12/14 Semantic Technologies in IBM Watson 14
*
*Entity Extraction *Preprocessing, proper nouns extraction
*Google Knowledge Graph: Freebase
*Categorization *uClassify API
*Extract highest ranking category
5/12/14 Semantic Technologies in IBM Watson 15
*
*Front end *Auto-detection/manual selection of location *Displays trending topics *Sends requests to server to analyze topics
*Back end *Tweets retrieval *Analysis using model of summarization *Send results to Freebase and uClassify APIs *Caches result
5/12/14 Semantic Technologies in IBM Watson 16
*
*Front end: HTML 5, JS, Google Maps API, Angular JS, JQuery
*Backend: Java / Play framework and MySQL database
*Hosted on AWS
5/12/14 Semantic Technologies in IBM Watson 17
*
*Asked users to provide feedback on results
*Questions covered all 3 parts of the project
*Got 19 responses as at the time of making this slide,
5/12/14 Semantic Technologies in IBM Watson 18
Avg = 3.89
Avg = 4.00
5/12/14 Semantic Technologies in IBM Watson 19
Avg = 4.21
Avg = 3.84
5/12/14 Semantic Technologies in IBM Watson 20
Avg = 4.16
5/12/14 Semantic Technologies in IBM Watson 21
*
* Liu, Fei, Yang Liu, and Fuliang Weng. "Why is SXSW trending?: exploring multiple text sources for Twitter topic summarization." 2011. 66--75.
* OConnor, Brendan, Michel Krieger, and David Ahn. "TweetMotif: Exploratory Search and Topic Summarization for Twitter." 2010.
* Zhang, Renxian, Wenjie Li, Dehong Gao, and You Ouyang. "Automatic Twitter Topic Summarization With Speech Acts." Audio, Speech, and Language Processing, IEEE Transactions on (IEEE) 21 (2013): 649--658.
* Nathan Schneider, Brendan O'Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments Kevin Gimpel, In Proceedings of ACL 2011.
* Abeel, T.; de Peer, Y. V. & Saeys, Y. Java-ML: A Machine Learning Library, Journal of Machine Learning Research, 2009, 10, 931-934
5/12/14 Semantic Technologies in IBM Watson 22
*
*Tweets under a topic are loosely grouped together, sometimes not sharing too much in common.
*Low performance with Speech-Act Classification
*Detection of Main entity
*Normalization of tweets could at times result in weird results
*Limits on Twitter API 180 search queries/user/application/15 minutes
5/12/14 Semantic Technologies in IBM Watson 23
*
*Real-time indexing of tweets before they start trending, using Lucene/ES or other full-text engines.
*Detection of sentence overlap in the selected phrases
*Detecting redundancies semantically.
*Different templates for various topic categories.
5/12/14 Semantic Technologies in IBM Watson 24
*
Recommended