Upload
kory-sutton
View
213
Download
1
Tags:
Embed Size (px)
Citation preview
RESEARCH POSTER PRESENTATION DESIGN © 2012
www.PosterPresentations.com
Game of Cricket (IPL)– 2 Teams, 2 Sessions, 11 Players/Team, ~4 Hours #IPL, #IPL2015 – Official Hash tags for Indian Premier LeagueWhy is it Interesting –
• Emotions on Twitter
• The Buzz of IPL
• IPL on Twitter - 62.7 Million Tweets last Week, Twitter Battle
• Involvement - 101.77 million for first six games
INTRODUCTION
OBJECTIVE
Sentiment Analysis
MATERIALS AND METHODS Results CONCLUSIONS
1. Successfully classified human sentiments on tweets into 5 different categories - Unpleasant, Sad, Neutral, Happy, Pleasant/Ecstatic.
2. Named entities classified/recognized using gazettes with powerful pre-tagging and correction on the tweet data.
3. Successfully applied k-means/k-means++ on the tweet data to explore clusters based on known events in the game of cricket and unknown cluster initialization.
4. Summarization using time-based chunking, identifying the peaks and then provide "summarizing tweets" from the peak chunks done successfully.
5. Visualization of all the above methods using Data driven documents and python matplotlib done successfully.
REFERENCES
1. Clustering
Our Implementation :
- The Advantages of Careful Seeding
David Arthur | Sergie Vassilvitskii
http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
Off the shelf Implementation :
- sklearn (k-means, k-means++)
http://scikit-learn.org/stable/modules/clustering.html#k-means
2. Summarization
Summarizing Sporting Events Using Twitter –
Jeffrey Nichols, Jalal Mahmud, Clemens Drews
IBM Research – Almaden http://www.jeffreynichols.com/papers/summary-iui2012.pdf
3. Gazette
Events, Venues, Players http://www.iplt20.com/
http://www.cricbuzz.com/
ACKNOWLEDGEMENTS
We would like to thank Prof. Kenji Sagae and Justin Garten for their continuous support and valuable inputs.
To apply following NLP techniques on tweets for the game of cricket (IPL) Sentiment Analysis Named Entity Recognition Clustering Summarization
Computer Science Graduate Students at University of Southern California, USAKunal Parakh, Preetam Shingavi
CricTwee – Tweet Analysis for the Game of Cricket
SentimentsSentiments Named EntitiesNamed Entities ClustersClusters SummarySummary
DATASET
Around 1000 manually annotated and corrected Tweets Gazettes –
NER – Persons, Locations, Venues, Teams
Events – Toss, Wicket, Milestones, Boundaries, Result
DBDBAutomated Pre-tagging
(POS, NER, EVENTS)
Automated Pre-tagging
(POS, NER, EVENTS)
Tweepy
StreamListener
Tweepy
StreamListener
ARK POS Tagger
Gazettes
Pre-tagged File
Manual Annotation &
Correction
Manual Annotation &
Correction
Tagged Training File
Untagged Data
Named Entity Recognition
Tagged Training File
Feature ExtractionFeature ExtractionTrain
Data
Model File
Naïve Bayes TrainNaïve Bayes Train
DevelopmentData
Naïve Bayes ClassifierNaïve Bayes Classifier ResultResult
Feature Extraction
Skipped tokens with POS tags D, #, P, ^, &
EvaluationMethod Accuracy (Approximate)
Megam 76%
Naïve Bayes Classifier 72%
NLTK Naïve Bayes Classifier 72%
Ngram Naïve Bayes Classifier 58%
Classes – Unpleasant, Sad, Neutral, Happy, Ecstatic
Tagged Training File
Feature ExtractionFeature Extraction
GazettesGazettes
ClassifierClassifier ResultResult
Named Entities
Named Entities – Persons, Locations, Team, Venues Feature Extraction – BIO encoding for tokens with POS “^” Evaluation – Manually checked the classified Named Entities with the entities in gazettes.
Clustering
DBDB
Untagged Data
Scikit-learn Clutering
(k-means++)
Scikit-learn Clutering
(k-means++)ResultResult
k-clusters
Feature ExtractionFeature ExtractionTFIDF & Cosine Similarity
K-means clsutering
TFIDF & Cosine Similarity
K-means clsuteringResultResult
k-clusters
Known EventsKnown Events
Unknown Events
Unknown Events
Pre-defined Clusters – Tweets belonging to each Event Evaluation – Exploratory
Toss, Wickets, Boundaries, Milestones, Result
Summarization
DBDB
Untagged Data
Chunk FilterChunk Filter
Find Peaks
ResultResult
Summary
Timeline ChunkingTimeline Chunking
Scikit-learn Clutering
(k-means++)
Scikit-learn Clutering
(k-means++)
Top 5 Tweets of Each Chunk
Gazettes Gazettes Summary FilterSummary Filter
Keywords
Chunking – Segregate time stamped tweets in chunks of k minutes. Chunked Filter – Find Peaks based on threshold calculated by averaging all the tweets. Summary Filter – Calculate scores based on keywords from clusters and events from gazettes.
Sentiment Analysis
Named Entity Recognition
Clustering
Summarization
CONTACT
Kunal Parakh Preetam Shingavi
Email – [email protected] Email – [email protected]
CSCI 544 – Advanced Natural Language Processing
University of Southern California
Why is it Challenging – • About Tweets – Unstructured, Annotation Task
• Manual Analysis
• Dynamic Data
• Evaluation of Models