RESEARCH POSTER PRESENTATION DESIGN © 2012 (—THIS SIDEBAR DOES NOT PRINT—) DESIGN GUIDE This PowerPoint 2007 template produces

RESEARCH POSTER PRESENTATION DESIGN © 2012

www.PosterPresentations.com

Game of Cricket (IPL)– 2 Teams, 2 Sessions, 11 Players/Team, ~4 Hours #IPL, #IPL2015 – Official Hash tags for Indian Premier LeagueWhy is it Interesting –

• Emotions on Twitter

• The Buzz of IPL

• IPL on Twitter - 62.7 Million Tweets last Week, Twitter Battle

• Involvement - 101.77 million for first six games

INTRODUCTION

OBJECTIVE

Sentiment Analysis

MATERIALS AND METHODS Results CONCLUSIONS

1. Successfully classified human sentiments on tweets into 5 different categories - Unpleasant, Sad, Neutral, Happy, Pleasant/Ecstatic.

2. Named entities classified/recognized using gazettes with powerful pre-tagging and correction on the tweet data.

3. Successfully applied k-means/k-means++ on the tweet data to explore clusters based on known events in the game of cricket and unknown cluster initialization.

4. Summarization using time-based chunking, identifying the peaks and then provide "summarizing tweets" from the peak chunks done successfully.

5. Visualization of all the above methods using Data driven documents and python matplotlib done successfully.

REFERENCES

1. Clustering

Our Implementation :

- The Advantages of Careful Seeding

David Arthur | Sergie Vassilvitskii

http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf

Off the shelf Implementation :

- sklearn (k-means, k-means++)

http://scikit-learn.org/stable/modules/clustering.html#k-means

2. Summarization

Summarizing Sporting Events Using Twitter –

Jeffrey Nichols, Jalal Mahmud, Clemens Drews

IBM Research – Almaden http://www.jeffreynichols.com/papers/summary-iui2012.pdf

3. Gazette

Events, Venues, Players http://www.iplt20.com/

http://www.cricbuzz.com/

ACKNOWLEDGEMENTS

We would like to thank Prof. Kenji Sagae and Justin Garten for their continuous support and valuable inputs.

To apply following NLP techniques on tweets for the game of cricket (IPL) Sentiment Analysis Named Entity Recognition Clustering Summarization

Computer Science Graduate Students at University of Southern California, USAKunal Parakh, Preetam Shingavi

CricTwee – Tweet Analysis for the Game of Cricket

SentimentsSentiments Named EntitiesNamed Entities ClustersClusters SummarySummary

DATASET

Around 1000 manually annotated and corrected Tweets Gazettes –

NER – Persons, Locations, Venues, Teams

Events – Toss, Wicket, Milestones, Boundaries, Result

DBDBAutomated Pre-tagging

(POS, NER, EVENTS)

Automated Pre-tagging

(POS, NER, EVENTS)

Tweepy

StreamListener

Tweepy

StreamListener

ARK POS Tagger

Gazettes

Pre-tagged File

Manual Annotation &

Correction

Manual Annotation &

Correction

Tagged Training File

Untagged Data

Named Entity Recognition


Feature ExtractionFeature ExtractionTrain

Data

Model File

Naïve Bayes TrainNaïve Bayes Train

DevelopmentData

Naïve Bayes ClassifierNaïve Bayes Classifier ResultResult

Feature Extraction

Skipped tokens with POS tags D, #, P, ^, &

EvaluationMethod Accuracy (Approximate)

Megam 76%

Naïve Bayes Classifier 72%

NLTK Naïve Bayes Classifier 72%

Ngram Naïve Bayes Classifier 58%

Classes – Unpleasant, Sad, Neutral, Happy, Ecstatic


Feature ExtractionFeature Extraction

GazettesGazettes

ClassifierClassifier ResultResult

Named Entities

Named Entities – Persons, Locations, Team, Venues Feature Extraction – BIO encoding for tokens with POS “^” Evaluation – Manually checked the classified Named Entities with the entities in gazettes.

Clustering

DBDB

Untagged Data

Scikit-learn Clutering

(k-means++)


(k-means++)ResultResult

k-clusters

Feature ExtractionFeature ExtractionTFIDF & Cosine Similarity

K-means clsutering

TFIDF & Cosine Similarity

K-means clsuteringResultResult

k-clusters

Known EventsKnown Events

Unknown Events

Unknown Events

Pre-defined Clusters – Tweets belonging to each Event Evaluation – Exploratory

Toss, Wickets, Boundaries, Milestones, Result

Summarization

DBDB

Untagged Data

Chunk FilterChunk Filter

Find Peaks

ResultResult

Summary

Timeline ChunkingTimeline Chunking


(k-means++)


(k-means++)

Top 5 Tweets of Each Chunk

Gazettes Gazettes Summary FilterSummary Filter

Keywords

Chunking – Segregate time stamped tweets in chunks of k minutes. Chunked Filter – Find Peaks based on threshold calculated by averaging all the tweets. Summary Filter – Calculate scores based on keywords from clusters and events from gazettes.

Sentiment Analysis

Named Entity Recognition

Clustering

Summarization

CONTACT

Kunal Parakh Preetam Shingavi

Email – [email protected] Email – [email protected]

CSCI 544 – Advanced Natural Language Processing

University of Southern California

Why is it Challenging – • About Tweets – Unstructured, Annotation Task

• Manual Analysis

• Dynamic Data

• Evaluation of Models

Documents

RESEARCH POSTER PRESENTATION DESIGN © 2012 (—THIS SIDEBAR DOES NOT PRINT—) DESIGN GUIDE This PowerPoint 2007 template produces