Upload
bo-hyun-kim
View
213
Download
6
Tags:
Embed Size (px)
DESCRIPTION
Attended Grace Hopper Celebration to present the work in Data Science Track. The presentation is on using HP Vertica Pulse and enhancing the accuracy using the right pre-processing methods and training for accuracy using the naive bayes theorem.
Citation preview
2014
Lexicon-Based Sentiment Analysis
Using the Most-Mentioned
Word TreeBo-Hyun Kim, Sr. Software Engineer
HP Big Data Business Unit
Oct 10th, 2014
#GHC14
2014
2014
What to Expect
Sentiment Analysis− What is it?− Why is it interesting?− How HP Vertica Pulse works− Achieving greater accuracy− Different point of view using the most-
mentioned word tree
2014
What I Expect
A 5-star rating on GHC app
I just expect you to enjoy and learn!
2014
Sentiment Analysis
In plain English− the process of automatically detecting if a text
segment contains emotional or opinionated content and determining its polarity (e.g., “thumbs up” or “thumbs down”), is a field of research that has received significant attention in recent years, both in academia and in industry. [Wright, 2009]
2014
Gimme Examples!
Also known as:− Opinion Mining− Text Mining
Determine people’s general opinion− “I just got a new car, and I’m loving it ”− “My new car isn’t as fast as I thought.”
2014
Why are we interested?
Increasing(every minute!) web usage− Articles− Blogs− Comments
Power of Social Media− Online Shopping− Customer Reviews− Recommended products on Amazon− How other people feel about the product
2014
Product Review
2014
Data… Data… Data…
2014
HP Vertica Pulse
2014
How to Analyze?
Lexicon-based approach – HP Labs [Zhang et. al. 2011] Choose a product, person, event, organization, or topic
[Hu and Liu, 2004] to analyze the opinion Determine the Semantic Orientation score of opinion
lexicons
Word Semantic Orientation Value
Fabulous +3
Good +1
Bad -1
Nasty -3
2014
Sentiment Scoring
Input: text or sentence Output: For each attribute or entity, generates a sentiment score
ranging from -1 to 1− -1: Negative sentiment− 0: Neutral sentiment− 1: Positive sentiment
Entity-level lexicon-based sentiment scoring
2014
Limitation
Semantic Orientation value(‘missed’) = -1 Gives more weight to the closely located
word Accuracy can suffer
2014
Improve accuracy
Accuracy is what we strive for! More robust pre-processing
− Prune data to fit for different types of user opinion (e.g. Twitter vs. YouTube comments)
Naïve Bayes Classifier Training Tune accordingly
2014
Data Set
Test dataset − Stanford students collected− In 2009− Over 3 million tweets with tested score− Analyzed 3500 tweets
Collected dataset− HP Vertica Pulse Twitter Connector− In 2014− Total of 1.2 million tweets over 30 days
2014
Data Pruning
Remove − Job postings
• #job, #jobs, #tweetmyjob
− Links• http://this.is/nogood
− Duplicates − Twitter specific characters
• RT, @, #
− Emoticons• I hate my life :-), sarcasm is wide-spread disease
After pruning− ~287000 tweets, 24% of the 1.2 million tweets
2014
Naïve Bayes Classifier
Supervised learning − Probabilistic classifier based on Bayes’ theorem− Requires a small amount of data− Assumes the presence/absence of a particular
feature of a class is unrelated to the presence/absence of any other feature
− Classifying the object based on its included features
− Open source found at [nltk.org]
2014
Naïve Bayes Classifier
Results: − Final accuracy : 0.788
2014
Tuning Pulse
Positive words Negative words Neutral words White lists Stop words Synonym mappings
2014
Accuracy Comparison
Sentiment scores generated for each phase
Keyword Ideal Original Pruning Training Tuning
Healthcare -0.1515 -0.0333 -0.0833 -0.1 -0.125
Obama 0.308 0.0944 0.1535 0.1535 0.1842
2014
Trend/Targeted Analysis
Targeted dataset analysis can help improve accuracy Identify the most-mentioned words
− Use the most-recurrent words to narrow the scope of analysis
Find new trends − Government healthcare (2009) vs. Obamacare (2014)
Are we looking at the targeted data?− “Solve healthcare challenges with technology!” − “Healthcare After ObamaCare”− “Get affordable healthcare at HealthCare.gov”
2014
Generating Tree
Increase the relevancy of sentiment score by running the sentiment analysis on the entity, as well as on the most-recurrent words to identify: − Homonyms that machines do not understand− More accurate scores based on user interest
Generate tree using Text Search− Merge stemmer words
e.g. query, queries, querying…− Lucene - apache open source
2014
Tree View
healthcare
obamacare !(Obamacare)
obama !(Obama) !(health)health
2014
Thank you
Many thanks to*:Tim Donar, Solution Engineer
Beth Favini, Tech Pubs Sr. Manager
Judith Plummer, Tech Pubs Editor in Chief
* In alphabetical order
2014
Got Feedback?
Rate and Review the session using the GHC Mobile App
To download visit www.gracehopper.org