Upload
jennifer-d-davis-phd
View
16
Download
4
Tags:
Embed Size (px)
Citation preview
Sentiment Analysis in Machine LearningJennifer D. Davis, Ph.D.American Computing Machinery, Austin ChapterSub-group on Knowledge, Discovery and Data MiningJune 2, 2015
What is sentiment analysis?
Machine learning technique that classifies comments and phrases based on what is called a ‘corpus’—a group of annotated texts with weights given to words in numerical terms
Defined as: “Sentiment analysis (opinion mining) refers to the
use of natural language processing, text analysis and computational linguistics to identify and extract subjective information to source materials.” wikipedia encyclopedia
Sentiment Analysis: Not your Mother’s Twitter Feed!
Sentiment Analysis can be used to: Understand the intent behind language in an unbiased
manner Business areas that frequently use Sentiment Analysis:
Retail Entertainment Healthcare Any customer-centered organization
Respond to customer complaints with better solutions, a sort of virtual call center (e.g. Amelia)
Retail
Introduce new products more successfully by understanding culture & social media
Understand and respond to customer needs using internal data sources such as customer reviews or feedback
Develop new products based on customer wants and needs as expressed in reviews, on-line and social media
Entertainment
Create interest or excitement about movies by understanding the market segment Target movie advertising or recommender systems
based on social commentary and collaborative filtering
Target advertising to gender or population or by cultural affinity.
Healthcare and Medical Treatment
Healthcare: Learn about patient wellness –
Potentially detect depression from journal entries Assist with patient adherence to treatment Learn about patient satisfaction and what is working Gather outcomes measures associated with patient
satisfaction This is a hot area of research and several academic
institutions are investing in research related to patient outcomes and sentiment analysis.
What are the overall steps for sentiment analysis?
Gather unstructured data from your own sources, web-sources, databases (healthcare.gov surprisingly has some) and competitions like Kaggle.
Parse out unnecessary punctuation and “stop” words or phrases, perform other pre-processing as needed or appropriate.
Transform the words or phrases to a numerical representation such as a vector
Choose an appropriate classification algorithm. For example Random Forrest has a high accuracy rate, but isn’t always computationally efficient. We discussed several other methods previously.
Apply your algorithm to a training set and if enough data is available, cross-validate. Tune the algorithm using appropriate parameters matched to features, but avoid over-fitting.
Apply the algorithm to test data (the fun part).
What techniques can we use?
Many are under development by machine-learning focused corporations and in academic linguistic laboratories
Often an ensemble of algorithms works best and is most accurate
Text data is often unstructured data. You will spend a portion of time cleaning and organizing data. Not fun, but necessary.
Today we will very briefly give high-level overview of 3 methods (i) Bayesian Probability classification, (ii) Word2Vec and (iii) Neural Recursive Networks
Bayesian Probability and classification method
Naïve Bayes classification uses probability formulas that are based on the assumptions that all features function independently
For most cases this is surprisingly accurate, and typically can yield 70-80% accuracies
You can read more about this in the textbook for this course, “Building Machine Learning Systems with Python”
Word2vec “deep” learning method
This method relies upon creating a “Bag of Words” from semi-structured data
Many tools are available in scikit learn and nltk python libraries (we will show some in our Jupyter (iPython) notebook
Invented by Google engineers who describes it as a “tool [that provides] an efficient implementation of a continuous bag-of-words and skip-gram architectures for computing vector representations of words”
In other words, (pun intended) words are assigned a vector of numbers representing their importance, and meaning
Neural recursive network method
The best (and most convenient to use) library is Stanford University’s Natural Language Processing library.
The method uses a recursion algorithm that will distinguish between phrases based upon the order of words & phrases
For example “this movie has humor that could not be denied” would be graded as positive whereas “this movie did not have any humor whatsoever” would be graded as negative based on order and choice of words & phrases.
SNLP Group can be found at: nlp.stanford.edu; their live demonstration is available at: nlp.stanford.edu/sentiment
So which do I choose?
It depends upon the complexity of data you are analyzing
It depends upon the accuracy you desire versus scalability (always a balancing act)
It depends on your time frame and how you will integrate the knowledge derived from using sentiment analysis
Out of the box solutions can work, but sometimes you will need to build your own
So now we can give it a try! A Jupyter Notebook has been created and can be accessed via my
Github account at: https://github.com/jddavis-100/Statistics-and-Machine-Learning/
Data is available at: Kaggle.com by joining the Kaggle Competition The test set was designed by me, and I can provide it to you or
Omar.
Gather your own data from a number of APIs including or web-crawlers such as: Rotten Tomatoes API Twitter API Web-scraping tools such as Scrapy (Python tool available at
scrapy.org)
GitHub Repository
Tutorial:
https://github.com/jddavis-100/Statistics-and-Machine-Learning/wiki/Sentiment-Analysis--Class-for-ACM,-SIGKDD,-Austin-Chapter
Repo: https://github.com/jddavis-100/Statistics-and-Machine-Learning