Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Introduction to Business Data Analytics
Lecture 7
The slides are available under creative common license. The original owner of these slides is the University of Tartu.
Brand Value Monitoring
Understanding and Mining Text
Till now and Today !
• Looking at other/not directly controlled platforms• Twitter, Blogs, Tech posts, Recc.
websites
• Looking at their own customers• Subscription based data
Till now and Today !
Till now
• Customer transactional data• Typical in CRM ERP
• Fairly Structured data • Relational data: DB, CSV
• Relatively well documented schema.
• Feature encoding is straight forward
Today
• From External sources & is Big Data• 5 Vs: Velocity, Volume, Veracity,
Variety, Value• Veracity: Messiness and
Trustworthiness ?• Value: Can we get some meaning out
of this data
• Mostly unstructured data
• Requires all new approaches to analyse the data
Channels of data capturing
• Source 1: Social Media Platforms• Twitter: VW Scandals
• Facebook (posts): Bad Product
• Blogs: Product reviews
• Source 2: Customers Interactions with the business• Customer feedback forms
• Consumer complaint logs
• Emails
Why we should care about data ?
• The Data can correlated with product and business about customers.
• W.r.t Source 1 (social media platforms): Data from the customers about product and business • Helps in uncovering patterns and themes, so you know what customers are thinking
about. • It reveals their wants and needs.• What they think about your services ?• Can provide an early warning of trouble.
• W.r.t Source 2 (Customers Interactions with the business): You are loosing an opportunity to detect and introduce• potential business products • added value services
Data Source 1: TwitterExample 1
• It appears tweets by Kylie cause Snapchat to lose $1.3 billion today. TMZ reports that the social media platform's shares are down 7.2 percent hours after Jenner posted her tweets.
Data Source 1: TwitterExample 2
Can we correlate Twitter and Financial market ?
Disclaimer !
Can Twitter sentiment predict stock market behaviour?
Source: http://blogs.lse.ac.uk/usappblog/2017/10/14/can-twitter-sentiment-predict-stock-market-behaviour/
Main idea behind the work: what if having people
compaining on Twitter about their malfunctioning
iPhone will in fact result in Apple experience quarterly
losses? Of course, the relationship is neither linear nor
simple. It actually depends on the total number of
users expressing a certain sentiment as well as on the
intensity of their sentiments, and even in perfect
circumstances, the correlation between general
sentiment and stock oscillations needs to be proved.
Klout score (value that indicates the degree of social influence of a certain individual in the social media world)-- varies between 1 and 100 and to a higher value corresponds a higher influence power.
Interesting to study this correlation when extraordinary corporate events happen (for example, an acquisition, an IPO, the launch of a new product, the company distributing dividends, the CEO been fired, etc.)
Disclaimer !
Can Twitter sentiment predict stock market behaviour?
Source: http://blogs.lse.ac.uk/usappblog/2017/10/14/can-twitter-sentiment-predict-stock-market-behaviour/
Main idea behind the work: what if having people
complaining on Twitter about their malfunctioning
iPhone will in fact result in Apple experience quarterly
losses? Of course, the relationship is neither linear nor
simple. It actually depends on the total number of
users expressing a certain sentiment as well as on the
intensity of their sentiments, and even in perfect
circumstances, the correlation between general
sentiment and stock oscillations needs to be proved.
Klout score (value that indicates the degree of social influence of a certain individual in the social media world)-- varies between 1 and 100 and to a higher value corresponds a higher influence power.
Interesting to study this correlation when extraordinary corporate events happen (for example, an acquisition, an IPO, the launch of a new product, the company distributing dividends, the CEO been fired, etc.)
Data Source 2: Massive Text in E-commerce
• Consider you are large organization (e-commerce like Market, Amazon or large companies like Transferwise etc)
• Millions of customers are interacting every day.
• An automatic process is required to classify and understand customers’• Mood (Happy or sad)
• Sentiment (positive or negative)
• Message type: Complain or Appreciation
What can we do with this data/Text?: Text Analytics
• Text analytics is the process of drawing meaning out of written communication.
• Text mining, also referred to as text data mining or text analytics, is the process of deriving high-quality information from text
Source: https://www.clarabridge.com/text-analytics/
Source: https://en.wikipedia.org/wiki/Text_mining
Text Analytics: How it works
Text Transformed dataPreprocessing
Sentiment or emotions: Eg. positive, negative, neutral
Patterns: Eg. Popular words, topics or terms)
But it is not easy to deal with text: Why?
• 1) Intended for human consumption (and not for computers)• Consider the text
• “The acting is poor and it gets out-of-control by the end, with the violence overdone and an incredible ending, but it’s still fun to watch.”
• What you expect the overall sentiment ?
• What do you think of the word “incredible” ? Positive or Negative ?
But it is not easy to deal with text: Why?
• 1) Intended for human consumption (and not for computers)
• 2) Unstructured data• No: Table of records with fields (like feature collection)
• Text fields can have varying number of words
• Word order matters (and sometimes not)
• Domain/context dependence• words/phrases can mean different things in different contexts and domains
• Effect of syntax on semantics
But it is not easy to deal with text: Why?
• 1) Intended for human consumption (and not for computers)
• 2) Unstructured data
• 3) Dirty text• No grammar
• Misspelled words
• Synonyms
• Emojis
• Sarcasm
Techniques for text analytics
• Techniques• Lexical based techniques
• K-NN (bit more sophisticated)
Additional reading (if you wish): http://liris.cnrs.fr/Documents/Liris-6508.pdf
Lexical based Technique: Technique 1Dictionary based
• Create a dictionary of positive and negative words with scores.
Additional reading (if you wish): 1) http://liris.cnrs.fr/Documents/Liris-6508.pdf2) https://medium.com/@datamonsters/sentiment-analysis-tools-overview-part-1-positive-and-negative-words-databases-ae35431a470c
Positive words
GoodExcellentSuperb:
Negative words
BadDelayHorrible:
Find Synonyms
splendiddazzlingmarvellous:
Find Synonyms
Substandardpoorinferior
Techniques for improvement
• Use of linguistic knowledge
Lexical based Technique: Technique 1
• Create a dictionary of positive and negative words with scores.
• For each sentence, add positive and negative score• Example: Weather is good today
let the score for good is 0.5, weather and today is 0 (on a scale of -1 to +1).So the score is 0.5 (and sentence is positive)
• Dictionary Based methods• When dictionaries are created in one substantive area and then applied to another
problems, serious errors can occurSee: https://brenocon.com/blog/2011/10/be-careful-with-dictionary-based-text-analysis/
Additional reading (if you wish): 1) http://liris.cnrs.fr/Documents/Liris-6508.pdf2) https://medium.com/@datamonsters/sentiment-analysis-tools-overview-part-1-positive-and-negative-words-databases-ae35431a470c
Lexical based Technique: Technique 2Documents based Technique
Comparing text with corpus
Calculating the positive and negative score
We are ignoring the order of the words in the sentence.
We won’t catch everything
• This movie makes Catwoman look like a great movie
• A terrible movie that some people will nevertheless find moving
• Your children will be occupied for 72 minutes.
Supervised Learning approach
• Train the documents which are already labeled with positive and negative labels.
• Evaluate your approach using the test data.
• In this lecture, we will use K-NN approach to perform classification of texts.• Classifies new cases based on distance (or similarity)
Classification: K-NN Neighbor (not K-Means)
Problem
Red dots : MalesBlue dots: FemalesPredict what is the green dot ?
In other words, you know about red and blue data and you would like to classify (or predict) about the gender of the green data ?
Source: www.imperial.ac.uk/people/n.sadawi
Classification: K-NN Neighbor (not K-Means)
Algorithm1) Find K data points which are nearest to the
green dot.2) Assign (classify) the color (class) of the
majority (among K neighbors) to the green (or the new data point)
Q1 : How to classify (or predict) about the gender of the green data ?
Q2: How to find identify which K neighbors are nearest among all the neighbors ?Answer: Use Distance metric
Distance Metrics revisited
Categorical
• Jaccard Distance: Ratio of Intersection/Union
• Hamming Distance
Continuous
NOTE: Distance is (1- similarity)
How to Pick the value of K ?
• Inspect your data to choose the value of K: Visualize your data
• Historically, the optimal value of K=3 to 10
• Large K, reduces the noise but no guarantee
• If K is odd : Avoid ties
• If K is even: Flip a coin (randomly assign) or do not assign the category
• Set aside a data from your training data to determine a good value of K
Loan Default Problem
Loan and Age are two input parameters/features
Red and Blue are the data points
K-NN
We can now use the training set to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance.
D = Sqrt[(48-33)^2 + (142000-150000)^2] D = 8000.01
If K=1 then the nearest neighbor is the last case in the training set with Default=Y.
K-NN
We can now use the training set to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance.
With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is again Default=Y.
Standardized Distance
If K = 1, then green dot = N1
Remember: without non standardized, answer is Y
Some Terminologies to understand before we see the use of K-NN in Sentiment Analytics
• Document: Piece of text ( one sentence like tweet or a book)
• Tokens or Terms: A document consists of individual tokens• Consider it is as a word for now
• Corpus: Collection of documents.
Tere, kuidas sullaheb ? Vaga hesti
aga sul ka ?
Document 1
Hello, how are you ?I am fine and you ?
Document 1
Ciao, come va ?Tutto bene e tu ?
Document 1
Corpus
Toke
n
Document (Term) Frequency Matrix
• Each sentence is considered a separate document.
• A simple bag-of-words approach using term frequency would produce a table of term counts
counts
Thus all the documents are mapped to feature vectors of equal length, which consists of all the unique words from all the documents.
Some Considerations for improvements
• Casing (e.g., good Vs. Good) • Change everything to smaller case.
• Numbers (0, 56, 109 etc.)
• Punctuation (e.g., ?, !, “ etc.)
• Symbols (e.g., <, @, #, etc.)
• Stemming: Remove suffixes (e.g., run, running, runner, runs, etc. to run)
Some considerations (continued)
Term Frequency
• Microsoft Corp and Skype Global today announced that they have entered into a definitive agreement under which Microsoft will acquire Skype, the leading Internet communications company, for $8.5 billion in cash from the investor group led by Silver Lake. The agreement has been approved by the boards of directors of both Microsoft and Skype.
case has been normalized : every term is in lowercase
words have been stemmed: their suffixes removed, so that verbs like announces, announced and announcing are all reduced to the term announc
stopwords have been removed (common words: the, and, of, and on )
Sentiment Analysis (opinion mining): KNN
Test data
From the training data
Basic Similarity score: 2Since two words are common
Test data
Training data
K-NN Text Classification
• We just computed the similarity score between the test document with another training document.
• Pick top K similar documents
• Based on majority, assign the polarity or classify the document.
Evaluation:Confusion Matrix & Accuracy
• Confusion Matrix
• Recall: TP/TP+FN
• Precision: TP/TP+FP
• Accuracy: Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations
• Accuracy = (TP+TN)/(TP+FP+FN+TN)
Source: https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
Techniques for improvement
• Bi-gram is taking 2 words at a time.
Smart Approach by Facebook !
Demo time!
https://courses.cs.ut.ee/2018/bda/fall/Main/Practice