Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Introduction to Business Data Analytics

Lecture 7

The slides are available under creative common license. The original owner of these slides is the University of Tartu.

Brand Value Monitoring

Understanding and Mining Text

Till now and Today !

• Looking at other/not directly controlled platforms• Twitter, Blogs, Tech posts, Recc.

websites

• Looking at their own customers• Subscription based data

Till now and Today !

Till now

• Customer transactional data• Typical in CRM ERP

• Fairly Structured data • Relational data: DB, CSV

• Relatively well documented schema.

• Feature encoding is straight forward

Today

• From External sources & is Big Data• 5 Vs: Velocity, Volume, Veracity,

Variety, Value• Veracity: Messiness and

Trustworthiness ?• Value: Can we get some meaning out

of this data

• Mostly unstructured data

• Requires all new approaches to analyse the data

Channels of data capturing

• Source 1: Social Media Platforms• Twitter: VW Scandals

• Facebook (posts): Bad Product

• Blogs: Product reviews

• Source 2: Customers Interactions with the business• Customer feedback forms

• Consumer complaint logs

• Emails

Why we should care about data ?

• The Data can correlated with product and business about customers.

• W.r.t Source 1 (social media platforms): Data from the customers about product and business • Helps in uncovering patterns and themes, so you know what customers are thinking

about. • It reveals their wants and needs.• What they think about your services ?• Can provide an early warning of trouble.

• W.r.t Source 2 (Customers Interactions with the business): You are loosing an opportunity to detect and introduce• potential business products • added value services

Data Source 1: TwitterExample 1

• It appears tweets by Kylie cause Snapchat to lose $1.3 billion today. TMZ reports that the social media platform's shares are down 7.2 percent hours after Jenner posted her tweets.

http://www.tmz.com/2018/02/22/kylie-jenner-tweets-snapchat-value-drops/

Data Source 1: TwitterExample 2

Can we correlate Twitter and Financial market ?

Disclaimer !

Can Twitter sentiment predict stock market behaviour?

Source: http://blogs.lse.ac.uk/usappblog/2017/10/14/can-twitter-sentiment-predict-stock-market-behaviour/

Main idea behind the work: what if having people

compaining on Twitter about their malfunctioning

iPhone will in fact result in Apple experience quarterly

losses? Of course, the relationship is neither linear nor

simple. It actually depends on the total number of

users expressing a certain sentiment as well as on the

intensity of their sentiments, and even in perfect

circumstances, the correlation between general

sentiment and stock oscillations needs to be proved.

Klout score (value that indicates the degree of social influence of a certain individual in the social media world)-- varies between 1 and 100 and to a higher value corresponds a higher influence power.

Interesting to study this correlation when extraordinary corporate events happen (for example, an acquisition, an IPO, the launch of a new product, the company distributing dividends, the CEO been fired, etc.)

http://blogs.lse.ac.uk/usappblog/2017/10/14/can-twitter-sentiment-predict-stock-market-behaviour/

Disclaimer !

Can Twitter sentiment predict stock market behaviour?

Source: http://blogs.lse.ac.uk/usappblog/2017/10/14/can-twitter-sentiment-predict-stock-market-behaviour/

Main idea behind the work: what if having people

complaining on Twitter about their malfunctioning

iPhone will in fact result in Apple experience quarterly

losses? Of course, the relationship is neither linear nor

simple. It actually depends on the total number of

users expressing a certain sentiment as well as on the

intensity of their sentiments, and even in perfect

circumstances, the correlation between general

sentiment and stock oscillations needs to be proved.

Klout score (value that indicates the degree of social influence of a certain individual in the social media world)-- varies between 1 and 100 and to a higher value corresponds a higher influence power.

Interesting to study this correlation when extraordinary corporate events happen (for example, an acquisition, an IPO, the launch of a new product, the company distributing dividends, the CEO been fired, etc.)

http://blogs.lse.ac.uk/usappblog/2017/10/14/can-twitter-sentiment-predict-stock-market-behaviour/

Data Source 2: Massive Text in E-commerce

• Consider you are large organization (e-commerce like Market, Amazon or large companies like Transferwise etc)

• Millions of customers are interacting every day.

• An automatic process is required to classify and understand customers’• Mood (Happy or sad)

• Sentiment (positive or negative)

• Message type: Complain or Appreciation

What can we do with this data/Text?: Text Analytics

• Text analytics is the process of drawing meaning out of written communication.

• Text mining, also referred to as text data mining or text analytics, is the process of deriving high-quality information from text

Source: https://www.clarabridge.com/text-analytics/

Source: https://en.wikipedia.org/wiki/Text_mining

Text Analytics: How it works

Text Transformed dataPreprocessing

Sentiment or emotions: Eg. positive, negative, neutral

Patterns: Eg. Popular words, topics or terms)

But it is not easy to deal with text: Why?

• 1) Intended for human consumption (and not for computers)• Consider the text

• “The acting is poor and it gets out-of-control by the end, with the violence overdone and an incredible ending, but it’s still fun to watch.”

• What you expect the overall sentiment ?

• What do you think of the word “incredible” ? Positive or Negative ?


• 1) Intended for human consumption (and not for computers)

• 2) Unstructured data• No: Table of records with fields (like feature collection)

• Text fields can have varying number of words

• Word order matters (and sometimes not)

• Domain/context dependence• words/phrases can mean different things in different contexts and domains

• Effect of syntax on semantics


• 1) Intended for human consumption (and not for computers)

• 2) Unstructured data

• 3) Dirty text• No grammar

• Misspelled words

• Synonyms

• Emojis

• Sarcasm

Techniques for text analytics

• Techniques• Lexical based techniques

• K-NN (bit more sophisticated)

Additional reading (if you wish): http://liris.cnrs.fr/Documents/Liris-6508.pdf

Lexical based Technique: Technique 1Dictionary based

• Create a dictionary of positive and negative words with scores.

Additional reading (if you wish): 1) http://liris.cnrs.fr/Documents/Liris-6508.pdf2) https://medium.com/@datamonsters/sentiment-analysis-tools-overview-part-1-positive-and-negative-words-databases-ae35431a470c

Positive words

GoodExcellentSuperb:

Negative words

BadDelayHorrible:

Find Synonyms

splendiddazzlingmarvellous:

Find Synonyms

Substandardpoorinferior

http://liris.cnrs.fr/Documents/Liris-6508.pdf

Techniques for improvement

• Use of linguistic knowledge

Lexical based Technique: Technique 1

• Create a dictionary of positive and negative words with scores.

• For each sentence, add positive and negative score• Example: Weather is good today

let the score for good is 0.5, weather and today is 0 (on a scale of -1 to +1).So the score is 0.5 (and sentence is positive)

• Dictionary Based methods• When dictionaries are created in one substantive area and then applied to another

problems, serious errors can occurSee: https://brenocon.com/blog/2011/10/be-careful-with-dictionary-based-text-analysis/

Additional reading (if you wish): 1) http://liris.cnrs.fr/Documents/Liris-6508.pdf2) https://medium.com/@datamonsters/sentiment-analysis-tools-overview-part-1-positive-and-negative-words-databases-ae35431a470c

http://liris.cnrs.fr/Documents/Liris-6508.pdf

Lexical based Technique: Technique 2Documents based Technique

Comparing text with corpus

Calculating the positive and negative score

We are ignoring the order of the words in the sentence.

We won’t catch everything

• This movie makes Catwoman look like a great movie

• A terrible movie that some people will nevertheless find moving

• Your children will be occupied for 72 minutes.

Supervised Learning approach

• Train the documents which are already labeled with positive and negative labels.

• Evaluate your approach using the test data.

• In this lecture, we will use K-NN approach to perform classification of texts.• Classifies new cases based on distance (or similarity)

Classification: K-NN Neighbor (not K-Means)

Problem

Red dots : MalesBlue dots: FemalesPredict what is the green dot ?

In other words, you know about red and blue data and you would like to classify (or predict) about the gender of the green data ?

Source: www.imperial.ac.uk/people/n.sadawi

Classification: K-NN Neighbor (not K-Means)

Algorithm1) Find K data points which are nearest to the

green dot.2) Assign (classify) the color (class) of the

majority (among K neighbors) to the green (or the new data point)

Q1 : How to classify (or predict) about the gender of the green data ?

Q2: How to find identify which K neighbors are nearest among all the neighbors ?Answer: Use Distance metric

Distance Metrics revisited

Categorical

• Jaccard Distance: Ratio of Intersection/Union

• Hamming Distance

Continuous

NOTE: Distance is (1- similarity)

How to Pick the value of K ?

• Inspect your data to choose the value of K: Visualize your data

• Historically, the optimal value of K=3 to 10

• Large K, reduces the noise but no guarantee

• If K is odd : Avoid ties

• If K is even: Flip a coin (randomly assign) or do not assign the category

• Set aside a data from your training data to determine a good value of K

Loan Default Problem

Loan and Age are two input parameters/features

Red and Blue are the data points

K-NN

We can now use the training set to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance.

D = Sqrt[(48-33)^2 + (142000-150000)^2] D = 8000.01

If K=1 then the nearest neighbor is the last case in the training set with Default=Y.

K-NN

We can now use the training set to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance.

With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is again Default=Y.

Standardized Distance

If K = 1, then green dot = N1

Remember: without non standardized, answer is Y

Some Terminologies to understand before we see the use of K-NN in Sentiment Analytics

• Document: Piece of text ( one sentence like tweet or a book)

• Tokens or Terms: A document consists of individual tokens• Consider it is as a word for now

• Corpus: Collection of documents.

Tere, kuidas sullaheb ? Vaga hesti

aga sul ka ?

Document 1

Hello, how are you ?I am fine and you ?

Document 1

Ciao, come va ?Tutto bene e tu ?

Document 1

Corpus

Toke

n

Document (Term) Frequency Matrix

• Each sentence is considered a separate document.

• A simple bag-of-words approach using term frequency would produce a table of term counts

counts

Thus all the documents are mapped to feature vectors of equal length, which consists of all the unique words from all the documents.

Some Considerations for improvements

• Casing (e.g., good Vs. Good) • Change everything to smaller case.

• Numbers (0, 56, 109 etc.)

• Punctuation (e.g., ?, !, “ etc.)

• Symbols (e.g., <, @, #, etc.)

• Stemming: Remove suffixes (e.g., run, running, runner, runs, etc. to run)

Some considerations (continued)

Term Frequency

• Microsoft Corp and Skype Global today announced that they have entered into a definitive agreement under which Microsoft will acquire Skype, the leading Internet communications company, for $8.5 billion in cash from the investor group led by Silver Lake. The agreement has been approved by the boards of directors of both Microsoft and Skype.

case has been normalized : every term is in lowercase

words have been stemmed: their suffixes removed, so that verbs like announces, announced and announcing are all reduced to the term announc

stopwords have been removed (common words: the, and, of, and on )

Sentiment Analysis (opinion mining): KNN

Test data

From the training data

Basic Similarity score: 2Since two words are common

Test data

Training data

K-NN Text Classification

• We just computed the similarity score between the test document with another training document.

• Pick top K similar documents

• Based on majority, assign the polarity or classify the document.

Evaluation:Confusion Matrix & Accuracy

• Confusion Matrix

• Recall: TP/TP+FN

• Precision: TP/TP+FP

• Accuracy: Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations

• Accuracy = (TP+TN)/(TP+FP+FN+TN)

Source: https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

Techniques for improvement

• Bi-gram is taking 2 words at a time.

Smart Approach by Facebook !

Demo time!

https://courses.cs.ut.ee/2018/bda/fall/Main/Practice