52
Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner of these slides is the University of Tartu.

Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Introduction to Business Data Analytics

Lecture 7

The slides are available under creative common license. The original owner of these slides is the University of Tartu.

Page 2: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Brand Value Monitoring

Understanding and Mining Text

Page 3: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Till now and Today !

• Looking at other/not directly controlled platforms• Twitter, Blogs, Tech posts, Recc.

websites

• Looking at their own customers• Subscription based data

Page 4: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Till now and Today !

Till now

• Customer transactional data• Typical in CRM ERP

• Fairly Structured data • Relational data: DB, CSV

• Relatively well documented schema.

• Feature encoding is straight forward

Today

• From External sources & is Big Data• 5 Vs: Velocity, Volume, Veracity,

Variety, Value• Veracity: Messiness and

Trustworthiness ?• Value: Can we get some meaning out

of this data

• Mostly unstructured data

• Requires all new approaches to analyse the data

Page 5: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Channels of data capturing

• Source 1: Social Media Platforms• Twitter: VW Scandals

• Facebook (posts): Bad Product

• Blogs: Product reviews

• Source 2: Customers Interactions with the business• Customer feedback forms

• Consumer complaint logs

• Emails

Page 6: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Why we should care about data ?

• The Data can correlated with product and business about customers.

• W.r.t Source 1 (social media platforms): Data from the customers about product and business • Helps in uncovering patterns and themes, so you know what customers are thinking

about. • It reveals their wants and needs.• What they think about your services ?• Can provide an early warning of trouble.

• W.r.t Source 2 (Customers Interactions with the business): You are loosing an opportunity to detect and introduce• potential business products • added value services

Page 7: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Data Source 1: TwitterExample 1

• It appears tweets by Kylie cause Snapchat to lose $1.3 billion today. TMZ reports that the social media platform's shares are down 7.2 percent hours after Jenner posted her tweets.

Page 8: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Data Source 1: TwitterExample 2

Can we correlate Twitter and Financial market ?

Page 9: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Disclaimer !

Can Twitter sentiment predict stock market behaviour?

Source: http://blogs.lse.ac.uk/usappblog/2017/10/14/can-twitter-sentiment-predict-stock-market-behaviour/

Main idea behind the work: what if having people

compaining on Twitter about their malfunctioning

iPhone will in fact result in Apple experience quarterly

losses? Of course, the relationship is neither linear nor

simple. It actually depends on the total number of

users expressing a certain sentiment as well as on the

intensity of their sentiments, and even in perfect

circumstances, the correlation between general

sentiment and stock oscillations needs to be proved.

Klout score (value that indicates the degree of social influence of a certain individual in the social media world)-- varies between 1 and 100 and to a higher value corresponds a higher influence power.

Interesting to study this correlation when extraordinary corporate events happen (for example, an acquisition, an IPO, the launch of a new product, the company distributing dividends, the CEO been fired, etc.)

Page 10: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Disclaimer !

Can Twitter sentiment predict stock market behaviour?

Source: http://blogs.lse.ac.uk/usappblog/2017/10/14/can-twitter-sentiment-predict-stock-market-behaviour/

Main idea behind the work: what if having people

complaining on Twitter about their malfunctioning

iPhone will in fact result in Apple experience quarterly

losses? Of course, the relationship is neither linear nor

simple. It actually depends on the total number of

users expressing a certain sentiment as well as on the

intensity of their sentiments, and even in perfect

circumstances, the correlation between general

sentiment and stock oscillations needs to be proved.

Klout score (value that indicates the degree of social influence of a certain individual in the social media world)-- varies between 1 and 100 and to a higher value corresponds a higher influence power.

Interesting to study this correlation when extraordinary corporate events happen (for example, an acquisition, an IPO, the launch of a new product, the company distributing dividends, the CEO been fired, etc.)

Page 11: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Data Source 2: Massive Text in E-commerce

• Consider you are large organization (e-commerce like Market, Amazon or large companies like Transferwise etc)

• Millions of customers are interacting every day.

• An automatic process is required to classify and understand customers’• Mood (Happy or sad)

• Sentiment (positive or negative)

• Message type: Complain or Appreciation

Page 12: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

What can we do with this data/Text?: Text Analytics

• Text analytics is the process of drawing meaning out of written communication.

• Text mining, also referred to as text data mining or text analytics, is the process of deriving high-quality information from text

Source: https://www.clarabridge.com/text-analytics/

Source: https://en.wikipedia.org/wiki/Text_mining

Page 13: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Text Analytics: How it works

Text Transformed dataPreprocessing

Sentiment or emotions: Eg. positive, negative, neutral

Patterns: Eg. Popular words, topics or terms)

Page 14: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

But it is not easy to deal with text: Why?

• 1) Intended for human consumption (and not for computers)• Consider the text

• “The acting is poor and it gets out-of-control by the end, with the violence overdone and an incredible ending, but it’s still fun to watch.”

• What you expect the overall sentiment ?

• What do you think of the word “incredible” ? Positive or Negative ?

Page 15: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

But it is not easy to deal with text: Why?

• 1) Intended for human consumption (and not for computers)

• 2) Unstructured data• No: Table of records with fields (like feature collection)

• Text fields can have varying number of words

• Word order matters (and sometimes not)

• Domain/context dependence• words/phrases can mean different things in different contexts and domains

• Effect of syntax on semantics

Page 16: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

But it is not easy to deal with text: Why?

• 1) Intended for human consumption (and not for computers)

• 2) Unstructured data

• 3) Dirty text• No grammar

• Misspelled words

• Synonyms

• Emojis

• Sarcasm

Page 17: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Techniques for text analytics

• Techniques• Lexical based techniques

• K-NN (bit more sophisticated)

Additional reading (if you wish): http://liris.cnrs.fr/Documents/Liris-6508.pdf

Page 18: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Lexical based Technique: Technique 1Dictionary based

• Create a dictionary of positive and negative words with scores.

Additional reading (if you wish): 1) http://liris.cnrs.fr/Documents/Liris-6508.pdf2) https://medium.com/@datamonsters/sentiment-analysis-tools-overview-part-1-positive-and-negative-words-databases-ae35431a470c

Positive words

GoodExcellentSuperb:

Negative words

BadDelayHorrible:

Find Synonyms

splendiddazzlingmarvellous:

Find Synonyms

Substandardpoorinferior

Page 19: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Techniques for improvement

• Use of linguistic knowledge

Page 20: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Lexical based Technique: Technique 1

• Create a dictionary of positive and negative words with scores.

• For each sentence, add positive and negative score• Example: Weather is good today

let the score for good is 0.5, weather and today is 0 (on a scale of -1 to +1).So the score is 0.5 (and sentence is positive)

• Dictionary Based methods• When dictionaries are created in one substantive area and then applied to another

problems, serious errors can occurSee: https://brenocon.com/blog/2011/10/be-careful-with-dictionary-based-text-analysis/

Additional reading (if you wish): 1) http://liris.cnrs.fr/Documents/Liris-6508.pdf2) https://medium.com/@datamonsters/sentiment-analysis-tools-overview-part-1-positive-and-negative-words-databases-ae35431a470c

Page 21: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Lexical based Technique: Technique 2Documents based Technique

Page 22: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Comparing text with corpus

Page 23: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Calculating the positive and negative score

We are ignoring the order of the words in the sentence.

Page 24: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner
Page 25: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

We won’t catch everything

• This movie makes Catwoman look like a great movie

• A terrible movie that some people will nevertheless find moving

• Your children will be occupied for 72 minutes.

Page 26: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Supervised Learning approach

• Train the documents which are already labeled with positive and negative labels.

• Evaluate your approach using the test data.

• In this lecture, we will use K-NN approach to perform classification of texts.• Classifies new cases based on distance (or similarity)

Page 27: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Classification: K-NN Neighbor (not K-Means)

Problem

Red dots : MalesBlue dots: FemalesPredict what is the green dot ?

In other words, you know about red and blue data and you would like to classify (or predict) about the gender of the green data ?

Source: www.imperial.ac.uk/people/n.sadawi

Page 28: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Classification: K-NN Neighbor (not K-Means)

Algorithm1) Find K data points which are nearest to the

green dot.2) Assign (classify) the color (class) of the

majority (among K neighbors) to the green (or the new data point)

Q1 : How to classify (or predict) about the gender of the green data ?

Q2: How to find identify which K neighbors are nearest among all the neighbors ?Answer: Use Distance metric

Page 29: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Distance Metrics revisited

Categorical

• Jaccard Distance: Ratio of Intersection/Union

• Hamming Distance

Continuous

NOTE: Distance is (1- similarity)

Page 30: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

How to Pick the value of K ?

• Inspect your data to choose the value of K: Visualize your data

• Historically, the optimal value of K=3 to 10

• Large K, reduces the noise but no guarantee

• If K is odd : Avoid ties

• If K is even: Flip a coin (randomly assign) or do not assign the category

• Set aside a data from your training data to determine a good value of K

Page 31: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Loan Default Problem

Loan and Age are two input parameters/features

Red and Blue are the data points

Page 32: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

K-NN

We can now use the training set to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance.

D = Sqrt[(48-33)^2 + (142000-150000)^2] D = 8000.01

If K=1 then the nearest neighbor is the last case in the training set with Default=Y.

Page 33: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

K-NN

We can now use the training set to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance.

With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is again Default=Y.

Page 34: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Standardized Distance

If K = 1, then green dot = N1

Remember: without non standardized, answer is Y

Page 35: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Some Terminologies to understand before we see the use of K-NN in Sentiment Analytics

• Document: Piece of text ( one sentence like tweet or a book)

• Tokens or Terms: A document consists of individual tokens• Consider it is as a word for now

• Corpus: Collection of documents.

Tere, kuidas sullaheb ? Vaga hesti

aga sul ka ?

Document 1

Hello, how are you ?I am fine and you ?

Document 1

Ciao, come va ?Tutto bene e tu ?

Document 1

Corpus

Toke

n

Page 36: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Document (Term) Frequency Matrix

• Each sentence is considered a separate document.

• A simple bag-of-words approach using term frequency would produce a table of term counts

counts

Thus all the documents are mapped to feature vectors of equal length, which consists of all the unique words from all the documents.

Page 37: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Some Considerations for improvements

• Casing (e.g., good Vs. Good) • Change everything to smaller case.

• Numbers (0, 56, 109 etc.)

• Punctuation (e.g., ?, !, “ etc.)

• Symbols (e.g., <, @, #, etc.)

• Stemming: Remove suffixes (e.g., run, running, runner, runs, etc. to run)

Page 38: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Some considerations (continued)

Page 39: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Term Frequency

• Microsoft Corp and Skype Global today announced that they have entered into a definitive agreement under which Microsoft will acquire Skype, the leading Internet communications company, for $8.5 billion in cash from the investor group led by Silver Lake. The agreement has been approved by the boards of directors of both Microsoft and Skype.

case has been normalized : every term is in lowercase

words have been stemmed: their suffixes removed, so that verbs like announces, announced and announcing are all reduced to the term announc

stopwords have been removed (common words: the, and, of, and on )

Page 40: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Sentiment Analysis (opinion mining): KNN

Test data

From the training data

Page 41: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner
Page 42: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner
Page 43: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Basic Similarity score: 2Since two words are common

Test data

Training data

Page 44: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner
Page 45: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner
Page 46: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner
Page 47: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner
Page 48: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

K-NN Text Classification

• We just computed the similarity score between the test document with another training document.

• Pick top K similar documents

• Based on majority, assign the polarity or classify the document.

Page 49: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Evaluation:Confusion Matrix & Accuracy

• Confusion Matrix

• Recall: TP/TP+FN

• Precision: TP/TP+FP

• Accuracy: Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations

• Accuracy = (TP+TN)/(TP+FP+FN+TN)

Source: https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

Page 50: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Techniques for improvement

• Bi-gram is taking 2 words at a time.

Page 51: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Smart Approach by Facebook !

Page 52: Introduction to Business Data Analytics · 2018-12-06 · Introduction to Business Data Analytics Lecture 7 The slides are available under creative common license. The original owner

Demo time!

https://courses.cs.ut.ee/2018/bda/fall/Main/Practice