51

How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum
Page 2: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

How to derive data-driven insights… from user-generated content

Page 3: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Stephanie Fischer

Dr. Christian Winkler

Page 4: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Foundedin 2017

Big Data text analytics experts

Located in Munich

Page 5: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Automatically gather relevant

content

1

Cleaning& Linguistics

2

Statistics & QA

3

Data-driven calculation of main

insights

4

Text analytics - automatically, regularly & based on large amount of data.

Visualization in dashboard &

reports

5

Page 6: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Text Analytics applied to a common situation:

Visiting foreign places.

Page 7: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

The problem:

Too little time, too much going on.

Page 8: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Opinion of a single author

Page 9: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Too many opinions

Page 10: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

14 yearsof London TripAdvisor Forum

Sep 2004 – Sep 2018

Making sense of …

Page 11: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Lots of content!

1.3 mio posts79,446 users

Nobody can read that.

Page 12: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

There is one problem with user-generated content:

Too much.

Page 13: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Solution?Solution?

Page 14: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Automatically gather relevant

content

1

Cleaning& Linguistics

2

Statistics & QA

3

Data-driven calculation of main

insights

4

Display in dashboard &

reports

5

Text processing pipeline

Spidering

Extractiontext and metadata

Normalizing

Language detection

SynonymsOutlier detection

Feature extraction

Regression

Clustering

Overall structure

Word combinations

Categories

Timelines

Semantics

Page 15: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Perform quality assurancewith the whole content

Statistics &

QA

Get overview of texts, data quality andrecognize possible bias in data

Page 16: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Typical questions answered with statistics

Do frequent authors write shorter posts?

How does the number of

articles change over time?

How does the article length change over

time? What is the length

distribution of articles?

Which are the most frequent

words?

How are keywords used

over time?

Statistics &

QA

Page 17: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Python(pandas)

Contentspidering

Dataextraction

PostgreSQLdatabase

JupyterNotebook

Statistics &

QASpidering, Database, Pandas, Jupyter

Page 18: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Question Answering

Translation / Dialogue

Summarization / Topic Mining

Classification / Retrieval

strong

weak

"Shallow NLP"• Simple language models with many

simplifications (Bag-of-Words, n-grams)• Keywords, phrases• Robust algorithms

"Deep NLP"• Complex language models necessary

for deep understanding• Statements spanning sentences• Fragile algorithms

Statistics &

QAText prepration with NLP

Page 19: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Statistics &

QA

Page 20: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Typical questions answered with statistics

How does the number of

articles change over time?

Statistics &

QA

Page 21: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Typical questions answered with statistics

How does the article length change over

time?

Statistics &

QA

Page 22: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Typical questions answered with statistics

What is the length

distribution of articles?

Statistics &

QA

Page 23: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Typical questions answered with statistics

Do frequent authors write shorter posts?

Statistics &

QA

Page 24: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Typical questions answered with statistics

Which are the most frequent

words?

Statistics &

QA

Page 25: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Typical questions answered with statistics

How are keywords used

over time?

Statistics &

QA

Page 26: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

For comparison … how would you rate the quality?

How are keywords used

over time?

Statistics &

QA

Page 27: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Summary quality assurance

Be sure the text masses future data-driven decisions will be based on has a good-enough quality

Recognize possible bias in data

Get overview of texts & data quality

1st take-away value oftext statistics

Page 28: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Create data-driven personasof London travelers

Topic modeling

Data-driven focus points for digital marketing, category management, product design, personalization

Page 29: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6

80% 20%

KimUser #1

HankUser #2

Marty# ...

Olivia#79,446

Topic modelingHow to create data-driven

personas

Page 30: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Topics are word distributions

Cheap 0.08

Hotel 0.06

Budget 0.03

Saving 0.03

Deal 0.02

Concessions 0.02

Sum 1.00

British museum 0.09

National gallery 0.07

Science Museum 0.06

V&A 0.06

Tate 0.05

Tate modern 0.05

Sum 1.00

Hyde Park 0.11

Regent’s Park 0.09

Hampstead Heath 0.06

Green Park 0.05

St. James Park 0.04

Sum 1.00

“Budget traveler” “Museum fan” “Park fan”

Topic modeling

Page 31: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Looking for hidden / latent structure

1) Which could be candidates for topics?

2) How are they distributed in document space?

Basic idea of topic modelling

Topic 1

Topic 2

Topic 3

topicsdocuments

...

Topic k

doc 1 doc 2 doc n...

Topic modeling

Page 32: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Only use word frequencies

• Term frequency (TF)

• Very simple, but robust

• Basis for many algorithms (retrieval, classification)

Disadvantages

• Very simplified model of language

• No syntactical or relational information kept

Improvements

• TF/IDF, n-grams

Need to vectorize data (BoW)Documents

D1: „Steffi likes London."

D2: „Steffi does not like London."

D3: „Steffi likes London, but not Paris."

D1 1 1 1

D2 1 1 1 1 1

D3 1 1 1 1 1 1

Topic modeling

Page 33: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Most ML is boring maths

x11 x1n

...

...

...

xm1 ... xmn

m documents with n features (words)• Use a matrix representation• m x n Matrix can become very large• 1.3 million rows, 500.000 columns• Matrix is sparse:

Most documents contain only a few words

Matrix can be simplified• Only keep certain number of features• Only keep features which occur more than

x times

features

do

cum

ents

Topic modeling

Page 34: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

How Topic Modeling works

Adopted from http://topicmodels.west.uni-koblenz.de/ckling/tmt/svd_ap.html

Topic modelling transforms the matrix• Re-arrange features (words) and

documents• Find blocks

• Word in blocks constitute topics• Documents in blocks belong to topic

Topic modeling

Page 35: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Topic modeling

Page 36: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Topic 2: Tourist arriving by air

flight airport heathrowgatwick terminal arrive fly hotel time hour

Topic 4: Travelcard user

card oyster buy travel day use ticket zone london

travelcard

Topic 3:Newbie

thank hi look help appreciate reply advance suggestion know hello

Topic 5: Public transport tourist

ticket train station time uk use line bus edit walk

Topic 1: Organized Traveller

london day stay tour visit trip good hotel plan night

Topic 6:Rental car user

car company drive hirerent rental parkingservice use return

Topic modelingResult: 6 data-driven Personas

Page 37: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Summary topic modeling

Decisions backed by data - for digital marketing, category management, product design, personalization, …

Detect distinct topics customer segments talk about

Find hidden customer segments based on interest

2nd take-away value ofdata-driven personas

Page 38: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Detect what people areinterested in

when talking about London

Word embeddings

Align your messages to what people actually like about you. Detect changing interests in real-time.

Page 39: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Search result for “tower”in the TripAdvisor forum

Word embeddings

Page 40: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Analysis of words in text

• Order not used

• Relations between words neglected

• Lost semantics

Analyze n-grams

• Order taken into account

• Static relations via tuples

• Abstraction to semantics missing

So far in text analytics…

Each word is a single entity

Context decides about semantics!

Word embeddings

Page 41: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Aim: Find context information of words

CBOW model model

• Predict word from context

Skip-gram model

• Determine context from word

• Slower, more precise with infrequent words

Training word vectors

Word embeddings

Page 42: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Word2vec similarities in detail

Word embeddings

Page 43: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Word embeddings

Page 44: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Word embeddings

Page 45: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Looking for similar sightsWord

embeddings

Similar (or near) to the tower:

London Eye Tower Bridge British MuseumBuckingham Palace Westminster Abbey

after training word embeddings

Page 46: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Search result for “airport”with the same trained word embeddings

Word embeddings

There are five airports in London:

LCYLondon City Gatwick Heathrow Luton Stansted

LGW LHR LTN STN

Page 47: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Looking for vector similaritieswith the same trained word embeddings

Word embeddings

Knightsbridge tube is near Harrods, but where do we need to get off for Buckingham Palace?

Hyde Park Westminster Green Park

Page 48: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Summary word embeddings

• Benefit from changing trends• Create semantically aware search results

Detect changing interests in real-time

Detect relevant context of a topic

3rd take-away value of semantic context

Page 49: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

To wrap it upmethods & value from UGC analysis

Statistics &QA

Topic modelingWord

embeddings

Ensure data quality

Methods

Value Uncover hidden structure

Detect semantic similarities

Page 50: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Looking beyond UGCderive value from other text sources

Technical documentation

Data-driven approach to deriving diverse, un-biased insight from …

Companywikis

Change requestsScientific

publications…

Futurecost-drivers

Knowledge bottlenecks

Emerging competing

technologies

Technical debts

Page 51: How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

Stephanie Fischer

Dr. Christian Winkler

datanizing GmbH