How to derive data-driven insights · Cheap 0.08 Hotel 0.06 Budget 0.03 Saving 0.03 Deal 0.02 Concessions 0.02 … Sum1.00 British museum 0.09 National gallery 0.07 Science Museum

How to derive data-driven insights… from user-generated content

Stephanie Fischer

Dr. Christian Winkler

Foundedin 2017

Big Data text analytics experts

Located in Munich

Automatically gather relevant

content

1

Cleaning& Linguistics

2

Statistics & QA

3

Data-driven calculation of main

insights

4

Text analytics - automatically, regularly & based on large amount of data.

Visualization in dashboard &

reports

5

Text Analytics applied to a common situation:

Visiting foreign places.

The problem:

Too little time, too much going on.

Opinion of a single author

Too many opinions

14 yearsof London TripAdvisor Forum

Sep 2004 – Sep 2018

Making sense of …

Lots of content!

1.3 mio posts79,446 users

Nobody can read that.

There is one problem with user-generated content:

Too much.

Solution?Solution?

Automatically gather relevant

content

1

Cleaning& Linguistics

2

Statistics & QA

3

Data-driven calculation of main

insights

4

Display in dashboard &

reports

5

Text processing pipeline

Spidering

Extractiontext and metadata

Normalizing

Language detection

SynonymsOutlier detection

Feature extraction

Regression

Clustering

Overall structure

Word combinations

Categories

Timelines

Semantics

Perform quality assurancewith the whole content

Statistics &

QA

Get overview of texts, data quality andrecognize possible bias in data

Typical questions answered with statistics

Do frequent authors write shorter posts?

How does the number of

articles change over time?

How does the article length change over

time? What is the length

distribution of articles?

Which are the most frequent

words?

How are keywords used

over time?

Statistics &

QA

Python(pandas)

Contentspidering

Dataextraction

PostgreSQLdatabase

JupyterNotebook

Statistics &

QASpidering, Database, Pandas, Jupyter

Question Answering

Translation / Dialogue

Summarization / Topic Mining

Classification / Retrieval

strong

weak

"Shallow NLP"• Simple language models with many

simplifications (Bag-of-Words, n-grams)• Keywords, phrases• Robust algorithms

"Deep NLP"• Complex language models necessary

for deep understanding• Statements spanning sentences• Fragile algorithms

Statistics &

QAText prepration with NLP

Statistics &

QA


How does the number of

articles change over time?

Statistics &

QA


How does the article length change over

time?

Statistics &

QA


What is the length

distribution of articles?

Statistics &

QA


Do frequent authors write shorter posts?

Statistics &

QA


Which are the most frequent

words?

Statistics &

QA



over time?

Statistics &

QA

For comparison … how would you rate the quality?


over time?

Statistics &

QA

Summary quality assurance

Be sure the text masses future data-driven decisions will be based on has a good-enough quality

Recognize possible bias in data

Get overview of texts & data quality

1st take-away value oftext statistics

Create data-driven personasof London travelers

Topic modeling

Data-driven focus points for digital marketing, category management, product design, personalization

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6

80% 20%

KimUser #1

HankUser #2

Marty# ...

Olivia#79,446

Topic modelingHow to create data-driven

personas

Topics are word distributions

Cheap 0.08

Hotel 0.06

Budget 0.03

Saving 0.03

Deal 0.02

Concessions 0.02

…

Sum 1.00

British museum 0.09

National gallery 0.07

Science Museum 0.06

V&A 0.06

Tate 0.05

Tate modern 0.05

…

Sum 1.00

Hyde Park 0.11

Regent’s Park 0.09

Hampstead Heath 0.06

Green Park 0.05

St. James Park 0.04

…

Sum 1.00

“Budget traveler” “Museum fan” “Park fan”

Topic modeling

Looking for hidden / latent structure

1) Which could be candidates for topics?

2) How are they distributed in document space?

Basic idea of topic modelling

Topic 1

Topic 2

Topic 3

topicsdocuments

...

Topic k

doc 1 doc 2 doc n...

Topic modeling

Only use word frequencies

• Term frequency (TF)

• Very simple, but robust

• Basis for many algorithms (retrieval, classification)

Disadvantages

• Very simplified model of language

• No syntactical or relational information kept

Improvements

• TF/IDF, n-grams

Need to vectorize data (BoW)Documents

D1: „Steffi likes London."

D2: „Steffi does not like London."

D3: „Steffi likes London, but not Paris."

D1 1 1 1

D2 1 1 1 1 1

D3 1 1 1 1 1 1

Topic modeling

Most ML is boring maths

x11 x1n

...

...

...

xm1 ... xmn

m documents with n features (words)• Use a matrix representation• m x n Matrix can become very large• 1.3 million rows, 500.000 columns• Matrix is sparse:

Most documents contain only a few words

Matrix can be simplified• Only keep certain number of features• Only keep features which occur more than

x times

features

do

cum

ents

Topic modeling

How Topic Modeling works

Adopted from http://topicmodels.west.uni-koblenz.de/ckling/tmt/svd_ap.html

Topic modelling transforms the matrix• Re-arrange features (words) and

documents• Find blocks

• Word in blocks constitute topics• Documents in blocks belong to topic

Topic modeling

http://topicmodels.west.uni-koblenz.de/ckling/tmt/svd_ap.html

Topic modeling

Topic 2: Tourist arriving by air

flight airport heathrowgatwick terminal arrive fly hotel time hour

Topic 4: Travelcard user

card oyster buy travel day use ticket zone london

travelcard

Topic 3:Newbie

thank hi look help appreciate reply advance suggestion know hello

Topic 5: Public transport tourist

ticket train station time uk use line bus edit walk

Topic 1: Organized Traveller

london day stay tour visit trip good hotel plan night

Topic 6:Rental car user

car company drive hirerent rental parkingservice use return

Topic modelingResult: 6 data-driven Personas

Summary topic modeling

Decisions backed by data - for digital marketing, category management, product design, personalization, …

Detect distinct topics customer segments talk about

Find hidden customer segments based on interest

2nd take-away value ofdata-driven personas

Detect what people areinterested in

when talking about London

Word embeddings

Align your messages to what people actually like about you. Detect changing interests in real-time.

Search result for “tower”in the TripAdvisor forum

Word embeddings

Analysis of words in text

• Order not used

• Relations between words neglected

• Lost semantics

Analyze n-grams

• Order taken into account

• Static relations via tuples

• Abstraction to semantics missing

So far in text analytics…

Each word is a single entity

Context decides about semantics!

Word embeddings

Aim: Find context information of words

CBOW model model

• Predict word from context

Skip-gram model

• Determine context from word

• Slower, more precise with infrequent words

Training word vectors

Word embeddings

Word2vec similarities in detail

Word embeddings

Word embeddings

Word embeddings

Looking for similar sightsWord

embeddings

Similar (or near) to the tower:

London Eye Tower Bridge British MuseumBuckingham Palace Westminster Abbey

after training word embeddings

Search result for “airport”with the same trained word embeddings

Word embeddings

There are five airports in London:

LCYLondon City Gatwick Heathrow Luton Stansted

LGW LHR LTN STN

Looking for vector similaritieswith the same trained word embeddings

Word embeddings

Knightsbridge tube is near Harrods, but where do we need to get off for Buckingham Palace?

Hyde Park Westminster Green Park

Summary word embeddings

• Benefit from changing trends• Create semantically aware search results

Detect changing interests in real-time

Detect relevant context of a topic

3rd take-away value of semantic context

To wrap it upmethods & value from UGC analysis

Statistics &QA

Topic modelingWord

embeddings

Ensure data quality

Methods

Value Uncover hidden structure

Detect semantic similarities

Looking beyond UGCderive value from other text sources

Technical documentation

Data-driven approach to deriving diverse, un-biased insight from …

Companywikis

Change requestsScientific

publications…

Futurecost-drivers

Knowledge bottlenecks

Emerging competing

technologies

Technical debts

Stephanie Fischer

Dr. Christian Winkler

datanizing GmbH