Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
How to derive data-driven insights… from user-generated content
Stephanie Fischer
Dr. Christian Winkler
Foundedin 2017
Big Data text analytics experts
Located in Munich
Automatically gather relevant
content
1
Cleaning& Linguistics
2
Statistics & QA
3
Data-driven calculation of main
insights
4
Text analytics - automatically, regularly & based on large amount of data.
Visualization in dashboard &
reports
5
Text Analytics applied to a common situation:
Visiting foreign places.
The problem:
Too little time, too much going on.
Opinion of a single author
Too many opinions
14 yearsof London TripAdvisor Forum
Sep 2004 – Sep 2018
Making sense of …
Lots of content!
1.3 mio posts79,446 users
Nobody can read that.
There is one problem with user-generated content:
Too much.
Solution?Solution?
Automatically gather relevant
content
1
Cleaning& Linguistics
2
Statistics & QA
3
Data-driven calculation of main
insights
4
Display in dashboard &
reports
5
Text processing pipeline
Spidering
Extractiontext and metadata
Normalizing
Language detection
SynonymsOutlier detection
Feature extraction
Regression
Clustering
Overall structure
Word combinations
Categories
Timelines
Semantics
Perform quality assurancewith the whole content
Statistics &
QA
Get overview of texts, data quality andrecognize possible bias in data
Typical questions answered with statistics
Do frequent authors write shorter posts?
How does the number of
articles change over time?
How does the article length change over
time? What is the length
distribution of articles?
Which are the most frequent
words?
How are keywords used
over time?
Statistics &
QA
Python(pandas)
Contentspidering
Dataextraction
PostgreSQLdatabase
JupyterNotebook
Statistics &
QASpidering, Database, Pandas, Jupyter
Question Answering
Translation / Dialogue
Summarization / Topic Mining
Classification / Retrieval
strong
weak
"Shallow NLP"• Simple language models with many
simplifications (Bag-of-Words, n-grams)• Keywords, phrases• Robust algorithms
"Deep NLP"• Complex language models necessary
for deep understanding• Statements spanning sentences• Fragile algorithms
Statistics &
QAText prepration with NLP
Statistics &
QA
Typical questions answered with statistics
How does the number of
articles change over time?
Statistics &
QA
Typical questions answered with statistics
How does the article length change over
time?
Statistics &
QA
Typical questions answered with statistics
What is the length
distribution of articles?
Statistics &
QA
Typical questions answered with statistics
Do frequent authors write shorter posts?
Statistics &
QA
Typical questions answered with statistics
Which are the most frequent
words?
Statistics &
QA
Typical questions answered with statistics
How are keywords used
over time?
Statistics &
QA
For comparison … how would you rate the quality?
How are keywords used
over time?
Statistics &
QA
Summary quality assurance
Be sure the text masses future data-driven decisions will be based on has a good-enough quality
Recognize possible bias in data
Get overview of texts & data quality
1st take-away value oftext statistics
Create data-driven personasof London travelers
Topic modeling
Data-driven focus points for digital marketing, category management, product design, personalization
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
80% 20%
KimUser #1
HankUser #2
Marty# ...
Olivia#79,446
Topic modelingHow to create data-driven
personas
Topics are word distributions
Cheap 0.08
Hotel 0.06
Budget 0.03
Saving 0.03
Deal 0.02
Concessions 0.02
…
Sum 1.00
British museum 0.09
National gallery 0.07
Science Museum 0.06
V&A 0.06
Tate 0.05
Tate modern 0.05
…
Sum 1.00
Hyde Park 0.11
Regent’s Park 0.09
Hampstead Heath 0.06
Green Park 0.05
St. James Park 0.04
…
Sum 1.00
“Budget traveler” “Museum fan” “Park fan”
Topic modeling
Looking for hidden / latent structure
1) Which could be candidates for topics?
2) How are they distributed in document space?
Basic idea of topic modelling
Topic 1
Topic 2
Topic 3
topicsdocuments
...
Topic k
doc 1 doc 2 doc n...
Topic modeling
Only use word frequencies
• Term frequency (TF)
• Very simple, but robust
• Basis for many algorithms (retrieval, classification)
Disadvantages
• Very simplified model of language
• No syntactical or relational information kept
Improvements
• TF/IDF, n-grams
Need to vectorize data (BoW)Documents
D1: „Steffi likes London."
D2: „Steffi does not like London."
D3: „Steffi likes London, but not Paris."
D1 1 1 1
D2 1 1 1 1 1
D3 1 1 1 1 1 1
Topic modeling
Most ML is boring maths
x11 x1n
...
...
...
xm1 ... xmn
m documents with n features (words)• Use a matrix representation• m x n Matrix can become very large• 1.3 million rows, 500.000 columns• Matrix is sparse:
Most documents contain only a few words
Matrix can be simplified• Only keep certain number of features• Only keep features which occur more than
x times
features
do
cum
ents
Topic modeling
How Topic Modeling works
Adopted from http://topicmodels.west.uni-koblenz.de/ckling/tmt/svd_ap.html
Topic modelling transforms the matrix• Re-arrange features (words) and
documents• Find blocks
• Word in blocks constitute topics• Documents in blocks belong to topic
Topic modeling
Topic modeling
Topic 2: Tourist arriving by air
flight airport heathrowgatwick terminal arrive fly hotel time hour
Topic 4: Travelcard user
card oyster buy travel day use ticket zone london
travelcard
Topic 3:Newbie
thank hi look help appreciate reply advance suggestion know hello
Topic 5: Public transport tourist
ticket train station time uk use line bus edit walk
Topic 1: Organized Traveller
london day stay tour visit trip good hotel plan night
Topic 6:Rental car user
car company drive hirerent rental parkingservice use return
Topic modelingResult: 6 data-driven Personas
Summary topic modeling
Decisions backed by data - for digital marketing, category management, product design, personalization, …
Detect distinct topics customer segments talk about
Find hidden customer segments based on interest
2nd take-away value ofdata-driven personas
Detect what people areinterested in
when talking about London
Word embeddings
Align your messages to what people actually like about you. Detect changing interests in real-time.
Search result for “tower”in the TripAdvisor forum
Word embeddings
Analysis of words in text
• Order not used
• Relations between words neglected
• Lost semantics
Analyze n-grams
• Order taken into account
• Static relations via tuples
• Abstraction to semantics missing
So far in text analytics…
Each word is a single entity
Context decides about semantics!
Word embeddings
Aim: Find context information of words
CBOW model model
• Predict word from context
Skip-gram model
• Determine context from word
• Slower, more precise with infrequent words
Training word vectors
Word embeddings
Word2vec similarities in detail
Word embeddings
Word embeddings
Word embeddings
Looking for similar sightsWord
embeddings
Similar (or near) to the tower:
London Eye Tower Bridge British MuseumBuckingham Palace Westminster Abbey
after training word embeddings
Search result for “airport”with the same trained word embeddings
Word embeddings
There are five airports in London:
LCYLondon City Gatwick Heathrow Luton Stansted
LGW LHR LTN STN
Looking for vector similaritieswith the same trained word embeddings
Word embeddings
Knightsbridge tube is near Harrods, but where do we need to get off for Buckingham Palace?
Hyde Park Westminster Green Park
Summary word embeddings
• Benefit from changing trends• Create semantically aware search results
Detect changing interests in real-time
Detect relevant context of a topic
3rd take-away value of semantic context
To wrap it upmethods & value from UGC analysis
Statistics &QA
Topic modelingWord
embeddings
Ensure data quality
Methods
Value Uncover hidden structure
Detect semantic similarities
Looking beyond UGCderive value from other text sources
Technical documentation
Data-driven approach to deriving diverse, un-biased insight from …
Companywikis
Change requestsScientific
publications…
Futurecost-drivers
Knowledge bottlenecks
Emerging competing
technologies
Technical debts
Stephanie Fischer
Dr. Christian Winkler
datanizing GmbH