Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
The Power of WordsComparing different approaches to text analysis
Christian Winkler
Who I am
Dr. Christian Winkler
Founder, Machine Learning Expert
Photo credit: Urs Wehrli
The challenge:
Information overload with text
12 million contracts
30,000customer emails
1 million archived documents
How can Machine Learning help?
100,000 documented
change requests…
Collect contentautomatically
1
Cleaning& linguistics
2
Statistics & QA
3
Data-driveninsights
4
Visualization & reporting
5
Pipeline for text analytics
Spidering
Extraction of text and metadata
Normalization
Determine language
SynonymsOutlier detection
Feature extraction
Regression
Clustering
Overall structure
Word combinations
Categories
Timelines
Semantics
Preparation of textCleaning & Linguistics
Tokenization
Stopword removal
Normalization
Lemmatization
Named Entity Recognition
Part of speech tagging
Only consider word frequencies
• Term frequency (TF, TF-IDF)
• Simple, but robust
• Basis for many algorithms(Retrieval, classification, topic modeling)
Disadvantages
• Oversimplified language model
• Syntactical and relational information is lost
Improvements e.g. via n-grams
Bag-of-Words vectorizationDokumente
D1: „Pete likes London. Pete likes Paris."
D2: „Pete does not like London."
D3: „Pete likes London, but not Paris."
D1 2 2 1 1
D2 1 1 1 1 1
D3 1 1 1 1 1 1
Topic Modeling
Look for hidden/latent structure
1) What are candidates for topics?
2) How are they distributed in document space?
Basic idea
Topic 1
Topic 2
Topic 3
TopicsDocuments
...
Topic k
doc 1 doc 2 doc n...
How Topic Modeling works
Adopted from http://topicmodels.west.uni-koblenz.de/ckling/tmt/svd_ap.html
Topic modelling transforms the matrix• Re-arrange features (words) and
documents• Find blocks
• Word in blocks constitute topics• Documents in blocks belong to topic
Example topic modeling: Reuters News
Summary topic modeling
Fast summary for digital marketing, product design, personalization
Detect what people are talking (writing) about
Find hidden (niche) structure
Result: Latent structure of dataset
Word Embeddings -Vectorizing words
"You shall know a word by the company it keeps."
Example: What is "tezgüino"? What is similar to "tezgüino"?
Distributional Hypthesis (Firth, 1957)
A bottle of ____ is on the table.Everybody likes ____.Don’t have ____ before you drive.We make ____ out of corn.
EISENSTEIN, JACOB: Natural Language Processing. Georgia Tech, 2018. Ch. 14
Semantic similarity
My cat eats fish on Saturday
His cat eats turkey on Tuesday
My dog eats meat on Sunday
His dog eats turkey on Monday
Syntagmatic axis
Par
adig
mat
icax
is
Similar contexts(paradigmatic): cat ≈ dog
Co-occurence(syntagmatic): cat ≈ eats
Basic idea of word embeddings
Learn semantics via
context
Princess 0,93 0,01 0,93 0,12 ....
Woman 0,08 0,03 0,98 0,51 ....
Queen 0,99 0,05 0,97 0,72 ....
King 0,96 0,99 0,03 0,64 ....
”The man who passes the sentence should swing the sword.” – Ned Stark
Predtict word (red)via context (green)
Order of context is ignored (bag of words)
Rows of weigh matrix W
yield n-dimensional word vectors
word2Vec training (Continuous Bag of Words)
sentenceone-hot-encoded
shouldone-hot-encoded
theone-hot-encoded
swordone-hot-encoded
N-dimhiddenlayer
swingone-hot-encoded
W
neural network
Training with 68 subreddits eachcontainting 1,000 posts about tv shows
25 epochs, takes a few minutes
Example
"Simpsons"Without direct relation to tv shows
Similarities of relationships
x = Kirk + Marge - Homer
Similarities of relationships
Homer
Marge
x
Kirk
t-SNE oder PCA
Interesting, but only clusters arerelevant, differencevectors incorrectlymapped
Visualization
fastText (Open Source software from Facebook)
• Uses character n-grams
• Useful for spell checking
• Pre-trained models for detecting language
gloVe
• Uses global co-occurence matrix
• Focus on non-local semantic similarities
• Sometimes better similarity compared to word2vec
• Less popular
fastText & GloVe
word2Vec fastText gloVe
• Facebook Research• Character n-grams• Classification, fault-tolerant
• Google Research • Stanford University• Different method
(co-occurence matrix)
fastText: Language detection
Loga
rith
mic
scal
e
Summary word embeddings
• Find new trends• Build semantic search engines
Detect changes over time
Vectors have a similarity matrix (inner product)
Summary: Understand semantic context of words
ElMo: Contextualizedembeddings
Word have different meanings depending on context
Context helps to improve word representation in vector space
Basic idea
He talked to the pole
He climbed the pole.
The pole wore a blue jacket.
The pole consisted of metal.
The pole was shaking.
She turned to the pole.
LSTM LSTM LSTM LSTM
LSTM LSTM LSTM LSTM
LSTM LSTM LSTM LSTM
LSTM LSTM LSTM LSTM
ELMo
startWV
LSTMLayer 1
LSTMLayer 1
Partly con-textualized
WV
Fully con-textualized
WV
He climbed the pole
Context-aware training
• Use bi-directional LSTMs
• Layer represent different layers of abstraction
• Character n-grams with CNN as starting point for (uncontextualized) word vectors
• Well-suited also for short texts
New featuresDisadvantages compared to word2vec
• Words cannot be mapped to a unique vector
• Much, much, much slower in training and evaluation
Semantical sentence representation
• E.g. via sum of contextualized word vectors
• Improvements via TF/IDF
• Often used for sentiment analysis
ELMo well suited for short sentences
Training with headlines from Reuters World News
Calculate ELMo sentence embeddings
Find semantically similar news
Example
BERT: Transfer Learning
Transfer Learning
Data Set 1 Model 1 Task 1
Data Set 2 Model 2 Task 2
Classical ML
A model is trained for exactly one task starting withoutprior knowledgeA lot of training data is needed to get good results.
BaseModel
Base Task
ImprovedModel
Task
Transfer Learning
A base model (trained with a very large dataset) is retrainedto a more specific task with a comparatively small dataset.
Base Data Set
Data
Fine-tuning
Transfer Learning
PretrainedBase
Model
ClassificationModel
WikipediaBooks
ClassificationData
QAModel
Question-Answer Data
NERModel
Named Entity Data
BERTTask 1: Masked LMTask 2: Next Sentence
QAModel
Question-Answer Data
BERT-Base, Multilingual Cased: 104 languages, 12-layer, 768-hidden, 110 Mio parameters4days on 4-16 TPUs, 2GB size
SQuAD: 150.000 QA-pairsTraining 1h CLOUD TPU
Prediction
Billions of words
Could use the selfposts as a data basis
However we don‘t know what is actuallyinside
Therefore use Wikipedia articlehttps://en.wikipedia.org/wiki/Star_Trek:_The_Original_Series
Text-only version 57 kB
Question answering
Question Answering
State-of-the-Art Text Mining
Classical Text Mining
Document VectorsBag-of-words /
TF-IDF
Classification
Topic
Modeling
(Pre-Trained) Embeddings und Deep Learning for complex tasks
Document VectorsFixed-length Seq.
EmbeddingDeep
Neural Net
Sentiment
Analysis
Knowledge
Extraction
Question
Answering
Embeddings
Word or
Document Vectors
Semantic
Similarity
Commercial use of these methods
Technicaldocumentation
Data-driven approach to collect knowledge fromdifferent sources
EnterpriseWikis
Change RequestsScientific
publications…
Cost driverKnowledge
silos
Detect game-changing
technologies early
Technicaldebt