The Power of Words - Mcubed AI London€¦ · Pipeline for text analytics Spidering Extraction of text and metadata Normalization Determine language Synonyms Outlier detection Feature

The Power of WordsComparing different approaches to text analysis

Christian Winkler

Who I am

Dr. Christian Winkler

Founder, Machine Learning Expert

Photo credit: Urs Wehrli

The challenge:

Information overload with text

12 million contracts

30,000customer emails

1 million archived documents

How can Machine Learning help?

100,000 documented

change requests…

Collect contentautomatically

1

Cleaning& linguistics

2

Statistics & QA

3

Data-driveninsights

4

Visualization & reporting

5

Pipeline for text analytics

Spidering

Extraction of text and metadata

Normalization

Determine language

SynonymsOutlier detection

Feature extraction

Regression

Clustering

Overall structure

Word combinations

Categories

Timelines

Semantics

Preparation of textCleaning & Linguistics

Tokenization

Stopword removal

Normalization

Lemmatization

Named Entity Recognition

Part of speech tagging

Only consider word frequencies

• Term frequency (TF, TF-IDF)

• Simple, but robust

• Basis for many algorithms(Retrieval, classification, topic modeling)

Disadvantages

• Oversimplified language model

• Syntactical and relational information is lost

Improvements e.g. via n-grams

Bag-of-Words vectorizationDokumente

D1: „Pete likes London. Pete likes Paris."

D2: „Pete does not like London."

D3: „Pete likes London, but not Paris."

D1 2 2 1 1

D2 1 1 1 1 1

D3 1 1 1 1 1 1

Topic Modeling

Look for hidden/latent structure

1) What are candidates for topics?

2) How are they distributed in document space?

Basic idea

Topic 1

Topic 2

Topic 3

TopicsDocuments

...

Topic k

doc 1 doc 2 doc n...

How Topic Modeling works

Adopted from http://topicmodels.west.uni-koblenz.de/ckling/tmt/svd_ap.html

Topic modelling transforms the matrix• Re-arrange features (words) and

documents• Find blocks

• Word in blocks constitute topics• Documents in blocks belong to topic

http://topicmodels.west.uni-koblenz.de/ckling/tmt/svd_ap.html

Example topic modeling: Reuters News

Summary topic modeling

Fast summary for digital marketing, product design, personalization

Detect what people are talking (writing) about

Find hidden (niche) structure

Result: Latent structure of dataset

Word Embeddings -Vectorizing words

"You shall know a word by the company it keeps."

Example: What is "tezgüino"? What is similar to "tezgüino"?

Distributional Hypthesis (Firth, 1957)

A bottle of ____ is on the table.Everybody likes ____.Don’t have ____ before you drive.We make ____ out of corn.

EISENSTEIN, JACOB: Natural Language Processing. Georgia Tech, 2018. Ch. 14

Semantic similarity

My cat eats fish on Saturday

His cat eats turkey on Tuesday

My dog eats meat on Sunday

His dog eats turkey on Monday

Syntagmatic axis

Par

adig

mat

icax

is

Similar contexts(paradigmatic): cat ≈ dog

Co-occurence(syntagmatic): cat ≈ eats

Basic idea of word embeddings

Learn semantics via

context

Princess 0,93 0,01 0,93 0,12 ....

Woman 0,08 0,03 0,98 0,51 ....

Queen 0,99 0,05 0,97 0,72 ....

King 0,96 0,99 0,03 0,64 ....

”The man who passes the sentence should swing the sword.” – Ned Stark

Predtict word (red)via context (green)

Order of context is ignored (bag of words)

Rows of weigh matrix W

yield n-dimensional word vectors

word2Vec training (Continuous Bag of Words)

sentenceone-hot-encoded

shouldone-hot-encoded

theone-hot-encoded

swordone-hot-encoded

N-dimhiddenlayer

swingone-hot-encoded

W

neural network

Training with 68 subreddits eachcontainting 1,000 posts about tv shows

25 epochs, takes a few minutes

Example

"Simpsons"Without direct relation to tv shows

Similarities of relationships

x = Kirk + Marge - Homer

Similarities of relationships

Homer

Marge

x

Kirk

t-SNE oder PCA

Interesting, but only clusters arerelevant, differencevectors incorrectlymapped

Visualization

fastText (Open Source software from Facebook)

• Uses character n-grams

• Useful for spell checking

• Pre-trained models for detecting language

gloVe

• Uses global co-occurence matrix

• Focus on non-local semantic similarities

• Sometimes better similarity compared to word2vec

• Less popular

fastText & GloVe

word2Vec fastText gloVe

• Facebook Research• Character n-grams• Classification, fault-tolerant

• Google Research • Stanford University• Different method

(co-occurence matrix)

fastText: Language detection

Loga

rith

mic

scal

e

Summary word embeddings

• Find new trends• Build semantic search engines

Detect changes over time

Vectors have a similarity matrix (inner product)

Summary: Understand semantic context of words

ElMo: Contextualizedembeddings

Word have different meanings depending on context

Context helps to improve word representation in vector space

Basic idea

He talked to the pole

He climbed the pole.

The pole wore a blue jacket.

The pole consisted of metal.

The pole was shaking.

She turned to the pole.

LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM

ELMo

startWV

LSTMLayer 1

LSTMLayer 1

Partly con-textualized

WV

Fully con-textualized

WV

He climbed the pole

Context-aware training

• Use bi-directional LSTMs

• Layer represent different layers of abstraction

• Character n-grams with CNN as starting point for (uncontextualized) word vectors

• Well-suited also for short texts

New featuresDisadvantages compared to word2vec

• Words cannot be mapped to a unique vector

• Much, much, much slower in training and evaluation

Semantical sentence representation

• E.g. via sum of contextualized word vectors

• Improvements via TF/IDF

• Often used for sentiment analysis

ELMo well suited for short sentences

Training with headlines from Reuters World News

Calculate ELMo sentence embeddings

Find semantically similar news

Example

BERT: Transfer Learning

Transfer Learning

Data Set 1 Model 1 Task 1

Data Set 2 Model 2 Task 2

Classical ML

A model is trained for exactly one task starting withoutprior knowledgeA lot of training data is needed to get good results.

BaseModel

Base Task

ImprovedModel

Task

Transfer Learning

A base model (trained with a very large dataset) is retrainedto a more specific task with a comparatively small dataset.

Base Data Set

Data

Fine-tuning

Transfer Learning

PretrainedBase

Model

ClassificationModel

WikipediaBooks

ClassificationData

QAModel

Question-Answer Data

NERModel

Named Entity Data

BERTTask 1: Masked LMTask 2: Next Sentence

QAModel

Question-Answer Data

BERT-Base, Multilingual Cased: 104 languages, 12-layer, 768-hidden, 110 Mio parameters4days on 4-16 TPUs, 2GB size

SQuAD: 150.000 QA-pairsTraining 1h CLOUD TPU

Prediction

Billions of words

Could use the selfposts as a data basis

However we don‘t know what is actuallyinside

Therefore use Wikipedia articlehttps://en.wikipedia.org/wiki/Star_Trek:_The_Original_Series

Text-only version 57 kB

Question answering

https://en.wikipedia.org/wiki/Star_Trek:_The_Original_Series

Question Answering

State-of-the-Art Text Mining

Classical Text Mining

Document VectorsBag-of-words /

TF-IDF

Classification

Topic

Modeling

(Pre-Trained) Embeddings und Deep Learning for complex tasks

Document VectorsFixed-length Seq.

EmbeddingDeep

Neural Net

Sentiment

Analysis

Knowledge

Extraction

Question

Answering

Embeddings

Word or

Document Vectors

Semantic

Similarity

Commercial use of these methods

Technicaldocumentation

Data-driven approach to collect knowledge fromdifferent sources

EnterpriseWikis

Change RequestsScientific

publications…

Cost driverKnowledge

silos

Detect game-changing

technologies early

Technicaldebt

[email protected]