156
Deep Learning for Natural Language Processing (in 2 hours) Eneko Agirre http://ixa2.si.ehu.eus/eneko IXA NLP group http://ixa.eus @eagirre

Deep Learning for Natural Language Processing (in 2 hours)ixa2.si.ehu.es/eneko/slides/DL4NLP2h.v5.pavia.pdf · Brief history of NLP 1960s: Complex rules and first order logic. Humans

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Deep Learning for Natural Language Processing

(in 2 hours)

Eneko Agirrehttp://ixa2.si.ehu.eus/eneko

IXA NLP grouphttp://ixa.eus

@eagirre

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 2

Contents

● Introduction to Deep learning– Deep Learning ~ Learning Representations

● Text as bag of words

● Text as a sequence

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 4

Quiz

● How many of you have done– A course on linguistics– A course on computational linguistics (aka NLP)– A course on machine learning– A course on deep learning

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 5

Introduction to Deep Learning

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 6

What is deep learning

A subfield of machine learning● Supervised ML, given a dataset of

examples x with labels y

● Learn a function f(x)→y with low training error and low test error

Key manual step: design features to extract key information from x (representation)

● e.g. weather forecast (wind, temperature, humidity, pressure, precipitations … – local and nearby locations)

● e.g. sentiment of tweet (keywords like “good” “bad”, certain emojis, ...)

Source: staesthetic.wordpress.com

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 7

What is deep learning

Source: Chris Manning cs224n

x f(x)=y

Source: www.vaetas.cz

Deep? Multiple levels

Deep learning jointly learns representation and output

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 11

ImageNet Classification withDeep Convolutional Neural Networks(Krizhevsky et al. 2012)

Learning hierarchical representations for face verification with convolutional deep belief networks (Huang et al. 2012)

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 17

Brief history of NLP

● 1960s: Complex rules and first order logic. Humans build complex grammars.

● 1990s: Supervised machine learning. Humans annotate text, design laborious task-specific features, and apply ML techniques.

● 2010s: Deep learning. Learning continuous representations, get rid of task-specific features.

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 18

Why deep learning for NLP?● Technology behind current

speech processing and machine translation– Advances in the state-of-the-art on most tasks

● Focus on representation learning– Learns representations for words and word sequences fitted to the task

… including world and visual knowledge.– Naturally accounts for graded judgments about language

● Word similarity: building / house● Sentence similarity:

A pony is close to the house / There is a horse in the front yard

● End-to-end joint learning (vs. pipeline)● Transfer models accross tasks

(word embeddings, sentence encoders)

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 19

Contents

● Introduction to Deep learning– Deep Learning ~ Learning Representations

● Text as bag of words– Text Classification– Representation learning and word embeddings – Superhuman: cross-lingual word embeddings

● Text as a sequence

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 20

Classification

● Input– A document d D ∈ D – A fixed set of classes C={c1, c2, … cj}

● Output: a predicted class y C∈ D

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 23

Classification

Supervised machine learning

Learn a classifier from hand-annotated examples:● Input: A training set of n hand-labeled

documents (x1,y1) … (xm, ym)

● Output: A learned classifier f:D→C

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 24

Text Classification

● Most NLP tasks can be framed as text classification– PoS tagging, parsing, QA, MT ...

● Simple example to showcase key methods– Sentiment analysis (polarity)

– Positive or negative movie review● Full of zany characters and richly applied satire● It was pathetic● Modestly accomplished, lifted by two terrific performances.● Not NEARLY as funny as its title

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 25

Documents as Documents as feature vectors feature vectors (bag of words)(bag of words)

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 26

Feature vectors

Representation of each example (document)● Key idea for most machine learning● Example as a vector of features x

– All examples same number of features– Features: boolean, integer, real

● Pre-processing code to convert from example into feature vector– Substantial effort (e.g. tokenization)

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 27

Bag of words

Representation of each example (document)

⇒ Bag of words representation

Source: Sam Bowman

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 30

Bag of words

Classifier:● Input: vector of |V| binary features

Simple sentiment analysis: words in document 1● Output of the classifier: discrete

Simple sentiment analysis: two classes

f( ) = cit 1the 0and 0by 0...terrific 1pathetic 0funny 0...

Positive!

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 31

Classification:Classification: Softmax Softmax

(Logistic Regression) (Logistic Regression)

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 35

Softmax classification● Given: vectors of weights wc for class c

(e.g. terrific for positive class high weight)

– For each class compute wc x + b

– Add non-linearity fc = exp(wc x + b) – Normalize it to estimate probabilities:

p(y=c|x) ≈ fc / ∑c’ C∈ D fc’

● That is the softmax function:

softmax (wc x)=exp(wc x)

∑c '∈C

exp(wc ' x)

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 36

Softmax classification

● Given: vectors of weights wc for class c ’

● Output value y = argmaxc softmax(wc x)

● Task for the training algorithm: – Given labeled examples

– Find wc vectors such that the output value is not wrong for as many training examples as possible

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 38

Softmax classification

Estimating weights (parameters)● Choose parameter which

minimize error over training data

Loss function J (aka cost f. or objective f.)

Cross-entropy error on one example (xi,yi)

– We want to maximize probability of correct classi.e. minimize negative log probability of the correct class

J i(W )=−logP( yi=c∣xi)=−log(exp(W c x)

∑c '∈C

exp(W c ' x) )

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 39

Softmax classification

Estimating parameters ● Search for parameters which

minimize error over training data● Back-propagation

Stochastic gradient descent– Start with random parameters– Select K examples (mini-batch) at random– Change parameters a little bit towards minimum of loss function

for those K (derivatives!)– Continue until loss function converges (or increases)

Source: staesthetic.wordpress.com

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 41

Softmax classification

● An input layer – the feature vector

● An output layer – class probabilities

Train, given labeled data (x1,y1) … (xm, ym)

● For each example (xi,yi): – Forward: input xi, obtain predicted class ci

– If ci ≠ yi then backward: adjust parameters

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 42

Softmax classification

Source: http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

Intuition: ● Assume 2D vectors● W defines the linear

decision boundary

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 43

Going deep: Multilayer perceptron

● An input layer – just a feature vector

● One or more hidden layers, each computed on the layer below – latent features (representations)

● An output layer, based on the top hidden layer – class probabilities

● Also known as Feed Forward

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 44

Going deep: Multilayer perceptronSoftmax classification

y=softmax (W x+b)y

x W

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 45

Going deep: Multilayer perceptron

y=softmax (W 2h1+b2)

h1=f (W 1h0+b1)

h0=f (W 0 x+b0)

y

x

h1

h0

W 2

W 1

W 0

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 46

Going deep: Multilayer perceptronLayers have the same structure

Non-linear functions!

hi=f (W ihi−1+bi)

Sigmoid : σ(x)=1

1+exp(−x)

Hyperbolic : tanh (x)=1−exp(−2 x)1+exp(−2 x)

Rectified linear unit : rect (x)=max (0, x)

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 47

Going deep: Multilayer perceptronMotivation

Without non-linearities, there is no extra expresivity: a sequence of linear transformations is a linear transformation

How do we train it?

Source: http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 48

What about representation learning?

● Document as bag-of-words:– Each word ~ a one-hot-vector

(0 0 0 0 0 … 1 … 0 0 0)– Each document ~ summation of

words in the document(0 0 0 1 0 … 1 … 1 0 0)

● NLP has been representing words as distinct features for many years!– Is there a better alternative? Source: Sam Bowman

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 49

Representation learning

y=softmax (W 2h1+b2)

h1=f (W 1h0+b1)

h0=f (W 0 x+b0)

1

0

0

0

W 0x

(one-hot)

h0

→h0

MLP: What are we learning in W0 when we backpropagate?

W 0

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 50

Representation learning

1

0

0

0

W 0x

(one-hot)

→y=softmax (W 2h1+b2)

h1=f (W 1h0+b1)

h0=f (W 0 x+b0)

h0W 0

MLP: What are we learning in W0 when we backpropagate?

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 51

Representation learning

Back-propagation allows Neural Networks to learn representations for words while training: word embeddings!

1

0

0

0

W 0x

(one-hot)

→y=softmax (W 2h1+b2)

h1=f (W 1h0+b1)

h0=f (W 0 x+b0)

h0W 0

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 52

Representation learning

● Back-propagation allows Neural Networks to learn representations for words while training!– Word embeddings!

Continuous vector space instead of one-hot● Are these word embeddings useful?● Which task would be the best to learn embeddings

that can be used in other tasks?● Can we transfer this representation

from one task to the other?● Can we have all languages in one embedding space?

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 53

Representation learning Representation learning and and

word embeddings word embeddings

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 54

Word embeddings● Let’s represent words as vectors:

Similar words should have vectors which are close to each other

● If an AI has seen these two sequences

I live in CambridgeI live in Paris

● … then which one should be more plausible?

I live in TallinnI live in policeman

Cambridge

London

Tallinn

Paris

policeman

judge driver cop

WHY?

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 55

Word embeddings● Let’s represent words as vectors:

Similar words should have vectors which are close to each other

● If an AI has seen these two sequences

I live in CambridgeI live in Paris

● … then which one should be more plausible?

I live in Tallinn OKI live in policeman

Cambridge

London

Tallinn

Paris

policeman

judge driver cop

WHY?

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 59

Word embeddings

Option 1: use co-occurrence matrix PPMI ● 1A: large sparse matrix● 1B: factorize it and use low-

rank dense matrix (SVD)

CambridgeLondon

Tallinn

Paris

policemanjudge driver cop

Option 2: learn low-rank dense matrix directly ● 2A: MLP on particular

classification task ● 2B: find a general task

Distributional vector spaces:

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 60

Word embeddings CambridgeLondon

Tallinn

Paris

policemanjudge driver cop

… people who keep pet dogs or cats exhibit better mental and physical health …

General task with large quantities of data: guess the missing word (language models)

CBOW: given context guess middle word

SKIP-GRAM given middle word guess context

Proposed by Mikolov et al. (2013) - Word2vec

… people who keep pet dogs or cats exhibit better mental and physical health …

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 62

CBOW

Like MLP, one layer, LARGE vocabulary

… people who keep pet dogs or cats exhibit better mental and physical health …

Cross-entropy loss / Softmax expensive! Negative sampling:

JNEGw (t )=log σ(h0 i)−∑k=1

K

Ewn∼Pnoiselog σ(hon)

Source: Mikolov et al. 2013

LARGE number of classes! The vocabulary

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 63

Word embeddings

Pre-trained word embeddings can leverage texts with BILLIONS of words!!

Pre-trained word embeddings useful for: ● Word similarity● Word analogy● Other tasks like PoS tagging, NERC,

sentiment analysis, etc.● Initialize embedding layer in deep learning models

CambridgeLondon

Tallinn

Paris

policemanjudge driver cop

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 64

Word embeddings

Word similarity

CambridgeLondon

Tallinn

Paris

policemanjudge driver cop

Source: Collobert et al. 2011

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 65

Word embeddings

Word analogy:

a is to b as c is to ?

man is to king as woman is to ?

policeman

CambridgeLondon

Tallinn

Paris

judge driver cop

king

man

woman

?

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 66

Word embeddings

Word analogy:

a is to b as c is to ?

man is to king as woman is to ?

policeman

CambridgeLondon

Tallinn

Paris

judge driver cop

a−b≈c−dd≈c−a+b

king−man+woman≈queen

argmaxd∈V (cos(d , c−a+b))

king

man

womanqueen

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 68

Word embeddings

● How to use embeddings in a given task(e.g. MLP sentiment analysis):– Learn them from scratch (random init.) – Initialize using pre-trained embeddings

from some other task (e.g. word2vec)● Other embeddings:

– GloVe (Pennington et al. 2014) – Fastext (Mikolov et al. 2017)

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 70

Recap

● Deep learning: learn representations for words● Are they useful for anything?● Which task would be the best to learn

embeddings that can be used in other tasks?● Can we transfer this representation

from one task to the other?● Can we have all languages in one embedding

space?

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 71

Superhuman abilities: Superhuman abilities: cross-lingual word cross-lingual word

embeddingsembeddings

http://aclweb.org/anthology/P18-1073http://aclweb.org/anthology/P18-1073

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 120

Contents

● Introduction to NLP– Deep Learning ~ Learning Representations

● Text as bag of words– Text Classification– Representation learning and word embeddings – Superhuman: xlingual word embeddings

● Text as a sequence: RNN– Sentence encoders– Machine Translation– Superhuman: unsupervised MT

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 121

From words to sequences● Representation for words:

one vector for each word (word embeddings)● Representation for sequences of words:

one vector for each sequence (?!) – Is it possible to represent a sentence in one

vector at all?– Let’s go back to MLP

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 122

From words to sequences

MLP: What is h0 with respect to the words in the input ?

● Add vectors of words in context (1’s in x), plus bias, apply non-linearity

h0 sentence representation

f (∑ w i+b0)

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 123

Sentence encoder

A function:input a sequence of word embeddings output a sentence representation

hidden layers

sentence representation

wi∈ℝD

w1 w2 w3

s∈ℝD'

s

sentence encoder

word embeddings

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 124

Sentence encoder

Baseline: Continuous bag of words (pre-trained embeddings)

sentence representation

w1 w2 w3

s

word embeddings

Σ

h1=f (W 1h0+b1)

h0=s

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 125

Sentence encoder

Baseline: MLP (with or without pre-trained embeddings)Encoding sequences as continous bag of words

sentence representation

f (∑ w i+b0)

s

w1 w2 w3

h1=f (W 1h0+b1)

h0=s

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 126

Problems with bag of words

● Representation is limited– repr(John loves Mary) = repr(Mary loves John)– repr(The end is weak but the film is awesome) =

repr(The end is awesome but the film is weak)● We need a more powerful representation

which takes into account word order– Sequences of tokens– Bigrams, trigrams of tokens

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 127

Sentence encoder

Concatenation

sentence representation s

w1 w2 w3

concatenate

h1=f (W 1h0+b1)

h0=s

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 128

Sentence encoder

Concatenation

It is a masterful performance but, for me, not enough to make me a fan of the film.

It is not only lovely to look at (the exquisite northern Italian countryside made me want to hop on a plane that moment), but is made even better by the subtle performance of Timothée Chalamet. "

Fantastic movie

h1=f (W 1h0+b1)

h0=s

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 129

Sentence encoder

Recurrent Neural Network (RNN)Apply a single function f recursively

hidden layers

w1 w2 w3

0 0 0 0 0 0

f f f

y

x

f

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 130

Sentence encoder: RNN

Recurrent Neural Network (RNN)

Apply a single function f recursively

keeping history (hidden) state

hidden layers

w1 w2 w3

0 0 0 0 0 0

f f f

y

x

f

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 131

Sentence encoder: RNN

Recurrent Neural Network (RNN)

Apply a single function f recursively

keeping history (hidden) state

hidden layers

w1 w2 w3

0 0 0 0 0 0

f f f

y

x

f

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 132

Sentence encoder: RNN

Recurrent Neural Network (RNN)

Apply a single function f recursively

keeping history (hidden) state

hidden layers

w1 w2 w3

0 0 0 0 0 0

f f f

y

x

f

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 133

Sentence encoder: RNN

Apply a single function f recursively.

w1 w2 w3

0 0 0 0 0 0

f f f

ht=f (ht−1 , wt)=tanh(W [ ht−1 , w t ]+b)

sentence representation

y

x

f

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 134

Sentence encoder: RNN

Varying sequence length: unroll (different graph for each input)

w1 w2 w3

0 0 0 0 0 0

f f f

sentence representation

w1 w2

0 0 0 0 0 0

f f

w3

f

w4

f

sentence representation

y

x

f

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 136

Sentence encoder: RNN

● Last hidden state is input to classifier

w1 w2 w3

0 0 0 0 0 0

f f f

g

g(ht)=softmax (U ht+c)

y

x

f

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 137

RNNs offer a lot of flexibility

MLP

Source: Fei-Fei Li & Andrej Karpathy and Justin Johnson

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 138

RNNs offer a lot of flexibility

Sentiment classification

Source: Fei-Fei Li & Andrej Karpathy and Justin Johnson

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 139

RNNs offer a lot of flexibility

Machine translation

Source: Fei-Fei Li & Andrej Karpathy and Justin Johnson

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 140

RNNs offer a lot of flexibility

Image Captioning

Source: Fei-Fei Li & Andrej Karpathy and Justin Johnson

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 141

RNNs offer a lot of flexibility

Video classification at frame level

Source: Fei-Fei Li & Andrej Karpathy and Justin Johnson

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 142

Machine TranslationMachine Translationseq2seqseq2seq

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 143

Sequence to sequence models

● For any problem that can be interpreted as a transformation from one sequence to another:– Model it as a pair of RNNs:– An encoder that reads the input sentence,

outputs nothing– A decoder whose starting hidden state is the last

hidden state of the encoder, and that generates a sentence (RNN language model)

– Give it lots of data…. Success is guaranteed (Sutskever et al. 2014)

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 144

Sequente to sequence models

● They are state-of-the-art in many problems, some of them fairly novel– Foreign sentences → translation

– Emails → simple replies (Google)

– Python functions → result

– English sentences → parsing instructions– ...

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 145

Sequence to sequence for MT

Combine two RNN: encoder and decoder

Train as regular RNN (NOTE: N classifiers)

g g

hidden layershidden layers

</s>

g

hidden layers

0 0 0 0 0 0

<s>

f f f

txakur pelikula

txakur </s>pelikula

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 146

Sequence to sequence for MT

Combine two RNN

Train as in regular RNN● Note that decoder is conditioned

on last hidden state from encoder:

g g

hidden layershidden layers

g

hidden layers

<s> txakur pelikula

txakur </s>pelikula

p(w1, ... , wm∣hencoder)≈∏t=0

m−1

y t , correct

C=hencoder

ht=tanh (W [ ht−1 , w t ,C ]+b)

y t=softmax (WS ht+ c)

J sentence=−1T∑t=1

m

log y t , correct

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 147

Sequence to sequence for MT

● Test as decoder– Compute sentence representation

</s>

0 0 0 0 0 0

f f f

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 148

Sequence to sequence for MT

● Test as conditional RLM decoder– Compute sentence representation– Generate first word

g

hidden layers

</s>

0 0 0 0 0 0

<s>

f f f

katuw t=argmax

iyt , i

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 149

Sequence to sequence for MT

● Test as conditional RLM decoder– Compute sentence representation– Generate second word

g g

hidden layershidden layers

</s>

0 0 0 0 0 0

f f f

katu

pelikula

<s>

katuw t=argmax

iyt , i

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 150

Sequence to sequence for MT

● Test as conditional RLM decoder– Compute sentence representation– Generate next word, stop if </s>

g g

hidden layershidden layers

</s>

g

hidden layers

0 0 0 0 0 0

f f f

katu pelikula

</s>pelikula

<s>

katuw t=argmax

iyt , i

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 151

Sequence to sequence for MT

● These models do not beat statistical MT● More complex RNN: LSTM● There is a huge loss of information in trying to

cram all the meaning of the input sentence in a single vector– The decoder is missing key information like

individual words in the input, word order information, length of input, etc.

– Add capability to attend to each of the tokens in the input: attention!

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 152

Re-thinking seq2seq for NMT

How can we access the necessary information at each decoding step?

g g

hidden layershidden layers

</s>

g

hidden layers

0

0

0

0

0

0

<s>

f f f

txakur pelikula

txakur </s>pelikula

h3=c= s0h0h1 h2 s1 s2 s3

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 153

Re-thinking seq2seq for NMT

Learn alignments jointly with the translation (!)(Bahdanu et al. 2015; Luong et al. 2015)

</s>

f f f

c t

α t 1+ α t 3

α t 2

0

0

0

0

0

0

st

hidden layers

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 160

Re-thinking seq2seq for NMT0

0

0

0

0

0

f f f

c t

α t 1+ α t 3

α t 2

</s>

g

hidden layers

<s>

s1

c t=∑j=1

T x

αtj h j

α tj=exp(etj)/∑k=1

T x

exp(etk)etj=a( st , h j)= stW a h j

t=1

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 161

Re-thinking seq2seq for NMT0

0

0

0

0

0

f f f

c t

α t 1+ α t 3

α t 2

</s>

g

hidden layers

<s>

katu

s1

c t=∑j=1

T x

αtj h j

α tj=exp(etj)/∑k=1

T x

exp(etk)etj=a( st , h j)= stW a h j

t=1

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 162

Re-thinking seq2seq for NMT0

0

0

0

0

0

f f f

c t

α t 1+ α t 3

α t 2

</s>

g g

hidden layershidden layers

<s> katu

katu

s1 s2

c t=∑j=1

T x

αtj h j

α tj=exp(etj)/∑k=1

T x

exp(etk)etj=a( st , h j)= stW a h j

t=2

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 163

Re-thinking seq2seq for NMT0

0

0

0

0

0

f f f

c t

α t 1+ α t 3

α t 2

</s>

g g

hidden layershidden layers

<s> katu

katu pelikula

s1 s2

c t=∑j=1

T x

αtj h j

α tj=exp(etj)/∑k=1

T x

exp(etk)etj=a( st , h j)= stW a h j

t=2

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 164

Re-thinking seq2seq for NMT0

0

0

0

0

0

f f f

c t

α t 1+ α t 3

α t 2

</s>

g g

hidden layershidden layers

g

hidden layers

<s> katu pelikula

katu </s>pelikula

s1 s2 s3

c t=∑j=1

T x

αtj h j

α tj=exp(etj)/∑k=1

T x

exp(etk)etj=a( st , h j)= stW a h j

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 166

Re-thinking seq2seq for NMT

Back to sentence representations:– Instead of last hidden state of encoder ...– Attention: keep information of all hidden states

encoder

txakur pelikula

decoder

ederra

cc0 c1 c2

c=[ c0 , c1 , c2]

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 167

Attention is all you need (!?)

Transformer (Vaswani et al. 2017) ● Extend attention to self-attention:

source (and target) with itself● Build hidden states using feed forward networks

State-of-the-art MT

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 168

Attention is all you need (!?)

Transformer (Vaswani et al. 2017)

c1

α11

+

α t 3α12

Feed forward

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 169

Attention is all you need (!?)

Transformer (Vaswani et al. 2017)

c2

α 21

+α 23

α 22

c1

Feed forward

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 170

Attention is all you need (!?)

Transformer (Vaswani et al. 2017)

c2c1 c3

α 31+α 33α 32

Feed forward

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 171

Attention is all you need (!?)

Transformer (Vaswani et al. 2017)

c2c1 c3

+

Feed forwardFeed forward

+

Feed forward

α11

+

α t 3α12

One forward pass

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 172

Recap

● Deep learning: learn representations for sentences– Encoders like RNNs (one vector plus attention) – Transformers that keep one vector per token

● Which task would be the best to learn representations that can be used in other tasks?

● Can we transfer this representation from one task to the other?

● Can we have all languages in one representation space?

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 173

Superhuman abilities: Superhuman abilities: unsupervised MTunsupervised MT

https://openreview.net/pdf?id=Sy2ogebAWhttps://openreview.net/pdf?id=Sy2ogebAW

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 175

Unsupervised MT

● Given:– Some books in Chinese– Some other books in Arabic– A person who does not know Chinese nor Arabic

● Can a person learn to translate from Chinese to Arabic or vice versa?

● A machine can!– Leverage unsupervised cross-lingual embeddings

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 176

. . .

L1 embeddings

Encoder for L1

. . .

atte

nti

on

softmax

L2 decoder

Unsupervised MT

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 177

. . .

fixed cross-lingual embeddings

Shared encoder (L1/L2)

. . .

atten

tio

n

softmax

L1 decoder

. . .

atten

tio

n

softmax

L2 decoder

Unsupervised MT

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 178

Unsupervised MT Previously: cross-lingual embeddings

Steps:● Train as autoencoder:

– Given sentence in L1train shared encoder with L1 decoder

– Given sentence in L2train shared encoder with L2 decoder

● Test:– Given sentence in L1 produce

sentence in L2– Given sentence in L2 produce

sentence in L1

Fails!! Why?

. . .

fixed cross-lingual embeddings

Shared encoder (L1/L2)

. . .

atten

tio

n

softmax

L1 decoder

. . .

atten

tio

n

softmax

L2 decoder

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 179

Unsupervised MT

● Improvements– Denoising autoencoder: introduce noise in input– Backtranslation:

● Given sentence in L1 translate to L2 and train shared encoder with L1 decoderto recover original sentence

● It works!– Artetxe et al. (2017): The first paper (by 1 day!)

showing that it is possible to translate from one language to the other without bilingual resources

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 180

Unsupervised MT

● Significant improvements last year– NMT shared decoder– Statistical MT initialized with bilingual dictionary,

plus back-translation iterations● Results competitive with the state-of-the-art

in 2014– No bilingual resource!

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 181

Unsupervised MT

● Why does it work at all? An intuition:– Cross-lingual embedding space provides

information for k-best possible translations– NMT/SMT figures out how to best “combine”

them

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 182

Conclusions

● New research area – unsupervised Machine Translation

● Performance up, 33 BLEU En-Fr

● Plenty of margin for improvement

● Code for replicability https://github.com/artetxem/undreamt https://github.com/artetxem/monoses

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 183

References

● Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Unsupervised Neural Machine Translation. In ICLR-2018.

● Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. Unsupervised Statistical Machine Translation. In EMNLP-2018.

● (Forthcoming)

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 184

Last wordsLast words

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 185

DL4NLP

● We have only scratched the surface of representation learning for NLP– How do learned representations match

linguistic expectations?● Exciting developments in transfer learning:

– Language generation with GPT-2– Contextual embeddings with BERT– Multilingual sentence representations

● Language understanding also requires:– Complex structure beyond sequence– Function words beyond embeddings (every,not)

DL4NLP 186

Resources

● Books!!

● Tutorials and implementations in Tensorflow (Keras) and Pytorch websites

● Stanford NLP and DL courses: https://nlp.stanford.edu/teaching/

● NYU NLP and DL courses: https://www.nyu.edu/projects/bowman/teaching.shtml

● Coursera, Udacity, Nvidia Deep Learning courses

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 187

And...

● Short 20 hour, 3 day summer course in Julyin San Sebastian (Keras) http://ixa2.si.ehu.es/deep_learning_seminar/

(pre-registration closes 15th of May!) ● Longer version in January

35 hours, 3 weeks (Tensorflow)

(to be announced)● Master and PhD opportunities

DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 188

THANKS!@eagirre

[email protected]

Acknowledgements:

● Overall slides: Sam Bowman (NYU), Chris Manning and Richard Socher (Stanford)

● All source url’s listed in the slides