Philipp Koehn is Professor of Computer Science at Johns ... · Philipp Koehn Chief Scientist...

Preview:

Citation preview

Philipp KoehnChief ScientistOmniscien Technologies

Professor of Computer ScienceJohns Hopkins University

Philipp Koehn is Professor of Computer Science at Johns Hopkins University andChief Scientist for Omniscien Technologies. He also holds the Chair for MachineTranslation in the School of Informatics at the University of Edinburgh. Philipp is aleader in the field of statistical machine translation research with over 100publications. He is the author of the seminal textbook in the field. Under hisleadership the open source Moses system became the de-facto standard toolkit formachine translation in research and commercial deployment.

Philipp led international research projects such as Euromatrix and CASMACAT.Philipp's research has been funded by the European Union, DARPA, Google,Facebook, Amazon, Bloomberg and several other funding agencies.

Philipp received his PhD in 2003 from the University of Southern California and wasa postdoctoral research associate at MIT. He was a finalist in the European PatentOffice's European Inventor Award in 2013 and received the Award of Honor fromthe International Association of Machine Translation in 2015.

At Omniscien Technologies Philipp has refined machine translation technology foruse in real-world deployments and helped to develop methods for data acquisitionand refinement. Philipp continues to drive innovation and technologicaldevelopment at Omniscien Technologies.

AI, MT and Language Processing Symposium

Philipp KoehnChief ScientistOmniscien Technologies

Professor of Computer ScienceJohns Hopkins University

The recent trend of using deep learning to solve a wide variety of problems inArtificial Intelligence has also reached machine translation - thus establishing a newstate-of-the-art approach for this application. This approach is not yet settled byany means. New neural architectures are proposed and ideas coming from suchdiverse fields as computer vision, game playing, and speech recognition can beapplied to machine translation as well.

At the practical end, we are just learning about the deployment challenges of thistechnology, since old methods, for example, to integrate terminology databases ordomain adaptation no longer apply.

This presentation will give an overview of the latest developments in research andwhat this means for practical deployment.

AI, MT and Language Processing Symposium

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Facebook.com/omniscien @omniscientech Omniscien Technologies sales@omniscien.com

Research in Translation –What is Exciting and Shows Promise Ahead?

Philipp Koehn

5Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Overview

• Evolution of Machine Translation

• Deep Learning

• Neural Machine Translation

• Challenges

• Looking Forward

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Evolution of Machine Translation

8Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Machine Translation Paradigms

• Various Approaches• Rule-based (1970s)

• Word-based (1990s)

• Phrase-based (2000s)

• Syntax-based (2010s)

• Neural-based (2016+)

Source Target

Interlingua

Semantic Transfer

Syntax Transfer

Lexical Transfer

9Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Hype and Reality

10Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Better Machine Learning

• Probabilistic models (1990s)

• Increased use of machine learning (2000s)

• Neural networks (since mid 2010s)

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Deep Learning

14Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Two Objectives

Fluency

• Translation must be fluent in the target language

• Need model that assigns a language score to each sentence

Adequacy

• Translation must have same meaning as source sentence

• Need model that assigns a translationscore to each sentence

15Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Learning from Data

• Detect patterns in aligned segment pairs

16Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Machine Learning

• Key to success• Analyze problem

• Feature engineering

• For instance: machine translation• What features are relevant for word order?

• What features are relevant for lexical translation?

Input

Features

Output

17Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Neural Learning

• Promise: no more feature engineering

• Several steps of processing features automatically discovered

Input

Hidden

Output

18Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Deep Learning

• More layers

• More complex

feature interactionsHidden

Hidden

Output

Input

Hidden

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Neural Machine Translation

20Copyright © 2018 Omniscien Technologies. All Rights Reserved.

word2vec

• Task: Predict word in the middle

21Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Neural Network Solution

• Learn mapping with a neural network

22Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Map Word to Embedding

• Vector representation of word

• Mathematically: • a matrix multiplication

• followed by an non-linear activation function

23Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Visualizing Neural Relationships and Features

Relationships are built much like the human brain.Collections of concepts and vocabulary.

24Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Visualizing Neural Relationships and Features

Distance indicates closeness of relationships.Groupings are formed.

25Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Visualizing Neural Relationships and Features

Groups are directly and indirectly interrelated.i.e. Sports + Broadcasting and Entertainment

26Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Neural Machine Translation

• Recall: two models

• Language model

… to ensure fluent output

• Translation model

… to ensure adequate translations

27Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Language Models

• Sequential language models:

predict the next word

I …

28Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Language Models

• Sequential language models:

predict the next word

I like ....

29Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Language Models

• Sequential language models:

predict the next word

I like to ...

30Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Language Models

• Sequential language models:

predict the next word

I like to learn ...

31Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Language Models

• Sequential language models:

predict the next word

I like to learn about …

32Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Language Models

• Sequential language models:

predict the next word

I like to learn about machine …

33Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Language Models

• Sequential language models:

predict the next word

I like to learn about machine translation .

37Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Recurrent Neural Language Model

Predict

the first word

of a sentence

same as before,

just drawn top-down

<s>

the

Given word

Embedding

Hidden state

Predicted word

38Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Recurrent Neural Language Model

Predict

the second word

of a sentence

Re-use hidden state

from

first word prediction

<s>

the

the

house

Given word

Embedding

Hidden state

Predicted word

39Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Recurrent Neural Language Model

<s>

the

the

house

house is big .

is big . </s>

Given word

Embedding

Hidden state

Predicted word

40Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Encoder Decoder Model

• We predicted the words of a sentence

• Why not also predict their translations?

41Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Encoder Decoder Model

<s>

the

the

house

house is big .

is big . </s>

Given word

Embedding

Hidden state

Predicted word

</s>

das

das

Haus

Haus ist groß .

ist groß . </s>

• Obviously madness

• Proposed by Google (Sutskever et al. 2014)

42Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Attention Mechanism

• What is missing?

• Alignment of source words to target words

• Solution: attention mechanism

43Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Neural Machine Translation, 2016

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

44Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Neural Machine Translation, 2016

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

• State of the art

• Used by Google, WIPO, Systran, Omniscien…

45Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Input Sentence

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

46Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Encode with Word Embeddings

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

47Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Output Sentence

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

48Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Each Word Predicted by Embedding

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

49Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Embedding Predicted from Input Context

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

50Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Input Context Selected By Word Alignment

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

51Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Input Context: Weighted Sum of Input Embeddings

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

52Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Benefits

• Each output predicted from• encoding of the full input sentence

• all previously produced output words

• Word embeddings allow generalization• “cat” and “cats” have similar representation

• “house” and “home” have similar representations

53Copyright © 2018 Omniscien Technologies. All Rights Reserved.

WMT 2016 Evaluation (News, English-German)

Neural MT

Statistical MT

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Challenges

55Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Benefits of Neural Machine Translation

• Evidence of overall better translation quality

• Ability to better generalize training data

• Better handling sentence-level context

• Better fluency

56Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Neural Machine Translaton is Data-Hungry

Phrase Based SMT with Big Language Model

BLE

U S

core

Corpus Size 10,000,000100,000

100,000,0001,000,000

1,000,000,00010,000,000

30

20

10

0

Phrase Based SMT

Neural MT

WordsSentences

57Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Neural Machine Translation Failures

58Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Adequacy or Fluency?

• Language model may take over

• Output unrelated to input

59Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Fluency vs. Adequacy Errors

• Input

Ich will Kuchen essen

• Fluency error (more common in SMT)

I want cake eat

• Adequacy error (more common in NMT)

I want to cook chicken

60Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Limited Vocabulary

• Words are encoded in highly dimension vector

• Only allows for limited vocabulary size• words are split into subwords

• maybe even split into characters?

• fall-back to dictionaries / phrase-based models

62Copyright © 2018 Omniscien Technologies. All Rights Reserved.

NMT More Susceptible to Noisy Training Data

• More harmed by• Alignment errors

• Bad language

• Wrong language on target side

• Severely harmed by un-translated source text (over-learns to copy)

• Data cleaning more important

63Copyright © 2018 Omniscien Technologies. All Rights Reserved.

NMT is Worse Out-of-Domain

• In nearly all cases, SMT was better than NMT when content was out of domain.

• More data is required for NMT to meet domain specific needs

• When sufficient data is available, NMT usually will be better than NMT for typical sentences

64Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Deployment Challenges for Neural MT

• Speed• training takes weeks

• decoding slower than traditional SMT

• Hardware requirements

• GPUs needed ($ 2’000 each)

• Google even has specialized hardware

• Process is not transparent

• Practically impossible to find out “why wrong?”

• Mistakes cannot be easily fixed

65Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Neural Machine Translation – A Mystery?

• Decisions of statistical often hard to understand

• Neural: even harder

input MAGIC output

• New studies reveal inner workings• Attention mechanism

• Word sense disambiguation

66Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Attention States

• Attention mechanism plays role of “word alignment”

• “Soft alignment”: distributed over several input words

67Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Word Sense Disambiguation

Deep embedding

of the word “right”

in encoder

68Copyright © 2018 Omniscien Technologies. All Rights Reserved.

NMT vs SMT: What We Know By Now

• In ideal conditions, NMT much better

• Different types of error (fluency vs. adequacy)

• NMT more susceptible to noise

• NMT less robust (out-of-domain, low-resource, etc.

=> Hybrid approach of Omniscien Technologies

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Looking Forward

71Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Attention Sequence-to-Sequence Model

• Based on recurrent neural networks

• Attention mechanism (alignment)

• Standard Approach 2015-2017

72Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Deeper Models

• More layers in encoder and decoder

• Models more complex relationships between words

• Significantly higher performance

73Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Google’s “Transformer” Model

• Self-attention

• Encoder: Input words inform each other

• Decoder: Attention on some previous output words

74Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Facebook’s Convolutional Model

• Hierarchical (“convolutions”) instead of sequential

• Faster (but more limited context)

• In encoder and decoder

75Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Synthesizing Data

• Neural machine translation trained on parallel data

• Improve with monolingual data• Back-translate target language text into source language

• Add as training data

• Can be iterated (“dual learning”)

76Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Domain Adapted Models

• Various techniques explored for cutomization

• One simple effective method• Train general system on all available data

• Fine-Tune on in-domain data

77Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Terminology

• Terminology, brand names with fixed translations

Der neue Neurolierer XVQ-72 ist lieferbar.

Neurolizer XVQ-72

• XML markup

Der neue <x translation=“Neurolizer XVQ-72”>

Neurolierer XVQ-72</x> ist lieferbar.

• Use attention states to detect insertion point

78Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Dynamic Software Environment

• Major players released deep learning frameworks• Tensorflow (Google)

• pyTorch (Facebook)

• MX-Net (Amazon)

• Theano framework discontinued development

• Also: dedicated NMT implementations (faster)

• Quick turn-around from research into deployment

79Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Hardware Developments

New GPUs from NVIDIA in 2018

• Faster, more memory

• Enable deeper models

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Facebook.com/omniscien @omniscientech Omniscien Technologies sales@omniscien.com

Research in Translation –What is Exciting and Shows Promise Ahead?

Philipp Koehn

Recommended