Philipp Koehn is Professor of Computer Science at Johns ... · Philipp Koehn Chief Scientist...

Philipp KoehnChief ScientistOmniscien Technologies

Professor of Computer ScienceJohns Hopkins University

Philipp Koehn is Professor of Computer Science at Johns Hopkins University andChief Scientist for Omniscien Technologies. He also holds the Chair for MachineTranslation in the School of Informatics at the University of Edinburgh. Philipp is aleader in the field of statistical machine translation research with over 100publications. He is the author of the seminal textbook in the field. Under hisleadership the open source Moses system became the de-facto standard toolkit formachine translation in research and commercial deployment.

Philipp led international research projects such as Euromatrix and CASMACAT.Philipp's research has been funded by the European Union, DARPA, Google,Facebook, Amazon, Bloomberg and several other funding agencies.

Philipp received his PhD in 2003 from the University of Southern California and wasa postdoctoral research associate at MIT. He was a finalist in the European PatentOffice's European Inventor Award in 2013 and received the Award of Honor fromthe International Association of Machine Translation in 2015.

At Omniscien Technologies Philipp has refined machine translation technology foruse in real-world deployments and helped to develop methods for data acquisitionand refinement. Philipp continues to drive innovation and technologicaldevelopment at Omniscien Technologies.

AI, MT and Language Processing Symposium

Philipp KoehnChief ScientistOmniscien Technologies

Professor of Computer ScienceJohns Hopkins University

The recent trend of using deep learning to solve a wide variety of problems inArtificial Intelligence has also reached machine translation - thus establishing a newstate-of-the-art approach for this application. This approach is not yet settled byany means. New neural architectures are proposed and ideas coming from suchdiverse fields as computer vision, game playing, and speech recognition can beapplied to machine translation as well.

At the practical end, we are just learning about the deployment challenges of thistechnology, since old methods, for example, to integrate terminology databases ordomain adaptation no longer apply.

This presentation will give an overview of the latest developments in research andwhat this means for practical deployment.

AI, MT and Language Processing Symposium

Facebook.com/omniscien @omniscientech Omniscien Technologies sales@omniscien.com

Research in Translation –What is Exciting and Shows Promise Ahead?

Philipp Koehn

Overview

• Evolution of Machine Translation

• Deep Learning

• Neural Machine Translation

• Challenges

• Looking Forward

Evolution of Machine Translation

Machine Translation Paradigms

• Various Approaches• Rule-based (1970s)

• Word-based (1990s)

• Phrase-based (2000s)

• Syntax-based (2010s)

• Neural-based (2016+)

Source Target

Interlingua

Semantic Transfer

Syntax Transfer

Lexical Transfer

Hype and Reality

Better Machine Learning

• Probabilistic models (1990s)

• Increased use of machine learning (2000s)

• Neural networks (since mid 2010s)

Deep Learning

Two Objectives

Fluency

• Translation must be fluent in the target language

• Need model that assigns a language score to each sentence

Adequacy

• Translation must have same meaning as source sentence

• Need model that assigns a translationscore to each sentence

Learning from Data

• Detect patterns in aligned segment pairs

Machine Learning

• Key to success• Analyze problem

• Feature engineering

• For instance: machine translation• What features are relevant for word order?

• What features are relevant for lexical translation?

Features

Output

Neural Learning

• Promise: no more feature engineering

• Several steps of processing features automatically discovered

Hidden

Output

Deep Learning

• More layers

• More complex

feature interactionsHidden

Hidden

Output

Hidden

Neural Machine Translation

word2vec

• Task: Predict word in the middle

Neural Network Solution

• Learn mapping with a neural network

Map Word to Embedding

• Vector representation of word

• Mathematically: • a matrix multiplication

• followed by an non-linear activation function

Visualizing Neural Relationships and Features

Relationships are built much like the human brain.Collections of concepts and vocabulary.

Distance indicates closeness of relationships.Groupings are formed.

Groups are directly and indirectly interrelated.i.e. Sports + Broadcasting and Entertainment

Neural Machine Translation

• Recall: two models

• Language model

… to ensure fluent output

• Translation model

… to ensure adequate translations

Language Models

• Sequential language models:

predict the next word

Language Models

I like ....

Language Models

I like to ...

Language Models

I like to learn ...

Language Models

I like to learn about …

Language Models

I like to learn about machine …

Language Models

I like to learn about machine translation .

Recurrent Neural Language Model

Predict

the first word

of a sentence

same as before,

just drawn top-down

Given word

Embedding

Hidden state

Predicted word

Predict

the second word

of a sentence

Re-use hidden state

first word prediction

Given word

Embedding

Hidden state

Predicted word

house is big .

is big . </s>

Given word

Embedding

Hidden state

Predicted word

Encoder Decoder Model

• We predicted the words of a sentence

• Why not also predict their translations?

Encoder Decoder Model

house is big .

is big . </s>

Given word

Embedding

Hidden state

Predicted word

Haus ist groß .

ist groß . </s>

• Obviously madness

• Proposed by Google (Sutskever et al. 2014)

Attention Mechanism

• What is missing?

• Alignment of source words to target words

• Solution: attention mechanism

Neural Machine Translation, 2016

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

Neural Machine Translation, 2016

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

• State of the art

• Used by Google, WIPO, Systran, Omniscien…

Input Sentence

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

Encode with Word Embeddings

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

Output Sentence

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

Each Word Predicted by Embedding

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

Embedding Predicted from Input Context

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

Input Context Selected By Word Alignment

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

Input Context: Weighted Sum of Input Embeddings

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

Benefits

• Each output predicted from• encoding of the full input sentence

• all previously produced output words

• Word embeddings allow generalization• “cat” and “cats” have similar representation

• “house” and “home” have similar representations

WMT 2016 Evaluation (News, English-German)

Neural MT

Statistical MT

Challenges

Benefits of Neural Machine Translation

• Evidence of overall better translation quality

• Ability to better generalize training data

• Better handling sentence-level context

• Better fluency

Neural Machine Translaton is Data-Hungry

Phrase Based SMT with Big Language Model

Corpus Size 10,000,000100,000

100,000,0001,000,000

1,000,000,00010,000,000

Phrase Based SMT

Neural MT

WordsSentences

Neural Machine Translation Failures

Adequacy or Fluency?

• Language model may take over

• Output unrelated to input

Fluency vs. Adequacy Errors

• Input

Ich will Kuchen essen

• Fluency error (more common in SMT)

I want cake eat

• Adequacy error (more common in NMT)

I want to cook chicken

Limited Vocabulary

• Words are encoded in highly dimension vector

• Only allows for limited vocabulary size• words are split into subwords

• maybe even split into characters?

• fall-back to dictionaries / phrase-based models

NMT More Susceptible to Noisy Training Data

• More harmed by• Alignment errors

• Bad language

• Wrong language on target side

• Severely harmed by un-translated source text (over-learns to copy)

• Data cleaning more important

NMT is Worse Out-of-Domain

• In nearly all cases, SMT was better than NMT when content was out of domain.

• More data is required for NMT to meet domain specific needs

• When sufficient data is available, NMT usually will be better than NMT for typical sentences

Deployment Challenges for Neural MT

• Speed• training takes weeks

• decoding slower than traditional SMT

• Hardware requirements

• GPUs needed ($ 2’000 each)

• Google even has specialized hardware

• Process is not transparent

• Practically impossible to find out “why wrong?”

• Mistakes cannot be easily fixed

Neural Machine Translation – A Mystery?

• Decisions of statistical often hard to understand

• Neural: even harder

input MAGIC output

• New studies reveal inner workings• Attention mechanism

• Word sense disambiguation

Attention States

• Attention mechanism plays role of “word alignment”

• “Soft alignment”: distributed over several input words

Word Sense Disambiguation

Deep embedding

of the word “right”

in encoder

NMT vs SMT: What We Know By Now

• In ideal conditions, NMT much better

• Different types of error (fluency vs. adequacy)

• NMT more susceptible to noise

• NMT less robust (out-of-domain, low-resource, etc.

=> Hybrid approach of Omniscien Technologies

Looking Forward

Attention Sequence-to-Sequence Model

• Based on recurrent neural networks

• Attention mechanism (alignment)

• Standard Approach 2015-2017

Deeper Models

• More layers in encoder and decoder

• Models more complex relationships between words

• Significantly higher performance

Google’s “Transformer” Model

• Self-attention

• Encoder: Input words inform each other

• Decoder: Attention on some previous output words

Facebook’s Convolutional Model

• Hierarchical (“convolutions”) instead of sequential

• Faster (but more limited context)

• In encoder and decoder

Synthesizing Data

• Neural machine translation trained on parallel data

• Improve with monolingual data• Back-translate target language text into source language

• Add as training data

• Can be iterated (“dual learning”)

Domain Adapted Models

• Various techniques explored for cutomization

• One simple effective method• Train general system on all available data

• Fine-Tune on in-domain data

Terminology

• Terminology, brand names with fixed translations

Der neue Neurolierer XVQ-72 ist lieferbar.

Neurolizer XVQ-72

• XML markup

Der neue <x translation=“Neurolizer XVQ-72”>

Neurolierer XVQ-72</x> ist lieferbar.

• Use attention states to detect insertion point

Dynamic Software Environment

• Major players released deep learning frameworks• Tensorflow (Google)

• pyTorch (Facebook)

• MX-Net (Amazon)

• Theano framework discontinued development

• Also: dedicated NMT implementations (faster)

• Quick turn-around from research into deployment

Hardware Developments

New GPUs from NVIDIA in 2018

• Faster, more memory

• Enable deeper models

Facebook.com/omniscien @omniscientech Omniscien Technologies sales@omniscien.com

Research in Translation –What is Exciting and Shows Promise Ahead?

Philipp Koehn

Philipp Koehn is Professor of Computer Science at Johns ... · Philipp Koehn Chief Scientist...

Documents

Process Control - Department of Computer Sciencephi/csf/slides/lecture-x86-control.pdfProcess Control Philipp Koehn 23 April 2018 Philipp Koehn Computer Systems Fundamentals: Process

Machine Translation and the Translator - TermCoord · Machine Translation and the Translator Philipp Koehn ... Each phrase is translated into English Phrases are reordered ... Bengali

Philipp Koehn presented by Gaurav Kumar 28 September 2017mt-class.org/jhu/slides/lecture-tuning.pdfPhilipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017. Path

Natural Language Processing - cs.jhu.eduphi/ai/slides/lecture-natural-language-processing.pdf · Natural Language Processing Philipp Koehn 27 April 2017 Philipp Koehn Artiﬁcial

Feedback and Flip-Flops - Department of Computer Sciencephi/csf/slides/lecture-flipflop.pdf · 2019-08-27 · Philipp Koehn Computer Systems Fundamental: Feedback and Flip-Flops 7

Accelerated Natural Language Processing Lecture 1 Introduction · Accelerated Natural Language Processing Lecture 1 Introduction Sharon Goldwater (based on slides by Philipp Koehn)

Decision Theory - Department of Computer Sciencephi/ai/slides/lecture-decision-theory.pdf · Decision Theory Philipp Koehn 9 April 2019 Philipp Koehn Artiﬁcial Intelligence: Decision

Accelerated Natural Language Processing Lecture 2 Morphology · Accelerated Natural Language Processing Lecture 2 Morphology Sharon Goldwater (based on slides by Philipp Koehn) 17

Translation Models Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics University of Edinburgh Some

Informed Search - Department of Computer Sciencephi/ai/slides/lecture-informed-search.pdf · – hill-climbing – simulated ... Best-First Search 5 ... Philipp Koehn Artiﬁcial

Cognitive Psychology - Department of Computer Science › ~phi › ai › slides-2017 › lecture... · Cognitive Psychology Philipp Koehn 15 September 2015 Philipp Koehn Artiﬁcial

MIPS Introduction - Department of Computer Sciencephi/csf/slides/lecture-mips-intro.pdf · MIPS Introduction Philipp Koehn presented by Chang Hwan Choi 12 March 2018 Philipp Koehn

Reinforcement Learning - Department of Computer …phi/ai/slides/lecture-reinforcement-learning.pdf · Reinforcement Learning Philipp Koehn ... Chess: – agent has to ... – function

Introduction to Admin Statistical Machine Translationdkauchak/classes/f11/cs457/... · 1 Introduction to Statistical Machine Translation Philipp Koehn Kevin Knight USC/Information

Neural Networks - cocoxu.github.io · Neural Networks Wei Xu (many slides from Greg Durrett and Philipp Koehn)

MIPS Pseudo Instructions and Functionsphi/csf/slides/lecture-mips-functions.pdf · jumps and subroutines Philipp Koehn Computer Systems Fundamentals: MIPS Pseudo Instructions and

Number Representation - Johns Hopkins Universityphi/csf/slides/lecture-numbers.pdf · 2018. 2. 7. · Number Representation Philipp Koehn 7 February 2018 Philipp Koehn Computer Systems

CS11-737 Multilingual NLP Data-based Strategies to Low ...demo.clab.cs.cmu.edu/11737fa20/slides/multiling-08-mtaugmentation.pdfVu Cong Duy Hoang, Philipp Koehn, Gholamreza Haffari,

Word Alignment Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics University of Edinburgh Some slides

Philipp Koehn presented by Gaurav Kumar 13 April 2017