Incremental Adaptation Personalized Deep Learning withon-demand.gputechconf.com/gtcdc/2018/pdf/dc8159...Comparison of professional translators for English to French, Arabic, and German

Personalized Deep Learning with Incremental Adaptation

Joern [email protected]

NVIDIA GTC, Washington DC, Oct 24, 2018

What will you learn in this talk?

2

● How to adapt neural machine translation models in real time, to learn domain-specific terminology, translator word choice and writing style

● How to encourage sparsity in personalized models using structured regularization (here: 70% reduction in network size)

● How to make personalized models available in a large-scale distributed environment

● Applicable to all tasks in which users are generating supervised data as they work

What is Lilt?

3

● Browser-based Computer-Aided Translation (CAT) tool

● Predictive typing / interactive machine translation○ Input: source language sentence, target language prefix○ Predict: target language sentence completion

● Difference to autocomplete on the phone:○ Larger context: source language sentence, target language prefix○ Prediction of full sentence completion

4

Short history of neural machine translation

System Description (English > German) BLEU[%] (newstest2014) GPU training hours

Statistical MT (Sennrich & Haddow, 2015) 22.6 n/a

Attention-based Neural MT (Bahdanau et al., 2014) 19.9 252 (K6000)

+ Monolingual training data (Sennrich et al., 2016) 22.7 670 (Titan Black)

+ Ensemble of neural models (Sennrich et al., 2016) 23.8 670 (Titan Black)

+ Deep network (Wu et al., 2016) -- Google 26.3 18,000 (K80)

Transformer network (Vaswani et al., 2017) 28.4 670 (P100)

Sennrich et al. Improving Neural Machine Translation Models with Monolingual Data.Bojar et al. Findings of the 2016 Conference on Machine Translation.Wu et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.Vaswani et al. Attention is All You Need.

Why Predictive Typing?

5

Post-editing vs. predictive typing

6

Post-editing

user action machine action Notation:

enter/upload source sentence

full sentence suggestion

edit full sentence

enter/upload source sentence

full sentence suggestion

correct first error

sentence completion suggestion

Predictive Typing

Post-editing machine translation

7

● Post-edited translations are generated more quickly and ranked as more accurate than unaided translations by professionals (Green et al., 2013).

○ Comparison of professional translators for English to French, Arabic, and German.

● But: Translators hate post-editing!● Expert translators make more edits in less time (Moorkens & O'Brien, 2015).

○ Professional translators were 3x more productive at post-editing than translation students.

● NMT doesn't speed up post-editing much vs Statistical MT (Castilho et al., 2017)○ For En-{De,Pt,El,Ru}, post-editing MT was ~5% faster, but required ~15% fewer keystrokes.○ Participants indicated that they found NMT errors more difficult to identify.

Green, Spence, Jeffrey Heer, and Christopher D. Manning. "The efficacy of human post-editing for language translation."Moorkens, Joss, and Sharon O’Brien. "Post-editing evaluations: Trade-offs between novice and professional participants."Castilho, Sheila, et al. "A Comparative Quality Evaluation of PBSMT and NMT using Professional Translators."

Predictive typing

8

Wuebker, Joern et al. “Models and Inference for Prefix-Constrained Machine Translation.” Green, Spence, et al. "Predictive Translation Memory: A mixed-initiative system for human language translation." Proceedings of the 27th annual ACM symposium on User interface software and technology. ACM, 2014.

● NMT helps for full-sentence MT, but even more on prefix-constrained MT (Wuebker et al., 2016)

● Predictive typing leads to more edits and higher quality (Green et al., 2014)○ Comparison of professional translators for English to French and German.○ Predictive typing did take ~20% longer than post-editing.○ When asked, “I would use interactive translation features if they were integrated into a CAT

product,” 20 out of 25 translators responded "agree" or "strongly agree."

● End translation quality is higher with predictive typing (Client Evaluation, 2017)○ Error frequency, detected by review, was 1.1% for post-editing & 0.3% for predictive typing.○ Throughput with predictive typing was 700+ words/hour, double a typical unassisted speed.

Transformer Network Architecture

9

10 Eine Glühstiftkerze (1) dient ...

Embedding lookup

Encoder Decoder

<s> A sheathed-element glow plug …

Filter

Self-attention

Filter

Self-attention

Filter

Self-attention...

...4×

10.3M

526K

Embedding lookup

Filter

Encoder attention

Self-attention

Filter

Encoder attention

Self-attention

Filter

Encoder attention

Self-attention

Output projection

A sheathed-element glow plug ...

10.3M

10.3M

788K

# parameters

Adaptation

11

12

Incremental adaptation: Document context

Example, a patent (https://www.google.com/patents/WO2007000372A1)

Sheathed-element glow plug

A sheathed-element glow plug (1) is to be placed inside a chamber (3) of an internal combustion engine. The sheathed-element glow plug (1) comprises a heating body (2) that has a glow tube (6) connected to a housing (4). The heating body (2) also comprises a ceramic heating element (15), which is placed inside the glow tube (6) and which serves to heat the glow tube (6). The glow tube (6) guarantees a thermal and mechanical protection for the ceramic heating element (15).

sheathed-element glow plug ↔ Glühstiftkerze

https://translate.google.com

https://www.google.com/patents/WO2007000372A1

https://translate.google.com/#de/en/Eine%20Gl%C3%BChstiftkerze%20(1)%20dient%20zur%20Anordnung%20in%20einer%20Kammer%20(3)%20einer%20Brennkraftmaschine.

13

Incremental adaptation: Example

PersonalizedMT

System

1. Initial MT suggestion

2. User correction

4. Improved suggestion

3. learn from correction

14

Personalized MT: Translation process1. Incoming translation request for User X2. Load User X’s model from cache or persistent storage 3. Apply model parameters to computation graph in TensorFlow4. Generate translation5. Respond to translation request (max. response time: ~300ms)

Full model: ~36M parameters

Personalized model: (2.) + (3.) ⇒ max. ~10M parameters

Solution: - Store personalized models as offsets from baseline model W = Wb + Wu

- Intelligent selection of sparse parameter subset Wu ⇒ Group Lasso

Structured Sparsity - “Group Lasso”

15

● Simultaneous regularization and tensor selection● Treat entire tensors/columns as individual parameters w.r.t. L1-regularization

● Can be easily implemented with any neural model● Applicable for any interactive machine assistance task

Adaptation results

16Wuebker et al. "Compact Personalized Models for Neural Machine Translation", To appear in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium, October 2018

Group Lasso size

Learning Curve

17

Word prediction accuracy (WPA): Evaluated by predicting each word of each segment, conditioned on all previous words in the segment.

Y-axis: Performance difference between an incrementally adapted model & a static baseline model.

Approach WPA

Unadapted 34.8%

Adapted 40.3%

Lilt(Demo)

18

Availability of personalized models

19

● Translation services are deployed as auto-scalable kubernetes pods● Personalized models are stored in a three-level cache:

○ Local LRU (least-recently used) cache on each translator node○ Region-specific high-availability in-memory database (Redis)○ Permanent cloud data storage

● Provides balance between availability, memory footprint and performance● Multiple users can work together using the same personalized model

Conclusions

20

Lilt research team

21

Summary

22

● If end-to-end translation quality is a primary concern, then interactive human translation using predictive typing appears to be the best cost-efficient option.

● Online incremental adaptation is very effective, even using small data sets.

● Impact of adaptation is on par or larger than the difference between Neural MT and Statistical MT

● Sparse personalized models by structured regularization: Reduction of model size by ~70%

Thank you!

Joern [email protected]

NVIDIA GTC, Washington DC, Oct 24, 2018

Production Architecture

24

General architecture

25

Front end / app server

ServicesBackbone

Translation Service

Updater Service

Lexicon Service

Translation Memory ServiceBrowser Message queue

MySQL DB

Other storage types(GCS, Redis)

query

response

Translation service

Converter service

Updater service

Lexicon service

Translation memory service

query

response

26

Why we care about human translators

● The majority of translations (perhaps 99.7%) are generated by computers

● 1000x price ratio: 15 ¢/word from an LSP vs 0.015 ¢/word from an MT API

● The volume of human translation is still large and growing

○ An estimated $21 billion was spent on text translation in 2017 (Common Sense Advisory)

○ Year-over-year growth in the language services industry is 7%

○ ~130 billion words translated per year at 2500 words/day & 250 days/year = 200k+ people

Documents

Incremental Adaptation Personalized Deep Learning withon-demand.gputechconf.com/gtcdc/2018/pdf/dc8159...Comparison of professional translators for English to French, Arabic, and German