Machine Translation Overview

Machine Translation Overview

Alon LavieLanguage Technologies Institute

Carnegie Mellon University

LTI Immigration CourseAugust 23, 2007

August 23, 2007 LTI IC 2007 2

Machine Translation: History

• MT started in 1940’s, one of the first conceived application of computers

• Promising “toy” demonstrations in the 1950’s, failed miserably to scale up to “real” systems

• AIPAC Report: MT recognized as an extremely difficult, “AI-complete” problem in the early 1960’s

• MT Revival started in earnest in 1980s (US, Japan)• Field dominated by rule-based approaches, requiring

100s of K-years of manual development• Economic incentive for developing MT systems for small

number of language pairs (mostly European languages)

August 23, 2007 LTI IC 2007 3

Machine Translation: Where are we today?

• Age of Internet and Globalization – great demand for MT: – Multiple official languages of UN, EU, Canada, etc.– Documentation dissemination for large manufacturers

(Microsoft, IBM, Caterpillar)• Economic incentive is still primarily within a small

number of language pairs• Some fairly good commercial products in the market for

these language pairs– Primarily a product of rule-based systems after many years

of development• Web-based (mostly free) MT services: Google, Babelfish,

others…• Pervasive MT between most language pairs still non-

existent and not on the immediate horizon

August 23, 2007 LTI IC 2007 4

Example of High Quality MT

PAHO’s Spanam system:• Mediante petición recibida por la Comisión Interamericana de

Derechos Humanos (en adelante …) el 6 de octubre de 1997, el señor Lino César Oviedo (en adelante …) denunció que la República del Paraguay (en adelante …) violó en su perjuicio los derechos a las garantías judiciales … en su contra.

• Through petition received by the `Inter-American Commission on Human Rights` (hereinafter …) on 6 October 1997, Mr. Linen César Oviedo (hereinafter “the petitioner”) denounced that the Republic of Paraguay (hereinafter …) violated to his detriment the rights to the judicial guarantees, to the political participation, to // equal protection and to the honor and dignity consecrated in articles 8, 23, 24 and 11, respectively, of the `American Convention on Human Rights` (hereinafter …”), as a consequence of judgments initiated against it.

August 23, 2007 LTI IC 2007 5

Core Challenges of MT

• Ambiguity and Language Divergences:– Human languages are highly ambiguous, and

differently in different languages– Ambiguity at all “levels”: lexical, syntactic, semantic,

language-specific constructions and idioms• Amount of required knowledge:

– Translation equivalencies for vast vocabularies (several 100k words and phrases)

– Syntactic knowledge (how to map syntax of one language to another), plus more complex language divergences (semantic differences, constructions and idioms, etc.)

– How do you acquire and construct a knowledge base that big that is (even mostly) correct and consistent?

August 23, 2007 LTI IC 2007 6

Major Sources of Translation Problems

• Lexical Differences:– Multiple possible translations for SL word, or

difficulties expressing SL word meaning in a single TL word

• Structural Differences:– Syntax of SL is different than syntax of the TL: word

order, sentence and constituent structure

• Differences in Mappings of Syntax to Semantics:– Meaning in TL is conveyed using a different syntactic

structure than in the SL

• Idioms and Constructions

August 23, 2007 LTI IC 2007 7

Lexical Differences

• SL word has several different meanings, that translate differently into TL– Ex: financial bank vs. river bank

• Lexical Gaps: SL word reflects a unique meaning that cannot be expressed by a single word in TL– Ex: English snub doesn’t have a corresponding verb

in French or German• TL has finer distinctions than SL SL word

should be translated differently in different contexts– Ex: English wall can be German wand (internal),

mauer (external)

August 23, 2007 LTI IC 2007 8

Structural Differences

• Syntax of SL is different than syntax of the TL: – Word order within constituents:

• English NPs: art adj n the big boy• Hebrew NPs: art n art adj ha yeled ha gadol

– Constituent structure:• English is SVO: Subj Verb Obj I saw the man• Modern Arabic is VSO: Verb Subj Obj

– Different verb syntax:• Verb complexes in English vs. in German I can eat the apple Ich kann den apfel essen

– Case marking and free constituent order• German and other languages that mark case: den apfel esse Ich the(acc) apple eat I(nom)

August 23, 2007 LTI IC 2007 9

Syntax-to-Semantics Differences

• Meaning in TL is conveyed using a different syntactic structure than in the SL– Changes in verb and its arguments– Passive constructions– Motion verbs and state verbs– Case creation and case absorption

• Main Distinction from Structural Differences:– Structural differences are mostly independent of

lexical choices and their semantic meaning can be addressed by transfer rules that are syntactic in nature

– Syntax-to-semantic mapping differences are meaning-specific: require the presence of specific words (and meanings) in the SL

August 23, 2007 LTI IC 2007 10

Syntax-to-Semantics Differences

• Structure-change example:I like swimming“Ich scwhimme gern”I swim gladly

• Verb-argument example:Jones likes the film.“Le film plait à Jones.”(lit: “the film pleases to Jones”)

• Passive Constructions– Example: French reflexive passives:

Ces livres se lisent facilement*”These books read themselves easily”These books are easily read

August 23, 2007 LTI IC 2007 11

Idioms and Constructions

• Main Distinction: meaning of whole is not directly compositional from meaning of its sub-parts no compositional translation

• Examples:– George is a bull in a china shop– He kicked the bucket– Can you please open the window?

August 23, 2007 LTI IC 2007 12

Formulaic Utterances

• Good night.• tisbaH cala xEr waking up on good

• Romanization of Arabic from CallHome Egypt

August 23, 2007 LTI IC 2007 13

How to Tackle the Core Challenges

• Manual Labor: 1000s of person-years of human experts developing large word and phrase translation lexicons and translation rules.

Example: Systran’s RBMT systems.• Lots of Parallel Data: data-driven approaches for

finding word and phrase correspondences automatically from large amounts of sentence-aligned parallel texts. Example: Statistical MT systems.

• Learning Approaches: learn translation rules automatically from small amounts of human translated and word-aligned data. Example: AVENUE’s Statistical XFER approach.

• Simplify the Problem: build systems that are limited-domain or constrained in other ways. Examples: CATALYST, NESPOLE!.

August 23, 2007 LTI IC 2007 14

State-of-the-Art in MT• What users want:

– General purpose (any text)– High quality (human level)– Fully automatic (no user intervention)

• We can meet any 2 of these 3 goals today, but not all three at once:– FA HQ: Knowledge-Based MT (KBMT)– FA GP: Corpus-Based (Example-Based) MT– GP HQ: Human-in-the-loop (efficiency tool)

August 23, 2007 LTI IC 2007 15

Types of MT Applications:

• Assimilation: multiple source languages, uncontrolled style/topic. General purpose MT, no semantic analysis. (GP FA or GP HQ)

• Dissemination: one source language, controlled style, single topic/domain. Special purpose MT, full semantic analysis. (FA HQ)

• Communication: Lower quality may be okay, but system robustness, real-time required.

August 23, 2007 LTI IC 2007 16

Mi chiamo Alon Lavie My name is Alon Lavie

Give-information+personal-data (name=alon_lavie)

[s [vp accusative_pronoun “chiamare” proper_name]]

[s [np [possessive_pronoun “name”]]

[vp “be” proper_name]]

Direct

Transfer

Interlingua

Analysis Generation

Approaches to MT: Vaquois MT Triangle

August 23, 2007 LTI IC 2007 17

Analysis and GenerationMain Steps

• Analysis:– Morphological analysis (word-level) and POS tagging– Syntactic analysis and disambiguation (produce

syntactic parse-tree)– Semantic analysis and disambiguation (produce

symbolic frames or logical form representation)– Map to language-independent Interlingua

• Generation:– Generate semantic representation in TL– Sentence Planning: generate syntactic structure and

lexical selections for concepts– Surface-form realization: generate correct forms of

words

August 23, 2007 LTI IC 2007 18

Direct Approaches

• No intermediate stage in the translation• First MT systems developed in the

1950’s-60’s (assembly code programs)– Morphology, bi-lingual dictionary lookup,

local reordering rules– “Word-for-word, with some local word-order

adjustments”

• Modern Approaches: EBMT and SMT

August 23, 2007 LTI IC 2007 19

Statistical MT (SMT)• Proposed by IBM in early 1990s: a direct, purely statistical,

model for MT• Statistical translation models are trained on a sentence-

aligned parallel bilingual corpus– Train word-level alignment models– Extract phrase-to-phrase correspondences– Apply them at runtime on source input and “decode”

• Attractive: completely automatic, no manual rules, much reduced manual labor

• Main drawbacks: – Effective only with large volumes (several mega-words) of parallel

text– Broad domain, but domain-sensitive– Still viable only for small number of language pairs!

• Impressive progress in last 5 years– Large DARPA funding programs (TIDES, GALE)– Lots of research in this direction– GIZA++, Pharoah, CAIRO

August 23, 2007 LTI IC 2007 20

EBMT ParadigmNew Sentence (Source)

Yesterday, 200 delegates met with President Clinton.

Matches to Source Found

Yesterday, 200 delegates met behind closed doors…

Difficulties with President Clinton…

Gestern trafen sich 200 Abgeordnete hinter verschlossenen…

Schwierigkeiten mit Praesident Clinton…

Alignment (Sub-sentential)

Translated Sentence (Target)

Gestern trafen sich 200 Abgeordnete mit Praesident Clinton.

Yesterday, 200 delegates met behind closed doors…

Difficulties with President Clinton over…

Gestern trafen sich 200 Abgeordnete hinter verschlossenen…

Schwierigkeiten mit Praesident Clinton…

August 23, 2007 LTI IC 2007 21

Transfer Approaches

• Syntactic Transfer:– Analyze SL input sentence to its syntactic structure

(parse tree)– Transfer SL parse-tree to TL parse-tree (various

formalisms for specifying mappings)– Generate TL sentence from the TL parse-tree

• Semantic Transfer:– Analyze SL input to a language-specific semantic

representation (i.e., Case Frames, Logical Form)– Transfer SL semantic representation to TL semantic

representation– Generate syntactic structure and then surface

sentence in the TL

August 23, 2007 LTI IC 2007 22

Transfer Approaches

Main Advantages and Disadvantages:• Syntactic Transfer:

– No need for semantic analysis and generation– Syntactic structures are general, not domain specific

Less domain dependent, can handle open domains– Requires word translation lexicon

• Semantic Transfer:– Requires deeper analysis and generation, symbolic

representation of concepts and predicates difficult to construct for open or unlimited domains

– Can better handle non-compositional meaning structures can be more accurate

– No word translation lexicon – generate in TL from symbolic concepts

August 23, 2007 LTI IC 2007 23

Knowledge-based Interlingual MT

• The classic “deep” Artificial Intelligence approach:– Analyze the source language into a detailed symbolic

representation of its meaning – Generate this meaning in the target language

• “Interlingua”: one single meaning representation for all languages – Nice in theory, but extremely difficult in practice:

• What kind of representation?• What is the appropriate level of detail to represent?• How to ensure that the interlingua is in fact universal?

August 23, 2007 LTI IC 2007 24

Interlingua versus Transfer

• With interlingua, need only N parsers/ generators instead of N2 transfer systems:

L1L2

L3

L4

L5

L6

L1L2

L3

L6

L5

L4

interlingua

August 23, 2007 LTI IC 2007 25

Multi-Engine MT

• Apply several MT engines to each input in parallel

• Create a combined translation from the individual translations

• Goal is to combine strengths, and avoid weaknesses.

• Along all dimensions: domain limits, quality, development time/cost, run-time speed, etc.

• Various approaches to the problem

August 23, 2007 LTI IC 2007 26

Speech-to-Speech MT

• Speech just makes MT (much) more difficult:– Spoken language is messier

• False starts, filled pauses, repetitions, out-of-vocabulary words

• Lack of punctuation and explicit sentence boundaries– Current Speech technology is far from perfect

• Need for speech recognition and synthesis in foreign languages

• Robustness: MT quality degradation should be proportional to SR quality

• Tight Integration: rather than separate sequential tasks, can SR + MT be integrated in ways that improves end-to-end performance?

August 23, 2007 LTI IC 2007 27

MT at the LTI

• LTI originated as the Center for Machine Translation (CMT) in 1985

• MT continues to be a prominent sub-discipline of research with the LTI– More MT faculty than any of the other areas– More MT faculty than anywhere else

• Active research on all main approaches to MT: Interlingua, Transfer, EBMT, SMT

• Leader in the area of speech-to-speech MT• Multi-Engine MT (MEMT)• MT Evaluation (METEOR, BLANC)

August 23, 2007 LTI IC 2007 28

KBMT: KANT, KANTOO, CATALYST

• Deep knowledge-based framework, with symbolic interlingua as intermediate representation– Syntactic and semantic analysis into a unambiguous

detailed symbolic representation of meaning using unification grammars and transformation mappers

– Generation into the target language using unification grammars and transformation mappers

• First large-scale multi-lingual interlingua-based MT system deployed commercially: – CATALYST at Caterpillar: high quality translation of

documentation manuals for heavy equipment• Limited domains and controlled English input• Minor amounts of post-editing• Active follow-on projects• Contact Faculty: Eric Nyberg and Teruko Mitamura

August 23, 2007 LTI IC 2007 29

EBMT• Developed originally for the PANGLOSS system

in the early 1990s– Translation between English and Spanish

• Generalized EBMT under development for the past several years

• Used in a variety of projects in recent years– DARPA TIDES and GALE programs– DIPLOMAT and TONGUES

• Active research work on improving alignment and indexing, decoding from a lattice

• Contact Faculty: Ralf Brown and Jaime Carbonell

August 23, 2007 LTI IC 2007 30

Statistical MT• Word-to-word and phrase-to-phrase translation pairs are

acquired automatically from data and assigned probabilities based on a statistical model

• Extracted and trained from very large amounts of sentence-aligned parallel text– Word alignment algorithms– Phrase detection algorithms– Translation model probability estimation

• Main approach pursued in CMU systems in the DARPA/TIDES program and now in GALE– Chinese-to-English and Arabic-to-English

• Most active work is on phrase detection and on advanced decoding techniques

• Contact Faculty: Stephan Vogel and Alex Waibel

August 23, 2007 LTI IC 2007 31

Speech-to-Speech MT• Evolution from JANUS/C-STAR systems to

NESPOLE!, LingWear, BABYLON, TC-STAR– Early 1990s: first prototype system that fully

performed sp-to-sp (very limited domains)– Interlingua-based, but with shallow task-oriented

representations: “we have single and double rooms available” [give-information+availability] (room-type={single, double})– Semantic Grammars for analysis and generation– Multiple languages: English, German, French, Italian,

Japanese, Korean, and others– Stat-MT applied in Speech-to-Speech scenarios– Most active work on portable speech translation on

small devices: Arabic/English and Thai/English– Contact Faculty: Alan Black, Stephan Vogel, Tanja

Schultz and Alex Waibel

August 23, 2007 LTI IC 2007 32

AVENUE/LETRAS: Learning-based Transfer MT

• Develop new approaches for automatically acquiring syntactic MT transfer rules from small amounts of elicited translated and word-aligned data– Specifically designed to bootstrap MT for languages for

which only limited amounts of electronic resources are available (particularly indigenous minority languages)

– Use machine learning techniques to generalize transfer rules from specific translated examples

– Combine with SMT-inspired decoding techniques for producing the best translation of new input from a lattice of translation segments

• Languages: Hebrew, Hindi, Mapudungun, Quechua• Most active work on designing a typologically

comprehensive elicitation corpus, advanced techniques for automatic rule learning, improved decoding, and rule refinement via user interaction

• Contact Faculty: Alon Lavie, Lori Levin, Jaime Carbonell and Bob Frederking

August 23, 2007 LTI IC 2007 33

Multi-Engine MT• New approach developed over past two years under

DoD and DARPA funding (used in GALE)• Main ideas:

– Treat original engines as “black boxes”– Align the word and phrase correspondences between the

translations– Build a collection of synthetic combinations based on the

aligned words and phrases– Score the synthetic combinations based on Language

Model and confidence measures– Select the top-scoring synthetic combination

• Architecture Issues: integrating “workflows” that produce multiple translations and then combine them with MEMT– IBM’s UIMA architecture

• Contact Faculty: Alon Lavie

August 23, 2007 LTI IC 2007 34

Synthetic Combination MEMT

Two Stage Approach:1. Align: Identify common words and phrases across

the translations provided by the engines2. Decode: search the space of synthetic

combinations of words/phrases and select the highest scoring combined translation

Example:1. announced afghan authorities on saturday reconstituted

four intergovernmental committees 2. The Afghan authorities on Saturday the formation of the

four committees of government

August 23, 2007 LTI IC 2007 35

Synthetic Combination MEMT

Two Stage Approach:1. Align: Identify common words and phrases across

the translations provided by the engines2. Decode: search the space of synthetic

combinations of words/phrases and select the highest scoring combined translation

Example:1. announced afghan authorities on saturday reconstituted

four intergovernmental committees 2. The Afghan authorities on Saturday the formation of the

four committees of government

MEMT: the afghan authorities announced on Saturday the formation of four intergovernmental committees

August 23, 2007 LTI IC 2007 36

Automatic MT Evaluation• METEOR: new metric developed at CMU• Improves upon BLEU metric developed by IBM and used

extensively in recent years• Main ideas:

– Assess the similarity between a machine-produced translation and (several) human reference translations

– Similarity is based on word-to-word matching that matches:

• Identical words• Morphological variants of same word (stemming)• synonyms

– Similarity is based on weighted combination of Precision and Recall

– Address fluency/grammaticality via a direct penalty: how well-ordered is the matching of the MT output with the reference?

• Improved levels of correlation with human judgments of MT Quality

• Contact Faculty: Alon Lavie

August 23, 2007 LTI IC 2007 37

The METEOR Metric

• Example:– Reference: “the Iraqi weapons are to be handed over to the

army within two weeks”– MT output: “in two weeks Iraq’s weapons will give army”

• Matching: Ref: Iraqi weapons army two weeks MT: two weeks Iraq’s weapons army• P = 5/8 =0.625 R = 5/14 = 0.357 • Fmean = 10*P*R/(9P+R) = 0.3731• Fragmentation: 3 frags of 5 words = (3-1)/(5-1) = 0.50• Discounting factor: DF = 0.5 * (frag**3) = 0.0625• Final score: Fmean * (1- DF) = 0.3731*0.9375 = 0.3498

August 23, 2007 LTI IC 2007 38

Summary

• Main challenges for current state-of-the-art MT approaches - Coverage and Accuracy:– Acquiring broad-coverage high-accuracy translation

lexicons (for words and phrases)– learning syntactic mappings between languages

from parallel word-aligned data– overcoming syntax-to-semantics differences and

dealing with constructions– Stronger Target Language Modeling

August 23, 2007 LTI IC 2007 39

Questions…

Documents

Machine Translation Overview