Current Research in Statistical Machine Translation and

Current Research inStatistical Machine Translation and

Links with Automatic Speech Recognition

Bill Byrnewith Shankar Kumar and Yonggang Deng

Department of Engineering, Cambridge UniversityTrumpington Street, Cambridge CB2 1PZ, U.K.

Center for Language and Speech Processing. The Johns Hopkins University3400 N. Charles Street, Baltimore, MD 21218, U.S.A.

[email protected]

December 16, 2004

OutlineOne View of SMT

Five Problems in Statistical Machine Translation2004 JHU-CLSP NIST MT Evaluation Systems

Summary and Conclusion

Outline

One View of SMTSMT as a Classification ProblemComponent Translation Models and Related Problems

Five Problems in Statistical Machine TranslationAutomatic Measurement of Translation QualityBitext for SMT Training: Document and Sentence AlignmentTranslation Models: Word-to-Word AlignmentTranslation Models: Phrase-BasedLanguage ModelsTranslation

2004 JHU-CLSP NIST MT Evaluation Systems


Bill Byrne ISM, Tokyo 13 Dec 2004 Current Research in SMT p. 2/59




Introduction

I Automatic Speech Recognition and Machine TranslationI The ultimate goal is to transform information into readable textI Can be cast in a generative source-channel modeling framework

P(A|W)

SOURCE

LANGUAGE

MODEL

P(W)

W A

MODEL

ACOUSTIC/TRANSLATION

I Applications in Information Processing systems- Translating news broadcasts from one language to another- Information Extraction from audio and text archives

I Like ASR, SMT modeling is mainly language independentI Relies on large amounts of text and training algorithms rather than

language-specific knowledge





Is Machine Translation Necessary ?

Maybe...

I The need for translation far exceeds what can be translatedI Governmental proceedings : Canada, UN, EU ...I Military and intelligence needsI Commerce: ‘Localization’ of products

I Translation is expensive and time-consumingI Training translators is slow and expensiveI Requires a great deal of expertise and oversight to ensure quality

Is there a consumer market ?

I How much will non-Arabic speakers pay to watch Al Jazeera ?

I People want to access the internet in their own languages,but will they pay for translation ?

An extreme example of localization :


Google is looking for volunteers ...

Google in Your Language (BETA)

Translate Google Into Your Language

Google believes that fast and accurate searchinghas universal value. That's why we are eager tooffer our service in all the languages scatteredupon the face of the earth. We need your help tomake this a reality.

You can volunteer to translate Google's helpinformation and search interface into yourfavorite language. By helping with ourtranslation process you ensure that Google willbe available in your language of choice morequickly and with a better interface than it wouldhave otherwise. There is no minimumcommitment. You can translate a phrase, a pageor our entire site. Once we have enough of thesite translated, we will make it available in thelanguage you are requesting.

If you are interested in helping us, please readthe translation style guide, frequently askedquestions list, and the legal stuff. Then click onthe link at the right to sign up as a volunteer.

We hope you enjoy working on our Googletranslation project and thank you for helping tomake Google a truly worldwide web service.

Here is a list of the languages for which we arecurrently developing support.

Don't have a Google Account?

Sign up to volunteeras a Google translator

Already have a Google Account? Sign in here to translate

Email:

Password:

Don't ask for my password for 2weeks.

Sign in

Forgot your password?

©2003 Google

Inspiration provided by the Weather Underground (www.wunderground.com)

... and finding them. My Account | Sign In

Google in Your Language

Google translation status report of main site

Alphabetical list:

Language % Complete

Abkhazian 1%

Afar 2%

Afrikaans 100%

Albanian 100%

Amharic 100%

Arabic 100%

Armenian 100%

Assamese 100%

Aymara 52%

Azerbaijani 87%

Bashkir 2%

Basque 100%

Belarusian 99%

Bengali 100%

Bhutani 0%

Bihari 100%

Bislama 43%

Borkborkbork 65%

Bosnian 100%

Breton 91%

Bulgarian 100%

Burmese 9%

Cambodian 99%

Catalan 100%

Chinese (Simplified) 99%

Chinese (Traditional) 99%

Corsican 29%

Croatian 100%

Czech 100%

Danish 100%

Most complete translations:

Language % Complete

Afrikaans 100%

Albanian 100%

Amharic 100%

Arabic 100%

Armenian 100%

Assamese 100%

Basque 100%

Bengali 100%

Bihari 100%

Bosnian 100%

Bulgarian 100%

Catalan 100%

Croatian 100%

Czech 100%

Danish 100%

Elmer Fudd 100%

Finnish 100%

Galician 100%

Greek 100%

Gujarati 100%

Hacker 100%

Hebrew 100%

Hindi 100%

Hungarian 100%

Indonesian 100%

Interlingua 100%

Interlingue 100%

Javanese 100%

Kannada 100%

Kazakh 100%

... and finding them. Dutch 99%

Elmer Fudd 100%

English (UK) 9%

Esperanto 99%

Estonian 99%

Faroese 92%

Fiji 13%

Finnish 100%

French 99%

Frisian 95%

Galician 100%

Georgian 88%

German 99%

Greek 100%

Greenlandic 12%

Guarani 87%

Gujarati 100%

Hacker 100%

Hausa 18%

Hebrew 100%

Hindi 100%

Hungarian 100%

Icelandic 99%

Indonesian 100%

Interlingua 100%

Interlingue 100%

Inuktitut 3%

Inupiak 1%

Irish 99%

Italian 99%

Japanese 99%

Javanese 100%

Kannada 100%

Kashmiri 32%

Kazakh 100%

Kinyarwanda 13%

Kirundi 1%

Klingon 97%

Laothian 100%

Latin 100%

Latvian 100%

Lithuanian 100%

Macedonian 100%

Malay 100%

Malayalam 100%

Maltese 100%

Marathi 100%

Mongolian 100%

Nepali 100%

Norwegian (Bokmål) 100%

Norwegian (Nynorsk) 100%

Oriya 100%

Pashto, Pushto 100%

Persian 100%

Pig Latin 100%

Polish 100%

Portuguese (Brazil) 100%

Portuguese (Portugal) 100%

Punjabi 100%

Quechua 100%

Romanian 100%

Russian 100%

Serbian 100%

Serbo-Croatian 100%

Shona 100%

Sinhalese 100%

Slovak 100%

Slovenian 100%

Somali 100%

Swahili 100%

Swedish 100%

Tagalog 100%

Tamil 100%

Telugu 100%

Thai 100%

Turkish 100%

... and finding them. Korean 99%

Kurdish 94%

Kyrgyz 80%

Laothian 100%

Latin 100%

Latvian 100%

Lingala 92%

Lithuanian 100%

Macedonian 100%

Malagasy 25%

Malay 100%

Malayalam 100%

Maltese 100%

Maori 48%

Marathi 100%

Moldavian 81%

Mongolian 100%

Nauru 0%

Nepali 100%

Norwegian (Bokmål) 100%

Norwegian (Nynorsk) 100%

Occitan 99%

Oriya 100%

Oromo 5%

Pashto, Pushto 100%

Persian 100%

Pig Latin 100%

Polish 100%

Portuguese (Brazil) 100%

Portuguese (Portugal) 100%

Punjabi 100%

Quechua 100%

Rhaeto-Romance 61%

Romanian 100%

Russian 100%

Samoan 17%

Sangho 0%

Sanskrit 95%

Ukrainian 100%

Urdu 100%

Vietnamese 100%

Welsh 100%

Belarusian 99%

Cambodian 99%

Chinese (Simplified) 99%

Chinese (Traditional) 99%

Dutch 99%

Esperanto 99%

Estonian 99%

French 99%

German 99%

Icelandic 99%

Irish 99%

Italian 99%

Japanese 99%

Korean 99%

Occitan 99%

Spanish 99%

Uzbek 99%

Yiddish 99%

Klingon 97%

Frisian 95%

Sanskrit 95%

Kurdish 94%

Scots Gaelic 94%

Uighur 94%

Faroese 92%

Lingala 92%

Breton 91%

Sindhi 91%

Tatar 90%

Tonga 90%

Zulu 89%

Georgian 88%

Sundanese 88%

Azerbaijani 87%

... and finding them. Scots Gaelic 94%

Serbian 100%

Serbo-Croatian 100%

Sesotho 76%

Setswana 33%

Shona 100%

Sindhi 91%

Sinhalese 100%

Siswati 2%

Slovak 100%

Slovenian 100%

Somali 100%

Spanish 99%

Sundanese 88%

Swahili 100%

Swedish 100%

Tagalog 100%

Tajik 1%

Tamil 100%

Tatar 90%

Telugu 100%

Thai 100%

Tibetan 1%

Tigrinya 79%

Tonga 90%

Tsonga 3%

Turkish 100%

Turkmen 82%

Twi 79%

Uighur 94%

Ukrainian 100%

Urdu 100%

Uzbek 99%

Vietnamese 100%

Volapuk 46%

Welsh 100%

Wolof 8%

Xhosa 78%

Guarani 87%

Yoruba 83%

Turkmen 82%

Moldavian 81%

Kyrgyz 80%

Tigrinya 79%

Twi 79%

Xhosa 78%

Sesotho 76%

Borkborkbork 65%

Rhaeto-Romance 61%

Aymara 52%

Maori 48%

Volapuk 46%

Bislama 43%

Setswana 33%

Kashmiri 32%

Corsican 29%

Malagasy 25%

Hausa 18%

Samoan 17%

Fiji 13%

Kinyarwanda 13%

Greenlandic 12%

Burmese 9%

English (UK) 9%

Wolof 8%

Oromo 5%

Inuktitut 3%

Tsonga 3%

Afar 2%

Bashkir 2%

Siswati 2%

Zhuang 2%

Abkhazian 1%

Inupiak 1%

Kirundi 1%

Tajik 1%

Yiddish 99%

Yoruba 83%

Zhuang 2%

Zulu 89%

Tibetan 1%

Bhutani 0%

Nauru 0%

Sangho 0%

Is this evidence of consumerdemand for translation ?




Outline









SMT as a Classification ProblemComponent Translation Models and Related Problems

Outline










Statistical Machine Translation

Chinese (Arabic, Hindi,...) → English (Hindi, Chinese,...)

I techniques are (largely) independent of language and direction

Based on the idea that parallel text can be found and aligned :

I Document Level Alignment

I Sentence Level Alignment

I Word and Phrase Level Alignment

Hierarchical models of translation from coarse to fine scale

I trained over collections of known translations

I the component models form a complete translation model






SMT as Classification

Statistical Machine Translation :Application of statistical techniques for translating texts from one naturallanguage (e.g. French) to another (e.g. English)

children need toys and leisure time

children need toys and leisure time

les enfants ont besoin de jouets et de loisirs

the children who need toys and leisure timethose children need toys in leisure timethe children need toys and leisures

��

��

��

STATISTICALCLASSIFIER

Hypothesis Space (Huge!)






Source-Channel SMT Models

Modeling: Given an English Sentence e = e1 . . . eI and a French sentencef = f1 . . . fJ , construct a joint distribution that provides the ‘probability’that one is the translation of the other.

P(e, f ) = P(f |e)︸︷︷︸Translation

Model

P(e)︸︷︷︸Language

Model

Decoding: Produce a translation of f as

e = argmaxe

P(e|f ) = argmaxe

P(f |e) P(e)

Assumptions: MAP decoding, uni-directional translation models, sourcetext is ‘generated’ independently of translation hypotheses, ...

I Why not model P(e|f ) or P(e, f ) directly ?






Component Models and Related Problems

Source

P(eI1)

Language Model -Estimated from

Monolingual Text

Target

Translation Model -Estimated from Bilingual Text

Modeling

Translation

P( f J1

|eI1)

f J1

e=argmaxe P(e)P( f |e)

Decoding -Efficient Search AlgorithmsBased on Translation and

Language Models

e

I Training data: bitext, monolingual textI Estimation algorithms: for translation and language modelsI Search algorithmsI Some way to measure translation quality





Automatic Measurement of Translation QualityBitext for SMT Training: Document and Sentence AlignmentTranslation Models: Word-to-Word AlignmentTranslation Models: Phrase-BasedLanguage ModelsTranslation

Outline










Automatic Measurement of Translation QualityAutomatic performance metrics have been central to the development oflarge statistical language processing systems

I Word/Character Error Rate (WER): ASR , OCR, ...I Precision/Recall: Information Retrieval, Speaker ID, ...I Crossing Brackets: Parsing, ...

These are all relative to human performance over defined test setsI human performance is measured one time over a fixed test setI system performance can be measured many times relative to

I human performance, directlyI performance of other systems, indirectly

I allows incremental system improvementI metrics can be incorporated into estimation and decoding

Evaluation metrics shouldI be inexpensive to computeI require little or no ongoing human involvement






Three Loss Functions for Machine TranslationI BLEU (Papineni et.al 2001) is an automatic MT metric - Correlates

well with human judgments of translationI Position Independent Word Error Rate (PER) : Minimum String edit

distance between a reference sentence and any permutation of thehypothesis sentence

Reference : mr. speaker , in absolutely no way .Hypothesis : in absolutely no way , mr. chairman .

BLEU computationSub-string-Matches(Truth,Hyp) BLEU

1-word 2-word 3-word 4-word(

78 ×

37 ×

26 ×

15

) 14 = 0.3976

7/8 3/7 2/6 1/5

Evaluation-Metric(Truth,Hyp) (%)BLEU WER PER39.76% 6/8 = 75.0% 1/8 = 12.5%






An Example of Good TranslationOriginal Word Segmented Chinese Pinyin (with tones) and gloss

dao4 2005nian2 , quan2guo2 hu4 lian2wang3 yong3hu4 jiang da2dao4 2yi4(By) (2005 year) (internet) (,) (whole country) (users) (will) (reach) (0.2 billion)

Automatic Translation

By 2005 , the number of internet users will reach 200 million

Human References

By 2005 , the number of internet users in the whole country will reach 200 millionBy 2005 , the number of internet users in China is estimated to be 200 millionBy 2005 , internet customers across the country is to reach 200 millionIn 2005 , the internet users in China will total 0.2 billion

Not so bad ...

I Fairly good agreement with reference translationsI Fluent






An Example of Not-So-Good Translation

Original Word Segmented Chinese Pinyin (with tones) and glosssui1ran2 bei3feng1 hu1xiao4 , dan4 tian1kong1 yi1ran2 shi2fen1 qing1che4(Although) (northern wind) (howl) (,) (but) (sky) (very) (clear)

Automatic TranslationAlthough wind howl . but the skies remain very tender

Human ReferencesAlthough a north wind was howling , the sky remained clear and blue .However , the sky remained clear under the strong north wind .Despite the strong northerly winds , the sky remains very clear .The sky was still crystal clear , though the north wind was howling .

Pretty bad ...

I some resemblance to the referencesI poor fluency






BLEU Assigns High Scores to Good Translations

Examples translations at various levels of translation performance as measured by the

sentence-level BLEU score. Examples selected from the NIST 2002 Ch-En MT evaluation set.117

Translations BLEU (%)60− 70%

Afghan Earthquake Victims begin to rebuild their homes . 66.1Prior to this , the ANC has issued a statement calling for the internationalcommunity to respect the choice of the people and help them survive . 66.0Statistics show that since 1992 , a total of 204 UN personnel have beenkilled , but only 15 criminals have been arrested . 64.4Chavez emphasized that Venezuela needs peace , stability and reasonfor all parties should make joint efforts to end the conflict . 62.2London Financial Times Index Friday at closing newspaper5,292.70 points , up 31.30 points . 61.0

40− 50%

Indonesia Reiterates Opposition to the presence of foreign troops in 49.0Witnesses said that Zambia Ji lying in a pool of blood , aside fromhis briefcase . 48.4Hong Kong Police Narcotics Bureau pointed out that this is the first discovery ,should attach great importance to this issue . 46.4Three pieces of wall sports in October last year were shipped toNew York to mark the 1990 reunification of Germany . 42.5Xiao Yang , president of the work report yesterday by 2026 votesto 528 votes against 259 abstentions . 41.4

20− 30%

Japan to temporarily freeze asked Russia to provide humanitarian assistance , 30.0Opposition Senator held that the president should focus more ondomestic affairs and not eager to go abroad . 26.2Taiwan DPP Legislator Chen Kim de fisheries groups to visit to Beijing . 23.9Recently , the international community for the recent conflict ,the fiercest Jenin camp conflict investigation of spreading . 20.8opinion maintained : Gusmao victory is a strong possibility becausehe is considered the East Timor independence hero . 20.0

0− 10%Japan Telecom company in 2000 to spend 5.5 billion dollars buy back . 0.0However , the voting result shows that Zhu because there is no reasonto be losing power by NPC deputies desolate . 0.077 private manufacturing enterprises also reported a foreign trademanagement right . 0.0Identification Department found that college students ofthe certificate , many of them were fake . 0.0The European Union would be implemented in steel importstemporary protective measures to discuss with the Chinese side , 0.0Georgia from a section of the great mountains Canyon withdrawal , 0.0

Table 5.2: Examples of translations under the TTM at various levels of translationperformance as measured by the sentence-level BLEU score. These examples areselected from the NIST 2002 MT evaluation set.

I readable and accurate






BLEU Assigns Low Scores to Poor Translations

117



40− 50%


20− 30%




I barely readable, probably misleading






BLEU Assigns Middling Scores to Middling Translations

117



40− 50%


20− 30%




I probably misleading with respect to details

I contains readable sections






BLEU Can Be Used (Carefully) For SMT Development

Improvements in BLEU can be trusted when :

I many different systems are developed under BLEU, and resultsthroughout development are published on standard test sets

I periodically re-validated against human judgments

Current test sets used in the NIST MT evaluations contain ∼ 1Ksentences / ∼ 25K words, each with four independently producedreference translations

BLEU is not an absolute measure of translation performance

I unlike Word Error Rate

I however, it is still useful for system development

There is extensive effort in improving/extending/replacing BLEU






Outline










Bitext and Bitext Alignment

Bitext is a collection of text in two languages, usually known or suspectedto contain translations

I Bitext Alignment: finds the regions that are translations

I There are levels of alignment: document, sentence or wordI different models and algorithms are used for each

Bitext Alignment is a translation sub-task

I Useful for creating training sets

I first step in extracting structual information and estimatingstatistical parameters

I provides basic ingredients in building an Machine Translation system






Document Level Alignment

The definition of a document is somewhat relative. Depending on thedomain, a document could be a book, a news story, a numbered chapter,paragraph, or verse, ...

Parallel Documents are sometimes created directly by human translators,for example in creating a foreign language edition of a newspaper.

I correspondence between the translations and the original sourcelanguage documents is maintained

I large collections of parallel documents are available from LDC

I these tend to be expensive to create and to obtain

Its possible to search for Parallel Documents ‘in the wild’ by crawling theweb and looking for hints that documents encountered might betranslations of each other. [STRAND-Resnik]






Sentence Level Alignment

Starts by assuming two parallel documents are known, and that each ofthese documents is an ordered collection of sentences

English Document: em1 French Document: fn

1

Objective:

I Segment the documents intochunks of sentences

I Align the chunks across thedocuments

Sentence Alignment Variable

aK (m, n)

K is the number of chunks

3

W1...W6W7...W13W14...W20W21...W35

f4

e2 e3 e4 e5

f1 f2 f3

a3=(5,5,4,4)

a2=(3,4,2,3)

a1=(1,2,1,1)

5e=e1

f=f41

e1

1

W1...W8W9...W20W21...W30W31...W38W39...W50

(5,4)a






Sentence Alignment via MAP Chunk AlignmentWith simplifying assumptions, chunk alignment becomes

{K , aK1 } = argmaxK ,aK

1β(K |m, n) P(aK

1 |m, n)∏K

k=1 P(w)(fak|eak

)

Chunk Count Alignment Translation Model

Model Model

e1 e2 e3 e4 e5

f1

f2

f3

f4

Pr(2,2)Pr(f2,f3|e3,e4)

Pr(1,1)Pr(f5|e5)

Pr(1,2)Pr(f1|e1,e2)(w)

(w)

(w)






Sentence Alignment: Underlying Models

I Statistics of sentence length (Gale & Church ’91)I P(m|l) : m−l·c√

lσ2∼ N (0, 1)

I Can also incorporate lexical knowledgeI hand labeled or provided (Chen ’93, Wu ’94)I identify cognates bootstrapped from a bilingual dictionary (Haruno

& Yamazaki ’96)

I Geometric approach using word correspondence (Melamed ’97)

I Multi-pass strategy (Simard & Plamondon ’98, Moore ’02)I identify high quality pairs initially as seedsI refined models are trained at each iteration






Sentence Alignment via Divisive ClusteringProceeds from coarse to fine and allows chunk reordering

At each iteration, the single most likely splitting point is chosen.

自从朝鲜半岛被分裂成两个国家以来，韩

国在背靠美国这棵大树以求自安的同时

，还小心翼翼但却坚持不懈地向美国寻求

先进武器，以抗衡朝鲜。据汉城的消息灵

通人士向《华盛顿邮报》透露，今年早些

时候，美国已秘而不宣地同意韩国 “ 可以

扩展它现有导弹的射程 ” ，使之能够直捣

朝鲜首都平壤。这本应是韩国感到欣

喜的事儿，可眼下半岛局势有了重大变

化，朝韩首脑面对面地会了晤，并签署

了联合声明。韩国怎么办？只好把到嘴的

“ 肥肉 ” 先吐出来，搁置自己的 “ 导弹射程

扩展计划 ” 。一名韩国知情人士道出了实

情： “ 因为有了首脑会谈，所以我们已搁置了自己的导弹计划，如果我们再那么干，就会弄糟首脑峰会开创的良好局面。

Since the Korean Peninsula was split into two countries , the Republic of Korea has , while leaning its back on the " big tree " of the United States for security , carefully and consistently sought advanced weapons from the United States in a bid to confront the Democratic People 's Republic of Korea . An informed source in Seoul revealed to the Washington Post that the United States had secretly agreed to the request of South Korea earlier this year to " extend its existing missile range " to strike Pyongyang direct . This should have elated South Korea .But since the situation surrounding the peninsula has changed dramatically and the two heads of state of the two Koreas have met with each other and signed a joint statement , what should South Korea do now ? It has no choice but spit back the " greasy meat " from its mouth and put the " missile expansion plan " on the back burner . A knowledgeable South Korean speaks the truth : " Because of the summit meeting , we have shelved our own missile plan . If we go ahead with it , it will spoil the excellent situation opened up by the summit meeting .








自从朝鲜半岛被分裂成两个国家以来，韩

国在背靠美国这棵大树以求自安的同时

，还小心翼翼但却坚持不懈地向美国寻求

先进武器，以抗衡朝鲜。据汉城的消息灵

通人士向《华盛顿邮报》透露，今年早些

时候，美国已秘而不宣地同意韩国 “ 可以

扩展它现有导弹的射程 ” ，使之能够直捣

朝鲜首都平壤。这本应是韩国感到欣喜

的事儿，可眼下半岛局势有了重大变化

，朝韩首脑面对面地会了晤，并签署了

联合声明。韩国怎么办？只好把到嘴的 “

肥肉 ” 先吐出来，搁置自己的 “ 导弹射程扩

展计划 ” 。一名韩国知情人士道出了实情

：

“ 因为有了首脑会谈，所以我们已搁置了自己的导弹计划，如果我们再那么干，就会弄糟首脑峰会开创的良好局面。

Since the Korean Peninsula was split into two countries , the Republic of Korea has , while leaning its back on the " big tree " of the United States for security , carefully and consistently sought advanced weapons from the United States in a bid to confront the Democratic People 's Republic of Korea . An informed source in Seoul revealed to the Washington Post that the United States had secretly agreed to the request of South Korea earlier this year to " extend its existing missile range " to strike Pyongyang direct . This should have elated South Korea .But since the situation surrounding the peninsula has changed dramatically and the two heads of state of the two Koreas have met with each other and signed a joint statement , what should South Korea do now ? It has no choice but spit back the " greasy meat " from its mouth and put the " missile expansion plan " on the back burner . A knowledgeable South Korean speaks the truth :

" Because of the summit meeting , we have shelved our own missile plan . If we go ahead with it , it will spoil the excellent situation opened up by the summit meeting .

1








自从朝鲜半岛被分裂成两个国家以来，韩国

在背靠美国这棵大树以求自安的同时，还

小心翼翼但却坚持不懈地向美国寻求先进武器

，以抗衡朝鲜。据汉城的消息灵通人士向《

华盛顿邮报》透露，今年早些时候，美国已秘

而不宣地同意韩国 “ 可以扩展它现有导弹的

射程 ” ，使之能够直捣朝鲜首都平壤。这本

应是韩国感到欣喜的事儿，可眼下半岛局势

有了重大变化，朝韩首脑面对面地会了晤，

并签署了联合声明。韩国怎么办？只好把到

嘴的 “ 肥肉 ” 先吐出来，搁置自己的 “ 导弹射程

扩展计划 ” 。

一名韩国知情人士道出了实情：


Since the Korean Peninsula was split into two countries , the Republic of Korea has , while leaning its back on the " big tree " of the United States for security , carefully and consistently sought advanced weapons from the United States in a bid to confront the Democratic People 's Republic of Korea . An informed source in Seoul revealed to the Washington Post that the United States had secretly agreed to the request of South Korea earlier this year to " extend its existing missile range " to strike Pyongyang direct . This should have elated South Korea .But since the situation surrounding the peninsula has changed dramatically and the two heads of state of the two Koreas have met with each other and signed a joint statement , what should South Korea do now ? It has no choice but spit back the " greasy meat " from its mouth and put the " missile expansion plan " on the back burner .

A knowledgeable South Korean speaks the truth :


1

2











，以抗衡朝鲜。

据汉城的消息灵通人士向《华盛顿邮报》透露，今年早些时候，美国已秘而不宣地同意韩国 “ 可以扩展它现有导弹的射程 ” ，使之能够直捣朝鲜首都平壤。这本应是韩国感到欣喜的事儿，可眼下半岛局势有了重大变化，朝韩首脑面对面地会了晤，并签署了联合声明。韩国怎么办？只好把到嘴的 “ 肥肉 ” 先吐出来，搁置自己的 “ 导弹射程扩展计划 ” 。



Since the Korean Peninsula was split into two countries , the Republic of Korea has , while leaning its back on the " big tree " of the United States for security , carefully and consistently sought advanced weapons from the United States in a bid to confront the Democratic People 's Republic of Korea .

An informed source in Seoul revealed to the Washington Post that the United States had secretly agreed to the request of South Korea earlier this year to " extend its existing missile range " to strike Pyongyang direct . This should have elated South Korea .But since the situation surrounding the peninsula has changed dramatically and the two heads of state of the two Koreas have met with each other and signed a joint statement , what should South Korea do now ? It has no choice but spit back the " greasy meat " from its mouth and put the " missile expansion plan " on the back burner .



1

2

3











，以抗衡朝鲜。

据汉城的消息灵通人士向《华盛顿邮报》透露，今年早些时候，美国已秘而不宣地同意韩国 “ 可以扩展它现有导弹的射程 ” ，使之能够直捣朝鲜首都平壤。

这本应是韩国感到欣喜的事儿，可眼下半岛局势有了重大变化，朝韩首脑面对面地会了晤，并签署了联合声明。韩国怎么办？

只好把到嘴的 “ 肥肉 ” 先吐出来，搁置自己的 “ 导弹射程扩展计划 ” 。



Since the Korean Peninsula was split into two countries , the Republic of Korea has , while leaning its back on the " big tree " of the United States for security , carefully and consistently sought advanced weapons from the United States in a bid to confront the Democratic People 's Republic of Korea .

An informed source in Seoul revealed to the Washington Post that the United States had secretly agreed to the request of South Korea earlier this year to " extend its existing missile range " to strike Pyongyang direct .

This should have elated South Korea .But since the situation surrounding the peninsula has changed dramatically and the two heads of state of the two Koreas have met with each other and signed a joint statement , what should South Korea do now ?

It has no choice but spit back the " greasy meat " from its mouth and put the " missile expansion plan " on the back burner .



1

2

3

4

5






Sentence Alignment Goal: Better Translation Models

Benefits of good chunking

I better training set alignment improves translation performance

I smaller chunk pairs leads to faster translation model training

Alignment desiderata

I fast

I efficient: as little bitext should be discarded as possible

I flat-start initialization

I language independent

I minimal linguistic knowledge required

I document alignment at the subsentence levelI coarse monotonic alignment followed by

fine divisive clustering works wellI splitting points can depend on the language pairs






Outline










Word Alignment

An English-French Sentence Pair: (e I1, f

J1 )

NULL

Monsieur le Orateur , ma question se adresse à le ministre chargé de les transports

Mr. Speaker , my question is directed to the Minister of Transport

I Alignment Links: b = (i , j) : fj linked to ei

I Alignment is defined by a Link Set B = {b1, b2, ..., bJ}I Some links are NULL links

I Introduce a word alignment variable aj

b = (i , j) → aj = i






IBM Model 1 & 2 Word Alignment ModelsTreat each sentence as a ‘bag of words’

I translation has ‘direction’

I any English word can generate any French word

I simple word-to-word alignment models

P(f, a,m | e) = P(f | a,m, e) P(a |m, e) P(m | e)

=∏m

j=1 Pt(fj | eaj )∏m

j=1 Pa(aj | j , l ,m) Pε(m | l)Word Translation Table Word Alignment Table Length

(flat in Model 1)

I EM is possible

P(f|e) =∑

a P(f, a|e)

= Pε(m|l)∏m

j=1

∑li=0 Pa(i |j ,m, l) Pt(fj |ei )






HMM Word Alignment Models

Improved models of word-alignment

I any English word can generate any French word

I Markov Process over the alignment variableI proceeds left-to-right over the french sentence

I Words are still translated by simple table lookup

P(f, a,m | e) = P(f | a,m, e) P(a |m, e) P(m | e)

=∏m

j=1 Pt(fj | eaj )∏m

j=1 Pa(aj | aj−1, l ,m) Pε(m | l)Word Translation Model Word Alignment Process Length

I EM is still possible using the Forward/Backward algorithm






IBM Model 4 for Word Alignment

Model 4 is a powerful model of word alignment within sentence pairsFeatures:

I lexical model, as in Model 1 and HMM alignment model

I Fertility: probability distribution over the number of French wordseach English word can generate

I an approximation to phrase modeling

I NULL Translation Model: allows French words to be generatedwithout a corresponding source on the English side

I Distortion Model: describes how French words are distributedthroughout the French sentence when generated from a singleEnglish word

Neither estimation nor decoding is easy under Model 4

I For an alignment, P(f, a|e) can be computed,but

∑a P(f, a|e) is not easy to compute






Phrase Translation ModelsOne weakness of the IBM models is that they describe translation as aone-to-many process in which one word generates many wordsAlignment Template Models of phrase-to-phrase translation

I introduced by Och et al. 1999

I derived from good word-level alignments, typically from IBM-4

this bill places salve on a sore wound

ce bill met de le baume sur une blessure

this bill places salve on a sore woundce bill met de le baume sur une blessure

this bill places salve on a sore wound

ce bill met de le baume sur une blessureIBM−4 F

IBM−4 E

u1

v1 v2

u2

IBM−4 F U E

I Templates and Phrase Pairs are extracted to cover patterns of wordalignments found in the training bitext






Monolingual Language Models

Language models are used as in Automatic Speech Recognition

I N-Grams estimated over in-domain monolingual text

P(e) =∏

i P(ei |ei−1, . . . , e1)

≈∏

i P(ei |ei−1, . . . , ei−N+1)






Translation

Translation is ‘trivial’ once the component distributions are defined ...

e = argmaxe

P(e|f)

typically approximated as e = argmaxe argmaxa P(e, a|f)One approach is simply to construct specialized search algorithms

I Depending on the underlying models, search can be by Viterbi, A∗,and other specialized search procedures

Decoder design and implementation is complex

I Small model changes might require large changes to a decoder

I Any approximations made during search lead to inexactimplementation of the model

I Decoder implementation takes effort away from ‘modeling’






Translation via Weighted Finite State Transducers

Translation with Finite State Devices, Knight & Al Onaizan, AMTA’98

I Implements the IBM models as WFSTsI word-to-word translation, word fertility, and permutation

If the component models can be implemented as WFSTs that can becomposed, building a decoder is trivial

I Can be limiting, but avoids special-purpose decoders

I The value of this modeling approach has been shown in ASR by thesystems developed at AT&T

I Translation is performed using libraries of standard FSM operationsI Clear formulation

Other work using FSMs - Bangalore’01






Translation Template Model - TTM

Generative source-channel model of machine translation

I Inspired byI Och&Ney’s Phrase-based translation modelsI Knight&Al Onaizan WFST description of translation via IBM models

I Formulation and Implementation using Weighted Finite StateTransducers (WFSTs)

I Word Alignment and Translation under the model can be performedusing standard WFST operations on the component transducers






Generative Translation Process underlying theTranslation Template Model

1

Source PhraseSegmentation

Source PhraseReordering

exports grain are_projected_to fall by_25_%

Placement of

Target Phrase

Insertion Markers

Target PhraseInsertion

PhraseTransduction

Target PhraseSegmentation

1

are_projected_tograin exports fall by_25_% Source Phrases

grain exports are projected to fall by 25 % Source LanguageSentence

Reordered SourcePhrases

grain are_projected_to by_25_%fall

Target Phrasesflechirdoivent de_25_%les exportations de grains

les exportations de grains doivent flechir de 25 %Target Language

Sentence

exports

I Transformations via stochastic models implemented as WFSTs






TTM Phrase Pair Inventory

I Translation Template Model relies on an inventory of French phrasesand their English translations

I Inventory is extracted from word-aligned bitext- IBM translation models are used for performing word alignment

I An example of PPI

English Phrase French Phrase Phrase Transductionu v Probability P(v |u)

hear hear bravo 0.8bravo bravo 0.15

ordre 0.05terms of reference mandat 0.8

de son mandat 0.2

I Phrase Pair Inventory affects the performance of the TTMI Word Alignment Quality of underlying modelsI Coverage of phrases on the test set






Coverage of the Test Set by the Phrase-Pair Inventory

I French-English Hansards (Train: 48K sent. pairs, Test: 500 sents)

I Train IBM translation models and alignment words

I Collect PPIs over nested subsets of the word alignments- Fixed alignment quality underlying the variable size inventories

I Coverage: % of phrases in the test set that exist in the PPI

# of sentence-pairs (K) Test-Set Coverage (%) BLEU (%)5 20.79 19.612 26.75 20.824 31.35 21.548 36.02 22.3

I A higher coverage of the test set by PPI improves translationperformance






Summary of the Translation Template Model

I Bitext word alignment and translation under the model can beperformed using standard WFST operations

I Modular ImplementationI No necessity for a specialized decoderI Can easily generate translation lattices and N-best lists

I WFST architecture provides support for generating bitext wordalignments and alignment lattices

I Novel use of phrase-based models for word alignmentI Allows development of parameter re-estimation procedures

I Detailed analysis of alignment and translation performance- Identified the contribution due to each component

I Very competitive performance in the NIST 2004 Chinese-to-Englishand Arabic-to-English MT evaluation





Outline









Text Processing

I Chinese Text segmented into words using LDC segmenter(Linguistic Data Consortium)

I Arabic Text processing pipeline- Modified Buckwalter analyzer- split conjunctions, prepositions, Al- and pronouns- Al- and w- deletion (maybe wrong decision!)

I English Text processed using a simple tokenizer

Bitext for Translation Model TrainingChinese-English Arabic-English

# of sent. pairs (K) - 68.0# of chunk pairs (M) 7.6 5.1

# of words (M) 175.7/207.4 123.0/132.5





English Language Model Training Data

By source - in Millions of words

Source Xin AFP PD FBIS UN AR-news TotalC-E

Small 3g 4.3 - 16.2 - - - 20.5Big 3g 155.7 200.8 16.2 10.5 - - 373.3Big 4g 155.7 200.8 16.2 10.5 - - 373.3A-E

Small 3g 63.1 200.8 - - - 2.1 266.0Big 3g 83.0 210.0 - - 131.0 3.6 428.0Big 4g 83.0 210.0 - - 131.0 3.6 428.0

Available from LDC:

I Xinhua, Agence France Press, People’s Daily, FBIS,United Nations collections, AR-news





Translation Performance : MT Bitext Size

Chinese-English - Decode with Small3g LMBitext Contribution by Source: En words (M) BLEU (%)

Partition FBIS HKNews XHTS UN Total eval01 eval02 eval031 10.5 16.3 - 26.7 53.5 25.1 25.5 24.12 10.5 - - 85.0 95.5 25.8 25.9 25.03 10.5 16.3 43.0 25.8 95.6 28.0 25.8 24.9

1+2+3 28.1 26.6 25.5(XHTS = Xinhua, Hansards, Treebank, Sinorama)

Arabic-English - Decode with Small3g LMBitext Contribution by Source: En words (M) BLEU (%)

Partition UN News Total eval02 eval031 65.0 3.5 68.5 35.0 37.42 64.0 3.5 67.5 35.7 37.6

1+2 35.8 37.8





Translation Performance : Language Modeling

System Decoding BLEU%# Method Chinese-English Arabic-English

e01 e02 e03 e04 (c) e02 e03 e04 (c)1 JHU-UMD ’03 - - 16.0 17.0 - - - -2 AE-Primary ’04 - - - - - - 38.0 30.6

3 Small3g 28.1 26.6 25.5 - 35.8 37.8 -4 Big3g 28.7 27.7 27.1 26.5 38.1 40.1 -5 Big4g Nbest from 4 29.6 28.2 27.3 27.5 - - -6 Lattice from 4 29.7 28.5 27.4 27.6 39.4 42.1 36.1

I Eval04 results are case-sensitive BLEU- Truecasing was performed using HMM capitalizer





Outline









Summary - Evaluation Systems

I A statistical MT system based on Bitext Chunking and TTMI Evaluation system benefitted from a careful study of contribution of

each model component to overall performanceI Bitext chunking extracts almost all available bitext for model trainingI Large gains relative to JHU-CLSP 2003 eval system (11% BLEU

absolute on eval03)I Respectable performance in Chinese-English and Arabic-English MTI Similar Architectures for C-E and A-E systems

- C-E system developed in 1 month - by 1 Indian graduate student- A-E system developed in 10 days - by 1 Chinese graduate student

I WFST Architecture supports the generation and rescoring ofLattices and N-best Lists





Summary - Current Topics in SMT

Have discussed five problems in statistical machine translationThese are not necessarily independent research topics

I Divisive Clustering is an attempt to integrate ‘text preprocessing’issues – sentence and document alignment – into overall MT systemtraining

I Minimum Bayes Risk Alignment and Translation attempts tointegrate Translation Performance Metrics into alignment andtranslation search algorithms

I WFST Translation Models blur the distinction between translationmodels and translation algorithms – if the models are constructedappropriately, translation is ‘straightforward’

Research goals are to refine these techniques and at the same time makeit possible to incorporate as much training data – bilingual andmonolingual text – as can be obtained


References Bangalore, S. and Riccardi, G. (2001) A finite-state approach to machine translation.Proceedings of the 2nd meeting of the North American Chapter of the Association forComputational Linguistics. Pittsburgh, PA, USA.

Brown, P. F., Cocke, J., Della Pietra, S. A., and Della Pietra, V. J., Jelinek, F., Lafferty,J. D., Mercer, R. L. and Roossin, P. S. (1990) A Statistical Approach to MachineTranslation. Computational Linguistics 16(2): 79–85.

Brown, P. F., Della Pietra, S. A., and Della Pietra, V. J. and Mercer, R. L. (1993) TheMathematics of Statistical Machine Translation: Parameter Estimation. ComputationalLinguistics 19(2): 263–311.

Brown, P. F., Della Pietra, S. A., and Della Pietra, V. J. and Mercer, R. L. (1993) TheMathematics of Statistical Machine Translation: Parameter Estimation. ComputationalLinguistics 19(2): 263–311.

Byrne, W., Khudanpur, S., Kim, W., Kumar, S., Pecina, P., Virga, P., Xu, P. andYarowsky, D. (2003) The Johns Hopkins University 2003 Chinese-English MachineTranslation System. Proceedings of MT Summit IX, pp. 447–450. New Orleans, LA,USA.

S. F. Chen. Aligning sentences in bilingual corpora using lexical information. In Meetingof the Association for Computational Linguistics, pages 9–16. 1993.

Deng, Y. and Byrne, W. (2004) Bitext Chunk Alignment for Statistical MachineTranslation. Research Note, Center for Language and Speech Processing, Johns HopkinsUniversity.

Doddington, G. (2002) Automatic Evaluation of Machine Translation Quality UsingN-gram Co-Occurrence Statistics. Proceedings of the Conference on Human LanguageTechnology, pp. 138–145, San Diego, CA. USA,

W. A. Gale and K. W. Church. A program for aligning sentences in bilingual corpora. InMeeting of the Association for Computational Linguistics, pages 177–184. 1991.

Germann, U., Jahr, M., Knight, K., Marcu, D. and Yamada, K. (2001) Fast Decodingand Optimal Decoding for Machine Translation. Proceedings of the 39th AnnualMeeting of the Association for Computational Linguistics, pp. 228–235. Toulouse,France.

M. Haruno and T. Yamazaki. High-performance bilingual text alignment using statisticaland dictionary information. In Proceedings of ACL ’96, pages 131–138. 1996.

Knight, K. and Al-Onaizan, Y. (1998) Translation with Finite-State Devices,Proceedings of the AMTA Conference, pp. 421–437. Langhorne, PA, USA.

Koehn, P., Och, F. and Marcu, D. (2003) Statistical Phrase-Based Translation.Proceedings of the Conference on Human Language Technology, pp. 127–133.Edmonton, Canada.

Kumar, S. and Byrne, W. (2002) Minimum Bayes-Risk Alignment of Bilingual Texts.Proceedings of the Conference on Empirical Methods in Natural Language Processing,pp. 140–147. Philadelphia, PA, USA.

Kumar, S. and Byrne, W. (2003) A Weighted Finite State Transducer Implementationof the Alignment Template Model for Statistical Machine Translation. Proceedings ofthe Conference on Human Language Technology, pp. 142–149. Edmonton, Canada.

Kumar, S. and Byrne, W. (2004) A Weighted Finite State Transducer TranslationTemplate Model for Statistical Machine Translation. Research Note No. 48, Center forLanguage and Speech Processing, Johns Hopkins University.

Kumar, S. and Byrne, W. (2004) Minimum Bayes-Risk Decoding for Statistical MachineTranslation Proceedings of the Conference on Human Language Technology,pp. 169–176. Boston, MA, USA.

Kumar, S. Minimum Bayes-Risk Techniques in Automatic Speech Recognition andMachine Translation, Ph.D. Dissertation, Johns Hopkins Univ. Electrical and ComputerEngineering Dept., 2004

Kumar, S and Byrne, W. (2005) A weighted finite state transducer translation templatemodel for statistical machine translation, Journal of Natural Language Engineering, toappear.

LDC Chinese Segmenter. http://www.ldc.upenn.edu/Projects/Chinese. LDC, 2002.

Marcu, D. and Wong, W. (2002) A Phrase-Based, Joint Probability Model forStatistical Machine Translation. Proceedings of the Conference on Empirical Methods inNatural Language Processing, pp. 133-139. Philadelphia, PA, USA.

I. D. Melamed. A portable algorithm for mapping bitext correspondence. In Proceedingsof the 35th Annual Conference of the Association for Computational Linguistics., pages305–312. 1997.

Mohri, M., Pereira, F. and M. Riley (1997), ATT General-purpose finite-state machinesoftware tools. http://www.research.att.com/sw/tools/fsm/

NIST (2004) The NIST Machine Translation Evaluations.http://www.nist.gov/speech/tests/mt/.

R. C. Moore. 2002. Fast and accurate sentence alignment of bilingual corpora. InProceeding of 5th Conference of the Association for Machine Translation in theAmericas, pages 135–244.

Och, F. (2002) Statistical Machine Translation: From Single Word Models to AlignmentTemplates. PhD. Thesis, RWTH Aachen, Germany,

Och, F. and Ney, H. (2000) Improved statistical alignment models. Proceedings of the38th Annual Meeting of the Association for Computational Linguistics pp. 440–447.Hong Kong, China.

Och, F., Tillmann, C. and Ney, H. (1999) Improved Alignment Models for StatisticalMachine Translation. Proceedings of the Joint Conference of Empirical Methods inNatural Language Processing and Very Large Corpora, pp. 20–28. College Park, MD,USA.

Och, F., Ueffing, N. and Ney, H. (2001) An Efficient A* Search Algorithm for StatisticalMachine Translation. Proceedings of the 39th Annual Meeting of the Association forComputational Linguistics, pp. 55–62. Toulouse, France.

Papineni, K., Roukos, S., Ward, T. and Zhu, W. (2001) Bleu: a Method for AutomaticEvaluation of Machine Translation. Tech Report RC22176 (W0109-022), IBM ResearchDivision.

M. Simard and P. Plamondon. Bilingual sentence alignment: Balancing robustness andaccuracy. In Machine Translation, volume 13, pages 59–80. 1998.

D. Wu. Aligning a parallel english-chinese corpus statistically with lexical criteria. InMeeting of the Association for Computational Linguistics, pages 80–87. 1994.

Stolcke, A. (2002) SRILM - An Extensible Language Modeling Toolkit.http://www.speech.sri.com/projects/srilm/, Proceedings of the International Conferenceon Spoken Language Processing, pp. 901–904. Denver, CO, USA.

Tillmann, C. and Ney, H. (2003) Word reordering and a dynamic programming beamsearch algorithm for statistical machine translation. Computational Linguistics 29(1):97–133.

Tillmann, C. (2003) A Projection Extension Algorithm for Statistical MachineTranslation. Proceedings of the Conference on Empirical Methods in Natural LanguageProcessing, Sapporo, Japan.

Ueffing, N., Och, F. and Ney, H. (2002) Generation of Word Graphs in StatisticalMachine Translation. Proceedings of the Conference on Empirical Methods in NaturalLanguage Processing, pp. 156–163. Philadelphia, PA, USA.

Vogel, S., Ney, H. and Tillmann, C. (1996) HMM Based Word Alignment in StatisticalTranslation. Proceedings of the 16th International Conference on ComputationalLinguistics pp. 836–841. Copenhagen, Denmark.

Wang, Y. and Waibel, A. (1997) Decoding Algorithm in Statistical Machine Translation.Proceedings of the 35th Annual Meeting of the Association for ComputationalLinguistics, pp. 366–372. Madrid, Spain.

JHU (2003), Syntax for Statistical Machine Translation, Final Report, JHU SummerWorkshop. http://www.clsp.jhu.edu/ws2003/groups/translate/.

Zens, R. and Ney, H. (2004) Improvements in Phrase-based Statistical MachineTranslation. Proceedings of the Conference on Human Language Technology,pp. 257–264. Boston, MA, USA.

Zhang, Y. Vogel, S. and Waibel A. (2003), Integrated Phrase Segmentation andAlignment Model for Statistical Machine Translation. Proceedings of the Conference onNatural Language Processing and Knowledge Engineering. Beijing, China.