Language and Translation Model Adaptation using ...snover/pub/emnlp08/emnlp2008_lm...LCTL vs Full Training Gains • LCTL Training: Gain 0.68 BLEU & 0.07 TER • Full Training: Gain

Language and Translation Model Adaptation

using Comparable CorporaMatthew Snover, Bonnie Dorr, and Richard Schwartz

1

Monolingual Data in MT

• Limited primarily to language model estimation• Parallel data to train TM is very expensive

• Monolingual data is very cheap and easy to acquire• How can we better exploit monolingual data?

• News stories are repeated across languages, giving us greater context for translation

• Even without repetition of news stories; monolingual data gives greater context for stories.

2

2

Comparable Documents

• Reference Translation Cameras are flashing and reporters are following up, for Hollywood star Angelina Jolie is finally talking to the public after a one-month stay in India, but not as a movie star. The Hollywood actress, goodwill ambassador of the United Nations high commissioner for refugees, met with the Indian minister of state for external affairs, Anand Sharma, here today, Sunday, to discuss issues of refugees and children. ... Jolie, accompanied by her five-year-old son, Maddox, visited the refugee camps that are run by the Khalsa Diwan Society for social services and the high commissioner for refugees Saturday afternoon after she arrived in Delhi. Jolie has been in India since October 5th shooting the movie "A Mighty Heart," which is based on the life of Wall Street Journal correspondent Daniel Pearl, who was kidnapped and killed in Pakistan. Jolie plays the role of Pearl's wife, Mariane.

• Comparable Document Actress Angelina Jolie hopped onto a crowded Mumbai commuter train Monday to film a scene for a movie about slain journalist Daniel Pearl, who lived and worked in India's financial and entertainment capital. Hollywood actor Dan Futterman portrays Pearl and Jolie plays his wife Mariane in the "A Mighty Heart" co-produced by Plan B, a production company founded by Brad Pitt and his ex-wife, actress Jennifer Aniston. Jolie and Pitt, accompanied by their three children -- Maddox, 5, 18-month-old Zahara and 5-month-old Shiloh Nouvel -- arrived in Mumbai on Saturday from the western Indian city Pune where they were shooting the movie for nearly a month. ....

3

3

Previous WorkExploiting Monolingual Data

• New word-to-word translations from comparable (not parallel) data [ Fung and Yee, 1998; Rapp 1999 ]

• Find parallel text by mining monolingual data in multiple languages [Resnik and Smith 2003; Munteanu and Marcu 2005]

• Re-weight portions of language model data using CLIR techniques [Kim and Khudanpur 2003; Zhao et al. 2004; Kim 2005]

• Similar techniques have been used for weighting bi-text4

4

Our Approach

• For each document to be translated• Find comparable documents in target data• Adapt language model using comparable documents

• Following Kim 2005• Add new phrasal translation rules based on source text

and comparable documents

• Use existing language model training for monolingual data

5

5

Outline

• Comparable Data Selection using CLIR• Model Adaptation

• Language Model Adaptation• Translation Model Adaptation

• Experimental Results• Discussion

6

6

Outline




7

7

Cross Lingual Information Retrieval

• Find English documents comparable to foreign documents using CLIR• Established method of locating relevant documents across languages• For each foreign document, query a database and return a ranked and

scored list of relevant English documents

• Adaptation method is independent of CLIR method

8

8

Somewhat Comparable

... An instrument aboard one of the two NASA rovers en route to Mars has malfunctioned, prompting worries it could harm the robot's

information-gathering ability, a scientist said. ...

Somewhat Comparable



Somewhat Comparable



CLIR

9

Monolingual EnglishDocuments

(Also LM Training)CLIR

Scoring

ForeignQuery

Ranked ListEnglish Doc 1English Doc 2English Doc 3English Doc 4

...

Foreign Source

... 火星探测车任务的主管塞辛格说 : " 我们发现探测车有

很严重异常现象." ...

Comparable

... NASA scientists said on Thursday they had lost contact with the Mars Spirit rover for more than 24 hours , describing the problem as " a

very serious anomaly . " ...

9

Long Comparable Documents

• Favors longer English documents• Long documents tend to be about many topics

• Bad for improving MT • Solution: Break documents into short (overlapping)

passages of ~300 words each

• Set of top N passages is bias-text

11

∑

e∈EPr(e|Doc) Pr(f |e)

11

Outline




12

12

• New LM generated from comparable passages • Interpolated with original generic LM using very

low weight (0.01)

• New LM is small & specific (~3000 words)

Language Model Biasing

13

Pr(e) = (1− λ) Prg

(e) + λ Prb

(e)

13

Example Improvement

14

Reference

the pope said, “each life should and needs to be guarded and developed.

Original Translation

“each and every individual should need of safeguarding and developing,” the pope said.

Portion of Comparable Text

” Every human life , as it is , deserves and demands to always be defended and promoted,” the pope said on the day the Catholic Church celebrates annually

as a “Day of Life”.

Biased Translation

“every life should and must be defended and promoted,” the pope said.

Biased Translation

“every life should and must be defended and promoted,” the pope said.

Portion of Comparable Text

” Every human life , as it is , deserves and demands to always be defended and promoted,” the

pope said on the day the Catholic Church celebrates annually as a “Day of Life”.

14

Outline




15

15

Implementation of Translation Model Adaptation

• Naïve assumption: same as IBM Model 1• Every phrase (1-3 words) in source can translate to every phrase (1-3

words) in comparable text

• Each rule has uniform probability• Use only those words and phrases from bias-text that occur ≥ k

times. (k = 2 in these experiments)

16

16

Implementation of TM Adaptation

• New translation rules differ only in lexical probability• Added to the decoder as phrasal rules but marked as special ‘bias

rules’

• Different weights for ‘bias rules’ and ‘generic rules’• Bias rule weight optimized alongside all other weights

17

17

Outline




18

18

Base MT System

• State-of-the-art hierarchical MT system (BBN’s HierDec - Shen et. al 2008)

• Arabic-to-English• LM: decode w/ 3-gram & rescore w/ 5-gram• Optimized with BLEU

• Performance measured using BLEU and TER• 10 most comparable passages used to create biastext

19

19

Data Sets

• Monolingual Comparable Data (all in LM training)• English Gigaword 2.8 billion words• FBIS corpus 28.5 million words• News Archive Data from Web: 828 million words

• Tuned on portions of MT04, MT05, GALE07 newswire (48,921 words)

• Test on MT06 (55,578 words)• 4 reference translations

20

20

Less Commonly Taught Languages Scenario

• TM adaptation should prove most beneficial when less bi-text is available

• Simulate LCTL with Arabic by reducing bi-text training set

• 5 Million words of Newswire training• Full monolingual data used

21

21

LCTL ScenarioMeasuring the Upper Bound of TM Adaptation

• Best possible comparable data would be parallel data• Use reference translations to simulate

• Ideally TM adaptation could determine which source words should align to which target words

• Align source to references with GIZA++• Discard rule probabilities

• We can also limit bias rules to target side that occurs in top 100 passages from comparable data

22

22

BLEU

LCTL ScenarioTM Adaptation from Aligned References

30.0

37.5

45.0

52.5

60.0

Tune TER MT06 TER Tune BLEU MT06 BLEU

43.35

51.38

48.99

41.79

52.16

58.41

45.17

36.92

34.68

40.80

55.16

49.84

23

No Adaptation Aligned Reference Overlap w/ Comparable

TER

23

LCTL ScenarioMeasuring a Weaker Upper Bound of TM Adaptation

• Fair TM adaptation doesn’t align source and bias-text• Use only reference text without alignments

• Every phrase in reference document aligns to every phrase in source document

• Uniform probabilities of rules (as before)• We also limit bias rules to target side that occurs in top 100 passages

from comparable data

• Upper-bound for our method if you found ‘perfect’ comparable documents

24

24

TER

LCTL ScenarioTM Adaptation from Unaligned References

30.0

37.5

45.0

52.5

60.0


36.95

43.13

53.90

48.08

39.90

45.66

52.54

44.92

34.68

40.80

55.16

49.84

25

BLEU

No Adaptation Unaligned Reference Overlap w/ Comparable

25

Fair LCTL Adaptation

30.0

37.5

45.0

52.5

60.0


35.36

42.44

55.09

48.88

34.78

41.69

55.45

48.08

34.90

41.40

55.59

49.22

34.68

40.80

55.16

49.84

26

No Adaptation LM Adapt TM Adapt LM&TM Adapt

TER BLEU

26

Full Training Scenario

• LCTL: Only small gain (0.68 BLEU) on MT06 test set• Can we benefit if we use all our data?• Full Training: 230M Words of BiText and 18.5M Segments

• Includes data parallel data extracted from comparable data from ISI (LDC2007T08)

27

27

BLEU

Full Training Adaptation

35

40

45

50

55


40.59

48.82

50.82

42.45

39.89

46.57

51.20

43.51

39.17

48.57

51.40

42.27

38.52

46.61

51.46

43.39

28

No Adaptation LM Adapt TM Adapt LM&TM Adapt

TER

28

LCTL vs Full Training Gains

• LCTL Training: Gain 0.68 BLEU & 0.07 TER• Full Training: Gain 2.07 BLEU & 0.58 TER• Counter-Intuitive: Larger gains with full training

• Better lexical probability estimates with full training• Better generic models to aid adaptation

29

TER Gain BLEU Gain

LCTL Training

Full Training

0.07 0.68

0.58 2.07

29

Discussion

• Exploit monolingual data to adapt both LM and TM.• Using no new information

• Comparable data part of LM. • CLIR uses generic TM.

• TM uses very simple method to generate bias rules• Gives substantial gains on test set for Newswire

• What is effect of level of comparability?• Can this be applied to less structured data? [web or audio] data?

30

30

Questions

31