31
Language and Translation Model Adaptation using Comparable Corpora Matthew Snover, Bonnie Dorr, and Richard Schwartz 1

Language and Translation Model Adaptation using ...snover/pub/emnlp08/emnlp2008_lm...LCTL vs Full Training Gains • LCTL Training: Gain 0.68 BLEU & 0.07 TER • Full Training: Gain

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

  • Language and Translation Model Adaptation

    using Comparable CorporaMatthew Snover, Bonnie Dorr, and Richard Schwartz

    1

  • Monolingual Data in MT

    • Limited primarily to language model estimation• Parallel data to train TM is very expensive

    • Monolingual data is very cheap and easy to acquire• How can we better exploit monolingual data?

    • News stories are repeated across languages, giving us greater context for translation

    • Even without repetition of news stories; monolingual data gives greater context for stories.

    2

    2

  • Comparable Documents

    • Reference Translation Cameras are flashing and reporters are following up, for Hollywood star Angelina Jolie is finally talking to the public after a one-month stay in India, but not as a movie star. The Hollywood actress, goodwill ambassador of the United Nations high commissioner for refugees, met with the Indian minister of state for external affairs, Anand Sharma, here today, Sunday, to discuss issues of refugees and children. ... Jolie, accompanied by her five-year-old son, Maddox, visited the refugee camps that are run by the Khalsa Diwan Society for social services and the high commissioner for refugees Saturday afternoon after she arrived in Delhi. Jolie has been in India since October 5th shooting the movie "A Mighty Heart," which is based on the life of Wall Street Journal correspondent Daniel Pearl, who was kidnapped and killed in Pakistan. Jolie plays the role of Pearl's wife, Mariane.

    • Comparable Document Actress Angelina Jolie hopped onto a crowded Mumbai commuter train Monday to film a scene for a movie about slain journalist Daniel Pearl, who lived and worked in India's financial and entertainment capital. Hollywood actor Dan Futterman portrays Pearl and Jolie plays his wife Mariane in the "A Mighty Heart" co-produced by Plan B, a production company founded by Brad Pitt and his ex-wife, actress Jennifer Aniston. Jolie and Pitt, accompanied by their three children -- Maddox, 5, 18-month-old Zahara and 5-month-old Shiloh Nouvel -- arrived in Mumbai on Saturday from the western Indian city Pune where they were shooting the movie for nearly a month. ....

    3

    3

  • Previous WorkExploiting Monolingual Data

    • New word-to-word translations from comparable (not parallel) data [ Fung and Yee, 1998; Rapp 1999 ]

    • Find parallel text by mining monolingual data in multiple languages [Resnik and Smith 2003; Munteanu and Marcu 2005]

    • Re-weight portions of language model data using CLIR techniques [Kim and Khudanpur 2003; Zhao et al. 2004; Kim 2005]

    • Similar techniques have been used for weighting bi-text4

    4

  • Our Approach

    • For each document to be translated• Find comparable documents in target data• Adapt language model using comparable documents

    • Following Kim 2005• Add new phrasal translation rules based on source text

    and comparable documents

    • Use existing language model training for monolingual data

    5

    5

  • Outline

    • Comparable Data Selection using CLIR• Model Adaptation

    • Language Model Adaptation• Translation Model Adaptation

    • Experimental Results• Discussion

    6

    6

  • Outline

    • Comparable Data Selection using CLIR• Model Adaptation

    • Language Model Adaptation• Translation Model Adaptation

    • Experimental Results• Discussion

    7

    7

  • Cross Lingual Information Retrieval

    • Find English documents comparable to foreign documents using CLIR• Established method of locating relevant documents across languages• For each foreign document, query a database and return a ranked and

    scored list of relevant English documents

    • Adaptation method is independent of CLIR method

    8

    8

  • Somewhat Comparable

    ... An instrument aboard one of the two NASA rovers en route to Mars has malfunctioned, prompting worries it could harm the robot's

    information-gathering ability, a scientist said. ...

    Somewhat Comparable

    ... An instrument aboard one of the two NASA rovers en route to Mars has malfunctioned, prompting worries it could harm the robot's

    information-gathering ability, a scientist said. ...

    Somewhat Comparable

    ... An instrument aboard one of the two NASA rovers en route to Mars has malfunctioned, prompting worries it could harm the robot's

    information-gathering ability, a scientist said. ...

    CLIR

    9

    Monolingual EnglishDocuments

    (Also LM Training)CLIR

    Scoring

    ForeignQuery

    Ranked ListEnglish Doc 1English Doc 2English Doc 3English Doc 4

    ...

    Foreign Source

    ... 火星 探测 车 任务 的 主管 塞 辛格 说 : " 我们 发现 探测 车 有

    很 严重 异常 现象." ...

    Comparable

    ... NASA scientists said on Thursday they had lost contact with the Mars Spirit rover for more than 24 hours , describing the problem as " a

    very serious anomaly . " ...

    9

  • CLIR

    • Use multiple translations for each foreign word• Assign scores to documents and return ranked

    list

    10

    Pr(Doc is rel|Q) = Pr(Doc is rel) Pr(Q|Doc is rel)Pr(Q)

    Pr(Q|Doc) =∏

    f∈QaPr(f |F ) + (1− a)

    e∈EPr(e|Doc) Pr(f |e)

    Xu et. al, 2001

    10

  • Long Comparable Documents

    • Favors longer English documents• Long documents tend to be about many topics

    • Bad for improving MT • Solution: Break documents into short (overlapping)

    passages of ~300 words each

    • Set of top N passages is bias-text

    11

    e∈EPr(e|Doc) Pr(f |e)

    11

  • Outline

    • Comparable Data Selection using CLIR• Model Adaptation

    • Language Model Adaptation• Translation Model Adaptation

    • Experimental Results• Discussion

    12

    12

  • • New LM generated from comparable passages • Interpolated with original generic LM using very

    low weight (0.01)

    • New LM is small & specific (~3000 words)

    Language Model Biasing

    13

    Pr(e) = (1− λ) Prg

    (e) + λ Prb

    (e)

    13

  • Example Improvement

    14

    Reference

    the pope said, “each life should and needs to be guarded and developed.

    Original Translation

    “each and every individual should need of safeguarding and developing,” the pope said.

    Portion of Comparable Text

    ” Every human life , as it is , deserves and demands to always be defended and promoted,” the pope said on the day the Catholic Church celebrates annually

    as a “Day of Life”.

    Biased Translation

    “every life should and must be defended and promoted,” the pope said.

    Biased Translation

    “every life should and must be defended and promoted,” the pope said.

    Portion of Comparable Text

    ” Every human life , as it is , deserves and demands to always be defended and promoted,” the

    pope said on the day the Catholic Church celebrates annually as a “Day of Life”.

    14

  • Outline

    • Comparable Data Selection using CLIR• Model Adaptation

    • Language Model Adaptation• Translation Model Adaptation

    • Experimental Results• Discussion

    15

    15

  • Implementation of Translation Model Adaptation

    • Naïve assumption: same as IBM Model 1• Every phrase (1-3 words) in source can translate to every phrase (1-3

    words) in comparable text

    • Each rule has uniform probability• Use only those words and phrases from bias-text that occur ≥ k

    times. (k = 2 in these experiments)

    16

    16

  • Implementation of TM Adaptation

    • New translation rules differ only in lexical probability• Added to the decoder as phrasal rules but marked as special ‘bias

    rules’

    • Different weights for ‘bias rules’ and ‘generic rules’• Bias rule weight optimized alongside all other weights

    17

    17

  • Outline

    • Comparable Data Selection using CLIR• Model Adaptation

    • Language Model Adaptation• Translation Model Adaptation

    • Experimental Results• Discussion

    18

    18

  • Base MT System

    • State-of-the-art hierarchical MT system (BBN’s HierDec - Shen et. al 2008)

    • Arabic-to-English• LM: decode w/ 3-gram & rescore w/ 5-gram• Optimized with BLEU

    • Performance measured using BLEU and TER• 10 most comparable passages used to create biastext

    19

    19

  • Data Sets

    • Monolingual Comparable Data (all in LM training)• English Gigaword 2.8 billion words• FBIS corpus 28.5 million words• News Archive Data from Web: 828 million words

    • Tuned on portions of MT04, MT05, GALE07 newswire (48,921 words)

    • Test on MT06 (55,578 words)• 4 reference translations

    20

    20

  • Less Commonly Taught Languages Scenario

    • TM adaptation should prove most beneficial when less bi-text is available

    • Simulate LCTL with Arabic by reducing bi-text training set

    • 5 Million words of Newswire training• Full monolingual data used

    21

    21

  • LCTL ScenarioMeasuring the Upper Bound of TM Adaptation

    • Best possible comparable data would be parallel data• Use reference translations to simulate

    • Ideally TM adaptation could determine which source words should align to which target words

    • Align source to references with GIZA++• Discard rule probabilities

    • We can also limit bias rules to target side that occurs in top 100 passages from comparable data

    22

    22

  • BLEU

    LCTL ScenarioTM Adaptation from Aligned References

    30.0

    37.5

    45.0

    52.5

    60.0

    Tune TER MT06 TER Tune BLEU MT06 BLEU

    43.35

    51.38

    48.99

    41.79

    52.16

    58.41

    45.17

    36.92

    34.68

    40.80

    55.16

    49.84

    23

    No Adaptation Aligned Reference Overlap w/ Comparable

    TER

    23

  • LCTL ScenarioMeasuring a Weaker Upper Bound of TM Adaptation

    • Fair TM adaptation doesn’t align source and bias-text• Use only reference text without alignments

    • Every phrase in reference document aligns to every phrase in source document

    • Uniform probabilities of rules (as before)• We also limit bias rules to target side that occurs in top 100 passages

    from comparable data

    • Upper-bound for our method if you found ‘perfect’ comparable documents

    24

    24

  • TER

    LCTL ScenarioTM Adaptation from Unaligned References

    30.0

    37.5

    45.0

    52.5

    60.0

    Tune TER MT06 TER Tune BLEU MT06 BLEU

    36.95

    43.13

    53.90

    48.08

    39.90

    45.66

    52.54

    44.92

    34.68

    40.80

    55.16

    49.84

    25

    BLEU

    No Adaptation Unaligned Reference Overlap w/ Comparable

    25

  • Fair LCTL Adaptation

    30.0

    37.5

    45.0

    52.5

    60.0

    Tune TER MT06 TER Tune BLEU MT06 BLEU

    35.36

    42.44

    55.09

    48.88

    34.78

    41.69

    55.45

    48.08

    34.90

    41.40

    55.59

    49.22

    34.68

    40.80

    55.16

    49.84

    26

    No Adaptation LM Adapt TM Adapt LM&TM Adapt

    TER BLEU

    26

  • Full Training Scenario

    • LCTL: Only small gain (0.68 BLEU) on MT06 test set• Can we benefit if we use all our data?• Full Training: 230M Words of BiText and 18.5M Segments

    • Includes data parallel data extracted from comparable data from ISI (LDC2007T08)

    27

    27

  • BLEU

    Full Training Adaptation

    35

    40

    45

    50

    55

    Tune TER MT06 TER Tune BLEU MT06 BLEU

    40.59

    48.82

    50.82

    42.45

    39.89

    46.57

    51.20

    43.51

    39.17

    48.57

    51.40

    42.27

    38.52

    46.61

    51.46

    43.39

    28

    No Adaptation LM Adapt TM Adapt LM&TM Adapt

    TER

    28

  • LCTL vs Full Training Gains

    • LCTL Training: Gain 0.68 BLEU & 0.07 TER• Full Training: Gain 2.07 BLEU & 0.58 TER• Counter-Intuitive: Larger gains with full training

    • Better lexical probability estimates with full training• Better generic models to aid adaptation

    29

    TER Gain BLEU Gain

    LCTL Training

    Full Training

    0.07 0.68

    0.58 2.07

    29

  • Discussion

    • Exploit monolingual data to adapt both LM and TM.• Using no new information

    • Comparable data part of LM. • CLIR uses generic TM.

    • TM uses very simple method to generate bias rules• Gives substantial gains on test set for Newswire

    • What is effect of level of comparability?• Can this be applied to less structured data? [web or audio] data?

    30

    30

  • Questions

    31