31
August 20, 2011 Stéphane HUET and Philippe LANGLAIS NLPCS 2011 - Copenhagen Identifying the Translations of Idiomatic Expressions using TransSearch

Identifying the Translations of Idiomatic Expressions using

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Identifying the Translations of Idiomatic Expressions using

August 20, 2011

Stéphane HUET and Philippe LANGLAIS

NLPCS 2011 - Copenhagen

Identifying the Translations of Idiomatic Expressions

using TransSearch

Page 2: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 2 S. HUET

Idiomatic Expressions

• Oxford Companion to the English Language– Idioms are expressions of a given language,

whose sense is not predictable from the meanings and arrangement of their elements

– To fight like cat and dog– It rains cats and dogs

Page 3: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 3 S. HUET

The problem of idiomatic expressions

• Numerous in most languages• Have idiosyncratic meanings that disturb

– Non-native persons– NLP

• In Machine Translation (MT)– Group multi-word expressions before the

alignment process [Lambert and Bancs 05]– Add a new feature encoding the fact that a

phrase is a multi-word expression [Carpuat and Diab 10]

Page 4: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 4 S. HUET

Idiomatic expressions and MT

Page 5: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 5 S. HUET

Idiomatic expressions and MT

Page 6: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 6 S. HUET

Objectives of the study

• The ability of the bilingual concordancer TSRali to retrieve the translations of idiomatic expressions

• Practical issues in querying such a system

Page 7: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 7 S. HUET

Outline

• Introduction• TransSearch• Experimental setup• Evaluations• Conclusion

Page 8: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 8 S. HUET

• Available on the Web since 1996• Developed by the Université de Montréal• Subscribed by many professional translators in

Canada• 7.2 M queries over 6 years• Exploits an English-French translation memory• Incorporates word alignment technology

Page 9: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 9 S. HUET

User interface

Page 10: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 10 S. HUET

User interface

1. Retrieve sentence pairs

Page 11: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 11 S. HUET

User interface

2. Spot translations

Page 12: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 12 S. HUET

User interface

3. Identify the list of translations

Page 13: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 13 S. HUET

Alignment and translation

• Word-based alignment (IBM)

• Translation spotting

This is in keeping with that strategy .

La présente mesure est conforme à cette stratégie .

La présente mesure est conforme à cette stratégie .

This is in keeping with that strategy .

– Constrained to contiguous word alignment

Page 14: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 14 S. HUET

Post-processing steps

• Objective: to have relevant and informative translations in the top list

• Bad translations filtering– Supervised classifier– Features: alignment probabilities, POS tags

• Similar translations merging– Inflectional forms of the same canonical words

conforme à / conforme aux

– Difference by grammatical words or punctuations

à l'encontre de / à l'encontre

Page 15: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 15 S. HUET

Type of queries

• Verbatim queries: “normal” queries– is still in its infancy

• Ellipses: for discontinuous expressions– is .. in its infancy

• Dictionary queries: for morphological expansions– be+ still in its+ infancy

• Bilingual queries: to check translations– En: is still in its infancy

Fr: en est encore à ses premiers balbutiements

Page 16: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 16 S. HUET

Outline

• Introduction• TransSearch• Experimental setup• Evaluations• Conclusion

Page 17: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 17 S. HUET

Resources

• Translation memory– Canadian Hansards (1986-2007)– 8.3 M sentence pairs

• Idiom lexicon– French-English phrase book– 1,467 expressions– Some entries with 2 or 3

translations

Page 18: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 18 S. HUET

Type of idiomatic expressions

• 2% are expressed in an informal language– She's well-upholstered.

– Il roule des mécaniques.

• 99% are used in the context of a sentence– It's fantastic to bop till you drop.

• 80% are verbal phrases used in their inflected forms– I slept like a log.

• 20% are fixed expressions– When there's a will, there's a way

Page 19: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 19 S. HUET

Manual preprocessing

• Annotation of words judged as extra information– They put the new salesman through his paces.

• Type of extra information words– Modal verbs: can, must– Semi-modal verbs: am going to, are likely

to– Catenative verbs: want to, keep– Adverbial phrases: in Italy, when he heard

the news– Noun phrases: this poet, his latest book

Page 20: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 20 S. HUET

Number of queries found in the TM

BilingualBilingual ENEN FRFR

Verbatim queries 36 136 248

• EN: I have no axe to grind• FR: Je ne prêche pas pour ma paroisse

Page 21: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 21 S. HUET

Number of queries found in the TM

BilingualBilingual ENEN FRFR

Verbatim queries 36 136 248

+ manual removal of extra words 91 302 410

• EN: I have .. axe to grind• FR: Je .. prêche .. pour ma paroisse

Page 22: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 22 S. HUET

BilingualBilingual ENEN FRFR

Verbatim queries 36 136 248

+ manual removal of extra words 91 302 410

+ removal of extra pronoun 106 381 509

Number of queries found in the TM

• EN: have .. axe to grind• FR: prêche .. pour ma paroisse

Page 23: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 23 S. HUET

Number of queries found in the TM

• EN: have+ .. axe to grind• FR: prêcher+ .. pour sa paroisse

BilingualBilingual ENEN FRFR

Verbatim queries 36 136 248

+ manual removal of extra words 91 302 410

+ removal of extra pronoun 106 381 509

+ verb lemmatization 210 624 650

Page 24: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 24 S. HUET

Number of queries found in the TM

• EN: have+ .. axe to grind• FR: prêcher+ .. pour sa+ paroisse

BilingualBilingual ENEN FRFR

Verbatim queries 36 136 248

+ manual removal of extra words 91 302 410

+ removal of extra pronoun 106 381 509

+ verb lemmatization 210 624 650

+ pronoun and determiner lemmatization 238 700 705

Page 25: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 25 S. HUET

Outline

• Introduction• TransSearch• Experimental setup• Evaluations• Conclusion

Page 26: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 26 S. HUET

Evaluation using the phrase book

• 700 English queries found in the TM– 36 sentence pairs per query– 13 suggested translations

• 705 French queries found in the TM– 32 sentence pairs per query– 15 suggested translations

• Evaluation restrained to 238 entries with English and French sides in a same sentence pair

Page 27: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 27 S. HUET

Recall measured using the phrase book

• For many queries, TransSearch displays relevant translations absent from the reference– est nébuleux displayed after the reference être

dans un état second for to be in a daze– 34 correct translations displayed for to be around

the corner

Rank 1 3 5 all

English queries 41.6 59.2 65.1 74.8

French queries 41.6 54.6 62.6 76.5

Page 28: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 28 S. HUET

Manual evaluation

• 100 French queries• 5 annotators that judged 50 queries each• 3 labels: “correct”, “wrong”, “partial”• Low Fleiss inter-annotator agreement (0.25)

Q: manger à tous les rateliers J1 J2 J3

slurps at everyone's trough correct correct correct

double-dipper partial correct partial

them pot lickers and accusing them of being at the trough and pork barelling

wrong partial wrong

Page 29: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 29 S. HUET

Manual evaluation

• Average rank of the 1st translation labeled as correct by 1 annotator: 1.4

• For 97/100 queries, a correct translation is displayed

correctpartialwrong

42%

22%

36%

Page 30: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 30 S. HUET

Conclusion

• 50% of the idioms of a phrase book found in the TM of TransSearch

• Users should use morphological (+) and proximity (..) operators for idioms

• Only 36% of the displayed translations were clearly wrong

Page 31: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 31 S. HUET

Thank you for your attention