Identifying the Translations of Idiomatic Expressions using

Preview:

Citation preview

August 20, 2011

Stéphane HUET and Philippe LANGLAIS

NLPCS 2011 - Copenhagen

Identifying the Translations of Idiomatic Expressions

using TransSearch

Identifying Translations of Idiomatic Expressions 2 S. HUET

Idiomatic Expressions

• Oxford Companion to the English Language– Idioms are expressions of a given language,

whose sense is not predictable from the meanings and arrangement of their elements

– To fight like cat and dog– It rains cats and dogs

Identifying Translations of Idiomatic Expressions 3 S. HUET

The problem of idiomatic expressions

• Numerous in most languages• Have idiosyncratic meanings that disturb

– Non-native persons– NLP

• In Machine Translation (MT)– Group multi-word expressions before the

alignment process [Lambert and Bancs 05]– Add a new feature encoding the fact that a

phrase is a multi-word expression [Carpuat and Diab 10]

Identifying Translations of Idiomatic Expressions 4 S. HUET

Idiomatic expressions and MT

Identifying Translations of Idiomatic Expressions 5 S. HUET

Idiomatic expressions and MT

Identifying Translations of Idiomatic Expressions 6 S. HUET

Objectives of the study

• The ability of the bilingual concordancer TSRali to retrieve the translations of idiomatic expressions

• Practical issues in querying such a system

Identifying Translations of Idiomatic Expressions 7 S. HUET

Outline

• Introduction• TransSearch• Experimental setup• Evaluations• Conclusion

Identifying Translations of Idiomatic Expressions 8 S. HUET

• Available on the Web since 1996• Developed by the Université de Montréal• Subscribed by many professional translators in

Canada• 7.2 M queries over 6 years• Exploits an English-French translation memory• Incorporates word alignment technology

Identifying Translations of Idiomatic Expressions 9 S. HUET

User interface

Identifying Translations of Idiomatic Expressions 10 S. HUET

User interface

1. Retrieve sentence pairs

Identifying Translations of Idiomatic Expressions 11 S. HUET

User interface

2. Spot translations

Identifying Translations of Idiomatic Expressions 12 S. HUET

User interface

3. Identify the list of translations

Identifying Translations of Idiomatic Expressions 13 S. HUET

Alignment and translation

• Word-based alignment (IBM)

• Translation spotting

This is in keeping with that strategy .

La présente mesure est conforme à cette stratégie .

La présente mesure est conforme à cette stratégie .

This is in keeping with that strategy .

– Constrained to contiguous word alignment

Identifying Translations of Idiomatic Expressions 14 S. HUET

Post-processing steps

• Objective: to have relevant and informative translations in the top list

• Bad translations filtering– Supervised classifier– Features: alignment probabilities, POS tags

• Similar translations merging– Inflectional forms of the same canonical words

conforme à / conforme aux

– Difference by grammatical words or punctuations

à l'encontre de / à l'encontre

Identifying Translations of Idiomatic Expressions 15 S. HUET

Type of queries

• Verbatim queries: “normal” queries– is still in its infancy

• Ellipses: for discontinuous expressions– is .. in its infancy

• Dictionary queries: for morphological expansions– be+ still in its+ infancy

• Bilingual queries: to check translations– En: is still in its infancy

Fr: en est encore à ses premiers balbutiements

Identifying Translations of Idiomatic Expressions 16 S. HUET

Outline

• Introduction• TransSearch• Experimental setup• Evaluations• Conclusion

Identifying Translations of Idiomatic Expressions 17 S. HUET

Resources

• Translation memory– Canadian Hansards (1986-2007)– 8.3 M sentence pairs

• Idiom lexicon– French-English phrase book– 1,467 expressions– Some entries with 2 or 3

translations

Identifying Translations of Idiomatic Expressions 18 S. HUET

Type of idiomatic expressions

• 2% are expressed in an informal language– She's well-upholstered.

– Il roule des mécaniques.

• 99% are used in the context of a sentence– It's fantastic to bop till you drop.

• 80% are verbal phrases used in their inflected forms– I slept like a log.

• 20% are fixed expressions– When there's a will, there's a way

Identifying Translations of Idiomatic Expressions 19 S. HUET

Manual preprocessing

• Annotation of words judged as extra information– They put the new salesman through his paces.

• Type of extra information words– Modal verbs: can, must– Semi-modal verbs: am going to, are likely

to– Catenative verbs: want to, keep– Adverbial phrases: in Italy, when he heard

the news– Noun phrases: this poet, his latest book

Identifying Translations of Idiomatic Expressions 20 S. HUET

Number of queries found in the TM

BilingualBilingual ENEN FRFR

Verbatim queries 36 136 248

• EN: I have no axe to grind• FR: Je ne prêche pas pour ma paroisse

Identifying Translations of Idiomatic Expressions 21 S. HUET

Number of queries found in the TM

BilingualBilingual ENEN FRFR

Verbatim queries 36 136 248

+ manual removal of extra words 91 302 410

• EN: I have .. axe to grind• FR: Je .. prêche .. pour ma paroisse

Identifying Translations of Idiomatic Expressions 22 S. HUET

BilingualBilingual ENEN FRFR

Verbatim queries 36 136 248

+ manual removal of extra words 91 302 410

+ removal of extra pronoun 106 381 509

Number of queries found in the TM

• EN: have .. axe to grind• FR: prêche .. pour ma paroisse

Identifying Translations of Idiomatic Expressions 23 S. HUET

Number of queries found in the TM

• EN: have+ .. axe to grind• FR: prêcher+ .. pour sa paroisse

BilingualBilingual ENEN FRFR

Verbatim queries 36 136 248

+ manual removal of extra words 91 302 410

+ removal of extra pronoun 106 381 509

+ verb lemmatization 210 624 650

Identifying Translations of Idiomatic Expressions 24 S. HUET

Number of queries found in the TM

• EN: have+ .. axe to grind• FR: prêcher+ .. pour sa+ paroisse

BilingualBilingual ENEN FRFR

Verbatim queries 36 136 248

+ manual removal of extra words 91 302 410

+ removal of extra pronoun 106 381 509

+ verb lemmatization 210 624 650

+ pronoun and determiner lemmatization 238 700 705

Identifying Translations of Idiomatic Expressions 25 S. HUET

Outline

• Introduction• TransSearch• Experimental setup• Evaluations• Conclusion

Identifying Translations of Idiomatic Expressions 26 S. HUET

Evaluation using the phrase book

• 700 English queries found in the TM– 36 sentence pairs per query– 13 suggested translations

• 705 French queries found in the TM– 32 sentence pairs per query– 15 suggested translations

• Evaluation restrained to 238 entries with English and French sides in a same sentence pair

Identifying Translations of Idiomatic Expressions 27 S. HUET

Recall measured using the phrase book

• For many queries, TransSearch displays relevant translations absent from the reference– est nébuleux displayed after the reference être

dans un état second for to be in a daze– 34 correct translations displayed for to be around

the corner

Rank 1 3 5 all

English queries 41.6 59.2 65.1 74.8

French queries 41.6 54.6 62.6 76.5

Identifying Translations of Idiomatic Expressions 28 S. HUET

Manual evaluation

• 100 French queries• 5 annotators that judged 50 queries each• 3 labels: “correct”, “wrong”, “partial”• Low Fleiss inter-annotator agreement (0.25)

Q: manger à tous les rateliers J1 J2 J3

slurps at everyone's trough correct correct correct

double-dipper partial correct partial

them pot lickers and accusing them of being at the trough and pork barelling

wrong partial wrong

Identifying Translations of Idiomatic Expressions 29 S. HUET

Manual evaluation

• Average rank of the 1st translation labeled as correct by 1 annotator: 1.4

• For 97/100 queries, a correct translation is displayed

correctpartialwrong

42%

22%

36%

Identifying Translations of Idiomatic Expressions 30 S. HUET

Conclusion

• 50% of the idioms of a phrase book found in the TM of TransSearch

• Users should use morphological (+) and proximity (..) operators for idioms

• Only 36% of the displayed translations were clearly wrong

Identifying Translations of Idiomatic Expressions 31 S. HUET

Thank you for your attention