Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
August 20, 2011
Stéphane HUET and Philippe LANGLAIS
NLPCS 2011 - Copenhagen
Identifying the Translations of Idiomatic Expressions
using TransSearch
Identifying Translations of Idiomatic Expressions 2 S. HUET
Idiomatic Expressions
• Oxford Companion to the English Language– Idioms are expressions of a given language,
whose sense is not predictable from the meanings and arrangement of their elements
– To fight like cat and dog– It rains cats and dogs
Identifying Translations of Idiomatic Expressions 3 S. HUET
The problem of idiomatic expressions
• Numerous in most languages• Have idiosyncratic meanings that disturb
– Non-native persons– NLP
• In Machine Translation (MT)– Group multi-word expressions before the
alignment process [Lambert and Bancs 05]– Add a new feature encoding the fact that a
phrase is a multi-word expression [Carpuat and Diab 10]
Identifying Translations of Idiomatic Expressions 4 S. HUET
Idiomatic expressions and MT
Identifying Translations of Idiomatic Expressions 5 S. HUET
Idiomatic expressions and MT
Identifying Translations of Idiomatic Expressions 6 S. HUET
Objectives of the study
• The ability of the bilingual concordancer TSRali to retrieve the translations of idiomatic expressions
• Practical issues in querying such a system
Identifying Translations of Idiomatic Expressions 7 S. HUET
Outline
• Introduction• TransSearch• Experimental setup• Evaluations• Conclusion
Identifying Translations of Idiomatic Expressions 8 S. HUET
• Available on the Web since 1996• Developed by the Université de Montréal• Subscribed by many professional translators in
Canada• 7.2 M queries over 6 years• Exploits an English-French translation memory• Incorporates word alignment technology
Identifying Translations of Idiomatic Expressions 9 S. HUET
User interface
Identifying Translations of Idiomatic Expressions 10 S. HUET
User interface
1. Retrieve sentence pairs
Identifying Translations of Idiomatic Expressions 11 S. HUET
User interface
2. Spot translations
Identifying Translations of Idiomatic Expressions 12 S. HUET
User interface
3. Identify the list of translations
Identifying Translations of Idiomatic Expressions 13 S. HUET
Alignment and translation
• Word-based alignment (IBM)
• Translation spotting
This is in keeping with that strategy .
La présente mesure est conforme à cette stratégie .
La présente mesure est conforme à cette stratégie .
This is in keeping with that strategy .
– Constrained to contiguous word alignment
Identifying Translations of Idiomatic Expressions 14 S. HUET
Post-processing steps
• Objective: to have relevant and informative translations in the top list
• Bad translations filtering– Supervised classifier– Features: alignment probabilities, POS tags
• Similar translations merging– Inflectional forms of the same canonical words
conforme à / conforme aux
– Difference by grammatical words or punctuations
à l'encontre de / à l'encontre
Identifying Translations of Idiomatic Expressions 15 S. HUET
Type of queries
• Verbatim queries: “normal” queries– is still in its infancy
• Ellipses: for discontinuous expressions– is .. in its infancy
• Dictionary queries: for morphological expansions– be+ still in its+ infancy
• Bilingual queries: to check translations– En: is still in its infancy
Fr: en est encore à ses premiers balbutiements
Identifying Translations of Idiomatic Expressions 16 S. HUET
Outline
• Introduction• TransSearch• Experimental setup• Evaluations• Conclusion
Identifying Translations of Idiomatic Expressions 17 S. HUET
Resources
• Translation memory– Canadian Hansards (1986-2007)– 8.3 M sentence pairs
• Idiom lexicon– French-English phrase book– 1,467 expressions– Some entries with 2 or 3
translations
Identifying Translations of Idiomatic Expressions 18 S. HUET
Type of idiomatic expressions
• 2% are expressed in an informal language– She's well-upholstered.
– Il roule des mécaniques.
• 99% are used in the context of a sentence– It's fantastic to bop till you drop.
• 80% are verbal phrases used in their inflected forms– I slept like a log.
• 20% are fixed expressions– When there's a will, there's a way
Identifying Translations of Idiomatic Expressions 19 S. HUET
Manual preprocessing
• Annotation of words judged as extra information– They put the new salesman through his paces.
• Type of extra information words– Modal verbs: can, must– Semi-modal verbs: am going to, are likely
to– Catenative verbs: want to, keep– Adverbial phrases: in Italy, when he heard
the news– Noun phrases: this poet, his latest book
Identifying Translations of Idiomatic Expressions 20 S. HUET
Number of queries found in the TM
BilingualBilingual ENEN FRFR
Verbatim queries 36 136 248
• EN: I have no axe to grind• FR: Je ne prêche pas pour ma paroisse
Identifying Translations of Idiomatic Expressions 21 S. HUET
Number of queries found in the TM
BilingualBilingual ENEN FRFR
Verbatim queries 36 136 248
+ manual removal of extra words 91 302 410
• EN: I have .. axe to grind• FR: Je .. prêche .. pour ma paroisse
Identifying Translations of Idiomatic Expressions 22 S. HUET
BilingualBilingual ENEN FRFR
Verbatim queries 36 136 248
+ manual removal of extra words 91 302 410
+ removal of extra pronoun 106 381 509
Number of queries found in the TM
• EN: have .. axe to grind• FR: prêche .. pour ma paroisse
Identifying Translations of Idiomatic Expressions 23 S. HUET
Number of queries found in the TM
• EN: have+ .. axe to grind• FR: prêcher+ .. pour sa paroisse
BilingualBilingual ENEN FRFR
Verbatim queries 36 136 248
+ manual removal of extra words 91 302 410
+ removal of extra pronoun 106 381 509
+ verb lemmatization 210 624 650
Identifying Translations of Idiomatic Expressions 24 S. HUET
Number of queries found in the TM
• EN: have+ .. axe to grind• FR: prêcher+ .. pour sa+ paroisse
BilingualBilingual ENEN FRFR
Verbatim queries 36 136 248
+ manual removal of extra words 91 302 410
+ removal of extra pronoun 106 381 509
+ verb lemmatization 210 624 650
+ pronoun and determiner lemmatization 238 700 705
Identifying Translations of Idiomatic Expressions 25 S. HUET
Outline
• Introduction• TransSearch• Experimental setup• Evaluations• Conclusion
Identifying Translations of Idiomatic Expressions 26 S. HUET
Evaluation using the phrase book
• 700 English queries found in the TM– 36 sentence pairs per query– 13 suggested translations
• 705 French queries found in the TM– 32 sentence pairs per query– 15 suggested translations
• Evaluation restrained to 238 entries with English and French sides in a same sentence pair
Identifying Translations of Idiomatic Expressions 27 S. HUET
Recall measured using the phrase book
• For many queries, TransSearch displays relevant translations absent from the reference– est nébuleux displayed after the reference être
dans un état second for to be in a daze– 34 correct translations displayed for to be around
the corner
Rank 1 3 5 all
English queries 41.6 59.2 65.1 74.8
French queries 41.6 54.6 62.6 76.5
Identifying Translations of Idiomatic Expressions 28 S. HUET
Manual evaluation
• 100 French queries• 5 annotators that judged 50 queries each• 3 labels: “correct”, “wrong”, “partial”• Low Fleiss inter-annotator agreement (0.25)
Q: manger à tous les rateliers J1 J2 J3
slurps at everyone's trough correct correct correct
double-dipper partial correct partial
them pot lickers and accusing them of being at the trough and pork barelling
wrong partial wrong
Identifying Translations of Idiomatic Expressions 29 S. HUET
Manual evaluation
• Average rank of the 1st translation labeled as correct by 1 annotator: 1.4
• For 97/100 queries, a correct translation is displayed
correctpartialwrong
42%
22%
36%
Identifying Translations of Idiomatic Expressions 30 S. HUET
Conclusion
• 50% of the idioms of a phrase book found in the TM of TransSearch
• Users should use morphological (+) and proximity (..) operators for idioms
• Only 36% of the displayed translations were clearly wrong
Identifying Translations of Idiomatic Expressions 31 S. HUET
Thank you for your attention