49
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L 2 F - Spoken Language Systems Laboratory 1 Cross-Language Alignments: Challenges, Guidelines and Gold Sets Anabela Barreiro Luísa Coheur Tiago Luís Ângela Costa Fernando Batista João Graça

Cross language alignments - challenges guidelines and gold sets

Embed Size (px)

Citation preview

Page 1: Cross language alignments - challenges guidelines and gold sets

1

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory

Cross-Language Alignments: Challenges, Guidelines and Gold Sets

Anabela Barreiro Luísa Coheur Tiago LuísÂngela Costa Fernando Batista João Graça

Page 2: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 2

Outline – Part 1

• Word alignment• Basic concepts• Applications• State of the art• Limitations• Paraphrase alignment• Multiword, meaning and translation unit alignment: importance

• Our task• Alignment tool: CLUE-Aligner

Page 3: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 3

Outline – Part 2

• General annotation guidelines• Cross-linguistic major challenges to word alignment • Annotation guidelines for multiword units and lexical and non-lexical

realization phenomena• Pro-dropping• Articles and zero articles• Examples: continuous multiword units• Examples: continuous and discontinuous support verb constructions

Preposition-dependency (V, N and Adj)

Active vs passive Choice of noun pre-modifiers Different PoS with same semantics (V vs process N)

Noun adjuncts Coordination Anaphora: choice of co-referents

Impersonal constructions

Contractions Style Antonyms and negation constructions

Romance languages double negation

Singular vs plural idiomatic vs non-idiomatic

Flexible/loose paraphrasing constructions;

Idiosyncrasies of each language

Page 4: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 4

Outline – Part 3

• Our contribution• Annotation process• Preliminary results• Discussion• Future work

Page 5: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 5

Word Alignment: Basic Concepts

• Objects representing the mapping of words (or expressions), which are semantically equivalent in a source and a target sentence of a parallel corpus [Brown at al., 1990]– Matrix of n * m entries, where n is a position on the source sentence, and

m is a position on the target sentence. An entry in that matrix an,m specifies if the word at position n is part of a translation of the word at a position m on the target language

• Task of word alignment - identifying translational equivalences (= semantic correspondences) in the aligned sentence pairs of a parallel text [Hearne & Way, 2011]

• Translational equivalences - graphically represented in a grid by the intersection of single segments (individual words) or blocks (semantico-syntactic units, phrases, expressions)

Page 6: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 6

Word Alignment: Basic Concepts

• Sure alignment (S-alignment)– Unambiguous and valid in all contexts

•EN system

•ES sistema

•FR système

•PT sistema

• Possible alignment (P-alignment)– Ambiguous and invalid in some contexts

•EN be

•ES ser/estar/haber/existir

•FR être/avoir/exister

•PT ser/estar/haver/existir

Page 7: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 7

Word Alignment: Applications

• Statistical machine translation– [Brown et al., 1990] – statistical machine translation– [Och and Ney, 2004] – phrase base machine translation– [Galley et al., 2004] – syntax base machine translation

• Annotations’ projections• Extraction of bilingual lexica• Evaluation of machine translation systems

Page 8: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 8

Word Alignment: State of the Art

• Workshops and evaluation tasks (multi-language)– http://www.cse.unt.edu/~rada/wp/– http://www.statmt.org/wpt05– http://www.lpl.univ-aix.fr/projects/arcade

• Projects– Blinker project –French-English

http://nlp.cs.nyu.edu/blinker/

• Guidelines[Melamed, 1998] [Och and Ney, 2000]

[Lambert et al., 2005] [Kruijff-Korbayová et al., 2006]

[Graça et al., 2004]

Page 9: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 9

Word Alignment: Limitations

• Language does not operate on a word-for-word basis• A large number of words are undissociated

– Multiword units• [Gross and Senellart, 1998] - +40% of 1 year of Le Monde are MWU• [Sag et al., 2002] – 50-70% of specialized lexica are MWU• [Ramisch et al., 2010] – 56.7% of terms in Genia corpus have 2+

words (not included general purpose MWU, e.g., generic compounds, lexical bundles, phrasal verbs, fixed expressions, which also occur in domain-specific texts)

– Translation units– Meaning units– Paraphrases

• Segment and block alignment (sure and possible)

Page 10: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 10

Example: Segment and Block Alignment (Sure and Possible)

Page 11: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 11

Paraphrase Alignment

• Monolingual– [Callison-Burch et al., 2006]

• Annotation guidelines for paraphrase alignment • Paraphrases - sentences that convey the same meaning but are

worded differently• Alignment of words, phrases, expressions, within the same language

• Bilingual = (non-literal) translation– Need to account for paraphrases across languages

Page 12: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 12

Multiword, Meaning and Translation Unit Alignment: Importance

• Publicly available manual word alignments are restricted to a few language pairs

• Manual word alignments are a desired resource– Evaluation of word alignment algorithms– Training of supervised and semi-supervised algorithms– Tuning of parameters for different types of model

• But, “name”, “concept” and “techniques” of alignment need to be linguistically sophisticated to be more useful and help provide improved machine translation!

Page 13: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 13

Our Task

• EuroParl corpus [Koehn, 2005]• 6 gold alignments sets

– 400 alignments each set (400x6=2,400)

• Languages: English, French, Portuguese and Spanish– Language pairs: [en-es], [en-fr], [en-pt], [es-fr], [pt-es], [pt-fr]

• Guidelines for multi-language manual word annotations (with inter-annotator agreement)

• Linguistically-informed (and linguistically-motivated) cross-language multiword unit and paraphrase alignment (translation unit alignment)

Page 14: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 14

CLUE-Aligner Alignment Tool

CLUE-Aligner = Cross-Language Unit Elicitation Aligner

• Helps reduce ambiguity in the alignment process• Facilitates the alignment of translation units

Page 15: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 15

Major Challenges (4 different classes)

• semantico-discursive – emphatic linguistic constructions

• tautology• pleonasm and repetition• focus constructions

• lexical and semantico-syntactic – multiword units– compound verbs– prepositional predicates

Page 16: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 16

Major Challenges (4 different classes)

• morphological – contracted forms– lexical versus non-lexical realization

• articles and zero articles• pro-dropping

– subject pronoun drop– empty relative pronoun

• morpho-syntactic – free noun adjuncts

Page 17: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 17

Linguistic phenomenon No alignment P-alignment

Incomplete or non-translation X

Incorrect translation and typo X*

Approximate correspondence (numeric) X

Non-obligatory linguistic structure

Pleonasm X

Repetition of words or expressions X

Redundancy or additional/extra information X

Mismatching pronoun, determiner, verbs, etc. X

Abbreviations versus full word X

Punctuation markDifferent but correct X

Incorrect / mismatch X

Missing X

General Annotation Guidelines

* If a multiword unit is incorrectly translated or contains a typo, none of its internal segments are aligned

Page 18: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 18

Linguistic phenomenon No alignment

Block-alignment

S-align P-align

Multiword Unitcontinuous X X

discontinuous X*

Lexical versus

non-lexicalrealization

article+ N versus

zero-article + N

Ø people=

PT - as pessoasX

Pro-drop + Vversus

pronoun + V

I went =

PT - Ø fuiX

Empty relative pronoun versus

realized relative pronoun

N that I met = N I met=

PT - que (eu) conheciX

Relative versus

participial adjective

that was writen = writen=

PT – escritoX

Annotation Guidelines

* Some discontinous multiword units are candidates for block-alignment (e.g., when the number of inserts is small or the multiword unit is “semi-frozen”

Page 19: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 19

Continuous multiword units Block-S-alignment Block-P-alignment

Support verb construction X X

Compound X X

Phrasal verb X X

Named entity X X

Date and time expression X

Lexical bundle X

Idiomatic expression X

Domain term X

French negation (ne pas) X

English infinitive (to + V) X X

Annotation Guidelines

[Barreiro, 2008] presents a detailed description and examples of the different types of multiword unit

Page 20: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 20

Example: Continuous Support Verb Constructions (alignment)

ES aprueba plenamente

FR approuve pleinement

Page 21: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 21

Example: Discontinuous Support Verb Constructions (no alignment)

ES para que acelere la directiva sobre pensiones

complementares

FR pour faire avancer la directive sur les pensions

complementaires

Page 22: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 22

Cross-Linguistic Challenges

• Prepositional predicatesEN I too should like to congratulate [NE] on his excellent report

ES también yo quisiera felicitar a mi colega [NE] por su excelente informe

FR je voudrais féliciter moi aussi mon collègue [NE] pour son excellent rapport

PT também eu gostaria de felicitar o meu colega [NE] pelo seu excelente relatório

EN […] our Asian partners prefer to deal with questions which unite us

ES […] nuestros socios asiáticos prefieren dedicarse a las questiones que nos unen

FR […] nos partenaires asiatiques préfèrent s’attacher à ce qui nous unit

PT […] os nossos parceiros asiáticos preferem centrar-se unicamente nas questões comuns

Segment S-alignment

Impossible to annotate discontinuous preposition-dependency

Block P-alignment

Page 23: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 23

agree with belong to forgive s/o for pay for stand for

aim at/for choose between hope for prepare for thank s/o for

allow for comment on insist on prevent s/o from think of/about

apologise for compare with interfere with/in provide s/o with volunteer to

apply for complain about joke about refer to wait for

approve of concentrate on laugh at rely on warn s/o about

argue with/about congratulate on lend s/th to s/o run for worry about

ask for consist of listen to smile at

attend to deal with long for succeed in

believe in decide on object to suffer from

Cross-Linguistic Challenges

• Prepositional verbs

Page 24: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 24

Cross-Linguistic Challenges

• Prepositional nouns

attack on attitude towards in agreement on strike

cruelty towards comparison between on average in trouble

difficulty in/with decrease in on condition on behalf of

knowledge of disadvantage of delay in connection between

reason for incerase in in doubt difference between/of

rise in preference for information about under guarantee

solution to reduction in need for in power

use of at risk protection from reaction to

in a hurry at stake report on result of

in practice in theory room for trouble with

Page 25: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 25

Cross-Linguistic Challenges

• Prepositional adjectives

delighted at/about frightened of opposed to similar to

different from friendly with pleased with sorry for/about

dissatisfied with good at popular with suspicious of

doubtful about guilty of proud of sympathetic to(wards)

enthusiastic about incapable of puzzled by/about tired of

envious of interested in safe from typical of

excited about jealous of satisfied with unaware of

famous for keen on sensitive to(wards) used to

fed up with kind to serious about

fond of mad at/about sick of

Page 26: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 26

Cross-Linguistic Challenges

• Noun Adjuncts

– Compounds • European investment bank banco europeu de investimento

[Adj N N] [N Adj [de N]]

– Free noun phrases (not compounds)• presidency communication comunicação da presidência

[N N] [N [de N]]

Block S-alignment

Segment S-alignment

Block-P-alignment of [de N]

Page 27: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 27

Cross-Linguistic Challenges

• Contractions

– two or more words with different parts-of-speech overlap, which makes syntactic analysis and generation difficult

– in cross-language analysis, the contrast between languages that have contractions and languages that do not have them, or do not have them in the same contexts, presents additional difficulties

– The alignment of one segment that corresponds to a contracted form in one language with the corresponding segments where elements are not contracted in the other language of the parallel pair is pragmatically motivated

Page 28: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 28

Example: Contractions (block-P-alignment)

Interference with the support verb construction

EN to make a reference to

PT fazer uma referência a

Page 29: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 29

Example: Contractions (block-P-alignment)

Interference with the support verb construction

ES hacer una referencia a

FR faire référence a

Page 30: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 30

Cross-Linguistic Challenges

• Singular versus plural (related to determiner)EN in every official language of the union

ES en todos los idiomas oficiales de la unión

FR dans toutes les langues officielles de l'union

PT em cada uma das línguas oficiais da união

• Active versus passive EN before new member states are admitted

ES antes de la incorporación de nuevos miembros

FR avant l'admission de nouveaux membres

PT antes da entrada de novos membros

Block or segment P-alignment

Block-S-alignment if there is some fixedness

(such as in this case)

Block P-alignment

Page 31: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 31

Cross-Linguistic Challenges

• CoordinationEN which we will send to the council and Ø parliament

ES que enviaremos al consejo y al parlamento

FR qui sera envoyée au conseil et au parlement

PT que remeterá ao conselho e ao parlamento

• Style: idiomatic versus non-idiomaticEN which began four years ago

ES que empezó hace quatro años

FR qui a vu le jour il y a quatre ans

PT que se iniciou há quatro anos

No alignment

Block P-alignment

Page 32: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 32

Cross-Linguistic Challenges

• Choice of noun pre-modifiersEN we should use that public funding for those types of project which are

most difficult to finance through the private sector

ES deberíamos utilizar esa financiación pública para aquel tipo de proyectos que tienen mayor dificuldad para ser financiados por el sector privado

FR nous devrions recourir au financement public pour les projets que le secteur privé boude

PT o financiamento público deveria ser utilizado para os projectos que registam maiores dificuldades em serem financiados pelo sector privado

Block P-alignment

EN despite certain difficulties

PT apesar das dificuldades

Page 33: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 33

Cross-Linguistic Challenges

• Anaphora - choice of co-referents (noun versus pronoun)EN it is not acceptable that we assisted Korea during the Asean crisis by

means of IMF loans and suchlike, only for Korea still to be subsidising its shipyards

EN no resulta procedente que hayamos ayudado a Corea en la crisis de la Asean a través de préstamos del FMI, etc. y que Corea siga subvencionando sus astilleros

FR il n’est pas acceptable que nous ayons aidé la Corée dans la crise de l’Anase, avec des prêts du FMI, etc. et qu’elle continue à subventionner ses chantiers navals

PT é inadmissível que, depois de termos ajudado a Coreia, através de créditos do FMI, etc., na crise da Asean, este país continue a subvencionar agora os seus estaleiros navais

Segment or block P-alignment

Page 34: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 34

Cross-Linguistic Challenges

• Antonyms and negation constructionsEN the countries of Asia have not unfortunately been in favour of that

proposal

ES los países de Asia desgraciadamente no han sido favorables a dicha propuesta

FR les pays d'Asie ont malheureusement rejeté cette proposition

PT os países da Ásia, infelizmente, não se mostraram favoráveis a esta proposta

Block S-alignment together with adverb

(insert in EN and FR)

Page 35: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 35

Cross-Linguistic Challenges

• Flexible/loose paraphrasing constructionsEN and we shall vote against it

ES y merece nuestra condena

FR et dénonçons

PT e merece a nossa condenação

EN 1993 was a significant year

ES el año 1993 es una fecha notable

FR l’année 1993 est à marquer d’une pierre blanche

PT 1993 é uma data charneira

Block P-alignment

Page 36: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 36

Cross-Linguistic Challenges

• Different parts-of-speech with same semantics (verbs versus process nouns)

EN we must use all the financial instruments at our disposal to rapidly develop the market

ES es preciso utilizar todos los instrumentos financieros disponibles para un rápido desarollo ulterior del mercado

FR il faut utiliser tous les instruments financiers disponibles pour développer rapidement le marché

PT todos os instrumentos financeiros disponíveis deverão ser aplicados para continuar a desenvolver rapidamente o mercado

Block S-alignment (with internal segment P-alignments)

EN and PT :Segment S-alignmentNo alignment of [continuar a]

Page 37: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 37

Cross-Linguistic Challenges

• Impersonal constructions (+ “impersonal” relative versus participial adjective)

EN we must fully support the demands that have been made

ES hay que apoyar plenamente las exigencias que se han formulado

FR il faut par conséquent appuyer les requêtes formulées

PT as reivindicações formuladas deverão ser plenamente apoiadas

Block P-alignment

Internal P-alignment

EN we must

ES hay que

FR il faut

Internal segment S-alignment - adverb and verb (EN, ES, FR)Internal segment P-alignment - verb (PT)

Page 38: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 38

Cross-Linguistic Challenges

• Romance languages double negation (+ coordination)EN it is not, therefore, surprising that there is, in this context, no real

integration or gennuine political dialogue

ES no es nada sorprendente, entonces, que en ese contexto, no haya ni verdadera integración ni verdadero diálogo político

FR rien d’étonnant donc, qu'il n'y ait dans ce contexte, ni intégration véritable, ni dialogue politique véritable

PT assim, não é de espantar que, nesse contexto, não exista verdadeira integração nem verdadeiro diálogo político

Block P-alignment of the relative existential with adverbial (insert)

EN that there is, in this context, no

ES que en esse contexto, no haya

FR qu’il n’y ait dans ce contexte

PT que, nesse contexto, não exista

Segment P-alignment of negation and negation connector

EN no – or

ES ni – ni

FR n’ – ni

PT Ø - nem

Page 39: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 39

Cross-Linguistic Challenges

• Idiosyncrasies of languages• Portuguese inflected infinitive (peculiar verb tense)

• English to+Infinitive• French negation• English apostrophe• …• Sociolinguistic differences

Page 40: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 40

Our Contribution

• Tool CLUE-Aligner• Annotated corpora • Cross-language resources – gold collection

Publicly available on the META-NET website:

http://metanet4u.l2f.inesc-id.pt/

• Guidelines – http://www.inesc-id.pt/ficheiros/publicacoes/8204.pdf

Page 41: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 41

Annotation Process

• Annotation of 400 x 6 (2,400 sentence alignments) by a linguist

• Alignment on a subset of by a second linguist (25

• sentences of the English-Portuguese language pair)

• Inter-annotators agreement

Page 42: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 42

Preliminary Results

language words avg. words

en 11158 27.9

es 11664 29.2

fr 12464 31.2

pt 11649 29.1

pair Sure Possible Totalen-pt 6684 418 7102en-fr 7025 569 7594en-es 7636 399 8035es-fr 7477 767 8244pt-es 7958 557 8515pt-fr 7029 782 7811

pair Sure Possible Totalen-pt 2588 602 3190en-fr 3865 414 4279en-es 3551 351 3902es-fr 3516 495 4011pt-es 3162 382 3544pt-fr 3253 698 3951

Block (MWU) alignmentSegment (word) alignment

Page 43: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 43

Inter-annotators Agreement

• Statistical significance for kappa is rarely reported. However, a number magnitude guidelines have appeared in the literature.– Landis & Koch (1977) consider

• kappas between .4 and .6 as a moderate agreement• kappas between .8 and 1 correspond to an almost perfect agreement

– Fleiss (1981) (equally arbitrary guidelines) characterize• kappas from .40 to .75 as fair to good• kappas over .75 as excellent

• This set of guidelines is however by no means universally accepted

Cohen's kappa coefficient

Multi-word units (MWU) 0.541Word alignments (WA) 0.984

Total 0.871

Page 44: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 45

Discussion

• Difficulties in analyzing fluency, stylistics (including word order), paraphrase, etc.

• Alignments do not always work bi-directionally (sometimes the source-target direction for a language pair matters)

• Levels of alignment and ranking systems (n-grams, morphology, semantico-syntactic level, phrase, paraphrase, etc.)

• Terminology imprecision is found in corpora (it leads to poor quality machine translation)

Page 45: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 46

Future Work

• Integration of lexica (multiword units, etc.) obtained via the use of local grammars – use multiword units as ONE (1) segment of alignment, whenever that is possible (contiguous, etc.)

• Pre-processing of contractions and post-processing of elements that need to be contracted is important if applied to machine translation or to create “more polished” lexica

• Evaluation of the current alignments in a statistical machine translation system to see if translation quality improves

Page 46: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 47

Future Work

• Machine learning of recognition and alignment of multiword units • based on segment alignments, i.e., individual words inside the

multiword unit • based on multiword units of a parallel sentence in another language or

language pair alignment

• Use of local grammars that identify and process discontinuous multiword units and other complex linguistic phenomena to combine with word alignment techniques – how to combine?

Page 47: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 48

Main Conclusion

• Bringing linguistics into STM at the start is the first inevitable place where hybridization should be possible.

• We believe that it would be productive to convert texts on both sides of a translation pair into a common semantico-syntactic representation before applying statistics into them. For this, each language would have to have a parser capable of producing homogeneous output.

• If this common representation were available, that would bring vast possibilities for multi-linguistic SMT.

Page 48: Cross language alignments - challenges guidelines and gold sets

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory 49

technologyfrom seed

L2 F - Spoken Language Systems Laboratory

Thank you!

Page 49: Cross language alignments - challenges guidelines and gold sets

50

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory

Cross-Language Alignments: Challenges, Guidelines and Gold Sets

Anabela Barreiro Luísa Coheur Tiago LuísÂngela Costa Fernando Batista João Graça