45
FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research Project or Innovation Projects Future and Emergent Technologies D5.1 Validation/Evaluation framework Due date of deliverable: 31.01.2006 Actual submission date: 16.03.2006 Start date of project: 1.10.2004 Duration: 3 years FUPF Final version Project co-funded by the European Commission within the Sixth Framework Programme (2002-2006) Dissemination Level PU Public X PP Restricted to other programme participants (including the Commission Services) RE Restricted to a group specified by the consortium (including the Commission Services) CO Confidential, only for members of the consortium (including the Commission Services)

FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

FP6-IST-003768

M E T I S - I I S t a t i s t i c a l M ac h i n e T r an s l a t i o n u s i n g M o n o l i n g u a l C o r p o r a :

F r o m C o n c e p t t o Im p l e m e n t a t i o n

Specific Targeted Research Project or Innovation Projects Future and Emergent Technologies

D5.1 Validation/Evaluation framework

Due date of deliverable: 31.01.2006 Actual submission date: 16.03.2006

Start date of project: 1.10.2004 Duration: 3 years FUPF

Final version

Project co-funded by the European Commission within the Sixth Framework Programme (2002-2006)

Dissemination Level PU Public X

PP Restricted to other programme participants (including the Commission Services)

RE Restricted to a group specified by the consortium (including the Commission Services)

CO Confidential, only for members of the consortium (including the Commission Services)

Page 2: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 1

Table of Contents  

1. Introduction ........................................................................................................................................ 3 2. Validation procedures internal to each site.................................................................................... 3 2.1 FUPF............................................................................................................................................... 4 2.1.1 Spanish grammatical phenomena ............................................................................................................4 2.1.2 Spanish‐English Contrastive grammatical phenomena (not included in 2.1.1) .....................................5 

2.2 GFAI............................................................................................................................................... 7 2.2.1 Lexical Translation Problems ..................................................................................................................7 2.2.2 Syntax......................................................................................................................................................8 

2.3 ILSP .............................................................................................................................................. 12 2.4 KUL .............................................................................................................................................. 14 2.4.1 Lexical phenomena.................................................................................................................................14 2.4.2 Syntactic phenomena.............................................................................................................................17 2.4.3 Word, phrase and sub‐clause order issues .............................................................................................19 

3. Consortium‐wide experiments ...................................................................................................... 20 3.1 Description of the common validation experiment .............................................................. 21 3.2 Description of the cross‐approach part of the experiment................................................... 21 3.3 Analysis of the results ............................................................................................................... 22 3.3.1 Description of test data and evaluation results .....................................................................................22 3.3.2 Human ratings ......................................................................................................................................24 3.3.3 Error Analysis .......................................................................................................................................25 3.3.4 Cross‐approach comparison...................................................................................................................26 

4. Evaluation in comparison to other MT systems.......................................................................... 28 5. User evaluation................................................................................................................................. 28 6. Calendar ............................................................................................................................................ 28 References.............................................................................................................................................. 29 Annex 1: Preliminary list of grammatical phenomena common to the four SL ......................... 30 Annex 2: Automatic evaluation metrics ........................................................................................... 35 Annex 3: Expander Input Format Description (ILSP’s proposal) ................................................. 36  List of Tables 

Table 1: Validation Experiment 2 compared to Experiment 1 ......................................................................20 Table 2: Template for the description of the conditions of the experiment.................................................23 Table 3: Template for the description of results (per sentence) ....................................................................23 Table  4:  Template  for  the  description  of  the  test  data  and  the  global  results  of  the  validation 

experiment ....................................................................................................................................................25 Table 5: Average scores depending on type of sentence ...............................................................................25 Table 6: Average scores depending on grammatical phenomenon .............................................................26 

Page 3: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 2

Table 7: Comparison of evaluation metrics for a given sentence .................................................................26 Table 8: Comparison of average evaluation metrics.......................................................................................27 Table 9: Comparison of average evaluation metrics on short sentences .....................................................27 Table  10: Comparison  of  average metrics  on  sentences with word  order  changes  between  SL  and 

English...........................................................................................................................................................27 Table 11: Calendar for validation and evaluation within METIS‐II .............................................................29 

Page 4: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 3

1. Introduction

Translation quality, whether human or software generated, is difficult to quantify. Differ-ent systems, and for that matter, different human translators, can produce intelligible, accurate, but different translations of the same sentence. Therefore, for any input sen-tence, there is no single, ideal output sentence. Also, some errors are more serious than others; so all errors should not be assigned the same importance.

MT will never attain the overall quality of human translation. The primary advantages of MT over human translation are speed, cost, and consistency. An MT system gets a great deal more translation done than is possible manually, and MT can deliver translations in-stantly for time-sensitive content. Therefore, judgment on MT output should be based not on whether system produces ‘real’ translations, and particularly not whether it pro-duces ‘good’ translations, but whether the output can be used and whether its use will save time or money.

Considerations of this type will be taken into account in the set up of the evaluation pro-cedures for METIS-II. Thus, the translation system developed in METIS-II will be evalu-ated along three lines:

a) Internal validation: the system is regularly validated through a series of tests within each site (see section 2). In addition, two experiments will be carried out with a common set-up for the validation of the whole system within the consortium (see section 3).

b) External evaluation: the system is evaluated in comparison to other MT systems (see section 4).

c) Usability evaluation: the system is evaluated by the user-groups not only in terms of quality of the translation but also in terms of time-saving factors (see sec-tion 5).

2. Validation procedures internal to each site

Systematically organised test suites are a common tool in NLP, useful for testing syntac-tic coverage and monitoring progress. One of the problems with setting up this kind of test suites is that of interaction between linguistic phenomena. The standard solution is to reduce the linguistic complexity of all items other than the item of interest to an abso-lute minimum.

When the system to be tested is a translation system yet more complications arise (King & Falkedal, 1990). Such a test suite has to include a substantial component of contras-tively based test inputs, which are specific for each language pair. This is particularly relevant for our system because most of the problems will in all probability arise from mismatches between SL and TL.

Page 5: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 4

Since many of those problems are lexically determined, we have to assume that it is al-most impossible to build a test suite that can guarantee a comprehensive coverage of all problematic cases. However, the most general cases of divergence between the two lan-guages need to be foreseen.

The following validation strategies are envisaged by each site (in alphabetical order):

2.1 FUPF

A test suite of growing grammatical complexity has been designed that covers general grammatical phenomena in Spanish as well as problematic areas in the translation be-tween Spanish and English. This test suite is used to assess progress in the development of the system. To that end, the system is periodically checked against the test suite.

We expect results to improve with every iteration of the validation process. Any regres-sion in the results needs be addressed in a proper manner.

Below follows the list of grammatical phenomena that has been used to build the test suite used by FUPF to validate the Spanish-English translation system. The first group of phenomena is monolingually motivated; the second group is related to the contrastive analysis of both the SL (Spanish) and the TL (English) and exemplifies cases of problem-atic translations between the two languages.

2.1.1 Spanish grammatical phenomena

∗ Main clauses: impersonal verbs, intransitive verbs, transitive verbs, copulative verb, prepositional verbs, ditransitive verbs, movement verbs.

∗ Complementation: full complements, clitics, duplicated clitics, dislocated comple-ments

∗ Time, place and other modifiers

∗ Prodrop

∗ Negation

∗ Canonical passive sentences (past-participle)

∗ Complex Verb Group (perfect, progressive,...)

∗ Modals

∗ NP modifiers (adjectives, mod PPs, possessive)

∗ Relative clauses

antifiers, ...)

ures

∗ Coordination (S and NP)

∗ DETP structure (possessive, qu

∗ Control and raising struct

Page 6: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 5

2.1.2 Spanish-English Contrastive grammatical phenomena (not included in 2.1.1)

∗ Lack of correspondence in the use of tense and mood: e.g. non-progressive present tense in Spanish is generally translated by progressive in English.

SP. Llueve.

ENG. It is raining.

∗ Different constructions in Spanish are translated into Passive Voice in English. Such constructions include: impersonal 3rd person plural, impersonal se and reflexive pas-sive.

SP. La coca-cola normalmente se bebe fría

ENG: Coke is usually drunk cold

∗ One of the most frequent cases of different complementation patterns in both lan-guages is the translation of a Direct Object in Spanish for a Prepositional Object in English.

SP. Espera el autobús.

ENG. Wait for the bus.

∗ Another systematic divergency is the use of preposition “a” (to) in Spanish to mark human DOs.

SP. Contesta al profesor.

ENG. Answer the teacher.

∗ A well-known case of ‘reverse’ constructions is the gustar / like pair.

SP. Les gusta el queso francés.

ENG. They like French cheese.

∗ Homonyms in Spanish have different translations in English

SP. La capital acumula todo el capital

ENG. The capital city accumulates all the money.

∗ The main structural issue related to noun complementation is the translation of [N + de N] in Spanish into [N + N] in English, although there are other structure changes, such as [N + Adj] => [N + N] or [N + de Vinf] => [Vger + N]

SP. jugo de naranja

ENG. orange juice

∗ The use of the article is different in both languages in many cases. Spanish tends to use the definite article ‘el’ for generic plural NP Subjects or Objects, appositive NPs, temporal NPs or locative NPs, while English does not. On the other hand English of-ten uses the indefinite article ‘an’ for generic NP Objects and Spanish does not.

Page 7: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 6

SP. Los hombres son mortales.

ENG. Men are mortal.

SP. Luisa siempre lleva sombrero.

ENG. Louise always wears a hat.

∗ Subject is obligatorily present in an English sentence, but not in Spanish. Conse-quently we find dummy pronouns in English and frequent instances of Pro-drop in Spanish.

SP. Hace veinte grados afuera.

ENG. It’s twenty degrees outside.

SP. Hoy voy a la playa.

ENG. Today I go to the beach

∗ Spanish pronoun ‘se’ has several semantic interpretations and consequently several possible translations into English (self pronoun, possessive, inchoative...)

SP. El gerente se disparó.

ENG. The manager shot himself.

SP. EL barco se hundirá.

ENG. The boat will sink.

∗ Spanish is a relatively free Word Order language, while English is not.

SP. Compró mi padre una casa.

ENG. My father bought a house.

∗ Another typical word order divergency is the position of the adjective within the NP.

SP. una mesa redonda

ENG. a round table

− Preposition stranding happens in English and does not happen in Spanish

SP. ¿De donde vino?

ENG. Where did he come from?

∗ English inserts the do-particle in questions and negated sentences

SP. ¿María salió del cuarto?

ENG. Did Mary leave the room?

∗ Lexicalisation of comparative adjectives only happen in a handful of cases in Span-ish but is a productive phenomenon in English.

SP. Juan es más rápido que Pilar.

ENG. John is faster than Pilar.

Page 8: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 7

2.2 GFAI

The following is an incomplete collection of phenomena subsumed under linguistic con-cepts that are relevant from a contrastive point of view.

Despite the fact that German is extremely similar to Dutch we would like to follow a scheme that is slightly different to the one for Dutch below.

2.2.1 Lexical Translation Problems

There are a number of lexical translation problems some of which can be found for other languages as well:

∗ Separable prefixes

Abnehmen – loose weight:

Mein Freund nahm durch eine strikte Diät 10 Kilos ab.

(My friend lost 10 kilos through a strict diet)

∗ Compounds: Compounds are productive and thus cannot be kept in the lexicon.

Schifffahrt (shipping)

Dampfschifffahrt (steam shipping)

Donaudampfschifffahrt (Danube steam shipping)

Die Donaudampfschiffahrt machte Verluste im letzten Jahr,.

(Danube steam shipping made a deficit last year).

∗ Fixed prepositions

Meine Bekannten mussten lange auf mich warten.

(My friends had to wait for me a long time)

∗ Collocations / Light verb constructions

Meine Mutter will mir bis morgen Bescheid sagen.

(My mother will let me know until tomorrow)

∗ Numbers and proper names: Numbers and proper names are phenomena for which the project needs a general strategy.

Page 9: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 8

∗ Degree of adjectives and adverbs

Er hat das größte Auto.

(He has the biggest car)

∗ Lexical ambiguities

Gestern saß ich auf der Bank im Park, als Peter um die Ecke kam.

(Yesterday, I was sitting on a bench, when Peter came around the corner)

2.2.2 Syntax

∗ Nominalisation

Aus Langeweile machte er den Fernseher an.

(Not knowing what to do he turned on the TV)

∗ Determination

Er ist Lehrer. (He is a teacher)

∗ Word order

Die Ärzte kämpfen seit Tagen um das Leben des Premierministers.

Um das Leben des Premierministers kämpfen die Ärzte seit Tagen.

Seit Tagen kämpfen um das Leben des Premierministers die Ärzte.

(The doctors have been fighting for the Prime Minister's life for days)

∗ Different Complementation

Ich erinnere mich an ihn.

I remember him.

Hans wartet auf ihn.

Hans is waiting for him.

Das Auto gefällt mir.

I like the car.

Page 10: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 9

∗ ft.

he house was bought by Hans.

worden.

The house deal was arranged by a broker.

∗ sen.

he house will be pulled down in a few days.

och selten.

I remember him very rarely.

∗ st schmutzig.

clean a pot. It is dirty.

aria cleans her / their pot.

The teacher is looking for the book which he had borrowed his student.

∗ ertrag.

e is signing a contract in this very moment.

When he entered the room, the boy was reading in a book.

Diathesis / Tense

Das Haus wurde von Hans gekau

T

Das Haus war durch einen Makler vermittelt

Combination of phenomena

In den nächsten Tagen wird das Haus abgeris

T

An ihn erinnere ich mich nur n

Anaphora

Ich spüle einen Topf. Er i

I

Maria spült ihren Topf.

M

Relative clauses

Der Lehrer sucht das Buch, das er seinem Schüler ausgeliehen hat.

Tense / Aspect

Er unterschreibt in diesem Augenblick einen V

H

Als er das Zimmer betrat, las der Junge ein Buch.

Page 11: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 10

Er wohnt seit Jahren in Saarbrücken.

He has been living in Saarbrücken for years.

∗ e.

He likes to swim.

The apple that has fallen from the tree is rotten

Arriving in Ohio he immediately went to see the dean of the University.

he kids are playing in the playground.

The painting is on the wall.

∗ r arbeiten.

he Siemens employees need not work longer than before.

arbeiten.

he Siemens employees must not work longer than before.

Head switch

Er schwimmt gern

Category change

Der vom Baum gefallene Apfel ist faul

Infinitives

Sobald er in Ohio ankam, suchte er den Dekan der Universität auf.

Prepositions

(‘auf’ – normally ‘on’)

Die Kinder spielen auf dem Spielplatz.

T

(‚an’ – normally ’at’)

Das Bild hängt an der Wand.

Modalities

Die Beschäftigten des Siemens-Konzerns müssen nicht länge

T

Die Beschäftigten des Siemens-Konzerns dürfen nicht länger

T

Page 12: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 11

Die Beschäftigten des Siemens-Konzerns dürfen länger arbeiten.

he Siemens employees may work longer than before.

iten.

he Siemens employees do not have to work longer than before.

beiten.

The Siemens employees have to work longer than before.

T

Die Beschäftigten des Siemens-Konzerns müssen nicht länger arbe

T

Die Beschäftigten des Siemens-Konzerns müssen länger ar

Page 13: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 12

2.3 ILSP

The following table contains the list of phenomena used by ILSP to test their system:

Phenomenon Greek Sentence English Translation

Transitive verbs Ο διευθυντής έδωσε βραβεία στους µαθητές.

The headmaster gave awards to the students.

Intransitive verbs Στο µεταξύ, ο Φίλιππος µεγάλωσε. In the meantime, Philippos grew up.

Copulas Η επιχείρηση είναι µικρή. The enterprise is small.

Impersonal verbs Πρέπει να γίνει µελέτη. A study should be carried out.

Ergatives

Προηγουµένως, ο µαέστρος µου διη-γήθηκε ότι ξεκίνησε την επαγγελµα-τική του καριέρα πριν τελειώσει το γυµνάσιο, µε κοντά παντελόνια.

Earlier on, the maestro told me that he started his profes-sional career before he fin-ished high-school, when he was still wearing short trou-sers.

Unergatives

Η εκδήλωση ξεκίνησε περίπου στις 7.30 το απόγευµα, και ο ακαδηµαϊκός Κωνσταντίνος Γρόλλιος εκφώνησε τον πανηγυρικό λόγο για την επέτειο της 25ης Μαρτίου.

The event started at about 7.30 in the evening and the academic Constantinos Grol-lios delivered a festive speech for the anniversary of the 25th March.

Στο διαιτητή έδωσε τα συγχαρητήρια ο πρόεδρος.

It was the referee who was congratulated by the presi-dent

Τα συγχαρητήρια έδωσε στο διαιτητή ο πρόεδρος.

Congratulations were offered to the referee by the presi-dent.

Ο πρόεδρος έδωσε στο διαιτητή συγ-χαρητήρια.

The president congratulated the referee.

Ο πρόεδρος έδωσε συγχαρητήρια στο διαιτητή.

The president congratulated the referee.

Word order

∆ίνει στο διαιτητή συγχαρητήρια ο πρόεδρος.

It was the referee whom the president congratulated.

Subordination

Ένα 56% των Ευρωπαίων πολιτών φοβάται µήπως χάσει κοινωνικά πλε-ονεκτήµατα λόγω της ευρωπαϊκής ενοποίησης.

A 56% of European citizens are afraid that they may lose their social benefits because of the European Union.

Page 14: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 13

Phenomenon Greek Sentence English Translation

Negation Η ευηµερία δεν συναρτάται µόνο µε τον αριθµό αυτοκινήτων ή τηλεορά-σεων ανά κάτοικο.

Prosperity isn't related only to the number of cars or televi-sions per inhabitant.

Coordinated clauses Βελτιώνονται και συµπληρώνονται ισχύοντες θεσµοί.

The institutions in effect are improved and completed.

Coordinated NPs Γραφειοκρατία, Αδράνεια, Αλαζονεία και Μετριοκρατία είναι οι πάντα πα-ρόντες εχθροί µας.

Bureaucracy, Inertness, Arro-gance and Mediocrity have always been our present enemies.

Abbreviations Οι υποψήφιοι για το ΤΕΙ Μαιευτικής Αθήνας και Θεσσαλονίκης ήταν φέτος 139.

There were 139 candidates for the Obstetrics TEI of Athens and Thessaloniki.

Proper Nouns Η Θεσσαλονίκη είναι µια πόλη µε ζω-ντανό παρελθόν.

Thessaloniki is a city with a vivid past.

NP modifiers Ο σκοπός του φαίνεται ότι ήταν να αφηγηθεί µια αντιηρωική ιστορία.

His aim was to tell an anti-heroic story.

PP complements

Τους τελευταίους µήνες η ελληνική κοινωνία δοκιµάστηκε από τις κινη-τοποιήσεις των αγροτών, εδώ στη Θεσσαλία.

During the last months the Greek society suffered from the farmer's movement, here in Thessaly.

Adverbs ∆υστυχώς δεν κατάφερα ακόµη να µάθω από πού θα προµηθευτείτε το µαγικό χαρτάκι.

Unfortunately, I have not managed yet to find out where we can get this magic paper from.

Postnominal geni-tives

Ο προϊστάµενος της τράπεζας ανα-κοινώνει το τέλος της ύφεσης.

The bank director announces the end of the recession.

Page 15: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 14

2.4 KUL

There are three series of phenomena the KUL team would like to investigate and test.

2.4.1 Lexical phenomena

∗ Dutch separable verbs

In Dutch, there are thousands of separable verbs, which are usually composed of a sim-ple verb and a particle. However, in some cases, the separable part is a noun or an ad-jective.

NL: Het Somalische parlement komt voor de eerste keer samen in het land zelf.

EN: Somalia's parliament meets inside the country for the first time.

In the Dutch sentence, 'komt' is the head verb (3PS simple present) and 'samen' the separable part (postposition) of the verb 'samenkomen', which translates as 'to meet'.

∗ Compounds

In principle, one can make an infinite set of compounds in Dutch, like in all Germanic languages. Most Germanic languages (except for English) write these compounds as one word. This means a dictionary can never be exhaustive. It is very useful to have a com-pound detector in order to translate 'unknown' words.

NL: De EU-ministers bespraken ook de economische impact op de pluimvee-industrie en de kwestie van compensaties voor de landbouwers.

EN: EU ministers also discussed the economic impact on the poultry industry and the is-sue of compensation for farmers.

The Dutch 'pluimvee-industrie' is a word one does not expect to turn up in a dictionary. How ever, splitting it up in 'pluimvee' (poultry) and 'industrie' (industry) makes it very easy to translate it into 'poultry industry'. We have to be careful though we do not split up in 'pluim', 'vee' and 'industrie', since we would end up with a totally wrong transla-tion: 'feather cattle industry'.

∗ Verbs with fixed prepositions, prepositional objects, ...

These have to be in the lexicon as a kind of collocations.

Page 16: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 15

NL: Door de staking moesten de mensen een half uur op de bus wachten.

EN: The people had to wait for the bus for half an hour because of the strike.

achten op' translates as 'wait for'.

(category change)

daald.

he prepositional object 'met 70%' is translated as a direct object in English.

tbode.

N: The big black dog chased the postman.

daagse soorten zijn verwant aan elkaar door een gemeenschappelijke afstam-

descent.

erwant zijn aan' (to be related to) is a collocation in Dutch.

ome words are used in the plural in one language and in the singular in the other.

N: It was during this period that he started writing poetry.

edichten' is the plural of 'gedicht' (poem) but is here translated better as 'poetry'.

unction words can easily be inserted or deleted.

NL: Zij waren geschokt door de onthulling dat ze net 335 pond aan een paar schoenen

'w

∗ Insertion and deletion of prepositions

NL: De verkoop in Italië is met 70% ge

EN: Sales in Italy have plunged 70%.

T

NL: De grote zwarte hond joeg achter de pos

E

∗ Collocations

NL: Hedenming.

EN: Contemporary species are related to each other through common

'v

∗ Pluralia tanta

S

NL: Het was in die tijd dat hij gedichten begon te schrijven.

E

'G

∗ Function words:

F

had uitgegeven.

Page 17: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 16

EN: They were shocked by the revelation that she has just spent £335 on a pair of shoes.

n h becomes 'a pair OF shoes' in English.

Indications of time

ns.

L: Dit genootschap hield zijn eerste meeting op 16 juni met Yeats als voorzitter.

L: Ze zijn steeds om half zes 's morgens opgestaan om naar het cricket te kijken.

een getting up at half past five in the morning to watch the cricket.

Numbers

L: Voor de tweede wereldoorlog leefden er 200.000 joden in Wenen.

200,000 Jews lived in Vienna.

Proper names

L: Kronwall speelde in de plaats van de gekwetste Mattias Ohlund.

ronwall and Mattias Ohlund have to be kept as they are. We have to take care that some names of cities, mountains, rivers and historic persons have to

et de auto rijden", while in English you ride a translat-

'Ee paar schoenen' in Dutc

Times and dates have to be reordered or rewritten according to national conventio

N

EN: This society held its first meeting on June 16, with Yeats in the chair.

N

EN: They have b

Rewrite numbers according to national conventions.

N

EN: Before World War II,

Recognition and transfer of non-translatable proper names

N

EN: Kronwall was playing in place of the injured Mattias Ohlund.

Kcountry names and be translated: Parijs/Paris, de Alpen/the Alps, Schelde/Scheldt, Aristoteles/Aristotle.

∗ Homonyms in the source language / Polysemous words and words with dif-ferent translations

In Dutch you say "met de fiets rijden" en "mbike, but you drive a car. "portier" is a homonym in Dutch, the masculine noun ing as "doorkeeper, porter" and the neutre one translating as "car door".

Page 18: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 17

∗ Insertion/deletion of lexical units

NL: Galler stelt alles in het werk om u zijn producten in goede staat te leveren.

ensure' is not in the Dutch sentence, but is needed to construct a fluent English

glish only refers to animates with

'oven' is masculine in Dutch, so it is referred to with 'hij', while 'oven' is inani-

-

on Earth is too complex to have evolved on its own.

d refers to an antecedent 'theorie' in a previous sentence, while Eng-

EN: Galler makes every effort to ensure that his products reach you in perfect condition.

The 'tosentence. This is actually a very difficult one, certainly when dealing with insertion.

∗ Dutch still has three genders, while Ennon-neuter pronouns.

NL: Schakel de oven niet in wanneer hij leeg is.

EN: Do not operate the oven when it is empty.

The nounmate in English and gets referred to with 'it'.

NL: Haar voorstanders argumenteren dat het leven op Aarde te complex is om zelfstandig geëvolueerd te zijn.

EN: Its proponents argue life

'Haar' is feminine anlish has 'its' for 'theory'.

2.4.2 Syntactic phenomena

∗ Verb tenses, do insertion,...

Negative, interrogative and emphasised sentences without modal or auxiliary verbs, get

L: Laat de televisie niet zonder toezicht of 's nachts in stand-by staan.

∗ simple and the con-

L: Kronwall speelde in de plaats van de gekwetste Mattias Ohlund.

N: Kronwall was playing in place of the injured Mattias Ohlund.

'do' insertion in English.

N

EN: Do not leave the television in stand-by unattended or overnight.

Translation of verb tenses, e.g. choosing between thetinuous tenses in English...

N

E

Page 19: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 18

∗ Structure changes

For example, from active to passive:

panese parent company is expected to make a formal announcement this eve-

L: Hij zwemt graag.

vestigate if sentences with several, possibly nested, sub-clauses keep their struc-nslation, sentences get merged or split.

s

N: The children go and swim.

ted with a conjunction of verbs.

L: Hij begint te rennen.

The degrees of comparison

thly.

t draaien.

han one minute.

L: Waarom zouden ze zoiets negeren?

NL: Men verwacht dat het Japanse moederbedrijf vanavond een formele aankondiging zal doen.

EN: The Janing.

A classic one:

N

EN: He likes to swim.

∗ Sentences with several sub-clauses

We inture in translation. Possibly, in tra

∗ Translating verb cluster

NL: De kinderen gaan zwemmen.

E

The verb cluster is substitu

N

EN: He start to run.

Check whether all degrees of comparison translate smoo

NL: Laat het mes nooit langer dan één minuu

EN: Never have the blade turn for more t

∗ Questions and interrogative words

N

EN: Why should they ignore a thing like that?

Page 20: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 19

2.4.3 Word, phrase and sub-clause order issues

Word order within NPs

hond joeg achter de postbode.

re World War II, 200,000 Jews lived in Vienna.

osition,

Sub-clause order and position within the main clause; coordination

rges.

L: Veel van zijn werken behoren tot het standaard concertrepertorium en staan wijd en bekend als meesterstukken van de klassieke stijl.

y of his works are part of the standard concert repertory and are widely recog-nized as masterpieces of the classical style.

L: Werknemers moeten hun vaardigheden constant bijschaven, mobiel zijn en hun loop-baan op tijd evalueren.

EN: Employees have to polish up their skills all the time, be mobile, and evaluate their career in time.

NL: De grote zwarte

EN: The big black dog chased the postman.

∗ Phrase order

NL: Voor de tweede wereldoorlog leefden er 200.000 joden in Wenen.

EN: Befo

In Dutch main clauses, the main verb 'leefden' always occupies the second pwhile in English the subject always is in front of the main verb.

NL: Daya Nayak, die geschorst is, ontkent de beschuldigingen.

EN: Daya Nayak, who has been suspended, denies the cha

Nzijd

EN: Man

In coordination, English also writes a comma before the final 'and', which is not there in Dutch.

N

Page 21: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 20

3. Consortium-wide experiments

In October 2005, a first validation experiment within the consortium was performed (see deliverable D4.1, Appendix I).

A second consortium-wide validation experiment is expected to take place on June 2006 (month 21 of the project). The goal of this experiment is to test the state of progress of the translation systems for each of the four language pairs involved, as the project reaches its middle point, and to ensure that the processing chain is set up for all lan-guages.

Although most of the target language processing is common to all teams involved in the Metis project, there are certain parts of the process, such as the Expander, that may fol-low different strategies (see D3.3). In order to be able to better evaluate the perform-ance of the different approaches, some amount of the data will also be crosschecked among the teams (see Section 3.2).

In this second experiment, apart from obtaining the usual metrics, we foresee to perform a thorough analysis of the results as well. This analysis should provide the developers with the necessary feedback to improve both the resources and the modules. This ex-periment will help fine-tune the system so that it can become the first prototype (to be delivered on September 2006).

The first validation experiment, performed at a very early stage of the project (end of first year), presented a few restrictions on the task and the test material, restrictions that will be avoided in the second experiment. A comparison between the definitions of the two experiments is summarised in Table 1.

Table 1: Validation Experiment 2 compared to Experiment 1

Challenges Experiment 1 Experiment 2

English-like to lemmatised English strings

Dutch, German, Greek, Spanish to English (lemmatised) translation Task

Lemma to lemma translation Word to lemma translation

Test material originally in English, backtranslated into each language

Test material originally in each of the 4 languages

Simple sentences (single clause, 7-15 words)

Free text

Single words (no expressions) Free text

15 sentences per site (60 total) 250 sentences per site (1000 total)

Test

material

Restricted English vocabulary (has to be in the BNC and the bilingual dictionary)

Unrestricted English vocabulary

Page 22: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 21

3.1 Description of the common validation experiment

The common validation experiment to be carried in June 2006 will consist of the following steps:

a) Each site prepares the test material for each language, consisting of 250 sentences. The material is distributed evenly among four different categories:

a. 70 sentences illustrating grammatical phenomena (defined by each site as in Section 2);

b. 60 sentences from newspapers;

c. 60 sentences from encyclopaedia article, or similar source of non-specialised scientific text;

d. 60 sentences from technical manuals, or similar source of technical text.

Vocabulary and syntactic constructions used in these sentences should belong to general language (as opposed to being exclusively technical, for example), although it is not a pre-requisite of this experiment that all the words appear in the BNC or in the dictionaries.

b) Each site prepares three English reference translations of the test material (if possi-ble, translated by 3 different professional translators), and provides the translations in formats suitable for evaluation (see Annex 2). Since we are testing word-to-lemma, reference translations need to be lemmatised before evaluation is per-formed.

c) Each site processes their portion of the test material (250 sentences) and provides translation results in the specified format (see Annex 2).

d) A unified evaluation is carried out on the results provided by each site using stan-dard metrics. At the same time, each group performs an analysis on the results of their own language pair, using a common template (see Section 3.3).

3.2 Description of the cross-approach part of the experiment

A part of the test data will be cross-checked among the teams in order to better assess the performance of the different approaches coexisting within the project.

For this purpose, a set of 50 sentences will be selected by each group out of the 250 sen-tences used to validate each language pair. This subset of sentences should be chosen avoiding language specific translation problems, such as translation of the Spanish ‘se’ pronoun, for example. A list of core grammatical phenomena considered to occur in the four languages is given in Annex 1 for guidance.

The cross-approach part of the experiment will consist of the following steps:

a) Each site selects 50 sentences from their test material (see Section 3.1), according to the following criteria:

Page 23: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 22

a. A minimum of 30 sentences will illustrate some contrastive grammatical phe-nomena common to all the language pairs, according to the common core list in Annex 1;

b. The rest of the sentences (up to 20) should be unproblematic from the point of view of translation divergence of the kind discussed in section 2.

b) Each site pre-processes the sentences up to the output of the bilingual dictionary and encodes the results using the input format agreed-upon by the consortium members (see Annex 3).

c) Each site provides the other groups with the data obtained in the previous step.

d) Each site processes the input data from the other groups through the remaining stages (Expander, Search Engine) up to the lemmatized English translation.

e) All translation results are delivered in the format specified in Annex 2.

f) An evaluation is carried out on the results using standard metrics.

g) A comparison and analysis of the results of the different sites is performed.

3.3 Analysis of the results

The objective of this analysis is to gather as much information as possible on the state of the translation systems and on the experiment itself. For that purpose, it is required that each site describes in detail the conditions and results of the experiment and performs error analysis in a unified way, identifying the potential sources of errors and describing ways to overcome them.

3.3.1 Description of test data and evaluation results

Table 2 intends to summarise the information required about the general conditions of the experiment.

Page 24: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 23

Table 2: Template for the description of the conditions of the experiment

Site Main Language Number of entries in dictionary Average number of translations for each entry in dictionary

Lemmatiser POS-tagger Chunker (if any)

Resources

Other resources (if any) Processor type [e.g. Intel-4] CPU speed RAM

Technical in-formation

System [Linux (version?), Windows (version?)…]

Table 3 summarises the required information about the test data and the evaluation re-sults for each of the sentences.

Table 3: Template for the description of results (per sentence)

SL [e.g. Spanish] Site (performing evaluation) [e.g. FUPF] Sentence id

Type of input Grammatical / Newspaper / En-cyclopaedia / Technical

Grammatical phenomenon1 [e.g. Different complementation pattern]

Sentence informa-tion

Number of words

BLEU NIST

Automatic measur-es2

Levenshtein Intelligibility / flu-ency

Scale from 3 to 0 Translation evaluation

Human ratings (see below) Correctness/ ade-

quacy / fidelity Scale from 5 to 1

Error analysis

Source of error(s)

Preprocessing / Dictionary / Ex-pander / Search Engine / Target corpus

Further categorize error [e.g. Dictionary: ∗ Missing entry ∗ Missing translation ∗ Wrong information]

1 Only if Type of input is Grammatical, i.e. the sentence belongs to an engineered Test Suite and is not free text. 2 See Annex 2.

Page 25: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 24

3.3.2 Human ratings

∗ Intelligibility / fluency

Intelligibility is one of the most frequently used metrics of the quality of output. Numer-ous definitions (or protocols for measuring it) have been proposed for it, for instance in Van Slype’s report or in the DARPA 1994 evaluations. We outline here the definition pro-posed by T.C. Halliday in (Van Slype, 1979, p. 70), which measures intelligibility on a 4-point scale (0 to 3).

Intelligibility or comprehensibility expresses how intelligible is the output of a translation device under different conditions (for instance, when the sentence fragments are trans-lated while being entered, or after each sentence). Comprehensibility reflects the degree to which a complete translation can be understood. Intelligibility can be based on the general clarity of translation, or the output can be considered in its entirety or by seg-ments out of context.

The following scale of intelligibility has been proposed, from 3 to 0, 3 being the most in-telligible:

3 – Very intelligible: all the content of the message is comprehensible, even if there are

errors of style and/or of spelling, and if certain words are missing, or are badly trans-lated, but close to the target language.

2 – Fairly intelligible: the major part of the message passes.

1 – Barely intelligible: a part only of the content is understandable, representing less than 50% of the message.

0 – Unintelligible: nothing or almost nothing of the message is comprehensible

∗ Correctness / adequacy / fidelity

This evaluation metric reprises the DARPA 1994 adequacy test (Doyon, Taylor, and White, 1996). As with that test, the reference translation or "authority version" is placed next to each of the translations of the source text, to be used as a comparison against each one, human or machine. Before the test is performed, both the "authority version" as well as each of translations should be segmented, with each text separated into sen-tence fragments to appear next to the corresponding fragment in the translation.

Once each translation is lined up with its equivalent, evaluators grade each unit on a scale of one to five, where five represents a paragraph containing all of the meaning ex-pressed in the corresponding text. The Adequacy scale is as follows:

5 – All meaning expressed in the source fragment appears in the translation fragment

4 – Most of the source fragment meaning is expressed in the translation fragment

3 – Much of the source fragment meaning is expressed in the translation fragment

2 – Little of the source fragment meaning is expressed in the translation fragment

Page 26: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 25

1 – None of the meaning expressed in the source fragment is expressed in the transla-tion fragment

3.3.3 Error Analysis

The information gathered using the template in Table 3 for each of the sentences can be used by the groups to gain insight on the main problematic areas of their systems as well as perform a cross-comparison of the results.

By categorizing the errors found into different classes, the teams are expected to fine-tune their systems and improve the overall performance.

Here follow some templates with suggestions on how to perform this categorization.

Table 4: Template for the description of the test data and the global results of the vali-dation experiment

Number of sentences

Number of words Test material

Average number of words per sentence

Average BLEU score

Average NIST score

Average Levenshtein

Number of unknown words

Average number of unknown words per sentence with unknown words

Global re-sults

Average translation speed (per sentence)

Table 5: Average scores depending on type of sentence

Type BLEU NIST Lev Intell Fidel

Grammatical

Encyclopaedia

Technical

Newspapers

Page 27: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 26

Table 6: Average scores depending on grammatical phenomenon

Grammatical phenomenon BLEU NIST Lev Intell Fidel

Different complementation pattern

Word order

Noun complementation

...

3.3.4 Cross-approach comparison

The error analysis of the crosschecked part of the data will include comparison among the results of the four sites.

Starting from the information about each translated sentence gathered by the teams as proposed in Section 3.3.1, the following comparisons could be performed.

For each of the 200 sentences (50 provided by each of the groups), the three automatic metrics and the two human measures are compared, as shown in the following tables.

SL

Sent Id

Grammatical Phenomenon

Number of words

Table 7: Comparison of evaluation metrics for a given sentence

Sent Id=i BLEU NIST Lev Intell Fidel

FUPF

GFAI

ILSP

KUL

Page 28: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 27

Also the kind of averages proposed in the Section 3.3.3 can be calculated, as illustrated in the following tables:

Table 8: Comparison of average evaluation metrics

Total sents (200)

Average BLEU Average NIST Average Lev Average Intell Average Fidel

FUPF

GFAI

ILSP

KUL

Table 9: Comparison of average evaluation metrics on short sentences

Number of words < n BLEU NIST Lev Intell Fidel

FUPF

GFAI

ILSP

KUL

Table 10: Comparison of average metrics on sentences with word order changes be-tween SL and English

Word order BLEU NIST Lev Intell Fidel

FUPF

GFAI

ILSP

KUL

Ideally, this part of the experiment should be run on sentences not presenting particular problems on the SL side (i.e. pre-processing and dictionary).

Page 29: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 28

4. Evaluation in comparison to other MT systems

For the consortium to define the place of METIS-II in both the current market and re-search on MT, it is important to evaluate it against existing MT systems. Two kinds of evaluation are foreseen:

a) Comparison with SYSTRAN (rule-based MT system existing for all 4 languages in the project), based on the standard metrics described in Annex 2. While SYSTRAN will surely perform better than METIS-II, the development times of SYSTRAN and METIS-II have to be taken into account. This evaluation will take place at the same time as the user evaluation described in Section 5, that is, after the delivery of the 2 prototypes foreseen in the project.

b) Participation in an international contest (in function of the language pairs required by the contest): e.g. TC-Star Evaluation Campaign 2007. This will allow us to com-pare our system with statistical machine translation systems.

5. User evaluation

Information provided by users is key in the development of any MT system. For this rea-son, it is foreseen that a set of professional translators and translation professors evalu-ate the METIS-II system, once the prototype is set and running.

This evaluation will not only take translation quality into account but it will also focus on the efficiency factor, i.e. how much time the Metis system is able to save the human translator when translating different types of texts.

Each site has already established a user group for the identification of user requirements (see deliverable D2.1). If possible, all or some members of each user group will partici-pate in the evaluation of METIS-II. FUPF will define a questionnaire and procedure for the users to perform this evaluation.

Evaluation by users will take place at two stages of the project, immediately after the de-livery of the actual running prototypes: October 2006 for the first prototype (month 25 of the project), and July-August 2007 for the final prototype (month 35 of the project).

6. Calendar

The calendar of the validation and evaluation procedures as defined in this document is summarised in Table 11.

Page 30: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 29

Table 11: Calendar for validation and evaluation within METIS-II

Event Date

Validation experiment I October 2005

Site-internal validation [Regularly]

Validation experiment II June 2006

First METIS-II prototype September 2006

Human and machine evaluation I October 2006

Final METIS-II prototype July 2007

Human and machine evaluation II July-August 2007

References

FEMTI - a Framework for the Evaluation of Machine Translation in ISLE (http://www.isi.edu/natural-language/mteval/ )

King, Margaret and Kirsten Falkedal. Using Test Suites in Evaluation of Machine Transla-tion Systems. In Proceedings of Thirteenth International Conference on Computa-tional Linguistics (COLING-90), Helsinki, 1990.

Netter, Klaus, Armstrong, Susan, Kiss, Tibor, Klein, Judith, Lehmann, Sabine, Milward, David, Regnier-Prost, Sylvie, Schäler, Reinhard and Wegst, Tillman (1998). DiET - Diagnostic and Evaluation Tools for Natural Language Application. In Proceedings of the 1st International Conference on Language Resources and Evaluation, May 28-30. Granada, Spain. (573-579).

Oepen, S. and D. Flickinger. Towards systematic grammar profiling: Test suite technol-ogy 10 years after. Computer Speech and Language, 12:411--435, 1998.

Oepen, Stephan, Klaus Netter & Judith Klein. "TSNLP - Test Suites for Natural Language Processing", John Nerbonne (ed.), Linguistic Databases, CSLI Publications, 1998, pp. 13-36.

Papieni, K, S. Roukos, T. Ward and W. Zhu (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th ACL, pages 311-318.

Page 31: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 30

Annex 1: Preliminary list of grammatical phenomena common to the four SL

1) Main clause: 0 arguments [weather verbs]

∗ Llovía.

∗ έβρεξε

∗ [It rained.]

2) Main clause: 1 argument

a. Intransitive verbs, NP subject

∗ María corre.

∗ Η Μαρία τρέχει

∗ [Mary runs.]

b. Intransitive verbs, clausal subject

∗ Suceden cosas extrañas.

∗ Συµβαίνουν παράξενα πράγµατα

∗ [Strange things happen.]

c. Impersonal verbs

∗ Hay problemas.

3)

ect Object

.

ject

Γιάννης.

∗ Υπάρχουν προβλήµατα

∗ [There are problems.]

Main clause: 2 arguments

a. Transitive verb, NP Dir

∗ Juan entregó el libro.

∗ Ο Γιάννης παρέδωσε το βιβλίο

∗ [He handed over the book.]

b. Transitive verb, clausal Direct Ob

∗ Exijo que Juan arregle el coche.

∗ Απαιτώ να επισκευάσει το αυτοκίνητο ο

∗ [I demand that John repairs the car.]

c. Transitive verb, indirect interrogative

∗ Nosotros sabemos qué compró Juan.

∗ Ξέρουµε τι αγόρασε ο Γιάννης.

Page 32: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 31

∗ [We know what John bought.]

d. Copulative verb, nominal attrib ute

ς.

djective attribute

ς.

τή.

4) Main clause: 3 arguments

a. Ditransitive verb, NP DO, PP-a Indirect Object

.

ia.

ει ταχτική.

ctics.]

resident.]

ments

Η Μαρία πήγε από το σπίτι στο σταθµό.

∗ [Mary went from the house to the station.]

∗ El doctor es ese hombre alto.

∗ Ο γιατρός είναι εκείνος ο ψηλό

∗ [The doctor is that tall man.]

e. Copulative verb, a

∗ El doctor es alto.

∗ Ο γιατρός είναι ψηλό

∗ [The doctor is tall.]

f. Bound prepositional complement

∗ Felipe habló con el encargado.

∗ Ο Φίλιππος µίλησε στον διευθυν

[Philip talked to the manager.]

∗ Juan entregó el libro a María.

∗ Ο Γιάννης παρέδωσε το βιβλίο στη Μαρία

∗ [John handed the book to Mary.]

b. Ditransitive verb, clausal DO, PP-a IO

∗ María exigió a Pedro que cambiara de estrateg

∗ Η Μαρία απαίτησε από τον Πέτρο να αλλάξ

∗ [Mary demanded Peter to change his ta

c. Transitive verb, NP DO, Object attribute

∗ El consejo nombró a Juan presidente.

∗ Το Συµβούλιο διόρισε πρόεδρο τον Γιάννη.

∗ [The council appointed John as p

d. Movement verbs, origin and goal comple

∗ María iba de casa a la estación.

Page 33: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 32

5) Time, place and other modifiers

a. Time modifiers

∗ El profesor llegó ayer por la mañana.

∗ Ο καθηγητής έφτασε χθές το πρωί.

∗ [The professor arrived yesterday morning.]

b. Place modifiers

∗ Había muchos libros sobre el escritorio.

∗ Υπήρχαν πολλά βιβλία πάνω στο γραφείο.

∗ [There were a lot of books on the desktop.]

c. Manner modifiers

∗ María iba de casa a la estación en tren.

∗ Η Μαρία πήγε από το σπίτι στο σταθµό µε το τραίνο.

∗ [Mary went from the house to the station by train.]

d. Other modifiers

∗ Ella compró el libro por tres dólares.

∗ Αγόρασε το βιβλίο για τρία δολλάρια.

∗ [She bought the book for three dollars.]

6) Negation

∗ No está lloviendo.

∗ ∆εν βρέχει.

∗ [It is not raining.]

7) NP structure

a. NP modifiers (adjectives, mod PPs, possessive)

∗ El libro rojo de Mao está en la biblioteca.

∗ Το Κόκκινο Βιβλίο του Μάο είναι στην βιβλιοθήκη.

∗ [Mao’s red book is in the library.]

b. NP coordination

∗ Tengo un hermano y una hermana en España.

∗ Έχω έναν αδελφό και µία αδελφή στην Ισπανία.

∗ [I have a brother and a sister in Spain.]

Page 34: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 33

8) DETP structure

∗ Luego vendrán todos los demás niños.

∗ Αργότερα θα έρθουν όλα τα άλλα παιδιά.

∗ [Later all the other children will come.]

9) Control structures

a. Object control verbs

∗ Los niños quieren ver la película.

∗ Τα παιδιά πήγαν να δούν την ταινία.

∗ [The children want to see the movie.]

10) Coordination

∗ Entré en la tienda y te compré una agenda.

∗ Μπήκα στο µαγαζί και σου αγόρασα ένα ηµερολόγιο.

∗ [I entered the store and bought you a diary.]

11) Different complementation pattern

a. Direct Object => Prep Object

∗ Espera el autobús.

∗ Περίµενε το λεωφορείο.

∗ [ENG. Wait for the bus].

b. Human Direct Object (PP-a) => Direct Object (NP)

∗ Contesta al profesor.

∗ Απάντησε στον δάσκαλο.

∗ [Answer the teacher.]

c. Non-human DO + Ind O => Human DO (NP) + Non-human Prep O

∗ Yo le pediré dinero a mi amigo.

∗ Θα ζητήσω λεφτά από τον φίλο µου.

∗ [ I will ask my friend for money.]

12) Homonym in the SL

∗ La capital acumula todo el capital

∗ [The capital city accumulates all the money.]

13) Word Order

a. Phrase order

Page 35: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 34

∗ Voor de tweede wereldoorlog leefden er 200.000 joden in Wenen.

∗ Before World War II, 200,000 Jews lived in Vienna.

14) Collocations

∗ Hedendaagse soorten zijn verwant aan elkaar door een gemeenschappelijke afstamming.

∗ Contemporary species are related to each other through common descent.

15) Function words

∗ Zij waren geschokt door de onthulling dat ze net 335 pond aan een paar schoenen had uitgegeven.

∗ They were shocked by the revelation that she has just spent £335 on a pair of shoes.

Page 36: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 35

Annex 2: Automatic evaluation metrics

The results of the Metis system will be automatically evaluated using the BLEU statistic as originally defined by IBM (Papineni et al., 2002), one of the most well-known auto-matic procedures for evaluating translation. As has been extensively shown in the littera-ture, this metric, which measures editing distances, correlates well with human evalua-tions.

We have also used a newer metric, known as NIST, which is a modification of the original IBM’s BLEU3 . This modified score demonstrates comparable or better performance at discriminating translation quality of different systems.

To compute both metrics, we have used the application MTEval developed by the Na-tional Institute of Standards and Technology (http://www.nist.gov ).

The application can be downloaded from their website: http://www.nist.gov/speech/tests/mt/mt2001/resource. Both BLEU and NIST use a reference corpus, built from good quality human translations and a numeric metric that measures the distance between the machine translated sen-tence with the reference translations. It does so by comparing all the n-grams that form the sentence, irrespective of their position. Because a sentence can have more than one translation, this method ideally requires more than one reference (e.g. three) for each translated sentence.

The BLEU metric measures performance of a task on a scale of 0 to 1, with 1 being the best. As for NIST’s scores, it also holds that the higher the better, but a particular feature of the NIST metric is that the scores increase with test set size. The reason for this is that when the test set size increases, the number of different n-grams, and thereby the information gain for each n-gram also increases.

Because both BLEU and NIST are based on n-gram counting, to leverage the results we are using a third metric, which is the Levenshtein editing distance measure.

Format required for the output translations

∗ Translation to be evaluated

<trans SL=dutch id=1 site=KUL> uncertain weather force NASA the landing of the space shuttle to delay </trans>

∗ Reference translations

<ref1 SL=dutch id=1> unsettled weather force NASA to delay the landing of the space shuttle </ref1>

3 The NIST score (developed at NIST) is the simple information-weighted sum of N-gram co-occurrences, normalized by number of N-gram occurrences and summed over 1-, 2-, 3-, 4- and 5-grams.

Page 37: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 36

Annex 3: Expander Input Format Description (ILSP’s proposal)

In what follows we present a proposal for the common input format for the expanders. According to METIS specifications, information used for searching the target language corpus includes at most:

1. pos tags (Source and Target language)

2. Target language lemma

3. Expressions in Target language (2 and 3 are provided by the bilingual lexica)

4. Chunking information (of the source language obviously)

The input format proposed here covers these types of information.

To develop the common input format for the expanders, the XML format of the search engine input in METIS proposed by Antoni Oliver has been extended to include the chunking information. In what follows, two equivalent formats are proposed. In the first format, translation and chunking information are discreet and co-indexed, whereas in the second format translation information is included within the chunking information. Below, the format of the SL sentence is given together with pos tag and lemma information (this format is the one proposed by Antoni Oliver).

<metis-translation> <SourceSentence> <source-unit id=”1”> <token>Ο</token> <lemma>o</lemma> <pos>AtDfMaSgNm</pos> </source-unit> <source-unit id=”2”> <token>πρόεδρος</token> <lemma>πρόεδρος</lemma> <pos>NoCmMaSgNm</pos> </source-unit> <source-unit id=”3”> <token>έδωσε</token> <lemma>δίνω</lemma> <pos>VbMnIdPa03SgXxPeAvXx</pos> </source-unit> <source-unit id=”4”> <token>συγχαρητήρια</token> <lemma>συγχαρητήρια</lemma> <pos>NoCmNePlAc</pos> </source-unit> <source-unit id=”5”>

Page 38: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 37

<token>στους</token> <lemma>στου</lemma> <pos>AsPpPaMaPlAc</pos> </source-unit> <source-unit id=”6”> <token>ποδοσφαιριστές</token> <lemma>ποδοσφαιριστής</lemma> <pos>NoCmMaPlAc</pos> </source-unit> <source-unit id=”7”> <token>.</token> <lemma>.</lemma> <pos>PUN</pos> </source-unit> </SourceSentence> <expander-input> ….. </expander-input>

Next, the two formats introducing chunking information are presented.

1. First proposal

1.1 Description of the elements

expander-input, trans-unit, option, token-trans, lemma, pos, extra_information: These elements are the same as in the original input format proposal. The only difference is that within the extra_information tag we enclose the id of the source-token from which this trans-unit is derived. All source token information is contained in the SourceSen-tence tag element of the input format.

New elements

∗ translation: it denotes the information regarding token translation.

∗ chunking: it denotes chunking information.

∗ chunking-option: it denotes a chunking option for the input sentence. For each different chunking of the sentence, a new chunking-option element is created.

∗ clause: it denotes a clause. Clauses have a unique id. Clauses contain one or more chunks and may also contain trans-units. Clauses are not like chunks, because they do not have a head token, while chunks must have a head token.

∗ chunk: it denotes a chunk. Chunks have a label attribute and a unique id. A chunk can contain translation units and/or chunks. The translation units contained in a chunk are denoted using the tag elements token and head. A chunk may contain

Page 39: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 38

one or more chunks and one or more tokens, but only one head. Token and head tag elements contain the tuid attribute, the id of the equivalent trans-unit and its options. In some cases (for instance, in the case of expressions), one of the transla-tion options of a trans-unit may result to a different chunking (a separate chunking-option is created for each chunking option).

1.2 Formal representation

MT :: <metis-translation> SS EI </ metis-translation >

SS :: <SourceSentence> ( SU )+ </SourceSentence>

SU :: <source-unit> ... </source-unit>

EI :: <expander-input> TR CH </expander-input>

TR :: <translation> ( TU )+ </translation>

TU :: <trans-unit> EX (TO )+ </trans-unit>

EX:: <extra_information> ... </extra_information>

TO :: <option> ... <option>

CH :: <chunking> ( CO )+ </chunking>

CO :: <chunking-option> ( CL )+ </chunking-option>

CL :: <clause> ((TU)* ( CK )+)* ( ( CK )* (TU)* )*</clause>

CK :: <chunk> ... </chunk>

1.3 First proposal example

<expander-input> <translation> <trans-unit id=”1”> <extra_information> <source-token>1</source-token> </extra_information> <option id=”1”> <token-trans id=”1”> <lemma>the</lemma> <pos>AT</pos>

Page 40: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 39

</token-trans> </option> </trans-unit> <trans-unit id=”2”> <extra_information> <source-token>2</source-token> </extra_information> <option id=”1”> <token-trans id=”1”> <lemma>chairman</lemma> <pos>NN</pos> </token-trans> </option> <option id=”2”> <token-trans id=”1”> <lemma>president</lemma> <pos>NN</pos> </token-trans> </option> </trans-unit> <trans-unit id=”3”> <extra_information> <source-token>3,4</source-token> </extra_information> <option id=”1”> <token-trans id=”1”> <lemma>congratulate</lemma> <pos>VV</pos> </token-trans> </option> </trans-unit> <trans-unit id=”4”> <extra_information> <source-token>5</source-token> </extra_information> <option id=”1”> <token-trans id=”1”> <lemma>in</lemma> <pos>AV-PRP</pos> </token-trans> </option> <option id=”2”> <token-trans id=”1”> <lemma>on</lemma> <pos>AV-PRP</pos> </token-trans>

Page 41: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 40

</option> <option id=”3”> <token-trans id=”1”> <lemma>to</lemma> <pos>AV-PRP</pos> </token-trans> </option> </trans-unit> <trans-unit id=”5”> <extra_information> <source-token>6</source-token> </extra_information> <option id=”1”> <token-trans id=”1”> <lemma>football</lemma> <pos>NN</pos> </token-trans> <token-trans id=”2”> <lemma>player</lemma> <pos>NN</pos> </token-trans> </option> </trans-unit> <trans-unit id=”6”> <extra_information> <source-token>7</source-token> </extra_information> <option id=”1”> <token-trans id=”1”> <lemma>.</lemma> <pos>PUN</pos> </token-trans> </option> </trans-unit> </translation> <chunking> <chunking-option id=”1”> <clause id=”1”> <chunk id="1" label="NP_NM"> <token tuid=”1” options=”1”/> <head tuid=”2” options=”1,2”/> </chunk> <chunk id="2" label="VG"> <head tuid=”3 options=”1”/> </chunk> <chunk id="3" label="NP_AC">

Page 42: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 41

<token tuid=”4” options=”1,2,3”/> <head tuid=”5” options=”1”/> </chunk> <token tuid=”6” options=”1”/> </clause> </chunking-option> </chunking> </expander-input> </metis-translation>

2. Second proposal

2.1 Description of the elements

expander-input, trans-unit, option, token-trans, lemma, pos, extra_information: These elements are the same as in the original input format proposal. The only difference is that within the extra_information tag we enclose the id of the source-token from which this trans-unit is derived. All source token information is contained in the SourceSen-tence tag element of the input format.

New elements

∗ chunking: it denotes chunking information.

∗ chunking-option: it denotes a chunking option for the input sentence. For each different chunking of the sentence, a new chunking-option element is created.

∗ clause: it denotes a clause. Clauses have a unique id. Clauses contain one or more chunks and may also contain trans-units. Clauses are not like chunks, because they do not have a head token, while chunks must have a head token.

∗ chunk: it denotes a chunk. Chunks have a label attribute and a unique id. A chunk can contain either translation units and/or chunks. Each chunk has a head tag ele-ment denoting the head translation unit of the chunk (using the corresponding id). If a trans-unit contains multiple options, and one of them leads to a different way of chunking, then each option is used only in the corresponding chunking option.

2.2 Formal representation

MT :: <metis-translation> SS EI </ metis-translation >

SS :: <SourceSentence> ( SU )+ </SourceSentence>

SU :: <source-unit> ... </source-unit>

Page 43: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 42

EI :: <expander-input> CH </expander-input>

CH :: <chunking> ( CO )+ </chunking>

CO :: <chunking-option> ( CL )+ </chunking-option>

CL :: <clause> ((TU)* (CK)+)* ( (CK)* (TU)* )*</clause>

CK :: <chunk> ((CK)* (TU)+)* ( (TU)* (CK)* )*</chunk>

TU :: <trans-unit> EX (TO )+ </trans-unit>

EX:: <extra_information> ... </extra_information>

TO :: <option> ... <option>

2.3 Second proposal example

<expander-input> <chunking> <chunking-option id=”1”> <clause id=”1”> <chunk id="1" label="NP_NM"> <head tuid=”2”/> <trans-unit id=”1”> <extra_information> <source-token>1</source-token> </extra_information> <option id=”1”> <token-trans id=”1”> <lemma>the</lemma> <pos>AT</pos> </token-trans> </option> </trans-unit> <trans-unit id=”2”> <extra_information> <source-token>2</source-token> </extra_information> <option id=”1”> <token-trans id=”1”> <lemma>chairman</lemma> <pos>NN</pos> </token-trans> </option> <option id=”2”> <token-trans id=”1”>

Page 44: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 43

<lemma>president</lemma> <pos>NN</pos> </token-trans> </option> </trans-unit> </chunk> <chunk id="2" label="VG"> <head tuid=”3”/> <trans-unit id=”3”> <extra_information> <source-token>3,4</source-token> </extra_information> <option id=”1”> <token-trans id=”1”> <lemma>congratulate</lemma> <pos>VV</pos> </token-trans> </option> </trans-unit> </chunk> <chunk id="3" label="NP_AC"> <head tuid=””/> <trans-unit id=”4”> <extra_information> <source-token>5</source-token> </extra_information> <option id=”1”> <token-trans id=”1”> <lemma>in</lemma> <pos>AV-PRP</pos> </token-trans> </option> <option id=”2”> <token-trans id=”1”> <lemma>on</lemma> <pos>AV-PRP</pos> </token-trans> </option> <option id=”3”> <token-trans id=”1”> <lemma>to</lemma> <pos>AV-PRP</pos> </token-trans> </option> </trans-unit> <trans-unit id=”5”>

Page 45: FP6-IST-003768 METIS-II · 2006-03-27 · FP6-IST-003768 METIS-II Statistical Machine Translation using Monolingual Corpora: From Concept to Implementation Specific Targeted Research

D5.1 Validation/Evaluation framework 44

<extra_information> <source-token>6</source-token> </extra_information> <option id=”1”> <token-trans id=”1”> <lemma>football</lemma> <pos>NN</pos> </token-trans> <token-trans id=”2”> <lemma>player</lemma> <pos>NN</pos> </token-trans> </option> </trans-unit> </chunk> <trans-unit id=”6”> <extra_information> <source-token>7</source-token> </extra_information> <option id=”1”> <token-trans id=”1”> <lemma>.</lemma> <pos>PUN</pos> </token-trans> </option> </trans-unit> </clause> </chunking-option> </chunking> </expander-input> </metis-translation>