METIS ( Traducció Automàtica per a llengües amb pocs recursos )

23/05/2008 Seminari NLP-UPC 1

METIS (Traducció Automàtica per a llengües amb pocs recursos )

Maite Melero (GLiCom – BM)


Roadmap

METIS II (2004-2007)ES-EN approach (GLiCom)METIS II evaluation resultsRapid deployment of METIS CA-EN pair


Current approaches to MT

In industry: mainly rule-based require lots of expensive manual

labour

In academia: mostly data driven (statistical and example-based MT) require large parallel corpora

What happens with smaller languages?


METIS II (2004-2007): the aims

Construct free text translations by relying on hybrid techniques employing basic resources retrieving the basic stock for

translations from large monolingual corpora of the target language only


Similar approach: MATADOR

MATADOR (Habash and Dorr, 2002, 2003; Habash,2003, 2004). Main difference: MATADOR aims at language pairs with

resource asymmetry: low resources for the source language, and high resources for the target language

METIS aims at low resources on both sides


Metis II: The main ideas

Hybrid approach: strong data driven component plus a limited number of rulesSimple resources, readily availableWeights associated with resources and the search algorithmTL corpus: processed off-line to construct TL modelLanguage-specific components independent from the core search engineSpecial data format for the core engine input (UDF)Several language pairs test feasibility of the approach: Dutch, German, Greek and Spanish English.


METIS II architecture

DDaattaabbaassee SSeerrvveerr

Lexicon

BNCClauses

BNCChunks

TokenGeneration

RulesFFiinnaall TTrraannssllaattiioonn

SSLL pprree--pprroocceessssiinngg

LLeexxiiccoonn LLooookkuupp

SSeeaarrcchh EEnnggiinnee

WWeebb IInntteerrffaaccee

TTookkeenn GGeenneerraattiioonn

Weights


What are basic NLP resources?

Part-of-speech taggersLemmatizersManually corrected POS tagged corpus (can be used to train a statistical tagger such as TnT (Brants, 2000))(optionally) Chunkers


Metis II: Fields of experimentation

• SL analysisSL analysis depth & richness of syntactic structure

• Transferwhich pieces / structures of information

• GenerationGenerationre-ordering of chunks and words


Metis II: SL Analysis (Morphology)All language pairs provide:

Lemmatisation: abstraction from inflection

POS tagging: verb, noun, adjectives, articles, pronouns, etc,

with subclasses according to properties of SL Nominal Inflection:

number, gender, case Verbal Inflection:

number, person, tense, mood, type (ptc, fin, inf, etc.)


Metis II: SL Analysis (Syntax)

No syntactic SL analysis: SpanishPhrase detection (nominal, prepositional, verbal groups) and Clause detection (main and subordinate clause): Dutch, German & GreekRecursive embedding of phrases and clauses: one level, no embedding: German two level embedding: Greek full recursivity: Dutch

detect phrase & clause head: Dutch & Greeksubject detection: German & Greektopological field analysis: German


Metis II: Source Language Analysis

SL features generated Spanish Dutch German Greek

Morphology

Lemmatisation X X X X

POS tagging X X X X

Nominal Inflection X X X X

Verbal Inflection X X X X

Syntax

Phrase detection --- X X X

Clause detection --- X X X

Recursivity of phrases 0 >2 1 2

Recursivity of clauses 0 >2 1 2

Phrase/Clause head --- X --- X

Subject detection --- --- X X

Topological analysis --- --- X ---

Provides generalization:

•Smaller lexicon

•Less data sparsity in TL corpus


Metis II: Transfer (Mapping of SL features to TL)

SL features Spanish Dutch German Greek

Single-word Lemmas X X X X

Multi-word Lemmas X X X ---

Discontinuous MWUs --- X X ---

POS tag mapping X X --- X

Phrase structure --- X --- X


Metis II: TL Generation (Reordering)Reordering of the transferred items into TL structure is conceived as a process of hypothesis generation and filtering, according to most likely TL pattern (from TL model).Mostly pattern-based and use only info from TL,but can also be partly rule-based and use

information from SL (Dutch and German)


Information to be matched in TL model

Shallow syntactic information: all exc. Spanish

n-gram patterns of mapped Pos & lemmata: Spanish

Matching Procedure top down: Greekbottom up: all exc. Greek

Metis II: TL Generation (Reordering)


Metis II: Reordering mechanism for TL word order generation

Reordering Spanish Dutch German Greek

TL-driven X X --- X

SL-driven --- X X ---

Bottom-up X X --- ---

Top-down --- --- --- X

Rule-based --- X X ---

Pattern-based X X --- X


Metis II Spanish-English Translation Paradigm

Spanish Preprocessing

Translation Model

English Generation

Bilingual flat lexicon (no structure transfer rules)

POS tagger and lemmatizer

Search over ngram models extracted from English corpus

Spanish sentence

English sentence


Main Translation Problems

Lexical selection: i.e. picking the right translation for a given word escribir una carta write a letter jugar una carta play a card

Translation divergences: i.e. whenever word-by-word translation does not work ver a Juan see (to) Juan cruzar nadando cross swimming (swim

across)


Translation Divergences. How MT has addressed them

Linguistic based MT systems devise data representations that minimize translation divergences. [head] ver [head] see

[arg2] Juan [arg2] Juan

Remaining divergences need to be solved in the translation module: Hand written bilingual mapping rules

(Transfer MT). Mappings automatically extracted from

parallel corpus (Example Based MT).


Translation Divergences. Our constraints.

Very basic resources required, both for source and target languages: only lemmatizer-POS tagger and (TL) chunker. No deep linguistic analysis to minimize

divergences No parallel corpus, only target corpus

Keep translation model very simple: only bilingual lexicon. No mapping rules, either hand-written, or

automatically learned.


Translation Divergences. Our approach.

Handle structure modifications in the TL Generation component. Treatment independent of the SL, i.e. much more general and reusable.


SL Preprocessing (Spanish)

Tagger (CastCG)

Statistical disambiguation

SL normalization


Spanish Tagger: CastCG

Form Lemma Synt. rels Synt. tags Morph tags

Me me obj>2 @NH PRON Pers SG P1 ACC

alojo alojar main>0 @MAIN V IND PRES SG P1

en en pm>5 @PREMARK PREP

la la det>5 @PREMOD DET FEM SG

casa casa loc>2 @NH N FEM SG

de de pm>7 @POSTMOD PREP

huéspedes

huésped mod>5 @NH N MSC PL

Me alojo en la casa de huéspedes.


SL Normalization: Tag Mapping

Form Lemma POS (PAROLE) Morph tag

Me me PP sg:1:acc

alojo alojar VM i:p:sg:1

en en SP

la la TD f:sg

casa casa NCF f:sg

de de SP

huéspedes huésped NCC m:pl


SL Normalization: e.g. Pronoun Insertion in Pro-drop

Form Lemma POS (PAROLE)

Morph tag

yo yo PP sg:1:nom

Me me PP sg:1:acc

alojo alojar VM i:p:sg:1

en en SP

la la TD f:sg

casa casa NCF f:sg

de de SP

huéspedes

huésped NCC m:pl


Translation Model: Spanish-English Lexicon Look-up

Sp-Englexmetis

HDHD

Oxford

Spanish lemma

Sp. POS (PAROLE)

English lemma

Eng. POS (reduced CLAWS5)

Order of translation

alojar VM house VV 1

alojar VM stay VV 2

alojar VM lodge VV 3

List of Pseudo-English

candidates (UDF)


Translation Model: Compound Detection

Sp-Englexmetis

<trans-unit id="6">

<option id="1">

<token-trans id="1">

<lemma>boarding</lemma>

<pos>VVG</pos>

</token-trans>

<token-trans id="2">

<lemma>house</lemma>

<pos>NN1</pos>

</token-trans>

</option>

</trans-unit>

casa => house

casa => casa de huéspedes

casa de huéspedes => boarding house

Oxford


Translation Model: Unfound words

Past participleEx. “denominado” > denominar (VM) > designate (VV) > designated (AJ0)

AdverbsEx. “técnicamente” > técnico (AQ) > technical (AJ0) > technically (AV0)


TL Generation (English)Pseudo-English UDF

Search Engine (TL models)

English lemmatized sentence

Token generation

English translation


Search Engine (1st version)

Lexical preselection

Candidate scoring

Candidate expansionn-gram

n-gramn-gram

TL models


Search Engine (2nd version): beam search decoding

the worker must carry helmet

wear

drive

bottle

headphones

helmet

…

n-gramn-gram

n-gram

TL models

Lexical pre-selection

Candidate expansion

Scoring

Search engine


Target Language Models

1-gram 2-gram 3-gram 4-gram 5-gram

BNC 6 M sents

TL Modelstay|VV in|PRP the|AT0 house|NN

subst. 1! position (for n>2)

stay|VV in|PRP the|AT0 NN


Handling Structure Divergences in TL Generation: Local Structure Modifications

• Insertion of functional words: want|VV to|TO0 go|VV

• Deletion of functional words: at|PRP (the|AT0) home|NN

• Permutation of content words: a|AT0 {day|NN happy|AJ0}

3-gram 5-gram1-gram 2-gram 4-gram

I

Dat|PRP the|AT0 home|NN at|PRP home|NN

want|VV go|VV want|VV to|TO0 go|VV

n freq

n freq


Search Engine: beam search decoding

Performance problems

Combinatorial explosion in the expansion step: Suppose we are given a source sentence with at least35 words which translate to at least to English words. Thus:

1235 10Candidates 2




Combinatorial explosion in the expansion step: Suppose we are given a source sentence with at least35 words which translate to at least to English words. Thus:

1235 10Candidates 2

The search space of candidates must be pruned.




Combinatorial explosion in the expansion step

Combinatorial explosion in the scoring computationstep.



Solution:

To incrementally build the search space (following Philipp Koehn’s Pharaoh: a Beam Search Decoder for Phrase-Based Statistical Machine Translation

Models) (2004)



Solution:

1Stack


Models) (2004)

w1,…,wk are pushed on the first stack. The stack is ranked and prunedup to a given stack depth



Solution:

iStack

Each candidate of (i-1)-th stack is expanded via the dictionary and edit ops. AgainCandidates are ranked and pruned up to given stack depth.


Models) (2004)



Solution:

The scoring of each partial translations is computed using the already computed stored scorings:

)),((cos)(),(1

wpTransendtpTransscorewpTransscore ii

ni

i i


Models) (2004)



Solution:

At the N-th step (the source sentence contains N tokens) the decoding process stops. We get a ranked stack with the translation candidates.


Models) (2004)


Handling Structure Divergences in TL Generation: Non-local Movements

PRPAT0NN VV AT0NN

AT0NN PRPAT0NN VV

AT0NN VV PRPAT0NN

Normalized BNC

Syntactic model

eg. [The man] [sleeps] [at the park]

Boundaries

Chunked Corpus


Evaluation final METIS prototype

Comparison with SYSTRAN: Widely used Available for all language pairs Rule-based, many man-years of

development

Goal: get an estimation of what has been achieved


Methodology: Test sets

Two test sets: 200 sentences manually chosen from

Europarl 200 sentences from balanced test suite

used to validate system development from a variety of domains: 25% grammatical phenomena 25% newspapers 25% technical 25% scientific


Methodology: Metrics

BLEU & NIST: measure edit distance using NgramsTER (Translation Error Rate): measures the amount of editing that a human would have to perform


Methodology: References

All metrics use human created references to compare with MT output.Europarl: 5 references (4 resulting from human translating each SL into English + original English one)Development: 3 references


Results on Europarl test set

60%

59%

71%

50%

%

0.18540.46380.2784ES-EN

0.12710.31320.1861EL-EN

0.11420.39580.2816DE-EN

0.19030.38280.1925NL-EN

differenceSYSTRAN

METIS-II

BLEU


Results on development test suite

63%

92%

71%

70%

%

0.16930.46340.2941ES-EN

0.02850.39460.3661EL-EN

0.09020.31330.2231DE-EN

0.14080.37770.2369NL-EN

differenceSYSTRANMETIS-II

BLEU


METIS-II on both test sets

0.01570.29410.2784ES-EN

0.18000.36610.1861EL-EN-0.05850.22310.2816DE-EN0.04440.23690.1925 NL-EN

differenceDevEuroparl

BLEU


Results according to text type (ES-EN on Development Testsuite)

0.450.470.460.48SYSTRAN

0.260.290.330.22METIS-II

TechScienceNewsGrammar

BLEU


Impact of the number of reference translations (DE-EN on Europarl)

0.1483 0.0923 1 (Dutch)

0.39580.2816 All 5

0.3871 0.2774 4 (- Greek)

0.3817 0.2750 4 (- Dutch)

0.3739 0.2697 4 (- Spanish)

0.3064 0.1803 4 (- German)

0.29120.2376 1 (German)

0.19220.11991 (Spanish)

0.19750.11551 (Europarl)

0.1182 0.0761 1 (Greek)

SYSTRANMETIS-IIRef


Conclusions

Homogeneity of results confirm language independence of the strategyYoung Metis II still not at the same level as mature SYSTRAN, but …… stands to the comparison.Ample room for improvement


Lines for (quick) improvement

Fix bugsExtend lexical coverageAugment Target corpusAdd mapping rules


Other lines for improvement (need bilingual corpora…)

Parameter tuningEnrichment of the translation model Automatic induction of dictionary

entries (lexical and phrasal) Automatic induction of structure

transfer rules

Automatic postediting rules (needs corpus of corrected translations)


Publications (METIS-II) (1)Toni Badia, Gemma Boleda, Maite Melero, and Antoni Oliver, An n-gram approach to exploiting monolingual corpus for MT, in: Proceedings of the 2nd Workshop on Example-Based Machine Translation held in conjunction with the 10th Machine Translation Summit, pp. 1-8, Phuket, 2005.

Toni Badia, Gemma Boleda, Maite Melero, and Antoni Oliver, El proyecto METIS-II, in: Procesamiento del Lenguaje Natural, 35 (2005), pp. 443-444.

Antonio Oliver, Toni Badia, Gemma Boleda, and Maite Melero, Traducción automática estadística basada en n-gramas, in: Procesamiento del Lenguaje Natural, 35 (2005), pp. 77-84.

Vincent Vandeghinste, Ineke Schuurman, Michael Carl, Stella Markantonatou, and Toni Badia, METIS-II: Machine Translation for Low-Resource Languages, in: Proceedings of the Fifth International Conference on Language Resources and Evaluation, pp. 1284-1289, Genoa, 2006.


Publications (METIS-II) (2)

Maite Melero, Antoni Oliver, Toni Badia, and Teresa Suñol, Dealing with Bilingual Divergences in MT using Target Language N-gram Models, in: Proceedings of the METIS-II Workshop: New Approaches to Machine Translation, pp. 19-26, Leuven, 2007.

Maite Melero and Toni Badia, Demonstration of the Spanish to English METIS-II MT System, in: Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI), pp. 132-133, Skövde, 2007.

Toni Badia, Maite Melero and Oriol Valentin, Rapid deployment of a new METIS language pair: Catalan-English. Submitted to LREC, Marrakech, 2008.


Follow-Up Projects

AMASS++: Summarization of multimedia / multilingual information (KUL): NL <-> EN

PaCo-MT: Parse and Corpus based MT (KUL): NL <-> EN NL <-> FR

I3Media (FUPF) ES -> CA CA -> EN EN -> ES


Rapid deployment of new METIS language pair: CA-EN

Catalan-English prototypeMotivation: Parallel corpora even harder to get for smaller languages


Rapid deployment of new METIS language pair: CA-EN

We have simply plugged to the English Generation: A Catalan POS tagger (CatCG) An open source CA-EN dictionary

(Dacco)

Adaptation and integration: less than 1 person / month


Experiment and evaluation

Translated to Catalan 200 Spanish sentences from METIS II development testsuite

Grammar News Science Tech All

Cat-EngMETIS

0.2059 0.2533 0.2070 0.2365 0.2342

Sp-EngMETIS

0.2241 0.3273 0.2876 0.2633 0.2941

Cat-EngTranslendium

0.3334 0.4406 0.4226 0.4264 0.4250


Gràcies!

Documents

METIS ( Traducció Automàtica per a llengües amb pocs recursos )