Upload
eavan
View
39
Download
0
Tags:
Embed Size (px)
DESCRIPTION
METIS ( Traducció Automàtica per a llengües amb pocs recursos ). Maite Melero (GLiCom – BM). Roadmap. METIS II (2004-2007) ES-EN approach (GLiCom) METIS II evaluation results Rapid deployment of METIS CA-EN pair. Current approaches to MT. In industry: mainly rule-based - PowerPoint PPT Presentation
Citation preview
23/05/2008 Seminari NLP-UPC 1
METIS (Traducció Automàtica per a llengües amb pocs recursos )
Maite Melero (GLiCom – BM)
23/05/2008 Seminari NLP-UPC 2
Roadmap
METIS II (2004-2007)ES-EN approach (GLiCom)METIS II evaluation resultsRapid deployment of METIS CA-EN pair
23/05/2008 Seminari NLP-UPC 3
Current approaches to MT
In industry: mainly rule-based require lots of expensive manual
labour
In academia: mostly data driven (statistical and example-based MT) require large parallel corpora
What happens with smaller languages?
23/05/2008 Seminari NLP-UPC 4
METIS II (2004-2007): the aims
Construct free text translations by relying on hybrid techniques employing basic resources retrieving the basic stock for
translations from large monolingual corpora of the target language only
23/05/2008 Seminari NLP-UPC 5
Similar approach: MATADOR
MATADOR (Habash and Dorr, 2002, 2003; Habash,2003, 2004). Main difference: MATADOR aims at language pairs with
resource asymmetry: low resources for the source language, and high resources for the target language
METIS aims at low resources on both sides
23/05/2008 Seminari NLP-UPC 6
Metis II: The main ideas
Hybrid approach: strong data driven component plus a limited number of rulesSimple resources, readily availableWeights associated with resources and the search algorithmTL corpus: processed off-line to construct TL modelLanguage-specific components independent from the core search engineSpecial data format for the core engine input (UDF)Several language pairs test feasibility of the approach: Dutch, German, Greek and Spanish English.
23/05/2008 Seminari NLP-UPC 7
METIS II architecture
DDaattaabbaassee SSeerrvveerr
Lexicon
BNCClauses
BNCChunks
TokenGeneration
RulesFFiinnaall TTrraannssllaattiioonn
SSLL pprree--pprroocceessssiinngg
LLeexxiiccoonn LLooookkuupp
SSeeaarrcchh EEnnggiinnee
WWeebb IInntteerrffaaccee
TTookkeenn GGeenneerraattiioonn
Weights
23/05/2008 Seminari NLP-UPC 8
What are basic NLP resources?
Part-of-speech taggersLemmatizersManually corrected POS tagged corpus (can be used to train a statistical tagger such as TnT (Brants, 2000))(optionally) Chunkers
23/05/2008 Seminari NLP-UPC 9
Metis II: Fields of experimentation
• SL analysisSL analysis depth & richness of syntactic structure
• Transferwhich pieces / structures of information
• GenerationGenerationre-ordering of chunks and words
23/05/2008 Seminari NLP-UPC 10
Metis II: SL Analysis (Morphology)All language pairs provide:
Lemmatisation: abstraction from inflection
POS tagging: verb, noun, adjectives, articles, pronouns, etc,
with subclasses according to properties of SL Nominal Inflection:
number, gender, case Verbal Inflection:
number, person, tense, mood, type (ptc, fin, inf, etc.)
23/05/2008 Seminari NLP-UPC 11
Metis II: SL Analysis (Syntax)
No syntactic SL analysis: SpanishPhrase detection (nominal, prepositional, verbal groups) and Clause detection (main and subordinate clause): Dutch, German & GreekRecursive embedding of phrases and clauses: one level, no embedding: German two level embedding: Greek full recursivity: Dutch
detect phrase & clause head: Dutch & Greeksubject detection: German & Greektopological field analysis: German
23/05/2008 Seminari NLP-UPC 12
Metis II: Source Language Analysis
SL features generated Spanish Dutch German Greek
Morphology
Lemmatisation X X X X
POS tagging X X X X
Nominal Inflection X X X X
Verbal Inflection X X X X
Syntax
Phrase detection --- X X X
Clause detection --- X X X
Recursivity of phrases 0 >2 1 2
Recursivity of clauses 0 >2 1 2
Phrase/Clause head --- X --- X
Subject detection --- --- X X
Topological analysis --- --- X ---
Provides generalization:
•Smaller lexicon
•Less data sparsity in TL corpus
23/05/2008 Seminari NLP-UPC 13
Metis II: Transfer (Mapping of SL features to TL)
SL features Spanish Dutch German Greek
Single-word Lemmas X X X X
Multi-word Lemmas X X X ---
Discontinuous MWUs --- X X ---
POS tag mapping X X --- X
Phrase structure --- X --- X
23/05/2008 Seminari NLP-UPC 14
Metis II: TL Generation (Reordering)Reordering of the transferred items into TL structure is conceived as a process of hypothesis generation and filtering, according to most likely TL pattern (from TL model).Mostly pattern-based and use only info from TL,but can also be partly rule-based and use
information from SL (Dutch and German)
23/05/2008 Seminari NLP-UPC 15
Information to be matched in TL model
Shallow syntactic information: all exc. Spanish
n-gram patterns of mapped Pos & lemmata: Spanish
Matching Procedure top down: Greekbottom up: all exc. Greek
Metis II: TL Generation (Reordering)
23/05/2008 Seminari NLP-UPC 16
Metis II: Reordering mechanism for TL word order generation
Reordering Spanish Dutch German Greek
TL-driven X X --- X
SL-driven --- X X ---
Bottom-up X X --- ---
Top-down --- --- --- X
Rule-based --- X X ---
Pattern-based X X --- X
23/05/2008 Seminari NLP-UPC 17
Metis II Spanish-English Translation Paradigm
Spanish Preprocessing
Translation Model
English Generation
Bilingual flat lexicon (no structure transfer rules)
POS tagger and lemmatizer
Search over ngram models extracted from English corpus
Spanish sentence
English sentence
23/05/2008 Seminari NLP-UPC 18
Main Translation Problems
Lexical selection: i.e. picking the right translation for a given word escribir una carta write a letter jugar una carta play a card
Translation divergences: i.e. whenever word-by-word translation does not work ver a Juan see (to) Juan cruzar nadando cross swimming (swim
across)
23/05/2008 Seminari NLP-UPC 19
Translation Divergences. How MT has addressed them
Linguistic based MT systems devise data representations that minimize translation divergences. [head] ver [head] see
[arg2] Juan [arg2] Juan
Remaining divergences need to be solved in the translation module: Hand written bilingual mapping rules
(Transfer MT). Mappings automatically extracted from
parallel corpus (Example Based MT).
23/05/2008 Seminari NLP-UPC 20
Translation Divergences. Our constraints.
Very basic resources required, both for source and target languages: only lemmatizer-POS tagger and (TL) chunker. No deep linguistic analysis to minimize
divergences No parallel corpus, only target corpus
Keep translation model very simple: only bilingual lexicon. No mapping rules, either hand-written, or
automatically learned.
23/05/2008 Seminari NLP-UPC 21
Translation Divergences. Our approach.
Handle structure modifications in the TL Generation component. Treatment independent of the SL, i.e. much more general and reusable.
23/05/2008 Seminari NLP-UPC 22
SL Preprocessing (Spanish)
Tagger (CastCG)
Statistical disambiguation
SL normalization
23/05/2008 Seminari NLP-UPC 23
Spanish Tagger: CastCG
Form Lemma Synt. rels Synt. tags Morph tags
Me me obj>2 @NH PRON Pers SG P1 ACC
alojo alojar main>0 @MAIN V IND PRES SG P1
en en pm>5 @PREMARK PREP
la la det>5 @PREMOD DET FEM SG
casa casa loc>2 @NH N FEM SG
de de pm>7 @POSTMOD PREP
huéspedes
huésped mod>5 @NH N MSC PL
Me alojo en la casa de huéspedes.
23/05/2008 Seminari NLP-UPC 24
SL Normalization: Tag Mapping
Form Lemma POS (PAROLE) Morph tag
Me me PP sg:1:acc
alojo alojar VM i:p:sg:1
en en SP
la la TD f:sg
casa casa NCF f:sg
de de SP
huéspedes huésped NCC m:pl
23/05/2008 Seminari NLP-UPC 25
SL Normalization: e.g. Pronoun Insertion in Pro-drop
Form Lemma POS (PAROLE)
Morph tag
yo yo PP sg:1:nom
Me me PP sg:1:acc
alojo alojar VM i:p:sg:1
en en SP
la la TD f:sg
casa casa NCF f:sg
de de SP
huéspedes
huésped NCC m:pl
23/05/2008 Seminari NLP-UPC 26
Translation Model: Spanish-English Lexicon Look-up
Sp-Englexmetis
HDHD
Oxford
Spanish lemma
Sp. POS (PAROLE)
English lemma
Eng. POS (reduced CLAWS5)
Order of translation
alojar VM house VV 1
alojar VM stay VV 2
alojar VM lodge VV 3
List of Pseudo-English
candidates (UDF)
23/05/2008 Seminari NLP-UPC 27
Translation Model: Compound Detection
Sp-Englexmetis
<trans-unit id="6">
<option id="1">
<token-trans id="1">
<lemma>boarding</lemma>
<pos>VVG</pos>
</token-trans>
<token-trans id="2">
<lemma>house</lemma>
<pos>NN1</pos>
</token-trans>
</option>
</trans-unit>
casa => house
casa => casa de huéspedes
casa de huéspedes => boarding house
Oxford
23/05/2008 Seminari NLP-UPC 28
Translation Model: Unfound words
Past participleEx. “denominado” > denominar (VM) > designate (VV) > designated (AJ0)
AdverbsEx. “técnicamente” > técnico (AQ) > technical (AJ0) > technically (AV0)
23/05/2008 Seminari NLP-UPC 29
TL Generation (English)Pseudo-English UDF
Search Engine (TL models)
English lemmatized sentence
Token generation
English translation
23/05/2008 Seminari NLP-UPC 30
Search Engine (1st version)
Lexical preselection
Candidate scoring
Candidate expansionn-gram
n-gramn-gram
TL models
23/05/2008 Seminari NLP-UPC 31
Search Engine (2nd version): beam search decoding
the worker must carry helmet
wear
drive
bottle
headphones
helmet
…
n-gramn-gram
n-gram
TL models
Lexical pre-selection
Candidate expansion
Scoring
Search engine
23/05/2008 Seminari NLP-UPC 32
Target Language Models
1-gram 2-gram 3-gram 4-gram 5-gram
BNC 6 M sents
TL Modelstay|VV in|PRP the|AT0 house|NN
subst. 1! position (for n>2)
stay|VV in|PRP the|AT0 NN
23/05/2008 Seminari NLP-UPC 33
Handling Structure Divergences in TL Generation: Local Structure Modifications
• Insertion of functional words: want|VV to|TO0 go|VV
• Deletion of functional words: at|PRP (the|AT0) home|NN
• Permutation of content words: a|AT0 {day|NN happy|AJ0}
3-gram 5-gram1-gram 2-gram 4-gram
I
Dat|PRP the|AT0 home|NN at|PRP home|NN
want|VV go|VV want|VV to|TO0 go|VV
n freq
n freq
23/05/2008 Seminari NLP-UPC 34
Search Engine: beam search decoding
Performance problems
Combinatorial explosion in the expansion step: Suppose we are given a source sentence with at least35 words which translate to at least to English words. Thus:
1235 10Candidates 2
23/05/2008 Seminari NLP-UPC 35
Search Engine: beam search decoding
Performance problems
Combinatorial explosion in the expansion step: Suppose we are given a source sentence with at least35 words which translate to at least to English words. Thus:
1235 10Candidates 2
The search space of candidates must be pruned.
23/05/2008 Seminari NLP-UPC 36
Search Engine: beam search decoding
Performance problems
Combinatorial explosion in the expansion step
Combinatorial explosion in the scoring computationstep.
23/05/2008 Seminari NLP-UPC 37
Search Engine: beam search decoding
Solution:
To incrementally build the search space (following Philipp Koehn’s Pharaoh: a Beam Search Decoder for Phrase-Based Statistical Machine Translation
Models) (2004)
23/05/2008 Seminari NLP-UPC 38
Search Engine: beam search decoding
Solution:
1Stack
To incrementally build the search space (following Philipp Koehn’s Pharaoh: a Beam Search Decoder for Phrase-Based Statistical Machine Translation
Models) (2004)
w1,…,wk are pushed on the first stack. The stack is ranked and prunedup to a given stack depth
23/05/2008 Seminari NLP-UPC 39
Search Engine: beam search decoding
Solution:
iStack
Each candidate of (i-1)-th stack is expanded via the dictionary and edit ops. AgainCandidates are ranked and pruned up to given stack depth.
To incrementally build the search space (following Philipp Koehn’s Pharaoh: a Beam Search Decoder for Phrase-Based Statistical Machine Translation
Models) (2004)
23/05/2008 Seminari NLP-UPC 40
Search Engine: beam search decoding
Solution:
The scoring of each partial translations is computed using the already computed stored scorings:
)),((cos)(),(1
wpTransendtpTransscorewpTransscore ii
ni
i i
To incrementally build the search space (following Philipp Koehn’s Pharaoh: a Beam Search Decoder for Phrase-Based Statistical Machine Translation
Models) (2004)
23/05/2008 Seminari NLP-UPC 41
Search Engine: beam search decoding
Solution:
At the N-th step (the source sentence contains N tokens) the decoding process stops. We get a ranked stack with the translation candidates.
To incrementally build the search space (following Philipp Koehn’s Pharaoh: a Beam Search Decoder for Phrase-Based Statistical Machine Translation
Models) (2004)
23/05/2008 Seminari NLP-UPC 42
Handling Structure Divergences in TL Generation: Non-local Movements
PRPAT0NN VV AT0NN
AT0NN PRPAT0NN VV
AT0NN VV PRPAT0NN
Normalized BNC
Syntactic model
eg. [The man] [sleeps] [at the park]
Boundaries
Chunked Corpus
23/05/2008 Seminari NLP-UPC 43
Evaluation final METIS prototype
Comparison with SYSTRAN: Widely used Available for all language pairs Rule-based, many man-years of
development
Goal: get an estimation of what has been achieved
23/05/2008 Seminari NLP-UPC 44
Methodology: Test sets
Two test sets: 200 sentences manually chosen from
Europarl 200 sentences from balanced test suite
used to validate system development from a variety of domains: 25% grammatical phenomena 25% newspapers 25% technical 25% scientific
23/05/2008 Seminari NLP-UPC 45
Methodology: Metrics
BLEU & NIST: measure edit distance using NgramsTER (Translation Error Rate): measures the amount of editing that a human would have to perform
23/05/2008 Seminari NLP-UPC 46
Methodology: References
All metrics use human created references to compare with MT output.Europarl: 5 references (4 resulting from human translating each SL into English + original English one)Development: 3 references
23/05/2008 Seminari NLP-UPC 47
Results on Europarl test set
60%
59%
71%
50%
%
0.18540.46380.2784ES-EN
0.12710.31320.1861EL-EN
0.11420.39580.2816DE-EN
0.19030.38280.1925NL-EN
differenceSYSTRAN
METIS-II
BLEU
23/05/2008 Seminari NLP-UPC 48
Results on development test suite
63%
92%
71%
70%
%
0.16930.46340.2941ES-EN
0.02850.39460.3661EL-EN
0.09020.31330.2231DE-EN
0.14080.37770.2369NL-EN
differenceSYSTRANMETIS-II
BLEU
23/05/2008 Seminari NLP-UPC 49
METIS-II on both test sets
0.01570.29410.2784ES-EN
0.18000.36610.1861EL-EN-0.05850.22310.2816DE-EN0.04440.23690.1925 NL-EN
differenceDevEuroparl
BLEU
23/05/2008 Seminari NLP-UPC 50
Results according to text type (ES-EN on Development Testsuite)
0.450.470.460.48SYSTRAN
0.260.290.330.22METIS-II
TechScienceNewsGrammar
BLEU
23/05/2008 Seminari NLP-UPC 51
Impact of the number of reference translations (DE-EN on Europarl)
0.1483 0.0923 1 (Dutch)
0.39580.2816 All 5
0.3871 0.2774 4 (- Greek)
0.3817 0.2750 4 (- Dutch)
0.3739 0.2697 4 (- Spanish)
0.3064 0.1803 4 (- German)
0.29120.2376 1 (German)
0.19220.11991 (Spanish)
0.19750.11551 (Europarl)
0.1182 0.0761 1 (Greek)
SYSTRANMETIS-IIRef
23/05/2008 Seminari NLP-UPC 52
Conclusions
Homogeneity of results confirm language independence of the strategyYoung Metis II still not at the same level as mature SYSTRAN, but …… stands to the comparison.Ample room for improvement
23/05/2008 Seminari NLP-UPC 53
Lines for (quick) improvement
Fix bugsExtend lexical coverageAugment Target corpusAdd mapping rules
23/05/2008 Seminari NLP-UPC 54
Other lines for improvement (need bilingual corpora…)
Parameter tuningEnrichment of the translation model Automatic induction of dictionary
entries (lexical and phrasal) Automatic induction of structure
transfer rules
Automatic postediting rules (needs corpus of corrected translations)
23/05/2008 Seminari NLP-UPC 55
Publications (METIS-II) (1)Toni Badia, Gemma Boleda, Maite Melero, and Antoni Oliver, An n-gram approach to exploiting monolingual corpus for MT, in: Proceedings of the 2nd Workshop on Example-Based Machine Translation held in conjunction with the 10th Machine Translation Summit, pp. 1-8, Phuket, 2005.
Toni Badia, Gemma Boleda, Maite Melero, and Antoni Oliver, El proyecto METIS-II, in: Procesamiento del Lenguaje Natural, 35 (2005), pp. 443-444.
Antonio Oliver, Toni Badia, Gemma Boleda, and Maite Melero, Traducción automática estadística basada en n-gramas, in: Procesamiento del Lenguaje Natural, 35 (2005), pp. 77-84.
Vincent Vandeghinste, Ineke Schuurman, Michael Carl, Stella Markantonatou, and Toni Badia, METIS-II: Machine Translation for Low-Resource Languages, in: Proceedings of the Fifth International Conference on Language Resources and Evaluation, pp. 1284-1289, Genoa, 2006.
23/05/2008 Seminari NLP-UPC 56
Publications (METIS-II) (2)
Maite Melero, Antoni Oliver, Toni Badia, and Teresa Suñol, Dealing with Bilingual Divergences in MT using Target Language N-gram Models, in: Proceedings of the METIS-II Workshop: New Approaches to Machine Translation, pp. 19-26, Leuven, 2007.
Maite Melero and Toni Badia, Demonstration of the Spanish to English METIS-II MT System, in: Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI), pp. 132-133, Skövde, 2007.
Toni Badia, Maite Melero and Oriol Valentin, Rapid deployment of a new METIS language pair: Catalan-English. Submitted to LREC, Marrakech, 2008.
23/05/2008 Seminari NLP-UPC 57
Follow-Up Projects
AMASS++: Summarization of multimedia / multilingual information (KUL): NL <-> EN
PaCo-MT: Parse and Corpus based MT (KUL): NL <-> EN NL <-> FR
I3Media (FUPF) ES -> CA CA -> EN EN -> ES
23/05/2008 Seminari NLP-UPC 58
Rapid deployment of new METIS language pair: CA-EN
Catalan-English prototypeMotivation: Parallel corpora even harder to get for smaller languages
23/05/2008 Seminari NLP-UPC 59
Rapid deployment of new METIS language pair: CA-EN
We have simply plugged to the English Generation: A Catalan POS tagger (CatCG) An open source CA-EN dictionary
(Dacco)
Adaptation and integration: less than 1 person / month
23/05/2008 Seminari NLP-UPC 60
Experiment and evaluation
Translated to Catalan 200 Spanish sentences from METIS II development testsuite
Grammar News Science Tech All
Cat-EngMETIS
0.2059 0.2533 0.2070 0.2365 0.2342
Sp-EngMETIS
0.2241 0.3273 0.2876 0.2633 0.2941
Cat-EngTranslendium
0.3334 0.4406 0.4226 0.4264 0.4250
23/05/2008 Seminari NLP-UPC 61
Gràcies!