23
Institut für Anthropomatik 1 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Grammatical Agreement in SMT Seminar Sprach-zu-Sprach-Übersetzung SS 2013

Grammatical Agreement in SMT

Embed Size (px)

DESCRIPTION

Grammatical Agreement in SMT

Citation preview

Page 1: Grammatical Agreement in SMT

Institut für Anthropomatik1 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Grammatical Agreement in SMT

Seminar Sprach-zu-Sprach-ÜbersetzungSS 2013

Page 2: Grammatical Agreement in SMT

Institut für Anthropomatik2 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Inflection– Modification of a word– signals grammatical variants (tense, gender, case, …)– e.g. walk vs. Walked

Agreement– Inflection for related words in a sentence has to agree– e.g. das Haus vs. die Haus

Some languages are weakly inflected (e.g. English)

Some are highly inflected (e.g. German, Arabic, …)

Inflection and Agreement

Page 3: Grammatical Agreement in SMT

Institut für Anthropomatik3 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Local Agreement Errors

Ref:

the-carF go

F with-speed

Hypo:

the-carF go

M with-speed

Long-distance Agreement Errors

Ref: celle qui parle , c’est ma femme

oneF who speak , is my wife

F

Hypo: celui qui parle est ma femme

oneM who speak is my spouse

F

Agreement Errors

Page 4: Grammatical Agreement in SMT

Institut für Anthropomatik4 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Approaches for SMT

Morphological Generation– Create raw stems and modify with predicted inflection

Agreement Constraints– Use SCFG of target and add constraints to it

Class-based Agreement Model– Use morphological word classes “Noun+Def+Sg+Fem”

Page 5: Grammatical Agreement in SMT

Institut für Anthropomatik5 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Morphological Generation: Idea

“Generating Complex Morphology for Machine Translation” (Minkov and Toutanova, 2007)

Convert MT output to stem sequence

Predict an inflection for every stem

Reflect meaning and comply with agreement rules

Page 6: Grammatical Agreement in SMT

Institut für Anthropomatik6 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Morphological Generation: Lexicons

Morphology analysis and generation

Operations:– Stemming– Inflection– Morphological analysis

Create manually

Create automatically from data

Here: assumed as given

Page 7: Grammatical Agreement in SMT

Institut für Anthropomatik7 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Morphological Generation: Inflection Prediction

Maximum Entropy Markov model (2nd order)

Features:– Monolingual– Bilingual

– Lexical– Morphological– Syntactic

p ( y∣ x)=∏t=1

n

p ( y t∣ y t−1 , yt−2 , xt) , y t∈ I t

Page 8: Grammatical Agreement in SMT

Institut für Anthropomatik8 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Morphological Generation: Evaluation

English-Russian and English-Arabic

Technical (software manual) domain

Input: Aligned sentence pairs of reference translations (no output of MT System) → reduce noise

Accuracy (%) results

Page 9: Grammatical Agreement in SMT

Institut für Anthropomatik9 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Morphological Generation: Conclusion

Needed resources:– Large corpus of aligned sentence pairs– Lexicons (source and target) with the three operations

+ Better accuracy than simple LM (even with small training data)

+ Easy to add to existing MT system

- Expensive creation of lexicons

Page 10: Grammatical Agreement in SMT

Institut für Anthropomatik10 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Constraints: Idea

“Agreement Constraints for Statistical Machine Translation into German” (Williams and Koehn, 2011)

String-to-tree model

Synchronous grammar for target language

Adding learned constraints and probabilities

Evaluation of constraints during decoding

Page 11: Grammatical Agreement in SMT

Institut für Anthropomatik11 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Constraints: Feature Structure

Feature structure

Unification

Page 12: Grammatical Agreement in SMT

Institut für Anthropomatik12 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Constraints: Grammar

Synchronous grammar learned from parallel corpus

Extended by constraints at target-side

Sample rule/constraint:

NP-SB → the X1 cat | die AP

1 Katze

Page 13: Grammatical Agreement in SMT

Institut für Anthropomatik13 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Constraints: Training

Propagation rules to capture NP/PP agreements:

Applied bottom-up

Page 14: Grammatical Agreement in SMT

Institut für Anthropomatik14 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Constraints: Decoding

Model:

Every element of rule/constraint has a feature structure

Constraint evaluation: Each hypothesis stores set of feature structures corresponding to its root rule element

Recombination of hypotheses is possible

t=arg maxtp(t∣s)

p (t∣ s)=1Z∑i=1

n

λ ihi(s ,t )

Page 15: Grammatical Agreement in SMT

Institut für Anthropomatik15 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Constraints: Evaluation

English-German

Europarl and News Commentary

Parsing: BitPar; Alignment: GIZA++; SCFG rules: Moses toolkit

Treebank for target

Grammar: ~140 m rules

BLEU scores and p-values for three test sets

Page 16: Grammatical Agreement in SMT

Institut für Anthropomatik16 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Constraints: Conclusion

Needed resources:– Parallel corpus– Heuristics for constraint extraction

+ Improvement in translation accuracy

- Improvement is quite small

Page 17: Grammatical Agreement in SMT

Institut für Anthropomatik17 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Class-Based: Idea

1. Segmentation

2. Tagging

3. Scoring

“A Class-Based Agreement Model for Generating Accurately Inflected Translations” (Green and DeNero, 2012)

During Decoding

Target-Side

Three Steps:

Page 18: Grammatical Agreement in SMT

Institut für Anthropomatik18 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Class-Based: Segmentation

Train conditional random field

Features:

Centered 5-character window

During decoding

Not as preprocessing step

Labels:

I: Continuation (Inside)

O: Outside (whitespace)

B: Beginning

F: Non-native chars

Page 19: Grammatical Agreement in SMT

Institut für Anthropomatik19 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Class-Based: Tagging

Train CRF on full sentences with gold classes

Features:– Current and previous words, affixes, etc.

Labels:– Morphological classes

→ Gender, number, person, definiteness– e.g. 89 classes for Arabic

Example:

'the car'

Tagged: “Noun+Def+Sg+Fem”

Page 20: Grammatical Agreement in SMT

Institut für Anthropomatik20 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Class-Based: Scoring

Scoring of word sequences not comparable across hypotheses

→ Scoring class sequences with generative model

Simple bigram LM over gold class sequences (add-1 smoothed)

τ '=arg maxτ

p( τ∣ s)

q(e)= p (τ ' )=∏i=1

I

p( τ ' i∣τ ' i−1)

Page 21: Grammatical Agreement in SMT

Institut für Anthropomatik21 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Class-Based: Evaluation

English-Arabic

Training data: variety of sources (e.g. web)

Development and Test: NIST sets (Newswire and mixed genre [broadcast news, newsgroups, weblog])

Phrase-based decoder

BLEU score for newswire sets

BLEU score for mixed genre sets

Page 22: Grammatical Agreement in SMT

Institut für Anthropomatik22 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Class-Based: Conclusion

Needed resources:– Treebank for target (existing for many languages)– Large target corpus

+ Improves translation quality

+ Easy to integrate in existing MT system

- Increases decoding time

- Not very good for mixed genres

Page 23: Grammatical Agreement in SMT

Institut für Anthropomatik23 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel

Green, S. and DeNero, J. (2012). “A Class-Based Agreement Model for

Generating Accurately Inflected Translations”. In: ACL.

Williams, P. and Koehn, P. (2011). “Agreement Constraints for Statistical Machine Translation into German”. In: Sixth Workshop on Statistical Machine Translation

Minkov, E. and Toutanova, K. (2007) “Generating Complex Morphology for Machine Translation”. In: ACL.

References