35
Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University [email protected]

Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University [email protected]

Embed Size (px)

Citation preview

Page 1: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

Human Judgements in Parallel Treebank Alignment

Martin Volk, Torsten Marek, Yvonne SamuelssonUniversity of Zurich and Stockholm [email protected]

Page 2: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

223 August 2008

English Syntax Tree

Page 3: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

323 August 2008

Page 4: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

423 August 2008

Page 5: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

523 August 2008

DE – EN

Alignment

Page 6: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

623 August 2008

SMULTRON Stockholm MULtilingual TReebank 1000 sentences in 3 languages (DE-EN-SV)

500 from Jostein Gaarder’s Sophie’s World (~ 7 500 tokens, 14 tokens/sentence) and

500 from Economy texts (~ 11 000 tokens, 22 tokens/sentence)

ABB Quarterly report Rainforest Alliance: Banana Certification Program SEB Annual report

Released: January 2008 www.ling.su.se/dali/research/smultron/index.htm

Page 7: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

723 August 2008

German Annotation

Page 8: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

823 August 2008

German sentence: flat annotation

Page 9: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

923 August 2008

German sentence: deepened

Page 10: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

1023 August 2008

English Annotation

Page 11: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

1123 August 2008

English Syntax Tree

Page 12: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

1223 August 2008

English annotation

Follows the Penn Treebank guidelines Slower annotation because of

insertion of tracessecondary edgesdeeper trees

Page 13: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

1323 August 2008

Page 14: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

1423 August 2008

Tree Alignment

Page 15: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

1523 August 2008

Sentence alignment Word alignment

input for Statistical MT Phrase alignment

linguistically motivated phrases input for Example-based MT

Page 16: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

1623 August 2008

Alignment Example

Page 17: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

1723 August 2008

Tools for Parallel Treebanks

creating and editing trees from mono-lingual treebanks PoS-taggers, chunkers, editor, ’tree-enricher’

aligning phrases use of word alignment tools tree alignment editor Stockholm TreeAligner

searching across languages TIGER-Search for parallel treebanks Stockholm

TreeAligner

Page 18: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

1823 August 2008

Guidelines for Alignment

1. Align words and phrases that represent the same meaning and could serve as translation units in an MT system.

2. Align as many words and phrases as possible.3. Distinguish between exact and approximate

alignments.4. 1:n word / phrase alignments are allowed, but

not m:n word / phrase alignments.5. m:n sentence alignments are allowed.

Page 19: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

1923 August 2008

Examples

Do not align:die Verwunderung über das Leben their astonishment at the world

Do align:was für eine seltsame Weltwhat an extraordinary world

Page 20: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

2023 August 2008

Specific rules

a pronoun in one language shall never be aligned with a full noun in the other

names are aligned regardless of spelling, unless the name is changed (fiction)

ignore number/case but not voice

Page 21: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

2123 August 2008

Exact vs approximate alignment

best vs. ”second-best” translation an acronym in one language shall be

aligned as approximate (fuzzy) with a spelled-out term in the otherPT – Power Technologies

difficult distinctionseiner der ersten Tage im Mai – early May

Page 22: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

2223 August 2008

Related Research

Blinker project (Melamed) Prague Czech-English Treebank Example-based MT in Dublin Linköping English-Swedish Treebank

Page 23: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

2323 August 2008

Experiment

12 students to align 20 tree pairs DE-EN10 tree pairs from Sophie’s world10 tree pairs from Economy text

advanced CL students received

short introduction the written guidelines

Page 24: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

2423 August 2008

Gold Standard Alignment (DE-EN)

word - word phrase - phrase

exact approx. exact approx.

10 sent.

Sophie75 3 46 12

78 58

10 sent.

Econ159 19 62 9

178 71

Page 25: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

2523 August 2008

Experiment: Results

The students created a huge variety in number of alignments Sophie part: from 47 to 125 (ø = 94.3) Econ part: from 62 to 259 (ø = 186.9) the 3 students with the lowest numbers

were non-native speakers of German 1 student had misunderstood the task

Page 26: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

2623 August 2008

Experiment: Results

The remaining 8 students had a high overlap with the gold standard (Recall): Sophie part: from 48% to 81% (ø = 68.7%) Econ part: from 66% to 89% (ø = 75.5%)

Precision Sophie part: from 81% to 97% (ø = 89.1%) Econ part: from 78% to 94% (ø = 88.2%)

Page 27: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

2723 August 2008

Discrepancies

students sometimes aligned a word (or some words) with a node.e.g. the word natürlich to the phrase of course

students sometimes aligned a German verb group with a single verb form in English e.g. ist zurückzuführen vs. reflecting

Page 28: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

2823 August 2008

Discrepancies

based on different grammatical forms: a definite single NP in German with an

indefinite plural NP in Englishder Umsatz vs. revenues

a German genitive NP with a PP in English der beiden Divisionen vs. of the two divisions

Page 29: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

2923 August 2008

Missed by all students

alignment of German word to empty token in Englishwenn sie die Hand ausstreckte vs. herself shaking hands

Page 30: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

3023 August 2008

Page 31: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

3123 August 2008

Conclusions

1. Our alignment guidelines are sufficient for a core of clear alignment decisions.

2. Needed:1. Better alignment rules with concrete

examples.2. Better support tools (consistency checking).

3. The distinction between exact alignment and approximate alignment is very tricky.

Page 32: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

3223 August 2008

Thank You for Your Attention!

Questions???

Page 33: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

3323 August 2008

Applications of Parallel Treebanks

For the Translator1. corpus for translation studies

search tools needed

For the Computational Linguist1. input for Example-based Machine

Translation2. evaluation corpus for word, phrase

or clause alignment3. training corpus for transfer rules

Page 34: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

3423 August 2008

Alignment Example

Page 35: Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch

3523 August 2008

Parallel Treebanking

DE sentence SV sentence

flat DE tree

ANNOTATE- PoS tagger (STTS)- Chunker (TIGER)

flat SV tree

PoS tagger (SUC)STTS conversionANNOTATE- Chunker (SWE-TIGER)

DE tree SV tree

Deepening Deepening + Back conv.

phrase alignment