20
Automatic Translation Error Analysis Project results & conclusions

Automatic Translation Error Analysismtmarathon2011.fbk.eu/sites/mtmarathon2011.fbk.eu/files/...Outline Got to know the tools; cross-evaluation (all) Hjerson++ (Maja) Addicter's friendlier

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Automatic Translation Error AnalysisProject results & conclusions

  • MT Bondone (side project)

  • Outline● Got to know the tools; cross-evaluation (all)

    ● Hjerson++ (Maja)● Addicter's friendlier and richer interface (Dan)● A non-paranoid alignment for Addicter (Martin)

    ● Both tools on wmt11 En-De (Sabine)● Both tools on dataset X (Arianna, Suhel)

  • Addicter vs Hjerson● Addicter (HMM)

    ● Moses the decoder uses beam search

    ● The Moses program employs suboptimal pruning● Hjerson & Addicter (Greedy)

    ● Moses the decoder uses beam search

    ● The Moses program employs suboptimal pruning

    ● Failed translation attempt or missing+extra pair? Hard task● Turns out, second strategy is better (ranking, error prec./rec.)

  • Hjerson on Czech data● WMT'09, En-Cs● Very rich morphology including inflections and

    derivations● Very free word order● Flexible human error analysis (related to the

    given reference, but only loosely)

  • Ranking evaluation● correlations over error categories between

    0.4 and 0.7● correlations over translation systems:

    ● strongest for missing words and lexical errors(0.7 - 1)

    ● weaker for reordering and morphological errors(0.2 - 0.8)

    ● reason: above mentioned characteristics of Czech language

    ● (again) weak for extra words (-0.2 - 0.4)

  • Precision, Recall, Confusions● General problem:

    ● (again) extra words confused with lexical errors

    ● Problems related to the Czech language:● morphological errors confused with lexical● much more reordering errors

  • WER alignment on base forms

    Improves some aspects:● better correlations over error classes● better recall of extra words (+less confusion

    with lexical errors)● price: deterioration of lexical recall + more lexical

    errors confused with extra words● however, the gain is significantly larger

    ● better precision of extra words● price: more correct words are tagged as extra

  • Addicter / Visualizer● Easy install (uses internal webserver now)● Improved interface● Reference-hypothesis alignment

    ● Multiple alignments of the same sentence● Color highlighting of automatically found errors

    ● … DEMO

  • Addicter on English● Automatic testing system for all tools and

    datasets

    ● Greedy alignment for Addicter● fast (linear search)● based on context, lemma and PoS similarity● suffers from lexical error overkill (better than not

    detecting them)● evaluated on manually annotated WMT09 De-En –

    similar to Hjerson● Addicter's best built-in aligner

  • Test on WMT11 EN-DE Data● 22 MT systems/outputs● No manually annotated gold standard ● Ranking according to manual judgments● Application of both Addicter & Hjerson to all the

    systems‘ output

  • Number of Errors● Addicter tags between 81-90k of 150k tokens

    with errors, Hjerson between 84-95k.● The systems with the fewest errors:

    ● online-B: rank #2 of 22● illc-uva: rank #21 of 22● RBMT systems are tagged with more errors

  • Fun with CorrelationsAddicter

    Total errors 0,003

    Inflection errors 0,113

    Extra words -0,283

    Missing words 0,268

    Lexical errors 0,086

    Reordering 0,189

    Hjerson

    Total errors -0,109

    Inflection errors 0,432

    Extra words -0,351

    Missing words 0,427

    Lexical errors -0,275

    Reordering 0,579

    Infl+ext+reord 0,654

  • Error Analysis of the Error Analysis● Addicter tags very conservatively wrt

    reordering/inflection, Hjerson is greedy.● The lack of alignment in Hjerson leads to many

    errors: the German determiner is often wrongly tagged with inflection or reordering errors.

    ● Addicter abuses extra/miss (can be fixed by creating a better alignment).

  • Example - Hjerson●Aktuálně.cz "tested" the Social Democrat members of the new Council in terms of the well-established slang that originated in the town hall during the few last years, when Prague was ruled by the current coalition partners.

    ●Die Zeitung Aktuálně.cz hat Mitglieder des neuen Rates aus der ČSSD mal ein wenig "abgeklopft", wie sie den notorischen Slang beherrschen, der sich in den letzten Jahren eingebürgert hat, in denen die heutigen Koalitionspartner in Prag am Ruder waren.

    ●Aktuáln.cz "testete" die Sozialdemokratin-Mitglieder vom neuen Rat in Bezug auf die feste Umgangssprache von den gegenwärtigen Koalitionspartnern, die während der paar letzten Jahre im Rathaus entstand, als Prag regiert wurde.

  • Example - Addicter● New Councilors of CSSD will most probably have to overcome certain

    language barriers to understand their old-new colleagues from ODS in Prague Council and municipal council.

    ● Die neuen Ratsherren der Hauptstadt aus den Reihen der ČSSD werden offensichtlich gewisse Sprachbarrieren überwinden müssen, um ihre alt-neuen Kollegen aus der ODS im Prager Rat und in der Stadtvertretung überhaupt verstehen zu können.

    ● Neue Ratsmitglieder von CSSD werden am wahrscheinlichsten Sprachbarrieren überwinden müssen, um ihre altneuen Kollegen von ODS in Prag-Rat und Magistrat zu verstehen.

  • Test on IWSLT'11 Ar-En Data● In progress

    ● “The system is of good quality and far too many errors are marked”

  • Conclusions● Hjerson updated; evaluates better; usable for

    error/system ranking and rough error-tagging● Addicter updated; now also usable for

    error/system ranking and rough error-tagging● Both tools tested on EnDe, En->Cs, Ar->En

    Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20