Transcript
Page 1: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Evalua&onofMachineTransla&onQualityMarcoTurchiFBKTrento,Italyturchi@<k.eu

Slidesfromthepresenta&onbyMaDeoNegri…andmyself

Disclaimer

“MorehasbeenwriDenaboutMTevalua&on

overthepast50yearsthanaboutMTitself”

Hovyetal.:PrinciplesofContext-BasedMachineTransla7onEvalua7on.

MachineTransla&on,16,pp.1–33,2002

(aDributedtoYorickWilks)

“ItisimpossibletowriteacomprehensiveoverviewoftheMTevalua&onliterature”

AdamLopez.:Sta7s7calMachineTransla7on.

ACMCompu&ngSurveys40(3)pp.1–49,August2008.

MTEvalua&on,Trento,DoctoralSchool-April2016

Outline

•  ImportanceofMTEvalua&on

•  DifficultyofMTEvalua&on

•  Humanevalua&on:fluency/adequacy

•  Automa&cevalua&on:

– Reference-based:BLEU,TER,HTER(chosenamongMANYothers)– Reference-free:qualityes&ma&on(es&ma&ngpost-edi&ngeffort)

MTEvalua&on,Trento,DoctoralSchool-April2016

TheimportanceofMTevalua&on

•  Answering“HowgoodisanMTsystem?”asawayto:– Whichsystemtouseforagiventask

– Assessandcomparesystems’performance– Definethestateoftheart– Drivesystemdevelopmentandmeasureimprovements– DecidewhethertoapplyMTatall

•  …Necessary(yes,notsufficient)condi&onsforprogressinanyresearchfield

•  Difficulttask!

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 2: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

TheimportanceofMTevalua&on

•  Answering“HowgoodisanMTsystem?”asawayto:– Whichsystemtouseforagiventask

– Assessandcomparesystems’performance– Definethestateoftheart– Drivesystemdevelopmentandmeasureimprovements– DecidewhethertoapplyMTatall

•  …Necessary(yes,notsufficient)condi&onsforprogressinanyresearchfield

•  Difficulttask!

MTEvalua&on,Trento,DoctoralSchool-April2016

TheimportanceofMTevalua&on

•  Answering“HowgoodisanMTsystem?”asawayto:– Whichsystemtouseforagiventask

– Assessandcomparesystems’performance– Definethestateoftheart– Drivesystemdevelopmentandmeasureimprovements– DecidewhethertoapplyMTatall

•  …Necessary(yes,notsufficient)condi&onsforprogressinanyresearchfield

•  Difficulttask!

MTEvalua&on,Trento,DoctoralSchool-April2016

DifficultyofMTevalua&on

•  Noformaldefini=onof“transla&on”!nodefini&onof“goodtransla&on”

•  Theno&onofqualityisinherentlysubjec=ve•  Exactquan&fica&onisdifficult(especiallyforlongsentences)

•  MTerrorsareveryvariedinnature

DifficultyofMTevalua&on

•  Noformaldefini=onof“transla&on”!nodefini&onof“goodtransla&on”

•  Theno&onofqualityisinherentlysubjec=ve•  Exactquan&fica&onisdifficult(especiallyforlongsentences)

•  MTerrorsareveryvariedinnature

Page 3: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

DifficultyofMTevalua&on

•  Noformaldefini=onof“transla&on”!nodefini&onof“goodtransla&on”

•  Theno&onofqualityisinherentlysubjec=ve•  Exactquan&fica&onisdifficult(especiallyforlongsentences)

•  MTerrorsareveryvariedinnature

DifficultyofMTevalua&on

•  Noformaldefini=onof“transla&on”!nodefini&onof“goodtransla&on”

•  Theno&onofqualityisinherentlysubjec=ve•  Exactquan&fica&onisdifficult(especiallyforlongsentences)

•  MTerrorsareveryvariedinnature

DifficultyofMTevalua&on

•  Noformaldefini=onof“transla&on”!nodefini&onof“goodtransla&on”

•  Theno&onofqualityisinherentlysubjec=ve•  Exactquan&fica&onisdifficult(especiallyforlongsentences)

•  MTerrorsareveryvariedinnature•  Perfectorverypoortransla&ons

areeasytoscore,butwhathappensinbetween?

DifficultyofMTevalua&on

•  Manydifferentacceptabletransla&onsforthesamesentence

���������

–  Iam[experiencing|sufferingfrom|feeling]athrobbingpain.–  I[feel|canfeel|have]a[throbbingpain|painfulthrobbing].–  [Itisa|It’sin|I’vegota]throbbingpain.–  It’sthrobbing[anditreallyhurts|withpain].–  [It’spainfuland|Ithurtssomuch]it’sthrobbing.

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 4: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

DifficultyofMTevalua&on

•  Howwouldyoutranslate:

It’srainingcatsanddogsAceinthehole

BeataroundthebushChewthefat

Wildgoosechase

TieoneonSunnysmile

•  Literally,itsmeaningorthecorrespondingidiom(ifany)?

MTEvalua&on,Trento,DoctoralSchool-April2016

DifficultyofMTevalua&on

MTEvalua&on,Trento,ISITSchool-November2013

•  Classifica&onoferrors:aquiterichtaxonomy

Note:errortypesarenotmutuallyexclusiveandonenco-occur(Vilaretal.2006)

HumanVsAutoma&cevalua&on

•  HumanMTevalua=on:– criteria:adequacy(fidelity)andfluency(intelligibility)– pros:veryaccurate,highquality– cons:expensive,slow,subjec&ve

•  Automa=cMTevalua=on:– criteria:“similarity”toprofessionalhumantransla&on

– pros:inexpensive,quick,objec&ve– cons:qualityis“slightly”lowerthanhumancheck

MTEvalua&on,Trento,DoctoralSchool-April2016

HumanVsAutoma&cevalua&on

•  HumanMTevalua=on:– criteria:adequacy(fidelity)andfluency(intelligibility)– pros:veryaccurate,highquality– cons:expensive,slow,subjec&ve

•  Automa=cMTevalua=on:– criteria:“similarity”toprofessionalhumantransla&on

– pros:inexpensive,quick,objec&ve– cons:qualityis“slightly”lowerthanhumancheck

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 5: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Humanevalua&on

MTEvalua&on,Trento,ISITSchool-November2013

Humanevalua&on

•  Given:– MToutput,sourceand/orreferencetransla&on

•  Task:assessthequalityoftheMToutput

•  Metrics

– Adequacy:doestheoutputconveythesamemeaningastheinputsentence?Ispartofthemessagelost,added,ordistorted?…requiresbilingualjudgesorareferencetransla&on

– Fluency:istheoutputgoodfluentEnglish?Thisinvolvesbothgramma&calcorrectnessandidioma&cwordchoices.…monolingualjudgesaresufficient,noreferenceneeded

MTEvalua&on,Trento,DoctoralSchool-April2016

Humanevalua&on

•  Given:– MToutput,sourceand/orreferencetransla&on

•  Task:assessthequalityoftheMToutput

•  Metrics

– Adequacy:doestheoutputconveythesamemeaningastheinputsentence?Ispartofthemessagelost,added,ordistorted?…requiresbilingualjudgesorareferencetransla&on

– Fluency:istheoutputgoodfluentEnglish?Thisinvolvesbothgramma&calcorrectnessandidioma&cwordchoices.…monolingualjudgesaresufficient,noreferenceneeded

MTEvalua&on,Trento,DoctoralSchool-April2016

Humanevalua&on

•  Given:– MToutput,sourceand/orreferencetransla&on

•  Task:assessthequalityoftheMToutput

•  Metrics

– Adequacy:doestheoutputconveythesamemeaningastheinputsentence?Ispartofthemessagelost,added,ordistorted?…requiresbilingualjudgesorareferencetransla&on

– Fluency:istheoutputgoodfluentEnglish?Thisinvolvesbothgramma&calcorrectnessandidioma&cwordchoices.…monolingualjudgesaresufficient,noreferenceneeded

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 6: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Humanevalua&on:adequacyandfluency

•  Sourcesentence:Lechatentredanslachambre.

(a)Adequatefluenttransla&on: Thecatenterstheroom.(b)Adequatedisfluenttransla&on:Thecatenterintheroom.(c)Fluentinadequatetransla&on:Thecatsenterthebedroom.(d)Disfluentinadequatetransla&on:Bedroomthedogsentersthe

MTEvalua&on,Trento,DoctoralSchool-April2016

Humanevalua&on:Likertscales

Adequacy

5 allmeaning

4 mostmeaning

3 muchmeaning

2 liDlemeaning

1 none

MTEvalua&on,Trento,DoctoralSchool-April2016

Fluency

5 flawlessEnglish

4 goodEnglish

3 non-na&veEnglish

2 disfluentEnglish

1 incomprehensible

Humanevalua&on:subjec&vity

a

fluency

adeq

uacy b

cd

a

fluency

adeq

uacy b

c

d

a

fluency

adeq

uacy b

cd

JUDGE1 JUDGE2 JUDGE3

• Perfectorverypoortransla&onsareeasytoscore… …butwhathappensinbetween?

(a)Adequatefluenttransla&on: Thecatenterstheroom.(b)Adequatedisfluenttransla&on:Thecatenterintheroom.(c)Fluentinadequatetransla&on:Thecatsenterthebedroom.(d)Disfluentinadequatetransla&on:Bedroomthedogsentersthe

Humanevalua&on:subjec&vity

Evaluatorsdisagree!•  …lookatthishistogramofadequacyjudgmentsby

differenthumanevaluators

MTEvalua&on,Trento,ISITSchool-November2013

Page 7: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Humanevalua&on:measuringagreement

•  Kappacoefficient

– p(A):propor&onof&mesthattheevaluatorsagree

– p(E):propor&onof&methattheywouldagreebychance

(5-pointscale→p(E)=1/5)

– Completeagreement:K=1

– Noagreementhigherthanchance:K=0

•  Example:inter-evaluatoragreementinWMT2007

K =p(A) − p(E)1− p(E)

p(A) p(E) K

Fluency .400 .2 .250

Adequacy .380 .2 .226

Humanevalua&on:alterna&ves

•  Rankingtransla=ons:istransla&onXbeDerthantransla&onY?– Evaluatorsaremoreconsistent

•  Informa=veness: answer comprehension ques&ons using thetransla&on(who?where?when?names,numbers,datesetc.)– Veryhardtodeviseques&ons

p(A) p(E) K

Fluency .400 .2 .250

Adequacy .380 .2 .226

Sentenceranking .582 .333 .373

Humanevalua&on:alterna&ves

•  Rankingtransla=ons:istransla&onXbeDerthantransla&onY?– Evaluatorsaremoreconsistent

•  Informa=veness: answer comprehension ques&ons using thetransla&on(who?where?when?names,numbers,datesetc.)– Veryhardtodeviseques&ons

p(A) p(E) K

Fluency .400 .2 .250

Adequacy .380 .2 .226

Sentenceranking .582 .333 .373

Humanevalua&on:alterna&ves

•  Reading=me– peoplereadmorequicklyawell-formedtext

•  Post-edi=ngeffort(=me/HTER)– TimerequiredtoturnMTintoagoodtransla&on

– HTER (Human-Targeted Transla&on Error Rate) – number ofedi&ng opera&ons required to turn MT output into anacceptabletransla&on

Page 8: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Humanevalua&on:alterna&ves

•  Reading=me– peoplereadmorequicklyawell-formedtext

•  Post-edi=ngeffort(=me/HTER)– TimerequiredtoturnMTintoagoodtransla&on

– HTER (Human-Targeted Transla&on Error Rate) – number ofedi&ng opera&ons required to turn MT output into anacceptabletransla&on

Automa&cmetricsforMTevalua&on

MTEvalua&on,Trento,ISITSchool-November2013

Requirementsforautoma&cmetrics

•  Lowcost(wrthumanevalua&on)

•  Objec=ve(unbiased)•  Meaningful:scoreshouldgiveintui&veinterpreta&onof

transla&onquality

•  Efficient:tobecomputedquicklyandonen

•  Consistent:repeateduseofmetricshouldgivesameresults

•  Correct:metricmustrankbeDersystemshigher

MTEvalua&on,Trento,DoctoralSchool-April2016

Reference-basedmetrics

•  Idea:computeasimilarityscorebetweenacandidatetransla&onandoneormorehigh-qualityreferencetransla&ons– Referencesarecreatedbyhumanexperts(e.g.professionaltranslators)

– Severalreferencesallowustoaccountforvariabilityofgoodtransla&ons

•  Criterionforvalida=ngautoma=cmetrics:automa&cscoresmustcorrelatewithhumanonesontestdata

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 9: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Reference-basedmetrics•  Typically:

–  Simisasimilaritymetricbetweensentences–  Simcanuseavarietyofproper&es:stringdistance,wordprecision/

recall,syntac&csimilarity,seman&cdistance,etc.

WER:ra&oofsmallesteditdistanceandoutputlength

BLEU:weightedsumofprecisionofn-grams

TER:normalizednumberofeditstomatchtheclosestreference

METEOR:harmonicmeanofunigramprecision/recallNIST,PER,GTM,HTER,TERP,CDER,GTM,BLANC,PER,ULC,MT-NCD,ATEC,TESLA,SEPIA,IQTM,BEWT-E,MEANT,etc.

1k

sim(refii=1

k

∑ ,cand) 1≤k≤4

“candidate”,“reference”,“n-grams”

Candidate(or“target”or“hypothesis”):thegunmanwasshotdeadbypolice.

Referencetransla=on:thegunmanwasshottodeathbythepolice.

N-grams:the,gunman,was,shot,by,police,.

thegunman,gunmanwas,wasshot,police. thegunmanwas,gunmanwasshot

thegunmanwasshot4-grams

3-grams

2-grams

1-grams

TheBLEUmetric(BiLingualEvalua&onUnderstudy)

•  ProposedbyIBM[Papinenietal.,2001](namefromIBM’scolor)•  Anumericalmeasureofclosenessbetweentexts

•  Ra&onal:thecloserMTistohumantransla&on,thebeDer

•  Idea:checkmatchesofwords(unigrams)andphrases(n-grams)between:–  onehypothesis(thetransla&onproducedbyMT)

–  asetofreferences(professionalhumantransla&ons)

•  Criterion:themorethematches,thebeDerthehypothesis

•  Needsgoodqualityreferencestocoverlinguis&cvariety

Important:onlythetargetlanguageistakenintoaccount!

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric(BiLingualEvalua&onUnderstudy)

•  ProposedbyIBM[Papinenietal.,2001](namefromIBM’scolor)•  Anumericalmeasureofclosenessbetweentexts

•  Ra&onal:thecloserMTistohumantransla&on,thebeDer

•  Idea:checkmatchesofwords(unigrams)andphrases(n-grams)between:–  onehypothesis(thetransla&onproducedbyMT)

–  asetofreferences(professionalhumantransla&ons)

•  Criterion:themorethematches,thebeDerthehypothesis

•  Needsgoodqualityreferencestocoverlinguis&cvariety

Important:onlythetargetlanguageistakenintoaccount!

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 10: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

TheBLEUmetric(BiLingualEvalua&onUnderstudy)

•  ProposedbyIBM[Papinenietal.,2001](namefromIBM’scolor)•  Anumericalmeasureofclosenessbetweentexts

•  Ra&onal:thecloserMTistohumantransla&on,thebeDer

•  Idea:checkmatchesofwords(unigrams)andphrases(n-grams)between:–  onehypothesis(thetransla&onproducedbyMT)

–  asetofreferences(professionalhumantransla&ons)

•  Criterion:themorethematches,thebeDerthehypothesis

•  Needsgoodqualityreferencestocoverlinguis&cvariety

Important:onlythetargetlanguageistakenintoaccount!

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric(BiLingualEvalua&onUnderstudy)

•  ProposedbyIBM[Papinenietal.,2001](namefromIBM’scolor)•  Anumericalmeasureofclosenessbetweentexts

•  Ra&onal:thecloserMTistohumantransla&on,thebeDer

•  Idea:checkmatchesofwords(unigrams)andphrases(n-grams)between:–  onehypothesis(thetransla&onproducedbyMT)

–  asetofreferences(professionalhumantransla&ons)

•  Criterion:themorethematches,thebeDerthehypothesis

•  Needsgoodqualityreferencestocoverlinguis&cvariety

Important:onlythetargetlanguageistakenintoaccount!

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric(BiLingualEvalua&onUnderstudy)

MTEvalua&on,Trento,DoctoralSchool-April2016

REF

HYP1

HYP2

HYP3

VERYGOOD

BAD

VERYBAD

TheBLEUmetric:modifiedn-gramprecision

•  n-gramPrecision:percentageofn-gramsinthehypothesisthatoccuralsoin(anyofthe)references(0≤p≤1)– matchesofshortern-grams(n=1,2)captureadequacy

– matchesoflongern-grams(n=3,4,...)capturefluency

•  Modified:areferencewordisconsideredexhaustedaneramatchingwordisiden&fiedinthehypothesis. – Example:

Hyp: thethethethethethethe

Ref: thecatisonthemat

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 11: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

TheBLEUmetric:modifiedn-gramprecision

•  n-gramPrecision:percentageofn-gramsinthehypothesisthatoccuralsoin(anyofthe)references(0≤p≤1)– matchesofshortern-grams(n=1,2)captureadequacy

– matchesoflongern-grams(n=3,4,...)capturefluency

•  Modified:areferencewordisconsideredexhaustedaneramatchingwordisiden&fiedinthehypothesis. – Example:

Hyp: thethethethethethethe

Ref: thecatisonthemat

MTEvalua&on,Trento,DoctoralSchool-April2016

p1standard =

77

p1modified =

27

TheBLEUmetric:brevitypenalty

•  Brevitypenalty(BP):topenalizetooshorthypotheses– Example:

Hyp: the

Ref: thecatisonthemat

…Can’tjusttypeoutsingleword“the’’(precision1.0!)

– c=lengthofMThypothesis,r=lengthoftheclosestreference

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.–  Ref1: Thegunmanwasshottodeathbythepolice.–  Ref2: Thehitmanwaskilledbythepoliceforces.–  Ref3: Policekilledthegunman.–  Ref4: Thegunmanwasshotdeadbythepolice.

•  Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)•  BrevityPenalty:c=8,r=9,BP=0.8825•  FinalScore:

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.–  Ref1: Thegunmanwasshottodeathbythepolice.–  Ref2: Thehitmanwaskilledbythepoliceforces.–  Ref3: Policekilledthegunman.–  Ref4: Thegunmanwasshotdeadbythepolice.

•  Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)•  BrevityPenalty:c=8,r=9,BP=0.8825•  FinalScore:

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 12: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.–  Ref1: Thegunmanwasshottodeathbythepolice.–  Ref2: Thehitmanwaskilledbythepoliceforces.–  Ref3: Policekilledthegunman.–  Ref4: Thegunmanwasshotdeadbythepolice.

•  Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)•  BrevityPenalty:c=8,r=9,BP=0.8825•  FinalScore:

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.–  Ref1: Thegunmanwasshottodeathbythepolice.–  Ref2: Thehitmanwaskilledbythepoliceforces.–  Ref3: Policekilledthegunman.–  Ref4: Thegunmanwasshotdeadbythepolice.

•  Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)•  BrevityPenalty:c=8,r=9,BP=0.8825•  FinalScore:

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.–  Ref1: Thegunmanwasshottodeathbythepolice.–  Ref2: Thehitmanwaskilledbythepoliceforces.–  Ref3: Policekilledthegunman.–  Ref4: Thegunmanwasshotdeadbythepolice.

•  Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)•  BrevityPenalty:c=8,r=9,BP=0.8825•  FinalScore:

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.–  Ref1: Thegunmanwasshottodeathbythepolice.–  Ref2: Thehitmanwaskilledbythepoliceforces.–  Ref3: Policekilledthegunman.–  Ref4: Thegunmanwasshotdeadbythepolice.

•  Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)•  BrevityPenalty:BP=0.8825(exp(1-(9/8))•  FinalScore:

MTEvalua&on,Trento,DoctoralSchool-April2016

c=8

r=9

Page 13: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.–  Ref1: Thegunmanwasshottodeathbythepolice.–  Ref2: Thehitmanwaskilledbythepoliceforces.–  Ref3: Policekilledthegunman.–  Ref4: Thegunmanwasshotdeadbythepolice.

•  Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)•  BrevityPenalty:BP=0.8825(exp(1-(9/8))•  FinalScore:

1× 0.86 × 0.67 × 0.64 × 0.8825= 0.68

TheBLEUmetric:computa&on

BLEU=BrevityPenalty*Geometricmeanofp1,p2,..pn(whereisthemodifiedn-gramprecisionfor1≤n≤4)

Hypothesis:Thegunmanwasshotdeadbypolice.–  Ref1: Thegunmanwasshottodeathbythepolice.–  Ref2: Thegunmanwaskilledbythepolice.–  Ref3: Policekilledthegunman.–  Ref4: Thegunmanwasshotdeadbythepolice.

•  Precision:p1=1.0(8/8)p2=0.86(6/7)p3=0.67(4/6)p4=0.6(3/5)•  BrevityPenalty:BP=0.8825(exp(1-(9/8))•  FinalScore:

1× 0.86 × 0.67 × 0.64 × 0.8825= 0.68

NOTE:thisisaproduct!!!! Ifoneofthefactorsis0(e.g.no4-grammatches)thefinalscorewillbe0!!!Forthisreasonthefinalscoreisusuallycalculatedontheen=reevalua=oncorpus,notonsinglesentences!

TheBLEUmetric:correla&onwithtrainingsetsize

MTEvalua&on,Trento,DoctoralSchool-April2016

ExperimentsbyPhilippKoehn

BLEUscore

No.sentencepairsusedintraining

FromGeorgeDoddington,NIST,2002

TheBLEUmetric:correla&onwithhumanjudgments

Page 14: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

TheBLEUmetriclimita&ons:examples

•  Reference: abcdefghIjklmnopqrs

•  Hyp1: abcdfegihjlkmonprqs

•  Hyp2: abcdefgxxxxxxxxxxxx

Hyp1 Hyp2

1-gram 1.0000 0.3684

2-gram 0.1666 0.3333

3-gram 0.1176 0.2941

4-gram 0.0625 0.2500

BLEUScore 0.1871 0.3083

MTEvalua&on,Trento,DoctoralSchool-April2016

Longern-gramsdominateshortern-grams!!!

TheBLEUmetriclimita&ons:examples

HYPOTHESES BLEU

GeorgeBushwillonentakeaholidayinCrawfordTexas 1.0000

BushwillonenholidayinTexas 0.4611

BushwillonenholidayinCrawfordTexas 0.6363

GeorgeBushwillonenholidayinCrawfordTexas 0.7490

GeorgeBushwillnotonenvaca&oninTexas 0.4491

GeorgeBushwillnotonentakeaholidayinCrawfordTexas 0.9129

MTEvalua&on,Trento,DoctoralSchool-April2016

•  Reference:

GeorgeBushwillonentakeaholidayinCrawfordTexas

TheBLEUmetriclimita&ons:examples

HYPOTHESES BLEU

GeorgeBushwillonentakeaholidayinCrawfordTexas 1.0000

BushwillonenholidayinTexas 0.4611

BushwillonenholidayinCrawfordTexas 0.6363

GeorgeBushwillonenholidayinCrawfordTexas 0.7490

GeorgeBushwillnotonenvaca&oninTexas 0.4491

GeorgeBushwillnotonentakeaholidayinCrawfordTexas 0.9129!

MTEvalua&on,Trento,DoctoralSchool-April2016

•  Reference:

GeorgeBushwillonentakeaholidayinCrawfordTexas

TheBLEUmetriclimita&ons:examples

HYPOTHESES BLEU

GeorgeBushwillonentakeaholidayinCrawfordTexas 1.0000

BushwillonenholidayinTexas 0.4611

BushwillonenholidayinCrawfordTexas 0.6363

GeorgeBushwillonenholidayinCrawfordTexas 0.7490

GeorgeBushwillnotonenvaca&oninTexas 0.4491

GeorgeBushwillnotonentakeaholidayinCrawfordTexas 0.9129!

MTEvalua&on,Trento,DoctoralSchool-April2016

•  Reference:

GeorgeBushwillonentakeaholidayinCrawfordTexas

Smallchangesinthetextmaydeterminebigmeaningchanges!

Page 15: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

•  Reference:ThePresidentfrequentlymakeshisvaca&oninCrawfordTexas

WHY?

TheBLEUmetriclimita&ons:examples

HYPOTHESES BLEU(4-gram)

GeorgeBushonentakesaholidayinCrawfordTexas 0.2627

holidayonenBushatakesGeorgeinCrawfordTexas 0.2627

MTEvalua&on,Trento,DoctoralSchool-April2016

•  Reference:ThePresidentfrequentlymakeshisvaca&oninCrawfordTexas

WHY?

…The“invisibleregion”[Hovy&Ravichandran2003]

TheBLEUmetriclimita&ons:examples

MTEvalua&on,Trento,DoctoralSchool-April2016

HYPOTHESES BLEU(4-gram)

GeorgeBushonentakesaholidayinCrawfordTexas 0.2627

holidayonenBushatakesGeorgeinCrawfordTexas 0.2627

•  Reference:ThePresidentfrequentlymakeshisvaca&oninCrawfordTexas

DTNNPRBVBZPRP$NNINNNPNNP

Solu=on#1:matchesatPOSlevel[Hovy&Ravichandran2003]

TheBLEUmetriclimita&ons:improvements

HYPOTHESES BLEU(4-gram)

GeorgeBushonentakesaholidayinCrawfordTexas 0.2627

holidayonenBushatakesGeorgeinCrawfordTexas 0.2627

MTEvalua&on,Trento,DoctoralSchool-April2016

•  Reference:ThePresidentfrequentlymakeshisvaca&oninCrawfordTexas

DTNNPRBVBZPRP$NNINNNPNNP

Solu=on#1:matchesatPOSlevel[Hovy&Ravichandran2003]

TheBLEUmetriclimita&ons:improvements

HYPOTHESES BLEU(4-gram)

NNPNNPRBVBZDTNNINNNPNNP 0.5411

NNRBNNPDTVBZNNPINNNPNNP 0.3117

HYPOTHESES BLEU(4-gram)

GeorgeBushonentakesaholidayinCrawfordTexas 0.2627

holidayonenBushatakesGeorgeinCrawfordTexas 0.2627

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 16: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

•  Reference:ThePresidentfrequentlymakeshisvaca&oninCrawfordTexas

DTNNPRBVBZPRP$NNINNNPNNP

Solu=on#2:(Words+POS)/2[Hovy&Ravichandran2003]

TheBLEUmetriclimita&ons:improvements

HYPOTHESES BLEU(4-gram)

NNPNNPRBVBZDTNNINNNPNNP 0.4020

NNRBNNPDTVBZNNPINNNPNNP 0.2966

HYPOTHESES BLEU(4-gram)

GeorgeBushonentakesaholidayinCrawfordTexas 0.2627

holidayonenBushatakesGeorgeinCrawfordTexas 0.2627

MTEvalua&on,Trento,DoctoralSchool-April2016

TheBLEUmetric:prosandcons•  BLEUrangesfrom0to1(transla&onqualityas“percentage”)

•  Themorethereferences,thehigherthescore

•  Highcorrela&onwithhumanassignedscores,especiallyonfluency

•  Rankingof“similar”MTsystemsequivalenttohumanranking

•  Collec&ngreferencehasahighcost

•  Longern-gramsdominateshortern-grams

•  Smallchangesinthetext(e.g.“not”)maydeterminebigmeaningchanges

•  Scoresarenotstraigh�orwardtointerpret(BLEU=30…sowhat?)

•  Syntaxpoorlymodeled

•  Ignoreswordrelevanceandseman&cequivalence(stringlevelcomparisons)

•  Canfailinrankingsystemsbasedondifferentapproaches

MTEvalua&on,Trento,DoctoralSchool-April2016

TheTERmetric(Transla&onEditRate)

•  Idea:simulatepost-edi=ng[Snoveretal.2006]– Givenatransla&onhypothesis(H)ANDareferencetransla&on(R)–  CalculatetheminimalnumberofeditstotransformHintoR

(normalizedbytheaveragelengthofthereferences)

–  Possibleedits:inser&ons/dele&on/subs&tu&onofsinglewords,shinsofwordsequences

•  Criterion:thelessthenumberofedits,thebeDerthehypothesis

MTEvalua&on,Trento,DoctoralSchool-April2016

TheTERmetric(Transla&onEditRate)

•  Idea:simulatepost-edi=ng[Snoveretal.2006]– Givenatransla&onhypothesis(H)ANDareferencetransla&on(R)–  CalculatetheminimalnumberofeditstotransformHintoR

(normalizedbytheaveragelengthofthereferences)

–  Possibleedits:inser&ons/dele&on/subs&tu&onofsinglewords,shinsofwordsequences

•  Criterion:thelessthenumberofedits,thebeDerthehypothesis

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 17: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

TheTERmetric:exampleREF:SaudiArabiadeniedthisweekinforma&onpublishedintheAmericanNYTHYP:thisweektheSaudisdeniedinforma&onpublishedintheNYT

•  HYP:fluent,samemeaningofreference(except“American”)

•  butnotexactmatch:

– thisweekisshined– SaudiArabiaintheREFappearsastheSaudisintheHYP– AmericanappearsonlyintheREF

•  Numberofedits=4(1shin,2subs&tu&ons,and1dele&on):

TER%=4/11*100=36.36%

MTEvalua&on,Trento,DoctoralSchool-April2016

TheTERmetric:discussion

•  Evalua&onclosetoarealtask(post-edi&ng)•  Resultsaremoreinterpretablethanforothermetrics

•  Canbecomputedonlyforasinglesentence

•  Insensi&vetoseman&ccloseness(e.g.synonyms,paraphrases)

•  Complexityofcomputa&on(op&malcalcula&onofedit-distancewithmoveopera&ons:NP-complete)–  approximatesearchviadynamicprogramming(decomposi&oninsub-

problems

MTEvalua&on,Trento,DoctoralSchool-April2016

TheHTERmetric(Human-targetedTER)

•  TERignoresseman&cequivalenceandheavilydependsonthereferencetransla&on

•  Idea:referencesashumanpost-edi=ons– Performhumanpost-edi&ngtotransformthehypothesisintotheclosestacceptabletransla&on

– HTERmeasuresTERbetweenthehypothesisandtheresul&ngreferencetransla&on

•  Criterion:thelessthenumberofedits,thebeDerthehypothesis(sameasTER)

MTEvalua&on,Trento,DoctoralSchool-April2016

TER/HTER:pros/cons

•  TER–  intui&vemeasureofMTquality

–  adequateforfastdevelopment

–  reasonablycorrelateswithhumanjudgments(>BLEU,<thanotherse.g.METEOR)

–  ignoresseman&cequivalence

•  HTER–  intui&vemeasureofMTquality

–  highestcorrela&onwithhumanjudgments

–  possiblesubs&tuteforhumanevalua&onsbecauselesssubjec&ve

–  expensive:3to7minutespersentenceforahumantoannotate

–  notsuitableforusinginthedevelopmentcycleofanMTsystem

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 18: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Applica&on-orientedMTevalua&on

QualityEs&ma&on(QE)

•  Fromcontrolledlabtestsandevalua&oncampaigns…

•  …toMTevalua&oninreal-lifecondi&ons(e.g.theCATframework)– Asasupporttohumantranslators

– Atrun&me

– Withoutreferencetransla&ons

MTEvalua&on,Trento,DoctoralSchool-April2016

(One)scenario:theCATframework

CATTool

?

TheCATtool1. Segmentstheinputdocument2. Provides,foreachsegment:

•  Sugges&onsfromatransla&onmemory(TM)

•  Sugges&onsfromanMTengine

Thetranslator,foreachsegment1. Selectsthebestsugges&on2. Post-editsit(ifnecessary)to

reachpublica&onquality

(One)scenario:theCATframework•  Questions:

–  Is this suggestion good enough to be published?

– Can I trust it? – Can a reader get the gist? –  Is it publishable “as is”? –  If not, what is better: post-editing

or rewriting?

•  Huge market interest –  Increased translators’ productivity – No manual intervention on

reliable MT suggestions

Page 19: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Predic&ngMToutputquality•  Task:automa&callyes&mateMToutputqualityatrun-8me

andwithoutreferencetransla8ons•  Approach:supervisedlearning.First(trainingstep),amodelis

learnedfromhuman-labelleddata.Then(predic&onstep),thethemodelisusedtolabelnew,unseendata.

Predic&ngMToutputquality•  Task:automa&callyes&mateMToutputqualityatrun-8me

andwithoutreferencetransla8ons•  Approach:supervisedlearning.First(trainingstep),amodelis

learnedfromhuman-labelleddata.Then(predic&onstep),thethemodelisusedtolabelnew,unseendata.

Posi&ve/Nega&veexamples

Possiblefeatures:hasWings,hasFeathers,sound,moves,hasPalmateFeet,etc.

Predic&ngMToutputquality

• Whatisagoodindicatoroftransla&onquality?

•  Itshouldtakeintoaccount:– Correctnessandusefulnessofthetransla&on– Cogni&veeffortneededbyhumanforthecorrec&on

•  Alltheseaspectscanbesummarizedinthe:– Post-edi=ngeffort

MTEvalua&on,Trento,ISITSchool-November2013

Predic&ngMToutputquality

• Whatisagoodindicatoroftransla&onquality?

•  Itshouldtakeintoaccount:– Correctnessandusefulnessofthetransla&on– Cogni&veeffortneededbyhumanforthecorrec&on

MTEvalua&on,Trento,ISITSchool-November2013

Page 20: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Predic&ngMToutputquality

• Whatisagoodindicatoroftransla&onquality?

•  Itshouldtakeintoaccount:– Correctnessandusefulnessofthetransla&on– Cogni&veeffortneededbyhumanforthecorrec&on

•  Alltheseaspectscanbesummarizedinthe:– Post-edi=ngeffort

MTEvalua&on,Trento,ISITSchool-November2013

Predic&ngMToutputquality

•  Whatispost-edi&ng?– Aprocessofmodifica&onratherthanrevision(Loffler-Laurian1985)

– The“termusedforthecorrec&onofmachinetransla&onoutputbyhumanlinguists/editors”(VealeandWay1997)

– Repairingtexts(Krings,2001)

– “…theprocessofimprovingamachine-generatedtransla&onwithaminimumofmanuallabor”(TAUSreport,2010)

MTEvalua&on,Trento,DoctoralSchool-April2016

Predic&ngMToutputquality

•  Whatispost-edi&ngeffort?–  theeffortmadebyapost-editortomanuallyimproveamachinegeneratedtransla&on

•  Measureofpost-edi&ngeffort:– Qualityscore(ases&matedbyhumansona1-5Likertscale)

– Numberofeditopera&ons(HTER)

– Post-Edi&ng&me(totalsecondsorsecondsperwords)

– Numberofkeystrokes

– …

MTEvalua&on,Trento,DoctoralSchool-April2016

Qualityscores

•  Arbitrarychoiceofthelevelsofquality 1=requirescompleteretransla&on;

2=requiressomeretransla&on;

3=veryliDlepostedi&ngneeded;

4=fitforpurpose

•  Labelingrequireshumaninterven&on

•  Aprecisemeasure

•  Subjec&ve/expensive/&meconsumingtask

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 21: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

•  WorkshoponSMTscoringschema:1.  TheMToutputisincomprehensible,withliDleorno

informa&ontransferredaccurately.Itcannotbeedited,needstobetranslatedfromscratch.

2.  About50%-70%oftheMToutputneedstobeedited.Itrequiresasignificantedi&ngeffortinordertoreachpublishablelevel.

3.  About25-50%oftheMToutputneedstobeedited.Itcontainsdifferenterrorsandmistransla&onsthatneedtobecorrected.

4.  About10-25%oftheMToutputneedstobeedited.Itisgenerallyclearandintelligible.

5.  TheMToutputisperfectlyclearandintelligible.Itisnotnecessarilyaperfecttransla&on,butrequiresliDletonoedi&ng.

81

Qualityscores

MTEvalua&on,Trento,DoctoralSchool-April2016

Post-edi&ng&me•  Secondsneededtopost-editasentence•  normalizedversioninsecondsperword

–  liDle&me=goodtransla&on

–  large&me=badtransla&on

•  Usuallyincludes:– reading&me

– searchingforinforma&ononexternalresources

–  typing&me

– extra&meforsecondaryac&vity(e.g.correc&on)

•  Highvariabilityacrosssentencesandtranslators

MTEvalua&on,Trento,DoctoralSchool-April2016

HTER(again!)•  HumantargetedTERisthestandardeditdistancebetweentheoriginalmachinetransla&onanditsminimallypost-editedversion

– edits:inser&on,dele&on,subs&tu&on,shin

•  Lowervariability(wrt&me)acrosssentences/translators

MTEvalua&on,Trento,DoctoralSchool-April2016

HTER =#edits

#words_ postedited _version

Post-edi&ng&meVsHTER

MTEvalua&on,Trento,DoctoralSchool-April2016

•  Time:pros/cons–  Accountsfordifferenteffortsin

transla&ngdifferentwords

–  Variabilityamongpost-editors

•  HTER:pros/cons–  Objec&ve,easytocomputemeasure–  lessvarianceacrosspost-editors

(bad=badforall)–  Ignoresdifferenteffortsintransla&ng

differentwords

Page 22: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Predic&ngMToutputquality

•  Tasks:– Automa&clabeling

•  realvalues=regression•  integers=classifica&on

– Automa&cranking

•  Granularity– Wordlevel(e.g.“Thecatenterintheroom”)– Sentencelevel(e.g.“Thecatenterintheroom”:2.27)– Documentlevel

MTEvalua&on,Trento,DoctoralSchool-April2016

Evalua&onMetrics-Regression•  Regression(predic&onsasrealvalues):

– MeanAbsoluteError(MAE)–  RootMeanSquaredError(RMSE)

•  GivenasetofpredictedscoresHandasetofhumanscoresV

MAE =

H(si) −V (si)i=1

N

∑N

RMSE =

(H(si) −V (si))2

i=1

N

∑N

MTEvalua&on,Trento,DoctoralSchool-April2016

Evalua&onMetrics-Classifica&on•  Classifica&on(predic&onsasintegers):

–  Precision(Pr)–  Recall(Re)–  f–score(F1)

•  GivenasetofpredictedscoresHandasetofhumanscoresV•  Anexampleforbinaryclassifica&on

V

1 -1

H1 TruePosi&ve FalsePosi&ve

-1 FalseNega&ve TrueNega&ve

Pr =tp

tp+ fp

Re =tp

tp+ fn

F1 = 2* Pr*RePr+Re

MTEvalua&on,Trento,DoctoralSchool-April2016

Evalua&onMetrics-Ranking

MTEvalua&on,Trento,DoctoralSchool-April2016

•  Spearman’sRankCoefficient

•  DeltaAverage(introducedatWMT2012)

Score Ranking

s1 3.2 3

s2 1 5

s3 5 1

s4 2.7 4

s5 4 2

Judgment Ranking

s1 5 1

s2 1 5

s3 4 2

s4 2 4

s5 3 3

System Human

RankSimilarityMetric

Page 23: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Qualityindicators

•  Featurescanbeextractedfrom–  Thesourcesentence(“Complexity”indicators)–  Thetranslatedsentence(“Fluency”indicators)–  SourceandTargetsentences(“Adequacy”andotherindicators)– MTsystemduringthetransla&onprocess(“Confidence”indicators)

MTEvalua&on,Trento,DoctoralSchool-April2016

Sourcesentence

Translatedsentence

MTsystem

Qualityindicators-Complexity

•  Capturethedifficultytotranslatethesourcesentence•  Complexsentencesarehardertotranslate

–  sourcesentencelength–  n-gramlanguagemodelprobability–  numberofpunctua&onmarks–  sourcesentencetype/tokenra&o(e.g.#nouns/#tokens)–  avg.#oftransla&onsperword(asgivenbyprobabilis&cdic&onaries)–  %ofcontent/non-contentwords–  …

Sourcesentence

Translatedsentence

MTsystem

MTEvalua&on,Trento,DoctoralSchool-April2016

Qualityindicators-Fluency

•  Capturethelevelofnaturalnessofthetransla=oninthetargetlanguage•  Thetransla&onshouldconformtothetargetlanguageintermsof

grammar,withlexicalchoicesappropriatetothegenreofthesourcetext

–  n-gramlanguagemodelprobability

–  POS-tagtargetlanguagemodel

–  …

Sourcesentence

MTsystem

Translatedsentence

MTEvalua&on,Trento,DoctoralSchool-April2016

Qualityindicators-Adequacy

•  Capturethelevelofseman=cequivalencebetweensourceandtransla=on•  Sourceandtargetsentencesshouldconveythesamemeaning.Meaning

drins/lossesfromsourcetotargetsentenceindicateabadtransla&on

–  %ofalignedwordsinsourceandtarget–  %ofalignmentsbetweenwordswiththesamepartofspeech

–  %ofalignednouns/verbs/adjec&ves–  alignedIDFmass(IDFasindicatoroftermrelevance)

–  …

MTsystem

Translatedsentence

Sourcesentence

MTEvalua&on,Trento,DoctoralSchool-April2016

Page 24: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Qualityindicators-Confidence

•  CapturethelevelofconfidenceoftheSMTsystem•  sentencesforwhichthetransla&onprocessiscomplexaremorelikelytobe

badtransla&ons

–  lengthNoftheNbestlist–  numberofprunedhypotheses

–  log-likelihoodscore–  avg.edit-distanceofthe1-bestfromthefirstk-bests

–  …

Sourcesentence

Translatedsentence

MTsystem

MTEvalua&on,Trento,DoctoralSchool-April2016

OpenIssues

•  Lackofanobjec&vequalityscoreabletocatchcogni&veefforts– AnewscorethatcontainsthemainfeaturesofHTERandcorrelateswellwithPE&me

•  Lackofatechniqueabletothresholdthequalityscore(badvs.goodtransla&ons)–  IsHTER=0.3/0.5/0.7abadorgoodtransla&on?– UsefulintheCATtoolscenario,whereitisnecessarytodiscardbadtransla&ons

MTEvalua&on,Trento,DoctoralSchool-April2016

OpenIssues

•  Morethan1,000qualityindicatorshavebeendevelopedinthelastyears.– Doweneedalloftheminarealapplica&on?

– Whicharethemostreliableineachgroup?

– Whichisthebestcombina&on?

•  Subjec&vityinthepost-editorworkandinthetask– Asinglequalityes&matorforverydifferentpost-editorbehaviorandtask

– Adaptability/personaliza&on

MTEvalua&on,Trento,DoctoralSchool-April2016

MTEvalua=onDilemma

Summary

•  MTevalua&on:ahottopic…– Sharedevalua&onmethods/rou&nesareakeyassetinanyfield

•  …butadifficulttask– Wetalkedabouterrorvariability,costs,speed,replicability,subjec&vity,correla&onwithhumanjudgments,etc.

Page 25: Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua ... · Evalua&on of Machine Transla&on Quality Marco Turchi FBK Trento, Italy turchi@

Summary

•  Humanevalua&on– Accurate,highquality,meaningful,expensive,slow,subjec&ve

•  Automa&cevalua&on– Cheap,quick,repeatable,objec&ve,approximate,lessaccurate

– Fluency,adequacy

– Reference-based:BLEU,TER,HTER(prosandcons)– Reference-free:qualityes&ma&on(goal,methods,openissues)

Summary•  Keyconcepts:

Adequacy

Referencetra

nsla&on

Agreement

Correla&on

Post-edi&ngeffort

CATtoolFeature

Cogni&veeffortHTER

MeanAbsoluteError

Evalua&onofMachineTransla&onQuality

MarcoTurchiFBKTrento,Italyturchi@<k.eu