View
217
Download
0
Category
Preview:
Citation preview
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
1/32
Lecture22:Evalua.on
April24,2010
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
2/32
LastTime
SpectralClustering
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
3/32
Today
Evalua.onMeasuresAccuracySignificanceTes.ngF-MeasureErrorTypes
ROCCurves EqualErrorRate
AIC/BIC
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
4/32
Howdoyouknowthatyouhavea
goodclassifier? Isafeaturecontribu.ngtooverallperformance?
IsclassifierAbeQerthanclassifierB? InternalEvalua.on:
MeasuretheperformanceoftheclassifierExternalEvalua.on:Measuretheperformanceonadownstreamtask
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
5/32
Accuracy
Easilythemostcommonandintui.vemeasureofclassifica.onperformance
Accuracy =#correct
N
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
6/32
Significancetes.ng
SayIhavetwoclassifiers A=50%accuracy B=75%accuracy BisbeQer,right?
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
7/32
SignificanceTes.ng
SayIhaveanothertwoclassifiers A=50%accuracy B=505%accuracy IsBbeQer?
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
8/32
BasicEvalua.on
Trainingdatausedtoiden.fymodelparameters
Tes.ngdatausedforevalua.on Op.onally:Development/tuningdatausedtoiden.fymodelhyperparameters
Difficulttogetsignificanceorconfidencevalues
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
9/32
Crossvalida.on
Iden.fynfoldsoftheavailabledata Trainonn-1folds Testontheremainingfold Intheextreme(n=N)thisisknownasleave-one-outcrossvalida.on
n-foldcrossvalida.on(xval)givesnsamplesoftheperformanceoftheclassifier
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
10/32
SignificanceTes.ng
Istheperformanceoftwoclassifiersdifferentwithsta.s.calsignificance?
Meanstes.ngIfwehavetwosamplesofclassifierperformance(accuracy),wewanttodetermineiftheyare
drawnfromthesamedistribu.on(nodifference)
ortwodifferentdistribu.ons
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
11/32
T-test
OneSamplet-test
Independentt-testUnequalvariancesandsamplesizes
Onceyouhaveat-
value,lookupthe
significancelevelona
table,keyedonthet-
valueanddegreesoffreedom
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
12/32
SignificanceTes.ng
RunCross-valida.ontogetn-samplesoftheclassifiermean
Usethisdistribu.ontocompareagainsteither:Aknown(published)levelofperformance
onesamplet-testAnotherdistribu.onofperformance
twosamplet-test Ifatallpossible,resultsshouldincludeinforma.onaboutthevarianceofclassifierperformance
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
13/32
SignificanceTes.ng
Caveatincludingmoresamplesoftheclassifierperformancecanar.ficiallyinflatethesignificancemeasure
Ifxandsareconstant(thesamplerepresentsthepopula.onmeanandvariance)thenraisingnwillincreaset
Ifthesesamplesarereal,thenthisisfineOencross-valida.onfoldassignmentisnottrulyrandomThussubsequentxvalrunsonlyresamplethesameinforma.on
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
14/32
ConfidenceBars
Varianceinforma.oncanbeincludedinplotsofclassifierperformancetoeasevisualiza.on
Plotstandarddevia.on,standarderrororconfidenceinterval?
= 10 = 1
SD = SE=
n
CI95% = 1.96 n
n = 10
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
15/32
ConfidenceBars
MostimportanttobeclearaboutwhatisploQed 95%confidenceintervalhastheclearestinterpreta.on
8
85
9
95
10
105
11
115
SD SE CI
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
16/32
BaselineClassifiers
MajorityClassbaselineEverydatapointisclassifiedastheclassthatismostfrequentlyrepresentedinthetrainingdata
RandombaselineRandomlyassignoneoftheclassestoeachdatapoint
withanevendistribu.onwiththetrainingclassdistribu.on
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
17/32
Problemswithaccuracy
Con.ngencyTableTrueValues
Posi1ve Nega1ve
Hyp
Values
Posi1ve True
Posi.ve
False
Posi.ve
Nega1ve False
Nega.ve
True
Nega.ve
ccuracy =TP+ TN
TP+ FP+ TN+ FN
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
18/32
Problemswithaccuracy
Informa.onRetrievalExampleFindthe10documentsrelatedtoaqueryinasetof110documents
TrueValues
Posi1ve Nega1ve
HypValues
Posi1ve 0 0
Nega1ve 10 100
Accuracy = 90%
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
19/32
Problemswithaccuracy
Precision:howmanyhypothesized
eventsweretrueevents
Recall:howmanyofthetrueeventswereiden.fied
F-Measure:Harmonicmeanofprecisionandrecall
TrueValues
Posi1ve Nega1v
e
Hyp
Values
Posi1ve 0 0
Nega1v
e
10 100
P =TP
TP+ FP
R =TP
TP+ FN
F =2PR
P+R
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
20/32
F-Measure
F-measurecanbeweightedtofavorPrecisionorRecall
beta>1favorsrecall
beta
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
21/32
F-Measure
TrueValues
Posi1ve Nega1ve
HypValues
Posi1ve 1 0
Nega1ve 9 100
F =(1 + 2)PR
(2P) + R
P = 1
R =1
10
F1 = .18
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
22/32
F-Measure
TrueValues
Posi1ve Nega1ve
HypValues
Posi1ve 10 50
Nega1ve 0 50
F =(1 + 2)PR
(2P) + R
P =
10
60
R = 1
F1 = .29
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
23/32
F-Measure
TrueValues
Posi1ve Nega1ve
HypValues
Posi1ve 9 1
Nega1ve 1 99
F =(1 + 2)PR
(2P) + R
P = .9
R = .9
F1 = .9
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
24/32
F-Measure
Accuracyisweightedtowardsmajorityclassperformance
F-measureisusefulformeasuringtheperformanceonminorityclasses
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
25/32
TypesofErrors
FalsePosi.vesThesystempredictedTRUEbutthevaluewasFALSE
akaFalseAlarmsorTypeIerror FalseNega.ves
ThesystempredictedFALSEbutthevaluewasTRUE
akaMissesorTypeIIerror
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
26/32
ROCcurves
Itiscommontoplotclassifierperformanceatavarietyofsengsorthresholds
ReceiverOpera.ngCharacteris.c(ROC)curvesplottrueposi.vesagainstfalseposi.ves
TheoverallperformanceiscalculatedbytheArea
UndertheCurve(AUC)
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
27/32
ROCCurves
EqualErrorRate(EER)iscommonlyreported EERrepresentsthehighestaccuracyoftheclassifier
Curvesprovidemoredetailaboutperformance
Gauvainetal1995
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
28/32
GoodnessofFit
Anotherviewofmodelperformance Measurethemodellikelihoodoftheunseendata
However,weveseenthatmodellikelihoodislikelytoimprovebyaddingparameters
Twoinforma.oncriteriameasuresincludeacosttermforthenumberofparametersinthe
model
l(x;)
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
29/32
AkaikeInforma.onCriterion
AkaikeInforma.onCriterion(AIC)basedonentropy
ThebestmodelhasthelowestAICGreatestmodellikelihoodFewestfreeparameters
AIC = 2k 2 ln(l(x; ))
Informa.onintheparameters
Informa.onlostbythemodeling
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
30/32
BayesianInforma.onCriterion
Anotherpenaliza.ontermbasedonBayesianarguments
Selectthemodelthatisaposteriorimostprobablywithaconstantpenaltytermforwrongmodels
IferrorsarenormallydistributedNotecompareses.matedmodelswhenxisconstant
BIC = k ln(n) 2 ln(l(x; ))
BIC = ln(2e) +k
nln(n)
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
31/32
Today
Accuracy SignificanceTes.ng F-Measure AIC/BIC
8/3/2019 Andrew Rosenberg- Lecture 22: Evaluation
32/32
NextTime
RegressionEvalua.on ClusterEvalua.on
Recommended