Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
CSC412/2506Winter2018ProbabilisticLearningandReasoning
Lecture2:SimpleClassifiers
SlidesbasedonRichZemel’s
Alllectureslideswillbeavailableonthecoursewebsite:www.cs.toronto.edu/~jessebett/CSC412
SomeofthefiguresareprovidedbyKevinMurphyfromhisbook:”MachineLearning:AProbabilisticPerspective”
BasicStatisticalProblems
• Basicproblems:densityest.,clustering,classification,regression.• Canalwaysdojointdensityestimationandthencondition:
– Regression:
– Classification:
– Clustering: c unobserved
– Densityestimation: y unobservedIngeneral,ifcertainthingsarealways observedwemaynotwanttomodeltheirdensity:
Ifcertainthingsarealwaysunobservedtheyarecalledhiddenorlatent variables(morelater):
p(y|x) = p(y,x)/p(x) = p(y,x)/
Zp(y,x)dy
p(c|x) = p(c,x)/p(x) = p(c,x)/X
c
p(c,x)
p(c|x) = p(c,x)/p(x)
p(y|x) = p(y,x)/p(x)
FundamentalOperations
• Whatcanwedowithaprobabilisticgraphicalmodel?• Generatedata.Forthisyouneedtoknowhowtosamplefromlocalmodels(directed)orhowtodoGibbsorothersampling(undirected)
• Computeprobabilities.Whenallnodesareeitherobservedormarginalizedtheresultisasinglenumberwhichistheprob.oftheconfiguration.
• Inference.Computeexpectationsofsomethingsgivenotherswhichareobservedormarginalized.
• Learning.(today)Settheparametersofthelocalfunctionsgivensome(partially)observeddatatomaximizetheprobabilityofseeingthatdata
LearningGraphicalModelsfromData
• Wanttobuildpredictionsystemsautomaticallybasedondata,andaslittleaspossibleonexpertinformation
• Inthiscourse,we’lluseprobabilitytocombineevidencefromdataandmakepredictions
• We’llusegraphicalmodelsasavisualshorthandlanguagetoexpressandreasonaboutfamiliesofmodelassumptions,structures,dependenciesandinformationflow,withoutspecifyingexactdistributionalformsorparameters
• Inthiscaselearning≡settingparametersofdistributionsgivenamodelstructure.(“Structurelearning”isalsopossiblebutwewon’tconsideritnow.)
MultipleObservations,CompleteIIDData• AsingleobservationofthedataXisrarelyusefulonitsown.• Generallywehavedataincludingmanyobservations,whichcreatesasetofrandomvariables:
• Wewillsometimesassumetwothings:1. Observationsareindependentlyandidenticallydistributed
accordingtojointdistributionofgraphicalmodel:i.i.d.samples.
2. Weobserveallrandomvariablesinthedomainoneachobservation:completedata,orfullyobservedmodel.
• Weshadethenodesinagraphicalmodeltoindicatetheyareobserved.(Laterwewillworkwithunshaded nodescorrespondingtomissingdataorlatentvariables.)
D = {x(1),x(2), ...,x(M)}
LikelihoodFunction• Sofarwehavefocusedonthe(log)probabilityfunctionp(x|θ)whichassignsaprobability(density)toanyjointconfigurationofvariablesx givenfixedparametersθ
• Butinlearningweturnthisonitshead:wehavesomefixeddataandwewanttofindparameters
• Thinkofp(x|θ)asafunctionofθ forfixedx:
Thisfunctioniscalledthe(log)“likelihood”.• Chooseθ tomaximizesomecostorlossfunction L(θ)whichincludes:
maximumlikelihood(ML)maximumaposteriori(MAP)/penalizedML
(alsocross-validation,Bayesianestimators,BIC,AIC,...)
`(✓;x) = log p(x|✓)
`(✓)L(✓) = `(✓;D)L(✓) = `(✓;D) + r(✓)
MaximumLikelihood• ForIIDdata,theloglikelihoodisasumofidenticalfunctions
• Ideaofmaximumlikelihoodestimation(MLE):pickthesettingofparametersmostlikelytohavegeneratedthedatawesaw:
• Verycommonlyusedinstatistics.• Oftenleadsto“intuitive”,“appealing”,or“natural”estimators.• Forastart,theIIDassumptionmakestheloglikelihoodintoasum,soitsderivativecanbeeasilytakentermbyterm.
p(D|✓) =Y
m
p(x(m)|✓)
`(✓;D) =X
m
log p(x(m)|✓)
✓⇤ML = argmax✓
`(✓;D)
SufficientStatistics• Astatisticisa(possiblyvectorvalued)deterministicfunctionofa(setof)randomvariable(s).
• T(X)isa“sufficientstatistic”forX if
• Equivalently(bytheNeyman factorizationtheorem)wecanwrite:
• Example:exponentialfamilymodels:
p(x|✓) = h(x, T (x))g(T (x), ✓)
T (x(1)) = T (x(2)) ) L(✓;x(1)) = L(✓;x(2)) 8✓
p(x|⌘) = h(x) exp{⌘TT (x)�A(⌘)}
Example:BernoulliTrials• We observeM iid coin flips• Model: p(H)=θ p(T)=(1−θ)• Likelihood:
• Take derivatives andsettozero:
`(✓;D) = log p(D|✓)
= logY
m
✓x(m)
(1� ✓)1�x(m)
= log ✓X
m
x(m) + log(1� ✓)X
m
(1� x(m))
= log ✓NH + log(1� ✓)NT
) ✓⇤ML
=NH
NH +NT
@`
@✓=
NH
✓� NT
1� ✓
D = H,H, T,H, ...
Example:Multinomial• WeobserveMiid dierolls(K-sided):• Model:
• Likelihood(forbinaryindicators[x(m) =k]):
• Take derivatives andsettozero (enforcing ):
`(✓;D) = log p(D|✓)
D = 3, 1,K, 2, ...p(k) = ✓k
X
k
✓k = 1
= logY
m
✓[x(m)=k]
1 ...✓[x(m)=k]
k
=X
k
log ✓kX
m
[x(m) = k] =X
k
Nk log ✓k
@`
@✓k=
Nk
✓k�M
) ✓⇤k =Nk
M
X
k
✓k = 1
Example:Univariate Normal• WeobserveMiid realsamples:• Model:
• Likelihood(usingprobabilitydensity):
• Take derivatives andsettozero :
`(✓;D) = log p(D|✓)
p(x) = (2⇡�2)�1/2 exp{�(x� µ)2/2�2}
= �M
2log(2⇡�2)� 1
2
X
m
(x� µ)2
�2
@`
@�2= � M
2�2+
1
2�4
X
m
(xm � µ)2@`
@µ=
1
�2
X
m
(xm � µ)
) µML = (1/M)X
m
xm ) �2ML = (1/M)
X
m
x2m � µ2
ML
D = 1.18,�.25, .78, ...
Example:LinearRegression• Atalinearregressionnode,someparents(covariates/inputs)andallchildren(responses/outputs)arecontinuousvaluedvariables.
• Foreachchildandsettingof parentsweusethemodel:
• Thelikelihoodisthefamiliar“squarederror”cost:
• TheMLparameterscanbesolvedforusinglinearleast-squares:
• Sufficientstatisticsareinputcorrelationmatrixandinput-outputcross-correlationvector.
p(y|x, ✓) = gauss(y|✓Tx,�2)
) ✓⇤ML = (XTX)�1XTY
@`
@✓= �
X
m
(y(m) � ✓Tx(m))x(m)
`(✓;D) = � 1
2�2
X
m
(y(m) � ✓Tx(m))2
Example:LinearRegression
SufficientStatisticsareSums
• Intheexamplesabove,thesufficientstatisticsweremerelysums(counts)ofthedata:– Bernoulli:#ofheads,tails– Multinomial:#ofeachtype– Gaussian:mean,mean-square– Regression:correlations
• Aswewillsee,thisistrueforallexponentialfamilymodels:sufficientstatisticsaretheaveragenaturalparameters.
• Onlyexponentialfamilymodelshavesimplesufficientstatistics.
MLEforDirectedGMs• Foradirected GM,thelikelihoodfunction has anice form:
• Theparametersdecouple;so we canmaximize likelihoodindependentlyforeach node’s function by setting θi
• Only need thevalues ofxi anditsparents inorder toestimate θi
• Furthermore,ifhavesufficientstatisticsonlyneedthose.
• Ingeneral,forfullyobserveddataifweknowhowtoestimateparams atasinglenodewecandoitforthewholenetwork.
log p(D|✓) = logY
m
Y
i
p(x(m)i |x⇡i , ✓i) =
X
m
X
i
log p(x(m)i |x⇡i , ✓i)
xi,x⇡i
Example:ADirectedModel• ConsiderthedistributiondefinedbytheDAGM:
• ThisisexactlylikelearningfourseparatesmallDAGMs,eachofwhichconsistsofanodeanditsparents
p(x|✓) = p(x1|✓1)p(x2|x1, ✓2)p(x3|x1, ✓3)p(x4|x2,x3, ✓4)
MLEforCategoricalNetworks• AssumeourDAGMcontainsonlydiscretenodes,andweusethe(general)categoricalformfortheconditionalprobabilities.
• Sufficientstatistics involvecountsofjointsettingsofsummingoverallothervariablesinthetable.
• Likelihoodforthesespecial“fullyobservedcategoricalnetworks”:
xi,x⇡i
MLEforCategoricalNetworks• AssumeourDAGMcontainsonlydiscretenodes,andweusethe(general)categoricalformfortheconditionalprobabilities.
• Sufficientstatistics involvecountsofjointsettingsofsummingoverallothervariablesinthetable.
• Likelihoodforthesespecial“fullyobservedcategoricalnetworks”:
xi,x⇡i
`(⌘;D) = logY
m,i
p(x(m)i |x(m)
⇡i, ✓i)
= logY
i,xi,x⇡i
p(xi|x⇡i , ✓i)N(xi,x⇡i ) = log
Y
i,xi,x⇡i
✓N(xi,x⇡i )
xi|x⇡i
=X
i
X
xi,x⇡i
N(xi,x⇡i) log ✓xi|x⇡i
) ✓⇤xi|x⇡i=
N(xi,x⇡i)
N(x⇡i)
MLEforGeneralExponentialFamilyModels• Recalltheprobabilityfunctionformodelsintheexponentialfamily:
• Fori.i.d.data,thesufficientstatistic vectorisT(x)
• Takederivativesandsettozero:
recallingthatthenaturalmomentsofanexponentialdistributionarethederivativesofthelognormalizer.
`(⌘;D) = log p(D|⌘) = X
m
log h(x(m))
!�MA(⌘) +
⌘TX
m
T (x(m)
!
@`
@⌘=
X
m
T (x(m))�M@A(⌘)
@⌘
⌘ML = 1/MX
m
T (x(m))
p(x|✓) = h(x) exp{⌘TT (x)�A(⌘)}
) @A(⌘)
@⌘= 1/M
X
m
T (x(m))
Classification,Revisited• Givenexamplesofadiscrete classlabelyandsomefeatures x.• Goal:computelabel(y) fornewinputsx.• Twoapproaches:Generative:model p(x,y)=p(y)p(x|y);
useBayes’ruletoinferconditionalp(y|x).Discriminative:modeldiscriminantsf(y|x) directlyandtakemax.
• Generativeapproachisrelatedtoconditionaldensityestimationwhilediscriminativeisclosertoregression
ProbabilisticClassification:BayesClassifier• Generativemodel:p(x,y)=p(y)p(x|y)
p(y) arecalledclasspriors(relativefrequencies).p(x|y)arecalledclass-conditionalfeaturedistributions
• FortheclassfrequencypriorweuseaBernoulliorcategorical:
• Fitting bymaximumlikelihood:– Sortdataintobatchesbyclasslabel– Estimatep(y)bycountingsizeofbatches(plusregularization)– Estimatep(x|y)separatelywithineachbatchusingML(alsowithregularization)
• Twoclassificationrules(ifforcedtochoose):– ML:argmaxy p(x|y) (canbehavebadlyifskewedfrequencies)
– MAP:argmaxy p(y|x)= argmaxy log p(x|y)+logp(y)(safer)
p(y = k|⇡) = ⇡k
X
k
⇡k = 1
ThreeKeyRegularizationIdeasToavoidoverfitting,wecan:• putpriors ontheparameters.Maximumlikelihood+priors=maximumaposteriori(MAP).Simpleandfast.NotBayesian.
• Integrateoverallpossibleparameters.Alsorequirespriors,butprotectsagainstoverfitting fortotallydifferentreasons.
• Make factorizationorindependence assumptions.Fewerinputstoeachconditionalprobability.Tiesparameterstogethersothatfewerofthemareestimated.
GaussianClass-ConditionalDistribution
• Ifallfeaturesarecontinuous,apopularchoiceisaGaussianclass-conditional.
• Fitting:usethefollowingamazingandusefulfact.ThemaximumlikelihoodfitofaGaussiantosomedataistheGaussianwhosemeanisequaltothedatameanandwhosecovarianceisequaltothesamplecovariance.[Trytoprovethisasanexerciseinunderstandinglikelihood,algebra,andcalculusallatonce!]
• Seemseasy.Andworksamazinglywell.Butwecandoevenbetterwithsomesimpleregularization...
p(x|y = k, ✓) = |2⇡⌃|�1/2 exp{�1
2(x� µk)⌃
�1(x� µk)}
RegularizedGaussians• Idea1:assumeallthecovariances arethesame(tieparameters).ThisisexactlyFisher’slineardiscriminantanalysis.
• Idea2:Makeindependenceassumptionstogetdiagonaloridentity-multiplecovariances.(Orsparseinversecovariances.)Moreonthisinafewminutes...
• Idea3:addabitoftheidentitymatrixtoeachsamplecovariance.This“fattensitup”inalldirectionsandpreventscollapse.RelatedtousingaWishart prior onthecovariancematrix.
GaussianBayesClassifier• Maximumlikelihoodestimatesforparameters:priorsπk:useobservedfrequenciesofclasses(plussmoothing)meansμk:useclassmeanscovarianceΣ:usedatafromsingleclassorpooleddata
toestimatefull/diagonalcovariances• ComputetheposteriorviaBayes’rule:
where andwehaveaugmentedxwithaconstantcomponentalwaysequalto1(biasterm).
=exp{µT
k⌃�1x� µT
k⌃�1µk/2 + log ⇡k}P
j exp{µTj ⌃
�1x� µTj ⌃
�1µj/2 + log ⇡j}
= e�Tk x/
X
j
e�Tj x = exp{�T
k x}/Z
�k = [⌃�1µk; (µTk⌃
�1µk + log ⇡k]
(x(m) � µy(m))
p(y = k|x, ✓) = p(x|y = k, ✓)p(y = k|⇡)Pj p(x|y = j, ✓)p(y = j|⇡)
Softmax/Logit• Thesquashingfunctionisknownasthesoftmax orlogit:
• Itisinvertible(uptoaconstant):
• Derivativeiseasy:
�k(z) ⌘ezkPj e
zjg(⌘) =
1
1 + e�⌘
zk = log �k + c ⌘ = log(g/1� g)
@�k
@zj= �k(�kj � �j)
@g
@⌘= g(1� g)
LinearGeometry• Takingtheratioofanytwoposteriors(the“odds”)showsthatthecontoursofequalpairwiseprobabilityarelinearsurfacesinthefeaturespace:
• Thepairwisediscriminationcontoursp(yk)=p(yj)areorthogonaltothedifferencesofthemeansinfeaturespacewhenΣ =σI.ForgeneralΣ sharedb/wallclassesthesameistrueinthetransformedfeaturespacet =Σ−1x.
• Classpriorsdonotchangethegeometry,theyonlyshifttheoperatingpointonthelogit bythelog-odds:log(πk/πj).
• Thus,forequalclass-covariances,weobtainalinearclassifier.• Ifweusedifferentcovariances,thedecisionsurfacesareconicsectionsandwehaveaquadraticclassifier.
p(y = k|x, ✓)p(y = j|x, ✓) = exp{(�k � �j)
Tx}
ExponentialFamilyClass-Conditionals• BayesClassifierhasthesamesoftmax formwhenevertheclass-conditionaldensitiesareany exponentialfamilydensity:
• Whereandwehaveaugmentedxwithaconstantcomponentalwaysequalto1(biasterm)
• Resultingclassifierislinearinthesufficientstatistics
p(x|y = k, ⌘k) = h(x) exp{⌘Tk x� a(⌘k)}
p(y = k|x, ⌘) = p(x|y = k, ⌘k)p(y = k|⇡)p(x|y = j, ⌘j)p(y = j|⇡)
=exp{⌘Tk x� a(⌘k)}Pj exp{⌘Tj x� a(⌘j)}
=e�
Tk x
Pj e
�Tk x
�k = [⌘k;�a(⌘k)]
DiscreteBayesianClassifier• Iftheinputsarediscrete(categorical),whatshouldwedo?• Thesimplestclassconditionalmodelisajointmultinomial(table):
• Thisisconceptuallycorrect,butthere’sabigpracticalproblem.• Fitting:MLparams areobservedcounts:
• Considerthe16x16digitsat256graylevels• Howmanyentriesinthetable?Howmanywillbezero?Whathappensattesttime?
• Weobviouslyneedsomeregularization.Smoothingwillnothelpmuchhere.Unlessweknowabouttherelationshipsbetweeninputsbeforehand,sharingparametersishardalso.Butwhataboutindependence?
p(x1 = a, x2 = b, ...|y = c) = ⌘cab...
⌘cab... =
Pn[y
(n) = c][x1 = a][x2 = b][...][...]Pn[y
(n) = c]
Naïve BayesClassifier
• Assumption:conditionedonclass,attributesareindependent.
• Soundscrazyright?Right!Butitworks.• Algorithm:sortdatacasesintobinsaccordingtoyn• Computemarginalprobabilitiesp(y=c)usingfrequencies• Foreachclass,estimatedistributionofithvariable:p(xi|y =c).• Attesttime,computeargmaxc p(c|x)using
c(x) = argmaxc
p(c|x) = argmaxc
[log p(x|c) + log p(c)]
= argmaxc
[log p(c) +X
i
log p(xi|c)]
p(x|y) =Y
i
p(xi|y)
Discrete(Categorical)Naïve Bayes
• Discretefeaturesxi assumedindependentgivenclasslabely
• Classificationrule:
p(xi = j|y = k) = ⌘ijk
p(x|y = k, ⌘) =Y
i
Y
j
⌘[xi=j]ijk
p(y = k|x, ⌘) =⇡k
Qi
Qj ⌘
[xi=j]ijk
Pq ⇡q
Qi
Qj ⌘
[xi=j]ijq
=e�
Tk x
Pq e
�Tq x
�k = log[⌘11k...⌘1jk...⌘ijk... log ⇡k]
x = [x1 = 1;x2 = 2; ...xi = j; ...; 1]
FittingDiscreteNaïve Bayes
• MLparametersareclass-conditionalfrequencycounts:
• Howdoweknow?Writedownthelikelihood:
• andoptimizeitbysettingitsderivativetozero(careful!enforcenormalizationwithLagrangemultipliers.CSC411Tut4):
⌘⇤ijk =
Pm[x(m)
i = j][y(m) = k]Pm[y(m) = k]
`(⌘;D) =X
m
log p(y(m)|⇡) +X
m,i
log p(x(m)i |y(m), ⌘)
`(⌘;D) =X
m
X
ijk
[x(m)i = j][y(m) = k] log ⌘ijk +
X
ik
�ik(1�X
j
⌘ijk)
@`
@⌘ijk=
Pm[x(m)
i = j][y(m) = k]
⌘ijk� �ik
@`
@⌘ijk= 0 ) �ik =
X
m
[y(m) = k] ) ⌘⇤ijk = above
GaussianNaïve Bayes
• ThisisjustaGaussianBayesClassifierwithaseparatediagonalcovariancematrixforeachclass.
• Equivalenttofittingaone-dimensionalGaussiantoeachinputforeachpossibleclass.
• Decisionsurfacesarequadratics,notlinear...
DiscriminativeModels• Parametrizep(y|x)directly,forgetp(x,y)andBayes’rule.• Aslongasp(y|x)ordiscriminantsf(y|x)arelinearfunctionsofx(ormonotonetransforms),decisionsurfaceswillbepiecewiselinear.
• Don’tneedtomodelthedensityofthefeatures.Somedensitymodelshavelotsofparameters.Manydensitiesgivesamelinearclassifier.Butwecannotgeneratenewlabeleddata.
• Optimizethesamecostfunctionweuseattesttime.
Logistic/Softmax Regression• Model:yisamultinomialrandomvariablewhoseposterioristhesoftmax oflinearfunctionsofanyfeaturevector.
• Fitting:nowweoptimizetheconditionallikelihood:
`(⌘;D) =X
m,k
[y(m) = k] log p(y = k|x(m), ✓) =X
m,k
y(m)k log p(m)
k
p(y = k|x, ✓) = e✓Tk x
Pj e
✓Tj x
@`
@✓i=
X
m,k
@`(m)k
@p(m)k
@p(m)k
@z(m)i
@z(m)i
@✓i
=X
m,k
y(m)k
p(m)k
p(m)k (�ik � p(m)
i )x(m)
=X
m
(y(m)k � p(m)
k )x(m)
zj = ✓Tj x
MoreonLogisticRegression• HardestPart:pickingthefeaturevector x.• Thelikelihoodisconvexintheparametersθ.Nolocalminima!
• Gradientiseasytocompute;soeasytooptimizeusinggradientdescentorNewton-Raphson.
• Weightdecay:addεθ2 tothecostfunction,whichsubtracts2εθfromeachgradient
• Whyisthismethodcalledlogisticregression?• Itshouldreallybecalled“softmax linearregression”.• Logodds(logit)betweenanytwoclassesislinearinparameters.
• Aclassificationneuralnetalwayshaslinearregressionasthelastlayer– nohiddenlayers=logisticregression
ClassificationviaRegression
• Binarycase:p(y=1|x) isalsotheconditionalexpectation.• Sowecouldforgetthatywasadiscrete(categorical)randomvariableandjustattempttomodelp(y|x)usingregression.
• Oneidea:doregressiontoanindicatormatrix.• Fortwoclasses,thisisrelatedtoLDA.For3ormore,disaster…
• Weirdidea. Noisemodels(e.g.,Gaussian)forregressionareinappropriate,andfitsaresensitivetooutliers.Furthermore,givesunreasonablepredictions<0and>1.
OtherModels
• Noisy-OR• ClassificationviaRegression• Non-parametric(e.g.K-nearest-neighbor).• Semi-parametric(e.g.kernelclassifiers,supportvectormachines,Gaussianprocesses).
• Probit regression.• Complementarylog-log.• Generalizedlinearmodels.• Somereturnavaluefor ywithoutadistribution.
Jointvs.ConditionalModels
• BothNaïve Bayesandlogisticregressionhavesameconditionalformandcanhavesameparameterization.
• Butthecriteriausedtochooseparametersisdifferent
• NaiveBayesisajointmodel;itoptimizesp(x,y)=p(x)p(y|x).
• LogisticRegressionisconditional:optimizesp(y|x)directly
• Prosofdiscriminative:Moreflexible,directlyoptimizeswhatwecareabout.Whynotchooseoptimalparameters?
• Prosofgenerative:Easiertothinkabout,checkmodel,andincorporateotherdatasources(semi-suplearning)
Jointvs.ConditionalModels:YinandYang
• Eachgenerativemodelimplicitlydefinesaconditionalmodel
• p(z|x)hascomplicatedformifp(x|z)isatallcomplicated.Expensivetocomputenaively,necessaryforlearning.
• Autoencoders:Giveninterestinggenerativemodelp(x|z),forceapproximateq(z|x)tohaveaniceform.
• So,designinginferencemethodsforgenerativemodelsinvolvesdesigningdiscriminativerecognitionnetworks.
• Thursday:Tutorialonoptimization.