Lecture 2: Simple Classifiersjessebett/CSC412/content/week2/lec2.pdf · 2018-01-16 · • Idea 1: assume all the covariancesare the same (tie parameters). This is exactly Fisher’s

CSC412/2506Winter2018ProbabilisticLearningandReasoning

Lecture2:SimpleClassifiers

SlidesbasedonRichZemel’s

Alllectureslideswillbeavailableonthecoursewebsite:www.cs.toronto.edu/~jessebett/CSC412

SomeofthefiguresareprovidedbyKevinMurphyfromhisbook:”MachineLearning:AProbabilisticPerspective”

BasicStatisticalProblems

• Basicproblems:densityest.,clustering,classification,regression.• Canalwaysdojointdensityestimationandthencondition:

– Regression:

– Classification:

– Clustering: c unobserved

– Densityestimation: y unobservedIngeneral,ifcertainthingsarealways observedwemaynotwanttomodeltheirdensity:

Ifcertainthingsarealwaysunobservedtheyarecalledhiddenorlatent variables(morelater):

p(y|x) = p(y,x)/p(x) = p(y,x)/

Zp(y,x)dy

p(c|x) = p(c,x)/p(x) = p(c,x)/X

c

p(c,x)

p(c|x) = p(c,x)/p(x)

p(y|x) = p(y,x)/p(x)

FundamentalOperations

• Whatcanwedowithaprobabilisticgraphicalmodel?• Generatedata.Forthisyouneedtoknowhowtosamplefromlocalmodels(directed)orhowtodoGibbsorothersampling(undirected)

• Computeprobabilities.Whenallnodesareeitherobservedormarginalizedtheresultisasinglenumberwhichistheprob.oftheconfiguration.

• Inference.Computeexpectationsofsomethingsgivenotherswhichareobservedormarginalized.

• Learning.(today)Settheparametersofthelocalfunctionsgivensome(partially)observeddatatomaximizetheprobabilityofseeingthatdata

LearningGraphicalModelsfromData

• Wanttobuildpredictionsystemsautomaticallybasedondata,andaslittleaspossibleonexpertinformation

• Inthiscourse,we’lluseprobabilitytocombineevidencefromdataandmakepredictions

• We’llusegraphicalmodelsasavisualshorthandlanguagetoexpressandreasonaboutfamiliesofmodelassumptions,structures,dependenciesandinformationflow,withoutspecifyingexactdistributionalformsorparameters

• Inthiscaselearning≡settingparametersofdistributionsgivenamodelstructure.(“Structurelearning”isalsopossiblebutwewon’tconsideritnow.)

MultipleObservations,CompleteIIDData• AsingleobservationofthedataXisrarelyusefulonitsown.• Generallywehavedataincludingmanyobservations,whichcreatesasetofrandomvariables:

• Wewillsometimesassumetwothings:1. Observationsareindependentlyandidenticallydistributed

accordingtojointdistributionofgraphicalmodel:i.i.d.samples.

2. Weobserveallrandomvariablesinthedomainoneachobservation:completedata,orfullyobservedmodel.

• Weshadethenodesinagraphicalmodeltoindicatetheyareobserved.(Laterwewillworkwithunshaded nodescorrespondingtomissingdataorlatentvariables.)

D = {x(1),x(2), ...,x(M)}

LikelihoodFunction• Sofarwehavefocusedonthe(log)probabilityfunctionp(x|θ)whichassignsaprobability(density)toanyjointconfigurationofvariablesx givenfixedparametersθ

• Butinlearningweturnthisonitshead:wehavesomefixeddataandwewanttofindparameters

• Thinkofp(x|θ)asafunctionofθ forfixedx:

Thisfunctioniscalledthe(log)“likelihood”.• Chooseθ tomaximizesomecostorlossfunction L(θ)whichincludes:

maximumlikelihood(ML)maximumaposteriori(MAP)/penalizedML

(alsocross-validation,Bayesianestimators,BIC,AIC,...)

`(✓;x) = log p(x|✓)

`(✓)L(✓) = `(✓;D)L(✓) = `(✓;D) + r(✓)

MaximumLikelihood• ForIIDdata,theloglikelihoodisasumofidenticalfunctions

• Ideaofmaximumlikelihoodestimation(MLE):pickthesettingofparametersmostlikelytohavegeneratedthedatawesaw:

• Verycommonlyusedinstatistics.• Oftenleadsto“intuitive”,“appealing”,or“natural”estimators.• Forastart,theIIDassumptionmakestheloglikelihoodintoasum,soitsderivativecanbeeasilytakentermbyterm.

p(D|✓) =Y

m

p(x(m)|✓)

`(✓;D) =X

m

log p(x(m)|✓)

✓⇤ML = argmax✓

`(✓;D)

SufficientStatistics• Astatisticisa(possiblyvectorvalued)deterministicfunctionofa(setof)randomvariable(s).

• T(X)isa“sufficientstatistic”forX if

• Equivalently(bytheNeyman factorizationtheorem)wecanwrite:

• Example:exponentialfamilymodels:

p(x|✓) = h(x, T (x))g(T (x), ✓)

T (x(1)) = T (x(2)) ) L(✓;x(1)) = L(✓;x(2)) 8✓

p(x|⌘) = h(x) exp{⌘TT (x)�A(⌘)}

Example:BernoulliTrials• We observeM iid coin flips• Model: p(H)=θ p(T)=(1−θ)• Likelihood:

• Take derivatives andsettozero:

`(✓;D) = log p(D|✓)

= logY

m

✓x(m)

(1� ✓)1�x(m)

= log ✓X

m

x(m) + log(1� ✓)X

m

(1� x(m))

= log ✓NH + log(1� ✓)NT

) ✓⇤ML

=NH

NH +NT

@`

@✓=

NH

✓� NT

1� ✓

D = H,H, T,H, ...

Example:Multinomial• WeobserveMiid dierolls(K-sided):• Model:

• Likelihood(forbinaryindicators[x(m) =k]):

• Take derivatives andsettozero (enforcing ):

`(✓;D) = log p(D|✓)

D = 3, 1,K, 2, ...p(k) = ✓k

X

k

✓k = 1

= logY

m

✓[x(m)=k]

1 ...✓[x(m)=k]

k

=X

k

log ✓kX

m

[x(m) = k] =X

k

Nk log ✓k

@`

@✓k=

Nk

✓k�M

) ✓⇤k =Nk

M

X

k

✓k = 1

Example:Univariate Normal• WeobserveMiid realsamples:• Model:

• Likelihood(usingprobabilitydensity):

• Take derivatives andsettozero :

`(✓;D) = log p(D|✓)

p(x) = (2⇡�2)�1/2 exp{�(x� µ)2/2�2}

= �M

2log(2⇡�2)� 1

2

X

m

(x� µ)2

�2

@`

@�2= � M

2�2+

1

2�4

X

m

(xm � µ)2@`

@µ=

1

�2

X

m

(xm � µ)

) µML = (1/M)X

m

xm ) �2ML = (1/M)

X

m

x2m � µ2

ML

D = 1.18,�.25, .78, ...

Example:LinearRegression• Atalinearregressionnode,someparents(covariates/inputs)andallchildren(responses/outputs)arecontinuousvaluedvariables.

• Foreachchildandsettingof parentsweusethemodel:

• Thelikelihoodisthefamiliar“squarederror”cost:

• TheMLparameterscanbesolvedforusinglinearleast-squares:

• Sufficientstatisticsareinputcorrelationmatrixandinput-outputcross-correlationvector.

p(y|x, ✓) = gauss(y|✓Tx,�2)

) ✓⇤ML = (XTX)�1XTY

@`

@✓= �

X

m

(y(m) � ✓Tx(m))x(m)

`(✓;D) = � 1

2�2

X

m

(y(m) � ✓Tx(m))2

Example:LinearRegression

SufficientStatisticsareSums

• Intheexamplesabove,thesufficientstatisticsweremerelysums(counts)ofthedata:– Bernoulli:#ofheads,tails– Multinomial:#ofeachtype– Gaussian:mean,mean-square– Regression:correlations

• Aswewillsee,thisistrueforallexponentialfamilymodels:sufficientstatisticsaretheaveragenaturalparameters.

• Onlyexponentialfamilymodelshavesimplesufficientstatistics.

MLEforDirectedGMs• Foradirected GM,thelikelihoodfunction has anice form:

• Theparametersdecouple;so we canmaximize likelihoodindependentlyforeach node’s function by setting θi

• Only need thevalues ofxi anditsparents inorder toestimate θi

• Furthermore,ifhavesufficientstatisticsonlyneedthose.

• Ingeneral,forfullyobserveddataifweknowhowtoestimateparams atasinglenodewecandoitforthewholenetwork.

log p(D|✓) = logY

m

Y

i

p(x(m)i |x⇡i , ✓i) =

X

m

X

i

log p(x(m)i |x⇡i , ✓i)

xi,x⇡i

Example:ADirectedModel• ConsiderthedistributiondefinedbytheDAGM:

• ThisisexactlylikelearningfourseparatesmallDAGMs,eachofwhichconsistsofanodeanditsparents

p(x|✓) = p(x1|✓1)p(x2|x1, ✓2)p(x3|x1, ✓3)p(x4|x2,x3, ✓4)

MLEforCategoricalNetworks• AssumeourDAGMcontainsonlydiscretenodes,andweusethe(general)categoricalformfortheconditionalprobabilities.

• Sufficientstatistics involvecountsofjointsettingsofsummingoverallothervariablesinthetable.

• Likelihoodforthesespecial“fullyobservedcategoricalnetworks”:

xi,x⇡i

MLEforCategoricalNetworks• AssumeourDAGMcontainsonlydiscretenodes,andweusethe(general)categoricalformfortheconditionalprobabilities.

• Sufficientstatistics involvecountsofjointsettingsofsummingoverallothervariablesinthetable.

• Likelihoodforthesespecial“fullyobservedcategoricalnetworks”:

xi,x⇡i

`(⌘;D) = logY

m,i

p(x(m)i |x(m)

⇡i, ✓i)

= logY

i,xi,x⇡i

p(xi|x⇡i , ✓i)N(xi,x⇡i ) = log

Y

i,xi,x⇡i

✓N(xi,x⇡i )

xi|x⇡i

=X

i

X

xi,x⇡i

N(xi,x⇡i) log ✓xi|x⇡i

) ✓⇤xi|x⇡i=

N(xi,x⇡i)

N(x⇡i)

MLEforGeneralExponentialFamilyModels• Recalltheprobabilityfunctionformodelsintheexponentialfamily:

• Fori.i.d.data,thesufficientstatistic vectorisT(x)

• Takederivativesandsettozero:

recallingthatthenaturalmomentsofanexponentialdistributionarethederivativesofthelognormalizer.

`(⌘;D) = log p(D|⌘) = X

m

log h(x(m))

!�MA(⌘) +

⌘TX

m

T (x(m)

!

@`

@⌘=

X

m

T (x(m))�M@A(⌘)

@⌘

⌘ML = 1/MX

m

T (x(m))

p(x|✓) = h(x) exp{⌘TT (x)�A(⌘)}

) @A(⌘)

@⌘= 1/M

X

m

T (x(m))

Classification,Revisited• Givenexamplesofadiscrete classlabelyandsomefeatures x.• Goal:computelabel(y) fornewinputsx.• Twoapproaches:Generative:model p(x,y)=p(y)p(x|y);

useBayes’ruletoinferconditionalp(y|x).Discriminative:modeldiscriminantsf(y|x) directlyandtakemax.

• Generativeapproachisrelatedtoconditionaldensityestimationwhilediscriminativeisclosertoregression

ProbabilisticClassification:BayesClassifier• Generativemodel:p(x,y)=p(y)p(x|y)

p(y) arecalledclasspriors(relativefrequencies).p(x|y)arecalledclass-conditionalfeaturedistributions

• FortheclassfrequencypriorweuseaBernoulliorcategorical:

• Fitting bymaximumlikelihood:– Sortdataintobatchesbyclasslabel– Estimatep(y)bycountingsizeofbatches(plusregularization)– Estimatep(x|y)separatelywithineachbatchusingML(alsowithregularization)

• Twoclassificationrules(ifforcedtochoose):– ML:argmaxy p(x|y) (canbehavebadlyifskewedfrequencies)

– MAP:argmaxy p(y|x)= argmaxy log p(x|y)+logp(y)(safer)

p(y = k|⇡) = ⇡k

X

k

⇡k = 1

ThreeKeyRegularizationIdeasToavoidoverfitting,wecan:• putpriors ontheparameters.Maximumlikelihood+priors=maximumaposteriori(MAP).Simpleandfast.NotBayesian.

• Integrateoverallpossibleparameters.Alsorequirespriors,butprotectsagainstoverfitting fortotallydifferentreasons.

• Make factorizationorindependence assumptions.Fewerinputstoeachconditionalprobability.Tiesparameterstogethersothatfewerofthemareestimated.

GaussianClass-ConditionalDistribution

• Ifallfeaturesarecontinuous,apopularchoiceisaGaussianclass-conditional.

• Fitting:usethefollowingamazingandusefulfact.ThemaximumlikelihoodfitofaGaussiantosomedataistheGaussianwhosemeanisequaltothedatameanandwhosecovarianceisequaltothesamplecovariance.[Trytoprovethisasanexerciseinunderstandinglikelihood,algebra,andcalculusallatonce!]

• Seemseasy.Andworksamazinglywell.Butwecandoevenbetterwithsomesimpleregularization...

p(x|y = k, ✓) = |2⇡⌃|�1/2 exp{�1

2(x� µk)⌃

�1(x� µk)}

RegularizedGaussians• Idea1:assumeallthecovariances arethesame(tieparameters).ThisisexactlyFisher’slineardiscriminantanalysis.

• Idea2:Makeindependenceassumptionstogetdiagonaloridentity-multiplecovariances.(Orsparseinversecovariances.)Moreonthisinafewminutes...

• Idea3:addabitoftheidentitymatrixtoeachsamplecovariance.This“fattensitup”inalldirectionsandpreventscollapse.RelatedtousingaWishart prior onthecovariancematrix.

GaussianBayesClassifier• Maximumlikelihoodestimatesforparameters:priorsπk:useobservedfrequenciesofclasses(plussmoothing)meansμk:useclassmeanscovarianceΣ:usedatafromsingleclassorpooleddata

toestimatefull/diagonalcovariances• ComputetheposteriorviaBayes’rule:

where andwehaveaugmentedxwithaconstantcomponentalwaysequalto1(biasterm).

=exp{µT

k⌃�1x� µT

k⌃�1µk/2 + log ⇡k}P

j exp{µTj ⌃

�1x� µTj ⌃

�1µj/2 + log ⇡j}

= e�Tk x/

X

j

e�Tj x = exp{�T

k x}/Z

�k = [⌃�1µk; (µTk⌃

�1µk + log ⇡k]

(x(m) � µy(m))

p(y = k|x, ✓) = p(x|y = k, ✓)p(y = k|⇡)Pj p(x|y = j, ✓)p(y = j|⇡)

Softmax/Logit• Thesquashingfunctionisknownasthesoftmax orlogit:

• Itisinvertible(uptoaconstant):

• Derivativeiseasy:

�k(z) ⌘ezkPj e

zjg(⌘) =

1

1 + e�⌘

zk = log �k + c ⌘ = log(g/1� g)

@�k

@zj= �k(�kj � �j)

@g

@⌘= g(1� g)

LinearGeometry• Takingtheratioofanytwoposteriors(the“odds”)showsthatthecontoursofequalpairwiseprobabilityarelinearsurfacesinthefeaturespace:

• Thepairwisediscriminationcontoursp(yk)=p(yj)areorthogonaltothedifferencesofthemeansinfeaturespacewhenΣ =σI.ForgeneralΣ sharedb/wallclassesthesameistrueinthetransformedfeaturespacet =Σ−1x.

• Classpriorsdonotchangethegeometry,theyonlyshifttheoperatingpointonthelogit bythelog-odds:log(πk/πj).

• Thus,forequalclass-covariances,weobtainalinearclassifier.• Ifweusedifferentcovariances,thedecisionsurfacesareconicsectionsandwehaveaquadraticclassifier.

p(y = k|x, ✓)p(y = j|x, ✓) = exp{(�k � �j)

Tx}

ExponentialFamilyClass-Conditionals• BayesClassifierhasthesamesoftmax formwhenevertheclass-conditionaldensitiesareany exponentialfamilydensity:

• Whereandwehaveaugmentedxwithaconstantcomponentalwaysequalto1(biasterm)

• Resultingclassifierislinearinthesufficientstatistics

p(x|y = k, ⌘k) = h(x) exp{⌘Tk x� a(⌘k)}

p(y = k|x, ⌘) = p(x|y = k, ⌘k)p(y = k|⇡)p(x|y = j, ⌘j)p(y = j|⇡)

=exp{⌘Tk x� a(⌘k)}Pj exp{⌘Tj x� a(⌘j)}

=e�

Tk x

Pj e

�Tk x

�k = [⌘k;�a(⌘k)]

DiscreteBayesianClassifier• Iftheinputsarediscrete(categorical),whatshouldwedo?• Thesimplestclassconditionalmodelisajointmultinomial(table):

• Thisisconceptuallycorrect,butthere’sabigpracticalproblem.• Fitting:MLparams areobservedcounts:

• Considerthe16x16digitsat256graylevels• Howmanyentriesinthetable?Howmanywillbezero?Whathappensattesttime?

• Weobviouslyneedsomeregularization.Smoothingwillnothelpmuchhere.Unlessweknowabouttherelationshipsbetweeninputsbeforehand,sharingparametersishardalso.Butwhataboutindependence?

p(x1 = a, x2 = b, ...|y = c) = ⌘cab...

⌘cab... =

Pn[y

(n) = c][x1 = a][x2 = b][...][...]Pn[y

(n) = c]

Naïve BayesClassifier

• Assumption:conditionedonclass,attributesareindependent.

• Soundscrazyright?Right!Butitworks.• Algorithm:sortdatacasesintobinsaccordingtoyn• Computemarginalprobabilitiesp(y=c)usingfrequencies• Foreachclass,estimatedistributionofithvariable:p(xi|y =c).• Attesttime,computeargmaxc p(c|x)using

c(x) = argmaxc

p(c|x) = argmaxc

[log p(x|c) + log p(c)]

= argmaxc

[log p(c) +X

i

log p(xi|c)]

p(x|y) =Y

i

p(xi|y)

Discrete(Categorical)Naïve Bayes

• Discretefeaturesxi assumedindependentgivenclasslabely

• Classificationrule:

p(xi = j|y = k) = ⌘ijk

p(x|y = k, ⌘) =Y

i

Y

j

⌘[xi=j]ijk

p(y = k|x, ⌘) =⇡k

Qi

Qj ⌘

[xi=j]ijk

Pq ⇡q

Qi

Qj ⌘

[xi=j]ijq

=e�

Tk x

Pq e

�Tq x

�k = log[⌘11k...⌘1jk...⌘ijk... log ⇡k]

x = [x1 = 1;x2 = 2; ...xi = j; ...; 1]

FittingDiscreteNaïve Bayes

• MLparametersareclass-conditionalfrequencycounts:

• Howdoweknow?Writedownthelikelihood:

• andoptimizeitbysettingitsderivativetozero(careful!enforcenormalizationwithLagrangemultipliers.CSC411Tut4):

⌘⇤ijk =

Pm[x(m)

i = j][y(m) = k]Pm[y(m) = k]

`(⌘;D) =X

m

log p(y(m)|⇡) +X

m,i

log p(x(m)i |y(m), ⌘)

`(⌘;D) =X

m

X

ijk

[x(m)i = j][y(m) = k] log ⌘ijk +

X

ik

�ik(1�X

j

⌘ijk)

@`

@⌘ijk=

Pm[x(m)

i = j][y(m) = k]

⌘ijk� �ik

@`

@⌘ijk= 0 ) �ik =

X

m

[y(m) = k] ) ⌘⇤ijk = above

GaussianNaïve Bayes

• ThisisjustaGaussianBayesClassifierwithaseparatediagonalcovariancematrixforeachclass.

• Equivalenttofittingaone-dimensionalGaussiantoeachinputforeachpossibleclass.

• Decisionsurfacesarequadratics,notlinear...

DiscriminativeModels• Parametrizep(y|x)directly,forgetp(x,y)andBayes’rule.• Aslongasp(y|x)ordiscriminantsf(y|x)arelinearfunctionsofx(ormonotonetransforms),decisionsurfaceswillbepiecewiselinear.

• Don’tneedtomodelthedensityofthefeatures.Somedensitymodelshavelotsofparameters.Manydensitiesgivesamelinearclassifier.Butwecannotgeneratenewlabeleddata.

• Optimizethesamecostfunctionweuseattesttime.

Logistic/Softmax Regression• Model:yisamultinomialrandomvariablewhoseposterioristhesoftmax oflinearfunctionsofanyfeaturevector.

• Fitting:nowweoptimizetheconditionallikelihood:

`(⌘;D) =X

m,k

[y(m) = k] log p(y = k|x(m), ✓) =X

m,k

y(m)k log p(m)

k

p(y = k|x, ✓) = e✓Tk x

Pj e

✓Tj x

@`

@✓i=

X

m,k

@`(m)k

@p(m)k

@p(m)k

@z(m)i

@z(m)i

@✓i

=X

m,k

y(m)k

p(m)k

p(m)k (�ik � p(m)

i )x(m)

=X

m

(y(m)k � p(m)

k )x(m)

zj = ✓Tj x

MoreonLogisticRegression• HardestPart:pickingthefeaturevector x.• Thelikelihoodisconvexintheparametersθ.Nolocalminima!

• Gradientiseasytocompute;soeasytooptimizeusinggradientdescentorNewton-Raphson.

• Weightdecay:addεθ2 tothecostfunction,whichsubtracts2εθfromeachgradient

• Whyisthismethodcalledlogisticregression?• Itshouldreallybecalled“softmax linearregression”.• Logodds(logit)betweenanytwoclassesislinearinparameters.

• Aclassificationneuralnetalwayshaslinearregressionasthelastlayer– nohiddenlayers=logisticregression

ClassificationviaRegression

• Binarycase:p(y=1|x) isalsotheconditionalexpectation.• Sowecouldforgetthatywasadiscrete(categorical)randomvariableandjustattempttomodelp(y|x)usingregression.

• Oneidea:doregressiontoanindicatormatrix.• Fortwoclasses,thisisrelatedtoLDA.For3ormore,disaster…

• Weirdidea. Noisemodels(e.g.,Gaussian)forregressionareinappropriate,andfitsaresensitivetooutliers.Furthermore,givesunreasonablepredictions<0and>1.

OtherModels

• Noisy-OR• ClassificationviaRegression• Non-parametric(e.g.K-nearest-neighbor).• Semi-parametric(e.g.kernelclassifiers,supportvectormachines,Gaussianprocesses).

• Probit regression.• Complementarylog-log.• Generalizedlinearmodels.• Somereturnavaluefor ywithoutadistribution.

Jointvs.ConditionalModels

• BothNaïve Bayesandlogisticregressionhavesameconditionalformandcanhavesameparameterization.

• Butthecriteriausedtochooseparametersisdifferent

• NaiveBayesisajointmodel;itoptimizesp(x,y)=p(x)p(y|x).

• LogisticRegressionisconditional:optimizesp(y|x)directly

• Prosofdiscriminative:Moreflexible,directlyoptimizeswhatwecareabout.Whynotchooseoptimalparameters?

• Prosofgenerative:Easiertothinkabout,checkmodel,andincorporateotherdatasources(semi-suplearning)

Jointvs.ConditionalModels:YinandYang

• Eachgenerativemodelimplicitlydefinesaconditionalmodel

• p(z|x)hascomplicatedformifp(x|z)isatallcomplicated.Expensivetocomputenaively,necessaryforlearning.

• Autoencoders:Giveninterestinggenerativemodelp(x|z),forceapproximateq(z|x)tohaveaniceform.

• So,designinginferencemethodsforgenerativemodelsinvolvesdesigningdiscriminativerecognitionnetworks.

• Thursday:Tutorialonoptimization.

Documents

Lecture 2: Simple Classifiersjessebett/CSC412/content/week2/lec2.pdf · 2018-01-16 · • Idea 1: assume all the covariancesare the same (tie parameters). This is exactly Fisher’s