Kernels and Clustering - University of California, Berkeleycs188/su19/assets/slides/... · 2019. 8. 6. · CS 188: Artificial Intelligence Kernels and Clustering Instructors: Brijen

CS188:ArtificialIntelligenceKernelsandClustering

Instructors:BrijenThananjeyanandAdityaBaradwaj--- UniversityofCalifornia,Berkeley[TheseslideswerecreatedbyDanKleinandPieterAbbeelforCS188 IntrotoAIatUCBerkeley.AllCS188materialsareavailableathttp://ai.berkeley.edu.]

AProbabilistic Perceptron

A1DExample

definitely blue definitely rednot sure

probability increases exponentially as we move away from boundary

normalizer

TheSoftMax

HowtoLearn?

§ Maximumlikelihoodestimation

§ Maximumconditional likelihoodestimation

Local Search

o Simple, general idea:o Start wherevero Repeat: move to the best neighboring stateo If no neighbors better than current, quito Neighbors = small perturbations of w

Our Status

o Our objective

o Challenge: how to find a good w ?

o Equivalently:

ll(w)

maxw

ll(w)

minw

�ll(w)

1D optimization

o Could evaluate ando Then step in best direction

o Or, evaluate derivative:

o Which tells which direction to step into

w

g(w)

w0

g(w0)

g(w0 + h) g(w0 � h)

@g(w0)

@w= lim

h!0

g(w0 + h)� g(w0 � h)

2h

2-D Optimization

Source: Thomas Jungblut’s Blog

Steepest Descento Idea:

o Start somewhereo Repeat: Take a step in the steepest descent direction

Figure source: Mathworks

Steepest Direction

o Steepest Direction = direction of the gradient (points up the hill)

rg =

2

6664

@g@w1@g@w2

· · ·@g@wn

3

7775

HowtoLearn?

Optimization Procedure: Gradient Descent

Stochastic Gradient Descent

probability of incorrect answer probability of incorrect answer

compare this to the multiclass perceptron: probabilistic weighting!

Non-SeparableData

Case-BasedReasoning

§ Classificationfromsimilarity§ Case-basedreasoning§ Predictaninstance’slabelusingsimilarinstances

§ Nearest-neighborclassification§ 1-NN:copythelabelofthemostsimilardatapoint§ K-NN:votetheknearestneighbors(needaweighting

scheme)§ Keyissue:howtodefinesimilarity§ Trade-offs:Smallkgivesrelevantneighbors,Largekgives

smootherfunctions

http://www.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html

Parametric/Non-Parametric

§ Parametricmodels:§ Fixedsetofparameters§ Moredatameansbettersettings

§ Non-parametricmodels:§ Complexityoftheclassifierincreaseswithdata§ Betterinthelimit,oftenworseinthenon-limit

§ (K)NNisnon-parametric Truth

2Examples 10Examples 100Examples 10000Examples

Nearest-NeighborClassification

§ Nearestneighborfordigits:§ Takenewimage§ Comparetoalltrainingimages§ Assignbasedonclosestexample

§ Encoding:imageisvectorofintensities:

§ What’sthesimilarityfunction?§ Dotproductoftwoimagesvectors?

§ Usuallynormalizevectorsso||x||=1§ min=0(when?),max=1(when?)

0

1

2

0

1

2

SimilarityFunctions

BasicSimilarity

§ Manysimilaritiesbasedonfeaturedotproducts:

§ Iffeaturesarejustthepixels:

§ Note:notallsimilaritiesareofthisform

InvariantMetrics

§ Bettersimilarityfunctionsuseknowledgeaboutvision§ Example:invariantmetrics:

§ Similaritiesareinvariantundercertaintransformations§ Rotation,scaling,translation,stroke-thickness…§ E.g:

§ 16x16=256pixels;apointin256-dimspace§ ThesepointshavesmallsimilarityinR256(why?)

§ Howcanweincorporatesuchinvariances intooursimilarities?

ThisandnextfewslidesadaptedfromXiaoHu,UIUC

RotationInvariantMetrics

§ EachexampleisnowacurveinR256

§ Rotationinvariantsimilarity:

s’=maxs(r(),r())

§ E.g.highestsimilaritybetweenimages’rotationlines

TangentFamilies

§ Problemswiths’:§ Hardtocompute§ Allowslargetransformations(e.g.6® 9)

§ Tangentdistance:§ 1storderapproximationatoriginalpoints.

§ Easytocompute§ Modelssmallrotations

ATaleofTwoApproaches…

§ Nearestneighbor-likeapproaches§ Canusefancysimilarityfunctions§ Don’tactuallygettodoexplicitlearning

§ Perceptron-likeapproaches§ Explicittrainingtoreduceempiricalerror§ Can’tusefancysimilarity,onlylinear§ Orcanthey?Let’sfindout!

Kernelization

PerceptronWeights

§ Whatisthefinalvalueofaweightwy ofaperceptron?§ Canitbeanyrealvector?§ No!It’sbuiltbyaddingupinputs.

§ Canreconstructweightvectors(theprimalrepresentation)fromupdatecounts(thedualrepresentation)

DualPerceptron

§ Howtoclassifyanewexamplex?

§ IfsomeonetellsusthevalueofKforeachpairofexamples,neverneedtobuildtheweightvectors(orthefeaturevectors)!

DualPerceptron

§ Startwithzerocounts(alpha)§ Pickuptraininginstancesonebyone§ Trytoclassifyxn,

§ Ifcorrect,nochange!§ Ifwrong:lowercountofwrongclass(forthisinstance),raise

countofrightclass(forthisinstance)

y = argmaxy

X

i

↵i,yK(xi, xn)

wy = wy � f(xn)

wy⇤ = wy⇤ + f(xn)

KernelizedPerceptron

§ Ifwehadablackbox(kernel)Kthattoldusthedotproductoftwoexamplesxandx’:§ Couldworkentirelywiththedualrepresentation§ Noneedtoevertakedotproducts(“kerneltrick”)

§ Likenearestneighbor– workwithblack-boxsimilarities§ Downside:slowifmanyexamplesgetnonzeroalpha

Kernels:WhoCares?

§ Sofar:averystrangewayofdoingaverysimplecalculation

§ “Kerneltrick”:wecansubstituteany* similarityfunctioninplaceofthedotproduct

§ Letsuslearnnewkindsofhypotheses

*Fineprint:ifyourkerneldoesn’t satisfycertaintechnicalrequirements, lotsofproofsbreak.E.g.convergence,mistakebounds. Inpractice,illegalkernelssometimeswork(butnotalways).

Non-Linearity

Non-LinearSeparators

§ Datathatislinearlyseparableworksoutgreatforlineardecisionrules:

§ Butwhatarewegoingtodoifthedatasetisjusttoohard?

§ Howabout…mappingdatatoahigher-dimensionalspace:

0

0

0

x2

x

x

x

ThisandnextfewslidesadaptedfromRayMooney,UT

Non-LinearSeparators

§ Generalidea:theoriginalfeaturespacecanalwaysbemappedtosomehigher-dimensionalfeaturespacewherethetrainingsetisseparable:

Φ: x→ φ(x)

SomeKernels

§ Kernelsimplicitlymaporiginalvectorstohigherdimensionalspaces,takethedotproductthere,andhandtheresultback

§ Linearkernel:

§ Quadratickernel:

§ RBF:infinitedimensionalrepresentation

§ Discretekernels:e.g.stringkernels

WhyKernels?

§ Can’tyoujustaddthesefeaturesonyourown(e.g.addallpairsoffeaturesinsteadofusingthequadratickernel)?

WhyKernels?

§ Can’tyoujustaddthesefeaturesonyourown(e.g.addallpairsoffeaturesinsteadofusingthequadratickernel)?§ Yes,inprinciple,justcomputethem§ Noneedtomodifyanyalgorithms§ But,numberoffeaturescangetlarge(orinfinite)

§ Kernelsletuscomputewiththesefeaturesimplicitly§ Example:implicitdotproductinquadratickerneltakesmuchlessspaceandtimeperdotproduct

§ Ofcourse,there’sthecostforusingthepuredualalgorithms:youneedtocomputethesimilaritytoeverytrainingdatum

Clustering(Outofscopeforfinal)

Clustering

§ Clusteringsystems:§ Unsupervisedlearning§ Detectpatterns inunlabeleddata

§ E.g.groupemailsorsearchresults§ E.g.findcategoriesofcustomers§ E.g.detectanomalousprogramexecutions

§ Usefulwhendon’tknowwhatyou’relookingfor

§ Requiresdata,butnolabels§ Oftengetgibberish

Clustering

§ Basicidea:grouptogethersimilarinstances§ Example:2Dpointpatterns

§ Whatcould“similar”mean?§ Oneoption:small(squared)Euclideandistance

K-Means

K-Means

§ Aniterativeclusteringalgorithm§ PickKrandompointsasclustercenters(means)

§ Alternate:§ Assigndatainstancestoclosestmean

§ Assigneachmeantotheaverageofitsassignedpoints

§ Stopwhennopoints’assignmentschange

K-MeansExample

K-MeansasOptimization

§ Considerthetotaldistancetothemeans:

§ Eachiterationreducesphi

§ Twostageseachiteration:§ Updateassignments:fixmeansc,changeassignmentsa§ Updatemeans:fixassignmentsa,changemeansc

pointsassignments

means

PhaseI:UpdateAssignments

§ Foreachpoint,re-assigntoclosestmean:

§ Canonlydecreasetotaldistancephi!

PhaseII:UpdateMeans

§ Moveeachmeantotheaverageofitsassignedpoints:

§ Alsocanonlydecreasetotaldistance…(Why?)

§ Funfact:thepointywithminimumsquaredEuclideandistancetoasetofpoints{x}istheirmean

Initialization

§ K-meansisnon-deterministic§ Requiresinitialmeans§ Itdoesmatterwhatyoupick!§ Whatcangowrong?

§ Variousschemesforpreventingthiskindofthing:variance-basedsplit/merge,initializationheuristics

K-MeansGettingStuck

§ Alocaloptimum:

Whydoesn’tthisworkoutliketheearlierexample,withthepurpletakingoverhalftheblue?

K-MeansQuestions

§ WillK-meansconverge?§ Toaglobaloptimum?

§ Willitalwaysfindthetruepatternsinthedata?§ Ifthepatternsareveryvery clear?

§ Willitfindsomethinginteresting?

§ Dopeopleeveruseit?

§ Howmanyclusterstopick?

AgglomerativeClustering


§ Agglomerativeclustering:§ Firstmergeverysimilarinstances§ Incrementallybuild largerclustersoutof

smallerclusters

§ Algorithm:§ Maintainasetofclusters§ Initially,eachinstanceinitsowncluster§ Repeat:

§ Pickthetwoclosestclusters§ Mergethemintoanewcluster§ Stopwhenthere’sonlyoneclusterleft

§ Producesnotoneclustering,butafamilyofclusterings representedbyadendrogram


§ Howshouldwedefine“closest”forclusterswithmultipleelements?

§ Manyoptions§ Closestpair (single-linkclustering)§ Farthestpair (complete-linkclustering)§ Averageofallpairs§ Ward’smethod(minvariance,likek-means)

§ Differentchoicescreatedifferentclusteringbehaviors

Documents

Kernels and Clustering - University of California, Berkeleycs188/su19/assets/slides/... · 2019. 8. 6. · CS 188: Artificial Intelligence Kernels and Clustering Instructors: Brijen