Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CS188:ArtificialIntelligenceKernelsandClustering
Instructors:BrijenThananjeyanandAdityaBaradwaj--- UniversityofCalifornia,Berkeley[TheseslideswerecreatedbyDanKleinandPieterAbbeelforCS188 IntrotoAIatUCBerkeley.AllCS188materialsareavailableathttp://ai.berkeley.edu.]
AProbabilistic Perceptron
A1DExample
definitely blue definitely rednot sure
probability increases exponentially as we move away from boundary
normalizer
TheSoftMax
HowtoLearn?
§ Maximumlikelihoodestimation
§ Maximumconditional likelihoodestimation
Local Search
o Simple, general idea:o Start wherevero Repeat: move to the best neighboring stateo If no neighbors better than current, quito Neighbors = small perturbations of w
Our Status
o Our objective
o Challenge: how to find a good w ?
o Equivalently:
ll(w)
maxw
ll(w)
minw
�ll(w)
1D optimization
o Could evaluate ando Then step in best direction
o Or, evaluate derivative:
o Which tells which direction to step into
w
g(w)
w0
g(w0)
g(w0 + h) g(w0 � h)
@g(w0)
@w= lim
h!0
g(w0 + h)� g(w0 � h)
2h
2-D Optimization
Source: Thomas Jungblut’s Blog
Steepest Descento Idea:
o Start somewhereo Repeat: Take a step in the steepest descent direction
Figure source: Mathworks
Steepest Direction
o Steepest Direction = direction of the gradient (points up the hill)
rg =
2
6664
@g@w1@g@w2
· · ·@g@wn
3
7775
HowtoLearn?
Optimization Procedure: Gradient Descent
Stochastic Gradient Descent
probability of incorrect answer probability of incorrect answer
compare this to the multiclass perceptron: probabilistic weighting!
Non-SeparableData
Case-BasedReasoning
§ Classificationfromsimilarity§ Case-basedreasoning§ Predictaninstance’slabelusingsimilarinstances
§ Nearest-neighborclassification§ 1-NN:copythelabelofthemostsimilardatapoint§ K-NN:votetheknearestneighbors(needaweighting
scheme)§ Keyissue:howtodefinesimilarity§ Trade-offs:Smallkgivesrelevantneighbors,Largekgives
smootherfunctions
http://www.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html
Parametric/Non-Parametric
§ Parametricmodels:§ Fixedsetofparameters§ Moredatameansbettersettings
§ Non-parametricmodels:§ Complexityoftheclassifierincreaseswithdata§ Betterinthelimit,oftenworseinthenon-limit
§ (K)NNisnon-parametric Truth
2Examples 10Examples 100Examples 10000Examples
Nearest-NeighborClassification
§ Nearestneighborfordigits:§ Takenewimage§ Comparetoalltrainingimages§ Assignbasedonclosestexample
§ Encoding:imageisvectorofintensities:
§ What’sthesimilarityfunction?§ Dotproductoftwoimagesvectors?
§ Usuallynormalizevectorsso||x||=1§ min=0(when?),max=1(when?)
0
1
2
0
1
2
SimilarityFunctions
BasicSimilarity
§ Manysimilaritiesbasedonfeaturedotproducts:
§ Iffeaturesarejustthepixels:
§ Note:notallsimilaritiesareofthisform
InvariantMetrics
§ Bettersimilarityfunctionsuseknowledgeaboutvision§ Example:invariantmetrics:
§ Similaritiesareinvariantundercertaintransformations§ Rotation,scaling,translation,stroke-thickness…§ E.g:
§ 16x16=256pixels;apointin256-dimspace§ ThesepointshavesmallsimilarityinR256(why?)
§ Howcanweincorporatesuchinvariances intooursimilarities?
ThisandnextfewslidesadaptedfromXiaoHu,UIUC
RotationInvariantMetrics
§ EachexampleisnowacurveinR256
§ Rotationinvariantsimilarity:
s’=maxs(r(),r())
§ E.g.highestsimilaritybetweenimages’rotationlines
TangentFamilies
§ Problemswiths’:§ Hardtocompute§ Allowslargetransformations(e.g.6® 9)
§ Tangentdistance:§ 1storderapproximationatoriginalpoints.
§ Easytocompute§ Modelssmallrotations
ATaleofTwoApproaches…
§ Nearestneighbor-likeapproaches§ Canusefancysimilarityfunctions§ Don’tactuallygettodoexplicitlearning
§ Perceptron-likeapproaches§ Explicittrainingtoreduceempiricalerror§ Can’tusefancysimilarity,onlylinear§ Orcanthey?Let’sfindout!
Kernelization
PerceptronWeights
§ Whatisthefinalvalueofaweightwy ofaperceptron?§ Canitbeanyrealvector?§ No!It’sbuiltbyaddingupinputs.
§ Canreconstructweightvectors(theprimalrepresentation)fromupdatecounts(thedualrepresentation)
DualPerceptron
§ Howtoclassifyanewexamplex?
§ IfsomeonetellsusthevalueofKforeachpairofexamples,neverneedtobuildtheweightvectors(orthefeaturevectors)!
DualPerceptron
§ Startwithzerocounts(alpha)§ Pickuptraininginstancesonebyone§ Trytoclassifyxn,
§ Ifcorrect,nochange!§ Ifwrong:lowercountofwrongclass(forthisinstance),raise
countofrightclass(forthisinstance)
y = argmaxy
X
i
↵i,yK(xi, xn)
wy = wy � f(xn)
wy⇤ = wy⇤ + f(xn)
KernelizedPerceptron
§ Ifwehadablackbox(kernel)Kthattoldusthedotproductoftwoexamplesxandx’:§ Couldworkentirelywiththedualrepresentation§ Noneedtoevertakedotproducts(“kerneltrick”)
§ Likenearestneighbor– workwithblack-boxsimilarities§ Downside:slowifmanyexamplesgetnonzeroalpha
Kernels:WhoCares?
§ Sofar:averystrangewayofdoingaverysimplecalculation
§ “Kerneltrick”:wecansubstituteany* similarityfunctioninplaceofthedotproduct
§ Letsuslearnnewkindsofhypotheses
*Fineprint:ifyourkerneldoesn’t satisfycertaintechnicalrequirements, lotsofproofsbreak.E.g.convergence,mistakebounds. Inpractice,illegalkernelssometimeswork(butnotalways).
Non-Linearity
Non-LinearSeparators
§ Datathatislinearlyseparableworksoutgreatforlineardecisionrules:
§ Butwhatarewegoingtodoifthedatasetisjusttoohard?
§ Howabout…mappingdatatoahigher-dimensionalspace:
0
0
0
x2
x
x
x
ThisandnextfewslidesadaptedfromRayMooney,UT
Non-LinearSeparators
§ Generalidea:theoriginalfeaturespacecanalwaysbemappedtosomehigher-dimensionalfeaturespacewherethetrainingsetisseparable:
Φ: x→ φ(x)
SomeKernels
§ Kernelsimplicitlymaporiginalvectorstohigherdimensionalspaces,takethedotproductthere,andhandtheresultback
§ Linearkernel:
§ Quadratickernel:
§ RBF:infinitedimensionalrepresentation
§ Discretekernels:e.g.stringkernels
WhyKernels?
§ Can’tyoujustaddthesefeaturesonyourown(e.g.addallpairsoffeaturesinsteadofusingthequadratickernel)?
WhyKernels?
§ Can’tyoujustaddthesefeaturesonyourown(e.g.addallpairsoffeaturesinsteadofusingthequadratickernel)?§ Yes,inprinciple,justcomputethem§ Noneedtomodifyanyalgorithms§ But,numberoffeaturescangetlarge(orinfinite)
§ Kernelsletuscomputewiththesefeaturesimplicitly§ Example:implicitdotproductinquadratickerneltakesmuchlessspaceandtimeperdotproduct
§ Ofcourse,there’sthecostforusingthepuredualalgorithms:youneedtocomputethesimilaritytoeverytrainingdatum
Clustering(Outofscopeforfinal)
Clustering
§ Clusteringsystems:§ Unsupervisedlearning§ Detectpatterns inunlabeleddata
§ E.g.groupemailsorsearchresults§ E.g.findcategoriesofcustomers§ E.g.detectanomalousprogramexecutions
§ Usefulwhendon’tknowwhatyou’relookingfor
§ Requiresdata,butnolabels§ Oftengetgibberish
Clustering
§ Basicidea:grouptogethersimilarinstances§ Example:2Dpointpatterns
§ Whatcould“similar”mean?§ Oneoption:small(squared)Euclideandistance
K-Means
K-Means
§ Aniterativeclusteringalgorithm§ PickKrandompointsasclustercenters(means)
§ Alternate:§ Assigndatainstancestoclosestmean
§ Assigneachmeantotheaverageofitsassignedpoints
§ Stopwhennopoints’assignmentschange
K-MeansExample
K-MeansasOptimization
§ Considerthetotaldistancetothemeans:
§ Eachiterationreducesphi
§ Twostageseachiteration:§ Updateassignments:fixmeansc,changeassignmentsa§ Updatemeans:fixassignmentsa,changemeansc
pointsassignments
means
PhaseI:UpdateAssignments
§ Foreachpoint,re-assigntoclosestmean:
§ Canonlydecreasetotaldistancephi!
PhaseII:UpdateMeans
§ Moveeachmeantotheaverageofitsassignedpoints:
§ Alsocanonlydecreasetotaldistance…(Why?)
§ Funfact:thepointywithminimumsquaredEuclideandistancetoasetofpoints{x}istheirmean
Initialization
§ K-meansisnon-deterministic§ Requiresinitialmeans§ Itdoesmatterwhatyoupick!§ Whatcangowrong?
§ Variousschemesforpreventingthiskindofthing:variance-basedsplit/merge,initializationheuristics
K-MeansGettingStuck
§ Alocaloptimum:
Whydoesn’tthisworkoutliketheearlierexample,withthepurpletakingoverhalftheblue?
K-MeansQuestions
§ WillK-meansconverge?§ Toaglobaloptimum?
§ Willitalwaysfindthetruepatternsinthedata?§ Ifthepatternsareveryvery clear?
§ Willitfindsomethinginteresting?
§ Dopeopleeveruseit?
§ Howmanyclusterstopick?
AgglomerativeClustering
AgglomerativeClustering
§ Agglomerativeclustering:§ Firstmergeverysimilarinstances§ Incrementallybuild largerclustersoutof
smallerclusters
§ Algorithm:§ Maintainasetofclusters§ Initially,eachinstanceinitsowncluster§ Repeat:
§ Pickthetwoclosestclusters§ Mergethemintoanewcluster§ Stopwhenthere’sonlyoneclusterleft
§ Producesnotoneclustering,butafamilyofclusterings representedbyadendrogram
AgglomerativeClustering
§ Howshouldwedefine“closest”forclusterswithmultipleelements?
§ Manyoptions§ Closestpair (single-linkclustering)§ Farthestpair (complete-linkclustering)§ Averageofallpairs§ Ward’smethod(minvariance,likek-means)
§ Differentchoicescreatedifferentclusteringbehaviors