Upload
arvind-surve
View
24
Download
0
Embed Size (px)
Citation preview
K-meansClustering• INPUT:nrecordsx1,x2,…,xn astherowsofmatrixX
– Eachxi ism-dimensional:xi =(xi1,xi2,…,xim)– MatrixXis(n× m)-dimensional
• INPUT:k,anintegerin{1,2,…,n}• OUTPUT:PartitiontherecordsintokclustersS1,S2,…,Sk
– Mayusenlabelsy1,y2,…,yn in{1,2,…,k}– NOTE:Sameclusterscanlabelink! ways– importantifchecking
correctness(don’tjustcompare“predicted”and“true”label)
• METRIC:Minimizewithin-clustersumofsquares (WCSS)
• Cluster“means”arekvectorsthatcaptureasmuchvarianceinthedataaspossible
2
( ) 221
:meanWCSS ∑ =∈−=
n
i jiji SxSx
K-meansClustering• K-meansisalittlesimilartolinearregression:
– Linearregressionerror =∑i≤n(yi– xi· β)2– BUT:Clusteringdescribesxi ’sthemselves,notyi’sgivenxi’s
• K-meanscanworkin“linearizationspace”(likekernelSVM)• Howtopickk?
– Tryk=1,2,…,uptosomelimit;checkforoverfitting– Pickthebestkinthecontextofthewholetask
• Caveatsfork-means– TheydoNOTestimateamixtureofGaussians
• EMalgorithmdoesthis– Thekclusterstendtobeofsimilarsize
• DoNOTuseforimbalancedclusters!3
( ) 221
:meanWCSS ∑ =∈−=
n
i jiji SxSx
TheK-meansAlgorithm• Pickk“centroids”c1,c2,…,ck fromtherecords{x1,x2,…,xn}
– Trytopickcentroidsfarfromeachother
• Assigneachrecordtothenearestcentroid:– Foreachxi computedi =min{ dist(xi,cj)overallcj}– ClusterSj ←{xi :dist(xi,cj)=di}
• Reseteachcentroidtoitscluster’smean:– Centroidcj ←mean(Sj)=∑i≤n(xi inSj?) · xi /|Sj|
• Repeat“assign”and“reset”stepsuntilconvergence• Lossdecreases:WCSSold ≥C-WCSSnew ≥WCSSnew
– Convergestolocaloptimum(often,notglobal)
4
( ) 221
:centroidWCSS-C ∑ =∈−=
n
i jiji SxSx
TheK-meansAlgorithm
• Runawaycentroid:closesttonorecordat“assign”step– Occasionallyhappense.g.withk=3centroidsand2dataclusters– Options:(a)terminate,(b)reducekby1
• Centroidsvs.means@earlytermination:– After“assign”step,clustercentroids≠theirmeans
• Centroids:(a)definetheclusters,(b)alreadycomputed• Means:(a)define theWCSSmetric,(b)notyetcomputed
– Wereportcentroidsandcentroid-WCSS(C-WCSS)
• Multipleruns:– Requiredagainstabadlocaloptimum– Use“parfor”loop,withrandominitialcentroids
5
K-means:DMLImplementationC = All_C [(k * (run - 1) + 1) : (k * run), ];iter = 0; term_code = 0; wcss = 0;while (term_code == 0) {
D = -2 * (X %*% t(C)) + t(rowSums(C ̂ 2));minD = rowMins (D); wcss_old = wcss;wcss = sumXsq + sum (minD);if (wcss_old - wcss < eps * wcss & iter > 0) {
term_code = 1; # Convergence is reached} else {
if (iter >= max_iter) { term_code = 2;} else { iter = iter + 1;
P = ppred (D, minD, "<=");P = P / rowSums(P);if (sum (ppred (colSums (P), 0.0, "<=")) > 0) {
term_code = 3; # "Runaway" centroid} else {
C = t(P / colSums(P)) %*% X;} } } }All_C [(k * (run - 1) + 1) : (k * run), ] = C;final_wcss [run, 1] = wcss; t_code [run, 1] = term_code; 6
Wantsmoothassign?Edithere
Tensoravoidancemaneuver
ParFor I/O
K-means++ InitializationHeuristic
• PickscentroidsfromXatrandom,pushingthemfarapart• GetsWCSSdowntoO(logk)× optimalinexpectation
• Howtopickcentroids:– Centroid c1:PickuniformlyatrandomfromX-rows– Centroid c2:Prob[ c2←xi]=(1/Σ)·dist (xi,c1)2
– Centroid cj:Prob[ cj←xi]=(1/Σ)·min {dist (xi,c1)2,…,dist (xi,cj–1)2}– Probabilitytopickarowisproportionaltoitssquaredmin-distance
fromearliercentroids
• IfXishuge,weuseasampleofX,differentacrossruns– OtherwisepickingkcentroidsrequireskpassesoverX
7
DavidArthur,SergeiVassilvitskii“k-means++:theadvantagesofcarefulseeding”inSODA2007
K-meansPredictScript
• PredictorandEvaluatorinone:– GivenX(data)andC(centroids),assignsclusterlabels prY– Compares2clusterings,“predicted” prY and“specified” spY
• ComputesWCSS,aswellasBetween-ClusterSumofSquares(BCSS)andTotalSumofSquares(TSS)– DatasetXmustbeavailable– IfcentroidsCaregiven,alsocomputesC-WCSSandC-BCSS
• Twowaystocompare prY and spY :– Same-clusteranddifferent-clusterPAIRSfrom prY and spY– ForeachprY-clusterfindbest-matchingspY-cluster,andviceversa– Allincountaswellasin%tofullcount
8
WeightedNon-NegativeMatrixFactorization(WNMF)
• INPUT:X isnon-negative(n × m)-matrix– Example:Xij =1ifperson #iclickedad #j,elseXij =0
• INPUT (OPTIONAL):W ispenalty(n × m)-matrix– Example:Wij =1ifperson #isawad #j,elseWij =0
• OUTPUT:(n × k)-matrixU,(m × k)-matrixV suchthat:
– k topics:Uic =affinity (prs.#i,topic#c),Vjc =affinity (ad#j,topic#c)– Approximation:Xij ≈Ui1·Vj1 +Ui2·Vj2 +…+Uik·Vjk
– Predicta“click”ifforsome #cboth Uic andVjc arehigh
9
( )( )21 1,
min ijT
ij
n
i
m
jijVU
VUXW −∑∑= =
0,0 t.s. ≥≥ VU
WeightedNon-NegativeMatrixFactorization(WNMF)
• NOTE:Non-negativityiscriticalforthis“bipartiteclustering”interpretationofU andV– MatrixU ofsizen × k=clusteraffinityforpeople– MatrixV ofsizem × k=clusteraffinityforads
• Negativeswouldviolate“disjunctionofconjunctions”sense:– Approximation:Xij ≈Ui1·Vj1 +Ui2·Vj2 +…+Uik·Vjk
– Predicta“click”ifforsome #cboth Uic andVjc arehigh
10
( )( )21 1,
min ijT
ij
n
i
m
jijVU
VUXW −∑∑= =
0,0 t.s. ≥≥ VU
11
§ EasytoparallelizeusingSystemML§ Multiplerunshelpavoidbadlocaloptima§ Mustspecifyk:Runfork =1,2,3...(asink-means)
( )[ ]( )[ ] ε+∗
∗←
ijTT
ijT
ijijUUVW
UXWVV
( )[ ]( )[ ] ε+∗
∗←
ijT
ijijij VUVW
VXWUU
WNMF:MultiplicativeUpdateDanielD.Lee,H.SebastianSeung“AlgorithmsforNon-negativeMatrixFactorization”inNIPS2000
InsideARunof(W)NMF• AssumethatWisasparsematrix
12
U = RND_U [, (r-1)*k + 1 : r*k];V = RND_V [, (r-1)*k + 1 : r*k];f_old = 0; i = 0;
f_new = sum ((X - U %*% t(V)) ^ 2); f_new = sum (W * (X - U %*% t(V)) ^ 2);
while (abs (f_new - f_old) > tol * f_new & i < max_iter)
{ {f_old = f_new; f_old = f_new;U = U * (X %*% V)/ (U %*% (t(V) %*% V) + eps);
U = U * ((W * X) %*% V)/ ( (W * (U %*% t(V))) %*% V + eps);
V = V * t(t(U) %*% X)/ (V %*% (t(U) %*% U) + eps);
V = V * (t(W * X) %*% U) / (t(W * (U %*% t(V))) %*% U + eps);
f_new = sum ((X - U %*% t(V))^2); f_new = sum (W * (X - U %*% t(V))^2);i = i + 1; i = i + 1;
} }
Sum-ProductRewrites• Matrixchainproductoptimization
– Example: (U %*% t(V)) %*% V = U %*% (t(V) %*% V)
• Movingoperatorsfrombigmatricestosmallerones– Example: t(X) %*% U = t(t(U) %*% X)
• Openingbracketsinexpressions(ongoing research)– Example: sum ((X – U %*% t(V))^2) = sum (X^2) –
2 * sum(X * (U %*% t(V)) + sum((U %*% t(V))^2)– K-means: D=rowSums(X^2)– 2*(X%*%t(C))+t(rowSums(C^2))
• Indexedsumrearrangements:– sum ((U %*% t(V))^2) = sum ((t(U) %*% U) * (t(V) %*% V))– sum (U %*% t(V)) = sum (colSums(U) * colSums(V))
13
OperatorFusion:W.Sq.Loss• WeightedSquaredLoss: sum (W * (X – U %*% t(V))^2)
– Commonpatternforfactorizationalgorithms– W andX usuallyverysparse(<0.001)– Problem:“Outer”productofU %*% t(V) createsthree dense
intermediatesinthesizeofX
è Fusedw.sq.lossoperator:
– Keyobservations:SparseW * allowsselectivecomputation,and“sum”aggregatesignificantlyreducesmemoryrequirements
U–
t(V)
XWsum *
2