Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski

ClusteringandFactorizationinSystemML(part1)

AlexandreEvfimievski

1

K-meansClustering• INPUT:nrecordsx1,x2,…,xn astherowsofmatrixX

– Eachxi ism-dimensional:xi =(xi1,xi2,…,xim)– MatrixXis(n× m)-dimensional

• INPUT:k,anintegerin{1,2,…,n}• OUTPUT:PartitiontherecordsintokclustersS1,S2,…,Sk

– Mayusenlabelsy1,y2,…,yn in{1,2,…,k}– NOTE:Sameclusterscanlabelink! ways– importantifchecking

correctness(don’tjustcompare“predicted”and“true”label)

• METRIC:Minimizewithin-clustersumofsquares (WCSS)

• Cluster“means”arekvectorsthatcaptureasmuchvarianceinthedataaspossible

2

( ) 221

:meanWCSS ∑ =∈−=

n

i jiji SxSx

K-meansClustering• K-meansisalittlesimilartolinearregression:

– Linearregressionerror =∑i≤n(yi– xi· β)2– BUT:Clusteringdescribesxi ’sthemselves,notyi’sgivenxi’s

• K-meanscanworkin“linearizationspace”(likekernelSVM)• Howtopickk?

– Tryk=1,2,…,uptosomelimit;checkforoverfitting– Pickthebestkinthecontextofthewholetask

• Caveatsfork-means– TheydoNOTestimateamixtureofGaussians

• EMalgorithmdoesthis– Thekclusterstendtobeofsimilarsize

• DoNOTuseforimbalancedclusters!3

( ) 221

:meanWCSS ∑ =∈−=

n

i jiji SxSx

TheK-meansAlgorithm• Pickk“centroids”c1,c2,…,ck fromtherecords{x1,x2,…,xn}

– Trytopickcentroidsfarfromeachother

• Assigneachrecordtothenearestcentroid:– Foreachxi computedi =min{ dist(xi,cj)overallcj}– ClusterSj ←{xi :dist(xi,cj)=di}

• Reseteachcentroidtoitscluster’smean:– Centroidcj ←mean(Sj)=∑i≤n(xi inSj?) · xi /|Sj|

• Repeat“assign”and“reset”stepsuntilconvergence• Lossdecreases:WCSSold ≥C-WCSSnew ≥WCSSnew

– Convergestolocaloptimum(often,notglobal)

4

( ) 221

:centroidWCSS-C ∑ =∈−=

n

i jiji SxSx

TheK-meansAlgorithm

• Runawaycentroid:closesttonorecordat“assign”step– Occasionallyhappense.g.withk=3centroidsand2dataclusters– Options:(a)terminate,(b)reducekby1

• Centroidsvs.means@earlytermination:– After“assign”step,clustercentroids≠theirmeans

• Centroids:(a)definetheclusters,(b)alreadycomputed• Means:(a)define theWCSSmetric,(b)notyetcomputed

– Wereportcentroidsandcentroid-WCSS(C-WCSS)

• Multipleruns:– Requiredagainstabadlocaloptimum– Use“parfor”loop,withrandominitialcentroids

5

K-means:DMLImplementationC = All_C [(k * (run - 1) + 1) : (k * run), ];iter = 0; term_code = 0; wcss = 0;while (term_code == 0) {

D = -2 * (X %*% t(C)) + t(rowSums(C ̂ 2));minD = rowMins (D); wcss_old = wcss;wcss = sumXsq + sum (minD);if (wcss_old - wcss < eps * wcss & iter > 0) {

term_code = 1; # Convergence is reached} else {

if (iter >= max_iter) { term_code = 2;} else { iter = iter + 1;

P = ppred (D, minD, "<=");P = P / rowSums(P);if (sum (ppred (colSums (P), 0.0, "<=")) > 0) {

term_code = 3; # "Runaway" centroid} else {

C = t(P / colSums(P)) %*% X;} } } }All_C [(k * (run - 1) + 1) : (k * run), ] = C;final_wcss [run, 1] = wcss; t_code [run, 1] = term_code; 6

Wantsmoothassign?Edithere

Tensoravoidancemaneuver

ParFor I/O

K-means++ InitializationHeuristic

• PickscentroidsfromXatrandom,pushingthemfarapart• GetsWCSSdowntoO(logk)× optimalinexpectation

• Howtopickcentroids:– Centroid c1:PickuniformlyatrandomfromX-rows– Centroid c2:Prob[ c2←xi]=(1/Σ)·dist (xi,c1)2

– Centroid cj:Prob[ cj←xi]=(1/Σ)·min {dist (xi,c1)2,…,dist (xi,cj–1)2}– Probabilitytopickarowisproportionaltoitssquaredmin-distance

fromearliercentroids

• IfXishuge,weuseasampleofX,differentacrossruns– OtherwisepickingkcentroidsrequireskpassesoverX

7

DavidArthur,SergeiVassilvitskii“k-means++:theadvantagesofcarefulseeding”inSODA2007

K-meansPredictScript

• PredictorandEvaluatorinone:– GivenX(data)andC(centroids),assignsclusterlabels prY– Compares2clusterings,“predicted” prY and“specified” spY

• ComputesWCSS,aswellasBetween-ClusterSumofSquares(BCSS)andTotalSumofSquares(TSS)– DatasetXmustbeavailable– IfcentroidsCaregiven,alsocomputesC-WCSSandC-BCSS

• Twowaystocompare prY and spY :– Same-clusteranddifferent-clusterPAIRSfrom prY and spY– ForeachprY-clusterfindbest-matchingspY-cluster,andviceversa– Allincountaswellasin%tofullcount

8

WeightedNon-NegativeMatrixFactorization(WNMF)

• INPUT:X isnon-negative(n × m)-matrix– Example:Xij =1ifperson #iclickedad #j,elseXij =0

• INPUT (OPTIONAL):W ispenalty(n × m)-matrix– Example:Wij =1ifperson #isawad #j,elseWij =0

• OUTPUT:(n × k)-matrixU,(m × k)-matrixV suchthat:

– k topics:Uic =affinity (prs.#i,topic#c),Vjc =affinity (ad#j,topic#c)– Approximation:Xij ≈Ui1·Vj1 +Ui2·Vj2 +…+Uik·Vjk

– Predicta“click”ifforsome #cboth Uic andVjc arehigh

9

( )( )21 1,

min ijT

ij

n

i

m

jijVU

VUXW −∑∑= =

0,0 t.s. ≥≥ VU

WeightedNon-NegativeMatrixFactorization(WNMF)

• NOTE:Non-negativityiscriticalforthis“bipartiteclustering”interpretationofU andV– MatrixU ofsizen × k=clusteraffinityforpeople– MatrixV ofsizem × k=clusteraffinityforads

• Negativeswouldviolate“disjunctionofconjunctions”sense:– Approximation:Xij ≈Ui1·Vj1 +Ui2·Vj2 +…+Uik·Vjk

– Predicta“click”ifforsome #cboth Uic andVjc arehigh

10

( )( )21 1,

min ijT

ij

n

i

m

jijVU

VUXW −∑∑= =

0,0 t.s. ≥≥ VU

11

§ EasytoparallelizeusingSystemML§ Multiplerunshelpavoidbadlocaloptima§ Mustspecifyk:Runfork =1,2,3...(asink-means)

( )[ ]( )[ ] ε+∗

∗←

ijTT

ijT

ijijUUVW

UXWVV

( )[ ]( )[ ] ε+∗

∗←

ijT

ijijij VUVW

VXWUU

WNMF:MultiplicativeUpdateDanielD.Lee,H.SebastianSeung“AlgorithmsforNon-negativeMatrixFactorization”inNIPS2000

InsideARunof(W)NMF• AssumethatWisasparsematrix

12

U = RND_U [, (r-1)*k + 1 : r*k];V = RND_V [, (r-1)*k + 1 : r*k];f_old = 0; i = 0;

f_new = sum ((X - U %*% t(V)) ^ 2); f_new = sum (W * (X - U %*% t(V)) ^ 2);

while (abs (f_new - f_old) > tol * f_new & i < max_iter)

{ {f_old = f_new; f_old = f_new;U = U * (X %*% V)/ (U %*% (t(V) %*% V) + eps);

U = U * ((W * X) %*% V)/ ( (W * (U %*% t(V))) %*% V + eps);

V = V * t(t(U) %*% X)/ (V %*% (t(U) %*% U) + eps);

V = V * (t(W * X) %*% U) / (t(W * (U %*% t(V))) %*% U + eps);

f_new = sum ((X - U %*% t(V))^2); f_new = sum (W * (X - U %*% t(V))^2);i = i + 1; i = i + 1;

} }

Sum-ProductRewrites• Matrixchainproductoptimization

– Example: (U %*% t(V)) %*% V = U %*% (t(V) %*% V)

• Movingoperatorsfrombigmatricestosmallerones– Example: t(X) %*% U = t(t(U) %*% X)

• Openingbracketsinexpressions(ongoing research)– Example: sum ((X – U %*% t(V))^2) = sum (X^2) –

2 * sum(X * (U %*% t(V)) + sum((U %*% t(V))^2)– K-means: D=rowSums(X^2)– 2*(X%*%t(C))+t(rowSums(C^2))

• Indexedsumrearrangements:– sum ((U %*% t(V))^2) = sum ((t(U) %*% U) * (t(V) %*% V))– sum (U %*% t(V)) = sum (colSums(U) * colSums(V))

13

OperatorFusion:W.Sq.Loss• WeightedSquaredLoss: sum (W * (X – U %*% t(V))^2)

– Commonpatternforfactorizationalgorithms– W andX usuallyverysparse(<0.001)– Problem:“Outer”productofU %*% t(V) createsthree dense

intermediatesinthesizeofX

è Fusedw.sq.lossoperator:

– Keyobservations:SparseW * allowsselectivecomputation,and“sum”aggregatesignificantlyreducesmemoryrequirements

U–

t(V)

XWsum *

2

BACK-UP

15

Education

Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski