CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

CS5614:(Big)DataManagementSystems

B.AdityaPrakashLecture#16:Clustering

HighDimensionalData

§  Givenacloudofdatapointswewanttounderstanditsstructure

VTCS5614 2Prakash2017

3

TheProblemofClustering§  Givenasetofpoints,withanoDonofdistancebetweenpoints,groupthepointsintosomenumberofclusters,sothat– Membersofaclusterareclose/similartoeachother– Membersofdifferentclustersaredissimilar

§  Usually:– Pointsareinahigh-dimensionalspace– Similarityisdefinedusingadistancemeasure

•  Euclidean,Cosine,Jaccard,editdistance,…

VTCS5614Prakash2017

4

Example:Clusters&Outliers

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

VTCS5614


x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x Outlier Cluster

Prakash2017

Clusteringisahardproblem!


6

Whyisithard?

§  Clusteringintwodimensionslookseasy§  Clusteringsmallamountsofdatalookseasy§  Andinmostcases,looksarenotdeceiving

§  ManyapplicaDonsinvolvenot2,but10or10,000dimensions

§  High-dimensionalspaceslookdifferent:Almostallpairsofpointsareataboutthesamedistance

VTCS5614Prakash2017

ClusteringProblem:Galaxies§  Acatalogof2billion“skyobjects”representsobjectsby

theirradiaWonin7dimensions(frequencybands)§  Problem:Clusterintosimilarobjects,e.g.,galaxies,nearby

stars,quasars,etc.§  SloanDigitalSkySurvey


ClusteringProblem:MusicCDs§  IntuiWvely:Musicdividesintocategories,andcustomerspreferafewcategories– Butwhatarecategoriesreally?

§  RepresentaCDbyasetofcustomerswhoboughtit:

§  SimilarCDshavesimilarsetsofcustomers,andvice-versa

8VTCS5614Prakash2017

ClusteringProblem:MusicCDsSpaceofallCDs:§  Thinkofaspacewithonedim.foreachcustomer– Valuesinadimensionmaybe0or1only– ACDisapointinthisspace(x1,x2,…,xk),wherexi=1ifftheithcustomerboughttheCD

§  ForAmazon,thedimensionistensofmillions

§  Task:FindclustersofsimilarCDsVTCS5614 9Prakash2017

ClusteringProblem:Documents

Findingtopics:§  Representadocumentbyavector(x1,x2,…,xk),wherexi=1ifftheithword(insomeorder)appearsinthedocument–  Itactuallydoesn’tmaberifkisinfinite;i.e.,wedon’tlimitthesetofwords

§  Documentswithsimilarsetsofwordsmaybeaboutthesametopic


Cosine,Jaccard,andEuclidean§  AswithCDswehaveachoicewhenwethinkofdocumentsassetsofwordsorshingles:– Setsasvectors:Measuresimilaritybythecosinedistance

– Setsassets:MeasuresimilaritybytheJaccarddistance

– Setsaspoints:MeasuresimilaritybyEuclideandistance


12

Overview:MethodsofClustering

§  Hierarchical:– AgglomeraWve(bobomup):

•  IniDally,eachpointisacluster•  Repeatedlycombinethetwo“nearest”clustersintoone

– Divisive(topdown):•  Startwithoneclusterandrecursivelysplitit

§  Pointassignment:– Maintainasetofclusters–  Pointsbelongto“nearest”cluster

VTCS5614Prakash2017

HierarchicalClustering

§  KeyoperaWon:Repeatedlycombinetwonearestclusters

§  ThreeimportantquesWons:– 1)Howdoyourepresentaclusterofmorethanonepoint?

– 2)Howdoyoudeterminethe“nearness”ofclusters?

– 3)Whentostopcombiningclusters?


HierarchicalClustering§  KeyoperaWon:Repeatedlycombinetwonearestclusters

§  (1)Howtorepresentaclusterofmanypoints?–  Keyproblem:Asyoumergeclusters,howdoyourepresentthe“locaDon”ofeachcluster,totellwhichpairofclustersisclosest?

§  Euclideancase:eachclusterhasacentroid=averageofits(data)points

§  (2)Howtodetermine“nearness”ofclusters?– Measureclusterdistancesbydistancesofcentroids


Example:Hierarchicalclustering

Prakash2017 VTCS5614 15

AndintheNon-EuclideanCase?WhatabouttheNon-Euclideancase?§  Theonly“locaDons”wecantalkaboutarethepointsthemselves–  i.e.,thereisno“average”oftwopoints

§  Approach1:–  (1)Howtorepresentaclusterofmanypoints?clustroid=(data)point“closest”tootherpoints

–  (2)Howdoyoudeterminethe“nearness”ofclusters?Treatclustroidasifitwerecentroid,whencompuDnginter-clusterdistances


“Closest”Point?§  (1)Howtorepresentaclusterofmanypoints?clustroid=point“closest”tootherpoints

§  Possiblemeaningsof“closest”:–  Smallestmaximumdistancetootherpoints–  Smallestaveragedistancetootherpoints–  Smallestsumofsquaresofdistancestootherpoints

•  FordistancemetricdclustroidcofclusterCis:

VTCS5614 17

∑∈Cxc

cxd 2),(min

Centroid is the avg. of all (data)points in the cluster. This means centroid is an “artificial” point. Clustroid is an existing (data)point that is “closest” to all other points in the cluster.

X

Cluster on 3 datapoints

Centroid

Clustroid

Datapoint

Prakash2017

Defining“Nearness”ofClusters

§  (2)Howdoyoudeterminethe“nearness”ofclusters?– Approach2:Interclusterdistance=minimumofthedistancesbetweenanytwopoints,onefromeachcluster

– Approach3:PickanoDonof“cohesion”ofclusters,e.g.,maximumdistancefromtheclustroid

•  Mergeclusterswhoseunionismostcohesive


Cohesion

§  Approach3.1:Usethediameterofthemergedcluster=maximumdistancebetweenpointsinthecluster

§  Approach3.2:Usetheaveragedistancebetweenpointsinthecluster

§  Approach3.3:Useadensity-basedapproach– Takethediameteroravg.distance,e.g.,anddividebythenumberofpointsinthecluster


ImplementaWon

§  NaïveimplementaWonofhierarchicalclustering:– Ateachstep,computepairwisedistancesbetweenallpairsofclusters,thenmerge

– O(N3)

§  CarefulimplementaDonusingpriorityqueuecanreduceDmetoO(N2logN)– SWlltooexpensiveforreallybigdatasetsthatdonotfitinmemory


K-MEANSCLUSTERING


k–meansAlgorithm(s)

§  AssumesEuclideanspace/distance

§  Startbypickingk,thenumberofclusters

§  IniDalizeclustersbypickingonepointpercluster– Example:Pickonepointatrandom,thenk-1otherpoints,eachasfarawayaspossiblefromthepreviouspoints


PopulaWngClusters§  1)Foreachpoint,placeitintheclusterwhosecurrentcentroiditisnearest

§  2)Alerallpointsareassigned,updatethelocaDonsofcentroidsofthekclusters

§  3)Reassignallpointstotheirclosestcentroid–  SomeDmesmovespointsbetweenclusters

§  Repeat2and3unWlconvergence–  Convergence:Pointsdon’tmovebetweenclustersandcentroidsstabilize


Example:AssigningClusters

VTCS5614 24

x

x

x

x

x

x

x x

x … data point … centroid

x

x

x

Clusters after round 1 Prakash2017


VTCS5614 25

x

x

x

x

x

x

x x


x

x

x

Clusters after round 2 Prakash2017


VTCS5614 26

x

x

x

x

x

x

x x


x

x

x

Clusters at the end Prakash2017

Gegngthekright

Howtoselectk?§  Trydifferentk,lookingatthechangeintheaveragedistancetocentroidaskincreases

§  AveragefallsrapidlyunDlrightk,thenchangeslible

27

k

Average distance to

centroid

Best value of k

VTCS5614Prakash2017

Example:Pickingk

VTCS5614 28


x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Too few; many long distances to centroid.

Prakash2017

Example:Pickingk

VTCS5614 29


x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Just right; distances rather short.

Prakash2017

Example:Pickingk

VTCS5614 30


x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Too many; little improvement in average distance.

Prakash2017

THEBFRALGORITHM

Extensionofk-meanstolargedata


BFRAlgorithm§  BFR[Bradley-Fayyad-Reina]isavariantofk-meansdesignedtohandleverylarge(disk-resident)datasets

§  AssumesthatclustersarenormallydistributedaroundacentroidinaEuclideanspace–  StandarddeviaDonsindifferentdimensionsmayvary

•  Clustersareaxis-alignedellipses

§  Efficientwaytosummarizeclusters(wantmemoryrequiredO(clusters)andnotO(data))


BFRAlgorithm§  Pointsarereadfromdiskonemain-memory-fullataDme

§  MostpointsfrompreviousmemoryloadsaresummarizedbysimplestaWsWcs

§  Tobegin,fromtheiniDalloadweselecttheiniDalkcentroidsbysomesensibleapproach:–  Takekrandompoints–  TakeasmallrandomsampleandclusteropDmally–  Takeasample;pickarandompoint,andthenk–1morepoints,eachasfarfromthepreviouslyselectedpointsaspossible


ThreeClassesofPoints3setsofpointswhichwekeeptrackof:§  Discardset(DS):

–  Pointscloseenoughtoacentroidtobesummarized

§  Compressionset(CS):–  GroupsofpointsthatareclosetogetherbutnotclosetoanyexisDngcentroid

–  Thesepointsaresummarized,butnotassignedtoacluster

§  Retainedset(RS):–  IsolatedpointswaiDngtobeassignedtoacompressionset


BFR:“Galaxies”Picture

VTCS5614 35

A cluster. Its points are in the DS.

The centroid

Compressed sets. Their points are in the CS.

Points in the RS

Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Prakash2017

SummarizingSetsofPointsForeachcluster,thediscardset(DS)issummarizedby:§  Thenumberofpoints,N§  ThevectorSUM,whoseithcomponentisthesumofthecoordinatesofthepointsintheithdimension

§  ThevectorSUMSQ:ithcomponent=sumofsquaresofcoordinatesinithdimension

VTCS5614

36A cluster. All its points are in the DS. The centroid

Prakash2017

SummarizingPoints:Comments

§  2d+1valuesrepresentanysizecluster–  d=numberofdimensions

§  Averageineachdimension(thecentroid)canbecalculatedasSUMi/N–  SUMi=ithcomponentofSUM

§  Varianceofacluster’sdiscardsetindimensioniis:(SUMSQi/N)–(SUMi/N)2–  AndstandarddeviaDonisthesquarerootofthat

§  Nextstep:Actualclustering

37VTCS5614

Note: Dropping the “axis-aligned” clusters assumption would require storing full covariance matrix to summarize the cluster. So, instead of SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big! Prakash2017

The“Memory-Load”ofPoints

Processingthe“Memory-Load”ofpoints(1):§  1)Findthosepointsthatare“sufficientlyclose”toaclustercentroidandaddthosepointstothatclusterandtheDS–  Thesepointsaresoclosetothecentroidthattheycanbesummarizedandthendiscarded

§  2)Useanymain-memoryclusteringalgorithmtoclustertheremainingpointsandtheoldRS–  ClustersgototheCS;outlyingpointstotheRS

VTCS5614 38

Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Prakash2017

The“Memory-Load”ofPointsProcessingthe“Memory-Load”ofpoints(2):§  3)DSset:AdjuststaDsDcsoftheclusterstoaccountforthenewpoints–  AddNs,SUMs,SUMSQs

§  4)ConsidermergingcompressedsetsintheCS

§  5)Ifthisisthelastround,mergeallcompressedsetsintheCSandallRSpointsintotheirnearestcluster

VTCS5614 39

Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Prakash2017

BFR:“Galaxies”Picture

VTCS5614 40

A cluster. Its points are in the DS.

The centroid

Compressed sets. Their points are in the CS.

Points in the RS

Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Prakash2017

AFewDetails…

§  Q1)Howdowedecideifapointis“closeenough”toaclusterthatwewilladdthepointtothatcluster?

§  Q2)Howdowedecidewhethertwocompressedsets(CS)deservetobecombinedintoone?


HowCloseisCloseEnough?

§  Q1)Weneedawaytodecidewhethertoputanewpointintoacluster(anddiscard)

§  BFRsuggeststwoways:–  TheMahalanobisdistanceislessthanathreshold–  Highlikelihoodofthepointbelongingtocurrentlynearestcentroid


MahalanobisDistance


MahalanobisDistance

§  Ifclustersarenormallydistributedinddimensions,thenalertransformaDon,onestandarddeviaDon=√𝒅 –  i.e.,68%ofthepointsoftheclusterwillhaveaMahalanobisdistance<√𝒅 

§  AcceptapointforaclusterifitsM.D.is<somethreshold,e.g.2standarddeviaDons


Picture:EqualM.D.Regions§  Euclideanvs.Mahalanobisdistance

VTCS5614 45

Contours of equidistant points from the origin

Uniformly distributed points, Euclidean distance

Normally distributed points, Euclidean distance

Normally distributed points, Mahalanobis distance

Prakash2017

Should2CSclustersbecombined?

Q2)Should2CSsubclustersbecombined?§  Computethevarianceofthecombinedsubcluster

–  N,SUM,andSUMSQallowustomakethatcalculaDonquickly

§  Combineifthecombinedvarianceisbelowsomethreshold

§  ManyalternaWves:Treatdimensionsdifferently,considerdensity


THECUREALGORITHM

Extensionofk-meanstoclustersofarbitraryshapes


TheCUREAlgorithm§  ProblemwithBFR/k-means:

–  Assumesclustersarenormallydistributedineachdimension

–  Andaxesarefixed–ellipsesatananglearenotOK

§  CURE(ClusteringUsingREpresentaWves):–  AssumesaEuclideandistance–  Allowsclusterstoassumeanyshape–  UsesacollecWonofrepresentaWvepointstorepresentclusters

48

Vs.

VTCS5614Prakash2017

Example:UniversitySalaries

49

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

VTCS5614Prakash2017

StarWngCURE2Passalgorithm.Pass1:§  0)Pickarandomsampleofpointsthatfitinmainmemory

§  1)IniWalclusters:–  Clusterthesepointshierarchically–groupnearestpoints/clusters

§  2)PickrepresentaWvepoints:–  Foreachcluster,pickasampleofpoints,asdispersedaspossible

–  Fromthesample,pickrepresentaDvesbymovingthem(say)20%towardthecentroidofthecluster


Example:IniWalClusters

51

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

VTCS5614Prakash2017

Example:PickDispersedPoints

52

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

Pick (say) 4 remote points for each cluster.

VTCS5614Prakash2017

Example:PickDispersedPoints

53

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

Move points (say) 20% toward the centroid.

VTCS5614Prakash2017

FinishingCURE

Pass2:§  Now,rescanthewholedatasetandvisiteachpointpinthedataset

§  Placeitinthe“closestcluster”– NormaldefiniDonof“closest”:FindtheclosestrepresentaDvetopandassignittorepresentaDve’scluster

54VTCS5614

p

Prakash2017

Summary§  Clustering:Givenasetofpoints,withanoDonofdistancebetweenpoints,groupthepointsintosomenumberofclusters

§  Algorithms:– AgglomeraDvehierarchicalclustering:

•  Centroidandclustroid– k-means:

•  IniDalizaDon,pickingk– BFR– CURE