CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR...

Preview:

Citation preview

CS5614:(Big)DataManagementSystems

B.AdityaPrakashLecture#16:Clustering

HighDimensionalData

§  Givenacloudofdatapointswewanttounderstanditsstructure

VTCS5614 2Prakash2017

3

TheProblemofClustering§  Givenasetofpoints,withanoDonofdistancebetweenpoints,groupthepointsintosomenumberofclusters,sothat– Membersofaclusterareclose/similartoeachother– Membersofdifferentclustersaredissimilar

§  Usually:– Pointsareinahigh-dimensionalspace– Similarityisdefinedusingadistancemeasure

•  Euclidean,Cosine,Jaccard,editdistance,…

VTCS5614Prakash2017

4

Example:Clusters&Outliers

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

VTCS5614

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x Outlier Cluster

Prakash2017

Clusteringisahardproblem!

VTCS5614 5Prakash2017

6

Whyisithard?

§  Clusteringintwodimensionslookseasy§  Clusteringsmallamountsofdatalookseasy§  Andinmostcases,looksarenotdeceiving

§  ManyapplicaDonsinvolvenot2,but10or10,000dimensions

§  High-dimensionalspaceslookdifferent:Almostallpairsofpointsareataboutthesamedistance

VTCS5614Prakash2017

ClusteringProblem:Galaxies§  Acatalogof2billion“skyobjects”representsobjectsby

theirradiaWonin7dimensions(frequencybands)§  Problem:Clusterintosimilarobjects,e.g.,galaxies,nearby

stars,quasars,etc.§  SloanDigitalSkySurvey

VTCS5614 7Prakash2017

ClusteringProblem:MusicCDs§  IntuiWvely:Musicdividesintocategories,andcustomerspreferafewcategories– Butwhatarecategoriesreally?

§  RepresentaCDbyasetofcustomerswhoboughtit:

§  SimilarCDshavesimilarsetsofcustomers,andvice-versa

8VTCS5614Prakash2017

ClusteringProblem:MusicCDsSpaceofallCDs:§  Thinkofaspacewithonedim.foreachcustomer– Valuesinadimensionmaybe0or1only– ACDisapointinthisspace(x1,x2,…,xk),wherexi=1ifftheithcustomerboughttheCD

§  ForAmazon,thedimensionistensofmillions

§  Task:FindclustersofsimilarCDsVTCS5614 9Prakash2017

ClusteringProblem:Documents

Findingtopics:§  Representadocumentbyavector(x1,x2,…,xk),wherexi=1ifftheithword(insomeorder)appearsinthedocument–  Itactuallydoesn’tmaberifkisinfinite;i.e.,wedon’tlimitthesetofwords

§  Documentswithsimilarsetsofwordsmaybeaboutthesametopic

10VTCS5614Prakash2017

Cosine,Jaccard,andEuclidean§  AswithCDswehaveachoicewhenwethinkofdocumentsassetsofwordsorshingles:– Setsasvectors:Measuresimilaritybythecosinedistance

– Setsassets:MeasuresimilaritybytheJaccarddistance

– Setsaspoints:MeasuresimilaritybyEuclideandistance

VTCS5614 11Prakash2017

12

Overview:MethodsofClustering

§  Hierarchical:– AgglomeraWve(bobomup):

•  IniDally,eachpointisacluster•  Repeatedlycombinethetwo“nearest”clustersintoone

– Divisive(topdown):•  Startwithoneclusterandrecursivelysplitit

§  Pointassignment:– Maintainasetofclusters–  Pointsbelongto“nearest”cluster

VTCS5614Prakash2017

HierarchicalClustering

§  KeyoperaWon:Repeatedlycombinetwonearestclusters

§  ThreeimportantquesWons:– 1)Howdoyourepresentaclusterofmorethanonepoint?

– 2)Howdoyoudeterminethe“nearness”ofclusters?

– 3)Whentostopcombiningclusters?

VTCS5614 13Prakash2017

HierarchicalClustering§  KeyoperaWon:Repeatedlycombinetwonearestclusters

§  (1)Howtorepresentaclusterofmanypoints?–  Keyproblem:Asyoumergeclusters,howdoyourepresentthe“locaDon”ofeachcluster,totellwhichpairofclustersisclosest?

§  Euclideancase:eachclusterhasacentroid=averageofits(data)points

§  (2)Howtodetermine“nearness”ofclusters?– Measureclusterdistancesbydistancesofcentroids

VTCS5614 14Prakash2017

Example:Hierarchicalclustering

Prakash2017 VTCS5614 15

AndintheNon-EuclideanCase?WhatabouttheNon-Euclideancase?§  Theonly“locaDons”wecantalkaboutarethepointsthemselves–  i.e.,thereisno“average”oftwopoints

§  Approach1:–  (1)Howtorepresentaclusterofmanypoints?clustroid=(data)point“closest”tootherpoints

–  (2)Howdoyoudeterminethe“nearness”ofclusters?Treatclustroidasifitwerecentroid,whencompuDnginter-clusterdistances

16VTCS5614Prakash2017

“Closest”Point?§  (1)Howtorepresentaclusterofmanypoints?clustroid=point“closest”tootherpoints

§  Possiblemeaningsof“closest”:–  Smallestmaximumdistancetootherpoints–  Smallestaveragedistancetootherpoints–  Smallestsumofsquaresofdistancestootherpoints

•  FordistancemetricdclustroidcofclusterCis:

VTCS5614 17

∑∈Cxc

cxd 2),(min

Centroid is the avg. of all (data)points in the cluster. This means centroid is an “artificial” point. Clustroid is an existing (data)point that is “closest” to all other points in the cluster.

X

Cluster on 3 datapoints

Centroid

Clustroid

Datapoint

Prakash2017

Defining“Nearness”ofClusters

§  (2)Howdoyoudeterminethe“nearness”ofclusters?– Approach2:Interclusterdistance=minimumofthedistancesbetweenanytwopoints,onefromeachcluster

– Approach3:PickanoDonof“cohesion”ofclusters,e.g.,maximumdistancefromtheclustroid

•  Mergeclusterswhoseunionismostcohesive

18VTCS5614Prakash2017

Cohesion

§  Approach3.1:Usethediameterofthemergedcluster=maximumdistancebetweenpointsinthecluster

§  Approach3.2:Usetheaveragedistancebetweenpointsinthecluster

§  Approach3.3:Useadensity-basedapproach– Takethediameteroravg.distance,e.g.,anddividebythenumberofpointsinthecluster

VTCS5614 19Prakash2017

ImplementaWon

§  NaïveimplementaWonofhierarchicalclustering:– Ateachstep,computepairwisedistancesbetweenallpairsofclusters,thenmerge

– O(N3)

§  CarefulimplementaDonusingpriorityqueuecanreduceDmetoO(N2logN)– SWlltooexpensiveforreallybigdatasetsthatdonotfitinmemory

VTCS5614 20Prakash2017

K-MEANSCLUSTERING

Prakash2017 VTCS5614 21

k–meansAlgorithm(s)

§  AssumesEuclideanspace/distance

§  Startbypickingk,thenumberofclusters

§  IniDalizeclustersbypickingonepointpercluster– Example:Pickonepointatrandom,thenk-1otherpoints,eachasfarawayaspossiblefromthepreviouspoints

22VTCS5614Prakash2017

PopulaWngClusters§  1)Foreachpoint,placeitintheclusterwhosecurrentcentroiditisnearest

§  2)Alerallpointsareassigned,updatethelocaDonsofcentroidsofthekclusters

§  3)Reassignallpointstotheirclosestcentroid–  SomeDmesmovespointsbetweenclusters

§  Repeat2and3unWlconvergence–  Convergence:Pointsdon’tmovebetweenclustersandcentroidsstabilize

VTCS5614 23Prakash2017

Example:AssigningClusters

VTCS5614 24

x

x

x

x

x

x

x x

x … data point … centroid

x

x

x

Clusters after round 1 Prakash2017

Example:AssigningClusters

VTCS5614 25

x

x

x

x

x

x

x x

x … data point … centroid

x

x

x

Clusters after round 2 Prakash2017

Example:AssigningClusters

VTCS5614 26

x

x

x

x

x

x

x x

x … data point … centroid

x

x

x

Clusters at the end Prakash2017

Gegngthekright

Howtoselectk?§  Trydifferentk,lookingatthechangeintheaveragedistancetocentroidaskincreases

§  AveragefallsrapidlyunDlrightk,thenchangeslible

27

k

Average distance to

centroid

Best value of k

VTCS5614Prakash2017

Example:Pickingk

VTCS5614 28

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Too few; many long distances to centroid.

Prakash2017

Example:Pickingk

VTCS5614 29

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Just right; distances rather short.

Prakash2017

Example:Pickingk

VTCS5614 30

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Too many; little improvement in average distance.

Prakash2017

THEBFRALGORITHM

Extensionofk-meanstolargedata

Prakash2017 VTCS5614 31

BFRAlgorithm§  BFR[Bradley-Fayyad-Reina]isavariantofk-meansdesignedtohandleverylarge(disk-resident)datasets

§  AssumesthatclustersarenormallydistributedaroundacentroidinaEuclideanspace–  StandarddeviaDonsindifferentdimensionsmayvary

•  Clustersareaxis-alignedellipses

§  Efficientwaytosummarizeclusters(wantmemoryrequiredO(clusters)andnotO(data))

32VTCS5614Prakash2017

BFRAlgorithm§  Pointsarereadfromdiskonemain-memory-fullataDme

§  MostpointsfrompreviousmemoryloadsaresummarizedbysimplestaWsWcs

§  Tobegin,fromtheiniDalloadweselecttheiniDalkcentroidsbysomesensibleapproach:–  Takekrandompoints–  TakeasmallrandomsampleandclusteropDmally–  Takeasample;pickarandompoint,andthenk–1morepoints,eachasfarfromthepreviouslyselectedpointsaspossible

33VTCS5614Prakash2017

ThreeClassesofPoints3setsofpointswhichwekeeptrackof:§  Discardset(DS):

–  Pointscloseenoughtoacentroidtobesummarized

§  Compressionset(CS):–  GroupsofpointsthatareclosetogetherbutnotclosetoanyexisDngcentroid

–  Thesepointsaresummarized,butnotassignedtoacluster

§  Retainedset(RS):–  IsolatedpointswaiDngtobeassignedtoacompressionset

VTCS5614 34Prakash2017

BFR:“Galaxies”Picture

VTCS5614 35

A cluster. Its points are in the DS.

The centroid

Compressed sets. Their points are in the CS.

Points in the RS

Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Prakash2017

SummarizingSetsofPointsForeachcluster,thediscardset(DS)issummarizedby:§  Thenumberofpoints,N§  ThevectorSUM,whoseithcomponentisthesumofthecoordinatesofthepointsintheithdimension

§  ThevectorSUMSQ:ithcomponent=sumofsquaresofcoordinatesinithdimension

VTCS5614

36A cluster. All its points are in the DS. The centroid

Prakash2017

SummarizingPoints:Comments

§  2d+1valuesrepresentanysizecluster–  d=numberofdimensions

§  Averageineachdimension(thecentroid)canbecalculatedasSUMi/N–  SUMi=ithcomponentofSUM

§  Varianceofacluster’sdiscardsetindimensioniis:(SUMSQi/N)–(SUMi/N)2–  AndstandarddeviaDonisthesquarerootofthat

§  Nextstep:Actualclustering

37VTCS5614

Note: Dropping the “axis-aligned” clusters assumption would require storing full covariance matrix to summarize the cluster. So, instead of SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big! Prakash2017

The“Memory-Load”ofPoints

Processingthe“Memory-Load”ofpoints(1):§  1)Findthosepointsthatare“sufficientlyclose”toaclustercentroidandaddthosepointstothatclusterandtheDS–  Thesepointsaresoclosetothecentroidthattheycanbesummarizedandthendiscarded

§  2)Useanymain-memoryclusteringalgorithmtoclustertheremainingpointsandtheoldRS–  ClustersgototheCS;outlyingpointstotheRS

VTCS5614 38

Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Prakash2017

The“Memory-Load”ofPointsProcessingthe“Memory-Load”ofpoints(2):§  3)DSset:AdjuststaDsDcsoftheclusterstoaccountforthenewpoints–  AddNs,SUMs,SUMSQs

§  4)ConsidermergingcompressedsetsintheCS

§  5)Ifthisisthelastround,mergeallcompressedsetsintheCSandallRSpointsintotheirnearestcluster

VTCS5614 39

Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Prakash2017

BFR:“Galaxies”Picture

VTCS5614 40

A cluster. Its points are in the DS.

The centroid

Compressed sets. Their points are in the CS.

Points in the RS

Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Prakash2017

AFewDetails…

§  Q1)Howdowedecideifapointis“closeenough”toaclusterthatwewilladdthepointtothatcluster?

§  Q2)Howdowedecidewhethertwocompressedsets(CS)deservetobecombinedintoone?

41VTCS5614Prakash2017

HowCloseisCloseEnough?

§  Q1)Weneedawaytodecidewhethertoputanewpointintoacluster(anddiscard)

§  BFRsuggeststwoways:–  TheMahalanobisdistanceislessthanathreshold–  Highlikelihoodofthepointbelongingtocurrentlynearestcentroid

VTCS5614 42Prakash2017

MahalanobisDistance

VTCS5614 43Prakash2017

MahalanobisDistance

§  Ifclustersarenormallydistributedinddimensions,thenalertransformaDon,onestandarddeviaDon=√𝒅 –  i.e.,68%ofthepointsoftheclusterwillhaveaMahalanobisdistance<√𝒅 

§  AcceptapointforaclusterifitsM.D.is<somethreshold,e.g.2standarddeviaDons

44VTCS5614Prakash2017

Picture:EqualM.D.Regions§  Euclideanvs.Mahalanobisdistance

VTCS5614 45

Contours of equidistant points from the origin

Uniformly distributed points, Euclidean distance

Normally distributed points, Euclidean distance

Normally distributed points, Mahalanobis distance

Prakash2017

Should2CSclustersbecombined?

Q2)Should2CSsubclustersbecombined?§  Computethevarianceofthecombinedsubcluster

–  N,SUM,andSUMSQallowustomakethatcalculaDonquickly

§  Combineifthecombinedvarianceisbelowsomethreshold

§  ManyalternaWves:Treatdimensionsdifferently,considerdensity

VTCS5614 46Prakash2017

THECUREALGORITHM

Extensionofk-meanstoclustersofarbitraryshapes

Prakash2017 VTCS5614 47

TheCUREAlgorithm§  ProblemwithBFR/k-means:

–  Assumesclustersarenormallydistributedineachdimension

–  Andaxesarefixed–ellipsesatananglearenotOK

§  CURE(ClusteringUsingREpresentaWves):–  AssumesaEuclideandistance–  Allowsclusterstoassumeanyshape–  UsesacollecWonofrepresentaWvepointstorepresentclusters

48

Vs.

VTCS5614Prakash2017

Example:UniversitySalaries

49

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

VTCS5614Prakash2017

StarWngCURE2Passalgorithm.Pass1:§  0)Pickarandomsampleofpointsthatfitinmainmemory

§  1)IniWalclusters:–  Clusterthesepointshierarchically–groupnearestpoints/clusters

§  2)PickrepresentaWvepoints:–  Foreachcluster,pickasampleofpoints,asdispersedaspossible

–  Fromthesample,pickrepresentaDvesbymovingthem(say)20%towardthecentroidofthecluster

VTCS5614 50Prakash2017

Example:IniWalClusters

51

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

VTCS5614Prakash2017

Example:PickDispersedPoints

52

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

Pick (say) 4 remote points for each cluster.

VTCS5614Prakash2017

Example:PickDispersedPoints

53

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

Move points (say) 20% toward the centroid.

VTCS5614Prakash2017

FinishingCURE

Pass2:§  Now,rescanthewholedatasetandvisiteachpointpinthedataset

§  Placeitinthe“closestcluster”– NormaldefiniDonof“closest”:FindtheclosestrepresentaDvetopandassignittorepresentaDve’scluster

54VTCS5614

p

Prakash2017

Summary§  Clustering:Givenasetofpoints,withanoDonofdistancebetweenpoints,groupthepointsintosomenumberofclusters

§  Algorithms:– AgglomeraDvehierarchicalclustering:

•  Centroidandclustroid– k-means:

•  IniDalizaDon,pickingk– BFR– CURE

VTCS5614 55Prakash2017

Recommended