Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
CS5614:(Big)DataManagementSystems
B.AdityaPrakashLecture#16:Clustering
HighDimensionalData
§ Givenacloudofdatapointswewanttounderstanditsstructure
VTCS5614 2Prakash2017
3
TheProblemofClustering§ Givenasetofpoints,withanoDonofdistancebetweenpoints,groupthepointsintosomenumberofclusters,sothat– Membersofaclusterareclose/similartoeachother– Membersofdifferentclustersaredissimilar
§ Usually:– Pointsareinahigh-dimensionalspace– Similarityisdefinedusingadistancemeasure
• Euclidean,Cosine,Jaccard,editdistance,…
VTCS5614Prakash2017
4
Example:Clusters&Outliers
x x x x x x x x x x x x x
x x
x xx x x x
x x x x
x x x x
x x x x x x x x x
x
x
x
VTCS5614
x x x x x x x x x x x x x
x x
x xx x x x
x x x x
x x x x
x x x x x x x x x
x Outlier Cluster
Prakash2017
Clusteringisahardproblem!
VTCS5614 5Prakash2017
6
Whyisithard?
§ Clusteringintwodimensionslookseasy§ Clusteringsmallamountsofdatalookseasy§ Andinmostcases,looksarenotdeceiving
§ ManyapplicaDonsinvolvenot2,but10or10,000dimensions
§ High-dimensionalspaceslookdifferent:Almostallpairsofpointsareataboutthesamedistance
VTCS5614Prakash2017
ClusteringProblem:Galaxies§ Acatalogof2billion“skyobjects”representsobjectsby
theirradiaWonin7dimensions(frequencybands)§ Problem:Clusterintosimilarobjects,e.g.,galaxies,nearby
stars,quasars,etc.§ SloanDigitalSkySurvey
VTCS5614 7Prakash2017
ClusteringProblem:MusicCDs§ IntuiWvely:Musicdividesintocategories,andcustomerspreferafewcategories– Butwhatarecategoriesreally?
§ RepresentaCDbyasetofcustomerswhoboughtit:
§ SimilarCDshavesimilarsetsofcustomers,andvice-versa
8VTCS5614Prakash2017
ClusteringProblem:MusicCDsSpaceofallCDs:§ Thinkofaspacewithonedim.foreachcustomer– Valuesinadimensionmaybe0or1only– ACDisapointinthisspace(x1,x2,…,xk),wherexi=1ifftheithcustomerboughttheCD
§ ForAmazon,thedimensionistensofmillions
§ Task:FindclustersofsimilarCDsVTCS5614 9Prakash2017
ClusteringProblem:Documents
Findingtopics:§ Representadocumentbyavector(x1,x2,…,xk),wherexi=1ifftheithword(insomeorder)appearsinthedocument– Itactuallydoesn’tmaberifkisinfinite;i.e.,wedon’tlimitthesetofwords
§ Documentswithsimilarsetsofwordsmaybeaboutthesametopic
10VTCS5614Prakash2017
Cosine,Jaccard,andEuclidean§ AswithCDswehaveachoicewhenwethinkofdocumentsassetsofwordsorshingles:– Setsasvectors:Measuresimilaritybythecosinedistance
– Setsassets:MeasuresimilaritybytheJaccarddistance
– Setsaspoints:MeasuresimilaritybyEuclideandistance
VTCS5614 11Prakash2017
12
Overview:MethodsofClustering
§ Hierarchical:– AgglomeraWve(bobomup):
• IniDally,eachpointisacluster• Repeatedlycombinethetwo“nearest”clustersintoone
– Divisive(topdown):• Startwithoneclusterandrecursivelysplitit
§ Pointassignment:– Maintainasetofclusters– Pointsbelongto“nearest”cluster
VTCS5614Prakash2017
HierarchicalClustering
§ KeyoperaWon:Repeatedlycombinetwonearestclusters
§ ThreeimportantquesWons:– 1)Howdoyourepresentaclusterofmorethanonepoint?
– 2)Howdoyoudeterminethe“nearness”ofclusters?
– 3)Whentostopcombiningclusters?
VTCS5614 13Prakash2017
HierarchicalClustering§ KeyoperaWon:Repeatedlycombinetwonearestclusters
§ (1)Howtorepresentaclusterofmanypoints?– Keyproblem:Asyoumergeclusters,howdoyourepresentthe“locaDon”ofeachcluster,totellwhichpairofclustersisclosest?
§ Euclideancase:eachclusterhasacentroid=averageofits(data)points
§ (2)Howtodetermine“nearness”ofclusters?– Measureclusterdistancesbydistancesofcentroids
VTCS5614 14Prakash2017
Example:Hierarchicalclustering
Prakash2017 VTCS5614 15
AndintheNon-EuclideanCase?WhatabouttheNon-Euclideancase?§ Theonly“locaDons”wecantalkaboutarethepointsthemselves– i.e.,thereisno“average”oftwopoints
§ Approach1:– (1)Howtorepresentaclusterofmanypoints?clustroid=(data)point“closest”tootherpoints
– (2)Howdoyoudeterminethe“nearness”ofclusters?Treatclustroidasifitwerecentroid,whencompuDnginter-clusterdistances
16VTCS5614Prakash2017
“Closest”Point?§ (1)Howtorepresentaclusterofmanypoints?clustroid=point“closest”tootherpoints
§ Possiblemeaningsof“closest”:– Smallestmaximumdistancetootherpoints– Smallestaveragedistancetootherpoints– Smallestsumofsquaresofdistancestootherpoints
• FordistancemetricdclustroidcofclusterCis:
VTCS5614 17
∑∈Cxc
cxd 2),(min
Centroid is the avg. of all (data)points in the cluster. This means centroid is an “artificial” point. Clustroid is an existing (data)point that is “closest” to all other points in the cluster.
X
Cluster on 3 datapoints
Centroid
Clustroid
Datapoint
Prakash2017
Defining“Nearness”ofClusters
§ (2)Howdoyoudeterminethe“nearness”ofclusters?– Approach2:Interclusterdistance=minimumofthedistancesbetweenanytwopoints,onefromeachcluster
– Approach3:PickanoDonof“cohesion”ofclusters,e.g.,maximumdistancefromtheclustroid
• Mergeclusterswhoseunionismostcohesive
18VTCS5614Prakash2017
Cohesion
§ Approach3.1:Usethediameterofthemergedcluster=maximumdistancebetweenpointsinthecluster
§ Approach3.2:Usetheaveragedistancebetweenpointsinthecluster
§ Approach3.3:Useadensity-basedapproach– Takethediameteroravg.distance,e.g.,anddividebythenumberofpointsinthecluster
VTCS5614 19Prakash2017
ImplementaWon
§ NaïveimplementaWonofhierarchicalclustering:– Ateachstep,computepairwisedistancesbetweenallpairsofclusters,thenmerge
– O(N3)
§ CarefulimplementaDonusingpriorityqueuecanreduceDmetoO(N2logN)– SWlltooexpensiveforreallybigdatasetsthatdonotfitinmemory
VTCS5614 20Prakash2017
K-MEANSCLUSTERING
Prakash2017 VTCS5614 21
k–meansAlgorithm(s)
§ AssumesEuclideanspace/distance
§ Startbypickingk,thenumberofclusters
§ IniDalizeclustersbypickingonepointpercluster– Example:Pickonepointatrandom,thenk-1otherpoints,eachasfarawayaspossiblefromthepreviouspoints
22VTCS5614Prakash2017
PopulaWngClusters§ 1)Foreachpoint,placeitintheclusterwhosecurrentcentroiditisnearest
§ 2)Alerallpointsareassigned,updatethelocaDonsofcentroidsofthekclusters
§ 3)Reassignallpointstotheirclosestcentroid– SomeDmesmovespointsbetweenclusters
§ Repeat2and3unWlconvergence– Convergence:Pointsdon’tmovebetweenclustersandcentroidsstabilize
VTCS5614 23Prakash2017
Example:AssigningClusters
VTCS5614 24
x
x
x
x
x
x
x x
x … data point … centroid
x
x
x
Clusters after round 1 Prakash2017
Example:AssigningClusters
VTCS5614 25
x
x
x
x
x
x
x x
x … data point … centroid
x
x
x
Clusters after round 2 Prakash2017
Example:AssigningClusters
VTCS5614 26
x
x
x
x
x
x
x x
x … data point … centroid
x
x
x
Clusters at the end Prakash2017
Gegngthekright
Howtoselectk?§ Trydifferentk,lookingatthechangeintheaveragedistancetocentroidaskincreases
§ AveragefallsrapidlyunDlrightk,thenchangeslible
27
k
Average distance to
centroid
Best value of k
VTCS5614Prakash2017
Example:Pickingk
VTCS5614 28
x x x x x x x x x x x x x
x x
x xx x x x
x x x x
x x x x
x x x x x x x x x
x
x
x
Too few; many long distances to centroid.
Prakash2017
Example:Pickingk
VTCS5614 29
x x x x x x x x x x x x x
x x
x xx x x x
x x x x
x x x x
x x x x x x x x x
x
x
x
Just right; distances rather short.
Prakash2017
Example:Pickingk
VTCS5614 30
x x x x x x x x x x x x x
x x
x xx x x x
x x x x
x x x x
x x x x x x x x x
x
x
x
Too many; little improvement in average distance.
Prakash2017
THEBFRALGORITHM
Extensionofk-meanstolargedata
Prakash2017 VTCS5614 31
BFRAlgorithm§ BFR[Bradley-Fayyad-Reina]isavariantofk-meansdesignedtohandleverylarge(disk-resident)datasets
§ AssumesthatclustersarenormallydistributedaroundacentroidinaEuclideanspace– StandarddeviaDonsindifferentdimensionsmayvary
• Clustersareaxis-alignedellipses
§ Efficientwaytosummarizeclusters(wantmemoryrequiredO(clusters)andnotO(data))
32VTCS5614Prakash2017
BFRAlgorithm§ Pointsarereadfromdiskonemain-memory-fullataDme
§ MostpointsfrompreviousmemoryloadsaresummarizedbysimplestaWsWcs
§ Tobegin,fromtheiniDalloadweselecttheiniDalkcentroidsbysomesensibleapproach:– Takekrandompoints– TakeasmallrandomsampleandclusteropDmally– Takeasample;pickarandompoint,andthenk–1morepoints,eachasfarfromthepreviouslyselectedpointsaspossible
33VTCS5614Prakash2017
ThreeClassesofPoints3setsofpointswhichwekeeptrackof:§ Discardset(DS):
– Pointscloseenoughtoacentroidtobesummarized
§ Compressionset(CS):– GroupsofpointsthatareclosetogetherbutnotclosetoanyexisDngcentroid
– Thesepointsaresummarized,butnotassignedtoacluster
§ Retainedset(RS):– IsolatedpointswaiDngtobeassignedtoacompressionset
VTCS5614 34Prakash2017
BFR:“Galaxies”Picture
VTCS5614 35
A cluster. Its points are in the DS.
The centroid
Compressed sets. Their points are in the CS.
Points in the RS
Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
Prakash2017
SummarizingSetsofPointsForeachcluster,thediscardset(DS)issummarizedby:§ Thenumberofpoints,N§ ThevectorSUM,whoseithcomponentisthesumofthecoordinatesofthepointsintheithdimension
§ ThevectorSUMSQ:ithcomponent=sumofsquaresofcoordinatesinithdimension
VTCS5614
36A cluster. All its points are in the DS. The centroid
Prakash2017
SummarizingPoints:Comments
§ 2d+1valuesrepresentanysizecluster– d=numberofdimensions
§ Averageineachdimension(thecentroid)canbecalculatedasSUMi/N– SUMi=ithcomponentofSUM
§ Varianceofacluster’sdiscardsetindimensioniis:(SUMSQi/N)–(SUMi/N)2– AndstandarddeviaDonisthesquarerootofthat
§ Nextstep:Actualclustering
37VTCS5614
Note: Dropping the “axis-aligned” clusters assumption would require storing full covariance matrix to summarize the cluster. So, instead of SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big! Prakash2017
The“Memory-Load”ofPoints
Processingthe“Memory-Load”ofpoints(1):§ 1)Findthosepointsthatare“sufficientlyclose”toaclustercentroidandaddthosepointstothatclusterandtheDS– Thesepointsaresoclosetothecentroidthattheycanbesummarizedandthendiscarded
§ 2)Useanymain-memoryclusteringalgorithmtoclustertheremainingpointsandtheoldRS– ClustersgototheCS;outlyingpointstotheRS
VTCS5614 38
Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
Prakash2017
The“Memory-Load”ofPointsProcessingthe“Memory-Load”ofpoints(2):§ 3)DSset:AdjuststaDsDcsoftheclusterstoaccountforthenewpoints– AddNs,SUMs,SUMSQs
§ 4)ConsidermergingcompressedsetsintheCS
§ 5)Ifthisisthelastround,mergeallcompressedsetsintheCSandallRSpointsintotheirnearestcluster
VTCS5614 39
Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
Prakash2017
BFR:“Galaxies”Picture
VTCS5614 40
A cluster. Its points are in the DS.
The centroid
Compressed sets. Their points are in the CS.
Points in the RS
Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
Prakash2017
AFewDetails…
§ Q1)Howdowedecideifapointis“closeenough”toaclusterthatwewilladdthepointtothatcluster?
§ Q2)Howdowedecidewhethertwocompressedsets(CS)deservetobecombinedintoone?
41VTCS5614Prakash2017
HowCloseisCloseEnough?
§ Q1)Weneedawaytodecidewhethertoputanewpointintoacluster(anddiscard)
§ BFRsuggeststwoways:– TheMahalanobisdistanceislessthanathreshold– Highlikelihoodofthepointbelongingtocurrentlynearestcentroid
VTCS5614 42Prakash2017
MahalanobisDistance
VTCS5614 43Prakash2017
MahalanobisDistance
§ Ifclustersarenormallydistributedinddimensions,thenalertransformaDon,onestandarddeviaDon=√𝒅 – i.e.,68%ofthepointsoftheclusterwillhaveaMahalanobisdistance<√𝒅
§ AcceptapointforaclusterifitsM.D.is<somethreshold,e.g.2standarddeviaDons
44VTCS5614Prakash2017
Picture:EqualM.D.Regions§ Euclideanvs.Mahalanobisdistance
VTCS5614 45
Contours of equidistant points from the origin
Uniformly distributed points, Euclidean distance
Normally distributed points, Euclidean distance
Normally distributed points, Mahalanobis distance
Prakash2017
Should2CSclustersbecombined?
Q2)Should2CSsubclustersbecombined?§ Computethevarianceofthecombinedsubcluster
– N,SUM,andSUMSQallowustomakethatcalculaDonquickly
§ Combineifthecombinedvarianceisbelowsomethreshold
§ ManyalternaWves:Treatdimensionsdifferently,considerdensity
VTCS5614 46Prakash2017
THECUREALGORITHM
Extensionofk-meanstoclustersofarbitraryshapes
Prakash2017 VTCS5614 47
TheCUREAlgorithm§ ProblemwithBFR/k-means:
– Assumesclustersarenormallydistributedineachdimension
– Andaxesarefixed–ellipsesatananglearenotOK
§ CURE(ClusteringUsingREpresentaWves):– AssumesaEuclideandistance– Allowsclusterstoassumeanyshape– UsesacollecWonofrepresentaWvepointstorepresentclusters
48
Vs.
VTCS5614Prakash2017
Example:UniversitySalaries
49
e e
e
e
e e
e
e e
e
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
VTCS5614Prakash2017
StarWngCURE2Passalgorithm.Pass1:§ 0)Pickarandomsampleofpointsthatfitinmainmemory
§ 1)IniWalclusters:– Clusterthesepointshierarchically–groupnearestpoints/clusters
§ 2)PickrepresentaWvepoints:– Foreachcluster,pickasampleofpoints,asdispersedaspossible
– Fromthesample,pickrepresentaDvesbymovingthem(say)20%towardthecentroidofthecluster
VTCS5614 50Prakash2017
Example:IniWalClusters
51
e e
e
e
e e
e
e e
e
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
VTCS5614Prakash2017
Example:PickDispersedPoints
52
e e
e
e
e e
e
e e
e
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
Pick (say) 4 remote points for each cluster.
VTCS5614Prakash2017
Example:PickDispersedPoints
53
e e
e
e
e e
e
e e
e
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
Move points (say) 20% toward the centroid.
VTCS5614Prakash2017
FinishingCURE
Pass2:§ Now,rescanthewholedatasetandvisiteachpointpinthedataset
§ Placeitinthe“closestcluster”– NormaldefiniDonof“closest”:FindtheclosestrepresentaDvetopandassignittorepresentaDve’scluster
54VTCS5614
p
Prakash2017
Summary§ Clustering:Givenasetofpoints,withanoDonofdistancebetweenpoints,groupthepointsintosomenumberofclusters
§ Algorithms:– AgglomeraDvehierarchicalclustering:
• Centroidandclustroid– k-means:
• IniDalizaDon,pickingk– BFR– CURE
VTCS5614 55Prakash2017