Upload
arda-tasci
View
217
Download
0
Embed Size (px)
Citation preview
7/30/2019 Ceng514 Spr2012 Clustering (1)
1/72
Clustering
C.Eng514Spring2012
7/30/2019 Ceng514 Spr2012 Clustering (1)
2/72
Clustering
Overview TypesofDatainClustering ACategoriza@onofMajorClusteringMethods
Par@@oningMethods HierarchicalMethods Density-BasedMethods Grid-BasedMethods Model-BasedMethods
Constraint-BasedClustering OutlierAnalysis
April 8, 2012 Clustering 2
7/30/2019 Ceng514 Spr2012 Clustering (1)
3/72
WhatisClustering?
Cluster:acollec@onofdataobjects Similartooneanotherwithinthesamecluster Dissimilartotheobjectsinotherclusters
Clustering/Clusteranalysis Findingsimilari@esbetweendataaccordingtothe
characteris@csfoundinthedataandgroupingsimilardata
objectsintoclusters
Unsupervisedlearning:nopredefinedclasses
April 8, 2012 Clustering 3
7/30/2019 Ceng514 Spr2012 Clustering (1)
4/72
ImportantIssues
Scalability AbilitytodealwithdifferenttypesofaVributes Abilitytohandledynamicdata Discoveryofclusterswitharbitraryshape Minimalrequirementsfordomainknowledgetodetermine
inputparameters
Abletodealwithnoiseandoutliers Insensi@vetoorderofinputrecords Highdimensionality Incorpora@onofuser-specifiedconstraints InterpretabilityandusabilityApril 8, 2012 Clustering 4
7/30/2019 Ceng514 Spr2012 Clustering (1)
5/72
QualityofClustering
Agoodclusteringmethodwillproducehighqualityclusterswith
highintra-classsimilarity lowinter-classsimilarity
Thequalityofaclusteringresultdependsonboththesimilaritymeasureusedbythemethodanditsimplementa@on
ThequalityofaclusteringmethodisalsomeasuredbyitsabilitytodiscoversomeorallofthehiddenpaVerns
April 8, 2012 Clustering 5
7/30/2019 Ceng514 Spr2012 Clustering (1)
6/72
MeasuretheQualityofClustering
Dissimilarity/Similaritymetric:Similarityisexpressedintermsofadistancefunc@on,typicallymetric:d(i,j)
Thedefini@onsofdistancefunc@onsareusuallyverydifferentforinterval-scaled,boolean,categorical,ordinal,etc.variables.
Weightsshouldbeassociatedwithdifferentvariablesbasedonapplica@onsanddataseman@cs.
Itishardtodefinesimilarenoughorgoodenough theansweristypicallyhighlysubjec@ve.
April 8, 2012 Clustering 6
7/30/2019 Ceng514 Spr2012 Clustering (1)
7/72
DataStructures
Datamatrix(objvs.aVr)
Dissimilaritymatrix(objvs.obj)distances
April 8, 2012 Clustering 7
npx...
nfx...
n1x
...............
ipx...ifx...i1x
...............
1px...
1fx...
11x
0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,10d(2,1)
0
7/30/2019 Ceng514 Spr2012 Clustering (1)
8/72
Typeofdatainclusteranalysis
Interval-scaledvariables Binaryvariables Nominal,ordinal,andra@ovariables ariablesofmixedtypes
April 8, 2012 Clustering 8
7/30/2019 Ceng514 Spr2012 Clustering (1)
9/72
Interval-valuedvariables
Standardizedata Calculatethemeanabsolutedevia@on:
where
Calculatethestandardizedmeasurement(z-score)
Usingmeanabsolutedevia@onismorerobustthanusingstandarddevia@on
April 8, 2012 Clustering 9
.)...21
1nffff
xx(xnm +++=
|)|...|||(|121 fnffffff
mxmxmxn
s +++=
f
fifif s
mxz
=
7/30/2019 Ceng514 Spr2012 Clustering (1)
10/72
SimilarityandDissimilarityBetween
Objects
Distancesarenormallyusedtomeasurethesimilarityordissimilaritybetweentwodataobjects
Somepopularonesinclude:Minkowskidistance:wherei=(xi1,xi2,,xip)andj=(xj1,xj2,,xjp)aretwop-
dimensionaldataobjects,andqisaposi@veinteger
Ifq=1,disManhaVandistance
April 8, 2012 Clustering 10
pp
jx
ix
jx
ix
jx
ixjid )||...|||(|),(
2211+++=
||...||||),(2211 pp j
xi
xj
xi
xj
xi
xjid +++=
7/30/2019 Ceng514 Spr2012 Clustering (1)
11/72
SimilarityandDissimilarityBetweenObjects
(Cont.)
Ifq=2,disEuclideandistance: Proper@es
d(i,j)0 d(i,i)=0 d(i,j)=d(j,i)d(i,j)
d(i,k)+d(k,j)
Also,onecanuseweighteddistanceApril 8, 2012 Clustering 11
)||...|||(|),( 2222
2
11 pp jx
ix
jx
ix
jx
ixjid +++=
7/30/2019 Ceng514 Spr2012 Clustering (1)
12/72
SimilarityandDissimilarityBetweenObjects
(Cont.)
Example:X1=(1,2) X2=(3,5)
Euclideandistance(X1,X2)=sqrt(4+9)=3.61
ManhaVandistance(X1,X2)=2+3=5
April 8, 2012 Clustering 12
7/30/2019 Ceng514 Spr2012 Clustering (1)
13/72
Binaryariables
Acon@ngencytableforbinarydata
Distancemeasureforsymmetricbinaryvariables: Distancemeasurefor
asymmetricbinaryvariables:
Jaccardcoefficient(similaritymeasureforasymmetricbinary
variables):cba
ajisimJaccard
++
=),(
April 8, 2012 Clustering
13
dcbacbjid+++
+=),(
cba
cb
jid +++
=
),(
pdbcasum
dcdc
baba
sum
++
+
+
0
1
01
Object i
Object j
7/30/2019 Ceng514 Spr2012 Clustering (1)
14/72
DissimilaritybetweenBinaryariables
Example
genderisasymmetricaVribute theremainingaVributesareasymmetricbinary letthevaluesYandPbesetto1,andthevalueNbesetto0
April 8, 2012 Clustering 14
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
75.0
211
21),(
67.0111
11),(
33.0102
10),(
=+
+=
=+
+=
=+
+=
maryjimd
jimjackd
maryjackd
7/30/2019 Ceng514 Spr2012 Clustering (1)
15/72
Nominal(Categorical)ariables
Ageneraliza@onofthebinaryvariableinthatitcantakemorethan2states,e.g.,red,yellow,blue,green
Simplematching m:#ofmatches,p:total#ofvariables
April 8, 2012 Clustering 15
pmp
jid
=),(
7/30/2019 Ceng514 Spr2012 Clustering (1)
16/72
Nominal(Categorical)ariables
Example:
April 8, 2012 Clustering 16
Object A1 A2
1 A E2 B F
3 C G
4 A E
7/30/2019 Ceng514 Spr2012 Clustering (1)
17/72
Nominal(Categorical)ariables
Example:
April 8, 2012 Clustering 17
Object A1 A2
1 A E2 B F
3 C G
4 A E
1 2 3 4
1 0
2 1 0
3 1 1 0
4 0 1 1 0
7/30/2019 Ceng514 Spr2012 Clustering (1)
18/72
Ordinalariables
Anordinalvariablecanbediscreteorcon@nuous Orderisimportant,e.g.,rank Canbetreatedlikeinterval-scaled
replacexifbytheirrank maptherangeofeachvariableonto[0,1]byreplacingi-th
objectinthef-thvariableby
computethedissimilarityusingmethodsforinterval-scaledvariables
April 8, 2012 Clustering 18
1
1
=
f
ifif M
rz
},...,1{ fif Mr
7/30/2019 Ceng514 Spr2012 Clustering (1)
19/72
Ordinalariables
Example:Assumethatthereisanorderingfair
7/30/2019 Ceng514 Spr2012 Clustering (1)
20/72
Ordinalariables
Example:Assumethatthereisanorderingfair
7/30/2019 Ceng514 Spr2012 Clustering (1)
21/72
ariablesofMixedTypes
Adatabasemaycontainallthesixtypesofvariables symmetricbinary,asymmetricbinary,nominal,ordinal,
intervalandra@o
Onemayuseaweightedformulatocombinetheireffectsfisbinaryornominal:
dij(f)=0ifxif=xjf,ordij
(f)=1otherwise
fisinterval-based:usethenormalizeddistancefisordinal
computeranksrifand andtreatzifasinterval-scaled
April 8, 2012 Clustering 21
)(1
)()(1),(
fij
pf
fij
fij
pf
djid
=
=
=
1
1
=
f
if
Mr
zif
7/30/2019 Ceng514 Spr2012 Clustering (1)
22/72
ectorObjects
ectorobjects:keywordsindocuments,genefeaturesinmicro-arrays,etc.
Broadapplica@ons:informa@onretrieval,biologictaxonomy,etc.
Cosinemeasure:GiventwovectorsofaVributes,AandB,thecosinesimilarityisrepresentedusingadotproduct
andmagnitude
Avariant:Tanimotocoefficient.ityieldstheJaccardcoefficientinthecaseofbinaryaVributes
April 8, 2012 Clustering 22
7/30/2019 Ceng514 Spr2012 Clustering (1)
23/72
ectorObjects
Example:
Giventwovectors:
x=(1,1,0,0)
y=(0,1,1,0)
s(x,y)=(0+1+0+0)/sqrt(2)*sqrt(2)=0.5
April 8, 2012 Clustering 23
7/30/2019 Ceng514 Spr2012 Clustering (1)
24/72
MajorClusteringApproaches
Par@@oningapproach: Constructvariouspar@@onsandthenevaluatethembysomecriterion,e.g.,
minimizingthesumofsquareerrors
Typicalmethods:k-means,k-medoids,CLARANS Hierarchicalapproach:
Createahierarchicaldecomposi@onofthesetofdata(orobjects)usingsomecriterion
Typicalmethods:Diana,Agnes,BIRCH,ROCK,CAMELEON Density-basedapproach:
Basedonconnec@vityanddensityfunc@ons Typicalmethods:DBSACN,OPTICS,DenClue
April 8, 2012 Clustering 24
7/30/2019 Ceng514 Spr2012 Clustering (1)
25/72
TypicalAlterna@vestoCalculatetheDistancebetween
Clusters
Singlelink:smallestdistancebetweenanelementinoneclusterandanelementintheother,i.e.,dis(Ki,Kj)=min(tip,tjq)
Completelink:largestdistancebetweenanelementinoneclusterandanelementintheother,i.e.,dis(Ki,Kj)=max(tip,tjq)
Average:avgdistancebetweenanelementinoneclusterandanelementintheother,i.e.,dis(Ki,Kj)=avg(tip,tjq)
Centroid:distancebetweenthecentroidsoftwoclusters,i.e.,dis(Ki,Kj)=dis(Ci,Cj)
Medoid:distancebetweenthemedoidsoftwoclusters,i.e.,dis(Ki,Kj)=dis(Mi,Mj)
Medoid:onechosen,centrallylocatedobjectintheclusterApril 8, 2012 Clustering 25
7/30/2019 Ceng514 Spr2012 Clustering (1)
26/72
Centroid,RadiusandDiameterofaCluster(for
numericaldatasets)
Centroid:themiddleofacluster Radius:squarerootofaveragedistancefromanypointoftheclusterto
itscentroid
Diameter:squarerootofaveragemeansquareddistancebetweenallpairsofpointsinthecluster
N
tNi ip
mC)(1=
=
)1(
2)(11
=
=
=
NN
iqt
ipt
N
i
N
imD
April 8, 2012 Clustering 26
N
mciptNimR
2)(1 ==
7/30/2019 Ceng514 Spr2012 Clustering (1)
27/72
Par@@oningAlgorithms:BasicConcept
Par@@oningmethod:Constructapar@@onofadatabaseDofnobjectsintoasetofkclusters,s.t.,minsumofsquareddistance
Givenak,findapar@@onofkclustersthatop@mizesthechosenpar@@oningcriterion
Globalop@mal:exhaus@velyenumerateallpar@@ons Heuris@cmethods:k-meansandk-medoidsalgorithms k-means(MacQueen67):Eachclusterisrepresentedbythecenterof
thecluster
k-medoidsorPAM(Par@@onaroundmedoids)(Kaufman&Rousseeuw87):Eachclusterisrepresentedbyoneoftheobjectsinthe
cluster
2
1 )( mimKmtk
mtC
mi
=
April 8, 2012 Clustering 27
7/30/2019 Ceng514 Spr2012 Clustering (1)
28/72
TheK-MeansClusteringMethod
Givenk,thek-meansalgorithmisimplementedinfoursteps:
Par@@onobjectsintoknonemptysubsets Computeseedpointsasthecentroidsoftheclustersof
thecurrentpar@@on(thecentroidisthecenter,i.e.,
meanpoint,ofthecluster)
Assigneachobjecttotheclusterwiththenearestseedpoint
GobacktoStep2,stopwhennomorenewassignmentApril 8, 2012 Clustering 28
7/30/2019 Ceng514 Spr2012 Clustering (1)
29/72
TheK-MeansClusteringMethod
Example
April 8, 2012 Clustering 29
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0123
45678910
012345678910
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
012345678910
012345678910
K=2
Arbitrarily choose Kobject as initialcluster center
Assigneachobjectsto mostsimilarcenter
Updatetheclustermeans
Updatetheclustermeans
reassignreassign
7/30/2019 Ceng514 Spr2012 Clustering (1)
30/72
CommentsontheK-MeansMethod
Strength:RelaFvelyefficient:O(tkn),wherenis#objects,kis#clusters,andtis#[email protected],k,t
7/30/2019 Ceng514 Spr2012 Clustering (1)
31/72
aria@onsoftheK-MeansMethod
Afewvariantsofthek-meanswhichdifferin Selec@onoftheini@alkmeans Dissimilaritycalcula@ons Strategiestocalculateclustermeans
Handlingcategoricaldata:k-modes(Huang98) Replacingmeansofclusterswithmodes Usingnewdissimilaritymeasurestodealwithcategoricalobjects Usingafrequency-basedmethodtoupdatemodesofclusters Amixtureofcategoricalandnumericaldata:k-prototypemethod
April 8, 2012 Clustering 31
7/30/2019 Ceng514 Spr2012 Clustering (1)
32/72
OutlierProblem
Thek-meansalgorithmissensi@vetooutliers! Sinceanobjectwithanextremelylargevaluemaysubstan@allydistort
thedistribu@onofthedata.
K-Medoids:Insteadoftakingthemeanvalueoftheobjectinaclusterasareferencepoint,medoidscanbeused,whichisthemostcentrallylocated
objectinacluster.
April 8, 2012 Clustering 32
012345678910
012345678910 012345678910
012345678910
7/30/2019 Ceng514 Spr2012 Clustering (1)
33/72
TheK-MedoidsClusteringMethod
FindrepresentaFveobjects,calledmedoids,inclusters PAM(Par@@oningAroundMedoids,1987)
startsfromanini@alsetofmedoidsanditera@velyreplacesoneofthemedoidsbyoneofthenon-medoidsifitimprovesthetotaldistanceof
theresul@ngclustering
PAMworkseffec@velyforsmalldatasets,butdoesnotscalewellforlargedatasets
CLARA(Kaufmann&Rousseeuw,1990) CLARANS(Ng&Han,1994):Randomizedsampling
April 8, 2012 Clustering 33
7/30/2019 Ceng514 Spr2012 Clustering (1)
34/72
ATypicalK-MedoidsAlgorithm(PAM)
April 8, 2012 Clustering 34
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
012
345678910
0 12 34 56 78 910
K=2
Arbitrarychoose kobject asinitialmedoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assigneachremaining objecttonearestmedoids
Randomly select anonmedoid object,Oramdom
Computetotal cost ofswapping
0123456789
10
0
1
2
3
4
5
6
7
8
9
10
Total Cost = 26
Swapping Oand Oramdom
If quality isimproved.
Do loop
Until nochange
0123456789
10
0
1
2
3
4
5
6
7
8
9
10
7/30/2019 Ceng514 Spr2012 Clustering (1)
35/72
PAM(Par@@oningAroundMedoids)(1987)
PAM(KaufmanandRousseeuw,1987) Userealobjecttorepresentthecluster
Selectkrepresenta@veobjectsarbitrarily Foreachpairofnon-selectedobjecthandselectedobjecti,
calculatethetotalswappingcostTCih
Foreachpairofiandh,
IfTCih
7/30/2019 Ceng514 Spr2012 Clustering (1)
36/72
PAMClustering:TotalswappingcostTCih=jCjih
April 8, 2012 Clustering 36
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
j
ih
t
Cjih = 0
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
t
i h
j
Cjih = d(j, h) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
h
it
j
Cjih = d(j, t) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
t
ih j
Cjih = d(j, h) - d(j, t)
7/30/2019 Ceng514 Spr2012 Clustering (1)
37/72
WhatIstheProblemwithPAM?
Pamismorerobustthank-meansinthepresenceofnoiseandoutliersbecauseamedoidislessinfluencedbyoutliersor
otherextremevaluesthanamean
Pamworksefficientlyforsmalldatasetsbutdoesnotscalewellforlargedatasets.
O(k(n-k)2)foreachitera@on wherenis#ofdata,kis#ofclusters
Samplingbasedmethod,CLARA(ClusteringLARgeApplica@ons)
April 8, 2012 Clustering 37
7/30/2019 Ceng514 Spr2012 Clustering (1)
38/72
CLARA(ClusteringLargeApplica@ons)(1990)
CLARA(KaufmannandRousseeuwin1990) ItdrawsmulFplesamplesofthedataset,appliesPAMoneach
sample,andgivesthebestclusteringastheoutput
Strength:dealswithlargerdatasetsthanPAM Weakness:
Efficiencydependsonthesamplesize Agoodclusteringbasedonsampleswillnotnecessarily
representagoodclusteringofthewholedatasetifthe
sampleisbiased
April 8, 2012 Clustering 38
7/30/2019 Ceng514 Spr2012 Clustering (1)
39/72
CLARANS(RandomizedCLARA)(1994,2002)
CLARANS(AClusteringAlgorithmbasedonRandomizedSearch)(NgandHan94)
CLARANSdrawssampleofneighborsdynamically Theclusteringprocesscanbepresentedassearchingagraph
whereeverynodeisapoten@alsolu@on,thatis,asetofk
medoids
Ifthelocalop@mumisfound,CLARANSstartswithnewrandomlyselectednodeinsearchforanewlocalop@mum
ItismoreefficientandscalablethanbothPAMandCLARA
April 8, 2012 Clustering 39
7/30/2019 Ceng514 Spr2012 Clustering (1)
40/72
HierarchicalClustering
Usedistancematrixasclusteringcriteria.Thismethoddoesnotrequirethenumberofclusterskasaninput,butneedsa
termina@oncondi@on
April 8, 2012 Clustering 40
Step 0 Step 1 Step 2 Step 3 Step 4
b
dc
e
aa b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative(AGNES)
divisive
(DIANA)
7/30/2019 Ceng514 Spr2012 Clustering (1)
41/72
AGNES(Agglomera@veNes@ng)
IntroducedinKaufmannandRousseeuw(1990) UsetheSingle-Linkmethodandthedissimilaritymatrix. Mergenodesthathavetheleastdissimilarity Eventuallyallnodesbelongtothesamecluster
April 8, 2012 Clustering 41
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
7/30/2019 Ceng514 Spr2012 Clustering (1)
42/72
April 8, 2012 Clustering 42
Dendrogram: Shows How the Clusters are Merged
Decompose data objects into a several levels of nestedpartitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting thedendrogram at the desired level, then each connectedcomponent forms a cluster.
7/30/2019 Ceng514 Spr2012 Clustering (1)
43/72
DIANA(DivisiveAnalysis)
IntroducedinKaufmannandRousseeuw(1990) InverseorderofAGNES Eventuallyeachnodeformsaclusteronitsown
April 8, 2012 Clustering 43
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Well known Hierarchical Clustering
7/30/2019 Ceng514 Spr2012 Clustering (1)
44/72
Well-knownHierarchicalClustering
Methods
Majorweaknessofagglomera@veclusteringmethods donotscalewell:@mecomplexityofatleastO(n2),wheren
isthenumberoftotalobjects
canneverundowhatwasdonepreviously Integra@onofhierarchicalwithdistance-basedclustering
BIRCH(1996):usesCF-treeandincrementallyadjuststhequalityofsub-clusters
ROCK(1999):clusteringcategoricaldatabyneighborandlinkanalysis
CHAMELEON(1999):hierarchicalclusteringusingdynamicmodeling
April 8, 2012 Clustering 44
7/30/2019 Ceng514 Spr2012 Clustering (1)
45/72
BIRCH(1996)
Birch:BalancedItera@veReducingandClusteringusingHierarchies(Zhang,Ramakrishnan&Livny,SIGMOD96)
IncrementallyconstructaCF(ClusteringFeature)tree,ahierarchicaldatastructureformul@phaseclustering
Phase1:scanDBtobuildanini@alin-memoryCFtree(amul@-levelcompressionofthedatathattriestopreservetheinherentclusteringstructureofthedata)
Phase2:useanarbitraryclusteringalgorithmtoclustertheleafnodesoftheCF-tree
Scaleslinearly:findsagoodclusteringwithasinglescanandimprovesthequalitywithafewaddi@onalscans Weakness:handlesonlynumericdata,andsensi@vetotheorder
ofthedatarecord.
April 8, 2012 Clustering 45
7/30/2019 Ceng514 Spr2012 Clustering (1)
46/72
April 8, 2012 Clustering 46
Clustering Feature Vector in BIRCH
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS:Ni=1=X
i
SS:Ni=1=Xi2
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
7/30/2019 Ceng514 Spr2012 Clustering (1)
47/72
CF-TreeinBIRCH
Clusteringfeature: summaryofthesta@s@csforagivensubcluster:the0-th,1stand2nd
momentsofthesubclusterfromthesta@s@calpointofview.
registerscrucialmeasurementsforcompu@ngclusterandu@lizesstorageefficiently
" ACFtreeisaheight-balancedtreethatstorestheclusteringfeaturesforahierarchicalclustering
Anonleafnodeinatreehasdescendantsorchildren ThenonleafnodesstoresumsoftheCFsoftheirchildren
ACFtreehastwoparameters Branchingfactor:specifythemaximumnumberofchildren. threshold:maxdiameterofsub-clustersstoredattheleafnodes
April 8, 2012 Clustering 47
7/30/2019 Ceng514 Spr2012 Clustering (1)
48/72
TheCFTreeStructure
April 8, 2012 Clustering 48
CF1child1
CF3child3
CF2child2
CF6child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6prev next CF1 CF2 CF4prev next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node
7/30/2019 Ceng514 Spr2012 Clustering (1)
49/72
ClusteringCategoricalData:TheROCKAlgorithm
ROCK:RObustClusteringusinglinKs S.Guha,R.Rastogi&K.Shim,ICDE99
Majorideas Uselinkstomeasuresimilarity/proximity
Notdistance-based
Algorithm:sampling-basedclustering Drawrandomsample
Clusterwithlinks Labeldataindisk
April 8, 2012 Clustering 49
7/30/2019 Ceng514 Spr2012 Clustering (1)
50/72
LinkMeasureinROCK
Links:#ofcommonneighbors C1:{a,b,c},{a,b,d},{a,b,e},{a,c,d},{a,c,e},{a,d,e},
{b,c,d},{b,c,e},{b,d,e},{c,d,e}
C2:{a,b,f},{a,b,g},{a,f,g},{b,f,g} LetT1={a,b,c},T2={c,d,e},T3={a,b,f}
link(T1,T2)=4,sincetheyhave4commonneighbors {a,c,d},{a,c,e},{b,c,d},{b,c,e}
link(T1,T3)=3,sincetheyhave3commonneighbors {a,b,d},{a,b,e},{a,b,g}
April 8, 2012 Clustering 50
CHA L ON Hi hi l Cl i U i i
7/30/2019 Ceng514 Spr2012 Clustering (1)
51/72
CHAMELEON:HierarchicalClusteringUsingDynamic
Modeling(1999)
CHAMELEON:byG.Karypis,E.H.Han,and.Kumar99 Measuresthesimilaritybasedonadynamicmodel
TwoclustersaremergedonlyiftheinterconnecFvityandcloseness(proximity)betweentwoclustersarehighrelaFvetotheinternal
interconnec@vityoftheclustersandclosenessofitemswithintheclusters Atwo-phasealgorithm
1. Useagraphpar@@oningalgorithm:clusterobjectsintoalargenumberofrela@velysmallsub-clusters
2. Useanagglomera@vehierarchicalclusteringalgorithm:findthegenuineclustersbyrepeatedlycombiningthesesub-clusters
April 8, 2012 Clustering 51
Overall Framework of
7/30/2019 Ceng514 Spr2012 Clustering (1)
52/72
OverallFrameworkof
CHAMELEON
April 8, 2012 Clustering 52
Construct
Sparse Graph Partition the Graph
Merge Partition
Final Clusters
Data Set
( l i l bj )
7/30/2019 Ceng514 Spr2012 Clustering (1)
53/72
CHAMELEON(ClusteringComplexObjects)
April 8, 2012 Clustering 53
d l h d
7/30/2019 Ceng514 Spr2012 Clustering (1)
54/72
Density-BasedClusteringMethods
Clusteringbasedondensity(localclustercriterion),suchasdensity-connectedpoints Majorfeatures:
Discoverclustersofarbitraryshape Handlenoise Onescan Needdensityparametersastermina@oncondi@on
Well-knownexamples: DBSCAN:Ester,etal.(KDD96) OPTICS:Ankerst,etal(SIGMOD99). DENCLUE:Hinneburg&D.Keim(KDD98) CLIQUE:Agrawal,etal.(SIGMOD98)(moregrid-based)
April 8, 2012 Clustering 54
7/30/2019 Ceng514 Spr2012 Clustering (1)
55/72
Density-BasedClustering:BasicConcepts
Twoparameters: Eps:Maximumradiusoftheneighbourhood MinPts:MinimumnumberofpointsinanEps-
neighbourhoodofthatpoint
NEps(p): {qbelongstoD|dist(p,q)=MinPts
April 8, 2012 Clustering 55
p
q
MinPts = 5
Eps = 1 cm
7/30/2019 Ceng514 Spr2012 Clustering (1)
56/72
Density-ReachableandDensity-Connected
Density-reachable: Apointpisdensity-reachablefroma
pointqw.r.t.Eps,MinPtsifthereisachainofpointsp1,,pn,p1=q,pn=p
suchthatpi+1isdirectlydensity-reachablefrompi
Density-connected Apointpisdensity-connectedtoa
pointqw.r.t.Eps,MinPtsifthereisapointosuchthatboth,pandqaredensity-reachablefromow.r.t.EpsandMinPts
April 8, 2012 Clustering 56
p
qp1
p q
o
DBSCAN D it B d S @ l Cl t i f
7/30/2019 Ceng514 Spr2012 Clustering (1)
57/72
DBSCAN:DensityBasedSpa@alClusteringof
Applica@onswithNoise
Reliesonadensity-basedno@onofcluster:Aclusterisdefinedasamaximalsetofdensity-connectedpoints
Discoversclustersofarbitraryshapeinspa@aldatabaseswithnoise
April 8, 2012 Clustering 57
Core
Border
Outlier
Eps = 1cm
MinPts = 5
7/30/2019 Ceng514 Spr2012 Clustering (1)
58/72
DBSCAN:TheAlgorithm
Arbitraryselectapointp Retrieveallpointsdensity-reachablefrompw.r.t.Epsand
MinPts.
Ifpisacorepoint,aclusterisformed. Ifpisaborderpoint,nopointsaredensity-reachablefromp
andDBSCANvisitsthenextpointofthedatabase.
Con@nuetheprocessun@lallofthepointshavebeenprocessed.
April 8, 2012 Clustering 58
7/30/2019 Ceng514 Spr2012 Clustering (1)
59/72
DBSCAN:Sensi@vetoParameters
April 8, 2012 Clustering 59
CHAMELEON (Cl t i C l Obj t )
7/30/2019 Ceng514 Spr2012 Clustering (1)
60/72
CHAMELEON(ClusteringComplexObjects)
April 8, 2012 Clustering 60
7/30/2019 Ceng514 Spr2012 Clustering (1)
61/72
Model-BasedClustering
Whatismodel-basedclustering? AVempttoop@mizethefitbetweenthegivendataandsome
mathema@calmodel
Basedontheassump@on:Dataaregeneratedbyamixtureofunderlyingprobabilitydistribu@on
Typicalmethods Sta@s@calapproach
EM(Expecta@onmaximiza@on),AutoClass Machinelearningapproach
COBWEB,CLASSIT Neuralnetworkapproach
SOM(Self-OrganizingFeatureMap)April 8, 2012 Clustering 61
7/30/2019 Ceng514 Spr2012 Clustering (1)
62/72
ConceptualClustering
Conceptualclustering Producesaclassifica@onschemeforasetofunlabeled
objects
Findscharacteris@cdescrip@onforeachconcept(class) COBWEB(Fisher87)
Apopularasimplemethodofincrementalconceptuallearning
Createsahierarchicalclusteringintheformofaclassifica@ontree
Eachnodereferstoaconceptandcontainsaprobabilis@cdescrip@onofthatconcept
April 8, 2012 Clustering 62
7/30/2019 Ceng514 Spr2012 Clustering (1)
63/72
COBWEBClusteringMethod
April 8, 2012 Clustering 63
A classification tree
N l N t k A h
7/30/2019 Ceng514 Spr2012 Clustering (1)
64/72
NeuralNetworkApproach
Neuralnetworkapproaches Representeachclusterasanexemplar,ac@ngasaprototypeofthecluster
Newobjectsaredistributedtotheclusterwhoseexemplaristhemostsimilaraccordingtosomedistancemeasure
Typicalmethods SOM(So-OrganizingfeatureMap)
April 8, 2012 Clustering 64
( )
7/30/2019 Ceng514 Spr2012 Clustering (1)
65/72
Self-OrganizingFeatureMap(SOM)
SOMs,alsocalledtopologicalorderedmaps,orKohonenSelf-OrganizingFeatureMap(KSOMs)
Itmapsallthepointsinahigh-dimensionalsourcespaceintoa2to3-dtargetspace,s.t.,thedistanceandproximityrela@onship(i.e.,topology)are
preservedasmuchaspossible
Similartok-means:clustercenterstendtolieinalow-dimensionalmanifoldinthefeaturespace
Clusteringisperformedbyhavingseveralunitscompe@ngforthecurrentobject
Theunitwhoseweightvectorisclosesttothecurrentobjectwins Thewinneranditsneighborslearnbyhavingtheirweightsadjusted
SOMsarebelievedtoresembleprocessingthatcanoccurinthebrain Usefulforvisualizinghigh-dimensionaldatain2-or3-Dspace
April 8, 2012 Clustering 65
Web Document Clustering Using
7/30/2019 Ceng514 Spr2012 Clustering (1)
66/72
WebDocumentClusteringUsing
SOM TheresultofSOM
clusteringof
12088Web
ar@cles
Thepictureontheright:drilling
downonthe
keywordmining
Basedonwebsom.hut.fi
Webpage
April 8, 2012 Clustering 66
7/30/2019 Ceng514 Spr2012 Clustering (1)
67/72
AClassifica@onofConstraintsinClusterAnalysis
Clusteringinapplica@ons:desirabletohaveuser-guided(i.e.,constrained)clusteranalysis
Differentconstraintsinclusteranalysis: Constraintsonindividualobjects(doselec@onfirst)
Clusteronhousesworthover$300K Constraintsondistanceorsimilarityfunc@ons
Weightedfunc@ons,obstacles(e.g.,rivers,lakes) Constraintsontheselec@onofclusteringparameters
#ofclusters,MinPts,etc. User-specifiedconstraints Containatleast500valuedcustomersand5000ordinaryones Semi-supervised:givingsmalltrainingsetsasconstraints
orhints
April 8, 2012 Clustering 67
7/30/2019 Ceng514 Spr2012 Clustering (1)
68/72
ClusteringwithUser-SpecifiedConstraints
Example:Loca@ngkdeliverycenters,eachservingatleastmvaluedcustomersandnordinaryones
Proposedapproach Findanini@alsolu@onbypar@@oningthedatasetintok
groupsandsa@sfyinguser-constraints
Itera@velyrefinethesolu@onbymicro-clusteringreloca@on(e.g.,moving-clustersfromclusterCitoCj)and
deadlockhandling(breakthemicroclusterswhen
necessary)
April 8, 2012 Clustering 68
7/30/2019 Ceng514 Spr2012 Clustering (1)
69/72
WhatIsOutlierDiscovery?
Whatareoutliers? Thesetofobjectsareconsiderablydissimilarfromthe
remainderofthedata
Example:??? Problem:Defineandfindoutliersinlargedatasets Applica@ons:
Creditcardfrauddetec@on Telecomfrauddetec@on Customersegmenta@on Medicalanalysis
April 8, 2012 Clustering 69
OutlierDiscovery:Sta@s@cal
7/30/2019 Ceng514 Spr2012 Clustering (1)
70/72
y
Approaches
Assumeamodelunderlyingdistribu@onthatgeneratesdataset(e.g.normaldistribu@on)
Usediscordancytestsdependingon datadistribu@on distribu@onparameter(e.g.,mean,variance) numberofexpectedoutliers
Drawbacks mosttestsareforsingleaVribute Inmanycases,datadistribu@onmaynotbeknown
April 8, 2012 Clustering 70
Outlier Discovery: Distance Based Approach
7/30/2019 Ceng514 Spr2012 Clustering (1)
71/72
OutlierDiscovery:Distance-BasedApproach
Introducedtocounterthemainlimita@onsimposedbysta@s@calmethods
Weneedmul@-dimensionalanalysiswithoutknowingdatadistribu@on
Distance-basedoutlier:ADB(p,D)-outlierisanobjectOinadatasetTsuchthatatleastafrac@onpoftheobjectsinTlies
atadistancegreaterthanDfromO
April 8, 2012 Clustering 71
Density-BasedLocalOutlier
7/30/2019 Ceng514 Spr2012 Clustering (1)
72/72
y
Detec@on
Distance-basedoutlierdetec@onisbasedonglobaldistancedistribu@on
Itencountersdifficul@estoiden@fyoutliersifdataisnotuniformly
distributed Ex.C1contains400loosely
distributedpoints,C2has100@ghtlycondensedpoints,2outlierpointso1,o2
Distance-basedmethodcannotiden@fyo2asanoutlier
Needtheconceptoflocaloutlier
n Local outlier factor (LOF)n Assume outlier is not
crisp
n Each point has a LOF