Ceng514 Spr2012 Clustering (1)

7/30/2019 Ceng514 Spr2012 Clustering (1)

1/72

Clustering

C.Eng514Spring2012


2/72

Clustering

Overview TypesofDatainClustering ACategoriza@onofMajorClusteringMethods

Par@@oningMethods HierarchicalMethods Density-BasedMethods Grid-BasedMethods Model-BasedMethods

Constraint-BasedClustering OutlierAnalysis

April 8, 2012 Clustering 2


3/72

WhatisClustering?

Cluster:acollec@onofdataobjects Similartooneanotherwithinthesamecluster Dissimilartotheobjectsinotherclusters

Clustering/Clusteranalysis Findingsimilari@esbetweendataaccordingtothe

characteris@csfoundinthedataandgroupingsimilardata

objectsintoclusters

Unsupervisedlearning:nopredefinedclasses



4/72

ImportantIssues

Scalability AbilitytodealwithdifferenttypesofaVributes Abilitytohandledynamicdata Discoveryofclusterswitharbitraryshape Minimalrequirementsfordomainknowledgetodetermine

inputparameters

Abletodealwithnoiseandoutliers Insensi@vetoorderofinputrecords Highdimensionality Incorpora@onofuser-specifiedconstraints InterpretabilityandusabilityApril 8, 2012 Clustering 4


5/72

QualityofClustering

Agoodclusteringmethodwillproducehighqualityclusterswith

highintra-classsimilarity lowinter-classsimilarity

Thequalityofaclusteringresultdependsonboththesimilaritymeasureusedbythemethodanditsimplementa@on

ThequalityofaclusteringmethodisalsomeasuredbyitsabilitytodiscoversomeorallofthehiddenpaVerns



6/72

MeasuretheQualityofClustering

Dissimilarity/Similaritymetric:Similarityisexpressedintermsofadistancefunc@on,typicallymetric:d(i,j)

Thedefini@onsofdistancefunc@onsareusuallyverydifferentforinterval-scaled,boolean,categorical,ordinal,etc.variables.

Weightsshouldbeassociatedwithdifferentvariablesbasedonapplica@onsanddataseman@cs.

Itishardtodefinesimilarenoughorgoodenough theansweristypicallyhighlysubjec@ve.



7/72

DataStructures

Datamatrix(objvs.aVr)

Dissimilaritymatrix(objvs.obj)distances


npx...

nfx...

n1x

...............

ipx...ifx...i1x

...............

1px...

1fx...

11x

0...)2,()1,(

:::

)2,3()

...ndnd

0dd(3,10d(2,1)

0


8/72

Typeofdatainclusteranalysis

Interval-scaledvariables Binaryvariables Nominal,ordinal,andra@ovariables ariablesofmixedtypes



9/72

Interval-valuedvariables

Standardizedata Calculatethemeanabsolutedevia@on:

where

Calculatethestandardizedmeasurement(z-score)

Usingmeanabsolutedevia@onismorerobustthanusingstandarddevia@on


.)...21

1nffff

xx(xnm +++=

|)|...|||(|121 fnffffff

mxmxmxn

s +++=

f

fifif s

mxz

=


10/72

SimilarityandDissimilarityBetween

Objects

Distancesarenormallyusedtomeasurethesimilarityordissimilaritybetweentwodataobjects

Somepopularonesinclude:Minkowskidistance:wherei=(xi1,xi2,,xip)andj=(xj1,xj2,,xjp)aretwop-

dimensionaldataobjects,andqisaposi@veinteger

Ifq=1,disManhaVandistance


qq

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211+++=

||...||||),(2211 pp j

xi

xj

xi

xj

xi

xjid +++=


11/72

SimilarityandDissimilarityBetweenObjects

(Cont.)

Ifq=2,disEuclideandistance: Proper@es

d(i,j)0 d(i,i)=0 d(i,j)=d(j,i)d(i,j)

d(i,k)+d(k,j)

Also,onecanuseweighteddistanceApril 8, 2012 Clustering 11

)||...|||(|),( 2222

2

11 pp jx

ix

jx

ix

jx

ixjid +++=


12/72

SimilarityandDissimilarityBetweenObjects

(Cont.)

Example:X1=(1,2) X2=(3,5)

Euclideandistance(X1,X2)=sqrt(4+9)=3.61

ManhaVandistance(X1,X2)=2+3=5



13/72

Binaryariables

Acon@ngencytableforbinarydata

Distancemeasureforsymmetricbinaryvariables: Distancemeasurefor

asymmetricbinaryvariables:

Jaccardcoefficient(similaritymeasureforasymmetricbinary

variables):cba

ajisimJaccard

++

=),(

April 8, 2012 Clustering

13

dcbacbjid+++

+=),(

cba

cb

jid +++

=

),(

pdbcasum

dcdc

baba

sum

++

+

+

0

1

01

Object i

Object j


14/72

DissimilaritybetweenBinaryariables

Example

genderisasymmetricaVribute theremainingaVributesareasymmetricbinary letthevaluesYandPbesetto1,andthevalueNbesetto0


Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N

Mary F Y N P N P N

Jim M Y P N N N N

75.0

211

21),(

67.0111

11),(

33.0102

10),(

=+

+=

=+

+=

=+

+=

maryjimd

jimjackd

maryjackd


15/72

Nominal(Categorical)ariables

Ageneraliza@onofthebinaryvariableinthatitcantakemorethan2states,e.g.,red,yellow,blue,green

Simplematching m:#ofmatches,p:total#ofvariables


pmp

jid

=),(


16/72


Example:


Object A1 A2

1 A E2 B F

3 C G

4 A E


17/72


Example:


Object A1 A2

1 A E2 B F

3 C G

4 A E

1 2 3 4

1 0

2 1 0

3 1 1 0

4 0 1 1 0


18/72

Ordinalariables

Anordinalvariablecanbediscreteorcon@nuous Orderisimportant,e.g.,rank Canbetreatedlikeinterval-scaled

replacexifbytheirrank maptherangeofeachvariableonto[0,1]byreplacingi-th

objectinthef-thvariableby

computethedissimilarityusingmethodsforinterval-scaledvariables


1

1

=

f

ifif M

rz

},...,1{ fif Mr


19/72

Ordinalariables

Example:Assumethatthereisanorderingfair


20/72

Ordinalariables

Example:Assumethatthereisanorderingfair


21/72

ariablesofMixedTypes

Adatabasemaycontainallthesixtypesofvariables symmetricbinary,asymmetricbinary,nominal,ordinal,

intervalandra@o

Onemayuseaweightedformulatocombinetheireffectsfisbinaryornominal:

dij(f)=0ifxif=xjf,ordij

(f)=1otherwise

fisinterval-based:usethenormalizeddistancefisordinal

computeranksrifand andtreatzifasinterval-scaled


)(1

)()(1),(

fij

pf

fij

fij

pf

djid

=

=

=

1

1

=

f

if

Mr

zif


22/72

ectorObjects

ectorobjects:keywordsindocuments,genefeaturesinmicro-arrays,etc.

Broadapplica@ons:informa@onretrieval,biologictaxonomy,etc.

Cosinemeasure:GiventwovectorsofaVributes,AandB,thecosinesimilarityisrepresentedusingadotproduct

andmagnitude

Avariant:Tanimotocoefficient.ityieldstheJaccardcoefficientinthecaseofbinaryaVributes



23/72

ectorObjects

Example:

Giventwovectors:

x=(1,1,0,0)

y=(0,1,1,0)

s(x,y)=(0+1+0+0)/sqrt(2)*sqrt(2)=0.5



24/72

MajorClusteringApproaches

Par@@oningapproach: Constructvariouspar@@onsandthenevaluatethembysomecriterion,e.g.,

minimizingthesumofsquareerrors

Typicalmethods:k-means,k-medoids,CLARANS Hierarchicalapproach:

Createahierarchicaldecomposi@onofthesetofdata(orobjects)usingsomecriterion

Typicalmethods:Diana,Agnes,BIRCH,ROCK,CAMELEON Density-basedapproach:

Basedonconnec@vityanddensityfunc@ons Typicalmethods:DBSACN,OPTICS,DenClue



25/72

TypicalAlterna@vestoCalculatetheDistancebetween

Clusters

Singlelink:smallestdistancebetweenanelementinoneclusterandanelementintheother,i.e.,dis(Ki,Kj)=min(tip,tjq)

Completelink:largestdistancebetweenanelementinoneclusterandanelementintheother,i.e.,dis(Ki,Kj)=max(tip,tjq)

Average:avgdistancebetweenanelementinoneclusterandanelementintheother,i.e.,dis(Ki,Kj)=avg(tip,tjq)

Centroid:distancebetweenthecentroidsoftwoclusters,i.e.,dis(Ki,Kj)=dis(Ci,Cj)

Medoid:distancebetweenthemedoidsoftwoclusters,i.e.,dis(Ki,Kj)=dis(Mi,Mj)

Medoid:onechosen,centrallylocatedobjectintheclusterApril 8, 2012 Clustering 25


26/72

Centroid,RadiusandDiameterofaCluster(for

numericaldatasets)

Centroid:themiddleofacluster Radius:squarerootofaveragedistancefromanypointoftheclusterto

itscentroid

Diameter:squarerootofaveragemeansquareddistancebetweenallpairsofpointsinthecluster

N

tNi ip

mC)(1=

=

)1(

2)(11

=

=

=

NN

iqt

ipt

N

i

N

imD


N

mciptNimR

2)(1 ==


27/72

Par@@oningAlgorithms:BasicConcept

Par@@oningmethod:Constructapar@@onofadatabaseDofnobjectsintoasetofkclusters,s.t.,minsumofsquareddistance

Givenak,findapar@@onofkclustersthatop@mizesthechosenpar@@oningcriterion

Globalop@mal:exhaus@velyenumerateallpar@@ons Heuris@cmethods:k-meansandk-medoidsalgorithms k-means(MacQueen67):Eachclusterisrepresentedbythecenterof

thecluster

k-medoidsorPAM(Par@@onaroundmedoids)(Kaufman&Rousseeuw87):Eachclusterisrepresentedbyoneoftheobjectsinthe

cluster

2

1 )( mimKmtk

mtC

mi

=



28/72

TheK-MeansClusteringMethod

Givenk,thek-meansalgorithmisimplementedinfoursteps:

Par@@onobjectsintoknonemptysubsets Computeseedpointsasthecentroidsoftheclustersof

thecurrentpar@@on(thecentroidisthecenter,i.e.,

meanpoint,ofthecluster)

Assigneachobjecttotheclusterwiththenearestseedpoint

GobacktoStep2,stopwhennomorenewassignmentApril 8, 2012 Clustering 28


29/72

TheK-MeansClusteringMethod

Example


0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0123

45678910

012345678910

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

012345678910

012345678910

K=2

Arbitrarily choose Kobject as initialcluster center

Assigneachobjectsto mostsimilarcenter

Updatetheclustermeans

Updatetheclustermeans

reassignreassign


30/72

CommentsontheK-MeansMethod

Strength:RelaFvelyefficient:O(tkn),wherenis#objects,kis#clusters,andtis#[email protected],k,t


31/72

aria@onsoftheK-MeansMethod

Afewvariantsofthek-meanswhichdifferin Selec@onoftheini@alkmeans Dissimilaritycalcula@ons Strategiestocalculateclustermeans

Handlingcategoricaldata:k-modes(Huang98) Replacingmeansofclusterswithmodes Usingnewdissimilaritymeasurestodealwithcategoricalobjects Usingafrequency-basedmethodtoupdatemodesofclusters Amixtureofcategoricalandnumericaldata:k-prototypemethod



32/72

OutlierProblem

Thek-meansalgorithmissensi@vetooutliers! Sinceanobjectwithanextremelylargevaluemaysubstan@allydistort

thedistribu@onofthedata.

K-Medoids:Insteadoftakingthemeanvalueoftheobjectinaclusterasareferencepoint,medoidscanbeused,whichisthemostcentrallylocated

objectinacluster.


012345678910

012345678910 012345678910

012345678910


33/72

TheK-MedoidsClusteringMethod

FindrepresentaFveobjects,calledmedoids,inclusters PAM(Par@@oningAroundMedoids,1987)

startsfromanini@alsetofmedoidsanditera@velyreplacesoneofthemedoidsbyoneofthenon-medoidsifitimprovesthetotaldistanceof

theresul@ngclustering

PAMworkseffec@velyforsmalldatasets,butdoesnotscalewellforlargedatasets

CLARA(Kaufmann&Rousseeuw,1990) CLARANS(Ng&Han,1994):Randomizedsampling



34/72

ATypicalK-MedoidsAlgorithm(PAM)


0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 20

012

345678910

0 12 34 56 78 910

K=2

Arbitrarychoose kobject asinitialmedoids

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Assigneachremaining objecttonearestmedoids

Randomly select anonmedoid object,Oramdom

Computetotal cost ofswapping

0123456789

10

0

1

2

3

4

5

6

7

8

9

10

Total Cost = 26

Swapping Oand Oramdom

If quality isimproved.

Do loop

Until nochange

0123456789

10

0

1

2

3

4

5

6

7

8

9

10


35/72

PAM(Par@@oningAroundMedoids)(1987)

PAM(KaufmanandRousseeuw,1987) Userealobjecttorepresentthecluster

Selectkrepresenta@veobjectsarbitrarily Foreachpairofnon-selectedobjecthandselectedobjecti,

calculatethetotalswappingcostTCih

Foreachpairofiandh,

IfTCih


36/72

PAMClustering:TotalswappingcostTCih=jCjih


0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

j

ih

t

Cjih = 0

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

t

i h

j

Cjih = d(j, h) - d(j, i)

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

h

it

j

Cjih = d(j, t) - d(j, i)

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

t

ih j

Cjih = d(j, h) - d(j, t)


37/72

WhatIstheProblemwithPAM?

Pamismorerobustthank-meansinthepresenceofnoiseandoutliersbecauseamedoidislessinfluencedbyoutliersor

otherextremevaluesthanamean

Pamworksefficientlyforsmalldatasetsbutdoesnotscalewellforlargedatasets.

O(k(n-k)2)foreachitera@on wherenis#ofdata,kis#ofclusters

Samplingbasedmethod,CLARA(ClusteringLARgeApplica@ons)



38/72

CLARA(ClusteringLargeApplica@ons)(1990)

CLARA(KaufmannandRousseeuwin1990) ItdrawsmulFplesamplesofthedataset,appliesPAMoneach

sample,andgivesthebestclusteringastheoutput

Strength:dealswithlargerdatasetsthanPAM Weakness:

Efficiencydependsonthesamplesize Agoodclusteringbasedonsampleswillnotnecessarily

representagoodclusteringofthewholedatasetifthe

sampleisbiased



39/72

CLARANS(RandomizedCLARA)(1994,2002)

CLARANS(AClusteringAlgorithmbasedonRandomizedSearch)(NgandHan94)

CLARANSdrawssampleofneighborsdynamically Theclusteringprocesscanbepresentedassearchingagraph

whereeverynodeisapoten@alsolu@on,thatis,asetofk

medoids

Ifthelocalop@mumisfound,CLARANSstartswithnewrandomlyselectednodeinsearchforanewlocalop@mum

ItismoreefficientandscalablethanbothPAMandCLARA



40/72

HierarchicalClustering

Usedistancematrixasclusteringcriteria.Thismethoddoesnotrequirethenumberofclusterskasaninput,butneedsa

termina@oncondi@on


Step 0 Step 1 Step 2 Step 3 Step 4

b

dc

e

aa b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive

(DIANA)


41/72

AGNES(Agglomera@veNes@ng)

IntroducedinKaufmannandRousseeuw(1990) UsetheSingle-Linkmethodandthedissimilaritymatrix. Mergenodesthathavetheleastdissimilarity Eventuallyallnodesbelongtothesamecluster


0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


42/72


Dendrogram: Shows How the Clusters are Merged

Decompose data objects into a several levels of nestedpartitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting thedendrogram at the desired level, then each connectedcomponent forms a cluster.


43/72

DIANA(DivisiveAnalysis)

IntroducedinKaufmannandRousseeuw(1990) InverseorderofAGNES Eventuallyeachnodeformsaclusteronitsown


0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Well known Hierarchical Clustering


44/72

Well-knownHierarchicalClustering

Methods

Majorweaknessofagglomera@veclusteringmethods donotscalewell:@mecomplexityofatleastO(n2),wheren

isthenumberoftotalobjects

canneverundowhatwasdonepreviously Integra@onofhierarchicalwithdistance-basedclustering

BIRCH(1996):usesCF-treeandincrementallyadjuststhequalityofsub-clusters

ROCK(1999):clusteringcategoricaldatabyneighborandlinkanalysis

CHAMELEON(1999):hierarchicalclusteringusingdynamicmodeling



45/72

BIRCH(1996)

Birch:BalancedItera@veReducingandClusteringusingHierarchies(Zhang,Ramakrishnan&Livny,SIGMOD96)

IncrementallyconstructaCF(ClusteringFeature)tree,ahierarchicaldatastructureformul@phaseclustering

Phase1:scanDBtobuildanini@alin-memoryCFtree(amul@-levelcompressionofthedatathattriestopreservetheinherentclusteringstructureofthedata)

Phase2:useanarbitraryclusteringalgorithmtoclustertheleafnodesoftheCF-tree

Scaleslinearly:findsagoodclusteringwithasinglescanandimprovesthequalitywithafewaddi@onalscans Weakness:handlesonlynumericdata,andsensi@vetotheorder

ofthedatarecord.



46/72


Clustering Feature Vector in BIRCH

Clustering Feature: CF = (N, LS, SS)

N: Number of data points

LS:Ni=1=X

i

SS:Ni=1=Xi2

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)

(2,6)

(4,5)

(4,7)

(3,8)


47/72

CF-TreeinBIRCH

Clusteringfeature: summaryofthesta@s@csforagivensubcluster:the0-th,1stand2nd

momentsofthesubclusterfromthesta@s@calpointofview.

registerscrucialmeasurementsforcompu@ngclusterandu@lizesstorageefficiently

" ACFtreeisaheight-balancedtreethatstorestheclusteringfeaturesforahierarchicalclustering

Anonleafnodeinatreehasdescendantsorchildren ThenonleafnodesstoresumsoftheCFsoftheirchildren

ACFtreehastwoparameters Branchingfactor:specifythemaximumnumberofchildren. threshold:maxdiameterofsub-clustersstoredattheleafnodes



48/72

TheCFTreeStructure


CF1child1

CF3child3

CF2child2

CF6child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6prev next CF1 CF2 CF4prev next

B = 7

L = 6

Root

Non-leaf node

Leaf node Leaf node


49/72

ClusteringCategoricalData:TheROCKAlgorithm

ROCK:RObustClusteringusinglinKs S.Guha,R.Rastogi&K.Shim,ICDE99

Majorideas Uselinkstomeasuresimilarity/proximity

Notdistance-based

Algorithm:sampling-basedclustering Drawrandomsample

Clusterwithlinks Labeldataindisk



50/72

LinkMeasureinROCK

Links:#ofcommonneighbors C1:{a,b,c},{a,b,d},{a,b,e},{a,c,d},{a,c,e},{a,d,e},

{b,c,d},{b,c,e},{b,d,e},{c,d,e}

C2:{a,b,f},{a,b,g},{a,f,g},{b,f,g} LetT1={a,b,c},T2={c,d,e},T3={a,b,f}

link(T1,T2)=4,sincetheyhave4commonneighbors {a,c,d},{a,c,e},{b,c,d},{b,c,e}

link(T1,T3)=3,sincetheyhave3commonneighbors {a,b,d},{a,b,e},{a,b,g}


CHA L ON Hi hi l Cl i U i i


51/72

CHAMELEON:HierarchicalClusteringUsingDynamic

Modeling(1999)

CHAMELEON:byG.Karypis,E.H.Han,and.Kumar99 Measuresthesimilaritybasedonadynamicmodel

TwoclustersaremergedonlyiftheinterconnecFvityandcloseness(proximity)betweentwoclustersarehighrelaFvetotheinternal

interconnec@vityoftheclustersandclosenessofitemswithintheclusters Atwo-phasealgorithm

1. Useagraphpar@@oningalgorithm:clusterobjectsintoalargenumberofrela@velysmallsub-clusters

2. Useanagglomera@vehierarchicalclusteringalgorithm:findthegenuineclustersbyrepeatedlycombiningthesesub-clusters


Overall Framework of


52/72

OverallFrameworkof

CHAMELEON


Construct

Sparse Graph Partition the Graph

Merge Partition

Final Clusters

Data Set

( l i l bj )


53/72

CHAMELEON(ClusteringComplexObjects)


d l h d


54/72

Density-BasedClusteringMethods

Clusteringbasedondensity(localclustercriterion),suchasdensity-connectedpoints Majorfeatures:

Discoverclustersofarbitraryshape Handlenoise Onescan Needdensityparametersastermina@oncondi@on

Well-knownexamples: DBSCAN:Ester,etal.(KDD96) OPTICS:Ankerst,etal(SIGMOD99). DENCLUE:Hinneburg&D.Keim(KDD98) CLIQUE:Agrawal,etal.(SIGMOD98)(moregrid-based)



55/72

Density-BasedClustering:BasicConcepts

Twoparameters: Eps:Maximumradiusoftheneighbourhood MinPts:MinimumnumberofpointsinanEps-

neighbourhoodofthatpoint

NEps(p): {qbelongstoD|dist(p,q)=MinPts


p

q

MinPts = 5

Eps = 1 cm


56/72

Density-ReachableandDensity-Connected

Density-reachable: Apointpisdensity-reachablefroma

pointqw.r.t.Eps,MinPtsifthereisachainofpointsp1,,pn,p1=q,pn=p

suchthatpi+1isdirectlydensity-reachablefrompi

Density-connected Apointpisdensity-connectedtoa

pointqw.r.t.Eps,MinPtsifthereisapointosuchthatboth,pandqaredensity-reachablefromow.r.t.EpsandMinPts


p

qp1

p q

o

DBSCAN D it B d S @ l Cl t i f


57/72

DBSCAN:DensityBasedSpa@alClusteringof

Applica@onswithNoise

Reliesonadensity-basedno@onofcluster:Aclusterisdefinedasamaximalsetofdensity-connectedpoints

Discoversclustersofarbitraryshapeinspa@aldatabaseswithnoise


Core

Border

Outlier

Eps = 1cm

MinPts = 5


58/72

DBSCAN:TheAlgorithm

Arbitraryselectapointp Retrieveallpointsdensity-reachablefrompw.r.t.Epsand

MinPts.

Ifpisacorepoint,aclusterisformed. Ifpisaborderpoint,nopointsaredensity-reachablefromp

andDBSCANvisitsthenextpointofthedatabase.

Con@nuetheprocessun@lallofthepointshavebeenprocessed.



59/72

DBSCAN:Sensi@vetoParameters


CHAMELEON (Cl t i C l Obj t )


60/72

CHAMELEON(ClusteringComplexObjects)



61/72

Model-BasedClustering

Whatismodel-basedclustering? AVempttoop@mizethefitbetweenthegivendataandsome

mathema@calmodel

Basedontheassump@on:Dataaregeneratedbyamixtureofunderlyingprobabilitydistribu@on

Typicalmethods Sta@s@calapproach

EM(Expecta@onmaximiza@on),AutoClass Machinelearningapproach

COBWEB,CLASSIT Neuralnetworkapproach

SOM(Self-OrganizingFeatureMap)April 8, 2012 Clustering 61


62/72

ConceptualClustering

Conceptualclustering Producesaclassifica@onschemeforasetofunlabeled

objects

Findscharacteris@cdescrip@onforeachconcept(class) COBWEB(Fisher87)

Apopularasimplemethodofincrementalconceptuallearning

Createsahierarchicalclusteringintheformofaclassifica@ontree

Eachnodereferstoaconceptandcontainsaprobabilis@cdescrip@onofthatconcept



63/72

COBWEBClusteringMethod


A classification tree

N l N t k A h


64/72

NeuralNetworkApproach

Neuralnetworkapproaches Representeachclusterasanexemplar,ac@ngasaprototypeofthecluster

Newobjectsaredistributedtotheclusterwhoseexemplaristhemostsimilaraccordingtosomedistancemeasure

Typicalmethods SOM(So-OrganizingfeatureMap)


( )


65/72

Self-OrganizingFeatureMap(SOM)

SOMs,alsocalledtopologicalorderedmaps,orKohonenSelf-OrganizingFeatureMap(KSOMs)

Itmapsallthepointsinahigh-dimensionalsourcespaceintoa2to3-dtargetspace,s.t.,thedistanceandproximityrela@onship(i.e.,topology)are

preservedasmuchaspossible

Similartok-means:clustercenterstendtolieinalow-dimensionalmanifoldinthefeaturespace

Clusteringisperformedbyhavingseveralunitscompe@ngforthecurrentobject

Theunitwhoseweightvectorisclosesttothecurrentobjectwins Thewinneranditsneighborslearnbyhavingtheirweightsadjusted

SOMsarebelievedtoresembleprocessingthatcanoccurinthebrain Usefulforvisualizinghigh-dimensionaldatain2-or3-Dspace


Web Document Clustering Using


66/72

WebDocumentClusteringUsing

SOM TheresultofSOM

clusteringof

12088Web

ar@cles

Thepictureontheright:drilling

downonthe

keywordmining

Basedonwebsom.hut.fi

Webpage



67/72

AClassifica@onofConstraintsinClusterAnalysis

Clusteringinapplica@ons:desirabletohaveuser-guided(i.e.,constrained)clusteranalysis

Differentconstraintsinclusteranalysis: Constraintsonindividualobjects(doselec@onfirst)

Clusteronhousesworthover$300K Constraintsondistanceorsimilarityfunc@ons

Weightedfunc@ons,obstacles(e.g.,rivers,lakes) Constraintsontheselec@onofclusteringparameters

#ofclusters,MinPts,etc. User-specifiedconstraints Containatleast500valuedcustomersand5000ordinaryones Semi-supervised:givingsmalltrainingsetsasconstraints

orhints



68/72

ClusteringwithUser-SpecifiedConstraints

Example:Loca@ngkdeliverycenters,eachservingatleastmvaluedcustomersandnordinaryones

Proposedapproach Findanini@alsolu@onbypar@@oningthedatasetintok

groupsandsa@sfyinguser-constraints

Itera@velyrefinethesolu@onbymicro-clusteringreloca@on(e.g.,moving-clustersfromclusterCitoCj)and

deadlockhandling(breakthemicroclusterswhen

necessary)



69/72

WhatIsOutlierDiscovery?

Whatareoutliers? Thesetofobjectsareconsiderablydissimilarfromthe

remainderofthedata

Example:??? Problem:Defineandfindoutliersinlargedatasets Applica@ons:

Creditcardfrauddetec@on Telecomfrauddetec@on Customersegmenta@on Medicalanalysis


OutlierDiscovery:Sta@s@cal


70/72

y

Approaches

Assumeamodelunderlyingdistribu@onthatgeneratesdataset(e.g.normaldistribu@on)

Usediscordancytestsdependingon datadistribu@on distribu@onparameter(e.g.,mean,variance) numberofexpectedoutliers

Drawbacks mosttestsareforsingleaVribute Inmanycases,datadistribu@onmaynotbeknown


Outlier Discovery: Distance Based Approach


71/72

OutlierDiscovery:Distance-BasedApproach

Introducedtocounterthemainlimita@onsimposedbysta@s@calmethods

Weneedmul@-dimensionalanalysiswithoutknowingdatadistribu@on

Distance-basedoutlier:ADB(p,D)-outlierisanobjectOinadatasetTsuchthatatleastafrac@onpoftheobjectsinTlies

atadistancegreaterthanDfromO


Density-BasedLocalOutlier


72/72

y

Detec@on

Distance-basedoutlierdetec@onisbasedonglobaldistancedistribu@on

Itencountersdifficul@estoiden@fyoutliersifdataisnotuniformly

distributed Ex.C1contains400loosely

distributedpoints,C2has100@ghtlycondensedpoints,2outlierpointso1,o2

Distance-basedmethodcannotiden@fyo2asanoutlier

Needtheconceptoflocaloutlier

n Local outlier factor (LOF)n Assume outlier is not

crisp

n Each point has a LOF

Documents

Ceng514 Spr2012 Clustering (1)