Ceng514 Spr2012 Clustering (1)

Embed Size (px)

Citation preview

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    1/72

    Clustering

    C.Eng514Spring2012

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    2/72

    Clustering

    Overview TypesofDatainClustering ACategoriza@onofMajorClusteringMethods

    Par@@oningMethods HierarchicalMethods Density-BasedMethods Grid-BasedMethods Model-BasedMethods

    Constraint-BasedClustering OutlierAnalysis

    April 8, 2012 Clustering 2

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    3/72

    WhatisClustering?

    Cluster:acollec@onofdataobjects Similartooneanotherwithinthesamecluster Dissimilartotheobjectsinotherclusters

    Clustering/Clusteranalysis Findingsimilari@esbetweendataaccordingtothe

    characteris@csfoundinthedataandgroupingsimilardata

    objectsintoclusters

    Unsupervisedlearning:nopredefinedclasses

    April 8, 2012 Clustering 3

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    4/72

    ImportantIssues

    Scalability AbilitytodealwithdifferenttypesofaVributes Abilitytohandledynamicdata Discoveryofclusterswitharbitraryshape Minimalrequirementsfordomainknowledgetodetermine

    inputparameters

    Abletodealwithnoiseandoutliers Insensi@vetoorderofinputrecords Highdimensionality Incorpora@onofuser-specifiedconstraints InterpretabilityandusabilityApril 8, 2012 Clustering 4

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    5/72

    QualityofClustering

    Agoodclusteringmethodwillproducehighqualityclusterswith

    highintra-classsimilarity lowinter-classsimilarity

    Thequalityofaclusteringresultdependsonboththesimilaritymeasureusedbythemethodanditsimplementa@on

    ThequalityofaclusteringmethodisalsomeasuredbyitsabilitytodiscoversomeorallofthehiddenpaVerns

    April 8, 2012 Clustering 5

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    6/72

    MeasuretheQualityofClustering

    Dissimilarity/Similaritymetric:Similarityisexpressedintermsofadistancefunc@on,typicallymetric:d(i,j)

    Thedefini@onsofdistancefunc@onsareusuallyverydifferentforinterval-scaled,boolean,categorical,ordinal,etc.variables.

    Weightsshouldbeassociatedwithdifferentvariablesbasedonapplica@onsanddataseman@cs.

    Itishardtodefinesimilarenoughorgoodenough theansweristypicallyhighlysubjec@ve.

    April 8, 2012 Clustering 6

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    7/72

    DataStructures

    Datamatrix(objvs.aVr)

    Dissimilaritymatrix(objvs.obj)distances

    April 8, 2012 Clustering 7

    npx...

    nfx...

    n1x

    ...............

    ipx...ifx...i1x

    ...............

    1px...

    1fx...

    11x

    0...)2,()1,(

    :::

    )2,3()

    ...ndnd

    0dd(3,10d(2,1)

    0

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    8/72

    Typeofdatainclusteranalysis

    Interval-scaledvariables Binaryvariables Nominal,ordinal,andra@ovariables ariablesofmixedtypes

    April 8, 2012 Clustering 8

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    9/72

    Interval-valuedvariables

    Standardizedata Calculatethemeanabsolutedevia@on:

    where

    Calculatethestandardizedmeasurement(z-score)

    Usingmeanabsolutedevia@onismorerobustthanusingstandarddevia@on

    April 8, 2012 Clustering 9

    .)...21

    1nffff

    xx(xnm +++=

    |)|...|||(|121 fnffffff

    mxmxmxn

    s +++=

    f

    fifif s

    mxz

    =

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    10/72

    SimilarityandDissimilarityBetween

    Objects

    Distancesarenormallyusedtomeasurethesimilarityordissimilaritybetweentwodataobjects

    Somepopularonesinclude:Minkowskidistance:wherei=(xi1,xi2,,xip)andj=(xj1,xj2,,xjp)aretwop-

    dimensionaldataobjects,andqisaposi@veinteger

    Ifq=1,disManhaVandistance

    April 8, 2012 Clustering 10

    qq

    pp

    qq

    jx

    ix

    jx

    ix

    jx

    ixjid )||...|||(|),(

    2211+++=

    ||...||||),(2211 pp j

    xi

    xj

    xi

    xj

    xi

    xjid +++=

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    11/72

    SimilarityandDissimilarityBetweenObjects

    (Cont.)

    Ifq=2,disEuclideandistance: Proper@es

    d(i,j)0 d(i,i)=0 d(i,j)=d(j,i)d(i,j)

    d(i,k)+d(k,j)

    Also,onecanuseweighteddistanceApril 8, 2012 Clustering 11

    )||...|||(|),( 2222

    2

    11 pp jx

    ix

    jx

    ix

    jx

    ixjid +++=

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    12/72

    SimilarityandDissimilarityBetweenObjects

    (Cont.)

    Example:X1=(1,2) X2=(3,5)

    Euclideandistance(X1,X2)=sqrt(4+9)=3.61

    ManhaVandistance(X1,X2)=2+3=5

    April 8, 2012 Clustering 12

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    13/72

    Binaryariables

    Acon@ngencytableforbinarydata

    Distancemeasureforsymmetricbinaryvariables: Distancemeasurefor

    asymmetricbinaryvariables:

    Jaccardcoefficient(similaritymeasureforasymmetricbinary

    variables):cba

    ajisimJaccard

    ++

    =),(

    April 8, 2012 Clustering

    13

    dcbacbjid+++

    +=),(

    cba

    cb

    jid +++

    =

    ),(

    pdbcasum

    dcdc

    baba

    sum

    ++

    +

    +

    0

    1

    01

    Object i

    Object j

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    14/72

    DissimilaritybetweenBinaryariables

    Example

    genderisasymmetricaVribute theremainingaVributesareasymmetricbinary letthevaluesYandPbesetto1,andthevalueNbesetto0

    April 8, 2012 Clustering 14

    Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

    Jack M Y N P N N N

    Mary F Y N P N P N

    Jim M Y P N N N N

    75.0

    211

    21),(

    67.0111

    11),(

    33.0102

    10),(

    =+

    +=

    =+

    +=

    =+

    +=

    maryjimd

    jimjackd

    maryjackd

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    15/72

    Nominal(Categorical)ariables

    Ageneraliza@onofthebinaryvariableinthatitcantakemorethan2states,e.g.,red,yellow,blue,green

    Simplematching m:#ofmatches,p:total#ofvariables

    April 8, 2012 Clustering 15

    pmp

    jid

    =),(

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    16/72

    Nominal(Categorical)ariables

    Example:

    April 8, 2012 Clustering 16

    Object A1 A2

    1 A E2 B F

    3 C G

    4 A E

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    17/72

    Nominal(Categorical)ariables

    Example:

    April 8, 2012 Clustering 17

    Object A1 A2

    1 A E2 B F

    3 C G

    4 A E

    1 2 3 4

    1 0

    2 1 0

    3 1 1 0

    4 0 1 1 0

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    18/72

    Ordinalariables

    Anordinalvariablecanbediscreteorcon@nuous Orderisimportant,e.g.,rank Canbetreatedlikeinterval-scaled

    replacexifbytheirrank maptherangeofeachvariableonto[0,1]byreplacingi-th

    objectinthef-thvariableby

    computethedissimilarityusingmethodsforinterval-scaledvariables

    April 8, 2012 Clustering 18

    1

    1

    =

    f

    ifif M

    rz

    },...,1{ fif Mr

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    19/72

    Ordinalariables

    Example:Assumethatthereisanorderingfair

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    20/72

    Ordinalariables

    Example:Assumethatthereisanorderingfair

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    21/72

    ariablesofMixedTypes

    Adatabasemaycontainallthesixtypesofvariables symmetricbinary,asymmetricbinary,nominal,ordinal,

    intervalandra@o

    Onemayuseaweightedformulatocombinetheireffectsfisbinaryornominal:

    dij(f)=0ifxif=xjf,ordij

    (f)=1otherwise

    fisinterval-based:usethenormalizeddistancefisordinal

    computeranksrifand andtreatzifasinterval-scaled

    April 8, 2012 Clustering 21

    )(1

    )()(1),(

    fij

    pf

    fij

    fij

    pf

    djid

    =

    =

    =

    1

    1

    =

    f

    if

    Mr

    zif

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    22/72

    ectorObjects

    ectorobjects:keywordsindocuments,genefeaturesinmicro-arrays,etc.

    Broadapplica@ons:informa@onretrieval,biologictaxonomy,etc.

    Cosinemeasure:GiventwovectorsofaVributes,AandB,thecosinesimilarityisrepresentedusingadotproduct

    andmagnitude

    Avariant:Tanimotocoefficient.ityieldstheJaccardcoefficientinthecaseofbinaryaVributes

    April 8, 2012 Clustering 22

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    23/72

    ectorObjects

    Example:

    Giventwovectors:

    x=(1,1,0,0)

    y=(0,1,1,0)

    s(x,y)=(0+1+0+0)/sqrt(2)*sqrt(2)=0.5

    April 8, 2012 Clustering 23

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    24/72

    MajorClusteringApproaches

    Par@@oningapproach: Constructvariouspar@@onsandthenevaluatethembysomecriterion,e.g.,

    minimizingthesumofsquareerrors

    Typicalmethods:k-means,k-medoids,CLARANS Hierarchicalapproach:

    Createahierarchicaldecomposi@onofthesetofdata(orobjects)usingsomecriterion

    Typicalmethods:Diana,Agnes,BIRCH,ROCK,CAMELEON Density-basedapproach:

    Basedonconnec@vityanddensityfunc@ons Typicalmethods:DBSACN,OPTICS,DenClue

    April 8, 2012 Clustering 24

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    25/72

    TypicalAlterna@vestoCalculatetheDistancebetween

    Clusters

    Singlelink:smallestdistancebetweenanelementinoneclusterandanelementintheother,i.e.,dis(Ki,Kj)=min(tip,tjq)

    Completelink:largestdistancebetweenanelementinoneclusterandanelementintheother,i.e.,dis(Ki,Kj)=max(tip,tjq)

    Average:avgdistancebetweenanelementinoneclusterandanelementintheother,i.e.,dis(Ki,Kj)=avg(tip,tjq)

    Centroid:distancebetweenthecentroidsoftwoclusters,i.e.,dis(Ki,Kj)=dis(Ci,Cj)

    Medoid:distancebetweenthemedoidsoftwoclusters,i.e.,dis(Ki,Kj)=dis(Mi,Mj)

    Medoid:onechosen,centrallylocatedobjectintheclusterApril 8, 2012 Clustering 25

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    26/72

    Centroid,RadiusandDiameterofaCluster(for

    numericaldatasets)

    Centroid:themiddleofacluster Radius:squarerootofaveragedistancefromanypointoftheclusterto

    itscentroid

    Diameter:squarerootofaveragemeansquareddistancebetweenallpairsofpointsinthecluster

    N

    tNi ip

    mC)(1=

    =

    )1(

    2)(11

    =

    =

    =

    NN

    iqt

    ipt

    N

    i

    N

    imD

    April 8, 2012 Clustering 26

    N

    mciptNimR

    2)(1 ==

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    27/72

    Par@@oningAlgorithms:BasicConcept

    Par@@oningmethod:Constructapar@@onofadatabaseDofnobjectsintoasetofkclusters,s.t.,minsumofsquareddistance

    Givenak,findapar@@onofkclustersthatop@mizesthechosenpar@@oningcriterion

    Globalop@mal:exhaus@velyenumerateallpar@@ons Heuris@cmethods:k-meansandk-medoidsalgorithms k-means(MacQueen67):Eachclusterisrepresentedbythecenterof

    thecluster

    k-medoidsorPAM(Par@@onaroundmedoids)(Kaufman&Rousseeuw87):Eachclusterisrepresentedbyoneoftheobjectsinthe

    cluster

    2

    1 )( mimKmtk

    mtC

    mi

    =

    April 8, 2012 Clustering 27

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    28/72

    TheK-MeansClusteringMethod

    Givenk,thek-meansalgorithmisimplementedinfoursteps:

    Par@@onobjectsintoknonemptysubsets Computeseedpointsasthecentroidsoftheclustersof

    thecurrentpar@@on(thecentroidisthecenter,i.e.,

    meanpoint,ofthecluster)

    Assigneachobjecttotheclusterwiththenearestseedpoint

    GobacktoStep2,stopwhennomorenewassignmentApril 8, 2012 Clustering 28

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    29/72

    TheK-MeansClusteringMethod

    Example

    April 8, 2012 Clustering 29

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0123

    45678910

    012345678910

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    012345678910

    012345678910

    K=2

    Arbitrarily choose Kobject as initialcluster center

    Assigneachobjectsto mostsimilarcenter

    Updatetheclustermeans

    Updatetheclustermeans

    reassignreassign

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    30/72

    CommentsontheK-MeansMethod

    Strength:RelaFvelyefficient:O(tkn),wherenis#objects,kis#clusters,andtis#[email protected],k,t

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    31/72

    aria@onsoftheK-MeansMethod

    Afewvariantsofthek-meanswhichdifferin Selec@onoftheini@alkmeans Dissimilaritycalcula@ons Strategiestocalculateclustermeans

    Handlingcategoricaldata:k-modes(Huang98) Replacingmeansofclusterswithmodes Usingnewdissimilaritymeasurestodealwithcategoricalobjects Usingafrequency-basedmethodtoupdatemodesofclusters Amixtureofcategoricalandnumericaldata:k-prototypemethod

    April 8, 2012 Clustering 31

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    32/72

    OutlierProblem

    Thek-meansalgorithmissensi@vetooutliers! Sinceanobjectwithanextremelylargevaluemaysubstan@allydistort

    thedistribu@onofthedata.

    K-Medoids:Insteadoftakingthemeanvalueoftheobjectinaclusterasareferencepoint,medoidscanbeused,whichisthemostcentrallylocated

    objectinacluster.

    April 8, 2012 Clustering 32

    012345678910

    012345678910 012345678910

    012345678910

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    33/72

    TheK-MedoidsClusteringMethod

    FindrepresentaFveobjects,calledmedoids,inclusters PAM(Par@@oningAroundMedoids,1987)

    startsfromanini@alsetofmedoidsanditera@velyreplacesoneofthemedoidsbyoneofthenon-medoidsifitimprovesthetotaldistanceof

    theresul@ngclustering

    PAMworkseffec@velyforsmalldatasets,butdoesnotscalewellforlargedatasets

    CLARA(Kaufmann&Rousseeuw,1990) CLARANS(Ng&Han,1994):Randomizedsampling

    April 8, 2012 Clustering 33

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    34/72

    ATypicalK-MedoidsAlgorithm(PAM)

    April 8, 2012 Clustering 34

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    Total Cost = 20

    012

    345678910

    0 12 34 56 78 910

    K=2

    Arbitrarychoose kobject asinitialmedoids

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    Assigneachremaining objecttonearestmedoids

    Randomly select anonmedoid object,Oramdom

    Computetotal cost ofswapping

    0123456789

    10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    Total Cost = 26

    Swapping Oand Oramdom

    If quality isimproved.

    Do loop

    Until nochange

    0123456789

    10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    35/72

    PAM(Par@@oningAroundMedoids)(1987)

    PAM(KaufmanandRousseeuw,1987) Userealobjecttorepresentthecluster

    Selectkrepresenta@veobjectsarbitrarily Foreachpairofnon-selectedobjecthandselectedobjecti,

    calculatethetotalswappingcostTCih

    Foreachpairofiandh,

    IfTCih

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    36/72

    PAMClustering:TotalswappingcostTCih=jCjih

    April 8, 2012 Clustering 36

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    j

    ih

    t

    Cjih = 0

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    t

    i h

    j

    Cjih = d(j, h) - d(j, i)

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    h

    it

    j

    Cjih = d(j, t) - d(j, i)

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    t

    ih j

    Cjih = d(j, h) - d(j, t)

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    37/72

    WhatIstheProblemwithPAM?

    Pamismorerobustthank-meansinthepresenceofnoiseandoutliersbecauseamedoidislessinfluencedbyoutliersor

    otherextremevaluesthanamean

    Pamworksefficientlyforsmalldatasetsbutdoesnotscalewellforlargedatasets.

    O(k(n-k)2)foreachitera@on wherenis#ofdata,kis#ofclusters

    Samplingbasedmethod,CLARA(ClusteringLARgeApplica@ons)

    April 8, 2012 Clustering 37

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    38/72

    CLARA(ClusteringLargeApplica@ons)(1990)

    CLARA(KaufmannandRousseeuwin1990) ItdrawsmulFplesamplesofthedataset,appliesPAMoneach

    sample,andgivesthebestclusteringastheoutput

    Strength:dealswithlargerdatasetsthanPAM Weakness:

    Efficiencydependsonthesamplesize Agoodclusteringbasedonsampleswillnotnecessarily

    representagoodclusteringofthewholedatasetifthe

    sampleisbiased

    April 8, 2012 Clustering 38

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    39/72

    CLARANS(RandomizedCLARA)(1994,2002)

    CLARANS(AClusteringAlgorithmbasedonRandomizedSearch)(NgandHan94)

    CLARANSdrawssampleofneighborsdynamically Theclusteringprocesscanbepresentedassearchingagraph

    whereeverynodeisapoten@alsolu@on,thatis,asetofk

    medoids

    Ifthelocalop@mumisfound,CLARANSstartswithnewrandomlyselectednodeinsearchforanewlocalop@mum

    ItismoreefficientandscalablethanbothPAMandCLARA

    April 8, 2012 Clustering 39

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    40/72

    HierarchicalClustering

    Usedistancematrixasclusteringcriteria.Thismethoddoesnotrequirethenumberofclusterskasaninput,butneedsa

    termina@oncondi@on

    April 8, 2012 Clustering 40

    Step 0 Step 1 Step 2 Step 3 Step 4

    b

    dc

    e

    aa b

    d e

    c d e

    a b c d e

    Step 4 Step 3 Step 2 Step 1 Step 0

    agglomerative(AGNES)

    divisive

    (DIANA)

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    41/72

    AGNES(Agglomera@veNes@ng)

    IntroducedinKaufmannandRousseeuw(1990) UsetheSingle-Linkmethodandthedissimilaritymatrix. Mergenodesthathavetheleastdissimilarity Eventuallyallnodesbelongtothesamecluster

    April 8, 2012 Clustering 41

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    42/72

    April 8, 2012 Clustering 42

    Dendrogram: Shows How the Clusters are Merged

    Decompose data objects into a several levels of nestedpartitioning (tree of clusters), called a dendrogram.

    A clustering of the data objects is obtained by cutting thedendrogram at the desired level, then each connectedcomponent forms a cluster.

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    43/72

    DIANA(DivisiveAnalysis)

    IntroducedinKaufmannandRousseeuw(1990) InverseorderofAGNES Eventuallyeachnodeformsaclusteronitsown

    April 8, 2012 Clustering 43

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    Well known Hierarchical Clustering

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    44/72

    Well-knownHierarchicalClustering

    Methods

    Majorweaknessofagglomera@veclusteringmethods donotscalewell:@mecomplexityofatleastO(n2),wheren

    isthenumberoftotalobjects

    canneverundowhatwasdonepreviously Integra@onofhierarchicalwithdistance-basedclustering

    BIRCH(1996):usesCF-treeandincrementallyadjuststhequalityofsub-clusters

    ROCK(1999):clusteringcategoricaldatabyneighborandlinkanalysis

    CHAMELEON(1999):hierarchicalclusteringusingdynamicmodeling

    April 8, 2012 Clustering 44

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    45/72

    BIRCH(1996)

    Birch:BalancedItera@veReducingandClusteringusingHierarchies(Zhang,Ramakrishnan&Livny,SIGMOD96)

    IncrementallyconstructaCF(ClusteringFeature)tree,ahierarchicaldatastructureformul@phaseclustering

    Phase1:scanDBtobuildanini@alin-memoryCFtree(amul@-levelcompressionofthedatathattriestopreservetheinherentclusteringstructureofthedata)

    Phase2:useanarbitraryclusteringalgorithmtoclustertheleafnodesoftheCF-tree

    Scaleslinearly:findsagoodclusteringwithasinglescanandimprovesthequalitywithafewaddi@onalscans Weakness:handlesonlynumericdata,andsensi@vetotheorder

    ofthedatarecord.

    April 8, 2012 Clustering 45

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    46/72

    April 8, 2012 Clustering 46

    Clustering Feature Vector in BIRCH

    Clustering Feature: CF = (N, LS, SS)

    N: Number of data points

    LS:Ni=1=X

    i

    SS:Ni=1=Xi2

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    CF = (5, (16,30),(54,190))

    (3,4)

    (2,6)

    (4,5)

    (4,7)

    (3,8)

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    47/72

    CF-TreeinBIRCH

    Clusteringfeature: summaryofthesta@s@csforagivensubcluster:the0-th,1stand2nd

    momentsofthesubclusterfromthesta@s@calpointofview.

    registerscrucialmeasurementsforcompu@ngclusterandu@lizesstorageefficiently

    " ACFtreeisaheight-balancedtreethatstorestheclusteringfeaturesforahierarchicalclustering

    Anonleafnodeinatreehasdescendantsorchildren ThenonleafnodesstoresumsoftheCFsoftheirchildren

    ACFtreehastwoparameters Branchingfactor:specifythemaximumnumberofchildren. threshold:maxdiameterofsub-clustersstoredattheleafnodes

    April 8, 2012 Clustering 47

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    48/72

    TheCFTreeStructure

    April 8, 2012 Clustering 48

    CF1child1

    CF3child3

    CF2child2

    CF6child6

    CF1

    child1

    CF3

    child3

    CF2

    child2

    CF5

    child5

    CF1 CF2 CF6prev next CF1 CF2 CF4prev next

    B = 7

    L = 6

    Root

    Non-leaf node

    Leaf node Leaf node

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    49/72

    ClusteringCategoricalData:TheROCKAlgorithm

    ROCK:RObustClusteringusinglinKs S.Guha,R.Rastogi&K.Shim,ICDE99

    Majorideas Uselinkstomeasuresimilarity/proximity

    Notdistance-based

    Algorithm:sampling-basedclustering Drawrandomsample

    Clusterwithlinks Labeldataindisk

    April 8, 2012 Clustering 49

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    50/72

    LinkMeasureinROCK

    Links:#ofcommonneighbors C1:{a,b,c},{a,b,d},{a,b,e},{a,c,d},{a,c,e},{a,d,e},

    {b,c,d},{b,c,e},{b,d,e},{c,d,e}

    C2:{a,b,f},{a,b,g},{a,f,g},{b,f,g} LetT1={a,b,c},T2={c,d,e},T3={a,b,f}

    link(T1,T2)=4,sincetheyhave4commonneighbors {a,c,d},{a,c,e},{b,c,d},{b,c,e}

    link(T1,T3)=3,sincetheyhave3commonneighbors {a,b,d},{a,b,e},{a,b,g}

    April 8, 2012 Clustering 50

    CHA L ON Hi hi l Cl i U i i

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    51/72

    CHAMELEON:HierarchicalClusteringUsingDynamic

    Modeling(1999)

    CHAMELEON:byG.Karypis,E.H.Han,and.Kumar99 Measuresthesimilaritybasedonadynamicmodel

    TwoclustersaremergedonlyiftheinterconnecFvityandcloseness(proximity)betweentwoclustersarehighrelaFvetotheinternal

    interconnec@vityoftheclustersandclosenessofitemswithintheclusters Atwo-phasealgorithm

    1. Useagraphpar@@oningalgorithm:clusterobjectsintoalargenumberofrela@velysmallsub-clusters

    2. Useanagglomera@vehierarchicalclusteringalgorithm:findthegenuineclustersbyrepeatedlycombiningthesesub-clusters

    April 8, 2012 Clustering 51

    Overall Framework of

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    52/72

    OverallFrameworkof

    CHAMELEON

    April 8, 2012 Clustering 52

    Construct

    Sparse Graph Partition the Graph

    Merge Partition

    Final Clusters

    Data Set

    ( l i l bj )

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    53/72

    CHAMELEON(ClusteringComplexObjects)

    April 8, 2012 Clustering 53

    d l h d

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    54/72

    Density-BasedClusteringMethods

    Clusteringbasedondensity(localclustercriterion),suchasdensity-connectedpoints Majorfeatures:

    Discoverclustersofarbitraryshape Handlenoise Onescan Needdensityparametersastermina@oncondi@on

    Well-knownexamples: DBSCAN:Ester,etal.(KDD96) OPTICS:Ankerst,etal(SIGMOD99). DENCLUE:Hinneburg&D.Keim(KDD98) CLIQUE:Agrawal,etal.(SIGMOD98)(moregrid-based)

    April 8, 2012 Clustering 54

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    55/72

    Density-BasedClustering:BasicConcepts

    Twoparameters: Eps:Maximumradiusoftheneighbourhood MinPts:MinimumnumberofpointsinanEps-

    neighbourhoodofthatpoint

    NEps(p): {qbelongstoD|dist(p,q)=MinPts

    April 8, 2012 Clustering 55

    p

    q

    MinPts = 5

    Eps = 1 cm

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    56/72

    Density-ReachableandDensity-Connected

    Density-reachable: Apointpisdensity-reachablefroma

    pointqw.r.t.Eps,MinPtsifthereisachainofpointsp1,,pn,p1=q,pn=p

    suchthatpi+1isdirectlydensity-reachablefrompi

    Density-connected Apointpisdensity-connectedtoa

    pointqw.r.t.Eps,MinPtsifthereisapointosuchthatboth,pandqaredensity-reachablefromow.r.t.EpsandMinPts

    April 8, 2012 Clustering 56

    p

    qp1

    p q

    o

    DBSCAN D it B d S @ l Cl t i f

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    57/72

    DBSCAN:DensityBasedSpa@alClusteringof

    Applica@onswithNoise

    Reliesonadensity-basedno@onofcluster:Aclusterisdefinedasamaximalsetofdensity-connectedpoints

    Discoversclustersofarbitraryshapeinspa@aldatabaseswithnoise

    April 8, 2012 Clustering 57

    Core

    Border

    Outlier

    Eps = 1cm

    MinPts = 5

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    58/72

    DBSCAN:TheAlgorithm

    Arbitraryselectapointp Retrieveallpointsdensity-reachablefrompw.r.t.Epsand

    MinPts.

    Ifpisacorepoint,aclusterisformed. Ifpisaborderpoint,nopointsaredensity-reachablefromp

    andDBSCANvisitsthenextpointofthedatabase.

    Con@nuetheprocessun@lallofthepointshavebeenprocessed.

    April 8, 2012 Clustering 58

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    59/72

    DBSCAN:Sensi@vetoParameters

    April 8, 2012 Clustering 59

    CHAMELEON (Cl t i C l Obj t )

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    60/72

    CHAMELEON(ClusteringComplexObjects)

    April 8, 2012 Clustering 60

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    61/72

    Model-BasedClustering

    Whatismodel-basedclustering? AVempttoop@mizethefitbetweenthegivendataandsome

    mathema@calmodel

    Basedontheassump@on:Dataaregeneratedbyamixtureofunderlyingprobabilitydistribu@on

    Typicalmethods Sta@s@calapproach

    EM(Expecta@onmaximiza@on),AutoClass Machinelearningapproach

    COBWEB,CLASSIT Neuralnetworkapproach

    SOM(Self-OrganizingFeatureMap)April 8, 2012 Clustering 61

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    62/72

    ConceptualClustering

    Conceptualclustering Producesaclassifica@onschemeforasetofunlabeled

    objects

    Findscharacteris@cdescrip@onforeachconcept(class) COBWEB(Fisher87)

    Apopularasimplemethodofincrementalconceptuallearning

    Createsahierarchicalclusteringintheformofaclassifica@ontree

    Eachnodereferstoaconceptandcontainsaprobabilis@cdescrip@onofthatconcept

    April 8, 2012 Clustering 62

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    63/72

    COBWEBClusteringMethod

    April 8, 2012 Clustering 63

    A classification tree

    N l N t k A h

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    64/72

    NeuralNetworkApproach

    Neuralnetworkapproaches Representeachclusterasanexemplar,ac@ngasaprototypeofthecluster

    Newobjectsaredistributedtotheclusterwhoseexemplaristhemostsimilaraccordingtosomedistancemeasure

    Typicalmethods SOM(So-OrganizingfeatureMap)

    April 8, 2012 Clustering 64

    ( )

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    65/72

    Self-OrganizingFeatureMap(SOM)

    SOMs,alsocalledtopologicalorderedmaps,orKohonenSelf-OrganizingFeatureMap(KSOMs)

    Itmapsallthepointsinahigh-dimensionalsourcespaceintoa2to3-dtargetspace,s.t.,thedistanceandproximityrela@onship(i.e.,topology)are

    preservedasmuchaspossible

    Similartok-means:clustercenterstendtolieinalow-dimensionalmanifoldinthefeaturespace

    Clusteringisperformedbyhavingseveralunitscompe@ngforthecurrentobject

    Theunitwhoseweightvectorisclosesttothecurrentobjectwins Thewinneranditsneighborslearnbyhavingtheirweightsadjusted

    SOMsarebelievedtoresembleprocessingthatcanoccurinthebrain Usefulforvisualizinghigh-dimensionaldatain2-or3-Dspace

    April 8, 2012 Clustering 65

    Web Document Clustering Using

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    66/72

    WebDocumentClusteringUsing

    SOM TheresultofSOM

    clusteringof

    12088Web

    ar@cles

    Thepictureontheright:drilling

    downonthe

    keywordmining

    Basedonwebsom.hut.fi

    Webpage

    April 8, 2012 Clustering 66

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    67/72

    AClassifica@onofConstraintsinClusterAnalysis

    Clusteringinapplica@ons:desirabletohaveuser-guided(i.e.,constrained)clusteranalysis

    Differentconstraintsinclusteranalysis: Constraintsonindividualobjects(doselec@onfirst)

    Clusteronhousesworthover$300K Constraintsondistanceorsimilarityfunc@ons

    Weightedfunc@ons,obstacles(e.g.,rivers,lakes) Constraintsontheselec@onofclusteringparameters

    #ofclusters,MinPts,etc. User-specifiedconstraints Containatleast500valuedcustomersand5000ordinaryones Semi-supervised:givingsmalltrainingsetsasconstraints

    orhints

    April 8, 2012 Clustering 67

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    68/72

    ClusteringwithUser-SpecifiedConstraints

    Example:Loca@ngkdeliverycenters,eachservingatleastmvaluedcustomersandnordinaryones

    Proposedapproach Findanini@alsolu@onbypar@@oningthedatasetintok

    groupsandsa@sfyinguser-constraints

    Itera@velyrefinethesolu@onbymicro-clusteringreloca@on(e.g.,moving-clustersfromclusterCitoCj)and

    deadlockhandling(breakthemicroclusterswhen

    necessary)

    April 8, 2012 Clustering 68

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    69/72

    WhatIsOutlierDiscovery?

    Whatareoutliers? Thesetofobjectsareconsiderablydissimilarfromthe

    remainderofthedata

    Example:??? Problem:Defineandfindoutliersinlargedatasets Applica@ons:

    Creditcardfrauddetec@on Telecomfrauddetec@on Customersegmenta@on Medicalanalysis

    April 8, 2012 Clustering 69

    OutlierDiscovery:Sta@s@cal

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    70/72

    y

    Approaches

    Assumeamodelunderlyingdistribu@onthatgeneratesdataset(e.g.normaldistribu@on)

    Usediscordancytestsdependingon datadistribu@on distribu@onparameter(e.g.,mean,variance) numberofexpectedoutliers

    Drawbacks mosttestsareforsingleaVribute Inmanycases,datadistribu@onmaynotbeknown

    April 8, 2012 Clustering 70

    Outlier Discovery: Distance Based Approach

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    71/72

    OutlierDiscovery:Distance-BasedApproach

    Introducedtocounterthemainlimita@onsimposedbysta@s@calmethods

    Weneedmul@-dimensionalanalysiswithoutknowingdatadistribu@on

    Distance-basedoutlier:ADB(p,D)-outlierisanobjectOinadatasetTsuchthatatleastafrac@onpoftheobjectsinTlies

    atadistancegreaterthanDfromO

    April 8, 2012 Clustering 71

    Density-BasedLocalOutlier

  • 7/30/2019 Ceng514 Spr2012 Clustering (1)

    72/72

    y

    Detec@on

    Distance-basedoutlierdetec@onisbasedonglobaldistancedistribu@on

    Itencountersdifficul@estoiden@fyoutliersifdataisnotuniformly

    distributed Ex.C1contains400loosely

    distributedpoints,C2has100@ghtlycondensedpoints,2outlierpointso1,o2

    Distance-basedmethodcannotiden@fyo2asanoutlier

    Needtheconceptoflocaloutlier

    n Local outlier factor (LOF)n Assume outlier is not

    crisp

    n Each point has a LOF