Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,

Clusteringandcommunitydetec2on

SocialandTechnologicalNetworks

RikSarkar

UniversityofEdinburgh,2017.

•  Plan/proposalguidelinesareup•  Officehours– Wednesdays12:00–13:00–  (Maychangeinfuture.Alwayscheckwebpagefor2mesandannouncements.)

Communitydetec2on

•  Givenanetwork•  Whatarethe“communi2es”– Closelyconnectedgroupsofnodes– Rela2velyfewedgestooutsidethecommunity

•  Similartoclusteringindatasets– Grouptogetherpointsthataremorecloseorsimilartoeachotherthanotherpoints

Communitydetec2onbyclustering

•  First,defineametricbetweennodes– Eithercomputeintrinsicmetricslikeallpairsshortestpaths[Floyd-WarshallalgorithmO(n3)]

– OrembedthenodesinaEuclideanspace,andusethemetricthere• Wewilllaterstudyembeddingmethods

•  Applyaclusteringalgorithmwiththemetric

Clustering

•  Acoreproblemofmachinelearning:– Whichitemsareinthesamegroup?

•  Iden2fiesitemsthataresimilarrela2vetorestofdata

•  Simplifiesinforma2onbygroupingsimilaritems– Helpsinalltypesofotherproblems

Clustering•  Outlineapproach:•  Givenasetofitems

–  Defineadistancebetweenthem•  E.g.Euclideandistancebetweenpointsinaplane;Euclideandistancebetweenothera_ributes;non-euclideandistances;pathlengthsinanetwork;2estrengthsinanetwork…

–  Determineagrouping(par22oning)thatop2misessomefunc2on(prefers‘close’itemsinsamegroup).

•  Referenceforclustering:–  CharuAggarwal:TheDataMiningTextbook,Springer

•  FreeonSpringersite(fromuniversitynetwork)–  Blumetal.Founda2onsofDataScience(freeonline)

K-meansclustering

•  Findk-clusters

– Withcenters

– Thatminimizethesumofsquareddistancesofnodestotheirclusters(calledthek-meanscost)

K-meansclustering:Lloyd’salgorithm

•  Therearenitems•  Selectk‘centers’

– Mayberandomkloca2onsinspace– Maybeloca2onofkoftheitemsselectedrandomly– Maybechosenaccordingtosomemethod

•  Iterate2llconvergence:–  Assigneachitemtotheclusterforitsclosestcenter–  Recomputeloca2onofcenterasthemeanloca2onofallelementsinthecluster

–  Repeat•  Warning:Lloyd’salgorithmisaHeuris2c.Doesnotguaranteethatthek-meanscostisminimised

K-means

•  Visualisa2ons•  h_p://stanford.edu/class/ee103/visualiza2ons/kmeans/kmeans.html

•  h_p://shabal.in/visuals/kmeans/1.html

K-means

•  Ward’salgorithm(alsoHeuris2c)– Startwitheachnodeasitsowncluster– Ateachround,findtwoclusterssuchthatmergingthemwillreducethek-meanscostthemost

– Mergethesetwoclusters– Repeatun2ltherearek-clusters

Kmeans:discussion•  Triestominimisesumofdistancesofitemstocluster

centers–  Computa2onallyhardproblem–  Algorithmgiveslocalop2mum

•  Dependsonini2alisa2on(star2ngsetofcenters)–  Cangivepoorresults–  Slowspeed

•  Theright‘k’maybeunknown–  Possiblestrategy:trydifferentpossibili2esandtakethebest

•  Canbeimprovedbyheuris2cslikechoosingcenterscarefully–  E.g.choosingcenterstobeasfarapartaspossible:chooseone,choosepointfarthesttoit,choosepointfarthesttoboth(maximisemindistancetoexis2ngsetetc)…

–  Trymul2ple2mesandtakebestresult..

K-medoids

•  Similar,butnoweachcentermustbeoneofthegivenitems–  Ineachcluster,findtheitemthatisthebest‘center’andrepeat

•  Usefulwhenthereisnoambientspace(extrinsicmetric)– E.g.Adistancebetweenitemscanbecomputedbetweennodes,buttheyarenotinanypar2cularEuclideanspace,sothe‘center’isnotameaningfulpoint

Othercenterbasedmethods

•  K-center:Minimisemaximumdistancetocenter:

•  K-median:Minimisesumofdistances:

Hierarchicalclustering

•  Hierarchicallygroupitems

Hierarchicalclustering•  Topdown(divisive):–  Startwitheverythingin1cluster

– Makethebestdivision,andrepeatineachsubcluster

•  Bo_omup(agglomera2ve):–  Startwithndifferentclusters– Mergetwoata2mebyfindingpairsthatgivethebestimprovement

Hierarchicalclustering•  Givesmanyop2onsforaflatclustering

•  Problem:whatisagood‘cut’ofthedendogram?

Densitybasedclustering

•  Groupdenseregionstogether

•  Be_eratnon-linearsepara2ons

•  Workswithunknownnumberofclusters

DBSCAN•  Densityatadatapoint:

–  NumberofdatapointswithinradiusEps•  Acorepoint:

–  Pointwithdensityatleastτ•  Borderpoint

–  Densitylessthanτ,butatleastonecorepointwithinradiusEps•  Noisepoint

–  Neithercorenorborder.Farfromdenseregions

DBSCAN:Discussions

•  Requiresknowledgeofsuitableradiusanddensityparameters(Epsandτ)

•  Doesnotallowforpossibilitythatdifferentclustersmayhavedifferentdensi2es

Densitybasedclustering

•  Singlelinkage(sameasKruskal’sMSTalgorithm)– Startwithnclusters– Mergetwoclusterswiththeshortestbridginglink– Repeatun2lkclusters

•  Other,morerobustmethodsexist

Communi2es

•  Groupsoffriends•  Colleagues/collaborators•  Webpagesonsimilartopics•  Biologicalreac2ongroups•  Similarcustomers/users…

Otherapplica2ons

•  Acoarserrepresenta2onofnetworks•  Oneormoremeta-nodeforeachcommunity•  Iden2fybridges/weak-links•  Structuralholes

Communitydetec2oninnetworks

•  Asimplestrategy:– Chooseasuitabledistancemeasurebasedonavailabledata•  E.g.Pathlengths;distancebasedoninverse2estrengths;sizeoflargestenclosinggrouporcommona_ribute;distanceinaspectral(eigenvector)embedding;etc..

– Applyastandardclusteringalgorithm

Clusteringisnotalwayssuitableinnetworks

•  Smallworldnetworkshavesmalldiameter– Andsome2meintegerdistances– Adistancebasedmethoddoesnothavealotofop2ontorepresentsimilari2es/dissimilari2es

•  Highdegreenodesarecommon– Connectdifferentcommuni2es– Hardtoseparatecommuni2es

•  Edgedensi2esvaryacrossthenetwork– Samethresholddoesnotworkwelleverywhere

Defini2onsofcommuni2es

•  Varies.Dependingonapplica2on

•  Generalidea:Densesubgraphs:Morelinkswithincommunity,fewlinksoutside

•  Sometypesandconsidera2ons:– Par22ons:Eachnodeinexactlyonecommunity– Overlapping:Eachnodecanbeinmul2plecommuni2es

Findingdensesubgraphsishardingeneral

•  Findinglargestclique– NP-hard–  Computa2onallyintractable–  Polynomial2me(efficient)algorithmsunlikelytoexist

•  Decisionversion:Doesacliqueofsizekexist?– NP-complete–  Computa2onallyintractable–  Polynomial2me(efficient)algorithmsunlikelytoexist

Densesubgraphs:Fewpreliminarydefini2ons

•  ForS,TsubgraphsofV•  e(S,T):SetofedgesfromStoT– e(S)=e(S,S):EdgeswithinS

•  dS(v):numberofedgesfromvtoS•  EdgedensityofS:|e(S)|/|S|– Largestforcompletegraphsorcliques

Densesubgraph

•  Thesubgraphwithlargestedgedensity•  Therealsoexistsadecisionversion:–  Isthereasubgraphwithedgedensity>α

•  CanbesolvedusingMaxFlowalgorithms– O(n2m):inefficientinlargedatasets–  Findstheonedensestsubgraph

•  Variant:FinddensestScontaininggivensubsetX•  Otherversions:Findsubgraphssizekorless•  NP-hard

Efficientapproxima2onforfindingdenseScontainingX

•  Givesa1/2approxima2on•  EdgedensityofoutputSsetisatleasthalfofop2malsetS*

•  (ProofinKempe2011).

Modularity

•  Wewanttofindthemanycommuni2es,notjustone

•  Clusteringagraph•  Problem:Whatistherightclustering?•  Idea:Maximizeaquan2tycalledmodularity

ModularityofsubsetS

•  GivengraphG•  ConsiderarandomG’graphwithsamenodedegrees(rememberconfigura2onmodel)– NumberofedgesinSinG:|e(S)|G–  ExpectednumberofedgesinSinG’:E[|e(S)|G’]– ModularityofS:|e(S)|-E[|e(S)|G’]– Morecoherentcommuni2eshavemoreedgesinsidethanwouldbeexpectedinarandomgraphwithsamedegrees

– Note:modularitycanbenega2ve

Modularityofaclustering

•  Takeapar22on(clustering)ofV:•  Writed(Si)forsumofdegreesofallnodesinSi•  CanbeshownthatE[|e(S)|G’]~d(Si)2•  Defini2on:Sumoverthepar22on:

Modularitybasedclustering•  Modularityismeantforusemoreasameasureofquality,notso

muchasaclusteringmethod

•  FindingclusteringwithhighestmodularityisNP-hard•  Heuris2c:

–  Usemodularitymatrix–  Takeitsfirsteigenvector

•  Note:Modularityisarela2vemeasureforcomparingcommunitystructure.

•  Noten2relyclearinwhichcasesitmayormaynotgivegoodresults

•  Athresholdof0.3ormoreissome2mesconsideredtogivegoodclustering

•  Canbeusedasastoppingcriterion(orfindingrightlevelofpar22oning)inothermethods– Eg.Girvan-newman

Karateclubhierarchicclustering

•  Shapeofnodesgivesactualsplitintheclubduetointernalconflicts– Newman2003

Overlappingcommuni2es

•  i

Non-Overlappingcommuni2es

Overlappingcommuni2es

•  s

Affilia2ongraphmodel

•  Genera2vemodel:•  Eachnodebelongstosomecommuni2es•  Ifbothaandbareincommunityc– Edge(a,b)iscreatedwithprobabilitypc

Affilia2ongraphmodel

•  Problem:•  Giventhenetwork,recover:–  Communi2es:C– MembershipsorAffilia2ons:M

•  Probabili2es:pc

•  A

Maximumlikelihoodes2ma2on

•  GivendataX•  AssumedataisgeneratedbysomemodelfwithparametersΘ

•  ExpressprobabilityP[f(X|Θ)]:fgeneratesX,givenspecificvaluesofΘ.

•  ComputeargmaxΘ(P[f(X|Θ)])

MLEforAGM:TheBIGCLAMmethod

•  Findingthebestpossiblebipar2tenetworkiscomputa2onallyhard(toomanypossibili2es)

•  Instead,takeamodelwheremembershipsarerealnumbers:Membershipstrengths– FuAStrengthofmembershipofuinA– PA(u,v)=1-exp(-FuA.FvA):Eachcommunitylinksindependently,byproductofstrengths

– Totalprobabilityofanedgeexis2ng:•  P(u,v)=1-ΠC(1-Pc(u,v))

BIGCLAM

•  FindtheFthatmaximizesthelikelihoodthatexactlytherightsetofedgesexist.

•  DetailsOmi_ed

•  Op2onally,See•  OverlappingCommunityDetec2onatScale:ANonnega2veMatrixFactoriza2onApproachbyJ.Yang,J.Leskovec.ACMInterna2onalConferenceonWebSearchandDataMining(WSDM),2013.

Correla2onclustering•  Someedgesareknowntobesimilar/friends/trusted

•  marked“+”•  Someedgesareknowntobedissimilar/enemies/distrusted

•  marked“-”•  Maximizethenumberof+edgesinsideclustersand

•  Maximizethenumberof-edgesbetweenclusters

Applica2ons

•  Communitydetec2onbasedonsimilarpeople/users

•  Documentclusteringbasedonknownsimilarityordissimilaritybetweendocuments

Features

•  Clusteringwithoutneedtoknownumberofclusters–  k-means,medians,clustersetcneedtoknownumberofclustersorotherparameterslikethreshold

–  Numberofclustersdependsonnetworkstructure•  Actually,doesnotneedanyparameter•  NPhard•  Notethatgraphmaybecompleteornotcomplete

–  Insomeapplica2onswithunlabelededges,itmaybereasonabletochangeedgesto“+”edgesandnon-edgesto“-”edges

Approxima2on

•  Naive1/2approxima2on(notveryuseful):–  Iftherearemore+edges•  Putthemallin1cluster

–  Iftherearemore-edges•  Putnodesinndifferentclusters

Be_erapproxima2ons

•  2waysoflookingatit:– MaximizeagreementorMinimizedisagreement– Similaridea,butweknowdifferentapproxima2onalgorithms

•  NikhilBansaletal.developPTAS(polynomial2meapproxima2onscheme)formaximizingagreement:–  (1-ε)approxima2on,running2me

Approxima2on

•  Min-disagree:– 4-approxima2on

Documents

Clustering and community detec2on · 2017-10-17 · Clustering and community detec2on Social and Technological Networks Rik Sarkar ... Dense subgraphs: More links within community,