Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Clusteringandcommunitydetec2on
SocialandTechnologicalNetworks
RikSarkar
UniversityofEdinburgh,2017.
• Plan/proposalguidelinesareup• Officehours– Wednesdays12:00–13:00– (Maychangeinfuture.Alwayscheckwebpagefor2mesandannouncements.)
Communitydetec2on
• Givenanetwork• Whatarethe“communi2es”– Closelyconnectedgroupsofnodes– Rela2velyfewedgestooutsidethecommunity
• Similartoclusteringindatasets– Grouptogetherpointsthataremorecloseorsimilartoeachotherthanotherpoints
Communitydetec2onbyclustering
• First,defineametricbetweennodes– Eithercomputeintrinsicmetricslikeallpairsshortestpaths[Floyd-WarshallalgorithmO(n3)]
– OrembedthenodesinaEuclideanspace,andusethemetricthere• Wewilllaterstudyembeddingmethods
• Applyaclusteringalgorithmwiththemetric
Clustering
• Acoreproblemofmachinelearning:– Whichitemsareinthesamegroup?
• Iden2fiesitemsthataresimilarrela2vetorestofdata
• Simplifiesinforma2onbygroupingsimilaritems– Helpsinalltypesofotherproblems
Clustering• Outlineapproach:• Givenasetofitems
– Defineadistancebetweenthem• E.g.Euclideandistancebetweenpointsinaplane;Euclideandistancebetweenothera_ributes;non-euclideandistances;pathlengthsinanetwork;2estrengthsinanetwork…
– Determineagrouping(par22oning)thatop2misessomefunc2on(prefers‘close’itemsinsamegroup).
• Referenceforclustering:– CharuAggarwal:TheDataMiningTextbook,Springer
• FreeonSpringersite(fromuniversitynetwork)– Blumetal.Founda2onsofDataScience(freeonline)
K-meansclustering
• Findk-clusters
– Withcenters
– Thatminimizethesumofsquareddistancesofnodestotheirclusters(calledthek-meanscost)
K-meansclustering:Lloyd’salgorithm
• Therearenitems• Selectk‘centers’
– Mayberandomkloca2onsinspace– Maybeloca2onofkoftheitemsselectedrandomly– Maybechosenaccordingtosomemethod
• Iterate2llconvergence:– Assigneachitemtotheclusterforitsclosestcenter– Recomputeloca2onofcenterasthemeanloca2onofallelementsinthecluster
– Repeat• Warning:Lloyd’salgorithmisaHeuris2c.Doesnotguaranteethatthek-meanscostisminimised
K-means
• Visualisa2ons• h_p://stanford.edu/class/ee103/visualiza2ons/kmeans/kmeans.html
• h_p://shabal.in/visuals/kmeans/1.html
K-means
• Ward’salgorithm(alsoHeuris2c)– Startwitheachnodeasitsowncluster– Ateachround,findtwoclusterssuchthatmergingthemwillreducethek-meanscostthemost
– Mergethesetwoclusters– Repeatun2ltherearek-clusters
Kmeans:discussion• Triestominimisesumofdistancesofitemstocluster
centers– Computa2onallyhardproblem– Algorithmgiveslocalop2mum
• Dependsonini2alisa2on(star2ngsetofcenters)– Cangivepoorresults– Slowspeed
• Theright‘k’maybeunknown– Possiblestrategy:trydifferentpossibili2esandtakethebest
• Canbeimprovedbyheuris2cslikechoosingcenterscarefully– E.g.choosingcenterstobeasfarapartaspossible:chooseone,choosepointfarthesttoit,choosepointfarthesttoboth(maximisemindistancetoexis2ngsetetc)…
– Trymul2ple2mesandtakebestresult..
K-medoids
• Similar,butnoweachcentermustbeoneofthegivenitems– Ineachcluster,findtheitemthatisthebest‘center’andrepeat
• Usefulwhenthereisnoambientspace(extrinsicmetric)– E.g.Adistancebetweenitemscanbecomputedbetweennodes,buttheyarenotinanypar2cularEuclideanspace,sothe‘center’isnotameaningfulpoint
Othercenterbasedmethods
• K-center:Minimisemaximumdistancetocenter:
• K-median:Minimisesumofdistances:
Hierarchicalclustering
• Hierarchicallygroupitems
Hierarchicalclustering• Topdown(divisive):– Startwitheverythingin1cluster
– Makethebestdivision,andrepeatineachsubcluster
• Bo_omup(agglomera2ve):– Startwithndifferentclusters– Mergetwoata2mebyfindingpairsthatgivethebestimprovement
Hierarchicalclustering• Givesmanyop2onsforaflatclustering
• Problem:whatisagood‘cut’ofthedendogram?
Densitybasedclustering
• Groupdenseregionstogether
• Be_eratnon-linearsepara2ons
• Workswithunknownnumberofclusters
DBSCAN• Densityatadatapoint:
– NumberofdatapointswithinradiusEps• Acorepoint:
– Pointwithdensityatleastτ• Borderpoint
– Densitylessthanτ,butatleastonecorepointwithinradiusEps• Noisepoint
– Neithercorenorborder.Farfromdenseregions
DBSCAN:Discussions
• Requiresknowledgeofsuitableradiusanddensityparameters(Epsandτ)
• Doesnotallowforpossibilitythatdifferentclustersmayhavedifferentdensi2es
Densitybasedclustering
• Singlelinkage(sameasKruskal’sMSTalgorithm)– Startwithnclusters– Mergetwoclusterswiththeshortestbridginglink– Repeatun2lkclusters
• Other,morerobustmethodsexist
Communi2es
• Groupsoffriends• Colleagues/collaborators• Webpagesonsimilartopics• Biologicalreac2ongroups• Similarcustomers/users…
Otherapplica2ons
• Acoarserrepresenta2onofnetworks• Oneormoremeta-nodeforeachcommunity• Iden2fybridges/weak-links• Structuralholes
Communitydetec2oninnetworks
• Asimplestrategy:– Chooseasuitabledistancemeasurebasedonavailabledata• E.g.Pathlengths;distancebasedoninverse2estrengths;sizeoflargestenclosinggrouporcommona_ribute;distanceinaspectral(eigenvector)embedding;etc..
– Applyastandardclusteringalgorithm
Clusteringisnotalwayssuitableinnetworks
• Smallworldnetworkshavesmalldiameter– Andsome2meintegerdistances– Adistancebasedmethoddoesnothavealotofop2ontorepresentsimilari2es/dissimilari2es
• Highdegreenodesarecommon– Connectdifferentcommuni2es– Hardtoseparatecommuni2es
• Edgedensi2esvaryacrossthenetwork– Samethresholddoesnotworkwelleverywhere
Defini2onsofcommuni2es
• Varies.Dependingonapplica2on
• Generalidea:Densesubgraphs:Morelinkswithincommunity,fewlinksoutside
• Sometypesandconsidera2ons:– Par22ons:Eachnodeinexactlyonecommunity– Overlapping:Eachnodecanbeinmul2plecommuni2es
Findingdensesubgraphsishardingeneral
• Findinglargestclique– NP-hard– Computa2onallyintractable– Polynomial2me(efficient)algorithmsunlikelytoexist
• Decisionversion:Doesacliqueofsizekexist?– NP-complete– Computa2onallyintractable– Polynomial2me(efficient)algorithmsunlikelytoexist
Densesubgraphs:Fewpreliminarydefini2ons
• ForS,TsubgraphsofV• e(S,T):SetofedgesfromStoT– e(S)=e(S,S):EdgeswithinS
• dS(v):numberofedgesfromvtoS• EdgedensityofS:|e(S)|/|S|– Largestforcompletegraphsorcliques
Densesubgraph
• Thesubgraphwithlargestedgedensity• Therealsoexistsadecisionversion:– Isthereasubgraphwithedgedensity>α
• CanbesolvedusingMaxFlowalgorithms– O(n2m):inefficientinlargedatasets– Findstheonedensestsubgraph
• Variant:FinddensestScontaininggivensubsetX• Otherversions:Findsubgraphssizekorless• NP-hard
Efficientapproxima2onforfindingdenseScontainingX
• Givesa1/2approxima2on• EdgedensityofoutputSsetisatleasthalfofop2malsetS*
• (ProofinKempe2011).
Modularity
• Wewanttofindthemanycommuni2es,notjustone
• Clusteringagraph• Problem:Whatistherightclustering?• Idea:Maximizeaquan2tycalledmodularity
ModularityofsubsetS
• GivengraphG• ConsiderarandomG’graphwithsamenodedegrees(rememberconfigura2onmodel)– NumberofedgesinSinG:|e(S)|G– ExpectednumberofedgesinSinG’:E[|e(S)|G’]– ModularityofS:|e(S)|-E[|e(S)|G’]– Morecoherentcommuni2eshavemoreedgesinsidethanwouldbeexpectedinarandomgraphwithsamedegrees
– Note:modularitycanbenega2ve
Modularityofaclustering
• Takeapar22on(clustering)ofV:• Writed(Si)forsumofdegreesofallnodesinSi• CanbeshownthatE[|e(S)|G’]~d(Si)2• Defini2on:Sumoverthepar22on:
Modularitybasedclustering• Modularityismeantforusemoreasameasureofquality,notso
muchasaclusteringmethod
• FindingclusteringwithhighestmodularityisNP-hard• Heuris2c:
– Usemodularitymatrix– Takeitsfirsteigenvector
• Note:Modularityisarela2vemeasureforcomparingcommunitystructure.
• Noten2relyclearinwhichcasesitmayormaynotgivegoodresults
• Athresholdof0.3ormoreissome2mesconsideredtogivegoodclustering
• Canbeusedasastoppingcriterion(orfindingrightlevelofpar22oning)inothermethods– Eg.Girvan-newman
Karateclubhierarchicclustering
• Shapeofnodesgivesactualsplitintheclubduetointernalconflicts– Newman2003
Overlappingcommuni2es
• i
Non-Overlappingcommuni2es
Overlappingcommuni2es
• s
Affilia2ongraphmodel
• Genera2vemodel:• Eachnodebelongstosomecommuni2es• Ifbothaandbareincommunityc– Edge(a,b)iscreatedwithprobabilitypc
Affilia2ongraphmodel
• Problem:• Giventhenetwork,recover:– Communi2es:C– MembershipsorAffilia2ons:M
• Probabili2es:pc
• A
Maximumlikelihoodes2ma2on
• GivendataX• AssumedataisgeneratedbysomemodelfwithparametersΘ
• ExpressprobabilityP[f(X|Θ)]:fgeneratesX,givenspecificvaluesofΘ.
• ComputeargmaxΘ(P[f(X|Θ)])
MLEforAGM:TheBIGCLAMmethod
• Findingthebestpossiblebipar2tenetworkiscomputa2onallyhard(toomanypossibili2es)
• Instead,takeamodelwheremembershipsarerealnumbers:Membershipstrengths– FuAStrengthofmembershipofuinA– PA(u,v)=1-exp(-FuA.FvA):Eachcommunitylinksindependently,byproductofstrengths
– Totalprobabilityofanedgeexis2ng:• P(u,v)=1-ΠC(1-Pc(u,v))
BIGCLAM
• FindtheFthatmaximizesthelikelihoodthatexactlytherightsetofedgesexist.
• DetailsOmi_ed
• Op2onally,See• OverlappingCommunityDetec2onatScale:ANonnega2veMatrixFactoriza2onApproachbyJ.Yang,J.Leskovec.ACMInterna2onalConferenceonWebSearchandDataMining(WSDM),2013.
Correla2onclustering• Someedgesareknowntobesimilar/friends/trusted
• marked“+”• Someedgesareknowntobedissimilar/enemies/distrusted
• marked“-”• Maximizethenumberof+edgesinsideclustersand
• Maximizethenumberof-edgesbetweenclusters
Applica2ons
• Communitydetec2onbasedonsimilarpeople/users
• Documentclusteringbasedonknownsimilarityordissimilaritybetweendocuments
Features
• Clusteringwithoutneedtoknownumberofclusters– k-means,medians,clustersetcneedtoknownumberofclustersorotherparameterslikethreshold
– Numberofclustersdependsonnetworkstructure• Actually,doesnotneedanyparameter• NPhard• Notethatgraphmaybecompleteornotcomplete
– Insomeapplica2onswithunlabelededges,itmaybereasonabletochangeedgesto“+”edgesandnon-edgesto“-”edges
Approxima2on
• Naive1/2approxima2on(notveryuseful):– Iftherearemore+edges• Putthemallin1cluster
– Iftherearemore-edges• Putnodesinndifferentclusters
Be_erapproxima2ons
• 2waysoflookingatit:– MaximizeagreementorMinimizedisagreement– Similaridea,butweknowdifferentapproxima2onalgorithms
• NikhilBansaletal.developPTAS(polynomial2meapproxima2onscheme)formaximizingagreement:– (1-ε)approxima2on,running2me
Approxima2on
• Min-disagree:– 4-approxima2on