57
1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang Data Mining Research Laboratory Dept. of Computer Science and Engineering The Ohio State University

1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

Embed Size (px)

Citation preview

Page 1: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

1

Local Sparsification for Scalable Module Identification in

NetworksSrinivasan Parthasarathy

Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

Data Mining Research LaboratoryDept. of Computer Science and

EngineeringThe Ohio State University

Page 2: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

“Every 2 days we create as much information as we did up

to 2003”- Eric Schmidt, Google ex-CEO

The Data Deluge

2

Page 3: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

3

600$ to buy a disk drive that can store

all of the world’s music[McKinsey Global Institute Special Report, June ’11]

Data Storage Costs are Low

Page 4: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

4

Data does not exist in isolation.

Page 5: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

5

Data almost always exists in connection with other data.

Page 6: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

6

Social networksProtein Interactions Internet

VLSI networks Data dependencies Neighborhood graphs

Page 7: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

7

All this data is only useful if we can scalably extract useful knowledge

Page 8: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

8

Challenges

1. Large Scale

Billion-edge graphs commonplace

Scalable solutions are needed

Page 9: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

9

Challenges

2. Noise

Links on the web, protein interactions

Need to alleviate

Page 10: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

10

Challenges

3. Novel structure

Hub nodes, small world phenomena, clusters of varying densities and

sizes, directionality

Novel algorithms or techniques are needed

Page 11: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

11

Challenges

4. Domain Specific Needs

E.g. Balance, Constraints etc.

Need mechanisms to specify

Page 12: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

12

Challenges

5. Network Dynamics

How do communities evolve? Which actors have influence? How do clusters change as a function of external factors?

Page 13: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

13

Challenges

6. Cognitive Overload

Need to support guided interaction for human in the

loop

Page 14: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

14

Our Vision and Approach

Graph Pre-processing

SparsificationSIGMOD ’11, WebSci’12

Near Neighbor SearchFor non-graph data

PVLDB ’12

SymmetrizationFor directed graphs

EDBT ’10

Core Clustering

Consensus Clustering KDD’06, ISMB’07

Viewpoint Neighborhood Analysis

KDD ’09

Graph Clustering via Stochastic FlowsKDD ’09, BCB ’10

Dynamic Analysis and Visualization

Event Based Analysis KDD’07,TKDD’09

Network VisualizationKDD’08

Density PlotsSIGMOD’08, ICDE’12

Scalable Implementations and Systems Support on Modern Architectures

Multicore Systems (VLDB’07, VLDB’09), GPUs (VLDB’11), STCI Cell (ICS’08),

Clusters (ICDM’06, SC’09, PPoPP’07, ICDE’10)

Application Domains

Bioinformatics (ISMB’07, ISMB’09, ISMB’12, ACM BCB’11, BMC’12)Social Network and Social Media Analysis (TKDD’09, WWW’11, WebSci’12,

WebSci’12)

Page 15: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

15

Graph Sparsification for Community Discovery

SIGMOD ’11, WebSci’12

Page 16: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

16

Is there a simple pre-processing of the graph to reduce the edge set

that can “clarify” or “simplify” its cluster structure?

Page 17: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

17

Given a graph, discover groups of nodes that are strongly connected to one another but weakly connected to the rest of the graph.

Graph Clustering and Community Discovery

Page 18: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

18

Social Network and Graph Compression

Direct Analytics on compressed representation

Graph Clustering : Applications

Page 19: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

19

Optimize VLSI layout

Graph Clustering : Applications

Page 20: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

20

Protein function prediction

Graph Clustering : Applications

Page 21: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

21

Data distribution to minimize communication and balance load

Graph Clustering : Applications

Page 22: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

22

Is there a simple pre-processing of the graph to reduce the edge set

that can “clarify” or “simplify” its cluster structure?

Page 23: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

23

Preview

Original Sparsified

[Automatically visualized using Prefuse]

Page 24: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

24

The promise

Clustering algorithms can run much faster and be more accurate on a

sparsified graph.

Ditto for Network Visualization

Page 25: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

25

Utopian Objective

Retain edges which are likely to be intra-cluster edges, while

discarding likely inter-cluster edges.

Page 26: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

26

A way to rank edges on “strength” or similarity.

Page 27: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

27

Algorithm: Global Sparsification (G-Spar)

Parameter: Sparsification ratio, s

1. For each edge <i,j>:(i) Calculate Sim ( <i,j> )

2. Retain top s% of edges in order of Sim, discard others

Page 28: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

28

Dense clusters are over-represented, sparse clusters under-represented

Works great when the goal is to just find the top communities

Page 29: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

29

Algorithm: Local Sparsification (L-Spar)

Parameter: Sparsification exponent, e (0 < e < 1)

1. For each node i of degree di:(i) For each neighbor j:

(a) Calculate Sim ( <i,j> )(ii) Retain top (d i)e

neighbors in order of Sim, for node i

Underscoring the importance of Local Ranking

Page 30: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

30

Ensures representation of clusters of varying densities

Page 31: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

31

But...

Similarity computation is expensive!

Page 32: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

32

A randomized, approximate solution based on

Minwise Hashing[Broder et. al., 1998]

Page 33: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

33

Minwise Hashing

{ dog, cat, lion, tiger, mouse}

[ cat, mouse, lion, dog, tiger]

[ lion, cat, mouse, dog, tiger]

Universe

A = { mouse, lion }

mh1(A) = min ( { mouse, lion } ) = mouse

mh2(A) = min ( { mouse, lion } ) = lion

Page 34: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

34

Key Fact

For two sets A, B, and a min-hash function mhi():

Unbiased estimator for Sim using k hashes:

Page 35: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

35

Time complexity using Minwise Hashing

EdgesHashes

Only 2 sequential passes over input. Great for disk-resident data

Note: exact similarity is less important – we really just care about relative ranking lower k

Page 36: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

Theoretical Analysis of L-Spar: Main Results

• Q: Why choose top de edges for a node of degree d?

• A: Conservatively sparsify low-degree nodes, aggressively sparsify hub nodes. Easy to control degree of sparsification.

• Proposition: If input graph has power-law degree distn. with exponent , then sparsified graph also has power-law degree distn. with exponent

• Corollary: The sparsification ratio corresponding to exponent e is no more than

• For = 2.1 and e = 0.5, ~17% edges will be retained.

• Higher (steeper power-laws) and/or lower e leads to more sparsification.

e

e 1

1

2

e

Page 37: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

Experiments• Datasets

• 3 PPI networks (BioGrid, DIP, Human)

• 2 Information (Wiki, Flickr) & 2 Social (Orkut , Twitter) networks

• Largest network (Orkut), roughly a Billion edges

• Ground truth available for PPI networks and Wiki

• Clustering algorithms

• Metis [Karypis & Kumar ‘98], MLR-MCL [Satuluri & Parthasarathy, ‘09], Metis+MQI [Lang & Rao ‘04], Graclus [Dhillon et. al. ’07], Spectral methods [Shi ’00], Edge-based agglomerative/divisive methods [Newman ’04]

• Compared sparsifications

• L-Spar, G-Spar, RandomEdge and ForestFire

Page 38: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

38

Dataset Dataset ((n, mn, m))

SparsSpars. .

RatioRatio

RandomRandom G-SparG-Spar L-SparL-Spar

SpeedSpeed QualitQualityy

SpeedSpeed QualitQualityy

SpeedSpeed QualityQuality

Yeast_NoiYeast_Noisysy(6k, (6k, 200k)200k)

17% 11x -10% 30x -15% 25x +11%

WikiWiki(1.1M, (1.1M, 53M)53M)

15% 8x -26% 104x -24% 52x +50%

OrkutOrkut(3M, (3M, 117M)117M)

17% 13x +20% 30x +60% 36x +60%

Results Using Metis

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

Page 39: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

39

Dataset Dataset ((n, mn, m))

SparsSpars. .

RatioRatio

RandomRandom G-SparG-Spar L-SparL-Spar

SpeedSpeed QualitQualityy

SpeedSpeed QualitQualityy

SpeedSpeed QualityQuality

Yeast_NoiYeast_Noisysy(6k, (6k, 200k)200k)

17% 11x -10% 30x -15% 25x +11%

WikiWiki(1.1M, (1.1M, 53M)53M)

15% 8x -26% 104x -24% 52x +50%

OrkutOrkut(3M, (3M, 117M)117M)

17% 13x +20% 30x +60% 36x +60%

Results Using Metis

Same sparsification ratio for all 3 methods.

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

Page 40: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

40

Dataset Dataset ((n, mn, m))

SparsSpars. .

RatioRatio

RandomRandom G-SparG-Spar L-SparL-Spar

SpeedSpeed QualitQualityy

SpeedSpeed QualitQualityy

SpeedSpeed QualityQuality

Yeast_NoiYeast_Noisysy(6k, (6k, 200k)200k)

17% 11x -10% 30x -15% 25x +11%

WikiWiki(1.1M, (1.1M, 53M)53M)

15% 8x -26% 104x -24% 52x +50%

OrkutOrkut(3M, (3M, 117M)117M)

17% 13x +20% 30x +60% 36x +60%

Results Using Metis

Good speedups, but typically loss in quality.

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

Page 41: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

41

Dataset Dataset ((n, mn, m))

SparsSpars. .

RatioRatio

RandomRandom G-SparG-Spar L-SparL-Spar

SpeedSpeed QualitQualityy

SpeedSpeed QualitQualityy

SpeedSpeed QualityQuality

Yeast_NoiYeast_Noisysy(6k, (6k, 200k)200k)

17% 11x -10% 30x -15% 25x +11%

WikiWiki(1.1M, (1.1M, 53M)53M)

15% 8x -26% 104x -24% 52x +50%

OrkutOrkut(3M, (3M, 117M)117M)

17% 13x +20% 30x +60% 36x +60%

Results Using Metis

Great speedups and quality.

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

Page 42: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

42

Dataset Dataset ((n, mn, m))

Spars. Spars. RatioRatio

L-SparL-Spar

SpeedSpeed QualityQuality

Yeast_NoisyYeast_Noisy(6k, 200k)(6k, 200k)

17% 17x +4%

WikiWiki(1.1M, 53M)(1.1M, 53M)

15% 23x -4.5%

OrkutOrkut(3M, 117M)(3M, 117M)

17% 22x 0%

L-Spar: Results Using MLR-MCL

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

Page 43: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

L-Spar: Qualitative Examples

Node Retained neighbors Discarded neighbors

Graph (Wiki article)

Graph Theory, Adjacency list, Adjacency matrix,Model theory

Tessellation,Roman letters used in Mathematics, Morphism

Jack Dorsey (Twitter user, and co-founder)

Biz Stone, Evan Williams, Jason Goldman, Sarah Lacy

Alyssa Milano, JetBlue Airways, WholeFoods, Parul Sharma

Gladiator (Flickr tag)

colosseum, world-heritage, site, italy

europe, travel, canon, sky, summer

Twitter executives, Silicon Valley figures

Page 44: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

44

Impact of Sparsification on Noisy Data

As the graphs get noisier, L-Spar is increasingly beneficial.

Page 45: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

Impact of Sparsification on Spectrum: Yeast PPI

Page 46: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

Global Sparsification results in multiple components

Local sparsification seems to match trends of original graph

Impact of Sparsification on Spectrum: Epinion

Page 47: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

Impact of Sparsification on Spectrum: Human PPI

Page 48: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

Impact of Sparsification on Spectrum: Flickr

Page 49: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

Anatomy of density plot

49

Some measure of density

Specific ordering of the vertices in the graph

Page 50: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

Density Overlay Plots

50

Visual Comparison between Global vs Local Sparsification

Page 51: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

51

Summary

Sparsification: Simple pre-processing that makes a big difference

• Only tens of seconds to execute on multi-million-node graphs.

• Reduces clustering time from hours down to minutes.

• Improves accuracy by removing noisy edges for several algorithms.

• Helps visualization

• Ongoing and future work

• Spectral results suggests one might be able to provide theoretical rationale – Can we tease it out?

• Investigate other kinds of graphs, incorporating content, novel applications (e.g. wireless sensor networks, VLSI design)

Page 52: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

52

Prior Work

•Random edge Sampling [Karger ‘94]

•Sampling in proportion to effective resistances: good guarantees but very slow [Spielman and Srivastava ‘08]

•Matrix sparsification [Arora et. al. ’06]: Fast, but same as random sampling in the absence of weights.

Page 53: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

Topological Measures

53

Page 54: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

Modularity (from Wikipedia )

54

•Modularity is the fraction of the edges that fall within the given groups minus the expected such fraction if edges were distributed at random. The value of the modularity lies in the range [−1/2,1). It is positive if the number of edges within groups exceeds the number expected on the basis of chance.

Page 55: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

55

Page 56: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

56

Page 57: 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

The MCL algorithm

Expand: M := M*M

Inflate: M := M.^r (r usually 2), renormalize columns

Converged?

Input: A, Adjacency matrixInitialize M to MG, the canonical transition matrix M:= MG:= (A+I) D-1

Yes

Output clusters

No

Prune

Enhances flow to well-connected nodes (i.e. nodes within a community).

Increases inequality in each column. “Rich get richer, poor get poorer.”(reduces flow across communities)

Saves memory by removing entries close to zero. Enables faster convergence

Clustering Interpretation: Nodes flowing into the same sink node are assigned same cluster labels