102
CMU SCS Large Graph Mining Christos Faloutsos CMU

Large Graph Mining

  • Upload
    overton

  • View
    33

  • Download
    2

Embed Size (px)

DESCRIPTION

Large Graph Mining. Christos Faloutsos CMU. Thank you!. Dr. Yan Liu Dr. Jimeng Sun. Outline. Introduction – Motivation Problem#1: Patterns in graphs Problem#2: Generators Problem#3: Scalability Conclusions. Graphs - why should we care?. Internet Map [lumeta.com]. - PowerPoint PPT Presentation

Citation preview

Page 1: Large Graph  Mining

CMU SCS

Large Graph Mining

Christos Faloutsos

CMU

Page 2: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 2

Thank you!

• Dr. Yan Liu

• Dr. Jimeng Sun

Page 3: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 3

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs

• Problem#2: Generators

• Problem#3: Scalability

• Conclusions

Page 4: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 4

Graphs - why should we care?

Internet Map [lumeta.com]

Food Web [Martinez ’91]

Protein Interactions [genomebiology.com]

Friendship Network [Moody ’01]

Page 5: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 5

Graphs - why should we care?

• IR: bi-partite graphs (doc-terms)

• web: hyper-text graph

• ... and more:

D1

DN

T1

TM

... ...

Page 6: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 6

Graphs - why should we care?

• network of companies & board-of-directors members

• ‘viral’ marketing

• web-log (‘blog’) news propagation

• computer network security: email/IP traffic and anomaly detection

• ....

Page 7: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 7

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs– Static graphs– Weighted graphs– Time evolving graphs

• Problem#2: Generators

• Problem#3: Scalability

• Conclusions

Page 8: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 8

Problem #1 - network and graph mining

• How does the Internet look like?• How does the web look like?• What is ‘normal’/‘abnormal’?• which patterns/laws hold?

Page 9: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 9

Graph mining

• Are real graphs random?

Page 10: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 10

Laws and patterns• Are real graphs random?• A: NO!!

– Diameter– in- and out- degree distributions– other (surprising) patterns

• So, let’s look at the data • (‘it is amazing what you hear, when you

listen’)

Page 11: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 11

Solution# S.1

• Power law in the degree distribution [SIGCOMM99]

log(rank)

log(degree)

-0.82

internet domains

att.com

ibm.com

Page 12: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 12

Solution# S.2: Eigen Exponent E

• A2: power law in the eigenvalues of the adjacency matrix

E = -0.48

Exponent = slope

Eigenvalue

Rank of decreasing eigenvalue

May 2001

Page 13: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 13

Solution# S.2: Eigen Exponent E

• [Mihail, Papadimitriou ’02]: slope is ½ of rank exponent

E = -0.48

Exponent = slope

Eigenvalue

Rank of decreasing eigenvalue

May 2001

Page 14: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 14

But:

How about graphs from other domains?

Page 15: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 16

More power laws:

• web hit counts [w/ A. Montgomery]

Web Site Traffic

log(in-degree)

log(count)

Zipf

userssites

``ebay’’

Page 16: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 17

epinions.com• who-trusts-whom

[Richardson + Domingos, KDD 2001]

(out) degree

count

trusts-2000-people user

Page 17: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 18

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs– Static graphs

• degree, diameter, eigen*,

• triangles

• cliques

– Weighted graphs– Time evolving graphs

• Problem#2: Generators

Page 18: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 19

Solution# S.3: Triangle ‘Laws’

• Real social networks have a lot of triangles

Page 19: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 20

Triangle ‘Laws’

• Real social networks have a lot of triangles– Friends of friends are friends

• Any patterns?

Page 20: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 21

Triangle Law: #S.3 [Tsourakakis ICDM 2008]

ASNHEP-TH

Epinions X-axis: # of Triangles a node participates inY-axis: count of such nodes

Page 21: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 22

Triangle Law: #S.4 [Tsourakakis ICDM 2008]

SNReuters

EpinionsX-axis: degreeY-axis: mean # trianglesNotice: slope ~ degree exponent (insets)

Page 22: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 23

Triangle Law: Computations [Tsourakakis ICDM 2008]

But: triangles are expensive to compute(3-way join; several approx. algos)

Q: Can we do that quickly?

details

Page 23: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 24

Triangle Law: Computations [Tsourakakis ICDM 2008]

But: triangles are expensive to compute(3-way join; several approx. algos)

Q: Can we do that quickly?A: Yes!

#triangles = 1/6 Sum ( i3 )

(and, because of skewness, we only need the top few eigenvalues!

details

Page 24: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 25

Triangle Law: Computations [Tsourakakis ICDM 2008]

1000x+ speed-up, high accuracy

details

Page 25: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 26

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs– Static graphs

• degree, diameter, eigen*,

• triangles

• cliques

– Weighted graphs– Time evolving graphs

• Problem#2: Generators

Page 26: Large Graph  Mining

CMU SCS

Large Human Communication NetworksPatterns and a Utility-Driven Generator

Nan Du, Christos Faloutsos, Bai Wang, Leman Akoglu

KDD 2009

Page 27: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 28

Cliques• Clique is a complete subgraph.• If a clique can not be

contained by any largerclique, it is called the maximal clique.

20

1 3

4

Page 28: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 29

Clique• Clique is a complete subgraph.• If a clique can not be

contained by any largerclique, it is called the maximal clique.

20

1 3

4

Page 29: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 30

Clique• Clique is a complete subgraph.• If a clique can not be

contained by any largerclique, it is called the maximal clique.

20

1 3

4

Page 30: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 31

Clique• Clique is a complete subgraph.• If a clique can not be

contained by any largerclique, it is called the maximal clique.

• {0,1,2}, {0,1,3}, {1,2,3}{2,3,4}, {0,1,2,3} are cliques;

• {0,1,2,3} and {2,3,4} are the maximal cliques.

20

1 3

4

Page 31: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 32

S.5: Clique-Degree Power-Law

• Power law:

More friends, even more social circles !

# maximalcliques of node i

degreeof node i

Dataset: who-calls-whomanonymized,Over several time units€

C∝da

Page 32: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 33

S.5 Clique-Degree Power-Law

• Outlier Detection

Page 33: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 34

1.5 Clique-Degree Power-Law

• Outlier Detection

Page 34: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 35

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs– Static graphs

• degree, diameter, eigen*,

• triangles

• cliques

– Weighted graphs– Time evolving graphs

• Problem#2: Generators

Page 35: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 36

Observation W.1

Question : Nodes in a triangle are topologically equivalent. Will they also give equal number of phone calls to each other ?

Max

Wei

ght M

in Weight

Mid Weight

Page 36: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 37

Observation W.1:Triangle Weight Law

Max ∝mediuma

Periods S1 – S3

Max ∝minb

medium∝minc

Page 37: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 38

Other observations on weighted graphs?

• A: yes - even more ‘laws’!

M. McGlohon, L. Akoglu, and C. Faloutsos Weighted Graphs and Disconnected Components: Patterns and a Generator. SIG-KDD 2008

Page 38: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 39

Observation W.2: fortification

Q: How do the weights

of nodes relate to degree?

Page 39: Large Graph  Mining

CMU SCS

Edges (# donors)

In-weights($)

ICDM-LDMTA 2009 C. Faloutsos 40

Observation W.2: fortification:Snapshot Power Law

• weight of a node is super-linear on the in-degree with PL exponent ‘iw’:

– i.e. 1.01 < iw < 1.26, super-linear

Orgs-Candidates

e.g. John Kerry, $10M received,from 1K donors

More donors, even more $

$10

$5

Page 40: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 41

Are there deviations from P.L.?

• A: yes – but they also correspond to very skewed distributions (‘black/gray swans’)

Page 41: Large Graph  Mining

CMU SCS

“Mobile Call Graphs: Beyond Power Law and Lognormal Distributions”

K D D ’ 0 8

Mukund Seshadri , Sridhar Machiraju,

Ashwin Sridharan, Jean Bolot

Christos Faloutsos, Jure Leskovec

Page 42: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 43

Dataset• Who-calls-whom (anonymized)

• Degree distribution, in green

Area S1Time period T11 monthCount

Degree of Mobile-phone call graph

.... Observed Data

--- Power Law Fit+ Observed Data

.... LogNormal Fit

Page 43: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 44

A poor fit?• Power Laws: p(x) ~ x^(-a)

• Lognormal: log(x) is Normal

Count

Degree of Mobile-phone call graph

.... Observed Data

--- Power Law Fit+ Observed Data

.... LogNormal Fit

Area S1Time period T11 month

Page 44: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 45

Solution: DPLN • Double Pareto Log Normal (Reed, 2003)

• 4 parameters: [α,β,ν,τ]• Linear head and tail.

Degree

Area S1, Time Period T1

β

α

“Rich get Richer”BUT

Population lifetimesNOT identical

count

Page 45: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 46

Datasets• Anonymized monthly aggregates of call metrics

per user – Collected at 4 coverage areas S1,S2,S3,S4– Collected for two month-long periods T1 and

T2, separated by 6 months – Total coverage:

• >1 million users

• >10 million calls

• ~7000 square miles.

Page 46: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 47

• Degree (No. of Call Partners)• Calls• Talk Time

Total over a month,per user

Area S1Time-period T1Degree Distribution

Metrics

Page 47: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 48

Persistence –

Across Metric, Time, Space

Partners

S2, T2

Partners

S1, T2

CallsArea S1, Time T1

Page 48: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 49

• Outlier detection– Many un-answered calls => multiples

of 27 sec!

Applications

Per-user distribution of total talk time (@ T1, S1)

Page 49: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 50

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs– Static graphs – Weighted graphs– Time evolving graphs

• Problem#2: Generators

• …

Page 50: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 51

Problem: Time evolution• with Jure Leskovec (CMU ->

Stanford)

• and Jon Kleinberg (Cornell – sabb. @ CMU)

Page 51: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 52

T.1 Evolution of the Diameter

• Prior work on Power Law graphs hints at slowly growing diameter:– diameter ~ O(log N)– diameter ~ O(log log N)

• What is happening in real data?

Page 52: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 53

T.1 Evolution of the Diameter

• Prior work on Power Law graphs hints at slowly growing diameter:– diameter ~ O(log N)– diameter ~ O(log log N)

• What is happening in real data?

• Diameter shrinks over time

Page 53: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 54

T.1 Diameter – “Patents”

• Patent citation network

• 25 years of data

time [years]

diameter

Page 54: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 55

T.2 Temporal Evolution of the Graphs

• N(t) … nodes at time t

• E(t) … edges at time t

• Suppose thatN(t+1) = 2 * N(t)

• Q: what is your guess for E(t+1) =? 2 * E(t)

Page 55: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 56

T.2 Temporal Evolution of the Graphs

• N(t) … nodes at time t• E(t) … edges at time t• Suppose that

N(t+1) = 2 * N(t)

• Q: what is your guess for E(t+1) =? 2 * E(t)

• A: over-doubled!– But obeying the ``Densification Power Law’’

Page 56: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 57

T.2 Densification – Patent Citations

• Citations among patents granted

• 1999– 2.9 million nodes– 16.5 million

edges

• Each year is a datapoint N(t)

E(t)

1.66

Page 57: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 58

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs– Static graphs – Weighted graphs– Time evolving graphs

• Problem#2: Generators

• …

Page 58: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 59

More on Time-evolving graphs

M. McGlohon, L. Akoglu, and C. Faloutsos Weighted Graphs and Disconnected Components: Patterns and a Generator. SIG-KDD 2008

Page 59: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 60

Observation T.3: NLCC behaviorQ: How do NLCC’s emerge and join with

the GCC?

(``NLCC’’ = non-largest conn. components)

– Do they continue to grow in size?

– or do they shrink?

– or stabilize?

Page 60: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 61

Observation T.4: NLCC behavior

• After the gelling point, the GCC takes off, but NLCC’s remain ~constant (actually, oscillate).

IMDB

CC size

Time-stamp

Page 61: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 62

T.5 How do new edges appear?

[LBKT’08] Microscopic Evolution of Social Networks Jure Leskovec, Lars Backstrom, Ravi Kumar, Andrew Tomkins. (ACM KDD), 2008.

Page 62: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 63

Edge gap δ(d): inter-arrival time between dth and d+1st edge

T.5 How do edges appear in time?[LBKT’08]

What is the PDF of Poisson?

Page 63: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 64

)()(),);(( dg eddp

Edge gap δ(d): inter-arrival time between dth and d+1st edge

LinkedIn

T.5 How do edges appear in time?[LBKT’08]

Page 64: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 65

Similarly for Blogs

• with Mary McGlohon (CMU)

• Jure Leskovec (CMU->Stanford)

• Natalie Glance (now at Google)

• Mat Hurst (now at MSR)

[SDM’07]

Page 65: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 66

T.6 : popularity over time

Post popularity drops-off – exponentially?

lag: days after post

# in links

1 2 3

@t

@t + lag

Page 66: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 67

T.6 : popularity over time

Post popularity drops-off – exponentially?POWER LAW!Exponent?

# in links(log)

1 2 3 days after post(log)

Page 67: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 68

T.6 : popularity over time

Post popularity drops-off – exponentially?POWER LAW!Exponent? -1.6 • close to -1.5: Barabasi’s stack model• and like the zero-crossings of a random walk

# in links(log)

1 2 3

-1.6

days after post(log)

Page 68: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 69

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs

• Problem#2: Generators

• Problem#3: Scalability

• Conclusions

Page 69: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 70

Problem#2: Generation

Goal: Generate a realistic sequence of graphs that

1. obey all the patterns• including weights on the edges and• timestamps on the edges

2. needs no randomness

3. needs no parameters (or as few as possible)

Page 70: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 71

PaC generator

• Why do we need generators?– What-if scenarios

– How to attract/retain customers/members

– Stress-testing algorithms, when real graphs are unavailable (eg., due to privacy, national/corporate security, etc)

Page 71: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 72

(NUMEROUS earlier generators)

• preferential attachment [Barabasi+]

• copying model [Kleinberg+]

• winner does not take all [Giles+]

• Kronecker graphs

• ...

• (survey: [Chakrabarti+, ’06])

Page 72: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 73

Problem#2: Generation

Goal: Generate a realistic sequence of graphs that

1. obey all the patterns• including weights on the edges and

• timestamps on the edges

2. needs no randomness

3. needs no parameters (or as few as possible)

Proposed ‘PaC’ model – idea:• design ‘agents’, who try to max utility of

contacts

Page 73: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 74

PaC Model• People benefit from calling each other.

• A Pay and Call game = PaC Model

• The payoffs are measured as “emotional dollars”.

agent

1iF 0iF

Prob pl to stay in the game

Friendliness 0<=Fi<=1

capital

Page 74: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 75

Outline of Agent Behavior• Step 1: decide to stay (PL)

• Step 2: if stay, call the most profitable person(s)– Existing friend (‘exploit’)

– Stranger (‘explore’) or ask for recommendation (if available) to maximize benefits

Exponential lifetime

Rich get richer

Closing Triangle

Page 75: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 77

Validation of PaC

• Ran 35 simulations

• 100,000 agents per simulation

• Variation of the parameters does not change the shape of the distribution

Page 76: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 78

Goals of Validation

• G1: Skewed degree/node weight distribution

• G2: Snapshot Power-Law

• G3: Skewed connected components distribution

• G4: Clique-Degree Power-Law

• G5: Clique-Participation Law

• G6: Triangle Weight Law

Page 77: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 79

Validation of PaC• G1: Skewed Degree / Node Weight Distribution

Real Network

Synthetic Network

Page 78: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 80

Validation of PaC• G2: Snapshot Power Law [McGlohon, Akoglu,

Faloutsos 08] “more partners, even more calls”

Real Network Synthetic Network

Page 79: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 81

Validation of PaC• G3: Skewed distribution of the connected

components

Real Network Synthetic Network

Page 80: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 82

Validation of PaC

• G4: Clique Degree Power Law

Real Network Synthetic Network

Page 81: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 83

Validation of PaC

• G5: Clique Participation Law

Real Network Synthetic Network

Page 82: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 84

Validation of PaC

• G6: Triangle Weight Law

Real Network

Synthetic Network

Page 83: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 85

Validation of PaCG1: Skewed degree/node weight

distributionG2: Snapshot Power-LawG3: Skewed connected components

distributionG4: Clique-Degree Power-LawG5: Clique-Participation LawG6: Triangle Weight Law

Page 84: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 86

Outline

• Introduction – Motivation

• Problem#1: Patterns in graphs

• Problem#2: Generators

• Problem#3: Scalability

• Conclusions

Page 85: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 87

Scalability

• How about if graph/tensor does not fit in core?

• How about handling huge graphs?

Page 86: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 88

Scalability

• How about if graph/tensor does not fit in core?

• [‘MET’: Kolda, Sun, ICMD’08, best paper award]

• How about handling huge graphs?

Page 87: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 89

Scalability• Google: > 450,000 processors in clusters of ~2000

processors each [Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003]

• Yahoo: 5Pb of data [Fayyad, KDD’07]• Problem: machine failures, on a daily basis• How to parallelize data mining tasks, then?• A: map/reduce – hadoop (open-source clone)

http://hadoop.apache.org/

Page 88: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 90

2’ intro to hadoop

• master-slave architecture; n-way replication (default n=3)

• ‘group by’ of SQL (in parallel, fault-tolerant way)• e.g, find histogram of word frequency

– compute local histograms– then merge into global histogram

select course-id, count(*)from ENROLLMENTgroup by course-id

details

Page 89: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 91

2’ intro to hadoop

• master-slave architecture; n-way replication (default n=3)

• ‘group by’ of SQL (in parallel, fault-tolerant way)• e.g, find histogram of word frequency

– compute local histograms– then merge into global histogram

select course-id, count(*)from ENROLLMENTgroup by course-id map

reduce

details

Page 90: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 92

UserProgram

Reducer

Reducer

Master

Mapper

Mapper

Mapper

fork fork fork

assignmap

assignreduce

readlocalwrite

remote read,sort

Output

File 0

Output

File 1

write

Split 0Split 1Split 2

Input Data(on HDFS)

By default: 3-way replication;Late/dead machines: ignored, transparently (!)

details

Page 91: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 93

PEGASUS: A Peta-Scale Graph Mining System, ICDM’09

U Kang, Charalampos Tsourakakis, C. Faloutsos

Page 92: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 94

the PageRank distribution: power-law

Analysis of Real Networks

YahooWeb: 1.4B nodes and 6.6B edges, 120Gb(largest public graph analyzed)

M45 (Y!):• 1.5Pb disk• 3Tb RAM• 4000 cores• among top 500

Page 93: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 95

Spikes?

Analysis of Real Networks

YahooWeb: 1.4 B nodes and 6.6 B edges -Connected components:

size

count

Page 94: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 96

Spikes: due to replications of pages.

Analysis of Real Networks

YahooWeb: 1.4 B nodes and 6.6 B edges -Connected components:

size

count

Page 95: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 97

Evolution of connected components of LinkedIn: stable, *after* the gelling point (@2004)

LinkedIn: 7.5M nodes and 58M edges

details

Page 96: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 98

OVERALL CONCLUSIONS – low level:

• Several new patterns (fortification, triangle-laws, etc)

• New generator: ‘PaC’ (agent based)

• Scalability / cloud computing -> PetaBytes

Page 97: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 99

OVERALL CONCLUSIONS – high level

• good news: unprecedented opportunities (huge datasets, available on-line)

• better news: cross-disciplinary work:

– DB/sys -> scalability

– ML/stat -> tools

– Econ, Soc -> experience; what questions to ask

– phys, math -> phase transitions, tensors, eigenvalues

– …

Page 98: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 100

References• Hanghang Tong, Christos Faloutsos, and Jia-Yu

Pan Fast Random Walk with Restart and Its Applications ICDM 2006, Hong Kong.

• Hanghang Tong, Christos Faloutsos Center-Piece Subgraphs: Problem Definition and Fast Solutions, KDD 2006, Philadelphia, PA

• T. G. Kolda and J. Sun. Scalable Tensor Decompositions for Multi-aspect Data Mining. In: ICDM 2008, pp. 363-372, December 2008.

Page 99: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 101

References• Jure Leskovec, Jon Kleinberg and Christos

Faloutsos Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations KDD 2005, Chicago, IL. ("Best Research Paper" award).

• Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication (ECML/PKDD 2005), Porto, Portugal, 2005.

Page 100: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 102

References• Jure Leskovec and Christos Faloutsos, Scalable

Modeling of Real Graphs using Kronecker Multiplication, ICML 2007, Corvallis, OR, USA

• Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang and Christos Faloutsos NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks WWW 2007, Banff, Alberta, Canada, May 8-12, 2007.

• Jimeng Sun, Dacheng Tao, Christos Faloutsos Beyond Streams and Graphs: Dynamic Tensor Analysis, KDD 2006, Philadelphia, PA

Page 101: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 103

References• Jimeng Sun, Yinglian Xie, Hui Zhang, Christos

Faloutsos. Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM, Minneapolis, Minnesota, Apr 2007. [pdf]

• Jimeng Sun, Spiros Papadimitriou, Philip S. Yu, and Christos Faloutsos, GraphScope: Parameter-free Mining of Large Time-evolving Graphs ACM SIGKDD Conference, San Jose, CA, August 2007

Page 102: Large Graph  Mining

CMU SCS

ICDM-LDMTA 2009 C. Faloutsos 104

Contact info:

www. cs.cmu.edu /~christos

Thanks again to the organizers:

Jimeng Sun, Yan LiuFunding sources:• NSF IIS-0705359, IIS-0534205, DBI-0640543• LLNL, PITA, IBM, INTEL, HP