39
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

http://cs246.stanford.edu

Page 2: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

2

Nodes: Football Teams Edges: Games played

Can we identify node groups?

(communities, modules, clusters)

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 3: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

3

NCAA conferences

Nodes: Football Teams Edges: Games played

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 4: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

4

Can we identify functional modules?

Nodes: Proteins Edges: Physical interactions

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 5: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

5

Functional modules

Nodes: Proteins Edges: Physical interactions

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 6: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

6

Can we identify social communities?

Nodes: Facebook Users Edges: Friendships

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 7: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

7

High school Summer internship

Stanford (Squash) Stanford (Basketball)

Social communities

Nodes: Facebook Users Edges: Friendships

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 8: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Non-overlapping vs. overlapping communities

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

Page 9: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

9

Network Adjacency matrix

Nodes

Nod

es

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 10: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

What is the structure of community overlaps: Edge density in the overlaps is higher!

10

Communities as “tiles” 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 11: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

11

This is what we want! Communities in a network

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 12: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

1) Given a model, we generate the network:

2) Given a network, find the “best” model

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

C

A

B

D E

H

F

G

C

A

B

D E

H

F

G

Generative model for networks

Generative model for networks

Page 13: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Goal: Define a model that can generate networks The model will have a set of “parameters” that we

will later want to estimate (and detect communities)

Q: Given a set of nodes, how do communities “generate” edges of the network?

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 13

C

A

B

D E

H

F

G

Generative model for networks

Page 14: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Generative model B(V, C, M, {pc}) for graphs: Nodes V, Communities C, Memberships M Each community c has a single probability pc

Later we fit the model to networks to detect communities

14

Model

Network

Communities, C

Nodes, V

Model

pA pB

Memberships, M

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 15: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

AGM generates the links: For each For each pair of nodes in community 𝑨𝑨,

we connect them with prob. 𝒑𝒑𝑨𝑨 The overall edge probability is:

15

Model

∏∩∈

−−=vu MMc

cpvuP )1(1),(

Network

Communities, C

Nodes, V Community Affiliations

pA pB

Memberships, M

If 𝒖𝒖,𝒗𝒗 share no communities: 𝑷𝑷 𝒖𝒖,𝒗𝒗 = 𝜺𝜺 Think of this as an “OR” function: If at least 1 community says “YES” we create an edge

𝑴𝑴𝒖𝒖 … set of communities node 𝒖𝒖 belongs to

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 16: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

16

Model

Network 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 17: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

AGM can express a variety of community structures: Non-overlapping, Overlapping, Nested

17 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 18: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Detecting communities with AGM:

18

C

A

B

D E

H

F

G

Given a Graph, find the Model 1) Affiliation graph M 2) Number of communities C 3) Parameters pc

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 19: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Maximum Likelihood Principle (MLE): Given: Data 𝑿𝑿 Assumption: Data is generated by some model 𝒇𝒇(𝚯𝚯) 𝒇𝒇 … model 𝚯𝚯 … model parameters

Want to estimate 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯): The probability that our model 𝒇𝒇 (with parameters 𝜣𝜣)

generated the data

Now let’s find the most likely model that could have generated the data: arg max

Θ 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯)

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

Page 20: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Imagine we are given a set of coin flips Task: Figure the bias of a coin! Data: Sequence of coin flips: 𝑿𝑿 = [𝟏𝟏,𝟎𝟎,𝟎𝟎,𝟎𝟎,𝟏𝟏,𝟎𝟎,𝟎𝟎,𝟏𝟏] Model: 𝒇𝒇 𝚯𝚯 = return 1 with prob. Θ, else return 0 What is 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 ? Assuming coin flips are independent So, 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 = 𝑷𝑷𝒇𝒇 𝟎𝟎 𝚯𝚯 ∗ 𝑷𝑷𝒇𝒇 𝟏𝟏 𝚯𝚯 ∗ 𝑷𝑷𝒇𝒇 𝟏𝟏 𝚯𝚯 … ∗ 𝑷𝑷𝒇𝒇 𝟎𝟎 𝚯𝚯 What is 𝑷𝑷𝒇𝒇 𝟏𝟏 𝚯𝚯 ? Simple, 𝑷𝑷𝒇𝒇 𝟎𝟎 𝚯𝚯 = 𝚯𝚯

Then, 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 = 𝚯𝚯𝟑𝟑 𝟏𝟏 − 𝚯𝚯 𝟓𝟓 For example: 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 = 𝟎𝟎.𝟓𝟓 = 𝟎𝟎.𝟎𝟎𝟎𝟎𝟑𝟑𝟎𝟎𝟎𝟎𝟎𝟎

𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 = 𝟑𝟑𝟖𝟖

= 𝟎𝟎.𝟎𝟎𝟎𝟎𝟓𝟓𝟎𝟎𝟎𝟎𝟎𝟎

What did we learn? Our data was most likely generated by coin with bias 𝚯𝚯 = 𝟑𝟑/𝟖𝟖

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 20 𝑷𝑷𝒇𝒇𝑿𝑿𝚯𝚯

𝚯𝚯

𝚯𝚯∗ = 𝟑𝟑/𝟖𝟖

Page 21: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

How do we do MLE for graphs? Model generates a probabilistic adjacency matrix We then flip all the entries of the probabilistic

matrix to obtain the binary adjacency matrix

The likelihood of AGM generating graph G:

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 21

0.25 0.10 0.10 0.04 0.05 0.15 0.02 0.06 0.05 0.02 0.15 0.06 0.01 0.03 0.03 0.09

1 1 0 0 0 1 0 1 1 0 1 1 0 1 0 1

For every pair of nodes 𝒖𝒖,𝒗𝒗 AGM gives the prob. 𝒑𝒑𝒖𝒖𝒗𝒗 of them being linked

Flip biased coins

)),(1(),()|(),(),(

vuPvuPGPGvuGvu

−ΠΠ=Θ∉∈

Page 22: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Given graph G and Θ we calculate likelihood that Θ generated G: P(G|Θ)

0.25 0.10 0.10 0.04

0.05 0.15 0.02 0.06

0.05 0.02 0.15 0.06

0.01 0.03 0.03 0.09 Θ=B(V, C, M, {pc})

1 0 1 1

0 1 0 1

1 0 1 1

1 1 1 1

G

P(G|Θ)

)),(1(),()|(),(),(

vuPvuPGPGvuGvu

−ΠΠ=Θ∉∈

G

2/13/2014 22 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 23: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Our goal: Find 𝚯𝚯 = 𝑩𝑩(𝑽𝑽,𝑪𝑪,𝑴𝑴, 𝒑𝒑𝑪𝑪 ) such that:

How do we find 𝑩𝑩(𝑽𝑽,𝑪𝑪,𝑴𝑴, 𝒑𝒑𝑪𝑪 ) that maximizes the likelihood? No nice way, have to search over

all affiliation graphs But, don’t despair…

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 23

ΘP( | ) AGM

arg max Θ

𝑮𝑮

Page 24: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Relaxation: Memberships have strengths

𝑭𝑭𝒖𝒖𝑨𝑨: The membership strength of node 𝒖𝒖 to community 𝑨𝑨 (𝑭𝑭𝒖𝒖𝑨𝑨 = 𝟎𝟎: no membership) Each community 𝑨𝑨 links nodes independently:

𝑷𝑷𝑨𝑨 𝒖𝒖,𝒗𝒗 = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞 (−𝑭𝑭𝒖𝒖𝑨𝑨 ⋅ 𝑭𝑭𝒗𝒗𝑨𝑨)

24

𝑭𝑭𝒖𝒖𝑨𝑨

u v

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 25: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Community membership strength matrix 𝑭𝑭

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 25

𝑭𝑭 =

j

Communities

Nod

es

𝑭𝑭𝒖𝒖𝑨𝑨 … strength of 𝒖𝒖’s membership to 𝑨𝑨

𝑭𝑭𝒗𝒗 … vector of community membership strengths of 𝒖𝒖

𝑷𝑷𝑨𝑨 𝒖𝒖,𝒗𝒗 = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞 (−𝑭𝑭𝒖𝒖𝑨𝑨 ⋅ 𝑭𝑭𝒗𝒗𝑨𝑨) Probability of connection is

proportional to the product of strengths

Prob. that at least one common community 𝑪𝑪 links the nodes: 𝑷𝑷 𝒖𝒖,𝒗𝒗 = 𝟏𝟏 − ∏ 𝟏𝟏 − 𝑷𝑷𝑪𝑪 𝒖𝒖,𝒗𝒗𝑪𝑪

Page 26: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Community 𝑨𝑨 links nodes 𝒖𝒖,𝒗𝒗 independently: 𝑷𝑷𝑨𝑨 𝒖𝒖,𝒗𝒗 = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞 (−𝑭𝑭𝒖𝒖𝑨𝑨 ⋅ 𝑭𝑭𝒗𝒗𝑨𝑨)

Then prob. at least one common 𝑪𝑪 links them: 𝑷𝑷 𝒖𝒖,𝒗𝒗 = 𝟏𝟏 − ∏ 𝟏𝟏 − 𝑷𝑷𝑪𝑪 𝒖𝒖,𝒗𝒗𝑪𝑪 = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞 (−∑ 𝑭𝑭𝒖𝒖𝑪𝑪 ⋅ 𝑭𝑭𝒗𝒗𝑪𝑪𝑪𝑪 )

= 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞 (−𝑭𝑭𝒖𝒖 ⋅ 𝑭𝑭𝒗𝒗𝑻𝑻) For example:

26

𝑭𝑭𝒖𝒖:

𝑭𝑭𝒗𝒗: Then: 𝑭𝑭𝒖𝒖 ⋅ 𝑭𝑭𝒗𝒗𝑻𝑻 = 𝟎𝟎.𝟏𝟏𝟎𝟎 And: 𝑷𝑷 𝒖𝒖,𝒗𝒗 = 𝟏𝟏 − 𝒆𝒆𝒆𝒆𝒑𝒑 −𝟎𝟎.𝟏𝟏𝟎𝟎 = 𝟎𝟎.𝟏𝟏𝟏𝟏 But: 𝑷𝑷 𝒖𝒖,𝒘𝒘 = 𝟎𝟎.𝟖𝟖𝟖𝟖 𝑷𝑷 𝒗𝒗,𝒘𝒘 = 𝟎𝟎

𝑭𝑭𝒘𝒘: Node community

membership strengths 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

0 1.2 0 0.2

0.5 0 0 0.8

0 1.8 1 0

Page 27: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Another way to think about this is: Think of a weighted graph (but we don’t see the

weights) 𝒖𝒖,𝒗𝒗 interact with strength 𝑿𝑿𝒖𝒖𝒗𝒗𝑨𝑨 ~𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃(𝑭𝑭𝒖𝒖𝑨𝑨 ⋅ 𝑭𝑭𝒗𝒗𝑨𝑨) Total interaction strength is the sum: 𝑿𝑿𝒖𝒖𝒗𝒗 = ∑ 𝑿𝑿𝒖𝒖𝒗𝒗𝑪𝑪𝑪𝑪 Then: 𝑿𝑿𝒖𝒖𝒗𝒗~𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃(∑ 𝑭𝑭𝒖𝒖𝑪𝑪 ⋅ 𝑭𝑭𝒗𝒗𝑪𝑪𝐶𝐶 ) We observe an edge if there is non-zero interaction: 𝑃𝑃 𝑋𝑋𝑢𝑢𝑢𝑢 > 0 = 1 − 𝑃𝑃 𝑋𝑋𝑢𝑢𝑢𝑢 = 0 = = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞 (−𝑭𝑭𝒖𝒖 ⋅ 𝑭𝑭𝒗𝒗𝑻𝑻)

27 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Poisson PMF:

𝑷𝑷(𝑿𝑿 = 𝒆𝒆) =𝝀𝝀𝒌𝒌

𝒌𝒌!𝒆𝒆−𝝀𝝀

Page 28: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Task: Given a network, estimate 𝑭𝑭 Find 𝑭𝑭 that maximizes the log-likelihood Many times we take the logarithm of the likelihood and

call it log-likelihood: 𝑙𝑙 𝐹𝐹 = log 𝑃𝑃(𝐺𝐺|𝐹𝐹)

28

Objective function is biconvex Non-negative matrix factorization: Update 𝑭𝑭𝒖𝒖𝑪𝑪 for node 𝒖𝒖 while fixing the

memberships of all other nodes

[wsdm ‘13]

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 29: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Task: Given a network, estimate 𝑭𝑭 Find 𝑭𝑭 that maximizes the log-likelihood Many times we take the logarithm of the likelihood and

call it log-likelihood: 𝒍𝒍 𝑭𝑭 = 𝐥𝐥𝐥𝐥𝐥𝐥 𝑷𝑷(𝑮𝑮|𝑭𝑭)

29

Connection: Non-negative matrix factorization We want to “factor” the adjacency matrix G

F FT Matrix of

edge probs.

𝟏𝟏 − 𝒆𝒆𝒆𝒆𝒑𝒑 (− . ) = 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Graph G

Page 30: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Non-negative matrix factorization: Use a gradient based approach! But updating the whole 𝑭𝑭 takes too long Computing the gradient takes quadratic time! Our network are far too big to allow for that: Network with 1M nodes, can have 1012 different edges

Instead, update 𝑭𝑭 in small “units”: Update 𝑭𝑭𝒖𝒖𝑪𝑪 for node 𝒖𝒖 while fixing the memberships of

all other nodes

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 30

Page 31: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Compute gradient of a single row 𝑭𝑭𝒖𝒖 of 𝑭𝑭:

Coordinate gradient ascent: Iterate over the rows of 𝑭𝑭: Compute gradient 𝜵𝜵𝒍𝒍 𝑭𝑭𝒖𝒖 of row 𝒖𝒖 (while keeping others fixed) Update the row 𝑭𝑭𝒖𝒖: 𝑭𝑭𝒖𝒖 ← 𝑭𝑭𝒖𝒖 + 𝜼𝜼 𝛁𝛁𝒍𝒍(𝑭𝑭𝒖𝒖) Project 𝑭𝑭𝒖𝒖 back to a non-negative vector: If 𝑭𝑭𝒖𝒖𝑪𝑪 < 𝟎𝟎: 𝑭𝑭𝒖𝒖𝑪𝑪 = 𝟎𝟎

This is slow! Computing 𝜵𝜵𝒍𝒍 𝑭𝑭𝒖𝒖 takes linear time!

31 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 32: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

However, we notice:

We can cache ∑ 𝑭𝑭𝒗𝒗𝒗𝒗 So, computing ∑ 𝑭𝑭𝒗𝒗𝒗𝒗∉𝓝𝓝(𝒖𝒖) now takes linear time

in the degree 𝓝𝓝(𝒖𝒖) of 𝒖𝒖 In networks degree of a node is much smaller to the total

number of nodes in the network, so this is a significant speedup!

32 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 33: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

BigCLAM takes 5 minutes for 300k node nets Other methods take 10 days

Can process networks with 100M edges! 33

Page 34: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

34 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 35: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Extension: Make community membership edges directed: Outgoing membership: Nodes “sends” edges Incoming membership: Node “receives” edges

35 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 36: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 36

Page 37: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Everything is almost the same except now we have 2 matrices: 𝑭𝑭 and 𝑯𝑯 𝑭𝑭… out-going community memberships 𝑯𝑯… in-coming community memberships

Edge prob.: 𝑷𝑷 𝒖𝒖,𝒗𝒗 = 𝟏𝟏 − 𝒆𝒆𝒆𝒆𝒑𝒑(−𝑭𝑭𝒖𝒖𝑯𝑯𝒗𝒗𝑻𝑻)

Network log-likelihood:

which we optimize the same way as before 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 37

𝑭𝑭 𝑯𝑯

Page 38: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

38 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Page 39: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2014/slides/12-graphs2.pdfInternational Conference on Web Search and Data Mining (WSDM), 2014. Community

Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2013.

Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2014.

Community Detection in Networks with Node Attributes by J. Yang, J. McAuley, J. Leskovec. IEEE International Conference On Data Mining (ICDM), 2013.

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 39