21
1 Graph and Link Mining

GRAPH AND LINK MINING - Fordhamstorm.cis.fordham.edu/~gweiss/classes/cisc6930/slides/graph-mining.… · entities and their pairwise relationships. ... probability of being at node

Embed Size (px)

Citation preview

1

Graph and Link Mining

Graphs - Basics

A graph is a powerful abstraction for modeling entities and their pairwise relationships.

G = (V,E) Set of nodes 𝑉 = 𝑣1, … , 𝑣5

Set of edges 𝐸 = { 𝑣1, 𝑣2 , … 𝑣4, 𝑣5 }

Examples: Social network

Twitter Followers

Web

Collaboration graphs

2

𝑣1

𝑣2

𝑣3 𝑣4

𝑣5

Undirected Graphs Undirected Graph

The edges are undirected pairs – they can be traversed in any direction.

Degree of node: Number of edges incident on the node

Path: A sequence of edges from one node to another

Connected Component: A set of nodes such that there is a path between any two nodes in the set

3

𝑣1

𝑣2

𝑣3 𝑣4

𝑣5

10011

00111

01011

11001

11110

A

Directed Graphs Directed Graph:

Edges are ordered pairs – they can be traversed in the direction from first to second.

In-degree and Out-degree of a node.

Path: A sequence of directed edges from one node to another

Strongly Connected Component: A set of nodes such that there is a directed path between any two nodes in the set

4

𝑣1

𝑣2

𝑣3 𝑣4

𝑣5

10001

00111

00010

10000

00110

A

Examples of Graphs we Might Mine

Airline Route Maps are useful

Info can tell you about both history and politics

Call Detail Records

Tell us about relationships between people

Who got in trouble about a decade ago for using this info?

Web is based on (hyper)links between docs

Social Networks form Graphs

Link Analysis is the data mining technique that addresses relationships and connections

5

6 Degrees of Separation

Claim: there are at most 6 degrees of separation between any two people This is important in social networks

LinkedIn tell you how you connect to others and it expands with each link.

Stanley Milgram wasn’t first to note small world effect But popularized it with famous experiment: How close are two

random people? Picked people in Omaha Nebraska or Wichita Kansas, and someone

in Boston Asked source person to send it to other person and if did not know

the person send it to someone more likely to know them Average path length was 5.5 or 6

But only 64 of 296 arrived (this is often not highlighted)

6

Examples of Applications

Identifying authoritative sources of information on the WWW by analyzing page links Google and PageRank– we will come back to this

Understanding physician referral patterns Analyzing telephone call patterns

MCI Friends and Family You call Mary Smith, also on MCI, so ask her to join MCI

But your wife does not know Mary Smith! Oops! Far-fetched? Facebook does it all of the time!!!!

Identify fraud: in past one would purchaser several stolen calling cards and use them to call same person. That is a clue.

7

Mining the graph structure

A graph is a combinatorial object, with a certain structure.

Mining the structure of the graph reveals information about the entities in the graph E.g., if in the Facebook graph I find that there are 100

people that are all linked to each other, then these people are likely to be a community The community discovery problem

By measuring the number of friends in Facebook graph I can find the most important nodes The node importance problem

8

Importance problem

What are the most important nodes in the graph?

What are the most authoritative pages on the web?

Who are the important users in Facebook?

What are the most influential Twitter accounts?

9

Link Analysis

First generation search engines view documents as flat text files could not cope with size, spamming, user needs

Second generation search engines Ranking becomes critical shift from relevance to authoritativeness

authoritativeness: the static importance of the page

a success story for the network analysis + a huge commercial success

it all started with two graduate students at Stanford. Everyone knows the company, right?

10

Link Analysis: Intuition

A link from page p to page q denotes endorsement

page p considers page q an authority on a subject

use the graph of recommendations

assign an authority value to every page

The same idea applies to other graphs as well

Twitter graph, where user p follows user q

11

Constructing the graph

Goal: output an authority weight for each node Also known as centrality or importance

12

w w

w

w

w

Rank by Popularity

Rank pages according to the number of incoming edges (in-degree, degree centrality)

13

1. Red Page

2. Yellow Page

3. Blue Page

4. Purple Page

5. Green Page

w=1 w=1

w=2

w=3 w=2

Popularity

It is not important only how many link to you,

but how important they are Good authorities are pointed by good authorities

Recursive definition of importance

14

PageRank

Good authorities are pointed to by good authorities The value of a page is the value of the

people that link to you

How do we implement that? Each node distributes its authority

value equally to its neighbors

The authority value of each node is the sum of the authority fractions it collects from its neighbors.

Solving the system of equations we get authority values for the nodes w = ½ , w = ¼ , w = ¼

15

w w

w

w + w + w = 1

w = w + w

w = ½ w

w = ½ w

A More Complex Example

16

v1 v2

v3

v4 v5

w1 = 1/3 w4 + 1/2 w5

w2 = 1/2 w1 + w3 + 1/3 w4

w3 = 1/2 w1 + 1/3 w4

w4 = 1/2 w5

w5 = w2

Random Walks on Graphs

What we described is equivalent to a random walk on the graph

Random walk: Start from a node uniformly at random Pick one of the outgoing edges uniformly at random Repeat Some nodes will be visited more often than others.

Those are more important. Based not only on number of incoming links, but how

often the predecessor nodes are visited

A value like Google’s Pagerank indicates how often a node would be visited

17

Random walks on graphs

Question: what is the probability of being at a specific node? 𝑝𝑖: probability of being at node i at this step 𝑝𝑖′: probability of being at node i in the next step

After many steps the probabilities converge to the stationary distribution of the random walk.

18

v1

v3

v4 v5

p’1 = 1/3 p4 + 1/2 p5

p’2 = 1/2 p1 + p3 + 1/3 p4

p’3 = 1/2 p1 + 1/3 p4

p’4 = 1/2 p5

p’5 = p2

v2

How Does Pagerank Work?

Arbitrarily initialize all pages to Pagerank of 1

Repeatedly perform calculations for each page

Eventually the values will converge

Pagerank is what caused Google to succeed

Prior to that only content mattered, not link structure

19

Benefits of PageRank

It is not trivial to fool Pagerank

You can create dummy pages to point to your page, but since no one is pointing to those pages, it will have low PageRank and not help much

You can create dummy pages to also point to one another, but without being pointed to by an outside authority, the impact will be limited

But it is clear that Google must have many tweaks to catch cases like this– link spam or link farms

20

Social Network Analysis

Social Network Analysis Overview https://www.youtube.com/watch?v=fgr_g1q2ikA

5 Minutes

What is Social Network Analysis https://www.youtube.com/watch?v=xT3EpF2EsbQ

4 minutes

21