34
Lecture 8: Graph mining (Book Ch 17) Image source https://www.jamiesheffield.com/2013/05/concept- mapping-my-protagonists-world.html MDM course Aalto 2020 – p.1/34

Lecture 8: Graph mining (Book Ch 17)

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 8: Graph mining (Book Ch 17)

Lecture 8: Graph mining (Book Ch 17)

Image source https://www.jamiesheffield.com/2013/05/concept-

mapping-my-protagonists-world.htmlMDM course Aalto 2020 – p.1/34

Page 2: Lecture 8: Graph mining (Book Ch 17)

Graphs occur everywhere!

MDM course Aalto 2020 – p.2/34

Page 3: Lecture 8: Graph mining (Book Ch 17)

Graphs occur everywhere!

image sources Leskovec et al. (2009), Fan & Simeon (2000)MDM course Aalto 2020 – p.3/34

Page 4: Lecture 8: Graph mining (Book Ch 17)

Graph mining

Data may consist of

1. multiple small graphs→ today

e.g., chemical compounds, biological pathways,program control flows, consumer behaviour, ..., even(html) documents can be presented by graphs!

2. one large graph→ later

e.g., internet, social network

Information to mine: interesting substructures, similarities,

communities, clusters

MDM course Aalto 2020 – p.4/34

Page 5: Lecture 8: Graph mining (Book Ch 17)

Concept map of methods (this lecture)

Distance

between graphs

Transformation

based methodsGraph matching

based methods

Frequent

subgraph based

methods

Graph

clustering

− type transport

− topological descr.

− kernel−similarity

uses

usesuses

− MCG dist

ances

− graph edit dist

ance

− type transport

− representatives from

frequent subgraphs

− K−medoids

− methods using

similarity graphs

needs

based methods

Distance

frequent subgraphs

Mining

matching

Graph

MDM course Aalto 2020 – p.5/34

Page 6: Lecture 8: Graph mining (Book Ch 17)

Graph notations

G = (V,E) graph

V = {v1, . . . , vn} = setof vertices or nodes

|V| = number of nodes

node label l(vi)

E = {e1, . . . , em} = setof edges, ei = (v, u),v, u ∈ V

|E| = number of edges

Now we assume that edges undirected and don’t havelabels

MDM course Aalto 2020 – p.6/34

Page 7: Lecture 8: Graph mining (Book Ch 17)

Graph matching = graph isomorphism

Two graphs G1 = (V1,E1) and G2 = (V2,E2) are matching orisomorphic iff there is a 1:1 correspondence betweennodes such that

(i) Corresponding nodes vi ∈ V1 and v j ∈ V2 havesame labels: l(vi) = l(v j).

(ii) Let [v1, u1] be a node pair in G1 and a correspond-ing pair [v2, u2] in G2. Then edge (v1, u1) ∈ E1 ⇔(v2, u2) ∈ E2.

Note: no polynomial time algorithms are known (exceptspecial cases)

MDM course Aalto 2020 – p.7/34

Page 8: Lecture 8: Graph mining (Book Ch 17)

There can be many matchings!

Two matchings for molecules 1 and 2. Totally 4!=24matchings!

Image source: Aggarwal Ch 17MDM course Aalto 2020 – p.8/34

Page 9: Lecture 8: Graph mining (Book Ch 17)

Subgraph isomorphism

Does a certain query graph Gq match a part of another

graph G?

Query graph Gq = (Vq,Eq) is a subgraph isomporphism of

G = (V,E), if

(i) For all vq ∈ Vq there is v ∈ V such that l(vq) = l(v); and

(ii) If [v1, u1] is a pair in Vq and [v2, u2] a matching pair in V,then (v1, u1) ∈ Eq ⇔ (v2, u2) ∈ E.

Sometimes a weaker condition suffices for (ii):if (v1, u1) ∈ Eq ⇒ (v2, u2) ∈ E

MDM course Aalto 2020 – p.9/34

Page 10: Lecture 8: Graph mining (Book Ch 17)

Subgraph isomorphism: example

Algorithm: see Aggarwal Ch 17.2.1

Image source: Aggarwal Ch 17MDM course Aalto 2020 – p.10/34

Page 11: Lecture 8: Graph mining (Book Ch 17)

Maximum common subgraph (MCG)

Problem: Given G1 and G2, find G0 = (V0,E0) such that

(i) G0 is a subgraph isomorphism of both G1 and G2 and

(i) |V0| is as large as possible.

+ useful for comparing graphs

– NP-hard (like subgraph isomorphism)

Algorithm: see Aggarwal Ch 17.2.2

MDM course Aalto 2020 – p.11/34

Page 12: Lecture 8: Graph mining (Book Ch 17)

Next to distances

Frequent

subgraph based

methods

Graph

clustering

− type transport

− topological descr.

− kernel−similarity

uses

usesuses

− MCG dist

ances

− graph edit dist

ance

− type transport

− representatives from

frequent subgraphs

− K−medoids

− methods using

similarity graphs

needs

based methods

Distance

frequent subgraphs

Mining

matching

Graph

Distance

between graphs

Transformation

based methodsGraph matching

based methods

MDM course Aalto 2020 – p.12/34

Page 13: Lecture 8: Graph mining (Book Ch 17)

Distances based on maximum common subgraph

(MCG)

Let’s assume graph size = number of nodes, i.e., forG = (V,E) notate |G| = |V|

Let MCS(G1,G2)=maximum common subgraph of G1

and G2 and |MCS(G1,G2)|=its size

1. Unnormalized non-matching measure:

U(G1,G2) = |G1| + |G2| − 2 · |MCS (G1,G2)|

= number on non-matching nodes

Problem: what if graphs have very different sizes?

MDM course Aalto 2020 – p.13/34

Page 14: Lecture 8: Graph mining (Book Ch 17)

Distances based on MCG

2. Union-normalized distance Udist ∈ [0, 1]

Udist(G1,G2) = 1 −|MCS(G1,G2)|

|G1| + |G2| − |MCS(G1,G2)|

= number of non-matching nodes normalized by union size

3. Max-normalized distance Mdist ∈ [0, 1]

Mdist(G1,G2) = 1 −|MCS(G1,G2)|

max{|G1|, |G2|}

• metric

MCG-based distances can be computed efficiently only forsmall graphs!

MDM course Aalto 2020 – p.14/34

Page 15: Lecture 8: Graph mining (Book Ch 17)

Graph edit distance

What is the minimum cost of edit operations to transformG1 to G2?

(i) node insertion

(ii) node deletion (deletes also incident edges)

(iii) edge insertion

(iv) edge deletion

(v) label substitution of nodes

application-specific costs

may be exponentially many possible edit paths!

NP-hard

MDM course Aalto 2020 – p.15/34

Page 16: Lecture 8: Graph mining (Book Ch 17)

Graph edit distance: example

MDM course Aalto 2020 – p.16/34

Page 17: Lecture 8: Graph mining (Book Ch 17)

Transformation-based distances

Idea: Transform graphs into a new space where distancesare easier to calculate

a) Type transport using frequent subgraphs

b) Topological descriptors

c) Kernel similarity

MDM course Aalto 2020 – p.17/34

Page 18: Lecture 8: Graph mining (Book Ch 17)

Type transport using frequent subgraphs

subgraph

isomorphism

Search frequentCreate new features

Present graphs in

vector space usingUse text similarity

measures

fi = number of times

ith subgraph occurs in G

or binary or tf-idf presentation

Choose subgraphs

that don’t overlapsubgraphs

too much subgraphs

f1, . . . , fd for remaining

f1, . . . , fd

involves an NP-hard subproblem

MDM course Aalto 2020 – p.18/34

Page 19: Lecture 8: Graph mining (Book Ch 17)

Topological descriptors

Idea: calculate different kinds of indices from graphs⇒new numerical features⇒ Use distances for numerical data

structural information lost

utility domain-specific (e.g., good in chemical domain)

e.g., Wiener index:

W(G) =∑

v,u∈V

d(v, u)

d(v, u)=length of shortest path from v to u

more in Aggarwal Ch 17.3.2

MDM course Aalto 2020 – p.19/34

Page 20: Lecture 8: Graph mining (Book Ch 17)

Kernel similarity

Idea:

Assume transformation Φ such that similarity of G1 andG2 can be measured by Φ(G1) · Φ(G2)

Design kernel function K such thatK(G1,G2) = Φ(G1) ·Φ(G2) and use it as a similaritymeasure (without transformation)

e.g. shortest path kernel (O(n4)) and random walk

kernel (O(n6))

practical for small graphs

more in Aggarwal Ch 17.3.3

MDM course Aalto 2020 – p.20/34

Page 21: Lecture 8: Graph mining (Book Ch 17)

Next to frequent subgraph discovery

Frequent

subgraph based

methods

Graph

clustering

− type transport

− topological descr.

− kernel−similarity

uses

usesuses

− MCG dist

ances

− graph edit dist

ance

− type transport

− representatives from

frequent subgraphs

− K−medoids

− methods using

similarity graphs

needs

based methods

Distance

frequent subgraphs

Mining

matching

Graph

Distance

between graphs

Transformation

based methodsGraph matching

based methods

MDM course Aalto 2020 – p.21/34

Page 22: Lecture 8: Graph mining (Book Ch 17)

Frequent subgraph discovery: Motivation

Predict:

anti−HIV activity

toxicity of compounds

binding ability with

Anthrax toxin

Image source: https://slideplayer.com/slide/5894097/

MDM course Aalto 2020 – p.22/34

Page 23: Lecture 8: Graph mining (Book Ch 17)

Frequent subgraph discovery

Task: Given graph database, search frequent subgraphsgiven threshold min f r.

Search idea: utilize monotonicity of frequency!

If G1 is a subgraph of G2, then fr(G1) ≥ fr(G2)

similar algorithms than for frequent itemsets, but morecomplex

two variants: size of graph may refer to a) number ofnodes b) number of edges⇒ how new candidates are generated

MDM course Aalto 2020 – p.23/34

Page 24: Lecture 8: Graph mining (Book Ch 17)

GraphApriori algorithm

Fi = frequent subgraphs of size i, Ci = candidates

F1 = {G | where |G| = 1, P(G) ≥ min f r}; i = 1

while Fi , ∅

generate candidates Ci+1 from Fi

prune G ∈ Ci+1 if G has a subgraph G′ such that|G′| = i and G′ < Fi (=monotonicity criterion)

count frequencies fr(G), G ∈ Ci+1

set Fi+1 = {G ∈ Ci+1 | P(G) ≥ min f r}

i = i + 1

return ∪iFi

MDM course Aalto 2020 – p.24/34

Page 25: Lecture 8: Graph mining (Book Ch 17)

GraphApriori: Candidate generation

For all G1,G2 ∈ Fi, |G1| = |G2| = i

1. determine if G1 and G2 have a common subgraph G0

of size i − 1

may be many isomorphic matchings ⇒ manyalternative G0s!

2. for each G0 create candidate graphs of size i + 1

node-based: include all common + 2 non-matchingnodes (with extra edge or not)

edge-based: include all i − 1 common edges and 2unique edges (with extra node or not)

same subgraphs may be generated multiple times⇒redundancy checking

MDM course Aalto 2020 – p.25/34

Page 26: Lecture 8: Graph mining (Book Ch 17)

Example of node-based join

Image source: Aggarwal Ch 17MDM course Aalto 2020 – p.26/34

Page 27: Lecture 8: Graph mining (Book Ch 17)

Example of edge-based join

Image source: Aggarwal Ch 17MDM course Aalto 2020 – p.27/34

Page 28: Lecture 8: Graph mining (Book Ch 17)

Why this is heavy?

number of candidate patterns may be huge!

subgraph isomorphism to identify pairs of subgraphsfor joining

graph isomorphism for redundancy checking

subgraph isomorphism for monotonicity pruning

subgraph isomorphism for frequency counting

Easier if

many unique node labels

only small subgraphs are searched

edge-based join is used (usually less candidates)MDM course Aalto 2020 – p.28/34

Page 29: Lecture 8: Graph mining (Book Ch 17)

Next to graph clustering

Graph

clustering

− type transport

− topological descr.

− kernel−similarity

uses

usesuses

− MCG dist

ances

− graph edit dist

ance

− type transport

− representatives from

frequent subgraphs

− K−medoids

− methods using

similarity graphs

needs

based methods

Distance

frequent subgraphs

Mining

matching

Graph

Distance

between graphs

Transformation

based methodsGraph matching

based methods

Frequent

subgraph based

methods

MDM course Aalto 2020 – p.29/34

Page 30: Lecture 8: Graph mining (Book Ch 17)

Distance-based clustering methods

Common approaches:

1. K-medoids (needs just a distance function)

2. Spectral and other graph-based methods

construct a nearest neighbour/similarity graph ofgraph objects

cluster nodes of the new graph

Remember: graph distance measures very expensive tocompute! → suitable for smaller graphs

MDM course Aalto 2020 – p.30/34

Page 31: Lecture 8: Graph mining (Book Ch 17)

Methods based on frequent subgraphs

Approach 1. Type transport: graphs→ multidimensional

subgraph

isomorphism

Search frequentCreate new features

Present graphs in

vector space usingUse text clustering

methods

fi = number of times

ith subgraph occurs in G

or binary or tf-idf representation

Choose subgraphs

that don’t overlapsubgraphs

too much subgraphs

f1, . . . , fd for remaining

f1, . . . , fd

involves an NP-hard subproblem

MDM course Aalto 2020 – p.31/34

Page 32: Lecture 8: Graph mining (Book Ch 17)

Methods based on frequent subgraphs

Approach 2. XProj: cluster representatives = sets offrequent subgraphs

Initialization: Create K random clusters C1, . . . ,CK

for all Ci: Fi = set of frequent subgraphs (of a givensize) from Ci

repeat until convergence:

assign each G j to Ci where sim(G j,Fi) largest

for all Ci determine new Fi

sim(G j,Fi) = fraction of frequent graphs in Fi that occur in G j

MDM course Aalto 2020 – p.32/34

Page 33: Lecture 8: Graph mining (Book Ch 17)

Summary

Distance

between graphs

Transformation

based methodsGraph matching

based methods

Frequent

subgraph based

methods

Graph

clustering

− type transport

− topological descr.

− kernel−similarity

uses

usesuses

− MCG dist

ances

− graph edit dist

ance

− type transport

− representatives from

frequent subgraphs

− K−medoids

− methods using

similarity graphs

needs

based methods

Distance

frequent subgraphs

Mining

matching

Graph

MDM course Aalto 2020 – p.33/34

Page 34: Lecture 8: Graph mining (Book Ch 17)

Image sources

Leskovec et al.: Community Structure in LargeNetworks: Natural Cluster Sizes and the Absence ofLarge Well-Defined Clusters. Internet Mathematics 6,2008.

Fan & Simeon: Integrity Constraints for XML.Principles of database systems 2000.

Gudes: Graph and Web Mining – Motivation,Applications and Algorithms, Data mining seminar,2University of Helsinki 2010.

MDM course Aalto 2020 – p.34/34