Shuai Ma , Yang Cao, Jinpeng Huai , Tianyu Wo

Shuai Ma, Yang Cao, Jinpeng Huai, Tianyu Wo

Distributed Graph Pattern Matching

2

File systems Databases World Wide Web

Graph searching is a key to social searching engines!

Graphs are everywhere, and quite a few are huge graphs!

Social Networks

Graph Pattern Matching• Given two graphs G1 (pattern graph) and G2 (data

graph), – decide whether G1 “matches” G2 (Boolean queries);– identify “subgraphs” of G2 that match G1

• Applications– Web mirror detection/ Web site classification – Complex object identification– Software plagiarism detection– Social network/biology analyses– …

• Matching Semantics– Traditional: Subgraph Isomorphism– Emerging applications: Graph Simulation and its

extensions, etc..

3

A variety of emerging real-life applications!

Distributed Graph Pattern Matching• Real-life graphs are typically way too large:

– Yahoo! web graph: 14 billion nodes– Facebook: over 0.8 billion users

• Real-life graphs are naturally distributed:– Google, Yahoo and Facebook have large-scale data centers

4

It is nature to study “distributed graph pattern matching”!

It is NOT practical to handle large graphs on single machines

Distributed graph processing is inevitable

Distributed Graph Pattern Matching

5

• Given pattern graph Q(Vq, Eq) and fragmented data graph F = (F1, … , Fk) of G(V, E) distributed over k sites,

• the distributed graph pattern matching problem is to find the maximum match in G for Q, via graph simulation.There exists a unique maximum match for graph simulation!

Graph Simulation• Given pattern graph Q(Vq, Eq) and data graph G(V,

E), a binary relation R ⊆ Vq × V is said to be a match if– (1) for each (u, v) ∈ R, u and v have the same label; and – (2) for each edge (u, u′) ∈ Eq, there exists an edge (v, v′)

in E such that (u′, v′) ∈ R. • Graph G matches pattern Q via graph simulation, if

there exists a total match relation M– for each u ∈ Vq, there exists v ∈ V such that (u, v) ∈ M.– Intuitively, simulation preserves the labels and the child

relationship of a graph pattern in its match. – Simulation was initially proposed for the analyses of

programs; and simulation and its extensions were recently introduced for social networks.

6

Subgraph isomorphism (NP-complete) vs. graph simulation (O(n2))!

Graph Simulation

7

Subgraph Isomorphism is too strict for emerging applications!

Set up a team to develop a new software product

Graph simulation returns F3, F4 and F5;Subgraph isomorphism returns empty!

Properties of Graph Simulation

• Let pattern Q = {Q1, . . . , Qh} (h CCs). For any data graph G, – if Mi is the maximum match in G for Qi, – then M1 ∪ … ∪ Mh is the maximum match in G for Q.

8

Impacts of connected components (CCs)

• Let data graph G = {G1, . . . , Gh} (h CCs). For any pattern graph G,– if Mi is the maximum match in Gi for Q, – then M1 ∪ … ∪ Mh is the maximum match in G for Q.

• Any binary relation R ⊆ Vq × V on pattern graph Q(Vq,Eq) and data graph G(V,E) that contains the maximum match M in G for Q. – If Mi is the maximum match in R(G)i for Q, – then M1 ∪ … ∪ Mh is exactly the maximum match in G for

Q,where R(G) consists of h CCs R(G)1, . . . , R(G)h).

Even if data graph G is connected, R(G) might be highly disconnected, by removing useless nodes and edges from G.

• The matched subgraph of Q1 and G1 is Gs = F3 ∪ F4 ∪ F5;• Removing any node or edge from Gs makes Q1 NOT match Gs.


9

What can be computed locally?

Graph simulation has poor data locality


• Checking whether data node v in G matches pattern node u in Q can be determined locally iff subgraph desc(Q, u) is a DAG.

10

We turn to the data locality of single nodes

What we have learned from the static analysis?

• Treat each connected component in Q and G separately;• Use the data locality to check whether a node in G can be

determined locally.

desc(Q1, SA) is the subgraph in

Q1 with nodes SA, SD and ST

Complexity Analysis of Distributed Algorithms

• A cluster of identical machines (with one acted as coordinator);

• Each machine can directly send arbitrary number of messages to another one;

• All machines co-work with each other by local computations and message-passing.

11

Model of Computation:

Complexity measures:

1. Visit times: the maximum visiting times of a machine (interactions)2. Makespan: the evaluation completion time (efficiency)3. Data shipment: the size of the total messages shipped among distinctmachines (network band consumption)

Complexity Analysis of Distributed Algorithms

• For each machine Si (1 ≤ i ≤ k) ,– Local information: ( 1) pattern graph Q; (2) subgraph Gs,i of G; and (3)

a marked binary relation Ri ⊆ Vq × V, where each match (u; v) 2 Ri is marked as true, false or unknown; and Ri can be updated by either messages or local computations.

– Message: only local information is allowed to be exchanged– Local computations: update Ri by utilizing the semantics of graph

simulation. local algorithms execute only local computations without involving message-

passing during the computation, run in time of a polynomial of |Q| and |Gs,i|.

12

Specifications for the distributed algorithms:

Complexity bounds:1. The optimal data shipment is |G| - 1, and it is tight.2. The optimal visit times are 1, and it is tight.3. The minimum makespan problem is NP-complete.

Remarks:1. Data shipment, visit times and makespan are controversial with each other.2. A well-balanced strategy between makespan and the other two measures.

Distributed Evaluation of Graph Simulation• Stage 1: Coordinator SQ broadcasts Q to all k sites;• Stage 2: All sites, in parallel, partially evaluate Q on local

fragments – partial match;• Stage 3: Ship those CCs across different machines to single

machines, while minimizing data shipment and makespan;• Stage 4: Compute the maximum matches in those CCs

originally across multiple machines in parallel;• Stage 5: Collect and assemble partial matches in the

coordinator.

13

Performance guarantees:

1. The total computational complexity is the same to the best-known centralizedalgorithm, while it invokes 4 rounds of message-passing and local evaluation only;2. Total data shipment is bounded by |G| + 4|B| + |Q||G| + (k - 1) |Q|;3. Each machine except coordinator SQ is visited with g + 2 times (g is themaximum number machines at which a CC resides in Stage2, and SQ is visited2(k -1) times.

Sacrifice data shipment and visit times for makspan!

Scheduling Data Shipment - Stage 3

14

The Scheduling Problem:

Given h connected components, C1, … ,Ch, and an integer k, find an assignment of the connected component to k identical machines, so that both the makespan and the total data shipment are minimized.

Approximation Hardness (data shipment, makespan):

The scheduling problem is not approximable within (ε, max(k − 1, 2)) for any ε > 1.

Performance guarantees of algorithm dSchedule:

Algorithm dSchedule produces an assignment of the scheduling problem such that the makespan is within a factor (2 − 1/k) of the optimal one.

Remarks:

1. A heuristic is used to minimize the data shipment,2. A greedy approach is adopted to guarantee the performance of the makesapn.3. The algorithm runs in O(kh), and is very efficient. Hence, its evaluation could not

cause a bottleneck.

Optimization Techniques

15

Using data locality

Determine whether (u, v) belongs to the maximum match M in G for Q:Case 1: when there are no boundary nodes in desc(G, v) of fragmented graph Gj;Case 2: when there are boundary nodes in fragmented graph Gj, but subgraph desc (Q, u) of Q is a DAG

Minimizing pattern graphs (Q ≡ Qm)

Given pattern graph Q, we compute a minimized equivalent pattern graph Qm such that for any data graph G, G matches Q iff G matches Qm, via graph simulation.

(SA, SA2): Case 1 (BA, BA2): Case 2

Minimization

Experimental Study

16

Real life datasets:

Google Web graph: 875,713 nodes and 5,105,039 edgesAmazon product co-buy network: 548,552 nodes and 1,788,725 edges

Machines:The experiments were run on a cluster of 16 machines, all with 2 Intel Xeon E5620 CPUs and 64GB memory

Synthetic graph generator: (108 nodes and 3,981,071,706 edges)Three parameters:1. The number n of nodes;2. The number nα of edges; and 3. The number l of node labels

Algorithms:Algorithm disHHK and its optimized version disHHK+

Optimal algorithms naiveMatchds(data shipment) and naiveMatchvt (visit times)

Experimental Study

17

1. All algorithms scale well except naiveMatchds and naiveMatchvt

2. disHHK+ consistently reduces about [1/5, 1/4] running time of disHHK

Experimental Study

18

1. All algorithms ship about 1/10000 of the data graphs 2. disHHK+ and disHHK even ship less data than naiveMatchds when data graphs are large and sparse

Experimental Study

19

disHHK+ and disHHK have [30%, 53%] more visit times than naiveMatchds, as expected

Conclusion

20

A first step towards the big picture of distributed graph pattern matching

We have formulated and investigated the distributed graph pattern matching problem, via graph simulation.

We have given a static analysis of graph simulation– Utility of connected components– Study of data locality

We have studied the complexity of a large class of distributed algorithms for graph simulation.

– A message-passing computation model– Makespan, data shipment, and visit times (controversial with each

other)We have proposed a distributed algorithm for graph simulation– The scheduling problem– Optimization techniques– Experimental verification

Documents

Shuai Ma , Yang Cao, Jinpeng Huai , Tianyu Wo