Upload
teagan
View
51
Download
0
Embed Size (px)
DESCRIPTION
Distributed Graph Pattern Matching. Shuai Ma , Yang Cao, Jinpeng Huai , Tianyu Wo. Graphs are everywhere , and quite a few are huge graphs!. Social Networks. File systems. Databases. World Wide Web. Graph searching is a key to social searching engines!. Graph Pattern Matching. - PowerPoint PPT Presentation
Citation preview
Shuai Ma, Yang Cao, Jinpeng Huai, Tianyu Wo
Distributed Graph Pattern Matching
2
File systems Databases World Wide Web
Graph searching is a key to social searching engines!
Graphs are everywhere, and quite a few are huge graphs!
Social Networks
Graph Pattern Matching• Given two graphs G1 (pattern graph) and G2 (data
graph), – decide whether G1 “matches” G2 (Boolean queries);– identify “subgraphs” of G2 that match G1
• Applications– Web mirror detection/ Web site classification – Complex object identification– Software plagiarism detection– Social network/biology analyses– …
• Matching Semantics– Traditional: Subgraph Isomorphism– Emerging applications: Graph Simulation and its
extensions, etc..
3
A variety of emerging real-life applications!
Distributed Graph Pattern Matching• Real-life graphs are typically way too large:
– Yahoo! web graph: 14 billion nodes– Facebook: over 0.8 billion users
• Real-life graphs are naturally distributed:– Google, Yahoo and Facebook have large-scale data centers
4
It is nature to study “distributed graph pattern matching”!
It is NOT practical to handle large graphs on single machines
Distributed graph processing is inevitable
Distributed Graph Pattern Matching
5
• Given pattern graph Q(Vq, Eq) and fragmented data graph F = (F1, … , Fk) of G(V, E) distributed over k sites,
• the distributed graph pattern matching problem is to find the maximum match in G for Q, via graph simulation.There exists a unique maximum match for graph simulation!
Graph Simulation• Given pattern graph Q(Vq, Eq) and data graph G(V,
E), a binary relation R ⊆ Vq × V is said to be a match if– (1) for each (u, v) ∈ R, u and v have the same label; and – (2) for each edge (u, u′) ∈ Eq, there exists an edge (v, v′)
in E such that (u′, v′) ∈ R. • Graph G matches pattern Q via graph simulation, if
there exists a total match relation M– for each u ∈ Vq, there exists v ∈ V such that (u, v) ∈ M.– Intuitively, simulation preserves the labels and the child
relationship of a graph pattern in its match. – Simulation was initially proposed for the analyses of
programs; and simulation and its extensions were recently introduced for social networks.
6
Subgraph isomorphism (NP-complete) vs. graph simulation (O(n2))!
Graph Simulation
7
Subgraph Isomorphism is too strict for emerging applications!
Set up a team to develop a new software product
Graph simulation returns F3, F4 and F5;Subgraph isomorphism returns empty!
Properties of Graph Simulation
• Let pattern Q = {Q1, . . . , Qh} (h CCs). For any data graph G, – if Mi is the maximum match in G for Qi, – then M1 ∪ … ∪ Mh is the maximum match in G for Q.
8
Impacts of connected components (CCs)
• Let data graph G = {G1, . . . , Gh} (h CCs). For any pattern graph G,– if Mi is the maximum match in Gi for Q, – then M1 ∪ … ∪ Mh is the maximum match in G for Q.
• Any binary relation R ⊆ Vq × V on pattern graph Q(Vq,Eq) and data graph G(V,E) that contains the maximum match M in G for Q. – If Mi is the maximum match in R(G)i for Q, – then M1 ∪ … ∪ Mh is exactly the maximum match in G for
Q,where R(G) consists of h CCs R(G)1, . . . , R(G)h).
Even if data graph G is connected, R(G) might be highly disconnected, by removing useless nodes and edges from G.
• The matched subgraph of Q1 and G1 is Gs = F3 ∪ F4 ∪ F5;• Removing any node or edge from Gs makes Q1 NOT match Gs.
Properties of Graph Simulation
9
What can be computed locally?
Graph simulation has poor data locality
Properties of Graph Simulation
• Checking whether data node v in G matches pattern node u in Q can be determined locally iff subgraph desc(Q, u) is a DAG.
10
We turn to the data locality of single nodes
What we have learned from the static analysis?
• Treat each connected component in Q and G separately;• Use the data locality to check whether a node in G can be
determined locally.
desc(Q1, SA) is the subgraph in
Q1 with nodes SA, SD and ST
Complexity Analysis of Distributed Algorithms
• A cluster of identical machines (with one acted as coordinator);
• Each machine can directly send arbitrary number of messages to another one;
• All machines co-work with each other by local computations and message-passing.
11
Model of Computation:
Complexity measures:
1. Visit times: the maximum visiting times of a machine (interactions)2. Makespan: the evaluation completion time (efficiency)3. Data shipment: the size of the total messages shipped among distinctmachines (network band consumption)
Complexity Analysis of Distributed Algorithms
• For each machine Si (1 ≤ i ≤ k) ,– Local information: ( 1) pattern graph Q; (2) subgraph Gs,i of G; and (3)
a marked binary relation Ri ⊆ Vq × V, where each match (u; v) 2 Ri is marked as true, false or unknown; and Ri can be updated by either messages or local computations.
– Message: only local information is allowed to be exchanged– Local computations: update Ri by utilizing the semantics of graph
simulation. local algorithms execute only local computations without involving message-
passing during the computation, run in time of a polynomial of |Q| and |Gs,i|.
12
Specifications for the distributed algorithms:
Complexity bounds:1. The optimal data shipment is |G| - 1, and it is tight.2. The optimal visit times are 1, and it is tight.3. The minimum makespan problem is NP-complete.
Remarks:1. Data shipment, visit times and makespan are controversial with each other.2. A well-balanced strategy between makespan and the other two measures.
Distributed Evaluation of Graph Simulation• Stage 1: Coordinator SQ broadcasts Q to all k sites;• Stage 2: All sites, in parallel, partially evaluate Q on local
fragments – partial match;• Stage 3: Ship those CCs across different machines to single
machines, while minimizing data shipment and makespan;• Stage 4: Compute the maximum matches in those CCs
originally across multiple machines in parallel;• Stage 5: Collect and assemble partial matches in the
coordinator.
13
Performance guarantees:
1. The total computational complexity is the same to the best-known centralizedalgorithm, while it invokes 4 rounds of message-passing and local evaluation only;2. Total data shipment is bounded by |G| + 4|B| + |Q||G| + (k - 1) |Q|;3. Each machine except coordinator SQ is visited with g + 2 times (g is themaximum number machines at which a CC resides in Stage2, and SQ is visited2(k -1) times.
Sacrifice data shipment and visit times for makspan!
Scheduling Data Shipment - Stage 3
14
The Scheduling Problem:
Given h connected components, C1, … ,Ch, and an integer k, find an assignment of the connected component to k identical machines, so that both the makespan and the total data shipment are minimized.
Approximation Hardness (data shipment, makespan):
The scheduling problem is not approximable within (ε, max(k − 1, 2)) for any ε > 1.
Performance guarantees of algorithm dSchedule:
Algorithm dSchedule produces an assignment of the scheduling problem such that the makespan is within a factor (2 − 1/k) of the optimal one.
Remarks:
1. A heuristic is used to minimize the data shipment,2. A greedy approach is adopted to guarantee the performance of the makesapn.3. The algorithm runs in O(kh), and is very efficient. Hence, its evaluation could not
cause a bottleneck.
Optimization Techniques
15
Using data locality
Determine whether (u, v) belongs to the maximum match M in G for Q:Case 1: when there are no boundary nodes in desc(G, v) of fragmented graph Gj;Case 2: when there are boundary nodes in fragmented graph Gj, but subgraph desc (Q, u) of Q is a DAG
Minimizing pattern graphs (Q ≡ Qm)
Given pattern graph Q, we compute a minimized equivalent pattern graph Qm such that for any data graph G, G matches Q iff G matches Qm, via graph simulation.
(SA, SA2): Case 1 (BA, BA2): Case 2
Minimization
Experimental Study
16
Real life datasets:
Google Web graph: 875,713 nodes and 5,105,039 edgesAmazon product co-buy network: 548,552 nodes and 1,788,725 edges
Machines:The experiments were run on a cluster of 16 machines, all with 2 Intel Xeon E5620 CPUs and 64GB memory
Synthetic graph generator: (108 nodes and 3,981,071,706 edges)Three parameters:1. The number n of nodes;2. The number nα of edges; and 3. The number l of node labels
Algorithms:Algorithm disHHK and its optimized version disHHK+
Optimal algorithms naiveMatchds(data shipment) and naiveMatchvt (visit times)
Experimental Study
17
1. All algorithms scale well except naiveMatchds and naiveMatchvt
2. disHHK+ consistently reduces about [1/5, 1/4] running time of disHHK
Experimental Study
18
1. All algorithms ship about 1/10000 of the data graphs 2. disHHK+ and disHHK even ship less data than naiveMatchds when data graphs are large and sparse
Experimental Study
19
disHHK+ and disHHK have [30%, 53%] more visit times than naiveMatchds, as expected
Conclusion
20
A first step towards the big picture of distributed graph pattern matching
We have formulated and investigated the distributed graph pattern matching problem, via graph simulation.
We have given a static analysis of graph simulation– Utility of connected components– Study of data locality
We have studied the complexity of a large class of distributed algorithms for graph simulation.
– A message-passing computation model– Makespan, data shipment, and visit times (controversial with each
other)We have proposed a distributed algorithm for graph simulation– The scheduling problem– Optimization techniques– Experimental verification