18
GraphOne: A Data Store for Real-time Analytics on Evolving Graphs Pradeep Kumar, H. Howie Huang The George Washington University

Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

GraphOne: A Data Store for Real-time

Analytics on Evolving Graphs

Pradeep Kumar, H. Howie Huang

The George Washington University

Page 2: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

Graph Applications: Big Data and Scientific Applications

2

▪ Basic Building Block for Modern Trade

▪ Social Network, Navigation, E-Commerce, etc.

▪ Metadata Store for Big Data

Li et al. Science’04

▪ Scientific Applications▪ Protein-Protein Interaction, Metabolic Network, etc.

uploads

Page 3: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

Graph Applications: Big Data and Scientific Applications

3

▪ Basic Building Block for Modern Trade

▪ Social Network, Navigation, E-Commerce, etc.

▪ Metadata Store for Big Data

Li et al. Science’04

▪ And Many More▪ Identification of Vulnerable code, Cyber Attack

▪ Knowledge Graph

▪ Key for Data Governance, e.g. GDPR

▪ Scientific Applications▪ Protein-Protein Interaction, Metabolic Network, etc.

Page 4: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

Definitions

▪ Topological Graph: Static (or slowly changing) Relationships▪ Vertices: Switches and Computers

▪ Edges: Network Bandwidth between Vertices

▪ Batch Analytics on Topological Graph▪ Network Infrastructure Management

▪ PageRank, Breadth First Traversal, etc.

4

(b) A Stream Graph after Time t: Useful for finding all the infected machines

t0 = t - δ

t2= t + δ

t1= t2 + δ

t3= t2 - δ

t5= t2+ δ

t3= t - δ

t

(a) A Computer Network Showing Articulation Points

Articulation Points

* Grades’15 Paper from Pacific Northwest National Laboratory; LANL Net-flow dataset, 2015 and 2017; computers and Security’15, …

▪ Streaming Graphs: Faster Arrival Rate▪ Vertices: Users, Processes, Computers + Ports, etc.

▪ Edges: Network Data Flow, Authentication Logs, Process Events

▪ Stream Analytics on Streaming Graph▪ Identification of Security Risks*

Page 5: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

Observation: Prior Works have Specialized/Private Data Stores

5

Research Question: How to perform diverse set of real-time analytics in presence of high arrival rate of graph data?

Solutions Data Ingestion Data Access Type

Graph Databases (For Short Queries)

Fine-grained(e.g., Neo4j)

Whole Data

Dynamic Graph Engines(For Batch Analytics)

Fine-grained(e.g., Stinger)

Whole Data

Coarse-grained(e.g., Graphchi)

Whole Data

Stream Graph Engines(For Stream Analytics)

Coarse-grained(e.g. Naiad)

Streaming Access Of Whole Data

Coarse-grained(e.g. GraphTau)

Streaming Access from a Time Window

GraphOne Both All

Findings:

▪ Specialized/Private Data Stores

▪ Integration of Many Systems

▪ Excessive Data Duplication

▪ Weakest Link Problem

Page 6: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

Abstracting Data Store away from Graph Analytics Engine

Data Sources

+

Graph Data StoreStream

Analytics

Actionable Knowledge

Anomaly

Community

Ranking

Batch Analytics

6

Graph Analytics Framework

Graph Data Store

Page 7: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

Idea: Abstracting Data Store away from Graph Analytics Engine

Data Sources

+Graph Data Store

GraphOne (FAST’19)

Stream Analytics

Actionable Knowledge

Anomaly

Community

Ranking

Batch Analytics

7

Results:▪ Over Neo4j and SQLite: 2 to 3 orders of higher ingestion rate

▪ Over Stinger: 5.36x more ingestion rate, over 3x speedup for analytics

▪ Over Static Graph Engine: No Pre-processing Cost

Graph Analytics Framework

Page 8: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

GraphOne: Background and Architecture

8

(a) Example graph

0,1,w0 3,4,w1 1,4,w2 2,4,w3 1,2,w4 4,5,w5

1,w00

1

2

3

4

5

0,w0 4,w22,w4

1,w4 4,w3

4,w1

3,w1 2,w31,w2 5,w5

4,w5

Vertex Array Edge Arrays

2

3

1

54

0w0 w4

w2 w3

w1 w5

Opportunity: Hybrid Store

(c) Adjacency List

▪ Indexed Data Structure

▪ Coarse-grained Versioning

(b) Edge List

▪ Log: Captures Temporal Ordering

▪ Fine-grained Versioning is Free

Stream Analytics

Batch Analytics

Graph DataUpdates

Data Management

Hybrid Store

Archiving

Adjacency Store

New log Old log

Circular Edge Log

Logging

Data Durability NVMe

Fin

e-g

rain

ed

Coarse-grained

Data Visibility

Two New Abstractions:

Data Visibility: Analytics

choose their Ingestion type

Page 9: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

GraphOne: Background and Architecture

9

(a) Example graph

0,1,w0 3,4,w1 1,4,w2 2,4,w3 1,2,w4 4,5,w5

1,w00

1

2

3

4

5

0,w0 4,w22,w4

1,w4 4,w3

4,w1

3,w1 2,w31,w2 5,w5

4,w5

Vertex Array Edge Arrays

2

3

1

54

0w0 w4

w2 w3

w1 w5

Opportunity: Hybrid Store

(c) Adjacency List

▪ Indexed Data Structure

▪ Coarse-grained Versioning

(b) Edge List

▪ Log: Captures Temporal Ordering

▪ Fine-grained Versioning is Free

Two New Abstractions:

Data Visibility: Analytics

choose their Ingestion type

Graph View:

▪ Static View: Batch Analytics

▪ Stream View: Stream Analytics

Data Management

Hybrid Store

GraphView(Data Visibility)

Static View

Archiving

Adjacency Store

Stream View

New log Old log

Circular Edge Log

Stream Analytics

Batch Analytics

Logging

Data Durability NVMe

Graph DataUpdates

Fin

e-g

rain

ed

Coarse-grained

Page 10: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

Hybrid Store Details

10

Edge Deletion at t7

Edge Arrays

Vertex Array

Multi-version Degree Array

3 2

6

5

4

1 4

2

0

1

2

3

4

5

6

0

1

2

3

4

5

6

2,S0

1,S0

1,S0

2,S0

1,S0

1,S0

5,6 1,2 3,4 2,4

t0 t1 t2 t3

S0, 4Global

Snapshot List

Circular Edge Log2

3

1

54

0

6

t4t1

t2

t3

t0

t5t6

t8

t9

S1, 8

0,1 0,3 1,3 -2,4

t4 t5 t6 t7

Adjacency List

Page 11: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

Hybrid Store Details

11

Edge Deletion at t7

Vertex Array

Multi-version Degree Array

Adjacency List

3 2

6

5

4

1 4

2

10

1

2

3

4

5

6

3

0

0 1

2,S10

1

2

3

4

5

6

3,S1

3,S1

2,S0

1,S0

1,S0

2,S0

1,S0

1,S0

-1

-13,S1

3,S1

3

S1, 8

S0, 4Global

Snapshot List

5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9

Non-archived

edgesCircular Edge Log2

3

1

54

0

6

t4t1

t2

t3

t0

t5t6

t8

t9

Stream View

Static View

Whole Data Access

Data Access from a Time-Window

GraphView

handle create-static-view(flags)status delete-static-view (handle)count get-nebr-degree(handle, vertex-id)count get-nebrs(handle, vertex-id, ptr)

status update-view(handle)bool has-vertex-changed(handle, vertex-id)

Edge Arrays

Page 12: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

Hybrid Store: Optimizing the Archiving Phase

12

edges with source vertex range [v0, v1)

Adjacency list Data

edges with sourcevertex range [v1, v2)

v0

v1

v2

Non-atomicArchiving

Non-atomicArchiving

Tail Head

Edge Log

Local Buffers

Local Buffers

EdgeSharding

0 20 40 60 80 100

Archiving with EdgeSharding

Archiving with AtomicInstruction

Run Time (sec)

Sharding Filling Degree Filling Edge Array

Archiving Phase

Machine: 2 Intel Xeon CPU socket▪ 14 cores each, ▪ Hyperthreading enabledDataset: Kron-28 Graph▪ 256 Million Vertex, ▪ 4 Billion Edges 0

10

20

30

40

1 2 4 8 16 32 56

Arc

hiv

ing

Rat

eM

illio

ns

Thread Count

Page 13: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

Hybrid Store: Optimizing Memory Usage

▪ Edge Log : Circular

▪ Degree Array: Older Versions Reused

13

OptimizationsChain Count Memory

Needed(GB)

Average Maximum

Baseline System 29.18 65536 148.73

Static GraphOne 0.45 1 33.81

+Cacheline Size allocation 2.96 65536 47.42

+Hub vertex Handling 2.47 3998 45.79

Chainededge arrays

Vertex array

Multi-version Degree array

3 2

6

5

4

1 4

2

10

1

2

3

4

5

6

3

0

0 1

2,S10

1

2

3

4

5

6

3,S1

3,S1

2,S0

1,S0

1,S0

2,S0

1,S0

1,S0

-1

-13,S1

3,S1

3

02468

10

Archiving Rate BFS PageRank 1-Hop

Spe

ed

up

Baseline +Cacheline Sized +Hub Vertex

▪ Edge Arrays: Many Memory Links▪ Cacheline Size Memory Allocation▪ Hub Vertex Handling

Page 14: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

020406080100120140

0

0.25

0.5

0.75

1

1.25

Arc

hiv

ing

Rat

e

(Ed

ge/s

)

Mill

ion

s

Alg

ori

thm

Ru

n t

ime

(N

orm

aliz

ed

)

Non-archived Edge Count (Log axis with base 2)

PageRank BFS 1 Hop Archiving Rate

210 212 214 216 218 220 222 224 226 228 230 232

Results: Hybrid Store Composition and Ingestion Rate

▪ Max Non-archived Edge count: ▪ 217 edges

▪ Archiving Threshold: ▪ 216 edges

▪ Archive Rate: ▪ 43.68 Million edges/s

14

▪ Faster Ingestion?

▪ Logging rate:

▪ ~80 Million edges/sec

▪ Higher archiving threshold

Graph Name Vertex Count (in Millions)

Edge Count(in Millions)

Rates (Million Edges/s)

Logging Archiving Ingestion

Twitter (D) 52.58 1963.26 82.62 47.98 66.39

Friendster (D) 68.35 2586.15 82.85 49.32 60.40

Subdomain (D) 101.72 2043.20 82.86 43.43 68.25

Kron-28 (U) 256.00 4096.00 79.23 43.68 52.39

LANL (D) 0.16 1521.19 35.98 28.91 26.99

D = Directed, U = Undirected

Page 15: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

Results: Dynamic Graph Systems

15

048

1216

Updates BFS PageRank

Spe

ed

up

Stinger GraphOne

1.0E+0

1.0E+2

1.0E+4

1.0E+6

1.0E+8

Neo4j GraphOne

Up

dat

e R

ate

(e

dge

s/se

c)

1.0E+0

1.0E+2

1.0E+4

1.0E+6

1.0E+8

SQLite GraphOne

Up

dat

e R

ate

(e

dge

s/se

c)

Against Stinger▪ 5.36x Higher Ingestion Rate▪ 12.76x and 3.18x Speedup for BFS and PageRank

Against SQLite and Neo4j: ▪ 2 to 3 Orders of Magnitude Higher Ingestion Rate

Dataset: RMAT Graph▪ 4 Million Vertices, 64 Million Edges▪ 8 bytes weight▪ 40 Million updates, 2.5 Millions deletions

Experimental Setup▪ Two Intel Xeon CPU E5-2683 sockets with 14 Core each▪ 500GB Memory▪ Samsung 950 Pro NVMe SSD, 512GB

Page 16: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

Results: Cost of Data Management Over Static Graph Engine

▪ Static GraphOne: For Static Graphs▪ Builds graph in entirety, Like Galois

▪ GraphOne: For Dynamic/Streaming Graphs ▪ Builds graph one edge at a time

16

0

0.25

0.5

0.75

1

1.25

Twitter Friendster Subdomain Kron-28-16 Kron-21-16

Spe

ed

up

(Co

mp

are

d t

o

Stat

ic G

rap

hO

ne

) BFS PageRank 1-Hop

▪ 17% less performance for real-world graphs▪ Avoids pre-processing cost completely

▪ 32.73 second pre-processing cost, 34.12x of BFS

Page 17: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

GraphOne: Conclusion

▪ Batch and Stream Analytics on Evolving Graphs

▪ Hybrid Store: Two Abstractions

▪ Data Visibility and GraphView

▪ Optimizations: ▪ Edge Sharding and Memory Allocation

▪ Results:▪ Over Neo4j and SQLite: 2 to 3 orders of higher

ingestion rate

▪ Over Stinger: 5.36x more ingestion rate, over 3x speedup for analytics

▪ Over Static Graph Engine: No Pre-processing Cost

▪ A Real-System you can use

▪ Can convert text/logs/csv/json to graph

17

Data Management

Hybrid Store

GraphView(Data Visibility)

Static View

Archiving

Adjacency Store

Stream View

New log Old log

Circular Edge Log

Stream Analytics

Batch Analytics

Logging

Data Durability NVMe

Graph DataUpdates

Fin

e-g

rain

ed

Coarse-grained

Page 18: Pradeep Kumar, H. Howie Huang The George Washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Non-archived Circular Edge Log edges 2 3

Thank [email protected]