Database laboratory Regular Seminar 2013-11-11 TaeHoon Kim

1

Database laboratoryRegular Seminar

2013-11-11TaeHoon Kim

Chonbuk National UnivDatabase Laboratory v.2

TurboGraph: A Fast Parallel Graph Engine Handling Billion-scale Graphs in a Single PC

Wook-Shin Han, Sangyeon Lee, Kyungyeol ParkJeong-Hoon Lee, Min-Soo Kim, Jinha Kim, Hwanjo Yu

POSTECH, DGISTSIGKDD 2013, ACM

ACM SIGKDD is conference (Knowledge discovery and data mining)

2

Contents• Introduction• Related Work• Efficient Graph Storage• Disk-Based Parallel Graph Computation• Processing Graph Queries*• Experiments• Conclusion

3

Introduction• Graphs are used to model many real objects

– Web graph, chemical compound, biological structure

• Very large real graph size– Facebook reached one billion users on Oct. 4, 2012– Yahoo Web graph consisting of 1.4 billion vertices and 6.6

billion edges

• However, if there are a billion vertices in the graph database, the size of the mapping table is too large to fit into memory

• For fast graph retrieval on a single commodity PC, graphs must be stored in fast external memory, such as FlashSSDs

4

Introduction• Proposed to handle big graph efficiently

– GBase is recent graph engine using MapReduce• If the graph is represented as a compressed matrix-vector,

computation solves many representative• However, distributed systems based on MapReduce are generally

slow unless there is a sufficient number of machines in a cluster– Distributed system based on the vertexcentric programing model

Pregel, GraphLab, PowerGraph has been propossed• However, efficient of graph operations is very difficult• User needs to be skilled at managing and tuning a distributed

systems in a cluster, which is a nontrivial job for the ordinary users– Recently, a disk-based graph processing engine on a single PC

called GraphChi has been propossed• Exploits the novel concept of parallel sliding windows(PSW)

5

Introduction• Processing PSW of GraphChi

– 1)Loading a subgraph 2)updating the vertices and edge 3)writing the updated parts of the subgraph to disk

• We observe that PSW incurs four serious problem– 1)In order to start updating vertices/edges in a shard file, their in-

edges must be fully loaded in memory– 2)All edges in the shard file source and target vertices are in the

same execution interval are processed in sequential order which hinder full parallelism

– 3)At each iteration, a significant number of updated edges can be flushed to disk

• If the size of graph is very large and/or there exist many iteration, GraphChi involve a significant amount of disk I/Os

– 4)Even if a query needs to access a small portion of the data, it reads the whole graph at the first iteration poor utilization(h/w)

6

Introduction• In this paper, TurboGraph provides

– First truly parallel graph engine on a single PC– Full parallelism including FlashSSD IO parallelism and multi-core

parallelism– Full overlap of CPU processing and I/O processing

• We present TurboGraph to process billion-scale graphs very efficiently by using modern hardware on a single PC

• We present a novel parallel execution model called pin-and-slide

• Implements the column view of the matrix-vector multiplication

7

Related Work• Distributed synchronous approaches

– PEGASUS and Gbase• based on MapReduce and support matrix-vector multiplication using

compressed matrices– All synchronous approaches above could suffer from costly

performance penalties• Because, the runtime of each step is determined by the slowest

machine in the cluster cause h/w variability, n/w imbalance• Distributed asynchronous approaches

– GraphLab is also based on vertex-centric programing model • Vertex kernel is executed in asynchronous parallel on each vertex• However, some algorithms based on asynchronous computation

require serializability for correctness

8

Related Work• Distributed asynchronous approaches

– PowerGraph • Basically similar to GraphLab • It partitions and store graph by exploiting the properties of real-world

graphs of highly skewed power-law degree distribution• However, efficient graph partitioning in the distributed environment for

all types of graph operation is inherently hard problem• Single-machine approaches

– GraphChi • Disk-based single machine system following the asynchronous vetex-

centric programing model• Use PSW• GraphChi is very efficient, and thus able to problems while using only

a single machine, there are still four serious

9

Efficient Graph Storage

A record means an adjacency list

Start LRPL offset

# of LA PAGES

LRPL offset

The slotted page is known to be very

good for supporting efficient updates

• Disk-based Graph Representation– For the vertices ~ 5 their adjacency list are stored as smallrecords

in pages p0~p2 while the adjacency list of 6 is stored as a large record which spans the two pages p3, p4

– Since the size of this RID table is very small, we can safely make it resident in memory

Buffer pooloffset

10

Efficient Graph Storage(2 core functions)• In-memory Data Structures and Core Operations

– Invoke PINPAGE : if the page exists in the buffer?(pinCount++)• Otherwise, it obtains an empty frame by the LRU replacement and

loads the page from disk to the frame• Return the memory address of the frame where the page was loaded

– UNPINPAGE : (pinCount--)

– PINCOMPUTEUNPIN(PageID pid, list<RID> RIDList, User Object u0)• Provide s asynchronous I/Os to the FlashSSD

User defined function for RID processing

Buffer pool : an array of frames

Buffer poolBe resident in Memory

FlashSSD

P0

P0 V1

1. Execution thread (PinComputeUnpin)

Callback threadU0 Compute(v1,Iterator(v1, adj))

U0

11

Disk-Based Parallel Graph Computation• G = (V,E)• Adjacency Matrix M(G), where vi = i-th vertex in G• Let M(G)i the i-th column vector of M(G)• When we have a column vector X(|X| = |V|), we can

define the matrix-vector multiplication – between M(G) and X(Y = M(G) X ) As Y = Xiin the column view

12

4

5

7

3

0.214

Vi

12

Disk-Based Parallel Graph Computation• X : input bit vector• Y : output bit vector

– We use bit vectors for the graph

• U0 : User object – User-define Compute as one

of its methods• Execution thread• (parallel async I/O)

– PinComputeUnpin• Callback thread• (concurrently process the

vertex)– U0.Compute(v1,Iterator)

13

Disk-Based Parallel Graph Computation

1

• If the page is fully loaded?– Pinned of the fully loaded page

• If partially loaded? Using Knapsack 0-1– PinComputeUnpin

• Ordered from the large adjacency list and small adjacency

• Using parallel processing, and Compute

• To thread safe, we use latch free approach

1

2

2

3

3

44

5

5

14

Disk-Based Parallel Graph Computation• Thread safe latch free approach

– 5.1 latch free approach ? 5

5.1 Th1

Th2

15

• Handling General Vectors– Example 1. We explain our pin-and-slide model handling general

vectors by using a PageRank query• Step1

– After first reading pages from disk into the buffer {p0, p1, p2}, We read the first chunk of each attribute vector into memory

– Then we join between block1 and chunk1• Step2

– We read chunk2 of each attribute vector into memory join between block1 and chunk2 and updates of chunk1 of the output vector

• Step3 – Since we complete the processing for block1, we read new pages from

disk {p3,p4}, then we join block2 and chunk1 and write the results to chunk2 of the output vector

• Step4 – We do the final join and update chunk2 of the output vector


16


v1

v2

v5

p0 p1 p2 p3, p4

2 2 2 2 2 2 7

v0

v3

v4

v6

outDegree

prevPR 0.143 0.143 0.143 0.143 0.143 0.143 0.1430.143

0.143

0.143

0.143

0.143

0.143

0.1430.143

chunk1 chunk2

p0 p1 p2

block1

1

0

0

0

0

0

0

0

0.082

0.082

0.082

0.082

0.082

0.082

0

chunk1

chunk2

1

{ 0.85 ( 0.143 / 2 ) + (0 / 2) } + (0.15 / 7 ) = 0.082 Total of vertex• Example of V0 have V1 and V6

V0 V1 V2 V3 V4 V5 V6

buffer pool

output

0.143

# of edges

V0

V1

V2 V3

V4 V5 V6

JOIN

17


v1

v2

v5

p0 p1 p2 p3, p4

v0

v3

v4

v6

0.143

0.143

0.143

0.143

0.143

0.143

0.1430.143

p0 p1 p2

block1

2

0.082

0.082

0.082

0.082

0.082

0.082

0

chunk1

chunk2

2

Total of vertex

0.099

0.099

0.099

0.099

0.099

0.099

0

2 2 2 2 2 2 7outDegree

prevPR 0.143 0.143 0.143 0.143 0.143 0.143 0.143

chunk1 chunk2 V0 V1 V2 V3 V4 V5 V6

0.85 { ( 0.143 / 2 ) + (0.143 / 7) } + (0.15 / 7 ) = 0.099

• Example of V0 have V1 and V6

buffer pool

# of edges

output

0.143

V0

V1

V2 V3

V4 V5 V6

JOIN

18


v1

v2

v5

p0 p1 p2 p3, p4

v0

v3

v4

v6

0.143

0.143

0.143

0.143

0.143

0.143

0.1430.143

p0 p1 p2

block1

0.099

0.099

0.099

0.099

0.099

0.099

0

chunk1

chunk2

3

Total of vertex

0.099

0.099

0.099

0.099

0.099

0.099

0.386


prevPR 0.143 0.143 0.143 0.143 0.143 0.143 0.143


• Example of V6 have recursive V6

buffer pool

# of edges

p3 p4

block2

3

0.85 { ( 0.143 / 2 ) + ( 0.143 / 2 ) + … ( 0.143 / 2 ) + ( 0.143 / 0 ) } + 0.15/7 = 0.386

output

0.143

V0

V1

V2 V3

V4 V5 V6

SLIDE

JOIN

19


v1

v2

v5

p0 p1 p2 p3, p4

v0

v3

v4

v6

0.143

0.143

0.143

0.143

0.143

0.143

0.1430.143

p0 p1 p2

block1

0.099

0.099

0.099

0.099

0.099

0.099

0

chunk1

chunk2

4

Total of vertex

0.099

0.099

0.099

0.099

0.099

0.099

0.403


prevPR 0.143 0.143 0.143 0.143 0.143 0.143 0.143


• Example of V6 have recursive V6

buffer pool

# of edges

p3 p4

block2

0.85 { ( 0.143 / 2 ) + ( 0.143 / 2 ) + … ( 0.143 / 2 ) + ( 0.143 / 7 ) } + 0.15/7 = 0.403

output

0.143

V0

V1

V2 V3

V4 V5 V6

4JOIN

20

Processing Graph Queries• Targeted Queries( BFSf(Vq) )

– BFS operators• 1-step(out-)neighbors• K-step neighbors• Induced subgraph• l-step egonet• K-step egonet• K-core, Cross-Edges

• Global Queries– We have already explained briefly how our model processes the

PageRank query in Example 1

21

Experiments• We use three real datasets for the experiments LiveJournal,

Twitter, and YahooWeb• The Twitter dataset contains 42M vertices and 1.5B edges• The YahooWeb dataset contains a web graph from Yahoo!

With 1.4B vertices and 6.6Edges

• Experimentation environment– Intel i7 6-core 3.2GHz CPU and 12 Gbytes DRAM– Two 512GB SSDs of Samsung 840 Series

• TurboGraph can be complied in Windows, but GraphChi can be compiled in Linux– Considering that disk I/O performance in Ubuntu is better than

that in Windows7

22

Experiments• Breadth-First Search

– We additionally perform experiments with a state-of-the-art in-memory graph BFS engine Green-Marl

– Varying the buffer Size

– Varying the Number of Execution Threads• GraphChi is very hard to pre-load the

graph•

• GraphChi processes all edges serially

• Green-Mari failed due to lack of memory

23

Experiments• Targeted Queries

• Global Queries

24

Conclusion• In this paper, we presented a fast, parallel graph engine called

TurboGraph for efficiently processing billion-scale graphs on a single PC

• We proposed a notion of the pin-and-slide model which implements the column view of the matrix-vector multiplication– It utilizes two types of thread, execution threads and callback

thread, along with a buffer manager

• We show that TurboGraph outperforms the state-of-the-art algorithms by up to four orders of magnitude

25

Discussion 관련연구 GraphChi 의 PSW 의 단점들을 제안하는 기법 [e.g)pin-and-slide ] 을 이용해서

해결하였고 , 그에 따른 성능이 기존 연구보다 우수함을 보임

강점

Flash-SSD 의 비동기 I/O 를 execution thread 를 사용하고 , compute 를 하기 위해 callback thread 를 사용하기 때문에 FULL CPU/ IO processing 을 하기 때문에 , 기존 연구보다 빠르게 처리 가능

최신 기법인 pin-and-silde 제안

단점

그래프 기반의 데이터구조에 쓰일 수 있음

쓰레드라는 O/S 자원이 필요

Thank you for listening my presentation : )

26

It is important contents to understand contents• Note that, PPT is summary of my thinking

Documents

Database laboratory Regular Seminar 2013-11-11 TaeHoon Kim