36
Graph500 and Green Graph500 benchmarks on SGI UV2000 *Yuichiro Yasui & Katsuki Fujisawa Kyushu University and JST CREST SGI User group conference Nov. 17, 2014

Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Embed Size (px)

Citation preview

Page 1: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Graph500 and Green Graph500 benchmarks on SGI UV2000

*Yuichiro Yasui & Katsuki Fujisawa Kyushu University and JST CREST

SGI User group conference Nov. 17, 2014

Page 2: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Outline 1.  Graph processing for large-scale networks

2.  Graph500 & Green Graph500 benchmarks

3.  Our NUMA-optimized BFS algorithm 4.  Numerical results on SGI UV 2000

Bottom%up(Top%down

CPU

RAM

NUMA%aware

Page 3: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Graph processing for Large scale networks •  Large scale graphs in various fields

–  US Road network : 58 million edges –  Twitter follow-ship : 1.47 billion edges –  Neuronal network : 100 trillion edges

89 billion vertices & 100 trillion edges Neuronal network @ Human Brain Project

Cyber-security Twitter

US road network 24 million vertices & 58 million edges 15 billion log entries / day

Social network

•  Fast and scalable graph processing by using HPC

large

61.6 million vertices & 1.47 billion edges

Page 4: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

•  Transportation •  Social network •  Cyber-security •  Bioinformatics

Graph analysis and important kernel BFS •  The cycle of graph analysis for understanding real-networks

graph processing

Understanding

Application field

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

Relationships - SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

graph

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

results Step1

Step2

Step3

constructing

•  concurrent search (breadth-first search) •  optimization (single source shortest path) •  edge-oriented (maximal independent set)

Page 5: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

•  Transportation •  Social network •  Cyber-security •  Bioinformatics

Graph analysis and important kernel BFS •  The cycle of graph analysis for understanding real-networks

•  concurrent search (breadth-first search) •  optimization (single source shortest path) •  edge-oriented (maximal independent set)

graph processing

Understanding

Application field

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

Relationships - SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

graph

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

results Step1

Step2

Step3

•  One of most important and fundamental processing •  Many algorithms and applications based on exists (Max.-flow and centrality) •  low arithmetic intensity & irregular memory accesses.

Breadth-first search (BFS)

Source

BFS Lv. 3

source Lv. 2 Lv. 1

Outputs:Distance (Lv.) and Predecessor for each vertex from source Inputs:Graph,

and source vertex

constructing

Page 6: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Twitter follow-ship network

•  follow-ship network – #Users (#vertices) 41,652,230 –  Follow-ships (#edges) 2,405,026,092

Lv. #users ratio (%) percentile (%) 0 1 0.00 0.00 1 7 0.00 0.00 2 6,188 0.01 0.01 3 510,515 1.23 1.24 4 29,526,508 70.89 72.13 5 11,314,238 27.16 99.29 6 282,456 0.68 99.97 7 11536 0.03 100.00 8 673 0.00 100.00 9 68 0.00 100.00

10 19 0.00 100.00 11 10 0.00 100.00 12 5 0.00 100.00 13 2 0.00 100.00 14 2 0.00 100.00 15 2 0.00 100.00

Total 41,652,230 100.00 -

BFS result from User 21,804,357

This network excludes unconnected users The six-degrees of

separation

Our algorithm computes a BFS in 60 ms only

Twitter2009

Page 7: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Highway

Bridge

Betweenness centrality (BC) •  Definition

CB(v) =!

s!v!t∈V

σst(v)σst

σst : number of shortest (s, t)-paths

σst(v) : number of shortest (s, t)-paths passing through vertex v

2 / 2

CB(v) =!

s!v!t∈V

σst(v)σst

σst : number of shortest (s, t)-paths

σst(v) : number of shortest (s, t)-paths passing through vertex v

2 / 2

: # of (s,t)-shortest paths : # of (s,t)-shortest paths passing throw v

Osaka road network 13,076 vertices and 40,528 edges

High(score(vertex/edge(=(Important(place(

c.g.)(Highway,(Bridge

•  BFS => one-to-all •  <#vertices> times BFS => all-to-all

•  BC requires the all-to-all shortest paths

•  BC measures important vertices and edges without coordinates

=>(13,076(times(BFS(computations

Page 8: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Graph500 Benchmark •  Measures a performance of irregular memory accesses •  TEPS score (# of Traversed edges per second) in a BFS

SCALE & edgefactor (=16)

Median TEPS

1.  Generation

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

3.  BFS x 64 2.  Construction

x 64

TEPS ratio

•  Generates synthetic scale-free network with 2SCALE vertices and 2SCALE×edgefactor edges by using SCALE-times the Rursive Kronecker products

www.graph500.org

G1 G2 G3 G4

Kronecker graph

Input parameters for problem size

Page 9: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Green Graph500 Benchmark •  Measures power-efficient using TEPS/W score •  Our results on various systems such as SGI UV series and Xeon servers, Android devices

http://green.graph500.org

Median TEPS

1. Generation - SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

3.  BFS phase

2. Construction x 64

TEPS ratio

Watt TEPS/W

Power measurement Green Graph500

Graph500

Page 10: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Target networks

20

25

30

35

40

45

15 20 25 30 35 40 45

log 2

(m)

log2(n)

USA-road-d.NY.gr

USA-road-d.LKS.gr

USA-road-d.USA.gr cit-Patents

soc-LiveJournal1

twitter-rv

Human Project

Graph500 (Toy)

Graph500 (Mini)

Graph500 (Small)

Graph500 (Medium)

Graph500 (Large)

Graph500 (Huge)

1blillion 1trillion

1blillion

1trillion

Human Brain 89 B, 100 T

Twitter2009

#(of(vertices((in(logscale)

#(of(edges((in(logscale)

US road network

Page 11: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Target networks on Smartphone

20

25

30

35

40

45

15 20 25 30 35 40 45

log 2

(m)

log2(n)

USA-road-d.NY.gr

USA-road-d.LKS.gr

USA-road-d.USA.gr cit-Patents

soc-LiveJournal1

twitter-rv

Human Project

Graph500 (Toy)

Graph500 (Mini)

Graph500 (Small)

Graph500 (Medium)

Graph500 (Large)

Graph500 (Huge)

1blillion 1trillion

1blillion

1trillion

Human Brain 89 B, 100 T

Twitter2009

Graph500 (SCALE20) ・Smartphone (4 cores)

Smartphone

US road network

#(of(vertices((in(logscale)

#(of(edges((in(logscale)

Page 12: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Target networks on Single-server

20

25

30

35

40

45

15 20 25 30 35 40 45

log 2

(m)

log2(n)

USA-road-d.NY.gr

USA-road-d.LKS.gr

USA-road-d.USA.gr cit-Patents

soc-LiveJournal1

twitter-rv

Human Project

Graph500 (Toy)

Graph500 (Mini)

Graph500 (Small)

Graph500 (Medium)

Graph500 (Large)

Graph500 (Huge)

1blillion 1trillion

1blillion

1trillion

Human Brain 89 B, 100 T

Twitter2009

Graph500 (SCALE29) ・4-way Intel Xeon (64 cores)

Graph500 (SCALE20) ・Smartphone (4 cores)

Single server

Smartphone

US road network

#(of(vertices((in(logscale)

#(of(edges((in(logscale)

Page 13: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Target networks on UV2000

20

25

30

35

40

45

15 20 25 30 35 40 45

log 2

(m)

log2(n)

USA-road-d.NY.gr

USA-road-d.LKS.gr

USA-road-d.USA.gr cit-Patents

soc-LiveJournal1

twitter-rv

Human Project

Graph500 (Toy)

Graph500 (Mini)

Graph500 (Small)

Graph500 (Medium)

Graph500 (Large)

Graph500 (Huge)

1blillion 1trillion

1blillion

1trillion

Human Brain 89 B, 100 T

Twitter2009

Graph500 (SCALE29) ・4-way Intel Xeon (64 cores)

Graph500 (SCALE32) ・UV2000 (1rack, 640 cores)

Graph500 (SCALE20) ・Smartphone (4 cores)

Single server

UV 2000

Smartphone

US road network

#(of(vertices((in(logscale)

#(of(edges((in(logscale)

Page 14: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Target networks on Supercomputer

20

25

30

35

40

45

15 20 25 30 35 40 45

log 2

(m)

log2(n)

USA-road-d.NY.gr

USA-road-d.LKS.gr

USA-road-d.USA.gr cit-Patents

soc-LiveJournal1

twitter-rv

Human Project

Graph500 (Toy)

Graph500 (Mini)

Graph500 (Small)

Graph500 (Medium)

Graph500 (Large)

Graph500 (Huge)

1blillion 1trillion

1blillion

1trillion

Human Brain 89 B, 100 T

US road network

Twitter2009

Graph500 (SCALE29) ・4-way Intel Xeon (64 cores)

Graph500 (SCALE40) ・BlueGene/Q (64K nodes) ・K computer (64K nodes)

Graph500 (SCALE32) ・UV2000 (1rack, 640 cores)

Graph500 (SCALE20) ・Smartphone (4 cores)

Single server

UV 2000

K and Sequoia

Smartphone

#(of(vertices((in(logscale)

#(of(edges((in(logscale)

Page 15: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Problem and Our motivation •  Does UV2000 obtain a high-performance without MPI?

Thread ? MPI

# of cores 1 4 32 640 512K 1280

K computer Thread << MPI

UV2000 Single-server Thread > MPI

Page 16: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Problem and Our motivation •  Does UV2000 obtain a high-performance without MPI?

•  Exploiting Algorithm on NUMA and cc-NUMA system –  Automatic processor topology detection

–  Affinity configurations for running threads and allocating memory

# of cores 1 4 32 640 512K 1280

K computer Thread ≈ MPI Thread << MPI

Node = 2 sockets Cube = 8 nodes Rack = 32 nodes

CPU

RAM

CPU

RAM

× 4 =

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

0th 3th

1st 2nd

0 1 2 3Adjacency Matrix

Partitioning Binding

UV2000 Single-server Thread > MPI

YES

Page 17: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Level-synchronized parallel BFS (Top-down) •  Started from source vertex and executes following two phases for each level

3) BFS iterations (timed).: This step iterates the timedBFS-phase and the untimed verify-phase 64 times. The BFS-phase executes the BFS for each source, and the verify-phaseconfirms the output of the BFS.

This benchmark is based on the TEPS ratio, which iscomputed for a given graph and the BFS output. Submissionagainst this benchmark must report five TEPS ratios: theminimum, first quartile, median, third quartile, and maxi-mum.

III. PARALLEL BFS ALGORITHM

A. Level-synchronized Parallel BFSWe assume that the input of a BFS is a graph G = (V, E)

consisting of a set of vertices V and a set of edges E.The connections of G are contained as pairs (v, w), wherev, w ∈ V . The set of edges E corresponds to a set ofadjacency lists A, where an adjacency list A(v) containsthe outgoing edges (v, w) ∈ E for each vertex v ∈ V . ABFS explores the various edges spanning all other verticesv ∈ V \{s} from the source vertex s ∈ V in a given graphG and outputs the predecessor map π, which is a map fromeach vertex v to its parent. When the predecessor map π(v)points to only one parent for each vertex v, it represents atree with the root vertex s. However, some applications, suchas the betweenness centrality [1], require all of the parentsfor each vertex, which is equivalent to the number of hopsfrom the source. Therefore, the output predecessor map isrepresented as a directed adjacency graph (DAG). In thispaper, we focus on the Graph500 benchmark, and assumethat the BFS output is a predecessor map that is representedby a tree.

Algorithm 1 is a fundamental parallel algorithm for aBFS. This requires the synchronization of each level thatis a certain number of hops away from the source. We callthis the level-synchronized parallel BFS [6]. Each traversalexplores all outgoing edges of the current frontier, which isthe set of vertices discovered at this level, and finds theirneighbors, which is the set of unvisited vertices at the nextlevel. We can describe this algorithm using a frontier queueQF and a neighbor queue QN , because unvisited verticesw are appended to the neighbor queue QN for each frontierqueue vertex v ∈ QF in parallel with the exclusive controlat each level (Algorithm 1, lines 7–12), as follows:

QN ←!w ∈ A(v) | w ̸∈ visited, v ∈ QF

". (1)

B. Hybrid BFS Algorithm of Beamer et al.The main runtime bottleneck of the level-synchronized

parallel BFS (Algorithm 1) is the exploration of all outgoingedges of the current frontier (lines 7–12). Beamer et al. [8],[9] proposed a hybrid BFS algorithm (Algorithm 2) thatreduced the number of edges explored. This algorithmcombines two different traversal kernels: top-down and

Algorithm 1: Level-synchronized Parallel BFS.Input : G = (V, A) : unweighted directed graph.

s : source vertex.Variables: QF : frontier queue.

QN : neighbor queue.visited : vertices already visited.

Output : π(v) : predecessor map of BFS tree.1 π(v)← −1, ∀v ∈ V2 π(s)← s3 visited← {s}4 QF ← {s}5 QN ← ∅6 while QF ̸= ∅ do7 for v ∈ QF in parallel do8 for w ∈ A(v) do9 if w ̸∈ visited atomic then

10 π(w)← v11 visited← visited ∪ {w}12 QN ← QN ∪ {w}

13 QF ← QN

14 QN ← ∅

bottom-up. Like the level-synchronized parallel BFS, top-down kernels traverse neighbors of the frontier. Conversely,bottom-up kernels find the frontier from vertices in candidateneighbors. In other words, a top-down method finds thechildren from the parent, whereas a bottom-up method findsthe parent from the children. For a large frontier, bottom-upapproaches reduce the number of edges explored, becausethis traversal kernel terminates once a single parent is found(Algorithm 2, lines 16–21).

Table III lists the number of edges explored at each levelusing a top-down, bottom-up, and combined hybrid (oracle)approach. For the top-down kernel, the frontier size mF atlow and high levels is much less than that at mid-levels.In the case of the bottom-up method, the frontier size mBis equal to the number of edges m = |E| in a given graphG = (V, E), and decreases as the level increases. Bottom-upkernels estimate all unvisited vertices as candidate neighborsQN , because it is difficult to determine their exact numberprior to traversal, as shown in line 15 of Algorithm 2. Thislazy estimation of candidate neighbors increases the numberof edges traversed for a small frontier. Hence, consideringthe size of the frontier and the number of neighbors, wecombine top-down and bottom-up approaches in a hybrid(oracle). This loop generally only executes once, so thealgorithm is suitable for a small-world graph that has a largefrontier, such as a Kronecker graph or an R-MAT graph.Table III shows that the total number of edges traversed bythe hybrid (oracle) is only 3% of that in the case of thetop-down kernel.

We now explain how to determine a traversal policy(Table II(a)) for the top-down and bottom-up kernels. The

395

Traversal

Swap

Frontier

Neighbor

Level k Level k+1 QF

QN

Swap exchanges the frontier QF and the neighbors QN for next level

Traversal finds neighbors QN from current frontier QF

Unvisited adjacency vertices(

Page 18: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Candidates of neighbors

前方探索と後方探索でのデータアクセスの観察• 前方探索でのデータの書込み

v→ w

v

w

Input : Directed graph G = (V, AF ), Queue QF

Data : Queue QN , visited, Tree π(v)

QN ← ∅for v ∈ QF in parallel do

for w ∈ AF (v) doif w ! visited atomic thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}

QF ← QN

• 後方探索でのデータの書込み

w→ v

v w

Input : Directed graph G = (V, AB), Queue QF

Data : Queue QN , visited, Tree π(v)

QN ← ∅for w ∈ V \ visited in parallel do

for v ∈ AB(w) doif v ∈ QF thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}break

QF ← QN

• どちらも wに関する変数 π(w)と visitedに書込みを行っている (vは点番号の参照)

6 / 12

Direction-optimizing BFS

Frontier

Neighbors

Level7k

Level7k+1

Frontier Level7k

Level7k+1 neighbors

Top-down algorithm •  Efficient for small-frontier •  Uses out-going edges

Bottom-up algorithm •  Efficient for large-frontier •  Uses in-coming edges

前方探索と後方探索でのデータアクセスの観察• 前方探索でのデータの書込み

v→ w

v

w

Input : Directed graph G = (V, AF ), Queue QF

Data : Queue QN , visited, Tree π(v)

QN ← ∅for v ∈ QF in parallel do

for w ∈ AF (v) doif w ! visited atomic thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}

QF ← QN

• 後方探索でのデータの書込み

w→ v

v w

Input : Directed graph G = (V, AB), Queue QF

Data : Queue QN , visited, Tree π(v)

QN ← ∅for w ∈ V \ visited in parallel do

for v ∈ AB(w) doif v ∈ QF thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}break

QF ← QN

• どちらも wに関する変数 π(w)と visitedに書込みを行っている (vは点番号の参照)

6 / 12

Current frontier

Unvisited neighbors

Current frontier

Candidates of neighbors

Skips unnecessary edge traversal

Outgoing edges Incoming

edges

Chooses one from Top-down or Bottom-up Beamer2012 @ SC2012

Page 19: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

# of traversal edges of Kronecker graph with SCALE 26

Hybrid-BFS reduces unnecessary edge traversals

Direction-optimizing BFS

Top%down 幅優先探索に対する前方探索 (Top-down)と後方探索 (Bottom-up)

Level Top-down Bottom-up Hybrid0 2 2,103,840,895 21 66,206 1,766,587,029 66,2062 346,918,235 52,677,691 52,677,6913 1,727,195,615 12,820,854 12,820,8544 29,557,400 103,184 103,1845 82,357 21,467 21,4676 221 21,240 227

Total 2,103,820,036 3,936,072,360 65,689,631Ratio 100.00% 187.09% 3.12%

6 / 14

Bottom%up(

Top%down

Distance from source |V| = 226, |E| = 230

= |E|

Chooses one from Top-down or Bottom-up Beamer2012 @ SC2012

Small frontier large frontier

Page 20: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

NUMA-optimized Dir. Opt. BFS

0

10

20

30

40

50

2011 SC10 SC12 BigData13ISC14

G500,ISC14

GT

EPS

Ref

eren

ce

NU

MA-a

war

e

Dir.

Opt

.

NU

MA-O

pt.

NU

MA-O

pt.

+D

eg.a

war

e

NU

MA-O

pt.

+D

eg.a

war

e+

Vtx

.Sor

t

87M 800M

5G

11G

29G

42G

⇥1 ⇥9

⇥58

⇥125

⇥334

⇥489

•  CPU: Intel Xeon •  #sockets: 4 •  #cores: 32 or 40 •  RAM: 256GB or 512GB

•  Manages memory accesses on NUMA system –  Each NUMA node contains CPU socket and local memory

Top%down Top%down

Bottom%up(

Top%down

CPU

RAM

NUMA%aware

System7configuration

Page 21: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

NUMA-optimized Dir. Opt. BFS

0

10

20

30

40

50

2011 SC10 SC12 BigData13ISC14

G500,ISC14

GT

EPS

Ref

eren

ce

NU

MA-a

war

e

Dir.

Opt

.

NU

MA-O

pt.

NU

MA-O

pt.

+D

eg.a

war

e

NU

MA-O

pt.

+D

eg.a

war

e+

Vtx

.Sor

t

87M 800M

5G

11G

29G

42G

⇥1 ⇥9

⇥58

⇥125

⇥334

⇥489

•  CPU: Intel Xeon •  #sockets: 4 •  #cores: 32 or 40 •  RAM: 256GB or 512GB

•  Manages memory accesses on NUMA system –  Each NUMA node contains CPU socket and local memory

Top%down Top%down

Bottom%up(

Top%down

CPU

RAM

NUMA%aware Bottom%up(

Top%down

CPU

RAM

NUMA%aware

System7configuration

Our results

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM0th 3th

1st 2nd

0 1 2 3

Adjacency Matrix

Partitioning

Binding on NUMA

Page 22: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

NUMA architecture

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

CPU socket(16 logical cores) + Local RAM

Memory access for Local RAM(Fast)

Memory access for Remote RAM(Slow)

NUMA node

•  Reduces and avoids memory accesses for Remote RAM

•  4-way Intel Xeon E5-4640 (Sandybridge-EP) –  4 (# of CPU sockets) –  8 (# of physical cores per socket) –  2 (# of threads per core)

4 x 8 x 2 = 64 threads

NUMA node

Max.

NUMA-aware (optimized) computation

Page 23: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Flow of affinities using ULIBC ULIBC : Ubiquity Library for Intelligently Binding Cores –  provides some APIs to utilizing processor topology easily.

(Our(library)

Page 24: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

1.  Detects entire topology

Flow of affinities using ULIBC

Cores CPU 0 P0, P4, P8, P12 CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14 CPU 3 P3, P7, P11, P15

Use(Other(

processes

ULIBC : Ubiquity Library for Intelligently Binding Cores –  provides some APIs to utilizing processor topology easily.

(Our(library)

Page 25: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

1.  Detects entire topology Cores

CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14

2.  Detects online (available) topology

Flow of affinities using ULIBC

Job manager (PBS) or numactl --cpunodebind=1,2

Cores CPU 0 P0, P4, P8, P12 CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14 CPU 3 P3, P7, P11, P15

Use(Other(

processes

ULIBC : Ubiquity Library for Intelligently Binding Cores –  provides some APIs to utilizing processor topology easily.

(Our(library)

Page 26: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

1.  Detects entire topology Cores

CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14

2.  Detects online (available) topology

Threads NUMA 0 0(P1), 2(P5), 4(P9), 6(P13) NUMA 1 1(P2), 3(P6), 5(P10)

3. Constructs ULIBC affinity ULIBC_set_affinity_policy( 7, SCATTER_MAPPING, THREAD_TO_CORE)

Flow of affinities using ULIBC

Job manager (PBS) or numactl --cpunodebind=1,2

# of threads Scatter-type mapping

Each thread binds each logical cores

Cores CPU 0 P0, P4, P8, P12 CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14 CPU 3 P3, P7, P11, P15

Use(Other(

processes

NUMA 0

NUMA 1

core 0 core 1 core 2

core 3

RAM

RAM

Local RAM

ULIBC : Ubiquity Library for Intelligently Binding Cores –  provides some APIs to utilizing processor topology easily.

(Our(library)

Page 27: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

NUMA-optimized BFS •  The 1-D column-wise partitioning for adjacency matrix

0 1 2 3Adjacency Matrix

Partitioning

0 1 2 3

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAMRAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

Local neighbors

Duplicated frontiers

All-gathering of next frontier Edge traversal on local RAM

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

0th 3th

1st 2nd

Binding

Inner-NUMA-node Inter-NUMA-node

Each NUMA node searches unvisited vertices from duplicated frontier

Construct duplicated frontiers from partial neighbors

In In In In Out Out Out Out

•  Local traversal and all-to-all comm. for each level

Page 28: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Degree-aware + NUMA-opt. + Dir. Opt. BFS

0

10

20

30

40

50

2011 SC10 SC12 BigData13ISC14

G500,ISC14

GT

EPS

Ref

eren

ce

NU

MA-a

war

e

Dir.

Opt

.

NU

MA-O

pt.

NU

MA-O

pt.

+D

eg.a

war

e

NU

MA-O

pt.

+D

eg.a

war

e+

Vtx

.Sor

t

87M 800M

5G

11G

29G

42G

⇥1 ⇥9

⇥58

⇥125

⇥334

⇥489

•  CPU: Intel Xeon •  #sockets: 4 •  #cores: 32 or 40 •  RAM: 256GB or 512GB

•  Manages memory accesses on NUMA system –  Each NUMA node contains CPU socket and local memory

Top%down Top%down

Bottom%up(

Top%down

CPU

RAM

NUMA%aware Bottom%up(

Top%down

CPU

RAM

NUMA%aware

System7configuration

2. Sorting adjacency vertices

1. Deleting isolated vertices Our results

Isolated

A(va)

… …

A(va)

… …

Sorted by degree

Page 29: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

TEPS and TEPS/W on single-server •  Strong scaling for SCALE 27 for Graph500 for Green Graph500

1

2

4

8

16

32

64

1 2 4 8 16 32 64

1

2

4

8

16

32

64

relative

GT

EPS

relative

MT

EPS/W

Number of threads

GTEPS

MTEPS/W

29.03 GTEPS

45.43 MTEPS/W

4-way SB-EP based Xeon Relative improvements

Number of threads

x 27.9

x 12.6

Page 30: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

SGI UV 2000 system •  Shared-memory supercomputer

–  handle large memory space using thread parallel. –  C/C++ with OpenMP/Pthreads (w/o MPI comm.) –  cc-NUMA architecture system base on Intel Xeon

•  ISM has two Full-spec. UV 2000 –  4 UV 2000 racks –  Up to 2,560 cores and 64 TB memory

•  ISM, SGI, and us collaborate for Graph500 –  achieves the fastest of single-node in current list

The Institute of Statistical Mathematics •  Japan's national research institute for

statistical science.

#1 system #2 system

UV2000 rack

Page 31: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

SGI UV 2000 configuration •  UV2000 has complex hardware topologies

–  Socket, Node, Cube, Inner-rack, and Inter-rack Node = 2 sockets Cube = 8 nodes Rack = 32 nodes

•  We used NUMA-based flat parallelization –  Each NUMA node contains a “Xeon CPU E5-2470 v2” and a “256 GB RAM”

CPU

RAM

CPU

RAM

× 4 =

CPU

RAM

Node = 2 NUMA nodes Rack = 64 NUMA nodes

× 64 = CPU

RAM

Cube = 16 NUMA nodes

× 2 CPU

RAM × 16

NUMAlink(

6.7GB/s

(20(cores,(512GB) (160(cores,(4TB) (640(cores,(16TB)

Page 32: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

0

50

100

150

200

26

(` = 1)27

(` = 2)28

(` = 4)29

(` = 8)30

(` = 16)31

(` = 32)32

(` = 64)33

(` = 128)34

(` = 256)

GT

EPS

SCALE (` = #sockets)

Weak scaling on UV 2000

June 2014

Weak scaling on UV2000 Fastest of single node 131 GTEPS Most power-efficient commercial supercomputer

Inner%rack(comm. Inter%rack

Graph500 June list

12.481 MTEPS = 131 GTEPS / 10.53 kW

Page 33: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

The Graph500 List in June 2014 •  Measures performance using TEPS (# of Traversed edges

per second) in graph traversal such as BFS

h"p://www.graph500.org�

Fastest.of.......single5node�

Fastest.of........single5server�

Distributed Memory

Distributed Memory

Distributed Memory

Shared Memory

Shared Memory

Fastest.of..mul:5node�

Page 34: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

The Green Graph500 List in June 2014

Small data category (< SCALE 29)

Big Data category (> SCALE 30)

SONY Xperia-Z1-SO-01F

Measures power-efficiency using TEPS/W http://green.graph500.org

George Washington University’sColonial

is ranked

No.1

in the Small Data category of the Green Graph 500Ranking of Supercomputers with

445.92 MTEPS/W on Scale 20

on the third Green Graph 500 list published at theInternational Supercomputing Conference, June 23, 2014.

Congratulations from the Green Graph 500 Chair

Kyushu’s UniversityGraphCREST-SandybridgeEP-2.4GHz

is ranked

No.1

in the Big Data category of the Green Graph 500Ranking of Supercomputers with

59.12 MTEPS/W on Scale 30

on the third Green Graph 500 list published at theInternational Supercomputing Conference, June 23, 2014.

Congratulations from the Green Graph 500 Chair

Ours

UV2000

Ours

4-way Xeon server

TSUBAME-KFC

Page 35: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

0

50

100

150

200

26

(` = 1)27

(` = 2)28

(` = 4)29

(` = 8)30

(` = 16)31

(` = 32)32

(` = 64)33

(` = 128)34

(` = 256)

GT

EPS

SCALE (` = #sockets)

Weak scaling on UV 2000

June 2014

Nov. 2014

Weak scaling on UV2000

Fastest of single node in Graph500 June

Inter%rack(comm. Inner%rack(comm.

131 GTEPS

New result 174 GTEPS

Two racks

Page 36: Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14

Conclusion •  UV 2000 with NUMA-based thread-parallelization

–  Scalable for irregular memory access computation

SGI UV2000 CPU : 64 CPUs per rack RAM : 16 TB per rack

•  Graph500/Green Graph500 on UV 2000 –  131 GTEPS with 640 threads –  The fastest of single node entries –  The most power-efficient of commercial supercomputers

174 GTEPS for SCALE 33 with 1,280 threads

•  ULIBC will be available at https://bitbucket.org/yuichiro_yasui/ulibc �

Graph500 Green Graph500