Japan Lustre User Group 2014

Extreme Big Data (EBD) Convergence of Extreme Computing

and Big Data Technologies

7RUIHR CJPLT P J TMU SH PUT HTK 3USV PTN 3LT LDUQ U TX P L UM DLJOTURUN

8P UXOP CH U

Big Data Examples Rates and Volumes are extremely immense

Social NW•  Facebook

–  1 billion users –  Average 130 friends –  30 billion pieces of content

shared per month

•  Twitter –  500 million active users –  340 million tweets per day

•  Internet –  300 million new websites per year –  48 hours of video to YouTube per minute –  30,000 YouTube videos played per second

Genomics

Social Simulation

).1 2 1 ,.1 (,120, .2 2 5 2

Sequencing)data)(bp)/$))becomes)x4000)per)5)years)

c.f.,)HPC)x33)in)5)years

•  Applications –  Target Area: Planet

(Open Street Map)

–  7 billion people

•  Input Data –  Road Network for Planet:

300GB (XML) –  Trip data for 7 billion people

10KB (1trip) x 7 billion = 70TB –  Real-Time Streaming Data

(e.g., Social sensor, physical data)

•  Simulated Output for 1 Iteration –  700TB

Weather

�30#sec'Ensemble'Forecast'Simulations

2'PFLOP

�EnsembleData'Assimilation

2'PFLOP

Himawari500MB/2.5min

��

��

Ensemble'Forecasts200GB

Phased'Array'Radar1GB/30sec/2'radars

��

��

Ensemble'Analyses200GB

A#1.'Quality'ControlA#2.'Data'Processing

B#1.'Quality'ControlB#2.'Data'Processing

Analysis'Data2GB

�30#minForecast'Simulation

1.2'PFLOP

30#min Forecast2GB

Repeat'every'30'sec.

Big Data Examples Rates and Volumes are extremely immense

Social NW•  Facebook

–  1 billion users –  Average 130 friends –  30 billion pieces of content

shared per month

•  Twitter –  500 million active users –  340 million tweets per day

•  Internet –  300 million new websites per year –  48 hours of video to YouTube per minute –  30,000 YouTube videos played per second

Genomics

Social Simulation

).1 2 1 ,.1 (,120, .2 2 5 2

Sequencing)data)(bp)/$))becomes)x4000)per)5)years)

c.f.,)HPC)x33)in)5)years

•  Applications –  Target Area: Planet

(Open Street Map)

–  7 billion people

•  Input Data –  Road Network for Planet:

300GB (XML) –  Trip data for 7 billion people

10KB (1trip) x 7 billion = 70TB –  Real-Time Streaming Data

(e.g., Social sensor, physical data)

•  Simulated Output for 1 Iteration –  700TB

Weather

�30#sec'Ensemble'Forecast'Simulations

2'PFLOP

�EnsembleData'Assimilation

2'PFLOP

Himawari500MB/2.5min

��

��

Ensemble'Forecasts200GB

Phased'Array'Radar1GB/30sec/2'radars

��

��

Ensemble'Analyses200GB

A#1.'Quality'ControlA#2.'Data'Processing

B#1.'Quality'ControlB#2.'Data'Processing

Analysis'Data2GB

�30#minForecast'Simulation

1.2'PFLOP

30#min Forecast2GB

Repeat'every'30'sec.

Future “Extreme Big Data” •  NOT mining Tbytes Silo Data •  Peta~Zetabytes of Data •  Ultra High-BW Data Stream •  Highly Unstructured, Irregular •  Complex correlations between data from multiple sources •  Extreme Capacity, Bandwidth, Compute All Required

Graph500(“Big(Data”(Benchmark

A: 0.57, B: 0.19 C: 0.19, D: 0.05

November(15,(2010(Graph&500&Takes&Aim&at&a&New&Kind&of&HPC&Richard&Murphy&(Sandia&NL&=>&Micron)&“(I&expect&that&this&ranking&may&at&Imes&look&very&different&from&the&TOP500&list.&Cloud&architectures&will&almost&certainly&dominate&a&major&chunk&of&part&of&the&list.”(

The 8th Graph500 List (June2014): K Computer #1, TSUBAME2 #12 Koji Ueno, Tokyo Institute of Technology/RIKEN AICS

on the Graph500 Ranking of Supercomputers with17977.1 GE/s on Scale 40

on the 8th Graph500 list published at the InternationalSupercomputing Conference, June 22, 2014.

Congratulations from the Graph500 Executive Committee

No.1

RIKEN Advanced Institute for ComputationalScience (AICS)’s K computer

is ranked

on the Graph500 Ranking of Supercomputers with1280.43 GE/s on Scale 36

on the 8th Graph500 list published at the InternationalSupercomputing Conference, June 22, 2014.

Congratulations from the Graph500 Executive Committee

No.12

Global Scientific Information and Computing Center, Tokyo Instituteof Technology’s TSUBAME 2.5

is ranked

#1 K Computer

#12 TSUBAME2

Reality: Top500 Supercomputers Dominate No Cloud IDCs at all

-‐‑‒ :2 2 2 2 2 :2 2 -‐‑‒ DUQ UDLJO

(. 6 C6 , . 6 46 ( UT DUV, : T ) (: 2

( / TUKLX# HJQX

3 E0 T LR GLUT G,-‐‑‒. -‐‑‒ JU LX7 E0 F 4 1 DLXRH ) G5 0 ,/ 72 44B ( 8

CC40 -‐‑‒ 72 )

2A4B 6 RR IPXLJ PUT V PJHR T TPIHTK

2CC40 ) D28440 . 2 X L# 7 6CDHVL0 2 CU HNLDLQ C /, )

TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 200TB SSD)

“Global(Work(Space”(#1

SFA10k(#5

“Global(Work(Space”(#2 “Global(Work(Space”(#3

SFA10k(#4SFA10k(#3SFA10k(#2SFA10k(#1

/data0( /work0 /work1(((((/gscr

“cNFS/Clusterd(Samba(w/(GPFS”((

HOME

System(applicaQon

“NFS/CIFS/iSCSI(by(BlueARC”((

HOME

iSCSI

Infiniband(QDR(Networks

SFA10k(#6

GPFS#1 GPFS#2 GPFS#3 GPFS#4

Parallel(File(System(VolumesHome(Volumes

QDR(IB(×4)(×(20 10GbE(×(2QDR(IB((×4)(×(8

1.2PB3.6&PB

/data1(

((

((

Thin(nodes 1408nodes((((32nodes(x44(Racks)(

HP(Proliant(SL390s(G7(1408nodesCPU:(Intel(WestmerecEP((2.93GHz(((((((((((6cores(×(2(=(12cores/node(GPU:(NVIDIA(Tesla(K20X,(3GPUs/node(Mem:(54GB((96GB)(SSD:((60GB(x(2(=(120GB((120GB(x(2(=(240GB) (

Medium(nodes

HP(Proliant(DL580(G7(24nodes((CPU:(Intel(NehalemcEX(2.0GHz((((((((((8cores(×(2(=(32cores/node(GPU:(NVIDIA((Tesla(S1070,((((((((((((NextIO(vCORE(Express(2070(Mem:128GB(SSD:(120GB(x(4(=(480GB(

((

Fat(nodes

HP(Proliant(DL580(G7(10nodes(CPU:(Intel(NehalemcEX(2.0GHz((((((((((8cores(×(2(=(32cores/node((GPU:(NVIDIA(Tesla(S1070(Mem:(256GB((512GB)(SSD:(120GB(x(4(=(480GB(

CompuIng&Nodes 17.1PFlops(SFP),&5.76PFlops(DFP),&224.69TFlops(CPU),&~100TB&MEM,&~200TB&SSD&

Interconnets: Full_bisecIon&OpIcal&QDR&Infiniband&Network

((

Voltaire(Grid(Director(4700((×12(IB(QDR:(324(ports(

Core(Switch(

((

Edge(Switch(

((

Edge(Switch((/w(10GbE(ports)(

Voltaire(Grid(Director(4036(×179(IB(QDR(:(36(ports(

Voltaire((Grid(Director(4036E(×6(IB(QDR:34ports(((10GbE:((2port(

12switches(

6switches(179switches(

2.4&PB&HDD&+&&4PB&Tape

GPFS+Tape Lustre Home

Local SSDs

TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 200TB SSD)

“Global(Work(Space”(#1

SFA10k(#5

“Global(Work(Space”(#2 “Global(Work(Space”(#3

SFA10k(#4SFA10k(#3SFA10k(#2SFA10k(#1

/data0( /work0 /work1(((((/gscr

“cNFS/Clusterd(Samba(w/(GPFS”((

HOME

System(applicaQon

“NFS/CIFS/iSCSI(by(BlueARC”((

HOME

iSCSI

Infiniband(QDR(Networks

SFA10k(#6

GPFS#1 GPFS#2 GPFS#3 GPFS#4

Parallel(File(System(VolumesHome(Volumes

QDR(IB(×4)(×(20 10GbE(×(2QDR(IB((×4)(×(8

1.2PB3.6&PB

/data1(

((

((

Thin(nodes 1408nodes((((32nodes(x44(Racks)(

HP(Proliant(SL390s(G7(1408nodesCPU:(Intel(WestmerecEP((2.93GHz(((((((((((6cores(×(2(=(12cores/node(GPU:(NVIDIA(Tesla(K20X,(3GPUs/node(Mem:(54GB((96GB)(SSD:((60GB(x(2(=(120GB((120GB(x(2(=(240GB) (

Medium(nodes

HP(Proliant(DL580(G7(24nodes((CPU:(Intel(NehalemcEX(2.0GHz((((((((((8cores(×(2(=(32cores/node(GPU:(NVIDIA((Tesla(S1070,((((((((((((NextIO(vCORE(Express(2070(Mem:128GB(SSD:(120GB(x(4(=(480GB(

((

Fat(nodes

HP(Proliant(DL580(G7(10nodes(CPU:(Intel(NehalemcEX(2.0GHz((((((((((8cores(×(2(=(32cores/node((GPU:(NVIDIA(Tesla(S1070(Mem:(256GB((512GB)(SSD:(120GB(x(4(=(480GB(

CompuIng&Nodes 17.1PFlops(SFP),&5.76PFlops(DFP),&224.69TFlops(CPU),&~100TB&MEM,&~200TB&SSD&

Interconnets: Full_bisecIon&OpIcal&QDR&Infiniband&Network

((

Voltaire(Grid(Director(4700((×12(IB(QDR:(324(ports(

Core(Switch(

((

Edge(Switch(

((

Edge(Switch((/w(10GbE(ports)(

Voltaire(Grid(Director(4036(×179(IB(QDR(:(36(ports(

Voltaire((Grid(Director(4036E(×6(IB(QDR:34ports(((10GbE:((2port(

12switches(

6switches(179switches(

2.4&PB&HDD&+&&4PB&Tape

GPFS+Tape Lustre Home

Local SSDs

Read(mostly(I/O(((datacintensive(apps,(parallel(workflow,(parameter(survey)

• (Home(storage(for(compuQng(nodes(• (Cloudcbased(campus(storage(services

Backup

Finecgrained(R/W(I/O((check(point,(temporal(files)

Finecgrained(R/W(I/O((check(point,(temporal(files)

TSUBAME2 Storage Usage Since Nov. 2010

DCE21 5) , HX H 2PN4H H TM HX J L•  7 E IHXLK HT JU L 1JJLRL H U

–  F 4 1 DLXRH ) G )) JH KX

•  8 NL LSU FUR SLX VL UKL–  ,/72 VL TUKL ( / TUKLX

•  6H D LL 4 HR BHPR A4B T TP2HTK–  ) DIVX UM M RR IPXLJ PUT IHTK PK O

•  UJHR CC4 KL PJLX VL TUKL–  ) D2 PT U HR

•  H NL XJHRL C U HNL C X LSX–  X L# 7 6C–  . 2 UM 844X# 2 UM DHVLX

DCE21 5) , HX H 2PN4H H TM HX J L•  7 E IHXLK HT JU L 1JJLRL H U

–  F 4 1 DLXRH ) G )) JH KX

•  8 NL LSU FUR SLX VL UKL–  ,/72 VL TUKL ( / TUKLX

•  6H D LL 4 HR BHPR A4B T TP2HTK–  ) DIVX UM M RR IPXLJ PUT IHTK PK O

•  UJHR CC4 KL PJLX VL TUKL–  ) D2 PT U HR

•  H NL XJHRL C U HNL C X LSX–  X L# 7 6C–  . 2 UM 844X# 2 UM DHVLX

A Major Northern Japanese Cloud Datacenter (2013)

Juniper(EX8208 Juniper(EX8208

2(zone(switches((Virtual(Chassis)

Juniper(EX4200

Zone((700(nodes)

Juniper(EX4200

Juniper(EX4200

Zone((700(nodes)

Juniper(EX4200

Juniper(EX4200

Zone((700(nodes)

Juniper(EX4200

Juniper(MX480 Juniper(MX480

10GbE10GbE

10GbE

10GbELACP

the(Internet

8 zones, Total 5600 nodes, Injection 1GBps/Node Bisection 160Gigabps

Advanced Silicon Photonics 40G

single CMOS Die 1490nm DFB 100km Fiber

Supercomputer Tokyo Tech. Tsubame 2.0

#4 Top500 (2010)

~1500 nodes compute & storage Full Bisection Multi-Rail

Optical Network Injection 80GBps/Node Bisection 220Terabps

>> x1000!

Towards Extreme-scale Supercomputers and BigData Machines

•  Computation –  Increase in Parallelism, Heterogeneity, Density

•  Multi-core, Many-core processors •  Heterogeneous processors

•  Hierarchial Memory/Storage Architecture –  NVM (Non-Volatile Memory),

SCM (Storage Class Memory) •  6 1C8# 3 # CDD B1 # BLB1 #8 3# L J

–  Next-gen HDDs (SMR), Tapes (LTFS)

(

Network

Locality

Productivity FT

Algorithm Power

Storage Hierarchy

I/O

Problems

Heterogeneity

Scalability

Extreme Big Data (EBD) Next Generation Big Data

Infrastructure Technologies Towards Yottabyte/Year Principal Investigator

Satoshi Matsuoka

Global Scientific Information and Computing Center

Tokyo Institute of Technology2014/11/05 JST CREST Big Data Symposium

EBE Research Scheme

Supercomputers Compute&Batch-Oriented

Cloud IDC Very low BW & Efficiencty

Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW

PCB

TSV Interposer

High Powered Main CPU

Low Power CPU

DRAMDRAMDRAMNVM/Flash

NVM/Flash

NVM/Flash

Low Power CPU

DRAMDRAMDRAMNVM/Flash

NVM/Flash

NVM/Flash

2Tbps HBM 4~6HBM Channels 1.5TB/s DRAM & NVM BW 30PB/s I/O BW Possible 1 Yottabyte / Year

EBD System Software incl. EBD Object System

Introduction

Problem

Dom

ain

Inmostlivingorganism

sgeneticinstructions

used

intheirdevelopm

entarestored

inthelong

polym

eric

moleculecalledDNA.

DNAconsists

oftwolong

polym

ersof

simpleun

its

callednu

cleotides.

The

four

basesfoun

din

DNAareadenine(abbreviated

A),cytosine

(C),guanine(G

)andthym

ine(T

).

AleksandrDrozd

,NaoyaMaruyama,SatoshiMatsuoka

(TIT

ECH)

AMultiGPU

Rea

dAlig

nmen

tAlgorithm

withModel-basedPerform

ance

Optimization

Nove

mber

1,2011

5/54

Large Scale Metagenomics

Massive Sensors and Data Assimilation in Weather Prediction

Ultra Large Scale Graphs and Social Infrastructures

Exascale Big Data HPC

Co-Design

Future Non-Silo Extreme Big Data Apps

Graph Store

EBD Bag

Co-Design 13/06/06 22:36日本地図

1/1 ページfile:///Users/shirahata/Pictures/日本地図.svg

1000km

KVS

KVS

KVS

EBD KVS

Cartesian Plane

Co-Design

Extreme Big Data (EBD) TeamCo-Design EHPC and EBD Apps

•  Satoshi Matsuoka (PI), Toshio Endo, Hitoshi Sato (Tokyo Tech.) (Tasks 1, 3, 4, 6)

•  Osamu Tatebe (Univ. Tsukuba) (Tasks 2, 3)

•  Michihiro Koibuchi (NII)

(Tasks 1, 2)

•  Yutaka Akiyama, Ken Kurokawa (Tokyo Tech, 5-1)

•  Toyotaro Suzumura (IBM Lab, 5-2)

•  Takemasa Miyoshi (Riken AICS, 5-3)

Introd

uction

Problem

Dom

ain


sgenetic

instructions

used

intheirdevelopm

entarestored

inthelong

polymeric

moleculecalledDNA.

DNAconsistsof

twolong

polymersof

simpleun

itscallednu

cleotid

es.

The

four

basesfoun

din


A),cytosin

e(C

),guanine(G

)andthym

ine(T

).

Aleksan

drDrozd

,Nao

yaMaruy

ama,

SatoshiM

atsuok

a(T

ITEC

H)

AMulti

GPU

Rea

dAlig

nmen

tAlgorithm

withMod

el-based

Perform

ance

Optim

ization

Nov

embe

r1,

2011

5/54

Introduction

Problem Domain

To decipher the information contained in DNAwe need to determine the order of nucleotides.This task is important for many emergingareas of science and medicine.

Modern sequencing techniques split the DNAmolecule into pieces (called reads) which areprocessed separately to increase thesequencing throughput.

Reads must be aligned to the referencesequence to determine their position in themolecule. This process is called readalignment.

Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka (TITECH)A Multi GPU Read Alignment Algorithm with Model-based Performance OptimizationNovember 1, 2011 6 / 54

13/06/06 22:36日本地図


1000km

KVS

KVSKVS

Graph(Store

EBD(ApplicaQon(CocDesign(and(ValidaQon(

Ultra(High(BW(&(Low(Latency(NVM Ultra(High(BW(&(Low(Latency(NW(

Processorcincmemory 3D(stacking

EBD(Performance(Modeling((&(EvaluaQon

Large(Scale(Genomic(CorrelaQon

Data(AssimilaQon(in(Large(Scale(Sensors(

and(Exascale(Atmospherics

Large(Scale(Graphs(and(Social(

Infrastructure(Apps

100,000&Times&Fold&EBD&“Convergent”&System&Overview

TSUBAME(3.0

EBD(Programming(System(

TSUBAME(2.0/2.5

Task(3

Tasks(5c1~5c3 Task6

EBD(“converged”(RealcTime(Resource(Scheduling(

Task(4

EBD(Distrbuted(Object(Store(on(100,000(NVM(Extreme(Compute(

and(Data(Nodes(

Task(2

EBD(Bag

EBD(KVS(

Cartesian(Plane

Ultra(Parallel(&(Low(Powe(I/O(EBD(“Convergent”(Supercomputer(~10TB/s ~100TB/s ~10PB/s

Task(1

Introd

uction

Problem

Dom

ain


sgenetic

instructions

used

intheirdevelopm

entarestored

inthelong

polymeric

moleculecalledDNA.

DNAconsistsof

twolong

polymersof

simpleun

itscallednu

cleotid

es.

The

four

basesfoun

din


A),cytosin

e(C

),guanine(G

)andthym

ine(T

).

Aleksan

drDrozd

,Nao

yaMaruy

ama,

SatoshiM

atsuok

a(T

ITEC

H)

AMulti

GPU

Rea

dAlig

nmen

tAlgorithm

withMod

el-based

Perform

ance

Optim

ization

Nov

embe

r1,

2011

5/54

Introduction

Problem Domain

To decipher the information contained in DNAwe need to determine the order of nucleotides.This task is important for many emergingareas of science and medicine.

Modern sequencing techniques split the DNAmolecule into pieces (called reads) which areprocessed separately to increase thesequencing throughput.

Reads must be aligned to the referencesequence to determine their position in themolecule. This process is called readalignment.

Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka (TITECH)A Multi GPU Read Alignment Algorithm with Model-based Performance OptimizationNovember 1, 2011 6 / 54

13/06/06 22:36日本地図


1000km

KVS

KVSKVS

Graph(Store

Cloud(Datacenter

Large(Scale(Genomic(CorrelaQon

Data(AssimilaQon(in(Large(Scale(Sensors(

and(Exascale(Atmospherics

Large(Scale(Graphs(and(Social(

Infrastructure(Apps

100,000&Times&Fold&EBD&“Convergent”&System&Overview

TSUBAME(3.0( TSUBAMEcGoldenBox

EBD(Bag

EBD(KVS(

Cartesian(Plane

MapReduce&for&EBDWorkflow/ScripIng&Languages&for&EBD

Interconnect&&(InfiniBand(100GbE)

EBD&Abstract&Data&Models&&(Distributed(Array,(Key(Value,(Sparse(Data(Model,(Tree,(etc.)

EBD&Algorithm&Kernels(((Search/(Sort,(Matching,(Graph(Traversals,(,(etc.)(

NVM&&(6 1C8# 3 # CDD B1 #

BLB1 # 8 3# L J )(HPC&Storage&&

EBD&File&System EBD&Data&Object

SQL&for&EBD Xpregel&(Graph)

Message&Passing&(MPI,&X10)&for&EBD

PGAS/Global&Array&for&EBD&

Network&&(SINET5)

Intercloud(/(Grid((HPCI)

Web&Object&Storage&&

EBD&Burst&I/O&Buffer EBD&Network&Topology&and&RouIng

Hamar((Highly(Accelerated(Map(Reduce)([Shirahata,(Sato(et(al.(Cluster2014]

•  A((soqware(framework(for(largecscale(supercomputers((w/(manyccore(accelerators(and(local(NVM(devices(–  AbstracQon(for(deepening(memory(hierarchy(

•  Device(memory(on(GPUs,(DRAM,(Flash(devices,(etc.((

•  Features(–  Objectcoriented((

•  C++cbased(implementaQon(•  Easy(adaptaQon(to(modern(commodity((manyccore(accelerator/Flash(devices(w/(SDKs(

–  CUDA,(OpenNVM,(etc.(

–  Weakcscaling(over(1000(GPUs((•  TSUBAME2(

–  Outcofccore(GPU(data(management(•  OpQmized(data(streaming(between((device/host(memory(

•  GPUcbased(external(sorQng(–  OpQmized(data(formats(for((

manyccore(accelerators(•  Similar(to(JDS(format(

Hamar(Overview

Map

Distributed(Array

Rank(0 Rank(1 Rank(n

Local(Array Local(Array Local(Array Local(Array

ReduceMap

Reduce

Map

ReduceShuffle

Shuffle

Data(Transfer(between(ranks

ShuffleShuffle

Local(Array Local(Array Local(Array Local(Array

Device(GPU)(Data

Host(CPU)(Data

Memcpy(((H2D,(D2H)

Virtualized(Data(ObjectLocal(Array(on(NVM Local(Array(on(NVM Local(Array(on(NVM Local(Array(on(NVM

ApplicaQon(Example(:(GIMcV(Generalized(IteraQve(MatrixcVector(mulQplicaQon*1

•  Easy(descripQon(of(various(graph(algorithms(by(implemenQng(combine2,(combineAll,(assign(funcQons(

•  PageRank,(Random(Walk(Restart,(Connected(Component(–  v’#=#M#×G#v((where(

v’i(=(assign(vj(,(combineAllj(({xj#|(j#=(1..n,(xj#=(combine2(mi,j,(vj)}))(((i(=(1..n)(

–  IteraQve(2(phases(MapReduce(operaQons(

×Gv’i mi,j

vj

v’ M

combineAll&and&assign&(stage2)

combine2&(stage1)

assign v

*1(:(Kang,(U.(et(al,(“PEGASUS:(A(PetacScale(Graph(Mining(Systemc(ImplementaQon((and(ObservaQons”,(IEEE(INTERNATIONAL(CONFERENCE(ON(DATA(MINING(2009

Straighyorward(implementaQon(using(Hamar

MapReducecbased(Graph(Processing(with(Outcofccore(Support(on(GPUs

0(

500(

1000(

1500(

2000(

2500(

3000(

0( 200( 400( 600( 800( 1000( 1200(

Performan

ce&[M

Edges/sec]

Number&of&Compute&Nodes

Weak&scaling&performance1CPU((S23(per(node)(

1GPU((S23(per(node)(

2CPUs((S24(per(node)(

2GPUs((S24(per(node)(

3GPUs((S24(per(node)(

2.81(GE/s(on((3072(GPUs(((SCALE(34)

2.10x(Speedup((3(GPU(v(2CPU)Be

mer

•  Hierarchical(memory(management(for(largecscale((graph(processing(using(mulQcGPUs(–  Support(outcofccore(processing(on(GPU(–  Overlapping(computaQon(and(CPUcGPU(communicaQon(

•  PageRank(applicaQon(on(TSUBAME(2.5(

EBD-IO Device: A Prototype of Local Storage Configuration [Shirahata, Sato et al. GTC2014]

High(Bandwidth(and(IOPS,(Huge(Capacity,(Low(Cost,(Power(Efficient((

IPSJ SIG Technical Report

SSD#2# SSD#3# SSD#4#SSD#1# SSD#1# SSD#2# SSD#3# SSD#4#

Compute#node#1#

Compute#node#2#

Compute#node#3#

Compute#node#4#

Compute#node#1##

Compute#node#2#

Compute#node#3#

Compute#node#4#

PFS#(Parallel#file#system)# PFS#(Parallel#file#system)#

A single node

Fig. 2 (a) Left: Flat buffer system (b) Right: Burst buffer system

ging overhead. In addition, if we apply uncoordinated check-pointing to MPI applications, indirect global synchronization canoccur. For example, process(a2) in cluster(A) wants to send amessage to process(b1) in cluster(B), which is writing its check-point at that time. Process(a2) waits for process(b1) because pro-cess(b1) is doing I/O and can not receive or reply to any mes-sages, which keeps process (a1) waiting to checkpoint with pro-cess (a2) in Figure 1. If such a dependency propagates across allprocesses, it results in indirect global synchronization. Many MPIapplications exchange messages between processes in a shorterperiod of time than is required for checkpoints, so we assumeuncoordinated checkpointing time is same as coordinated check-pointing one in the model in Section 4.

2.4 Target Checkpoint/Restart StrategiesAs discussed previously, multilevel and asynchronous ap-

proaches are more efficient than single and synchronous check-point/restart respectively. However, there is a trade-off betweencoordinated and uncoordinated checkpointing given an applica-tion and the configuration. In this work, we compare the ef-ficiency of multilevel asynchronous coordinated and uncoordi-nated checkpoint/restart. However, because we have alreadyfound that these approaches may be limited in increasing applica-tion efficiencies at extreme scale [29], we also consider storagearchitecture approaches.

3. Storage designsOur goal is to achieve a more reliable system with more effi-

cient application executions. Thus, we consider not only a soft-ware approach via checkpoint/restart techniques, but also con-sider different storage architectures. In this section, we introducean mSATA-based SSD burst buffer system (Burst buffer system),and explore the advantages by comparing to a representative cur-rent storage system (Flat buffer system).

3.1 Current Flat Buffer SystemIn a flat buffer system (Figure 2 (a)), each compute node has

its dedicated node-local storage, such as an SSD, so this designis scalable with increasing number of compute nodes. Severalsupercomputers employ this flat buffer system [13], [22], [24].However this design has drawbacks: unreliable checkpoint stor-age and inefficient utilization of storage resources. Storing check-points in node-local storage is not reliable because an applica-tion can not restart its execution if a checkpoint is lost due to afailed compute node. For example, if compute node 1 in Figure2 (a) fails, a checkpoint on SSD 1 will be lost because SSD 1is connected to the failed compute node 1. Storage devices canbe underutilized with uncoordinated checkpointing and message

logging. While the system can limit the number of processes torestart, i.e., perform a partial restart, in a flat buffer system, lo-cal storage is not utilized by processes which are not involved inthe partial restart. For example, if compute node 1 and 3 are in asame cluster, and restart from a failure, the bandwidth of SSD 2and 4 will not be utilized. Compute node 1 can write its check-points on the SSD of compute node 2 as well as its own SSD inorder to utilize both of the SSDs on restart, but as argued earlierdistributing checkpoints across multiple compute nodes is not areliable solution.

Thus, future storage architectures require not only efficient butreliable storage designs for resilient extreme scale computing.

3.2 Burst Buffer SystemTo solve the problems in a flat buffer system, we consider a

burst buffer system [21]. A burst buffer is a storage space tobridge the gap in latency and bandwidth between node-local stor-age and the PFS, and is shared by a subset of compute nodes.Although additional nodes are required, a burst buffer can offera system many advantages including higher reliability and effi-ciency over a flat buffer system. A burst buffer system is morereliable for checkpointing because burst buffers are located ona smaller number of dedicated I/O nodes, so the probability oflost checkpoints is decreased. In addition, even if a large numberof compute nodes fail concurrently, an application can still ac-cess the checkpoints from the burst buffer. A burst buffer systemprovides more efficient utilization of storage resources for partialrestart of uncoordinated checkpointing because processes involv-ing restart can exploit higher storage bandwidth. For example, ifcompute node 1 and 3 are in the same cluster, and both restartfrom a failure, the processes can utilize all SSD bandwidth unlikea flat buffer system. This capability accelerates the partial restartof uncoordinated checkpoint/restart.

Table 1 Node specificationCPU Intel Core i7-3770K CPU (3.50GHz x 4 cores)

Memory Cetus DDR3-1600 (16GB)M/B GIGABYTE GA-Z77X-UD5HSSD Crucial m4 msata 256GB CT256M4SSD3

(Peak read: 500MB/s, Peak write: 260MB/s)SATA converter KOUTECH IO-ASS110 mSATA to 2.5’ SATA

Device Converter with Metal FramRAID Card Adaptec RAID 7805Q ASR-7805Q Single

To explore the bandwidth we can achieve with only commod-ity devices, we developed an mSATA-based SSD test system. Thedetailed specification is shown in Table 1. The theoretical peakof sequential read and write throughput of the mSATA-based SSDis 500 MB/sec and 260 MB/sec, respectively. We aggregate theeight SSDs into a RAID card, and connect two the RAID cardsvia PCE-express(x8) 3.0. The theoretical peak performance ofthis configuration is 8 GB/sec for read and 4.16 GB/sec for writein total. Our preliminary results showed that actual read band-width is 7.7 GB/sec (96% of peak) and write bandwidth is 3.8GB/sec (91% of peak) [32] . By adding two more RAID cards,and connecting via high-speed interconnects, we expect to be ableto build a burst buffer machine using only commodity deviceswith 16 GB/sec of read, and 8.32 GB/sec of write throughput.

c⃝ 2013 Information Processing Society of Japan 3

16&cards&of&&mSATA&SSD&devices&&&&&Capacity:&256GB&x&16&→&4TB&&&Read&BW:&0.5GB/s&&x&16&→&8&GB/s&

A(single(mSATA(SSD( 8(integrated(mSATA(SSDs(

RAID(cards( Prototype/Test(machine(

0(

1000(

2000(

3000(

4000(

5000(

6000(

7000(

8000(

9000(

0( 5( 10( 15( 20(

Band

width&[M

B/s]

#&mSATAs

Raw(mSATA(4KB(RAID0(1MB(RAID0(64KB(

Preliminary I/O Performance Evaluation Between GPU and EBD-I/O device Using Matrix-Vector Multiplication

0(

0.5(

1(

1.5(

2(

2.5(

3(

3.5(

Throughu

put&[GB

/s]

Matrix&Size&[GB]

Raw(8(mSATA(8(mSATA(RAID0((1MB)(8(mSATA(RAID0((64KB)(

I/O(Performance(using(FIO(I/O(Performance(with(MatrixcVector((MulQplicaQon(on(GPU

I/O&Performance&of&&EBD_I/O&device&&7.39&GB/s&(RAID0&)&

I/O&Performance&to&GPU&&3.06&GB/s&(Up&to&PCI_E&BW)&

0

10

20

30

0 500 1000 1500 2000# of proccesses (2 proccesses per node)

Keys

/sec

ond(

billio

ns)

HykSort 1threadHykSort 6threadsHykSort GPU + 6threads

K20x x4 faster than K20x

0

20

40

60

0 500 1000 1500 2000 0 500 1000 1500 2000# of proccesses (2 proccesses per node)

Keys

/sec

ond(

billio

ns)

HykSort 6threadsHykSort GPU + 6threadsPCIe_10PCIe_100PCIe_200PCIe_50Prediction of our implementation

Sorting for EBD Plugging in GPUs for large-scale sorting

•  Weak scaling performance (Grand Challenge on TSUBAME2.5)

–  1 ~ 1024 nodes (2 ~ 2048 GPUs) –  2 processes per node and each node has

2GB 64bit integer •  Yahoo/Hadoop Terasort: 0.02[TB/s]

–  Including I/O

x1.4

x3.61

x389

0.25 [TB/s]

•  Performance prediction

x2.2 speedup compared to CPU-based

implmentataion when the # of PCI bandwidth increase to 50GB/s

8.8% reduction of overall runtime when

the accelerators work 4 times faster than K20x

!  PCIe_#: #GB/s bandwidth of interconnect between CPU and GPU

•  GPU implementation of splitter-based sorting (HykSort)

[Shamoto, Sato et al. BigData 2014]

!  New BigData Benchmark based on Large-scale Graph Search for Ranking Supercomputers

!  BFS (Breadth First Search) from a single vertex on a static, undirected Kronecker graph with average vertex degree edgegactor (=16).

!  Evaluation criteria: TEPS (Traversed Edges Per Second), and problem size that can be solved on a system, minimum execution time.

&&&&Graph500&Benchmark h{p://www.graph500.org

SCALE(and(edgefactor&(=16)

Median&TEPS

1.   GeneraIon&

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

3.   BFS&2.   ConstrucIon&

&x&64

TEPS(raQo &x&64•  Kronecker(graph(– 2SCALE(verQces(and(2SCALE+4(edges(–  syntheQc(scalecfree(network(

Scalable distributed memory BFS for Graph500 Koji Ueno (Tokyo Institute of Technology) et al.

!  What’s the best algorithm for distributed memory BFS?

Optimizations SC11 ISC12 SC12 ISC142D decomposition ✓ ✓ ✓ ✓ vertex sorting ✓ direction optimization ✓ data compression ✓ ✓ ✓ sparse vector with pop counting ✓ adaptive data representation ✓ overlapped communication ✓ ✓ ✓ ✓ shared memory ✓ GPGPU ✓ ✓

✓ Utilization for each version

We proposed and explored many optimizations.

100

317 462

1,280

0

200

400

600

800

1000

1200

1400

SC'11 ISC'12 SC'12 ISC'14

Perf

orm

ance

(GTE

PS)

Graph500 score history of TSUBAME2

Continuous effort to improve performance

Machine # of nodes Performance K computer 65536 5524 GTEPS TSUBAME2.5 1024 1280 GTEPS TSUBAME-KFC 32 104 GTEPS

Optimized for various machines

2.&Proposal&1.&Hybrid_BFS&(&Beamer’11&)&

DRAM

SCALE&31&(Total&data&size&is&1TB)BFS(Performance 13.8(GTEPS

Power(ConsumpQon 391.8(W

Large&Scale&Graph&Processing&Using&NVM&

3.&Experiment&

35.2&MTEPS&/&W

Green&Graph&500

h{p://green.graph500.org/

(

)

+

() ( (+ ( (- (. (/ ) )=3185

MB LE B 0 (=3185

4 19 524#7

4 19

8EGEL IC 4 19

) .

9BAE

6ED

5=

6ED

MB

BA5ADB

B=BIA

mSATA-SSD

RAID Card

×&8& &

www.adaptec.com

www.crucial.com/

4(Qmes(larger(graph(with(

6.9(%(of(degradaQon(

DRAM NVM

Load(highly(accessed(graph(data(before(BFS

Holds(full(size(of(GraphHolds(highly(accessed(data

[Iwabuchi(Sato(et(al.(BigData2014]

Top_down& Bomom_up&

#(of(fronQers:nfron2er��#(of(all(verQces:nall,#########parameter&:&α,#β#

Switching(two(approaches(

Results&:&BFS&Performance&#(EBDcRH5885v2((Huawei,(Tecal(RH5885(V2(Server(w/(Tecal(ES3000(PCIe(SSD(800GBx2,1.2TBx2,(crucial@M500(mSATA(480GB(x32,(MEM(1024GB)(c(Scale33(c(3.10722(GTEPS(c(908.22(W(c(3.42130(MTEPS/W((#(GraphCrest(Node(#1((Custom,(Intel(R)(Xeon(tm)(E5c2690,(crucial@m4(mSATA(256GB(x16,(MEM(256GB)(c(Scale31(c(13.7963(GTEPS(c(391.825(W(c(35.2104(MTEPS/W((#(MEMcCREST(Node(#2((Supermicro,(2027GRcTRF(w/(FusionIO(ioDrive2(1.2TB(x2,(MEM(128GB)(c(Scale30(c(7.98417(GTEPS(c(276.507(W(c(28.8751(MTEPS/W

MEM_CREST&Node&#2&(Supermicro&2027GR_TRF)& GraphCrest&Node&#1& EBD_RH5885v2&

(Huawei&Tecal&RH5885&V2)&

DRAM 128(GB 256(GB 1024(GB

NVM ioDrive2(1.2(TB(×(2 EBDcI/O(2TB(×(2•  Tecal(ES3000(

800GBx2,1.2TBx2(•  EBDcI/O(4TB(×(2

SCALE((Total(Data(Size)

30((500GB)

31((1TB)

33((4TB)

GTEPS 7.98 13.80( 3.11

MTEPS(/(W 28.88( 35.21 3.42

The(Graph(500(2014(June(((((((((((DRAM(+(NVM(model

The&2nd&Green&Graph500&list(on(Nov.(2013•  Measures(powercefficient(using(TEPS/W(raQo(•  Results(on(various(system(such(as(Huawei’s&RH5885v2&w/&Tecal&ES3000&

PCIe&SSD&800GB&*&2&and&1.2TB&*&2&•  hmp://green.graph500.org(

Expectations for Next-Gen Storage System (Towards TSUBAME3.0)

•  Achieving high IOPS – Many apps with massive small I/O ops

•  graph, etc. •  Utilizing NVM devices

– Discrete local SSDs on TSUBAME2 – How to aggregate them?

•  Stability/Reliability as Archival Storage•  I/O resource reduction/consolidation

– Can we allow a large number of OSSs for achieving ~TB/s throughput

– Many constraints •  Space, Power, Budget, etc.

Current Status •  New Approaches

•  Tsubame 2.0 has pioneered the use of local flash storage as a high-IOPS alternative to an external PFS

•  Tired and hybrid storage environments, combining (node) local flash with an external PFS

•  Industry Status •  High-performance, high-capacity flash (and other new

semiconductor devices) are becoming available at reasonable cost

•  New approaches/interface to use high-performance devices (e.g. NVMexpress)