Upload
hitoshi-sato
View
199
Download
3
Embed Size (px)
DESCRIPTION
Presentation slides for Japan Lustre User Group (17 Oct 2014)
Citation preview
Extreme Big Data (EBD) Convergence of Extreme Computing
and Big Data Technologies
7RUIHR CJPLT P J TMU SH PUT HTK 3USV PTN 3LT LDUQ U TX P L UM DLJOTURUN
8P UXOP CH U
Big Data Examples Rates and Volumes are extremely immense
Social NW• Facebook
– 1 billion users – Average 130 friends – 30 billion pieces of content
shared per month
• Twitter – 500 million active users – 340 million tweets per day
• Internet – 300 million new websites per year – 48 hours of video to YouTube per minute – 30,000 YouTube videos played per second
Genomics
Social Simulation
).1 2 1 ,.1 (,120, .2 2 5 2
Sequencing)data)(bp)/$))becomes)x4000)per)5)years)
c.f.,)HPC)x33)in)5)years
• Applications – Target Area: Planet
(Open Street Map)
– 7 billion people
• Input Data – Road Network for Planet:
300GB (XML) – Trip data for 7 billion people
10KB (1trip) x 7 billion = 70TB – Real-Time Streaming Data
(e.g., Social sensor, physical data)
• Simulated Output for 1 Iteration – 700TB
Weather
�30#sec'Ensemble'Forecast'Simulations
2'PFLOP
�EnsembleData'Assimilation
2'PFLOP
Himawari500MB/2.5min
���������
���������
Ensemble'Forecasts200GB
Phased'Array'Radar1GB/30sec/2'radars
���������
���������
Ensemble'Analyses200GB
A#1.'Quality'ControlA#2.'Data'Processing
B#1.'Quality'ControlB#2.'Data'Processing
Analysis'Data2GB
�30#minForecast'Simulation
1.2'PFLOP
30#min Forecast2GB
Repeat'every'30'sec.
Big Data Examples Rates and Volumes are extremely immense
Social NW• Facebook
– 1 billion users – Average 130 friends – 30 billion pieces of content
shared per month
• Twitter – 500 million active users – 340 million tweets per day
• Internet – 300 million new websites per year – 48 hours of video to YouTube per minute – 30,000 YouTube videos played per second
Genomics
Social Simulation
).1 2 1 ,.1 (,120, .2 2 5 2
Sequencing)data)(bp)/$))becomes)x4000)per)5)years)
c.f.,)HPC)x33)in)5)years
• Applications – Target Area: Planet
(Open Street Map)
– 7 billion people
• Input Data – Road Network for Planet:
300GB (XML) – Trip data for 7 billion people
10KB (1trip) x 7 billion = 70TB – Real-Time Streaming Data
(e.g., Social sensor, physical data)
• Simulated Output for 1 Iteration – 700TB
Weather
�30#sec'Ensemble'Forecast'Simulations
2'PFLOP
�EnsembleData'Assimilation
2'PFLOP
Himawari500MB/2.5min
���������
���������
Ensemble'Forecasts200GB
Phased'Array'Radar1GB/30sec/2'radars
���������
���������
Ensemble'Analyses200GB
A#1.'Quality'ControlA#2.'Data'Processing
B#1.'Quality'ControlB#2.'Data'Processing
Analysis'Data2GB
�30#minForecast'Simulation
1.2'PFLOP
30#min Forecast2GB
Repeat'every'30'sec.
Future “Extreme Big Data” • NOT mining Tbytes Silo Data • Peta~Zetabytes of Data • Ultra High-BW Data Stream • Highly Unstructured, Irregular • Complex correlations between data from multiple sources • Extreme Capacity, Bandwidth, Compute All Required
Graph500(“Big(Data”(Benchmark
A: 0.57, B: 0.19 C: 0.19, D: 0.05
November(15,(2010(Graph&500&Takes&Aim&at&a&New&Kind&of&HPC&Richard&Murphy&(Sandia&NL&=>&Micron)&“(I&expect&that&this&ranking&may&at&Imes&look&very&different&from&the&TOP500&list.&Cloud&architectures&will&almost&certainly&dominate&a&major&chunk&of&part&of&the&list.”(
The 8th Graph500 List (June2014): K Computer #1, TSUBAME2 #12 Koji Ueno, Tokyo Institute of Technology/RIKEN AICS
on the Graph500 Ranking of Supercomputers with17977.1 GE/s on Scale 40
on the 8th Graph500 list published at the InternationalSupercomputing Conference, June 22, 2014.
Congratulations from the Graph500 Executive Committee
No.1
RIKEN Advanced Institute for ComputationalScience (AICS)’s K computer
is ranked
on the Graph500 Ranking of Supercomputers with1280.43 GE/s on Scale 36
on the 8th Graph500 list published at the InternationalSupercomputing Conference, June 22, 2014.
Congratulations from the Graph500 Executive Committee
No.12
Global Scientific Information and Computing Center, Tokyo Instituteof Technology’s TSUBAME 2.5
is ranked
#1 K Computer
#12 TSUBAME2
Reality: Top500 Supercomputers Dominate No Cloud IDCs at all
-‐‑‒ :2 2 2 2 2 :2 2 -‐‑‒ DUQ UDLJO
(. 6 C6 , . 6 46 ( UT DUV, : T ) (: 2
( / TUKLX# HJQX
3 E0 T LR GLUT G,-‐‑‒. -‐‑‒ JU LX7 E0 F 4 1 DLXRH ) G5 0 ,/ 72 44B ( 8
CC40 -‐‑‒ 72 )
2A4B 6 RR IPXLJ PUT V PJHR T TPIHTK
2CC40 ) D28440 . 2 X L# 7 6CDHVL0 2 CU HNLDLQ C /, )
TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 200TB SSD)
“Global(Work(Space”(#1
SFA10k(#5
“Global(Work(Space”(#2 “Global(Work(Space”(#3
SFA10k(#4SFA10k(#3SFA10k(#2SFA10k(#1
/data0( /work0 /work1(((((/gscr
“cNFS/Clusterd(Samba(w/(GPFS”((
HOME
System(applicaQon
“NFS/CIFS/iSCSI(by(BlueARC”((
HOME
iSCSI
Infiniband(QDR(Networks
SFA10k(#6
GPFS#1 GPFS#2 GPFS#3 GPFS#4
Parallel(File(System(VolumesHome(Volumes
QDR(IB(×4)(×(20 10GbE(×(2QDR(IB((×4)(×(8
1.2PB3.6&PB
/data1(
((
((
Thin(nodes 1408nodes((((32nodes(x44(Racks)(
HP(Proliant(SL390s(G7(1408nodesCPU:(Intel(WestmerecEP((2.93GHz(((((((((((6cores(×(2(=(12cores/node(GPU:(NVIDIA(Tesla(K20X,(3GPUs/node(Mem:(54GB((96GB)(SSD:((60GB(x(2(=(120GB((120GB(x(2(=(240GB) (
Medium(nodes
HP(Proliant(DL580(G7(24nodes((CPU:(Intel(NehalemcEX(2.0GHz((((((((((8cores(×(2(=(32cores/node(GPU:(NVIDIA((Tesla(S1070,((((((((((((NextIO(vCORE(Express(2070(Mem:128GB(SSD:(120GB(x(4(=(480GB(
((
Fat(nodes
HP(Proliant(DL580(G7(10nodes(CPU:(Intel(NehalemcEX(2.0GHz((((((((((8cores(×(2(=(32cores/node((GPU:(NVIDIA(Tesla(S1070(Mem:(256GB((512GB)(SSD:(120GB(x(4(=(480GB(
CompuIng&Nodes 17.1PFlops(SFP),&5.76PFlops(DFP),&224.69TFlops(CPU),&~100TB&MEM,&~200TB&SSD&
Interconnets: Full_bisecIon&OpIcal&QDR&Infiniband&Network
((
Voltaire(Grid(Director(4700((×12(IB(QDR:(324(ports(
Core(Switch(
((
Edge(Switch(
((
Edge(Switch((/w(10GbE(ports)(
Voltaire(Grid(Director(4036(×179(IB(QDR(:(36(ports(
Voltaire((Grid(Director(4036E(×6(IB(QDR:34ports(((10GbE:((2port(
12switches(
6switches(179switches(
2.4&PB&HDD&+&&4PB&Tape
GPFS+Tape Lustre Home
Local SSDs
TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 200TB SSD)
“Global(Work(Space”(#1
SFA10k(#5
“Global(Work(Space”(#2 “Global(Work(Space”(#3
SFA10k(#4SFA10k(#3SFA10k(#2SFA10k(#1
/data0( /work0 /work1(((((/gscr
“cNFS/Clusterd(Samba(w/(GPFS”((
HOME
System(applicaQon
“NFS/CIFS/iSCSI(by(BlueARC”((
HOME
iSCSI
Infiniband(QDR(Networks
SFA10k(#6
GPFS#1 GPFS#2 GPFS#3 GPFS#4
Parallel(File(System(VolumesHome(Volumes
QDR(IB(×4)(×(20 10GbE(×(2QDR(IB((×4)(×(8
1.2PB3.6&PB
/data1(
((
((
Thin(nodes 1408nodes((((32nodes(x44(Racks)(
HP(Proliant(SL390s(G7(1408nodesCPU:(Intel(WestmerecEP((2.93GHz(((((((((((6cores(×(2(=(12cores/node(GPU:(NVIDIA(Tesla(K20X,(3GPUs/node(Mem:(54GB((96GB)(SSD:((60GB(x(2(=(120GB((120GB(x(2(=(240GB) (
Medium(nodes
HP(Proliant(DL580(G7(24nodes((CPU:(Intel(NehalemcEX(2.0GHz((((((((((8cores(×(2(=(32cores/node(GPU:(NVIDIA((Tesla(S1070,((((((((((((NextIO(vCORE(Express(2070(Mem:128GB(SSD:(120GB(x(4(=(480GB(
((
Fat(nodes
HP(Proliant(DL580(G7(10nodes(CPU:(Intel(NehalemcEX(2.0GHz((((((((((8cores(×(2(=(32cores/node((GPU:(NVIDIA(Tesla(S1070(Mem:(256GB((512GB)(SSD:(120GB(x(4(=(480GB(
CompuIng&Nodes 17.1PFlops(SFP),&5.76PFlops(DFP),&224.69TFlops(CPU),&~100TB&MEM,&~200TB&SSD&
Interconnets: Full_bisecIon&OpIcal&QDR&Infiniband&Network
((
Voltaire(Grid(Director(4700((×12(IB(QDR:(324(ports(
Core(Switch(
((
Edge(Switch(
((
Edge(Switch((/w(10GbE(ports)(
Voltaire(Grid(Director(4036(×179(IB(QDR(:(36(ports(
Voltaire((Grid(Director(4036E(×6(IB(QDR:34ports(((10GbE:((2port(
12switches(
6switches(179switches(
2.4&PB&HDD&+&&4PB&Tape
GPFS+Tape Lustre Home
Local SSDs
Read(mostly(I/O(((datacintensive(apps,(parallel(workflow,(parameter(survey)
• (Home(storage(for(compuQng(nodes(• (Cloudcbased(campus(storage(services
Backup
Finecgrained(R/W(I/O((check(point,(temporal(files)
Finecgrained(R/W(I/O((check(point,(temporal(files)
TSUBAME2 Storage Usage Since Nov. 2010
DCE21 5) , HX H 2PN4H H TM HX J L• 7 E IHXLK HT JU L 1JJLRL H U
– F 4 1 DLXRH ) G )) JH KX
• 8 NL LSU FUR SLX VL UKL– ,/72 VL TUKL ( / TUKLX
• 6H D LL 4 HR BHPR A4B T TP2HTK– ) DIVX UM M RR IPXLJ PUT IHTK PK O
• UJHR CC4 KL PJLX VL TUKL– ) D2 PT U HR
• H NL XJHRL C U HNL C X LSX– X L# 7 6C– . 2 UM 844X# 2 UM DHVLX
DCE21 5) , HX H 2PN4H H TM HX J L• 7 E IHXLK HT JU L 1JJLRL H U
– F 4 1 DLXRH ) G )) JH KX
• 8 NL LSU FUR SLX VL UKL– ,/72 VL TUKL ( / TUKLX
• 6H D LL 4 HR BHPR A4B T TP2HTK– ) DIVX UM M RR IPXLJ PUT IHTK PK O
• UJHR CC4 KL PJLX VL TUKL– ) D2 PT U HR
• H NL XJHRL C U HNL C X LSX– X L# 7 6C– . 2 UM 844X# 2 UM DHVLX
A Major Northern Japanese Cloud Datacenter (2013)
Juniper(EX8208 Juniper(EX8208
2(zone(switches((Virtual(Chassis)
Juniper(EX4200
Zone((700(nodes)
Juniper(EX4200
Juniper(EX4200
Zone((700(nodes)
Juniper(EX4200
Juniper(EX4200
Zone((700(nodes)
Juniper(EX4200
Juniper(MX480 Juniper(MX480
10GbE10GbE
10GbE
10GbELACP
the(Internet
8 zones, Total 5600 nodes, Injection 1GBps/Node Bisection 160Gigabps
Advanced Silicon Photonics 40G
single CMOS Die 1490nm DFB 100km Fiber
Supercomputer Tokyo Tech. Tsubame 2.0
#4 Top500 (2010)
~1500 nodes compute & storage Full Bisection Multi-Rail
Optical Network Injection 80GBps/Node Bisection 220Terabps
>> x1000!
Towards Extreme-scale Supercomputers and BigData Machines
• Computation – Increase in Parallelism, Heterogeneity, Density
• Multi-core, Many-core processors • Heterogeneous processors
• Hierarchial Memory/Storage Architecture – NVM (Non-Volatile Memory),
SCM (Storage Class Memory) • 6 1C8# 3 # CDD B1 # BLB1 #8 3# L J
– Next-gen HDDs (SMR), Tapes (LTFS)
(
Network
Locality
Productivity FT
Algorithm Power
Storage Hierarchy
I/O
Problems
Heterogeneity
Scalability
Extreme Big Data (EBD) Next Generation Big Data
Infrastructure Technologies Towards Yottabyte/Year Principal Investigator
Satoshi Matsuoka
Global Scientific Information and Computing Center
Tokyo Institute of Technology2014/11/05 JST CREST Big Data Symposium
EBE Research Scheme
Supercomputers Compute&Batch-Oriented
Cloud IDC Very low BW & Efficiencty
Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW
PCB
TSV Interposer
High Powered Main CPU
Low Power CPU
DRAMDRAMDRAMNVM/Flash
NVM/Flash
NVM/Flash
Low Power CPU
DRAMDRAMDRAMNVM/Flash
NVM/Flash
NVM/Flash
2Tbps HBM 4~6HBM Channels 1.5TB/s DRAM & NVM BW 30PB/s I/O BW Possible 1 Yottabyte / Year
EBD System Software incl. EBD Object System
Introduction
Problem
Dom
ain
Inmostlivingorganism
sgeneticinstructions
used
intheirdevelopm
entarestored
inthelong
polym
eric
moleculecalledDNA.
DNAconsists
oftwolong
polym
ersof
simpleun
its
callednu
cleotides.
The
four
basesfoun
din
DNAareadenine(abbreviated
A),cytosine
(C),guanine(G
)andthym
ine(T
).
AleksandrDrozd
,NaoyaMaruyama,SatoshiMatsuoka
(TIT
ECH)
AMultiGPU
Rea
dAlig
nmen
tAlgorithm
withModel-basedPerform
ance
Optimization
Nove
mber
1,2011
5/54
Large Scale Metagenomics
Massive Sensors and Data Assimilation in Weather Prediction
Ultra Large Scale Graphs and Social Infrastructures
Exascale Big Data HPC
Co-Design
Future Non-Silo Extreme Big Data Apps
Graph Store
EBD Bag
Co-Design 13/06/06 22:36日本地図
1/1 ページfile:///Users/shirahata/Pictures/日本地図.svg
1000km
KVS
KVS
KVS
EBD KVS
Cartesian Plane
Co-Design
Extreme Big Data (EBD) TeamCo-Design EHPC and EBD Apps
• Satoshi Matsuoka (PI), Toshio Endo, Hitoshi Sato (Tokyo Tech.) (Tasks 1, 3, 4, 6)
• Osamu Tatebe (Univ. Tsukuba) (Tasks 2, 3)
• Michihiro Koibuchi (NII)
(Tasks 1, 2)
• Yutaka Akiyama, Ken Kurokawa (Tokyo Tech, 5-1)
• Toyotaro Suzumura (IBM Lab, 5-2)
• Takemasa Miyoshi (Riken AICS, 5-3)
Introd
uction
Problem
Dom
ain
Inmostlivingorganism
sgenetic
instructions
used
intheirdevelopm
entarestored
inthelong
polymeric
moleculecalledDNA.
DNAconsistsof
twolong
polymersof
simpleun
itscallednu
cleotid
es.
The
four
basesfoun
din
DNAareadenine(abbreviated
A),cytosin
e(C
),guanine(G
)andthym
ine(T
).
Aleksan
drDrozd
,Nao
yaMaruy
ama,
SatoshiM
atsuok
a(T
ITEC
H)
AMulti
GPU
Rea
dAlig
nmen
tAlgorithm
withMod
el-based
Perform
ance
Optim
ization
Nov
embe
r1,
2011
5/54
Introduction
Problem Domain
To decipher the information contained in DNAwe need to determine the order of nucleotides.This task is important for many emergingareas of science and medicine.
Modern sequencing techniques split the DNAmolecule into pieces (called reads) which areprocessed separately to increase thesequencing throughput.
Reads must be aligned to the referencesequence to determine their position in themolecule. This process is called readalignment.
Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka (TITECH)A Multi GPU Read Alignment Algorithm with Model-based Performance OptimizationNovember 1, 2011 6 / 54
13/06/06 22:36日本地図
1/1 ページfile:///Users/shirahata/Pictures/日本地図.svg
1000km
KVS
KVSKVS
Graph(Store
EBD(ApplicaQon(CocDesign(and(ValidaQon(
Ultra(High(BW(&(Low(Latency(NVM Ultra(High(BW(&(Low(Latency(NW(
Processorcincmemory 3D(stacking
EBD(Performance(Modeling((&(EvaluaQon
Large(Scale(Genomic(CorrelaQon
Data(AssimilaQon(in(Large(Scale(Sensors(
and(Exascale(Atmospherics
Large(Scale(Graphs(and(Social(
Infrastructure(Apps
100,000&Times&Fold&EBD&“Convergent”&System&Overview
TSUBAME(3.0
EBD(Programming(System(
TSUBAME(2.0/2.5
Task(3
Tasks(5c1~5c3 Task6
EBD(“converged”(RealcTime(Resource(Scheduling(
Task(4
EBD(Distrbuted(Object(Store(on(100,000(NVM(Extreme(Compute(
and(Data(Nodes(
Task(2
EBD(Bag
EBD(KVS(
Cartesian(Plane
Ultra(Parallel(&(Low(Powe(I/O(EBD(“Convergent”(Supercomputer(~10TB/s ~100TB/s ~10PB/s
Task(1
Introd
uction
Problem
Dom
ain
Inmostlivingorganism
sgenetic
instructions
used
intheirdevelopm
entarestored
inthelong
polymeric
moleculecalledDNA.
DNAconsistsof
twolong
polymersof
simpleun
itscallednu
cleotid
es.
The
four
basesfoun
din
DNAareadenine(abbreviated
A),cytosin
e(C
),guanine(G
)andthym
ine(T
).
Aleksan
drDrozd
,Nao
yaMaruy
ama,
SatoshiM
atsuok
a(T
ITEC
H)
AMulti
GPU
Rea
dAlig
nmen
tAlgorithm
withMod
el-based
Perform
ance
Optim
ization
Nov
embe
r1,
2011
5/54
Introduction
Problem Domain
To decipher the information contained in DNAwe need to determine the order of nucleotides.This task is important for many emergingareas of science and medicine.
Modern sequencing techniques split the DNAmolecule into pieces (called reads) which areprocessed separately to increase thesequencing throughput.
Reads must be aligned to the referencesequence to determine their position in themolecule. This process is called readalignment.
Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka (TITECH)A Multi GPU Read Alignment Algorithm with Model-based Performance OptimizationNovember 1, 2011 6 / 54
13/06/06 22:36日本地図
1/1 ページfile:///Users/shirahata/Pictures/日本地図.svg
1000km
KVS
KVSKVS
Graph(Store
Cloud(Datacenter
Large(Scale(Genomic(CorrelaQon
Data(AssimilaQon(in(Large(Scale(Sensors(
and(Exascale(Atmospherics
Large(Scale(Graphs(and(Social(
Infrastructure(Apps
100,000&Times&Fold&EBD&“Convergent”&System&Overview
TSUBAME(3.0( TSUBAMEcGoldenBox
EBD(Bag
EBD(KVS(
Cartesian(Plane
MapReduce&for&EBDWorkflow/ScripIng&Languages&for&EBD
Interconnect&&(InfiniBand(100GbE)
EBD&Abstract&Data&Models&&(Distributed(Array,(Key(Value,(Sparse(Data(Model,(Tree,(etc.)
EBD&Algorithm&Kernels(((Search/(Sort,(Matching,(Graph(Traversals,(,(etc.)(
NVM&&(6 1C8# 3 # CDD B1 #
BLB1 # 8 3# L J )(HPC&Storage&&
EBD&File&System EBD&Data&Object
SQL&for&EBD Xpregel&(Graph)
Message&Passing&(MPI,&X10)&for&EBD
PGAS/Global&Array&for&EBD&
Network&&(SINET5)
Intercloud(/(Grid((HPCI)
Web&Object&Storage&&
EBD&Burst&I/O&Buffer EBD&Network&Topology&and&RouIng
Hamar((Highly(Accelerated(Map(Reduce)([Shirahata,(Sato(et(al.(Cluster2014]
• A((soqware(framework(for(largecscale(supercomputers((w/(manyccore(accelerators(and(local(NVM(devices(– AbstracQon(for(deepening(memory(hierarchy(
• Device(memory(on(GPUs,(DRAM,(Flash(devices,(etc.((
• Features(– Objectcoriented((
• C++cbased(implementaQon(• Easy(adaptaQon(to(modern(commodity((manyccore(accelerator/Flash(devices(w/(SDKs(
– CUDA,(OpenNVM,(etc.(
– Weakcscaling(over(1000(GPUs((• TSUBAME2(
– Outcofccore(GPU(data(management(• OpQmized(data(streaming(between((device/host(memory(
• GPUcbased(external(sorQng(– OpQmized(data(formats(for((
manyccore(accelerators(• Similar(to(JDS(format(
Hamar(Overview
Map
Distributed(Array
Rank(0 Rank(1 Rank(n
Local(Array Local(Array Local(Array Local(Array
ReduceMap
Reduce
Map
ReduceShuffle
Shuffle
Data(Transfer(between(ranks
ShuffleShuffle
Local(Array Local(Array Local(Array Local(Array
Device(GPU)(Data
Host(CPU)(Data
Memcpy(((H2D,(D2H)
Virtualized(Data(ObjectLocal(Array(on(NVM Local(Array(on(NVM Local(Array(on(NVM Local(Array(on(NVM
ApplicaQon(Example(:(GIMcV(Generalized(IteraQve(MatrixcVector(mulQplicaQon*1
• Easy(descripQon(of(various(graph(algorithms(by(implemenQng(combine2,(combineAll,(assign(funcQons(
• PageRank,(Random(Walk(Restart,(Connected(Component(– v’#=#M#×G#v((where(
v’i(=(assign(vj(,(combineAllj(({xj#|(j#=(1..n,(xj#=(combine2(mi,j,(vj)}))(((i(=(1..n)(
– IteraQve(2(phases(MapReduce(operaQons(
×Gv’i mi,j
vj
v’ M
combineAll&and&assign&(stage2)
combine2&(stage1)
assign v
*1(:(Kang,(U.(et(al,(“PEGASUS:(A(PetacScale(Graph(Mining(Systemc(ImplementaQon((and(ObservaQons”,(IEEE(INTERNATIONAL(CONFERENCE(ON(DATA(MINING(2009
Straighyorward(implementaQon(using(Hamar
MapReducecbased(Graph(Processing(with(Outcofccore(Support(on(GPUs
0(
500(
1000(
1500(
2000(
2500(
3000(
0( 200( 400( 600( 800( 1000( 1200(
Performan
ce&[M
Edges/sec]
Number&of&Compute&Nodes
Weak&scaling&performance1CPU((S23(per(node)(
1GPU((S23(per(node)(
2CPUs((S24(per(node)(
2GPUs((S24(per(node)(
3GPUs((S24(per(node)(
2.81(GE/s(on((3072(GPUs(((SCALE(34)
2.10x(Speedup((3(GPU(v(2CPU)Be
mer
• Hierarchical(memory(management(for(largecscale((graph(processing(using(mulQcGPUs(– Support(outcofccore(processing(on(GPU(– Overlapping(computaQon(and(CPUcGPU(communicaQon(
• PageRank(applicaQon(on(TSUBAME(2.5(
EBD-IO Device: A Prototype of Local Storage Configuration [Shirahata, Sato et al. GTC2014]
High(Bandwidth(and(IOPS,(Huge(Capacity,(Low(Cost,(Power(Efficient((
IPSJ SIG Technical Report
SSD#2# SSD#3# SSD#4#SSD#1# SSD#1# SSD#2# SSD#3# SSD#4#
Compute#node#1#
Compute#node#2#
Compute#node#3#
Compute#node#4#
Compute#node#1##
Compute#node#2#
Compute#node#3#
Compute#node#4#
PFS#(Parallel#file#system)# PFS#(Parallel#file#system)#
A single node
Fig. 2 (a) Left: Flat buffer system (b) Right: Burst buffer system
ging overhead. In addition, if we apply uncoordinated check-pointing to MPI applications, indirect global synchronization canoccur. For example, process(a2) in cluster(A) wants to send amessage to process(b1) in cluster(B), which is writing its check-point at that time. Process(a2) waits for process(b1) because pro-cess(b1) is doing I/O and can not receive or reply to any mes-sages, which keeps process (a1) waiting to checkpoint with pro-cess (a2) in Figure 1. If such a dependency propagates across allprocesses, it results in indirect global synchronization. Many MPIapplications exchange messages between processes in a shorterperiod of time than is required for checkpoints, so we assumeuncoordinated checkpointing time is same as coordinated check-pointing one in the model in Section 4.
2.4 Target Checkpoint/Restart StrategiesAs discussed previously, multilevel and asynchronous ap-
proaches are more efficient than single and synchronous check-point/restart respectively. However, there is a trade-off betweencoordinated and uncoordinated checkpointing given an applica-tion and the configuration. In this work, we compare the ef-ficiency of multilevel asynchronous coordinated and uncoordi-nated checkpoint/restart. However, because we have alreadyfound that these approaches may be limited in increasing applica-tion efficiencies at extreme scale [29], we also consider storagearchitecture approaches.
3. Storage designsOur goal is to achieve a more reliable system with more effi-
cient application executions. Thus, we consider not only a soft-ware approach via checkpoint/restart techniques, but also con-sider different storage architectures. In this section, we introducean mSATA-based SSD burst buffer system (Burst buffer system),and explore the advantages by comparing to a representative cur-rent storage system (Flat buffer system).
3.1 Current Flat Buffer SystemIn a flat buffer system (Figure 2 (a)), each compute node has
its dedicated node-local storage, such as an SSD, so this designis scalable with increasing number of compute nodes. Severalsupercomputers employ this flat buffer system [13], [22], [24].However this design has drawbacks: unreliable checkpoint stor-age and inefficient utilization of storage resources. Storing check-points in node-local storage is not reliable because an applica-tion can not restart its execution if a checkpoint is lost due to afailed compute node. For example, if compute node 1 in Figure2 (a) fails, a checkpoint on SSD 1 will be lost because SSD 1is connected to the failed compute node 1. Storage devices canbe underutilized with uncoordinated checkpointing and message
logging. While the system can limit the number of processes torestart, i.e., perform a partial restart, in a flat buffer system, lo-cal storage is not utilized by processes which are not involved inthe partial restart. For example, if compute node 1 and 3 are in asame cluster, and restart from a failure, the bandwidth of SSD 2and 4 will not be utilized. Compute node 1 can write its check-points on the SSD of compute node 2 as well as its own SSD inorder to utilize both of the SSDs on restart, but as argued earlierdistributing checkpoints across multiple compute nodes is not areliable solution.
Thus, future storage architectures require not only efficient butreliable storage designs for resilient extreme scale computing.
3.2 Burst Buffer SystemTo solve the problems in a flat buffer system, we consider a
burst buffer system [21]. A burst buffer is a storage space tobridge the gap in latency and bandwidth between node-local stor-age and the PFS, and is shared by a subset of compute nodes.Although additional nodes are required, a burst buffer can offera system many advantages including higher reliability and effi-ciency over a flat buffer system. A burst buffer system is morereliable for checkpointing because burst buffers are located ona smaller number of dedicated I/O nodes, so the probability oflost checkpoints is decreased. In addition, even if a large numberof compute nodes fail concurrently, an application can still ac-cess the checkpoints from the burst buffer. A burst buffer systemprovides more efficient utilization of storage resources for partialrestart of uncoordinated checkpointing because processes involv-ing restart can exploit higher storage bandwidth. For example, ifcompute node 1 and 3 are in the same cluster, and both restartfrom a failure, the processes can utilize all SSD bandwidth unlikea flat buffer system. This capability accelerates the partial restartof uncoordinated checkpoint/restart.
Table 1 Node specificationCPU Intel Core i7-3770K CPU (3.50GHz x 4 cores)
Memory Cetus DDR3-1600 (16GB)M/B GIGABYTE GA-Z77X-UD5HSSD Crucial m4 msata 256GB CT256M4SSD3
(Peak read: 500MB/s, Peak write: 260MB/s)SATA converter KOUTECH IO-ASS110 mSATA to 2.5’ SATA
Device Converter with Metal FramRAID Card Adaptec RAID 7805Q ASR-7805Q Single
To explore the bandwidth we can achieve with only commod-ity devices, we developed an mSATA-based SSD test system. Thedetailed specification is shown in Table 1. The theoretical peakof sequential read and write throughput of the mSATA-based SSDis 500 MB/sec and 260 MB/sec, respectively. We aggregate theeight SSDs into a RAID card, and connect two the RAID cardsvia PCE-express(x8) 3.0. The theoretical peak performance ofthis configuration is 8 GB/sec for read and 4.16 GB/sec for writein total. Our preliminary results showed that actual read band-width is 7.7 GB/sec (96% of peak) and write bandwidth is 3.8GB/sec (91% of peak) [32] . By adding two more RAID cards,and connecting via high-speed interconnects, we expect to be ableto build a burst buffer machine using only commodity deviceswith 16 GB/sec of read, and 8.32 GB/sec of write throughput.
c⃝ 2013 Information Processing Society of Japan 3
16&cards&of&&mSATA&SSD&devices&&&&&Capacity:&256GB&x&16&→&4TB&&&Read&BW:&0.5GB/s&&x&16&→&8&GB/s&
A(single(mSATA(SSD( 8(integrated(mSATA(SSDs(
RAID(cards( Prototype/Test(machine(
0(
1000(
2000(
3000(
4000(
5000(
6000(
7000(
8000(
9000(
0( 5( 10( 15( 20(
Band
width&[M
B/s]
#&mSATAs
Raw(mSATA(4KB(RAID0(1MB(RAID0(64KB(
Preliminary I/O Performance Evaluation Between GPU and EBD-I/O device Using Matrix-Vector Multiplication
0(
0.5(
1(
1.5(
2(
2.5(
3(
3.5(
Throughu
put&[GB
/s]
Matrix&Size&[GB]
Raw(8(mSATA(8(mSATA(RAID0((1MB)(8(mSATA(RAID0((64KB)(
I/O(Performance(using(FIO(I/O(Performance(with(MatrixcVector((MulQplicaQon(on(GPU
I/O&Performance&of&&EBD_I/O&device&&7.39&GB/s&(RAID0&)&
I/O&Performance&to&GPU&&3.06&GB/s&(Up&to&PCI_E&BW)&
0
10
20
30
0 500 1000 1500 2000# of proccesses (2 proccesses per node)
Keys
/sec
ond(
billio
ns)
HykSort 1threadHykSort 6threadsHykSort GPU + 6threads
K20x x4 faster than K20x
0
20
40
60
0 500 1000 1500 2000 0 500 1000 1500 2000# of proccesses (2 proccesses per node)
Keys
/sec
ond(
billio
ns)
HykSort 6threadsHykSort GPU + 6threadsPCIe_10PCIe_100PCIe_200PCIe_50Prediction of our implementation
Sorting for EBD Plugging in GPUs for large-scale sorting
• Weak scaling performance (Grand Challenge on TSUBAME2.5)
– 1 ~ 1024 nodes (2 ~ 2048 GPUs) – 2 processes per node and each node has
2GB 64bit integer • Yahoo/Hadoop Terasort: 0.02[TB/s]
– Including I/O
x1.4
x3.61
x389
0.25 [TB/s]
• Performance prediction
x2.2 speedup compared to CPU-based
implmentataion when the # of PCI bandwidth increase to 50GB/s
8.8% reduction of overall runtime when
the accelerators work 4 times faster than K20x
! PCIe_#: #GB/s bandwidth of interconnect between CPU and GPU
• GPU implementation of splitter-based sorting (HykSort)
[Shamoto, Sato et al. BigData 2014]
! New BigData Benchmark based on Large-scale Graph Search for Ranking Supercomputers
! BFS (Breadth First Search) from a single vertex on a static, undirected Kronecker graph with average vertex degree edgegactor (=16).
! Evaluation criteria: TEPS (Traversed Edges Per Second), and problem size that can be solved on a system, minimum execution time.
&&&&Graph500&Benchmark h{p://www.graph500.org
SCALE(and(edgefactor&(=16)
Median&TEPS
1. GeneraIon&
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
3. BFS&2. ConstrucIon&
&x&64
TEPS(raQo &x&64• Kronecker(graph(– 2SCALE(verQces(and(2SCALE+4(edges(– syntheQc(scalecfree(network(
Scalable distributed memory BFS for Graph500 Koji Ueno (Tokyo Institute of Technology) et al.
! What’s the best algorithm for distributed memory BFS?
Optimizations SC11 ISC12 SC12 ISC142D decomposition ✓ ✓ ✓ ✓ vertex sorting ✓ direction optimization ✓ data compression ✓ ✓ ✓ sparse vector with pop counting ✓ adaptive data representation ✓ overlapped communication ✓ ✓ ✓ ✓ shared memory ✓ GPGPU ✓ ✓
✓ Utilization for each version
We proposed and explored many optimizations.
100
317 462
1,280
0
200
400
600
800
1000
1200
1400
SC'11 ISC'12 SC'12 ISC'14
Perf
orm
ance
(GTE
PS)
Graph500 score history of TSUBAME2
Continuous effort to improve performance
Machine # of nodes Performance K computer 65536 5524 GTEPS TSUBAME2.5 1024 1280 GTEPS TSUBAME-KFC 32 104 GTEPS
Optimized for various machines
2.&Proposal&1.&Hybrid_BFS&(&Beamer’11&)&
DRAM
SCALE&31&(Total&data&size&is&1TB)BFS(Performance 13.8(GTEPS
Power(ConsumpQon 391.8(W
Large&Scale&Graph&Processing&Using&NVM&
3.&Experiment&
35.2&MTEPS&/&W
Green&Graph&500
h{p://green.graph500.org/
(
)
+
() ( (+ ( (- (. (/ ) )=3185
MB LE B 0 (=3185
4 19 524#7
4 19
8EGEL IC 4 19
) .
9BAE
6ED
5=
6ED
MB
BA5ADB
B=BIA
mSATA-SSD
RAID Card
×&8& &
www.adaptec.com
www.crucial.com/
4(Qmes(larger(graph(with(
6.9(%(of(degradaQon(
DRAM NVM
Load(highly(accessed(graph(data(before(BFS
Holds(full(size(of(GraphHolds(highly(accessed(data
[Iwabuchi(Sato(et(al.(BigData2014]
Top_down& Bomom_up&
#(of(fronQers:nfron2er���#(of(all(verQces:nall,#########parameter&:&α,#β#
Switching(two(approaches(
Results&:&BFS&Performance&#(EBDcRH5885v2((Huawei,(Tecal(RH5885(V2(Server(w/(Tecal(ES3000(PCIe(SSD(800GBx2,1.2TBx2,(crucial@M500(mSATA(480GB(x32,(MEM(1024GB)(c(Scale33(c(3.10722(GTEPS(c(908.22(W(c(3.42130(MTEPS/W((#(GraphCrest(Node(#1((Custom,(Intel(R)(Xeon(tm)(E5c2690,(crucial@m4(mSATA(256GB(x16,(MEM(256GB)(c(Scale31(c(13.7963(GTEPS(c(391.825(W(c(35.2104(MTEPS/W((#(MEMcCREST(Node(#2((Supermicro,(2027GRcTRF(w/(FusionIO(ioDrive2(1.2TB(x2,(MEM(128GB)(c(Scale30(c(7.98417(GTEPS(c(276.507(W(c(28.8751(MTEPS/W
MEM_CREST&Node&(Supermicro&2027GR_TRF)& GraphCrest&Node& EBD_RH5885v2&
(Huawei&Tecal&RH5885&V2)&
DRAM 128(GB 256(GB 1024(GB
NVM ioDrive2(1.2(TB(×(2 EBDcI/O(2TB(×(2• Tecal(ES3000(
800GBx2,1.2TBx2(• EBDcI/O(4TB(×(2
SCALE((Total(Data(Size)
30((500GB)
31((1TB)
33((4TB)
GTEPS 7.98 13.80( 3.11
MTEPS(/(W 28.88( 35.21 3.42
The(Graph(500(2014(June(((((((((((DRAM(+(NVM(model
The&2nd&Green&Graph500&list(on(Nov.(2013• Measures(powercefficient(using(TEPS/W(raQo(• Results(on(various(system(such(as(Huawei’s&RH5885v2&w/&Tecal&ES3000&
PCIe&SSD&800GB&*&2&and&1.2TB&*&2&• hmp://green.graph500.org(
Expectations for Next-Gen Storage System (Towards TSUBAME3.0)
• Achieving high IOPS – Many apps with massive small I/O ops
• graph, etc. • Utilizing NVM devices
– Discrete local SSDs on TSUBAME2 – How to aggregate them?
• Stability/Reliability as Archival Storage• I/O resource reduction/consolidation
– Can we allow a large number of OSSs for achieving ~TB/s throughput
– Many constraints • Space, Power, Budget, etc.
Current Status • New Approaches
• Tsubame 2.0 has pioneered the use of local flash storage as a high-IOPS alternative to an external PFS
• Tired and hybrid storage environments, combining (node) local flash with an external PFS
• Industry Status • High-performance, high-capacity flash (and other new
semiconductor devices) are becoming available at reasonable cost
• New approaches/interface to use high-performance devices (e.g. NVMexpress)