H5Spark - Cray User Group...H5Spark: Big Data Analytics, Spark • Apache Spark is an open source cluster compu9ng framework – Developed at UCB AMPLab, 2014 v1.0, 2016 v2.0 • Ac9vely

H5Spark

CUG,May2016

H5Spark:BridgingtheI/OGapbetweenSparkandScien9ficDataFormatsonHPCSystems

Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon,

Alex Gittens, Lisa Gerhardt, Suren Byna, Mike F. Ringenburg, Prabhat

[email protected] National Energy Research Scientific Computing Center(NERSC)

H5Spark: Outline

•  Introduc9on,Spark•  Mo9va9on

•  H5SparkDesign•  H5SparkEvalua9on

•  H5SparkFuture

-2-CUG,May2016

H5Spark: Big Data Analytics, Spark

•  ApacheSparkisanopensourceclustercompu9ngframework–  DevelopedatUCBAMPLab,2014v1.0,2016v2.0

•  Ac9velydeveloped,1000+contributorsin2015–  Produc9veprogramminginterface

•  6vs28linesofcodecomparetohadoopmapreduce

–  Implicitdataparallelism–  Fault-tolerance

•  SparkforData-intensiveCompu9ng–  Streamingprocessing–  SQL–  Machinelearning,MLlib–  Graphprocessing

-3-CUG,May2016

H5Spark: Porting Spark onto HPC

•  AdvantagesofPor9ngSparkontoHPC–  Amoreproduc9veAPIfordata-intensivecompu9ng–  Relievetheusersfromconcurrencycontrol,communica9onand

memorymanagementwithtradi9onalMPImodel.–  Embarrassinglyparallelcompu9ng,data.map(f) –  Faulttolerance,recompute()

•  ButScien9ficDataFormatsinHPCnotSupported–  HDF5/netCDFareamongthetop5librariesatNERSC,2015

•  750+uniqueusers@NERSC,millionofusersworldwide

–  1987,NCSA&UIUC.NASAsendHDF-EOSto2.4millionsendusers–  Hierarchicaldataorganiza9on–  ParallelI/O

-4-CUG,May2016

H5Spark: Data in Spark

•  RDD:ResilientDistributedDatasets–  Read-only,par99onedcollec9onofrecordsinSpark–  RDDcancontainanytypeofPython/Java/Scalaobjects–  FaultTolerant

•  Transforma9onsonRDD–  Filter,map,join,etc

•  Ac9onsonRDD–  Reduce,collect,etc

•  Sparkopera9onsarelazy•  RDDallowsin-memoryprocessing

–  rdd.cache()orrdd.persist()–  Goodforitera9veorinterac9veprocessing

-5-CUG,May2016

H5Spark: Data in Spark

-6-CUG,May2016

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9

H5Spark: Data in HDF5

•  HierarchicalDataFormatv5

-7-CUG,May2016

Group HDF5

Dataset

1 2 3 4 5 6 7 8 9

Dataset Attributes Group

H5Spark: Support HDF5 in Spark

•  WhatdoesSparkhaveinreadingvariousdataformats?–  Texhile,sc.textFile()–  Parquet,sc.read.parquet()–  Json,sc.read.json()

•  Challenges:Func9onalityandPerformance–  HowtotransformanHDF5datasetintoanRDD?

–  Howtou9lizetheHDF5I/OlibrariesinSpark?–  HowtoenableparallelI/OonHPC?–  WhatistheimpactofLustrestriping?

–  WhatistheeffectofcachingonIOinSpark?

-8-CUG,May2016

HDF5àParquet?

H5Spark: Software Overview

•  Scala/Pythonimplementa9on–  SparkfavorsScalaandPython–  H5SparkusesHDF5javalibrary–  UnderneathisHDF5Cposixlibrary–  NoMPIIOsupport

•  H5Sparkasastandalonepackage–  UserscanloaditintheirSparkapplica9ons–  H5SparkmoduleonCori–  sbtpackage------>h5spark_2.10-1.0.jar

•  Opensource–  Github:hlps://github.com/valiantljk/h5spark

-9-CUG,May2016

H5Sparkscala/python

JHI5java

HDF5c

HDF5MPI

H5Pypython

1.8.14

H5Spark: Design

-10-

Group HDF5

Dataset

Lustre File System

User App

H5Spark Hyperslab Partitioner

RD

D

Parallel I/O

H5Spark Metadata Analyzer

H5Spark RDD Constructor

H5spark RDD Seeder

1 2 3 4 5 6 7 8 9

•  RDDSeeder•  MetadataAnalyzer•  HyperslabPar99oner•  RDDConstructor

CUG,May2016

H5Spark: From HDF5 to RDD

•  Input:

*SparkPar$$ondeterminesthedegreeofparallelism=MPIprocesses +OpenMP p>numofcores

•  Output:RDD:r •  UndertheHood:readingHDF5intoRDD

–  Adjustpar99ons p=p>dim[sid]?dim[sid]:p

–  Determinehyperslab offset[i]=dim[sid]/p*i

–  SeedRDD r_seed=sc.parallelize(offset,p)–  PerformparallelI/O r_seed.flatmap(h5read(f,v))

-11-

HDF5FilePath: f DatasetName: v SparkContext: sc

*SparkPar99on: p

CUG,May2016

H5Spark: How to Use

•  H5SparkAPIs

•  CorrespondtoSparkMLlibinterfaceimportorg.apache.spark.mllib.linalg

DataType:Vector,labeledpoint,matrix,indexedrowmatrix,etc

-12-CUG,May2016

Input:sc,f,v,p

Func9ons Output

h5read ARDDofdoublearray

h5read_point ARDDof(key,value)pair

h5read_vec ARDDofvector

h5read_irow ARDDofindexedrow

H5read_imat ARDDofindexedrowmatrix

H5Spark: How to Use

•  Samplecodes,H5SparkvsMPI

-13-CUG,May2016

1.  valsc=newSparkContext()2.  valrdd=h5read(sc,f,v,p)3.  sc.stop()

1.  MPI_Init(&argc,&argv);2.  MPI_Comm_size(comm,&mpi_size);3.  MPI_Comm_rank(comm,&mpi_rank);4.  hid_tfapl=H5Pcreate(H5P_FILE_ACCESS);5.  H5Pset_fapl_mpio(fapl,comm,info);6.  file=H5Fopen(f,H5F_ACC_RDONLY,fapl);7.  dataset=H5Dopen(file,v,H5P_DEFAULT);8.  hid_tdataspace=H5Dget_space(dataset);9.  hsize_toffset[rank];10.  hsize_tcount[rank];11.  hsize_trest=dims_out[0]%mpi_size;12.  if(mpi_rank!=(mpi_size-1)){13.  count[0]=dims_out[0]/mpi_size;14.  }else{15.  count[0]=dims_out[0]/mpi_size+rest;16.  }17.  offset[0]=dims_out[0]/mpi_size*mpi_rank;18.  for(i=1;i<rank;i++){19.  offset[i]=0;20.  count[i]=dims_out[i];21.  }22.  hid_thyperid=H5Sselect_hyperslab(dataspace,23.  H5S_SELECT_SET,offset,NULL,count,NULL);24.  hsize_trankmemsize=1;25.  for(i=0;i<rank;i++)rankmemsize*=count[i];26.  hid_tmemspace=H5Screate_simple(rank,count,NULL);27.  double*data_t=(double*)malloc(sizeof(double)*rankmemsize);28.  H5Dread(dataset,H5T_NATIVE_DOUBLE,memspace,29.  dataspace,H5P_DEFAULT,data_t);30.  MPI_Finalize()

H5SparkParallelRead

MPIParallelRead

Parallelism

H5Spark: Evaluation

•  AbouttheSystem–  Cori,Phase1,CrayXC40supercomputer,1600computenodes,248

LustreOSTs–  Eachcomputenodehas32coreswith128GBRAMintotal.Thepeak

I/Obandwidthis700GB/s.

•  ExperimentalSetup–  PCAon2.2TBglobaloceantemperaturedata,16TBCAM5

atmospheredata.–  2.2TB,16TB,HDF5format,Doubleprecision–  Numberofnodes:45,90,135,1600–  Stripecounts:1,8,24,72,144,248

-14-CUG,May2016

CAM5,16TB,Findingtheprincipalcausesofvariabilityinlargescale3Dfields.

H5Spark: Evaluation

•  Scaling/ProfilingH5SparkwithLustreStriping–  45nodes,1440cores,3000par99ons,2.2TBdata,1MBstripesize

-15-CUG,May2016

I/OBandwidthwithLustreStriping H5SparkTasksLaunchingDelay

OSTshouldbeanotherfactorinSpark’sschedulingbesidesCPU/Memory

H5Spark: Evaluation

•  ScalingH5SparkwithPar99ons–  45nodes,2.2TB

-16-CUG,May2016

Thenumberofpar99onscanbetuned,basedontheworkloadsandresources

H5Spark: Evaluation

•  ScalingH5SparkwithExecutorsand/orPar99ons–  2.2TB,45,95,135nodes

-17-CUG,May2016

Lesson:IncreasethenumberofExecutorsandPar99onsatthesame9me

H5Spark: Evaluation

•  H5SparkhasbeentestedatfullscaleonCoriphase1

-18-CUG,May2016

Tests Size(TB) I/O(s) B/W(GB/s) OSTs Executors ParSSons

135nodes 2.2 37 59.7 144 135 9000

Fullscale 16 120 136.5 144 1522 52100

H5Spark: Evaluation

•  H5SparkPythonvsScala

-19-CUG,May2016

Version I/O(s) B/W(GB/s) Speedup Mem(GB) RaSo

Python 162 13.65 1 479 1

Scala 90 24.56 1.8 2210 4.61

ScalaisfasterthanPython

H5Spark: Evaluation

•  H5SparkvsMPI-IO

-20-CUG,May2016

MPIscaleswellwithOSTsH5SparkscaleswellwithNodes(whileMPIsaturatestheI/O)Again:StorageonHPCisanimportantschedulingfactor

Par99onsarealsoincreased

H5Spark: Evaluation

•  H5Spark@LBNLvsSciSpark@NASA–  hlps://github.com/SciSpark/SciSpark–  hlps://github.com/valiantljk/h5spark

-21-CUG,May2016

H5Spark: Conclusion & Future Work

•  H5Spark:–  AnefficientHDF5fileloaderforSpark–  UserscannowuseSparkperformbigdataanalysisonHDF5data–  H5SparkgetsclosertoMPIIO

•  H5SparkFuture–  SparkI/Ofinerprofiling/lazyevalua9on–  Parallelwrite/filter–  Storage-awarescheduling

-22-CUG,May2016

H5Spark

ThanksSciSparkTeamThankDouglasJacobsen@NERSC,

JeyKolaalam@Amplab

-23-CUG,May2016

Documents

H5Spark - Cray User Group...H5Spark: Big Data Analytics, Spark • Apache Spark is an open source cluster compu9ng framework – Developed at UCB AMPLab, 2014 v1.0, 2016 v2.0 • Ac9vely