HPC and Big data convergence - events.science-japon.orgevents.science-japon.org/hpc17/slides/Serge Petiton... · HPC and Big data convergence Serge G. Petiton, UniversitéLille 1

HPCandBigdataconvergence

Serge G. Petiton, Université Lille 1 and CNRS, France,Nahid Emad, University of Versailles, France, Laurent Bobelin, Universite François Rabelais, France,

Outline

• Introduction

• BigDataand/versusHPC

• ThePageRankmethodandeigenvectorcomputations

• TheMap-Reducemethodanddistributedalgorithms

• ConvergencebetweenHPCandBigData

• Conclusion

1stFJGworkshopApril5th

Outline

• Introduction





• Conclusion


Introduction• HPC

– simulation:sizeofdataincreasing,newdomains,stability,….

– globalreductionproblemsandcommunicationproblems

– andallthewell-knownproblemsdiscussedthisweekatCSE2017…..

• BigData:

– enoughdatatoavoidstatisticalmodels?

– dataranking

– map-reduceoperations(indexing,….)

• HPCandBigdata:convergence?Cooperation?Mutualexchanges?

– Jointedproblems:energyconsumption,I/Ominisation,resilience,.......

– Nevertheless,theintrinsicbehaviorsarequitedifferent

”convergence”HPC-Big Data:Whatlanguages,programmingparadigms,systems… forwhatmethods,sciencesandfuturedevelopments

April5th 1stFJGworkshop

Outline

• Introduction





• Conclusion


Aggregation of Google search data to estimate currentflu activity in near real-time

Statistic model free - Big Data: flu epidemic

1stFJGworkshop

AYutong Lu’sslide

StackoftheTianhe supercomputers:

April5th

MPI/OPenMP +hadoop/Sparkonsupercomputersoftwarestack

1stFJGworkshop

HaadoopandSparkOnTianhe 2

WithMPIandOpenMP

April5th

Askingalsoforsmartermiddlewareonsuchmachine)


BUT Nowadays HPC and Big Data platforms are different:• HPC Storage = expensive and interconnected, Big Data = local, cheap

disks• HPC and Big Data platforms are not of the same scale:

billion of cores vs. thousands of machines*

HPC tries to make full use of CPU/GPU, Big Data isfocused on I/O.

Nowadays HPC and Big Data relies on different theoreticalground:• HPC performs most of the time a mix of task and data parallelism• Inter-tasks communications are really different.• Big Data have a tendency to treat all nodes as equals.• MapReduce for example has good scaling property when all nodes perform a

Map simultaneously.• Big Data relies on statistics and good probabilistic properties of data

distribution, while HPC code explicitly distribute data

Things are fast changing and some Big Data environment (Hadoop for example) now include more support for task parallelism, but there's still an enormous gap between the two.

1stFJGworkshop

WhatHPC/BigDataApplications?Andwhatprogrammingparadigms?AndwhataboutDataRankingandHPCwith”BigData”?

April5th

Nowadays HPC and Big Data stacks are different:

• Performances: Big Data exhibits larger latency compared to HPC. BD scaling implies often larger tasks.

• HPC is optimized since almost 50 years, 10 years for BigData• HPC code is often optimized for its hardware platform, while BigData is not that

deep into the architecture• Languages and philosophy used in both may be different

Many projects have failed due to the complexity of integrating both stacks.

Some projects so far have addressed the problem – mainly based on Hadoop + MPI with no real large adoption:

• Integrating HPC tools into Hadoop environment (Hamster, ...): most of the time not accurate due to application latency, disappointing results

• Deploying Hadoop on Demand on HPC resources (HoD,...): most of the time not accurate because of hardware requirements

Outline

• Introduction





• Conclusion


Flu epidmic : Proposed PageRank-like model (Nahid Emad)

PageRank-like model PageRank modelAn individual in a social graph A webpage in a web graphe

A virus A walker

Propagation of the virus Promenade of the walker

PageRank of an individual is the probability to be infected by the virus in the course of epidemic

PageRank of a specific page is the probability of the presence of the walker on the page

Withamathematicalformalismanddominanteigenvectorcomputation

Weranktheindividualwithrespecttotheviruspropagation

Mathematical formalismFrobenius theorem à λ=1 is the largest eigenvalue of the matrix P.Then, there is a stationary distribution for the final state of epidemicspread: Px=x.xi the probability that individual i be infected during epidemicx= (x1, x2 ,.., xn) the stationary distribution (infection vector) for thewhole population is independent of starting distribution and verifiesPx=x.The impact of infection vector x in social graph is similar to that ofPageRank vector in web graph.

Problem Solution

Dangling individual add a loop to itself

Small worldnon-uniqueness of ranking vector

add a jumping vector to the random virus propagation process

Computational algorithms

A = αP + (1-α)vzT

A is disease transition matrixv is the teleportation vectorz is the vector (1, …, 1)T

α (<1) damping factor1-α jumping rate; the probability for the virus to jump from any individual to any other individual in a social graph.

Then,computationofthedominanteigenvectorofA(verylarge,non-Hermitianandverysparse)

Computational algorithms: MIRAM

initial guess w3

Projection of(Pn) to Km3

Solving (Pm3)

Return in ¢n

initial guess w1


Solving (Pm1)

Return in ¢n

initial guess w2


Solving (Pm2)

Return in ¢n

Asynchronous information exchange

With n=1011, c=103, m=103, if iter=100, b =10, MIRAM needs:iter ´ b =100 Exa flop (103 peta flop) and b ´ 10=100 peta bytes

ExperimentsStochastic simulation using the infection vector

Time series of infection in an 7010-node power-law social graph ba, with ν= 0.2, δ= 0.24and x=5

Experiments

Convergencebehaviorforthe281903X281903Stanfordmatrix,α=0:85

Numberofmatrixvectorproductsforthe281903281903Stanfordgraph

• 2 kg – more bacteria than human cells (60.1016)• An unknown organ: intestinal microbiota• Amount of sequence generated has increased 109 times in

20 years.

Big Data: microbiota (Nahid Emad with J.-M. BATTO - INRA MGP)

bacteria

protists

archae

virus

fungus

Big Data: microbiota

bacteria =~3000genes

parasites=~6000genes

Virus=~50genes

100 – 1000 individualsU

p to

10

mill

ions

of

gene

s

DNA preparation->Get Sequences->Compare to reference->Counting & analyzing

Big Data: microbiota

MetaProf→energy efficiency multiplied by 4.7 with the GPU implementation

genes

gene

s

Correlationmatrix

gene

s

Countingmatrix

Matrix106 genesby800samples

Samples

Principal Coordinates Analysis applied on the matrices of distances between samples,concentrating the major variations in the samples in a small space implies using manylinear algebra techniques. SVD computation of very large matrices

Numerical methods/algorithms & HPC techniques have to be defined/adapted to increase data-scalability.

April5th 1st FJG workshop

Google

PageRankisoftenpartofalargerApplication


Indexing : MAP-REDUCE algorithm


Pagerank ranks all the pages : Computation of a non-Hermitian sparse matrix of size several billions, and growing

The ranking is done before the queries



Pagerank ranks all the pages : Computation of a non-Hermitian sparse matrix of size several billions, and growing

The ranking is done before the queries


Itexistsapplicationsmixingasynchronouslydifferentapproaches:• HPC• Indexing• DataOntology• …....

• Wehavedistributedasynchronousparallelcomputation

Outline

• Introduction





• Conclusion


Applicationsmixinglinearalgebra(orHPC-likecomputation,includingPageRank,…)andMap-Reduce(orotherclassical“bigdata”algorithms

• Insequence• Inparallel,asynchrously• Iteratively

ItexistsnewapproachesforMap-Reduce(Yarn,Tez,…)

ItexistsnewlanguageanprogrammingparadigmsforHPC

Bothhavetomixdistributedandparallelcomputation,asynchronouslyornot

Wehavetotargetpost-petascale machineandfutureexascale computing

Wehavehavetoproposesolutionadaptedtosupercomputeranddatacenter



Geoscience

Kirchhoff


Kirchhoff


Map-Reduce,orothersbigdataindexingorontologyanalysis

1stFJGworkshop

Masnida Amami, Ali Setayesh, Nasrin Jaberi. Distributed Computing of Seismic Imaging Algorithms (Iran)

Rizvandi N B, Boloori A J, Kamyabpour N, et al.: MapReduce implementation of prestack Kirchhoff time migration (PKTM) on seismic data (Australia)

SparkversiondevelopedbyresearchersfromNanjingUniversity(China

April5th

Map-ReduceusetooptimizeI/OandcommunicationsforsomeKirchhoffmigrationmethod

Nevertheless,theMap-ReduceisdonebeforetheKirchhoffcomputationitself.

WhataboutcomputationalmethodsmixingMap-Reduce(orotherstypicalBigDataAlgorithms)and”classical”HPCcomputation(LinearAlgebra,FFT,convolution,…..)duringcomputation

• Mixinglanguage• Datastructure• Plateforms:clouds,grids,supercomputers,...• Systems,scheduling,....• I/O.....• .....

StartingfromtheBigdataside:

Firstexperiments,forsomeversionsofKirchhoff:


Yarn:containersContainersareclustersofnodes



Hardwarenodes

Node Manager’s containers

Node Managers

ResourceManager

Distributedandparalleltwolevelprogramming

WithDAG





TEZ:DAGofMapReducetasks/jobs

Outline

• Introduction





• Conclusion


Some elements on YML

• YML1 Framework is dedicated to develop and run parallel and distributed applications on Cluster, clusters of clusters, and supercomputers (schedulers and middleware would have to be optimized for more integrated computer – cf. “K” and OmnRPC for example).

• Independent from systems and middlewares– The end users can reused their code using another middleware– Actually the main system is OmniRPC3

• Components approach– Defined in XML– Three types : Abstract, Implementation (in FORTRAN, C or C++;XMP,..),

Graph (Parallelism)– Reuse and Optimized

• The parallelism is expressed through a graph description language, named Yvette (name of the river in Gif-sur-Yvette where the ASCI lab was). LL(1) grammar, easy to parse.

• Deployed in France, Belgium, Ireland, Japan (T2K, K), China, Tunisia, USA (LBNL, TOTAL-Houston).

1stFJGworkshop

Graph (n dimensions)of components/tasksYML

Begin nodeEnd nodeGraph node

Dependence

parcompute tache1(..);notify(e1);

//compute tache2(..); migrate matrix(..);notify(e2);

//wait(e1 and e2);Par (i :=1;n) do

parcompute tache3(..);

notify(e3(i));//if(I < n)then

wait(e3(i+1);compute tache4(..);notify(e4);

endif;//compute tache5(..); control robot(..);notify(e5); visualize mesh(…) ;end par

end do par//

wait(e3(2:n) and e4 and e5);compute tache6(..);compute tache7(..);

end par

Generic component node


AbstractComponent<?xml version="1.0"?><componenttype="abstract" name="prodMat"description=“Matrix

MatrixProduct"><params><param name="matrixBkk"type="Matrix"mode="in"/><param name="matrixAki« type="Matrix"mode="inout"/><param name="blocksize"type="integer"mode="in"/></params>

</component>

Future:

<param name= "conv"type=" graph_param_float"mode= "inout"/>

1stFJGworkshop

Enduserswouldbeabletoaddsomeexpertisehere

<libraries=ScaLAPACk <NAG<PETSC>

Implementation Component<?xml version="1.0"?><componenttype="impl"name="prodMat"abstract="prodMat"description="Implementation

componentofaMatrixProduct"><impl lang="CXX"><header/><source><![CDATA[

int i,j,k;double**tempMat;//Allocationfor(k=0;k<blocksize ;k++)for(i=0;i<blocksize ;i++)

for(j=0;j<blocksize ;j++)tempMat[i][j]=tempMat[i][j]+matrixBkk.data[i][k]*matrixAki.data[k][j];

for(i=0;i<blocksize ;i++)for(j=0;j<blocksize ;j++)

matrixAki.data[i][j]=tempMat[i][j];//Desallocation

]]></source>

<footer /></impl>

</component>1stFJGworkshop

Endusersmayaddsomeexpertisehere(we’llseeexample)

<impl lang="XMP"nodes="CPU:(5,5)"libs=""><distribute>

<param template="block,block "name="A(100,100)" align="[i][j]:(j,i)" /><param template="block"name="Y(100);X(100)"align="[i]:(i,*)"/>

</distribute>

Aimplementationcomponentmaybedevelopedusingaparallellanguage(suchasXMP)

Graph component of Block Gauss-Jordan Method

44

1stFJGworkshop


Tez and YML :• Share (more or less) the same DAG/Graph of Tasks vision• Each task in both may be parallel• Shares the same structure, with a two-part scheduler: one DAG-based part and

one responsible for resource allocation.• As Tez is maintaining a container pool, it can provide YML a stable platform.

It may not be perfect, but at least it can provide a prototype to have a better view of the inherent problems of scheduling

Graph (n dimensions)of components/tasksYML

Begin nodeEnd nodeGraph node

Dependence


//compute tache2(..); notify(e2);

reduce tache8(…) ; notify(e8);//

wait(e1 and e2);Par (i :=1;n) do

parcompute tache3(..); notify(e3(i));

//if(I < n)then


endif;//

wait (e8);map tache9(…..);

end parend do par

//wait(e3(2:n) and e4 and e5);compute tache6(..);compute tache7(..);

end par

Generic component node



//compute tache2(..); notify(e2);

mapreduce tache8(…) ; notify(e8);//

wait(e1 and e2);Par (i :=1;n) do

parcompute tache3(..); notify(e3(i));

//if(I < n)then


endif;//

wait (e8);mapreduce tache9(…..);

end parend do par

//wait(e3(2:n) and e4 and e5);compute tache6(..);compute tache7(..);

end par1stFJGworkshop

Withinformations onthe“implementationcomponents”suchas“reduce”or“map”,,orothers

Itwouldbepossibletoavoidkeywordssuchas“compute”,“mapreduce”,“migration”orothersandtospecifythatontheabstractcomponents,butitwillbelessclearfortheend-userand/orprogrammer

Then,wemayhavegraphsoftask/componentmixingHPCandBidDatacomputation.

Thedifficultiesaretoefficientlyschedulethesetask/componentsandminimizecommunicationsandenergyconsumptionsandtoahaveagoodresilience

April5th

Outline

• Introduction





• Conclusion


Conclusion• Adaptedhighlevelprogrammingparadigmshavetobedeveloped,

allowingschedulerstomanageboththegraphsofcontrol-tasksandthedataflowgraphs;andallowingmixingHPCandBigDatacomputation

• Multi-levelprogrammingissharedbyHPCandBigData(graphadistributed”parallel”computation)

• ConvergenceofHPCandBigDatacomputationisanimportantchallenge:theimportantchallengeistoproposelanguage,programmingparadigmsandschedulerswhichwillallowtobeadaptedtoallthedifferentapproachesofthisconvergence.

• Applicationswithhugedatamodifiedduringcomputationaskingforoperationssuchasmap-reduceinparallelofclassicHPC wouldbemoreandmorepresentforthefutureimportantexascale challenges.

• Highperformanceartificialintelligence(askingforHPCandhugenumberofdata)wouldusesuchprogrammingparadigms


Documents

HPC and Big data convergence - events.science-japon.orgevents.science-japon.org/hpc17/slides/Serge Petiton... · HPC and Big data convergence Serge G. Petiton, UniversitéLille 1