Evolution of Parallel Programming in HEP F. Rademakers – CERN International Workshop on Large Scale Computing VECC, Kolkata

Evolution of Parallel Evolution of Parallel Programming in HEPProgramming in HEP

F. Rademakers – CERN

International Workshop on Large Scale Computing VECC, Kolkata

IWLSC, 9 Feb 2006 2 Fons Rademakers

OutlineOutline

Why use parallel computing Parallel computing concepts Typical parallel problems Amdahl’s law Parallelism in HEP Parallel data analysis in HEP PIAF PROOF Conclusions


Why ParallelismWhy Parallelism

Two primary reasons: Save time – wall clock time Solve larger problems

Other reasons: Taking advantage of non-local resources – Grid Cost saving – using multiple “cheap” machines instead of paying for a super

computer Overcoming memory constraints – single computers have finite memory

resources, use many machine to create a very large memory

Limits to serial computing Transmission speeds – the speed of a serial computer is directly dependent

on how much data can move through the hardware Limits speed of light (30 cm/ns) and transmission limit of copper wire (9 cm/ns)

Limits to miniaturization Economic limitations

Ultimately, parallel computing is an attempt to maximize the infinite but seemingly scarce commodity called time


Parallel Computing ConceptsParallel Computing Concepts

Parallel hardware A single computer with multiple processors (multiple multi-core) An arbitrary number of computers connected by a network

(LAN/WAN) A combination of both

Parallelizable computational problems Can be broken apart into discrete pieces of work that can be solved

simultaneously Can execute multiple program instructions at any moment in time Can be solved in less time with multiple compute resource than with

a single compute resource


Parallel Computing ConceptsParallel Computing Concepts

There are different ways to classify parallel computers (Flynn’s Taxonomy):

SISD

Single Instruction, Single Data

SIMD

Single Instruction, Multiple Data

MISD

Multiple Instruction, Multiple Data

MIMD

Multiple Instruction, Multiple Data


SISDSISD

A serial (non-parallel) computer Single instruction: only one

instruction stream is being acted on by the CPU during any one clock cycle

Single data: only one data stream is being used as input during any one clock cycle

Deterministic execution Examples: most classical PC’s,

single CPU workstations and mainframes


SIMDSIMD

A type of parallel computer Single instruction: all processing

units execute the same instruction at any given clock cycle

Multiple data: each processing unit can operate on a different data element

This type of machine typically has an instruction dispatcher, a very high bandwidth internal network and a very large array of very small-capacity CPU’s

Best suited for specialized problems with high degree of regularity: image processing

Synchronous and deterministic execution Two varieties: processor arrays and vector pipelines Examples (some extinct):

Processor arrays: Connection Machine, Maspar MP-1, MP-2 Vector pipelines: CDC 205, IBM 9000, Cray C90, Fujitsu, NEC SX-2


MISDMISD

Few actual examples of this class of parallel computer have ever existed

Some conceivable examples might be: Multiple frequency filter operating on a single signal stream Multiple cryptography algorithms attempting to track a single coded

message


MIMDMIMD

Currently the most common type of parallel computer

Multiple instruction: every processor may be executing a different instruction stream

Multiple data: every processor may be working with a different data stream

Execution can be synchronous or asynchronous, deterministic or non-deterministic

Examples: most current supercomputers, networked parallel computer “grids” and multi-processor SMP computers – including multi-CPU and multi-core PC’s


Relevant TerminologyRelevant Terminology

Observed speedup wall-clock time of serial execution / wall-clock time of parallel execution

Granularity Coarse: relatively large amounts of computational work are done between

communication events Fine: relatively small amounts of computational work are done between

communication events Parallel overhead

The amount of time required to coordinate parallel tasks, as opposed to doing useful work, typically:

Task start-up time Synchronizations Data communications Software overhead imposed by parallel compilers, libraries, tools, OS, etc. Task termination time

Scalability Refers to a parallel system’s ability to demonstrate a proportional increase in

parallel speedup with the addition of more processors Embarrassingly parallel


Typical Parallel ProblemsTypical Parallel Problems

Traditionally, parallel computing has been considered to be “the high-end of computing”:

Weather and climate Chemical and nuclear reactions Biological, human genome Geological, seismic activity Electronic circuits

Today commercial applications are the driving force: Parallel databases, data mining Oil exploration Web search engines Computer-aided diagnosis in medicine Advanced graphics and virtual reality

The future: during past 10 years trends indicated by ever faster networks, distributed systems and multi-processor, and now multi-core, computer architectures suggest that parallelism is the future


Amdahl’s LawAmdahl’s Law

Amdahl’s law states that potential speedup is defined by the fraction of code (P) that can be parallelized:

If none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup). If all the code is parallelized, P = 1, the speedup is infinite (in theory)

If 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast

Introducing the number of processors performing the parallel fraction of work, the relationship can written like:

Where P = parallel fraction, N = number of processors and S = serial fraction



Parallelism in HEPParallelism in HEP

Main areas of processing in HEP DAQ

Typically highly parallel Process in parallel large number of detectors modules or sub-detectors

Simulation No need for fine-grained track level parallelism, a single event is not the

end product Some attempts were made to introduce track level parallelism in G3 Typically job level parallelism, resulting in a large number of files

Reconstruction Idem as for simulation

Analysis Run over many events in parallel to get quickly the final analysis results Embarrassingly parallel, event level parallelism Preferably interactive for better control on and feedback of the analysis Main challenge: efficient data access


Parallel Data Analysis in HEPParallel Data Analysis in HEP

Most parallel data analysis systems designed in the past and present are based on job splitting scripts and batch queues

When queue full no parallelism Explicit parallelism

Turn around time dictated by batch system scheduler and resource availability

Remarkably few attempts at real interactive implicitly parallel systems

PIAF PROOF


Classical Parallel Data AnalysisClassical Parallel Data AnalysisStorageBatch farm

queues

manager

outputs

catalog

“Static” use of resources Jobs frozen, 1 job / CPU

“Manual” splitting, merging Limited monitoring (end of single job)

submit

files

jobsdata file splitting

myAna.C

mergingfinal analysis


Interactive Parallel Data AnalysisInteractive Parallel Data Analysiscatalog StorageInteractive farm

scheduler

query

Farm perceived as extension of local PC More dynamic use of resources Automated splitting and merging Real time feedback

MASTER

query:data file list, myAna.C

files

final outputs(merged)

feedbacks

(merged)


PIAFPIAF

The Parallel Interactive Analysis Facility First attempt at an interactive parallel analysis system Extension of and based on the PAW system Joint project between CERN/IT and Hewlett-Packard Development started in 1992 Small production service opened for LEP users in 1993

Up to 30 concurrent users CERN PIAF cluster consisted of 8 HP PA-RISC

machines FDDI interconnect 512 MB RAM Few hundred GB disk

First observation of hyper-speedup using column-wise n-tuples


PIAF ArchitecturePIAF Architecture

Two-tier push architecture Client → Master → Workers Master divides total number of events by number of workers and

assigns each worker 1/n number of events to process

Pros Transparent

Cons Slowest node determined time of completion Not adaptable to varying node loads No optimized data access strategies Required homogeneous cluster Not scalable


PIAF Push ArchitecturePIAF Push Architecture

Initialization

Process

Wait for nextcommand

Slave 1Process(“ana.C”)

Pro

cess

or

Initialization

Process


Slave NMaster

SendEvents() SendEvents()

SendObject(histo)SendObject(histo)

Addhistograms

Displayhistograms

1/n 1/n

Process(“ana.C”)


PROOFPROOF

Parallel ROOT Facility Second generation interactive parallel analysis system Extension of and based on the ROOT system Joint project between ROOT, LCG, ALICE and MIT Proof of concept in 1997 Development picked up in 2002 PROOF in production in Phobos/BNL (with up to 150

CPU’s) since 2003 Second wave of developments started in 2005

following interest by LHC experiments


PROOF Original Design GoalsPROOF Original Design Goals

Interactive parallel analysis on heterogeneous cluster Transparency

Same selectors, same chain Draw(), etc. on PROOF as in local session

Scalability Good and well understood (1000 nodes most extreme case) Extensive monitoring capabilities MLM (Multi-Level-Master) improves scalability on wide area clusters

Adaptability Partly achieved, system handles varying load on cluster nodes MLM allows much better latencies on wide area clusters No support yet for coming and going of worker nodes


good connection ?VERY importantless important

Optimize for data locality or efficient data server access

adapts to clusterof clusters orwide area virtual clusters

Physically separated domains

PROOF Multi-Tier ArchitecturePROOF Multi-Tier Architecture


PROOF Pull ArchitecturePROOF Pull Architecture

Initialization

Process

Process

Process

Process


Slave 1Process(“ana.C”)

Pac

ket

gen

erat

or

Initialization

Process

Process

Process

Process


Slave NMaster

GetNextPacket()

GetNextPacket()

GetNextPacket()

GetNextPacket()

GetNextPacket()

GetNextPacket()

GetNextPacket()

GetNextPacket()

SendObject(histo)SendObject(histo)

Addhistograms

Displayhistograms

0,100

200,100

340,100

490,100

100,100

300,40

440,50

590,60

Process(“ana.C”)


PROOF New FeaturesPROOF New Features

Support for “interactive batch” mode Allow submission of long running queries Allow client/master disconnect and reconnect

Powerful, friendly and complete GUI

Work in grid environments Startup of agents via Grid job scheduler Agents calling out to master (firewalls, NAT) Dynamic master-worker setup


Interactive/Batch queriesInteractive/Batch queries

GUI

Commands

scripts Batch

stateful

statefulor stateless

stateless

Interactive analysis usinglocal resources, e.g.- end-analysis calculations- visualizationv

Analysis jobs with well defined algorithms (e.g. production of personal trees)

Medium term jobs, e.g.analysis design and development using alsonon-local resources

Goal: bring these to thesame level of perception


AQ1: 1s query produces a local histogram

AQ2: a 10mn query submitted to PROOF1

AQ3->AQ7: short queries

AQ8: a 10h query submitted to PROOF2BQ1: browse results of AQ2

BQ2: browse temporary results of AQ8

BQ3->BQ6: submit 4 10mn queries to PROOF1

CQ1: Browse results of AQ8, BQ3->BQ6

Monday at 10h15

ROOT sessionon my desktop

Monday at 16h25

ROOT sessionon my laptop

Wednesday at 8h40

ROOT session on my laptop in

Kolkata

Analysis Session ExampleAnalysis Session Example


New PROOF GUINew PROOF GUI








TGrid – Abstract Grid InterfaceTGrid – Abstract Grid Interface

class TGrid : public TObject {public: virtual Int_t AddFile(const char *lfn, const char *pfn) = 0; virtual Int_t DeleteFile(const char *lfn) = 0; virtual TGridResult *GetPhysicalFileNames(const char *lfn) = 0; virtual Int_t AddAttribute(const char *lfn, const char *attrname, const char *attrval) = 0; virtual Int_t DeleteAttribute(const char *lfn, const char *attrname) = 0; virtual TGridResult *GetAttributes(const char *lfn) = 0; virtual void Close(Option_t *option="") = 0; virtual TGridResult *Query(const char *query) = 0;

static TGrid *Connect(const char *grid, const char *uid = 0, const char *pw = 0);

ClassDef(TGrid,0) // ABC defining interface to GRID services};


PROOF on the GridPROOF on the Grid

PROOFPROOF

USER SESSIONUSER SESSION

PROOF PROOF SLAVE SLAVE SERVERSSERVERS

PROOF MASTERPROOF MASTER SERVERSERVER



Guaranteed site access throughPROOF Sub-Masters calling outto Master (agent technology)

PROOF SUB-PROOF SUB-MASTERMASTER SERVER SERVER

PROOFPROOF

PROOFPROOF

PROOFPROOF

Grid/ROOT Authentication

Grid Access Control Service

TGrid UI/Queue UI

Proofd Startup

Grid Service Interfaces

Grid File/Metadata Catalogue

Client retrieves listof logical files (LFN + MSN)

Slave servers access data via xrootd from local disk pools


Running PROOFRunning PROOF

TGrid *alien = TGrid::Connect(“alien”);

TGridResult *res; res = alien->Query(“lfn:///alice/simulation/2001-04/V0.6*.root“);

TChain *chain = new TChain("AOD");chain->Add(res);

gROOT->Proof(“master”);chain->Process(“myselector.C”);

// plot/save objects produced in myselector.C. . .


ConclusionsConclusions

The Amdahl’s Law shows that making really scalable parallel applications is very hard

Parallelism in HEP off-line computing still lagging To solve the LHC data analysis problems, parallelism

is the only solution To make good use of the current and future generation

of multi-core CPU’s parallel applications are required

Documents

Evolution of Parallel Programming in HEP F. Rademakers – CERN International Workshop on Large Scale Computing VECC, Kolkata