Compiler, Languages, and Libraries

Compiler, Languages, and Compiler, Languages, and LibrariesLibraries

ECE Dept., University of TehranECE Dept., University of TehranParallel Processing Course SeminarParallel Processing Course Seminar

Hadi EsmaeilzadehHadi Esmaeilzadeh810181079810181079

[email protected]@cad.ece.ut.ac.ir

IntroductionIntroduction

Distributed systems are heterogeneous:Distributed systems are heterogeneous: PowerPower ArchitectureArchitecture Data RepresentationData Representation

Data access latency are significantly long Data access latency are significantly long and vary with underlaying network trafficand vary with underlaying network traffic

Network bandwidths are limited and can Network bandwidths are limited and can vary dramatically with the underlaying vary dramatically with the underlaying loadload

Programming Support Programming Support Systems: PrinciplesSystems: Principles

Principle: each component of the Principle: each component of the system should do what it does bestsystem should do what it does best

The application developer should be The application developer should be able to concentrate on problem able to concentrate on problem analysis and decomposition at a analysis and decomposition at a fairly high level of abstractionfairly high level of abstraction

Programming Support Programming Support Systems: GoalsSystems: Goals

They should make applications easy to developThey should make applications easy to develop Build applications that portable across different Build applications that portable across different

architectures and computing configurationsarchitectures and computing configurations Achieve high performance close to what an expert Achieve high performance close to what an expert

programmer can achieve using the underlaying programmer can achieve using the underlaying features of the network and computing features of the network and computing configurationsconfigurations

Exploits various forms of parallelism to balance Exploits various forms of parallelism to balance across a heterogeneous configurationacross a heterogeneous configuration Minimizing the computation timeMinimizing the computation time Matching the communication to the underlaying Matching the communication to the underlaying

bandwidths and latenciesbandwidths and latencies Ensure the performance variability remains within certain Ensure the performance variability remains within certain

boundsbounds

AutoparallelizationAutoparallelization

The user focuses on The user focuses on what what is beingis being computed rather than computed rather than HowHow

Performance penalty should not be Performance penalty should not be worse rather than a factor of twoworse rather than a factor of two

Automatic vectorizationAutomatic vectorization Dependence analysisDependence analysis

Asynchronous (MIMD) Parallel Asynchronous (MIMD) Parallel ProcessingProcessing Symmetric multiprocessor (SMP)Symmetric multiprocessor (SMP)

Distributed Memory Distributed Memory ArchitectureArchitecture

CachesCaches Higher latency of large memoriesHigher latency of large memories Determine how to apportion data to the memories of Determine how to apportion data to the memories of

processors in away thatprocessors in away that Maximize local memory accessMaximize local memory access Minimize communicationMinimize communication

Regions of parallel execution had to be large enough to Regions of parallel execution had to be large enough to compensate for the overhead of initiating and compensate for the overhead of initiating and synchronizationsynchronization

Interprocedural analysis and optimizationInterprocedural analysis and optimization Mechanisms that involve the programmer in the design of Mechanisms that involve the programmer in the design of

the parallelization as well as the problem solution will be the parallelization as well as the problem solution will be requiredrequired

Explicit CommunicationExplicit Communication

Message passing to get data from Message passing to get data from remote memoriesremote memories

Single version of program runs on Single version of program runs on the all processorsthe all processors

The computation is specialized to The computation is specialized to specific processors through specific processors through extracting number of processor and extracting number of processor and indexing its own dataindexing its own data

Send-Receive ModelSend-Receive Model

A shared-memory environmentA shared-memory environment Each processor not only receives its Each processor not only receives its

needed data but also sends data needed data but also sends data other ones requireother ones require

PVMPVM MPIMPI

Get-Put ModelGet-Put Model

The processor that needs data from a The processor that needs data from a remote memory is able to explicitly remote memory is able to explicitly get it without requiring any explicit get it without requiring any explicit action by the remote processoraction by the remote processor

DiscussionDiscussion

Program is responsible for:Program is responsible for: Decomposition of computation Decomposition of computation

The power of individual processorThe power of individual processor Load balancingLoad balancing Layout of the memoryLayout of the memory Management of latencyManagement of latency Organization and optimization of communicationOrganization and optimization of communication

Explicit communication can be though of as Explicit communication can be though of as an assembly language for gridsan assembly language for grids

Distributed Shared MemoryDistributed Shared Memory

DSM as a vehicle for hiding complexities DSM as a vehicle for hiding complexities of memory and communication of memory and communication managementmanagement

Address space is as flatten as a single-Address space is as flatten as a single-processor machine for programmerprocessor machine for programmer

The hardware/software is responsible for The hardware/software is responsible for data retrieval through generating needed data retrieval through generating needed communications, from remote memoriescommunications, from remote memories

Hardware ApproachHardware Approach

Stanford DASH, HP/Convex Exemplar, Stanford DASH, HP/Convex Exemplar, SGI OriginSGI Origin

Local cache misses initiate data Local cache misses initiate data transfer from remote memory if transfer from remote memory if neededneeded

Software SchemeSoftware Scheme

Shared Virtual Memory, TreadMarkShared Virtual Memory, TreadMark Rely on paging mechanism in the Rely on paging mechanism in the

operating systemoperating system Transfer whole page on demand between Transfer whole page on demand between

operating systemsoperating systems Make granularity and latency significantly Make granularity and latency significantly

largelarge Used in conjunction with relaxed memory Used in conjunction with relaxed memory

consistency models and support for consistency models and support for latency hidinglatency hiding

DiscussionDiscussion

Programmer is free from handling Programmer is free from handling thread packaging and parallel loopsthread packaging and parallel loops

Has performance penalties and then is Has performance penalties and then is useful for coarser-grained parallelismuseful for coarser-grained parallelism

Works best with some help from the Works best with some help from the programmer on the layout of memoryprogrammer on the layout of memory

Is a promising strategy for simplifying Is a promising strategy for simplifying the programming modelthe programming model

Data-Parallel LanguagesData-Parallel Languages

High performance on distributed memory:High performance on distributed memory: Allocate data to various processor memory to maximize Allocate data to various processor memory to maximize

locality and minimize communicationlocality and minimize communication For scaling parallelism to hundreds or thousands of For scaling parallelism to hundreds or thousands of

processors data parallelism is necessaryprocessors data parallelism is necessary Data parallelism: subdividing the data domain in some Data parallelism: subdividing the data domain in some

manner and assigning the subdomains to different manner and assigning the subdomains to different processors (data layout)processors (data layout)

These are the foundations for data-parallel These are the foundations for data-parallel languageslanguages

Fortran D, Vienna Fortran, CM Fortran, C*, data-Fortran D, Vienna Fortran, CM Fortran, C*, data-parallel C, and PC++parallel C, and PC++

High Performance Fortran (HPF), and High High Performance Fortran (HPF), and High Performance C++ (HPC++)Performance C++ (HPC++)

HPFHPF

Provides directives for data layout on F’90 and F’95Provides directives for data layout on F’90 and F’95 Directives have no effect on the meaning of the programDirectives have no effect on the meaning of the program Advices for compiler on how to assign elements of the Advices for compiler on how to assign elements of the

program arrays and data structures to different processorsprogram arrays and data structures to different processors These specification is relatively machine independentThese specification is relatively machine independent The principle focus is the layout of arraysThe principle focus is the layout of arrays Arrays are typically associated with the data domains of Arrays are typically associated with the data domains of

underlying problemunderlying problem The principle drawback: limited support for problems on The principle drawback: limited support for problems on

irregular meshesirregular meshes Distribution via run-time arrayDistribution via run-time array Generalized block distribution (blocks to be of different sizes)Generalized block distribution (blocks to be of different sizes)

For heterogeneous machines: block sizes can be adopted to For heterogeneous machines: block sizes can be adopted to the powers of target machines (generalized block the powers of target machines (generalized block distribution)distribution)

HPC++HPC++

Unsynchronized for-loopsUnsynchronized for-loops Parallel template libraries, with Parallel template libraries, with

parallel or distributed data structures parallel or distributed data structures as basisas basis

Task ParallelismTask Parallelism

Different components of the same Different components of the same computation are executed in parallelcomputation are executed in parallel

Different tasks can be allocated to Different tasks can be allocated to different nodes of the griddifferent nodes of the grid

Object parallelism (Different tasks may be Object parallelism (Different tasks may be components of objects of different classes)components of objects of different classes)

Task parallelism need not be restricted to Task parallelism need not be restricted to shared-memory systems and can be shared-memory systems and can be defined in terms of communication librarydefined in terms of communication library

HPF 2.0 Extensions for Task HPF 2.0 Extensions for Task ParallelismParallelism

Can be implemented on both shared- and Can be implemented on both shared- and distributed-memory systemsdistributed-memory systems

Providing a way for a set of cases to be Providing a way for a set of cases to be run in parallel with no communication until run in parallel with no communication until synchronization at the endsynchronization at the end

Remaining problems on using HPF on a Remaining problems on using HPF on a computational grid:computational grid: Load matchingLoad matching Communication optimizationCommunication optimization

Coarse-Grained Software Coarse-Grained Software IntegrationIntegration

Complete application is not a simple programComplete application is not a simple program It is a collection of programs that must all be It is a collection of programs that must all be

run, passing data to one anotherrun, passing data to one another The main technical challenge of the The main technical challenge of the

integration is how to prevent performance integration is how to prevent performance degradation due to sequential processing of degradation due to sequential processing of the various programsthe various programs

Each program could be viewed as a taskEach program could be viewed as a task Tasks collected and matched to the power of Tasks collected and matched to the power of

the various nodes in the gridthe various nodes in the grid

Latency ToleranceLatency Tolerance

Dealing with long memory or communication Dealing with long memory or communication latencieslatencies

Latency hiding:Latency hiding: data communication is overlapped data communication is overlapped with computation (software-perfecting)with computation (software-perfecting)

Latency reduction: Latency reduction: programs are reorganized to programs are reorganized to reuse more data in local memories (loop blocking reuse more data in local memories (loop blocking for cache)for cache)

More complex to implement on heterogeneous More complex to implement on heterogeneous distributed computersdistributed computers Latencies are large and variableLatencies are large and variable More time to be spent on estimating running timesMore time to be spent on estimating running times

Load BalancingLoad Balancing

Spreading the calculation evenly across Spreading the calculation evenly across processors while minimizing communicationprocessors while minimizing communication

Simulated annealing, neural netsSimulated annealing, neural nets Recursive bisection: at each stage, the work Recursive bisection: at each stage, the work

is divided into two equal parts.is divided into two equal parts. For Grid: power of each node must be taken For Grid: power of each node must be taken

in the accountin the account Performance prediction of components is essentialPerformance prediction of components is essential

Runtime CompilationRuntime Compilation

A problem with automatic load-balancing A problem with automatic load-balancing (especially on irregular grids)(especially on irregular grids) Unknown loop upper boundsUnknown loop upper bounds Unknown array sizesUnknown array sizes

Inspector/executer modelInspector/executer model Inspector: Inspector: executed a single time once the executed a single time once the

runtime, establishes a plan for efficient runtime, establishes a plan for efficient executionexecution

Executor:Executor: executed on each iteration, carries executed on each iteration, carries out the plan defined by inspectorout the plan defined by inspector

LibrariesLibraries

Functional library: Functional library: the parallelized version of standard the parallelized version of standard functions are applied to user-defined data structures functions are applied to user-defined data structures (ScaLAPACK, FFTPACK)(ScaLAPACK, FFTPACK)

Data structure library: Data structure library: aa parallel data structure is parallel data structure is maintained within the library whose representation is maintained within the library whose representation is hidden from the user (DAGH)hidden from the user (DAGH) Well suited for OO languagesWell suited for OO languages Provides max flexibility to the library developer to manage Provides max flexibility to the library developer to manage

runtime challengesruntime challenges Heterogeneous networksHeterogeneous networks Adaptive girdingAdaptive girding Variable latenciesVariable latencies

Drawback: their components are currently treated by Drawback: their components are currently treated by compilers as black boxescompilers as black boxes Some sort of collaboration between compiler and library might Some sort of collaboration between compiler and library might

be possible, particularly in an interprocidural compilationbe possible, particularly in an interprocidural compilation

Programming ToolsProgramming Tools

Tools like Pablo, Gist and Upshot can Tools like Pablo, Gist and Upshot can show where performance bottlenecks show where performance bottlenecks existexist

Performance-tuning toolsPerformance-tuning tools

Future Directions Future Directions (Assumptions)(Assumptions)

The user is responsible for both The user is responsible for both problem decomposition and assignmentproblem decomposition and assignment

Some kind of service negotiator runs Some kind of service negotiator runs prior the execution and determines the prior the execution and determines the available nodes and their relative poweravailable nodes and their relative power

Some portion of compilation will be Some portion of compilation will be invoked after this serviceinvoked after this service

Task CompilationTask Compilation

Constructing a task graph, along with an Constructing a task graph, along with an estimation of running time for each taskestimation of running time for each task TG construction and decompositionTG construction and decomposition Performance EstimationPerformance Estimation

Restructuring the program to better suit Restructuring the program to better suit the target grid configurationthe target grid configuration

Assignments of components of the TG to Assignments of components of the TG to the available nodesthe available nodes JavaJava

Grid Shared Memory Grid Shared Memory (Challenges)(Challenges)

Different nodes has different page Different nodes has different page sizing and paging mechanismssizing and paging mechanisms

Good Performance EstimationGood Performance Estimation Managing the system level Managing the system level

interaction providing DSMinteraction providing DSM

Global Grid CompilationGlobal Grid Compilation

Providing a programming language Providing a programming language and compilation strategy targeted to and compilation strategy targeted to gridgrid

Mixture of parallelism styles, data Mixture of parallelism styles, data parallelism and task parallelismparallelism and task parallelism Data decompositionData decomposition Function decompositionFunction decomposition

Documents

Compiler, Languages, and Libraries