67
Dec. 19, 2005 HPC Productivity 1 Productivity in High Performance Computing Overview •Perspective •Basic Principles •Historical and Emerging HPC •HPC Development Paradigm – Requirements •HPC Development Paradigm – Concepts •HPC Development Environment – An Example •Connection to Other Research •Research Issues

Productivity in High Performance Computing€œDesign, implementation and adaptation should be a unified evolutionary process.” Design evaluation and system execution should be a

  • Upload
    dinhtu

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Dec. 19, 2005 HPC Productivity 1

Productivity in High Performance Computing

Overview

•Perspective

•Basic Principles

•Historical and Emerging HPC

•HPC Development Paradigm – Requirements

•HPC Development Paradigm – Concepts

•HPC Development Environment – An Example

•Connection to Other Research

•Research Issues

Dec. 19, 2005 HPC Productivity 2

Perspective - Personal

• 49 years of programming

• 48 years of “HPC” programming

• 25 years of parallel/distributed/grid

programming

• Software tools and applications

Dec. 19, 2005 HPC Productivity 3

PERSPECTIVE – Past Research

Transition from Serial to Vector to Parallel to

Distributed Architectures

1. Transition to Vector Processors – The promise and

the reality:

2. Programming systems for parallel

architectures:1980-1995

Shared Memory-Distributed Memory

Adaptations/extensions of serial languages

3. Programming systems for distributed architectures:

1995-2005

Grid Programming Systems

Dec. 19, 2005 HPC Productivity 4

Productivity

• “Cost of goal attainment”

• Cost = Σ (resources) – people and physical

• Goals (examples):

– Initial use of system

– Completion of problem instance

– N years of use

Dec. 19, 2005 HPC Productivity 5

• Overview

• Perspective

• Basic Principles

• Historical and Emerging HPC

• HPC Development Paradigm – Requirements

• HPC Development Paradigm – Concepts

• HPC Development Environment – An Example

• Connection to Other Research

• Research Issues

Dec. 19, 2005 HPC Productivity 6

Productivity Principle #1

“Our ability to reason is constrained by the language in

which we reason”

Therefore programming systems should facilitate

reasoning about the issues of concern.

HPC has a plethora of different concerns

Challenge - Bring all these concerns into a unified context

Productivity Principles

Dec. 19, 2005 HPC Productivity 7

Productivity Principle #2

“Automation of program composition”

The components from which programs are composed

must support automated composition.

Components must be meaningful in the context of an

application.

Challenge – Representation which enables automated

composition of programs.

Productivity Principles - Continuation

Dec. 19, 2005 HPC Productivity 8

Productivity Principle #3

“Design, implementation and adaptation should be a

unified evolutionary process.”

Design evaluation and system execution should be a

unified process.

Challenge – executable representation spanning

multiple levels of abstraction.

Challenge – Unification of design evaluation and

system execution.

Productivity Principles - Continuation

Dec. 19, 2005 HPC Productivity 9

• Overview

• Perspective

• Basic Principles

• Historical and Emerging HPC

• HPC Development Paradigm – Requirements

• HPC Development Paradigm – Concepts

• HPC Development Environment – An Example

• Connection to Other Research

• Research Issues

Dec. 19, 2005 HPC Productivity 10

Historical HPC

• Users – Small cadre of dedicated professional users combining discipline expertise with programming skills.

• Applications – Narrow family of applications, large PDE system solvers or signal analysis, static structure, visualization based analysis.

• Platforms – Specialized vector/parallel “supercomputer” systems – Closed set of resources -Stable over periods of hours or days.

• Algorithms – Static algorithms but multi-domain physical systems

• Goal – Solve largest possible problems within resource constraints.

Dec. 19, 2005 HPC Productivity 11

Conventional Practice in Application Family

Development

– Comprehensive package of functional modules

– Common data structures.

– Many paths through system structure

– Users choose parameters to select execution paths

– Program is coded before performance is evaluated

Dec. 19, 2005 HPC Productivity 12

Why Current Practice Needs Improvement

• Optimization and adaptation of parallel programs is effort intensive

– Different execution environments

– Different problem instances

• Direct modification of complete application is effort intensive

• Maintenance and evolution of parallel programs is a complex task

• Code structure is often sub-optimal for an given case and/or execution environment

Dec. 19, 2005 HPC Productivity 13

Status of Conventional HPC

• Islands of excellence – application families in well-characterized domains and users of libraries for communication and interaction management.

• Productivity (by some metrics) little changed for two decades

• Complexity of the algorithms used and application system complexity have grown dramatically.

Dec. 19, 2005 HPC Productivity 14

• Overview

• Perspective

• Basic Principles

• Historical HPC

• Emerging HPC

• HPC Development Paradigm – Requirements

• HPC Development Paradigm – Concepts

• HPC Development Environment – An Example

• Connection to Other Research

• Research Issues

Dec. 19, 2005 HPC Productivity 15

Emerging HPC Platforms

1. Broadly available commodity clusters of multi-

core processors.

(Lack of standard configurations)

2. Enormous specialized cluster architectures

eg Blue Gene

3. Grids- Heterogeneous, unreliable and constantly

changing platforms

Each has different properties but really large

clusters and grids are beginning to have similar

characteristics.

Dec. 19, 2005 HPC Productivity 16

Emerging Application Characteristics

• Multiple domains

• Complex adaptive algorithms

• Complex, possibly dynamic coordination/interaction structures

• Data intensive as well as computation intensive

• Interfaced to online data sources

• Integration of automated content analysis

• Require management of uncertainty

Dec. 19, 2005 HPC Productivity 17

• Overview

• Perspective

• Basic Principles

• Historical HPC

• Emerging HPC

• Productivity Concepts for Conventional Software

• HPC Development Paradigm – Barriers and Requirements

• HPC Development Paradigm – Concepts

• HPC Development Environment – An Example

• Connection to Other Research

• Research Issues

Dec. 19, 2005 HPC Productivity 18

Status of Productivity for Mainstream Systems

Application/Platform Characteristics

Serial with mostly straight-line interactions

Standard platforms

Productivity varies dramatically with domain

Commonly used well-supported domains (GUIs, RDB,

etc. – Factors of 10 over a decade or so

Specialized application domains – nearly unchanged

since the 1970’s

Application systems span multiple domains

Dec. 19, 2005 HPC Productivity 19

Current Mainstream Programming Systems

(Why C/C++/Fortran are not suitable for HPC.)

•Assume serial execution

•Parallelism is deviation from normal behavior

•Representation of parallelism is ad hoc

•Locality is only implicitly addressed

•Don’t support automated composition

•Minimal coordination and interaction semantics

•Extension mechanisms have complex semantics

•Design is not really addressed and performance is not-

considered

Dec. 19, 2005 HPC Productivity 20

Basis for Productivity Improvements

Broadly applicable domain analyses

Libraries implementing the domain analyses

Compositional tools (Language specific)

Cheap resource rich uniform platforms – fast

turnaround

Abstraction – use of specification-level languages

Automation – Code generators from specifications

Design and validation/verification methods and

tools

Productivity for Mainstream Systems

Dec. 19, 2005 HPC Productivity 21

Productivity Research in Mainstream Systems

Component-oriented development

Software architectures

Specification languages and code generators

Aspects/Features

Dec. 19, 2005 HPC Productivity 22

• Overview

• Perspective

• Basic Principles

• Historical HPC

• Emerging HPC

• Productivity Concepts for Conventional Software

• HPC Development Paradigm – Barriers and Requirements

• HPC Development Paradigm – Concepts

• HPC Development Environment – An Example

• Connection to Other Research

• Research Issues

Dec. 19, 2005 HPC Productivity 23

Barriers to Productivity for HPC

(Things we can’t do anything about.)

Obvious Barriers

Market size

Few HPC-specialized tools

Heterogeneous, sparsely available platforms

Cultural Barriers

Parochialism and ignorance by all parties

Out-of-date education programs

Us versus them

Code first culture

Dec. 19, 2005 HPC Productivity 24

HPC – CS Disconnect

Scalable Parallelism

Micro-and macro-locality

Increasing complexity of applications

Multiple application domains

Adaptive algorithms

Increasing complexity/diversity of execution platforms

Multi-level locality – cache to network scales

Multi-scale parallelism

Barriers to Productivity in HPC/HPPS

(Things we can’t do anything about.)

Dec. 19, 2005 HPC Productivity 25

Barriers to Productivity in HPC

(Things we can do something about.)

Current programming systems are a lousy basis for reasoning

about HPC

Current programming systems don’t support automated

composition of systems from components.

Absence of HPC-specific design and development methods,

processes and tools

Available programming systems don’t address HPC

requirements and concerns

Dec. 19, 2005 HPC Productivity 26

• Overview

• Perspective

• Basic Principles

• Historical HPC

• Emerging HPC

• Productivity Concepts for Conventional Software

• HPC Development Paradigm – Requirements

• HPC Development Paradigm – Concepts

• HPC Development Environment – An Example

• Connection to Other Research

• Research Issues

Dec. 19, 2005 HPC Productivity 27

Capabilities for Productivity in HPC

Automation of Composition of Programs

Self-describing components

Components which make visible sufficient semantic

information about the services they provide, the services

they require and their properties and behaviors to enable

a compiler to select a component on the basis of its

services, properties and behaviors.

Dec. 19, 2005 HPC Productivity 28

Design and development methods, processes and tools which

address HPC issues, eg. performance

Design methods which incorporate design to performance and

evaluation of performance at design time including

impacts of execution environments and problem

instances

Tools for verification and validation including assessing

performance at component and total system levels

Capabilities for Productiviy for HPC

Dec. 19, 2005 HPC Productivity 29

Capabilities for Productivity in HPC

Unification of design-time, compile-time composition and

runtime composition (adaptation)

Unification of composition among abstract and concrete

components – Design time evaluation

Unification of compile-time and runtime composition

Support for measuring and monitoring of execution

behavior

Support for intelligent analysis of execution behavior

Support for component/algorithm replacement

Dec. 19, 2005 HPC Productivity 30

Capabilities for Productivity in HPC

Specification of dynamic, complex coordination and

interactions among components

•Make an coordination/interaction a first class concept in

the programming system.

•Allow interactions depend on the state of a component

Dec. 19, 2005 HPC Productivity 31

Capabilities for Productivity in HPC

Uncertainty management, adaptivity and fault-tolerance

Explicit representation of component state

Language support for measurement and monitoring

Language support for state analysis

Runtime support for runtime component replacement

Dec. 19, 2005 HPC Productivity 32

Programming systems which address HPC issues

Language extensibility

Support for customization including syntax extensions and

execution environment specifications – Annotation language?

(Anyone have ideas on this?)

Dec. 19, 2005 HPC Productivity 33

Programming systems which address HPC issues

Explicit representation of hierarchical locality

Configurations of data, processes and threads should be

explicitly specifiable to virtual machines.

Mapping of abstract machines to realized machines should be

represented.

(I have not thought through this one.)

Dec. 19, 2005 HPC Productivity 34

• Overview

• Perspective

• Basic Principles

• Historical HPC

• Emerging HPC

• Productivity Concepts for Conventional Software

• HPC Development Paradigm – Requirements

• HPC Development Paradigm – Concepts

• HPC Development Environment – An Example

• Connection to Other Research

• Research Issues

Dec. 19, 2005 HPC Productivity 35

Demonstration Implementation of Concepts

Problem Domain

Development of families of applications which are to be run

on (possibly multiple) large scale dynamic parallel and

distributed execution environments. A family of applications

is a set of programs for solution of a set of related

computational problems. Each instance should be efficient

for a specific case on a specific execution environment. It is

assumed that the programs may utilize adaptive algorithms.

Dec. 19, 2005 HPC Productivity 36

Assumptions

The functionality from which many instances of an

application family can be composed can be

implemented as a reasonable set of well-specified

components

A parameterized coordination structure

(=dependence graph in terms of components) for the

program family is known at design time.

Dec. 19, 2005 HPC Productivity 37

Goals

Order of magnitude productivity enhancement for application families

– Develop parallel programs from sequential components

– Reuse components

– Enable development of program families from multiple versions of components

– Automatic composition of parallel programs from components

– Enable design time evaluation of performance

– Incorporation adaptation and uncertainty management into the programming system.

Dec. 19, 2005 HPC Productivity 38

Conceptual Elements for Enhancing Productivity

•Self-describing components

•Coordination/interaction/composition interface

specification language

•Programming model

•Automated composition of parallel/distributed programs

from components

•Framework for unification of different semantic domains

•Unification of compile time and run time composition

enabling runtime adaptation on a component level

•Unification of abstract (simulated execution) and concrete

execution (for performance modeling)

Dec. 19, 2005 HPC Productivity 39

Demonstration Implementation – P-COM2

Description of Compositional Compiler - LCPC 2003

Case study on adaptation – ICCS 2005

Case study on evolutionary development –WOSP 2005

Case study of benefits on componentization of the

Sweep3D benchmark - Compframe 2005 (submitted

to Concurrency and Computation)

Role-based Programming Model – Proc. Workshop on

Roles

http://www.cs.utexas.edu/users/pcom

Dec. 19, 2005 HPC Productivity 40

Self-Describing Components

Functionality + Composition/Coordination/Abstraction

Interface

Provides interface

profile, state machine, protocol

Sequential Computation

(abstract or concrete)

Requires interface

(selector, transaction, protocol)

“Component”

is recursiveFunctionality:

•Computation

•Measurement/

Monitoring

•Analysis

State

machines

capture

enabling

conditions,

preconditio

ns/postcond

itions

Dec. 19, 2005 HPC Productivity 41

2D FFT Example

• Steps for 2D FFT computation

– Partition given matrix row-wise

– Apply 1D FFT to each row of the partition

– Combine the partitions and transpose the matrix

– Partition transposed matrix row-wise

– Apply 1D FFT to each row of the partition

– Combine the partitions and transpose the matrix

– Transposed matrix is the 2D FFT of the original matrix

Dec. 19, 2005 HPC Productivity 42

2D FFT Example

Dec. 19, 2005 HPC Productivity 43

selector:

string domain == "matrix";

string function == "distribute";

string element_type == "complex";

bool distribute_by_row == true;

transaction:

int distribute(out mat2 grid_re,out mat2 grid_im, out int n,

out int m, out int p);

protocol: dataflow;

profile:

string domain = "matrix";

string function = "distribute";

string element_type = "complex";

bool distribute_by_row = true;

transaction:

int distribute(in mat2 grid_re,in mat2 grid_im, in int n,

in int m, in int p);

protocol: dataflow;

2D FFT Example (Cont’d)

Requires

interface

of

Initialize

Provides

interface

of

Distribute

Dec. 19, 2005 HPC Productivity 44

{selector:

string domain == "fft";

string input == "matrix";

string element_type == "complex";

string algorithm == "Cooley-Tukey";

bool apply_per_row == true;

transaction:

int fft_row(out mat2 out_grid_re[],out mat2

out_grid_im[], out int n/p, out int m);

protocol: dataflow;

}index [ p ]

profile:

string domain = "fft";

string input = "matrix";

string element_type = "complex";

string algorithm = "Cooley-Tukey";

bool apply_per_row = true;

type = “concrete”;

transaction :

int fft_row(in mat2 grid_re,in mat2 grid_im,in int n,

in int m);

protocol: dataflow;

2D FFT Example (Cont’d)

Requires

interface

(partial) of

Distribute

Provides

interface

of

FFT_Row

Dec. 19, 2005 HPC Productivity 45

selector:

string domain == "matrix";

string function == "gather";

string element_type == "complex";

bool combine_by_row == true;

bool transpose == true;

transaction:

int gather_transpose(out mat2 out_grid_re,out mat2

out_grid_im, out int me);

protocol: dataflow;

profile:

string domain = "matrix";

string function = "gather";

string element_type = "complex";

bool combine_by_row = true;

bool transpose = true;

transaction:

int get_no_of_p(in int n, in int m, in int p,in int state);

>

int gather_transpose(in mat2 grid_re,in mat2 grid_im,

in int inst);

protocol: dataflow;

2D FFT Example (Cont’d)

Requires

interface

of

FFT_Row

Provides

interface

of

Gather_Tr

anspose

Dec. 19, 2005 HPC Productivity 46

2D FFT Example (Cont’d)

selector:

string domain == "matrix";

string function == "distribute";

string element_type == "complex";

bool distribute_by_row == true;

transaction:

%{ exec_no == 1 && gathered == p }%

int distribute(out mat2 out_grid_re,out mat2 out_grid_im, out int m, out int n*p,

out int p);

protocol: dataflow;

Requires

interface

(partial)

of

Gather_T

ranspose

Dec. 19, 2005 HPC Productivity 47

Capabilities Based on Self-Describing Components

•Compiler implementing recursive associative

composition of components

•Compiler generation of parallelism at component

level

•Run time adaptation combining monitoring,

analysis and composition.

•Unified concrete and abstract execution – (Design

Time Performance Evaluation)

•Framework for unification of concerns

Dec. 19, 2005 HPC Productivity 48

Automated Composition Process

• Matching of

– Requires and Provides

• Matching starts from the selector of the start

component

• Applied recursively to each matched components

• Output is a generalized dynamic data flow graph

as defined in CODE (Newton ’92)

• Data flow graph is compiled to a parallel program

for a specific architecture

Dec. 19, 2005 HPC Productivity 49

“Our ability to reason is constrained by the

language in which we reason”

Separation of Concerns

Framework for Unification of Multiple

Representations

Language Framework Concept

Dec. 19, 2005 HPC Productivity 50

Language Framework Concept – Multiple Representations

Specification languageLocality mapping

C/C++/FortranComputation

P-COM2

Coordination/Interaction

specification language

Coordination/interaction,

composition and abstraction

APIMeasurement and

monitoring

Rule-based programmingAnalysis and fault-tolerance

RepresentationConcern

Dec. 19, 2005 HPC Productivity 51

Framework Concept – Multiple Tools

Composers, Weavers, Analyzers, Execution Engines

Composer – automate composition to meet specified system

properties.

Weaver - source to source merges of different layers if

necessary.

Analyzer – static analysis, abstract/interpret models of code,

model checkers

Execution Engine – debuggers, simulated execution, direct

execution, adaptive control

Dec. 19, 2005 HPC Productivity 52

Unification of Compile Time/Run Time Composition

Provides and Requires can be modified at runtime.

Requires/Provides match implemented in runtime system

Monitoring and adaptation components included in

composition

When preconditions/postconditions for a component are

not met, a requires interface for a predecessor component

is modified to require a different component.

Component is replaced using OS dynamic loader

Dec. 19, 2005 HPC Productivity 53

Component-Oriented Evolutionary Development

Do domain analysis (ontology) – define components,

attributes and coordination/interaction structure.

Create execution environment parameterized performance

models for implementations of components with complete

implementation of coordination/interaction behavior.

Compose program instances for target execution

environments and execute via unified execution engine.

Performance Evaluation - If all components are performance

models, then you have evaluated a performance model.

Evolution to Concrete – Replace abstract components by

concrete components. Model and concrete components can be

included in a single composition

Dec. 19, 2005 HPC Productivity 54

Implementation of Unified Execution Engine

Runtime system which combines parallel/distributed simulation

with direct execution.

Based on coordination structure (data/control flow graph)

traversal.

Time management by generalized Lamport clocks at each

component (node in graph)

If a component is abstract it generates its own execution time

for the Lamport clock computation.

If a component is concrete, the execution time is measured.

Communication is also either modeled or concrete.

Dec. 19, 2005 HPC Productivity 55

Case Study – Optimization of Sweep3D

What is Sweep3D?

• Three-dimensional particle transport problem.

• ASCI Benchmark for high performance parallel architectures.

• Parallel wavefront computation via domain decomposition

Data Grid: 10x10x10

Processor Grid: 2x2x10

Dec. 19, 2005 HPC Productivity 56

start

octant

source

initialize

allocat e

read_input

snd_ouflows

compute_flux

rcv_inflows

kplane_block

angle_block

flux_err

allocate allocat e

rcv_inflows rcv_inflows

snd_ouflows snd_ouflows

flux_err flux_err

gather_data

sourcesource source

print_result s stop

stop

stop

Scattering

Operator -

'Inner

Iterations'

Next Iteration .. .

1

1

3

4

5

6

2

7

8

• • • • • •

• • • • • •

• • •

• • •

• • •

• • •

• • • • • •

Figure 1: Data Flow Graph of Sweep3D code

Streaming

Operator -

'Sweep

Routine'

Data Flow Graph with Sweep3D Components

Dec. 19, 2005 HPC Productivity 57

Productivity and Performance

Experiments

• Performance of Component-based code

• Adaptation to Execution Environment

– Memory System Optimizations

– Communication System Optimizations

– Communication/Memory Trade-off

Dec. 19, 2005 HPC Productivity 58

Improved Serial and Parallel

Performance

• Componentized code is faster on a single processor and

gets better speedup in parallel execution.

Problem Size: 100x100x100

0

50

100

150

200

250

0 5 10 15 20 25 30Number of processors

Tim

e (in

sec.)

Original Sw eep3D code

Componentized Sw eep3D code

Dec. 19, 2005 HPC Productivity 59

Efficiency and Isoefficiency

• decline in efficiency of the original code

• for componentized code, we are able to maintain fixed

efficiency (approximately) by increasing the problem size

as we increase the number of processors.

Isoefficiency Analysis

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8Processors,Problem Size

Efficiency

Original Sw eep3D code

Componentized Sw eep3D code

Dec. 19, 2005 HPC Productivity 60

Communication/Memory Trade-offNumber of processors 1 2 4 16 20

25

Runtime with Invariants as comm.. msgs. 164.9 88.751 45.65 13.82 12.34

12.38

Runtime with Invariants as state 164.9 77.15 33.16 11.17 11.1

10.12

• Alternative implementations where invariant data is

either kept as local state in each component or

communicated among components.

Dec. 19, 2005 HPC Productivity 61

Synchronous Versus Asynchronous Communication

Table 6: Performance comparisons on a fixed problem size (100x100x100)

Number of processors 1 2 4 9 16 20

Synchronous Comm. 164.9 88.751 45.65 22.79 13.82 12.34

Asnychronous Comm. 164.9 80.11 36.45 16.27 13.24 11.63

Dec. 19, 2005 HPC Productivity 62

Sweep3D Summary

• Sweep3D benchmark was mapped to components and

dozens of instances of the code realized.

• Productivity Enhanced - Adaptation and

optimizations in minutes or hours, not days or weeks

• Performance Enhanced - Component replacement for

optimizations for execution environments and

problem cases

• X10 Version of Sweep3D

Dec. 19, 2005 HPC Productivity 63

• Overview

• Perspective

• Basic Principles

• Historical HPC

• Emerging HPC

• Productivity Concepts for Conventional Software

• HPC Development Paradigm – Requirements

• HPC Development Paradigm – Concepts

• HPC Development Environment – An Example

• Connection to Other Research

• Research Issues

Dec. 19, 2005 HPC Productivity 64

Related Research

DARPA High Productivity Program

Software Engineering:

Component-oriented development

Software architectures

Grid Programming Systems – Automate, ICENI, etc.

Autonomic Computing

Agent-based systems

Role-Based Systems

Commercial IDE – J2EE, Javabeans, .Net, etc.

NOTE: All of software development is based on a few

simple principles. Different research communities use the

same ideas but give them different names and target different

problem domains.

Dec. 19, 2005 HPC Productivity 65

• Overview

• Perspective

• Basic Principles

• Historical HPC

• Emerging HPC

• Productivity Concepts for Conventional Software

• HPC Development Paradigm – Requirements

• HPC Development Paradigm – Concepts

• HPC Development Environment – An Example

• Connection to Other Research

• Research Issues

Dec. 19, 2005 HPC Productivity 66

Future Research

Unaddressed issues:

•Explicit parallelism within primitive components.

•Locality management beyond components

•Multiple versions of components

•Use of software architectures in instance design

•Fault-tolerance except by replication

•Verification/Validation of coordination behaviors by model

checking.

Dec. 19, 2005 HPC Productivity 67

Conclusion

• Orders of magnitude in productivity gain for

HPC applications is readily possible.

• Requires breaking old thought patterns

• Concepts neither difficult nor original.