28
1 of 28 © 2006 David A. Padua CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists and Engineers Spring 2006

CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

1 of 28

2

ming for

© 2006 David A. Padua

CS420/CSE 402/ECE 49

Introduction to Parallel ProgramScientists and Engineers

Spring 2006

Page 2: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

2 of 28

nization

© 2006 David A. Padua

Additional Foils 0.i: Course orga

Page 3: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

3 of 28

Instructor: Office Hours:

© 2006 David A. Padua

David Padua. By appointment4227 SC [email protected] 3-4223

T.A.: Office Hours:Predrag Tosic

XXXX Siebel Center [email protected]

Page 4: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

4 of 28

© 2006 David A. Padua

Textbook

Page 5: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

5 of 28

Lectures

ill be posted at:

0/

must complete ork).

© 2006 David A. Padua

Some lecture foils will be required reading. These w

http://www-courses.cs.uiuc.edu/~cs42

Grading:

6-9 Machine Problems(MPs)/Homeworks 50% Midterm (Friday Mar 3) 25%Final (Comprehensive) 25%

Graduate students registered for 1 unit (4 credits)additional work (associated with each MP/Homew

Page 6: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

6 of 28

cs

© 2006 David A. Padua

Additional Foils 0.ii: Topi

Page 7: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

7 of 28

, pC++, SplitC, n), HTA

© 2006 David A. Padua

• Machine models.

• Parallel programming models.

• Language extensions to express parallelism:

OpenMP (Fortran) and MPI (Fortran or C).

If time allows: High-Performance Fortran, LindaUPC (Unified Parallel C), CAF (Co-array Fortra(Hierarchically Tiled Aarrays).

• Issues in algorithm design

Parallelism

Load balancing

Communication

Locality

Page 8: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

8 of 28

• Algorithms.

ication and

© 2006 David A. Padua

Linear algebra algorithms such as matrix multiplequation solvers.

Symbolic algorithms such as sorting.

N-body

Random number generators.

Asynchronous algorithms.

• Program analysis and transformation

Dependence analysis

Race conditions

Deadlock detection

• Parallel program development and maintenance

Modularity

Performance analysis and tuning

Debugging

Page 9: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

9 of 28

duction

© 2006 David A. Padua

Additional Foils Chapter 1: Intro

Page 10: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

10 of 28

P

• ming two or more

• ce the very first

• ntional systems such

andle one digit at a This design strategy al computer design of

ructions and floating-ctions can execute

ed simultaneously. t.

© 2006 David A. Padua

arallelism

The idea is simple: improve performance by perforoperations at the same time.

Has been an important computer design strategy sincomputers.

It takes many (complementary forms) within conveas uniprocessor PCs and UNIX workstations:

At the circuit level: Adders and multipliers do not htime but operate on several digits at the same time. was used even by Charles Babbage in his mechanicthe 19th century.

At the processor-design level: The execution of instpoint operations is usually pipelined. Several instrusimultaneously.

At the system level: Computation and I/O can proceThis is why multiprogramming increases throughpu

Page 11: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

11 of 28

• However, the design strategy of interest to us is to attain everal complete

eading /products/server/

E|s )

med after a 1964 s that “The its doubles every

f parallel systems

doorstop can be ather one of a res, which can one chooses to the dear departed he others.” In

© 2006 David A. Padua

parallelism by using several processors or even scomputers.

• Future PCs will be built with multicore chips. (Rassignment: http://www.intel.com/business/bssresource_center/multi-core.htm?ppc_cid=ggl|multicore_resrc_ctr|k46E

• Multicore are made possible by Moore’s Law, naobservation by Gordon E. Moore of Intel. It holdnumber of elements in advanced integrated circuyear.”

• Another important reason for the development oof the multicomputer variety is availability.

“Having a computer shrivel up into an expensivea whole lot less traumatic if it’s not unique, but rherd. The herd should be able to accomodate spapotentially be used to keep the work going; or ifconfigure sparelessly, the work that was done bysibling can, potentially, be redistributed among tsearch of clusters. G. Pfister. Prentice Hall.

Page 12: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

12 of 28

Page 13: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

13 of 28

n used for as the weather, turing processes,

ized as the third some cases it is the only may not be possible due (very far away), dangers g the experiments. In d into computer software include both

mathematical models. By eter values, an

hese simulations and the anding and its usefulness

g and re”.

© 2006 David A. Padua

Applications

• Traditionally, highly parallel computers have beenumerical simulations of complex systems such mechanical devices, electronic circuits, manufacchemical reactions, etc.

• “ In part because of HPCC technologies, simulation has become recognparadigm of science, the first two being experimentation and theory. Inapproach available for further advancing knowledge -- experimentsto size (very big or very small), speed (very fast or very slow), distanceto health and safety (toxic or explosive), or the economics of conductinsimulations, mathematical models of physical phenomena are translatethat specifies how calculations are performed using input data that mayexperimental data and estimated values of unknown parameters in the repeatedly running the software using different data and different paramunderstanding of the phenomenon of interest emerges. The realism of tspeed with which they are produced affect the accuracy of this understin predicting change. “

From an old document entitled “High Performance ComputinCommunications: Foundation for America's Information Futu

Page 14: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

14 of 28

in parallel

ASC)

s/BGLbrocure.pdf

ar weapons in e of the US ters (1000s of ing used to

y. Examples ers, data mining, main driving

tions due to their

© 2006 David A. Padua

• Perhaps the most important government programcomputing today is the

Advanced Simulation and Computing Program (

( Reading assignment:

http://www.llnl.gov/asc/overview/overview.html http://www.llnl.gov/asci/platforms/bluegenel/image

).

Its main objective is to accurately simulate nucleorder to verify safety, reliability, and performancnuclear stockpile. Several highly-parallel compuprocessors) from Intel, IBM, and SGI are now bedevelop these simulations

• Commercial applications are also important todainclude: transaction processing systems, web servetc. These applications will probably become theforce behind parallel computing in the future.

• In this course, we will focus on numerical simulaimportance for scientists and engineers.

Page 15: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

15 of 28

sidered today as a experimentation

ring tool that lity of new

© 2006 David A. Padua

• As mentioned above, computer simulation is conthird mode of scientific research. It complementsand theoretical analysis.

• Furthermore, simulation is an important engineeprovides fast feedback on the quality and feasibidesigns.

Page 16: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

16 of 28

ne models

© 2006 David A. Padua

Additional Foils Chapter 2: Machi

Page 17: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

17 of 28

l model

Parallel

s ago.

this model. It is

© 2006 David A. Padua

2.1 The Von Neumann computationa

Discussion taken from Almasi and Gottlieb: HighlyComputing. Benjamin Cummings, 1988.

• Designed by John Von Neumann about fifty year

• All widely used “conventional” machines followrepresented next:

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

Page 18: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

18 of 28

s “add the ult in that

d data of a

fter another from d shuttles data essor.

© 2006 David A. Padua

• The machine’s essential features are:

1. A processor that performs instructions such acontents of these two registers and put the resregister”

2. A memory that stores both the instructions anprogram in cells having unique addresses.

3. A control scheme that fetches one instruction athe memory for execution by the processor, anone word at a time between memory and proc

Page 19: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

19 of 28

For an instruction to be executed, there are several steps that must be performed. For example:

1. Instruction Fetch and decode (IF). Bring the instrution from memory into the control unit and identify the type of instruction.

2. Read data (RD). Read data from memory.

3. Execution (EX). Execute operation.

4. Write Back (WB). Write the results back.

Page 20: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

20 of 28

ed in a high

mpiler into the le, the previous sequence of the

register 3)

s in memory)

ine” with its own

nguages, such as model.

© 2006 David A. Padua

• Notice that machines today usually are programmlevel language containing statements such as

A = B + C

However, these statements are translated by a comachine instructions just mentioned. For exampassignment statement would be translated into a form:

LD 1,B (load B from memory into processor register 1)

LD 2,C (load C from memory into register 2)

ADD 3,1,2 (add registers 1 and 2 and put the result into

ST 3,A (store register 3’s contents into variable A’s addres

• It is said that the compiler creates a “virtual machlanguage and computational model.

• Virtual machines represented by conventional laFortran 77 and C, also follow the Von Neumann

Page 21: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

21 of 28

tion of

mmunicate with

© 2006 David A. Padua

2.2 Multicomputers

• The easiest way to get parallelism given a collecconventional computers is to connect them:

• Each machine can proceed independently and cothe others via the interconnection network.

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

Interconnect

Page 22: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

22 of 28

lusters and uite similar, but old as such.

interconnected le, unified

essor (such as

ervers

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logicregisters

Instruction counter

CONTROL

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logicregisters

Instruction counter

CONTROL

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logicregisters

Instruction counter

CONTROL

© 2006 David A. Padua

• There are two main classes of multicomputers: cdistributed-memory multiprocessors. They are qthe latter is considered a single computer and is s

Furthermore, a cluster consists of a collection ofwhole computers (including I/O) used as a singcomputing resource.

Not all nodes of a distributed memory multiprocIBMs SP-2) need have complete I/O resources.

• An example of cluster is a web server

The net

dispatcherrouter

request

S

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logicregisters

Instruction counter

CONTROL

Page 23: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

23 of 28

rmilab, which workstations.

. Analyzing any zing any of the that analyzes one ossible to analyze

© 2006 David A. Padua

• Another example was a workstation cluster at Feconsisted of about 400 Silicon Graphics and IBMThe system is used to analyze accelerator eventsone of those events has nothing to do with analyothers. Each machine runs a sequential program event at a time. By using several machines it is pmany events simultaneously.

Page 24: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

24 of 28

cessor is the we mean that ties. Therefore al access to every O device equally h processor the m symmetric.

hese will be

I/O

LAN Disks

Interconnect

© 2006 David A. Padua

2.3 Shared-memory multiprocessors

• The simplest form of a shared-memory multiprosymmetric multiprocessor (SMP). By symmetriceach of the processors has exactly the same abiliany processor can do anything: they all have equlocation in memory; they all can control every I/well, etc. In effect, from the point of view of eacrest of the machine looks the same, hence the ter

• An important component of SMPs are caches. Tdiscussed later.

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

Page 25: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

25 of 28

allelism that are e coarse grain rs.

nit.

is type of

© 2006 David A. Padua

2.4 Other forms of parallelism

• As discussed above, there are other forms of parwidely used today. These usually coexist with thparallelism of multicomputers and multiprocesso

• Pipelining of the control unit and/or arithmetic u

• Multiple functional units

• Most microprocessors today take advantage of thparallelism.

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

registers

Instruction counter

CONTROL

Page 26: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

26 of 28

are an important hat each re performed ted by the guage rol this type of

IALU

BRANCH

© 2006 David A. Padua

• VLIW (Very Long Instruction Word) processorsclass of multifunctional processors. The idea is tinstruction may involve several operations that asimultaneously.This parallelism is usually exploicompiler and not accessible to the high-level lanprogrammer. However, the programmer can contparallelism in assembly language.

Register File

Memory

LD/ST FADD FMUL

LD/ST FADD FMUL IALUInstruction

Word

Multifunction Processor (VLIW)

Page 27: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

27 of 28

achine. Each was connected to us).

MEMORYholds instructions and data

ARITHMETICUNIT

logic

registers

© 2006 David A. Padua

• Array processors. Multiple arithmetic units

• Illiac IV is the earliest example of this type of marithmetic unit (processing unit) of the Illiac IV four others to form a two-dimensional array (tor

MEMORYholds instructions and data

PROCESSOR

ARITHMETICUNIT

logic

registers

Instruction counter

CONTROL

MEMORYholds instructions and data

MEMORYholds instructions and data

ARITHMETICUNIT

logic

registers

ARITHMETICUNIT

logic

registers

Page 28: CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists …polaris.cs.uiuc.edu/~padua/cs420/Introduction.pdf · 2006-01-18 · include: transaction processing systems,

28 of 28

h he picked two ssible nd the others

nal Von

lticomputers and

processors.

ed and perhaps

© 2006 David A. Padua

2.5 Flynn’s taxonomy

• Michael Flynn published a paper in 1972 in whiccharacteristics of computers and tried all four pocombinations. Two stuck in everybody’s mind, adidn’t:

• SISD: Single Instruction, Single Data. ConventioNeumann computers.

• MIMD: Multiple Instruction, Multiple Data. Mumultiprocessors.

• SIMD: Single Instruction, Multiple Data. Array

• MISD: Multiple Instruction, Single Data. Not usnot meaningful.