Download pdf - cs.uef.fics.uef.fi/pages/sjuva/par12_handout1_3.pdf · Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 1(289) 1 University of Eastern Finland Computer Science Parallel Computing

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 1 (289)

1U

niversity of Eastern

Fin

land

Com

puter S

cience

Parallel C

omputing

5 cr, 3

621528

Fall 2

012

Sim

[email protected]

http://cs.uef.fi

/pages/sjuva/parallel.htm

l

Sittin

g placem

ents at the fi

rst lectures:

1) Sit w

ithinreach of som

eone (several) else.

2) The w

hole class must be

connected.


2

Course contents (preliminary) Co

urse co

nten

ts (prelim

ina

ry)

•C

hapter 1:A

n Introduction to Parallel C

omputing

(p. 3)

•W

hat?,W

hy?,H

ow?

•C

hapter 2:P

RA

M(p. 55)

•A

simple

model to parallelism

•C

hapter 3:P

arallel algorithms (in P

RA

M-notation)

(p. 85)

•B

asicalgorithm

s, e.g., counting, prefix, sorting, etc.

•C

hapter 4:T

aking real world into account

(p. 163)

•N

etwork

delay models, m

emory access m

odels

•C

hapter 5:M

essage passing programm

ing (with M

PI)

(p. 224)

•R

ealparallel program

ming

work.

•C

hapter 6:O

ther stuff(p. 228)

•O

penMP, F

ortran 90, HP

F, functional, data flow

.

•G

PU

programm

ing, CU

DA

/OpenC

L

•E

veryday(especially

infew

years)parallel

(andconcurrent)

pro-gram

ming. P

rocesses, IPC

,shared mem

ory, phtreads, Java threads.


3

Ch

ap

ter 1

An

Intro

du

ction

to P

ara

llel Com

pu

ting

What?, W

hy?, H

ow?

Som

e key concepts

Pros, C

ons

Other sim

ilar terms

Exam

ples

An anim

al experimen

t

Desig

n issues


4

What is Parallel Computing? Wh

at is

Pa

rallel C

om

pu

ting

?

⇒U

seseveralcom

putersto

solvea

singlecom

putationaltaskin

parallel!

•T

wo is better than one.

•O

ne thousand is better than tw

o…•

Think hum

an (manu

al) work.

⇒T

he single task has to bedivided in several parts.

•S

ome tasks are easy to divide, som

e are not.

⇒T

he cooperating computers have to be able to

comm

unicate.

•O

ne task, one solution.•

There are m

any ways to com

municate.

⇒T

he participating "computers" do not need to be com

plete!

•P

rocessor,mem

ory,comm

unication medium

(processing unit).

•M

onitors do not process.

•T

he whole parallel com

puter still needs to have some I/O

, etc.


5

What is Parallel Computing?

Exam

ple 1

-1:

A hu

man

exam

ple:m

anual sorting of papers:

•Input: a bunch of A

4papers, each having a

name.

•Inputsize:10,100,1

000,or10000

papers(1

mm

,1cm

,10cm

,1m

).

•T

ask: sort the bunch (alphabetically).

One (quick) person alone: [1st exercise in D

ata Structures and A

lgorithms]

•10 papers: 30 s [3 s/paper]

•m

ethod insignificant

•100 papers: 8 m

in [5 s/paper]

•divide

in10

(5-27)substacks

accordingto

thefi

rstletter,

sortsub-

stacks, combine

•1000 papers: 2 h [7 s/paper]

•divide

in10

substacks

accordingto

thefi

rstletter,

applyrecursively

previous 100-sort.

•10000 papers: 25 h [9 s/paper]

•divide

in10

substacks

accordingto

thefi

rstletter,

applyrecursively

previous 1000-sort.

•Y

ou might w

ant some help...


6

What is Parallel Computing?

Para

llel man

ual p

ap

er sortin

g:

•10, 100, 1000, 10000 helpers!

•W

ork organization is more diffi

cult than in single person sort.

•E

xercise 1.

⇒T

he important question:

•W

ill10 helpers

speed up the work 10 tim

es?

•10 papers task:

no (one helper can help a little).•

10000 papers task:

yes (at least alm

ost 10 times).

•W

ill10

000 helpers speed up the work

10000 times?

•10 papers task:

no.•

10000 papers task:

no (but w

e can exploitm

ore than 10 helpers).•

100,000,000 papers task:yes (alm

ost)


7

What is Parallel Computing? ⇒W

hat is theoptim

al number of helpers for each num

ber of papers?

•W

hat is the goal?W

hat means

op

tima

l?

•M

inimal

wall clock tim

e?•

Effi

ciency (minim

al person work hours, i.e., euros)?

•?


8

Little practice Little p

ractice

Ru

les

•P

hysical messages, w

riting on a piece of paper

•W

ritten message m

ay includeinstructions, addresses, data

•C

onnections toneighbours w

ithout standing up

•S

ending a m

essage (synchronous comm

unication):

•A

sk the neighbour to receive, wait until he/she is ready

•H

and out the message, say “here are you”

•R

eceiving a message:

•A

gree to receive•

Receive, say “thank you”

•Y

ou can see and comm

unicateo

nly

with your neighbours.

•L

ocal operations are unlimited


9

Little practice

Task

s

•M

ax,

count,search (single value, pattern),sum, sort, ...

Alg

orith

m?

•F

or above rules?

•F

ordifferent rules?

•W

ithout rules (but no magic)?


10

Little practice

Ph

ysica

l con

ditio

ns/restrictio

ns (i.e., ch

allen

ges):

•O

pen hall, no restrictions

•C

oordination:loudsp

eakers(for

leaders),person-to-person

comm

u-nication, guidance painted on fl

oor, rehearsal, etc.

•"C

luster" of two door-connected halls?

•S

itting here, no person movem

ent allowed.

•P

aper deliveryonly for neighbours vs anyone?.

•O

nlyone paper at tim

e vs. a bunch at a time

•H

ow to benefi

t use of blackboard or an electronic m

essage board?

•H

ow to benefi

t from sho

uting?

•W

ithout sight contact to neighbours.

•L

oad balancing (fast and slow

workers)

•F

ault tolerance (temporary, perm

anent)


11

Why parallel computing is needed? Wh

y p

ara

llel com

pu

ting

is need

ed?

⇒W

hy computers are needed?

•B

ecausecom

puterscan

compute

(calculate)fa

stand

theycan

havehuge

mem

ory.

Why

i7 at 3.50 GH

z (20

03

slides: 3

GH

z)is not enough???

•C

omputing pow

er will ~

double every ~tw

o years. [“Moore”]

•Intel/A

MD

4/6/8-core processors at 2-4 GH

z arevery cheap

(from100e)!

•20 years ago governm

ents would have paid

millions for a 2012 P

C.

Wh

at else w

e need

?

•H

umans are greedy and im

patient...

•S

ome

tasksare

toodem

andingand

urgentto

becom

putedby

oneproc-

essor only.

•S

ome

tasksare

more

valuablethe

more

computing

power

we

canuse

onthem

.


12

Why parallel computing is needed?

Wh

at is so

dem

an

din

g a

nd

urg

ent?

•W

ord processing?

•W

WW

-surfing?

•B

ank / stock exchange?

•eC

omm

erce?

•G

aming?

•R

eal world

simulation!

•M

atter consists ofvery tiny particles!

•E

very visible piece consists ofvery m

any particles.

•W

ecannot

simulate

every(sub)atom

icparticle

fora

large(visible)

object!

⇒B

ut:thesm

allerparticles

we

cansim

ulate,them

oreaccurate

simula-

tion we have!

•S

maller particles⇒

more particles⇒

more calculations to do!

⇒U

nbounded amount of calculations!


13


Wh

y w

e wan

t to sim

ula

te real w

orld

?

•"T

est" a piece of equipment w

ithout building it.

•P

rediction of natural phenom

ena.

•P

rediction of consequenses of changes.

•"S

ee" artificial things.

•O

ptimizing

structures or m

odels.


14


Exam

ple:

wea

ther

foreca

sts

•H

istory data, constants measurem

ents.

•S

imulation of the future m

ovement of air particles.

•S

imulation

ofphysicalchanges

(temperature,pressure,hum

idity,veloc-ity, etc.) of air in the atm

osphere.

•H

ugeam

ountsof

molecules

move

andinteract

quicklyfor

severaldays.

•Incom

prehensible amou

nt of calculations.


15


•R

esolution reduction:

•50×

50×1 km

(×5 m

in)block of air as 1 entity.

•P

enalty: accuracy and reliability are reduced.

•F

orecast asfar to future as possible.

•U

nfortunately:inaccuracies m

ultiply.

⇒M

orepow

erfulcom

puteror

more

time

yieldsim

mediately

more

accurate forecasts (and longer forecasts).

⇒(R

eliable) weather forecasts are

very valuable!

•In real forecasts, the m

odels exploit grid-wide differential equations

instead of local simulation...

Block size (km

),height 0.5 km

Gfl

op/s needed for"real" tim

esim

ulation (2 m

inute steps)5 days in 2 hours

Gfl

op/s needed

11 804 492

108 269 544

321 762

105 732

10241.7

103


16


•A

late forecast is worthless.

•F

innish Meteorological Institute: (about)

•7.5×

7.5×0.3 km

(×6 m

in) Canada .. U

ral, 3-10 days

•2.5×

2.5×0.? km

(×? m

in) Sw

eden .. Finland

•(44km

->7,5km

in 14 years)

•C

ray XT

5m, 656

× 6

-core Opteron, 35

TF

LO

PS

(theoretical)


17

Why parallel computing is needed? ⇒C

onclusion

•W

e want as

powerful co

mputer as possible!

•W

e are willing to

pay for it.

⇒U

nfortunately

•N

o IA256 @

300 GH

z ever(?) (until 2030+?)

•E

ven if we pay

all the money in the w

orld.

Th

us

⇒W

e’ll use several processors to achieve more com

puting power.

•F

innish CS

C currently (louhi.csc.fi

): Cray X

T4/5

•2716

× (4-core 2.3 G

Hz O

pteron, 4-8GB

, 25G

B/s)

•T

heoretically 102.3 TF

lop/s, measured 76.5 T

Flop/s (L

inpack)

•http://w

ww

.csc.fi/english/research/C

omputing_services/com

puting

•see “C

urrent parallel computers (briefl

y)” p.37

•O

rdered: Cray C

ascade (10Me, 1

PF

LO

PS

?)


18


Oth

er ap

plica

tion

s for p

rocessin

g p

ow

er (para

llelism)

•H

uge databases, urgent queries, data mining

•D

igital signal/image/vid

eo processing

•C

omplex user interfaces (virtual reality, gam

es)

•D

NA

modelling

•D

NA

matching

•M

olecular modelling

•E

nvironmental

modelling

(storms,pollution,earthquakes,sea

currents)

•A

stronomical m

odelling

•O

ptimization (aero/hyd

rodynamics, etc.)

•S

tructure strength calculations (car crash sim

ulations, etc.).

•C

ryptoanalysis,

•P

attern recognition, audio/image surveillance

•D

ata mining/indexing/classifi

cation,

•A

rtificial intelligence

•M

easurementdata

analysisand

modelling

(sensorvalues

tobig

picture)


19

Some key concepts So

me k

ey co

ncep

ts

Exam

ple 1

-2:

Build

ing a sm

all house:

•O

ne skilled man can bu

ild a house in oneyear

•T

wo skilled m

en can do it inabout half a year

•12 m

en, onem

onth: requires very careful planning (at least)

•365 m

en,one day

: probably impossible

•1 m

illion men,

10 seconds: definitely im

possible


20

Some key concepts ⇒H

ow to coordinate the fast (1-5 day) parallel building of a house?

•S

killed w

orkers

•S

ynchronization of w

ork

•P

artlyindependent com

ponents (roof, walls, etc.)

•M

ore than one (levels of) leader(s)

•G

oodinstructions and

comm

unication

•D

etailed plan available to all (at least many) w

orkers

•P

roblem: single plan

will be crow

ded

•S

olution: local partial copies of the plan


21

Some key concepts ⇒L

essons learned:

•P

arallelization possibilities depends on the problem (ditch vs. w

ell)

•C

omm

unication and coordination are vital

•A

ccessto

aS

HA

RE

Dplan

with

localcopies

isa

fairlygood

comm

uni-cation m

ethod

⇒T

here is a limit on efficient num

ber of workers.

•K

ey concepts:

•speedup, extra w

ork,effi

ciency


22

Some key concepts

Exam

ple 1

-3:

Exam

ple: w

hich one to choose?

Th

ink

BIG

!

•G

reat Wall of C

hina (in a day?)

•5

mm

/ ~300

kg of wall for each C

hinese

•G

reat Pyram

id of Giza (in ???)

•~

60kg for each E

gyptian

Labour

Calendar

time

Speedup

Work

Labour

expensesE

ffi-

ciency

1 man

1 year1.00

1.00 my

48,000 e1.00

2 men

7 mo

nths1.71

1.17 my

56,000 e0.86

4 men

4.5 m

on

ths

2.671.50 m

y72,000 e

0.66

365

men

5 day

s73.00

5.00 my

240,000 e0.20


23

Some key concepts

Lim

its of p

ara

llelizatio

n

•C

an we

speed up a computation infi

nitely by adding m

ore and more

processors?

•N

otinfi

nitely,m

ostproblem

shave

alow

ertim

ebound

(usually(poly)logarithm

ic, with polynom

ial number of processors).

•In

practice, thelim

it is money.

•H

ard problems are huge (input size (N

) is large).

•H

uge problems have a lot of potential parallel parts.

•E

.g., a high-rise building vs. a single-family house.

•S

mall problem

s are fast enough with

one processor.

•In

theory,thelim

itis3-dim

ensionalspaceand

speedof

light(we

cannotreach exponential num

ber (as a function of tim

e) of processors)

(T(N

,P

) =).

ΩP

113 ------ε

–


24

Some key concepts

Sp

eedu

p(n

op

eutu

s),w

ork

(työ),

efficien

cy(teh

ok

ku

us,

hyö

tysuh

de)

•A

noptim

al sequential (uniprocessor) algorithm tim

e =T

s(N

).

•P

arallel algorithm w

ithP

processors, time =

Tp

(N,

P)

•S

peedup is defi

ned as ratioT

s/T

p

•S

peedupT

s/T

p =O

(P)

•I.e.,

superlinearspeedup

isnot

possible,as

itw

ouldim

plya

fastersequential algorithm

.

•W

ork (used resources) =

Tp ×

P.

•If

Tp ×

P =

O(T

s ), the algorithm is

work optim

al (työo

ptim

aa

linen).

•T

p ×P

=o

(Ts ) is im

possible!


25

Some key concepts

Am

da

hl’s

law

on

serial

fractio

ns

with

inp

ara

llelp

rog

ram

s

•If an algorithm

has an (inherently) serial part that will not be paral-

lelized, it will

limit w

hole parallelization.

•O

r, if we

do not bother to parallelize some diffi

cult part.

•W

hole algorithm (serial) tim

eT,sequential fraction

α (0..1).

T(N

,P

) =.

(1-1)

Speedup(P

)=w

henP

→∞

(1-2)

Effi

ciency(N,

P) =

(P→

∞)

(1-3)

αT

1α

–(

) TP ---+

T

αT

1α

–(

) TP ---+

------------------------------------1

α1

α–P

------------+

-----------------------1α ---

→=

T

Pα

T1

α–

() TP ---

+

----------------------------------------------

1P

α1

+-----------------

→


26

Some key concepts

Po

ssible

go

als fo

r speed

up

an

d/o

r efficien

cy

•A

sfast as possible.

•N

o matter how

many processors.

•F

orm

ostproblem

s,there

existsa

(poly)logarithmic-tim

e((log

n) k)

algorithm (very fast!).

•A

s goodeffi

ciency as possible.

•U

nfortunately, the sequential algorithm is alw

ays the most effi

cient.

⇒A

sfast

aspossible

while

maintaining

(asymptotically

full,or

given)efficiency.

•S

omething betw

een, or inreal life:

•In

agiven

time

byas

few(and

cheap)processors

(andother

resources) as possible.•

By

agiven

number

ofprocessors

(andother

resources)as

fastaspos-

sible.Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 27 (289)

27

Some key concepts

Bren

t’s theo

rem

•If our algorithm

works w

ithP

processor in time

T, we can execute it

with

P’

<P

processors in time

T× P

/P’ .

⇒W

ecan

always

designalgorithm

sfor

asm

anyprocessors

aspossible/

efficient. The algorithm

will w

ork nicely with few

er processors.

•E

ven if we w

on’t have thousands of processors, multithreaded proces-

sors work m

ore efficiently w

ith more threads.

⇒In

some

cases,thought,analgorithm

thatisdesigned

forfew

erproc-

essors may be m

ore efficient.


28

Some key concepts

Wh

at is so

diffi

cult in

para

llel pro

gra

mm

ing?

•S

ometim

es evensequential program

ming is diffi

cult.

•In

parallelprogramm

ingw

ehave

tom

anageseveralprocessors,each

ofw

hich must w

ork correctly.

•T

he processors must

com

municate correctly.

•S

ome problem

s are easy to parallelize, some diffi

cult or inefficient.

⇒P

arallel programm

ing is difficult.

⇒W

eoften

needm

oreabstraction

levelsthan

insequential

program-

ming.

•C

oncentrate ondata and

operations on data.


29

Some key concepts

Para

llelism is

natu

ral!

•In fact,sequential order is (som

etimes) artifi

cial.

•A

“typical” algorithm segm

ent:

for ea

chelem

in array

Ad

o1

elem←

elem×

22

•A

sequential programm

er implem

ents:

for (i =

1; i <=

A; i+

+)

1

A[i] =

A[i] * 2;

2

•W

hy to serialize anoriginally parallel (sim

ultaneous) operation?

•S

ometim

es serialization might be a source of errors.

•A

parallel version can be fl

exibly implem

ented with 1..N

processors.

•R

eal world is concurrent (and

very parallel) anyway.

•P

arallelism is (alm

ost) as old as Life.


30

Some similar terms (that are sometimes mixed up) So

me sim

ilar term

s (tha

t are so

metim

esm

ixed

up

)

Distrib

uted

System

(haja

utettu

järjestelm

ä)

⇒A

distributedsystem

isa

collectionof

autonomous

computers

linkedby

acom

puternetw

orkthatappear

tothe

usersof

thesystem

asa

sin-gle com

puter.

•T

hem

achinesare

autonomous;this

means

theyare

computers

which,in

principle, could work independently;

•S

eparatecom

putersw

orkconcurrently,

without

globalclock,

andm

ayappear, fail and recover independently.

•T

heuser’s perception: the distributed system

is perceived as a singlesystem

solving a certain problem (even though, in reality, w

e haveseveral com

puters placed in different locations).

⇒E

achpart

ofthe

distributedsystem

may

bea

partof

(i.e.,participatein)

several distributed systems.

•N

ot part of this course.


31

Some similar terms (that are sometimes mixed up)

Distrib

uted

com

pu

ting (h

aja

utettu

lask

enta

)

•T

erm often used w

hen severalcom

puters (often geographically distrib-uted) are used to com

pute a single computational problem

in parallel.

•M

essagepassing

programm

ing,tolerate

longand/or

unpredictabledelays, low

bandwidth

•E

.g., SE

TI@

home, d

istributed DN

A m

atching, etc.•

Boundary

between

paralleland

distributedcom

putingdepends

onthe speaker.

•S

ometim

es,"distributed

computing"

isused

on“distributed

sys-tem

s”.

•"G

rid computing".

•P

art of this course.


32


Co

ncu

rrent sy

stem (sa

ma

na

ika

inen

)

•T

hings occurringa

pp

aren

tly simultaneously.

•In

reality,only

one(process,

etc.)is

executingat

atim

e,and

theprocess is changed frequently enough.

•E

.g.,processesin

am

ultitaskingO

Sexecute

at~10

ms

time

slices.

•C

an also occur really simultaneously in m

ultiprocessors systems.

•C

oncurrency is defined w

ith respect to a slow observer (hum

an).

•O

rder of concurrent events isnondeterm

inistic.

•C

anbe

(usuallyis)

implem

entedusing

time-sharing

(sometim

esseveral

processors).

•T

asks are not necessarily (tightly) related.

•P

arallel and distributed systems are concurrent by nature.

•P

rocesses in different com

puters execute simultaneously

•T

hecom

munication

inasynchronous

distributedsystem

sis

concurrent.

•T

oachieve

most

flexibility

andperform

ance,theprocesses

(comput-

ers,softw

are)that

participatein

aD

Sare

usuallyconcurrent

(multi-

threaded).

•C

oncurrency theory (or practical handling) is not part of this course.


33


Mu

ltithrea

din

g (sä

ikeistys)

•T

he standard mechanism

to implem

enta concurrent process (one

process)

•A

s opposed to distinct processes, the threads of a single processshare

the same data.

•N

ot part of this course.

Mu

ltithrea

din

g a

ccord

ing to

pro

cessor m

an

ufa

ctures

•P

rocessorincludes

specialcircuitsto

executeseveralprocesses

simulta-

neously.

•D

epending on the implem

entation, the processes may execute at full

speed, or at slightly lower speed.

•B

enefit: m

ore efficient utilization of functional units.

•O

S (and processes) "see" several processors.

•E

.g., Intel HyperT

hreading(tm

), SU

N C

MT

.

•R

elates to this course.

•S

ee Processor m

ultithreading (p.44).


34


Distrib

uted

op

eratin

g sy

stem

•S

ingle system im

age (for user) for several computers.

•U

ser will not know

in which physical com

puter their processes run.

•A

utomatic

job/process distribution, balancing, m

igration.

•"G

rid computing"

•E

.g., Mosix


35


Pa

rallel

com

pu

tatio

n/co

mp

uter

(rinn

ak

ka

islask

enta

,-tieto

ko

ne)

•U

se several processors/computers to

solve a single computation in

parallel

•T

he only goal is to make hard com

puting faster.

•U

p toP

times faster using

P processors.

•U

seful(only)

ifw

eare

ina

hurry(sim

ulation/forecast,real-tim

eapplications)

•A

parallelcomputer

oftenhas

dozens..thousands

ofsim

ilarprocessors

with a tight interconnection and often a (virtual) shared m

emory.


36


Pa

rallel,

distrib

uted

,a

nd

con

curren

tsy

stems

an

dp

ro-

gra

mm

ing

have a

lot in

com

mo

n.

•T

askdivision.

•Interprocess

comm

unication, dividing data.

•N

ondeterminism

.

•S

ynchronization challenges.

•D

eadlock possibility.

•L

oadbalancing.

•E

rror possibilities, fault-tolerance techniques.

⇒H

ardware, tools, and

goals differ.

•In this

course, we concentrate on parallelism

, but we’ll m

ight havesom

ething (threads, processes) on concurrency.


37

Current parallel computers (briefly) Cu

rrent p

ara

llel com

pu

ters (briefl

y)

SM

P (S

ym

metric M

ultiP

rocesso

r)

•2-16 (-64) processors on the sam

em

emory bus (or sw

itch).

•S

everal banks of mem

ory.

•E

ach processor has its own

cache (to reduce bus traffic).

•N

ot very scalable approach (as bus, a bit m

ore with a sw

itch).

proc.

cache

pro

c.

cache

proc.

cache

proc.

cachem

emory

I/Om

emory

central system bus

Fig

ure 1

-1:

Bus-based S

MP

computer.


38

Current parallel computers (briefly)

•E

.g.,cs: S

un M4000,

(2× 4-core S

PAR

C64 V

II 2.4G

Hz).

•In

largerunits

(P≥

8-16),processorsare usually

clustered.

•P

rocessors do not comm

unicatedirectly,m

emory is used for com

-m

unication.

•U

suallyused

toim

provethroughput

in a concurrent system, can be used

for parallel computation as w

ell.

proc.

cache

proc.

cache

mem

ory

I/O

mem

ory

mem

ory

mem

ory

Fig

ure

1-2

:C

rossbar-basedS

MP

computer.


39


Why parallel (once again)[G

ordon Moore, IS

SC

C2003, w

ww

.intel.com]


40



41


Mu

lticore S

MP

, SM

T, C

MT

•A

sthe

siliconm

anufacturingprocess

improves,m

oreand

more

transis-tors can be fi

tted in a chip (mainfram

e/supercomputer: in a board).

•H

ow to use the

exponentially growing transistor count

efficently?

•1940’s to 70’s: m

ore and more bit-parallelism

and instructions.•

Eventually dim

inishing returns.

•(70’s), 80’s, 90’s:

deeper pipelining, wider superscalar.

•U

sefulnessof

deeperpipelines

andw

idersuperscalar

islim

itedby

code/compilers, eventually dim

inishing returns.•

Since late 80’s: m

ore and more

cache to balanceslow

mem

ory.

•D

ifferenceof

2MB

and4M

BL

2caches

issm

allin

speed,buthas

more transistors than an A

LU

, eventually diminishing returns.

•S

ince mid 2000’s:

mo

re cores.

•(A

nd more integration for cheap P

Cs)

•S

ame transistor count:

6000×

i386 and single

2-core Itanium 2!


42


•M

ulticore SM

P is to have several C

PU

s within the single silicon chip

•E

ach CP

U has its ow

n AL

U(s), L

1 (& L

2) cache, usually also FP

U.

•C

PU

s share L3 (&

L2) cache, M

MU

, and external connections

•M

ulticore benefit

•P

times processing potential for approx. the sam

e price

•D

rawback

•M

emory

andI/O

bandwidth

donot

increaseaccordingly,

eventu-ally dim

inishing returns.


43


•S

unU

ltraSPA

RC

IVprocessor[w

ww

.sun.com]


44


Processor

multithreading

•E

ach core executes several processes (threads).

•R

educesthe

impact

ofm

emory

latencyby

making

eachvirtual

proc-essors slow

er.

•S

un UltraS

PAR

C T

3•

16cores, 8

threads each→

OS

sees 128 threads ("processors")

•C

ray XM

T•

128 threads per processor.


45

Current parallel computers (briefly) ⇒M

ulticore is mainstream

now (2006 slides: "soon").

•X

Box360

•C

PU

: Triple core P

owerP

C, tw

o threads each (total6 threads)

•G

PU

:48 A

LU

s

•P

laystation3

•8 V

LIW

processors (AP

U), each 4+

4 pipelines =256 pipelines.

•Intel

•S

ince 2003: Hyperthreading provides

2 virtual processors for OS

•8-core

i7/Xeon (m

ulti-chip)•

Dual core P

4 at 2005,quad core at 2007, 48? -core at 2010.

•A

MD

2*8-core Opteron,

dual core Athlon at 2005,quad

2007.

•S

UN

/OR

AC

LE

quad-core S

PAR

C61 V

II, 16-core T3

•S

UN

dual core UltraS

PAR

C IV

at 2004,8-core T

1 at 2006.

•IB

M 8-core P

OW

ER

7,dual core PP

C970 at 2004.

•N

vidia Kepler: 1536 cores, up to 96 threads/core, 500

e.

⇒N

owdays,w

ecan

assume

thatour

software

isrun

mostly

onparallel

machines!


46



47


Vecto

r (sup

er)com

pu

ters

•C

lassical supercomputers since C

ray 1 at 1977.

•1-32 (m

ore clustered) extremely pow

erful processors.

•E

ach up to 100 GF

LO

PS

(2008).

•~

8M

UL

-an

d-A

DD

floating point operations / clock cycle / processor

•E

.g., dot product

•R

equiresseveral

long(1000

element)

arrays(vectors)

forpeak

per-form

ance.

•O

n each clock cycle, up to 16 words (64B

) from/to m

emory.

•A

verage PC

: 0.1 .. 1 B/cc

•N

o caches, but hardware prefetch (very deep pipeline) and

very wide

mem

ory channels (and SR

AM

mem

ory).

•C

ray, Hitachi, F

ujitsu,N

EC

.

•V

ery expensive, even per FL

OP

S.

•N

earlyextinct in original form

, current implem

entations approachM

PP

s, see below.

•N

EC

SX

-9: 100G

FL

OP

S/proc, 256

GB

/s mem

ory bandwidth/proc


48


•http://w

ww

.nec.com/de/en/prod/servers/hpc/m

aterial/255_e_sx9.pdf


49


MP

P (M

assiv

ely P

ara

llel Pro

cessing)

•T

ens..thousands of processors.

•E

achprocessing node is a 1-4 processor S

MP

and mem

ory.

•S

eparate I/O nodes.

•P

rocessingnodes

connectedby

aninterconnection

network,topologies

vary.

Fig

ure

1-3

:A

64-node3D

mesh,

a32-node

binaryhypercube,

andan

80-node butterfly (w

ith 16 input/output nodes).


50


•U

sually hardware supports

virtual shared mem

ory.

•S

cales enough (can be built to consum

e any budget).

•C

omm

unicationnetw

ork is expensive (up to half of the machine cost).

•S

pecialpurposem

achinescan

betailor-designed

tobalance

thecosts

ofsubsystem

s (processors, mem

ory, bandwidth, I/O

) with the given task.

•G

eneral purpose computers provide com

promises betw

een price andinterconnection and m

emory perform

ance.

•E

.g.,(ILL

IAC

IV),T

hinkingM

achinesC

M-1,-2,-5,C

rayT

3E,X

T4/5,

XE

6,Digital(H

P)

Alph

aserverS

C,IB

MeS

erver,IntelAS

CI

Red,S

GI,

etc.


51


NO

W (N

etwo

rk o

f Wo

rksta

tion

s)

⇒P

ersonal workstations are 99%

idle (nights, editor usage).

•F

ree cycles can be used by:nice compute

•"F

ree" (unused) computing pow

er:

•cs departm

ent: 400 PC

s× 3 G

FL

OP

S =

1.200T

FL

OP

S.

•U

EF

: 5000 PC

s× 3 G

FL

OP

S =

15T

FL

OP

S.

•F

inland: 1.5M

PC

s× 2G

FL

OP

S =

3P

FL

OP

S >

Blue G

ene.

•O

rdinary Unix (W

inNT

) workstations, T

CP

/IP connection.

•A

switch

...LA

N...W

AN

...Internet.

•S

ometim

es (nowadays) also a

dedicated cluster (ryväs).

•1(0)

Gb

Ethernet,Infi

niband,AT

M,F

C,or

Myrinet;no

displays,etc.•

Blade racks to save space, reduce loose w

ires.

⇒S

low(ish) com

munication restricts algorithm

choice.

⇒C

heapest FL

OP

S because of m

ass production!

•S

ee exercise 4-5.


52


Pa

rallel

arch

itectures

seemto

con

verg

eto

wa

rds

each

oth

er.

•In S

MP

-computers the buses are replaced by

clustered networks.

•V

ector supercomputers are im

plemented in

CM

OS

, usecaches and

DR

AM

,P

increases, nodes areclustered (m

emory perform

ancedegrades or no anym

ore shared mem

ory).

•V

ectortechniques

andvirtual

sharedm

emory

areused

inM

PP

comput-

ers.

•M

ultithreading and multicore are used in C

PU

s and GP

Us.

•W

orkstation (or server computing nodes) have parallel vector units.

•M

PP

computers are build from

comm

odity parts like NO

Ws.

•D

edicated “NO

Ws” are used for parallel com

putation.

•S

everal (even heterogenous) computers are connected for joint w

ork(grid com

puting).

•B

lade server racks look like a m

ainframe...

Cu

rrent to

p co

mp

uters: h

ttp://w

ww

.top

500.o

rg/


53


IBM

Seq

uoia

- Blu

eGen

e/Q

•98,304 * 16core P

owerP

C

•16 P

FL

OP

S, 7900 kW

Tia

nh

e-1A

•http://pressroom

.nvidia.com/easyir/

customrel.do?easyirid=

A0D

622CE

9F579F

09&version=

live&prid=

678988&releasejsp=

releas

e_157

•7,168 N

VID

IA T

esla 2122 M2050 G

PU

s

•448 cores each

⇒3.2M

cores

•~

1 MF

LO

PS

/ core⇒

500 GF

LO

PS

/ GP

U

•B

ut only 3GB

mem

ory / GP

U•

~ 3.5 P

FL

OP

S theoretical, 2.5 P

FL

OP

S L

INPA

CK

•tens of threads / core =

tens of millions of threads!

•14,336 X

eon CP

Us.


54


Ad

ditio

nal b

on

us o

n p

ara

llel com

pu

ters

•A

s we can have unlim

ited performance via parallelization, w

e do notneed

thefastestprocessor.Instead,w

e’llselectthebestby

performance/

price. (ww

w.verkkokauppa.com

2010)

•N

ot quite as simple as G

FL

OP

S/e.

•W

eneed

more

thanprocessors

(motherboards,

network

cards,sw

itches).

•A

lgorithm m

ay be less efficient w

ith more processing nodes.

•S

ee exercises 4-5.

Intel Core 2 D

uo E7500 2×

2.9GH

z, 3MB

118.90 e

Intel Core 2 Q

uad Q8400 4×

2.66GH

z, 6MB

151.90 e

Intel Core 2 Q

uad Q9650 4×

3.0GH

z, 12MB

330.90 e

Intel i5-760 4×2.8G

Hz, 8M

B193.90 e

Intel i7-950 4×3.06G

Hz, 8M

B514.90 e

Intel i980X

EE

4×3.3G

Hz, 12M

B989.90 e

Intel Xeon X

7460 6×2.66G

Hz, 16M

B2578.90 e


55

Ch

ap

ter 2

PR

AM

A sim

ple

model of parallelism

PR

AM

program

ming

PR

AM

physical im

plem

entation possibilities

⇒P

RA

M is used to

avoid dirty details.


56

PRAM shortly PR

AM

sho

rtly

How

PR

AM

was

born

?

⇒A

familiar com

puter abstraction (for programm

ers, etc.):

•R

AM

(Random

Access M

achine)

•A

processor•

Am

emory

•P

rocedural (or OO

) programm

ing, especiallyvariables.

•N

ot quite accurate anymore, but good enough.

Processor

. . .M

emory

Fig

ure 2

-1:

RA

M (V

on N

eumann).


57

PRAM shortly

A n

atu

ral ex

tensio

n:

•P

RA

M (P

arallel Random

Access M

achine)

•F

ortune and Wyllie 1

978, many others

⇒Increase the num

ber ofprocessors.

•A

llprocessors

can equally access theshared m

emory.

⇒P

rogramm

ing like RA

M, exceptm

emory (variables) is shared.

•A

ll processors have to be programm

ed.

•M

emory access confl

icts have to be avoided.

Fig

ure 2

-2:

The structu

re of the PR

AM

model.

P1

P2

P3

P4

PP

. . .

. . .P

processors

Word-w

ise accessible shared mem

ory

Read/w

rite operations from/to shared m

emory


58

PRAM shortly

Wh

y P

RA

M is

good

:

•S

imple and

strong model.

•If a parallel algorithm

can be done, it can be done for PR

AM

.

•R

eminds real com

puters (like RA

M).

•F

lexible:T

ens of different variations.

•G

enerally used.

•M

ost parallel algorithms are designed for P

RA

M.

•E

xisting set of algorithms and other theory.

Wh

y P

RA

M is

bad

:

•P

-port shared mem

ory cannot be build (easily).

•R

eal world

delays are ignored.

•D

oes not account for building costs.

•D

oes not guide for savingresources.

Still•

A handy

too

l (abstraction) for research and teaching.

•A

lgorithms can be adapted for real com

puters.


59

PRAM models PR

AM

mo

dels

Pro

cessors a

re pro

cessors, b

ran

d d

oes n

ot m

atter.

•If

needed,we

candefi

neeach

processor(processing

node)to

havelocal

mem

ory and I/O

.

•E

specially theprogram

can be stored aslocal copies, but as a plain

model, it does not m

atter.

•U

sually we assum

e thesam

e program but ow

n program counters at

every processor (MIM

D, m

ultiple instruction stream, m

ultiple data).

•S

IMD

(singleinstruction

stream)

isan

optionfor

cheaperim

plemen-

tation.


60

PRAM models

Th

esh

ared

mem

ory

in P

RA

M is in

teresting.

•T

ooperate

efficiently,the

processorsneed

tobe

ableto

exploitmem

ory.

•U

p to aread/w

rite at every clock cycle by every processor.

•Is

itpossible/feasible

todefi

ne/implem

enta

mem

orythat

canhandle

Psim

ultaneous mem

ory accesses every clock cycle?

•It is easy to

define.

•It is attractive to

use.•

It might be possible to im

plement (w

ith some tricks).

•It is

not currently feasible to implem

ent, though.•

For

aw

hilew

eassum

ethat

itis

possible,and

we'll

exploitit

toachieve easiest possible parallelism

.

Processor - m

emory speed com

parison (Random

Access M

achine):

•8

bits/DR

AM

chip, 50ns

random access latency, 3

GH

z 64-bit proces-sor:

•3×

50×64

/8=

1200D

RA

Mchips/processor

forfull

randomaccess

of one word at every clock cycle!

•A

ctually, modern (S

D)R

AM

should not be considered as RA

M...


61

PRAM models

PR

AM

mem

ory

mod

el

•A

single mem

ory, indexed mem

ory locations (e.g., 1..m).

•m

usually "unlimited" (as in R

AM

).

•E

ach mem

ory reference (read/write) is done in unit tim

e (O(1), 1 cc).

•A

lso, all other machine instructions in 1 clock cycle.

⇒W

hataboutifsim

ultaneousm

emory

referenceshitthe

same

mem

orybank or even the sam

e mem

ory location?

•S

imultaneous:on

theex

actlysam

eclock

cycle,notim

esharingpossible

within a clock cycle. A

lso calledconcurrent.

Sam

e bank,different address:

•F

or the model, there is n

o such problem.

•F

or a real implem

entation, we need m

ore circuitry and/or tricks (seebelow

).


62

PRAM models

Several sim

ultaneous mem

ory references to the

same m

emory

address:

•T

he references could possibly be

combined

.

•W

rite requests: something is w

ritten.

•R

ead requests: the result is copied to all accessing processors.

⇒In a

model, w

e just define what w

ill happen.

•S

everalsimultaneous

readsis

astrong

operation,butveryeasy

todefi

ne.

•S

imultaneous

read(s)an

da

write

canbe

defined

as,e.g.,everyw

riteto

occur before every read (two stages =

O(1)).

•S

everal simultaneous w

rites is much m

ore difficult to defi

ne.

•E

ach mem

ory location will alw

ays contain only one value.

⇒In P

RA

M m

odel, these are considered as model variations.


63

PRAM models

PR

AM

varia

tion

s

•T

hem

emory

models

differon

restrictions/resultson

whatcan

happenat

single mem

ory location at a single clock cycle.

•If

therestrictions

areviolated,the

whole

machine

haltsim

mediately

(ina m

odel), or results are unknown (in real life).


64

PRAM models

E/C

/O× R

/W

•E

RE

W (E

xclusive Read, E

xclusive Write)

•B

oth several simultaneous reads and w

rites are forbidden.

•C

RE

W (C

oncurrent Read, E

xclusive Write)

•S

everalprocessors

may

readsim

ultaneously,but

writing

isallow

edto one processor at a tim

e.

•C

RC

W (C

oncurrent Read, C

oncurrent Write)

•U

nlimited

number

ofreads

andw

ritesare

permitted

simultaneously.

•T

heresult

ofsim

ultaneousw

riteshas

tobe

solvedsom

ehow,

seebelow

.

•C

RO

W (C

oncurrent Read,

Ow

ner Write)

•E

achm

emory

locationis

owned

bya

processor,othersm

ayonly

readit.

•E

RC

W (E

xclusive Read, C

oncurrent Write)


65

PRAM models

CW

varia

tion

exam

ples

•O

n concurrent access to a single mem

orylocation.

•In ascending (partial) order of strength.

•W

EA

K

•O

nly simultaneous w

riting ofzeroes is allow

ed.

•C

OM

MO

N

•O

nly simultaneous w

riting of thesam

e value is allowed.

•T

OL

ER

AN

T

•N

othing happens if several processors try to write sim

ultaneously.

•C

OL

LIS

ION

•A

specialcollisionsym

bolisw

rittenif

severalprocessorstry

tow

ritesim

ultaneously.

•C

OL

LIS

ION

+

•A

specialcollisionsym

bolisw

rittenif

severalprocessorstry

tow

ritesim

ultaneousdifferent values. (see C

OM

MO

N)


66

PRAM models

•A

RB

ITR

AR

Y

•S

ome

(random)

valuesurvives

ifseveral

processorstry

tow

ritesim

ultaneously.

•P

RIO

RIT

Y

•P

rocessor with low

estP

ID w

ill success, others fail.

•S

TR

ON

G

•A

combination of the values is w

ritten,•

e.g., AD

D&

WR

ITE

, AN

D&

WR

ITE

, PR

EF

IX-S

CA

N

•D

ifferent variations have been suggested.


67

PRAM models

Exam

ples o

n p

oten

cy d

ifferences:

•S

preading a w

ord to every processor (or toP

mem

ory locations).

•C

RE

W: every processor reads the sam

e mem

ory location:O

(1)

•E

RE

W:value

isdoub

led(as

ina

binarytree)

untilallprocessorshave

read it:O

(logP

)

•M

aximum

of an array.

•C

RE

W:

O(log

N)

•W

EA

K C

RC

W:

O(1

)

•S

orting

•E

RE

W:

O(log

N)

•S

TR

ON

G C

RC

W:

O(1)


68

PRAM "programming" PR

AM

"p

rog

ram

min

g"

⇒A

s in sequential programm

ing, we'll use

several abstraction levels.

•D

escribe the algorithm in a

natural language and apicture.

•D

escribe the algorithm in an

algorithm notation.

•T

ransform the algorithm

toadapt w

ith real world

(machine and pro-

gramm

ing environment) restrictions.

•W

rite the algorithm in a

programm

ing language.

•C

ompile the program

intom

achine language.


69

PRAM "programming"

(Data

)para

llel alg

orith

m n

ota

tion

⇒A

s sequential, with an additional statem

ent to express parallelism

for

i∈ 1..N

pard

o//or,e.g.,fo

rea

chelem

entin

Apard

o1

statem

ent;

// e.g., if A[i] =

0 then A[i] :=

...2

•sta

temen

tisexecuted

oncefor

eachvalue

ofi(1..N

)(as

ina

seqfor-do).

•A

llN

executions are donein parallel, if w

e have at leastN

processors.

•T

ime com

plexity:

•T

st +O

(1)if

we

haveenough

proc(T

st =tim

eof

asingle

statem

ent).

• T

st ×N

/P +O

(1) if we take

P into account.

•R

emem

ber Brent’s theorem

(p.27).


70

PRAM "programming" ⇒D

ifferent parallel executions may not disturb each other.

for

i∈ 1..N

pard

o1

A[A

[i]] = A

[i];// result very unclear,

not a

llow

ed!

2

•If

we

needlocalvariables

(mem

ory),we

canuse

keywords

priv

ate

andsh

ared

to clarify the situation.

⇒C

reativefreedom

isallow

edin

algorithmnotation

aslong

asexact-

ness and comprehensibility is m

aintained.


71

PRAM "programming"

pro

cedu

re Odd-even_m

ergesort (A : array[1..N

]);1

if Pro

cessors = 1

then

2

Sequential_m

ergesort(A

);3

else4

pa

r i = 1

to 2

do

5

Od

d-even_m

ergesort(i:th h

alf o

f A);

6

Odd

-even_

merg

e(ha

lves of A

);7

syn

chro

nize;

8

pro

cedu

re Odd

-even_merge (A

: array[1..N]);

9

if Processo

rs = 1

then

10

Sequential_m

erge(A

);11

else12

pa

r i = 0

to 1

do

13

Od

d-even_m

erge(ha

lves of o

dd

/even (2

n+

i) elemen

ts of A

);14

pa

r i = 2

to N

–1

by 2

do

15

pip

elined

_co

mp

are-exch

an

ge (A

[i], A[i+

1]);16

syn

chro

nize;

17

Alg

orith

m 2

-1:

Parallel odd-even m

ergesort, informal version.


72

PRAM "programming"

Para

llel pro

gra

mm

ing la

ngu

ages

⇒V

ariety is huge, few established standards.

•W

e'll describe some real languages/standards later on.

⇒P

RA

Mprogram

ming

with

paper(or

with

aP

RA

Mem

ulator)can

bedone

aseasily

asm

ovingfrom

sequentialalgorithm

sto

sequentialprogram

s.

•L

ocal and shared variables.

•P

rocessor-ID (P

ID) to distinguish betw

een processors.

•S

ynchronization.

•I/O

is either forgot, or we'll use parallel I/O

.

•E

xample: (P

arallel Modula-2 for F

-PR

AM

)


73

PRAM "programming"

procedure oemerge(sharedvar S

: array of word; S

tart, Length, S

tride : word);

1

vara, b

: word

;2

i, j, k, L

ength2 : register w

ord;3

begin

4

Len

gth

2 := L

ength / 2;

5

par i :=

0 to 1 do

6

oemerg

e(S, S

tart + i * S

tride, Length2, S

tride * 2);7

end;

8

par i :=

1 to L

ength2 - 1 do

9

j := i * 2

;10

a := S

[Start +

(j - 1) * Stride];

11

b := S

[Start +

j * Stride];

12

if a > b

then13

S[S

tart + (j - 1) * S

tride] := b;

14

S[S

tart + j * S

tride] := a;

15

end;

16

end;

17

synchronize;

18

end oem

erge;

19

Alg

orith

m 2

-2:

Odd-even m

erge in fpm.


74

PRAM "programming"

PR

AM

mach

ine la

ngu

age

•A

sany

RA

Mm

achinelanguage,possibly

alsoL

OA

DP

ID,and

separateoperations to access local and shared m

emory.

•U

suallyone shared program

for every processor.

•T

hesam

eprogram

isloaded

toevery

processornode,processors

will

branch according to PID

.

•W

e can use assembler as an interm

ediate stage.

•E

.g., F-P

RA

M.


75

PRAM "programming"

# macro assem

bler

else5:L

OA

D =

01

ST

OR

ET

MP

152

ST

OR

ET

MP

113

LO

AD

=1

4

ST

OR

ET

MP

105

LO

AD

PR

OS

6

SU

BT

MP

107

AD

DT

MP

118

SU

B=

19

JPO

Soverpar0

10

# macros opened

LO

AD

=0

1

ST

OR

E24

2

ST

OR

E20

3

LO

AD

=1

4

ST

OR

E19

5

LO

AD

96

SU

B19

7

AD

D20

8

SU

B=

19

JPO

S322

10

Fig

ure 2

-3:

(F)P

RA

M m

achine language


76

Implementing PRAM Imp

lemen

ting

PR

AM

⇒U

singshared

mem

ory(m

emory

referenceis

aread

orw

rite)in

oneclock cycle in im

possible.

•It has not succeeded even on uniprocessors since 1

MH

z times at 80's.

•T

oday, we could achieve 20

MH

z on DR

AM

, 300M

Hz on (nonem

bed-ded) S

RA

M.

•In addition to D

RA

M latency, physical distances or large com

putersm

ake access slow.

•In

0.3ns

(3G

Hz),

thelight

will

travel10

cmin

freespace,

electricity~

7cm

ina

coaxialcable,

evenless

oncircuit

board,only

fewcm

ona sem

iconductor.

⇒M

oreover,buildinga

P-port

mem

oryis

expensive/impossible

ifP

islarge.


77

Implementing PRAM

Extra

cost fa

ctor fo

rP

ports isΩ

(P2) (a

s VL

SI a

rea).

•E

.g., let us considertechnology for 4G

bit (0.5G

B) m

emory chips.

•It w

ill yield 16M

bit (2M

B) m

emory w

ith 16 ports.•

Moreover,each

ofthe

16processors

willneed

24address

linesand

2data

lines,totalling

more

than416

pinsfor

the16

Mbit

(2M

B)

mem

ory chip.•

Packaging

costsfor

am

odest1

GB

mem

ory(64

MB

/pr)w

ouldbe

100000's e.

•A

t64 ports, a 1 M

bit (128

kB) chip w

ould be more com

plex (>1800

pins) than a Itanium2 Q

uad.

•64

GB

would take 0,5

M chips, 1000

m2, and cost >

109e.

•A

nd the access latency would still be long...


78

Implementing PRAM

PR

AM

can

be im

plem

ented

more ea

sily v

iasim

ula

ting th

esh

ared

mem

ory

by

distrib

uted

mem

ory.

⇒P

processors,M

mem

ory banks.

P0

P1

P2

P3

PP

–1

. . .P

processing nodes

Interco

nnectio

n n

etwork

Pro

cessor

Mem

ory

Netw

ork

interface

Fig

ure 2

-4:

Distributed M

emory M

odel.


79

Implementing PRAM

•O

ften it is assumed that

M =

P, i.e., each processing node contains a

mem

ory module.

•G

ood:easier

construction,less

nodes,less

comm

unicationconnec-

tions.

•P

oor:more

traffic

ineach

node/connection,inreallife,m

emories

areslow

er than processors.

•F

orreasonable

performance,

M=

CP

,w

hereC

isthe

speeddiffer-

ence factor between processors and m

emory.


80

Implementing PRAM

Overlo

ad

ing (ylik

uorm

itus)

⇒L

etus

assume

thata

mem

oryreference

from/to

a(virtual)

sharedm

emory takes

h clock cycles.

•T

he computer has

P physical processors.

•E

ach physical processor executes the tasks ofh

PR

AM

processors (hvirtual processors per a physical processor).

•T

heprocessor

executesonly

oneinstruction

atatim

efor

eachP

RA

Mprocessor it is respon

sible of.•

After each clock cycle it

changes to the next PR

AM

processor.

•A

fterim

plementing

allh

PR

AM

processors,itstarts

overby

execut-ing the next instructions for each P

RA

M processor.

⇒T

hem

emory

referencesm

adeby

PR

AM

processorshave

occurredin

hclock cycles.

•In algorithm

notation, see Algorithm

2-3:.


81

Implementing PRAM

wh

ilenot a

ll pro

cessors h

alted

do

1

for

each

threa

di

do

2

PC

i := P

Ci +

1;

3

if op =

write

then

4

send

write-reference

5

else if op

= read

then

6

send

read-reference

7

else8

execute o

pera

tion

9

for

each

threa

dd

o10

if op =

readth

en11

recieve read-reference12

Alg

orith

m 2

-3:

PR

AM

simulation algorithm

.


82

Implementing PRAM ⇒W

hat shall we

benefit?

•F

or each PR

AM

processor ("virtual processor") everything occurs inone clock cycle.

•T

heclock frequency of each P

RA

M processor is only

1/h

of the realprocessor.

•T

here areh×

P P

RA

M processors.

•P

rocessing power is (h×

P)×

( 1/h) =

P, i.e.,

the same as w

ith directP

processors.

⇒If

theprogram

canexploit

h×P

processors,it

will

executew

ork-optim

ally.

•h is called also parallel

slackness.


83

Implementing PRAM

How

larg

eh

need

s to b

e?

•D

epends on the network and routing protocol.

•A

t leasttw

ice the diameter of the interconnection netw

ork.

•E

vena

bitmore

asthe

routing

algorithmneeds

slacknessto

handlecon-

gestions.

•E

.g., in a butterfly netw

ork:O

(logP

loglog

P).

•It has been done (S

aarbrücken SB

-PR

AM

, Tera M

TA

/ Cray X

MT

).

•S

ame technique is used in G

PU

units, e.g., Nvidia G

8x, etc.

•B

onus:no caches needed

.

Req

uirem

ents fo

r overlo

ad

ing

•M

ultithreading processor (switch after every clock cycle)

•Im

plementation sim

ilar to superpipelining (Forsell).

•H

ugem

emory bandw

idth.

•E

.g.,fullypopulated

gridshave

toonarrow

bisectionbandw

idth,seeF

igure 1-3 (p.49).


84

Implementing PRAM

Lesso

n lea

rned

⇒A

parallelalgorithm

shouldbe

designedto

useas

many

processorsas

(efficiently) possible.

•P

RA

M is not com

pletely utopistic.

•E

speciallyif

we

uselocal

mem

oriesto

decreasethe

traffic

inthe

shared mem

ory.


85

Ch

ap

ter 3

Para

llel alg

orith

ms

(in P

RA

M-n

ota

tion

)

Goals

Techniques

Som

e algorithm

s


86

Parallel algorithm design goals Pa

rallel a

lgo

rithm

desig

ng

oa

ls

Eith

er

•m

aximal

speedup (and parallelism

), or

•m

aximal speedup w

hile still maintaining

work-optim

ality.


87

Parallel algorithm design goals

More fo

rmally, a

n a

lgorith

m cla

ssifica

tion

•A

ccording totim

e complexity

•N

C:

polylogarithmic

time

complexity,

polynomial

number

ofproc-

essors (Nick's class).

•P

:polynom

ialspeedup

•D

ifferent P than in sequential algs (solvable in polynom

ial time).

•note: N

C and P

are not disjoint

•A

ccording tow

ork optimality

•E

:effi

cient•

A:

polylogarithmic ineffi

ciency (almost effi

cient)•

S:

polynomial ineffi

ciency (semi effi

cient)

•C

ombining

thesew

e’llgetsixclasses

ofalgorithm

s,EN

C,A

NC

,SN

C,

EP, A

P, SP.

•E

NC

would be nice.

•E

P is usually good enough.


88

Parallel algorithm design methods Pa

rallel a

lgo

rithm

desig

n m

etho

ds

⇒C

oncentrate to (operations for)data, not (operations by) processors!

Para

llelizing seq

uen

tial p

arts o

f an

existin

g seq

uen

tial

alg

orith

m.

⇒T

hisis

notarealdesign

method,butin

reallifethis

isw

hatwe'llface

(as ad hoc programm

ers have sequentialized parallel problems).

•S

uits well for linear algebra.

•A

nalysingfor-do loops (and other sequential sections).

•If the sequential parts are independent, w

e can parallelize them

•S

ometim

es,inner loops are parallel, som

etimes

outer loops.

•L

ooprearranging m

ay help.


89

Parallel algorithm design methods

•E

.g., matrix m

ultiplication

C=

A⋅ B

,,

(3-1)

•E

asy sequential algorithm and an easy parallelization, .

•N

×N

matrix,

O(N

3)sequential

algorithm,

O(N

)parallel

algorithmw

ithO

(N2) processo

rs.•

PR

AM

variant? Exercise.

cij

aik

bk

j×

k0

=

N1

–

∑ =

for i :=

1to N

do

//⇒p

ard

o1

for j :=

1to N

do

//⇒p

ard

o2

for k :=

1to N

do

3

C[i, j] :=

C[i, j] +

A[i, k] * B

[k, j];4

Alg

orith

m 3

-1:

Matrix m

ultiplication.


90


•P

arallelizinginnerm

ostfor-loop

isnot

quiteas

straightforward

(unless we use S

TR

ON

G C

RC

W-m

odel).•

How

ever,the

innermost

product-sumcan

beevaluated

inO

(logN

)tim

eusing

O(N

)pro

cessorssee

Parallel

tournament

(turnaustekni-ikka) (p.95).•

Even

with

O(N

/logN

)processors,

seeB

locking(lohkom

inen)(p.

97).

•T

hus,the

whole

algorithmin

O(log

N)

time

with

O(N

3/logN

)proc-

essors (exercise).

•F

orreal

computers

andreal

inputsizes,it

isoften

enoughto

parallelizeonly

one of the nested loops.


91


•In

algorithms

with

severalstages,we

shouldparallelize

all(demanding)

stages to achieve full efficiency (processor utilization).

for i :=

1to N

pard

o//

O(1)

1

for j :=

1to N

pard

o2

statem

ent1;

//O

(1)3

for i :=

1to N

do

//O

(N)

4

for j :=

1to N

pard

o5

statem

ent2;

//O

(1)6

Alg

orith

m3

-2:

An

unevenparallelization:

O(N

)tim

e,O

(N2)

proces-sors (but

O(N

) with

O(N

) processors).


92


Div

ide-a

nd

-con

qu

er

⇒D

ivideinput

intw

oparts,

solvehalves

recursivelyin

parallel,com

-bine the results (in parallel).

•F

amiliar technique in sequential algorithm

s.

•P

arallel recursion isterm

inated w

hen either

•input is

trivial (as in sequential programm

ing), or•

thereis

only1

processorleft,

when

we

cansw

itchto

asequential

algorithm(see

Blocking

(lohkominen)

(p.97)

andA

lgorithm2-1

(p.71)).

•S

ubresults are combined

to larger subresults on returning from recur-

sion.


93


•E

.g., mergesort

•S

equential algorithm:

Ts (N

) = 2*

Ts (N

/2) +O

(N) =

O(N

logN

)

•R

ecursivecalls

atlines

3and

4can

beexecuted

inparallel

(asthey

work on disjoint parts of the array).

•U

singsequentialm

erge,T

p (N)

=T

p (N/2)

+O

(N)

=O

(N),

O(N

)proc-

essors,O

(N2) w

ork, not good.

⇒A

lso combining of subresults m

ust be parallelized!

•C

ombining is often m

ore difficult than dividing.

•S

ometim

es combinin

g is trivial, though.

•E

.g.,insearch

algorithms

(onlydiscoverer

acts),especiallyusing

CR

CW

.

pro

cedu

rem

ergesort(var A

: array; first, last : index);1

if (last–fi

rst) > 0

then

2

mergesort(A

, first, (last+

first)/2);

3

mergesort(A

, (last+fi

rst)/2+1, last);

4

merge(A

, first, (last+

first)/2, (last+

first)/2+

1, last);5

Alg

orith

m 3

-3:

Mergesort.


94


•In m

ergesort, combinin

g is the merging phase, w

hich is more diffi

cultto parallelize.

•If

we

couldm

ergein

O(1)

time

usingO

(P)

processors,the

sortingtim

ew

ouldbe

Tp (N

)=

Tp (N

/2)+

O(1)

=O

(logN

)tim

e,O

(N)

proc-essors,

O(N

logN

) work.

•U

nfortunatelym

ergingin

O(1)

time

isim

possible(using

realisticm

odels).•

O(1)

am

ortized

time is possible, but unfeasibly com

plex.

•M

ergingin

O(log

N)

orO

(loglog

N)

time

ism

ucheasier,butdoes

notoffer

work

optimality,

unlessw

euse

lessprocessors,

see“O

dd-evenm

erge” p.136.

•D

ivision can be made in

more than tw

o parts toreduce the num

ber ofstages.

•E

.g., division in parts, com

bining in unit time:

T(N

)=

T(

) +O

(1) =O

(loglog

N).

•O

bviously,combining

mightnotbe

aseasy

anym

ore,seeraw

power

and waterfall techniq

ues below.

NN


95


Para

llel tou

rnam

ent (tu

rnau

stekn

iikka

)

•A

lso calledb

ala

nced

tree.

•If divide-and-conquer is a

top

-do

wn

approach, we can also apply a

similar technique also

bo

ttom

-up.

•W

e'llskip

(recursive/parallel)dividing

inparts,instead

we'll

startfrom

ready“sequences” of length one elem

ent.

•C

ompare input elem

ents pairwise,w

inner continues to the next round.

•D

efinition

ofw

innerdepends

onapplication,e.g.,a

combination

canbe used.

•A

stage can be done inO

(1) time using

N/2 processors.

•S

ame

isrepeated

againand

againam

ongthe

winners

(N/4,N

/8,...pairs)until the ultim

ate winner is left.

•log

N stages, each

O(1) tim

e⇒

O(log

N) tim

e,O

(N) processors.

•A

s in divide-and-conquer, more than tw

o elements can be handled at

each stage, see below.


96


Raw

pow

er (raaka vo

ima

)

•A

s fast as possible.

•"O

verkill".

•A

lmost: using as m

any processors as possible.

⇒W

e'll try to evaluate all possibilities at once.

•E

.g., we'll

compare all p

airs simultaneously.

•O

(N2) com

parisons inO

(1) time using

O(N

2) processors.•

N input elem

ents will transform

toN

2 subresults!

•C

ombining m

ay be hard to do fast, requires usually CR

CW

.

•G

oal isO

(1) or logarithm

ic time algorithm

.

•R

arely work-optim

al.

•Is often used as a fi

nal stage of an algorithm, see below

.


97


Blo

ckin

g (lo

hkom

inen

)

•P

revious methods often result in unbalanced processor utilization,

which im

plies non-optimal w

ork.

•E

.g.,at

thebeginning

ofa

tournament,

N/2

processorsare

used,but

thenum

berof

activeprocessors

reduceson

everyround,lastcom

par-ison is m

ade by one processor only.

•W

e'llrestrict parallelism

appropriately to achieve

work-optim

ality.

•Idea:

•L

ess processors.•

More w

ork to do for each processor.•

Atthe

beginning,eachprocessor

(inparallel)

evaluatesits

own

blocksequentially.

•S

witch

tothe

fastparallel

algorithmonly

when

eachprocessor

hasa

single intermediate result.

•U

sually used with other techniques, e.g., divide-and-conquer.


98


•E

.g., in a tournament of

O(N

) sequential work:

•A

ctual tournament stage w

ill takeO

(logP

) time.

•T

om

aintainw

ork-efficiency,w

ecan

useat

most

O(N

/logP

)proces-

sors(if

alsothe

blockpartcan

bedone

isO

(logP

)tim

e,lessif

ittakesm

ore time).

•W

e'll chooseP

=N

/logN

.

•E

achprocessor

will

havea

logN

-element

block,sequential

algo-rithm

is used,O

(logN

) time.

•R

emaining

N/log

Nelem

entsw

illbeprocessed

usingparalleltourna-

ment in

O(log

N) tim

e usingN

/logN

processors.

⇒W

hole algorithm in

O(log

N) tim

e with

N/log

N processors.

•If

thesequential

partw

ithblocks

ism

orethan

O(N

)tim

e,sm

allerblocks are enough.


99


Waterfa

ll techn

iqu

e (vesipu

tou

stekn

iikka)

•A

lso calleda

ccelerated

casca

din

g.

•C

ombine the best parts of the previous m

ethods.

•S

witch

toa

fasteralgorithm

afterthe

sizeof

inputhasshrunk

enoughto

be executed faster using the givenP.


100


Oth

er meth

od

s

•S

ome

basicalgorithm

s,e.g.,prefix

sums

(seep.119),binary

search,andtree/path com

paction, are useful asparts of larger algorithm

s. They

often help at the combining parts.

•R

andomization

(breaking patterns), useful for real-world E

RE

W-like

variant to avoid mem

ory congestion.

•P

arallel Monte C

arlo / genetic methods (all processors try (random

)solutions).

•S

ampling

.

•T

akea

(smallish,but

aslarge

aspossible

without

disturbingthe

effi-

ciency)sam

pleon

thew

holedata,

analyseit

usinga

fastalgorithm

(raw pow

er).•

Divide input according to the distribution of the sam

ple.

•Input w

ill hopefully be divided more evenly to processors.

•H

elps on real data with inconvenient patterns.


101

Maximum finding Ma

xim

um

fin

din

g

⇒A

very simple problem

, examples on each technique.

•Input: a shared array

A[0..N

–1]

•O

utput:largest elem

ent or/and itsindex.

•S

equential algorithmO

(N).


102

Maximum finding

Sta

nd

ard

tou

rnam

ent

⇒C

ompare elem

ents pairwise, w

inner continues to next iteration.

•A

fterlogN

iterations, only one element is left.

•Interm

ediate results have to be stored somew

here.

•F

or each comparison, w

e need two values w

hich are compared on

previous iteration by different processors.

•If w

e want to leave original array intact, w

e'll use an auxiliary array.

•H

ere we'll use the original for sim

plicity.


103

Maximum finding

•W

inner placement can b

e done in many w

ays, see below.

•H

erew

e'llrestoreallw

innersto

thebeginning

partofthe

array.The

partreduces to half on every iteration.

•T

he most diffi

cult part is to make

indices match

on every iteration.

•Iterations have to be executed in

strict synchrony

•W

ecan

assume

thisin

PR

AM

algorithmnotation

(we

canm

entionit,

though).In

realm

achinesw

eneed

tohave

anexplicit

synchroniza-tion.

Fig

ure 3

-1:

Tournam

ent m

aximum

.

62

23

71

54

07

62

23

63

75

07

62

23

63

67

07

62

23

63

67

07


104

Maximum finding

•B

ysom

eclever

organization,the

synchronizationrequirem

entcan

be easied, even removed (w

ith auxiliary data structures).

•If/w

henthe

inputsize

Nis

notof

form2

k,we'll

haveto

refine

line4

to,e.g.,

A[j] :=

max(((j*2<

N) ? A

[j*2] : A[j]), (j*2+

1<N

? A[j*2+

1] : A[j])))

4

•T

ime:log

N(line

2)×O

(1)(lines

3-4)+

O(1)

(lines1

and5)

=O

(logN

).

•N

umber of processors:

N/2 =

O(N

).

•W

ork:O

(Nlog

N),

not work-optim

al (inefficient by factor

O(log

N))

•E

RE

W P

RA

M is suffi

cient.

fun

ction

tournament-m

ax(var A

: array[0..N–

1]);1

for i :=

log

N–

1to 0

do

2

for j :=

0to 2

i–1

pa

rdo

3

A[j] :=

max(A

[j*2], A[j*2+

1]);4

return

A[0

];5

Alg

orith

m 3

-4:

Maxim

um using standard tournam

ent.


105

Maximum finding

•T

he same set of indices can be w

ritten in different ways:

•A

lso,youm

ayuse

anyindices,or

anew

arrayto

storethe

intermediate

results.

fun

ction

tournament-m

ax2(var A

: array[0..N–

1]);1

i = N

;2

wh

ile i > 0

do

3

i := i/2;

4

for j :=

0to i

pa

rdo

5

if j*2

< N

–1

then

6

A[j] :=

max(A

[j*2], A[j*2+

1]);7

elseif j*2

= N

–1

then

8

A[j] :=

A[j*2];

9

end

wh

ile;10

return

A[0];

11

Alg

orith

m 3

-5:

Tournam

ent-max, alternative im

plementation.


106

Maximum finding

•E

.g., using doubling/halvingstride w

orks well.

•If counting tw

ice does not hurt, modulo helps on boundaries.

fun

ction

tournament-m

ax3(var A

: array[0..N–

1]);1

s := 1

; // stride

2

wh

ile s < n

do

3

for j :=

0to N

–s–

1b

y s*2p

ard

o4

A[i] :=

max(A

[i], A[i+

s]);5

s := s *

2;6

end

wh

ile;7

return

A[0];

8

Alg

orith

m3

-6:

Tournam

ent-max,

yetanother

alternativeim

plementa-

tion.


107

Maximum finding

Fig

ure 3

-2:

Binary tree of A

lgorithm 3-6:

01

23

45

67

89

1011

1213

1415

1248

N–

1–

1

N–

2–

1

N–

4–

1

N–

8–

1


108

Maximum finding

A v

aria

tion

: maxim

um

for

every

pro

cessor

•O

ften,them

aximum

hasto

bespread

toallprocessors

(orindices

ofthe

array).

•T

his is useful especially on ER

EW

PR

AM

.•

We could m

ake the spreading by using another logN

“tree”.

•B

ut,onprevious

algorithm

,mostprocessors

areidle

mostof

time.T

heycan be exploited in “con

current spreading”.

•E

ach processor evaluates its “local” maxim

um tree.

•E

ven if all processors make useful w

ork during the whole execution,

this isnot w

ork-optimal.

Fig

ure

3-3

:A

n“arrays

oftrees”

ofdegree

2.D

ashedlines

representw

rap-around edges.


109

Maximum finding

Div

ide-a

nd

-con

qu

er

•W

orks actually like tournament, slightly different notation.

•D

ivide recursively until input is trivial.

•O

n returning from recursion,

compare, and return the larger one.

•M

anaging array boundaries and synchrony is easier.

•P

arallelism representation possibly m

ore difficult / ineffi

cient.

•T

ime:

T(N

)=

T(N

/2) +O

(1) =O

(logN

),O

(N) proc,

O(N

logN

) work.

fun

ction

divide_conquer-max(v

ar A

: array[0..N–

1]; low, high : index);

1

if (low =

hig

h)th

en2

return

A[low

];3

else4

pa

rdo

5

x := divide_conquer-m

ax(A, low

, (high+

low)/2);

6

y := divide_conquer-m

ax(A, (high

+low

)/2+1, high);

7

return

max

(x, y

);8

Alg

orith

m3

-7:

Maxim

umfi

ndingusing

divide-and-conquer-tech-

nique.


110

Maximum finding

Blo

ckin

g a

nd

tou

rnam

ent

•N

one of the previous algorithms is w

ork-optimal.

•W

ithout Concurrent W

rite, we cannot achieve

O(1) tim

e with

O(N

)processors, thus, w

e'll have toreduce the num

ber of processors forw

ork-optimality.

⇒W

e'll first useN

/logN

processors, goal forO

(logN

) time.


111

Maximum finding

•Idea:

reduce the input toN

/logN

, after which w

e'll use tournament in

O(log

N) tim

e usingN

/logN

processors.

•E

ach processor finds fi

rst the maxim

um of its ow

n block of size logN

sequentially (but all processors in parallel).

•A

fterO

(logN

) time, w

e'll have an intermediate input of size

N/log

N.

•T

hen we’ll do tournam

ent for the smaller input.

•T

otal time O

(logN

),N

/logN

processors⇒O

(N) w

ork!

•E

RE

W is still enough.

fun

ction

blocking_tournament-m

ax(var A

: array[0..N–

1]);1

for i :=

0to

N/lo

gN

–1

pa

rdo

2

B[i] :=

A[i*log

N];

3

for j :=

1to lo

gN

–1

do

4

B[i] :=

max

(B[i], A

[i*logN

+j]);

5

return

tou

rnam

ent-m

ax(B[0..N

/logN

–1]);

6

Alg

orith

m 3

-8:

Blocking technique in m

aximum

finding.


112

Maximum finding

Raw

pow

er (raaka vo

ima

)

•L

et us assume that

any element could be the m

aximum

.

•W

e’llprove other elem

ents not to be maxim

um, only m

aximum

is left.

•Initialize

anarray

of1's

ofsize

N(a

bitfor

everyelem

entof

theinput).

•C

ompare all pairs sim

ultaneously (about

N2/2 pairs).

•T

hesm

allerof

apair

cannotbe

them

aximum

,thusm

arkit

with

0to

the boolean array.

•D

raws

aredecided

accordingto

theindex

(below,

theone

with

smaller index w

ins).

•O

nly the maxim

um value retained the 1.

•A

ll stages inO

(1) time,

N2/2 processors,

O(N

2) work.

•C

oncurrent read is needed at line 4, concurrent write at lines 7 and 9.

•O

nlyzeros

arew

rittenconcurrently,thus

WE

AK

CR

CW

suffices.


113

Maximum finding

fun

ction

raw-m

ax(var A

: array[0..N–

1]);1

for i :=

0to N

–1

pa

rdo

2

V[i] :=

1;3

for i :=

0to N

–1

pa

rdo

4

for j :=

i+1

to N–

1p

ard

o5

if A[i] <

A[j]

then

6

V[i] :=

0;7

else8

V[j] :=

0;9

for i :=

0to N

–1

pa

rdo

10

if V[i]≠

0th

en11

return

A[i];

12

Alg

orith

m 3

-9:

Maxim

um w

ith raw-pow

er.


114

Maximum finding

Div

ide-a

nd

-con

qu

er & ra

w p

ow

er

•D

ivide-and-conquer can be used with division in m

ore than 2 parts.

•C

ombining fast enough is harder.

•U

sing raw-pow

er maxim

um A

lgorithm 3-9, w

e can combine (fi

ndm

aximum

of)M

results with

M2 processors in unit tim

e.

•If w

e haveN

processors, we can com

bine subresults by raw

-m

aximum

.

•D

ivide input in parts, solve them

recursively, find m

aximum

with

raw-m

ax.

N

N

fun

ction

root-max(v

ar A

: array[0..N–

1]; low, high : index);

1

if (low =

hig

h)th

en2

return

A[low

];3

else4

k :=

hig

h–

low+

1;5

for i :=

0to

–1

pa

rdo

6

B[i] :=

roo

t-max(A

, low +

i*, low

+ (i+

1)* – 1);

7

return

raw-m

ax(B

[0..–

1]);8

Alg

orith

m 3

-10

: -divide-and-conquer m

aximum

.

kk

kk

N


115

Maximum finding

•If

Nis

notof

form,w

ehave

torefi

nethe

algorithma

bit(exercise).

•T

ime

T(N

)=

T(

)+O

(1) =O

(loglog

N),

O(N

) processors,O

(Nlog

logN

) work. 2

2n

N


116

Maximum finding

Waterfa

ll = b

lock

ing

&d

ivid

e-an

d-co

nq

uer

&ra

w-p

ow

er

•R

educeN

elements to

N/log

logN

elements sequentially in log

logN

time using

N/log

logN

processors (blocking).

•S

olve the remaining

N/log

logN

elements w

ithN

/loglog

N processors

using Algorithm

3-10 (divide-and-conquer&raw

-power).

⇒A

work-optim

alO

(loglog

N) tim

e (weak) C

RC

W algorithm

.


117

Maximum finding

Usin

g stro

nger C

RC

W m

od

els

•S

TR

ON

G C

W has a ready operation for m

aximum

.

•P

RIO

RIT

YC

Wcan

solvem

aximum

easilyin

O(1)

time

usingO

(N+

M)

processors:

fun

ction

crcw_priority_m

ax(sh

ared

var A

: array[0..N–

1]);1

sha

redva

r max

value, w

innerindex;2

for i :=

0 to N

–1

pa

rdo

3

counts[i] :=

–1;

4

for i :=

0 to N

–1p

ard

o5

counts[A

[i]] := i;

6

for i =

max_

valto 0

by

–1p

ard

o// process with largest

i will w

in7

if coun

ts[i] >=

0th

en8

maxvalu

e := i;

9

winnerind

ex := counts[i];

10

return

(max

value, winnerindex);

11

Alg

orith

m 3

-11

:U

sing PR

IOR

ITY

CR

CW

for maxim

um.


118

Maximum finding

Oth

er simila

r pro

blem

s

•M

ost previous algorithms can be used (w

ith small changes) for m

anysim

ilar tasks.

•E

specially all problems w

here theresult is atom

ic, and combining is

easy.

•F

inding, selecting, counting, sum, and, or, etc.

•O

r, the algorithms can b

e used in opposite direction tospread data.


119

Prefix sum (alkusumma) Prefi

x su

m (a

lku

sum

ma

)

•Input: array

A[0..N

–1] (or [1..N

]).

•R

esult: array

(A[0],

A[0]+

A[1],

...,, ...,

), or,(3-2)

(0,A

[0],A

[0]+A

[1],...

,)

("0-prefix sum

")(3-3)

Aj

[]

j0

= i

∑A

j[

]j

0=

N1

–

∑

Aj

[]

j0

=

N2

–

∑


120

Prefix sum (alkusumma)

•E

.g., (4 5 2 5 6)⇒ (4 9 11 16 22).

•E

.g., (1 0 1 1 0 0 1)⇒ (1 1 2 3 3 3 4).

•A

pplications: counting, array/list compression (rem

oving empty ele-

ments), load balancing, radix sort, graph algorithm

s, etc.

•A

lgorithm sim

ilar to maxim

um for all

•U

seblocking to m

ake it work optim

al (exercise).

•A

gain, synchrony is crucial; array boundaries are more diffi

cult ifN

isnot a pow

er of 2; use another array if original is needed.

pro

cedu

reprefix-sum

(var A

: array[0..N–

1]);1

for i :=

1to

logN

do

2

for j :=

2i–

1to

N–

1p

ard

o3

A[j] :=

A[j–

2i–

1] + A

[j];4

Alg

orith

m 3

-12

:B

asic parallel prefix sum

.


121

Prefix sum (alkusumma)

Fig

ure 3

-4:

Data m

ovement in prefi

x sum.

62

23

71

54

07

84

510

86

94

07

1314

1316

1710

94

07

3024

2220

1710

94

07


122

Merging and sorting algorithms Merg

ing

an

d so

rting

alg

orith

ms

⇒P

arallelsorting

canbe

approachedin

severalw

ays(as

sequentialsorting).

•W

e'll present:

•R

aw pow

er

•M

ergesort(w

itha

coupleof

possibleapproaches

tom

ergingin

par-allel).

•S

ampling bucket sort.

•R

adix sort.

•L

ater, we’ll present som

e sorting algorithms suitable for m

essage-passing environm

ent.


123

Merging and sorting algorithms

Para

llel "b

ub

bleso

rt" (o

dd

-even

tran

spositio

n)

•C

ompare-exchange odd and even pairs

N tim

es.

•N

/2 processors, 2N

=O

(N) tim

e,O

(N2) w

ork.


124


Raw

pow

er sort (b

y ra

nk

ing)

⇒P

resents PR

AM

at its best and worst!

•E

xploitsS

TR

ON

G A

DD

CR

CW

.

Com

pu

te the

correct lo

catio

n o

f each

elemen

t at o

nce:

•C

ounthow

many sm

aller elements there are in the array.

•I.e., the

rank (ra

nkka

us,

sijoitu

s) of each element.

•R

anksare

evaluatedas

inraw

-max:com

pareallpairs,increase

therank

of the larger element by one (cf. zero the sm

aller inra

w-m

ax).

•S

everal increasings of the sam

e element at once (S

TR

ON

G A

DD

CR

CW

needed).

•A

fter ranking, we'll know

thenum

ber of smaller elem

ents for eachelem

ent, i.e., thelocatio

n of each elem

ent.

•D

raws have to be solved.

•O

(1) time,

O(N

2) processors,O

(N2) w

ork.

⇒R

anks can be counted also in different (more efficient) w

ays.


125


Fig

ure 3

-5:

Direct sorting by ranking.

69

23

71

54

07

57

12

60

43

07

97

65

43

21

07

inp

ut

A

rankV

A[V

[i]] :=A

[i]

pro

cedu

reraw

-sort(var A

: array[0..N–

1]);1

for i :=

0to N

–1

pa

rdo

2

V[i] :=

0;3

for i :=

0to N

–1

pa

rdo

// rank4

for j :=

0to N

–1

pa

rdo

5

if A[i] <

A[j]

then

6

V[j] :=

V[j] +

1;// S

TR

ON

G A

DD

CR

CW

7

for i :=

0to N

–1

pa

rdo

// sort8

A[V

[i]] := A

[i];9

Alg

orith

m 3

-13

:S

orting by raw pow

er.


126


Merg

esort (lo

mitu

slajittelu

)

•A

ctual sort is trivial, presented earlier.

•M

erging in parallel is interesting, we'll present a few

examples.

•M

ergingin

O(N

)tim

e(sequentially):

O(N

)tim

efull

sort(O

(N2)

work).

•M

erging inO

(logN

) time:

O(log

2N

) time sort.

•M

erging inO

(loglog

N) tim

e:O

(logN

loglog

N) tim

e sort.

•M

erging inO

(1) (amortized) tim

e:O

(logN

) time,

O(N

logN

) work.

pro

cedu

rem

ergesort(var A

: array; first, last : index);1

if (last–fi

rst) > 0

then

2

pa

rdo

3

merg

esort(A, fi

rst, (last+fi

rst)/2);4

merg

esort(A, (last+

first)/2+

1, last);5

merge(A

, first, (last+

first)/2, (last+

first)/2+

1, last);6

Alg

orith

m 3

-14

:M

ergesort.


127


Merg

ing b

y ra

nk

ing

•W

e assume elem

ents to bedistinct (use index to resolve draw

s).

•L

et us define the

ran

k of an element

x in an arrayA

[0..N–1] as the

number of sm

aller elements in array

A.

⇒C

omputing

ofthe

rankis

much

easierif

Ais

inincreasing

order(sorted).

rank(x,A

) := m

axi

A[i]≤

x(3-4)

•U

sing one processor: usingbinary search in tim

eO

(logN

).

•W

ithP

processors, we can divide into

P+

1 parts (P division points)

instead of two.

•T

hus parallel “binary search” in time

Tp (N

,P

) = +

O(1) =

O(log

P+

1N

) =.

(3-5)

•O

ne processor finds correct interval, others follow

. exercise.

Ts

N

P1

+-------------

ON

logP

log------------


128


•U

singraw

power,w

ecan

find

onerank

inO

(1)tim

eusing

O(N

)proces-

sors.

•If needed, w

e can refine this w

ith one processor writing (instead of

return) and the rest of processors reading the result.

•C

RE

W suffi

ces.

•L

ater we’ll show

how to do this m

ore efficiently.

fun

ction

raw-rank(x : elem

ent;var A

: array[0..N–

1]);1

if x < A

[0]

then

2

return

0;3

else if x≥

A[N

–1

]th

en4

return

N;

5

else6

for i :=

0to N

–2

pa

rdo

7

if A[i]≤

xa

nd

x≤

A[i+

1]th

en8

return

i+1;

9

Alg

orith

m 3

-15

:R

ank in unit time by raw

power.


129


Merg

ing w

ith ra

nk

ing

•Input: readily sorted arrays

A and

B (often halves of the sam

e array).

•R

ank of element

A[i] in array

A is

i.

•R

ank of element

A[i] in

arrayB

is rank(A[i],

B).

•R

ank of element

A[i] in the fi

nal array isi +

rank(A[i],

B).

•W

e can place every element to the fi

nal arrayindependently!

⇒F

orthe

whole

merge,w

e'llneed

therank

ofeach

element

ofA

inB

,and the

rank of each element of

B in

A.

•T

hiscan

beeasily

convertedto

restoreelem

entsback

toA

andB

and/orto m

erge halves of a single array.

fun

ction

rank-merge(A

, B : array[0..N

–1]) : array[0..N

*2–

1];1

for i :=

0to N

–1

pa

rdo

2

C[i +

rank(A

[i], B)] :=

A[i];

3

C[i +

rank(B

[i], A)] :=

B[i];

4

return

C;

5

Alg

orith

m 3

-16

:D

irect merge by rank.


130


•W

e need CR

EW

PR

AM

sinceN

simultaneous ranking processes read

the same array (using b

inary search) in parallel (though only constantpenalty on E

RE

W).

•If parallelization and synchronization are m

ade carefully, the merging

can be donein place.

•B

utw

eneed

Nprocessors,

allof

which

useO

(1)helper

space,thus

it actually usesO

(N) extra space.

•L

ater,w

ithless

processors,w

eneed

anyway

O(N

)extra

spaceand

have to move elem

ents to/from

a helper array.

•Ω

(Nlog

N) w

ork.

•M

oreaccurate

analysisof

rank-merge-sort

with

P=

N,P

=N

2,arbitraryP

as an exercise.


131


Faster m

ergin

g a

lgorith

ms

Merg

ing in

O(lo

gN

) time,

O(N

) work

•Input: arrays

A and

B (o

f lengthN

)

•C

hoose regularlyN

/logN

elements of

B.

•R

ankeach

ofthese

(with

sequentialbinary

search)in

A(elem

ent/processor, totalN

/logN

processors).

•N

ow w

e haveN

/logN

pairs of subsequences each ofw

hich can bem

erged sequentially.

a1 ...a

j1 andb

1 ...blog

n|ji =

rank(bi*

log

n ,A

)(3-6)

aj1

+1 ...a

j2 andb

logn+

1 ...b2*log

n

…ajlo

gn–

1+

1 ...an and

b(n–1)log

n+1 ...b

n

•F

rom the section boundaries, w

e know the location of the m

ergedsection in the new

array –m

erging tasks are independent.

•O

naverage, the lengths are

O(log

N), thus the w

hole algorithm in

O(log

N) tim

e.

AB


132


•U

nfortunately, thesequences of

A can be longer if data is uneven.

•E

ither:

•S

ymm

etric ranking & partitioning:

•C

hooseN

/logN

elements of both

A and

B.

•R

ankeach

ofthese

(with

binarysearch)

onthe

otherarray.

•N

oww

ehave

tom

erge2×

N/log

Npairs

ofsequences

oflength at m

ostlog

N.

•O

r:

•R

epartition the (few) too large sequences.

AB


133


Merg

ing in

O(lo

glo

gN

) time, O

(N) p

roc, O

(Nlo

glo

gN

) work

•E

xloits more effi

cient2

-step ranking A

lgorithm 3-17.

•T

ake regularly sam

ples of each arrayA

andB

.

•R

anksam

ples ofA

in samples of

B (not in w

hole B!).

• ranks on

elements w

ithN

proc inO

(1) time (raw

-rank).

•S

ame for sam

ples ofB

inA

(as in symm

etric ranking above).

•N

ow w

e have 2 subsequences, but

boundaries are still inaccurate(w

eonly

knowin

which

blockof

theother

arraythe

samples

belongto).

•R

ank each sample of

Ain the subsequence of

B it belongs to.

•2

ranks on elem

ents with

N procs in

O(1) tim

e (raw-rank).

•S

ame for sam

ples ofB

inA

.

•N

ow w

e have 2 subsequences w

ithaccurate boundaries in

O(1)

time.

•A

pply the algorithmrecursively to each of 2

subsequences (ofaverage length

/2) with

/2 processors for each subsequence.

•T

(N)

=T

()+

O(1) =

O(log

logN

).

N

NN

N

NN

N

NN

N

N


134


fun

ction

root-raw-rank(x : elem

ent;var A

: array[0..N–

1]) : index;1

if x < A

[0]

then

2

return

0;3

else if x≥

A[N

–1

]th

en4

return

N;

5

else6

for i :=

0to

–1

pa

rdo

7

B[i] :=

A[i*

]8

block = raw

-rank(x, B);

//O

(1) time w

ith proc

9

brank

=raw

-rank(x,A[block

*..(block+

1)*

]);//

O(1)

10

return

blo

ck*

+ brank;

11

Alg

orith

m 3

-17

:R

ank inO

(1) time w

ith processors.

(TO

DO

: check indeces).

NN

NN

NN

N


135


Merg

ing in

O(lo

glo

gN

) time, O

(N) w

ork

•W

ork and time optim

al merge!

•N

/loglog

N processors.

•P

artitionA

andB

toblocks of size log

logN

.

•R

ank theboundaries (N

/loglog

N)

in each other with previous algo-

rithm (O

(loglog

N) tim

e).

•R

ankeach

ofthe

sequentiallyboundaries

within

thecorresponding

sub-section

of lengthO

(loglog

N). (O

(loglog

logN

) time w

ith binarysearch).

•N

ow w

e have accurate boundaries (ranks) of 2×N

/loglog

N pairs of

sequences of length at most log

logN

.

•M

erge each pair of sequences independently

using a sequential algo-rithm

(O(log

logN

) time).

•Y

ields aO

(logN

loglog

N) tim

e,O

(Nlog

N) w

ork sorting algorithm.


136


Od

d-ev

en m

erge

•B

atcher 68: odd-even merge and bitonic m

erge.

•Input array halves

A and

B.

•In

practice,halvesof

thesam

earray

arenam

edA

andB

foreasier

ref-erence.

•M

erge (recursively)odd

elements of

A and odd elem

ents ofB

; andm

erge (recursively)even elem

ents ofA

and even elements of

B.

•M

erging is done in place.


137


•A

fter these merges,consecutive pairs m

ay be out of order, we'll check

order ofeach pair, sw

ap if needed.

•M

erge in time: T

(N) =

T(N

/2) +O

(1) =O

(logN

),O

(1) space.

Fig

ure 3

-6:

Odd-even m

erge [5].


138


Fig

ure 3

-7:

Recursion in odd-even m

erge [5].


139


pro

cedu

re Odd-even_m

erge (A : array[0..N

–1]);1

pa

rdo

2

Odd-even_m

erge(halves o

fodd elem

ents o

f A);

3

Odd-even_m

erge(halves o

feven

elemen

ts of A

);4

pa

r i = 1

to N–2

by

2d

o5

com

pa

re-excha

nge (A

[i], A[i+

1]);6

Alg

orith

m 3

-18

:P

arallel odd-even merge inform

ally.

pro

cedu

re oemerge(v

ar S

:arra

y; First, L

ength, Stride : index);

1

pa

r i := 0

to 1

do

2

oem

erge(S, F

irst + i * S

tride, Length/2, S

tride * 2);3

pa

r i := 1

to L

ength

/2–

1d

o4

j := i * 2

;// j :=

2 to Length

–2 by 2

5

ifS

[First +

(j–1) * S

tride] > S

[First +

j * Stride]

then

6

swap

(S[F

irst+(j–

1)*S

tride], S[F

irst+j*

Stride]);

7

Alg

orith

m 3

-19

:P

arallel in place odd-even merge procedure (F

PM

).


140


OE

M-so

rt perfo

rman

ce

•M

ergesort with odd-even m

erge exploits at most

N/2 processors,

executes inO

(log2N

) time, and thus uses

O(N

log2N

) work, w

hich isineffi

cient by factor ofO

(logN

).

⇒W

ecan

improve

theefficiency

byreducing

thenum

berof

processors.

•If there are less than

N/2 processors, w

e cansw

itch to sequential sort/m

erge as soon as we run out of processors.

•T

he recursive sort branches according toP

.

•A

lso,merging

canrun

outofprocessors,thus

alsothe

merge

willbranch

according toP

.

•T

ime com

plexity will be

T(N

,P

) =O

((N/P

)×(lo

g2P

+ log

N/P

)).(3-7)


141


•In theory, w

e cannot exploit very many processors effi

ciently.

•E

.g., to ensure50%

efficiency, w

e would have to settle for

(3-8)

•T

he same plotted:

•In practice, though, w

e can efficiently use slightly m

ore processors, asthe

slow recursion tails are rem

oved ifN

is clearly larger thanP

.

•M

easured performance on F

-PR

AM

:

P2

Nlog

≤


142


Fig

ure

3-8

:M

aximum

efficiently

usefulP

asa

functionof

Nas

pre-dicted

byF

ormula

(3-8),odd-evenm

ergesort,logarithmic

x-axis.

1.04×10 6

1.67×10 7

2.68×10 8

4.29×10 9

6.87×10 10

510 15 20 25 3

035 40 45 5016

64

256

102

440961638465536

maximum efficient

inpu

t sizeN

number of processors

262144


143


Fig

ure

3-9

:S

peedupof

odd-evenm

ergesortasa

functionof

thenum

berof

processorsfor

differentinput

sizes.B

othscales

arelogarith-

mic.

1 2 4 8

16 32 64

128

256

512

1024

12

48

1632

64128

256512

10242048

4096

speedup

number of processors

N = 262144

N =

16384

N =

4096

N =

1024

N =

256

50 %

10 %

N = 65536

linear


144


Cole’s o

ptim

al p

ara

llel merg

esort (1

986)

•T

he first

almost practical tim

e and work-optim

alO

(logN

) sort.

•F

irstasym

ptoticallyoptim

alw

asA

jtai,K

omlós,

Szem

erédi(A

KS

)1983.

⇒In

fact,w

edo

notneed

aO

(1)tim

em

erging,a

merging

with

O(1)

amortized cost for each phase is sufficient.

•T

he merge operations in different stages of sort can be pipelined.

•W

e collect samples (border values, "cover") in different stages.

•W

e collect the ranks of the samples in halves of data.

•A

ccording to ranks of samples w

e can do the next stage faster.

•B

ecause of large constants, this Cole's sort is faster than odd-even

mergesort (or bitonic) only if

N>

1021 [6].

•S

ee, e.g., Jájá or Akl.


145


Sam

plin

g p

ara

llel bu

cketso

rt (kau

kalo

lajittelu

)

•L

et us assume the

N>>

P.

•E

ach processorsam

ples its own part of the array.

•S

amples are sorted

in some fast (parallel) w

ay.

•A

ccording to the samples, processors decide

P–1 division points (val-

ues).

•E

achprocessor

partitionsits

partofinputto

otherprocessors

accordingto the division points.

•E

ach processorreceives one subsection of input from

all others.

•E

ach processor sorts its own section.

•In shared m

emory m

odel, we need som

e amount of additional space.

•In m

essage passing mod

el, we need all-to-all com

munication.


146


Rad

ix so

rt in p

ara

llel (kan

talu

ku

lajittelu

)

⇒P

robablythe

fastestsequential

sortif

keysare

reasonablyshort

andinput is large.

•S

equentialtime

,where

mis

keysize

(inbits)

andr

isradix size (bits).

•S

orting instages:

•D

ividekey in parts.

•S

ortaccording

tothe

leastsignifi

cant part.•

Sort

accordingnext

leastsignifi

cant part.•

...

•S

ortaccording

tothe

most

significant part.

•S

ortshave

tobe

stable,i.e.,theorderofelem

entsw

iththe

same

subkey has the be sustained.

Omr ----

n2

r+

()

Fig

ure 3

-10

:S

orting in stages.

12

3 2

31

12

3 1

23

34

5 1

23

32

4 2

31

54

3 5

43

32

5 2

33

23

3 2

33

23

1 3

24

53

3 5

33

23

3 3

25

32

5 3

43

53

3 3

43

34

3 3

24

54

3 3

45

23

1 3

45

34

3 5

33

32

4 3

25

34

5 5

43


147


•A

s each subkey is short (reasonable amount of different possible

subkeys) we could use bucketsort.

•A

sw

ehave

alot

ofkeys

(alot

foreach

bucket),theuse

oflists

inbuck-

etsort gets slower, thus w

e'll use a slightly different method.

•F

irst count thenum

ber of each subkeys.

•C

ompute a

0-prefix sum

of the count array.

•P

refix sum

tells us into w

hich position each “bucket” will be stored.

•C

ontents of each “bucket” w

ill be stored in original order.

•R

adix sum location is increased after each assignm

ent.

•If/w

hen keys are not integers, we'll use the bit representation of keys.

rbits at a tim

e yields 2r buckets,

r is typically 12-20.


148


Fig

ure 3

-11

:S

equential radix sort using a histogram.

12

3

34

5

54

3

23

0

53

3

32

5

34

3

23

1

32

4

01

23

45

11

04

12

Occurrences

01

23

45

01

22

67

0-prefi

x sum

23

1

12

3

54

3

23

0

53

3

34

3

32

4

34

5

32

5

013 245678

013 245678

T1:

T2:

R:

R:

1.2.

3.


149


Para

llelizatio

n

•If

severalprocessorscountoccurrences

inparallel, the prefi

xsum

needs to becounted for everyP

×2

r buckets.

•R

esultlikein

Figure

3-12, butlinear

(sequential) scan istoo slow

.

Fig

ure

3-1

2:

Linear

scanfor

radixsort

[Culler&

al].Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 150 (289)

150


Prefi

x in

three sta

ges

•P

refix sum

each row to

last column

(2r×

P/P

= 2

r time).

•B

roadcast all values of the last column to all processors (2

r (or skip inC

RE

W)).

•P

refix sum

the last colum

n.

•E

valuatefi

nal prefix sum

s by adding also the previous row sum

s. (2r)

•A

ssignment stage of the local input as in sequential version.

•P

rocesses can work independently.

⇒C

omparison of different sorts on C

M-5 [C

uller&al]: F

igure 3-13.