Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 1 (289)
1U
niversity of Eastern
Fin
land
Com
puter S
cience
Parallel C
omputing
5 cr, 3
621528
Fall 2
012
Sim
http://cs.uef.fi
/pages/sjuva/parallel.htm
l
Sittin
g placem
ents at the fi
rst lectures:
1) Sit w
ithinreach of som
eone (several) else.
2) The w
hole class must be
connected.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 2 (289)
2
Course contents (preliminary) Co
urse co
nten
ts (prelim
ina
ry)
•C
hapter 1:A
n Introduction to Parallel C
omputing
(p. 3)
•W
hat?,W
hy?,H
ow?
•C
hapter 2:P
RA
M(p. 55)
•A
simple
model to parallelism
•C
hapter 3:P
arallel algorithms (in P
RA
M-notation)
(p. 85)
•B
asicalgorithm
s, e.g., counting, prefix, sorting, etc.
•C
hapter 4:T
aking real world into account
(p. 163)
•N
etwork
delay models, m
emory access m
odels
•C
hapter 5:M
essage passing programm
ing (with M
PI)
(p. 224)
•R
ealparallel program
ming
work.
•C
hapter 6:O
ther stuff(p. 228)
•O
penMP, F
ortran 90, HP
F, functional, data flow
.
•G
PU
programm
ing, CU
DA
/OpenC
L
•E
veryday(especially
infew
years)parallel
(andconcurrent)
pro-gram
ming. P
rocesses, IPC
,shared mem
ory, phtreads, Java threads.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 3 (289)
3
Ch
ap
ter 1
An
Intro
du
ction
to P
ara
llel Com
pu
ting
What?, W
hy?, H
ow?
Som
e key concepts
Pros, C
ons
Other sim
ilar terms
Exam
ples
An anim
al experimen
t
Desig
n issues
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 4 (289)
4
What is Parallel Computing? Wh
at is
Pa
rallel C
om
pu
ting
?
⇒U
seseveralcom
putersto
solvea
singlecom
putationaltaskin
parallel!
•T
wo is better than one.
•O
ne thousand is better than tw
o…•
Think hum
an (manu
al) work.
⇒T
he single task has to bedivided in several parts.
•S
ome tasks are easy to divide, som
e are not.
⇒T
he cooperating computers have to be able to
comm
unicate.
•O
ne task, one solution.•
There are m
any ways to com
municate.
⇒T
he participating "computers" do not need to be com
plete!
•P
rocessor,mem
ory,comm
unication medium
(processing unit).
•M
onitors do not process.
•T
he whole parallel com
puter still needs to have some I/O
, etc.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 5 (289)
5
What is Parallel Computing?
Exam
ple 1
-1:
A hu
man
exam
ple:m
anual sorting of papers:
•Input: a bunch of A
4papers, each having a
name.
•Inputsize:10,100,1
000,or10000
papers(1
mm
,1cm
,10cm
,1m
).
•T
ask: sort the bunch (alphabetically).
One (quick) person alone: [1st exercise in D
ata Structures and A
lgorithms]
•10 papers: 30 s [3 s/paper]
•m
ethod insignificant
•100 papers: 8 m
in [5 s/paper]
•divide
in10
(5-27)substacks
accordingto
thefi
rstletter,
sortsub-
stacks, combine
•1000 papers: 2 h [7 s/paper]
•divide
in10
substacks
accordingto
thefi
rstletter,
applyrecursively
previous 100-sort.
•10000 papers: 25 h [9 s/paper]
•divide
in10
substacks
accordingto
thefi
rstletter,
applyrecursively
previous 1000-sort.
•Y
ou might w
ant some help...
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 6 (289)
6
What is Parallel Computing?
Para
llel man
ual p
ap
er sortin
g:
•10, 100, 1000, 10000 helpers!
•W
ork organization is more diffi
cult than in single person sort.
•E
xercise 1.
⇒T
he important question:
•W
ill10 helpers
speed up the work 10 tim
es?
•10 papers task:
no (one helper can help a little).•
10000 papers task:
yes (at least alm
ost 10 times).
•W
ill10
000 helpers speed up the work
10000 times?
•10 papers task:
no.•
10000 papers task:
no (but w
e can exploitm
ore than 10 helpers).•
100,000,000 papers task:yes (alm
ost)
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 7 (289)
7
What is Parallel Computing? ⇒W
hat is theoptim
al number of helpers for each num
ber of papers?
•W
hat is the goal?W
hat means
op
tima
l?
•M
inimal
wall clock tim
e?•
Effi
ciency (minim
al person work hours, i.e., euros)?
•?
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 8 (289)
8
Little practice Little p
ractice
Ru
les
•P
hysical messages, w
riting on a piece of paper
•W
ritten message m
ay includeinstructions, addresses, data
•C
onnections toneighbours w
ithout standing up
•S
ending a m
essage (synchronous comm
unication):
•A
sk the neighbour to receive, wait until he/she is ready
•H
and out the message, say “here are you”
•R
eceiving a message:
•A
gree to receive•
Receive, say “thank you”
•Y
ou can see and comm
unicateo
nly
with your neighbours.
•L
ocal operations are unlimited
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 9 (289)
9
Little practice
Task
s
•M
ax,
count,search (single value, pattern),sum, sort, ...
Alg
orith
m?
•F
or above rules?
•F
ordifferent rules?
•W
ithout rules (but no magic)?
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 10 (289)
10
Little practice
Ph
ysica
l con
ditio
ns/restrictio
ns (i.e., ch
allen
ges):
•O
pen hall, no restrictions
•C
oordination:loudsp
eakers(for
leaders),person-to-person
comm
u-nication, guidance painted on fl
oor, rehearsal, etc.
•"C
luster" of two door-connected halls?
•S
itting here, no person movem
ent allowed.
•P
aper deliveryonly for neighbours vs anyone?.
•O
nlyone paper at tim
e vs. a bunch at a time
•H
ow to benefi
t use of blackboard or an electronic m
essage board?
•H
ow to benefi
t from sho
uting?
•W
ithout sight contact to neighbours.
•L
oad balancing (fast and slow
workers)
•F
ault tolerance (temporary, perm
anent)
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 11 (289)
11
Why parallel computing is needed? Wh
y p
ara
llel com
pu
ting
is need
ed?
⇒W
hy computers are needed?
•B
ecausecom
puterscan
compute
(calculate)fa
stand
theycan
havehuge
mem
ory.
Why
i7 at 3.50 GH
z (20
03
slides: 3
GH
z)is not enough???
•C
omputing pow
er will ~
double every ~tw
o years. [“Moore”]
•Intel/A
MD
4/6/8-core processors at 2-4 GH
z arevery cheap
(from100e)!
•20 years ago governm
ents would have paid
millions for a 2012 P
C.
Wh
at else w
e need
?
•H
umans are greedy and im
patient...
•S
ome
tasksare
toodem
andingand
urgentto
becom
putedby
oneproc-
essor only.
•S
ome
tasksare
more
valuablethe
more
computing
power
we
canuse
onthem
.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 12 (289)
12
Why parallel computing is needed?
Wh
at is so
dem
an
din
g a
nd
urg
ent?
•W
ord processing?
•W
WW
-surfing?
•B
ank / stock exchange?
•eC
omm
erce?
•G
aming?
•R
eal world
simulation!
•M
atter consists ofvery tiny particles!
•E
very visible piece consists ofvery m
any particles.
•W
ecannot
simulate
every(sub)atom
icparticle
fora
large(visible)
object!
⇒B
ut:thesm
allerparticles
we
cansim
ulate,them
oreaccurate
simula-
tion we have!
•S
maller particles⇒
more particles⇒
more calculations to do!
⇒U
nbounded amount of calculations!
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 13 (289)
13
Why parallel computing is needed?
Wh
y w
e wan
t to sim
ula
te real w
orld
?
•"T
est" a piece of equipment w
ithout building it.
•P
rediction of natural phenom
ena.
•P
rediction of consequenses of changes.
•"S
ee" artificial things.
•O
ptimizing
structures or m
odels.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 14 (289)
14
Why parallel computing is needed?
Exam
ple:
wea
ther
foreca
sts
•H
istory data, constants measurem
ents.
•S
imulation of the future m
ovement of air particles.
•S
imulation
ofphysicalchanges
(temperature,pressure,hum
idity,veloc-ity, etc.) of air in the atm
osphere.
•H
ugeam
ountsof
molecules
move
andinteract
quicklyfor
severaldays.
•Incom
prehensible amou
nt of calculations.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 15 (289)
15
Why parallel computing is needed?
•R
esolution reduction:
•50×
50×1 km
(×5 m
in)block of air as 1 entity.
•P
enalty: accuracy and reliability are reduced.
•F
orecast asfar to future as possible.
•U
nfortunately:inaccuracies m
ultiply.
⇒M
orepow
erfulcom
puteror
more
time
yieldsim
mediately
more
accurate forecasts (and longer forecasts).
⇒(R
eliable) weather forecasts are
very valuable!
•In real forecasts, the m
odels exploit grid-wide differential equations
instead of local simulation...
Block size (km
),height 0.5 km
Gfl
op/s needed for"real" tim
esim
ulation (2 m
inute steps)5 days in 2 hours
Gfl
op/s needed
11 804 492
108 269 544
321 762
105 732
10241.7
103
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 16 (289)
16
Why parallel computing is needed?
•A
late forecast is worthless.
•F
innish Meteorological Institute: (about)
•7.5×
7.5×0.3 km
(×6 m
in) Canada .. U
ral, 3-10 days
•2.5×
2.5×0.? km
(×? m
in) Sw
eden .. Finland
•(44km
->7,5km
in 14 years)
•C
ray XT
5m, 656
× 6
-core Opteron, 35
TF
LO
PS
(theoretical)
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 17 (289)
17
Why parallel computing is needed? ⇒C
onclusion
•W
e want as
powerful co
mputer as possible!
•W
e are willing to
pay for it.
⇒U
nfortunately
•N
o IA256 @
300 GH
z ever(?) (until 2030+?)
•E
ven if we pay
all the money in the w
orld.
Th
us
⇒W
e’ll use several processors to achieve more com
puting power.
•F
innish CS
C currently (louhi.csc.fi
): Cray X
T4/5
•2716
× (4-core 2.3 G
Hz O
pteron, 4-8GB
, 25G
B/s)
•T
heoretically 102.3 TF
lop/s, measured 76.5 T
Flop/s (L
inpack)
•http://w
ww
.csc.fi/english/research/C
omputing_services/com
puting
•see “C
urrent parallel computers (briefl
y)” p.37
•O
rdered: Cray C
ascade (10Me, 1
PF
LO
PS
?)
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 18 (289)
18
Why parallel computing is needed?
Oth
er ap
plica
tion
s for p
rocessin
g p
ow
er (para
llelism)
•H
uge databases, urgent queries, data mining
•D
igital signal/image/vid
eo processing
•C
omplex user interfaces (virtual reality, gam
es)
•D
NA
modelling
•D
NA
matching
•M
olecular modelling
•E
nvironmental
modelling
(storms,pollution,earthquakes,sea
currents)
•A
stronomical m
odelling
•O
ptimization (aero/hyd
rodynamics, etc.)
•S
tructure strength calculations (car crash sim
ulations, etc.).
•C
ryptoanalysis,
•P
attern recognition, audio/image surveillance
•D
ata mining/indexing/classifi
cation,
•A
rtificial intelligence
•M
easurementdata
analysisand
modelling
(sensorvalues
tobig
picture)
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 19 (289)
19
Some key concepts So
me k
ey co
ncep
ts
Exam
ple 1
-2:
Build
ing a sm
all house:
•O
ne skilled man can bu
ild a house in oneyear
•T
wo skilled m
en can do it inabout half a year
•12 m
en, onem
onth: requires very careful planning (at least)
•365 m
en,one day
: probably impossible
•1 m
illion men,
10 seconds: definitely im
possible
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 20 (289)
20
Some key concepts ⇒H
ow to coordinate the fast (1-5 day) parallel building of a house?
•S
killed w
orkers
•S
ynchronization of w
ork
•P
artlyindependent com
ponents (roof, walls, etc.)
•M
ore than one (levels of) leader(s)
•G
oodinstructions and
comm
unication
•D
etailed plan available to all (at least many) w
orkers
•P
roblem: single plan
will be crow
ded
•S
olution: local partial copies of the plan
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 21 (289)
21
Some key concepts ⇒L
essons learned:
•P
arallelization possibilities depends on the problem (ditch vs. w
ell)
•C
omm
unication and coordination are vital
•A
ccessto
aS
HA
RE
Dplan
with
localcopies
isa
fairlygood
comm
uni-cation m
ethod
⇒T
here is a limit on efficient num
ber of workers.
•K
ey concepts:
•speedup, extra w
ork,effi
ciency
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 22 (289)
22
Some key concepts
Exam
ple 1
-3:
Exam
ple: w
hich one to choose?
Th
ink
BIG
!
•G
reat Wall of C
hina (in a day?)
•5
mm
/ ~300
kg of wall for each C
hinese
•G
reat Pyram
id of Giza (in ???)
•~
60kg for each E
gyptian
Labour
Calendar
time
Speedup
Work
Labour
expensesE
ffi-
ciency
1 man
1 year1.00
1.00 my
48,000 e1.00
2 men
7 mo
nths1.71
1.17 my
56,000 e0.86
4 men
4.5 m
on
ths
2.671.50 m
y72,000 e
0.66
365
men
5 day
s73.00
5.00 my
240,000 e0.20
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 23 (289)
23
Some key concepts
Lim
its of p
ara
llelizatio
n
•C
an we
speed up a computation infi
nitely by adding m
ore and more
processors?
•N
otinfi
nitely,m
ostproblem
shave
alow
ertim
ebound
(usually(poly)logarithm
ic, with polynom
ial number of processors).
•In
practice, thelim
it is money.
•H
ard problems are huge (input size (N
) is large).
•H
uge problems have a lot of potential parallel parts.
•E
.g., a high-rise building vs. a single-family house.
•S
mall problem
s are fast enough with
one processor.
•In
theory,thelim
itis3-dim
ensionalspaceand
speedof
light(we
cannotreach exponential num
ber (as a function of tim
e) of processors)
(T(N
,P
) =).
ΩP
113 ------ε
–
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 24 (289)
24
Some key concepts
Sp
eedu
p(n
op
eutu
s),w
ork
(työ),
efficien
cy(teh
ok
ku
us,
hyö
tysuh
de)
•A
noptim
al sequential (uniprocessor) algorithm tim
e =T
s(N
).
•P
arallel algorithm w
ithP
processors, time =
Tp
(N,
P)
•S
peedup is defi
ned as ratioT
s/T
p
•S
peedupT
s/T
p =O
(P)
•I.e.,
superlinearspeedup
isnot
possible,as
itw
ouldim
plya
fastersequential algorithm
.
•W
ork (used resources) =
Tp ×
P.
•If
Tp ×
P =
O(T
s ), the algorithm is
work optim
al (työo
ptim
aa
linen).
•T
p ×P
=o
(Ts ) is im
possible!
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 25 (289)
25
Some key concepts
Am
da
hl’s
law
on
serial
fractio
ns
with
inp
ara
llelp
rog
ram
s
•If an algorithm
has an (inherently) serial part that will not be paral-
lelized, it will
limit w
hole parallelization.
•O
r, if we
do not bother to parallelize some diffi
cult part.
•W
hole algorithm (serial) tim
eT,sequential fraction
α (0..1).
T(N
,P
) =.
(1-1)
Speedup(P
)=w
henP
→∞
(1-2)
Effi
ciency(N,
P) =
(P→
∞)
(1-3)
αT
1α
–(
) TP ---+
T
αT
1α
–(
) TP ---+
------------------------------------1
α1
α–P
------------+
-----------------------1α ---
→=
T
Pα
T1
α–
() TP ---
+
----------------------------------------------
1P
α1
+-----------------
→
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 26 (289)
26
Some key concepts
Po
ssible
go
als fo
r speed
up
an
d/o
r efficien
cy
•A
sfast as possible.
•N
o matter how
many processors.
•F
orm
ostproblem
s,there
existsa
(poly)logarithmic-tim
e((log
n) k)
algorithm (very fast!).
•A
s goodeffi
ciency as possible.
•U
nfortunately, the sequential algorithm is alw
ays the most effi
cient.
⇒A
sfast
aspossible
while
maintaining
(asymptotically
full,or
given)efficiency.
•S
omething betw
een, or inreal life:
•In
agiven
time
byas
few(and
cheap)processors
(andother
resources) as possible.•
By
agiven
number
ofprocessors
(andother
resources)as
fastaspos-
sible.Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 27 (289)
27
Some key concepts
Bren
t’s theo
rem
•If our algorithm
works w
ithP
processor in time
T, we can execute it
with
P’
<P
processors in time
T× P
/P’ .
⇒W
ecan
always
designalgorithm
sfor
asm
anyprocessors
aspossible/
efficient. The algorithm
will w
ork nicely with few
er processors.
•E
ven if we w
on’t have thousands of processors, multithreaded proces-
sors work m
ore efficiently w
ith more threads.
⇒In
some
cases,thought,analgorithm
thatisdesigned
forfew
erproc-
essors may be m
ore efficient.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 28 (289)
28
Some key concepts
Wh
at is so
diffi
cult in
para
llel pro
gra
mm
ing?
•S
ometim
es evensequential program
ming is diffi
cult.
•In
parallelprogramm
ingw
ehave
tom
anageseveralprocessors,each
ofw
hich must w
ork correctly.
•T
he processors must
com
municate correctly.
•S
ome problem
s are easy to parallelize, some diffi
cult or inefficient.
⇒P
arallel programm
ing is difficult.
⇒W
eoften
needm
oreabstraction
levelsthan
insequential
program-
ming.
•C
oncentrate ondata and
operations on data.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 29 (289)
29
Some key concepts
Para
llelism is
natu
ral!
•In fact,sequential order is (som
etimes) artifi
cial.
•A
“typical” algorithm segm
ent:
for ea
chelem
in array
Ad
o1
elem←
elem×
22
•A
sequential programm
er implem
ents:
for (i =
1; i <=
A; i+
+)
1
A[i] =
A[i] * 2;
2
•W
hy to serialize anoriginally parallel (sim
ultaneous) operation?
•S
ometim
es serialization might be a source of errors.
•A
parallel version can be fl
exibly implem
ented with 1..N
processors.
•R
eal world is concurrent (and
very parallel) anyway.
•P
arallelism is (alm
ost) as old as Life.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 30 (289)
30
Some similar terms (that are sometimes mixed up) So
me sim
ilar term
s (tha
t are so
metim
esm
ixed
up
)
Distrib
uted
System
(haja
utettu
järjestelm
ä)
⇒A
distributedsystem
isa
collectionof
autonomous
computers
linkedby
acom
puternetw
orkthatappear
tothe
usersof
thesystem
asa
sin-gle com
puter.
•T
hem
achinesare
autonomous;this
means
theyare
computers
which,in
principle, could work independently;
•S
eparatecom
putersw
orkconcurrently,
without
globalclock,
andm
ayappear, fail and recover independently.
•T
heuser’s perception: the distributed system
is perceived as a singlesystem
solving a certain problem (even though, in reality, w
e haveseveral com
puters placed in different locations).
⇒E
achpart
ofthe
distributedsystem
may
bea
partof
(i.e.,participatein)
several distributed systems.
•N
ot part of this course.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 31 (289)
31
Some similar terms (that are sometimes mixed up)
Distrib
uted
com
pu
ting (h
aja
utettu
lask
enta
)
•T
erm often used w
hen severalcom
puters (often geographically distrib-uted) are used to com
pute a single computational problem
in parallel.
•M
essagepassing
programm
ing,tolerate
longand/or
unpredictabledelays, low
bandwidth
•E
.g., SE
TI@
home, d
istributed DN
A m
atching, etc.•
Boundary
between
paralleland
distributedcom
putingdepends
onthe speaker.
•S
ometim
es,"distributed
computing"
isused
on“distributed
sys-tem
s”.
•"G
rid computing".
•P
art of this course.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 32 (289)
32
Some similar terms (that are sometimes mixed up)
Co
ncu
rrent sy
stem (sa
ma
na
ika
inen
)
•T
hings occurringa
pp
aren
tly simultaneously.
•In
reality,only
one(process,
etc.)is
executingat
atim
e,and
theprocess is changed frequently enough.
•E
.g.,processesin
am
ultitaskingO
Sexecute
at~10
ms
time
slices.
•C
an also occur really simultaneously in m
ultiprocessors systems.
•C
oncurrency is defined w
ith respect to a slow observer (hum
an).
•O
rder of concurrent events isnondeterm
inistic.
•C
anbe
(usuallyis)
implem
entedusing
time-sharing
(sometim
esseveral
processors).
•T
asks are not necessarily (tightly) related.
•P
arallel and distributed systems are concurrent by nature.
•P
rocesses in different com
puters execute simultaneously
•T
hecom
munication
inasynchronous
distributedsystem
sis
concurrent.
•T
oachieve
most
flexibility
andperform
ance,theprocesses
(comput-
ers,softw
are)that
participatein
aD
Sare
usuallyconcurrent
(multi-
threaded).
•C
oncurrency theory (or practical handling) is not part of this course.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 33 (289)
33
Some similar terms (that are sometimes mixed up)
Mu
ltithrea
din
g (sä
ikeistys)
•T
he standard mechanism
to implem
enta concurrent process (one
process)
•A
s opposed to distinct processes, the threads of a single processshare
the same data.
•N
ot part of this course.
Mu
ltithrea
din
g a
ccord
ing to
pro
cessor m
an
ufa
ctures
•P
rocessorincludes
specialcircuitsto
executeseveralprocesses
simulta-
neously.
•D
epending on the implem
entation, the processes may execute at full
speed, or at slightly lower speed.
•B
enefit: m
ore efficient utilization of functional units.
•O
S (and processes) "see" several processors.
•E
.g., Intel HyperT
hreading(tm
), SU
N C
MT
.
•R
elates to this course.
•S
ee Processor m
ultithreading (p.44).
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 34 (289)
34
Some similar terms (that are sometimes mixed up)
Distrib
uted
op
eratin
g sy
stem
•S
ingle system im
age (for user) for several computers.
•U
ser will not know
in which physical com
puter their processes run.
•A
utomatic
job/process distribution, balancing, m
igration.
•"G
rid computing"
•E
.g., Mosix
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 35 (289)
35
Some similar terms (that are sometimes mixed up)
Pa
rallel
com
pu
tatio
n/co
mp
uter
(rinn
ak
ka
islask
enta
,-tieto
ko
ne)
•U
se several processors/computers to
solve a single computation in
parallel
•T
he only goal is to make hard com
puting faster.
•U
p toP
times faster using
P processors.
•U
seful(only)
ifw
eare
ina
hurry(sim
ulation/forecast,real-tim
eapplications)
•A
parallelcomputer
oftenhas
dozens..thousands
ofsim
ilarprocessors
with a tight interconnection and often a (virtual) shared m
emory.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 36 (289)
36
Some similar terms (that are sometimes mixed up)
Pa
rallel,
distrib
uted
,a
nd
con
curren
tsy
stems
an
dp
ro-
gra
mm
ing
have a
lot in
com
mo
n.
•T
askdivision.
•Interprocess
comm
unication, dividing data.
•N
ondeterminism
.
•S
ynchronization challenges.
•D
eadlock possibility.
•L
oadbalancing.
•E
rror possibilities, fault-tolerance techniques.
⇒H
ardware, tools, and
goals differ.
•In this
course, we concentrate on parallelism
, but we’ll m
ight havesom
ething (threads, processes) on concurrency.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 37 (289)
37
Current parallel computers (briefly) Cu
rrent p
ara
llel com
pu
ters (briefl
y)
SM
P (S
ym
metric M
ultiP
rocesso
r)
•2-16 (-64) processors on the sam
em
emory bus (or sw
itch).
•S
everal banks of mem
ory.
•E
ach processor has its own
cache (to reduce bus traffic).
•N
ot very scalable approach (as bus, a bit m
ore with a sw
itch).
proc.
cache
pro
c.
cache
proc.
cache
proc.
cachem
emory
I/Om
emory
central system bus
Fig
ure 1
-1:
Bus-based S
MP
computer.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 38 (289)
38
Current parallel computers (briefly)
•E
.g.,cs: S
un M4000,
(2× 4-core S
PAR
C64 V
II 2.4G
Hz).
•In
largerunits
(P≥
8-16),processorsare usually
clustered.
•P
rocessors do not comm
unicatedirectly,m
emory is used for com
-m
unication.
•U
suallyused
toim
provethroughput
in a concurrent system, can be used
for parallel computation as w
ell.
proc.
cache
proc.
cache
mem
ory
I/O
mem
ory
mem
ory
mem
ory
Fig
ure
1-2
:C
rossbar-basedS
MP
computer.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 39 (289)
39
Current parallel computers (briefly)
Why parallel (once again)[G
ordon Moore, IS
SC
C2003, w
ww
.intel.com]
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 40 (289)
40
Current parallel computers (briefly)
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 41 (289)
41
Current parallel computers (briefly)
Mu
lticore S
MP
, SM
T, C
MT
•A
sthe
siliconm
anufacturingprocess
improves,m
oreand
more
transis-tors can be fi
tted in a chip (mainfram
e/supercomputer: in a board).
•H
ow to use the
exponentially growing transistor count
efficently?
•1940’s to 70’s: m
ore and more bit-parallelism
and instructions.•
Eventually dim
inishing returns.
•(70’s), 80’s, 90’s:
deeper pipelining, wider superscalar.
•U
sefulnessof
deeperpipelines
andw
idersuperscalar
islim
itedby
code/compilers, eventually dim
inishing returns.•
Since late 80’s: m
ore and more
cache to balanceslow
mem
ory.
•D
ifferenceof
2MB
and4M
BL
2caches
issm
allin
speed,buthas
more transistors than an A
LU
, eventually diminishing returns.
•S
ince mid 2000’s:
mo
re cores.
•(A
nd more integration for cheap P
Cs)
•S
ame transistor count:
6000×
i386 and single
2-core Itanium 2!
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 42 (289)
42
Current parallel computers (briefly)
•M
ulticore SM
P is to have several C
PU
s within the single silicon chip
•E
ach CP
U has its ow
n AL
U(s), L
1 (& L
2) cache, usually also FP
U.
•C
PU
s share L3 (&
L2) cache, M
MU
, and external connections
•M
ulticore benefit
•P
times processing potential for approx. the sam
e price
•D
rawback
•M
emory
andI/O
bandwidth
donot
increaseaccordingly,
eventu-ally dim
inishing returns.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 43 (289)
43
Current parallel computers (briefly)
•S
unU
ltraSPA
RC
IVprocessor[w
ww
.sun.com]
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 44 (289)
44
Current parallel computers (briefly)
Processor
multithreading
•E
ach core executes several processes (threads).
•R
educesthe
impact
ofm
emory
latencyby
making
eachvirtual
proc-essors slow
er.
•S
un UltraS
PAR
C T
3•
16cores, 8
threads each→
OS
sees 128 threads ("processors")
•C
ray XM
T•
128 threads per processor.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 45 (289)
45
Current parallel computers (briefly) ⇒M
ulticore is mainstream
now (2006 slides: "soon").
•X
Box360
•C
PU
: Triple core P
owerP
C, tw
o threads each (total6 threads)
•G
PU
:48 A
LU
s
•P
laystation3
•8 V
LIW
processors (AP
U), each 4+
4 pipelines =256 pipelines.
•Intel
•S
ince 2003: Hyperthreading provides
2 virtual processors for OS
•8-core
i7/Xeon (m
ulti-chip)•
Dual core P
4 at 2005,quad core at 2007, 48? -core at 2010.
•A
MD
2*8-core Opteron,
dual core Athlon at 2005,quad
2007.
•S
UN
/OR
AC
LE
quad-core S
PAR
C61 V
II, 16-core T3
•S
UN
dual core UltraS
PAR
C IV
at 2004,8-core T
1 at 2006.
•IB
M 8-core P
OW
ER
7,dual core PP
C970 at 2004.
•N
vidia Kepler: 1536 cores, up to 96 threads/core, 500
e.
⇒N
owdays,w
ecan
assume
thatour
software
isrun
mostly
onparallel
machines!
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 46 (289)
46
Current parallel computers (briefly)
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 47 (289)
47
Current parallel computers (briefly)
Vecto
r (sup
er)com
pu
ters
•C
lassical supercomputers since C
ray 1 at 1977.
•1-32 (m
ore clustered) extremely pow
erful processors.
•E
ach up to 100 GF
LO
PS
(2008).
•~
8M
UL
-an
d-A
DD
floating point operations / clock cycle / processor
•E
.g., dot product
•R
equiresseveral
long(1000
element)
arrays(vectors)
forpeak
per-form
ance.
•O
n each clock cycle, up to 16 words (64B
) from/to m
emory.
•A
verage PC
: 0.1 .. 1 B/cc
•N
o caches, but hardware prefetch (very deep pipeline) and
very wide
mem
ory channels (and SR
AM
mem
ory).
•C
ray, Hitachi, F
ujitsu,N
EC
.
•V
ery expensive, even per FL
OP
S.
•N
earlyextinct in original form
, current implem
entations approachM
PP
s, see below.
•N
EC
SX
-9: 100G
FL
OP
S/proc, 256
GB
/s mem
ory bandwidth/proc
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 48 (289)
48
Current parallel computers (briefly)
•http://w
ww
.nec.com/de/en/prod/servers/hpc/m
aterial/255_e_sx9.pdf
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 49 (289)
49
Current parallel computers (briefly)
MP
P (M
assiv
ely P
ara
llel Pro
cessing)
•T
ens..thousands of processors.
•E
achprocessing node is a 1-4 processor S
MP
and mem
ory.
•S
eparate I/O nodes.
•P
rocessingnodes
connectedby
aninterconnection
network,topologies
vary.
Fig
ure
1-3
:A
64-node3D
mesh,
a32-node
binaryhypercube,
andan
80-node butterfly (w
ith 16 input/output nodes).
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 50 (289)
50
Current parallel computers (briefly)
•U
sually hardware supports
virtual shared mem
ory.
•S
cales enough (can be built to consum
e any budget).
•C
omm
unicationnetw
ork is expensive (up to half of the machine cost).
•S
pecialpurposem
achinescan
betailor-designed
tobalance
thecosts
ofsubsystem
s (processors, mem
ory, bandwidth, I/O
) with the given task.
•G
eneral purpose computers provide com
promises betw
een price andinterconnection and m
emory perform
ance.
•E
.g.,(ILL
IAC
IV),T
hinkingM
achinesC
M-1,-2,-5,C
rayT
3E,X
T4/5,
XE
6,Digital(H
P)
Alph
aserverS
C,IB
MeS
erver,IntelAS
CI
Red,S
GI,
etc.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 51 (289)
51
Current parallel computers (briefly)
NO
W (N
etwo
rk o
f Wo
rksta
tion
s)
⇒P
ersonal workstations are 99%
idle (nights, editor usage).
•F
ree cycles can be used by:nice compute
•"F
ree" (unused) computing pow
er:
•cs departm
ent: 400 PC
s× 3 G
FL
OP
S =
1.200T
FL
OP
S.
•U
EF
: 5000 PC
s× 3 G
FL
OP
S =
15T
FL
OP
S.
•F
inland: 1.5M
PC
s× 2G
FL
OP
S =
3P
FL
OP
S >
Blue G
ene.
•O
rdinary Unix (W
inNT
) workstations, T
CP
/IP connection.
•A
switch
...LA
N...W
AN
...Internet.
•S
ometim
es (nowadays) also a
dedicated cluster (ryväs).
•1(0)
Gb
Ethernet,Infi
niband,AT
M,F
C,or
Myrinet;no
displays,etc.•
Blade racks to save space, reduce loose w
ires.
⇒S
low(ish) com
munication restricts algorithm
choice.
⇒C
heapest FL
OP
S because of m
ass production!
•S
ee exercise 4-5.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 52 (289)
52
Current parallel computers (briefly)
Pa
rallel
arch
itectures
seemto
con
verg
eto
wa
rds
each
oth
er.
•In S
MP
-computers the buses are replaced by
clustered networks.
•V
ector supercomputers are im
plemented in
CM
OS
, usecaches and
DR
AM
,P
increases, nodes areclustered (m
emory perform
ancedegrades or no anym
ore shared mem
ory).
•V
ectortechniques
andvirtual
sharedm
emory
areused
inM
PP
comput-
ers.
•M
ultithreading and multicore are used in C
PU
s and GP
Us.
•W
orkstation (or server computing nodes) have parallel vector units.
•M
PP
computers are build from
comm
odity parts like NO
Ws.
•D
edicated “NO
Ws” are used for parallel com
putation.
•S
everal (even heterogenous) computers are connected for joint w
ork(grid com
puting).
•B
lade server racks look like a m
ainframe...
Cu
rrent to
p co
mp
uters: h
ttp://w
ww
.top
500.o
rg/
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 53 (289)
53
Current parallel computers (briefly)
IBM
Seq
uoia
- Blu
eGen
e/Q
•98,304 * 16core P
owerP
C
•16 P
FL
OP
S, 7900 kW
Tia
nh
e-1A
•http://pressroom
.nvidia.com/easyir/
customrel.do?easyirid=
A0D
622CE
9F579F
09&version=
live&prid=
678988&releasejsp=
releas
e_157
•7,168 N
VID
IA T
esla 2122 M2050 G
PU
s
•448 cores each
⇒3.2M
cores
•~
1 MF
LO
PS
/ core⇒
500 GF
LO
PS
/ GP
U
•B
ut only 3GB
mem
ory / GP
U•
~ 3.5 P
FL
OP
S theoretical, 2.5 P
FL
OP
S L
INPA
CK
•tens of threads / core =
tens of millions of threads!
•14,336 X
eon CP
Us.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 54 (289)
54
Current parallel computers (briefly)
Ad
ditio
nal b
on
us o
n p
ara
llel com
pu
ters
•A
s we can have unlim
ited performance via parallelization, w
e do notneed
thefastestprocessor.Instead,w
e’llselectthebestby
performance/
price. (ww
w.verkkokauppa.com
2010)
•N
ot quite as simple as G
FL
OP
S/e.
•W
eneed
more
thanprocessors
(motherboards,
network
cards,sw
itches).
•A
lgorithm m
ay be less efficient w
ith more processing nodes.
•S
ee exercises 4-5.
Intel Core 2 D
uo E7500 2×
2.9GH
z, 3MB
118.90 e
Intel Core 2 Q
uad Q8400 4×
2.66GH
z, 6MB
151.90 e
Intel Core 2 Q
uad Q9650 4×
3.0GH
z, 12MB
330.90 e
Intel i5-760 4×2.8G
Hz, 8M
B193.90 e
Intel i7-950 4×3.06G
Hz, 8M
B514.90 e
Intel i980X
EE
4×3.3G
Hz, 12M
B989.90 e
Intel Xeon X
7460 6×2.66G
Hz, 16M
B2578.90 e
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 55 (289)
55
Ch
ap
ter 2
PR
AM
A sim
ple
model of parallelism
PR
AM
program
ming
PR
AM
physical im
plem
entation possibilities
⇒P
RA
M is used to
avoid dirty details.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 56 (289)
56
PRAM shortly PR
AM
sho
rtly
How
PR
AM
was
born
?
⇒A
familiar com
puter abstraction (for programm
ers, etc.):
•R
AM
(Random
Access M
achine)
•A
processor•
Am
emory
•P
rocedural (or OO
) programm
ing, especiallyvariables.
•N
ot quite accurate anymore, but good enough.
Processor
. . .M
emory
Fig
ure 2
-1:
RA
M (V
on N
eumann).
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 57 (289)
57
PRAM shortly
A n
atu
ral ex
tensio
n:
•P
RA
M (P
arallel Random
Access M
achine)
•F
ortune and Wyllie 1
978, many others
⇒Increase the num
ber ofprocessors.
•A
llprocessors
can equally access theshared m
emory.
⇒P
rogramm
ing like RA
M, exceptm
emory (variables) is shared.
•A
ll processors have to be programm
ed.
•M
emory access confl
icts have to be avoided.
Fig
ure 2
-2:
The structu
re of the PR
AM
model.
P1
P2
P3
P4
PP
. . .
. . .P
processors
Word-w
ise accessible shared mem
ory
Read/w
rite operations from/to shared m
emory
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 58 (289)
58
PRAM shortly
Wh
y P
RA
M is
good
:
•S
imple and
strong model.
•If a parallel algorithm
can be done, it can be done for PR
AM
.
•R
eminds real com
puters (like RA
M).
•F
lexible:T
ens of different variations.
•G
enerally used.
•M
ost parallel algorithms are designed for P
RA
M.
•E
xisting set of algorithms and other theory.
Wh
y P
RA
M is
bad
:
•P
-port shared mem
ory cannot be build (easily).
•R
eal world
delays are ignored.
•D
oes not account for building costs.
•D
oes not guide for savingresources.
Still•
A handy
too
l (abstraction) for research and teaching.
•A
lgorithms can be adapted for real com
puters.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 59 (289)
59
PRAM models PR
AM
mo
dels
Pro
cessors a
re pro
cessors, b
ran
d d
oes n
ot m
atter.
•If
needed,we
candefi
neeach
processor(processing
node)to
havelocal
mem
ory and I/O
.
•E
specially theprogram
can be stored aslocal copies, but as a plain
model, it does not m
atter.
•U
sually we assum
e thesam
e program but ow
n program counters at
every processor (MIM
D, m
ultiple instruction stream, m
ultiple data).
•S
IMD
(singleinstruction
stream)
isan
optionfor
cheaperim
plemen-
tation.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 60 (289)
60
PRAM models
Th
esh
ared
mem
ory
in P
RA
M is in
teresting.
•T
ooperate
efficiently,the
processorsneed
tobe
ableto
exploitmem
ory.
•U
p to aread/w
rite at every clock cycle by every processor.
•Is
itpossible/feasible
todefi
ne/implem
enta
mem
orythat
canhandle
Psim
ultaneous mem
ory accesses every clock cycle?
•It is easy to
define.
•It is attractive to
use.•
It might be possible to im
plement (w
ith some tricks).
•It is
not currently feasible to implem
ent, though.•
For
aw
hilew
eassum
ethat
itis
possible,and
we'll
exploitit
toachieve easiest possible parallelism
.
Processor - m
emory speed com
parison (Random
Access M
achine):
•8
bits/DR
AM
chip, 50ns
random access latency, 3
GH
z 64-bit proces-sor:
•3×
50×64
/8=
1200D
RA
Mchips/processor
forfull
randomaccess
of one word at every clock cycle!
•A
ctually, modern (S
D)R
AM
should not be considered as RA
M...
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 61 (289)
61
PRAM models
PR
AM
mem
ory
mod
el
•A
single mem
ory, indexed mem
ory locations (e.g., 1..m).
•m
usually "unlimited" (as in R
AM
).
•E
ach mem
ory reference (read/write) is done in unit tim
e (O(1), 1 cc).
•A
lso, all other machine instructions in 1 clock cycle.
⇒W
hataboutifsim
ultaneousm
emory
referenceshitthe
same
mem
orybank or even the sam
e mem
ory location?
•S
imultaneous:on
theex
actlysam
eclock
cycle,notim
esharingpossible
within a clock cycle. A
lso calledconcurrent.
Sam
e bank,different address:
•F
or the model, there is n
o such problem.
•F
or a real implem
entation, we need m
ore circuitry and/or tricks (seebelow
).
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 62 (289)
62
PRAM models
Several sim
ultaneous mem
ory references to the
same m
emory
address:
•T
he references could possibly be
combined
.
•W
rite requests: something is w
ritten.
•R
ead requests: the result is copied to all accessing processors.
⇒In a
model, w
e just define what w
ill happen.
•S
everalsimultaneous
readsis
astrong
operation,butveryeasy
todefi
ne.
•S
imultaneous
read(s)an
da
write
canbe
defined
as,e.g.,everyw
riteto
occur before every read (two stages =
O(1)).
•S
everal simultaneous w
rites is much m
ore difficult to defi
ne.
•E
ach mem
ory location will alw
ays contain only one value.
⇒In P
RA
M m
odel, these are considered as model variations.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 63 (289)
63
PRAM models
PR
AM
varia
tion
s
•T
hem
emory
models
differon
restrictions/resultson
whatcan
happenat
single mem
ory location at a single clock cycle.
•If
therestrictions
areviolated,the
whole
machine
haltsim
mediately
(ina m
odel), or results are unknown (in real life).
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 64 (289)
64
PRAM models
E/C
/O× R
/W
•E
RE
W (E
xclusive Read, E
xclusive Write)
•B
oth several simultaneous reads and w
rites are forbidden.
•C
RE
W (C
oncurrent Read, E
xclusive Write)
•S
everalprocessors
may
readsim
ultaneously,but
writing
isallow
edto one processor at a tim
e.
•C
RC
W (C
oncurrent Read, C
oncurrent Write)
•U
nlimited
number
ofreads
andw
ritesare
permitted
simultaneously.
•T
heresult
ofsim
ultaneousw
riteshas
tobe
solvedsom
ehow,
seebelow
.
•C
RO
W (C
oncurrent Read,
Ow
ner Write)
•E
achm
emory
locationis
owned
bya
processor,othersm
ayonly
readit.
•E
RC
W (E
xclusive Read, C
oncurrent Write)
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 65 (289)
65
PRAM models
CW
varia
tion
exam
ples
•O
n concurrent access to a single mem
orylocation.
•In ascending (partial) order of strength.
•W
EA
K
•O
nly simultaneous w
riting ofzeroes is allow
ed.
•C
OM
MO
N
•O
nly simultaneous w
riting of thesam
e value is allowed.
•T
OL
ER
AN
T
•N
othing happens if several processors try to write sim
ultaneously.
•C
OL
LIS
ION
•A
specialcollisionsym
bolisw
rittenif
severalprocessorstry
tow
ritesim
ultaneously.
•C
OL
LIS
ION
+
•A
specialcollisionsym
bolisw
rittenif
severalprocessorstry
tow
ritesim
ultaneousdifferent values. (see C
OM
MO
N)
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 66 (289)
66
PRAM models
•A
RB
ITR
AR
Y
•S
ome
(random)
valuesurvives
ifseveral
processorstry
tow
ritesim
ultaneously.
•P
RIO
RIT
Y
•P
rocessor with low
estP
ID w
ill success, others fail.
•S
TR
ON
G
•A
combination of the values is w
ritten,•
e.g., AD
D&
WR
ITE
, AN
D&
WR
ITE
, PR
EF
IX-S
CA
N
•D
ifferent variations have been suggested.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 67 (289)
67
PRAM models
Exam
ples o
n p
oten
cy d
ifferences:
•S
preading a w
ord to every processor (or toP
mem
ory locations).
•C
RE
W: every processor reads the sam
e mem
ory location:O
(1)
•E
RE
W:value
isdoub
led(as
ina
binarytree)
untilallprocessorshave
read it:O
(logP
)
•M
aximum
of an array.
•C
RE
W:
O(log
N)
•W
EA
K C
RC
W:
O(1
)
•S
orting
•E
RE
W:
O(log
N)
•S
TR
ON
G C
RC
W:
O(1)
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 68 (289)
68
PRAM "programming" PR
AM
"p
rog
ram
min
g"
⇒A
s in sequential programm
ing, we'll use
several abstraction levels.
•D
escribe the algorithm in a
natural language and apicture.
•D
escribe the algorithm in an
algorithm notation.
•T
ransform the algorithm
toadapt w
ith real world
(machine and pro-
gramm
ing environment) restrictions.
•W
rite the algorithm in a
programm
ing language.
•C
ompile the program
intom
achine language.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 69 (289)
69
PRAM "programming"
(Data
)para
llel alg
orith
m n
ota
tion
⇒A
s sequential, with an additional statem
ent to express parallelism
for
i∈ 1..N
pard
o//or,e.g.,fo
rea
chelem
entin
Apard
o1
statem
ent;
// e.g., if A[i] =
0 then A[i] :=
...2
•sta
temen
tisexecuted
oncefor
eachvalue
ofi(1..N
)(as
ina
seqfor-do).
•A
llN
executions are donein parallel, if w
e have at leastN
processors.
•T
ime com
plexity:
•T
st +O
(1)if
we
haveenough
proc(T
st =tim
eof
asingle
statem
ent).
• T
st ×N
/P +O
(1) if we take
P into account.
•R
emem
ber Brent’s theorem
(p.27).
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 70 (289)
70
PRAM "programming" ⇒D
ifferent parallel executions may not disturb each other.
for
i∈ 1..N
pard
o1
A[A
[i]] = A
[i];// result very unclear,
not a
llow
ed!
2
•If
we
needlocalvariables
(mem
ory),we
canuse
keywords
priv
ate
andsh
ared
to clarify the situation.
⇒C
reativefreedom
isallow
edin
algorithmnotation
aslong
asexact-
ness and comprehensibility is m
aintained.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 71 (289)
71
PRAM "programming"
pro
cedu
re Odd-even_m
ergesort (A : array[1..N
]);1
if Pro
cessors = 1
then
2
Sequential_m
ergesort(A
);3
else4
pa
r i = 1
to 2
do
5
Od
d-even_m
ergesort(i:th h
alf o
f A);
6
Odd
-even_
merg
e(ha
lves of A
);7
syn
chro
nize;
8
pro
cedu
re Odd
-even_merge (A
: array[1..N]);
9
if Processo
rs = 1
then
10
Sequential_m
erge(A
);11
else12
pa
r i = 0
to 1
do
13
Od
d-even_m
erge(ha
lves of o
dd
/even (2
n+
i) elemen
ts of A
);14
pa
r i = 2
to N
–1
by 2
do
15
pip
elined
_co
mp
are-exch
an
ge (A
[i], A[i+
1]);16
syn
chro
nize;
17
Alg
orith
m 2
-1:
Parallel odd-even m
ergesort, informal version.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 72 (289)
72
PRAM "programming"
Para
llel pro
gra
mm
ing la
ngu
ages
⇒V
ariety is huge, few established standards.
•W
e'll describe some real languages/standards later on.
⇒P
RA
Mprogram
ming
with
paper(or
with
aP
RA
Mem
ulator)can
bedone
aseasily
asm
ovingfrom
sequentialalgorithm
sto
sequentialprogram
s.
•L
ocal and shared variables.
•P
rocessor-ID (P
ID) to distinguish betw
een processors.
•S
ynchronization.
•I/O
is either forgot, or we'll use parallel I/O
.
•E
xample: (P
arallel Modula-2 for F
-PR
AM
)
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 73 (289)
73
PRAM "programming"
procedure oemerge(sharedvar S
: array of word; S
tart, Length, S
tride : word);
1
vara, b
: word
;2
i, j, k, L
ength2 : register w
ord;3
begin
4
Len
gth
2 := L
ength / 2;
5
par i :=
0 to 1 do
6
oemerg
e(S, S
tart + i * S
tride, Length2, S
tride * 2);7
end;
8
par i :=
1 to L
ength2 - 1 do
9
j := i * 2
;10
a := S
[Start +
(j - 1) * Stride];
11
b := S
[Start +
j * Stride];
12
if a > b
then13
S[S
tart + (j - 1) * S
tride] := b;
14
S[S
tart + j * S
tride] := a;
15
end;
16
end;
17
synchronize;
18
end oem
erge;
19
Alg
orith
m 2
-2:
Odd-even m
erge in fpm.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 74 (289)
74
PRAM "programming"
PR
AM
mach
ine la
ngu
age
•A
sany
RA
Mm
achinelanguage,possibly
alsoL
OA
DP
ID,and
separateoperations to access local and shared m
emory.
•U
suallyone shared program
for every processor.
•T
hesam
eprogram
isloaded
toevery
processornode,processors
will
branch according to PID
.
•W
e can use assembler as an interm
ediate stage.
•E
.g., F-P
RA
M.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 75 (289)
75
PRAM "programming"
# macro assem
bler
else5:L
OA
D =
01
ST
OR
ET
MP
152
ST
OR
ET
MP
113
LO
AD
=1
4
ST
OR
ET
MP
105
LO
AD
PR
OS
6
SU
BT
MP
107
AD
DT
MP
118
SU
B=
19
JPO
Soverpar0
10
# macros opened
LO
AD
=0
1
ST
OR
E24
2
ST
OR
E20
3
LO
AD
=1
4
ST
OR
E19
5
LO
AD
96
SU
B19
7
AD
D20
8
SU
B=
19
JPO
S322
10
Fig
ure 2
-3:
(F)P
RA
M m
achine language
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 76 (289)
76
Implementing PRAM Imp
lemen
ting
PR
AM
⇒U
singshared
mem
ory(m
emory
referenceis
aread
orw
rite)in
oneclock cycle in im
possible.
•It has not succeeded even on uniprocessors since 1
MH
z times at 80's.
•T
oday, we could achieve 20
MH
z on DR
AM
, 300M
Hz on (nonem
bed-ded) S
RA
M.
•In addition to D
RA
M latency, physical distances or large com
putersm
ake access slow.
•In
0.3ns
(3G
Hz),
thelight
will
travel10
cmin
freespace,
electricity~
7cm
ina
coaxialcable,
evenless
oncircuit
board,only
fewcm
ona sem
iconductor.
⇒M
oreover,buildinga
P-port
mem
oryis
expensive/impossible
ifP
islarge.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 77 (289)
77
Implementing PRAM
Extra
cost fa
ctor fo
rP
ports isΩ
(P2) (a
s VL
SI a
rea).
•E
.g., let us considertechnology for 4G
bit (0.5G
B) m
emory chips.
•It w
ill yield 16M
bit (2M
B) m
emory w
ith 16 ports.•
Moreover,each
ofthe
16processors
willneed
24address
linesand
2data
lines,totalling
more
than416
pinsfor
the16
Mbit
(2M
B)
mem
ory chip.•
Packaging
costsfor
am
odest1
GB
mem
ory(64
MB
/pr)w
ouldbe
100000's e.
•A
t64 ports, a 1 M
bit (128
kB) chip w
ould be more com
plex (>1800
pins) than a Itanium2 Q
uad.
•64
GB
would take 0,5
M chips, 1000
m2, and cost >
109e.
•A
nd the access latency would still be long...
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 78 (289)
78
Implementing PRAM
PR
AM
can
be im
plem
ented
more ea
sily v
iasim
ula
ting th
esh
ared
mem
ory
by
distrib
uted
mem
ory.
⇒P
processors,M
mem
ory banks.
P0
P1
P2
P3
PP
–1
. . .P
processing nodes
Interco
nnectio
n n
etwork
Pro
cessor
Mem
ory
Netw
ork
interface
Fig
ure 2
-4:
Distributed M
emory M
odel.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 79 (289)
79
Implementing PRAM
•O
ften it is assumed that
M =
P, i.e., each processing node contains a
mem
ory module.
•G
ood:easier
construction,less
nodes,less
comm
unicationconnec-
tions.
•P
oor:more
traffic
ineach
node/connection,inreallife,m
emories
areslow
er than processors.
•F
orreasonable
performance,
M=
CP
,w
hereC
isthe
speeddiffer-
ence factor between processors and m
emory.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 80 (289)
80
Implementing PRAM
Overlo
ad
ing (ylik
uorm
itus)
⇒L
etus
assume
thata
mem
oryreference
from/to
a(virtual)
sharedm
emory takes
h clock cycles.
•T
he computer has
P physical processors.
•E
ach physical processor executes the tasks ofh
PR
AM
processors (hvirtual processors per a physical processor).
•T
heprocessor
executesonly
oneinstruction
atatim
efor
eachP
RA
Mprocessor it is respon
sible of.•
After each clock cycle it
changes to the next PR
AM
processor.
•A
fterim
plementing
allh
PR
AM
processors,itstarts
overby
execut-ing the next instructions for each P
RA
M processor.
⇒T
hem
emory
referencesm
adeby
PR
AM
processorshave
occurredin
hclock cycles.
•In algorithm
notation, see Algorithm
2-3:.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 81 (289)
81
Implementing PRAM
wh
ilenot a
ll pro
cessors h
alted
do
1
for
each
threa
di
do
2
PC
i := P
Ci +
1;
3
if op =
write
then
4
send
write-reference
5
else if op
= read
then
6
send
read-reference
7
else8
execute o
pera
tion
9
for
each
threa
dd
o10
if op =
readth
en11
recieve read-reference12
Alg
orith
m 2
-3:
PR
AM
simulation algorithm
.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 82 (289)
82
Implementing PRAM ⇒W
hat shall we
benefit?
•F
or each PR
AM
processor ("virtual processor") everything occurs inone clock cycle.
•T
heclock frequency of each P
RA
M processor is only
1/h
of the realprocessor.
•T
here areh×
P P
RA
M processors.
•P
rocessing power is (h×
P)×
( 1/h) =
P, i.e.,
the same as w
ith directP
processors.
⇒If
theprogram
canexploit
h×P
processors,it
will
executew
ork-optim
ally.
•h is called also parallel
slackness.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 83 (289)
83
Implementing PRAM
How
larg
eh
need
s to b
e?
•D
epends on the network and routing protocol.
•A
t leasttw
ice the diameter of the interconnection netw
ork.
•E
vena
bitmore
asthe
routing
algorithmneeds
slacknessto
handlecon-
gestions.
•E
.g., in a butterfly netw
ork:O
(logP
loglog
P).
•It has been done (S
aarbrücken SB
-PR
AM
, Tera M
TA
/ Cray X
MT
).
•S
ame technique is used in G
PU
units, e.g., Nvidia G
8x, etc.
•B
onus:no caches needed
.
Req
uirem
ents fo
r overlo
ad
ing
•M
ultithreading processor (switch after every clock cycle)
•Im
plementation sim
ilar to superpipelining (Forsell).
•H
ugem
emory bandw
idth.
•E
.g.,fullypopulated
gridshave
toonarrow
bisectionbandw
idth,seeF
igure 1-3 (p.49).
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 84 (289)
84
Implementing PRAM
Lesso
n lea
rned
⇒A
parallelalgorithm
shouldbe
designedto
useas
many
processorsas
(efficiently) possible.
•P
RA
M is not com
pletely utopistic.
•E
speciallyif
we
uselocal
mem
oriesto
decreasethe
traffic
inthe
shared mem
ory.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 85 (289)
85
Ch
ap
ter 3
Para
llel alg
orith
ms
(in P
RA
M-n
ota
tion
)
Goals
Techniques
Som
e algorithm
s
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 86 (289)
86
Parallel algorithm design goals Pa
rallel a
lgo
rithm
desig
ng
oa
ls
Eith
er
•m
aximal
speedup (and parallelism
), or
•m
aximal speedup w
hile still maintaining
work-optim
ality.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 87 (289)
87
Parallel algorithm design goals
More fo
rmally, a
n a
lgorith
m cla
ssifica
tion
•A
ccording totim
e complexity
•N
C:
polylogarithmic
time
complexity,
polynomial
number
ofproc-
essors (Nick's class).
•P
:polynom
ialspeedup
•D
ifferent P than in sequential algs (solvable in polynom
ial time).
•note: N
C and P
are not disjoint
•A
ccording tow
ork optimality
•E
:effi
cient•
A:
polylogarithmic ineffi
ciency (almost effi
cient)•
S:
polynomial ineffi
ciency (semi effi
cient)
•C
ombining
thesew
e’llgetsixclasses
ofalgorithm
s,EN
C,A
NC
,SN
C,
EP, A
P, SP.
•E
NC
would be nice.
•E
P is usually good enough.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 88 (289)
88
Parallel algorithm design methods Pa
rallel a
lgo
rithm
desig
n m
etho
ds
⇒C
oncentrate to (operations for)data, not (operations by) processors!
Para
llelizing seq
uen
tial p
arts o
f an
existin
g seq
uen
tial
alg
orith
m.
⇒T
hisis
notarealdesign
method,butin
reallifethis
isw
hatwe'llface
(as ad hoc programm
ers have sequentialized parallel problems).
•S
uits well for linear algebra.
•A
nalysingfor-do loops (and other sequential sections).
•If the sequential parts are independent, w
e can parallelize them
•S
ometim
es,inner loops are parallel, som
etimes
outer loops.
•L
ooprearranging m
ay help.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 89 (289)
89
Parallel algorithm design methods
•E
.g., matrix m
ultiplication
C=
A⋅ B
,,
(3-1)
•E
asy sequential algorithm and an easy parallelization, .
•N
×N
matrix,
O(N
3)sequential
algorithm,
O(N
)parallel
algorithmw
ithO
(N2) processo
rs.•
PR
AM
variant? Exercise.
cij
aik
bk
j×
k0
=
N1
–
∑ =
for i :=
1to N
do
//⇒p
ard
o1
for j :=
1to N
do
//⇒p
ard
o2
for k :=
1to N
do
3
C[i, j] :=
C[i, j] +
A[i, k] * B
[k, j];4
Alg
orith
m 3
-1:
Matrix m
ultiplication.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 90 (289)
90
Parallel algorithm design methods
•P
arallelizinginnerm
ostfor-loop
isnot
quiteas
straightforward
(unless we use S
TR
ON
G C
RC
W-m
odel).•
How
ever,the
innermost
product-sumcan
beevaluated
inO
(logN
)tim
eusing
O(N
)pro
cessorssee
Parallel
tournament
(turnaustekni-ikka) (p.95).•
Even
with
O(N
/logN
)processors,
seeB
locking(lohkom
inen)(p.
97).
•T
hus,the
whole
algorithmin
O(log
N)
time
with
O(N
3/logN
)proc-
essors (exercise).
•F
orreal
computers
andreal
inputsizes,it
isoften
enoughto
parallelizeonly
one of the nested loops.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 91 (289)
91
Parallel algorithm design methods
•In
algorithms
with
severalstages,we
shouldparallelize
all(demanding)
stages to achieve full efficiency (processor utilization).
for i :=
1to N
pard
o//
O(1)
1
for j :=
1to N
pard
o2
statem
ent1;
//O
(1)3
for i :=
1to N
do
//O
(N)
4
for j :=
1to N
pard
o5
statem
ent2;
//O
(1)6
Alg
orith
m3
-2:
An
unevenparallelization:
O(N
)tim
e,O
(N2)
proces-sors (but
O(N
) with
O(N
) processors).
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 92 (289)
92
Parallel algorithm design methods
Div
ide-a
nd
-con
qu
er
⇒D
ivideinput
intw
oparts,
solvehalves
recursivelyin
parallel,com
-bine the results (in parallel).
•F
amiliar technique in sequential algorithm
s.
•P
arallel recursion isterm
inated w
hen either
•input is
trivial (as in sequential programm
ing), or•
thereis
only1
processorleft,
when
we
cansw
itchto
asequential
algorithm(see
Blocking
(lohkominen)
(p.97)
andA
lgorithm2-1
(p.71)).
•S
ubresults are combined
to larger subresults on returning from recur-
sion.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 93 (289)
93
Parallel algorithm design methods
•E
.g., mergesort
•S
equential algorithm:
Ts (N
) = 2*
Ts (N
/2) +O
(N) =
O(N
logN
)
•R
ecursivecalls
atlines
3and
4can
beexecuted
inparallel
(asthey
work on disjoint parts of the array).
•U
singsequentialm
erge,T
p (N)
=T
p (N/2)
+O
(N)
=O
(N),
O(N
)proc-
essors,O
(N2) w
ork, not good.
⇒A
lso combining of subresults m
ust be parallelized!
•C
ombining is often m
ore difficult than dividing.
•S
ometim
es combinin
g is trivial, though.
•E
.g.,insearch
algorithms
(onlydiscoverer
acts),especiallyusing
CR
CW
.
pro
cedu
rem
ergesort(var A
: array; first, last : index);1
if (last–fi
rst) > 0
then
2
mergesort(A
, first, (last+
first)/2);
3
mergesort(A
, (last+fi
rst)/2+1, last);
4
merge(A
, first, (last+
first)/2, (last+
first)/2+
1, last);5
Alg
orith
m 3
-3:
Mergesort.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 94 (289)
94
Parallel algorithm design methods
•In m
ergesort, combinin
g is the merging phase, w
hich is more diffi
cultto parallelize.
•If
we
couldm
ergein
O(1)
time
usingO
(P)
processors,the
sortingtim
ew
ouldbe
Tp (N
)=
Tp (N
/2)+
O(1)
=O
(logN
)tim
e,O
(N)
proc-essors,
O(N
logN
) work.
•U
nfortunatelym
ergingin
O(1)
time
isim
possible(using
realisticm
odels).•
O(1)
am
ortized
time is possible, but unfeasibly com
plex.
•M
ergingin
O(log
N)
orO
(loglog
N)
time
ism
ucheasier,butdoes
notoffer
work
optimality,
unlessw
euse
lessprocessors,
see“O
dd-evenm
erge” p.136.
•D
ivision can be made in
more than tw
o parts toreduce the num
ber ofstages.
•E
.g., division in parts, com
bining in unit time:
T(N
)=
T(
) +O
(1) =O
(loglog
N).
•O
bviously,combining
mightnotbe
aseasy
anym
ore,seeraw
power
and waterfall techniq
ues below.
NN
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 95 (289)
95
Parallel algorithm design methods
Para
llel tou
rnam
ent (tu
rnau
stekn
iikka
)
•A
lso calledb
ala
nced
tree.
•If divide-and-conquer is a
top
-do
wn
approach, we can also apply a
similar technique also
bo
ttom
-up.
•W
e'llskip
(recursive/parallel)dividing
inparts,instead
we'll
startfrom
ready“sequences” of length one elem
ent.
•C
ompare input elem
ents pairwise,w
inner continues to the next round.
•D
efinition
ofw
innerdepends
onapplication,e.g.,a
combination
canbe used.
•A
stage can be done inO
(1) time using
N/2 processors.
•S
ame
isrepeated
againand
againam
ongthe
winners
(N/4,N
/8,...pairs)until the ultim
ate winner is left.
•log
N stages, each
O(1) tim
e⇒
O(log
N) tim
e,O
(N) processors.
•A
s in divide-and-conquer, more than tw
o elements can be handled at
each stage, see below.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 96 (289)
96
Parallel algorithm design methods
Raw
pow
er (raaka vo
ima
)
•A
s fast as possible.
•"O
verkill".
•A
lmost: using as m
any processors as possible.
⇒W
e'll try to evaluate all possibilities at once.
•E
.g., we'll
compare all p
airs simultaneously.
•O
(N2) com
parisons inO
(1) time using
O(N
2) processors.•
N input elem
ents will transform
toN
2 subresults!
•C
ombining m
ay be hard to do fast, requires usually CR
CW
.
•G
oal isO
(1) or logarithm
ic time algorithm
.
•R
arely work-optim
al.
•Is often used as a fi
nal stage of an algorithm, see below
.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 97 (289)
97
Parallel algorithm design methods
Blo
ckin
g (lo
hkom
inen
)
•P
revious methods often result in unbalanced processor utilization,
which im
plies non-optimal w
ork.
•E
.g.,at
thebeginning
ofa
tournament,
N/2
processorsare
used,but
thenum
berof
activeprocessors
reduceson
everyround,lastcom
par-ison is m
ade by one processor only.
•W
e'llrestrict parallelism
appropriately to achieve
work-optim
ality.
•Idea:
•L
ess processors.•
More w
ork to do for each processor.•
Atthe
beginning,eachprocessor
(inparallel)
evaluatesits
own
blocksequentially.
•S
witch
tothe
fastparallel
algorithmonly
when
eachprocessor
hasa
single intermediate result.
•U
sually used with other techniques, e.g., divide-and-conquer.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 98 (289)
98
Parallel algorithm design methods
•E
.g., in a tournament of
O(N
) sequential work:
•A
ctual tournament stage w
ill takeO
(logP
) time.
•T
om
aintainw
ork-efficiency,w
ecan
useat
most
O(N
/logP
)proces-
sors(if
alsothe
blockpartcan
bedone
isO
(logP
)tim
e,lessif
ittakesm
ore time).
•W
e'll chooseP
=N
/logN
.
•E
achprocessor
will
havea
logN
-element
block,sequential
algo-rithm
is used,O
(logN
) time.
•R
emaining
N/log
Nelem
entsw
illbeprocessed
usingparalleltourna-
ment in
O(log
N) tim
e usingN
/logN
processors.
⇒W
hole algorithm in
O(log
N) tim
e with
N/log
N processors.
•If
thesequential
partw
ithblocks
ism
orethan
O(N
)tim
e,sm
allerblocks are enough.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 99 (289)
99
Parallel algorithm design methods
Waterfa
ll techn
iqu
e (vesipu
tou
stekn
iikka)
•A
lso calleda
ccelerated
casca
din
g.
•C
ombine the best parts of the previous m
ethods.
•S
witch
toa
fasteralgorithm
afterthe
sizeof
inputhasshrunk
enoughto
be executed faster using the givenP.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 100 (289)
100
Parallel algorithm design methods
Oth
er meth
od
s
•S
ome
basicalgorithm
s,e.g.,prefix
sums
(seep.119),binary
search,andtree/path com
paction, are useful asparts of larger algorithm
s. They
often help at the combining parts.
•R
andomization
(breaking patterns), useful for real-world E
RE
W-like
variant to avoid mem
ory congestion.
•P
arallel Monte C
arlo / genetic methods (all processors try (random
)solutions).
•S
ampling
.
•T
akea
(smallish,but
aslarge
aspossible
without
disturbingthe
effi-
ciency)sam
pleon
thew
holedata,
analyseit
usinga
fastalgorithm
(raw pow
er).•
Divide input according to the distribution of the sam
ple.
•Input w
ill hopefully be divided more evenly to processors.
•H
elps on real data with inconvenient patterns.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 101 (289)
101
Maximum finding Ma
xim
um
fin
din
g
⇒A
very simple problem
, examples on each technique.
•Input: a shared array
A[0..N
–1]
•O
utput:largest elem
ent or/and itsindex.
•S
equential algorithmO
(N).
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 102 (289)
102
Maximum finding
Sta
nd
ard
tou
rnam
ent
⇒C
ompare elem
ents pairwise, w
inner continues to next iteration.
•A
fterlogN
iterations, only one element is left.
•Interm
ediate results have to be stored somew
here.
•F
or each comparison, w
e need two values w
hich are compared on
previous iteration by different processors.
•If w
e want to leave original array intact, w
e'll use an auxiliary array.
•H
ere we'll use the original for sim
plicity.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 103 (289)
103
Maximum finding
•W
inner placement can b
e done in many w
ays, see below.
•H
erew
e'llrestoreallw
innersto
thebeginning
partofthe
array.The
partreduces to half on every iteration.
•T
he most diffi
cult part is to make
indices match
on every iteration.
•Iterations have to be executed in
strict synchrony
•W
ecan
assume
thisin
PR
AM
algorithmnotation
(we
canm
entionit,
though).In
realm
achinesw
eneed
tohave
anexplicit
synchroniza-tion.
Fig
ure 3
-1:
Tournam
ent m
aximum
.
62
23
71
54
07
62
23
63
75
07
62
23
63
67
07
62
23
63
67
07
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 104 (289)
104
Maximum finding
•B
ysom
eclever
organization,the
synchronizationrequirem
entcan
be easied, even removed (w
ith auxiliary data structures).
•If/w
henthe
inputsize
Nis
notof
form2
k,we'll
haveto
refine
line4
to,e.g.,
A[j] :=
max(((j*2<
N) ? A
[j*2] : A[j]), (j*2+
1<N
? A[j*2+
1] : A[j])))
4
•T
ime:log
N(line
2)×O
(1)(lines
3-4)+
O(1)
(lines1
and5)
=O
(logN
).
•N
umber of processors:
N/2 =
O(N
).
•W
ork:O
(Nlog
N),
not work-optim
al (inefficient by factor
O(log
N))
•E
RE
W P
RA
M is suffi
cient.
fun
ction
tournament-m
ax(var A
: array[0..N–
1]);1
for i :=
log
N–
1to 0
do
2
for j :=
0to 2
i–1
pa
rdo
3
A[j] :=
max(A
[j*2], A[j*2+
1]);4
return
A[0
];5
Alg
orith
m 3
-4:
Maxim
um using standard tournam
ent.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 105 (289)
105
Maximum finding
•T
he same set of indices can be w
ritten in different ways:
•A
lso,youm
ayuse
anyindices,or
anew
arrayto
storethe
intermediate
results.
fun
ction
tournament-m
ax2(var A
: array[0..N–
1]);1
i = N
;2
wh
ile i > 0
do
3
i := i/2;
4
for j :=
0to i
pa
rdo
5
if j*2
< N
–1
then
6
A[j] :=
max(A
[j*2], A[j*2+
1]);7
elseif j*2
= N
–1
then
8
A[j] :=
A[j*2];
9
end
wh
ile;10
return
A[0];
11
Alg
orith
m 3
-5:
Tournam
ent-max, alternative im
plementation.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 106 (289)
106
Maximum finding
•E
.g., using doubling/halvingstride w
orks well.
•If counting tw
ice does not hurt, modulo helps on boundaries.
fun
ction
tournament-m
ax3(var A
: array[0..N–
1]);1
s := 1
; // stride
2
wh
ile s < n
do
3
for j :=
0to N
–s–
1b
y s*2p
ard
o4
A[i] :=
max(A
[i], A[i+
s]);5
s := s *
2;6
end
wh
ile;7
return
A[0];
8
Alg
orith
m3
-6:
Tournam
ent-max,
yetanother
alternativeim
plementa-
tion.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 107 (289)
107
Maximum finding
Fig
ure 3
-2:
Binary tree of A
lgorithm 3-6:
01
23
45
67
89
1011
1213
1415
1248
N–
1–
1
N–
2–
1
N–
4–
1
N–
8–
1
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 108 (289)
108
Maximum finding
A v
aria
tion
: maxim
um
for
every
pro
cessor
•O
ften,them
aximum
hasto
bespread
toallprocessors
(orindices
ofthe
array).
•T
his is useful especially on ER
EW
PR
AM
.•
We could m
ake the spreading by using another logN
“tree”.
•B
ut,onprevious
algorithm
,mostprocessors
areidle
mostof
time.T
heycan be exploited in “con
current spreading”.
•E
ach processor evaluates its “local” maxim
um tree.
•E
ven if all processors make useful w
ork during the whole execution,
this isnot w
ork-optimal.
Fig
ure
3-3
:A
n“arrays
oftrees”
ofdegree
2.D
ashedlines
representw
rap-around edges.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 109 (289)
109
Maximum finding
Div
ide-a
nd
-con
qu
er
•W
orks actually like tournament, slightly different notation.
•D
ivide recursively until input is trivial.
•O
n returning from recursion,
compare, and return the larger one.
•M
anaging array boundaries and synchrony is easier.
•P
arallelism representation possibly m
ore difficult / ineffi
cient.
•T
ime:
T(N
)=
T(N
/2) +O
(1) =O
(logN
),O
(N) proc,
O(N
logN
) work.
fun
ction
divide_conquer-max(v
ar A
: array[0..N–
1]; low, high : index);
1
if (low =
hig
h)th
en2
return
A[low
];3
else4
pa
rdo
5
x := divide_conquer-m
ax(A, low
, (high+
low)/2);
6
y := divide_conquer-m
ax(A, (high
+low
)/2+1, high);
7
return
max
(x, y
);8
Alg
orith
m3
-7:
Maxim
umfi
ndingusing
divide-and-conquer-tech-
nique.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 110 (289)
110
Maximum finding
Blo
ckin
g a
nd
tou
rnam
ent
•N
one of the previous algorithms is w
ork-optimal.
•W
ithout Concurrent W
rite, we cannot achieve
O(1) tim
e with
O(N
)processors, thus, w
e'll have toreduce the num
ber of processors forw
ork-optimality.
⇒W
e'll first useN
/logN
processors, goal forO
(logN
) time.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 111 (289)
111
Maximum finding
•Idea:
reduce the input toN
/logN
, after which w
e'll use tournament in
O(log
N) tim
e usingN
/logN
processors.
•E
ach processor finds fi
rst the maxim
um of its ow
n block of size logN
sequentially (but all processors in parallel).
•A
fterO
(logN
) time, w
e'll have an intermediate input of size
N/log
N.
•T
hen we’ll do tournam
ent for the smaller input.
•T
otal time O
(logN
),N
/logN
processors⇒O
(N) w
ork!
•E
RE
W is still enough.
fun
ction
blocking_tournament-m
ax(var A
: array[0..N–
1]);1
for i :=
0to
N/lo
gN
–1
pa
rdo
2
B[i] :=
A[i*log
N];
3
for j :=
1to lo
gN
–1
do
4
B[i] :=
max
(B[i], A
[i*logN
+j]);
5
return
tou
rnam
ent-m
ax(B[0..N
/logN
–1]);
6
Alg
orith
m 3
-8:
Blocking technique in m
aximum
finding.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 112 (289)
112
Maximum finding
Raw
pow
er (raaka vo
ima
)
•L
et us assume that
any element could be the m
aximum
.
•W
e’llprove other elem
ents not to be maxim
um, only m
aximum
is left.
•Initialize
anarray
of1's
ofsize
N(a
bitfor
everyelem
entof
theinput).
•C
ompare all pairs sim
ultaneously (about
N2/2 pairs).
•T
hesm
allerof
apair
cannotbe
them
aximum
,thusm
arkit
with
0to
the boolean array.
•D
raws
aredecided
accordingto
theindex
(below,
theone
with
smaller index w
ins).
•O
nly the maxim
um value retained the 1.
•A
ll stages inO
(1) time,
N2/2 processors,
O(N
2) work.
•C
oncurrent read is needed at line 4, concurrent write at lines 7 and 9.
•O
nlyzeros
arew
rittenconcurrently,thus
WE
AK
CR
CW
suffices.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 113 (289)
113
Maximum finding
fun
ction
raw-m
ax(var A
: array[0..N–
1]);1
for i :=
0to N
–1
pa
rdo
2
V[i] :=
1;3
for i :=
0to N
–1
pa
rdo
4
for j :=
i+1
to N–
1p
ard
o5
if A[i] <
A[j]
then
6
V[i] :=
0;7
else8
V[j] :=
0;9
for i :=
0to N
–1
pa
rdo
10
if V[i]≠
0th
en11
return
A[i];
12
Alg
orith
m 3
-9:
Maxim
um w
ith raw-pow
er.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 114 (289)
114
Maximum finding
Div
ide-a
nd
-con
qu
er & ra
w p
ow
er
•D
ivide-and-conquer can be used with division in m
ore than 2 parts.
•C
ombining fast enough is harder.
•U
sing raw-pow
er maxim
um A
lgorithm 3-9, w
e can combine (fi
ndm
aximum
of)M
results with
M2 processors in unit tim
e.
•If w
e haveN
processors, we can com
bine subresults by raw
-m
aximum
.
•D
ivide input in parts, solve them
recursively, find m
aximum
with
raw-m
ax.
N
N
fun
ction
root-max(v
ar A
: array[0..N–
1]; low, high : index);
1
if (low =
hig
h)th
en2
return
A[low
];3
else4
k :=
hig
h–
low+
1;5
for i :=
0to
–1
pa
rdo
6
B[i] :=
roo
t-max(A
, low +
i*, low
+ (i+
1)* – 1);
7
return
raw-m
ax(B
[0..–
1]);8
Alg
orith
m 3
-10
: -divide-and-conquer m
aximum
.
kk
kk
N
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 115 (289)
115
Maximum finding
•If
Nis
notof
form,w
ehave
torefi
nethe
algorithma
bit(exercise).
•T
ime
T(N
)=
T(
)+O
(1) =O
(loglog
N),
O(N
) processors,O
(Nlog
logN
) work. 2
2n
N
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 116 (289)
116
Maximum finding
Waterfa
ll = b
lock
ing
&d
ivid
e-an
d-co
nq
uer
&ra
w-p
ow
er
•R
educeN
elements to
N/log
logN
elements sequentially in log
logN
time using
N/log
logN
processors (blocking).
•S
olve the remaining
N/log
logN
elements w
ithN
/loglog
N processors
using Algorithm
3-10 (divide-and-conquer&raw
-power).
⇒A
work-optim
alO
(loglog
N) tim
e (weak) C
RC
W algorithm
.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 117 (289)
117
Maximum finding
Usin
g stro
nger C
RC
W m
od
els
•S
TR
ON
G C
W has a ready operation for m
aximum
.
•P
RIO
RIT
YC
Wcan
solvem
aximum
easilyin
O(1)
time
usingO
(N+
M)
processors:
fun
ction
crcw_priority_m
ax(sh
ared
var A
: array[0..N–
1]);1
sha
redva
r max
value, w
innerindex;2
for i :=
0 to N
–1
pa
rdo
3
counts[i] :=
–1;
4
for i :=
0 to N
–1p
ard
o5
counts[A
[i]] := i;
6
for i =
max_
valto 0
by
–1p
ard
o// process with largest
i will w
in7
if coun
ts[i] >=
0th
en8
maxvalu
e := i;
9
winnerind
ex := counts[i];
10
return
(max
value, winnerindex);
11
Alg
orith
m 3
-11
:U
sing PR
IOR
ITY
CR
CW
for maxim
um.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 118 (289)
118
Maximum finding
Oth
er simila
r pro
blem
s
•M
ost previous algorithms can be used (w
ith small changes) for m
anysim
ilar tasks.
•E
specially all problems w
here theresult is atom
ic, and combining is
easy.
•F
inding, selecting, counting, sum, and, or, etc.
•O
r, the algorithms can b
e used in opposite direction tospread data.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 119 (289)
119
Prefix sum (alkusumma) Prefi
x su
m (a
lku
sum
ma
)
•Input: array
A[0..N
–1] (or [1..N
]).
•R
esult: array
(A[0],
A[0]+
A[1],
...,, ...,
), or,(3-2)
(0,A
[0],A
[0]+A
[1],...
,)
("0-prefix sum
")(3-3)
Aj
[]
j0
= i
∑A
j[
]j
0=
N1
–
∑
Aj
[]
j0
=
N2
–
∑
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 120 (289)
120
Prefix sum (alkusumma)
•E
.g., (4 5 2 5 6)⇒ (4 9 11 16 22).
•E
.g., (1 0 1 1 0 0 1)⇒ (1 1 2 3 3 3 4).
•A
pplications: counting, array/list compression (rem
oving empty ele-
ments), load balancing, radix sort, graph algorithm
s, etc.
•A
lgorithm sim
ilar to maxim
um for all
•U
seblocking to m
ake it work optim
al (exercise).
•A
gain, synchrony is crucial; array boundaries are more diffi
cult ifN
isnot a pow
er of 2; use another array if original is needed.
pro
cedu
reprefix-sum
(var A
: array[0..N–
1]);1
for i :=
1to
logN
do
2
for j :=
2i–
1to
N–
1p
ard
o3
A[j] :=
A[j–
2i–
1] + A
[j];4
Alg
orith
m 3
-12
:B
asic parallel prefix sum
.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 121 (289)
121
Prefix sum (alkusumma)
Fig
ure 3
-4:
Data m
ovement in prefi
x sum.
62
23
71
54
07
84
510
86
94
07
1314
1316
1710
94
07
3024
2220
1710
94
07
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 122 (289)
122
Merging and sorting algorithms Merg
ing
an
d so
rting
alg
orith
ms
⇒P
arallelsorting
canbe
approachedin
severalw
ays(as
sequentialsorting).
•W
e'll present:
•R
aw pow
er
•M
ergesort(w
itha
coupleof
possibleapproaches
tom
ergingin
par-allel).
•S
ampling bucket sort.
•R
adix sort.
•L
ater, we’ll present som
e sorting algorithms suitable for m
essage-passing environm
ent.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 123 (289)
123
Merging and sorting algorithms
Para
llel "b
ub
bleso
rt" (o
dd
-even
tran
spositio
n)
•C
ompare-exchange odd and even pairs
N tim
es.
•N
/2 processors, 2N
=O
(N) tim
e,O
(N2) w
ork.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 124 (289)
124
Merging and sorting algorithms
Raw
pow
er sort (b
y ra
nk
ing)
⇒P
resents PR
AM
at its best and worst!
•E
xploitsS
TR
ON
G A
DD
CR
CW
.
Com
pu
te the
correct lo
catio
n o
f each
elemen
t at o
nce:
•C
ounthow
many sm
aller elements there are in the array.
•I.e., the
rank (ra
nkka
us,
sijoitu
s) of each element.
•R
anksare
evaluatedas
inraw
-max:com
pareallpairs,increase
therank
of the larger element by one (cf. zero the sm
aller inra
w-m
ax).
•S
everal increasings of the sam
e element at once (S
TR
ON
G A
DD
CR
CW
needed).
•A
fter ranking, we'll know
thenum
ber of smaller elem
ents for eachelem
ent, i.e., thelocatio
n of each elem
ent.
•D
raws have to be solved.
•O
(1) time,
O(N
2) processors,O
(N2) w
ork.
⇒R
anks can be counted also in different (more efficient) w
ays.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 125 (289)
125
Merging and sorting algorithms
Fig
ure 3
-5:
Direct sorting by ranking.
69
23
71
54
07
57
12
60
43
07
97
65
43
21
07
inp
ut
A
rankV
A[V
[i]] :=A
[i]
pro
cedu
reraw
-sort(var A
: array[0..N–
1]);1
for i :=
0to N
–1
pa
rdo
2
V[i] :=
0;3
for i :=
0to N
–1
pa
rdo
// rank4
for j :=
0to N
–1
pa
rdo
5
if A[i] <
A[j]
then
6
V[j] :=
V[j] +
1;// S
TR
ON
G A
DD
CR
CW
7
for i :=
0to N
–1
pa
rdo
// sort8
A[V
[i]] := A
[i];9
Alg
orith
m 3
-13
:S
orting by raw pow
er.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 126 (289)
126
Merging and sorting algorithms
Merg
esort (lo
mitu
slajittelu
)
•A
ctual sort is trivial, presented earlier.
•M
erging in parallel is interesting, we'll present a few
examples.
•M
ergingin
O(N
)tim
e(sequentially):
O(N
)tim
efull
sort(O
(N2)
work).
•M
erging inO
(logN
) time:
O(log
2N
) time sort.
•M
erging inO
(loglog
N) tim
e:O
(logN
loglog
N) tim
e sort.
•M
erging inO
(1) (amortized) tim
e:O
(logN
) time,
O(N
logN
) work.
pro
cedu
rem
ergesort(var A
: array; first, last : index);1
if (last–fi
rst) > 0
then
2
pa
rdo
3
merg
esort(A, fi
rst, (last+fi
rst)/2);4
merg
esort(A, (last+
first)/2+
1, last);5
merge(A
, first, (last+
first)/2, (last+
first)/2+
1, last);6
Alg
orith
m 3
-14
:M
ergesort.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 127 (289)
127
Merging and sorting algorithms
Merg
ing b
y ra
nk
ing
•W
e assume elem
ents to bedistinct (use index to resolve draw
s).
•L
et us define the
ran
k of an element
x in an arrayA
[0..N–1] as the
number of sm
aller elements in array
A.
⇒C
omputing
ofthe
rankis
much
easierif
Ais
inincreasing
order(sorted).
rank(x,A
) := m
axi
A[i]≤
x(3-4)
•U
sing one processor: usingbinary search in tim
eO
(logN
).
•W
ithP
processors, we can divide into
P+
1 parts (P division points)
instead of two.
•T
hus parallel “binary search” in time
Tp (N
,P
) = +
O(1) =
O(log
P+
1N
) =.
(3-5)
•O
ne processor finds correct interval, others follow
. exercise.
Ts
N
P1
+-------------
ON
logP
log------------
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 128 (289)
128
Merging and sorting algorithms
•U
singraw
power,w
ecan
find
onerank
inO
(1)tim
eusing
O(N
)proces-
sors.
•If needed, w
e can refine this w
ith one processor writing (instead of
return) and the rest of processors reading the result.
•C
RE
W suffi
ces.
•L
ater we’ll show
how to do this m
ore efficiently.
fun
ction
raw-rank(x : elem
ent;var A
: array[0..N–
1]);1
if x < A
[0]
then
2
return
0;3
else if x≥
A[N
–1
]th
en4
return
N;
5
else6
for i :=
0to N
–2
pa
rdo
7
if A[i]≤
xa
nd
x≤
A[i+
1]th
en8
return
i+1;
9
Alg
orith
m 3
-15
:R
ank in unit time by raw
power.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 129 (289)
129
Merging and sorting algorithms
Merg
ing w
ith ra
nk
ing
•Input: readily sorted arrays
A and
B (often halves of the sam
e array).
•R
ank of element
A[i] in array
A is
i.
•R
ank of element
A[i] in
arrayB
is rank(A[i],
B).
•R
ank of element
A[i] in the fi
nal array isi +
rank(A[i],
B).
•W
e can place every element to the fi
nal arrayindependently!
⇒F
orthe
whole
merge,w
e'llneed
therank
ofeach
element
ofA
inB
,and the
rank of each element of
B in
A.
•T
hiscan
beeasily
convertedto
restoreelem
entsback
toA
andB
and/orto m
erge halves of a single array.
fun
ction
rank-merge(A
, B : array[0..N
–1]) : array[0..N
*2–
1];1
for i :=
0to N
–1
pa
rdo
2
C[i +
rank(A
[i], B)] :=
A[i];
3
C[i +
rank(B
[i], A)] :=
B[i];
4
return
C;
5
Alg
orith
m 3
-16
:D
irect merge by rank.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 130 (289)
130
Merging and sorting algorithms
•W
e need CR
EW
PR
AM
sinceN
simultaneous ranking processes read
the same array (using b
inary search) in parallel (though only constantpenalty on E
RE
W).
•If parallelization and synchronization are m
ade carefully, the merging
can be donein place.
•B
utw
eneed
Nprocessors,
allof
which
useO
(1)helper
space,thus
it actually usesO
(N) extra space.
•L
ater,w
ithless
processors,w
eneed
anyway
O(N
)extra
spaceand
have to move elem
ents to/from
a helper array.
•Ω
(Nlog
N) w
ork.
•M
oreaccurate
analysisof
rank-merge-sort
with
P=
N,P
=N
2,arbitraryP
as an exercise.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 131 (289)
131
Merging and sorting algorithms
Faster m
ergin
g a
lgorith
ms
Merg
ing in
O(lo
gN
) time,
O(N
) work
•Input: arrays
A and
B (o
f lengthN
)
•C
hoose regularlyN
/logN
elements of
B.
•R
ankeach
ofthese
(with
sequentialbinary
search)in
A(elem
ent/processor, totalN
/logN
processors).
•N
ow w
e haveN
/logN
pairs of subsequences each ofw
hich can bem
erged sequentially.
a1 ...a
j1 andb
1 ...blog
n|ji =
rank(bi*
log
n ,A
)(3-6)
aj1
+1 ...a
j2 andb
logn+
1 ...b2*log
n
…ajlo
gn–
1+
1 ...an and
b(n–1)log
n+1 ...b
n
•F
rom the section boundaries, w
e know the location of the m
ergedsection in the new
array –m
erging tasks are independent.
•O
naverage, the lengths are
O(log
N), thus the w
hole algorithm in
O(log
N) tim
e.
AB
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 132 (289)
132
Merging and sorting algorithms
•U
nfortunately, thesequences of
A can be longer if data is uneven.
•E
ither:
•S
ymm
etric ranking & partitioning:
•C
hooseN
/logN
elements of both
A and
B.
•R
ankeach
ofthese
(with
binarysearch)
onthe
otherarray.
•N
oww
ehave
tom
erge2×
N/log
Npairs
ofsequences
oflength at m
ostlog
N.
•O
r:
•R
epartition the (few) too large sequences.
AB
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 133 (289)
133
Merging and sorting algorithms
Merg
ing in
O(lo
glo
gN
) time, O
(N) p
roc, O
(Nlo
glo
gN
) work
•E
xloits more effi
cient2
-step ranking A
lgorithm 3-17.
•T
ake regularly sam
ples of each arrayA
andB
.
•R
anksam
ples ofA
in samples of
B (not in w
hole B!).
• ranks on
elements w
ithN
proc inO
(1) time (raw
-rank).
•S
ame for sam
ples ofB
inA
(as in symm
etric ranking above).
•N
ow w
e have 2 subsequences, but
boundaries are still inaccurate(w
eonly
knowin
which
blockof
theother
arraythe
samples
belongto).
•R
ank each sample of
Ain the subsequence of
B it belongs to.
•2
ranks on elem
ents with
N procs in
O(1) tim
e (raw-rank).
•S
ame for sam
ples ofB
inA
.
•N
ow w
e have 2 subsequences w
ithaccurate boundaries in
O(1)
time.
•A
pply the algorithmrecursively to each of 2
subsequences (ofaverage length
/2) with
/2 processors for each subsequence.
•T
(N)
=T
()+
O(1) =
O(log
logN
).
N
NN
N
NN
N
NN
N
N
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 134 (289)
134
Merging and sorting algorithms
fun
ction
root-raw-rank(x : elem
ent;var A
: array[0..N–
1]) : index;1
if x < A
[0]
then
2
return
0;3
else if x≥
A[N
–1
]th
en4
return
N;
5
else6
for i :=
0to
–1
pa
rdo
7
B[i] :=
A[i*
]8
block = raw
-rank(x, B);
//O
(1) time w
ith proc
9
brank
=raw
-rank(x,A[block
*..(block+
1)*
]);//
O(1)
10
return
blo
ck*
+ brank;
11
Alg
orith
m 3
-17
:R
ank inO
(1) time w
ith processors.
(TO
DO
: check indeces).
NN
NN
NN
N
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 135 (289)
135
Merging and sorting algorithms
Merg
ing in
O(lo
glo
gN
) time, O
(N) w
ork
•W
ork and time optim
al merge!
•N
/loglog
N processors.
•P
artitionA
andB
toblocks of size log
logN
.
•R
ank theboundaries (N
/loglog
N)
in each other with previous algo-
rithm (O
(loglog
N) tim
e).
•R
ankeach
ofthe
sequentiallyboundaries
within
thecorresponding
sub-section
of lengthO
(loglog
N). (O
(loglog
logN
) time w
ith binarysearch).
•N
ow w
e have accurate boundaries (ranks) of 2×N
/loglog
N pairs of
sequences of length at most log
logN
.
•M
erge each pair of sequences independently
using a sequential algo-rithm
(O(log
logN
) time).
•Y
ields aO
(logN
loglog
N) tim
e,O
(Nlog
N) w
ork sorting algorithm.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 136 (289)
136
Merging and sorting algorithms
Od
d-ev
en m
erge
•B
atcher 68: odd-even merge and bitonic m
erge.
•Input array halves
A and
B.
•In
practice,halvesof
thesam
earray
arenam
edA
andB
foreasier
ref-erence.
•M
erge (recursively)odd
elements of
A and odd elem
ents ofB
; andm
erge (recursively)even elem
ents ofA
and even elements of
B.
•M
erging is done in place.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 137 (289)
137
Merging and sorting algorithms
•A
fter these merges,consecutive pairs m
ay be out of order, we'll check
order ofeach pair, sw
ap if needed.
•M
erge in time: T
(N) =
T(N
/2) +O
(1) =O
(logN
),O
(1) space.
Fig
ure 3
-6:
Odd-even m
erge [5].
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 138 (289)
138
Merging and sorting algorithms
Fig
ure 3
-7:
Recursion in odd-even m
erge [5].
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 139 (289)
139
Merging and sorting algorithms
pro
cedu
re Odd-even_m
erge (A : array[0..N
–1]);1
pa
rdo
2
Odd-even_m
erge(halves o
fodd elem
ents o
f A);
3
Odd-even_m
erge(halves o
feven
elemen
ts of A
);4
pa
r i = 1
to N–2
by
2d
o5
com
pa
re-excha
nge (A
[i], A[i+
1]);6
Alg
orith
m 3
-18
:P
arallel odd-even merge inform
ally.
pro
cedu
re oemerge(v
ar S
:arra
y; First, L
ength, Stride : index);
1
pa
r i := 0
to 1
do
2
oem
erge(S, F
irst + i * S
tride, Length/2, S
tride * 2);3
pa
r i := 1
to L
ength
/2–
1d
o4
j := i * 2
;// j :=
2 to Length
–2 by 2
5
ifS
[First +
(j–1) * S
tride] > S
[First +
j * Stride]
then
6
swap
(S[F
irst+(j–
1)*S
tride], S[F
irst+j*
Stride]);
7
Alg
orith
m 3
-19
:P
arallel in place odd-even merge procedure (F
PM
).
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 140 (289)
140
Merging and sorting algorithms
OE
M-so
rt perfo
rman
ce
•M
ergesort with odd-even m
erge exploits at most
N/2 processors,
executes inO
(log2N
) time, and thus uses
O(N
log2N
) work, w
hich isineffi
cient by factor ofO
(logN
).
⇒W
ecan
improve
theefficiency
byreducing
thenum
berof
processors.
•If there are less than
N/2 processors, w
e cansw
itch to sequential sort/m
erge as soon as we run out of processors.
•T
he recursive sort branches according toP
.
•A
lso,merging
canrun
outofprocessors,thus
alsothe
merge
willbranch
according toP
.
•T
ime com
plexity will be
T(N
,P
) =O
((N/P
)×(lo
g2P
+ log
N/P
)).(3-7)
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 141 (289)
141
Merging and sorting algorithms
•In theory, w
e cannot exploit very many processors effi
ciently.
•E
.g., to ensure50%
efficiency, w
e would have to settle for
(3-8)
•T
he same plotted:
•In practice, though, w
e can efficiently use slightly m
ore processors, asthe
slow recursion tails are rem
oved ifN
is clearly larger thanP
.
•M
easured performance on F
-PR
AM
:
P2
Nlog
≤
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 142 (289)
142
Merging and sorting algorithms
Fig
ure
3-8
:M
aximum
efficiently
usefulP
asa
functionof
Nas
pre-dicted
byF
ormula
(3-8),odd-evenm
ergesort,logarithmic
x-axis.
1.04×10 6
1.67×10 7
2.68×10 8
4.29×10 9
6.87×10 10
510 15 20 25 3
035 40 45 5016
64
256
102
440961638465536
maximum efficient
inpu
t sizeN
number of processors
262144
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 143 (289)
143
Merging and sorting algorithms
Fig
ure
3-9
:S
peedupof
odd-evenm
ergesortasa
functionof
thenum
berof
processorsfor
differentinput
sizes.B
othscales
arelogarith-
mic.
1 2 4 8
16 32 64
128
256
512
1024
12
48
1632
64128
256512
10242048
4096
speedup
number of processors
N = 262144
N =
16384
N =
4096
N =
1024
N =
256
50 %
10 %
N = 65536
linear
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 144 (289)
144
Merging and sorting algorithms
Cole’s o
ptim
al p
ara
llel merg
esort (1
986)
•T
he first
almost practical tim
e and work-optim
alO
(logN
) sort.
•F
irstasym
ptoticallyoptim
alw
asA
jtai,K
omlós,
Szem
erédi(A
KS
)1983.
⇒In
fact,w
edo
notneed
aO
(1)tim
em
erging,a
merging
with
O(1)
amortized cost for each phase is sufficient.
•T
he merge operations in different stages of sort can be pipelined.
•W
e collect samples (border values, "cover") in different stages.
•W
e collect the ranks of the samples in halves of data.
•A
ccording to ranks of samples w
e can do the next stage faster.
•B
ecause of large constants, this Cole's sort is faster than odd-even
mergesort (or bitonic) only if
N>
1021 [6].
•S
ee, e.g., Jájá or Akl.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 145 (289)
145
Merging and sorting algorithms
Sam
plin
g p
ara
llel bu
cketso
rt (kau
kalo
lajittelu
)
•L
et us assume the
N>>
P.
•E
ach processorsam
ples its own part of the array.
•S
amples are sorted
in some fast (parallel) w
ay.
•A
ccording to the samples, processors decide
P–1 division points (val-
ues).
•E
achprocessor
partitionsits
partofinputto
otherprocessors
accordingto the division points.
•E
ach processorreceives one subsection of input from
all others.
•E
ach processor sorts its own section.
•In shared m
emory m
odel, we need som
e amount of additional space.
•In m
essage passing mod
el, we need all-to-all com
munication.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 146 (289)
146
Merging and sorting algorithms
Rad
ix so
rt in p
ara
llel (kan
talu
ku
lajittelu
)
⇒P
robablythe
fastestsequential
sortif
keysare
reasonablyshort
andinput is large.
•S
equentialtime
,where
mis
keysize
(inbits)
andr
isradix size (bits).
•S
orting instages:
•D
ividekey in parts.
•S
ortaccording
tothe
leastsignifi
cant part.•
Sort
accordingnext
leastsignifi
cant part.•
...
•S
ortaccording
tothe
most
significant part.
•S
ortshave
tobe
stable,i.e.,theorderofelem
entsw
iththe
same
subkey has the be sustained.
Omr ----
n2
r+
()
Fig
ure 3
-10
:S
orting in stages.
12
3 2
31
12
3 1
23
34
5 1
23
32
4 2
31
54
3 5
43
32
5 2
33
23
3 2
33
23
1 3
24
53
3 5
33
23
3 3
25
32
5 3
43
53
3 3
43
34
3 3
24
54
3 3
45
23
1 3
45
34
3 5
33
32
4 3
25
34
5 5
43
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 147 (289)
147
Merging and sorting algorithms
•A
s each subkey is short (reasonable amount of different possible
subkeys) we could use bucketsort.
•A
sw
ehave
alot
ofkeys
(alot
foreach
bucket),theuse
oflists
inbuck-
etsort gets slower, thus w
e'll use a slightly different method.
•F
irst count thenum
ber of each subkeys.
•C
ompute a
0-prefix sum
of the count array.
•P
refix sum
tells us into w
hich position each “bucket” will be stored.
•C
ontents of each “bucket” w
ill be stored in original order.
•R
adix sum location is increased after each assignm
ent.
•If/w
hen keys are not integers, we'll use the bit representation of keys.
rbits at a tim
e yields 2r buckets,
r is typically 12-20.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 148 (289)
148
Merging and sorting algorithms
Fig
ure 3
-11
:S
equential radix sort using a histogram.
12
3
34
5
54
3
23
0
53
3
32
5
34
3
23
1
32
4
01
23
45
11
04
12
Occurrences
01
23
45
01
22
67
0-prefi
x sum
23
1
12
3
54
3
23
0
53
3
34
3
32
4
34
5
32
5
013 245678
013 245678
T1:
T2:
R:
R:
1.2.
3.
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 149 (289)
149
Merging and sorting algorithms
Para
llelizatio
n
•If
severalprocessorscountoccurrences
inparallel, the prefi
xsum
needs to becounted for everyP
×2
r buckets.
•R
esultlikein
Figure
3-12, butlinear
(sequential) scan istoo slow
.
Fig
ure
3-1
2:
Linear
scanfor
radixsort
[Culler&
al].Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 150 (289)
150
Merging and sorting algorithms
Prefi
x in
three sta
ges
•P
refix sum
each row to
last column
(2r×
P/P
= 2
r time).
•B
roadcast all values of the last column to all processors (2
r (or skip inC
RE
W)).
•P
refix sum
the last colum
n.
•E
valuatefi
nal prefix sum
s by adding also the previous row sum
s. (2r)
•A
ssignment stage of the local input as in sequential version.
•P
rocesses can work independently.
⇒C
omparison of different sorts on C
M-5 [C
uller&al]: F
igure 3-13.