View
241
Download
1
Embed Size (px)
Citation preview
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Parallel Computation
Architecture, Algorithm and Programming
Professor Guoliang Chen Professor Guoliang Chen Dept. Of Comp. Sci. & Tech.Dept. Of Comp. Sci. & Tech.
Univ. Of Sci.&Tech. Of ChinaUniv. Of Sci.&Tech. Of China
Textbook Series for 21st Century
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Browse the homepages of HPCC in USA.Browse the homepages of HPCC in USA. A crossbar chip design for SP2A crossbar chip design for SP2 Multistage crossbar network in Cray Y-MPMultistage crossbar network in Cray Y-MP The structure of fat tree (the four-way fat tree The structure of fat tree (the four-way fat tree
implemented in CM-5)implemented in CM-5)
Part I: Hardware basics for parallel computing
Chapter 1Architecture models of parallel computer
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Buffered wormhole routing is usedBuffered wormhole routing is used No conflict: 8 packet cells (called flits) pass thr
ough the 8x8 switch in every 40MHz cycle hot-spot conflict: only one crosspoint will be en
abled at a time Central queueCentral queue
Dual-port RAM, 1 read&1 write in a clock cycle deserializes eight flits from FIFO into a chunk, t
hen write it to the central queue in one cycle
Part I: Hardware basics for parallel computing
A crossbar chip design for SP2
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
4x4 and 8x8 crossbar switches and 1x8 de-mu4x4 and 8x8 crossbar switches and 1x8 de-multiplexors in three stagesltiplexors in three stages
supporting data streaming between 8 vector supporting data streaming between 8 vector processors and 256 memory banksprocessors and 256 memory banks
Multistage crossbar network in Cray Y-MP
Part I: Hardware basics for parallel computing
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Alleviate the bottleneck towards the root of a tree sAlleviate the bottleneck towards the root of a tree structure (Leiserson 1985)tructure (Leiserson 1985)
the number of links increases when traversing frothe number of links increases when traversing from a leaf node toward the rootm a leaf node toward the root
Each node has 4 child nodes and 2 or 4 parent nodeEach node has 4 child nodes and 2 or 4 parent nodes (shown in left figure)s (shown in left figure)
Part I: Hardware basics for parallel computing
The structure of multiway fat tree
Fi gur e 1. 34 Four - way f at t r ee i mpl ement ed i n CM5
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
In a two-way fat tree shown in Figure 1.33, each In a two-way fat tree shown in Figure 1.33, each node has 2 parent nodes. If each oval in the node has 2 parent nodes. If each oval in the figure is looked as a single node and the multi-figure is looked as a single node and the multi-edge between a pair of nodes is looked as one edge between a pair of nodes is looked as one edge, then the fat tree can be looked as a binary edge, then the fat tree can be looked as a binary tree. The question is, if we look each square but tree. The question is, if we look each square but not the oval in the figure as a node, then what not the oval in the figure as a node, then what multilevel internet will it be when looked from multilevel internet will it be when looked from the leaves to the root?the leaves to the root?
In a butterfly network shown in Fig. 1.37, In a butterfly network shown in Fig. 1.37, NN=(=(kk+1)2+1)2kk. . Node degree=? Network Node degree=? Network Diameter=? And Bisection width=?Diameter=? And Bisection width=?
Exercises (1)
Part I: Hardware basics for parallel computing
Fi gur e 1. 33 Two- way f at t r ee
0
Fi gur e 1. 37 But t er fl y net wor k k=3
Li ne1
Li ne2
Li ne3
Li ne
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Exercises (2)
The Cedar multiprocessor at Illinois is built with a clustered OmThe Cedar multiprocessor at Illinois is built with a clustered Omega network as shown in the following figure. Four 8x4 crossbar ega network as shown in the following figure. Four 8x4 crossbar switches are used in stage 1 and four 4x8 crossbar switched are switches are used in stage 1 and four 4x8 crossbar switched are used in stage 2. There are 32 processors and 32 memory moduleused in stage 2. There are 32 processors and 32 memory modules, devided into four clusters with eight processors per clusters, devided into four clusters with eight processors per cluster Figure out a fixed priority scheme to avoid conflicts in using crossbar
switches for nonblocking connections. For simplicity, consider only the forward connections from processors to the memory modules.
Suppose both stages use 8x8 crossbar switches. Designed a two-stage Cedar network to provide switched connections between 64 processors and 64 memory modules.
Further expand the Cedar network to three stages using 8x8 crossbar switches as building blocks to connect 512 processors and 512 memory modules. Show all connections between adjacent stages from the input end to output end.
Part I: Hardware basics for parallel computing
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Chapter 2 SMP, MPP and COW
The architecture of fat hypercube in The architecture of fat hypercube in Origin-2000Origin-2000
128-way high-performance switches in 128-way high-performance switches in SP2SP2
Part I: Hardware basics for parallel computing
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
The architecture of fat hypercube in Origin-2000
Part I: Hardware basics for parallel computing
each SPIDER router (R) has six ports, two for noeach SPIDER router (R) has six ports, two for node connection and four for network connectionde connection and four for network connection
Modified from the traditional binary hypercubeModified from the traditional binary hypercube advantages: linear bisection bandwidth of hyperc
ube while avoiding an increasing node degree each node contains two processors and two nodes
are connected to a router beyond 32 nodes, employs a fat hypercube topolog
y using extra routers—metarouter
Rout er ( R)
XBOW
P
PH M
N
Node ( N)
N
R
N
N
XBOW
R
XBOW
N
XBOW
N N
R
Cr ay Rout er
( a) ( b) ( c)
( d) ( e) ( f )
( f )Fi gur e 2. 6 Fat Hyper Cube Topol ogy i n Or i gi n 2000
( a) 2 nodes ( b) 4 nodes ( c) 8 nodes ( d) 16 nodes ( e) 32 nodes( f ) 64 nodes ( g) 128 nodes
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
128-way high-performance switches in SP2
Each link is 8-b, bidirectionalEach link is 8-b, bidirectional Each Each frameframe consists of 16 processing nodes connected by consists of 16 processing nodes connected by
a 16-way a 16-way switchboard switchboard (each contains two stages of switch (each contains two stages of switch chips)chips)
Eight frames are interconnected by an extra stage of switcEight frames are interconnected by an extra stage of switch boards, this MIN has four switch stagesh boards, this MIN has four switch stages
packet switching, buffered wormhole routing, 40MHz clocpacket switching, buffered wormhole routing, 40MHz clockk
hardware latency when contention-free is small while muhardware latency when contention-free is small while much higher in actual application processesch higher in actual application processes
Part I: Hardware basics for parallel computing
Fi gur e 2. 10 A 128- way hi gh per f or mance swi t ch bui l t wi t hf our st ages of 16- way swi t ches i n I BM SP2
N0N1N2N3
N4N5N6N7
N8N9N10N11
N12N13N14N15
Swi t ch Boar d
A Fr ame
Fr ame 3
Fr ame 4
Fr ame 5
Fr ame 6
Fr ame 7
Fr ame 8
Fr ame 1
Fr ame 2
4 l i nks
4 l i nks
Swi t ch Boar d 1
Swi t ch Boar d 2
Swi t ch Boar d 3
Swi t ch Boar d 4
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Exercises (1)
Point out the differences between centralized Point out the differences between centralized file system and xFSfile system and xFS
Part I: Hardware basics for parallel computing
Cl i ent Cl i ent Cl i ent Cl i ent
Cent r al i zed Fi l e Ser ver
Net wor ks
Fi gur e 2. 19 Compar i son of Two Fi l e syst ems( a) Cent r al i zed Fi l e Ser ver ( b) xFS
Cl i ent
Net wor ks
Manager
St or ageSer ver
St or ageSer ver
Cl i ent
Manager
Cl i ent
St or ageSer ver
Cl i ent
( a)
( b)
St or age Ser verManager
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
The code shown below is the process of short messThe code shown below is the process of short message communication between two processes by usiage communication between two processes by using active message in NOW. Process Q computes arng active message in NOW. Process Q computes array ray AA, and process P compute scalar , and process P compute scalar xx. The final re. The final result is the summation of sult is the summation of x x and and AA[7]. Please explain [7]. Please explain the process of remote fetching.the process of remote fetching.
Process PProcess P Process QProcess Qcompute compute xx compute compute AA : : : : : :: :am_enable(…); am_enable(…); am_enable(…);am_enable(…);am_request_1(Q, request_h,7); :am_request_1(Q, request_h,7); :am_poll( );am_poll( ); am_poll( ); am_poll( ); sun=x+ysun=x+y : : :: am_disable( );am_disable( );am_disable( );am_disable( );
am_request_h(vnn_t P, iam_request_h(vnn_t P, intk)ntk)int reply_h(vnn_t,Q,intz) {am_reply_1(P, reply_h int reply_h(vnn_t,Q,intz) {am_reply_1(P, reply_h A[k]);}A[k]);}{y=z}{y=z}
Exercises (2)
Part I: Hardware basics for parallel computing
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Chapter 3 Performance Evaluation (I)
Compute ACompute AnxnnxnxBxBnxnnxn=C=Cnxn nxn by using the computatiby using the computation model of DSM and DM.on model of DSM and DM.
Analyze the scalability of FFT algorithm on a Analyze the scalability of FFT algorithm on a n-dimension hypercuben-dimension hypercubeIn the hypercube network, the distance between all In the hypercube network, the distance between all pairs of processors communicating to each other is pairs of processors communicating to each other is 1, so1, so
(1)(1)when when p p increases, in order to keep E as a constant, increases, in order to keep E as a constant, n n must increases too, so that must increases too, so that TTee==kTkToo, that is, , that is,
(2) (2) where where kk==EE/(1-/(1-EE). Because ). Because nnloglogpp≥≥ p ploglogp, p, the second the second
item in (2) is the main item. So it needs only to consiitem in (2) is the main item. So it needs only to consider the case when der the case when ttccnnloglognn==ktktbbnnloglogp. p. By simple algebBy simple algebraic transformation we can obtain . Because raic transformation we can obtain . Because WW==ttccnnloglogn, n, the ISO-efficiency function the ISO-efficiency function ffEE((pp) is: ) is: WW= = ffEE
((pp)= (3))= (3)In (3), if In (3), if ktktbb//ttcc<1, then the increasing ratio of <1, then the increasing ratio of WW is les is less than s than OO((pploglogpp); if ); if ktktbb//ttcc>1, then ISO-efficiency functi>1, then ISO-efficiency function on WW will exacerbate rapidly with the increase of will exacerbate rapidly with the increase of ktktbb
//ttcc; When ; When ttbb==ttcc, if , if kk<1 (that is, <1 (that is, EE<0.5), then <0.5), then W=W=ΘΘ(( p ploglogpp); if ); if kk>1 (that is, >1 (that is, EE>0.5), for example, >0.5), for example, EE=0.9, =0.9, kk=9, the=9, then n WW= = ΘΘ(( p p99loglogpp). That is ISO-efficiency function gets ). That is ISO-efficiency function gets worse in this case. worse in this case.
Part I: Hardware basics for parallel computing
pntppttpntttpT bsh
p
j
bhso loglog)()(1log
0
]loglog)[(log pntppttknnT bshc
cb tktpn /ppkt cb tkt
b log/
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Given VGiven V1/21/2=(1/2)V=(1/2)V=(1/2)x1.7Mflops=0.85M flo=(1/2)x1.7Mflops=0.85M flops, compute:ps, compute: The upper triangle matrix of (p, p’) Drawing the function curve of (p’)=f(p’)
According to Figure 3.7, we can work out According to Figure 3.7, we can work out Table I Table I shshowing the relationship between processor numberowing the relationship between processor numbers and execution time. When average speed is consts and execution time. When average speed is constant, we can work out ant, we can work out Table IITable II by using by using Table ITable I and and FFormula ormula ((pp, , pp’)=’)=TT//T’T’..
Part I: Hardware basics for parallel computing
Chapter 3 Performance Evaluation (II)
Proc Proc numnum 11 22 44 88 1616 3232 6464 128128
ExecutiExecution on
TimeTime
0.00400.00402929
0.0090.0091313
0.0130.0136262
0.0170.0174444
0.0210.0214444
0.0250.0256161
0.0290.02966
0.0330.0333838
Table I Processor number and their related execution time
Table I I upper triangle matrix of upper triangle matrix of ((pp, , pp’)’)
(1, 2)(1, 2) (1, (1, 2)2)
(1, 2)(1, 2) (1, 2)(1, 2) (1, 2)(1, 2) (1, 2)(1, 2) (1, 2)(1, 2)
Execution Execution TimeTime
0.0040290.004029 0.009130.00913 0.013620.01362 0.017440.01744 0.021440.02144 0.025610.02561
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Browse the web site Browse the web site http://www.netlib.org/liblist.htmlhttp://www.netlib.org/liblist.html
Analyze the scalability of Analyze the scalability of nn-point FFT algorithm on a -point FFT algorithm on a row-numbered mesh. Supposed the time for corow-numbered mesh. Supposed the time for communicating connection to establish is mmunicating connection to establish is ttss, the latenc, the latency of hops is y of hops is ttnn, the time for sending a packet is , the time for sending a packet is ttbb, and , and the computing unit time is the computing unit time is ttcc.. Compute the communication hops of processor Pi; Compute the total communication latency To; Compute the ISO-efficiency Function W=fE(p), and disc
uss it. (Hint: Compare FFT and the topology of mesh, you can find that the communication of processors occurs only on the same row or on the same column, and the biggest communication hops=
The time complexity of the multiplication of two The time complexity of the multiplication of two NN**N N factored matrixes factored matrixes TT11==CNCN33s, where s, where C C is a constant. This a constant. The time complexity of the parallel matrix multiplicatioe time complexity of the parallel matrix multiplication on a n on a nn-node parallel machine -node parallel machine TTnn=(=(CNCN33//nn++bNbN22/ )s, w/ )s, where here bb is another constant, the first item stands for th is another constant, the first item stands for the time for computing and the second item stands for e time for computing and the second item stands for the time of communication overhead.the time of communication overhead. Compute the speedup when fixed workload and discuss
it Compute the speedup when fixed time and discuss it Compute the speedup when limited storage and discuss
it
Part I: Hardware basics for parallel computing
Exercises
PP
2/P
n
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Chapter 4 Algorithm Basics
PRAM ModelPRAM Model BSP ModelBSP Model Summation Algorithm in LogP ModelSummation Algorithm in LogP Model
Part II: Parallel Algorithm Designing
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
An Example of PRAM Model
Performance attributes:Performance attributes: The machine size n can be arbitrarily large The basic time step is called cycle Within each cycle, each processor executes one inst
ruction All processors implicitly synchronize at each cycle,
and the synchronization overhead is assumed to be zero
All instruction can be any random-access machine instruction
Compute the inner product Compute the inner product s s of two of two NN-dimension-dimensional vectors al vectors AA and and BB on an on an n-n-processor EREW PRAM processor EREW PRAM computercomputer Total execution time is 2N/n + logn
One can assign each processor to perform 2N/n additions and multiplications to generate a local result in 2N/n cycles
One can add all the n local sums to a final sum s by a treelike reduction method in logn cycles
Speedup is n/{1+nlogn/(2N)} n, when N>>n
Part II: Parallel Algorithm Designing
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
An Example of BSP Model
Compute the inner product Compute the inner product s s of two of two NN-dimensional vect-dimensional vectors ors AA and and BB on an 8 on an 8--processor BSP computerprocessor BSP computer Super step 1
Computation: Each processor computes its local sum in w=2N/8 cycles.
Communication: Processors 0,2,4,6 send their local sums to Processors 1,3,5,7. Apply1 relation here.
Barrier synchronization Super step 2
Computation: Processors 1,3,5,7 each perform one addition (w=1). Communication: Processors 1,5 send their immediate results to Proce
ssors 3,7. A1 relation is applied here. Barrier synchronization
Super step 3 Computation: Processors 3,7 each perform one addition (w=1). Communication: Processors 3 sends its immediate result to Processor
s 7. A1 relation is applied here. Barrier synchronization
Super step 4 Computation: Processor 7 performs one addition (w=1) to generate th
e final sum. No more communication or synchronization is needed
The total execution time is 2The total execution time is 2NN/8+3/8+3gg+3+3ll+3 cycles. In gene+3 cycles. In general, the execution time is 2ral, the execution time is 2NN//nn++lognlogn((gg++ll+1) cycles on an +1) cycles on an nn-processor BSP-processor BSP
Part II: Parallel Algorithm Designing
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Summation Algorithm on LogP Model
In a summation algorithm, a processor, In a summation algorithm, a processor, which has which has kk children processors, needs children processors, needs to receive to receive k k messages. In order to messages. In order to overlap the overhead of overlap the overhead of gapgap, it needs at , it needs at least (k-1)*(g-o) local summation.least (k-1)*(g-o) local summation.
Part II: Parallel Algorithm Designing
16P0 27
P1 P2 P3 P5
P4 P6 P7
18 14 10 6
9 5 5
13 12 11 7
6610
++++++++++++--++--++--++--+
++++++--
++++++++++--
+++++++++++--+--
+++++--
+++++++++++--++--+--
+++++--
+++++++++--
P0
P5
P3
P2
P7
P1
P6
P4
g g g
o o o o
o
o
o o
o
o o o
o
o
g
LL
L
L
LL
L
Figure a: Summation Tree (L=4, g=4, o=2, P=8) A processor has n (underlined number) local data. The number in circles means every processor’s active time
Figure b: Processor’s active state (N=81, t=27, L=4, g=4, o=2, P=8)
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Supposed that at the beginning data dSupposed that at the beginning data dii are locat are located on ed on PPi i ( ). The summation here means usi( ). The summation here means using to replace the original ng to replace the original ddi i on on PPii. The sum. The summation algorithm on PRAM model is given in Almation algorithm on PRAM model is given in Algorithm 4.3.gorithm 4.3.
Summation Algorithm 4.3 on PRAM-EREW:Summation Algorithm 4.3 on PRAM-EREW:Input: Input: ddii are kept in are kept in PPii, where , where Output: replaces Output: replaces ddi i in processor in processor PPii
BeginBeginfor for jj=0 to logn-1 do=0 to logn-1 do
for for ii=2=2jj+1 to +1 to nn par-do par-do(i) P(i) Pii==ddii-2-2ii
(ii) (ii) ddii= = ddii + + ddii –2 –2jj
end forend forend forend for
EndEnd(1) compute the summation (n=8) step by step a(1) compute the summation (n=8) step by step according to the algorithm shown aboveccording to the algorithm shown above(2) analyze the time complexity of algorithm 4.(2) analyze the time complexity of algorithm 4.33
Part II: Parallel Algorithm Designing
Exercises (1)
ni 1
i
jjd
1
ni 1
i
jjd
1
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
When design an algorithm on APRAM model, we should try our best to When design an algorithm on APRAM model, we should try our best to make the local computing time and the reading and writing time equivamake the local computing time and the reading and writing time equivalent with the synchronization time B in each processor. When computilent with the synchronization time B in each processor. When computing the summation of ng the summation of nn numbers on APRAM, we could use the summati numbers on APRAM, we could use the summation algorithm in on algorithm in BB-tree. Suppose -tree. Suppose PP processors are computing summatio processors are computing summation of n of n n numbers, and each processor has numbers, and each processor has nn//p p numbers. Each processor cnumbers. Each processor computes out its local summation, and then reads out its omputes out its local summation, and then reads out its BB children’s l children’s local summations from the shared memory, and then adds them togetheocal summations from the shared memory, and then adds them together and puts the result to an appointed shared memory unit SM. At last thr and puts the result to an appointed shared memory unit SM. At last the summation result in the root processor is the total summation of these summation result in the root processor is the total summation of these e n n numbers. Algorithm 4.4 shows the summation algorithm on APRAnumbers. Algorithm 4.4 shows the summation algorithm on APRAM.M.
Algorithm 4.4 The summation algorithm on APRAMAlgorithm 4.4 The summation algorithm on APRAMInput: n numbersInput: n numbersOutput: The total summation are stored in a shared memory unit SMOutput: The total summation are stored in a shared memory unit SMBeginBegin
(1) each processor computes the local summation of (1) each processor computes the local summation of nn//pp and puts it int and puts it into SMo SM(2) Barrier(2) Barrier(3) for (3) for kk = down to 0 do = down to 0 do
(3.1) for all P(3.1) for all Pii, , do, , do if Pif Pii is on the is on the kkth level thenth level then PPii summates its summates its BB children’s local summation children’s local summation
results and results and its local one, then puts the result to SMits local one, then puts the result to SM end ifend if end forend for(3.2) Barrier(3.2) Barrier
end forend forEndEnd
(1)(1) Using the parameters of model APRAM, write out the function expressiUsing the parameters of model APRAM, write out the function expression of the time complexity of Algorithm 4.4on of the time complexity of Algorithm 4.4
(2)(2) Explain the role of Barrier statement in the algorithmExplain the role of Barrier statement in the algorithm
Exercises (2)
Part II: Parallel Algorithm Designing
10 pi 2)1)1((log BpB
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
The summation of The summation of nn numbers on BSP can be done on a numbers on BSP can be done on a dd-tree. -tree. Supposed Supposed PP processors compute the summation of processors compute the summation of nn numbers, numbers, then each processor are assigned then each processor are assigned nn//p p numbers. First, each prnumbers. First, each processor performs the local summation of ocessor performs the local summation of nn//p p numbers, then tnumbers, then the summation are performed on the he summation are performed on the dd-tree from bottom to up.-tree from bottom to up. The process is shown in algorithm 4.5 The process is shown in algorithm 4.5
Algorithm 4.5 Summation Algorithm on BSPAlgorithm 4.5 Summation Algorithm on BSPInput:Input: nn numbers numbersOutput:Output: The total summation are stored in the root processor PThe total summation are stored in the root processor P00
BeginBegin(1) for all P(1) for all Pii do do
(1.1) P(1.1) Pii computes the summation of local computes the summation of local nn//pp numbers numbers(1.2) if P(1.2) if Pii is on the level then is on the level then
PPii sends its local summation result to its parent node sends its local summation result to its parent node end ifend if
end forend for(2) Barrier(2) Barrier(3) for (3) for kk= down to 0 do= down to 0 do
(3.1) for all P(3.1) for all Pi i dodo if Pif Pi i is on the is on the rrth level thenth level then
PPii receives its receives its dd children nodes’ messages, and children nodes’ messages, and summates them with it own local susummates them with it own local su
mmation mmation result, then sends the result to its parresult, then sends the result to its parent nodeent node
end ifend if end forend for(3.2) Barrier(3.2) Barrier
end forend forEndEnd(1)(1) Analyze the time complexity of Algorithm 4.5Analyze the time complexity of Algorithm 4.5(2)(2) How to confirm How to confirm d’s d’s value?value?
Part II: Parallel Algorithm Designing
Exercises (3)
10 pi
thBpB 1)1)1((log
2)1)1((log BpB
10 pi
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Chapter 5 Designing Method
Parallel string matching algorithmParallel string matching algorithm The shortest paths among all pointsThe shortest paths among all points
Part II: Parallel Algorithm Designing
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Supposed Supposed TT=abaababaababaababaababa, =abaababaababaababaababa, PP=abaabab=abaababa (a (mm=8). Show the execution of Parallel string matchi=8). Show the execution of Parallel string matching algorithm.ng algorithm.
Answer :Answer :For pattern For pattern PP : WIT(1)=0, WIT(2)=1, WIT(3)=2, WIT(4)=4 : WIT(1)=0, WIT(2)=1, WIT(3)=2, WIT(4)=4We know that We know that PP is an aperiodic string. In order to match is an aperiodic string. In order to match
withwithString T, we shouldString T, we should
WIT[1:n-m+1] of P relative to T Compute duel(p, q)
Partition Partition T T and and PP to 2 to 2kk-blocks (k=1,2,..), and then compu-blocks (k=1,2,..), and then computete
duel(duel(pp,,qq) among the blocks parallelly.) among the blocks parallelly.kk=0, WIT=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,]=0, WIT=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,]kk=1, WIT=[0,1][2,0] [2,0][2,0] [0,1][0,1] [2,0][2,0]=1, WIT=[0,1][2,0] [2,0][2,0] [0,1][0,1] [2,0][2,0]kk=2, WIT=[0,1,2,4] [2,0,2,2] [4,1,0,1] [2,4,2,0]=2, WIT=[0,1,2,4] [2,0,2,2] [4,1,0,1] [2,4,2,0]At last, we know that matching will occur at the position At last, we know that matching will occur at the position
of 1,6,11,16 of 1,6,11,16
Part II: Parallel Algorithm Designing
Parallel string matching algorithm
a b a a b a b a a b a b a a b a b a a b a
1 4 6 8 9 11 14 16
k=2
k=1
1 6 11 16
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
The shortest paths among all pairs of points
Part II: Parallel Algorithm Designing
Compute the shortest paths among all Compute the shortest paths among all pairs of points in the directed&weighted pairs of points in the directed&weighted graph shown in Figure 5.2graph shown in Figure 5.2V0
V1
V4
V2
V3
V5
V6
7
1 6 1
1
2
18 5
3
4
Fi gur e 5. 2
2
( a)
0 1 2 3 4 5 60 4 1 0 7 0 00 0 8 0 0 0 00 0 0 2 6 0 00 5 0 0 0 0 10 0 0 0 0 0 00 0 0 2 1 0 00 3 0 0 0 1 0
0123456
( b)
0 1 2 3 4 5 60 4 1 7 0 8 0 2 6 5 0 1 0 2 1 0 3 1 0
0123456
( c)
8
8 8 8
8 8 8 8 8
8 8 8 8
8 8 8 8
8 8 8 8 8 8
8 8 8 8
8 8 8
0 1 2 3 4 5 60 4 1 3 7 0 8 7 0 2 6 3 4 0 2 1 0 7 2 1 0 3 3 3 2 1 0
0123456
( d)
8
8 8
8 8 8
8 8
8 8
8 8 8 8 8 8
8 8
0 1 2 3 4 5 60 4 1 3 7 5 4 0 8 6 0 2 5 4 3 4 0 3 2 1 00 6 2 1 0 3 3 3 2 1 0
0123456
( e)
88
88
8 8 8 8 8 8
0 1 2 3 4 5 60 4 1 3 6 5 4 0 8 6 0 2 5 4 3 4 0 3 2 1 0 6 2 1 0 3 3 3 2 1 0
0123456
( f )
88
88
8 8 8 8 8
8
10 14
13
11
10 14 12 11
12
1411
10 13 12 11
8
12
1411
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
The algorithm shows how to compute function duelThe algorithm shows how to compute function duel(p,q)(p,q)
Input:Input: WIT[1:n-m+1], ,WIT[1:n-m+1], ,Output:Output: return the duel survival’s position or null return the duel survival’s position or null
(which means (which means one of one of pp and and qq does not exist does not existProcedure DUEL(Procedure DUEL(pp, , qq))BeginBegin
if if pp=null then duel==null then duel=qq else else if if qq=null then duel==null then duel=pp else else (1) (1) jj==qq--pp+1+1 (2) (2) ww=WIT(=WIT(jj)) (3) if T((3) if T(qq++ww-1)<>-1)<>PP((ww) then) then
(i) WIT((i) WIT(qq)=)=ww (ii) duel=(ii) duel=pp elseelse (i) WIT((i) WIT(pp)=)=qq--pp+1+1 (ii) duel=(ii) duel=qq end ifend if
end ifend ifend ifend if
EndEnd(1) Set (1) Set TT=abaababaabaababababa, =abaababaabaababababa, PP=abaababa, comput=abaababa, comput
e WIT(e WIT(ii))(2) Consider the duel case when (2) Consider the duel case when pp=6 and =6 and qq=9=9
Part II: Parallel Algorithm Designing
Exercises (1)
11 mnqp 2/)( mpq
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Consider the unit-weight directed graph shoConsider the unit-weight directed graph shown in figure 5.3. Using the multiplication of bwn in figure 5.3. Using the multiplication of boolean neighboring matrixes to compute its toolean neighboring matrixes to compute its transitive closure.ransitive closure.
Exercises (2)
Part II: Parallel Algorithm Designing
d c
Fi gur e 5. 3 Uni t - wei ght di r ect ed gr aph
a b
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Chapter 6 Designing Techniques
PSRS AlgorithmPSRS Algorithm (m, n)- Selection Algorithm(m, n)- Selection Algorithm
Part II: Parallel Algorithm Designing
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Supposed that the sequence length Supposed that the sequence length nn=27, =27, there are there are p=3 p=3 processors. The sorting processors. The sorting process of PSRS is shown in Fig. 6.1process of PSRS is shown in Fig. 6.1
Part II: Parallel Algorithm Designing
PSRS Algorithm
15 46 48 93 39 6 72 91 14 36 69 40 89 61 97 12 21 54 53 97 84 58 32 27 33 72 20
6 14 15 39 46 48 72 91 93 12 21 36 40 54 61 69 89 97 20 27 32 33 53 58 72 84 97
6 39 72 12 40 69 20 33 72
6 12 20 33 39 40 69 72 72
33 69
6 14 15 39 46 48 72 91 93 12 21 36 40 54 61 69 89 97 20 27 32 33 53 58 72 84 97
6 14 15
6 14 15 39 46 48 7291 9312 21 36 40 54 61 69 89 9720 27 32 33 53 58 72 84 97
39 46 48 72 91 9312 21 36 40 54 61 69 89 9720 27 32 33 53 58 72 84 97
Fi gur e 6. 1
( a) Par t i t i on:
( b)Local
sor t i ng:
( c) Sampl i ng:
( d)Sampl i ngSor t i ng
:
( e)Key
Choosi ng:
( f )Key
par t i t i on:
( h)Mer ge
sor t i ng:
( g)Gl obal
exchange:
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Input: None-ordered Input: None-ordered nn-sequence -sequence AA=(=(aa11, , aa22,…, ,…, aann)) Output: The Output: The mm smallest elements in the sequence smallest elements in the sequence Algorithm bodyAlgorithm body
Partition: partition A into g=n/m groups, each group contains m elements
Local sorting: use Batcher sorting network to sort each group parallelly
Compare all the groups two by two, and obtain the MIN sequence
Sort&compare: for all the MIN sequences, do step 2 and step 3 until finally the m smallest elements are obtained
Part II: Parallel Algorithm Designing
(m, n)-Selection Algorithm
( M)
m
m
m
m
m ( M)m
mm
MAX MI N
MAX MI N
Fi gur e 6. 3
S
S
S
S
S
S
...
...
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Analyze the execution process of the Analyze the execution process of the convolution on a one-dimension systolic convolution on a one-dimension systolic array.array.
Part II: Parallel Algorithm Designing
Exercises (1)
Cycl e
Fi gur e 6. 10
c4
w4
c3 c2 c1
w3 w2 w1x6 x5 x4 x3 x2 x1
x6 x5 x4 x3 x2 x1
x6 x5 x4 x3 x2 x1
x6 x5 x4 x3 x2 x1
x6 x5 x4 x3 x2 x1
x6 x5 x4 x3 x2
x6 x5 x4 x3 x2
x6 x5 x4 x3
x6 x5 x4 x3
x6 x5 x4
x6 x5 x4
x6 x5
x6 x5
y1 y2 y3
y1 y2 y3
y1 y2 y3
y1 y2 y3
y1 y2 y3
y1 y2 y3
y1 y2 y3
y1 y2 y3
y2 y3
y2 y3
y3
y3
1
2
3
4
5
6
7
8
9
Cycl e10
Cycl e11
Cycl e12
Cycl e
Cycl e
Cycl e
Cycl e
Cycl e
Cycl e
Cycl e
Cycl e
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Using the following algorithm to get the Connected Component from the graph in Fig 6.11Using the following algorithm to get the Connected Component from the graph in Fig 6.11Algorithm 6.12 Hirschberg Algorithm for Calculating Connected Component on PRAM-CREWAlgorithm 6.12 Hirschberg Algorithm for Calculating Connected Component on PRAM-CREWInput:Input: Neighboring Matrix ANeighboring Matrix An*nn*n
Output:Output: Vector Vector DD[0:[0:nn-1], where -1], where DD((ii) represents the component in vector ) represents the component in vector DDBeginBegin
(1) for all (1) for all ii: 0 : 0 ≤≤i i ≤≤nn-1 par-do-1 par-do DD((ii) =) =ii
end forend for do step (2) through (6) for do step (2) through (6) for loglognn iterations: iterations:(2) for all (2) for all ii, , j j : 0 : 0 ≤≤ii, , j j ≤≤nn-1 par-do-1 par-do (2.1) (2.1) CC((ii)=min)=minjj{{DD((jj))∣∣A(A(ii,,jj)=1 and )=1 and DD((ii))≠≠DD((jj)})} (2.2) if none then (2.2) if none then CC((ii)= )= DD((ii) endif) endif end forend for(3) for all (3) for all ii, , j j : 0 : 0 ≤≤ii, , j j ≤≤nn-1 par-do-1 par-do (3.1) (3.1) CC((ii)=min)=minjj{{CC((jj) ) ∣∣DD((jj)=)=ii and and CC((jj))≠≠ii}} (3.2) if none then (3.2) if none then CC((ii)= )= DD((ii) endif) endif end forend for(4) for all (4) for all i i : 0 : 0 ≤≤i i ≤≤nn-1 par-do -1 par-do DD((ii)=)=CC((ii) end for) end for(5) for (5) for loglognniterations doiterations do
for all for all i i : 0 : 0 ≤≤i i ≤≤nn-1 par-do -1 par-do CC((ii)=)=CC((CC((ii)) end for)) end for end forend for(6) for all (6) for all i i : 0 : 0 ≤≤i i ≤≤nn-1 par-do -1 par-do DD((ii)=min{)=min{CC((ii), ), DD((CC((ii))} end for))} end for
EndEnd
Part II: Parallel Algorithm Designing
Exercises (2)
Ver t ex
Component
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
Fi gur e 6. 11 I ndi r ect ed Gr aph
1
4
8
5
6
3
7
2
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Part III: Parallel Numeric Algorithm
Chapter 8 Basic communication operations
The process of All-to-All Broadcast on a HyperThe process of All-to-All Broadcast on a HyperCube by using SFCube by using SF
One-to-All and All-to-All Personalized CommuOne-to-All and All-to-All Personalized Communicationnication
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
The process of All-to-All Broadcast on a The process of All-to-All Broadcast on a Hypercube by using SFHypercube by using SF
2
4
0
6
1
5
7
3
( 0) ( 1)
( 2) ( 3)
( 7)( 6)
( 5)( 4)
( a)
2
4
0
6
1
5
7
3
( 6, 7)
( b)
( 0, 1)
( 2, 3)
( 0, 1)
( 2, 3)
( 4, 5) ( 5, 6)
( 6, 7)
( 4, 5)
2
4
0
6
1
5
7
3
( 0, 1, 2, 3)
( c)
( 0, 1, 2, 3)
( 0, 1, 2, 3)
( 0, 1, 2, 3)
( 4, 5, 6, 7) ( 4, 5, 6, 7)
( 4, 5, 6, 7)( 4, 5, 6, 7)
2
4
0
6
1
5
7
3
( 0, . . . , 7)
( d)
( 0, . . . , 7)
( 0, . . . , 7)
( 0, . . . , 7)
( 0, . . . , 7) ( 0, . . . , 7)
( 0, . . . , 7)( 0, . . . , 7)
Fi gur e8. 12
All-to-All Broadcast on a Hypercube All-to-All Broadcast on a Hypercube
Part III: Parallel Numeric Algorithm
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
In One-to-All Personalized Communication (also In One-to-All Personalized Communication (also called as Single-Node Scatter), there are called as Single-Node Scatter), there are pp packets in packets in the source processor, and every packet has a the source processor, and every packet has a different destination different destination
In All-to-All Personalized Communication (also In All-to-All Personalized Communication (also called as Total Exchange), every processor sends called as Total Exchange), every processor sends different packets with the same size different packets with the same size mm to the others to the others
Personalized CommunicationPersonalized Communication
Part III: Parallel Numeric Algorithm
Fi gur e 8. 14
Si ngl e- NodeCol l ect
( c)
0 1
0 1
0 1
0 1
( d)
. . . . . .
. . . . . .
...
Mp- 1
M1
M0p- 1 p- 1
Mp- 1M0 M1
p- 1p- 1
...
Mp- 1, p- 1
M1, p- 1
M0, p- 1
...
Mp- 1, 1
M1, 1
M0, 1
...
Mp- 1, 0
M1, 0
M0, 0
...
Mp- 1, p- 1
Mp- 1, 1
Mp- 1, 0
...
M1, p- 1
M1, 1
M1, 0
...
M0, p- 1
M0, 1
M0, 0
One- t o- Al l per soal i zedcommuni cat i on
Al l - t o- Al l Per sonal i zedcommuni cat i on
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
One-to-All Personalized Communication is also calOne-to-All Personalized Communication is also called as Single-Node Scatter. It is different to one-to-led as Single-Node Scatter. It is different to one-to-all broadcast in that there are all broadcast in that there are pp packets in the sou packets in the source processor, and every packet has a different derce processor, and every packet has a different destination . Fig. 8.17 shows the Single-Node Scatter stination . Fig. 8.17 shows the Single-Node Scatter process on a 8-processor hypercube. Please prove: process on a 8-processor hypercube. Please prove: the time to perform One-to-All Personalized Comthe time to perform One-to-All Personalized Communication on a hypercube by using SF and CT is munication on a hypercube by using SF and CT is
ttone-to-all-persone-to-all-pers==ttssloglogpp+m+mttww((pp-1)-1)
Exercises (1)
Part III: Parallel Numeric Algorithm
6
2
7
3
0 1
54
6
2
7
3
0 1
54
6
2
7
3
0 1
54
6
2
7
3
0 1
54
( 0, 1, 2, 3)
( 4, 5,6, 7)
( 0, 1)
( 2, 3)
( 6, 7)
( 4, 5)
( 0)
( 2)
( 6)
( 4)
Fi gur e 8. 17 The Si ngl e- Node Scat t er pr ocesson a 8- pr ocessor hyper cube
( 0, 1, 2, 3,4, 5, 6, 7)
( 7)
( 1)
( 5)
( 3)
( a) ( b)
( c) ( d)
a) Initial distribution
b) Distribution before 2nd step
c) Distribution before 3rd step
d) Final distribution
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
All-to-All Personalized Communication is also caAll-to-All Personalized Communication is also called as Total Exchange. Every processor sends diflled as Total Exchange. Every processor sends different packets with the same size ferent packets with the same size mm to the others. to the others. Fig. 8.18 shows the total exchange process on a Fig. 8.18 shows the total exchange process on a 6-processor ring. {x,y} represents {source proces6-processor ring. {x,y} represents {source processor, destination processor}, ({xsor, destination processor}, ({x11,y,y11}, {x}, {x22,y,y22},…, {x},…, {xnn,,yynn}) represents the packets stream in the transfer }) represents the packets stream in the transfer process. Each processor only accepts it own pacprocess. Each processor only accepts it own packet and delivers the others to other processors. Pket and delivers the others to other processors. Please prove: the time to perform total exchange lease prove: the time to perform total exchange on a ring is on a ring is tttotal-exchangetotal-exchange=(=(ttss+(1/2)+(1/2)mtmtwwpp)()(pp-1)-1)
Hint: the size of package on Hint: the size of package on iith step is m(th step is m(pp--ii))
Part III: Parallel Numeric Algorithm
Exercises (2)
5 4 3
0 1 2
( {4, 5}. . . {4, 3}) ( {3, 4}. . . {3, 2})
( {3, 5}. . . {3, 2}) ( {2, 4}. . . {2, 1})
( {2, 5}. . . {2, 1}) ( {1, 4}. . . {1, 0})
( {1, 5}, {1, 0}) ( {0, 4}, {0, 5})
( {0, 5}) ( {5, 4})5 5
1
4 43 32 2
1
( {2, 1}) ( {3, 2})
( {3, 1}, {3, 2}) ( {4, 2}, {4, 3})
( {4, 1}. . . {4, 3}) ( {5, 2}. . . {5, 4})
( {5, 1}. . . {5, 4}) ( {0, 2}. . . {0, 5})
( {0, 1}. . . {0, 5}) ( {1, 2}. . . {1, 0})1 1
5
2 2
3 34 4
5
1
( {2, 3},{2. 4},{2, 5},{2, 0},{2, 1})
2
( {1, 3},{1. 4},{1, 5},{1, 0})
3
( {0, 3},{0. 4},{0, 5})
4
( {5, 3},{5. 4})
5
( {4, 3})
1
( {5, 0},{5. 1},{5, 2},{5, 3},{5, 4})
2
( {4, 0},{4. 1},{4, 2},{4, 3})
3
( {3, 0},{3. 1},{3, 2})
4
( {2, 0},{2. 1})
5
( {1, 0})
Fi gur e 8. 18 The t ot al exchange pr ocess on a 6- pr ocessor r i ng
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Fox Matrix Multiplication Algorithm: Fox Matrix Multiplication Algorithm: divide the matdivide the matrixes rixes AA and and B B into into p p blocks blocks AAi,ji,j
and and BBi,ji,j (0 (0≤i≤i, j, j≤ -1) with the size ≤ -1) with the size (n/ )* (n/ ). The assign them to processors (P(n/ )* (n/ ). The assign them to processors (P0,00,0
,P,P0,10,1 ,…,P ) At ,…,P ) At
the beginning, processor Pthe beginning, processor Pi,ji,j contains contains AAi,ji,j and and BBi,j i,j ,, and it want to cand it want to c
ompute ompute CCi,j i,j .. The steps are as following:The steps are as following: One-to-All broadcast the diagonal block Ai,i to other proce
ssors on the same row. Every processor performs the multiplication and addition op
eration on its own B block and the received A block Processor Pi,j send its own B block to processor Pi-1,j cyclically If the broadcast block in this turn is Ai,j
, then in the next turn choose block A and broadcast it to the other processors in the same row, and go back to step 2
In Figure 9.11, we show the process of Fox algorithm on In Figure 9.11, we show the process of Fox algorithm on matrix multiplication on a 16-processor machinematrix multiplication on a 16-processor machine
Part III: Parallel Numeric Algorithm
Chapter 9 Matrix Operation
p p
p pp 1,1 pp
1p
pji mod)1(, 1p
4444 BA
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Part III: Parallel Numeric Algorithm
An Example of Fox Matrix Multiplication An Example of Fox Matrix Multiplication AlgorithmAlgorithm
0, 0A
1, 1A
2, 2A
3, 3A
0, 0B
1, 0B
2, 0B
3, 0B 3, 1B
0, 1B
1, 1B
2, 1B 2, 2B
3, 2B
0, 2B
1, 2B 1, 3B
2, 3B
3, 3B
0, 3B
3, 0A
0, 1A
1, 2A
2, 3A
1, 0B
2, 0B
0, 0B 0, 1B
1, 1B
2, 1B
3, 1B 3, 2B
0, 2B
1, 2B
2, 2B 2, 3B
3, 3B
0, 3B
1, 3B
( b)( a)
2, 0A
3, 1A
0, 2A
1, 3A
2, 0B
3, 0B
0, 0B
1, 0B 1, 1B
2, 1B
3, 1B
0, 1B 0, 2B
1, 2B
2, 2B
3, 2B 3, 3B
0, 3B
1, 3B
2, 3B
1, 0A
2, 1A
3, 2A
0, 3A
3, 0B
0, 0B
1, 0B
2, 0B 2, 1B
3, 1B
0, 1B
1, 1B 1, 2B
2, 2B
3, 2B
0, 2B 0, 3B
1, 3B
2, 3B
3, 3B
( d)( c)
Fi gur e 9. 11 Communi cat i on pr ocess i n Fox Mul t i pl i cat i on on 16 pr ocessor s
3, 0B
a) one-to-all broadcast A0,0 … move Bi,j upwards cyclically
b) one-to-all broadcast A0,1 … move Bi,j upwards cyclically
c) one-to-all broadcast A0,2 … move Bi,j upwards cyclically
d) one-to-all broadcast A0,3 …
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Refer to Fig. 9.14, algorithm 9.8 shows the matrix mRefer to Fig. 9.14, algorithm 9.8 shows the matrix multiplication of Aultiplication of Am*nm*n*B*Bnn**kk=C=Cmm**k k on a two-dimension on a two-dimension mm**kk Systolic array. A subscript-suitable pair of elemen Systolic array. A subscript-suitable pair of elements in the matrix can be put together for multiplicatiots in the matrix can be put together for multiplication by delaying the matrix elements according to the pn by delaying the matrix elements according to the principle of pipelining.rinciple of pipelining.
Algorithm 9.8 Algorithm 9.8 Input:Input: AAmm**nn, B, Bnn**kk
Output:Output: the multiplication result Cthe multiplication result Cii,,j j is put into Pis put into Pii,,jj BeginBegin
for for ii=1 to =1 to mm par-do par-do for for jj=1 to =1 to kk par-do par-do
(1) c(1) ci,ji,j =0 =0(2) while P(2) while Pi,ji,j receives receives aa and and bb do do (2.1) c(2.1) ci,ji,j= c= ci,ji,j + +aa*b*b (2.2) if (2.2) if ii<<mm then send then send bb to P to Pii+1,+1,jj endif endif (2.3) if (2.3) if jj<<kk then send then send aa to P to Pii,,jj+1+1 endif endif endwhileendwhile
end forend forend forend for
EndEnd
The question is The question is In order to make sure that ai,s and bs,j will meet at the suita
ble time, How many time unit should row i in matrix A lag than row i-1 (2 ≤ i ≤ m)? And how many time unit should column j in matrix B lag than column j-1 (2 ≤ j ≤ k)?
When j=k, would a be sent to Pi,j+1? When i=m, would b be sent to Pi+1,j?
Exercises (1)
Part III: Parallel Numeric Algorithm
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
According to Fox Algorithm discussing in According to Fox Algorithm discussing in section 9.4.3 Write out the formal section 9.4.3 Write out the formal description of Fox multiplication.description of Fox multiplication.
Exercises (2)
Part III: Parallel Numeric Algorithm
P1, 1 P1, 2 P1, 3
P2, 1 P2, 2 P2, 3
P3, 1 P3, 2 P3, 3
P4, 1 P4, 2 P4, 3
b 11b 21b 31b 41b 51
b 12b 22b 32b 42b 52
bbbbb
Fi gur e 9. 14 The l oadi ng mode f r om mat r i x A and B
13
23
33
43
53
a11 a12 a13 a14 a15
a21 a22 a23 a24 a25
a31 a32 a33 a34 a35
a41 a42 a43 a44 a45
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Odd-Even Reduction scheme for System of TridiOdd-Even Reduction scheme for System of Tridiagonal Equations.agonal Equations.
Guass-Seldel Scheme for the System of Linear EGuass-Seldel Scheme for the System of Linear Equationsquations
Conjugate Gradient Scheme for the System of LiConjugate Gradient Scheme for the System of Linear Equationsnear Equations
Part III: Parallel Numeric Algorithm
Chapter 10 System of Linear Equations
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Initiate the coefficients arrayInitiate the coefficients array Reduce the coefficients to one equationReduce the coefficients to one equation Get the result Get the result XXnn Substitute it back to get Substitute it back to get XXii ( (ii=1,2,…,=1,2,…,nn-1)-1)
Part III: Parallel Numeric Algorithm
Odd-Even Reduction scheme (1)Odd-Even Reduction scheme (1)
#include "mpi.h"
#include "stdio.h"
#include "math.h"
int MOD (int a,int b) {
int c=a/b;
if (a>=0) return (a-c*b);
else return (b+a);
}
main (int argc, char** argv)
{
int myid,i,d,group_size,j,n=9;
double pred,starttime,endtime;
double fpie[12],gpie[12],hpie[12],bpie[12],r[12],delta[12],x[12];
double b[12]={0,0,2,7,13,18,6,3,9,1,0,0};
double f[12]={0,0,0,4,2,5,1,2,2,5,0,0};
double h[12]={0,0,1,5,6,4,1,3,1,0,0,0};
double g[12]={0,0,4,11,14,18,4,6,12,8,0,0};
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Comm_size(MPI_COMM_WORLD,&group_size);
starttime=MPI_Wtime();
for (i=0;i<12;i++) {
fpie[i]=gpie[i]=hpie[i]=bpie[i]=0;
x[i]=r[i]=delta[i]=0;
}
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Odd-Even Reduction scheme (2)Odd-Even Reduction scheme (2)
Part III: Parallel Numeric Algorithm
for (i=0;i<=3-1;i++) {
pred=i*log(2);
d=(int)exp(pred);
for (j=2*myid*d+2*d+1; j<=n-1; j+=2*d*group_size) {
r[j]=f[j]/g[j-d]; delta[j]=h[j]/g[i+d];
fpie[j]=-r[j]*f[j-d];
gpie[j]=-delta[j]*f[j+d]-r[j]*h[j-d]+g[j];
hpie[j]=-delta[j]*h[j+d];
bpie[j]=b[j]-r[j]*b[j-d]-delta[j]*b[j+d];
}
r[n]=f[n]/g[n-d]; f[n]=r[n]*f[n-d];
g[n]=g[n]-r[n]*h[n-d]; b[n]=b[n]-r[n]*b[n-d];
for (j=2*d*myid+2*d+1; j<=n-1; j+=2*d*group_size) {
f[j]=fpie[j]; g[j]=gpie[j];
h[j]=hpie[j]; b[j]=bpie[j];
}
for (j=2*d+1;j<=n-1;j+=2*d) {
int buf=MOD((j-2*d-1)/(2*d),group_size);
MPI_Bcast(&f[j],1,MPI_DOUBLE,buf,MPI_COMM_WORLD);
MPI_Bcast(&g[j],1,MPI_DOUBLE,buf,MPI_COMM_WORLD);
MPI_Bcast(&h[j],1,MPI_DOUBLE,buf,MPI_COMM_WORLD);
MPI_Bcast(&b[j],1,MPI_DOUBLE,buf,MPI_COMM_WORLD);
}
}
x[n]=b[n]/g[n];
for (i=3-1; i>-1; i--) {
pred=i*log(2); d=(int)exp(pred);
x[d+1]=(b[d+1]-h[d+1]*x[2*d+1])/g[d+1];
for (j=3*d+1+myid*2*d; j<n; j+=2*d*group_size) x[j]=(b[j]-f[j]*x[j-d]-h[j]*x[j+d])/g[j];
for (j=3*d+1; j<n; j+=2*d) {
int buf2=MOD((j-3*d-1)/(2*d),group_size);
MPI_Bcast(&x[j],1,MPI_DOUBLE,buf2,MPI_COMM_WORLD);
}
}
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
endtime=MPI_Wtime();
if (myid==0) {
FILE *fp=fopen("eqdata.dat","w");
for (i=2;i<=n;i++) {
printf("X[%d]=%10.5f",i-1,x[i]);
fprintf("X[%d]=%10.5f",i-1,x[i]);
}
printf("It takes %fs to calculate it"), endtime-starttime);
fprintf("It takes %fs to calculate it"), endtime-starttime);
fclose(fp);
}
MPI_Finalize();
Odd-Even Reduction scheme (3)Odd-Even Reduction scheme (3)
Part III: Parallel Numeric Algorithm
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
By a series of elimination, Guass-Seldel scheme tranBy a series of elimination, Guass-Seldel scheme transforms the coefficient matrix to a diagonal matrix, tsforms the coefficient matrix to a diagonal matrix, then compute out hen compute out xxii from the from the iith equation directly.th equation directly.
In the following example, given n=4, we show the eliIn the following example, given n=4, we show the elimination in the process of Guass-Seldel schememination in the process of Guass-Seldel scheme Find the pivot whose absolute value is the maximum in
the coefficient matrix, for example, it is a11. The first equation is multiplied by –ai1/a11 and then is added to the ith equation (i=2,3,4) respectively.
Find the pivot whose absolute value is the maximum in the coefficient matrix except the first row, for example, it is b22. The second equation is multiplied by –bi2/b22 and then is added to the ith equation (i=1,3,4) respectively.
Repeat it, and finally we will get the system of linear equations, and can compute out xi directly. x1=a15/a11, x2=b25/b22, x3=c35/c33, x4=d45/d44,
Guass-Seldel SchemeGuass-Seldel Scheme
Part III: Parallel Numeric Algorithm
45
35
25
15
4
3
2
1
444342
343332
242322
14131211
0
0
0
a
a
a
a
x
x
x
x
bbb
bbb
bbb
aaaa
45
35
25
15
4
3
2
1
4443
3433
242322
141311
00
00
0
0
c
c
b
a
x
x
x
x
cc
cc
bbb
cca
45
35
25
15
4
3
2
1
44
33
22
11
000
000
000
000
d
c
b
a
x
x
x
x
d
c
b
a
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Initiate Initiate xx(0)=0, (0)=0, PP(0)=0, (0)=0, rr(0)=-(0)=-bb,, a a((kk)=[)=[bb-A-Axx((kk-1)]/-1)]/APAP((xx))
Compute Compute rr((kk)=A)=Axx((kk-1)--1)-bb Compute direction vectorCompute direction vector
P(P(kk)=-)=-rr((kk)+[)+[rrTT((kk))rr((kk)P()P(kk-1)]/[-1)]/[rrTT((kk-1)-1)rr((kk-1)]-1)] Compute Step Compute Step
a(a(kk)=-)=-rr((kk)/[AP()/[AP(kk)]=-P)]=-PTT((kk))rr((kk)/[P)/[PTT((kk) AP() AP(PP)])]
Part III: Parallel Numeric Algorithm
Conjugate Gradient Scheme (1)Conjugate Gradient Scheme (1)
#include <mpi.h>
#define ITERATION 1000
double A[9][9]={{2,1,0,0,0,0,0,0,0},
{1,2,1,0,0,0,0,0,0},
{0,1,1,0,0,0,0,0,0},
{0,0,0,2,1,0,0,0,0},
{0,0,0,1,2,1,0,0,0},
{0,0,0,0,1,1,0,0,0},
{0,0,0,0,0,0,2,1,0},
{0,0,0,0,0,0,1,2,1},
{0,0,0,0,0,0,0,1,1} }; /*Matrix to solve*/
double localA[9]; /* the row of Matrix that the node conserve*/
double b[9]={1,0,1,1,0,1,1,0,1};
double r[9]={-1,0,-1,-1,0,-1,-1,0,-1};
double d[9]={0,0,0,0,0,0,0,0,0};
double x[9]={0,0,0,0,0,0,0,0,0};
double d[9],temparr[9];
double localr,localx,locald,localtemp;
int id,size;
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
int main(int argc, char** argv)
{
double num1,denom1,num2,denom2,temp;
int i,j;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&id);
MPI_Comm_size(MPI_COMM_WORLD,&size);
if (id==0&&size!=9) {
printf("need and only need 9 nodes!");
goto the_exit;
}
printf("Node %d%d start...",id,size);
for (i=0;i<size;i++) localA[i]=A[id][i];
localr=r[id]; locald=d[id]; localx=x[id];
for (i=0;i<1000;i++){
temp=localr*localr;
MPI_Barrier(MPI_COMM_WORLD);
/*calculate the inner product of r&r*/
MPI_Allreduce(&temp,&denom1,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);
localr=0;
for (j=0;j<size;j++) localr+=localA[j]*x[j];
localr-=localb;
MPI_Barrier(MPI_COMM_WORLD);
/*gather the vector of r: r=Ax-b*/
MPI_Allgather(&localr,1,MPI_DOUBLE,r,1,MPI_DOUBLE,MPI_COMM_WORLD);
temp=localr*localr;
MPI_Barrier(MPI_COMM_WORLD);
Part III: Parallel Numeric Algorithm
Conjugate Gradient Scheme (2)Conjugate Gradient Scheme (2)
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
/*calculate the inner product of r&r*/
MPI_Allreduce(&temp,&num1,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);
if (num1<.0001) break;
locald=num1*locald/denom1-localr;
MPI_Barrier(MPI_COMM_WORLD);
/*gather the vector of d*/
MPI_Allgather(&locald,1,MPI_DOUBLE,d,1,MPI_DOUBLE,MPI_COMM_WORLD);
temp=locald*localr;
MPI_Barrier(MPI_COMM_WORLD);
/*calculate the inner product of d&r*/
MPI_Allreduce(&temp,&num1,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);
localtemp=0;
for (j=0;j<size;j++) localtemp+=localA[j]*d[j];
temp=locald*localtemp;
MPI_Barrier(MPI_COMM_WORLD);
/*Calculate the product of d'[A]d*/
MPI_Allreduce(&temp,&denom2,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);
localx=localx-num2/denom2*locald;
MPI_Barrier(MPI_COMM_WORLD);
/*calculate the new vector of x*/
MPI_Allgather(&localx,1,MPI_DOUBLE,x,1,MPI_DOUBLE,MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
}
if (id==0) {
printf("the solution is");
for (i=0;i<size;i++) printf("x%d:%f",i,x[i]);
}
the_exit:
MPI_Finalize();
}
The solution is
x0: 2.000000
x1:-3.000000
x2: 4.000000
x3: 2.000000
x4:-3.000000
x5: 4.000000
x6: 2.000000
x7:-3.000000
x8: 4.000000
mat: end of program (1.590001 seconds)
Conjugate Gradient Scheme (3)Conjugate Gradient Scheme (3)
Part III: Parallel Numeric Algorithm
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Given Given xx(0)=[0, 0, 1](0)=[0, 0, 1]TT,, using Conjugate Gradient methusing Conjugate Gradient method to solve the equations systemod to solve the equations system
Algorithm 10.11 Odd-Even Reduction Scheme on SISD for SAlgorithm 10.11 Odd-Even Reduction Scheme on SISD for System of ystem of Tridiagonal Equations. Tridiagonal Equations.
Input:Input: AAnn**nn, , bb=[=[bb11,…,,…,bbnn]]TT
Output:Output: xx= [= [xx11,…,,…,xxnn]]TT
BeginBegin(1) for (1) for ii=0 to log=0 to lognn-1 do-1 do (1.1) (1.1) dd=2=2ii
(1.2) for (1.2) for jj=2=2ii+1 to +1 to nn-1 step 2-1 step 2dd do do rrjj==ffjj//ggjj--d d δδjj==hhjj/g/gjj++d d f f jj’’=-=-rrjjffjj--dd
ggjj’’=-=-δδjjffii++dd--rrjjhhjj--dd h’h’jj==δδjjhhii++dd
b’b’jj==bbjj++rrjjbbjj--dd--δδjjbbjj++dd
end forend for (1.3) (1.3) rrnn==ffnn/g/gnn--dd
(1.4) (1.4) ffnn==-r-rnnffnn--dd
(1.5) (1.5) ggnn= = ggnn - -rrnnhhnn--dd
(1.6) (1.6) bbnn= = bbnn + +rrnnbbnn--dd
(1.7) for (1.7) for jj=2=2ii+1 to +1 to nn-1 step 2-1 step 2d d dodo ggjj==g’g’j j ffjj==f’f’j j hhjj==h’h’j j bbjj==b’b’jj
end forend for end forend for
Part III: Parallel Numeric Algorithm
Exercises (1)
1
0
1
110
121
012
3
2
1
xxx
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
(2) (2) xxnn==bbnn//ggnn
(3) for (3) for ii=log=lognn-1 to 0 step –1 do-1 to 0 step –1 do (3.1) (3.1) dd=2=2ii
(3.2) (3.2) xxdd=(=(bbdd--hhddxx22dd)/)/ggdd
(3.3) for (3.3) for jj=3=3dd to to nn step 2 step 2dd do doxxjj=(=(bbjj--ffjjxxjj--dd--hhjjxxjj++dd)/)/ggjj
end forend for end forend for EndEnd
Suppose:Suppose:
Computer Computer AXAX==BB
Part III: Parallel Numeric Algorithm
Exercises (2)
1
9
3
618
13
7
2
,
85
121
00
20
00
00
00
0006
00
32
12
00
10
00
0000
00
04
00
185
614
00
2000
00
00
00
05
00
114
14
BA
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Compute the Principal nth Root of Unity wheCompute the Principal nth Root of Unity when n nn=8=8
Part III: Parallel Numeric Algorithm
Chapter 11 FFT
1
2
2
2
2
2
2
2
2
4
5sin
4
5cos
101sincos
2
2
2
2
4
3sin
4
3cos
02
sin2
cos)(
2
2
2
2
4sin
4cos
448
437
426
145
224
123
2/28/22
4/8/21
i
i
ii
ii
ii
iiiee
iiee
ii
ii
ω8=( 1, 0)
ω6
=( 0, - i )
ω2
=( 0, i )
ω4
=( - 1, 0)
ω7
ω3
ω5
ω1
Fi gur e 11. 1Pr i nci pal 8t h Root of Uni t y
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
FFT algorithm on SIMD-MCFFT algorithm on SIMD-MC22
Input: ak in Pk, where k=0,…,n-1 Output: bk in Pk
1. for k=0 to n-1 par-do Ck=ak
end for2. for h=logn-1 to 0 do for k=0 to n-1 par-do (2.1) p=2h
(2.2) q=n/p (2.3) z=ωp
(2.4) if (k mod p=k mod 2p) the par-do (i) Ck = Ck + Ck+p zr(k) mod q
(ii) Ck+p = Ck -Ck+p zr(k) mod q
endif endfor3. for k=0 to n-1 par-do bk=Cr(k)
endfor
Part III: Parallel Numeric Algorithm
Chapter 11 FFT
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
An Example of FFT algorithm on SIMD-MCAn Example of FFT algorithm on SIMD-MC2 2
Supposed n=4 processors form a 2x2 matrix, and are numbered according to row-first. when
In the algorithm at step 2 , when h=1, p=2, q=2, z=ω2, processor P0 and P1 satisfy k mod 2=k mod 4
P0: c0=c0+c2(ω2)0=a0+a2, c2=c0- c2(ω2)0 =a0-a2
P1: c1=c1+c3(ω2)0=a1+a3, c3=c1- c3(ω2)0 =a0-a3
At the same time processors in the same column need to communicate to each other
h=0, p=1, q=4, z=ω, processor P0 and P2 satisfy k mod 1=k mod 2
P0: c0=c0+c1ω0=(a0+a2)+(a1+a3), c1=c0- c1ω0 = (a0+a2)-(a1+a3)
P2: c2=c2+c3ω=(a0-a2)+(a1-a3)ω,c3=c2- c3ω = (a0-a2)-(a1-a3)ω
At step 3, the execution result is b0=c0, b3=c3, b1=c2, b2=c1. When computing b1=c2 and b2=c1, the processors on the diagonal need to communicate to each other
Part III: Parallel Numeric Algorithm
Chapter 11 FFT
P0 P1 P2
P4 P5 P6
P8 P9 P10
P12 P13 P14
P3
P7
P11
P15
n1/ 2
n1/ 2
Fi gur e 11. 5MC2 St r uct ur e i n Par al l el FFT
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Please compute the inverse DFT of the Please compute the inverse DFT of the following sequencefollowing sequence (16, -0.76+8.66i, -6+6i, -9.25+2.66i, 0, -
9.25-2.66i, -6-6i, -0.76-8.66i) (4-i, 2+i, 2+i, -i, 4-i, 2+i, 2+i, -i)
Given Given nn=8=2=8=2kk, on the butterfly network, , on the butterfly network, according to exp(according to exp(rr,,ii)=)=jj (0 (0 ≤≤i i ≤≤nn-1, 0 ≤-1, 0 ≤r r ≤≤kk), ),
please compute the element please compute the element ωωjj in the in the coefficient matrix of the FFT for the 8 coefficient matrix of the FFT for the 8 points distributed in a butterfly network.points distributed in a butterfly network.
Part III: Parallel Numeric Algorithm
Exercises
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Three Parallelization Approaches
A sequential code fragmentA sequential code fragment for(i=0;i<N;i++) A[i]=b[i]*b[i+1];for(i=0;i<N;i++) A[i]=b[i]*b[i+1];
for(i=0;i<N;i++) c[i]=A[i]+A[i+1];for(i=0;i<N;i++) c[i]=A[i]+A[i+1]; Equivalent parallel code using library routines Equivalent parallel code using library routines
id=my_process_id();id=my_process_id();p=number_of_process();p=number_of_process();for(i=id;i<N;i=i+p) A[i]=b[i]*b[i+1];for(i=id;i<N;i=i+p) A[i]=b[i]*b[i+1];barrier();barrier();for(i=id;i<N;i=i+p) c[i]=A[i]+A[i+1];for(i=id;i<N;i=i+p) c[i]=A[i]+A[i+1];
Equivalent code in Fortran 90 using array operationsEquivalent code in Fortran 90 using array operationsmy_process_id(),number_of_process(),and barrier() my_process_id(),number_of_process(),and barrier() ??? ???A(0:N-1)=b(0:N-1)*b(1:N)A(0:N-1)=b(0:N-1)*b(1:N)c=A(0:N-1)+A(1:N)c=A(0:N-1)+A(1:N)
Equivalent code using pragmas in SGI Power CEquivalent code using pragmas in SGI Power C#pragma parallel#pragma parallel#pragma shared(A,b,c)#pragma shared(A,b,c)#pragma local(i)#pragma local(i){{ #pragma pfor iterate(i=0;N;1)#pragma pfor iterate(i=0;N;1) for(i=0;i<N;i++) A[i]=b[i]*b[i+1]; for(i=0;i<N;i++) A[i]=b[i]*b[i+1]; #pragma synchronize#pragma synchronize #pragma pfor iterate(i=0;N;1)#pragma pfor iterate(i=0;N;1)
for(i=0;i<N;i++) c[i]=A[i]+A[i+1];for(i=0;i<N;i++) c[i]=A[i]+A[i+1];}}
Part IV: Parallel Programming
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
A sequential C code to compute π
#define N 1000000#define N 1000000main() {main() {
double local,pi=0.0,w;double local,pi=0.0,w;long i;long i;w=1.0/N;w=1.0/N;for(i=0;i<N;i++) {for(i=0;i<N;i++) { local=(i+0.5)*w;local=(i+0.5)*w; pi=pi+4.0/(1.0+local*local);pi=pi+4.0/(1.0+local*local);}}printf(“pi is %f \n”,pi*w);printf(“pi is %f \n”,pi*w);
}/*main()*/}/*main()*/
Part IV: Parallel Programming
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Shared-Variable Parallel Code for Computing π
#define N 1000000#define N 1000000main() {main() {
double local,pi=0.0,w;double local,pi=0.0,w;long i;long i;
A:A: w=1.0/N;w=1.0/N;B:B: #pragma parallel#pragma parallel
#pragma shared(pi,w)#pragma shared(pi,w)#pragma local(i,local)#pragma local(i,local){{ #pragma pfor iterate(i=0;N;1)#pragma pfor iterate(i=0;N;1) for(i=0;i<N;i++) {for(i=0;i<N;i++) {
P: local=(i+0.5)*w;P: local=(i+0.5)*w; Q: local=4.0/(1.0+local*local);Q: local=4.0/(1.0+local*local);
}}C:C: #pragma critical #pragma critical
pi=pi+localpi=pi+local}}
D:D: printf(“pi is %f \n”,pi*w);printf(“pi is %f \n”,pi*w);}/*main()*/}/*main()*/
Part IV: Parallel Programming
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Computing π in OpenMP
Program Compute-piProgram Compute-piinteger n,iinteger n,ireal w,x,sum,pi,f,areal w,x,sum,pi,f,a
C function to integrateC function to integratef(a)=4.d0/(1.d0+a*a)f(a)=4.d0/(1.d0+a*a)print *,’Enter number of intervaprint *,’Enter number of interva
ls:’ls:’read *,nread *,n
C calculate the interval sizeC calculate the interval sizew=1.0d0/nw=1.0d0/nsum=0.0d0sum=0.0d0
! $ OMP PARALLEL DO PRIVATE(x),SHARED(w)! $ OMP PARALLEL DO PRIVATE(x),SHARED(w),REDUCTION(+:sum),REDUCTION(+:sum)
do i=1,ndo i=1,n x=w*(i+0.5d0)x=w*(i+0.5d0) sum=sum+f(x)sum=sum+f(x)enddoenddo
! $ OMP END PARALLEL DO! $ OMP END PARALLEL DOpi=w*sumpi=w*sumprint *,’compute pi=’,piprint *,’compute pi=’,pistopstopendend
Part IV: Parallel Programming
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Exercises:a Pthreads code computing π (1)
#include <stdio.h>#include <stdio.h>#include <pthread.h>#include <pthread.h>#include <synch.h>#include <synch.h>
extern unsigned * micro_timer;extern unsigned * micro_timer;unsigned int fin,Start;unsigned int fin,Start;semaphore_t semaphore;semaphore_t semaphore;barrier_spin_t barrier;barrier_spin_t barrier;double delta,pi;double delta,pi;typedef struct{typedef struct{
int low;int low;int high;int high;
}Arg;}Arg;
void child(arg)void child(arg)Arg *arg;Arg *arg;{{
int low=arg->low;int low=arg->low;int high=arg->high;int high=arg->high;int i;int i;double x,part=0.0;double x,part=0.0;
Part IV: Parallel Programming
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Exercises:a Pthreads code computing π (2)
for(i=10; i<=high; i++){for(i=10; i<=high; i++){x=(i+0.5)*delta;x=(i+0.5)*delta;part+=1.0/(1.0+x*x);part+=1.0/(1.0+x*x);
}}pesma(&semaphore);pesma(&semaphore);
pi+=4.0*delta*part;pi+=4.0*delta*part;vsema(&semaphore);vsema(&semaphore);-barrier_spin(&barrier);-barrier_spin(&barrier);pthread_exit();pthread_exit();
}}
main(argc,argv)main(argc,argv)int argc;int argc;char *argv[ ];char *argv[ ];{{
int no_of_threads,segments,i;int no_of_threads,segments,i;pthread_t thread;pthread_t thread;Arg *arg;Arg *arg;if(arg!=3){if(arg!=3){
printf("usage:pi<no_of_threads><no_of_stprintf("usage:pi<no_of_threads><no_of_strips>\n");rips>\n");
exit(1);exit(1);}}
Part IV: Parallel Programming
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Exercises:a Pthreads code computing π (3)
no_of_threads=atoi(argv[1]);no_of_threads=atoi(argv[1]);segments=atoi(arg[2]);segments=atoi(arg[2]);delta=1.0/segments;delta=1.0/segments;pi=0.0;pi=0.0;sema_init(&semaphore,1);sema_init(&semaphore,1);-barrier_spin_init(&barrier,no_of_threads+-barrier_spin_init(&barrier,no_of_threads+1);1);start=*micro_timer;start=*micro_timer;for(i=0; i<no_of_threads; i++){for(i=0; i<no_of_threads; i++){
arg=(Arg*)malloc(size of (Arg));arg=(Arg*)malloc(size of (Arg));arg->low=i*segment/no_of_threads;arg->low=i*segment/no_of_threads;arg->high=(i+1)*segment/no_of_threarg->high=(i+1)*segment/no_of_thre
ads-1;ads-1;pthread_create(&thread,pthread_attpthread_create(&thread,pthread_att
r_default,child,arg);r_default,child,arg);}}-barrier_spin(&barrier);-barrier_spin(&barrier);fin=*micro_timer;fin=*micro_timer;printf("%u\n",fin-start);printf("%u\n",fin-start);printf("\npi\t%15.14f\n",pi);printf("\npi\t%15.14f\n",pi);
}}
Part IV: Parallel Programming
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Message-Passing Code for Computing π
#define N 1000000#define N 1000000main(){main(){
double local,pi,w;double local,pi,w;long i,taskid,numtask;long i,taskid,numtask;
A:A: w=1.0/N;w=1.0/N;MPI_Init(&argc,&argv);MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,MPI_Comm_rank(MPI_COMM_WORLD,&taskid);&taskid);MPI_Comm_Size(MPI_COMM_WORLD,MPI_Comm_Size(MPI_COMM_WORLD,&numtask);&numtask);
B:B: for(i=taskid;i<N;i=i+numtask){for(i=taskid;i<N;i=i+numtask){ P: local=(i+0.5)*w; P: local=(i+0.5)*w; Q: local=4.0/(1.0+local*local);Q: local=4.0/(1.0+local*local);
}}C:C: MPI_Reduce(&local,&pi,MPI_Double,MPI_Reduce(&local,&pi,MPI_Double,
MPI_MAX,0,MPI_COMM_WORLD);MPI_MAX,0,MPI_COMM_WORLD);D:D: if(taskid==0) printf(“pi is %f \n”,pi*w);if(taskid==0) printf(“pi is %f \n”,pi*w);
MPI_Finalize();MPI_Finalize();}/*main()*/}/*main()*/
Part IV: Parallel Programming
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
A MPI-C program to compute π : 1
#include “mpi.h”#include “mpi.h”#include <stdio.h>#include <stdio.h>double f(a)double f(a)double a;double a;{{
return (4.0/(1.0+a*a));return (4.0/(1.0+a*a));}}
int main(argc,argv)int main(argc,argv)int argc;int argc;char* argv[];char* argv[];{{
int done=0,n,myid,mumprocs,i;int done=0,n,myid,mumprocs,i;double PI25DT=3.141592653589793238462643;double PI25DT=3.141592653589793238462643;double mypi,pi,h,sum,x,a;double mypi,pi,h,sum,x,a;double startwtime,endwtime;double startwtime,endwtime;
MPI_Init(&argc,&argv);MPI_Init(&argc,&argv);MPI_Comm_Size(MPI_COMM_WORLD,&numprocMPI_Comm_Size(MPI_COMM_WORLD,&numprocs);s);MPI_Comm_rank(MPI_COMM_WORLD,&myid);MPI_Comm_rank(MPI_COMM_WORLD,&myid);
Part IV: Parallel Programming
A MPI-C program to compute π : 1
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
A MPI-C program to compute π : 2
while(!done)while(!done){{
if(myid==0)if(myid==0){{ printf(“Enter the number of intervals:(0 printf(“Enter the number of intervals:(0
quits)”);quits)”); scanf(“%d”,&n);scanf(“%d”,&n);
startwtime=MPI_Wtime();startwtime=MPI_Wtime();}}MPI_Bcast(&n,1,MPI_INT,0,MPI_COMM_WMPI_Bcast(&n,1,MPI_INT,0,MPI_COMM_W
ORLD);ORLD);if(n==0)if(n==0) done=1;done=1;elseelse{{ h=1.0/(double)n;h=1.0/(double)n; sum=0.0sum=0.0 for(i=myid+1;i<=n;i+=numprocs)for(i=myid+1;i<=n;i+=numprocs) {{
x=h*((double)i-0.5);x=h*((double)i-0.5);sum+=f(x);sum+=f(x);
}}
Part IV: Parallel Programming
A MPI-C program to compute π : 2
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
A MPI-C program to compute π : 3
mypi=h*summypi=h*sum MPI_Reduce(&mypi,&pi,1,MPI_DOMPI_Reduce(&mypi,&pi,1,MPI_DO
UBLE,UBLE,MPI_SUM,0,MPI_COMM_WORMPI_SUM,0,MPI_COMM_WOR
LD);LD); if(myid==0)if(myid==0) {{
pirntf(“pi is approximately %.pirntf(“pi is approximately %.16f,16f,
Error is %.16f\n”,pi,Error is %.16f\n”,pi,fabs(pi-PI25DT));fabs(pi-PI25DT));
endwtime=MPI_Wtime();endwtime=MPI_Wtime();pintf(“wall clock time=%f\n”,pintf(“wall clock time=%f\n”,
endwtime-startwtime);endwtime-startwtime); }}}/*else*/}/*else*/
}/*while*/}/*while*/MPI_Finalize();MPI_Finalize();
}/*main()*/}/*main()*/
Part IV: Parallel Programming
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
A PVM-C program to compute π : 1
#define n 16 /*number of tasks*/#define n 16 /*number of tasks*/#include “pvm3.h”#include “pvm3.h”
main(int argc,char* * argv)main(int argc,char* * argv){{
int mytid,tids[n],me,i,N,rc,parent;int mytid,tids[n],me,i,N,rc,parent;double mypi,h,sum=0.0,x;double mypi,h,sum=0.0,x;me=pvm_joingroup(“pi”);me=pvm_joingroup(“pi”);parent=pvm_parent();parent=pvm_parent();
if(me==0)if(me==0){{
pvm_spawn(“pi”,(char**)0,0,””,n-pvm_spawn(“pi”,(char**)0,0,””,n-1,tids);1,tids);
printf(“Enter the number of regionprintf(“Enter the number of regions:”);s:”);
scanf(“%d”,&N);scanf(“%d”,&N);pvm_initsend(PvmDataRaw);pvm_initsend(PvmDataRaw);pvm_pkint(&N,1,1);pvm_pkint(&N,1,1);pvm_mcast(tids,n-1,5);pvm_mcast(tids,n-1,5);
}}elseelse
Part IV: Parallel Programming
A PVM-C program to compute π : 1
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
A PVM-C program to compute π : 2
{{pvm_recv(parent,5);pvm_recv(parent,5);pvm_upkint(&N,1,1);pvm_upkint(&N,1,1);
}}pvm_barrier(“pi”,n); /*optional*/pvm_barrier(“pi”,n); /*optional*/h=1.0/(double)N;h=1.0/(double)N;for(i=me+1;i<=N;i+=n)for(i=me+1;i<=N;i+=n){{
x=h*((double)i-0.5);x=h*((double)i-0.5);sum+=4.0/(1.0+x*x);sum+=4.0/(1.0+x*x);
}}mypi=h*sum;mypi=h*sum;pvm_reduce(PvmSum,&mypi,1,PVMDOUBLE,pvm_reduce(PvmSum,&mypi,1,PVMDOUBLE,6,6,
””pi”,0);pi”,0);if(me==0) printf(“pi is approximately %.16f\if(me==0) printf(“pi is approximately %.16f\n”,mypi);n”,mypi);pvm_lvgroup(“pi”);pvm_lvgroup(“pi”);pvm_exit();pvm_exit();
}}
Part IV: Parallel Programming
A PVM-C program to compute π : 2
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Exercises: a HPF code computing π
Reference ONLYReference ONLY
parameter N=20000parameter N=20000
real pi,hreal pi,h
integer iinteger i
real x(N),y(N)real x(N),y(N)
!HPF$ PROCESSOR Nodes(10)!HPF$ PROCESSOR Nodes(10)
!HPF$ ALIGN x(i) WITH y(i)!HPF$ ALIGN x(i) WITH y(i)
!HPF$ DISTRIBUTE x(CYCLIC) ONTO !HPF$ DISTRIBUTE x(CYCLIC) ONTO NodesNodes
h=1.0/Nh=1.0/N
FORALL(i=1:N) FORALL(i=1:N)
x(i)=(i-0.5)*hx(i)=(i-0.5)*h
y(i)=4.0/(1.0+x(i)*x(i))y(i)=4.0/(1.0+x(i)*x(i))
ENDFORALLENDFORALL
pi=SUM(y)*h pi=SUM(y)*h
Part IV: Parallel Programming
NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]
Exercises: a F90 code computing π
1.1. INTEGER,PARAMETER::N=131072INTEGER,PARAMETER::N=131072
2.2. INTEGER,PARAMETER::LONG=INTEGER,PARAMETER::LONG=
3.3. SELECTED_REAL_KIND(13,99)SELECTED_REAL_KIND(13,99)
4.4. REAL(KIND=LONG)PI,WIDTHREAL(KIND=LONG)PI,WIDTH
5. INTEGER,DIMENSION(N)::ID5. INTEGER,DIMENSION(N)::ID
6.6. REAL(KIND=LONG),DIMENSION(N)::X,YREAL(KIND=LONG),DIMENSION(N)::X,Y
7.7. WIDTH=1.0_LONG/NWIDTH=1.0_LONG/N
8.8. ID=(/(I,(I=1,N)/)ID=(/(I,(I=1,N)/)
9.9. X=(ID-0.5)*WIDTHX=(ID-0.5)*WIDTH
10.10. Y=4.0/(1.0+X*X)Y=4.0/(1.0+X*X)
11.11. PI=SUM(Y)*WIDTHPI=SUM(Y)*WIDTH
12.12. FORMAT('ESTIMATION OF PI FORMAT('ESTIMATION OF PI WITH',16,WITH',16,
13.13. &'INTERVALS IS',F14.12)&'INTERVALS IS',F14.12)
14.14. PRINT 10,N,PIPRINT 10,N,PI
15. END15. END
Part IV: Parallel Programming