Upload
wesley-oconnor
View
214
Download
0
Embed Size (px)
Citation preview
A Tutorial on
High Performance Computing Taxonomy
By
Prof. V. Kamakoti
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Chennai – 600 036, India
Organization of the Tutorial
• Session – 1– Instruction Level Parallelism (ILP)
• Pipelining concepts
• RTL and Speed-up
• Superscalar/VLIW concepts– Static Instruction Scheduling
– Dynamic Instruction Scheduling
• Branch Prediction
Organization of the Tutorial
• Session – 2– Amdahl’s law and its applications– Symmetric Multiprocessors (SMP)
• The Cache Coherency problem – ESI Protocol
– Distributed Memory Systems– Basics of Message Passing Systems– Parallel Models of Computing– Design of Algorithms for Parallel processors
• Brent’s Lemma
Why this Title?
• Performance related issues at– circuit level (RTL)– instruction level (Processor level)– Shared Memory Multiprocessor level (SMP)– Distributed Memory Multiprocessor level
(Cluster/Grid – Collection of SMPs)
ILP - Pipelining
Fetch + Inc. PC
Decode Instrn
Fetch Data
Execute Instrn
Store Data
10000 Instructions
No pipelining takes
50000 Units
With Pipelining
I1I2
I1
I3
I2
I1
I4
I3
I2
I1
I5
I4
I3
I2
I1
First Instruction completes at end of 5th unit
Second instruction at end of 6th unit
10000th instruction at end of 10004 units
Performance• With pipelining we get a speed up of close to 5• This will not work always
– Hazards• Data• Control• Structural
• Non Ideal: Not every step take the same amount of time– Float Mult – 10 cycles– Float Div – 40 cycles
• Performance come out of Parallelism. Let us try to understand parallelism
Types of Parallelism
• Recognizing parallelism– Example: To add 100 numbers– for j = 1 to 100 do
• Sum = Sum + A[j]; //Inherently Sequential
– A better solution in terms of parallelism, assuming 4 processors are available
• Split 100 numbers into 4 parts each of 25 numbers and allot one part each to all the four processors
• Each processor adds 25 numbers allotted to it and sends the answer to a head processor, which further adds and gives the sum.
Types of Parallelism
• Data Parallelism or SIMD– The above example– Same instruction “Add 25 numbers” but on
multiple data– The parallelism is because of data.
Types of Parallelism
• Functional Parallelism or MIMD– Multiple functions to be performed on a data
set or data sets.– Multiple Instructions and Multiple Data.– Example is the case of the pipeline discussed
earlier
Example of Pipelining
• Imagine that 100 sets of data are to be processed in a sequence by the following system.
Part1
Part2
Part 1 takes 10 ms and Part 2 takes 15 ms.
To process 100 sets of data it takes 2500 ms.
Example of Pipelining
• Consider the following changes, with a storage element. When first data set is in part 2, the second data set can be in part 1.
Part1
Part2
First data set finishes at 30ms and after every 15 ms one data set will come out – total processing time is 1515 ms. – A tremendous speedup.
STORAGE
Functional Parallelism
• Different data sets and different instructions on them.
• Hence, Multiple Instruction and Multiple data.• An interesting problem is to convert circuits with
large delays into pipelined circuits to get very good throughput as seen earlier.
• Useful in the context of using the same circuit for different sets of data.
Pipelining Concepts
• A combinational circuit can be easily modeled as a Directed Acyclic Graph (DAG).
• Every node of the DAG is a subcircuit of the given circuit – forms a stage of a pipeline.
• An edge of the DAG connects two nodes of the DAG.
• Perform a topological sorting of the DAG.
N2 N3N1
N5
N6
N8
N7
N4
Level = 1
Level = 3
Level = 2
Level = 4
A Pipelined Circuit
• If an edge connects two nodes of levels j and k, j < k, then introduce k-j storage levels in between, in the edge.
• Each edge can carry one or more bits.
N2 N3N1
N5
N6
N8
N7
N4
1 1 1
2 2
3
2
4
Optimization
• Delay at every stage should be al most equal.
• The stage with maximum delay dictates the throughput.
• Number of bits transferred across nodes to be optimized, that shall reduce the storage requirements.
Stage Time Balancing
Part1
Part2
STORAGE
If Part 1 takes 10 ms and Part 2 takes 15ms then, First data set finishes at 30ms and after every 15 ms one data set will come out – total processing time for 100 data set is 1515 ms. – A tremendous speedup.
If Part 1 takes 12 ms and Part 2 takes 13 ms then, First data set finishes at 26 ms and after every 13 ms one data set will come out – total processing time for 100 data set is 1313 ms. – A significant improvement.
RTL View of Circuits and Performance
• Register Transfer Level
RegL1
RegL2
RegL3
Combo3 ns
Combo5 ns
Clock Frequency is (1/5)*109
Improve frequency by reducing maximum stage delay.
High Speed Circuits
• Carry Ripple to Carry Look ahead• Wallace Tree Multipliers• Increased Area and Power Consumption• Lower Time delays• Why Laptops have lesser frequency ratings than
Desktops?• Reference: Cormen, Lieserson and Rivest, (First
Edition) Introduction to Algorithms (or) Computer Architecture by Hamacher et al.
ILP Continues….
• Data Hazards– LOAD [R2 + 10], R1 // Loads into R1– ADD R3, R1, R2 //R3 = R1 + R2
• This is the “Read After Write (RAW)” Data Hazard for R1– LD [R2+10], R1– ADD R3, R1, R12– LD [R2 + 14], R1– ADD R12, R1, R2
• This shows the WAW for R1 and WAR for R12
ILP – Pipelining Advanced
Fetch + Inc. PC
Decode Instrn
Fetch Data
Execute Unit 1
Store Data
Execute Unit 2 Execute Unit K
Superscalar: CPI < 1
Success: Different Instrns take different cycle time
Four FMULs while one FDIV
Implies – Out-of-Order Execution
Difficulties in SuperscalarConstruction
• Ensuring no Data Hazards among several instructions executing in the different execution units at a same point of time.
• If this is done by compiler – then Static Instruction Scheduling – VLIW - Itanium
• Done by the hardware – then Dynamic Instruction Scehduling – Tomasulo – MIPS Embedded Processor
Static Instruction Scheduling
• Compiler make bundles of “K” instructions that can be put at the same time to the execution units such that there are no data dependencies between them.– Very Long Instruction Word (VLIW) to accommodate “K’ instructions at
a time
• Lot of “NOPS” if the bundle cannot be filled with relevant instructions
– Size of the executable
• Does not complicate the Hardware
• Source code portability – if I make the next gen processor with K+5 units (say) – then?
– Solved by having a software/firmware emulator which has a negative say in the performance.
Thorn in the Flesh for Static Instruction Scheduling
• The famous “Memory Aliasing Problem”– LD [R1+20], R2 //Load R2 into
– ST R3, [R4+40] //Store R3 with
• This implies a RAW if (R1 + 20 = R4 + 40) and this cannot be detected at compile time
• Such combinations of memory operations are not put in same bundle and memory operations are strictly scheduled in program order.
Dynamic Instruction Scheduling
• The data hazards are handled by the hardware– RAW using Operand Forwarding Technique– WAR and WAW using Register Renaming
Technique
Processor Overview
Processor ALU/Control
Multiple functionUnits
Register File
Memory
Bus
LD [R1+20],R2
ADD R3,R2,R4
RAW
Why should result of LD go to R2 in Reg file and then reload to ALU?
Forward the same on its way to reg file
Register Renaming
1. ADD R1, R2, R32. ST R1, [R4+50]3. ADD R1, R5, R64. SUB R7,R1,R85. ST R1, [R4 + 54] 6. ADD R1, R9, R10
Dependencies due to Reg R1
RAW: (1,2), (1,4), (1,5) (3,4) (3,5)
WAR: (2,3), (2,6), (4,6), (5,6)
WAW: (1,3), (1,6), (3,6)
Register Renaming: Static Scheduling
1. ADD R1, R2, R3
2. ST R1, [R4+50]
3. ADD R12, R5, R6
4. SUB R7,R12,R8
5. ST R12, [R4 + 54]
6. ADD R1, R9,R10
Rename R1 to R12 after Instruction 3 till Instruction 6
Dependency only within a window and not the whole program.
Only WAR and WAW are between (1,6) and (2,6) which are far away in the program order
Increases Register pressure for the compiler
Dynamic Scheduling - Tomasulo
Exec 1 Exec 2 Exec 3 Exec 4
Common Data Bus (CDB)
To Reg file/Mem
Reservation Station
Register Status Indicator
Instruction Fetch Unit
Instructions are fetched one by one and decoded to find the type of operation and the source of operandsRegister Status Indicator indicates whether the latest value of the register is in the reg file or currently being computed by some
execution unit and if the latter it states the execution unit number
If all operands available then operation proceeds in the allotted execution unit, else, it waits in the reservation station of the allotted execution unit pinging the CDB
Every Execution unit writes the result along with the unit number on to the CDB which is forwarded to all reservation stations, Reg-file and Memory
An Example: Instruction Fetch
Reg Number
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Status 0 0 0 0 0 0 0 0 0 0
Register Status Indicator
Empty Empty Empty Empty Empty Empty
1. ADD R1, R2, R32. ST R1, [R4+50]3. ADD R1, R5, R64. SUB R7,R1,R85. ST R1, [R4 + 54] 6. ADD R1, R9, R10
An Example:
ADD R1, R2, R3
Instruction Fetch
Reg Number
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Status 1 0 0 0 0 0 0 0 0 0
Register Status Indicator
Ins 1 Empty Empty Empty Empty Empty
1. --2. ST R1, [R4+50]3. ADD R1, R5, R64. SUB R7,R1,R85. ST R1, [R4 + 54] 6. ADD R1, R9, R10
An Example:
ST R1, [R4+50]
Instruction Fetch
Reg Number
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Status 1 0 0 0 0 0 0 0 0 0
Register Status Indicator
I 1, E I 2, W 1 Empty Empty Empty Empty
1. ---2. ---3. ADD R1, R5, R64. SUB R7,R1,R85. ST R1, [R4 + 54] 6. ADD R1, R9, R10
An Example:
ADD R1, R5, R6
Instruction Fetch
Reg Number
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Status 3 0 0 0 0 0 0 0 0 0
Register Status Indicator
I 1, E I 2, W 1 I 3, E Empty Empty Empty
1. ---2. ---3. ---4. SUB R7,R1,R85. ST R1, [R4 + 54] 6. ADD R1, R9, R10
Note: Reservation Station stores the number of the execution unit that shall yield the latest value of a register.
An Example:
SUB R7,R1,R8
Instruction Fetch
Reg Number
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Status 3 0 0 0 0 0 4 0 0 0
Register Status Indicator
I 1, E I 2, W 1 I 3, E I 4, W 3 Empty Empty
1. ---2. ---3. ---4. ---5. ST R1, [R4 + 54] 6. ADD R1, R9, R10
An Example:
ST R1, [R4 + 54]
Instruction Fetch
Reg Number
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Status 3 0 0 0 0 0 4 0 0 0
Register Status Indicator
I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 Empty
1. ---2. ---3. ---4. ---5. --- 6. ADD R1, R9, R10
An Example:
ADD R1, R9, R10
Instruction Fetch
Reg Number
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Status 6 0 0 0 0 0 4 0 0 0
Register Status Indicator
I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 I 6, E
1. ---2. ---3. ---4. ---5. --- 6. ---
An Example:
ADD R1, R9, R10
Instruction Fetch
Reg Number
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Status 6 0 0 0 0 0 4 0 0 0
Register Status Indicator
I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 I 6, E
1. ADD R1, R2, R32. ST U1, [R4+50]3. ADD R1, R5, R64. SUB R7, U3, R85. ST U3, [R4 + 54] 6. ADD R1, R9, R10
Effectively three Instructions are executing and others waiting for the appropriate results. The whole program is converted as shown above.See that Operand Forwarding and Register Renaming is done automatically
Execution unit 6, on completion will make R1 entry in Register Status Indicator 0. Similarly unit 4 will make R7 entry 0.
Memory Aliasing
• Every Memory location is a register
• Conceptually the same method can be used
• The size of memory status indicator will be prohibitively large.
• An Associative memory used to record the memory address to be written to and the unit number doing it.
Other Hazards
• Control Hazards– Conditional Jumps – which instruction to fetch
next in the pipeline– Branch predictors are used – which shall
predict whether a branch is taken or not.– Misprediction leads to undo-ing some actions
increasing the penalty but nothing much can be done
Branch Prediction
• Different types of predictors– Tournament – Correlation– K-bit
• Reference: Henessey and Patterson – Computer Architecture.
Other Hazards
• Structural Hazards– Non availability of a functional unit– Say would like to schedule the seventh
instruction in our example– The new instruction has to wait.– Separate Integer, FPU and Load Store units are
made available
• Load-Store Architecture – What is it?
Architectural Enhancements
Amdahl’s Law
Speedup(Overall) = Exec time without EnhancementExec time with Enhancement
A = Fraction of computation time in the original architecture that can be converted to take advantage of enhancement.
Exec_time(New) = (1 – A) Exec_time(old) + Exec_time of Enhanced portion -- (1)
Speedup(enhanced) =Exec_time of enhanced portion(old)
Exec_time of enhanced portion(new)
= A * Exec time (old)
Exec_time of enhanced portion(new)
Exec_time of enhanced portion(new) =A * Exec time (old)
Speedup(enhanced)
Substituting in (1) above we get
A
Speedup(Enhanced)1 – A +( )
Exec_time(new) =
Exec_time(old) *
Final form of Amdahl’s Law
Speedup Overall = 1
A
Speedup(Enhanced)1 – A +( )
Application of Amdahl’s Law:
Always 50% FP operations
- 20% FP Square root, 30% others
Choice 1: Use hardware and improve FP Square root to get speedup of 10
Choice 2: Use software and improve all FP operations by a speedup of 1.6
Speedup in Choice 1 is 1/(1 – 0.2 + 0.2/10) = 1.22
Speedup in Choice 2 is 1/(1 – 0.5 + 0.5/1.6) = 1.23
Choice 2 is better than Choice 1
Shared Memory Architectures
•Sharing one memory space among several processors.
•Maintaining coherence among several copies of a data item.
Processor
Caches
Memory
Disk & Other IO
Registers
Processor
Registers
Caches
Processor
Registers
Caches
Processor
Registers
Caches
Chipset
Shared Memory Multiprocessor
Snoopy
Snoopy-cache state Machine-1
Invalid
CPU Read hit
Shared (read/only)
Exclusive
(read/write)
CPU Read
Place read miss on bus
CPU read missWrite back block
CPU Read miss Place read miss
On bus
CPU Write(Hit/Miss)Place Write Miss on Bus
CPU write miss Write back cache block
Place write miss on BusCPU read HitCPU write hit
Cache Block state
CPU Write
Place write
Miss on Bus
Applies to writeBack Data
Shared Memory Architectures
* State machine for CPU requests for each cache block
Snoopy-cache state Machine- II
Invalid Shared (read/only)
Exclusive
(read/write)
Write missFor this block
Shared Memory Architectures
State machine for bus requests
for each cache block
Write Back
Block; (abort memory access)
Read missFor this block
Write missFor this block
Write Back
Block; (abort memory access)
Shared
Exclusive
Invalid
P1 P2 Bus Memory
Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1:Write 10 to A1
P1:Read A1
P2:Read A1
P2:Write 20 to A1
P2:Write 40 to A2
Assumes Initial Cache Stateis invalid and A1 and A2 mapto same cache block,but A1 == A2
Remote Write or Miss
CPU read hitCPU write hit
Remote write or Miss Write
Back
Read miss on Bus
Write miss on Bus
Remote ReadWrite Back
CPU read hit
CPU writePlace write
Miss on Bus
Example Shared Memory Architectures
This is the Cache for P1
Shared
Exclusive
Invalid
Example : step 1
P1 P2 Bus Memory
Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1:Write 10 to A1
Excl. A1 10 Wr.Ms P1 A1
P1:Read A1
P2:Read A1
P2:Write 20 to A1
P2:Write 40 to A2
Shared Memory Architectures
Write Miss on Bus
Shared
Exclusive
Invalid
Example : step 2
P1 P2 Bus Memory
Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1:Write 10 to A1
Excl. A1 10 Wr.Ms P1 A1
P1:Read A1
Excl. A1 10
P2:Read A1
P2:Write 20 to A1
P2:Write 40 to A2
Assumes Initial Cache Stateis invalid and A1 and A2 mapto same cache block.but A1 == A2
Shared Memory Architectures
CPU read hit
Shared
Exclusive
Invalid
Example : step 3Shared Memory Architectures
P1 P2 Bus Memory
Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1:Write 10 to A1
Excl. A1 10 Wr.Ms P1 A1
P1:Read A1
Excl. A1 10
P2:Read A1
Shar A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar A1 10 EdDa P2 A1 10 10
10
P2:Write 40 to A2
10
10
Assumes Initial Cache Stateis invalid and A1 and A2 mapto same cache block.but A1 ==A2 Remote Read
Write Back
Shared
Exclusive
Invalid
Example : step 4Shared Memory Architectures
P1 P2 Bus Memory
Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1:Write 10 to A1
Excl. A1 10 Wr.Ms P1 A1
P1:Read A1
Excl. A1 10
P2:Read A1
Shar A1 RdMs P2 A1
Shar A1 10 WrBk P1 A1 10 A1 10
Shar A1 10 RdDa P2 A1 10 10
P2:Write 20 to A1
Inv. Excl A1 20 WrMs P2 A1 10
P2:Write 40 to A2
10
10
Assumes Initial Cache Stateis invalid and A1 and A2 mapto same cache block.but A1== A2
Remote Write
Example : step 5
P1 P2 Bus Memory
Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1:Write 10 to A1
Excl. A1 10 Wr.Ms P1 A1
P1:Read A1
Excl. A1 10
P2:Read A1
Shar A1 RdMs P2 A1
Shar A1 10 WrBk P1 A1 10 A1 10
Shar A1 10 RdDa P2 A1 10 10
P2:Write 20 to A1
Inv Excl A1 20 WrMs P2 A1 10
P2:Write 40 to A2
WrMs P2 A2 10
Excl A2 40 WrBk P2 A1 20 A1 20
Assumes Initial Cache Stateis invalid and A1 and A2 mapto same cache block.but A1 == A2
Shared Memory Architectures
Distributed Memory Systems
P1 P2 P3 P4
P1 P2 P3 P4
NETWORKMessage Passing for
Inter Process Communication.
Shared Vs Distributed Memory
1. Single Address
Space
1. Multiple Address
Space
2. Easy to Program 2. Difficult to
Program
3. Less Scalable 3. More Scalable
provided you
know how to
program.
Basics of Message Passing SystemsMessages
• Which processor is sending the message.
• Where is the data on the sending processor.
•What kind of data is being sent.
•How much data is there.
•Which processor(s) are receiving the message.
•Where should the data be left on the receiving processor.
•How much data is the receiving processor prepared to accept.
Aspects
• Access to the message passing systems• Addressing• Reception• Point – Point Communication: Two processors communicate
•Synchronous and Asynchronous•Blocking and Non–blocking Operations.
• Collective Communication – Group of processors communicate
•Barrier , Broadcast and Reduction Operations.
Point – Point Synchronous Communication
Synchronous communication does not complete until the message has been received.
Point-Point Asynchronous Communication
------------------------
Asynchronous communication completes as soon as the message is on its way.
--------------------
-------------------------------------------------
Non – Blocking Operations
Non –blocking Operations return straight away after initiating the Operation and hence allows useful work to be performed whilewaiting for the operation to complete.One can test for the completion of the operation when necessary.
Blocking Operations
Blocking Operations waits for the operation to complete before Proceeding further.
Barrier
Barrier
Barrier
• Synchronizes processors by blocking until all of the participating processors have called the barrier routine.
• There is no exchange of data.
Barrier
Broadcast
One–to–many Communication.One processor send the same
message to several destinations in a single operation.
Reduction
Takes data items from several processors and reduces them to a
Single data item that is usually made available to all of the
participating processors. e.g. Strike Voting , Summation
STRIKE
.
Parallel Models
EREW PRAM : Exclusive Read Exclusive Write Parallel Random Access Memory.
CREW PRAM : Concurrent Read Exclusive Write Parallel Random Access Memory.
CRCW PRAM : Concurrent Read Concurrent Write Parallel Random Access Memory
Parallel Algorithm:Recursive Doubling Technique
• Finding Prefix Sum of ‘8’ Numbers
7 6 5 4 3 2 1 0
-15 6 -8 7 3 -2 1 0
Parallel Algorithm:Recursive Doubling Technique (Step 1)
• Finding Prefix Sum of ‘8’ Numbers
7 6 5 4 3 2 1 0
-9 -2 -1 10 1 -1 1 0
Step 1
Parallel Algorithm:Recursive Doubling Technique (Step 2)
• Finding Prefix Sum of ‘8’ Numbers
7 6 5 4 3 2 1 0
-10 8 0 9 2 -1 1 0
Step 2
Parallel Algorithm:Recursive Doubling Technique (Step 3)
• Finding Prefix Sum of ‘8’ Numbers
7 6 5 4 3 2 1 0
-8 7 1 9 2 -1 1 0
Step 3
• Prefix Sum of n numbers in O(log n) steps
• EREW Implementation
• Applicable for any semigroup operator like min , max , mul etc.
CRCW Algorithms
• What can you do in constant time.
• Find OR of ‘n’ bits (b1,b2,…,bn) using n Processors.
--Each processor Pi , reads bi in parallel. --If (bi = 1) then C = 1;
• All Processors write to ‘C’ the same value.
Optimality of Parallel Algorithms
• Parallel Algorithm is optimal iff Time taken by Parallel Algorithm * No.of Processors used
= Time taken by best known sequential algorithm.
• Prefix Sum of n Numbers
O(n) Sequential O(n log n) – Processor-Time Product Not optimal.
Make it Optimal
Yes – Using Brent’s Technique
Use ‘p’ Processors and split data set into n/p blocks.
Allot one processor per block, Let us take 3 processors and 6 numbers for example.
Bn/p B2 B1
6 4 3 2 15
The Algorithm
Step 1: Processor ‘i’ finds prefix sum of elements in Bi sequentially and outputs Si the Sum of last element- O(n/p) time.
11 5 7 3 3 1
11 7 3
Step2: Find prefix sum of (Sp,……,S1) and let it be (S’p….S’1) – O(log(n/p)) time Recursive Doubling.
11 5 7 3 3 1
10 3
Step3: Processor ’i’, i > 1, adds S’i-1 to all prefixes of block Bi -O(n/p)
Total O(n/p+log n – log p) time = O(n/p)time. 21 15 10 6 3 1
21
6 5 4 3 2 1
Sublogarthmic Algorithms and CRCW PRAM
• Minimum of (a(1),a(2),…,a(n)) using n2 processors in O(1) time.
• C[i] = 1; using n processors.
• for all (i, j) – if a(i) > a(j) then C[i] = 0; uses n2 processors
• if C[i] = 1 then B = A[i]; uses n processors
• B has the minimum
What if we have “n” processors only?• Split into n0.5 blocks each of elements and assign
n0.5 processors to each block• Solve recursively in each block to get n0.5
minimum elements• Now we have n0.5 elements from which we have to
find the minimum element using n processors. Do it in constant time using the previous algorithm.
• T(n) = T(n0.5) + 1 = O(log log n)