A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

A Tutorial on

High Performance Computing Taxonomy

By

Prof. V. Kamakoti

Department of Computer Science and Engineering

Indian Institute of Technology, Madras

Chennai – 600 036, India

Organization of the Tutorial

• Session – 1– Instruction Level Parallelism (ILP)

• Pipelining concepts

• RTL and Speed-up

• Superscalar/VLIW concepts– Static Instruction Scheduling

– Dynamic Instruction Scheduling

• Branch Prediction

Organization of the Tutorial

• Session – 2– Amdahl’s law and its applications– Symmetric Multiprocessors (SMP)

• The Cache Coherency problem – ESI Protocol

– Distributed Memory Systems– Basics of Message Passing Systems– Parallel Models of Computing– Design of Algorithms for Parallel processors

• Brent’s Lemma

Why this Title?

• Performance related issues at– circuit level (RTL)– instruction level (Processor level)– Shared Memory Multiprocessor level (SMP)– Distributed Memory Multiprocessor level

(Cluster/Grid – Collection of SMPs)

ILP - Pipelining

Fetch + Inc. PC

Decode Instrn

Fetch Data

Execute Instrn

Store Data

10000 Instructions

No pipelining takes

50000 Units

With Pipelining

I1I2

I1

I3

I2

I1

I4

I3

I2

I1

I5

I4

I3

I2

I1

First Instruction completes at end of 5th unit

Second instruction at end of 6th unit

10000th instruction at end of 10004 units

Performance• With pipelining we get a speed up of close to 5• This will not work always

– Hazards• Data• Control• Structural

• Non Ideal: Not every step take the same amount of time– Float Mult – 10 cycles– Float Div – 40 cycles

• Performance come out of Parallelism. Let us try to understand parallelism

Types of Parallelism

• Recognizing parallelism– Example: To add 100 numbers– for j = 1 to 100 do

• Sum = Sum + A[j]; //Inherently Sequential

– A better solution in terms of parallelism, assuming 4 processors are available

• Split 100 numbers into 4 parts each of 25 numbers and allot one part each to all the four processors

• Each processor adds 25 numbers allotted to it and sends the answer to a head processor, which further adds and gives the sum.


• Data Parallelism or SIMD– The above example– Same instruction “Add 25 numbers” but on

multiple data– The parallelism is because of data.


• Functional Parallelism or MIMD– Multiple functions to be performed on a data

set or data sets.– Multiple Instructions and Multiple Data.– Example is the case of the pipeline discussed

earlier

Example of Pipelining

• Imagine that 100 sets of data are to be processed in a sequence by the following system.

Part1

Part2

Part 1 takes 10 ms and Part 2 takes 15 ms.

To process 100 sets of data it takes 2500 ms.

Example of Pipelining

• Consider the following changes, with a storage element. When first data set is in part 2, the second data set can be in part 1.

Part1

Part2

First data set finishes at 30ms and after every 15 ms one data set will come out – total processing time is 1515 ms. – A tremendous speedup.

STORAGE

Functional Parallelism

• Different data sets and different instructions on them.

• Hence, Multiple Instruction and Multiple data.• An interesting problem is to convert circuits with

large delays into pipelined circuits to get very good throughput as seen earlier.

• Useful in the context of using the same circuit for different sets of data.

Pipelining Concepts

• A combinational circuit can be easily modeled as a Directed Acyclic Graph (DAG).

• Every node of the DAG is a subcircuit of the given circuit – forms a stage of a pipeline.

• An edge of the DAG connects two nodes of the DAG.

• Perform a topological sorting of the DAG.

N2 N3N1

N5

N6

N8

N7

N4

Level = 1

Level = 3

Level = 2

Level = 4

A Pipelined Circuit

• If an edge connects two nodes of levels j and k, j < k, then introduce k-j storage levels in between, in the edge.

• Each edge can carry one or more bits.

N2 N3N1

N5

N6

N8

N7

N4

1 1 1

2 2

3

2

4

Optimization

• Delay at every stage should be al most equal.

• The stage with maximum delay dictates the throughput.

• Number of bits transferred across nodes to be optimized, that shall reduce the storage requirements.

Stage Time Balancing

Part1

Part2

STORAGE

If Part 1 takes 10 ms and Part 2 takes 15ms then, First data set finishes at 30ms and after every 15 ms one data set will come out – total processing time for 100 data set is 1515 ms. – A tremendous speedup.

If Part 1 takes 12 ms and Part 2 takes 13 ms then, First data set finishes at 26 ms and after every 13 ms one data set will come out – total processing time for 100 data set is 1313 ms. – A significant improvement.

RTL View of Circuits and Performance

• Register Transfer Level

RegL1

RegL2

RegL3

Combo3 ns

Combo5 ns

Clock Frequency is (1/5)*109

Improve frequency by reducing maximum stage delay.

High Speed Circuits

• Carry Ripple to Carry Look ahead• Wallace Tree Multipliers• Increased Area and Power Consumption• Lower Time delays• Why Laptops have lesser frequency ratings than

Desktops?• Reference: Cormen, Lieserson and Rivest, (First

Edition) Introduction to Algorithms (or) Computer Architecture by Hamacher et al.

ILP Continues….

• Data Hazards– LOAD [R2 + 10], R1 // Loads into R1– ADD R3, R1, R2 //R3 = R1 + R2

• This is the “Read After Write (RAW)” Data Hazard for R1– LD [R2+10], R1– ADD R3, R1, R12– LD [R2 + 14], R1– ADD R12, R1, R2

• This shows the WAW for R1 and WAR for R12

ILP – Pipelining Advanced

Fetch + Inc. PC

Decode Instrn

Fetch Data

Execute Unit 1

Store Data

Execute Unit 2 Execute Unit K

Superscalar: CPI < 1

Success: Different Instrns take different cycle time

Four FMULs while one FDIV

Implies – Out-of-Order Execution

Difficulties in SuperscalarConstruction

• Ensuring no Data Hazards among several instructions executing in the different execution units at a same point of time.

• If this is done by compiler – then Static Instruction Scheduling – VLIW - Itanium

• Done by the hardware – then Dynamic Instruction Scehduling – Tomasulo – MIPS Embedded Processor

Static Instruction Scheduling

• Compiler make bundles of “K” instructions that can be put at the same time to the execution units such that there are no data dependencies between them.– Very Long Instruction Word (VLIW) to accommodate “K’ instructions at

a time

• Lot of “NOPS” if the bundle cannot be filled with relevant instructions

– Size of the executable

• Does not complicate the Hardware

• Source code portability – if I make the next gen processor with K+5 units (say) – then?

– Solved by having a software/firmware emulator which has a negative say in the performance.

Thorn in the Flesh for Static Instruction Scheduling

• The famous “Memory Aliasing Problem”– LD [R1+20], R2 //Load R2 into

– ST R3, [R4+40] //Store R3 with

• This implies a RAW if (R1 + 20 = R4 + 40) and this cannot be detected at compile time

• Such combinations of memory operations are not put in same bundle and memory operations are strictly scheduled in program order.

Dynamic Instruction Scheduling

• The data hazards are handled by the hardware– RAW using Operand Forwarding Technique– WAR and WAW using Register Renaming

Technique

Processor Overview

Processor ALU/Control

Multiple functionUnits

Register File

Memory

Bus

LD [R1+20],R2

ADD R3,R2,R4

RAW

Why should result of LD go to R2 in Reg file and then reload to ALU?

Forward the same on its way to reg file

Register Renaming

1. ADD R1, R2, R32. ST R1, [R4+50]3. ADD R1, R5, R64. SUB R7,R1,R85. ST R1, [R4 + 54] 6. ADD R1, R9, R10

Dependencies due to Reg R1

RAW: (1,2), (1,4), (1,5) (3,4) (3,5)

WAR: (2,3), (2,6), (4,6), (5,6)

WAW: (1,3), (1,6), (3,6)

Register Renaming: Static Scheduling

1. ADD R1, R2, R3

2. ST R1, [R4+50]

3. ADD R12, R5, R6

4. SUB R7,R12,R8

5. ST R12, [R4 + 54]

6. ADD R1, R9,R10

Rename R1 to R12 after Instruction 3 till Instruction 6

Dependency only within a window and not the whole program.

Only WAR and WAW are between (1,6) and (2,6) which are far away in the program order

Increases Register pressure for the compiler

Dynamic Scheduling - Tomasulo

Exec 1 Exec 2 Exec 3 Exec 4

Common Data Bus (CDB)

To Reg file/Mem

Reservation Station

Register Status Indicator

Instruction Fetch Unit

Instructions are fetched one by one and decoded to find the type of operation and the source of operandsRegister Status Indicator indicates whether the latest value of the register is in the reg file or currently being computed by some

execution unit and if the latter it states the execution unit number

If all operands available then operation proceeds in the allotted execution unit, else, it waits in the reservation station of the allotted execution unit pinging the CDB

Every Execution unit writes the result along with the unit number on to the CDB which is forwarded to all reservation stations, Reg-file and Memory

An Example: Instruction Fetch

Reg Number

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

Status 0 0 0 0 0 0 0 0 0 0


Empty Empty Empty Empty Empty Empty

1. ADD R1, R2, R32. ST R1, [R4+50]3. ADD R1, R5, R64. SUB R7,R1,R85. ST R1, [R4 + 54] 6. ADD R1, R9, R10

An Example:

ADD R1, R2, R3

Instruction Fetch

Reg Number

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

Status 1 0 0 0 0 0 0 0 0 0


Ins 1 Empty Empty Empty Empty Empty

1. --2. ST R1, [R4+50]3. ADD R1, R5, R64. SUB R7,R1,R85. ST R1, [R4 + 54] 6. ADD R1, R9, R10

An Example:

ST R1, [R4+50]

Instruction Fetch

Reg Number

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

Status 1 0 0 0 0 0 0 0 0 0


I 1, E I 2, W 1 Empty Empty Empty Empty

1. ---2. ---3. ADD R1, R5, R64. SUB R7,R1,R85. ST R1, [R4 + 54] 6. ADD R1, R9, R10

An Example:

ADD R1, R5, R6

Instruction Fetch

Reg Number

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

Status 3 0 0 0 0 0 0 0 0 0


I 1, E I 2, W 1 I 3, E Empty Empty Empty

1. ---2. ---3. ---4. SUB R7,R1,R85. ST R1, [R4 + 54] 6. ADD R1, R9, R10

Note: Reservation Station stores the number of the execution unit that shall yield the latest value of a register.

An Example:

SUB R7,R1,R8

Instruction Fetch

Reg Number

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

Status 3 0 0 0 0 0 4 0 0 0


I 1, E I 2, W 1 I 3, E I 4, W 3 Empty Empty

1. ---2. ---3. ---4. ---5. ST R1, [R4 + 54] 6. ADD R1, R9, R10

An Example:

ST R1, [R4 + 54]

Instruction Fetch

Reg Number

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

Status 3 0 0 0 0 0 4 0 0 0


I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 Empty

1. ---2. ---3. ---4. ---5. --- 6. ADD R1, R9, R10

An Example:

ADD R1, R9, R10

Instruction Fetch

Reg Number

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

Status 6 0 0 0 0 0 4 0 0 0


I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 I 6, E

1. ---2. ---3. ---4. ---5. --- 6. ---

An Example:

ADD R1, R9, R10

Instruction Fetch

Reg Number

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

Status 6 0 0 0 0 0 4 0 0 0


I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 I 6, E

1. ADD R1, R2, R32. ST U1, [R4+50]3. ADD R1, R5, R64. SUB R7, U3, R85. ST U3, [R4 + 54] 6. ADD R1, R9, R10

Effectively three Instructions are executing and others waiting for the appropriate results. The whole program is converted as shown above.See that Operand Forwarding and Register Renaming is done automatically

Execution unit 6, on completion will make R1 entry in Register Status Indicator 0. Similarly unit 4 will make R7 entry 0.

Memory Aliasing

• Every Memory location is a register

• Conceptually the same method can be used

• The size of memory status indicator will be prohibitively large.

• An Associative memory used to record the memory address to be written to and the unit number doing it.

Other Hazards

• Control Hazards– Conditional Jumps – which instruction to fetch

next in the pipeline– Branch predictors are used – which shall

predict whether a branch is taken or not.– Misprediction leads to undo-ing some actions

increasing the penalty but nothing much can be done

Branch Prediction

• Different types of predictors– Tournament – Correlation– K-bit

• Reference: Henessey and Patterson – Computer Architecture.

Other Hazards

• Structural Hazards– Non availability of a functional unit– Say would like to schedule the seventh

instruction in our example– The new instruction has to wait.– Separate Integer, FPU and Load Store units are

made available

• Load-Store Architecture – What is it?

Architectural Enhancements

Amdahl’s Law

Speedup(Overall) = Exec time without EnhancementExec time with Enhancement

A = Fraction of computation time in the original architecture that can be converted to take advantage of enhancement.

Exec_time(New) = (1 – A) Exec_time(old) + Exec_time of Enhanced portion -- (1)

Speedup(enhanced) =Exec_time of enhanced portion(old)

Exec_time of enhanced portion(new)

= A * Exec time (old)

Exec_time of enhanced portion(new)

Exec_time of enhanced portion(new) =A * Exec time (old)

Speedup(enhanced)

Substituting in (1) above we get

A

Speedup(Enhanced)1 – A +( )

Exec_time(new) =

Exec_time(old) *

Final form of Amdahl’s Law

Speedup Overall = 1

A

Speedup(Enhanced)1 – A +( )

Application of Amdahl’s Law:

Always 50% FP operations

- 20% FP Square root, 30% others

Choice 1: Use hardware and improve FP Square root to get speedup of 10

Choice 2: Use software and improve all FP operations by a speedup of 1.6

Speedup in Choice 1 is 1/(1 – 0.2 + 0.2/10) = 1.22

Speedup in Choice 2 is 1/(1 – 0.5 + 0.5/1.6) = 1.23

Choice 2 is better than Choice 1

Shared Memory Architectures

•Sharing one memory space among several processors.

•Maintaining coherence among several copies of a data item.

Processor

Caches

Memory

Disk & Other IO

Registers

Processor

Registers

Caches

Processor

Registers

Caches

Processor

Registers

Caches

Chipset

Shared Memory Multiprocessor

Snoopy

Snoopy-cache state Machine-1

Invalid

CPU Read hit

Shared (read/only)

Exclusive

(read/write)

CPU Read

Place read miss on bus

CPU read missWrite back block

CPU Read miss Place read miss

On bus

CPU Write(Hit/Miss)Place Write Miss on Bus

CPU write miss Write back cache block

Place write miss on BusCPU read HitCPU write hit

Cache Block state

CPU Write

Place write

Miss on Bus

Applies to writeBack Data


* State machine for CPU requests for each cache block

Snoopy-cache state Machine- II

Invalid Shared (read/only)

Exclusive

(read/write)

Write missFor this block


State machine for bus requests

for each cache block

Write Back

Block; (abort memory access)

Read missFor this block

Write missFor this block

Write Back

Block; (abort memory access)

Shared

Exclusive

Invalid

P1 P2 Bus Memory

Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value

P1:Write 10 to A1

P1:Read A1

P2:Read A1

P2:Write 20 to A1

P2:Write 40 to A2

Assumes Initial Cache Stateis invalid and A1 and A2 mapto same cache block,but A1 == A2

Remote Write or Miss

CPU read hitCPU write hit

Remote write or Miss Write

Back

Read miss on Bus

Write miss on Bus

Remote ReadWrite Back

CPU read hit

CPU writePlace write

Miss on Bus

Example Shared Memory Architectures

This is the Cache for P1

Shared

Exclusive

Invalid

Example : step 1

P1 P2 Bus Memory


P1:Write 10 to A1

Excl. A1 10 Wr.Ms P1 A1

P1:Read A1

P2:Read A1

P2:Write 20 to A1

P2:Write 40 to A2


Write Miss on Bus

Shared

Exclusive

Invalid

Example : step 2

P1 P2 Bus Memory


P1:Write 10 to A1


P1:Read A1

Excl. A1 10

P2:Read A1

P2:Write 20 to A1

P2:Write 40 to A2

Assumes Initial Cache Stateis invalid and A1 and A2 mapto same cache block.but A1 == A2


CPU read hit

Shared

Exclusive

Invalid

Example : step 3Shared Memory Architectures

P1 P2 Bus Memory


P1:Write 10 to A1


P1:Read A1

Excl. A1 10

P2:Read A1

Shar A1 RdMs P2 A1

Shar. A1 10 WrBk P1 A1 10 A1 10

Shar A1 10 EdDa P2 A1 10 10

10

P2:Write 40 to A2

10

10

Assumes Initial Cache Stateis invalid and A1 and A2 mapto same cache block.but A1 ==A2 Remote Read

Write Back

Shared

Exclusive

Invalid

Example : step 4Shared Memory Architectures

P1 P2 Bus Memory


P1:Write 10 to A1


P1:Read A1

Excl. A1 10

P2:Read A1

Shar A1 RdMs P2 A1

Shar A1 10 WrBk P1 A1 10 A1 10

Shar A1 10 RdDa P2 A1 10 10

P2:Write 20 to A1

Inv. Excl A1 20 WrMs P2 A1 10

P2:Write 40 to A2

10

10

Assumes Initial Cache Stateis invalid and A1 and A2 mapto same cache block.but A1== A2

Remote Write

Example : step 5

P1 P2 Bus Memory


P1:Write 10 to A1


P1:Read A1

Excl. A1 10

P2:Read A1

Shar A1 RdMs P2 A1

Shar A1 10 WrBk P1 A1 10 A1 10

Shar A1 10 RdDa P2 A1 10 10

P2:Write 20 to A1

Inv Excl A1 20 WrMs P2 A1 10

P2:Write 40 to A2

WrMs P2 A2 10

Excl A2 40 WrBk P2 A1 20 A1 20

Assumes Initial Cache Stateis invalid and A1 and A2 mapto same cache block.but A1 == A2


Distributed Memory Systems

P1 P2 P3 P4

P1 P2 P3 P4

NETWORKMessage Passing for

Inter Process Communication.

Shared Vs Distributed Memory

1. Single Address

Space

1. Multiple Address

Space

2. Easy to Program 2. Difficult to

Program

3. Less Scalable 3. More Scalable

provided you

know how to

program.

Basics of Message Passing SystemsMessages

• Which processor is sending the message.

• Where is the data on the sending processor.

•What kind of data is being sent.

•How much data is there.

•Which processor(s) are receiving the message.

•Where should the data be left on the receiving processor.

•How much data is the receiving processor prepared to accept.

Aspects

• Access to the message passing systems• Addressing• Reception• Point – Point Communication: Two processors communicate

•Synchronous and Asynchronous•Blocking and Non–blocking Operations.

• Collective Communication – Group of processors communicate

•Barrier , Broadcast and Reduction Operations.

Point – Point Synchronous Communication

Synchronous communication does not complete until the message has been received.

Point-Point Asynchronous Communication

------------------------

Asynchronous communication completes as soon as the message is on its way.

--------------------

-------------------------------------------------

Non – Blocking Operations

Non –blocking Operations return straight away after initiating the Operation and hence allows useful work to be performed whilewaiting for the operation to complete.One can test for the completion of the operation when necessary.

Blocking Operations

Blocking Operations waits for the operation to complete before Proceeding further.

Barrier

Barrier

Barrier

• Synchronizes processors by blocking until all of the participating processors have called the barrier routine.

• There is no exchange of data.

Barrier

Broadcast

One–to–many Communication.One processor send the same

message to several destinations in a single operation.

Reduction

Takes data items from several processors and reduces them to a

Single data item that is usually made available to all of the

participating processors. e.g. Strike Voting , Summation

STRIKE

.

Parallel Models

EREW PRAM : Exclusive Read Exclusive Write Parallel Random Access Memory.

CREW PRAM : Concurrent Read Exclusive Write Parallel Random Access Memory.

CRCW PRAM : Concurrent Read Concurrent Write Parallel Random Access Memory

Parallel Algorithm:Recursive Doubling Technique

• Finding Prefix Sum of ‘8’ Numbers

7 6 5 4 3 2 1 0

-15 6 -8 7 3 -2 1 0

Parallel Algorithm:Recursive Doubling Technique (Step 1)


7 6 5 4 3 2 1 0

-9 -2 -1 10 1 -1 1 0

Step 1



7 6 5 4 3 2 1 0

-10 8 0 9 2 -1 1 0

Step 2



7 6 5 4 3 2 1 0

-8 7 1 9 2 -1 1 0

Step 3

• Prefix Sum of n numbers in O(log n) steps

• EREW Implementation

• Applicable for any semigroup operator like min , max , mul etc.

CRCW Algorithms

• What can you do in constant time.

• Find OR of ‘n’ bits (b1,b2,…,bn) using n Processors.

--Each processor Pi , reads bi in parallel. --If (bi = 1) then C = 1;

• All Processors write to ‘C’ the same value.

Optimality of Parallel Algorithms

• Parallel Algorithm is optimal iff Time taken by Parallel Algorithm * No.of Processors used

= Time taken by best known sequential algorithm.

• Prefix Sum of n Numbers

O(n) Sequential O(n log n) – Processor-Time Product Not optimal.

Make it Optimal

Yes – Using Brent’s Technique

Use ‘p’ Processors and split data set into n/p blocks.

Allot one processor per block, Let us take 3 processors and 6 numbers for example.

Bn/p B2 B1

6 4 3 2 15

The Algorithm

Step 1: Processor ‘i’ finds prefix sum of elements in Bi sequentially and outputs Si the Sum of last element- O(n/p) time.

11 5 7 3 3 1

11 7 3

Step2: Find prefix sum of (Sp,……,S1) and let it be (S’p….S’1) – O(log(n/p)) time Recursive Doubling.

11 5 7 3 3 1

10 3

Step3: Processor ’i’, i > 1, adds S’i-1 to all prefixes of block Bi -O(n/p)

Total O(n/p+log n – log p) time = O(n/p)time. 21 15 10 6 3 1

21

6 5 4 3 2 1

Sublogarthmic Algorithms and CRCW PRAM

• Minimum of (a(1),a(2),…,a(n)) using n2 processors in O(1) time.

• C[i] = 1; using n processors.

• for all (i, j) – if a(i) > a(j) then C[i] = 0; uses n2 processors

• if C[i] = 1 then B = A[i]; uses n processors

• B has the minimum

What if we have “n” processors only?• Split into n0.5 blocks each of elements and assign

n0.5 processors to each block• Solve recursively in each block to get n0.5

minimum elements• Now we have n0.5 elements from which we have to

find the minimum element using n processors. Do it in constant time using the previous algorithm.

• T(n) = T(n0.5) + 1 = O(log log n)

Documents

A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras