83
A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras Chennai – 600 036, India

A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Embed Size (px)

Citation preview

Page 1: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

A Tutorial on

High Performance Computing Taxonomy

By

Prof. V. Kamakoti

Department of Computer Science and Engineering

Indian Institute of Technology, Madras

Chennai – 600 036, India

Page 2: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Organization of the Tutorial

• Session – 1– Instruction Level Parallelism (ILP)

• Pipelining concepts

• RTL and Speed-up

• Superscalar/VLIW concepts– Static Instruction Scheduling

– Dynamic Instruction Scheduling

• Branch Prediction

Page 3: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Organization of the Tutorial

• Session – 2– Amdahl’s law and its applications– Symmetric Multiprocessors (SMP)

• The Cache Coherency problem – ESI Protocol

– Distributed Memory Systems– Basics of Message Passing Systems– Parallel Models of Computing– Design of Algorithms for Parallel processors

• Brent’s Lemma

Page 4: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Why this Title?

• Performance related issues at– circuit level (RTL)– instruction level (Processor level)– Shared Memory Multiprocessor level (SMP)– Distributed Memory Multiprocessor level

(Cluster/Grid – Collection of SMPs)

Page 5: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

ILP - Pipelining

Fetch + Inc. PC

Decode Instrn

Fetch Data

Execute Instrn

Store Data

10000 Instructions

No pipelining takes

50000 Units

With Pipelining

I1I2

I1

I3

I2

I1

I4

I3

I2

I1

I5

I4

I3

I2

I1

First Instruction completes at end of 5th unit

Second instruction at end of 6th unit

10000th instruction at end of 10004 units

Page 6: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Performance• With pipelining we get a speed up of close to 5• This will not work always

– Hazards• Data• Control• Structural

• Non Ideal: Not every step take the same amount of time– Float Mult – 10 cycles– Float Div – 40 cycles

• Performance come out of Parallelism. Let us try to understand parallelism

Page 7: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Types of Parallelism

• Recognizing parallelism– Example: To add 100 numbers– for j = 1 to 100 do

• Sum = Sum + A[j]; //Inherently Sequential

– A better solution in terms of parallelism, assuming 4 processors are available

• Split 100 numbers into 4 parts each of 25 numbers and allot one part each to all the four processors

• Each processor adds 25 numbers allotted to it and sends the answer to a head processor, which further adds and gives the sum.

Page 8: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Types of Parallelism

• Data Parallelism or SIMD– The above example– Same instruction “Add 25 numbers” but on

multiple data– The parallelism is because of data.

Page 9: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Types of Parallelism

• Functional Parallelism or MIMD– Multiple functions to be performed on a data

set or data sets.– Multiple Instructions and Multiple Data.– Example is the case of the pipeline discussed

earlier

Page 10: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Example of Pipelining

• Imagine that 100 sets of data are to be processed in a sequence by the following system.

Part1

Part2

Part 1 takes 10 ms and Part 2 takes 15 ms.

To process 100 sets of data it takes 2500 ms.

Page 11: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Example of Pipelining

• Consider the following changes, with a storage element. When first data set is in part 2, the second data set can be in part 1.

Part1

Part2

First data set finishes at 30ms and after every 15 ms one data set will come out – total processing time is 1515 ms. – A tremendous speedup.

STORAGE

Page 12: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Functional Parallelism

• Different data sets and different instructions on them.

• Hence, Multiple Instruction and Multiple data.• An interesting problem is to convert circuits with

large delays into pipelined circuits to get very good throughput as seen earlier.

• Useful in the context of using the same circuit for different sets of data.

Page 13: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Pipelining Concepts

• A combinational circuit can be easily modeled as a Directed Acyclic Graph (DAG).

• Every node of the DAG is a subcircuit of the given circuit – forms a stage of a pipeline.

• An edge of the DAG connects two nodes of the DAG.

• Perform a topological sorting of the DAG.

Page 14: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

N2 N3N1

N5

N6

N8

N7

N4

Level = 1

Level = 3

Level = 2

Level = 4

Page 15: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

A Pipelined Circuit

• If an edge connects two nodes of levels j and k, j < k, then introduce k-j storage levels in between, in the edge.

• Each edge can carry one or more bits.

Page 16: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

N2 N3N1

N5

N6

N8

N7

N4

1 1 1

2 2

3

2

4

Page 17: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Optimization

• Delay at every stage should be al most equal.

• The stage with maximum delay dictates the throughput.

• Number of bits transferred across nodes to be optimized, that shall reduce the storage requirements.

Page 18: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Stage Time Balancing

Part1

Part2

STORAGE

If Part 1 takes 10 ms and Part 2 takes 15ms then, First data set finishes at 30ms and after every 15 ms one data set will come out – total processing time for 100 data set is 1515 ms. – A tremendous speedup.

If Part 1 takes 12 ms and Part 2 takes 13 ms then, First data set finishes at 26 ms and after every 13 ms one data set will come out – total processing time for 100 data set is 1313 ms. – A significant improvement.

Page 19: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

RTL View of Circuits and Performance

• Register Transfer Level

RegL1

RegL2

RegL3

Combo3 ns

Combo5 ns

Clock Frequency is (1/5)*109

Improve frequency by reducing maximum stage delay.

Page 20: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

High Speed Circuits

• Carry Ripple to Carry Look ahead• Wallace Tree Multipliers• Increased Area and Power Consumption• Lower Time delays• Why Laptops have lesser frequency ratings than

Desktops?• Reference: Cormen, Lieserson and Rivest, (First

Edition) Introduction to Algorithms (or) Computer Architecture by Hamacher et al.

Page 21: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

ILP Continues….

• Data Hazards– LOAD [R2 + 10], R1 // Loads into R1– ADD R3, R1, R2 //R3 = R1 + R2

• This is the “Read After Write (RAW)” Data Hazard for R1– LD [R2+10], R1– ADD R3, R1, R12– LD [R2 + 14], R1– ADD R12, R1, R2

• This shows the WAW for R1 and WAR for R12

Page 22: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

ILP – Pipelining Advanced

Fetch + Inc. PC

Decode Instrn

Fetch Data

Execute Unit 1

Store Data

Execute Unit 2 Execute Unit K

Superscalar: CPI < 1

Success: Different Instrns take different cycle time

Four FMULs while one FDIV

Implies – Out-of-Order Execution

Page 23: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Difficulties in SuperscalarConstruction

• Ensuring no Data Hazards among several instructions executing in the different execution units at a same point of time.

• If this is done by compiler – then Static Instruction Scheduling – VLIW - Itanium

• Done by the hardware – then Dynamic Instruction Scehduling – Tomasulo – MIPS Embedded Processor

Page 24: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Static Instruction Scheduling

• Compiler make bundles of “K” instructions that can be put at the same time to the execution units such that there are no data dependencies between them.– Very Long Instruction Word (VLIW) to accommodate “K’ instructions at

a time

• Lot of “NOPS” if the bundle cannot be filled with relevant instructions

– Size of the executable

• Does not complicate the Hardware

• Source code portability – if I make the next gen processor with K+5 units (say) – then?

– Solved by having a software/firmware emulator which has a negative say in the performance.

Page 25: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Thorn in the Flesh for Static Instruction Scheduling

• The famous “Memory Aliasing Problem”– LD [R1+20], R2 //Load R2 into

– ST R3, [R4+40] //Store R3 with

• This implies a RAW if (R1 + 20 = R4 + 40) and this cannot be detected at compile time

• Such combinations of memory operations are not put in same bundle and memory operations are strictly scheduled in program order.

Page 26: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Dynamic Instruction Scheduling

• The data hazards are handled by the hardware– RAW using Operand Forwarding Technique– WAR and WAW using Register Renaming

Technique

Page 27: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Processor Overview

Processor ALU/Control

Multiple functionUnits

Register File

Memory

Bus

LD [R1+20],R2

ADD R3,R2,R4

RAW

Why should result of LD go to R2 in Reg file and then reload to ALU?

Forward the same on its way to reg file

Page 28: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Register Renaming

1. ADD R1, R2, R32. ST R1, [R4+50]3. ADD R1, R5, R64. SUB R7,R1,R85. ST R1, [R4 + 54] 6. ADD R1, R9, R10

Dependencies due to Reg R1

RAW: (1,2), (1,4), (1,5) (3,4) (3,5)

WAR: (2,3), (2,6), (4,6), (5,6)

WAW: (1,3), (1,6), (3,6)

Page 29: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Register Renaming: Static Scheduling

1. ADD R1, R2, R3

2. ST R1, [R4+50]

3. ADD R12, R5, R6

4. SUB R7,R12,R8

5. ST R12, [R4 + 54]

6. ADD R1, R9,R10

Rename R1 to R12 after Instruction 3 till Instruction 6

Dependency only within a window and not the whole program.

Only WAR and WAW are between (1,6) and (2,6) which are far away in the program order

Increases Register pressure for the compiler

Page 30: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Dynamic Scheduling - Tomasulo

Exec 1 Exec 2 Exec 3 Exec 4

Common Data Bus (CDB)

To Reg file/Mem

Reservation Station

Register Status Indicator

Instruction Fetch Unit

Instructions are fetched one by one and decoded to find the type of operation and the source of operandsRegister Status Indicator indicates whether the latest value of the register is in the reg file or currently being computed by some

execution unit and if the latter it states the execution unit number

If all operands available then operation proceeds in the allotted execution unit, else, it waits in the reservation station of the allotted execution unit pinging the CDB

Every Execution unit writes the result along with the unit number on to the CDB which is forwarded to all reservation stations, Reg-file and Memory

Page 31: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

An Example: Instruction Fetch

Reg Number

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

Status 0 0 0 0 0 0 0 0 0 0

Register Status Indicator

Empty Empty Empty Empty Empty Empty

1. ADD R1, R2, R32. ST R1, [R4+50]3. ADD R1, R5, R64. SUB R7,R1,R85. ST R1, [R4 + 54] 6. ADD R1, R9, R10

Page 32: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

An Example:

ADD R1, R2, R3

Instruction Fetch

Reg Number

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

Status 1 0 0 0 0 0 0 0 0 0

Register Status Indicator

Ins 1 Empty Empty Empty Empty Empty

1. --2. ST R1, [R4+50]3. ADD R1, R5, R64. SUB R7,R1,R85. ST R1, [R4 + 54] 6. ADD R1, R9, R10

Page 33: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

An Example:

ST R1, [R4+50]

Instruction Fetch

Reg Number

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

Status 1 0 0 0 0 0 0 0 0 0

Register Status Indicator

I 1, E I 2, W 1 Empty Empty Empty Empty

1. ---2. ---3. ADD R1, R5, R64. SUB R7,R1,R85. ST R1, [R4 + 54] 6. ADD R1, R9, R10

Page 34: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

An Example:

ADD R1, R5, R6

Instruction Fetch

Reg Number

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

Status 3 0 0 0 0 0 0 0 0 0

Register Status Indicator

I 1, E I 2, W 1 I 3, E Empty Empty Empty

1. ---2. ---3. ---4. SUB R7,R1,R85. ST R1, [R4 + 54] 6. ADD R1, R9, R10

Note: Reservation Station stores the number of the execution unit that shall yield the latest value of a register.

Page 35: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

An Example:

SUB R7,R1,R8

Instruction Fetch

Reg Number

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

Status 3 0 0 0 0 0 4 0 0 0

Register Status Indicator

I 1, E I 2, W 1 I 3, E I 4, W 3 Empty Empty

1. ---2. ---3. ---4. ---5. ST R1, [R4 + 54] 6. ADD R1, R9, R10

Page 36: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

An Example:

ST R1, [R4 + 54]

Instruction Fetch

Reg Number

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

Status 3 0 0 0 0 0 4 0 0 0

Register Status Indicator

I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 Empty

1. ---2. ---3. ---4. ---5. --- 6. ADD R1, R9, R10

Page 37: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

An Example:

ADD R1, R9, R10

Instruction Fetch

Reg Number

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

Status 6 0 0 0 0 0 4 0 0 0

Register Status Indicator

I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 I 6, E

1. ---2. ---3. ---4. ---5. --- 6. ---

Page 38: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

An Example:

ADD R1, R9, R10

Instruction Fetch

Reg Number

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

Status 6 0 0 0 0 0 4 0 0 0

Register Status Indicator

I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 I 6, E

1. ADD R1, R2, R32. ST U1, [R4+50]3. ADD R1, R5, R64. SUB R7, U3, R85. ST U3, [R4 + 54] 6. ADD R1, R9, R10

Effectively three Instructions are executing and others waiting for the appropriate results. The whole program is converted as shown above.See that Operand Forwarding and Register Renaming is done automatically

Execution unit 6, on completion will make R1 entry in Register Status Indicator 0. Similarly unit 4 will make R7 entry 0.

Page 39: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Memory Aliasing

• Every Memory location is a register

• Conceptually the same method can be used

• The size of memory status indicator will be prohibitively large.

• An Associative memory used to record the memory address to be written to and the unit number doing it.

Page 40: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Other Hazards

• Control Hazards– Conditional Jumps – which instruction to fetch

next in the pipeline– Branch predictors are used – which shall

predict whether a branch is taken or not.– Misprediction leads to undo-ing some actions

increasing the penalty but nothing much can be done

Page 41: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Branch Prediction

• Different types of predictors– Tournament – Correlation– K-bit

• Reference: Henessey and Patterson – Computer Architecture.

Page 42: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Other Hazards

• Structural Hazards– Non availability of a functional unit– Say would like to schedule the seventh

instruction in our example– The new instruction has to wait.– Separate Integer, FPU and Load Store units are

made available

• Load-Store Architecture – What is it?

Page 43: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras
Page 44: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras
Page 45: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Architectural Enhancements

Amdahl’s Law

Speedup(Overall) = Exec time without EnhancementExec time with Enhancement

A = Fraction of computation time in the original architecture that can be converted to take advantage of enhancement.

Exec_time(New) = (1 – A) Exec_time(old) + Exec_time of Enhanced portion -- (1)

Page 46: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Speedup(enhanced) =Exec_time of enhanced portion(old)

Exec_time of enhanced portion(new)

= A * Exec time (old)

Exec_time of enhanced portion(new)

Exec_time of enhanced portion(new) =A * Exec time (old)

Speedup(enhanced)

Substituting in (1) above we get

Page 47: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

A

Speedup(Enhanced)1 – A +( )

Exec_time(new) =

Exec_time(old) *

Final form of Amdahl’s Law

Speedup Overall = 1

A

Speedup(Enhanced)1 – A +( )

Page 48: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Application of Amdahl’s Law:

Always 50% FP operations

- 20% FP Square root, 30% others

Choice 1: Use hardware and improve FP Square root to get speedup of 10

Choice 2: Use software and improve all FP operations by a speedup of 1.6

Speedup in Choice 1 is 1/(1 – 0.2 + 0.2/10) = 1.22

Speedup in Choice 2 is 1/(1 – 0.5 + 0.5/1.6) = 1.23

Choice 2 is better than Choice 1

Page 49: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Shared Memory Architectures

•Sharing one memory space among several processors.

•Maintaining coherence among several copies of a data item.

Page 50: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Processor

Caches

Memory

Disk & Other IO

Registers

Processor

Registers

Caches

Processor

Registers

Caches

Processor

Registers

Caches

Chipset

Shared Memory Multiprocessor

Snoopy

Page 51: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Snoopy-cache state Machine-1

Invalid

CPU Read hit

Shared (read/only)

Exclusive

(read/write)

CPU Read

Place read miss on bus

CPU read missWrite back block

CPU Read miss Place read miss

On bus

CPU Write(Hit/Miss)Place Write Miss on Bus

CPU write miss Write back cache block

Place write miss on BusCPU read HitCPU write hit

Cache Block state

CPU Write

Place write

Miss on Bus

Applies to writeBack Data

Shared Memory Architectures

* State machine for CPU requests for each cache block

Page 52: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Snoopy-cache state Machine- II

Invalid Shared (read/only)

Exclusive

(read/write)

Write missFor this block

Shared Memory Architectures

State machine for bus requests

for each cache block

Write Back

Block; (abort memory access)

Read missFor this block

Write missFor this block

Write Back

Block; (abort memory access)

Page 53: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Shared

Exclusive

Invalid

P1 P2 Bus Memory

Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value

P1:Write 10 to A1

P1:Read A1

P2:Read A1

P2:Write 20 to A1

P2:Write 40 to A2

Assumes Initial Cache Stateis invalid and A1 and A2 mapto same cache block,but A1 == A2

Remote Write or Miss

CPU read hitCPU write hit

Remote write or Miss Write

Back

Read miss on Bus

Write miss on Bus

Remote ReadWrite Back

CPU read hit

CPU writePlace write

Miss on Bus

Example Shared Memory Architectures

This is the Cache for P1

Page 54: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Shared

Exclusive

Invalid

Example : step 1

P1 P2 Bus Memory

Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value

P1:Write 10 to A1

Excl. A1 10 Wr.Ms P1 A1

P1:Read A1

P2:Read A1

P2:Write 20 to A1

P2:Write 40 to A2

Shared Memory Architectures

Write Miss on Bus

Page 55: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Shared

Exclusive

Invalid

Example : step 2

P1 P2 Bus Memory

Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value

P1:Write 10 to A1

Excl. A1 10 Wr.Ms P1 A1

P1:Read A1

Excl. A1 10

P2:Read A1

P2:Write 20 to A1

P2:Write 40 to A2

Assumes Initial Cache Stateis invalid and A1 and A2 mapto same cache block.but A1 == A2

Shared Memory Architectures

CPU read hit

Page 56: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Shared

Exclusive

Invalid

Example : step 3Shared Memory Architectures

P1 P2 Bus Memory

Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value

P1:Write 10 to A1

Excl. A1 10 Wr.Ms P1 A1

P1:Read A1

Excl. A1 10

P2:Read A1

Shar A1 RdMs P2 A1

Shar. A1 10 WrBk P1 A1 10 A1 10

Shar A1 10 EdDa P2 A1 10 10

10

P2:Write 40 to A2

10

10

Assumes Initial Cache Stateis invalid and A1 and A2 mapto same cache block.but A1 ==A2 Remote Read

Write Back

Page 57: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Shared

Exclusive

Invalid

Example : step 4Shared Memory Architectures

P1 P2 Bus Memory

Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value

P1:Write 10 to A1

Excl. A1 10 Wr.Ms P1 A1

P1:Read A1

Excl. A1 10

P2:Read A1

Shar A1 RdMs P2 A1

Shar A1 10 WrBk P1 A1 10 A1 10

Shar A1 10 RdDa P2 A1 10 10

P2:Write 20 to A1

Inv. Excl A1 20 WrMs P2 A1 10

P2:Write 40 to A2

10

10

Assumes Initial Cache Stateis invalid and A1 and A2 mapto same cache block.but A1== A2

Remote Write

Page 58: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Example : step 5

P1 P2 Bus Memory

Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value

P1:Write 10 to A1

Excl. A1 10 Wr.Ms P1 A1

P1:Read A1

Excl. A1 10

P2:Read A1

Shar A1 RdMs P2 A1

Shar A1 10 WrBk P1 A1 10 A1 10

Shar A1 10 RdDa P2 A1 10 10

P2:Write 20 to A1

Inv Excl A1 20 WrMs P2 A1 10

P2:Write 40 to A2

WrMs P2 A2 10

Excl A2 40 WrBk P2 A1 20 A1 20

Assumes Initial Cache Stateis invalid and A1 and A2 mapto same cache block.but A1 == A2

Shared Memory Architectures

Page 59: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Distributed Memory Systems

P1 P2 P3 P4

P1 P2 P3 P4

NETWORKMessage Passing for

Inter Process Communication.

Page 60: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Shared Vs Distributed Memory

1. Single Address

Space

1. Multiple Address

Space

2. Easy to Program 2. Difficult to

Program

3. Less Scalable 3. More Scalable

provided you

know how to

program.

Page 61: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Basics of Message Passing SystemsMessages

• Which processor is sending the message.

• Where is the data on the sending processor.

•What kind of data is being sent.

•How much data is there.

•Which processor(s) are receiving the message.

•Where should the data be left on the receiving processor.

•How much data is the receiving processor prepared to accept.

Page 62: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Aspects

• Access to the message passing systems• Addressing• Reception• Point – Point Communication: Two processors communicate

•Synchronous and Asynchronous•Blocking and Non–blocking Operations.

• Collective Communication – Group of processors communicate

•Barrier , Broadcast and Reduction Operations.

Page 63: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Point – Point Synchronous Communication

Synchronous communication does not complete until the message has been received.

Page 64: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Point-Point Asynchronous Communication

------------------------

Asynchronous communication completes as soon as the message is on its way.

--------------------

-------------------------------------------------

Page 65: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Non – Blocking Operations

Non –blocking Operations return straight away after initiating the Operation and hence allows useful work to be performed whilewaiting for the operation to complete.One can test for the completion of the operation when necessary.

Page 66: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Blocking Operations

Blocking Operations waits for the operation to complete before Proceeding further.

Page 67: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Barrier

Barrier

Barrier

• Synchronizes processors by blocking until all of the participating processors have called the barrier routine.

• There is no exchange of data.

Barrier

Page 68: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Broadcast

One–to–many Communication.One processor send the same

message to several destinations in a single operation.

Page 69: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Reduction

Takes data items from several processors and reduces them to a

Single data item that is usually made available to all of the

participating processors. e.g. Strike Voting , Summation

STRIKE

Page 70: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

.

Parallel Models

EREW PRAM : Exclusive Read Exclusive Write Parallel Random Access Memory.

CREW PRAM : Concurrent Read Exclusive Write Parallel Random Access Memory.

CRCW PRAM : Concurrent Read Concurrent Write Parallel Random Access Memory

Page 71: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Parallel Algorithm:Recursive Doubling Technique

• Finding Prefix Sum of ‘8’ Numbers

7 6 5 4 3 2 1 0

-15 6 -8 7 3 -2 1 0

Page 72: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Parallel Algorithm:Recursive Doubling Technique (Step 1)

• Finding Prefix Sum of ‘8’ Numbers

7 6 5 4 3 2 1 0

-9 -2 -1 10 1 -1 1 0

Step 1

Page 73: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Parallel Algorithm:Recursive Doubling Technique (Step 2)

• Finding Prefix Sum of ‘8’ Numbers

7 6 5 4 3 2 1 0

-10 8 0 9 2 -1 1 0

Step 2

Page 74: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Parallel Algorithm:Recursive Doubling Technique (Step 3)

• Finding Prefix Sum of ‘8’ Numbers

7 6 5 4 3 2 1 0

-8 7 1 9 2 -1 1 0

Step 3

Page 75: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

• Prefix Sum of n numbers in O(log n) steps

• EREW Implementation

• Applicable for any semigroup operator like min , max , mul etc.

Page 76: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

CRCW Algorithms

• What can you do in constant time.

• Find OR of ‘n’ bits (b1,b2,…,bn) using n Processors.

--Each processor Pi , reads bi in parallel. --If (bi = 1) then C = 1;

• All Processors write to ‘C’ the same value.

Page 77: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Optimality of Parallel Algorithms

• Parallel Algorithm is optimal iff Time taken by Parallel Algorithm * No.of Processors used

= Time taken by best known sequential algorithm.

• Prefix Sum of n Numbers

O(n) Sequential O(n log n) – Processor-Time Product Not optimal.

Page 78: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Make it Optimal

Yes – Using Brent’s Technique

Use ‘p’ Processors and split data set into n/p blocks.

Allot one processor per block, Let us take 3 processors and 6 numbers for example.

Bn/p B2 B1

6 4 3 2 15

Page 79: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

The Algorithm

Step 1: Processor ‘i’ finds prefix sum of elements in Bi sequentially and outputs Si the Sum of last element- O(n/p) time.

11 5 7 3 3 1

11 7 3

Step2: Find prefix sum of (Sp,……,S1) and let it be (S’p….S’1) – O(log(n/p)) time Recursive Doubling.

11 5 7 3 3 1

10 3

Step3: Processor ’i’, i > 1, adds S’i-1 to all prefixes of block Bi -O(n/p)

Total O(n/p+log n – log p) time = O(n/p)time. 21 15 10 6 3 1

21

6 5 4 3 2 1

Page 80: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Sublogarthmic Algorithms and CRCW PRAM

• Minimum of (a(1),a(2),…,a(n)) using n2 processors in O(1) time.

• C[i] = 1; using n processors.

• for all (i, j) – if a(i) > a(j) then C[i] = 0; uses n2 processors

• if C[i] = 1 then B = A[i]; uses n processors

• B has the minimum

Page 81: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras

What if we have “n” processors only?• Split into n0.5 blocks each of elements and assign

n0.5 processors to each block• Solve recursively in each block to get n0.5

minimum elements• Now we have n0.5 elements from which we have to

find the minimum element using n processors. Do it in constant time using the previous algorithm.

• T(n) = T(n0.5) + 1 = O(log log n)

Page 82: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras
Page 83: A Tutorial on High Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras