High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm

High Performance Computer Architecture

Dothan Core(0.09µm)

145 mm2/55Mtr 84 mm2/140Mtr

217 mm2m/42Mtr

143 mm2/291MtrConroe Core (0.065 µm)

Intel-Pentium-4 (11/2000)

Intel-Pentium-4(01/2002)

Intel-Pentium-M(05/2004)

Intel-Core2-Duo(07/2006)Willamette Core (0.18 µm)

Northwood Core (0.13 µm)

Penryn Core(0.045µm)

107 mm2/410Mtr

Bloomfield Core (0.045 µm)263 mm2/731Mtr

Intel-Core-i7 (11/2008)

Fermi 512G (0.040 µm)

Sandy Bridge (0.032 µm) 216 mm2/995Mtr

467 mm2/3000Mtr

Intel-Core2-Duo(01/2008)

IBM (08/2011)

Intel-Core-i7-2920XM (01/2011)

BlueGene/Q – 18 cores (0.045 µm)NVIDIA GF100 (09/2009) 360 mm2/1470Mtr

Llano 4C/400G (0.032 µm) 228 mm2/1450Mtr AMD A8-3850 (06/2011)1Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 53

10 mm

2Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 53

412 mm2m/174MtrFirst Commercial Dual-Core Chip (0.18 µm)

20 mm

661 mm2m/5560MtrXeon E5-2600 v3 (06/2015 �7000$)

Haswell-EP – 18 Cores – 2SMT (0.022 µm)

IBM Power4 (12/2001)

362 mm2m/2100Mtr12 Cores – 8SMT (0.022 µm)

IBM Power4 (12/2014)

128 mm2m/3000MtrApple-iPad Air2 (11/2014)

A8X 11 Cores (0.020 µm)Ivy Bridge (0.022 µm) tri-gate

160 mm2/1400MtrIntel-Core-i73770 (04/2012)

413 mm2m/7100MtrXeon Phi (06/2015)(estimated)

Knights Landing – 72 Cores – 4SMT (0.014 µm)

567 mm2m/8900MtrAMD Radeon R9 Fury-X (06/2015)

Fuji XT – 2048 GPU-Cores (0.028 µm)

Where are High Performance Computers ?


Among users

Where you need to make this happen:

“I have a limited battery and need to… take a picture, share it with my friends, ...”

In the Internet Infrastructure

Where you need to connect anybody with anything

Every electronic device

has a Computers inside

Electronic Devices

In the Datacenters Where you need to Store

and Retrieve YOUR data

Cars may have many as 50+ computers:

(California approved a bill

for autonomous vehicles)

AMD Opteron 6200 ARCHITECTURE AMD Opteron 6200 CORE (“Bulldozer”)

AMD Opteron 6272HP Proliant DL 585 G7

Computer Architects

• Computer Architects UNDERSTAND and CAN BUILDthe Computing Infrastructure… and almost ALL details of it ! :-)


AMD Opteron 6200 CHIP

AMD Opteron 6200 characteristics

Objectives of this course

• This course constitutes a deeper study of current computers and aims to provide:

• Principles of high-performance microprocessors (superscalar, VLIW)

• An understanding of the basic mechanisms for the programming of applications that take advantage of the parallelism made available by the system

• Principles of Multi-Core / Multi-Processor Systems

• Tools for programming Parallel Machines


Course Administration• Teacher: Roberto Giorgi ( [email protected] )

• Telephone: 0577-191-5182

• Office-hours: Monday 16:30/19:00

• Slides: http://www.dii.unisi.it/~giorgi/teaching/hpca2

• Adopted Textbook:• M. Dubois, M. Annavaram, P. Stenstrom,

"Parallel Computer Organization and Design", Cambridge University Press, 2012, ISBN: 978-0-521-88675-8

• Other Reference Textbooks• Hennessy and Patterson,

“Computer Architecture: A Quantitative Approach” 5th Ed.,Morgan Kauffman, 2012,ISBN 978-0-12-383872-8

• D. Culler, J.P. Singh, A. Gupta,"Parallel Computer Architecture: A Hw/Sw Approach",Morgan Kaufman/Elsevier, 1998, ISBN 1558603433

• M.J. Flynn, "Computer Architecture: Pipelined and Parallel Processor Design",Jones and Bartlett Publishers, Inc., 1995, ISBN 0867202041

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 536

Rules for exams, dates, slides, tools

• Check out the course website:


http://www.dii.unisi.it/~giorgi/teaching/hpca2

Computer Architecture

“The term ARCHITECTURE is used here to describe the set of attributes of a system, as this appears to the programmer*, i.e., its conceptual structure and its operation, with a distinctive organization of the networks that manage the flow of data and control networks, as compared to the logical design and physical implementation”

-- Gene Amdahl, IBM Journal of R&D, Apr. 1964

*programmer == system programmer (OS)

engineer or the compiler


Architecture: an overloaded term• In the strict sense: Interface Hardware / Software

• Set of instructions

• Memory management and protection

• Interruptions and exceptions (traps)

• Data formats (for example, IEEE 754 floating point)

• Organization: also called "Microarchitecture"• In this sense, it is "the implementation" of architecture

(this is a part that Gene Amdahl had excluded)

• Specifies the functional units and connections

• Configuration of the pipeline

• Position and configuration of cache memory

• As a discipline, "Architecture of Computers" also includes the microarchitecture• To avoid confusion when it comes to interface HW / SW we use "Instruction Set

Architecture" (ISA)• "COMPUTER ARCHITECTURE concerns the interface between what the technology provides and what the market demands" - Yale Patt, ISCA, Jun 2006


Levels of Computer Architecture

I/O devicesand

Networking

Controllers

System Interconnect(bus)

Controllers

MemoryTranslation

Execution Hardware

DriversMemoryManager

Scheduler

Operating System

Libraries

ApplicationPrograms

MainMemory

1

2

33

4 5 6

7 78888

9

10 10

1111 12

13 14

ISA

Software

Hardware

1: User Interface2: API3,7: ABI4,5,6: internal interface of

the Operating System7,8: ISA9: Memory architecture10: I/O architecture11,12: RTL architecture13,14: Bus architecture

API=Application Program Interface

ABI=Application Binary Interface

ISA=Instruction Set Architecture

RTL=Register Transfer Level

Interfaces:


What technology provides: Moore's Law• “The number of TRANSISTORS doubles every 18 months”

(Later revised to "24 months"), this is due to:- higher density (transistors / area)

- availability of bigger chips

DATA FROM SLIDE 1

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5311Moore's Law and purely PSYCHOLOGICAL!

Mtr

What the market demands: Applications

• Application Trend• FROM numerical, scientific TO commercial, entertainment

• FROM few "big" TO ubiquitous, "small“- mainframes � minis � microprocessors � handheld, embedded

• FROM little TO big memory storage (primary and secondary)

• FROM single-thread TO multiple-threads

• FROM standalone TO networked (cloud computing)

• FROM character-oriented TO multimedia (graphics and sound)

• FROM personal data TO “BIG DATA”


Main Applications

• Numerical/Scientific• Computational Fluid Dynamics, Weather Prediction, ECAD

• Long word length, floating point arithmetic

• Commercial• inventory control, billing, payroll, decision support

• byte oriented, fixed point, high I/O, large secondary storage

• Real-Time/Embedded• control, some communications

• predictable performance

• interrupt architecture important, low power, cost critical

• Home Computing• multimedia, entertainment

• high bandwidth data movement, graphics

• cryptography, compression/decompression


App. Trends: Multimedia, Networked, Web-servers

• A large choice of multimedia devices with• Graphic displays (LCD, etc.).

• High Definition Audio

• Large capacity of secondary storage for images, sound, etc…

• Services via the Web and high-performance networks require• Many independent threads

• Wide band communication


MICROPROCESSOR ARCHITECTURE

• The increasing number of transistors (cheaper and faster)has fueled the demand for higher performance CPU

• 1970s – Serial CPU, 1-bit for integers

• 1980s – 32-bit RISC with a pipeline- The ISA simplicity allows the integration

of the entire processor chip

• 1990s – bigger CPUs, superscalar- Also for CISC

• 2000s – Multiprocessors on a chip...


Course Structure

1. High Performance Pipelining

2. Branch Prediction

3. Superscalar processor

4. Media Processing: VLIW processors

5. Multiprocessors and related problems

6. TLP: Thread Level Parallelism

7. Evaluation of High Performance Architectures

8. Tools for Parallel programming machines (Cilk, OpenMP, MPI, CUDA, ...)


EVALUATING COMPUTERS


POWER

• TOTAL POWER: DYNAMIC + STATIC(LEAKAGE)

Pdynamic = αCV2f

Pstatic = VIsub ≈ Ve-KVt/T

• DYNAMIC POWER FAVORS PARALLEL PROCESSING OVER HIGHER CLOCK RATE

• DYNAMIC POWER ROUGHLY PROPORTIONAL TO f3

• TAKE A CORE AND REPLICATE IT 4 TIMES: 4X SPEEDUP & 4X POWER

• TAKE A CORE AND CLOCK IT 4 TIMES FASTER: 4X SPEEDUP BUT 64X DYNAMIC POWER!

• STATIC POWER

• BECAUSE CIRCUITS LEAK WHATEVER THE FREQUENCY IS.

• POWER/ENERGY ARE CRITICAL PROBLEMS

• POWER (IMMEDIATE ENERGY DISSIPATION) MUST BE DISSIPATED• OTHERWISE TEMPERATURE GOES UP (AFFECTS PERFORMANCE, CORRECTNESS AND MAY POSSIBLY DESTROY THE

CIRCUIT, SHORT TERM OR LONG TERM)

• EFFECT ON THE SUPPLY OF POWER TO THE CHIP

• ENERGY (DEPENDS ON POWER AND SPEED)• COSTLY; GLOBAL PROBLEM

• BATTERY OPERATED DEVICES


RELIABILITY

• TRANSIENT FAILURES (OR SOFT ERRORS)• CHARGE Q = C X V

• IF C AND V DECREASE THEN IT IS EASIER TO FLIP A BIT• SOURCES ARE COSMIC RAYS AND ALPHA PARTICLES RADIATING

FROM THE PACKAGING MATERIAL• DEVICE IS STILL OPERATIONAL BUT VALUE HAS BEEN CORRUPTED• SHOULD DETECT/CORRECT AND CONTINUE EXECUTION • ALSO: ELECTRICAL NOISE CAUSES SIMILAR FAILURES

• INTERMITTENT/TEMPORARY FAILURES• LAST LONGER• DUE TO

• TEMPORARY: ENVIRONMENTAL VARIATIONS (EG, TEMPERATURE)• INTERMITTENT: AGING

• SHOULD TRY TO CONTINUE EXECUTION• PERMANENT FAILURES

• MEANS THAT THE DEVICE WILL NEVER FUNCTION AGAIN• MUST BE ISOLATED AND REPLACED BY SPARE

PROCESS VARIATIONS INCREASE THE PROBABILITY OF FAILURES


PERFORMANCE METRICS (MEASURE)

• METRIC #1: TIME TO COMPLETE A TASK (Texe): EXECUTION TIME, RESPONSE TIME, LATENCY• “X IS N TIMES FASTER THAT Y” MEANS Texe(Y)/Texe(X) = N

• THE MAJOR METRIC USED IN THIS COURSE

• METRIC #2: NUMBER OF TASKS PER DAY, HOUR, SEC, NS• THE THROUGHPUT FOR X IS N TIMES HIGHER THAN Y IF

THROUGHPUT(X)/THROUGHPUT(Y) = N

• NOT THE SAME AS LATENCY (EXAMPLE OF MULTIPROCESSORS)

• EXAMPLES OF UNRELIABLE METRICS:• MIPS: MILLION OF INSTRUCTIONS PER SECOND

• MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS PER SECOND

EXECUTION TIME OF A PROGRAM IS THE ULTIMATE MEASURE OF PERFORMANCE BENCHMARKING


WHICH PROGRAM TO CHOOSE?

• REAL PROGRAMS: • PORTING PROBLEM; COMPLEXITY; NOT EASY TO UNDERSTAND THE CAUSE OF

RESULTS

• KERNELS• COMPUTATIONALLY INTENSE PIECE OF REAL PROGRAM

• TOY BENCHMARKS (E.G. QUICKSORT, MATRIX MULTIPLY)

• SYNTHETIC BENCHMARKS (NOT REAL)

• BENCHMARK SUITES• SPEC: STANDARD PERFORMANCE EVALUATION CORPORATION

• SCIENTIFIC/ENGINEEING/GENERAL PURPOSE• INTEGER AND FLOATING POINT• NEW SET EVERY SO MANY YEARS (95,98,2000,2006)

• TPC BENCHMARKS: • FOR COMMERCIAL SYSTEMS• TPC-B, TPC-C, TPC-H, AND TPC-W

• EMBEDDED BENCHMARKS• MEDIA BENCHMARKS


REPORTING PERFORMANCE FOR A SET OF PROGRAMS

LET Ti BE THE EXECUTION TIME OF PROGRAM i (out of N progams):

1. (WEIGHTED) ARITHMETIC MEAN OF EXECUTION TIMES:

OR

THE PROBLEM HERE IS THAT THE PROGRAMS WITH LONGEST EXECUTION TIMES DOMINATE THE RESULT

2. DEALING WITH SPEEDUPS• SPEEDUP MEASURES THE ADVANTAGE OF A MACHINE OVER A REFERENCE

MACHINE FOR A PROGRAM i (let TR,i be the execution time on the reference machine)

• ARITHMETIC MEAN OF SPEEDUPS

• HARMONIC MEAN

T i N⁄i∑ T i W i×

i∑

Si

TR i,Ti

-----------=

�� =1��

�

�

�� =�

∑ 1 � ��


REPORTING PERFORMANCE FOR A SET OF PROGRAMS

• GEOMETRIC MEANS OF SPEEDUPS

- MEAN SPEEDUP COMPARIONS BETWEEN TWO MACHINES ARE INDEPENDENT OF THE REFERENCE MACHINE

- EASILY COMPOSABLE

- USED TO REPORT SPEC NUMBERS FOR INTEGER AND FLOATING POINT

Program A Program B Arithmetic Mean Speedup (ref 1) Speedup (ref 2)

Machine 1 10 sec 100 sec 55 sec 91.8 10

Machine 2 1 sec 200 sec 100.5 sec 50.2 5.5

Reference 1 100 sec 10000 sec 5050 sec

Reference 2 100 sec 1000 sec 550 sec

Program A Program B Arithmetic Harmonic Geometric

Wrt Reference 1

Machine 1 10 100 55 18.2 31.6

Machine 2 100 50 75 66.7 70.7

Wrt Reference 2

Machine 1 10 10 10 10 10

Machine 2 100 5 52.5 9.5 22.4

In terms of speedup:

x2.2

x2.2

x3.66

x0.95

x1.36

x5.25

� GM: whichever reference machine we choose, the relative speed between the two machines is always the SAME !!


�� = ��

�

�

FUNDAMENTAL PERFORMANCE EQUATIONS FOR CPUs(also known as “IRON LAW”)

Texe = IC X CPI X Tc

• IC: DEPENDS ON PROGRAM, COMPILER AND ISA• CPI: DEPENDS ON INSTRUCTION MIX, ISA, AND

IMPLEMENTATION• Tc: DEPENDS ON IMPLEMENTATION COMPLEXITY AND

TECHNOLOGY

CPI (CLOCK PER INSTRUCTION) IS OFTEN USED INSTEAD OF EXECUTION TIME

• WHEN PROCESSOR EXECUTES MORE THAN ONE INSTRUCTION PER CLOCK USE IPC (INSTRUCTIONS PER CLOCK)

Texe = (IC X Tc)/IPC


AMDAHL’S LAW

• ENHANCEMENT E ACCELERATES A FRACTION F OF THE TASK BY A FACTOR S

1-F F

Apply enhancement

1-F F/S

without E

with E

Texe withE( ) Texe withoutE( )X 1 F–( ) FS--+=

Speedup E( )Texe w ithoutE( )

Texe withE( )-------------------------- 1

1 F–( ) FS--+

---------------= =


LESSONS FROM AMDAHL’S LAW

1) IMPROVEMENT IS LIMITED BY THE FRACTION OF THE EXECUTION TIME THAT CANNOT BE ENHANCED

• LAW OF DIMINISHING RETURNS – MARGINAL SPEEDUP• The difference between SPEEDUPk+1 and SPEEDUPk is smaller and smaller as S goes from k to k+1

2) OPTIMIZE THE COMMON CASE• EXECUTE THE RARE CASE IN SOFTWARE (E.G. EXCEPTIONS)

F=0.5

SPEEDUP E( ) 11 F–-------<


Amdhal’s maximum

Amdhal’s Law

Remaining Speedup

Marginal Speedup

PARALLEL SPEEDUP

• NOTE: SPEEDUP CAN BE SUPERLINEAR. HOW CAN THAT BE??

OVERALL NOT VERY HOPEFUL

F=0.95


�� =�� =

11 − � + �/� =

�� + �(1 − �) <

11 − �

“mortar shot”

Amdhal’s Law

Amdhal’s maximum

Ideal speedup

GUSTAFSON’S LAW

• REDEFINE SPEEDUP• THE RATIONALE IS THAT, AS MORE AND MORE CORES ARE INTEGRATED ON

CHIP OVER TIME, THE WORKLOADS ARE ALSO GROWING

• STARTS WITH THE EXECUTION TIME ON THE PARALLEL MACHINE WITH P PROCESSORS:

• s IS THE TIME TAKEN BY THE SERIAL CODE AND p IS THE TIME TAKEN BY THE PARALLEL CODE

• EXECUTION TIME ON ONE PROCESSOR IS

• Let F=p/(s+p). Then SP = (s+pP)/(s+p) = 1-F+FP = 1+F(P-1)

TP s p+=

T1 s pP+=


Gustafson observes that even if the single algorithm/program completes faster only if the parallel portion is dominant (Amdhal), the same algorithm will complete more and more faster as we add processors (P) compared to a purely sequential execution that just repeats the parallel portion (p) for P times.

Course Structure

1. High Performance Pipelining

2. Branch Prediction

3. Superscalar processor

4. Media Processing: VLIW processors

5. Multiprocessors and related problems

6. TLP: Thread Level Parallelism

7. Evaluation of High Performance Architectures

8. Tools for Parallel programming machines (Cilk, OpenMP, MPI, CUDA, ...)


PIPELINING


Pipelining

• Pipelining principles

• Simple Pipeline

• Structural Hazards

• Data Hazards

• Control Hazards


Pipelining principles

• Let T be the time to execute an instruction

• Without pipelining• Latency = T

• Throughput seq = 1 / T

• With an ideal n-stage pipeline• Latency = T

• Throughput pipe = n / T

• Speedup = Throughput pipe /Throughput seq = n

T

1 2 n. . .

1 2 n. . .

1 n. . .

1 2 n. . .

The (ideal) speedup

obtainable from an ideal

pipeline is equal to n


• Consider instructions composed of n phases of equal duration

2

Implementation of a Simple Pipeline

• Simple 5-stage pipeline• F -- Instruction Fetch

• D -- Instruction Decode + Operand Fetch

• X -- Execution and Effective Address

• M -- Memory Access

• W – Write-back Results

latch

clock

latchlatchlatchlatchlatch

F D X M W


5-STAGE PIPELINE

INSTRUCTIONS GO THROUGH EVERY STAGE IN PROCESS ORDER, EVEN IF THEY DON’T USE THE STAGE

• NOTE: CONTROL IMPLEMENTATION• INSTRUCTION CARRIES CONTROL• THIS IS A GENERAL APPROACH: “INSTRUCTION CARRIES ITS BAGGAGE”


Notation

5-stage pipeline1 2 3 4 5 6 7 8 9

i F D X M W

i+1 F D X M W

i+2 F D X M W

i+3 F D X M W

i+4 F D X M W

accessexecute backwrite

M WXF Dmemory

inst. fetchinst.decode


Pipeline Hazards

• Conditions that lead to a malfunction if certain countermeasures are not taken

1) Structural Hazards• Two instructions want to use the same hardware resource in the same

cycle (conflict over resources, e.g. Instruction Mem. and Data Mem.)

2) Data Hazards• Two instructions use the same data: must happen in the order defined by

the programmer, even if the execution overlaps parts of the instruction execution (see RAW, WAW, WAR)

3) Control Hazards• An instruction (branch, jump, call) can irrevocably determine which

instructions are executed next, because the pipeline has already taken instructions from the initial branch even if there is a jump


1) Structural Hazards

• Two instructions want to use the same hardware resource in the same cycle

Example:• A load / store uses the same memory location that is used by the

instruction fetch

i F D X M W <-- load instruction

i+1 F D X M W

i+2 F D X M W

i+3 * F D X M W <-- i-fetch stalls

i+4 F D X M . . .


Resolving structural hazards

• Stall one of the involved instructions+ Cost-effective and simple

- Reduces the performance

- Used for some rare events

• Pipelining the resource• Useful if possible (e.g.,

for resources that require more cycles)

+ Good performance

- In some cases too complex to do(e.g. RAM)

• Replicate the resource+ Good performance

- Costly

- Probably introduces delays

- Used for cheap resources (or indivisible) De-mux Mux


Guidelines to reduce the structural hazards

• The structural hazards can be avoided if each instruction uses the resource:• At most once:

- (e.g., Separated Instruction memory and Data memory)

• In the same pipeline cycle - (e.g. I-Fetch in stage F, R/W in stage M)

• For a single cycle- (e.g. HIT in the data or instruction cache)

• Many RISC processor ISAs were designed with this in mind

• Example of problematic situation:• MISS in cache: pipeline stalls


2) Data Hazards

• Two instructions use the same data: this must happen in the order indicated by the programmer, even if the execution overlaps parts of the instruction execution. Example :

R1 <- R2 + R3

R2 <- R1 - R7

R1 <- R5 OR R6


Reading r1

Writing r1

Data Hazards -- examples

i add r1, r2, r3 F D X M W

i+1 sub r2, r1, r7 F D X M W

r1 ?? � Read-After-Write (RAW) Hazard

r1 ?? � Write-After-Read (WAR) Hazard

i+1 sub r2, r1, r7 F * * * * * D X M W

i+2 or r1, r5, r6 F D X M W

Writing r1

Reading r1

Note: PURELY HYPOTHETICAL SITUATION � it can not happen in this pipeline, by construction

i add r1, r2, r3 F * * * D X M W

i+1 sub r2, r1, r7 F D X M W

i+2 or r1, r5, r6 F D X M W

Writing r1

Writing r1r1 ?? � Write-After-Write (WAW) Hazard

Note: PURELY HYPOTHETICAL SITUATION � it can not happen in this pipeline, by construction


Dependency and Hazard

Dependency: situation in the code that can potentially create hazards

• Read-After-Write (RAW, true-dependence)• There is a real “data exchange" from an instruction to another

• Write-After-Read (WAR, anti-dependence)• An artificial dependence that comes from a bad assignment of registers

• Write-After-Write (WAW, output-dependence)• An artificial dependence that comes from a bad assignment of registers

• Read-After-Read (RAR)• Will not cause problems

HAZARDS

• The dependencies can be translated into hazards depending on the hardware


True Dependence and MIPS Data Hazards

• Read After Write (RAW) The instruction J tries to read an operand before the instruction I writes it

• Caused by a “dependence” (in the terminology of the theory of compilers) called “true dependence"• The hazard results from a real need of data communication

• In the MIPS processor True Dependence normally generates a hazard

I: add r1,r2,r3J: sub r4,r1,r3


Anti-Dependence e MIPS Data Hazards

• Write After Read (WAR) The instruction J tries to write an operand before the instruction I reads it

• Also called "anti-dependence" (in the terminology of the theory of compilers)

• It results from having reused the name "r1", while I could easily use another register (*)

• Does not conflict in case of a 5-stage pipeline MIPS because:• All instructions take 5 stages, and

• the reads from registers occur in stage 2 (D) and

• all writes are always in stage 5 (W)

I: sub r4,r1,r3 J: add r1,r2,r3

(*) This, however, is not always possible to sw, v. Less.-2Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5344

Output Dependence e MIPS Data Hazards

• Write After Write (WAW) The instruction J tries to write an operand before the instruction I writes it

• Also called "output dependence" (in the terminology of the theory of compilers)

• Also in this case, it results from having reused the name "r1", while I could easily use another register (*)

• Does not conflict in case of a 5-stage pipeline MIPS because:• All instructions take 5 stages, and

• all writes are always in stage 5 (W)

I: mul r1,r4,r3 J: add r1,r2,r3

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5345 (*) This, however, is not always possible to sw, v. Less.-2

Simple resolution of the RAW hazard

• The hardware detects the RAW hazard, and then...• Generates a stall to allow the "producer" instruction to finish

F D X M W

R1<-R2+R3 F D X M W

R2<-R1-R7 F * * D X M W

+ Cost-effective and simple

- Reduces the performance

NOTE: It is assumed that the registers can be written in the first half of the cycle (W) and read in the second half of the cycle (D)


Implementation: stall control network

• Add latches to remember the RS1/RS2/RD register identifiers at each stage

• The stall is detected by making the following comparisonif (RS1(D)==RD(X) || RS1(D) == RD(M)) then STALL (generates stall in F)

• Similarly for RS2

unitExecution

fileRegister

D-Cache

A

B

RS1

RD RD

StallControl

stall

Stage D Stage X Stage M


Inserting Stalls (detail)

• Related to the instruction that creates the stall, it is necessary:• On the previous stages: block all "inter-stage latch"

• On the next stages:- Turn off the valid bit associated with inter-stage latch, so that the "bubble" in the

pipeline can continue to proceed without creating problems

Previous stage

Stage of the instruction

which STALLS Next Stage Next Stage

V VValid Bit = 0 Valid Bit = 1To Flip-Flop HOLD signal

STALL SIGNAL


Reduction of the RAW stalls

• Bypass/Forward/Short-Circuit network

• Idea: use the data before it is written in the registers

+ Reduces (potentially avoid) stalls

- Additional Complexity

bypasses

ME WBEXIF ID


Bypass

• Additional Hardware• Multiplexer to select the input value to the ALU:

or from Registers or from Bypass network

• Hazard detection logic (called interlock) that controls these multiplexers


bypass control

Unit(ALU)

Execution

operandlatches

bypass control

bypass

MUX

MUX

file

Register

Resultlatch

Network to detect the possibility of Bypass

• Add latches to remember RS1/RS2/RD register names at each stage (similar to the network for hazard detection)

• E.g. on the input A of the ALU, the network will act like this:if RS1(D)==RD(X) then select ALU-OUT(X)

else if RS1(D)==RD(M) then D-CACHE-OUT(M)

else select (A)

…similarlyon B input…


Unit(ALU)

Execution

MUX

MUX

file

Register

Resultlatch

D-Cache

RS1

RD RD

BypassControl

A

B

ALU-OUT(X)

D-CACHE-OUT(M)

Interaction between control networks Stall/Bypass

• The stall logic is aware of the presence of the bypass logic

• The bypass logic is activated independently at each stall condition


unitExecution

MUX

MUX

fileRegister

D-Cache

RS1

RD RD

BypassControl

A

B

StallControl

Pipeline Scheduling

• Scheduling of instructions at compile-time (Reorder instructions to reduce stalls caused by instruction load):

BEFORE:

a= b + c; R1 <- mem(b)

R2 <- mem(c)

stall

R3 <- R1 + R2

mem(a) <- R3

d = e - f; R4 <- mem(e)

R5 <- mem(f)

stall

R6 <- R4 - R5

mem(d) <- R6

AFTER:R1 <- mem(b)

R2 <- mem(c)

R4 <- mem(e)

R3 <- R1 + R2

R5 <- mem(f)

mem(a) <- R3

R6 <- R4 - R5

mem(d) <- R6


Documents

High Performance Computer Architecturegiorgi/teaching/lezioni/lezioni216/c216lez01-intro-pipe.pdfHigh Performance Computer Architecture Dothan Core (0.09 µm) 145 mm 2/55Mtr 84 mm