Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
High Performance Computer Architecture
Dothan Core(0.09µm)
145 mm2/55Mtr 84 mm2/140Mtr
217 mm2m/42Mtr
143 mm2/291MtrConroe Core (0.065 µm)
Intel-Pentium-4 (11/2000)
Intel-Pentium-4(01/2002)
Intel-Pentium-M(05/2004)
Intel-Core2-Duo(07/2006)Willamette Core (0.18 µm)
Northwood Core (0.13 µm)
Penryn Core(0.045µm)
107 mm2/410Mtr
Bloomfield Core (0.045 µm)263 mm2/731Mtr
Intel-Core-i7 (11/2008)
Fermi 512G (0.040 µm)
Sandy Bridge (0.032 µm) 216 mm2/995Mtr
467 mm2/3000Mtr
Intel-Core2-Duo(01/2008)
IBM (08/2011)
Intel-Core-i7-2920XM (01/2011)
BlueGene/Q – 18 cores (0.045 µm)NVIDIA GF100 (09/2009) 360 mm2/1470Mtr
Llano 4C/400G (0.032 µm) 228 mm2/1450Mtr AMD A8-3850 (06/2011)1Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 53
10 mm
2Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 53
412 mm2m/174MtrFirst Commercial Dual-Core Chip (0.18 µm)
20 mm
661 mm2m/5560MtrXeon E5-2600 v3 (06/2015 �7000$)
Haswell-EP – 18 Cores – 2SMT (0.022 µm)
IBM Power4 (12/2001)
362 mm2m/2100Mtr12 Cores – 8SMT (0.022 µm)
IBM Power4 (12/2014)
128 mm2m/3000MtrApple-iPad Air2 (11/2014)
A8X 11 Cores (0.020 µm)Ivy Bridge (0.022 µm) tri-gate
160 mm2/1400MtrIntel-Core-i73770 (04/2012)
413 mm2m/7100MtrXeon Phi (06/2015)(estimated)
Knights Landing – 72 Cores – 4SMT (0.014 µm)
567 mm2m/8900MtrAMD Radeon R9 Fury-X (06/2015)
Fuji XT – 2048 GPU-Cores (0.028 µm)
Where are High Performance Computers ?
3Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 53
Among users
Where you need to make this happen:
“I have a limited battery and need to… take a picture, share it with my friends, ...”
In the Internet Infrastructure
Where you need to connect anybody with anything
Every electronic device
has a Computers inside
Electronic Devices
In the Datacenters Where you need to Store
and Retrieve YOUR data
Cars may have many as 50+ computers:
(California approved a bill
for autonomous vehicles)
AMD Opteron 6200 ARCHITECTURE AMD Opteron 6200 CORE (“Bulldozer”)
AMD Opteron 6272HP Proliant DL 585 G7
Computer Architects
• Computer Architects UNDERSTAND and CAN BUILDthe Computing Infrastructure… and almost ALL details of it ! :-)
4Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 53
AMD Opteron 6200 CHIP
AMD Opteron 6200 characteristics
Objectives of this course
• This course constitutes a deeper study of current computers and aims to provide:
• Principles of high-performance microprocessors (superscalar, VLIW)
• An understanding of the basic mechanisms for the programming of applications that take advantage of the parallelism made available by the system
• Principles of Multi-Core / Multi-Processor Systems
• Tools for programming Parallel Machines
5Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 53
Course Administration• Teacher: Roberto Giorgi ( [email protected] )
• Telephone: 0577-191-5182
• Office-hours: Monday 16:30/19:00
• Slides: http://www.dii.unisi.it/~giorgi/teaching/hpca2
• Adopted Textbook:• M. Dubois, M. Annavaram, P. Stenstrom,
"Parallel Computer Organization and Design", Cambridge University Press, 2012, ISBN: 978-0-521-88675-8
• Other Reference Textbooks• Hennessy and Patterson,
“Computer Architecture: A Quantitative Approach” 5th Ed.,Morgan Kauffman, 2012,ISBN 978-0-12-383872-8
• D. Culler, J.P. Singh, A. Gupta,"Parallel Computer Architecture: A Hw/Sw Approach",Morgan Kaufman/Elsevier, 1998, ISBN 1558603433
• M.J. Flynn, "Computer Architecture: Pipelined and Parallel Processor Design",Jones and Bartlett Publishers, Inc., 1995, ISBN 0867202041
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 536
Rules for exams, dates, slides, tools
• Check out the course website:
7Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 53
http://www.dii.unisi.it/~giorgi/teaching/hpca2
Computer Architecture
“The term ARCHITECTURE is used here to describe the set of attributes of a system, as this appears to the programmer*, i.e., its conceptual structure and its operation, with a distinctive organization of the networks that manage the flow of data and control networks, as compared to the logical design and physical implementation”
-- Gene Amdahl, IBM Journal of R&D, Apr. 1964
*programmer == system programmer (OS)
engineer or the compiler
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 538
Architecture: an overloaded term• In the strict sense: Interface Hardware / Software
• Set of instructions
• Memory management and protection
• Interruptions and exceptions (traps)
• Data formats (for example, IEEE 754 floating point)
• Organization: also called "Microarchitecture"• In this sense, it is "the implementation" of architecture
(this is a part that Gene Amdahl had excluded)
• Specifies the functional units and connections
• Configuration of the pipeline
• Position and configuration of cache memory
• As a discipline, "Architecture of Computers" also includes the microarchitecture• To avoid confusion when it comes to interface HW / SW we use "Instruction Set
Architecture" (ISA)• "COMPUTER ARCHITECTURE concerns the interface between what the technology provides and what the market demands" - Yale Patt, ISCA, Jun 2006
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 539
Levels of Computer Architecture
I/O devicesand
Networking
Controllers
System Interconnect(bus)
Controllers
MemoryTranslation
Execution Hardware
DriversMemoryManager
Scheduler
Operating System
Libraries
ApplicationPrograms
MainMemory
1
2
33
4 5 6
7 78888
9
10 10
1111 12
13 14
ISA
Software
Hardware
1: User Interface2: API3,7: ABI4,5,6: internal interface of
the Operating System7,8: ISA9: Memory architecture10: I/O architecture11,12: RTL architecture13,14: Bus architecture
API=Application Program Interface
ABI=Application Binary Interface
ISA=Instruction Set Architecture
RTL=Register Transfer Level
Interfaces:
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5310
What technology provides: Moore's Law• “The number of TRANSISTORS doubles every 18 months”
(Later revised to "24 months"), this is due to:- higher density (transistors / area)
- availability of bigger chips
DATA FROM SLIDE 1
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5311Moore's Law and purely PSYCHOLOGICAL!
Mtr
What the market demands: Applications
• Application Trend• FROM numerical, scientific TO commercial, entertainment
• FROM few "big" TO ubiquitous, "small“- mainframes � minis � microprocessors � handheld, embedded
• FROM little TO big memory storage (primary and secondary)
• FROM single-thread TO multiple-threads
• FROM standalone TO networked (cloud computing)
• FROM character-oriented TO multimedia (graphics and sound)
• FROM personal data TO “BIG DATA”
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5312
Main Applications
• Numerical/Scientific• Computational Fluid Dynamics, Weather Prediction, ECAD
• Long word length, floating point arithmetic
• Commercial• inventory control, billing, payroll, decision support
• byte oriented, fixed point, high I/O, large secondary storage
• Real-Time/Embedded• control, some communications
• predictable performance
• interrupt architecture important, low power, cost critical
• Home Computing• multimedia, entertainment
• high bandwidth data movement, graphics
• cryptography, compression/decompression
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5313
App. Trends: Multimedia, Networked, Web-servers
• A large choice of multimedia devices with• Graphic displays (LCD, etc.).
• High Definition Audio
• Large capacity of secondary storage for images, sound, etc…
• Services via the Web and high-performance networks require• Many independent threads
• Wide band communication
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5314
MICROPROCESSOR ARCHITECTURE
• The increasing number of transistors (cheaper and faster)has fueled the demand for higher performance CPU
• 1970s – Serial CPU, 1-bit for integers
• 1980s – 32-bit RISC with a pipeline- The ISA simplicity allows the integration
of the entire processor chip
• 1990s – bigger CPUs, superscalar- Also for CISC
• 2000s – Multiprocessors on a chip...
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5315
Course Structure
1. High Performance Pipelining
2. Branch Prediction
3. Superscalar processor
4. Media Processing: VLIW processors
5. Multiprocessors and related problems
6. TLP: Thread Level Parallelism
7. Evaluation of High Performance Architectures
8. Tools for Parallel programming machines (Cilk, OpenMP, MPI, CUDA, ...)
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5316
EVALUATING COMPUTERS
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5317
POWER
• TOTAL POWER: DYNAMIC + STATIC(LEAKAGE)
Pdynamic = αCV2f
Pstatic = VIsub ≈ Ve-KVt/T
• DYNAMIC POWER FAVORS PARALLEL PROCESSING OVER HIGHER CLOCK RATE
• DYNAMIC POWER ROUGHLY PROPORTIONAL TO f3
• TAKE A CORE AND REPLICATE IT 4 TIMES: 4X SPEEDUP & 4X POWER
• TAKE A CORE AND CLOCK IT 4 TIMES FASTER: 4X SPEEDUP BUT 64X DYNAMIC POWER!
• STATIC POWER
• BECAUSE CIRCUITS LEAK WHATEVER THE FREQUENCY IS.
• POWER/ENERGY ARE CRITICAL PROBLEMS
• POWER (IMMEDIATE ENERGY DISSIPATION) MUST BE DISSIPATED• OTHERWISE TEMPERATURE GOES UP (AFFECTS PERFORMANCE, CORRECTNESS AND MAY POSSIBLY DESTROY THE
CIRCUIT, SHORT TERM OR LONG TERM)
• EFFECT ON THE SUPPLY OF POWER TO THE CHIP
• ENERGY (DEPENDS ON POWER AND SPEED)• COSTLY; GLOBAL PROBLEM
• BATTERY OPERATED DEVICES
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5318
RELIABILITY
• TRANSIENT FAILURES (OR SOFT ERRORS)• CHARGE Q = C X V
• IF C AND V DECREASE THEN IT IS EASIER TO FLIP A BIT• SOURCES ARE COSMIC RAYS AND ALPHA PARTICLES RADIATING
FROM THE PACKAGING MATERIAL• DEVICE IS STILL OPERATIONAL BUT VALUE HAS BEEN CORRUPTED• SHOULD DETECT/CORRECT AND CONTINUE EXECUTION • ALSO: ELECTRICAL NOISE CAUSES SIMILAR FAILURES
• INTERMITTENT/TEMPORARY FAILURES• LAST LONGER• DUE TO
• TEMPORARY: ENVIRONMENTAL VARIATIONS (EG, TEMPERATURE)• INTERMITTENT: AGING
• SHOULD TRY TO CONTINUE EXECUTION• PERMANENT FAILURES
• MEANS THAT THE DEVICE WILL NEVER FUNCTION AGAIN• MUST BE ISOLATED AND REPLACED BY SPARE
PROCESS VARIATIONS INCREASE THE PROBABILITY OF FAILURES
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5319
PERFORMANCE METRICS (MEASURE)
• METRIC #1: TIME TO COMPLETE A TASK (Texe): EXECUTION TIME, RESPONSE TIME, LATENCY• “X IS N TIMES FASTER THAT Y” MEANS Texe(Y)/Texe(X) = N
• THE MAJOR METRIC USED IN THIS COURSE
• METRIC #2: NUMBER OF TASKS PER DAY, HOUR, SEC, NS• THE THROUGHPUT FOR X IS N TIMES HIGHER THAN Y IF
THROUGHPUT(X)/THROUGHPUT(Y) = N
• NOT THE SAME AS LATENCY (EXAMPLE OF MULTIPROCESSORS)
• EXAMPLES OF UNRELIABLE METRICS:• MIPS: MILLION OF INSTRUCTIONS PER SECOND
• MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS PER SECOND
EXECUTION TIME OF A PROGRAM IS THE ULTIMATE MEASURE OF PERFORMANCE BENCHMARKING
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5320
WHICH PROGRAM TO CHOOSE?
• REAL PROGRAMS: • PORTING PROBLEM; COMPLEXITY; NOT EASY TO UNDERSTAND THE CAUSE OF
RESULTS
• KERNELS• COMPUTATIONALLY INTENSE PIECE OF REAL PROGRAM
• TOY BENCHMARKS (E.G. QUICKSORT, MATRIX MULTIPLY)
• SYNTHETIC BENCHMARKS (NOT REAL)
• BENCHMARK SUITES• SPEC: STANDARD PERFORMANCE EVALUATION CORPORATION
• SCIENTIFIC/ENGINEEING/GENERAL PURPOSE• INTEGER AND FLOATING POINT• NEW SET EVERY SO MANY YEARS (95,98,2000,2006)
• TPC BENCHMARKS: • FOR COMMERCIAL SYSTEMS• TPC-B, TPC-C, TPC-H, AND TPC-W
• EMBEDDED BENCHMARKS• MEDIA BENCHMARKS
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5321
REPORTING PERFORMANCE FOR A SET OF PROGRAMS
LET Ti BE THE EXECUTION TIME OF PROGRAM i (out of N progams):
1. (WEIGHTED) ARITHMETIC MEAN OF EXECUTION TIMES:
OR
THE PROBLEM HERE IS THAT THE PROGRAMS WITH LONGEST EXECUTION TIMES DOMINATE THE RESULT
2. DEALING WITH SPEEDUPS• SPEEDUP MEASURES THE ADVANTAGE OF A MACHINE OVER A REFERENCE
MACHINE FOR A PROGRAM i (let TR,i be the execution time on the reference machine)
• ARITHMETIC MEAN OF SPEEDUPS
• HARMONIC MEAN
T i N⁄i∑ T i W i×
i∑
Si
TR i,Ti
-----------=
�� =1����
�
�
�� =�
∑ 1 � ���
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5322
REPORTING PERFORMANCE FOR A SET OF PROGRAMS
• GEOMETRIC MEANS OF SPEEDUPS
- MEAN SPEEDUP COMPARIONS BETWEEN TWO MACHINES ARE INDEPENDENT OF THE REFERENCE MACHINE
- EASILY COMPOSABLE
- USED TO REPORT SPEC NUMBERS FOR INTEGER AND FLOATING POINT
Program A Program B Arithmetic Mean Speedup (ref 1) Speedup (ref 2)
Machine 1 10 sec 100 sec 55 sec 91.8 10
Machine 2 1 sec 200 sec 100.5 sec 50.2 5.5
Reference 1 100 sec 10000 sec 5050 sec
Reference 2 100 sec 1000 sec 550 sec
Program A Program B Arithmetic Harmonic Geometric
Wrt Reference 1
Machine 1 10 100 55 18.2 31.6
Machine 2 100 50 75 66.7 70.7
Wrt Reference 2
Machine 1 10 10 10 10 10
Machine 2 100 5 52.5 9.5 22.4
In terms of speedup:
x2.2
x2.2
x3.66
x0.95
x1.36
x5.25
� GM: whichever reference machine we choose, the relative speed between the two machines is always the SAME !!
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5323
�� = ����
�
�
FUNDAMENTAL PERFORMANCE EQUATIONS FOR CPUs(also known as “IRON LAW”)
Texe = IC X CPI X Tc
• IC: DEPENDS ON PROGRAM, COMPILER AND ISA• CPI: DEPENDS ON INSTRUCTION MIX, ISA, AND
IMPLEMENTATION• Tc: DEPENDS ON IMPLEMENTATION COMPLEXITY AND
TECHNOLOGY
CPI (CLOCK PER INSTRUCTION) IS OFTEN USED INSTEAD OF EXECUTION TIME
• WHEN PROCESSOR EXECUTES MORE THAN ONE INSTRUCTION PER CLOCK USE IPC (INSTRUCTIONS PER CLOCK)
Texe = (IC X Tc)/IPC
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5324
AMDAHL’S LAW
• ENHANCEMENT E ACCELERATES A FRACTION F OF THE TASK BY A FACTOR S
1-F F
Apply enhancement
1-F F/S
without E
with E
Texe withE( ) Texe withoutE( )X 1 F–( ) FS--+=
Speedup E( )Texe w ithoutE( )
Texe withE( )-------------------------- 1
1 F–( ) FS--+
---------------= =
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5325
LESSONS FROM AMDAHL’S LAW
1) IMPROVEMENT IS LIMITED BY THE FRACTION OF THE EXECUTION TIME THAT CANNOT BE ENHANCED
• LAW OF DIMINISHING RETURNS – MARGINAL SPEEDUP• The difference between SPEEDUPk+1 and SPEEDUPk is smaller and smaller as S goes from k to k+1
2) OPTIMIZE THE COMMON CASE• EXECUTE THE RARE CASE IN SOFTWARE (E.G. EXCEPTIONS)
F=0.5
SPEEDUP E( ) 11 F–-------<
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5326
Amdhal’s maximum
Amdhal’s Law
Remaining Speedup
Marginal Speedup
PARALLEL SPEEDUP
• NOTE: SPEEDUP CAN BE SUPERLINEAR. HOW CAN THAT BE??
OVERALL NOT VERY HOPEFUL
F=0.95
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5327
�� =��� =
11 − � + �/� =
�� + �(1 − �) <
11 − �
“mortar shot”
Amdhal’s Law
Amdhal’s maximum
Ideal speedup
GUSTAFSON’S LAW
• REDEFINE SPEEDUP• THE RATIONALE IS THAT, AS MORE AND MORE CORES ARE INTEGRATED ON
CHIP OVER TIME, THE WORKLOADS ARE ALSO GROWING
• STARTS WITH THE EXECUTION TIME ON THE PARALLEL MACHINE WITH P PROCESSORS:
• s IS THE TIME TAKEN BY THE SERIAL CODE AND p IS THE TIME TAKEN BY THE PARALLEL CODE
• EXECUTION TIME ON ONE PROCESSOR IS
• Let F=p/(s+p). Then SP = (s+pP)/(s+p) = 1-F+FP = 1+F(P-1)
TP s p+=
T1 s pP+=
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5328
Gustafson observes that even if the single algorithm/program completes faster only if the parallel portion is dominant (Amdhal), the same algorithm will complete more and more faster as we add processors (P) compared to a purely sequential execution that just repeats the parallel portion (p) for P times.
Course Structure
1. High Performance Pipelining
2. Branch Prediction
3. Superscalar processor
4. Media Processing: VLIW processors
5. Multiprocessors and related problems
6. TLP: Thread Level Parallelism
7. Evaluation of High Performance Architectures
8. Tools for Parallel programming machines (Cilk, OpenMP, MPI, CUDA, ...)
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5329
PIPELINING
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5330
Pipelining
• Pipelining principles
• Simple Pipeline
• Structural Hazards
• Data Hazards
• Control Hazards
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5331
Pipelining principles
• Let T be the time to execute an instruction
• Without pipelining• Latency = T
• Throughput seq = 1 / T
• With an ideal n-stage pipeline• Latency = T
• Throughput pipe = n / T
• Speedup = Throughput pipe /Throughput seq = n
T
1 2 n. . .
1 2 n. . .
1 n. . .
1 2 n. . .
The (ideal) speedup
obtainable from an ideal
pipeline is equal to n
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5332
• Consider instructions composed of n phases of equal duration
2
Implementation of a Simple Pipeline
• Simple 5-stage pipeline• F -- Instruction Fetch
• D -- Instruction Decode + Operand Fetch
• X -- Execution and Effective Address
• M -- Memory Access
• W – Write-back Results
latch
clock
latchlatchlatchlatchlatch
F D X M W
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5333
5-STAGE PIPELINE
INSTRUCTIONS GO THROUGH EVERY STAGE IN PROCESS ORDER, EVEN IF THEY DON’T USE THE STAGE
• NOTE: CONTROL IMPLEMENTATION• INSTRUCTION CARRIES CONTROL• THIS IS A GENERAL APPROACH: “INSTRUCTION CARRIES ITS BAGGAGE”
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5334
Notation
5-stage pipeline1 2 3 4 5 6 7 8 9
i F D X M W
i+1 F D X M W
i+2 F D X M W
i+3 F D X M W
i+4 F D X M W
accessexecute backwrite
M WXF Dmemory
inst. fetchinst.decode
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5335
Pipeline Hazards
• Conditions that lead to a malfunction if certain countermeasures are not taken
1) Structural Hazards• Two instructions want to use the same hardware resource in the same
cycle (conflict over resources, e.g. Instruction Mem. and Data Mem.)
2) Data Hazards• Two instructions use the same data: must happen in the order defined by
the programmer, even if the execution overlaps parts of the instruction execution (see RAW, WAW, WAR)
3) Control Hazards• An instruction (branch, jump, call) can irrevocably determine which
instructions are executed next, because the pipeline has already taken instructions from the initial branch even if there is a jump
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5336
1) Structural Hazards
• Two instructions want to use the same hardware resource in the same cycle
Example:• A load / store uses the same memory location that is used by the
instruction fetch
i F D X M W <-- load instruction
i+1 F D X M W
i+2 F D X M W
i+3 * F D X M W <-- i-fetch stalls
i+4 F D X M . . .
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5337
Resolving structural hazards
• Stall one of the involved instructions+ Cost-effective and simple
- Reduces the performance
- Used for some rare events
• Pipelining the resource• Useful if possible (e.g.,
for resources that require more cycles)
+ Good performance
- In some cases too complex to do(e.g. RAM)
• Replicate the resource+ Good performance
- Costly
- Probably introduces delays
- Used for cheap resources (or indivisible) De-mux Mux
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5338
Guidelines to reduce the structural hazards
• The structural hazards can be avoided if each instruction uses the resource:• At most once:
- (e.g., Separated Instruction memory and Data memory)
• In the same pipeline cycle - (e.g. I-Fetch in stage F, R/W in stage M)
• For a single cycle- (e.g. HIT in the data or instruction cache)
• Many RISC processor ISAs were designed with this in mind
• Example of problematic situation:• MISS in cache: pipeline stalls
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5339
2) Data Hazards
• Two instructions use the same data: this must happen in the order indicated by the programmer, even if the execution overlaps parts of the instruction execution. Example :
R1 <- R2 + R3
R2 <- R1 - R7
R1 <- R5 OR R6
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5340
Reading r1
Writing r1
Data Hazards -- examples
i add r1, r2, r3 F D X M W
i+1 sub r2, r1, r7 F D X M W
r1 ?? � Read-After-Write (RAW) Hazard
r1 ?? � Write-After-Read (WAR) Hazard
i+1 sub r2, r1, r7 F * * * * * D X M W
i+2 or r1, r5, r6 F D X M W
Writing r1
Reading r1
Note: PURELY HYPOTHETICAL SITUATION � it can not happen in this pipeline, by construction
i add r1, r2, r3 F * * * D X M W
i+1 sub r2, r1, r7 F D X M W
i+2 or r1, r5, r6 F D X M W
Writing r1
Writing r1r1 ?? � Write-After-Write (WAW) Hazard
Note: PURELY HYPOTHETICAL SITUATION � it can not happen in this pipeline, by construction
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5341
Dependency and Hazard
Dependency: situation in the code that can potentially create hazards
• Read-After-Write (RAW, true-dependence)• There is a real “data exchange" from an instruction to another
• Write-After-Read (WAR, anti-dependence)• An artificial dependence that comes from a bad assignment of registers
• Write-After-Write (WAW, output-dependence)• An artificial dependence that comes from a bad assignment of registers
• Read-After-Read (RAR)• Will not cause problems
HAZARDS
• The dependencies can be translated into hazards depending on the hardware
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5342
True Dependence and MIPS Data Hazards
• Read After Write (RAW) The instruction J tries to read an operand before the instruction I writes it
• Caused by a “dependence” (in the terminology of the theory of compilers) called “true dependence"• The hazard results from a real need of data communication
• In the MIPS processor True Dependence normally generates a hazard
I: add r1,r2,r3J: sub r4,r1,r3
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5343
Anti-Dependence e MIPS Data Hazards
• Write After Read (WAR) The instruction J tries to write an operand before the instruction I reads it
• Also called "anti-dependence" (in the terminology of the theory of compilers)
• It results from having reused the name "r1", while I could easily use another register (*)
• Does not conflict in case of a 5-stage pipeline MIPS because:• All instructions take 5 stages, and
• the reads from registers occur in stage 2 (D) and
• all writes are always in stage 5 (W)
I: sub r4,r1,r3 J: add r1,r2,r3
(*) This, however, is not always possible to sw, v. Less.-2Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5344
Output Dependence e MIPS Data Hazards
• Write After Write (WAW) The instruction J tries to write an operand before the instruction I writes it
• Also called "output dependence" (in the terminology of the theory of compilers)
• Also in this case, it results from having reused the name "r1", while I could easily use another register (*)
• Does not conflict in case of a 5-stage pipeline MIPS because:• All instructions take 5 stages, and
• all writes are always in stage 5 (W)
I: mul r1,r4,r3 J: add r1,r2,r3
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5345 (*) This, however, is not always possible to sw, v. Less.-2
Simple resolution of the RAW hazard
• The hardware detects the RAW hazard, and then...• Generates a stall to allow the "producer" instruction to finish
F D X M W
R1<-R2+R3 F D X M W
R2<-R1-R7 F * * D X M W
+ Cost-effective and simple
- Reduces the performance
NOTE: It is assumed that the registers can be written in the first half of the cycle (W) and read in the second half of the cycle (D)
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5346
Implementation: stall control network
• Add latches to remember the RS1/RS2/RD register identifiers at each stage
• The stall is detected by making the following comparisonif (RS1(D)==RD(X) || RS1(D) == RD(M)) then STALL (generates stall in F)
• Similarly for RS2
unitExecution
fileRegister
D-Cache
A
B
RS1
RD RD
StallControl
stall
Stage D Stage X Stage M
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5347
Inserting Stalls (detail)
• Related to the instruction that creates the stall, it is necessary:• On the previous stages: block all "inter-stage latch"
• On the next stages:- Turn off the valid bit associated with inter-stage latch, so that the "bubble" in the
pipeline can continue to proceed without creating problems
Previous stage
Stage of the instruction
which STALLS Next Stage Next Stage
V VValid Bit = 0 Valid Bit = 1To Flip-Flop HOLD signal
STALL SIGNAL
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5348
Reduction of the RAW stalls
• Bypass/Forward/Short-Circuit network
• Idea: use the data before it is written in the registers
+ Reduces (potentially avoid) stalls
- Additional Complexity
bypasses
ME WBEXIF ID
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5349
Bypass
• Additional Hardware• Multiplexer to select the input value to the ALU:
or from Registers or from Bypass network
• Hazard detection logic (called interlock) that controls these multiplexers
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5350
bypass control
Unit(ALU)
Execution
operandlatches
bypass control
bypass
MUX
MUX
file
Register
Resultlatch
Network to detect the possibility of Bypass
• Add latches to remember RS1/RS2/RD register names at each stage (similar to the network for hazard detection)
• E.g. on the input A of the ALU, the network will act like this:if RS1(D)==RD(X) then select ALU-OUT(X)
else if RS1(D)==RD(M) then D-CACHE-OUT(M)
else select (A)
…similarlyon B input…
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5351
Unit(ALU)
Execution
MUX
MUX
file
Register
Resultlatch
D-Cache
RS1
RD RD
BypassControl
A
B
ALU-OUT(X)
D-CACHE-OUT(M)
Interaction between control networks Stall/Bypass
• The stall logic is aware of the presence of the bypass logic
• The bypass logic is activated independently at each stall condition
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5352
unitExecution
MUX
MUX
fileRegister
D-Cache
RS1
RD RD
BypassControl
A
B
StallControl
Pipeline Scheduling
• Scheduling of instructions at compile-time (Reorder instructions to reduce stalls caused by instruction load):
BEFORE:
a= b + c; R1 <- mem(b)
R2 <- mem(c)
stall
R3 <- R1 + R2
mem(a) <- R3
d = e - f; R4 <- mem(e)
R5 <- mem(f)
stall
R6 <- R4 - R5
mem(d) <- R6
AFTER:R1 <- mem(b)
R2 <- mem(c)
R4 <- mem(e)
R3 <- R1 + R2
R5 <- mem(f)
mem(a) <- R3
R6 <- R4 - R5
mem(d) <- R6
Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL di 5353