Upload
lamkiet
View
227
Download
4
Embed Size (px)
Citation preview
Introduction to Computer Architecture ‐ II
ECE 154BDmitri Strukov
Computer systems overview
1
OutlineOutline
• Course informationCourse information
• Trends
C i l• Computing classes
• Quantitative Principles of Design
• Dependability
2
Course organizationCourse organization
• Class website:Class website: http://www.ece.ucsb.edu/~strukov/ece154bSpring2013/home.htm
• Instructor office hours: Tue, 1:00 pm – 3:00Instructor office hours: Tue, 1:00 pm 3:00 pm
• Advait Madhavan’s (TA) office hours: Mon,Advait Madhavan s (TA) office hours: Mon, 1:00 pm – 3:00 pm (tentative) [email protected]
3
TextbookTextbook
• Computer Architecture: A QuantitativeComputer Architecture: A Quantitative Approach, John L. Hennessy and David A. Patterson Fifth Edition Morgan KaufmannPatterson, Fifth Edition, Morgan Kaufmann, 2012, ISBN: 978‐0‐12‐383872‐8
4
Class topicsClass topics
• Computer fundamentals (historical trends, p ( ,performance) – 1 week
• Memory hierarchy design ‐ 2 weeks• Instruction level parallelism (static and dynamic scheduling, speculation) – 2 weeks
• Data level parallelism (vector SIMD and GPUs) 2• Data level parallelism (vector, SIMD and GPUs) – 2 weeks
• Thread level parallelism (shared‐memory architectures, p ( ysynchronization and cache coherence) – 2 weeks
• Warehouse‐scale computers ‐ 1 week
5
GradingGrading
• Projects: 50 %ojects: 50 %• Midterm: 20 %• Final: 30 %Final: 30 %
• Project course work will involve programProject course work will involve program performance analysis and architectural optimizations for superscalar processors using SimpleScalar simulation tools
• HW will be assigned each week but not graded
6
Course prerequisitesCourse prerequisites
• ECE 154A or 154ECE 154A or 154
7
ENIAC: Electronic Numerical Integrator And Computer, 1946p ,
8
VLSI Developmentsp
1946: ENIAC electronic numerical integrator and computer
2011: High Performance microprocessorintegrator and computer
• Floor area– 140 m2
• Chip area– 100‐400 mm2 (for multi‐core)
• Board area
P f
• Board area– 200 cm2; improvement of 104
P f• Performance– multiplication of two 10‐digit
numbers in 2 ms
• Performance: – 64 bit multiply in few ns;
improvement of 106
9
Computer trends: Performance of a (single) processorPerformance of a (single) processor
Move to multi-processor
RISC
10
Current Trends in ArchitectureCurrent Trends in Architecture
• Cannot continue to leverage Instruction‐Level gparallelism (ILP)– Single processor performance improvement ended in 2003
• New models for performance:Data level parallelism (DLP)– Data‐level parallelism (DLP)
– Thread‐level parallelism (TLP)– Request‐level parallelism (RLP)
• These require explicit restructuring of the application
11
Classes of ComputersClasses of Computers
• Personal Mobile Device (PMD)– e.g. start phones, tablet computers– Emphasis on energy efficiency and real‐time
• Desktop ComputingE h i i f– Emphasis on price‐performance
• Servers– Emphasis on availability, scalability, throughput
• Clusters / Warehouse Scale Computers• Clusters / Warehouse Scale Computers– Used for “Software as a Service (SaaS)”– Emphasis on availability and price‐performance– Sub‐class: Supercomputers, emphasis: floating‐point performance and p p , p g p p
fast internal networks• Embedded Computers
– Emphasis: price
12
Defining Computer Architecture• “Old” view of computer architecture:
– Instruction Set Architecture (ISA) design– i.e. decisions regarding:
• registers, memory addressing, addressing modes, instruction operands, available operations, control flow instructions, p , p , ,instruction encoding
• “Real” computer architecture:Real computer architecture:– Specific requirements of the target machine– Design to maximize performance within constraints:
d l b lcost, power, and availability– Includes ISA, microarchitecture, hardware
13
Trends in TechnologyTrends in Technology
• Integrated circuit technology– Transistor density: 35%/year– Die size: 10‐20%/year– Integration overall: 40‐55%/year
• DRAM capacity: 25‐40%/year (slowing)
• Flash capacity: 50‐60%/year– 15‐20X cheaper/bit than DRAM
• Magnetic disk technology: 40%/year– 15‐25X cheaper/bit then Flash– 300‐500X cheaper/bit than DRAMp
14
CMOS improvements:• Transistor density: 4x / 3 yrs• Die size: 10-25% / yr• Die size: 10-25% / yr
15
PC hard drive capacity
E l ti f l itEvolution of memory granularity
16
Bandwidth and LatencyBandwidth and Latency
• Bandwidth or throughputg p– Total work done in a given time– 10,000‐25,000X improvement for processors– 300‐1200X improvement for memory and disks
• Latency or response time• Latency or response time– Time between start and completion of an event– 30‐80X improvement for processors– 6‐8X improvement for memory and disks
17
Bandwidth and LatencyCPU high, Memory low(“Memory Wall”)
Performance Milestones
• Processor: ‘286, ‘386, ‘486,
Wall”)
Pentium, Pentium Pro, Pentium 4 (21x,2250x)
• Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)
• Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)
• Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)
Log-log plot of bandwidth and latency milestones
Transistors and WiresTransistors and Wires
• Feature size– Minimum size of transistor or wire in x or y dimension10 i i 1971 032– 10 microns in 1971 to .032 microns in 2011
– Transistor performance a s s o pe o a cescales linearly
• Wire delay does not improve with feature size!with feature size!
– Integration density scales quadratically
19
Scaling with Feature Size
• If s is scaling factor, then density scale as s2
• Logic gate capacitance C (traditionally dominating): ~ 1/s g g p ( y g) /
• Capacitance of wires
– fixed length: ~does not change
reduced length by s: ~1/s– reduced length by s: ~1/s
• Resistance of wires
– fixed length: s2
– reduced length by s: s
• Saturation current ION (which is reciprocal of effective RON of the gate): 1/s
• Voltage V: ~1/s g /
(does not scale as fast anymore because of subthreshold leakage)
• Gate delay: ~ CV/ION = 1/s
20
Power and EnergyPower and Energy
• Problem: Get power in, get power out
• Thermal Design Power (TDP)Characterizes sustained power consumption– Characterizes sustained power consumption
– Used as target for power supply and cooling system– Lower than peak power, higher than average power
consumptionconsumption
• Clock rate can be reduced dynamically to limit power consumptionconsumption
• Energy per task is often a better measurement
21
Dynamic Energy and PowerDynamic Energy and Power
• Dynamic energyy gy– Transistor switch from 0 ‐> 1 or 1 ‐> 0– ½ x Capacitive load x Voltage2
• Dynamic power– ½ x Capacitive load x Voltage2 x Frequency switched
• Reducing clock rate reduces power not energy• Reducing clock rate reduces power, not energy
22
PowerPower• Intel 80386
consumed ~ 2 W• 3.3 GHz Intel Core
i7 consumes 130 W• Heat must be• Heat must be
dissipated from 1.5 x 1.5 cm chiphi i h li i f• This is the limit of what can be cooled by air
23
Power consumptionPower consumption
24
Reducing PowerReducing Power
• Techniques for reducing power:Techniques for reducing power:– Do nothing well– Dynamic Voltage‐Frequency Scalingy g q y g– Low power state for DRAM, disks– Overclocking, turning off coresg, g
Since ION ~ V2 and gate delay is ~ CV/ION , in the first approximation clock frequency ( which is reciprocal of gate delay) is proportional to Vwhich is reciprocal of gate delay) is proportional to V
Lowering voltage reduces the dynamic power consumption and energy per operation but decrease performance because of negative effect on frequency
25
Static PowerStatic Power
• Static power consumptionStatic power consumption– Currentstatic x Voltage– Scales with number of transistors– To reduce: power gating
26
Trends in CostTrends in Cost
• Cost driven down by learning curveCost driven down by learning curve– Yield
• DRAM: price closely tracks cost
• Microprocessors: price depends on volume10% less for each doubling of volume– 10% less for each doubling of volume
27
8” MIPS64 R20K wafer (564 dies)Drawing single‐crystalSi ingot from furnace…. Then, slice into wafers and pattern it…
28
What's the price of an IC ?
Die cost + Testing cost + Packaging costDie cost Testing cost Packaging costFinal test yield
IC cost =
Final test yield: fraction of packaged dies which pass the final testing state
29
Integrated Circuits Costs
IC cost = Die cost + Testing cost + Packaging costFinal test yield
Die cost = Wafer costDies per Wafer * Die yieldDies per Wafer Die yield
Final test yield: fraction of packaged dies which pass the final testing statetesting state
Die yield: fraction of good dies on a wafer
30
What's the price of the final product ?• Component Costs• Direct Costs (add 25% to 40%) recurring costs: labor, ( ) g
purchasing, warranty
• Gross Margin (add 82% to 186%) nonrecurring costs: R&D, marketing, sales, equipment maintenance, rental, financing cost, pretax profits, taxes
• Average Discount Li P i ( dd 33% 66%)• Average Discount to get List Price (add 33% to 66%): volume discounts and/or retailer markup
List Price
GrossMargin
AverageDiscount
Avg. Selling Price
List Price
34% to 39%
25% to 40%
ComponentCost
Direct CostMargin
15% to 33%6% to 8%
31
Integrated Circuit CostIntegrated Circuit Cost
• Integrated circuitg
• Bose‐Einstein formula:
• Defects per unit area = 0.016‐0.057 defects per square cm (2010)• N = process‐complexity factor = 11 5‐15 5 (40 nm 2010)N = process complexity factor = 11.5 15.5 (40 nm, 2010)
32
Quantitative Principles of DesignQuantitative Principles of Design• Take Advantage of Parallelism
• Principle of Locality
• Focus on the Common CaseFocus on the Common Case– Amdahl’s Law
– E g common case supported by special hardware;– E.g. common case supported by special hardware; uncommon cases in software
• The Performance Equation• The Performance Equation
33
Measuring PerformanceMeasuring Performance• Typical performance metrics:
R i– Response time– Throughput
• Speedup of X relative to Yp p– Execution timeY / Execution timeX
• Execution timeWall clock time: includes all system overheads– Wall clock time: includes all system overheads
– CPU time: only computation time
• Benchmarks– Kernels (e.g. matrix multiply)– Toy programs (e.g. sorting)– Synthetic benchmarks (e.g. Dhrystone)– Benchmark suites (e.g. SPEC06fp, TPC‐C)
34
1. Parallelism1. Parallelism
How to improve performance?p p
• (Super)‐pipelining• Powerful instructions
– MD‐technique• multiple data operands per operationmultiple data operands per operation
– MO‐technique• multiple operations per instruction
• Multiple instruction issue• Multiple instruction issue– single instruction‐program stream– multiple streams (or programs, or tasks)
35
Flynn’s TaxonomyFlynn s Taxonomy
• Single instruction stream, single data stream (SISD)
• Single instruction stream, multiple data streams (SIMD)– Vector architectures– Multimedia extensions– Graphics processor units
• Multiple instruction streams, single data stream (MISD)– No commercial implementation
• Multiple instruction streams, multiple data streams (MIMD)– Tightly‐coupled MIMD– Loosely‐coupled MIMD
36
Pipelined Instruction ExecutionPipelined Instruction ExecutionTime (clock cycles)
C l 1 C l 2 C l 3 C l 4 C l 6 C l 7C l 5
Ins
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
str. Reg A
LU DMemIfetch Reg
Orde
Reg ALU DMemIfetch Reg
er Reg A
LU DMemIfetch Reg
37
Limits to pipelining• Hazards prevent next instruction from executing during its designated
clock cycleStr ct ral ha ards attempt to se the same hard are to do t o different– Structural hazards: attempt to use the same hardware to do two different things at once
– Data hazards: Instruction depends on result of prior instruction still in the pipelinepipeline
– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
Ti ( l k l )
Ins
Time (clock cycles)
Reg ALU DMemIfetch Reg
Ustr.
Or
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
der
A
38
2. The Principle of Locality
• Programs access a relatively small portion of the address space at any instant of time.
• Two Different Types of Locality:• Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon (e.g., loops, reuse)– Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soonwhose addresses are close by tend to be referenced soon (e.g., straight‐line code, array access)
• Last 30 years, HW relied on locality for memory perf.
P MEM$
39
Memory Hierarchy LevelsCapacityAccess Time Staging
CPU Registers100s Bytes300 – 500 ps (0.3-0.5 ns)
Access TimeCost
RegistersInstr Operands
StagingXfer Unit
prog./compiler1 8 b t
Upper Level
fasterp
L1 and L2 Cache10s-100s K Bytes~1 ns - ~10 ns~ $100s/ GByte
L1 CacheInstr. Operands
Blocks
1-8 bytes
cache cntl32-64 bytes
faster
~ $100s/ GByte
Main MemoryG Bytes80 200 Memory
L2 Cachecache cntl64-128 bytesBlocks
80ns- 200ns~ $10/ GByte
Disk10 T B t 10 m
Memory
Pages OS4K-8K bytes
10s T Bytes, 10 ms (10,000,000 ns)~ $0.1 / GByte
Disk
Files user/operatorGbytes
LargerTape infinitesec-min~$0.1 / GByte
Tape Lower Levelg
still needed? 40
3. Focus on the Common Case• Favor the frequent case over the infrequent case
– E.g., Instruction fetch and decode unit used more frequently than multiplier, so optimize it firstmultiplier, so optimize it first
– E.g., If database server has 50 disks / processor, storage dependability dominates system dependability, so optimize it first
• Frequent case is often simpler and can be done faster than the infrequent case
E fl i h ddi 2 b i– E.g., overflow is rare when adding 2 numbers, so improve performance by optimizing more common case of no overflow
– May slow down overflow, but overall performance improved by optimizing for the normal caseoptimizing for the normal case
• What is frequent case? How much performance improved by ki f ? A d hl’ Lmaking case faster? => Amdahl’s Law
41
Amdahl’s LawAmdahl s Law
T 1Speedupoverall =
Texec,old
Texec,new
=1
(1 - Fractionenhanced) + Fractionenhanced
SpeedupenhancedSpeedupenhancedart
parallel pa
l part
l part
seria
seria
42
Amdahl’s LawAmdahl s Law
• Floating point instructions improved to run 2Floating point instructions improved to run 2 times faster, but only 10% of actual instructions are FPinstructions are FPTexec,new =
Speedupoverall =
43
Amdahl’s Law
• Floating point instructions improved to run 2X; but only 10% of actual instructions are FPbut only 10% of actual instructions are FP
T = T x (0 9 + 0 1/2) = 0 95 x T
Speedupoverall =1
= 1.053
Texec,new = Texec,old x (0.9 + 0.1/2) = 0.95 x Texec,old
0.95
44
Amdahl's law
45
Principles of Computer Design
• The Processor Performance Equation
46
Principles of Computer Design
• Different instruction types having differentDifferent instruction types having different CPIs
47
DependabilityDependability
• Module reliabilityy– Mean time to failure (MTTF)– Mean time to repair (MTTR)– Mean time between failures (MTBF) = MTTF + MTTR– Availability = MTTF / MTBF
48
AcknowledgementsAcknowledgements
Some of the slides contain material developedSome of the slides contain material developed and copyrighted by Henk Corporaal (TU/e) and instructor material for the textbookinstructor material for the textbook
49