Upload
reynold-summers
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
(2)
Some Useful Reading
• http://en.wikipedia.org/wiki/CPU_power_dissipation
• http://en.wikipedia.org/wiki/CMOS#Power:_switching_and_leakage
• http://www.xbitlabs.com/articles/cpu/display/core-i5-2500t-2390t-i3-2100t-pentium-g620t.html
• http://www.cpu-world.com/info/charts.html
(4)
Technology Scaling
• 30% scaling down in dimensions doubles transistor density
• Power per transistor Vdd scaling lower power
• Transistor delay = Cgate Vdd/ISAT Cgate, Vdd scaling lower delay
GATE
SOURCE
BODY
DRAIN
tox
GATE
SOURCE DRAIN
L
leakddstdddd IVIVfCVP 2
(5)
Fundamental Trends
High Volume Manufacturing
2004 2006 2008 2010 2012 2014 2016 2018
Technology Node (nm)
90 65 45 32 22 16 11 8
Integration Capacity (BT)
2 4 8 16 32 64 128 256
Delay = CV/I scaling 0.7 ~0.7 >0.7 Delay scaling will slow down
Energy/Logic Op scaling
>0.35 >0.5 >0.5 Energy scaling will slow down
Bulk Planar CMOS High Probability Low Probability
Alternate, 3G etc Low Probability High Probability
Variability Medium High Very High
ILD (K) ~3 <3 Reduce slowly towards 2-2.5
RC Delay 1 1 1 1 1 1 1 1
Metal Layers 6-7 7-8 8-9 0.5 to 1 layer per generation
Source: Shekhar Borkar, Intel Corp.
(6)
ITRS Roadmap for Logic Devices
From: “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems,” P. Kogge, et.al, 2008
(7)
Where Does the Power Go in CMOS?
• Dynamic Power Consumption Charging and discharging capacitance
• Short Circuit Power Short circuit path between supply rails during
switching Nominally 10%-20% of dynamic power and can be
ignored for a first order analysis
• Leakage Leaky transistors
(8)
Dynamic Power
PDYNAMIC = CL x VDD x VDD x Frequency
Time
VDD
Voltage
0 T
VDD VDD
Output Capacitor Charging
Output Capacitor
Discharging
Input to CMOS
inverter
iDD
iDDCLCL
• Dynamic power is used in charging and discharging the capacitances in the CMOS circuit.
(9)
• Technology scaling has caused transistors to become smaller and smaller. As a result, static power has become a substantial portion of the total power.
Static Power
Gate Leakage
Junction Leakage Sub-threshold Leakage
Input = 0 Output = VDD
PSTATIC = VDD x ISTATIC
(10)
EnergyDelay
Ene
rgy
or d
elay
VDDVDD
ED
P
Energy-Delay Interaction
• Delay decreases with supply voltage but energy/power increases
(11)
leak
age
or d
elay
Vth
leakagedelay
Static Energy-Delay Interaction
• Static energy increases exponentially with decrease in threshold voltage
• Delay increases with threshold voltage
tox
SOURCE DRAIN
L
GATE
(12)
Power Vs. Energy
• Energy is a rate of expenditure of energy One joule/sec = one watt
• Both profiles use the same amount of energy at different rates or power
Pow
er(
watt
s)
P0
P1
P2
Same Energy = area under the curve
Pow
er(
watt
s)
Time
P0
Time
(13)
Optimizing Power vs. Energy
Thermal envelopes minimize peak power
Maximize battery life minimize energy
(14)
The Problem
• Historically performance scaling was accompanied by power scaling
• This is no longer true power densities are increasing
(15)
The End of Dennard Scaling
tox
SOURCE DRAIN
L
GATE
• Voltage is no longer scaling at the same rate
• Slower scaling in power per transistor increasing power densities
From R. Dennard, et al., “Design of ion-implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid State Circuits, vol. SC-9, no. 5, pp. 256-268, Oct. 1974.
(16)
Chip Power Densities
From: “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems,” P. Kogge, et.al, 2008
(17)
Mukhopadhyay and Yalamanchili (2009)
Based on scaling using Pentium-class cores While Moore’s Law continues, scaling phenomena have
changed Power densities are increasing with each generation
17
What is the Problem?
(18)
The Power Wall
• Power per transistor scales with frequency but also scales with Vdd
Lower Vdd can be compensated for with increased pipelining to keep throughput constant
Power per transistor is not same as power per area power density is the problem!
Multiple units can be run at lower frequencies to keep throughput constant, while saving power
leakddstdddd IVIVfCVP 2
(19)
The Advent of Dark Silicon?
64-core asymmetric chip multiprocessor layoutand failure probability distribution
In-order core Out of-order core • Cannot afford to turn on all devices at once
• How do we manage the power and thermals?
(20)
What are my Options?
1. Better technology Manufacturing New Devices non-CMOS?
2. Be more efficient – activity management Clock gating Power gating Power management
3. Improved architecture Simpler pipelines
4. Parallelism
(21)
Activity Management
• Turn off clock to a block of logic
• Eliminate unnecessary transitions/activity
• Clock distribution power
• Turn off power to a block of logic, e.g., core
• No leakage
Combinational Logic
clk
clk
cond
input
clk
Core 0 Core 1
VddPower gate transistor
Clock Gating Power Gating
(22)
Power Management
• Software controlled power management Optimize power and/or energy Orchestrated by the operating system or application
libraries Industry standard interfaces for power management
o Advanced Configuration and Power Interface (ACPI) https://www.acpica.org/ http://www.acpi.info/
• Hardware power management Optimized power/energy Failsafe operation, e.g., protect against thermal
emergencies
(23)
Processor Power States
• Performance States – P-states Operate at different voltage/frequencies
o Recall delay-voltage relationship Lower voltage lower leakage Lower frequency lower power (not the same as energy!) Lower frequency longer execution time
• Idle States - C-states Sleep states Differ is how much state is saved
• SW or HW managed transitions between states!
(24)
Multiple Voltage Frequency Domains
From E. Rotem et. Al. HotChips 2011
• Cores and ring in one DVFS domain• Graphics unit in another DVFS domain• Cores and portion of cache can be gated
off
Intel Sandy Bridge Processor
(25)
Power States
From: http://www.intel.com/content/www/us/en/processors/core/2nd-gen-core-family-mobile-vol-1-datasheet.html
(26)
Power Gating
Intel Sandy Bridge Processor
• Turn off components that are not being used Lose all state information
• Costs of powering down
• Costs of powering up
• Smart shutdown Models to guide decisions
(27)
Simplify Core DesignAMD Bulldozer Core
ARM A7 Core (arm.com)
• Support for out of order execution, schedulers, branch prediction, etc. consumes more energy per instruction
• Can fit many more simpler cores on a dies
(28)
Parallelism and PowerIBM Power5
Source: IBM
AMD Trinity
Source: forwardthinking.pcmag.com
• How much of the chip area is devoted to compute?
• Run many cores slower. Why does this reduce power?
(29)
Parallelism
• Concurrency + lower frequency greater energy efficiency
leakddstdddd IVIVfCVP 2
Core
Cache
Core
Cache
Core
Cache
Core
Cache
Core
Cache
• 4X #cores• 0.75x voltage• 0.5x Frequency• 1X power• 2X in performance
Example
(30)
Microarchitectural Level Models
• How can we study power consumption without building circuits? Models
• Models can are available at multiple levels of abstraction.
We are interested in microarchitectural models
(31)
Processor Microarchitecture
Instruction Cache
Instruction Queue
FetchQueue
Instruction Decoder
BranchPrediction
Register Files
Instruction TLB
ALU
MUL
FPU
LD
ST
L1 Data Cache
DataTLB
L2 Data CacheNoC Router
On-ChipNetwork
Fetch Decode Execute/Writeback
Memory
Network
(32)
Energy/Power Calculation
• How do we calculate energy or power dissipation for a given microarchitecture?
• Energy/Power varies between: Different ISA; ARM vs Intel x86
Different microarchitecture; in-order vs out-of-order
Different applications; memory vs compute-bound
Different technologies; 90nm vs 22nm technology
Different operation conditions; frequency, temperature
(33)
Architecture Activity (1)
Instruction Cache
Instruction Queue
FetchQueue
Instruction Decoder
BranchPrediction
Register Files
Instruction TLB
ALU
MUL
FPU
LD
ST
L1 Data Cache
DataTLB
L2 Data CacheNoC Router
On-ChipNetwork
Activity 1: Instruction Fetch
icache.read++; fbuffer.write++;
• Collect activity counts of each architecture component (through simulation or measurement).
• List of components differs between microarchitectures.
• Activity counts at each component differs between applications.
(34)
Architecture Activity (2)
Instruction Cache
Instruction Queue
FetchQueue
Instruction Decoder
BranchPrediction
Register Files
Instruction TLB
ALU
MUL
FPU
LD
ST
L1 Data Cache
DataTLB
L2 Data CacheNoC Router
On-ChipNetwork
Activity 2: Instruction Decode
fbuffer.read++; idecoder.logic++;
• Read/write accesses to caches, buffers, etc.
• Logical accesses to logic blocks such as decoder, ALUs, etc.
• Tradeoff of differentiating more access types (accuracy) vs simulation speed (complexity).
(35)
Power and Architecture Activity
• For example, At nth clock cycle, collected counters are: Data cache:
o read = 20, write = 12;
o per-read energy = 0.5nJ; per-write energy = 0.6nJ;
o Read energy = read*per-read energy = 10nJ
o Write energy = write*per-write energy = 7.2nJ
o Total activity energy = read+write energies = 17.2nJ
o If n = 50th clock cycle and clock frequency = 2GHz,Total activity power = energy*clock_freq/n = 688mW
*Note: n/clock_freq = n clock periods in sec power = time average of energy
(36)
Things to consider (1)
1. How do we calculate per-read/write energies?
• Per-access energies can be estimated from circuit-level designs and analyses.
• There are various open-source tools for this.
Architecture Specification
Technology Parameters
Circuit-levelEstimation
Tool
Estimation Results:Area, Energy, Timing, etc.
(37)
Things to consider (2)
2. Is per-access energy always the same?
• Per-access energy in fact depends on:• how many bits are switching • how they are switching (0→1 or 1→0)
• It is reasonable to assume constant per-access energy in long-term observation (e.g., n = 1M clock cycles); the number of switching bits are averaged (e.g., 50% of bits are switching).
• Most architecture simulators do not capture bit-level details due to simulation complexity.
(38)
Things to consider (3)
3. If a register file didn’t have read/write accesses but held data, what is the energy dissipation?
• Energy (or power) is largely comprised of dynamic and static dissipations.
• Dynamic (or switching) energy refers to energy dissipation due to switching activities.
• Static (or leakage) energy is dissipation to keep the electronic system turned on.
• In this case, the register file has no dynamic energy dissipation but consumes static energy.
(39)
Thermal Issues
• Heat can cause damage to the chip Need failsafe operation
• Thermal fields change the physical characteristics Leakage current and therefore power increases Delay increases Device degradation becomes worse
• Cooling solution determines the permitted power dissipation
(40)
Thermal Design Power (TDP)
• This is the maximum power at which the part is designed to operate Dictates the design of the
cooling system o Max temperature Tjmax
Typically fixed by worst case workload
• Parts are typically operating below the TDP
• Opportunities for turbo mode?
AMD Trinity APU
http://ecs.vancouver.wsu.edu/thermofluids-research
(41)
Trinity TDP
Source: http://www.anandtech.com/show/6347/amd-a10-5800k-a8-5600k-review-trinity-on-the-desktop-part-2
(42)
Exploiting the Physics
• Most of time the part is operating well below its thermal limit Leaving performance on the table
• Can temporarily boost frequency (and therefore power dissipation) for short periods of time, e.g., seconds
• Temperature changes slowly
(43)
Boosting
• Exploit package physics Temperature changes on the
order of milliseconds
• Use the thermal headroom
Max Power
TDP Power
Low power – build up thermal credits
Turbo boost region
10s of seconds
Intel Sandy Bridge
(44)
Conclusions
• Power/energy is the leading driver of modern architecture design
• Power and energy management is key to scalability
• Need integrated power/energy, performance, thermal management in fielded systems
• What about energy/power efficient algorithms?
(45)
Study Guide
• Explain the difference between energy dissipation and power dissipation
• Distinguish between static power dissipation and dynamic power dissipation
• Be able to apply the simplified McPAT power model to a simple datapath and instruction sequence
• Explain dynamic voltage frequency scaling What are power states? Why is this an advantage? What is the impact of DVFS on i) energy, ii) execution
time, and iii) power
(46)
Study Guide (cont.)
• How is thermal design power (TDP) calculated?
• When using boost algorithms, what determines the duration of the high frequency operation?
• How does a power virus work?
• Describe how throttling works
• Know the power dissipation in some modern processor-memory systems drawn from the embedded, server, and high performance computing segments