Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Lecture 25:“Performance, Power, & Energy of Computers”
➢ Recommended References:• “Energy per Instruction Trends in Intel® Microprocessors,” by Ed
Grochowski, Murali Annavaram, 2006. • “Mitigating Amdahl's Law through EPI Throttling,” by M. Annavaram,
E. Grochowski, J. Shen. In 32nd ISCA 2005.
The “Science” of Computer Systems
18-600 Foundations of Computer Systems
John P. ShenNovember 29, 2017
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 1
Lecture 25:“Performance, Power, & Energy of Computers”
A. Latency (Single-Thread) Performance (Law #1)
B. Throughput (Multi-Thread) Performance (Law #2)
C. Throughput Performance Scalability (Law #3)
D. Performance Scalability Impediments (Law #4)
E. Performance and Power Scaling (Law #5)
F. Power and Energy Optimizations
18-600 Foundations of Computer Systems
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 2
Flynn’s Taxonomy of Computer Systems [Mike Flynn, 1966]
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 3
SISD – Single Instruction Stream & Single Data Stream. (Sequential uni-processor.)
SIMD – Single Instruction Stream & Multiple Data Streams. (Vector processing, lockstep.)
MISD – Multiple Instruction Streams & Single Data Stream. (Stream data through multiple processing stages.)
MIMD – Multiple Instruction Streams & Multiple Data Streams. (Multi-threads & multi-processors, most general parallelism.)
SPMD – Single Program, Multiple Data Streams. (GPUs, not lockstep.)
MPMD – Multiple Programs, Multiple Data Streams.
Classification of Parallelism• SISD – Traditional sequential program on single-core processor
➢ Sequential code with sequential execution semantics (single PC)
➢ Can support concurrent “multi-processing” through time sharing
➢ Control-flow graph & data-flow graph embedded in sequential code
➢ Achieve Instruction Level Parallelism (ILP) via aggressive control flow speculation and dataflow limit processing
• MIMD – Multi-threads & multi-cores, most general type of parallelism
➢ Can support simultaneous parallel “multi-processing” (multiple PCs)
➢ Simultaneous traversing of multiple control-flow graphs
➢ Can support both “multi-processing” and “multi-threading”
➢ Achieve Thread Level Parallelism (TLP) via program & machine parallelisms
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 4
A. Latency (Single-Thread) Performance (Law #1)
❖ Time to execute a program: T (latency)
❖ Processor performance: Perf = 1/T
cycle
time
ninstructio
cycles
program
nsinstructioT
CycleTimeCPIPathLengthT
CPIPathLength
Frequency
CycleTimeCPIPathLengthPerfCPU
1
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 5
Landscape of Microprocessor Families
0
0.5
1
0 500 1000 1500 2000 2500 3000 3500
Frequency (MHz)
SP
EC
int2
00
0/M
Hz
Intel-x86
AMD-x86
Power
Itanium
700500
300
100
PIII
P4
Athlon
** Data source www.spec.org
Power4
NWD
900
11001900 SpecINT 2000 1300
1500
Opteron
800 MHz
Extreme
Power 3
Power5
PSC
DTN
1700
Itanium
Source: www.SPEC.org
CPIPathLength
FrequencyePerformanc CPU
Deeper pipelining
Wid
er
pip
elin
e
SPECint Latency Performance of Processors[John DeVale & Bryan Black, 2005]
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 6
Latency vs. Throughput Performance
❖ Reduce Latency of Application▪ Uni-processor, Single Program
▪ Target Single-Thread (ST) Performance
▪ Examples: SPEC, PC and Workstations
❖ Increase Throughput of System▪ Multi-cores/Multi-processors, Many Threads
▪ Target Multi-threaded/Multi-tasking Throughput
▪ Example: Database Transaction Processing
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 7
Ideal Throughput (Multi-Thread) Performance
❖ Time to process a thread: T (latency)
❖ Multi-thread performance: Perf = 1/T
cycle
time
ninstructio
cycles
thread
nsinstructioT
CPIPathLength
Frequencyn
CycleTimeCPIPathLength
nPerfMP
1
CycleTimeCPIPathLengthT
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 8
B. Throughput (Multi-Thread) Performance (Law #2)
❖ Multi-Core/Multi-Thread Performance:
❖ Can Improve PerfMC by:▪ Increasing: n (no. of CPUs or cores)
▪ Increasing: Frequency (CPU clock frequency)
▪ Decreasing: PL (dynamic instruction count)
▪ Decreasing: CPI (cycles/instruction)
)()( nCPInPL
FrequencynPerfMC
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 9
A Throughput Performance Scalability Model
❖ Multi-Core/Multi-Thread Speedup:
❖ A Rigorous Scalability Model (my proposal):
)()(
)1()1(
)()()( 11
nCPInPL
CPIPLn
CPIPL
Frequency
nCPInPL
Frequencyn
nSpeedup MC
11)()( CPInPLnnCPInPL yx
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 10
C. Throughput Performance Scalability (Law #3)
❖ Multi-Core/Multi-Thread Speedup:
❖ Scalability Impedance Functions:
yxyxMCnn
n
CPInPLn
CPIPLnnSpeedup
11
11)(
FunctionImpedancenCPIn
FunctionImpedancenPLny
x
)(
)(
ImpedanceNoyx
SpeedupNoyxnnn yxyx
0.0
0.1)(
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 11
Real World Example: Database Performance (OLTP)
❖ MIMD Database Performance: TPS
❖ Can Improve TPS by:▪ Increasing: n (no. of CPUs)
▪ Increasing: Frequency (CPU clock frequency)
▪ Decreasing: IPX (instructions/transaction) == PL
▪ Decreasing: CPI (cycles/instruction)
)()( nCPInIPX
Frequencyn
Second
nsTransactioTPS
[Law #2]
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 12
[Rich Hankins & Murali Annavaram, 2004]
4-way Multiprocessor Experimental SetupComponent Intel® Xeon System Intel® Itanium® 2 System
Processors 4-way SMP, 1.6GHz 4-way SMP, 900MHz
Caches 256KB L2, 1MB L3 256KB L2, 3MB L3
Operating System Red Hat® AS 2.1 Red Hat® AS 2.1
Disks 24 data + 2 log 32 data + 1 log
Main Memory 4GB 16GB
Database Oracle® 9ir2 Oracle® 10g
OS Large Page Size 4MB 256MB
SGA 3GB 14GB
[Rich Hankins & Murali Annavaram, 2004]
❖ Based on EMON Events▪ Separate User and OS components for each event▪ Use multiple long runs (20-min warm up, 10-min measurement)▪ Strive for standard deviation <5% for TPS & CPU utilization > 90%▪ Overall user execution time ~70-90%
)()( nCPInIPX
FrequencynTPS
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 13
[Rich Hankins & Murali Annavaram, 2004]
TPS: Throughput Scaling
0
200
400
600
800
1000
0 250 500 750 1000 1250 1500
Warehouses
TP
S
1P 2P 4P
0
200
400
600
800
1000
0 200 400 600 800 1000 1200
Warehouses
TP
S
1P 2P 4P
Balanced I/O Bound
CPU Bound
Xeon Itanium 2
❖ TPS scales more linearly on Itanium2 ▪ Larger SGA implies slower I/O rate increase
(Xeon I/O rate increases at 7KB/WH, & Itanium 2 at: 6KB/WH)
▪ Bus utilization on Xeon higher than Itanium 2 (45% vs. 39%)
❖ Increasing I/O rate g more processes & context switches▪ Clients increase from 8 to 56 on Xeon & 4 to 64 on Itanium 2
▪ OS time up to 20% on Xeon & only 10% on Itanium 2
)()( nCPInIPX
FrequencynTPS
▪ Performance degrades with increased data set size (due to I/O rate)
▪ Performance improves with increased n (on IPF almost linear increase)
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 14
[Rich Hankins & Murali Annavaram, 2004]
IPX: Path-Length Scaling )()( nCPInIPX
FrequencynTPS
❖ Growth in IPX (quite linear) attributed to OS IPX increase▪ More I/O, more context switching
❖ User level IPX remains relatively constant in both systems▪ Code path through Oracle relatively constant
❖ Excluding NOPS, IPX at 25 WH similar for both systems!
❖ IPX growth less pronounced on Itanium 2
0.0
0.5
1.0
1.5
2.0
2.5
0 100 200 300 400 500 600 700 800
Warehouses
IPX
(x10
6)
1P 2P 4P
0.0
0.5
1.0
1.5
2.0
2.5
0 200 400 600 800 1000 1200
Warehouses
IPX
(x10
6)
1P 2P 4PXeon Itanium 2
1)( IPXnnIPX x 1)( IPXnnIPX x
10~ nnx 10~ nnx
▪ IPX increases with increased data set size (Xeon) ▪ But IPX(n) doesn’t
increase much with increased n
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 15
[Rich Hankins & Murali Annavaram, 2004]
Path-Length Contributions
ationSynchroniz
Replay
nSpeculatio
CodeOriginal
APP
SwitchContext
FaultPage
OI
OS
nPL
/
)(
)()( nCPInIPX
FrequencynTPS
0.0
0.1
)(
1
x
PL
nn
PL
x
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 16
CPI: Overhead Scaling )()( nCPInIPX
FrequencynTPS
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0 200 400 600 800 1000 1200
Warehouses
CP
I
1P 2P 4P
0.0
2.0
4.0
6.0
8.0
10.0
0 100 200 300 400 500 600 700 800
Warehouses
CP
I
1P 2P 4PXeon Itanium 2
1)( CPInnCPI y 1)( CPInnCPI y
❖ CPI increases with a “knee” but less sharp on Itanium 2 !
❖ Overall CPI trend strongly determined by user CPI▪ User mode execution time more than 90% on IPF, 80% on Xeon
❖ Xeon CPI grows with P, Itanium 2 CPI does not▪ Growth attributed to higher bus utilization on Xeon
▪ CPI does increase with increased data set size▪ CPI(n) also increases
with increased n (esp. for Xeon)
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 17
[Rich Hankins & Murali Annavaram, 2004]
CPI Breakdown: Xeon
1
167.0
1
0 CPInIPXn
FrequencynTPS
833.0)( n
nn
nnSpeedup
yxMC
[Law #3]
0
1
2
3
4
5
6
7
8
9
10
1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P
Warehouses
CP
I
Other
L3 Miss
L2 Miss
TC
TLB
Branch
Inst
10 W 100 W 500 W 800 W
Xeon1)( CPInnCPI y )167.0(y
)()( nCPInIPX
FrequencynTPS
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 18
[Rich Hankins & Murali Annavaram, 2004]
CPI Breakdown: Itanium 2
1
073.0
1
0 CPInIPXn
FrequencynTPS
927.0)( n
nn
nnSpeedup
yxMC
[Law #3]
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P 1P 2P 4P
25 W 100 W 500 W 800 W 1200 W
Warehouses
CP
IExe
FE
L1d
RSE
Flush
Work
Itanium 21)( CPInnCPI y )073.0(y
)()( nCPInIPX
FrequencynTPS
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 19
[Rich Hankins & Murali Annavaram, 2004]
CPI Contributions
ationSynchroniz
ctInterconneMP
Bus
DRAMMem
Coherence
TLBCacheCache
sourceRe
StallsDep
Execute
CPU
nCPI
/
.
)(
)()( nCPInIPX
FrequencynTPS
0.0
0.1
)(
1
y
CP
In
nC
PI
y
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 20
Scalability Impediments
1p-16p scaling
1-(x+y) R^2
SEMPHY 0.993 0.999
PLSA 0.963 0.999
Rsearch 0.931 0.997
SVM-RFE 0.786 0.970
SNPs 0.685 0.967
GeneNet 0.642 0.983
[Carole Dulong et al., 2005]
Speedup Scalability
0
5
10
15
20
25
30
35
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Number of Cores (n)
Sp
ee
du
p
n^1.00
n^0.90
n^0.80
n^0.70
(x+y)=0.10
(x+y)=0.20
(x+y)=0.30
11 CPInPLn
FrequencynPerf
yxMC
yxn
nn
nnSpeedup
yxMC
1)(
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 21
Scalability HeadroomSpeedup Scalability Extrapolation
0
5
10
15
20
25
30
35
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Number of Cores (n)
Sp
eed
up
n̂ 1.00
n̂ 0.92
n̂ 0.65
11 CPInPLn
FrequencynPerf
yxMC
Reduce PL(n) Reduce CPI(n)
?
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 22
Conspiring Forces Against Performance Scaling: Three Forms of Scalability Impediments
❖ Algorithm▪ Limitation of Language and Algorithm
▪ Tyranny of Amdahl’s Law (sequential bottleneck)
❖ Architecture▪ Increase of Path-Length Undermines Scalability
▪ Increase of CPI also Undermines Scalability
❖ Power/Thermal▪ Increased Complexity and Inefficiency of Design
▪ Super-linear Power Scaling Relative to Performance
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 23
D. Performance Scalability Impediments (Law #4)
111
1
1)(
fn
n
n
ffnSpeedup ALG
yxn
n
n
nn
nnSpeedup
yxyxARCH
1)(
Algorithm Scalability: (Amdahl’s Law)
Architecture Scalability:
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 24
Relative (Additive) Degrees of TyrannySpeedup for z and f == (0.10)
0
5
10
15
20
25
30
35
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Number of cores (n)
Sp
ee
du
pPerfect Speedup
JPS (architecture)
Amdahl (algorithm)
z
zyxyxMC nn
n
n
n
nn
nnSpeedup
1)(
111
1
1)(
fn
n
n
ffnSpeedup MP
Architecturetyranny
Algorithmtyranny
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 25
Speedup for f=0.01 and (x+y)=0.08
0
10
20
30
40
50
60
70
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64
Number of Cores (n)
Sp
ee
du
p
Perfect (Linear) Speedup
Architecture Scaling (w/ x+y=0.08)
Algorithm Scaling (w/ f=0.01)
Actual Speedup (w/ f=0.01 & x+y=0.08)
92.008.01)( nnnSpeedup ARCH
101.01)(
n
nnSpeedup ALG
92.0
101.01)(
n
nnSpeedup Actual
10X SU n > 14
20X SU n > 35
Combined Effect on Actual Speedup (f=0.01, x+y=0.08)
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 26
Combined Effect on Actual Speedup (f=0.02, x+y=0.08)
Speedup for f=0.02 and (x+y)=0.08
0
10
20
30
40
50
60
70
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64
Number of Cores (n)
Sp
ee
du
p
Perfect Speedup
Architecture Scaling (w/ x+y=0.08)
Algorithm Scaling (w/ f=0.02)
Actual Speedup (w/ f=0.02 & x+y=0.08)
92.008.01)( nnnSU ARCH
102.01)(
n
nnSU ALG
10X SU n > 16
20X SU n > 52
92.0)()( ALGActual nSUnSU
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 27
Combined Effect on Actual Speedup (f=0.02, x+y=0.10)Speedup for f=0.02 and (x+y)=0.10
(with scalar execution of sequential %)
0
10
20
30
40
50
60
70
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64
Number of Cores (n)
Sp
ee
du
p
Perfect (Linear) Speedup
Architecture Scaling (w/ x+y=0.10)
Algorithm Scaling (w/ f=0.02)
Actual Speedup (w/ f=0.02 & x+y=0.10)
90.010.01)( nnnSpeedup ARCH
n
nSpeedup ALG98.0
1
02.0
1)(
10X SU n > 18
20X SU n > 62
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 28
Impact of Latency Performance on MP Performance Speedup for f=0.02 and (x+y)=0.10
(with scalar execution of sequential %)
0
10
20
30
40
50
60
70
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64
Number of Cores (n)
Sp
ee
du
p
Perfect (Linear) Speedup
Architecture Scaling (w/ x+y=0.10)
Algorithm Scaling (w/ f=0.02)
Actual Speedup (w/ f=0.02 & x+y=0.10)
n
nSpeedup ALG98.0
1
02.0
1)(
10X SU n > 18
20X SU n > 62
Speedup for f=0.02 and (x+y)=0.10
(with superscalar execution of sequential %)
0
10
20
30
40
50
60
70
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64
Number of Cores (n)
Sp
ee
du
p
Perfect (Linear) Speedup
Architecture Scaling (w/ x+y=0.10)
Algorithm Scaling (w/ f=0.02)
Actual Speedup (w/ f=0.02 & x+y=0.10)
90.010.01)( nnnSpeedup ARCH
n
nSpeedup ALG98.0
2
02.0
1)(
10X SU n > 15
20X SU n > 38
90.010.01)( nnnSpeedup ARCH
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 29
E. Performance and Power Scaling (Law #5)
PathLength
FrequencyIPC
CPIPathLength
FrequencyePerformanc
FrequencyIPCEPIPower
PathLengthePerformancEPIPower
second
cycle
cycle
ninstructio
ninstructio
Joule
second
JouleWatt
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 30
FrequencyIPCEPIPower
[John DeVale & Bryan Black, 2005]
Power Scaling Relative to Performance Scaling
0 500 1000 1500 2000 2500 3000 3500 40000
0.2
0.4
0.6
0.8
1
0
20
40
60
80
100
120
Power (W)
10
0
30
0
50
0
70
0
90
0
1100
13
00
15
00
17
00
19
00
Pentium
Pentium EE
Pentium M
Frequency (Hz)
Spec2K/MHz
Power (W)
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 31
Power vs. Latency Performance
15
20
25
30
0 2
Relative Performance
Rela
tive P
ow
er
Pentium 4 (Psc)
power = perf (1.74)
Pentium 4 (Wmt)
Pentium Pro
Pentiumi486
0
5
10
4 6 8
[Ed Grochowski, 2004]
❖ For comparison▪ Factor out contributions due to
process technology
▪ Keep contributions due to microarchitecture design
▪ Normalize to i486™ processor
❖ Relative to i486™ Pentium® 4 (Wmt) processor is▪ 6x faster (2X IPC at 3X frequency)
▪ 23x higher power
▪ Spending 4 units of power for every 1 unit of scalar performance
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 32
Power vs. Latency & Throughput Performances
power = perf (1.74)
30
Rela
tive P
ow
er
10
15
20
25
0 2 4 6 8
Relative Performance
Pentium 4 (Psc)
Pentium 4 (Wmt)
Pentium Pro
Pentium
i486
0
5
Pentium M
power = perf (1.0) ?
power = perf (1.74)
Scalar/LatencyPerformance
ThroughputPerformance
CPU EPII486 7 nj
P5 10 nj
P6 17 nj
P4P-wmt 27 nj
P4P-psc 29 nj
Pentium M 9 nj
[Ed Grochowski, 2005]
Low EPI
❖ Assume a large-scale CMP with potentially many cores
❖ Replication of cores results in (nearly) proportional increases to both throughput performance and power (hopefully).
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 33
Power/Performance (EPI) Evolution
IntelMicroprocessors
EPI (nj) 65nm at 1.33v
i486 10
Pentium 14
Pentium Pro 24
Pentium 4 (WMT) 38
Pentium 4 (CDM) 48
Pentium M (Banias) 13
Pentium M (Dothan) 15
Core Duo (Yonah) 11
Core Duo (Merom) 10
❖ Power: single core power (relative to i486 baseline)❖ Performance: SPECint performance (relative to i486 baseline)❖ EPI: average energy spent per instruction (in nano-joules)
0
5
10
15
20
25
30
35
40
45
50
0 2 4 6 8 10Scalar Performance
Pow
er
Power = Performance1.75
i486 Pentium
Pentium Pro
Pentium 4 (Willamette)
Pentium 4 (Cedarmill)
BaniasDothan Yonah
Merom
Pentium M Core Duo
FrequencyIPCEPIPower [Ed Grochowski, 2004]
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 34
Power/Performance Scaling
0
20
40
60
80
100
120
140
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Relative Performance
Rela
tive P
ow
er
Perf^1.74
Perf^1.5
Perf^1.25
Perf^1.0
Pentium M
Pentium 4 (Psc)Pentium 4 (Wmt)
Pentium Proi486
?
?
Pentium Neo-Core ?
?
CPU EPI SUi486 7 nj 1
P5 10 nj 2
P6 17 nj 3.5
P4P-wmt 27 nj 6
P4P-psc 29 nj 6.5
Pentium M 9 nj 5.5
Neo-Core 5 nj 6.5
P5: 10 njP-M: 9 nj
Neo: 5 nj
P4P: 27 nj
• The issue is not small vs. big cores, nor in-ordervs. out-of-order cores. The key metric is EPI.
• The ideal core: ultra-low EPI with best possiblesingle-thread or single-core performance.
NP/DSP/GPU EPI
IXP2800 1 nj
TMS320C6713 0.7 nj
GeF7800GTX 0.6 nj
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 35
Power and Speedup Scaling
0
10
20
30
40
50
60
70
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Machine Scaling (n)
Po
wer/
Sp
eed
up
Scalin
g
Power at n 1̂.2
Power at n 1̂.1
Perfect Scaling
Speedup at n 0̂.9
Speedup at n 0̂.8
Power scalingchallenges:• Low EPI cores• Un-core scaling
Speedup scalingchallenges:• Algorithm
• Sequential %• Architecture
• PL scaling• CPI scaling
?
?
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 36
Power vs. M C Speedup Scaling
0
20
40
60
80
100
120
140
160
180
200
01.5 3
4.5 67.5 9
10.5 1213.5 15
16.5 1819.5
MC Speedup
Po
we
r
Speedup^1.74
Speedup^1.50
Speedup^1.25
Speedup^1.00
Pentium M
Pentium 4 (Psc)Pentium 4 (Wmt)
Pentium Proi486 Pentium
EPI=29nj
EPI=10nj
EPI=9nj
CPU EPI SU
i486 7 nj 1
P5 10 nj 2
P6 17 nj 3.5
P4P-wmt 27 nj 6
P4P-psc 29 nj 6.5
Pentium M 9 nj 5.5
Neo core? < 5 nj 6.5
Neo Core?
•The scaling goal is not just the number of cores but maximum throughput within fixed power envelope.
•The key issue is not the power scaling of replicated cores, but the un-core power scaling that may push total power scaling towards the square law again.
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 37
Future Directions
❖ Scalability Strategies:
▪ Algorithm – Languages & Specialized Parallelism
▪ Architecture – CPI and Path Length Reduction
▪ Power/Thermal – EPI Reduction & Scalable Un-core
❖ Power/Energy is the new scalability wall
❖ Research Challenges:
▪ Sequential % mitigation for compelling workloads
▪ Ultra-low EPI core with great latency performance
▪ Un-core fabric with near-linear power scaling
❖ Un-core scaling is the new power goblin
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 38
Useful Chart on Power, Voltage, Resistance, Current
http://www.sengpielaudio.com/calculator-ohm.htm
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 39
Formula 1 − Electrical power equation:
Power P = I × V = R × I2 = V2 ⁄ Rwhere power P is in watts, voltage V is in volts and current I is in amperes (DC).If there is AC, look also at the power factor PF = cos φ and
φ = power factor angle(phase angle) between voltage and amperage.
Formula 2 − Mechanical power equation:
Power P = E ⁄ t = W ⁄ twhere power P is in watts, Energy E is in joules, and time tis in seconds. 1 W = 1 J/s.
Electric Energy is E = P × t − measured in watt-hours,
or also in kWh. 1J = 1N×m = 1W×s
Power vs. Energy
▪ Energy: integral of power (area under the curve)
▪ Energy and power driven by different design constraints
▪ Power issues:
▪ Power delivery (supply current @ right voltage)
▪ Thermal (don’t fry the chip)
▪ Reliability effects (chip lifetime)
▪ Energy issues:
▪ Limited energy capacity (battery)
▪ Efficiency (work per unit energy)
▪ Different usage models drive tradeoffs
Time
Pow
er
Energy
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 40
F. Power and Energy Optimizations
▪ With constant time base, two are “equivalent”
▪ 10% reduction in power => 10% reduction in energy
▪ Once time changes, must treat as separate metrics
▪ E.g. reduce frequency to save power => reduce performance => increase time to completion => consume more energy (perhaps)
▪ Metric: energy-delay product per unit of work
▪ Tries to capture both effects
▪ Others advocate energy-delay2
▪ Best to consider all
▪ Plot performance (time), energy, ed, ed2
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 41
Performance/Power Efficiency Metrics
▪ Power is good metric for deciding on the thermal envelope of the processor
▪ Energy is good metric in battery constrained environments
▪ Task executed at ½ speed but ¼ power means ½ the energy (2T * ¼ P = ½ E)
▪ 2X battery life!
▪ Energy*Delay metric gives higher weight to performance
▪ Same example above, ED ((2T)2 * ¼ P) stays same
▪ Energy*Delay2 gives even more weight to performance
▪ Same example above shows that ½ speed is 2X worse on ED2 metric
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 42
MIPS/watt Common measure of power efficiency
Equivalent to energy per instruction
Independent of time
Ideal metric for throughput performance
MIPS^2/watt Equivalent to energy • delay
Common metric for comparing logic families
MIPS^3/watt Equivalent to energy • delay^2
Assign increasing weight to time
Appropriate metric for latency performance
Mips
Watt
Instructions
Second
Joules
Second
=Instructions
Joule=
mips/watt
0
0.2
0.4
0.6
0.8
1
1.2
486 p5 p6 pentium 4
Norm
aliz
ed M
etr
ic
mips 2̂/watt
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
486 p5 p6 pentium 4
Nor
mal
ized
Met
ric
mips 3̂/watt
0
2
4
6
8
10
12
486 p5 p6 pentium 4
Nor
mal
ized
Met
ric
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 43
[Ed Grochowski, Intel, 2005]
iPad3
18-600 Lecture #2511/29/2017 (© J.P. Shen)
Energy Storage Challenge for Mobile Devices
44
iPad2•25Wh = 90kJ•10h use = avg 2.5W
Kindle3•6.5Wh = 23kJ•“30 day use” = 30h =
avg 220mW
Typical smartphone•32g, 13cc, 5.5Wh = 18kJ •+5h charging @ max 1W•20mW static power = 10 days standby•150mW notifications = 1.3 day standby•“typical usage” 5kJ active + 13kJ standby =
1 battery charge
How long between recharges?
Key Areas Description
➢ Context Based Power Management: Offloading of communication to available connectivity, and computation to companion devices or the cloud.
➢ Workload Based Power Management: Manage power consumption based on actual usage and workload scenarios by leveraging heterogeneous cores and smart parallelism to reduce overall power.
➢ Near-Zero-Power Standby Mode: Low-power always-on transflective bistable (TF/BS) displays; eager hibernation with instant resume.
➢ Ultra-Low-Power Always-On Device: Device with minimal standby functionality and seamless quick switch over to companion devices.
➢ Casual Charging: Wireless inductive charging; solar charging for large surface devices.
➢ Anticipatory Preexecution: Speculative cross-device or cloud-based preprocessing and content prefetching.
Active Power Saving
Standby Power Saving
Energy Harvesting
18-600 Lecture #2511/29/2017 (© J.P. Shen)
Key Challenges…how far can we go? [Per Ljung, 2012]
45
18-600 Lecture #2511/29/2017 (© J.P. Shen)
Active Power Saving1.5h @ 900mW = 1.35Wh 5Wh battery: 30%
ideal 100%
avg app goal <100mW
➢ Context-Based Power Management• Offload Communication (Local ULP radio)
•1m radio instead of 1000m cellular•Dongle, access point, femtocell box
• Offload Computation (Cross device)• Cloud/companion pre-compute/pre-render
➢ Workload-Based Power Management• Power consumption based on usage scenario
•Activity, location, history• Approach zero energy waste
• De-powering unused peripherals & power islands• Heterogeneous cores & SP processor arrays (ULP)
80% time in home/office40mW vs 1W
trade cheap communication for expensive computation
40mW vs 1W
0W vs 160mW (email, notifications)
10’s mW vs 100’s mW cores
25x
25x
46
18-600 Lecture #2511/29/2017 (© J.P. Shen)
Standby Power Saving
➢ Near-Zero-Power Standby Mode• Zero-power always-on displays
• Transflective, bistable • Zero-power OS idling
• Android, Meego, WP7 use 15mW • iPhone uses 5mW
➢ Ultra-Low-Power Always-On Device• Wearable accessory for notifications & voice• Seamless quick switch over to companion• Week-long battery life on small battery
4mW vs 200mW handset
better firmware 2mW vs 15mWwith hibernation 1mW vs 15mW
TF/BS, TF/LCD, TF/OLED5mW vs OLED 500mW
default: email, skype
100x
15x
50x
goal <10mW, 1kJ
22.5h @ 160mW = 3.6Wh 5Wh battery: 70%
ideal 0%
47
18-600 Lecture #2511/29/2017 (© J.P. Shen)
Energy Harvesting
Effectively Approach Unlimited Standby
➢ Casual Charging• Wireless inductive chargers• Instant charging for quick fix• Solar for large surface areas• Cross-device energy sharing
➢ Anticipatory Preexecution• Speculative pre-processing and pre-fetching• Cloud based or cross-device based • Context and user behavior model driven
best case charge 3Wfor 1W tablet
25x
1W desk, nightstand, car
40mW vs 1W
1h charge 3h use
5h full charge
48
18-600 Lecture #2511/29/2017 (© J.P. Shen)
Refactoring the Mobile Form Factors?Multiple Devices, One Seamless Experience•Laptop (avg 10W ➡ avg 2W)
• Exploit large battery & connectivity- Runs offloaded apps from Phone- Gathers data for Wearable+ Display normally off+ 50 WH battery 25 hours nonstop use
•Smartphone (avg 200mW ➡ avg 40mW)+ Normally hibernating in standby+ Offload apps to Laptop+ Longer battery life: 5x battery life+ 5 WH battery 125 hours nonstop use
•Wearable (avg 3mW)+ Always-on display+ Always fresh data feeds- Respond using Laptop or Smartphone+ Reduces avg power of Laptop & Smartphone+ 1.2 WH battery 400 hours nonstop use
49
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Lecture 26:“Future of Computer Systems”
John P. Shen December 4, 2017
11/29/2017 (© J.P. Shen) 18-600 Lecture #25 50
18-600 Foundations of Computer Systems