Upload
ryan-santhirarajan
View
226
Download
0
Embed Size (px)
Citation preview
8/9/2019 02_Performance_ECE552-2014 - complete.pdf
1/13
ECE 552: PerformanceProf. Natalie Enright Jerger
Lecture notes based on slides created by Amir Roth of University ofPennsylvania with sources that included University of Wisconsin slides byMark Hill, Guri Sohi, Jim Smith, and David Wood.
Lecture notes enhanced by Milo Martin, Mark Hill, and David Wood withsources that included Profs. Asanovic, Falsafi, Hoe, Lipasti, Shen, Smith,Sohi, Vijaykumar, and Wood
Before we start... To have a meaningful discussion
about modern architectures Must discuss metricsto evaluate them
Discuss how metricsare impacted by Moores Law
Moores Law: Devices per chip double
every 18-24 months
Empirical Evaluation Metrics
Performance
Page 1
8/9/2019 02_Performance_ECE552-2014 - complete.pdf
2/13
Cost
Power
Reliability
Often more important in combination thanindividually
Performance/cost (MIPS/$)
Performance/power (MIPS/W)
Basis for Design decisions
Purchasing decisions
Performance Performance metrics
Latency
Throughput
Reporting performance
Benchmarking and averaging
CPU performance equationand
performance trends Two definitions
Latency (execution time):
Throughput (bandwidth):
Page 2
8/9/2019 02_Performance_ECE552-2014 - complete.pdf
3/13
Very different: throughput can exploit parallelism,latency cannot
Often contradictory
Choose definition that matches goals (most frequentlythroughput)
Latency/Throughput Example
Example: move people from A to B, 10 miles Car: capacity = 5, speed = 60 miles/hour
Bus: capacity = 60, speed = 20 miles/hour
Latency:
Throughput:
Performance Improvement
Processor A is X times faster than processor B if Latency(P,A) =
Throughput(P,A) =
Processor A is X% faster than processor B if
Page 3
8/9/2019 02_Performance_ECE552-2014 - complete.pdf
4/13
Latency(P,A) =
Throughput(P,A) =
Car/bus example Latency?
Throughput?
What Is P in Latency(P,A)? Program
Latency(A) makes no sense, processor executessome program
But which one? Actual target workload?
Some representative benchmark program(s)?
Page 4
8/9/2019 02_Performance_ECE552-2014 - complete.pdf
5/13
Some small kernel benchmarks (micro-benchmarks)
Adding/Averaging Performance Numbers
You can add latencies, but not throughput
Latency(P1+P2, A) = Latency(P1,A) + Latency(P2,A)
Throughput(P1+P2,A) != Throughput(P1,A) +Throughput(P2,A)
1 km @ 30 kph + 1 km @ 90 kph
0.033 hours at 30 kph + 0.011 hours at 90 kph
Throughput(P1+P2,A) =
2 / [(1/ Throughput(P1,A)) + (1/ Throughput(P2,A))]
Same goes for means (averages) Arithmetic: (1/N) * !P=1..NLatency(P)
Harmonic: N / !P=1..N1/Throughput(P)
Page 5
8/9/2019 02_Performance_ECE552-2014 - complete.pdf
6/13
Geometric: N!"P=1..NSpeedup(P)
CPU Performance Equation Multiple aspects to performance: helps to
isolate them
Latency(P,A) = seconds / program =
Instructions / program:
Cycles / instruction:
Seconds / cycle:
For low latency (better performance) minimize all three
Hard: often pull against the other
Page 6
8/9/2019 02_Performance_ECE552-2014 - complete.pdf
7/13
Cycles per Instruction (CPI) This course is mostly about improving CPI
Cycle/instruction for average instruction
IPC= 1/CPI
Different instructions have different cycle costs
E.g., integer add typically takes 1 cycle, FP divide takes > 10
Assumes you know something about insn frequencies
CPI Example A program executes equal integer, FP, and
memory operations
Cycles per instruction type:
integer = 1, memory = 2, FP = 3
What is the CPI?
Caveat: this sort of calculation ignores
dependences completely
Back-of-the-envelope arguments only
Measuring CPI
How are CPI & execution-time actually measured?
Page 7
8/9/2019 02_Performance_ECE552-2014 - complete.pdf
8/13
Execution time: time (Unix): wall clock + CPU + system
CPI = CPU time / (clock frequency * dynamic insn count)
How is dynamic instruction count measured?
Want CPI breakdowns (CPICPU, CPIMEM, etc.) to see what to fix
CPI breakdowns Hardware event counters
Calculate CPI using counter frequencies/event costs
Cycle-level micro-architecture simulation (e.g., SimpleScalar)
+ Measures breakdown exactly provided
+ Models micro-architecture faithfully
+ Runs realistic workloads (some)
Method of choice for many micro-architects (and you)
Improving CPI This course is more about improving CPI than frequency
Historically, clock accounts for 70%+ of performanceimprovement
Achieved via deeper pipelines
This has changed
Deep pipelining is not power efficient
Physical speed limits are approaching
1GHz: 1999, 2GHz: 2001, 3GHz: 2002, 3.8GHz: 2004,5GHz: 2008
Intel Core 2: 1.8-3.2GHz: 2008
Techniques we will look at
Caching, speculation, multiple issue, out-of-order issue,multiprocessing, more#
Page 8
8/9/2019 02_Performance_ECE552-2014 - complete.pdf
9/13
Moore helps because CPI reduction requirestransistors Definition of parallelism -- more transistors
But best example is caches
Another CPI Example Assume a processor with insn frequencies and costs
Integer ALU: 50%, 1 cycle
Load: 20%, 5 cycle
Store: 10%, 1 cycle
Branch: 20%, 2 cycle Which change would improve performance more?
A. Branch prediction to reduce branch cost to 1 cycle?
B. A bigger data cache to reduce load cost to 3 cycles?
Compute CPI
Base =
A =
B =
CPI Example 3Operation Frequency Cycles
ALU 45% 1
Load 20% 1
Store 15% 2
Branch 20% 2
Page 9
8/9/2019 02_Performance_ECE552-2014 - complete.pdf
10/13
You can reduce store to 1 cycle But slow down clock by 20%
Old CPI =
New CPI =
Speedup = Old time/New time
Now, if ALU ops were 2 cycles originally and storeswere 1 cycle
and you could reduce ALU to 1 cycle, while slowingdown clock by 20%,
This optimization:
Example of Amdahls law- you dont want to
speedup a small fraction to the detriment of the rest
Page 10
8/9/2019 02_Performance_ECE552-2014 - complete.pdf
11/13
Performance Rule of Thumb:Amdahls Law
f fraction that can run in parallel (be sped up)
1-f fraction that must run serially
Speed-up =
Time
1#CPU
s
n
Time
1#CPUs
n
Page 11
8/9/2019 02_Performance_ECE552-2014 - complete.pdf
12/13
Pretty good idealscaling for a modest number of cores
Large number of cores require a lot of parallelism
Amdahls Law not just about parallelism f !fraction that you can speed up
Your performance will always be limited by (1-f) part
Summary Latency = seconds/program =
(instructions/program) * (cycles/instruction) * (seconds/cycle)
Instructions/program: dynamic instruction count
Function of program, compiler, instruction set architecture (ISA)
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
0
8
16
24
32
40
48
56
64
0 8 16 24 32 40 48 56 64
Page 12
8/9/2019 02_Performance_ECE552-2014 - complete.pdf
13/13
Cycles/instruction: CPI
Function of program, compiler, ISA, micro-architecture
Seconds/cycle: clock period
Function of micro-architecture, technology parameters
To improve performance, optimize each component
Focus mostly on CPI in this course
Other Metrics Will (try to) come back to
Cost
Power
Reliability
Interested in learning more: Grad school
Page 13