02_Performance_ECE552-2014 - complete.pdf

8/9/2019 02_Performance_ECE552-2014 - complete.pdf

1/13

ECE 552: PerformanceProf. Natalie Enright Jerger

Lecture notes based on slides created by Amir Roth of University ofPennsylvania with sources that included University of Wisconsin slides byMark Hill, Guri Sohi, Jim Smith, and David Wood.

Lecture notes enhanced by Milo Martin, Mark Hill, and David Wood withsources that included Profs. Asanovic, Falsafi, Hoe, Lipasti, Shen, Smith,Sohi, Vijaykumar, and Wood

Before we start... To have a meaningful discussion

about modern architectures Must discuss metricsto evaluate them

Discuss how metricsare impacted by Moores Law

Moores Law: Devices per chip double

every 18-24 months

Empirical Evaluation Metrics

Performance

Page 1


2/13

Cost

Power

Reliability

Often more important in combination thanindividually

Performance/cost (MIPS/$)

Performance/power (MIPS/W)

Basis for Design decisions

Purchasing decisions

Performance Performance metrics

Latency

Throughput

Reporting performance

Benchmarking and averaging

CPU performance equationand

performance trends Two definitions

Latency (execution time):

Throughput (bandwidth):

Page 2


3/13

Very different: throughput can exploit parallelism,latency cannot

Often contradictory

Choose definition that matches goals (most frequentlythroughput)

Latency/Throughput Example

Example: move people from A to B, 10 miles Car: capacity = 5, speed = 60 miles/hour

Bus: capacity = 60, speed = 20 miles/hour

Latency:

Throughput:

Performance Improvement

Processor A is X times faster than processor B if Latency(P,A) =

Throughput(P,A) =

Processor A is X% faster than processor B if

Page 3


4/13

Latency(P,A) =

Throughput(P,A) =

Car/bus example Latency?

Throughput?

What Is P in Latency(P,A)? Program

Latency(A) makes no sense, processor executessome program

But which one? Actual target workload?

Some representative benchmark program(s)?

Page 4


5/13

Some small kernel benchmarks (micro-benchmarks)

Adding/Averaging Performance Numbers

You can add latencies, but not throughput

Latency(P1+P2, A) = Latency(P1,A) + Latency(P2,A)

Throughput(P1+P2,A) != Throughput(P1,A) +Throughput(P2,A)

1 km @ 30 kph + 1 km @ 90 kph

0.033 hours at 30 kph + 0.011 hours at 90 kph

Throughput(P1+P2,A) =

2 / [(1/ Throughput(P1,A)) + (1/ Throughput(P2,A))]

Same goes for means (averages) Arithmetic: (1/N) * !P=1..NLatency(P)

Harmonic: N / !P=1..N1/Throughput(P)

Page 5


6/13

Geometric: N!"P=1..NSpeedup(P)

CPU Performance Equation Multiple aspects to performance: helps to

isolate them

Latency(P,A) = seconds / program =

Instructions / program:

Cycles / instruction:

Seconds / cycle:

For low latency (better performance) minimize all three

Hard: often pull against the other

Page 6


7/13

Cycles per Instruction (CPI) This course is mostly about improving CPI

Cycle/instruction for average instruction

IPC= 1/CPI

Different instructions have different cycle costs

E.g., integer add typically takes 1 cycle, FP divide takes > 10

Assumes you know something about insn frequencies

CPI Example A program executes equal integer, FP, and

memory operations

Cycles per instruction type:

integer = 1, memory = 2, FP = 3

What is the CPI?

Caveat: this sort of calculation ignores

dependences completely

Back-of-the-envelope arguments only

Measuring CPI

How are CPI & execution-time actually measured?

Page 7


8/13

Execution time: time (Unix): wall clock + CPU + system

CPI = CPU time / (clock frequency * dynamic insn count)

How is dynamic instruction count measured?

Want CPI breakdowns (CPICPU, CPIMEM, etc.) to see what to fix

CPI breakdowns Hardware event counters

Calculate CPI using counter frequencies/event costs

Cycle-level micro-architecture simulation (e.g., SimpleScalar)

+ Measures breakdown exactly provided

+ Models micro-architecture faithfully

+ Runs realistic workloads (some)

Method of choice for many micro-architects (and you)

Improving CPI This course is more about improving CPI than frequency

Historically, clock accounts for 70%+ of performanceimprovement

Achieved via deeper pipelines

This has changed

Deep pipelining is not power efficient

Physical speed limits are approaching

1GHz: 1999, 2GHz: 2001, 3GHz: 2002, 3.8GHz: 2004,5GHz: 2008

Intel Core 2: 1.8-3.2GHz: 2008

Techniques we will look at

Caching, speculation, multiple issue, out-of-order issue,multiprocessing, more#

Page 8


9/13

Moore helps because CPI reduction requirestransistors Definition of parallelism -- more transistors

But best example is caches

Another CPI Example Assume a processor with insn frequencies and costs

Integer ALU: 50%, 1 cycle

Load: 20%, 5 cycle

Store: 10%, 1 cycle

Branch: 20%, 2 cycle Which change would improve performance more?

A. Branch prediction to reduce branch cost to 1 cycle?

B. A bigger data cache to reduce load cost to 3 cycles?

Compute CPI

Base =

A =

B =

CPI Example 3Operation Frequency Cycles

ALU 45% 1

Load 20% 1

Store 15% 2

Branch 20% 2

Page 9


10/13

You can reduce store to 1 cycle But slow down clock by 20%

Old CPI =

New CPI =

Speedup = Old time/New time

Now, if ALU ops were 2 cycles originally and storeswere 1 cycle

and you could reduce ALU to 1 cycle, while slowingdown clock by 20%,

This optimization:

Example of Amdahls law- you dont want to

speedup a small fraction to the detriment of the rest

Page 10


11/13

Performance Rule of Thumb:Amdahls Law

f fraction that can run in parallel (be sped up)

1-f fraction that must run serially

Speed-up =

Time

1#CPU

s

n

Time

1#CPUs

n

Page 11


12/13

Pretty good idealscaling for a modest number of cores

Large number of cores require a lot of parallelism

Amdahls Law not just about parallelism f !fraction that you can speed up

Your performance will always be limited by (1-f) part

Summary Latency = seconds/program =

(instructions/program) * (cycles/instruction) * (seconds/cycle)

Instructions/program: dynamic instruction count

Function of program, compiler, instruction set architecture (ISA)

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

0

8

16

24

32

40

48

56

64

0 8 16 24 32 40 48 56 64

Page 12


13/13

Cycles/instruction: CPI

Function of program, compiler, ISA, micro-architecture

Seconds/cycle: clock period

Function of micro-architecture, technology parameters

To improve performance, optimize each component

Focus mostly on CPI in this course

Other Metrics Will (try to) come back to

Cost

Power

Reliability

Interested in learning more: Grad school

Page 13

Documents

02_Performance_ECE552-2014 - complete.pdf