Performance Tuning John Black CS 425 UNR, Fall 2000

Performance Tuning

John Black

CS 425

UNR, Fall 2000

Why Go Fast?

• Real Time Systems

• Solve Bigger Problems

• Less waiting time

• Better Simulations, Video Games, etc.

• Moore’s Law: Every 18 months CPU speeds double for constant cost– Has held true for last 30 years!

How do we go Faster?

• Find the Bottlenecks!– Hardware problems– Bad algorithms/data structures– I/O bound (disk, network, peripherals)

• If none of these, we have to hand tune

Hand Tuning

• Use a “profiler”– 80% of time is spent in 20% of the code– Find this 20% and tune by hand– Do NOT waste time tuning code which is not a

bottleneck

• How can we hand-tune?

Hand Tuning (cont.)

• Exploit architecture-dependent features

(of course this approach limits portability)

• We focus on the Pentium family– Memory System– Pipeline

Memory System

• Modern processors have a hierarchical memory system– Memory units nearer the processor are faster

but smaller

Registers

Level 1 Cache

Level 2 Cache

Main Memory

32K

P3 has 256K or 512K

About 40 32-bit registers

Common Memory-Related Bottlenecks

• Alignment: the requirement that an accessed object lie at an address which is a multiple of 16 or 32 or 64, etc.

• For Pentium Pro, P2, P3, there is no penalty for misaligned data (unless you cross a cache line boundary)– Cache lines are 32 byte cells in the caches– We always read 32-bytes at a time

Locality

• Since we fetch 32 bytes at a time, accessing memory sequentially (or in a small neighborhood) is efficient– This is called “spatial locality”

• Since cache contents are aged (with an LRU algorithm) and eventually kicked out, accessing items repeatedly is efficient– This is called “temporal locality”

Digression on Big-Oh

• We learn in algorithms class that we are concerned only with the asymptotic running time on a RAM machine– With new architectures this may no longer be a

good way to measure performance– An O(n lg n) algorithm may be FASTER than

an O(n) algorithm due to architectural considerations

Pipelining

• Modern processors execute many instructions in parallel to increase speed

Decode Get Args Execute Retire

An instruction may be getting decoded at the same time as another is getting its arguments and another is executing; this parallelism greatly speeds up the processor

Is a Pentium RISC or CISC?

• The Pentium instruction set is CISC (Complex Instruction Set Computer) but it actually translates into micro-ops and runs on a RISC (Reduced Instruction Set Computer) under the sheets

• The RISC instructions (micro-ops) are what is actually pipelined

How does this Affect Tuning?

• If the pipeline is not fully utilized, we can improve performance by fixing this– Reorder instructions to reduce dependencies

and “pipeline stalls”– Avoid mispredicted branches with loop

unrolling– Avoid function calls with inlining

The Downside

• All this tuning makes code harder to write, maintain, debug, port, etc.

• Assembly language may be required, which has all the above drawbacks

Documents

Performance Tuning John Black CS 425 UNR, Fall 2000