3 Processor Designs

The Evolution of High Performance Computers

David Chisnall

February 8, 2011

Outline

Classic Computer Architecture

Fetch Decode Execute

Author's Note

Comment

Early processors were not pipelined at all. The simplest pipeline looks roughly like this. A modern processor typically has much longer pipeline, but each of the stages roughly fits into one of these, so this is a useful conceptual model when thinking about what the machine is doing.

Execute?

1. Read value from memory

2. Store in register

1. Read value from register

2. Store in memory

1. Read value(s) from register(s)

2. Perform calculation

3. Store result in register

Author's Note

Comment

The execute phase is the part that we really care about - that's where the processor does what we tell it. In a typical load-store architecture, the instructions are grouped into one of these sets, with the last set being the one that does real work.

Calculation?

• Add or subtract?

• Shift?

• Multiply?

• Divide?

• Evaluate polynomial?

• Discrete Cosine Transform?

• Exact set of operations depends on the computer

Author's Note

Comment

The operations that a CPU actually has hardware to execute are not always the same as the ones that it accepts instructions for. In the x86 architecture, you have things like string copy instructions, which are implemented in microcode - a single instruction can copy 4GB of data, so it takes a lot more than one cycle to complete. The VAX's evaluate polynomial was a famous example - basically a subroutine that looked like an instruction. On some chips, however, you have very complex hardware operations, for accelerating things like video decoding and cryptography. Using this, instead of the general purpose hardware, can make things go a lot faster.

A Simple Case

r1

r2

Adder r3

Author's Note

Comment

A typical execute phase - for an instruction that isn't microcoded - involves connecting two registers to the input lines of an arithmetic logic unit (ALU) and the output lines to another register. The time taken for the operation may be several cycles - often the ALU is pipelined internally, so you can issue a new operation to it every cycle (for example) but you won't get the result for 5 cycles.

Making it Faster: Vectors• Each register stores multiple values• Operations on components

Author's Note

Comment

The idea behind single instruction multiple data (SIMD), or vector, programming is to increase the amount of 'execute' that you do for the same amount of 'fetch' and 'decode'. You fetch one instruction, decode one instruction, but then run that instruction on multiple inputs. Early supercomputers were the first vector processors, but now you'll find a vector coprocessor in almost all modern processors.

Example Problem

• Pixel values: 32 bits per channel, red, green, blue, alpha

• Lighten operation scales each value by a constant factor

• Same operation on each of the colour channels

Scalar Solution

�for (int i=0 ; i<imageSize ; i++)

{

pixels[i].red *= 1.1;

pixels[i]. green *= 1.1;

pixels[i].blue *= 1.1;

} � �• Three loads per pixel

• Three multiplies per pixel

• Three stores per pixel

Author's Note

Comment

This simple example requires 9 operations, but it's really doing the same operation three times on three sets of data - it's therefore a perfect case for a vector processor.

Vector Solution

�pixel_t scale = {1.1, 1.1, 1.1, 1};

for (int i=0 ; i<imageSize ; i++)

{

pixel[i] *= scale;

} � �• One (vector) load per pixel

• One (vector) multiply per pixel

• One (vector) store per pixel

• Redundant multiply (alpha× 1) costs nothing, but makespotential speedup only 3, not 4 for a 4-element vector unit

Author's Note

Comment

This loop does the same thing with vector code. Now we have three instructions, instead of 9, but we're doing the same work. The loads may be serial, depending on the memory bandwidth, but the multiplies all run in parallel. Vectors tend to be fixed size, so we have to do 4 operations - even without the fourth, the hardware would just be sitting idle - so we make it do a redundant multiply here for the alpha channel. If you're writing vector code, or code for the compiler to auto-vectorise, then you have to think about the layout of your data to make it possible. We can only vectorise his code trivially because the red, green, and blue values for a pixel are all adjacent in memory.

Why is this faster?

• One fetch, one decode, four executes

• Fewer instructions, less space used for storing the program

Author's Note

Comment

Next lecture we look at cache, to see why the loads can be very fast. The multiply operations are going to all happen in parallel, so they're obviously faster. Fewer instructions also means better instruction cache usage, which can be very important for performance with large programs.

Memory Layout

• For vectors to be fast, you must be able to load vectors frommemory

• If you separate the channels, this would not be possible

• Modern compilers can auto-vectorise code with a usablememory layout

Even Faster: Parallel Streams

• Each loop iteration is independent

• We can do them in parallel

Author's Note

Comment

In theory, every single one of our loop iterations could run in parallel, because none of the iterations depends on the outcome of the one before it. This is known as an embarrassingly parallel problem.

Bigger Vectors?

�pixel_pair_t scale = {1.1, 1.1, 1.1, 1, 1.1,

1.1, 1.1, 1};

for (int i=0 ; i<imageSize ; i+=2)

{

pixels[i] *= scale;

} � �• Now we can process two pixels at once

• Some modern systems have 256-bit vectors

• Really hard to map most problems to them

• Most vector units are only 128-bit

Author's Note

Comment

Vector units can be any size, but generally they're only 128 bits wide. Earlier, we saw that this simple algorithm is only getting 3/4 of the performance from a 4-element vector. For a wider vector, you'd find it even harder to fill most of the execution units. You also have problems with memory bandwidth - loading a 1KB vector would take so long that you'd need to keep it in registers for a long time and do a lot of things with it for it to be worthwhile. With smaller vectors, it's easier to interleave the load, execute, and store instructions.

Independent Processors?

• Split loop into two or more loops

• Run each on a different processor

Author's Note

Comment

Most modern CPUs are now multicore. This is a very simple way of making processors faster: just put two of them on the same die. We'll look at the kind of algorithm that does well on this kind of architecture, and some of the problems with it, later on in this module.

Hybrid Solution: Semi-independent Cores

• Each core runs a group of threads

• When threads are executing the same thing, all run

• When they branch, only one runs

• Typical GPU architecture

Author's Note

Comment

Modern GPUs implement something that looks like a load of independent cores to the programmer, but is really a smaller number of SIMD (vector) cores.

What the Programmer Sees

• One program (kernel) per input (e.g. per pixel)

• Conceptually all run in parallel

• In practice, 1–128 run in parallel

Author's Note

Comment

Encouraging the programmer to write parallel code is something compiler writers love. It's trivial to take a parallel algorithm and run it on a serial processor, it's much harder to take a serial algorithm and run it on a parallel processor. If the compiler sees that you have 10,000 iterations that are independent, and you have 100 processors, it can run 100 operations on each processor.

How it Works

• Processor is n ×m element vector processor

• Programmer sees n element vector processor

• Processor starts m elements at once

• Linear code Just WorkTM

• Special case for branches:

• All threads take the same branch, all continue to run

• Some take a different branch, paused, resumed later

Author's Note

Comment

The GPU will run a load of instances of the same short program (kernel) on different input data on different cores. If all of the copies of the program are following the same execution path, then it's basically a SIMD program. If they follow different branches, it isn't, so you can no longer use the SIMD processor to run them at the same time. Some are paused, while the others run, then the paused ones are allowed to resume. There are also barrier instructions that let you rendezvous kernels that temporarily took different branches but are now back in the same place. This model lets you write code that doesn't look like it's vector code, but run it on a vector processor.

Bad Example�__kernel void stripe(const float4 *input

global float4 *output)

{

int i = get_global_id (0);

// Lighten even pixels , darken odd pixels

if (i % 2)

{

output[i] = input[i] * 1.1;

}

else

{

output[i] = input[i] * 0.9;

}

} � �• Each pair of threads will take different branches

• Only half will actually be running in parallel

Author's Note

Comment

This is an example of what not to do on a GPU. A GPU may run four copies of this kernel on a single core. Half will take the first branch, half will take the second. This means that only two will actually be running at a time, so you've only got half of the GPU's maximum throughput.

Why it’s Useful

• Good for code with highly independent elements

• Higher ratio of execution units to other components than ageneral purpose CPU

Author's Note

Comment

This model is harder to program for than the traditional serial model, but lets you have a lot of execution units on the processor. Each step through the fetch and decode part of the pipeline issues a lot more instructions than in a corresponding serial processor.

When is it not Useful?

• Problems that don’t have independent inputs

• Algorithms that have lots of branches

Author's Note

Comment

You can compile any program for a GPU, because it's just another Turing Machine - but that doesn't mean that it will be fast. A lot of existing code will run much slower on the GPU than the CPU - it's not magic. The GPU makes some trades in favour of certain categories of algorithm, while the CPU makes trades in favour of different categories of algorithm. The trick for writing high-performance code is to target your algorithms at a processor optimised for them.

So, Branches are Fast on CPUs?

• Fetch, decode, execute, is an oversimplification

• Modern pipelines are 7–20 stages

• Can’t fetch instruction after the branch until after executingthe branch!

• P4 can have 140 instructions executing at once

• Can have none executing if it doesn’t know what to executethough...

Author's Note

Comment

Branches are really expensive on GPUs. Old GPUs just executed both paths and only stored the results of the taken branch, which was incredibly slow, but they're still problematic on newer ones. Even on CPUs, they're not always fast. A branch can cause a pipeline stall, which means that you have to wait for all in-flight instructions (up to about 150 on a typical CPU) to complete before you can continue. If this happens often, performance can drop to a few percent of the theoretical throughput. Branch prediction helps, but it doesn't always get things right.

Not So Much Then?

• Typical CPUs do branch prediction

• Correct prediction is very fast

• Incorrect prediction is very slow

• Accuracy is about 95%

• So 5% of branches cause a pipeline stall (bad!)

Author's Note

Comment

Even 95% accurate branch prediction means that 5% of the time you've got a pipeline stall on a branch. Depending on how far apart your branches are, this can still add up to a significant performance problem. Avoiding branching is often a good idea for performance, but it can have other detrimental effects.

Another Work-Around: Predicated Instructions

• Found in ARM and Itanium

• Instruction predicated on condition register

• Executes anyway

• Result is only stored if the condition was set

• No pipeline stall, but some redundant work

• Much faster for short branches

Author's Note

Comment

Predicated instructions are instructions that appear to only execute when a condition flag is set. The conditional jump instructions that we looked at for implementing if statements are very common examples of predicated instructions (they only execute if some condition is true), but ARM and Itanium also include predicated arithmetic instructions. These can be very fast, because a superscalar architecture will always execute them, in parallel with other instructions, and will only store the results back to registers / memory if they were really meant to be executed. Because the decision about whether to store the results is made at the end of the pipeline, not at the start, the result is available without stalling the pipeline.

Avoiding Branches: Loop UnrollingWhat are funroll loops?


{

pixels[i] *= scale;

} � �• One branch per pixel

• Should be correctly predicted...

• But still wastes some time

Author's Note

Comment

Each loop iteration includes a jump, which adds some overhead and makes it hard for some other optimisations to take place. For example, some superscalar processors will not execute instructions in parallel if there is a conditional jump between them.

Avoiding Branches: Loop UnrollingWhat are funroll loops?


{

pixels[i] *= scale;

pixels[i++] *= scale;



} � �• One branch per four pixels

• Code is bigger, but should be faster

Don’t do this yourself! The compiler will do it for you!

Author's Note

Comment

This is a common compiler optimisation, enabled by -funroll-loops in GCC. Old C textbooks will tell you to do this yourself, but it's generally a bad idea - the compiler is probably better at it than you. This reduces branching, but increases code size. The exact trade depends on the target architecture, which is why it's usually a good idea to leave the decision up to the compiler.

Gotcha: Short-Circuit Evaluation

�if (a() || b() || c())

doStuff (); � �What this really means:�if (a())

doStuff ();

else if (b())

doStuff ();

else if (c())

doStuff (); � �One if statement, three branches!

Author's Note

Comment

In C, the logical operators do short-circuit evaluation, meaning that they will skip later clauses if the result can be known from the earlier ones. This increases the amount of branching. In this example, it doesn't matter, because you have a branch for each function call, so short-circuit evaluation will probably make things faster - or, at least, not noticeably slower. In other cases, the condition may just be simple arithmetic, and then it's faster to avoid the branching.

Avoiding Short-Circuiting

�int condition = a();

condition = b() || condition;

condition = c() || condition;

if (condition)

doStuff (); � �• All of the subexpressions are evaluated.

Questions?

Documents

3 Processor Designs