29
The Evolution of High Performance Computers David Chisnall February 8, 2011

3 Processor Designs

  • Upload
    savio77

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 3 Processor Designs

The Evolution of High Performance Computers

David Chisnall

February 8, 2011

Page 2: 3 Processor Designs

Outline

Page 3: 3 Processor Designs

Classic Computer Architecture

Fetch Decode Execute

Author's Note
Comment
Early processors were not pipelined at all. The simplest pipeline looks roughly like this. A modern processor typically has much longer pipeline, but each of the stages roughly fits into one of these, so this is a useful conceptual model when thinking about what the machine is doing.
Page 4: 3 Processor Designs

Execute?

1. Read value from memory

2. Store in register

1. Read value from register

2. Store in memory

1. Read value(s) from register(s)

2. Perform calculation

3. Store result in register

Author's Note
Comment
The execute phase is the part that we really care about - that's where the processor does what we tell it. In a typical load-store architecture, the instructions are grouped into one of these sets, with the last set being the one that does real work.
Page 5: 3 Processor Designs

Calculation?

• Add or subtract?

• Shift?

• Multiply?

• Divide?

• Evaluate polynomial?

• Discrete Cosine Transform?

• Exact set of operations depends on the computer

Author's Note
Comment
The operations that a CPU actually has hardware to execute are not always the same as the ones that it accepts instructions for. In the x86 architecture, you have things like string copy instructions, which are implemented in microcode - a single instruction can copy 4GB of data, so it takes a lot more than one cycle to complete. The VAX's evaluate polynomial was a famous example - basically a subroutine that looked like an instruction. On some chips, however, you have very complex hardware operations, for accelerating things like video decoding and cryptography. Using this, instead of the general purpose hardware, can make things go a lot faster.
Page 6: 3 Processor Designs

A Simple Case

r1

r2

Adder r3

Author's Note
Comment
A typical execute phase - for an instruction that isn't microcoded - involves connecting two registers to the input lines of an arithmetic logic unit (ALU) and the output lines to another register. The time taken for the operation may be several cycles - often the ALU is pipelined internally, so you can issue a new operation to it every cycle (for example) but you won't get the result for 5 cycles.
Page 7: 3 Processor Designs

Making it Faster: Vectors• Each register stores multiple values• Operations on components

Author's Note
Comment
The idea behind single instruction multiple data (SIMD), or vector, programming is to increase the amount of 'execute' that you do for the same amount of 'fetch' and 'decode'. You fetch one instruction, decode one instruction, but then run that instruction on multiple inputs. Early supercomputers were the first vector processors, but now you'll find a vector coprocessor in almost all modern processors.
Page 8: 3 Processor Designs

Example Problem

• Pixel values: 32 bits per channel, red, green, blue, alpha

• Lighten operation scales each value by a constant factor

• Same operation on each of the colour channels

Page 9: 3 Processor Designs

Scalar Solution

�for (int i=0 ; i<imageSize ; i++)

{

pixels[i].red *= 1.1;

pixels[i]. green *= 1.1;

pixels[i].blue *= 1.1;

} � �• Three loads per pixel

• Three multiplies per pixel

• Three stores per pixel

Author's Note
Comment
This simple example requires 9 operations, but it's really doing the same operation three times on three sets of data - it's therefore a perfect case for a vector processor.
Page 10: 3 Processor Designs

Vector Solution

�pixel_t scale = {1.1, 1.1, 1.1, 1};

for (int i=0 ; i<imageSize ; i++)

{

pixel[i] *= scale;

} � �• One (vector) load per pixel

• One (vector) multiply per pixel

• One (vector) store per pixel

• Redundant multiply (alpha× 1) costs nothing, but makespotential speedup only 3, not 4 for a 4-element vector unit

Author's Note
Comment
This loop does the same thing with vector code. Now we have three instructions, instead of 9, but we're doing the same work. The loads may be serial, depending on the memory bandwidth, but the multiplies all run in parallel. Vectors tend to be fixed size, so we have to do 4 operations - even without the fourth, the hardware would just be sitting idle - so we make it do a redundant multiply here for the alpha channel. If you're writing vector code, or code for the compiler to auto-vectorise, then you have to think about the layout of your data to make it possible. We can only vectorise his code trivially because the red, green, and blue values for a pixel are all adjacent in memory.
Page 11: 3 Processor Designs

Why is this faster?

• One fetch, one decode, four executes

• Fewer instructions, less space used for storing the program

Author's Note
Comment
Next lecture we look at cache, to see why the loads can be very fast. The multiply operations are going to all happen in parallel, so they're obviously faster. Fewer instructions also means better instruction cache usage, which can be very important for performance with large programs.
Page 12: 3 Processor Designs

Memory Layout

• For vectors to be fast, you must be able to load vectors frommemory

• If you separate the channels, this would not be possible

• Modern compilers can auto-vectorise code with a usablememory layout

Page 13: 3 Processor Designs

Even Faster: Parallel Streams

• Each loop iteration is independent

• We can do them in parallel

Author's Note
Comment
In theory, every single one of our loop iterations could run in parallel, because none of the iterations depends on the outcome of the one before it. This is known as an embarrassingly parallel problem.
Page 14: 3 Processor Designs

Bigger Vectors?

�pixel_pair_t scale = {1.1, 1.1, 1.1, 1, 1.1,

1.1, 1.1, 1};

for (int i=0 ; i<imageSize ; i+=2)

{

pixels[i] *= scale;

} � �• Now we can process two pixels at once

• Some modern systems have 256-bit vectors

• Really hard to map most problems to them

• Most vector units are only 128-bit

Author's Note
Comment
Vector units can be any size, but generally they're only 128 bits wide. Earlier, we saw that this simple algorithm is only getting 3/4 of the performance from a 4-element vector. For a wider vector, you'd find it even harder to fill most of the execution units. You also have problems with memory bandwidth - loading a 1KB vector would take so long that you'd need to keep it in registers for a long time and do a lot of things with it for it to be worthwhile. With smaller vectors, it's easier to interleave the load, execute, and store instructions.
Page 15: 3 Processor Designs

Independent Processors?

• Split loop into two or more loops

• Run each on a different processor

Author's Note
Comment
Most modern CPUs are now multicore. This is a very simple way of making processors faster: just put two of them on the same die. We'll look at the kind of algorithm that does well on this kind of architecture, and some of the problems with it, later on in this module.
Page 16: 3 Processor Designs

Hybrid Solution: Semi-independent Cores

• Each core runs a group of threads

• When threads are executing the same thing, all run

• When they branch, only one runs

• Typical GPU architecture

Author's Note
Comment
Modern GPUs implement something that looks like a load of independent cores to the programmer, but is really a smaller number of SIMD (vector) cores.
Page 17: 3 Processor Designs

What the Programmer Sees

• One program (kernel) per input (e.g. per pixel)

• Conceptually all run in parallel

• In practice, 1–128 run in parallel

Author's Note
Comment
Encouraging the programmer to write parallel code is something compiler writers love. It's trivial to take a parallel algorithm and run it on a serial processor, it's much harder to take a serial algorithm and run it on a parallel processor. If the compiler sees that you have 10,000 iterations that are independent, and you have 100 processors, it can run 100 operations on each processor.
Page 18: 3 Processor Designs

How it Works

• Processor is n ×m element vector processor

• Programmer sees n element vector processor

• Processor starts m elements at once

• Linear code Just WorkTM

• Special case for branches:

• All threads take the same branch, all continue to run

• Some take a different branch, paused, resumed later

Author's Note
Comment
The GPU will run a load of instances of the same short program (kernel) on different input data on different cores. If all of the copies of the program are following the same execution path, then it's basically a SIMD program. If they follow different branches, it isn't, so you can no longer use the SIMD processor to run them at the same time. Some are paused, while the others run, then the paused ones are allowed to resume. There are also barrier instructions that let you rendezvous kernels that temporarily took different branches but are now back in the same place. This model lets you write code that doesn't look like it's vector code, but run it on a vector processor.
Page 19: 3 Processor Designs

Bad Example�__kernel void stripe(const float4 *input

global float4 *output)

{

int i = get_global_id (0);

// Lighten even pixels , darken odd pixels

if (i % 2)

{

output[i] = input[i] * 1.1;

}

else

{

output[i] = input[i] * 0.9;

}

} � �• Each pair of threads will take different branches

• Only half will actually be running in parallel

Author's Note
Comment
This is an example of what not to do on a GPU. A GPU may run four copies of this kernel on a single core. Half will take the first branch, half will take the second. This means that only two will actually be running at a time, so you've only got half of the GPU's maximum throughput.
Page 20: 3 Processor Designs

Why it’s Useful

• Good for code with highly independent elements

• Higher ratio of execution units to other components than ageneral purpose CPU

Author's Note
Comment
This model is harder to program for than the traditional serial model, but lets you have a lot of execution units on the processor. Each step through the fetch and decode part of the pipeline issues a lot more instructions than in a corresponding serial processor.
Page 21: 3 Processor Designs

When is it not Useful?

• Problems that don’t have independent inputs

• Algorithms that have lots of branches

Author's Note
Comment
You can compile any program for a GPU, because it's just another Turing Machine - but that doesn't mean that it will be fast. A lot of existing code will run much slower on the GPU than the CPU - it's not magic. The GPU makes some trades in favour of certain categories of algorithm, while the CPU makes trades in favour of different categories of algorithm. The trick for writing high-performance code is to target your algorithms at a processor optimised for them.
Page 22: 3 Processor Designs

So, Branches are Fast on CPUs?

• Fetch, decode, execute, is an oversimplification

• Modern pipelines are 7–20 stages

• Can’t fetch instruction after the branch until after executingthe branch!

• P4 can have 140 instructions executing at once

• Can have none executing if it doesn’t know what to executethough...

Author's Note
Comment
Branches are really expensive on GPUs. Old GPUs just executed both paths and only stored the results of the taken branch, which was incredibly slow, but they're still problematic on newer ones. Even on CPUs, they're not always fast. A branch can cause a pipeline stall, which means that you have to wait for all in-flight instructions (up to about 150 on a typical CPU) to complete before you can continue. If this happens often, performance can drop to a few percent of the theoretical throughput. Branch prediction helps, but it doesn't always get things right.
Page 23: 3 Processor Designs

Not So Much Then?

• Typical CPUs do branch prediction

• Correct prediction is very fast

• Incorrect prediction is very slow

• Accuracy is about 95%

• So 5% of branches cause a pipeline stall (bad!)

Author's Note
Comment
Even 95% accurate branch prediction means that 5% of the time you've got a pipeline stall on a branch. Depending on how far apart your branches are, this can still add up to a significant performance problem. Avoiding branching is often a good idea for performance, but it can have other detrimental effects.
Page 24: 3 Processor Designs

Another Work-Around: Predicated Instructions

• Found in ARM and Itanium

• Instruction predicated on condition register

• Executes anyway

• Result is only stored if the condition was set

• No pipeline stall, but some redundant work

• Much faster for short branches

Author's Note
Comment
Predicated instructions are instructions that appear to only execute when a condition flag is set. The conditional jump instructions that we looked at for implementing if statements are very common examples of predicated instructions (they only execute if some condition is true), but ARM and Itanium also include predicated arithmetic instructions. These can be very fast, because a superscalar architecture will always execute them, in parallel with other instructions, and will only store the results back to registers / memory if they were really meant to be executed. Because the decision about whether to store the results is made at the end of the pipeline, not at the start, the result is available without stalling the pipeline.
Page 25: 3 Processor Designs

Avoiding Branches: Loop UnrollingWhat are funroll loops?

�for (int i=0 ; i<imageSize ; i++)

{

pixels[i] *= scale;

} � �• One branch per pixel

• Should be correctly predicted...

• But still wastes some time

Author's Note
Comment
Each loop iteration includes a jump, which adds some overhead and makes it hard for some other optimisations to take place. For example, some superscalar processors will not execute instructions in parallel if there is a conditional jump between them.
Page 26: 3 Processor Designs

Avoiding Branches: Loop UnrollingWhat are funroll loops?

�for (int i=0 ; i<imageSize ; i++)

{

pixels[i] *= scale;

pixels[i++] *= scale;

pixels[i++] *= scale;

pixels[i++] *= scale;

} � �• One branch per four pixels

• Code is bigger, but should be faster

Don’t do this yourself! The compiler will do it for you!

Author's Note
Comment
This is a common compiler optimisation, enabled by -funroll-loops in GCC. Old C textbooks will tell you to do this yourself, but it's generally a bad idea - the compiler is probably better at it than you. This reduces branching, but increases code size. The exact trade depends on the target architecture, which is why it's usually a good idea to leave the decision up to the compiler.
Page 27: 3 Processor Designs

Gotcha: Short-Circuit Evaluation

�if (a() || b() || c())

doStuff (); � �What this really means:�if (a())

doStuff ();

else if (b())

doStuff ();

else if (c())

doStuff (); � �One if statement, three branches!

Author's Note
Comment
In C, the logical operators do short-circuit evaluation, meaning that they will skip later clauses if the result can be known from the earlier ones. This increases the amount of branching. In this example, it doesn't matter, because you have a branch for each function call, so short-circuit evaluation will probably make things faster - or, at least, not noticeably slower. In other cases, the condition may just be simple arithmetic, and then it's faster to avoid the branching.
Page 28: 3 Processor Designs

Avoiding Short-Circuiting

�int condition = a();

condition = b() || condition;

condition = c() || condition;

if (condition)

doStuff (); � �• All of the subexpressions are evaluated.

Page 29: 3 Processor Designs

Questions?