What is a Pipeline Processor

8/2/2019 What is a Pipeline Processor

1/4

What is a pipeline Processor

All regular ChipGeek readers have undoubtedly read about the number of pipeline stages each processor has. Their number anduse are big factors in overall performance, and they can really speed-up or slow-down certain types of code. But what is a pipelineand why is it useful?The pipeline itself comprises a whole task that has been broken out into smaller sub-tasks. The concept actually has its roots in

mass production manufacturing plants, such as Ford Motor Company. Henry Ford determined long ago that even though it took

several hours to physically build a car, he could actually produce a car a minute if he broke out all of the steps required to put a car

together into different physical stations on an assembly line. As such, one station was responsible for putting in the engine, anotherthe tires, another the seats, and so on.

Using this logic, when the car assembly line was initially turned on it still took several hours to get the first car to come off the end

and be finished, but since everything was being done in steps or stages, the second car was right behind it and was almost

completed when the first one rolled off. This followed with the third, fourth, and so on. Thus the assembly line was formed, and mass

production became a reality.

In computers, the same basic logic applies, but rather than producing something physical on an assembly line, it is the workload

itself (required to carry out the task at hand) that gets broken down into smaller stages, called the pipeline.

Consider a simple operation. Suppose the need exists to take two numbers and multiply them together and then store the result. As

humans, we would just look at the numbers and multiply them (or, if they're too big, punch them into a calculator) and then write

down the result. We wouldn't give much thought to the process, we would just do it.

Computers aren't that smart; they have to be told exactly how to do everything. So, a programmer would have to tell the computerwhere the first number was, where the second number was, what operation to perform (a multiply), and then where to store the

result.

This logic can be broken down into the following (greatly simplified) stepsor stagesof the pipeline:

This pipeline has four stages. Now suppose that each of these logical operations took one clock cycle to complete (which is fairly

typical in modern computers). That would mean the completed task of multiplying two numbers together would take four clock cycles

to complete. However, with the ability to do things at the same time (in parallel) rather than one after another, the result can often be

that while the task itself physically takes four clock cycles to complete, it can actually appear to be completed in fewer clock cycles

because each of those stages can also be doing something immediately before and after the first task's needs are met. As a result,

after each clock cycle the output of those operations are retired or completed, meaning that task is done. And, since we're doing

things in a pipeline, that means that each task, taking four clock cycles to complete, can actually appear to be retired one per clockcycle.

This concept can be visualized with colors added to the previous image and the stages broken out for each clock. Imagine each

color representing a stage involved in processing a computer instruction, and that each takes four clock cycles to complete. The red,

green, and dark blue instructions would've had other stages above our block, and the yellow, purple, and brown instructions would

need additional clock cycles after our block to complete. But, as you can see, even with all of this going on simultaneously, after

every single clock cycle an instruction (which actually took four clocks to execute) is completed! This is the big advantage of

processing data via a pipeline.

This may seem a little confusing, so try to look at it this way. There are four units, and in every clock cycle each unit is doing

something. You can visualize each unit doing its own bit of work with the following breakout:


2/4

Every clock cycle, each unit has something to do. And because each sub-task is known to only take one clock cycle, by the time the

data from the first clock cycle gets ready to be processed next, it knows the data will be ready because, by definition, each

unit has to complete its work in one clock cycle. If it doesn't then the processor isn't working like it's supposed to (this is one reason

why you can only overclock CPUs so far and no further, even with great cooling). And because all of that stuff is working together,

four-step instructions (or tasks) can be completed at a rate of one per clock.

The advantages of this as a speed-up potential should be obvious, especially when you consider how many stages modern

processors have (from 8 in Itanium 2 all the way up to 31 in Prescott!!). The terms Super Pipelined and Hyper Pipelined have

become commonplace to describe the extent to which this breakout has been employed.

Below is the pipeline for the Itanium 2. Each stage represents something that IA64 can do, and once everything gets rolling the

Itanium 2 is able to process data really, really quickly. The problem with IA64 is that the compiler or assembly language programmer

has to be extremely comprehensive to figure out the best way to keep all of those pipeline stages filled all of the time, because when

they're not filled the Itanium's performance goes down significantly:

I was hoping to find an image showing Prescott's 31-stages, but I couldn't. The closest I found was a black-and-white comparison of

the P6 core (Pentium III) and the original P7 core (Willamette). If anyone has a link showing Prescott's 31 stages, please let us

know.

Here is an Opteron pipeline shown through actual logic units as they exist on the chip. This will help you visualize how the logical

stages shown above for Itanium 2 might relate to physical units on the CPU die itself:


3/4

As you can see, there are different parts to the pipeline all working together, just like on an assembly line. They all relate to one

another to do some real quantity of work. Some of it is front-end preparation, some of it is actual execution; and once everything iscompleted, parts are dedicated to retiring data or putting it back wherever it needs to go (main memory/cache or something called

an internal register, which is like a super-fast cache inside of the processor itself, or an external data port, etc.).

It's worth noting that the hyper-pipelined design of Intel's Netburst (used in Willamette through Prescott) has been found to be dead-

ended when pushed to its extreme 31-stage pipeline in Prescott. The reason for this is a penalty that comes from mis-predicting

where the computer program will go next. If the processor guesses wrong, it has to refill the pipeline, and that takes many clock

cycles before any real work can start flowing again (just like how it takes several hours to make the first car). Another penalty is

extreme heat generation at the high clock rates seen in Prescott-based P4s.

As a result, real-world experience has shown that there is a trade-off between how deep your pipelinecanbe and how deep

it shouldbe given the type of processing you're doing. Even though on paper it might seem a better idea to have a 50-stage pipeline

with a 50GHz clock rate, a designer cannot simply go and build iteven though it would allow extremely complex tasks to be

completed 50 billion times per second (though with GaAs chips on the way, that might now be possible).

Chip designers can't do it because there are real-world constraints that mandate a happy medium between that ideal and the real-

world actual. The most major factor is how the computer program jumps around constantly, calling sub-routines or functions, going

over if..else..endif branches, looping, etc. The processor is constantly running the risk of guessing a branch wrong, and when it doesit must invalidate everything it guessed on in the pipeline and begin to refill it completelyand that takes away time and lowers your

performance.

The imposed limitations on pipeline depth are simply the side-effect of running code via the facilities within a processor available to

carry out the workload. A processor just can't do stuff the way a person can. Everything inside a CPU has to be programmed exactly

as it needs to be, with absolutely no margin for error or guesswork. Any errorany error whatsoever, no matter how smallmeans

the processor becomes totally and completely useless; it might as well not even exist.

I hope this article has been informative. It should've given you a way to visualize a processor pipeline, understand why it is important

to performance, and help you put together how it all works. You should be able to see why designs like Prescott (which take the

pipeline depth to an extreme) often come at a real-world performance cost. You should also appreciate why slower-clocked

processors (such as Itanium 2 at 1.8GHz) are able to do more work than much higher clocked processors (like Pentium 4 at 4GHz).


4/4

It's exactly because of the number of pipeline stages, coupled to the number of available units inside of the chip that can do things in

parallel.

The pipeline allows things to be done in parallel, and that means that a CPU's logic units are kept as busy as possible as of ten as

possible to make sure that the instructions keep flying off the end at the highest rate possible.

Keep in mind that there are several other factors that speed-up processing: processor concepts such as OoO (Out of Order)

execution, speculative execution, the benefits of cache, etc. Stay tuned to ChipGeek for coverage of those, and keep your inner-

geek close by.

Post your questions and comments below.

Also, for your reading pleasure, here are some other online articles relating to pipelines and pipeline stages: Ars Technica

onpipelining in generalandOpteron's pipeline; some info onPrescott's die; and ahistory of Intel chipsand their pipeline depths.

This closing graphic will summarize the trend from the original 8086 through today's Pentium 4. Enjoy!
http://arstechnica.com/paedia/c/cpu/part-2/cpu2-3.htmlhttp://arstechnica.com/paedia/c/cpu/part-2/cpu2-3.htmlhttp://arstechnica.com/paedia/c/cpu/part-2/cpu2-3.htmlhttp://arstechnica.com/articles/paedia/cpu/amd-hammer-1.ars/2http://arstechnica.com/articles/paedia/cpu/amd-hammer-1.ars/2http://arstechnica.com/articles/paedia/cpu/amd-hammer-1.ars/2http://www.chip-architect.com/news/2003_03_06_Looking_at_Intels_Prescott.htmlhttp://www.chip-architect.com/news/2003_03_06_Looking_at_Intels_Prescott.htmlhttp://www.chip-architect.com/news/2003_03_06_Looking_at_Intels_Prescott.htmlhttp://www.intel.com/technology/computing/mi06031.htmhttp://www.intel.com/technology/computing/mi06031.htmhttp://www.intel.com/technology/computing/mi06031.htmhttp://www.intel.com/technology/computing/mi06031.htmhttp://www.chip-architect.com/news/2003_03_06_Looking_at_Intels_Prescott.htmlhttp://arstechnica.com/articles/paedia/cpu/amd-hammer-1.ars/2http://arstechnica.com/paedia/c/cpu/part-2/cpu2-3.html

Documents

What is a Pipeline Processor