33
Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour of easier visibility.

Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

Embed Size (px)

Citation preview

Page 1: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

Understanding The Nehalem Core

Note: The examples herein are mostly illustrative. They have shortcommings compared to

the real implementation in favour of easier visibility.

Page 2: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Agenda

•Pipelines

•Branch Prediction

•Superscalar execution

•Out-Of-Order execution

Page 3: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Understanding CPUs

•Optimizing performance is pretty much tied to understanding what a CPU actually does.

•VTune, Thread Profiler and other tools are mere helpers in order to understand the operation of the CPU.

•The following gives some illustrative ideas on how the most important concepts of modern CPU design work. Not in any case this corresponds to the actual implementation, but rather outlines some basic ideas.

Page 4: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Architecture Block Diagram

Disclaimer: This block diagram is for example purposes only. Significant hardware blocks have been arranged or omitted for clarity.

Branch Target Buffer

Microcode Sequencer

Register Allocation Table (RAT)

32 KBInstruction Cache

Next IP

InstructionDecode

(4 issue)

Fetch / Decode

Retire

Re-Order Buffer (ROB) – 128 entry

IA Register Set

To L2 Cache/Memory

Port

Port

Port

Port

Bus Unit

Reserv

ati

on

Sta

tion

s (

RS

)3

2 e

ntr

y

Sch

ed

ule

r /

Dis

patc

h P

ort

s

32 KBData Cache

Execute

Port

FP Add

SIMDIntegerArithmetic

MemoryOrderBuffer(MOB)

Load

StoreAddr

FP Div/MulInteger

Shift/RotateSIMD

SIMD

IntegerArithmetic

IntegerArithmetic

Port

StoreData

This is too complicated for the beginning!The best way to understand it is to construct the

CPU yourself

Page 5: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Our CPU Construction

CPU

Data

Data

Let‘s assume nothing for the moment – The CPU is just a black box with data coming in and data going out

Page 6: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Modern Processor Technology

In order to make use of modern CPUs some concepts must be clear

1. Pipelines

2. Prediction

3. Superscalarity

4. Out-Of-Order Execution

5. Vectorization

We will go step by step through these concepts and make contact to indicators for sub-optimal performance on the Nehalem architecture

Page 7: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Pipelines - What a CPU really understands …

A CPU can distinguish between different concepts

Instructions – what to do with the arguments

Arguments - Usually registers that contain numbers or addresses

A register can be thought of as a piece of memory directly in the CPU. It is as big as the architecture is wide (64bit). Specialized registers can be wider (SSE registers: 128bit).

The ALU can only operate on registers!

Page 8: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Pipelines - What a CPU really understands …

1001101011000111101101110010101000110100000100100111111001111100011010001110010011010111100101000110010001000101001100111111111110001010011001111110110100010100110011111011011000010111110110

4d 63 db4a 8d 04 9f4f 8d 1c 9a0f 28 c845 33 ff45 33 f645 33 ed85 f6

movslq %r11d,%r11lea (%rdi,%r11,4),%raxlea (%r10,%r11,4),%r11movaps %xmm0,%xmm1xor %r15d,%r15dxor %r14d,%r14dxor %r13d,%r13dtest %esi,%esi

This is the output of „objdump –d a.out“ for some binary a.out. Try yourself …

1. FETCH: get a binary number from memory2. DECODE: translated a binary number into a

meaningful instruction

Page 9: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Pipelines – Execution and Memory Access

So we can fetch the next number and translate it into a meaningful instruction. What now?

If the instruction is arithmetic, just execute it, e.g.

mul %r14d,%r15d (multiply register 14 with register 15)

If it is a memory reference, load the data, e.g.

mov <memory address>,%r15d

Some memory references involve computation (arithmetic)

lea (%r10,%r11,4),%r11

3. EXECUTE: Execute the instruction4. MEMORY ACCESS: Load and store data into memory,

maybe compute the address in step 3.

Page 10: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Pipelines – Write back

•This is the final step of our primitve pipeline

Once the the result is computed or loaded, write it into the register specified, e.g.

mul %r14d,%r15d

Put the result here

The need for this step is not immediately clear, but will become important later on …

5. WRITE BACK: Save the result as specified

Page 11: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Our CPU construction – functional parts

1. Fetch a new instruction2. Decode Instruction 3. Exectution4. Memory Access5. Write Register

Data

Data

Let‘s assume nothing for the moment – The CPU is just a black box with data coming in and data going out

Page 12: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

… a CPU with 5 pipeline stages

Inst D

ata

Instr.Fetch

Instr. Decode

ExecuteMemory Access

Write Back

Data

Page 13: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Pipelining 1 – No Pipelining

13 245

Execution time = Ninstructions * Tp

Fetch Instruction

Decode Instruction

Execute Write Back

Memory Access

F D E M W

Tp

Page 14: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Processor Technology Pipelining 1 – No Pipelining

•The concept of a non-pipelinging CPU is very inefficient!

•While, e.g. some comparison is performed in the ALU, the other functional units are running idle.

•With very little effort we can use all functional units at the same time and increase the performance tremendously

• Pipeling let‘s different commands use different functional units at the same time.

•There are problems connected to pipelined operation. We will address them later.

Page 15: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Pipelining 2 - Pipelining

Fetch Instruction

Decode Instruction

Execute Write Back

Memory Access

F D E M W

3 2 145

Tp

Execution time = Ninstructions * Ts

Ts

Page 16: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Our Core2 Construction

Inst D

ata

Instr.Fetch

Instr. Decode

ALUMemory Access

Write Registers

Data

Page 17: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Pipelining 3 – Pipeline Stalls

1

F D E M W

3 2

RExample code:

...

c=a/b

d=c/b

e=c/d

f=e/d

g=f/e

...

Pipeline Stalls can have two reasons – competition for resources and data dependencies

Page 18: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Pipelining 4 – Branches

F D E M W

3J Z5

CMP

45

Branches are a consicerable obstacle for performance!

FCompare two values and write result in F

Jump if Flag F is 0

Page 19: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Processor Technology Avoiding Pipeline Stalls – Branch Prediction

Pipeline stalls due to branches can be avoided by introducing a functional unit the „guesses“ the next branch.

00No jump predicte

d

012. No jump

predicted

102. Jump predicte

d

11Jump

predicted

jump

no jump jump

no jump

no jump jumpjump

no jump

Page 20: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Pipelining 4 – Branch Prediction

F D E M W

3J Z5

CMP

45

Branch prediction units can predict with a probability of >>90%. At a wrongly predicted branch the

pipeline needs to be flushed!

FCompare two values and write result in F

Jump if Flag F is 0

Branch Target Buffer

Page 21: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Our CPU Construction

Inst

Data

Instr.Fetch

Decode ALUMemory Access

Write Registers

Data

Branch Target Buffer

Page 22: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Superscalar Processors 1

• The superscalar concept is an extension of the pipeline concept

• Pipelining alows for „hiding“ the latency of an instruction

• Pipelining idealy achieves 1 instruction per clock cycle

• Higher ratest can only be achieved if more than one instruction can be executed at the same time

1. Superscalar architecture - Xeon

2. Other idea: Very long instruction word (VLIW) – EPIC/Itanium

Page 23: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Superscalar Processors 2

Fetch Instruction + Decode

Dispatch

ALU Write Register

Memory Access

F D E M W

3 2 14

Execution time = Ninstructions * Ts / NPipelines

Problems of the Pipeline remain!

Page 24: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Processor Technology Superscalar Processors 3

• The superscalar concept dramatically improves performance (2x at best)

• The different pipelines don‘t need to have the same functionalities

• The superscalar concept can be extended easily

Page 25: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Our CPU Construction

Inst

Data

Instr.Fetch + Decode

Dispatch

ALU

Memory Access

Write Registers

Data

Branch Target Buffer

ALU

Read Registers

Read Registers

Read Registers

Write Registers

Write Registers

Page 26: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Processor Technology Out-Of-Order Execution

•Speculative computing provides an important way of reducing pipeline stalls in case of branches.

•Still data dependencies cause pipeline stalls

•There is no good reason why the CPU shouldn‘t execute instruction which don‘t have dependencies or for which the dependencies are already computed.

•Out-Of-Order a way to minimize the pipeline stalls by reordering the instructions at entry to the pipeline and restore the original order when exeting the pipeline

Page 27: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Processor Technology Out-Of-Order Execution

1

F R A M W

3 2

R

45

Inst. Queue Reorder Buffer

Page 28: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Our Construction

Inst

Data

Instr.Fetch + Decode

Dispatch

ALU

Memory Access

Data

Branch Target Buffer

ALU

Read Registers

Read Registers

Read Registers

Reorder Buffer

Retire -Write

Registers

Page 29: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Processor Technology Register Allocation

•IA32 provides a very limited number of registers for local storage

•This is only due to definition in the instruction set architecture (ISA)

•Internally there are more registers available

•The CPU can internally re-assign the registers in order make best use of the given resources.

•This re-allocation is tracked in the Register Allocation Table

Page 30: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Our CPU Construction

Inst

Data

Instr.Fetch

Dispatch

ALU

Memory Access

Data

Branch Target Buffer

ALU

Read Registers

Read Registers

Read Registers

Reorder Buffer

Retire -Write

RegistersRegister Allocation

Now let‘s compare to the original ...

Page 31: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Block Diagram

Branch Target Buffer

Microcode Sequencer

Register Allocation Table (RAT)

32 KBInstruction Cache

Next IP

InstructionDecode

(4 issue)

Fetch / Decode

Retire (Write back)

Re-Order Buffer (ROB) – 128 entry

IA Register Set

To L2 Cache

Port

Port

Port

Port

Bus Unit

Reserv

ati

on

Sta

tion

s (

RS

)3

2 e

ntr

y

Sch

ed

ule

r /

Dis

patc

h P

ort

s

32 KBData Cache

Execute

Port

FP Add

SIMDIntegerArithmetic

MemoryOrderBuffer(MOB)

Load

StoreAddr

FP Div/MulInteger

Shift/RotateSIMD

SIMD

IntegerArithmetic

IntegerArithmetic

Port

StoreData

Not very different, right?

Memory

Dispatch

Page 32: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008

Summary

Nehalem can be thought of very much in the same as standard text books present a CPU

1. Instruction Fetch and Decode (Frontend)

2. Dispatch

3. Execute

4. Memory Access

5. Retire (Write Back)

If you want to dive deeper into the subject:Hennessy and Patterson: Computer Architecture – A quantitave approach

Page 33: Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour

April 29 2008