44
Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Embed Size (px)

Citation preview

Page 1: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Computer architecture

Lecture 6: Processor’s structure

Piotr Bilski

Page 2: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Procesor’s tasks:

• Instruction fetching• Instruction interpretation• Data fetching• Data processing• Data saving

These justify existence of the registers (temporary memory space)

Page 3: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Internal processor’s structure

Registers

Control Unit

ALU

Status flags

Shifter

Complementer

Arithmetic and Boolean Logic

Page 4: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Block Scheme of Pentium 3 Processor

Page 5: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Block Scheme of P6 Core (Pentium

Pro) – 1995 r.

• Front-end of the processor

• Core• Completion unit

Page 6: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Register types

• Accessible for the user (addressing, data etc.)

• Inaccessible for the user (control, status)

• This categorization is not formal!

Page 7: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Registers accessible by the user

• General Purpose Registers (GPR)

• Data

• Addressing (segment pointer, stack, indexing)

• Conditional codes (state pointer, flags) – read-only!

Page 8: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Control and state registers

• Basic: – Program Counter (PC)– Instruction Decoding Register (IR)– Memory Address Register (MAR)– Memory Buffer Register (MBR)

• Program Status Word (PSW)• Interrupt Vector Register• Page Table Pointer

Page 9: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Program Status Word0 3 4 15

S – sign bit

Z – bit set, if operation result is zero

P – carry bit

R – logical comparison result bit

O – overflow bit

I – Enable/disable interrupt execution

N – supervisor mode

S Z P R O I N OTHER

Page 10: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Registers in the Motorola MC68000 processor

• Data and address registers (32-bit)• Specialization: 8 data registers (D0-D7) and 9

address registers (two used interchangeably in the user and supervisor modes)

• Control bus 24-bit, data bus 16-bit• A7 register used as a Stack Pointer (SP)• State register (SR)16-bit (another name: CCR)• Program counter (PC) 32-bit• Instructions are stored under even addresses

Page 11: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Registers in the Intel 8086 Processor

• 16-bit address and data registers• Data/General Purpose Registers (AX, BX,

CX, DX)• Pointer and index registers (SP, BP, SI,

DI)• Segment registers (CS, DS, SS, ES)• Instruction pointer• State register

Page 12: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Intel 8086 Registers (cont.)

AX

BX

CX

DX

Accumulator

Base

Counting

Data

SP

BP

SI

DI

Stack pointer

Base pointer

Source index

Displ. ndex

Page 13: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Intel 386 - Pentium Processors Registers Organization

• 32-bit data and address registers

• Eight General Purpose Registers (EAX, EBX, ECX, EDX, ESP, EBP, ESI, EDI)

• For the backward compatibility, the lower part of the registers are 16-bit registers

• 32-bit status register

• 32-bit instruction pointer

Page 14: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Floating-point registers of the Pentium processor

• Eight 80-bit numerical registers

• 16-bit control register

• 16-bit state register

• 16-bit floating point register content type word

• 48-bit instruction pointer

• 48-bit data pointer

Page 15: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

EFLAGS register

• TF – trap flag• IF – interrupt enable flag• DF – direction flag• IOPL – privileged input/output flag• RF – resume flag• AC – alignment control• ID – identification flag

CF

PF

AF

ZF

SF

TF

IF

DF

OF

IOF

NT

015

RFVM

AC

VIF

VIP

ID

2131

Page 16: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Registers in the Athlon 64 processor

• Compatibility with x86-64 architecture (40-bit physical address space, 48-bit virtual address space)

• Data and address registers 64-bit• 8 general purpose registers (RAX, RBX, RCX, RDX,

RBP, RSI, RDI, RSP), work in the 32-bit compatibility mode

• Opteron contains additional 8 general purpose registers (R8-R15)

• 16 SSE registers (XMM0-XMM15)• 8 floating-point registers x87, 80-bit

Page 17: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Registers in the PowerPC processor

• 32 general purpose registers (64-bit) + exception register (XER)

• 32 registers for the floating point unit (64-bit) + state and control register (FPSCR)

• Branch processing unit registers: 32-bit condition register, 64-bit counting and binding registers

Page 18: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Instruction mode

Instruction fetch

Instruction address

calc.

Instructiondecoding

Argument address

calc.

Argument fetching

Data operation

Interrupts checking

Interrupt handling

Argument address

calc.

Writing argument

Instruction executed, fetch the next one

Multiple arguments

Multiple results

No interrupts Return to data

Indirect addressing

Indirect addressing

Page 19: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Instruction fetching cycleProcessor

MAR

MBR

CU

Memory

Address bus

Control bus

Data bus

PC

IR

Page 20: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Indirect modeProcessor

MAR

MBR

CU

Memory

Address bus

Control bus

Data bus

Page 21: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Interrupt modeProcessor

MAR

MBR

CU

Memory

Address bus

Control bus

Data bus

PC

Page 22: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Pipeline

• Problem: during the instruction cycle only one instruction is processed

• Solution: divide the cycle into smaller fragments

• Condition: time instants, when no main memory access is required!

Cycle 1 Cycle 2 Cycle 3

Page 23: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Pipeline example - laundry

LA DR PA LA DR PA LA DR PA

CYCLE 1 CYCLE 2 CYCLE 3

LA DR PA

LA DR PA

LA DR PA

3 hours / cycle – 9 hours for all

3 hours / cycle – 5 hours for all !!

Page 24: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Prefetch

• NOTE: acceleration is smaller than double, as the memory access lasts longer than the instruction execution

Instruction fetch

Execution

Instruction Instruction Result

Instruction fetching

Execution

Instruction Instruction Result

Waiting WaitingNew address

Denial

Page 25: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Basic phases of the instruction cycle:• Instruction fetching (FI)• Instruction decoding (DI)• Operands calculation (CO)• Operands fetching (FO)• Instruction execution (EI)• Writing outcome (WO)

FI DI CO FO EI WO

FI DI CO FO EI WO

FI DI CO FO EI WO

FI DI CO FO EI WO

1 2 3 4 5 6 7 8 9 10 11

I1

I2

I3

I4

Page 26: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Branches and pipelining

FI DI CO FO EI WO

FI DI CO FO EI WO

FI DI CO FO

FI DI CO

1 2 3 4 5 6 7 8 9 10 11 12 13

I1

I2

I3

I4

I5

I6

I21

I22FI DI CO FO EI WO

FI DI CO FO EI WO

FI DI

FI

Page 27: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Pipeline implementation algorithm

Page 28: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Problems of the pipelining

• Subsequent pipe phases don’t last the same amount of time

• Transferring data between the buffers may significantly increase pipeline execution time

• Dependency between the registers and memory in the pipeline optimization may be minimized with high stakes

Page 29: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Efficiency of the pipelining

)]1([ nkTk

Cycle execution time:

Time required to execute all the instructions:

Instruction pipeline acceleration ratio:

)1(1

nk

nk

T

TS

kk

Page 30: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Example of the pipeline efficiency

Page 31: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Modern Processors Pipelines

• Pentium 3 – 10 stages • Athlon – 10 stages for ALU, 15 stages for FPU• Pentium M – 12 stages• Athlon 64/ 64 X2 – 12 stages for ALU, 17 stages for

FPU• Pentium 4 Northwood – 20 stages (hyperpipeline!!)• Pentium 4 Prescott – 31 stages• Core2Duo – 14 stages

Page 32: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Hazards

• They are pipelining disturbances

• There are data, resources and control hazards

Page 33: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Branch handling

• Pipeline multiplication

• Prefetch of the instruction

• Loop buffer

• Branch prediction

• Delayed branch

Page 34: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Multiplied pipelining

• Both instructions for simultaneous processing as a result of branch are loaded into two pipelines

• The main problem is to gain memory access for both instructions

Page 35: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Prefetch and loop buffer

• When branch instruction is decoded, the target instruction is fetched. It is stored until the branch is executed

• A buffer in memory to store the subsequent instructions is created

• It is useful when there are conditional branch instructions and loops involved

Loop buffer

Prefetch

Page 36: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Conditional Branch Prediction

• Static– Never occuring branch (Sun SPARC, MIPS)– Always occuring branch– Operation code prediction

• Dynamic– Occured/Didn’t occur switch– Branch history table

Page 37: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Static prediction

• The simplest, used as the fallback method, for instance in the Motorola MPC7450 processor

• Pentium 4 allowed inserting the code suggesting if the static prediction should point at the branch or not (so-called prediction hint)

Page 38: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Dynamic prediction of the conditional branches

• A conditional branch instruction history is stored

• It is represented by the bits stored in the cache memory

• Every instruction has its own history bits

• Another solution is the table storing informations about the conditional branch result

Page 39: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

History bits prediction

Page 40: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Branch history table

Branch instruction address

History bitsTarget instruction

Page 41: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Local Branch Prediction

• Requires a separate history buffer for each instruction, although the history table can be common for all instructions

• Pentium MMX, Pentium 2 i 3 processors have local prediction circuits with 4 history bits and 16 positions for every type of instruction

• Local prediction efficiency is estimated at 97 %

Page 42: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Global Branch Prediction

• A common history for all branch instructions is stored in memory. It allows to consider dependencies between different branch instructions

• Rarely a better solution than the local prediction

• Hybrid solutions: shared unit of the global prediction and the history table (AMD processors, Pentium M, Core, Core 2)

Page 43: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Branch Prediction Unit

• A processor circuit responsible for prediction of the disturbances in the sequential code execution

• Often connected with the microoperation cache memory

• In Pentium 4 processor, the buffer for the branch prediction has 4096, in Pentium 3 – only 512. Therefore the former has a 33 percent better hit ratio than the latter

Page 44: Computer architecture Lecture 6: Processor’s structure Piotr Bilski

Location of the Branch Prediction Unit