AIMS Embedded Systems Programming MT 2018 - Micro …aims.robots.ox.ac.uk/wp-content/uploads/2018/11/2-micro.pdf · High-Level View of Microarchitectures..... eaxebx ecx ZF CPU registers

AIMS Embedded Systems ProgrammingMT 2018

Micro Architectures

Daniel Kroening

University of Oxford, Computer Science Department

Version 1.0, 2014

Outline

X86/Y86

ARM

Pipelining

Memory

D. Kroening: AIMS Embedded Systems Programming MT 2018 2

High-Level View of Microarchitectures

...

...

eax ebx

ecx

ZF

CPU

registers

cachesFUs

ALUFloat

Memory

memorymodule

memorymodule

I/O(USB, ...)

dataaddresscontrol

L1, L2

IP


CPUs

I Process a sequential assembler program

I Data held in registers

I Program controls which data is given to which FU,and where the result is stored

I Program controls transfer of data between registers andmemory

I Caches speed up access to frequently used memory cells


Instruction Set Architectures

I These summarise the behavior of a CPU from the point ofview of the programmer

I An ISA describes “what the CPU does”

I Ideally as little as possible about “how the CPU does it”


We will study two ISAs:1. CISC: specifically the Y86

(academic variant of Intel’s x86)2. RISC: specifically the ARM 32 architecture

One of the goals of this course is to understand the difference


Visible Registers

RAM

I Contains data and the program

Data registers

Index 0 1 2 3 4 5 6 7Name eax ecx edx ebx esp ebp esi edi

Instruction Pointer (IP)

I Points to address of current instruction

Flag registers (ZF, ...)

I Store flags for branches


Y86 Assembler

I Subset of Intel’s x86 assembler

4 You can run a Y86 program on your x86 machine!

8 The reverse does not work in general,as too many instructions are missing(you are welcome to mend this)


Y86 Instructions

I add/sub: Addition/subtraction of the values in tworegisters;ZF is set appropriately

I RRmov: copies value of one register into anotherI RMmov: copies value of a register into RAMI MRmov: copies value from RAM into a register

I jnz: Jumps to relative address if ZF = 0


Y86 Loads and Stores

I Loads and stores have a Displacement :

ea = esi+ Displacement

I The displacement is included in the instruction word asimmediate constant

I The register esi is used as offset


Y86 Instruction Formats

01

29

75

11

11

89

89

8b

11

01

7 6 3 0

01

110

f4

110

IP←IP+Distance

RD←RS

MEM[ea]←RS

RS

RS

Distance

RS

RS

RS Displacement

RD

RD

RD

SemanticsMnemonic Opcode

RS←MEM[ea]

RD←RD+RS

hlt

MRmov

RMmov

RRmov

jnz

sub

add

Displacement

RD←RD-RS

if(¬ZF)


Example 1

add eax, edx

I Intel convention: the target register is always on the

left-hand side

I The target register is a source register, too!

I Semantics:

eax← eax + edx


Example 2

mov edx, [BYTE one+esi]

8B 56 17

Opcode (MRmov) Displacement

01 010︸︷︷︸edx

110

Semantics:

edx← MEM[esi+17]


How do Branches Work?

i f ( a==b ) {T ;

}else {

5 F ;}

→

mov eax , [BYTE a+es i ]mov ebx , [BYTE b+es i ]sub eax , ebxjnz f

5 ;; Code fo r ‘T ’;mov eax , [BYTE one+es i ]add eax , eax

10 jnz ef ;

; Code fo r ‘F ’;

15 e ; . . .


Assembler Example

Address Machine Code Assembler using Mnemonics00 29 F6 sub esi, esi

02 29C0 sub eax, eax

04 29DB sub ebx, ebx

06 8B56 17 l mov edx, [BYTE one+esi]

09 01D0 add eax, edx

0B 01C3 add ebx, eax

0D 89C1 mov ecx, eax

0F 8B561B mov edx, [BYTE ten+esi]

12 29D1 sub ecx, edx

14 75 F0 jnz l

16 F4 hlt

17 01 00 0000 one dd 1

1B 0A00 0000 ten dd 10

The result is in ebx


The NASM Assembler

I Windows:nasm -f win32 my test.asm

link /subsystem:console /entry:start my test.obj

I Linux:nasm -f elf my test.asm

ld -s -o my test my test.o

I MacOS:nasm -f macho my test.asm

ld -arch i386 -o my test my test.o


Inline Assembler with Visual Studio

int one=1, ten=10, r e s u l t ;

int main ( ) {asm {

5 sub e s i , e s isub eax , eaxsub ebx , ebx

l : mov edx , [ one+e s i ]add eax , edx

10 add ebx , eaxmov ecx , eaxmov edx , [ ten+e s i ]sub ecx , edxjnz l

15 mov [ r e s u l t+e s i ] , ebx}

p r i n t f ( ”Result : %d\n” , r e s u l t ) ;return 0 ;

20 }


Debugging with GDB (Part 1)

I run

Start execution

I x/[size] LabelDump a region of the memory

I x/[sizei] LabelDisassemble some memory region, e. g. x/5i $pc

I info registers

Show the value of the registers

I step

Execute one instruction


Debugging with GDB (Part 2)

I break labelset breakpoint at label

I info break

show the breakpoints

I delete breakpoints numberwell, delete a breakpoint

I continue

resume the execution after a breakpoint


Debugging with Visual Studio


Debugging with XCode


Extensions: Comparisons

We would love to have Y86 commands fori f ( a<b) { . . . }

These obviously depend on the number representation:

with sign without sign0>−7

twoc(0000)> twoc(1001)0< 9

bin(0000)< bin(1001)


Reminder: Number Interpretation

Binary representation:

bin() : {0, 1}n −→ {0, . . . , 2n − 1}

bin(x) =

n−1∑i=0

xi · 2i

Two’s complement:

twoc() : {0, 1}n −→ {−2n−1, . . . , 2n−1 − 1}

twoc(x) = −2n−1 · xn−1 + bin(xn−2, . . . , x0)


Comparing Unsigned Integers

Unsigned integers:

bin(a) < bin(b) ⇐⇒ bin(a)− bin(b) < 0

Recall: −b = (¬b) + 1We get the “+1” for free by setting the carry-in of the adder.

Let’s pretend we compute with one more bit (“zero extension”):

0 an−1 . . . a1 a0+ 1 ¬bn−1 . . . ¬b1 ¬b0

cn cn−1 . . . c1 1 (carry bits)= sn sn−1 . . . s1 s0 (sum)

Thus: bin(a)− bin(b) < 0 ⇐⇒ sn ⇐⇒ ¬cn


Comparing Signed Integers

Two’s complement:

twoc(a) < twoc(b) ⇐⇒ twoc(a)− twoc(b) < 0

Again, let’s pretend we have an extra bit (“sign extension”):

an−1 an−1 . . . a1 a0+ ¬bn−1 ¬bn−1 . . . ¬b1 ¬b0

cn cn−1 . . . c1 1 (carry bits)= sn sn−1 . . . s1 s0 (sum)

Thus: twoc(a)− twoc(b) < 0 ⇐⇒ sn ⇐⇒an−1 ⊕ ¬bn−1 ⊕ cn ⇐⇒ sn−1 ⊕ cn−1 ⊕ cn


New Flags: CF, SF, OF

We1 introduce three new flags for arithmetic operations:

I CF: The carry flag(cn in case of additions, ¬cn in case of subtraction)

I SF: The sign flag (sn−1)

I OF: The overflow flag (cn ⊕ cn−1)

1meaning Intel did soD. Kroening: AIMS Embedded Systems Programming MT 2018 26

Examples (Part 1)

000 . . . 000 = 0+ 000 . . . 001 = 1

0000 . . . 000= 000 . . . 001 = 1

ZF = 0,CF = 0,SF = 0,OF = 0

000 . . . 001 = 1− 000 . . . 001 = 1

1111 . . . 111= 000 . . . 000 = 0

ZF = 1,CF = 0,SF = 0,OF = 0

111 . . . 111 = −1+ 000 . . . 010 = 2

1111 . . . 110= 000 . . . 001 = 1

ZF = 0,CF = 1,SF = 0,OF = 0


Examples (Part 2)

011 . . . 111 = 2n−1 − 1+ 000 . . . 001 = 1

0111 . . . 110= 100 . . . 000 = 2n−1

ZF = 0,CF = 0,SF = 1,OF = 1

100 . . . 000 = −2n−1− 000 . . . 001 = 1

1000 . . . 001= 011 . . . 111 = 2n−1 − 1

ZF = 0,CF = 0,SF = 0,OF = 1


Branching Instructions for Comparisons

Instruction Flagsjz, je ZF

jnz, jne ¬ZFjnae, jb CFjae, jnb ¬CFjna, jbe CF ∨ ZFja, jnbe ¬(CF ∨ ZF)jnge, jl SF⊕OFjge, jnl ¬(SF⊕OF)jng, jle ((SF⊕OF) ∨ ZF)jg, jnle ¬((SF⊕OF) ∨ ZF)

jmp near unconditional

n = not, z = zero, e = equal,g = greater, l = less, a = above, b = below

i.e. jnbe = “jump if not (below or equal)”


Branching Instructions for Comparisons

sub ax , bxJxxx ta r g e t. . .

t a r g e t :

branch if with sign without signax = bx je je

ax 6= bx jne jne

ax > bx jg ja

ax ≥ bx jge jae

ax < bx jl jb

ax ≤ bx jle jbe


Example Branching Instructionss t a r t sub esi , es i ; array index

mov edx , [BYTE Intmax+es i ] ; Minimummov ecx , [BYTE Top+es i ] ; top indexsub ebx , ebx ; counter

5

L mov eax , ebxsub eax , ecxjae end ; counter≥Top?

10 mov esi , ebxmov edi , [BYTE Array+es i ] ; ed i :=array [ebx ]

mov eax , edisub eax , edx

15 jge sk ip; array [ebx ]≥Minimum?

mov edx , edi; Minimum:=array [ebx ]

sk ip sub esi , es i20 mov eax , [BYTE Four+es i ]

add ebx , eax ; counter+=4

jmp near L

25 end hlt


Example Branching Instructions (Part 2)

Four dd 4Top dd 40Array dd 1 , 2 , 3 , 4 , 5 , 6 , −7, 8 , 9 , 10Intmax dd 0 x 7 f f f f f f f


History ARM

I 1980s: Acorn ComputersI 1982: BBC Micro (8 bit)I 1986: ARM development kitI 1990: ARM, “Advanced RISC

Machines”, founded;owners: Acorn Computers, Apple andVLSI Technology


ARM Today

I Now primarily licensed as IP, with focus on low-endembedded systems and phones (>95 % market share)

I Built by Apple, Nvidia, Qualcomm, Samsung, TI

I 2013: 37 billion ARM processors produced

I Early 64-bit prototypes for application in low-power servers


Visible Data

I RAM, organised in 32-bit words

I RegistersI R0 to R15I R15 is a special case: this is the PCI R13 is the stack pointer (SP)I R14 is used for the return address for function calls (LR)I CPSR for various flagsI (There is another register file for floating-point numbers)


Basic Instructions

ADD Rd, Rn, Rm Rd ← Rn +RmSUB Rd, Rn, Rm Rd ← Rn −RmMUL Rd, Rm, Rs Rd ← (Rm ·Rs)[31 : 0]

SMUL RdL, RdH , Rm, Rs RdH , RdL ← Rm ·RsUMUL RdL, RdH , Rm, Rs RdH , RdL ← Rm ·RsSDIV Rd, Rm, Rs Rd ← Rm/RsUDIV Rd, Rm, Rs Rd ← Rm/RsAND Rd, Rn, Rm Rd ← Rn&RmB label PC← label

BL label LR← PC+4; PC← label

BX Rm BX← Rm

Many variants!


Setting Condition Flags

I Most instructions can be given a suffix S.

I In addition to the usual behaviour,the condition flags (in CPSR) are updated.

31 30 29 28

N Z C V

N = negative, Z = zero, C = carry, V = overflow


Using Condition Flags

Most instructions can be given condition suffixes:

EQ equal NE not equalCS/HS carry set CC/LO carry clearMI negative PL positive (or zero)VS overflow VC no overflowHI higher LS lower or sameGE greater or equal LT less thanGT greater than LE less than or equal

These use 4 bits in the instruction word.


ARM Instruction Formats

ARM uses a fixed-size instruction word:

31 28 27 21 20 19 16 15 12 11 0

Cond Opcode S Rn Rd Rmdata processing

31 28 27 25 24 23 0

Cond 1 0 1 L offsetbranch and branch&link


ARM Instruction Formats

I There is a compressed version called“Thumb-2 Instruction Set”

I The instructions have 16 bit

I Fewer options, conditions are a separate instruction

I Aimed at better I-Cache efficiency


Sequential Processors with Pipeline

I We will start with an implementation thatI has the form and shape of a pipeline, butI processes one instruction at a timeI processes the instructions in a fixed order of phases

I These aren’t built, but only exist for illustrative purposes.

4 But: The step to a proper pipeline is minimal(will show!)


The 5 Instruction Phases (Stages)

1. Instruction Fetch (IF)The instruction is copied from the RAM into a register (IR)

2. Instruction Decode (ID)Loads the values of the operands from the register file intoregisters A and B;also increments the program counter

3. Execute (EX)Perform any ALU operation (say add/sub),address arithmetic for load/store

4. Memory (M)RAM access for load/store

5. Write-Back (WB)Store any result in the register file


An Implementation: High-level View

ALU

0

addresses

A, Beax, ..., esi, edi

WB

M

EX

ID

IF

IP

IR

load1

nextIPsy

stem

bus

A, B

MAR

MDRr

C FlgMDRw


Sequential Execution

I We first implement a sequential machine:The stages are processed one after the otherin the order IF – ID – EX – M – WB

I We execute exactly one instruction at a time

I In contrast to multi-cycle designs:We stick to this even if an instruction doesn’t actually use aparticular stage


Sequential Execution

Let I1, I2, . . . be the sequence of instructions in program order.

time 0 1 2 3 4 5 6 7 8IF I1 I2ID I1 I2EX I1 I2MEM I1 I2WB I1


Example: Processing add

cycle:��

program:add edx, ebx

mov [100+esi], edx

0ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw


Example: Processing add (1)

cycle:��


mov [100+esi], edx

00

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw



cycle:��


mov [100+esi], edx

1

add

0

2

2, 3

29, 6

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw



cycle:��


mov [100+esi], edx

2

add

2

35 0

29, 6

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw



cycle:��


mov [100+esi], edx

3

add

2

35

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw



cycle:��


mov [100+esi], edx

4

add

2

35

2

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw


Example: Processing RMmov

cycle:��


mov [100+esi], edx

52

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw


Example: Processing RMmov (1)

cycle:��


mov [100+esi], edx

52

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw



cycle:��


mov [100+esi], edx

6

RMmov

2

50, 35

6, 2

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw



cycle:��


mov [100+esi], edx

7

RMmov

5

35100

0, 35

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw



cycle:��


mov [100+esi], edx

8

RMmov

5

35100

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw



cycle:��


mov [100+esi], edx

9

RMmov

5

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw


Example: Processing jnz

cycle:��

program:jnz l

the distance is 10

00

0

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw


Example: Processing jnz (1)

cycle:��

program:jnz l

the distance is 10

00

0

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw



cycle:��

program:jnz l

the distance is 10

1

jnz

0

0

12

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw



cycle:��

program:jnz l

the distance is 10

2

jnz

12

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw



cycle:��

program:jnz l

the distance is 10

3

jnz

12

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw



cycle:��

program:jnz l

the distance is 10

4

jnz

12

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw


Pipelining

I Increases the performance using the assembly-line idea

performance = instructions per cycle︸︷︷︸IPC

· clock frequency︸︷︷︸1/τ

I Standard technique in virtually all modern circuitry(not just CPUs, but also GPUs, video, networking, wireless,...)


Pipelining

time 0 1 2 3 4 5IF I1 I2 I3 I4 I5 I6ID I1 I2 I3 I4 I5EX I1 I2 I3 I4MEM I1 I2 I3WB I1 I2

Best case: one instruction per cycle!


Pipelining Performance

Performance:IPC · 1

τ

IPC ≈ 1

τ ≈ DFF +D

n

where:IPC : instructions per cycleτ : cycle timen: # stagesD: combinational delay without the flip flops


Implementing the Pipeline: Roadmap

1. Resolving resource conflicts

2. Modifying the control

3. Dealing with data and control hazards


Resource Conflicts

Let’s look at our sequential machine again:

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw

Consider the C register of an ALUinstruction followed by anotherALU instruction!

IR once the 2nd instruction isfetched?


Register Lifetime

ALU

0

addresses


WB

M

EX

ID

IF

IP

IR

load1

nextIP

syst

embu

s

A, B

MAR

MDRr

C FlgMDRw

IF ID EX M WBIR W R R R RA, B W RIP R WMAR W RMDRw W RC W R RFlags R WMDRr W Reax. . . R W

8 Problem: IR and C need to be remembered for multiplestages!


Register Lifetime

ALU

01

addresses

A, B

load

eax, ..., esi, edi

WB

M

EX

ID

IF

IP

IR1

IR2

IR3

IR4

nextIP

syst

embu

s

C4

A, B

MAR C3 FlgMDRw

MDRr

4 We resolveby replication !


Resource Conflicts

Q: Which other resources are shared by stages?A: The system bus (shared by IF and MEM)!

Q: What do we do?A: Most CPUs have an L1-cache that permits two(read-)accesses simultaneously.

(Really two L1 caches: an I- and a D-cache)


Example Pipeline

cycle:��

program (modified):add edx, ebx

mov [100+esi], ecx

0ALU

01

addresses

A, B

load

eax, ..., esi, edi

WB

M

EX

ID

IF

IP

IR1

IR2

IR3

IR4

nextIP

syst

embu

s

C4

A, B

MAR C3 FlgMDRw

MDRr


Example Pipeline (1)

cycle:��0


mov [100+esi], ecx

0

ALU

01

addresses

A, B

load

eax, ..., esi, edi

WB

M

EX

ID

IF

IP

IR1

IR2

IR3

IR4

nextIP

syst

embu

s

C4

A, B

MAR C3 FlgMDRw

MDRr



cycle:��


mov [100+esi], ecx

1

add

0

2

2, 3

29, 6

ALU

01

addresses

A, B

load

eax, ..., esi, edi

WB

M

EX

ID

IF

IP

IR1

IR2

IR3

IR4

nextIP

syst

embu

s

C4

A, B

MAR C3 FlgMDRw

MDRr



cycle:��2


mov [100+esi], ecx

RMmov

2

50, 20

6, 1

add2

35 0

29, 6

ALU

01

addresses

A, B

load

eax, ..., esi, edi

WB

M

EX

ID

IF

IP

IR1

IR2

IR3

IR4

nextIP

syst

embu

s

C4

A, B

MAR C3 FlgMDRw

MDRr



cycle:��3


mov [100+esi], ecx

RMmov5

20100

0, 20

add35

ALU

01

addresses

A, B

load

eax, ..., esi, edi

WB

M

EX

ID

IF

IP

IR1

IR2

IR3

IR4

nextIP

syst

embu

s

C4

A, B

MAR C3 FlgMDRw

MDRr



cycle:��4


mov [100+esi], ecxRMmov20

100

add35

2

ALU

01

addresses

A, B

load

eax, ..., esi, edi

WB

M

EX

ID

IF

IP

IR1

IR2

IR3

IR4

nextIP

syst

embu

s

C4

A, B

MAR C3 FlgMDRw

MDRr



cycle:��5


mov [100+esi], ecx

RMmov

ALU

01

addresses

A, B

load

eax, ..., esi, edi

WB

M

EX

ID

IF

IP

IR1

IR2

IR3

IR4

nextIP

syst

embu

s

C4

A, B

MAR C3 FlgMDRw

MDRr


Data and Control Dependencies

Example program with data dependency:

add edx , ebxmov [100+ es i ] , edx

Execution in the pipeline:Like that?

time 3IF . . .ID . . .EX mov [100+esi], edxMEM add edx, ebxWB


Data and Control DependenciesExample program with data dependency:

add edx , ebxmov [100+ es i ] , edx

Execution in the pipeline:

time 3IF . . .ID mov [100+esi], edxEX BUBBLEMEM add edx, ebxWB

DATA DEPENDENCY!8 We would now read the wrong (old) value of edx!


Memory

I ROM: read-only memory

I RAM: random-access memory(but usually means random-access read and writememory)

I SRAM: static RAMstores state as long as power is supplied

I DRAM: dynamic RAMimplemented using capacitors;the state is lost without periodic refresh


RAM in PCs

30 pin SIMM 72 pin SIMM MicroDIMM 184 pin RAMBus RIMM

100 pin DIMM 72 pinSODIMM

144 pin SDRAMSODIMM

200 pin DDRSODIMM

200 pin DDR-2SODIMM

168 pin SDRAM DIMM 184 pin DDR DIMM 240 pin DDR-2 DIMM


Addresses

addressDATA

RA

M

WE

I RAM/ROM-Chips store many(billions of) bits

I Distinguish using an address

I The address is given in binary

I Plus WE : read/write

I The data pins are used for readingas well as writing


Structure

2

2 4

decoder

deco

der

address I RAM/ROM chips are a 2Dmatrix

I The address is split into arow and column

I The binary encoding isturned into unary using adecoder


SRAM Cell with Two Inverters

Address Line

Data Data

I Reading and writingI Address line selects the cellI State is held using the inverters (latch)I Read by comparing Data and Data


SRAM Cell in CMOS

Data

VDDAddress Line

GND

Data


DRAM

I DRAM uses capacitorsI more simplistic and easier to build than SRAM4 high density, low costI But: slower!

→ fast but expensive SRAM for caches (more on that later)→ slow but inexpensive DRAM for the main memory


Reminder: Capacitors

0 1 2 3 4 5 6 7 8

1

time

charge %

charging discharging

Store an electric charge – but only for limited time


DRAM Cell

GND Data

Address Line

A bit is stored as a capacity and has to be refreshed periodically


Data Buses

Connecting multiple memory chips:

memorymodule

CPUmemory

modulememory

module

8 No! I/O pins are expensive!


Data Buses

I Goal: effective use of the pricey wires

I Idea: share wires for data and addresses among RAMmodules

module

controlCPU

data

memorymodule

memorymodule

memory

address


Interface RAM Chips

I Control signals:I CS (Chip Select) – activates a particular chipI WE (Write Enable)I OE (Output Enable)

I Inactive chips have high-impedance outputs (Z)

I Write by setting WE , read by setting OE

I Interface constraint: OE and WE are never both active


Write Cycle

validData

Address

OE

WE

CS


Read Cycle

valid

CS

WE

OE

Address

Data


Row- und Column-Address-Strobes

I Idea: save even more wires by sending the address in two(or more) steps

I Typical: row and column are sent separately

I RAS: Row Address Strobe,CAS: Column Address Strobe


RAS/CAS Write Cycle

Row Col

valid

RAS

CAS

WE

Data

Address


RAS/CAS Read Cycle

Row Col

valid

CAS

Address

RAS

WE

Data


Bus-Bursts

8 RAM has long latencyI RAM is often accessed sequentially

I Caches therefore are arranged in lines:a sequence of consecutive addresses (e.g. 256 bytes)

I Bus-bursts: efficient transmission of an entire cache line


Bus-Bursts

CAS Latency (CL)

Address Row

RAS

OE

DATA

CLK

Col

D0 D1 D3D2

CAS


Double Data Rate (DDR) RAM

D3

Address Row

RAS

OE

DATA

CLK

Col

CAS

CAS Latency (CL)

D0 D1 D2


Timings

6GB 1066MHz DDR3 ECC Reg w/Par CL7 DIMM (Kit of 3) DR, x8 w/Therm Sen KVR1066D3D8R7SK3/6G Get Price

6GB 1333MHz DDR3 ECC Reg w/Par CL9 DIMM (Kit of 3) SR, x4 w/Therm Sen KVR1333D3S4R9SK3/6G Get Price


8GB 1066MHz DDR3 Non-ECC CL7 DIMM (Kit of 2) KVR1066D3N7K2/8G Get Price


8GB 1066MHz DDR3 ECC Reg w/Parity CL7 DIMM Quad Rank, x4 w/Therm Sen KVR1066D3Q4R7S/8G Get Price

8GB 1066MHz DDR3 ECC Reg w/Par CL7 DIMM (Kit of 2) QR, x8 w/Therm Sen KVR1066D3Q8R7SK2/8G Get Price


12GB 1066MHz DDR3 Non-ECC CL7 DIMM (Kit of 3) KVR1066D3N7K3/12G Get Price

12GB 1066MHz DDR3 ECC CL7 DIMM (Kit of 3) with Thermal Sensor KVR1066D3E7SK3/12G Get Price






HyperX DDR 333MHz and 400MHz

Description Part Number Price

512MB 333MHz DDR Non-ECC CL2 (2-2-2-5-1) DIMM KHX2700/512 Get Price

512MB 400MHz DDR Non-ECC CL2 (2-3-2-6-1) DIMM KHX3200A/512 Get Price

1GB 333MHz DDR Non-ECC CL2 (2-2-2-5-1) DIMM (Kit of 2) KHX2700K2/1G Get Price

1GB 400MHz DDR Non-ECC CL2.5 (2.5-3-3-7-1) DIMM KHX3200/1G Get Price

1GB 400MHz DDR Non-ECC CL2 (2-3-2-6-1) DIMM KHX3200A/1G Get Price

1GB 400MHz DDR Non-ECC CL2 (2-3-2-6-1) DIMM (Kit of 2) KHX3200AK2/1G Get Price

2GB 400MHz DDR Non-ECC CL2.5 (2.5-3-3-7-1) DIMM (Kit of 2) KHX3200K2/2G Get Price

2GB 400MHz DDR Non-ECC CL2 (2-3-2-6-1) DIMM (Kit of 2) KHX3200AK2/2G Get Price

HyperX DDR2 800MHz, 900MHz, 1000MHz, 1066MHz and 1150MHz


512MB 800MHz DDR2 Non-ECC Low-Latency CL4 (4-4-4-12) DIMM KHX6400D2LL/512 Get Price

512MB 1066MHz DDR2 Non-ECC CL5 (5-5-5-15) DIMM KHX8500D2/512 Get Price

1GB 800MHz DDR2 Non-ECC CL5 (5-5-5-15) DIMM KHX6400D2/1G Get Price

1GB 800MHz DDR2 Non-ECC Low-Latency CL4 (4-4-4-12) DIMM KHX6400D2LL/1G Get Price

1GB 800MHz DDR2 Non-ECC Low-Latency CL4 (4-4-4-12) DIMM (Kit of 2) KHX6400D2LLK2/1G Get Price

1GB 800MHz DDR2 Non-ECC Low Lat CL4 (4-4-4-12) DIMM (NVIDIA SLI-Ready) KHX6400D2LLK2/1GN Get Price


1GB 1066MHz DDR2 Non-ECC CL5 (5-5-5-15) DIMM (Kit of 2) KHX8500D2K2/1G Get Price

1GB 1066MHz DDR2 CL5 (5-5-5-15) DIMM (Kit of 2) (NVIDIA SLI-Ready) KHX8500D2K2/1GN Get Price






2GB 800MHz DDR2 Non-ECC CL5 (5-5-5-15) DIMM (Kit of 2) Tall HS KHX6400D2T1K2/2G Get Price


2GB 800MHz DDR2 Non-ECC Low-Lat CL4 (4-4-4-12) DIMM (NVIDIA SLI-Ready) KHX6400D2LLK2/2GN Get Price



2GB 1066MHz DDR2 CL5 (5-5-5-15) DIMM (Kit of 2) (NVIDIA SLI-Ready) KHX8500D2K2/2GN Get Price



2GB 800MHz DDR2 ECC Low-Latency CL4 (4-4-4-12) FBDIMM (Kit of 2) KHX6400F2LLK2/2G Get Price




4GB 1066MHz DDR2 Non-ECC CL5 (5-5-5-15) DIMM (Kit of 2) Tall HS KHX8500D2T1K2/4G Get Price



HyperX DDR3 1375MHz, 1600MHZ, 1625MHz, 1800MHz, 1866MHz and 2000MHz





1GB 1625MHz DDR3 Non-ECC Low-Latency CL7 (7-7-7-20) DIMM KHX13000AD3LL/1G Get Price


1GB 1800MHz DDR3 Non-ECC CL8 (8-8-8-24) DIMM KHX14400AD3/1G Get Price

2GB 1375MHz DDR3 Non-ECC CL9 (9-9-9) DIMM (Kit of 2) KHX11000D3K2/2G Get Price


2GB 1375MHz DDR3 Non-ECC CL7 (7-7-7-20) DIMM (Kit of 2) KHX11000D3LLK2/2G Get Price

2GB 1375MHz DDR3 Non-ECC CL7 (7-7-7-20) DIMM (Kit of 2) Intel XMP KHX11000D3LLK2/2GX Get Price





2GB 1625MHz DDR3 Low Latency CL8 (8-7-7-20) DIMM (Kit of 2) NVIDIA SLI KHX13000D3LLK2/2GN Get Price


Timings

Example: 2-2-2-5

Current standard:

1. CAS Latency2. RAS-to-CAS Delay3. RAS Precharge4. Act-to-Precharge Delay


Caches

I Recall: DRAM slow/cheap, SRAM fast/pricey

I Idea: use SRAM as fast cache for lots of DRAM

I “Hides” the latency of the slow DRAM

I Usually good hit rates >90 %


Caches: Overview

1

2

3

5

4

0

. . .

6

7

8

9

123

456

123 456

789

123789 123

0

tag

cacheline 0

1

2

3

index

main memorycache

0

0 1 2 3offset


Caches: Hashing

Q: How to map the addresses?

Easiest answer: use least-significant bits

address = tag index offset

I tag: distinguishes lines with same indexI index: address in cacheI offset: distinguishes words in cache line


Collisions

2

3

index

0

. . .

0123456789

101112131415161718192021222324

tag

0

0

0

0

1

1

0

tag

cacheline 0

1


Overview of Design Options for Caches

I sizeI line size – number of bytes stored togetherI allocation policy – when is a new entry created?I associativity – length of list in hash tableI replacement policy – which entries to purgeI (sectoring)I write policy – write through or write backI split I/D cache or unified I/D cache

We will have more options once hierarchy is added.


Cache Size

I Bigger cache −→ better hit rateI Bigger caches are also more expensive and have longer

paths

I Partially addressed by hierarchy(more on that later)


Line Size

I Observation: memory accesses are clusteredI I.e., the subsequent accesses are often next to each otherI Cache entries have overhead: address bits plus flag bitsI Also remember the latency of memory!

4 Reduce overhead by making cache entry bigger

I Typical size: 64 bytes (512 bits)


Associativity

I Also called “ways”

I An n-way cache can store n entries with the same addresshash

I Think of the length of the list in a hash table

I This reduces the number of collisions


Associativity

12131415161718192021222324

tag

0

0

0

0

1

1

2

2

3

3

0

0

1

1

index tag

0

1

2-way cache

cacheline

. . .

0123456789

1011


Cache Hierarchies

I Recall that fast SRAM is expensive, and bigger cacheshave long paths

I Thus: build a cache for the cache

I L1: closest to CPUI L2, L3, L4: cache the next level

I Caches get bigger the closer they get to the memory


Statistics

Model Year L1 Cache L2 Cache L3 Cache L4 Cache80486DX 1989 8 KB jointPentium 1993 8 KB+8 KBPentium Pro 1995 8 KB+8 KB 0.25 MBPentium MMX 1997 16 KB+16 KBPentium II 1997 16 KB+16 KB 0.5 MBXeon 1998 8 KB+8 KB 0.25–1 MBPentium III 1999 16 KB+16 KB 0.5 MBPentium 4 2000 16 KB+16 KB 0.25–0.5 MBItanium 2 2002 16 KB+16 KB 1.5–9 MB 2 or 4 MBPentium M 2003 32 KB+32 KB 0.25 MBCore 2 Duo 2006 32 KB+32 KB 2 MBCore i7 2008 32 KB+32 KB 0.25 MB 8 MBCore i5 2009 32 KB+32 KB 0.25 MB 8 MBCore i3 2010 32 KB+32 KB 0.25 MB 4 MBAtom SoC 2012 32 KB+24 KB 0.25 MBCore M 2014 0.25 MB 3 MB 128 MB

Numbers are per core unless shared.


Documents

AIMS Embedded Systems Programming MT 2018 - Micro …aims.robots.ox.ac.uk/wp-content/uploads/2018/11/2-micro.pdf · High-Level View of Microarchitectures..... eaxebx ecx ZF CPU registers