OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE

OMSE 510: Computing Foundations3: Caches, Assembly, CPU Overview

Chris Gilmore <[email protected]>

Portland State University/OMSE

Today

Caches

DLX Assembly

CPU Overview

Computer System (Idealized)

CPUMemory

System Bus

Disk Controller

Disk

The Big Picture: Where are We Now?

Control

Datapath

Memory

Processor

Input

Output

The Five Classic Components of a Computer

Next Topic: Simple caching techniquesMany ways to improve cache performance

Upper Level

Lower Level

faster

Larger

Recap: Levels of the Memory Hierarchy

Processor

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

Recap: exploit locality to achieve fast memory

Two Different Types of Locality:Temporal Locality (Locality in Time): If an item is referenced, it

will tend to be referenced again soon.Spatial Locality (Locality in Space): If an item is referenced, items

whose addresses are close by tend to be referenced soon.

By taking advantage of the principle of locality:Present the user with as much memory as is available in the

cheapest technology.Provide access at the speed offered by the fastest technology.

DRAM is slow but cheap and dense:Good choice for presenting the user with a BIG memory system

SRAM is fast but expensive and not very dense:Good choice for providing the user FAST access time.

Memory Hierarchy: Terminology

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper levelHit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss

Miss: data needs to be retrieve from a block in the lower level (Block Y)

Miss Rate = 1 - (Hit Rate)Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor

Hit Time << Miss Penalty

Processor

$

MEM

Memory

reference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>, . . .

op: i-fetch, read, write

Optimize the memory system organizationto minimize the average memory access timefor typical workloads

Workload orBenchmarkprograms

The Art of Memory System Design

Example: Fully Associative

:

Cache Data

Byte 0

0431

:

Cache Tag (27 bits long)

Valid Bit

:

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Cache Tag

Byte Select

Ex: 0x01

X

X

X

X

X

Fully Associative CacheNo Cache IndexCompare the Cache Tags of all cache entries in parallelExample: Block Size = 32 B blocks, we need N 27-bit

comparators

By definition: Conflict Miss = 0 for a fully associative cache

Example: 1 KB Direct Mapped Cache with 32 B Blocks

Cache Index

0

1

2

3

:

Cache Data

Byte 0

0431

:

Cache Tag Example: 0x50

Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:

31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte Select

Ex: 0x00

9Block address

For a 2 ** N byte cache:The uppermost (32 - N) bits are always the Cache TagThe lowest M bits are the Byte Select (Block Size = 2 ** M)

Set Associative Cache

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

N-way set associative: N entries for each Cache IndexN direct mapped caches operates in parallel

Example: Two-way set associative cacheCache Index selects a “set” from the cacheThe two tags in the set are compared to the input in parallelData is selected based on the tag result

Disadvantage of Set Associative Cache

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

N-way Set Associative Cache versus Direct Mapped Cache:N comparators vs. 1Extra MUX delay for the dataData comes AFTER Hit/Miss decision and set selection

In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:

Possible to assume a hit and continue. Recover later if miss.

Block Size Tradeoff

MissPenalty

Block Size

MissRate Exploits Spatial Locality

Fewer blocks: compromisestemporal locality

AverageAccess

Time

Increased Miss Penalty& Miss Rate

Block Size Block Size

Larger block size take advantage of spatial locality BUT:Larger block size means larger miss penalty:

Takes longer time to fill up the blockIf block size is too big relative to cache size, miss rate

will go up Too few cache blocks

In general, Average Access Time:

= Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate

A Summary on Sources of Cache Misses

Compulsory (cold start or process migration, first reference): first access to a block

“Cold” fact of life: not a whole lot you can do about itNote: If you are going to run “billions” of instruction,

Compulsory Misses are insignificant

Conflict (collision):Multiple memory locations mapped

to the same cache locationSolution 1: increase cache sizeSolution 2: increase associativity

Capacity:Cache cannot contain all blocks access by the programSolution: increase cache size

Coherence (Invalidation): other process (e.g., I/O) updates memory

Direct Mapped N-way Set Associative Fully Associative

Compulsory Miss:

Cache Size: Small, Medium, Big?

Capacity Miss

Coherence Miss

Conflict Miss

Source of Cache Misses Quiz

Choices: Zero, Low, Medium, High, Same

Assume constant cost.

Sources of Cache Misses

AnswerDirect Mapped N-way Set Associative Fully Associative

Compulsory Miss

Cache Size

Capacity Miss

Coherence Miss

Big Medium Small

Note:If you are going to run “billions” of instruction, Compulsory Misses are insignificant.

Same Same Same

Conflict Miss High Medium Zero

Low Medium High

Same Same Same

Recap: Four Questions for Caches and Memory Hierarchy

Q1: Where can a block be placed in the upper level? (Block placement)

Q2: How is a block found if it is in the upper level? (Block identification)

Q3: Which block should be replaced on a miss? (Block replacement)

Q4: What happens on a write? (Write strategy)

0 1 2 3 4 5 6 7Blockno.

Fully associative:block 12 can go anywhere

0 1 2 3 4 5 6 7Blockno.

Direct mapped:block 12 can go only into block 4 (12 mod 8)

0 1 2 3 4 5 6 7Blockno.

Set associative:block 12 can go anywhere in set 0 (12 mod 4)

Set0

Set1

Set2

Set3

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Block-frame address

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3Blockno.

Q1: Where can a block be placed in the upper level?

Block 12 placed in 8 block cache:Fully associative, direct mapped, 2-way set associativeS.A. Mapping = Block Number Modulo Number Sets

Blockoffset

Block AddressTag Index

Q2: How is a block found if it is in the upper level?

Set Select

Data Select

Direct indexing (using index and block offset), tag compares, or combination

Increasing associativity shrinks index, expands tag

Q3: Which block should be replaced on a miss?

Easy for Direct Mapped

Set Associative or Fully Associative:RandomFIFOLRU (Least Recently Used)LFU (Least Frequently Used)

Associativity: 2-way 4-way 8-way

Size LRU Random LRU Random LRU Random

16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%

64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%

256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

Q4: What happens on a write?

Write through—The information is written to both the block in the cache and to the block in the lower-level memory.

Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.

is block clean or dirty?

Pros and Cons of each?WT: read misses cannot result in writes, coherency easierWB: no writes of repeated writes

WT always combined with write buffers so they don’t wait for lower level memory

ProcessorCache

Write Buffer

DRAM

Write Buffer for Write Through

A Write Buffer is needed between the Cache and MemoryProcessor: writes data into the cache and the write bufferMemory controller: write contents of the buffer to memory

Write buffer is just a FIFO:Typical number of entries: 4Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write

cycle

Memory system designer’s nightmare:Store frequency (w.r.t. time) > 1 / DRAM write cycleWrite buffer saturation

Write Buffer SaturationProcessor

Cache

Write Buffer

DRAM

ProcessorCache

Write Buffer

DRAML2Cache

Store frequency (w.r.t. time) > 1 / DRAM write cycleIf this condition exist for a long period of time (CPU cycle time too

quick and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time

Solution for write buffer saturation:Use a write back cacheInstall a second level (L2) cache: (does this always work?)

Cache Index

0

1

2

3

:

Cache Data

Byte 0

0431

:

Cache Tag Example: 0x00

Ex: 0x00

0x50

Valid Bit

:

31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte Select

Ex: 0x00

9

Write-miss Policy: Write Allocate versus Not Allocate

Assume: a 16-bit write to memory location 0x0 and causes a missDo we read in the block?

Yes: Write Allocate No: Write Not Allocate

Impact on Cycle Time

IR

PCI -Cache

D Cache

A B

R

T

IRex

IRm

IRwb

miss

invalid

Miss

Cache Hit Time:directly tied to clock rateincreases with cache sizeincreases with associativity

Average Memory Access time = Hit Time + Miss Rate x Miss Penalty

Time = IC x CT x (ideal CPI + memory stalls)

What happens on a Cache miss?For in-order pipeline, 2 options:

Freeze pipeline in Mem stage (popular early on: Sparc, R4000)

IF ID EX Mem stall stall stall … stall Mem Wr IF ID EX stall stall stall … stall stall Ex

Wr

Use Full/Empty bits in registers + MSHR queue MSHR = “Miss Status/Handler Registers” (Kroft)

Each entry in this queue keeps track of status of outstanding memory requests to one complete memory line.

Per cache-line: keep info about memory address. For each word: register (if any) that is waiting for result. Used to “merge” multiple requests to one memory line

New load creates MSHR entry and sets destination register to “Empty”. Load is “released” from pipeline.

Attempt to use register before result returns causes instruction to block in decode stage.

Limited “out-of-order” execution with respect to loads. Popular with in-order superscalar architectures.

Out-of-order pipelines already have this functionality built in… (load queues, etc).

Time = IC x CT x (ideal CPI + memory stalls)

Average Memory Access time = Hit Time + (Miss Rate x Miss Penalty) =

(Hit Rate x Hit Time) + (Miss Rate x Miss Time)

Improving Cache Performance: 3 general options

1. Reduce the miss rate,

2. Reduce the miss penalty, or

3. Reduce the time to hit in the cache.

Improving Cache Performance




Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

Conflict

Compulsory vanishinglysmall

3Cs Absolute Miss Rate (SPEC92)

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.141 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

Conflict

miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2

2:1 Cache Rule

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0%

20%

40%

60%

80%

100%

1 2 4 8

16

32

64

12

8

1-way

2-way4-way

8-way

Capacity

Compulsory

Conflict

Flaws: for fixed block sizeGood: insight => invention

3Cs Relative Miss Rate

Block Size (bytes)

Miss Rate

0%

5%

10%

15%

20%

25%

16

32

64

12

8

25

6

1K

4K

16K

64K

256K

1. Reduce Misses via Larger Block Size

2. Reduce Misses via Higher Associativity

2:1 Cache Rule: Miss Rate DM cache size N Miss Rate 2-way cache size

N/2

Beware: Execution time is only final measure!Will Clock Cycle time increase?Hill [1988] suggested hit time for 2-way vs. 1-way

external cache +10%, internal + 2%

Example: Avg. Memory Access Time vs. Miss Rate

Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped

Cache Size Associativity

(KB) 1-way 2-way 4-way 8-way

1 2.33 2.15 2.07 2.01

2 1.98 1.86 1.76 1.68

4 1.72 1.67 1.61 1.53

8 1.46 1.48 1.47 1.43

16 1.29 1.32 1.32 1.32

32 1.20 1.24 1.25 1.27

64 1.14 1.20 1.21 1.23

128 1.10 1.17 1.18 1.20

(Green means A.M.A.T. not improved by more associativity)

(AMAT = Average Memory Access Time)

To Next Lower Level InHierarchy

DATATAGS

One Cache line of DataTag and Comparator




3. Reducing Misses via a “Victim Cache”

How to combine fast hit time of direct mapped yet still avoid conflict misses?

Add buffer to place data discarded from cache

Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache

Used in Alpha, HP machines

4. Reducing Misses by Hardware Prefetching

E.g., Instruction PrefetchingAlpha 21064 fetches 2 blocks on a missExtra block placed in “stream buffer”On miss check stream buffer

Works with data blocks too:Jouppi [1990] 1 data stream buffer got 25% misses from

4KB cache; 4 streams got 43%Palacharla & Kessler [1994] for scientific programs for 8

streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches

Prefetching relies on having extra memory bandwidth that can be used without penalty

5. Reducing Misses by Software Prefetching Data

Data PrefetchLoad data into register (HP PA-RISC loads)Cache Prefetch: load into cache

(MIPS IV, PowerPC, SPARC v. 9)Special prefetching instructions cannot cause faults;

a form of speculative execution

Issuing Prefetch Instructions takes timeIs cost of prefetch issues < savings in reduced misses?Higher superscalar reduces difficulty of issue bandwidth

6. Reducing Misses by Compiler Optimizations

McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software

InstructionsReorder procedures in memory so as to reduce conflict missesProfiling to look at conflicts(using tools they developed)

DataMerging Arrays: improve spatial locality by single array of compound elements vs.

2 arraysLoop Interchange: change nesting of loops to access data in order stored in

memoryLoop Fusion: Combine 2 independent loops that have same looping and some

variables overlapBlocking: Improve temporal locality by accessing “blocks” of data repeatedly vs.

going down whole columns or rows

Improving Cache Performance (Continued)




0. Reducing Penalty: Faster DRAM / Interface

New DRAM Technologies RAMBUS - same initial latency, but much higher

bandwidthSynchronous DRAM

Better BUS interfaces

CRAY Technique: only use SRAM

ProcessorCache

Write Buffer

DRAM

1. Reducing Penalty: Read Priority over Write on Miss

A Write Buffer Allows reads to bypass writesProcessor: writes data into the cache and the write bufferMemory controller: write contents of the buffer to memory

Write buffer is just a FIFO:Typical number of entries: 4Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

Memory system designer’s nightmare:Store frequency (w.r.t. time) > 1 / DRAM write cycleWrite buffer saturation

1. Reducing Penalty: Read Priority over Write on Miss

Write-Buffer Issues:Write through with write buffers offer RAW conflicts with main

memory reads on cache missesIf simply wait for write buffer to empty, might increase read

miss penalty (old MIPS 1000 by 50% )

Check write buffer contents before read; if no conflicts, let the memory access continue

Write Back?Read miss replacing dirty blockNormal: Write dirty block to memory, and then do the readInstead copy the dirty block to a write buffer, then do the read,

and then do the writeCPU stall less since restarts as soon as do read

block

2. Reduce Penalty: Early Restart and Critical Word First

Don’t wait for full block to be loaded before restarting CPUEarly restart—As soon as the requested word of the block arrives,

send it to the CPU and let the CPU continue executionCritical Word First—Request the missed word first from memory

and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first

Generally useful only in large blocks,

Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart

3. Reduce Penalty: Non-blocking Caches

Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss

requires F/E bits on registers or out-of-order executionrequires multi-bank memories

“hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests“hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses

Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses

Requires multiple memory banks (otherwise cannot support)

Pentium Pro allows 4 outstanding memory misses

Hit Under i Misses

Av

g.

Me

m.

Acce

ss T

ime

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

eqnto

tt

esp

ress

o

xlisp

com

pre

ss

mdljsp

2

ear

fpppp

tom

catv

swm

256

doduc

su2co

r

wave5

mdljdp2

hydro

2d

alv

inn

nasa

7

spic

e2g6

ora

0->1

1->2

2->64

Base

Integer Floating Point

“Hit under n Misses”

Value of Hit Under Miss for SPEC

FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.198 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss

4. Reduce Penalty: Second-Level Cache

Proc

L1 Cache

L2 Cache

L2 Equations

AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 +

Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)

Definitions:Local miss rate— misses in this cache divided by the total number of

memory accesses to this cache (Miss rateL2)Global miss rate—misses in this cache divided by the total number of

memory accesses generated by the CPU (Miss RateL1 x Miss RateL2)

Global Miss Rate is what matters

Reducing Misses: which apply to L2 Cache?

Reducing Miss Rate

1. Reduce Misses via Larger Block Size

2. Reduce Conflict Misses via Higher Associativity

3. Reducing Conflict Misses via Victim Cache

4. Reducing Misses by HW Prefetching Instr, Data

5. Reducing Misses by SW Prefetching Data

6. Reducing Capacity/Conf. Misses by Compiler Optimizations

Relative CPU Time

Block Size

11.11.21.31.41.51.61.71.81.9

2

16 32 64 128 256 512

1.361.28 1.27

1.34

1.54

1.95

L2 cache block size & A.M.A.T.

32KB L1, 8 byte path to memory

Improving Cache Performance (Continued)

1. Reduce the miss rate, 2. Reduce the miss penalty, or3. Reduce the time to hit in the cache:

- Lower Associativity (+victim caching)?- 2nd-level cache- Careful Virtual Memory Design

Summary #1/ 3:The Principle of Locality:

Program likely to access a relatively small portion of the address space at any instant of time.

Temporal Locality: Locality in Time Spatial Locality: Locality in Space

Three (+1) Major Categories of Cache Misses:Compulsory Misses: sad facts of life. Example: cold start misses.Conflict Misses: increase cache size and/or associativity.

Nightmare Scenario: ping pong effect!Capacity Misses: increase cache sizeCoherence Misses: Caused by external processors or I/O

devices

Cache Design Spacetotal size, block size, associativityreplacement policywrite-hit policy (write-through, write-back)write-miss policy

Summary #2 / 3: The Cache Design Space

Associativity

Cache Size

Block Size

Bad

Good

Less More

Factor A Factor B

Several interacting dimensionscache sizeblock sizeassociativityreplacement policywrite-through vs write-backwrite allocation

The optimal choice is a compromisedepends on access characteristics

workload use (I-cache, D-cache, TLB)

depends on technology / cost

Simplicity often wins

mis

s r

ate

Summary #3 / 3: Cache Miss Optimization

Lots of techniques people use to improve the miss rate of caches:

Technique MR MP HT Complexity

Larger Block Size + – 0Higher Associativity + – 1Victim Caches + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + 2Compiler Controlled Prefetching+ 3Compiler Reduce Misses + 0

Onto Assembler!

What is assembly language?

Machine-Specific Programming Language one-one correspondence between statements and native

machine language matches machine instruction set and architecture

What is an assembler?

Systems Level Program Usually works in conjunction with the compiler translates assembly language source code to machine

language object file - contains machine instructions, initial data, and

information used when loading the program listing file - contains a record of the translation process, line

numbers, addresses, generated code and data, and a symbol table

Why learn assembly?

Learn how a processor works

Understand basic computer architecture

Explore the internal representation of data and instructions

Gain insight into hardware concepts

Allows creation of small and efficient programs

Allows programmers to bypass high-level language restrictions

Might be necessary to accomplish certain operations

Machine Representation

A language of numbers, called the Processor’s Instruction Set The set of basic operations a processor can perform

Each instruction is coded as a number

Instructions may be one or more bytes

Every number corresponds to an instruction

Assembly vs Machine

Machine Language Programming Writing a list of numbers representing the bytes of machine

instructions to be executed and data constants to be used by the program

Assembly Language Programming Using symbolic instructions to represent the raw data that

will form the machine language program and initial data constants

Assembly

Mnemonics represent Machine Instructions Each mnemonic used represents a single machine

instruction The assembler performs the translation

Some mnemonics require operands Operands provide additional information

register, constant, address, or variable

Assembler Directives

Instruction Set Architecture: a Critical Interface

instruction set

software

hardware

Portion of the machine that is visible to the programmer or the compiler writer.

Good ISA

Lasts through many implementations (portability, compatibility)

Can be used for many different applications (generality)

Provide convenient functionality to higher levels

Permits an efficient implementation at lower levels

Von Neumann Machines

Von Neumann “invented” stored program computer in 1945

Instead of program code being hardwired, the program code (instructions) is placed in memory along with data Control

ALU

Program

Data

Basic ISA Classes

Memory to Memory Machines Every instruction contains a full memory address for each

operand Maybe the simplest ISA design However memory is slow Memory is big (lots of address bits)

Memory-to-memory machineAssumptions Two operands per operation first operand is also the destination Memory address = 16 bits (2 bytes) Operand size = 32 bits (4 bytes) Instruction code = 8 bits (1 byte)

Example A = B+C (hypothetical code)

mov A, B ; A <= B

add A, C ; A <= B+C 5 bytes for instruction 4 bytes for fetch 1st and 2nd operands 4 bytes to store results add needs 17 bytes and mov needs 13 byts Total 30 bytes memory traffic

Why CPU Storage?

A small amount of storage in the CPU To reduce memory traffic by keeping repeatedly used

operands in the CPU Avoid re-referencing memory Avoid having to specify full memory address of the operand

This is a perfect example of “make the common case fast”.

Simplest Case A machine with 1 cell of CPU storage: the accumulator

Accumulator MachineAssumptions Two operands per operation 1st operand in the accumulator 2nd operand in the memory accumulator is also the destination (except for store) Memory address = 16 bits (2 bytes) Operand size = 32 bits (4 bytes) Instruction code = 8 bits (1 byte)

Example A = B+C (hypothetical code)

Load B ; acc <= B

Add C ; acc <= B+C

Store A ; A <= acc 3 bytes for instruction 4 bytes to load or store the second operand 7 bytes per instruction 21 bytes total memory traffic

Stack MachinesInstruction sets are based on a stack model of execution.

Aimed for compact instruction encoding

Most instructions manipulate top few data items (mostly top 2) of a pushdown stack.

Top few items of the stack are kept in the CPU

Ideal for evaluating expressions (stack holds intermediate results)

Were thought to be a good match for high level languages

Awkward Become very slow if stack grows beyond CPU local storage No simple way to get data from “middle of stack”

Stack Machines

Binary arithmetic and logic operations Operands: top 2 items on stack Operands are removed from stack Result is placed on top of stack

Unary arithmetic and logic operations Operand: top item on the stack Operand is replaced by result of operation

Data move operations Push: place memory data on top of stack Pop: move top of stack to memory

General Purpose Register Machines

With stack machines, only the top two elements of the stack are directly available to instructions. In general purpose register machines, the CPU storage is organized as a set of registers which are equally available to the instructions

Frequently used operands are placed in registers (under program control) Reduces instruction size Reduces memory traffic

General Purpose Registers Dominate

1975-present all machines use general purpose registers

Advantages of registers registers are faster than memory registers are easier for a compiler to use e.g., (A*B) – (C*D) – (E*F) can do multiplies in any order registers can hold variables

memory traffic is reduced, so program is sped up (since registers are faster than memory)

code density improves (since register named with fewer bits than memory location)

Classifying General Purpose Register Machines

General purpose register machines are sub-classified based on whether or not memory operands can be used by typical ALU instructions

Register-memory machines: machines where some ALU instructions can specify at least one memory operand and one register operand

Load-store machines: the only instructions that can access memory are the “load” and the “store” instructions

Comparing number of instructions

Code sequence for A = B+C for five classes of instruction sets:

Memory to Memorymov A Badd A C

Stackpush Bpush Caddpop A

Accumulatorload Badd Cstore A

Register(Register-memory)load R1 Badd R1 Cstore A R1

Register(Load-store)Load R1 BLoad R2 CAdd R1 R1 R2Store A R1

DLX/MIPS is one of these

Instruction Set DefinitionObjects = architecture entities = machine state Registers

General purpose Special purpose (e.g. program counter, condition code, stack pointer)

Memory locations Linear address space: 0, 1, 2, …,2^s -1

Operations = instruction types Data operation

Arithmetic Logical

Data transfer Move (from register to register) Load (from memory location to register) Store (from register to memory location)

Instruction sequencing Branch (conditional) Jump (unconditional)

Topic: DLX

Instructional Architecture

Much nicer and easier to understand than x86 (barf)

The Plan: Teach DLX, then move to x86/y86

DLX: RISC ISA, very similar to MIPS

Great links to learn more DLX:

http://www.softpanorama.org/Hardware/architecture.shtml#DLX

DLX Architecture

Based on observations about instruction set architecture Emphasizes:

Simple load-store instruction set Design for pipeline efficiency Design for compiler target

DLX registers 32 32-bit GPRS named R0, R1, ..., R31 32 32-bit FPRs named F0, F2, ..., F30

Accessed independently for 32-bit data Accessed in pairs for 64-bit (double-precision) data

Register R0 is hard-wired to zero Other status registers, e.g., floating-point status register

Byte addressable in big-endian with 32-bit addressArithmetic instructions operands must be registers

0 zero constant 0

1 at reserved for assembler

2 v0 expression evaluation &

3 v1 function results

4 a0 arguments

5 a1

6 a2

7 a3

8 t0 temporary: caller saves

. . . (callee can clobber)

15 t7

MIPS: Software conventions for Registers

16 s0 callee saves

. . . (callee must save)

23 s7

24 t8 temporary (cont’d)

25 t9

26 k0 reserved for OS kernel

27 k1

28 gp Pointer to global area

29 sp Stack pointer

30 fp frame pointer

31 ra Return Address (HW)

Addressing Mode Example Instruction

Meaning When Used

Register Add R4, R3 R[R4] <- R[R4] + R[R3] When a value is in a register.

Immediate Add R4, #3 R[R4] <- R[R4] + 3 For constants.

Displacement Add R4, 100(R1) R[R4] <- R[R4] +

M[100+R[R1] ]

Accessing local variables.

Register Deferred Add R4, (R1) R[R4] <- R[R4] +

M[R[R1] ]

Using a pointer or a computed address.

Absolute Add R4, (1001) R[R4] <- R[R4] + M[1001] Used for static data.

Addressing Modes

This table shows the most common modes.

Memory Organization

Viewed as a large, single-dimension array, with an address.

A memory address is an index into the array

"Byte addressing" means that the index points to a byte of memory.

0

1

2

3

4

5

6

8 bits of data

8 bits of data

8 bits of data

8 bits of data

8 bits of data

8 bits of data

8 bits of data

Memory Addressing

Bytes are nice, but most data items use larger "words"

For DLX, a word is 32 bits or 4 bytes.

2 questions for design of ISA: Since one could read a 32-bit word as four loads of bytes

from sequential byte addresses or as one load word from a single byte address,

How do byte addresses map to word addresses? Can a word be placed on any byte boundary?

Addressing Objects: Endianess and Alignment

msb lsb

3 2 1 0little endian byte 0

0 1 2 3big endian byte 0

Alignment: require that objects fall on address that is multiple of their size.

0 1 2 3

Aligned

NotAligned

Big Endian: address of most significant byte = word address (xx00 = Big End of word)

IBM 360/370, Motorola 68k, MIPS, Sparc, HP PA

Little Endian: address of least significant byte = word address (xx00 = Little End of word)

Intel 80x86, DEC Vax, DEC Alpha (Windows NT)

Assembly Language vs. Machine Language

Assembly provides convenient symbolic representation much easier than writing down numbers e.g., destination first

Machine language is the underlying reality e.g., destination is no longer first

Assembly can provide 'pseudoinstructions' e.g., “move r10, r11” exists only in Assembly would be implemented using “add r10,r11,r0”

When considering performance you should count real instructions

DLX arithmetic

ALU instructions can have 3 operands add $R1, $R2, $R3 sub $R1, $R2, $R3

Operand order is fixed (destination first)Example:

C code: A = B + CDLX code: add r1, r2, r3

(registers associated with variables by compiler)

DLX arithmetic

Design Principle: simplicity favors regularity. Why?

Of course this complicates some things...

C code: A = B + C + D;E = F - A;

MIPS code:add r1, r2, r3

add r1, r1, r4sub r5, r6, r1

Operands must be registers, only 32 registers provided

Design Principle: smaller is faster. Why?

Execution assembly instructionsProgram counter holds the instruction addressCPU fetches instruction from memory and puts it onto the instruction registerControl logic decodes the instruction and tells the register file, ALU and other registers what to doAn ALU operation (e.g. add) data flows from register file, through ALU and back to register file

ALU Execution Example

ALU Execution example

Memory InstructionsLoad and store instructions lw r11, offset(r10); sw r11, offset(r10);

Example:

C code:A[8] = h + A[8];assume h in r2 and base address of the array A in r3

DLX code: lw r4, 32(r3)add r4, r2, r4sw r4, 32(r3)

Store word has destination last

Remember arithmetic operands are registers, not memory!

Memory Operations - Loads

Load data from memory

lw R6, 0(R5) # R6 <= mem[0x14]

Memory Operations - StoresStoring data to memory works essentially the same way sw R6 , 0(R5) R6 = 200; let’s assume R5 =0x18 mem[0x18] <-- 200

So far we’ve learned:

DLX— loading words but addressing bytes— arithmetic on registers only

Instruction Meaning

add r1, r2, r3 r1 = r2 + r3sub r1, r2, r3 r1 = r2 – r3lw r1, 100(r2) r1 = Memory[r2+100] sw r1, 100(r2) Memory[r2+100] = r1

Use of Registers

Example:

a = ( b + c) - ( d + e) ; // C statement# r1 – r5 : a - e add r10, r2, r3 add r11, r4, r5 sub r1, r10, r11

a = b + A[4]; // add an array element to a var// r3 has address A

lw r4, 16(r3) add r1, r2, r4

Use of Registers : load and store

Example

A[8]= a + A[6] ; // A is in r3, a is in r2

lw r1, 24(r3) # r1 gets A[6] contents

add r1, r2, r1# r1 gets the sum

sw r1, 32(r3)# sum is put in A[8]

load and store

Ex:

a = b + A[i]; // A is in r3, a,b, i in // r1, r2, r4

add r11, r4, r4 # r11 = 2 * i

add r11, r11, r11 # r11 = 4 * i

add r11, r11, r3 # r11 =addr. of A[i]

(r3+(4*i))

lw r10, 0(r11) # r10 = A[i]

add r1, r2, r10 # a = b + A[i]

Example: Swap

Swapping words

r2 has the base address of the array v

swap: lw r10, 0(r2) lw r11, 4(r2) sw r10, 4(r2) sw r11, 0(r2)

temp = v[0]v[0] = v[1];v[1] = temp;

DLX Instruction Format

Opcode Rs1 rd Immediate

Opcode Rs1 Rs2 rd Immediate

Opcode Offset added to PC

I-type Instructions

6 5 5 16

R-type Instructions

6 5 5 5 11

J-type Instructions

6 26

Instruction Format I-type R-type J-type

Machine Language

Instructions, like registers and words of data, are also 32 bits long Example: add r10, r1, r2 registers have numbers, 10, 1, 2

Instruction Format:

000000 00001 00010 01010 0…000100000

Opcode Rs1 Rs2 rd func

R-type Instructions

6 5 5 5 11

Machine Language


I-type Instructions (Loads/stores)

6 5 5 16

1000011 01010 00010 0…000100000

Consider the load-word and store-word instructions, What would the regularity principle have us do? New principle: Good design demands a compromise

Introduce a new type of instruction format I-type for data transfer instructions other format was R-type for register

Example: lw r10, 32(r2)

Machine Language

0000010 offset to .L1

Jump instructions

Example: j .L1


J-type Instructions (Jump, Jump and Link, Trap, return from exception)

6 26

DLX Instruction Format


Opcode Rs1 Rs2 rd Immediate


I-type Instructions

6 5 5 16

R-type Instructions

6 5 5 5 16

J-type Instructions

6 26

Instruction Format I-type R-type J-type

Instructions for Making Decisions

beq reg1, reg2, L1 Go to the statement labeled L1 if the value in reg1 equals

the value in reg2

bne reg1, reg2, L1 Go to the statement labeled L1 if the value in reg1 does not

equals the value in reg2

j L1 Unconditional jump

jr r10 “jump register”. Jump to the instruction specified in register

r10

Making Decisions

Example

if ( a != b) goto L1; // x,y,z,a,b mapped to r1-r5

x = y + z;

L1 : x = x – a;

bne r4, r5, L1 # goto L1 if a != b

add r1, r2, r3 # x = y + z (ignored if a=b)

L1:sub r1, r1, r4 # x = x – a (always ex)

if-then-else

Example: if ( a==b) x = y + z;

else x = y – z ;

bne r4, r5, Else # goto Else if a!=b

add r1, r2, r3 # x = y + z

j Exit # goto ExitElse : sub r1,r2,r3 # x = y – z

Exit :

Example: Loop with array index

Loop: g = g + A [i];i = i + j;if (i != h) goto Loop

.... r1, r2, r3, r4 = g, h, i, j, array base = r5

LOOP: add r11, r3, r3 #r11 = 2 * iadd r11, r11, r11 #r11 = 4 * iadd r11, r11, r5 #r11 = adr. Of A[i]lw r10, 0(r11) #load A[i]add r1, r1, r10 #g = g + A[i]add r3, r3, r4 #i = i + jbne r3, r2, LOOP

Other decisionsSet R1 on R2 less than R3 slt R1, R2, R3 Compares two registers, R2 and R3 R1 = 1if R2 < R3 else

R1 = 0 if R2 >= R3

Example slt r11, r1, r2Branch less than Example: if(A < B) goto LESS slt r11, r1, r2 #t1 = 1 if A < B bne r11, $0, LESS

LoopsExample :

while ( A[i] == k ) // i,j,k in r3. r4, r5

i = i + j; // A is in r6

Loop: sll r11, r3, 2 # r11 = 4 * i add r11, r11, r6 # r11 = addr. Of A[i]

lw r10, 0(r11) # r10 = A[i]

bne r10, r5, Exit # goto Exit if A[i]!=k

add r3, r3, r4 # i = i + j

j Loop # goto Loop

Exit:

op rs rt 16 bit address

op 26 bit address

I

J

Addresses in Branches and Jumps

Instructions:bne r14,r15,Label Next instruction is at Label if r14≠r15beq r14,r15,Label Next instruction is at Label if r14=r15j Label Next instruction is at Label

Formats:

Addresses are not 32 bits — How do we handle this with large programs?

— First idea: limitation of branch space to the first 216 bits

op rs rt 16 bit addressI

Addresses in BranchesInstructions:bne r14,r15,Label Next instruction is at Label if r14≠r15beq r14,r15,Label Next instruction is at Label if r14=r15

Formats:

Treat the 16 bit number as an offset to the PC register – PC-relative addressing Word offset instead of byte offset, why?? most branches are local (principle of locality)

Jump instructions just use the high order bits of PC – Pseudodirect addressing 32-bit jump address = 4 Most Significant bits of PC concatenated with

26-bit word address (or 28- bit byte address) Address boundaries of 256 MB

Conditional Branch Distance

0%

10%

20%

30%

40%

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Words of Branch Dispalcement

Int. Avg. FP Avg.

• 65% of integer branches are 2 to 4 instructions

Conditional Branch Addressing

Frequency of comparison types in branches

0% 50% 100%

EQ/NE

GT/LE

LT/GE

37%

23%

40%

86%

7%

7%

Int Avg.

FP Avg.

PC-relative since most branches are relatively close to the current PCAt least 8 bits suggested (128 instructions)Compare Equal/Not Equal most important for integer programs (86%)

PC-relative addressing

Program Counter

Adressable space relative to PC

For larger distances: Jump register jr required.

Example

op rs rt

0 19 10 9

35 9 8 1000

5 8 21 8

0 19 20

2 80000

...

80000

80004

80008

80012

80016

80020

19 0

0 24

32

LOOP: mult $9, $19, $10 # R9 = R19*R10 lw $8, 1000($9) # R8 = @(R9+1000)bne $8, $21, EXIT add $19, $19, $20 #i = i + jj LOOP EXIT: ...

Assume address of LOOP is 0x8000

2

0x8000

Procedure calls

Procedures or subroutines: Needed for structured programming

Steps followed in executing a procedure call: Place parameters in a place where the procedure

(callee) can access them Transfer control to the procedure Acquire the storage resources needed for the procedure Perform desired task Place results in a place where the calling program

(caller) can access them Return control to the point of origin

Resources InvolvedRegisters used for procedure calling: $a0 - $a3 : four argument registers in which to pass parameters $v0 - $v1 : two value registers in which to return values r31 : one return address register to return to the point of origin

Transferring the control to the callee: jal ProcedureAddress jump-and-link to the procedure address the return address (PC+4) is saved in r31 Example: jal 20000

Returning the control to the caller: jr r31 instruction following jal is executed next

Memory Stacks

Useful for stacked environments/subroutine call & return even if operand stack not part of architecture

Stacks that Grow Up vs. Stacks that Grow Down:

abc

0 Little

inf. Big 0 Little

inf. Big

MemoryAddressesSP grows

upgrowsdown

Low address

High address

Calling conventionsint func(int g, int h, int i, int j){ int f;

f = ( g + h ) – ( i + j ) ;return ( f );

} // g,h,i,j - $a0,$a1,$a2,$a3, f in r8func :addi $sp, $sp, -12 #make room in stack for 3 words

sw r11, 8($sp) #save the regs we want to use

sw r10, 4($sp)sw r8, 0($sp)add r10, $a0, $a1 #r10 = g + hadd r11, $a2, $a3 #r11 = i + jsub r8, r10, r11 #r8 has the result

add $v0, r8, r0 #return reg $v0 has f

Calling (cont.)

lw r8, 0($sp) # restore r8lw r10, 4($sp) # restore r10lw r11, 8($sp) # restore r11addi $sp, $sp, 12 # restore sp

jr $ra

• we did not have to restore r10-r19 (caller save)• we do need to restore r1-r8 (must be preserved by

callee)

Nested CallsStacking of Subroutine Calls & Returns and Environments:

A: CALL B

CALL C

C: RET

RET

B:

A

A B

A B C

A B

A

Some machines provide a memory stack as part of the architecture (e.g., VAX, JVM)Sometimes stacks are implemented via software convention

D:

Compiling a String Copy Proc.

void strcpy ( char x[ ], y[ ]){ int i=0;

while ( x[ i ] = y [ i ] != 0)i ++ ;

} // x and y base addr. are in $a0 and $a1strcpy :

addi $sp, $sp, -4 # reserve 1 word space in stacksw r8, 0($sp) # save r8add r8, $zer0, $zer0 # i = 0

L1 : add r11, $a1, r8 # addr. of y[ i ] in r11lb r12, 0(r11) # r12 = y[ i ]add r13, $a0, r8 # addr. Of x[ i ] in r13sb r12, 0(r13) # x[ i ] = y [ i ]beq r12, $zero, L2 # if y [ i ] = 0 goto L2addi r8, r8, 1 # i ++j L1 # go to L1

L2 : lw r8, 0($sp) # restore r8addi $sp, $sp, 4 # restore $spjr $ra # return

IA - 321978: The Intel 8086 is announced (16 bit architecture)1980: The 8087 floating point coprocessor is added1982: The 80286 increases address space to 24 bits, +instructions1985: The 80386 extends to 32 bits, new addressing modes1989-1995: The 80486, Pentium, Pentium Pro add a few instructions

(mostly designed for higher performance)1997: 57 new “MMX” instructions are added, Pentium II1999: The Pentium III added another 70 instructions (SSE)2001: Another 144 instructions (SSE2)2003: AMD extends the architecture to increase address space to 64 bits,

widens all registers to 64 bits and other changes (AMD64)2004: Intel capitulates and embraces AMD64 (calls it EM64T) and adds

more media extensions

“This history illustrates the impact of the “golden handcuffs” of compatibility

“adding new features as someone might add clothing to a packed bag”

“an architecture that is difficult to explain and impossible to love”

IA-32 OverviewComplexity: Instructions from 1 to 17 bytes long one operand must act as both a source and destination one operand can come from memory complex addressing modes

e.g., “base or scaled index with 8 or 32 bit displacement”

Saving grace: the most frequently used instructions are not too difficult to build compilers avoid the portions of the architecture that are slow

“what the 80x86 lacks in style is made up in quantity, making it beautiful from the right perspective”

IA32 Registers

Oversimplified Architecture Four 32-bit general purpose registers:

eax, ebx, ecx, edx al is a register to mean “the lower 8 bits of eax”

Stack Pointer esp

Fun fact: Once upon a time, only x86 was a 16-bit CPU So, when they upgraded x86 to 32-bits... Added an “e” in front of every register and called it

“extended”

GPR0 EAX Accumulator

GPR1 ECX Count register, string, loop

GPR2 EDX Data Register; multiply, divide

GPR3 EBX Base Address Register

GPR4 ESP Stack Pointer

GPR5 EBP Base Pointer – for base of stack seg.

GPR6 ESI Index Register

GPR7 EDI Index Register

CS Code Segment Pointer

SS Stack Segment Pointer

DS Data Segment Pointer

ES Extra Data Segment Pointer

FS Data Seg. 2

GS Data Seg. 3

PC EIP Instruction Counter

Eflags Condition Codes

Intel 80x86 Integer Registers

x86 Assemblymov <dest>, <src>

Move the value from <src> into <dest>

Used to set initial values

add <dest>, <src>

Add the value from <src> to <dest>

sub <dest>, <src>

Subtract the value from <src> from <dest>

x86 Assemblypush <target>

Push the value in <target> onto the stack

Also decrements the stack pointer, ESP (remember stack grows from high to low)

pop <target>

Pops the value from the top of the stack, put it in <target>

Also increments the stack pointer, ESP

x86 Assemblyjmp <address>

Jump to an instruction (like goto)

Change the EIP to <address>

Call <address>

A function call.

Pushes the current EIP + 1 (next instruction) onto the stack, and jumps to <address>

x86 Assemblylea <dest>, <src>

Load Effective Address of <src> into register <dest>.

Used for pointer arithmetic (not actual memory reference)

int <value>

interrupt – hardware signal to operating system kernel, with flag <value>

int 0x80 means “Linux system call”

x86 AssemblyCondition Codes:

CZ: Carry Flag – Overflow Detection (Unsigned)

ZF: Zero Flag

SF: Sign Flag

OF: Overflow Flag – Overflow Detection (Signed)

Conditional Codes are you usually accessed through conditional branches (Not Directly)

Interrupt conventionint 0x80 – System call interupt

eax – System call number (eg. 1-exit, 2-fork, 3-read, 4-write)

ebx – argument #1

ecx – argument #2

edx – argument #3

CISC vs RISCRISC = Reduced Instruction Set Computer (DLX)

CISC = Complex Instruction Set Computer (x86)

Both have their advantages.

RISCNot very many instructions

All instructions all the same length in both execution time, and bit length

Results in simpler CPU’s (Easier to optimize)

Usually takes more instructions to express useful operation

CISCVery many instructions (x86 instruction set over 700 pages)

Variable length instructions

More Complex CPU’s

Usually takes fewer instructions to express useful operation (lower memory requirements)

Some linksGreat stuff on x86 asm:

http://www.cs.dartmouth.edu/~cs37/

Instruction Architecture Ideas

If code size is most important, use variable length instructions

If simplicity is most important, use fixed length instructions

Simpler ISA’s are easier to optimize!

Recent embedded machines (ARM, MIPS) added optional mode to execute subset of 16-bit wide instructions (Thumb, MIPS16); per procedure decide performance or density

Summary

Instruction complexity is only one variable lower instruction count vs. higher CPI / lower clock rate

Design Principles: simplicity favors regularity smaller is faster good design demands compromise make the common case fast

Instruction set architecture a very important abstraction indeed!

Documents

OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE