64
Fall 2015, arz 1 CPE555A: Real-Time Embedded Systems Lecture 3 Ali Zaringhalam Stevens Institute of Technology

CPE555A: Real-Time Embedded Systems

Embed Size (px)

DESCRIPTION

CPE555A: Real-Time Embedded Systems. Lecture 3 Ali Zaringhalam Stevens Institute of Technology. Administrative. Assignment 1 will be posted this week. Outline. What is forwarding? Memory hierarchy Memory models. MIPS 5-Stage Integer Pipeline. Instruction Fetch. Instr. Decode - PowerPoint PPT Presentation

Citation preview

Page 1: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 1

CPE555A:Real-Time Embedded Systems

Lecture 3Ali Zaringhalam

Stevens Institute of Technology

Page 2: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 2

Administrative

Assignment 1 was posted

Page 3: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 3

Outline

What is forwarding? Memory hierarchy Memory models

Page 4: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 44

MIPS 5-Stage Integer Pipeline

MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

PC

+

Inst.Mem.

4

NPC

IR

Regs

SignExt.

A

BMUX

MUX

MUX

LMDALU DataMem.

ALUOutput

Zero? Cond.

16 64

Imm.

MUX

Page 5: CPE555A: Real-Time Embedded Systems

CPE555A – Real-Time Embedded SystemsStevens Institute of TechnologyFall 2015, arz 55

MIPS Pipeline

RegIM ALUALUDM Reg

RegIM ALUALUDM Reg

RegIM ALUALUDM Reg

CC1 CC2 CC3 CC4 CC5 CC6 CC7

•Instruction memory (IM) and data memory (DM) are shown as separate units•All operations in a pipeline stage must complete in one clock cycle•Values passed from one stage to another must be stored in internal registers •Registers labeled with the names of stages they connect (e.g., IF/ID registers)

•Instruction memory (IM) and data memory (DM) are shown as separate units•All operations in a pipeline stage must complete in one clock cycle•Values passed from one stage to another must be stored in internal registers •Registers labeled with the names of stages they connect (e.g., IF/ID registers)

Pipeline registersPipeline registers

IF/ID EX/MEMID/EX MEM/WB

Intermediate registers introduce delay in the datapath

Page 6: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 6

Data Hazard

RegIF ALUALUMem Reg

RegIF ALUALUMem Reg

RegIF ALUALUMem

RegIF ALUALU

ADD R1, R2, R3

SUB R4, R1, R5

AND R6, R1, R7

OR R8, R1, R9

XOR R10, R1, R11

CC1 CC2 CC3 CC4 CC5

RegIF ALUALUMem Reg

IF/ID ID/EX EX/MEM MEM/WB

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 7: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 7

RAW Data Hazard ADD R1, R2, R3SUB R4, R1, R5AND R6, R1, R7OR R8, R1, R9XOR R10, R1, R11

ADD R1, R2, R3SUB R4, R1, R5AND R6, R1, R7OR R8, R1, R9XOR R10, R1, R11

•R1 is not written back to the register file until the WB cycle (CC5) of the ADD instruction•R1 is needed in the ID cycle of the succeeding instructions

–CC3 for SUB–CC4 for AND–CC5 for OR–CC6 for XOR

•Unless the hazard is handled, these instructions operate on the wrong operand value

•R1 is not written back to the register file until the WB cycle (CC5) of the ADD instruction•R1 is needed in the ID cycle of the succeeding instructions

–CC3 for SUB–CC4 for AND–CC5 for OR–CC6 for XOR

•Unless the hazard is handled, these instructions operate on the wrong operand value

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 8: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 8

Split-Phase Register Read/Write

•XOR operates correctly because its ID cycle is in CC6•OR can be made to operate correctly by:

–Writing the register file in the first half of the clock cycle–Reading the register file in the second half of the clock cycle

•XOR operates correctly because its ID cycle is in CC6•OR can be made to operate correctly by:

–Writing the register file in the first half of the clock cycle–Reading the register file in the second half of the clock cycle

RegIM ALUALUMem

ADD R1, R2, R3

OR R8, R1, R9

RegIM ALUALUMem Reg

CC1 CC2 CC3 CC4 CC5 CC6 CC7

1st half 2nd half

IF/ID ID/EX EX/MEM MEM/WB

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 9: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 9

Forwarding (aka Bypassing)

RegIF ALUALUMem Reg

ADD R1, R2, R3

SUB R4, R1, R5

CC1 CC2 CC3 CC4 CC5

RegIF ALUALUMem Reg

•The result is needed by the SUB instruction in CC4. But the ADD instruction has actually computed the result in the previous cycle CC3

•Forward the result of ALU operations from the previous cycle

•ALU results is written in the ALUout in the EX/MEM pipeline registers

•If Forwarding logic detects that one of the register operands has been “touched” by the previous ALU operation, control logic selects input from the EX/MEM instead of ID/EX

•The result is needed by the SUB instruction in CC4. But the ADD instruction has actually computed the result in the previous cycle CC3

•Forward the result of ALU operations from the previous cycle

•ALU results is written in the ALUout in the EX/MEM pipeline registers

•If Forwarding logic detects that one of the register operands has been “touched” by the previous ALU operation, control logic selects input from the EX/MEM instead of ID/EX

ALUout

ALUout

IF/ID ID/EX EX/MEM MEM/WB

IF/ID ID/EX EX/MEM MEM/WB

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 10: CPE555A: Real-Time Embedded Systems

Hazard Detection & Control

Control detects that the content of ID/EX[A] read in the ID phase is stale

It knows that the update is in the EX/ME[ALUout] Because it knows what the previous instruction was

So it selects the input to the ALU from EX/ME[ALUout]

ALUA

ID/EX

ALUout

EX/MEM Mux

B IN_1

IN_2OUT

Hazard Control

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 11: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 11

IM

Reg

Reg

IM

LW R1, 0(R2)

AND R6, R1, R7

OR R8, R1, R9

CC1 CC2 CC3 CC4 CC5

RegIM ALUALUDM Reg

CC6 CC7

ALUALUDM SUB R4, R1, R5

ALUALUDM

Stall in the Pipeline• No instruction begins in CC3• No instruction completes in CC6

Stall in the Pipeline• No instruction begins in CC3• No instruction completes in CC6

Forwarding is now through MEM/WB

OR no longer requires forwarding

RegIM ALUALU

IF/ID ID/EX EX/MEM MEM/WB

IF/ID IF/ID ID/EX EX/MEM MEM/WB

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 12: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 12

Tabular View of Pipelining

LW R1,0(R2) I F I D EX MEM WB

SUB R4,R1,R5 I F I D EX MEM WB

AND R6,R1,R7 I F I D EX MEM WB

OR R8,R1,R9 I F I D EX MEM WB

LW R1,0(R2) I F I D EX MEM WB

SUB R4,R1,R5 I F Stall I D EX MEM WB

AND R6,R1,R7 I F I D EX MEM WB

OR R8,R1,R9 I F I D EX MEM

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 13: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 13

Assumptions Made To-Date

All memory operations take the same amount of time to complete

Each memory operation must complete before the next one can begin

Monolithic memory system: no structure

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 14: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 14

Example: Perfect Memory

IDIF MEMALUWB

IDIF MEM WB

IDIF MEMALU

WB

ALU

IDIF MEM WBALU

If every memory access takes 1 cycle, then, assuming our 5 stage pipeline, this program fragment takes 8 cycles

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 15: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 15

Storage-Device HierarchyIn

creasi

ng A

ccess

Tim

e

0.25-0.5 ns

0.5-20 ns

80-250 ns

4 GHZ CPU Cycle T=0.25 ns

The CPU can access registers in one CPU clock cycle.

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 16: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 16

Example: Real Memory

IDIF MEMALUWB

IF

bubble bubble

IDbubble bubble

IF

If memory references take more than one cycle, then there will be a lot of stalls

Every instruction requires an instruction memory reference Every Load or Store requires a memory reference

Memory hierarchy is designed to alleviate this problem

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 17: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 17

Memory System Complexity

A typical system contains a mix of memory technologies

The faster the memory, the more expensive. In practice a memory hierarchy is used to get the right price/performance.

There is also a need for non-volatile memory that survives a reset.

Example use case: executable program Memory address space must be partitioned

between I/O devices and various software needs such as stack and heap memory.

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 18: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 18

Memory Chip Organization

bitbit bitbit bitbit bitbit

bitbit bitbit bitbit bitbit

bitbit bitbit bitbit bitbit

bitbit bitbit bitbit bitbit

Deco

der

Deco

derHigh half

of addressbits

Low halfof address

bitsMultiplexerMultiplexer

Output

Word lines

Bit lines

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 19: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 19

Single-Transistor DRAM Write:

Drive word line high (row select) Drive bit line Charge (1) or drain (0) capacitor

Read: Drive word line high (row select) Capacitor connected to bit line Capacitor output connected to multiplexer

row select

bit

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 20: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 20

Six-Transistor Static RAM Cell

bit bit

word(row select)

Write:1. Drive word line high2. Drive bit lines

Read:1. Drive word line high2. Connect inverter outputs to

bit lines3. Result sent to input of

multiplexer

Write:1. Drive word line high2. Drive bit lines

Read:1. Drive word line high2. Connect inverter outputs to

bit lines3. Result sent to input of

multiplexer

10

0 1

Once a value is stored in the cell, the ringstructure of the inverter pair ensures the value circulates indefinitely as long as power is applied to the cell. Hence Static RAM (SRAM).

Once a value is stored in the cell, the ringstructure of the inverter pair ensures the value circulates indefinitely as long as power is applied to the cell. Hence Static RAM (SRAM).

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 21: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 21

SRAM-DRAM Differences Chip density is lower for SRAM

SRAM requires more transistors per cell compared to DRAM

higher cell density for DRAM: ~6-10x SRAM

Access time is faster for SRAM SRAM uses active devices (inverters) to drive bit

lines whereas DRAM uses a capacitor bit lines driven faster by the stronger signal (higher

current) in SRAM: ~10x faster than DRAM Cost: SRAM more expensive than DRAM Both SRAM and DRAM are volatile

They lose their content when powered-off

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 22: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 22

Embedded System Memory

Most embedded systems include an SRAM memory (~100 K)

They could also have DRAM memory if more memory is needed and it is not cost-effective to supply it in SRAM

DRAM refresh Over time charge on capacitors leak Capacitor would lose charge when cell is read So cell must be refreshed to maintain its stored value Refresh: a periodic dummy read and write to every cell

Typical interval: ~60 milliseconds Persistence interval is much longer: ~1 second

A DRAM controller is used to refresh memory regularly If memory is accessed during the refresh cycle of a cell, the

memory controller stalls the CPU The stall introduces variability in program execution

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 23: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 23

Flash Memory

Flash: semiconductor, non-volatile memory

Compared to a hard disk Lower latency Lower power Lighter weight, smaller size, shock resistance

Rough comparisons for DRAM:Flash:Disk Cost per bit: 100:10:1 Access latency (NAND Flash):

1:1,000:1,000,000

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 24: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 24

Types of Flash

NOR flash Fast read (~100ns), slow writes (200sec), very slow erase

(1sec) 10K to 100K erase cycles before it wears out Used for instruction memory in mobile systems

A read-only application

NAND flash Denser (bits/area, ~40% of NOR), cheaper per GB Slow read (50sec), slow writes (200sec), slow erase

(2msec) 100K to 1M erase cycles before it wears out Used for data storage (phones, USB keys, solid-state drives)

Both types have durability issues Damaged after some number of write/erase cycles

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 25: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 25

NAND Flash Chips

Page: minimum unit of read/write 0.5KB –8KB of data + spare area for error

coding Block: minimum unit of erasing

64 –128 pages Chip: 1 –16GB

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 26: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 26

Flash Operations

Read the contents of a page 20-50s

Erase sets all bits in a block to 1 Pages must be erased before they can be

written Update-in-place is not possible 0.5-3ms

Write data to a page Only 1-> 0 transitions are allowed Writing within a block must be ordered by

page 100-300us

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 27: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 27

Reliability

Wear-out Flash cells are physically damaged by read,

write and erase operations Writing disturb

Writing to a page can corrupt values stored in other pages in the block

Read disturb Reading data can corrupt the data in the

block It takes many reads to see this effect

That’s why there is a spare area for error correction coding

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 28: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 28

The Principle of Locality

The Principle of Locality: Programs access a relatively small portion

of the address space at any instant of time. Two Different Types of Locality:

Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse of data)

Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight line code execution , array access)

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 29: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 29

• Present the user with as much memory as is available in the cheapest technology

• Provide access at the speed offered by the fastest technology at the cost of the cheapest technology (on average)

Control

ALU/Datapath

SecondaryStorage(Disk)

Processor

Registe

rs

MainMemory(DRAM)

SecondLevelCache(SRAM)

On-C

hip

Cach

e

~10s ms

~Gbytes

~10s ns~Kbytes

~100s ns~Mbytes

~1s ns~100 bytes

TertiaryStorage

(Disk/Tape)

~ sec~TBytes

Memory Hierarchy

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 30: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 30

Accessing Data in Main Memory

Ignore caches for the moment

Data access involves Sending address to

memory Address indexes into

memory mem[address]

Data from mem[address] is returned to CPU

Memory is referenced just like an array

result <-- mem[index]

CPU

DataAddress

0x000x010x020x030x040x050x060x07

memoryaddresses

Contents of memory

mem[ ]

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 31: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 31

Demand-Based System

Main Memory

(DRAM)

Cache(SRAM)

Processor

Block data transfer

• Processor request is first looked upin the top level of the hierarchy• If data cannot be found in the toplevel, the next level is searched

• Processor request is first looked upin the top level of the hierarchy• If data cannot be found in the toplevel, the next level is searched

• Memory is copied from a lowerlevel to a higher level in blocksof sequential address locations

• it is faster to read/write blocks than individual word • takes advantage of spatiallocality of reference in programs

• Memory is copied from a lowerlevel to a higher level in blocksof sequential address locations

• it is faster to read/write blocks than individual word • takes advantage of spatiallocality of reference in programs

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 32: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 32

Block Size In Memory Hierarchy

Block (aka line): The minimum unit of information that can be transferred between cache & main memory (more generally between two adjacent layers in the memory hierarchy)

Register

Cache

Tape

Main memory

Disk

~100 bytes

~ Kbytes

~ Mbytes

~ Gbytes

Unlimited

1 - 8 bytes

100 bytes

Kbytes

~ Mbytes

storage scale block size

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 33: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 33

A Simple Cache X4

X1

Xn-2

Xn-1

X2

X3

X4

X1

Xn-2

Xn-1

Xn

X3

X2

Before the reference to Xn After the reference to Xn

• Processor request: 1 word• Block size: 1 word

• Processor request: 1 word• Block size: 1 word

• Processor requests Xn which is not in the cache• Request results in a miss• Cache is full• Xn is brought from memory into cache replacing X6

• Processor requests Xn which is not in the cache• Request results in a miss• Cache is full• Xn is brought from memory into cache replacing X6

X5

X6

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 34: CPE555A: Real-Time Embedded Systems

Leveraging Locality of Reference

Spatial locality Fetch data in blocks Assumes adjacent data will be used

with high probability Temporal locality

Keep data in memory until you need to purge it due to a miss

Assumes data will remain useful for a while with high probability

Fall 2015, arz 34

Page 35: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 35

Cache Design

Block size

Block organization

Direct-mapped

Fully-associative

Set-associative

Block replacement policy

FIFO

LRU

Random

Write policy

Writeback

Write-through Write-allocate

Write-no-allocate

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 36: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 36

Cache Associativity01234567

01234567

01234567

Main memory

45

Fully associative: block 45 can goanywhere in cache (similar to an array). All entriesmust be checked for lookup.

Fully associative: block 45 can goanywhere in cache (similar to an array). All entriesmust be checked for lookup.

Direct-mapped: block 45 can goin only one location in cache:

45 mod 8 = 5(Similar to a hash table). Only a single entry must be checked.

Direct-mapped: block 45 can goin only one location in cache:

45 mod 8 = 5(Similar to a hash table). Only a single entry must be checked.

Set 0

Set 1

Set 2

Set 3

Set-associative: block 45 can goanywhere in one set in cache:

45 mod 4 = 1(similar to a hash bucket). AllEntries within a single set mustbe checked for lookup.

Set-associative: block 45 can goanywhere in one set in cache:

45 mod 4 = 1(similar to a hash bucket). AllEntries within a single set mustbe checked for lookup.

Cache

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 37: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 37

How To Find a Block?

Tag Index

Block address Block offset

Selects bytewithin a block

Selects bytewithin a block

Selects the set. NULL for fully-

associative cache. In a fully-associative

cache there is only one set. All blocks belong to this set.

Selects the set. NULL for fully-

associative cache. In a fully-associative

cache there is only one set. All blocks belong to this set.

Compared against CPU address for hit/miss

Compared against CPU address for hit/miss

CPU address

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 38: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 38

Example01234567

01234567

01234567

Main memory

45

Fully associative: 1 set.Eight blocks per setIndex = 0. 8-way set associative.

Fully associative: 1 set.Eight blocks per setIndex = 0. 8-way set associative.

Direct-mapped: 8 sets.One block per set.Index = 31-way set associative

Direct-mapped: 8 sets.One block per set.Index = 31-way set associative

Set 0

Set 1

Set 2

Set 3

Set-associative: 4 sets.Two blocks per setIndex = 22-way set associative

Set-associative: 4 sets.Two blocks per setIndex = 22-way set associative

Cache

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 39: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 39

How To Find a Block?

Tag Index

Block address Block offset

Given the number of bits in the index field (call it Index), the number of sets is = The number of blocks per set is = set associativity

Given the number of bits in the index field (call it Index), the number of sets is = The number of blocks per set is = set associativity

CPU address

index

index

Cache size Block size Set associativity

Cache size

Block size Set associativity

2

2

index

index

Cache size Block size Set associativity

Cache size

Block size Set associativity

2

2

2index

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 40: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 40

Example

index

16index 9

6

Cache size2 =

Block size × Set associativity

22 = =2 index=9

2 ×2

index

16index 9

6

Cache size2 =

Block size × Set associativity

22 = =2 index=9

2 ×2

• Block size = 64 bytes• Cache size = 64 Kbytes• 2-way set associative • How many sets? • What is the size of the Index field

• Block size = 64 bytes• Cache size = 64 Kbytes• 2-way set associative • How many sets? • What is the size of the Index field

Block Block

Way 1 Way 2

Use approximation: 210=1000

Number of sets: 29=512

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 41: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 41

Fully Associative Cache

• Address is partitioned into• Block number• Block offset which identifies the datawithin the block

• Block can go anywhere in the cache• Must examine all blocks

• Each cache block has a Tag• Tag is compared to block number• If one of the blocks has Tag=Block #we have a hit• Need a comparator per cache block• Comparisons performed in parallel

• Address is partitioned into• Block number• Block offset which identifies the datawithin the block

• Block can go anywhere in the cache• Must examine all blocks

• Each cache block has a Tag• Tag is compared to block number• If one of the blocks has Tag=Block #we have a hit• Need a comparator per cache block• Comparisons performed in parallel

5-bit offset supports

32 bytes per block5-bit offset supports

32 bytes per block

Fully-associative cache does not need Set Index field. Index=0corresponding to one set.

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 42: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 42

Valid Bit

• Initially (at power-up/cold-start) cache is either empty or contains random data• Need a valid bit to indicate whether acache block contains valid data• A hit is only called against valid blocks

• Initially (at power-up/cold-start) cache is either empty or contains random data• Need a valid bit to indicate whether acache block contains valid data• A hit is only called against valid blocks

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 43: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 43

Direct-Mapped Cache - 1

• Memory block can only be stored in one of the cache lines

• if two memory blocks map to the same cache line, the old block mustbe evicted to make room for the new block

• Address is again partitioned into • Block number• Block offset

• Block number partitioned into• Set index which identifies the cacheline where the memory block may bestored (set with a single entry) • Tag number which determines whetherthere is a hit or not (only one comparison is necessary)

• Memory block can only be stored in one of the cache lines

• if two memory blocks map to the same cache line, the old block mustbe evicted to make room for the new block

• Address is again partitioned into • Block number• Block offset

• Block number partitioned into• Set index which identifies the cacheline where the memory block may bestored (set with a single entry) • Tag number which determines whetherthere is a hit or not (only one comparison is necessary)

(Block Address) MOD n

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 44: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 44

Direct-Mapped Cache -Example

• Block size = 32 bytes• need 5 bits for Block Offset

• 512 lines in the cache• need 9 bits for Set Index

• Cache size = 32 x 512 = 16 Kbytes

• Block size = 32 bytes• need 5 bits for Block Offset

• 512 lines in the cache• need 9 bits for Set Index

• Cache size = 32 x 512 = 16 Kbytes

• Set Index is used to index into the array• Tag is read out and input into the comparator• The comparator compares Tag in the address and Tag stored in the cache’s Tag array

• Set Index is used to index into the array• Tag is read out and input into the comparator• The comparator compares Tag in the address and Tag stored in the cache’s Tag array

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 45: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 45

Direct-Mapped Cache - 3

Main MemoryMain Memory

• Conceptually main memory can be viewed as a number of partitions with each partition size equal to the cache size• Address with the same set index map to the same cache block

• Conceptually main memory can be viewed as a number of partitions with each partition size equal to the cache size• Address with the same set index map to the same cache block

(Block Address) MOD n

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 46: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 46

Example

Consider the direct-mapped cache shown schematically on the right.The following trace shows the memory references sent to the cache in sequence. Assume that the cache starts with all entries set to INVALID.

•Mark each reference in the table as a Hit or Miss.•After the last reference, show the final Valid and Tag

fields in the cache diagram above.

Consider the direct-mapped cache shown schematically on the right.The following trace shows the memory references sent to the cache in sequence. Assume that the cache starts with all entries set to INVALID.

•Mark each reference in the table as a Hit or Miss.•After the last reference, show the final Valid and Tag

fields in the cache diagram above.

Ref. Count Tag Index Byte Offset Hit or Miss? 1 100 0 0 2 100 0 4 3 100 1 0 4 100 1 4 5 100 2 0 6 101 1 0 7 101 2 0 8 100 0 0 9 100 1 0

10 101 2 4 11 101 3 0 12 100 0 4

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 47: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 47

2-Way Set-Associative Cache -1

• Each set holds two blocks (Way 0 & Way 1)• Each memory block is

• mapped to a set• can be stored in either cache line

• Each set holds two blocks (Way 0 & Way 1)• Each memory block is

• mapped to a set• can be stored in either cache line

• Block size = 32 bytes• 5-bit Block Offset

• 512 blocks divided into 256 sets withtwo lines per set

• need 8 bits for Set Index• Cache size: 256 x 2 x 16 = 16 Kbytes

• Block size = 32 bytes• 5-bit Block Offset

• 512 blocks divided into 256 sets withtwo lines per set

• need 8 bits for Set Index• Cache size: 256 x 2 x 16 = 16 Kbytes

(Block Address) MOD (number of sets)

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 48: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 48

Hit Detection in 2-Way Cache

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 49: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 49

Cache Associativity

High associativity reduces conflicts between blocks that

map to the same location reduces “eviction rate” and miss rate

Low asscociativity increases miss rate reduces cache hardware complexity

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 50: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 50

Replacement Policy

Replacement policy defines the algorithm to “evict” a block when there is a cache miss and the cache is full

Direct-mapped cache trivial choice: evict resident block and replace with new block

Fully-associative & set-associative random selection from among candidate blocks for eviction

simple to support in hardware spreads allocation uniformly

least-recently-used (LRU): access to blocks are recorded evict the block that has gone unused the longest improves chance of exploiting temporal locality

both methods comparable for large cache size

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 51: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 51

Cache Write Policies

Write-back: data which is written by the processor is updated only in the cache but not in the lower level

data updated in the lower level only when a block is evicted a block that requires update at eviction sets its “dirty” bit multiple byte updates within a single block can be written to lower

level in one write operation must write entire block; it is not known which bytes must be updated

Write-through: update data in both the cache and the lower level

a read miss does not require updating the evicted block in the lower level because the lower level is already updated

only updated bytes within the block must be written to the lower level

less complex but consumes more memory bus bandwidth

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 52: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 52

Write-Through Cache Issues

Processor must wait for write to lower level to complete this is referred to as write-stall

A write buffer (aka store buffer) is used to reduce write stall processor writes data to the write buffer memory controller writes content of write buffer to memory

hierarchy

ProcessorCache

MainMemory

Write buffer Effective only if (store frequency) << (1/DRAM write cycle) If (store frequency) ~ (1/DRAM write cycle) write buffer

eventually overflows On cache miss must lookup write buffer as well

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 53: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 53

Memory Models

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 54: CPE555A: Real-Time Embedded Systems

Memory Use

Programs use memory to Store executable code Data used by programs during execution

Memory for data must be available Allocated statically at compile time Allocated dynamically at run time

On the program stack for procedure calls In the heap as needed by program

Fall 2015, arz

CS555A – Real-Time Embedded SystemsStevens Institute of Technology 54

Page 55: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 55 55

Memory Usage

Global variable. Allocated at compile time.

Local/automatic variables. Allocated on the stack at run time.

Dynamic variables. Allocated in the heap area at run time.

CS555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 56: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 56CS555A – Real-Time Embedded Systems

Stevens Institute of Technology 56

Procedure Call: Return

opcode=3 immediate

6 26

J-Type Format

Af ter the procedure fi nishes execution it must return to the instruction at the return address. The address of this instruction is stored in $ra (register 31). Thus to return to this address the procedure executes:

jr $ra

so that no new instruction f or the return instruction is required. Similar to the jump instruction j , the jal instruction is encoded as a j -type instruction with opcode=3 as shown below.

Page 57: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 57CS555A – Real-Time Embedded Systems

Stevens Institute of Technology 57

ExampleLet’s compile the following procedure:

int leaf _example(int g, int h, int i, int j )

{

int f ;

f = (g+h) – (i+j );

return f ;

}

Now let’s try to compile this program using what we’ve learned sofar. To compile, the compiler must make a decision about how topass the arguments g, h, i and j to the procedure and where theprocedure must return its results to the caller.

Page 58: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 58CS555A – Real-Time Embedded Systems

Stevens Institute of Technology 58

Argument Passing & Return Value

The MI PS I SA does not specif y how arguments should be passed and values returned. This is done using a compiler/ assembler convention f or MI PS:

Registers 4-7 are used f or argument passing. By convention MI PS refers to these registers as $a0-$a3,

Registers 2-3 are used to return values f rom procedures. By convention MI PS refers to these registers as $v0-$v1.

Let’s assume in our example that the arguments g-i will be passed to the procedure in $a0-$a3 and the result is returned in $v0. The compiler chooses reg16 for the local variable f . I n addition the compiler uses reg8 and reg9 f or temporary storage of (g+h) & (i+j).

Page 59: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 59 59

ExampleLet’s compile the f ollowing procedure:

int leaf _example(int g, int h, int i, int j )

{

int f ;

f = (g+h) – (i+j );

return f ;

}

Now let’s try to compile this program using what we’ve learned so f ar. To compile, the compiler must make a decision about how to pass the arguments g, h, i and j to the procedure and where the procedure must return its results to the caller.

$a0, $a1, $a2, $a3 = R4-R7

R16

$v0=R2R8

R9

CS555A – Real-Time Embedded SystemsStevens Institute of Technology

Page 60: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 60CS555A – Real-Time Embedded Systems

Stevens Institute of Technology 60

What Could Go Wrong?

What if the caller also happens to be using the same registers reg8, reg9 and reg16? I n this case the values used by the caller will be overwritten by the callee and it will not be able to execute the program correctly af ter the callee returns. This problem cannot be addressed by simply using a diff erent set of registers in the caller and the callee. For one thing we don’t know how deep the nested procedure calls are. I f n procedure calls are nested, we will need

O n registers whereas we only have 32. For another this scheme

clearly will not work if the procedures are compiled separately and later linked (as in a library). The solution is to spill registers into main memory. We associate a f rame (or activation record) with a procedure call of the f unction 1 2 nf x ,x ,...x . The f rame will contain

registers that the callee plans to use during execution.

Page 61: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 61CS555A – Real-Time Embedded Systems

Stevens Institute of Technology 61

Call Stack The natural data structure for spilling

registers into memory is a call stack (a last-in first-out structure)

Register values are pushed and saved on the stack when the procedure is called and popped from the stack into the original register at return

Historically call stacks “grow” from High address to low address

A stack pointer is used to address the first unused memory location

MIPS software uses register 29 for stack pointer and refers to it as $sp

other machines (e.g., 80x86) may use a special-purpose stack pointer

main

Proc1

Proc2

Proc3

Proc4

Low Address

High Address

$sp

Page 62: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 62CS555A – Real-Time Embedded Systems

Stevens Institute of Technology 62

Carnegie Mellon

Pushing a Register on the Stack

Suppose the called procedure wants to use reg16 It must push register reg16 to save it

subi $sp, $sp, 4 Makes room for a 4-byte word on the

stack sw reg16, 0($sp)

Stores reg16 into stack memory Now the called procedure can use

reg16

-4

Stack GrowsDown

IncreasingAddresses

Stack “Bottom”

Stack Pointer: $sp

Stack “Top”

Page 63: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 63CS555A – Real-Time Embedded Systems

Stevens Institute of Technology 63

Stack Pointer: $sp

Stack GrowsDown

IncreasingAddresses

Stack “Top”

Stack “Bottom”

Carnegie Mellon

Popping a Register From the Stack

+4

Before the procedure returns, it must restore reg16 to original value Pops stack into register reg16

lw reg16, 0($sp) Loads reg16 from stack memory

addi $sp, $sp, 4 Pops the stack

Now the callee can use reg1 as before

Page 64: CPE555A: Real-Time Embedded Systems

Fall 2015, arz 64

Example Solution

Consider the direct-mapped cache shown schematically on the right.The following trace shows the memory references sent to the cache in sequence. Assume that the cache starts with all entries set to INVALID.

•Mark each reference in the table as a Hit or Miss.•After the last reference, show the final Valid and Tag

fields in the cache diagram above.

Consider the direct-mapped cache shown schematically on the right.The following trace shows the memory references sent to the cache in sequence. Assume that the cache starts with all entries set to INVALID.

•Mark each reference in the table as a Hit or Miss.•After the last reference, show the final Valid and Tag

fields in the cache diagram above.

Ref. Count Tag Index Byte Offset Hit or Miss? 1 100 0 0 Miss 2 100 0 4 Hit 3 100 1 0 Miss 4 100 1 4 Hit 5 100 2 0 Miss 6 101 1 0 Miss 7 101 2 0 Miss 8 100 0 0 Hit 9 100 1 0 Miss

10 101 2 4 Hit 11 101 3 0 Miss 12 100 0 4 Hit

100100101

101

1111

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology