Upload
kylie-campbell
View
39
Download
5
Tags:
Embed Size (px)
DESCRIPTION
CPE555A: Real-Time Embedded Systems. Lecture 3 Ali Zaringhalam Stevens Institute of Technology. Administrative. Assignment 1 will be posted this week. Outline. What is forwarding? Memory hierarchy Memory models. MIPS 5-Stage Integer Pipeline. Instruction Fetch. Instr. Decode - PowerPoint PPT Presentation
Citation preview
Fall 2015, arz 1
CPE555A:Real-Time Embedded Systems
Lecture 3Ali Zaringhalam
Stevens Institute of Technology
Fall 2015, arz 2
Administrative
Assignment 1 was posted
Fall 2015, arz 3
Outline
What is forwarding? Memory hierarchy Memory models
Fall 2015, arz 44
MIPS 5-Stage Integer Pipeline
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
PC
+
Inst.Mem.
4
NPC
IR
Regs
SignExt.
A
BMUX
MUX
MUX
LMDALU DataMem.
ALUOutput
Zero? Cond.
16 64
Imm.
MUX
CPE555A – Real-Time Embedded SystemsStevens Institute of TechnologyFall 2015, arz 55
MIPS Pipeline
RegIM ALUALUDM Reg
RegIM ALUALUDM Reg
RegIM ALUALUDM Reg
CC1 CC2 CC3 CC4 CC5 CC6 CC7
•Instruction memory (IM) and data memory (DM) are shown as separate units•All operations in a pipeline stage must complete in one clock cycle•Values passed from one stage to another must be stored in internal registers •Registers labeled with the names of stages they connect (e.g., IF/ID registers)
•Instruction memory (IM) and data memory (DM) are shown as separate units•All operations in a pipeline stage must complete in one clock cycle•Values passed from one stage to another must be stored in internal registers •Registers labeled with the names of stages they connect (e.g., IF/ID registers)
Pipeline registersPipeline registers
IF/ID EX/MEMID/EX MEM/WB
Intermediate registers introduce delay in the datapath
Fall 2015, arz 6
Data Hazard
RegIF ALUALUMem Reg
RegIF ALUALUMem Reg
RegIF ALUALUMem
RegIF ALUALU
ADD R1, R2, R3
SUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
XOR R10, R1, R11
CC1 CC2 CC3 CC4 CC5
RegIF ALUALUMem Reg
IF/ID ID/EX EX/MEM MEM/WB
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 7
RAW Data Hazard ADD R1, R2, R3SUB R4, R1, R5AND R6, R1, R7OR R8, R1, R9XOR R10, R1, R11
ADD R1, R2, R3SUB R4, R1, R5AND R6, R1, R7OR R8, R1, R9XOR R10, R1, R11
•R1 is not written back to the register file until the WB cycle (CC5) of the ADD instruction•R1 is needed in the ID cycle of the succeeding instructions
–CC3 for SUB–CC4 for AND–CC5 for OR–CC6 for XOR
•Unless the hazard is handled, these instructions operate on the wrong operand value
•R1 is not written back to the register file until the WB cycle (CC5) of the ADD instruction•R1 is needed in the ID cycle of the succeeding instructions
–CC3 for SUB–CC4 for AND–CC5 for OR–CC6 for XOR
•Unless the hazard is handled, these instructions operate on the wrong operand value
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 8
Split-Phase Register Read/Write
•XOR operates correctly because its ID cycle is in CC6•OR can be made to operate correctly by:
–Writing the register file in the first half of the clock cycle–Reading the register file in the second half of the clock cycle
•XOR operates correctly because its ID cycle is in CC6•OR can be made to operate correctly by:
–Writing the register file in the first half of the clock cycle–Reading the register file in the second half of the clock cycle
RegIM ALUALUMem
ADD R1, R2, R3
OR R8, R1, R9
RegIM ALUALUMem Reg
CC1 CC2 CC3 CC4 CC5 CC6 CC7
1st half 2nd half
IF/ID ID/EX EX/MEM MEM/WB
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 9
Forwarding (aka Bypassing)
RegIF ALUALUMem Reg
ADD R1, R2, R3
SUB R4, R1, R5
CC1 CC2 CC3 CC4 CC5
RegIF ALUALUMem Reg
•The result is needed by the SUB instruction in CC4. But the ADD instruction has actually computed the result in the previous cycle CC3
•Forward the result of ALU operations from the previous cycle
•ALU results is written in the ALUout in the EX/MEM pipeline registers
•If Forwarding logic detects that one of the register operands has been “touched” by the previous ALU operation, control logic selects input from the EX/MEM instead of ID/EX
•The result is needed by the SUB instruction in CC4. But the ADD instruction has actually computed the result in the previous cycle CC3
•Forward the result of ALU operations from the previous cycle
•ALU results is written in the ALUout in the EX/MEM pipeline registers
•If Forwarding logic detects that one of the register operands has been “touched” by the previous ALU operation, control logic selects input from the EX/MEM instead of ID/EX
ALUout
ALUout
IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Hazard Detection & Control
Control detects that the content of ID/EX[A] read in the ID phase is stale
It knows that the update is in the EX/ME[ALUout] Because it knows what the previous instruction was
So it selects the input to the ALU from EX/ME[ALUout]
ALUA
ID/EX
ALUout
EX/MEM Mux
B IN_1
IN_2OUT
Hazard Control
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 11
IM
Reg
Reg
IM
LW R1, 0(R2)
AND R6, R1, R7
OR R8, R1, R9
CC1 CC2 CC3 CC4 CC5
RegIM ALUALUDM Reg
CC6 CC7
ALUALUDM SUB R4, R1, R5
ALUALUDM
Stall in the Pipeline• No instruction begins in CC3• No instruction completes in CC6
Stall in the Pipeline• No instruction begins in CC3• No instruction completes in CC6
Forwarding is now through MEM/WB
OR no longer requires forwarding
RegIM ALUALU
IF/ID ID/EX EX/MEM MEM/WB
IF/ID IF/ID ID/EX EX/MEM MEM/WB
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 12
Tabular View of Pipelining
LW R1,0(R2) I F I D EX MEM WB
SUB R4,R1,R5 I F I D EX MEM WB
AND R6,R1,R7 I F I D EX MEM WB
OR R8,R1,R9 I F I D EX MEM WB
LW R1,0(R2) I F I D EX MEM WB
SUB R4,R1,R5 I F Stall I D EX MEM WB
AND R6,R1,R7 I F I D EX MEM WB
OR R8,R1,R9 I F I D EX MEM
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 13
Assumptions Made To-Date
All memory operations take the same amount of time to complete
Each memory operation must complete before the next one can begin
Monolithic memory system: no structure
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 14
Example: Perfect Memory
IDIF MEMALUWB
IDIF MEM WB
IDIF MEMALU
WB
ALU
IDIF MEM WBALU
If every memory access takes 1 cycle, then, assuming our 5 stage pipeline, this program fragment takes 8 cycles
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 15
Storage-Device HierarchyIn
creasi
ng A
ccess
Tim
e
0.25-0.5 ns
0.5-20 ns
80-250 ns
4 GHZ CPU Cycle T=0.25 ns
The CPU can access registers in one CPU clock cycle.
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 16
Example: Real Memory
IDIF MEMALUWB
IF
bubble bubble
IDbubble bubble
IF
If memory references take more than one cycle, then there will be a lot of stalls
Every instruction requires an instruction memory reference Every Load or Store requires a memory reference
Memory hierarchy is designed to alleviate this problem
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 17
Memory System Complexity
A typical system contains a mix of memory technologies
The faster the memory, the more expensive. In practice a memory hierarchy is used to get the right price/performance.
There is also a need for non-volatile memory that survives a reset.
Example use case: executable program Memory address space must be partitioned
between I/O devices and various software needs such as stack and heap memory.
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 18
Memory Chip Organization
bitbit bitbit bitbit bitbit
bitbit bitbit bitbit bitbit
bitbit bitbit bitbit bitbit
bitbit bitbit bitbit bitbit
Deco
der
Deco
derHigh half
of addressbits
Low halfof address
bitsMultiplexerMultiplexer
Output
Word lines
Bit lines
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 19
Single-Transistor DRAM Write:
Drive word line high (row select) Drive bit line Charge (1) or drain (0) capacitor
Read: Drive word line high (row select) Capacitor connected to bit line Capacitor output connected to multiplexer
row select
bit
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 20
Six-Transistor Static RAM Cell
bit bit
word(row select)
Write:1. Drive word line high2. Drive bit lines
Read:1. Drive word line high2. Connect inverter outputs to
bit lines3. Result sent to input of
multiplexer
Write:1. Drive word line high2. Drive bit lines
Read:1. Drive word line high2. Connect inverter outputs to
bit lines3. Result sent to input of
multiplexer
10
0 1
Once a value is stored in the cell, the ringstructure of the inverter pair ensures the value circulates indefinitely as long as power is applied to the cell. Hence Static RAM (SRAM).
Once a value is stored in the cell, the ringstructure of the inverter pair ensures the value circulates indefinitely as long as power is applied to the cell. Hence Static RAM (SRAM).
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 21
SRAM-DRAM Differences Chip density is lower for SRAM
SRAM requires more transistors per cell compared to DRAM
higher cell density for DRAM: ~6-10x SRAM
Access time is faster for SRAM SRAM uses active devices (inverters) to drive bit
lines whereas DRAM uses a capacitor bit lines driven faster by the stronger signal (higher
current) in SRAM: ~10x faster than DRAM Cost: SRAM more expensive than DRAM Both SRAM and DRAM are volatile
They lose their content when powered-off
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 22
Embedded System Memory
Most embedded systems include an SRAM memory (~100 K)
They could also have DRAM memory if more memory is needed and it is not cost-effective to supply it in SRAM
DRAM refresh Over time charge on capacitors leak Capacitor would lose charge when cell is read So cell must be refreshed to maintain its stored value Refresh: a periodic dummy read and write to every cell
Typical interval: ~60 milliseconds Persistence interval is much longer: ~1 second
A DRAM controller is used to refresh memory regularly If memory is accessed during the refresh cycle of a cell, the
memory controller stalls the CPU The stall introduces variability in program execution
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 23
Flash Memory
Flash: semiconductor, non-volatile memory
Compared to a hard disk Lower latency Lower power Lighter weight, smaller size, shock resistance
Rough comparisons for DRAM:Flash:Disk Cost per bit: 100:10:1 Access latency (NAND Flash):
1:1,000:1,000,000
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 24
Types of Flash
NOR flash Fast read (~100ns), slow writes (200sec), very slow erase
(1sec) 10K to 100K erase cycles before it wears out Used for instruction memory in mobile systems
A read-only application
NAND flash Denser (bits/area, ~40% of NOR), cheaper per GB Slow read (50sec), slow writes (200sec), slow erase
(2msec) 100K to 1M erase cycles before it wears out Used for data storage (phones, USB keys, solid-state drives)
Both types have durability issues Damaged after some number of write/erase cycles
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 25
NAND Flash Chips
Page: minimum unit of read/write 0.5KB –8KB of data + spare area for error
coding Block: minimum unit of erasing
64 –128 pages Chip: 1 –16GB
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 26
Flash Operations
Read the contents of a page 20-50s
Erase sets all bits in a block to 1 Pages must be erased before they can be
written Update-in-place is not possible 0.5-3ms
Write data to a page Only 1-> 0 transitions are allowed Writing within a block must be ordered by
page 100-300us
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 27
Reliability
Wear-out Flash cells are physically damaged by read,
write and erase operations Writing disturb
Writing to a page can corrupt values stored in other pages in the block
Read disturb Reading data can corrupt the data in the
block It takes many reads to see this effect
That’s why there is a spare area for error correction coding
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 28
The Principle of Locality
The Principle of Locality: Programs access a relatively small portion
of the address space at any instant of time. Two Different Types of Locality:
Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse of data)
Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight line code execution , array access)
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 29
• Present the user with as much memory as is available in the cheapest technology
• Provide access at the speed offered by the fastest technology at the cost of the cheapest technology (on average)
Control
ALU/Datapath
SecondaryStorage(Disk)
Processor
Registe
rs
MainMemory(DRAM)
SecondLevelCache(SRAM)
On-C
hip
Cach
e
~10s ms
~Gbytes
~10s ns~Kbytes
~100s ns~Mbytes
~1s ns~100 bytes
TertiaryStorage
(Disk/Tape)
~ sec~TBytes
Memory Hierarchy
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 30
Accessing Data in Main Memory
Ignore caches for the moment
Data access involves Sending address to
memory Address indexes into
memory mem[address]
Data from mem[address] is returned to CPU
Memory is referenced just like an array
result <-- mem[index]
CPU
DataAddress
0x000x010x020x030x040x050x060x07
memoryaddresses
Contents of memory
mem[ ]
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 31
Demand-Based System
Main Memory
(DRAM)
Cache(SRAM)
Processor
Block data transfer
• Processor request is first looked upin the top level of the hierarchy• If data cannot be found in the toplevel, the next level is searched
• Processor request is first looked upin the top level of the hierarchy• If data cannot be found in the toplevel, the next level is searched
• Memory is copied from a lowerlevel to a higher level in blocksof sequential address locations
• it is faster to read/write blocks than individual word • takes advantage of spatiallocality of reference in programs
• Memory is copied from a lowerlevel to a higher level in blocksof sequential address locations
• it is faster to read/write blocks than individual word • takes advantage of spatiallocality of reference in programs
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 32
Block Size In Memory Hierarchy
Block (aka line): The minimum unit of information that can be transferred between cache & main memory (more generally between two adjacent layers in the memory hierarchy)
Register
Cache
Tape
Main memory
Disk
~100 bytes
~ Kbytes
~ Mbytes
~ Gbytes
Unlimited
1 - 8 bytes
100 bytes
Kbytes
~ Mbytes
storage scale block size
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 33
A Simple Cache X4
X1
Xn-2
Xn-1
X2
X3
X4
X1
Xn-2
Xn-1
Xn
X3
X2
Before the reference to Xn After the reference to Xn
• Processor request: 1 word• Block size: 1 word
• Processor request: 1 word• Block size: 1 word
• Processor requests Xn which is not in the cache• Request results in a miss• Cache is full• Xn is brought from memory into cache replacing X6
• Processor requests Xn which is not in the cache• Request results in a miss• Cache is full• Xn is brought from memory into cache replacing X6
X5
X6
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Leveraging Locality of Reference
Spatial locality Fetch data in blocks Assumes adjacent data will be used
with high probability Temporal locality
Keep data in memory until you need to purge it due to a miss
Assumes data will remain useful for a while with high probability
Fall 2015, arz 34
Fall 2015, arz 35
Cache Design
Block size
Block organization
Direct-mapped
Fully-associative
Set-associative
Block replacement policy
FIFO
LRU
Random
Write policy
Writeback
Write-through Write-allocate
Write-no-allocate
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 36
Cache Associativity01234567
01234567
01234567
Main memory
45
Fully associative: block 45 can goanywhere in cache (similar to an array). All entriesmust be checked for lookup.
Fully associative: block 45 can goanywhere in cache (similar to an array). All entriesmust be checked for lookup.
Direct-mapped: block 45 can goin only one location in cache:
45 mod 8 = 5(Similar to a hash table). Only a single entry must be checked.
Direct-mapped: block 45 can goin only one location in cache:
45 mod 8 = 5(Similar to a hash table). Only a single entry must be checked.
Set 0
Set 1
Set 2
Set 3
Set-associative: block 45 can goanywhere in one set in cache:
45 mod 4 = 1(similar to a hash bucket). AllEntries within a single set mustbe checked for lookup.
Set-associative: block 45 can goanywhere in one set in cache:
45 mod 4 = 1(similar to a hash bucket). AllEntries within a single set mustbe checked for lookup.
Cache
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 37
How To Find a Block?
Tag Index
Block address Block offset
Selects bytewithin a block
Selects bytewithin a block
Selects the set. NULL for fully-
associative cache. In a fully-associative
cache there is only one set. All blocks belong to this set.
Selects the set. NULL for fully-
associative cache. In a fully-associative
cache there is only one set. All blocks belong to this set.
Compared against CPU address for hit/miss
Compared against CPU address for hit/miss
CPU address
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 38
Example01234567
01234567
01234567
Main memory
45
Fully associative: 1 set.Eight blocks per setIndex = 0. 8-way set associative.
Fully associative: 1 set.Eight blocks per setIndex = 0. 8-way set associative.
Direct-mapped: 8 sets.One block per set.Index = 31-way set associative
Direct-mapped: 8 sets.One block per set.Index = 31-way set associative
Set 0
Set 1
Set 2
Set 3
Set-associative: 4 sets.Two blocks per setIndex = 22-way set associative
Set-associative: 4 sets.Two blocks per setIndex = 22-way set associative
Cache
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 39
How To Find a Block?
Tag Index
Block address Block offset
Given the number of bits in the index field (call it Index), the number of sets is = The number of blocks per set is = set associativity
Given the number of bits in the index field (call it Index), the number of sets is = The number of blocks per set is = set associativity
CPU address
index
index
Cache size Block size Set associativity
Cache size
Block size Set associativity
2
2
index
index
Cache size Block size Set associativity
Cache size
Block size Set associativity
2
2
2index
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 40
Example
index
16index 9
6
Cache size2 =
Block size × Set associativity
22 = =2 index=9
2 ×2
index
16index 9
6
Cache size2 =
Block size × Set associativity
22 = =2 index=9
2 ×2
• Block size = 64 bytes• Cache size = 64 Kbytes• 2-way set associative • How many sets? • What is the size of the Index field
• Block size = 64 bytes• Cache size = 64 Kbytes• 2-way set associative • How many sets? • What is the size of the Index field
Block Block
Way 1 Way 2
Use approximation: 210=1000
Number of sets: 29=512
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 41
Fully Associative Cache
• Address is partitioned into• Block number• Block offset which identifies the datawithin the block
• Block can go anywhere in the cache• Must examine all blocks
• Each cache block has a Tag• Tag is compared to block number• If one of the blocks has Tag=Block #we have a hit• Need a comparator per cache block• Comparisons performed in parallel
• Address is partitioned into• Block number• Block offset which identifies the datawithin the block
• Block can go anywhere in the cache• Must examine all blocks
• Each cache block has a Tag• Tag is compared to block number• If one of the blocks has Tag=Block #we have a hit• Need a comparator per cache block• Comparisons performed in parallel
5-bit offset supports
32 bytes per block5-bit offset supports
32 bytes per block
Fully-associative cache does not need Set Index field. Index=0corresponding to one set.
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 42
Valid Bit
• Initially (at power-up/cold-start) cache is either empty or contains random data• Need a valid bit to indicate whether acache block contains valid data• A hit is only called against valid blocks
• Initially (at power-up/cold-start) cache is either empty or contains random data• Need a valid bit to indicate whether acache block contains valid data• A hit is only called against valid blocks
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 43
Direct-Mapped Cache - 1
• Memory block can only be stored in one of the cache lines
• if two memory blocks map to the same cache line, the old block mustbe evicted to make room for the new block
• Address is again partitioned into • Block number• Block offset
• Block number partitioned into• Set index which identifies the cacheline where the memory block may bestored (set with a single entry) • Tag number which determines whetherthere is a hit or not (only one comparison is necessary)
• Memory block can only be stored in one of the cache lines
• if two memory blocks map to the same cache line, the old block mustbe evicted to make room for the new block
• Address is again partitioned into • Block number• Block offset
• Block number partitioned into• Set index which identifies the cacheline where the memory block may bestored (set with a single entry) • Tag number which determines whetherthere is a hit or not (only one comparison is necessary)
(Block Address) MOD n
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 44
Direct-Mapped Cache -Example
• Block size = 32 bytes• need 5 bits for Block Offset
• 512 lines in the cache• need 9 bits for Set Index
• Cache size = 32 x 512 = 16 Kbytes
• Block size = 32 bytes• need 5 bits for Block Offset
• 512 lines in the cache• need 9 bits for Set Index
• Cache size = 32 x 512 = 16 Kbytes
• Set Index is used to index into the array• Tag is read out and input into the comparator• The comparator compares Tag in the address and Tag stored in the cache’s Tag array
• Set Index is used to index into the array• Tag is read out and input into the comparator• The comparator compares Tag in the address and Tag stored in the cache’s Tag array
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 45
Direct-Mapped Cache - 3
Main MemoryMain Memory
• Conceptually main memory can be viewed as a number of partitions with each partition size equal to the cache size• Address with the same set index map to the same cache block
• Conceptually main memory can be viewed as a number of partitions with each partition size equal to the cache size• Address with the same set index map to the same cache block
(Block Address) MOD n
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 46
Example
Consider the direct-mapped cache shown schematically on the right.The following trace shows the memory references sent to the cache in sequence. Assume that the cache starts with all entries set to INVALID.
•Mark each reference in the table as a Hit or Miss.•After the last reference, show the final Valid and Tag
fields in the cache diagram above.
Consider the direct-mapped cache shown schematically on the right.The following trace shows the memory references sent to the cache in sequence. Assume that the cache starts with all entries set to INVALID.
•Mark each reference in the table as a Hit or Miss.•After the last reference, show the final Valid and Tag
fields in the cache diagram above.
Ref. Count Tag Index Byte Offset Hit or Miss? 1 100 0 0 2 100 0 4 3 100 1 0 4 100 1 4 5 100 2 0 6 101 1 0 7 101 2 0 8 100 0 0 9 100 1 0
10 101 2 4 11 101 3 0 12 100 0 4
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 47
2-Way Set-Associative Cache -1
• Each set holds two blocks (Way 0 & Way 1)• Each memory block is
• mapped to a set• can be stored in either cache line
• Each set holds two blocks (Way 0 & Way 1)• Each memory block is
• mapped to a set• can be stored in either cache line
• Block size = 32 bytes• 5-bit Block Offset
• 512 blocks divided into 256 sets withtwo lines per set
• need 8 bits for Set Index• Cache size: 256 x 2 x 16 = 16 Kbytes
• Block size = 32 bytes• 5-bit Block Offset
• 512 blocks divided into 256 sets withtwo lines per set
• need 8 bits for Set Index• Cache size: 256 x 2 x 16 = 16 Kbytes
(Block Address) MOD (number of sets)
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 48
Hit Detection in 2-Way Cache
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 49
Cache Associativity
High associativity reduces conflicts between blocks that
map to the same location reduces “eviction rate” and miss rate
Low asscociativity increases miss rate reduces cache hardware complexity
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 50
Replacement Policy
Replacement policy defines the algorithm to “evict” a block when there is a cache miss and the cache is full
Direct-mapped cache trivial choice: evict resident block and replace with new block
Fully-associative & set-associative random selection from among candidate blocks for eviction
simple to support in hardware spreads allocation uniformly
least-recently-used (LRU): access to blocks are recorded evict the block that has gone unused the longest improves chance of exploiting temporal locality
both methods comparable for large cache size
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 51
Cache Write Policies
Write-back: data which is written by the processor is updated only in the cache but not in the lower level
data updated in the lower level only when a block is evicted a block that requires update at eviction sets its “dirty” bit multiple byte updates within a single block can be written to lower
level in one write operation must write entire block; it is not known which bytes must be updated
Write-through: update data in both the cache and the lower level
a read miss does not require updating the evicted block in the lower level because the lower level is already updated
only updated bytes within the block must be written to the lower level
less complex but consumes more memory bus bandwidth
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 52
Write-Through Cache Issues
Processor must wait for write to lower level to complete this is referred to as write-stall
A write buffer (aka store buffer) is used to reduce write stall processor writes data to the write buffer memory controller writes content of write buffer to memory
hierarchy
ProcessorCache
MainMemory
Write buffer Effective only if (store frequency) << (1/DRAM write cycle) If (store frequency) ~ (1/DRAM write cycle) write buffer
eventually overflows On cache miss must lookup write buffer as well
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 53
Memory Models
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology
Memory Use
Programs use memory to Store executable code Data used by programs during execution
Memory for data must be available Allocated statically at compile time Allocated dynamically at run time
On the program stack for procedure calls In the heap as needed by program
Fall 2015, arz
CS555A – Real-Time Embedded SystemsStevens Institute of Technology 54
Fall 2015, arz 55 55
Memory Usage
Global variable. Allocated at compile time.
Local/automatic variables. Allocated on the stack at run time.
Dynamic variables. Allocated in the heap area at run time.
CS555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 56CS555A – Real-Time Embedded Systems
Stevens Institute of Technology 56
Procedure Call: Return
opcode=3 immediate
6 26
J-Type Format
Af ter the procedure fi nishes execution it must return to the instruction at the return address. The address of this instruction is stored in $ra (register 31). Thus to return to this address the procedure executes:
jr $ra
so that no new instruction f or the return instruction is required. Similar to the jump instruction j , the jal instruction is encoded as a j -type instruction with opcode=3 as shown below.
Fall 2015, arz 57CS555A – Real-Time Embedded Systems
Stevens Institute of Technology 57
ExampleLet’s compile the following procedure:
int leaf _example(int g, int h, int i, int j )
{
int f ;
f = (g+h) – (i+j );
return f ;
}
Now let’s try to compile this program using what we’ve learned sofar. To compile, the compiler must make a decision about how topass the arguments g, h, i and j to the procedure and where theprocedure must return its results to the caller.
Fall 2015, arz 58CS555A – Real-Time Embedded Systems
Stevens Institute of Technology 58
Argument Passing & Return Value
The MI PS I SA does not specif y how arguments should be passed and values returned. This is done using a compiler/ assembler convention f or MI PS:
Registers 4-7 are used f or argument passing. By convention MI PS refers to these registers as $a0-$a3,
Registers 2-3 are used to return values f rom procedures. By convention MI PS refers to these registers as $v0-$v1.
Let’s assume in our example that the arguments g-i will be passed to the procedure in $a0-$a3 and the result is returned in $v0. The compiler chooses reg16 for the local variable f . I n addition the compiler uses reg8 and reg9 f or temporary storage of (g+h) & (i+j).
Fall 2015, arz 59 59
ExampleLet’s compile the f ollowing procedure:
int leaf _example(int g, int h, int i, int j )
{
int f ;
f = (g+h) – (i+j );
return f ;
}
Now let’s try to compile this program using what we’ve learned so f ar. To compile, the compiler must make a decision about how to pass the arguments g, h, i and j to the procedure and where the procedure must return its results to the caller.
$a0, $a1, $a2, $a3 = R4-R7
R16
$v0=R2R8
R9
CS555A – Real-Time Embedded SystemsStevens Institute of Technology
Fall 2015, arz 60CS555A – Real-Time Embedded Systems
Stevens Institute of Technology 60
What Could Go Wrong?
What if the caller also happens to be using the same registers reg8, reg9 and reg16? I n this case the values used by the caller will be overwritten by the callee and it will not be able to execute the program correctly af ter the callee returns. This problem cannot be addressed by simply using a diff erent set of registers in the caller and the callee. For one thing we don’t know how deep the nested procedure calls are. I f n procedure calls are nested, we will need
O n registers whereas we only have 32. For another this scheme
clearly will not work if the procedures are compiled separately and later linked (as in a library). The solution is to spill registers into main memory. We associate a f rame (or activation record) with a procedure call of the f unction 1 2 nf x ,x ,...x . The f rame will contain
registers that the callee plans to use during execution.
Fall 2015, arz 61CS555A – Real-Time Embedded Systems
Stevens Institute of Technology 61
Call Stack The natural data structure for spilling
registers into memory is a call stack (a last-in first-out structure)
Register values are pushed and saved on the stack when the procedure is called and popped from the stack into the original register at return
Historically call stacks “grow” from High address to low address
A stack pointer is used to address the first unused memory location
MIPS software uses register 29 for stack pointer and refers to it as $sp
other machines (e.g., 80x86) may use a special-purpose stack pointer
main
Proc1
Proc2
Proc3
Proc4
Low Address
High Address
$sp
Fall 2015, arz 62CS555A – Real-Time Embedded Systems
Stevens Institute of Technology 62
Carnegie Mellon
Pushing a Register on the Stack
Suppose the called procedure wants to use reg16 It must push register reg16 to save it
subi $sp, $sp, 4 Makes room for a 4-byte word on the
stack sw reg16, 0($sp)
Stores reg16 into stack memory Now the called procedure can use
reg16
-4
Stack GrowsDown
IncreasingAddresses
Stack “Bottom”
Stack Pointer: $sp
Stack “Top”
Fall 2015, arz 63CS555A – Real-Time Embedded Systems
Stevens Institute of Technology 63
Stack Pointer: $sp
Stack GrowsDown
IncreasingAddresses
Stack “Top”
Stack “Bottom”
Carnegie Mellon
Popping a Register From the Stack
+4
Before the procedure returns, it must restore reg16 to original value Pops stack into register reg16
lw reg16, 0($sp) Loads reg16 from stack memory
addi $sp, $sp, 4 Pops the stack
Now the callee can use reg1 as before
Fall 2015, arz 64
Example Solution
Consider the direct-mapped cache shown schematically on the right.The following trace shows the memory references sent to the cache in sequence. Assume that the cache starts with all entries set to INVALID.
•Mark each reference in the table as a Hit or Miss.•After the last reference, show the final Valid and Tag
fields in the cache diagram above.
Consider the direct-mapped cache shown schematically on the right.The following trace shows the memory references sent to the cache in sequence. Assume that the cache starts with all entries set to INVALID.
•Mark each reference in the table as a Hit or Miss.•After the last reference, show the final Valid and Tag
fields in the cache diagram above.
Ref. Count Tag Index Byte Offset Hit or Miss? 1 100 0 0 Miss 2 100 0 4 Hit 3 100 1 0 Miss 4 100 1 4 Hit 5 100 2 0 Miss 6 101 1 0 Miss 7 101 2 0 Miss 8 100 0 0 Hit 9 100 1 0 Miss
10 101 2 4 Hit 11 101 3 0 Miss 12 100 0 4 Hit
100100101
101
1111
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology