1
CSE 237AMemory
Tajana Simunic RosingDepartment of Computer Science and EngineeringUniversity of California, San Diego.
2Fall 2005
Hardware platform architecture
3Fall 2005
Memory hierarchy Want inexpensive,
fast memory Main memory
Large, inexpensive, slow, stores all data
Cache Small, expensive, fast
memory stores a copy of likely accessed parts
L1, L2
Processor
Cache
Main memory
Disk
Backup
Registers
4Fall 2005
Memory Efficiency is a concern:speed (latency and throughput); predictable timingenergy efficiencysizecostother attributes (volatile vs. persistent, etc)
5Fall 2005
Access-times
2
4
8
2 4 5
Speed
years31
[P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane]
≥ 2xevery 2 years
10
6Fall 2005
Caches and CPUs - MPSoC Higher end systems – L1&L2 cache on chip
7Fall 2005
Cache Designed with SRAM, Usually on same chip as processor Cache operation:
Request for main memory access (read or write) First, check cache for copy
cache hit cache miss
Design choices cache mapping
Direct - each memory location maps onto exactly one cache entry Fully associative – anywhere in memory, never implemented Set-associative - each memory location can go into one of n set
write techniques Write-through - write to main memory at each update Write-back – write only when “dirty” block replaced
replacement policies Random LRU: least-recently used FIFO: first-in-first-out
Data
Valid
Tag Index Offset
=
V T D
Tag Index Offset
=
V T DData
Valid
V T D
=
8Fall 2005
Cache impact on system performance
Most important parameters in terms of performance: Total size of cache (data and control info – tags etc) Degree of associativity Data block size
Larger caches -> lower miss rates, higher access cost Average memory access time (h1=L1 hit rate, h2=L2 hit rate)
tav = h1tL1 + (h2-h1)tL2 + (1- h2-h1)tmain
e.g., if miss cost = 20 2 Kbyte: miss rate = 20%, hit cost = 2 cycles, access 5.6 cycles 4 Kbyte: miss rate = 10%, hit cost = 3 cycles, access 4.7 cycles 8 Kbyte: miss rate = 8%, hit cost = 4 cycles, access 4.8 cycles
9Fall 2005
Influence of the associativity
[P. Marwedel et al., ASPDAC, 2004]
10Fall 2005
Predictability
Embedded systems are often real-time:Have to guarantee meeting timing constraints.
Pre run-time scheduling - predictabilityTime-triggered, statically scheduled operating
systems
Predictable cache design?
11Fall 2005
Scratch pad memories (SPM)
Address space
ARM7TDMI cores, well-known for low power consumption
scratch pad memory
0
FFF..
main
SPM
processor
HierarchyExample
no tag memory
12Fall 2005
Why not just use a cache ?
[P. Marwedel et al., ASPDAC, 2004]
Worst case execution time (WCET) may be large
13Fall 2005
Why not just use a cache ?
0
1
2
3
4
5
6
7
8
9
256 512 1024 2048 4096 8192 16384
Ener
gy p
er a
cces
s [n
J]
.
memory size
Scratch padCache, 2way, 4GB spaceCache, 2way, 16 MB spaceCache, 2way, 1 MB space
[R. Banakar, S. Steinke, B.-S. Lee, 2001]
Energy for parallel access of sets, in comparators, muxes.
14Fall 2005
Scratchpad vs. main memory currents
48.2 50.9 44.4 53.1
11677.2 82.2
1.16
020406080
100120140160180
Prog Off-Chip/ Data Off-Chip
Prog Off-Chip/ Data On-Chip
Prog On-Chip/ Data Off-Chip
Prog On-Chip/ Data On-Chip
mA
Current32 Bit-Load Instruction (Thumb)
Core+On-Chip-Memory Current (mA) Off-Chip-Memory Current (mA)
Example: Atmel ARM-Evaluation board
Processor
SPM (On-chip)
Main
Memory
(on board)
current reduction:
Factor of 3.02
15Fall 2005
Scratchpad vs. main memory energy
115.8
51.676.5
16.4
0.020.040.060.080.0
100.0120.0140.0
Prog Off-Chip/ Data Off-Chip
Prog Off-Chip/ Data On-Chip
Prog On-Chip/ Data Off-Chip
Prog On-Chip/ Data On-Chip
nJ
Energy32 Bit-Load Instruction (Thumb)
Energy
Example: Atmel ARM-Evaluation board "Main" memory access takes longer
savings 86%
energy reduction:factor of 7.06
100% predictable
16Fall 2005
Memory management units Memory management unit (MMU)
translates addresses:
CPU mainmemory
memorymanagement
unit
logicaladdress
physicaladdress
17Fall 2005
Memory Management Unit (MMU)
Duties of MMU Handles DRAM refresh, bus interface & arbitration Takes care of memory sharing among multiple CPUs Translates logic memory addresses from processor to
physical memory addresses of DRAM Modern CPUs often come with MMU built-in Single-purpose processors can be used
18Fall 2005
Address translation Mapping logical to physical
addresses. Two basic schemes:
Segmented memory footprint can change
dynamically usually only a few segments per
process; e.g. data and stack Paged
size preassigned can be combined (x86).
SEGMENTATION PAGING
Involves programmer Transparent to programmer
Separate compiling No separate compiling
Separate protection No separate protection
Shared No sharing
memory
segment 1
segment 2
page 1page 2
19Fall 2005
ARM memory management Memory region types:section: 1 Mbyte block; large page: 64 kbytes;small page: 4 kbytes.
An address is marked as section-mapped or page-mapped.
Two-level translation scheme.
20Fall 2005
ARM address translation
offset1st index 2nd index
physical address
Translation tablebase register
1st level tabledescriptor
2nd level tabledescriptor
concatenate
concatenate
21Fall 2005
Memory: basic concepts Stores large number of bits
m x n: m words of n bits each k = Log2(m) address input signals or m = 2^k words e.g., 4,096 x 8 memory:
32,768 bits 12 address input signals 8 input/output data signals
Memory access r/w: selects read or write enable: read or write only when asserted multiport: multiple accesses to different
locations simultaneously
m × n memory
…
…
n bits per word
mw
ords
enable2k × n read and write
memory
A0…
r/w
…
Q0Qn-1
Ak-1
memory external view
22Fall 2005
Write ability/ storage permanence Traditional ROM/RAM
ROM read only, bits stored
without power RAM
read and write, lose stored bits without power
Distinctions blurred Advanced ROMs can be
written to e.g., EEPROM
Advanced RAMs can hold bits without power e.g., NVRAM Write ability and storage permanence of memories,
showing relative degrees along each axis (not to scale).
Externalprogrammer
OR in-system,block-orientedwrites, 1,000s
of cycles
Batterylife (10years)
Writeability
EPROM
Mask-programmed ROM
EEPROM FLASH
NVRAM
SRAM/DRAM
Stor
age
perm
anen
ce
Nonvolatile
In-systemprogrammable
Ideal memory
OTP ROM
Duringfabrication
only
Externalprogrammer,
1,000sof cycles
Externalprogrammer,one time only
Externalprogrammer
OR in-system,1,000s
of cycles
In-system, fastwrites,
unlimitedcycles
Nearzero
Tens ofyears
Life ofproduct
23Fall 2005
ROM: “Read-Only” Memory Nonvolatile, read only “Programmed” before inserting to
embedded system Mask-programmed – at fabrication One-time programmable (OTP ROM)
Programmed by user; fuse/anitfuse tech., cheaper
Erasable ROM With UV light – EPROM With electricity - EEPROM
Uses Store software program for general-purpose
processor Store constant data needed by system Implement combinational circuit
2k × n ROM
…
Q0Qn-1
A0
…
enable
Ak-1
External view
24Fall 2005
EEPROM: Electrically erasable programmable ROM
Programmed and erased electronically higher than normal voltage can program and erase individual words Can be erased and programmed 1000s of times
Better write ability can be in-system programmable with built-in circuit to provide
higher than normal voltage writes very slow due to erasing and programming
Similar storage permanence to EPROM (about 10 years) Far more convenient than EPROMs, but more expensive
25Fall 2005
Flash Memory Extension of EEPROM
Same floating gate principle Same write ability and storage permanence
Fast erase Large blocks of memory erased at once, rather than one word at a
time Blocks typically several thousand bytes large
Writes to single words may be slower Entire block must be read, word updated, then entire block written
back
Used with embedded systems storing large data items in nonvolatile memory
26Fall 2005
RAM: “Random-access” memory Volatile, read/write at run time Internal structure more complex
a word consists of several memory cells, each storing 1 bit each input and output data line connects to each cell in its
column rd/wr connected to every cell
SRAM: Static RAM Memory cell uses flip-flop to store bit Requires 6 transistors Holds data as long as power supplied
DRAM: Dynamic RAM Memory cell uses MOS transistor and capacitor to store bit More compact than SRAM “Refresh” required due to capacitor leak
word’s cells refreshed when read Typical refresh rate 15.625 microsec. Slower to access than SRAM
enable2k × n read and write
memory
A0 …
r/w
…
Q0Qn-1
Ak-1
external view
4×4 RAM
2×4 decoder
Q0Q3
A0
enable
A1
Q2 Q1
Memory cell
I0I3 I2 I1
rd/wr To every cell
internal view
Data
W
Data'
SRAM
DataW
DRAM
27Fall 2005
Ram variations PSRAM: Pseudo-static RAM
DRAM with built-in memory refresh controller Popular low-cost high-density alternative to SRAM
NVRAM: Nonvolatile RAM Holds data after external power removed Battery-backed RAM
SRAM with own permanently connected battery writes as fast as reads no limit on number of writes unlike nonvolatile ROM-based
memory SRAM with EEPROM or flash
stores complete RAM contents on EEPROM or flash before power turned off
28Fall 2005
Extended data out DRAM
Improvement of FPM (full page mode) DRAM Extra latch before output buffer
allows strobing of cas before data read operation completed
Reduces read/write latency by additional cycle
row col col col
data data data
Speedup through overlap
ras
cas
address
data
29Fall 2005
(S)ynchronous and Enhanced Synchronous (ES) DRAM
SDRAM latches data on active edge of clock Eliminates time to detect ras/cas and rd/wr signals A counter is initialized to column address then
incremented on active edge of clock to access consecutive memory locations
ESDRAM improves SDRAM added buffers enable overlapping of column addressing faster clocking and lower read/write latency possible
clock
ras
cas
address
data
row col
data data data
30Fall 2005
Rambus DRAM (RDRAM)
More of a bus interface architecture than DRAM architecture
Data is latched on both rising and falling edge of clock
Broken into 4 banks each with own row decoder can have 4 pages open at a time
Capable of very high throughput
31Fall 2005
DRAM integration problem SRAM easily integrated with CPU DRAM more difficultDifferent chip making process between DRAM
and conventional logicGoal of conventional logic (IC) designers:
minimize parasitic capacitance to reduce signal propagation delays and power consumption
Goal of DRAM designers: create capacitor cells to retain stored information
32Fall 2005
SRAM vs. DRAM SRAM:Faster.Easier to integrate with logic.Higher active power consumption per bit.
DRAM:Denser.Must be refreshed.
33Fall 2005
UCB1200Analog &
DigitalSensors
Microphoneand
Speakers
Memory:Flash (1MB)SRAM (1MB)
Display
DC-DCConverter Ba
ttery
RF
StrongARMSA-1100
Design example: SmartBadge
Active power: 3.5 W Idle power: 2.2 W Standby: 0.2 W Sleep time: 1ms Wake-up time: 150 ms
34Fall 2005
MPEG2 video example StrongARM platform with SRAM
0.E+00
1.E-09
2.E-09
3.E-09
4.E-09
5.E-09
6.E-09
0 100000 200000 300000 400000 500000 600000 700000 800000
Ene
rgy
per C
ycle
(mW
hr)
Cycles
ProcessorFLASHSRAMBatteryPins & InterconnectDC-DC Converter
35Fall 2005
SmartBadge HW for MPEG2 Video
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Original L2 Cache Burst SRAM Burst SDRAM
Ene
rgy
(mW
hr)
DC-DC Converter
Interconnect & Pins
L2 Cache
Data Memory
Instruction Memory
Processor
Energy Consumption
Memory Architectures
Hardware Configurations
GSRC Workshop August 4, 2008
Rich Freitas
Memory power trends DRAM: 2X size / chip / 3 year
Growth rate lower than the requirement for system performance growth
Numerically, more memory chips will be needed in a system to match the requirement for growth in system performance
Memory chip power is unlikely to decrease P = active power + leakage power + refresh power DRAM can be put in standby mode, i.e., no active power, but
refresh and leakage power are still present albeit they may be at reduced levels
37
GSRC Workshop August 4, 2008
Rich Freitas
Definition of Storage Class Memory
A new class of data storage/memory devicesmany technologies compete to be the ‘best’ SCM
SCM features: Non-volatile ( ~ 10 years) Fast Access times (~ DRAM like ) Low cost per bit more (DISK like – by 2015) Solid state, no moving parts
SCM blurs the distinction between MEMORY (fast, expensive, volatile ) and STORAGE (slow, cheap, non-volatile)
38
GSRC Workshop August 4, 2008
Rich Freitas
Phase-changeRAM
Access device(transistor, diode)
PCRAM“programmableresistor”
Bit-line
Word-line
temperature
time
Tmelt
Tcryst
“RESET” pulse
“SET” pulse
VoltagePotential headache: High power/current affects scaling!
Potential headache: If crystallization is slow affects performance!
39
GSRC Workshop August 4, 2008
Rich Freitas
Memory/Storage Stack Latency Problem
CPU operations (1ns)
Get data from L2 cache (10ns)
Access DISK (5ms)
Get data from TAPE (40s)
Storage
MemoryGet data from DRAM or PCM (60ns)
Tim
e in
ns
102
108
103
104
105
106
107
109
1010
101
Access FLASH (20 us)
SCMAccess PCM (100 – 1000 ns)
second
CenturyH
uman
Sca
le
40
GSRC Workshop August 4, 2008
Rich Freitas
$1 / GB
Chart courtesy of
Dr. Chung Lam
IBM Research
To be published
IBM Journal R&D
$10 / GB
$100 / GB
$1k / GB
$10k / GB
$100k / GB
$1M / GB
$0.10 / GB
$0.01 / GB
DRAM
NAND
DesktopHDD
SCM
SCM
If you could have SCM, why would you need anything else?
C-33
41
GSRC Workshop August 4, 2008
Rich Freitas
SCM in a large System
CPU RAM DISK
CPU SCM
TAPE
RA
M
CPU RAM DISK TAPEFLASH
SSD2008
1980
2013
ArchivalActive StorageMemoryLogic
TAPE...DISK
42 GSRC Workshop August 4, 2008 Rich Freitas
Active vs passive power The blue area marks active power in the
power equations The red area marks passive power in the
power equations Passive power is unproductive. It just
causes heat For memories it is leakage and
refresh power, which is typically smaller than maximum active power
For disks it is keeping the motor spinning and the standby power of the electronics, which is typically larger than the maximum active power
For PCM is is the leakage and small standby power and is typically much much smaller than the maximum active power.
fCVIVIVP ddM dddd
2refreshleak α++=
VIVIrdPD t&sc&i8.26.4 ακ ++=
ddddPCM VIVIP activestandby α+=
passive active
motordisk theofpower normalized theis productive and active is device that the timeofportion theis
κα
43
GSRC Workshop August 4, 2008
Rich Freitas
Focus on memory/storage stack
Issues Redesign of DRAM and Disks to
eliminate passive power Possible but not probable
DRAM has fast turn off and turn on times, but it is volatile Turning off DRAM when not
active causes data loss Disks are nonvolatile and turn on/off
in ~ 20-30 seconds On/off time to long for practical
active storage systems Storage systems that manage
power in this manner are called MAID system.
So far, only used for archive systems
Opportunities How can PCM be used to virtually
eliminate passive power? Active power is much greater than
passive power Turn on/off time ~50us
How can data be laid out to minimize active power? Memory/storage pools hierarchy
How can active power be used more efficiently? Device design System architecture exploit virtualization (management
challenge) exploit accelerators
Goal: eliminate passive power and make active power more efficient
44Fall 2005
Summary Memory hierarchy Needs: speed, low power, predictable
Cache designMapping, replacement & write policies
Memory typesROM vs RAM, types of ROM/RAM
45Fall 2005
Sources and References
Frank Vahid, Tony Givargis, “Embedded System Design,” Wiley, 2002.
Wayne Wolf, “Computers as Components,” Morgan Kaufmann, 2001.
Peter Marwedel, “Embedded Systems Design,” 2004.