Download pdf - CSE 237A Memory - Computer Science and Engineeringcseweb.ucsd.edu/classes/sp10/cse237a/handouts/mem.pdf1 CSE 237A Memory Tajana Simunic Rosing. Department of Computer Science and Engineering

1

CSE 237AMemory

Tajana Simunic RosingDepartment of Computer Science and EngineeringUniversity of California, San Diego.

2Fall 2005

Hardware platform architecture

3Fall 2005

Memory hierarchy Want inexpensive,

fast memory Main memory

Large, inexpensive, slow, stores all data

Cache Small, expensive, fast

memory stores a copy of likely accessed parts

L1, L2

Processor

Cache

Main memory

Disk

Backup

Registers

4Fall 2005

Memory Efficiency is a concern:speed (latency and throughput); predictable timingenergy efficiencysizecostother attributes (volatile vs. persistent, etc)

5Fall 2005

Access-times

2

4

8

2 4 5

Speed

years31

[P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane]

≥ 2xevery 2 years

10

6Fall 2005

Caches and CPUs - MPSoC Higher end systems – L1&L2 cache on chip

7Fall 2005

Cache Designed with SRAM, Usually on same chip as processor Cache operation:

Request for main memory access (read or write) First, check cache for copy

cache hit cache miss

Design choices cache mapping

Direct - each memory location maps onto exactly one cache entry Fully associative – anywhere in memory, never implemented Set-associative - each memory location can go into one of n set

write techniques Write-through - write to main memory at each update Write-back – write only when “dirty” block replaced

replacement policies Random LRU: least-recently used FIFO: first-in-first-out

Data

Valid

Tag Index Offset

=

V T D

Tag Index Offset

=

V T DData

Valid

V T D

=

8Fall 2005

Cache impact on system performance

Most important parameters in terms of performance: Total size of cache (data and control info – tags etc) Degree of associativity Data block size

Larger caches -> lower miss rates, higher access cost Average memory access time (h1=L1 hit rate, h2=L2 hit rate)

tav = h1tL1 + (h2-h1)tL2 + (1- h2-h1)tmain

e.g., if miss cost = 20 2 Kbyte: miss rate = 20%, hit cost = 2 cycles, access 5.6 cycles 4 Kbyte: miss rate = 10%, hit cost = 3 cycles, access 4.7 cycles 8 Kbyte: miss rate = 8%, hit cost = 4 cycles, access 4.8 cycles

9Fall 2005

Influence of the associativity

[P. Marwedel et al., ASPDAC, 2004]

10Fall 2005

Predictability

Embedded systems are often real-time:Have to guarantee meeting timing constraints.

Pre run-time scheduling - predictabilityTime-triggered, statically scheduled operating

systems

Predictable cache design?

11Fall 2005

Scratch pad memories (SPM)

Address space

ARM7TDMI cores, well-known for low power consumption

scratch pad memory

0

FFF..

main

SPM

processor

HierarchyExample

no tag memory

12Fall 2005

Why not just use a cache ?

[P. Marwedel et al., ASPDAC, 2004]

Worst case execution time (WCET) may be large

13Fall 2005

Why not just use a cache ?

0

1

2

3

4

5

6

7

8

9

256 512 1024 2048 4096 8192 16384

Ener

gy p

er a

cces

s [n

J]

.

memory size

Scratch padCache, 2way, 4GB spaceCache, 2way, 16 MB spaceCache, 2way, 1 MB space

[R. Banakar, S. Steinke, B.-S. Lee, 2001]

Energy for parallel access of sets, in comparators, muxes.

14Fall 2005

Scratchpad vs. main memory currents

48.2 50.9 44.4 53.1

11677.2 82.2

1.16

020406080

100120140160180

Prog Off-Chip/ Data Off-Chip

Prog Off-Chip/ Data On-Chip

Prog On-Chip/ Data Off-Chip

Prog On-Chip/ Data On-Chip

mA

Current32 Bit-Load Instruction (Thumb)

Core+On-Chip-Memory Current (mA) Off-Chip-Memory Current (mA)

Example: Atmel ARM-Evaluation board

Processor

SPM (On-chip)

Main

Memory

(on board)

current reduction:

Factor of 3.02

15Fall 2005

Scratchpad vs. main memory energy

115.8

51.676.5

16.4

0.020.040.060.080.0

100.0120.0140.0

Prog Off-Chip/ Data Off-Chip

Prog Off-Chip/ Data On-Chip

Prog On-Chip/ Data Off-Chip

Prog On-Chip/ Data On-Chip

nJ

Energy32 Bit-Load Instruction (Thumb)

Energy

Example: Atmel ARM-Evaluation board "Main" memory access takes longer

savings 86%

energy reduction:factor of 7.06

100% predictable

16Fall 2005

Memory management units Memory management unit (MMU)

translates addresses:

CPU mainmemory

memorymanagement

unit

logicaladdress

physicaladdress

17Fall 2005

Memory Management Unit (MMU)

Duties of MMU Handles DRAM refresh, bus interface & arbitration Takes care of memory sharing among multiple CPUs Translates logic memory addresses from processor to

physical memory addresses of DRAM Modern CPUs often come with MMU built-in Single-purpose processors can be used

18Fall 2005

Address translation Mapping logical to physical

addresses. Two basic schemes:

Segmented memory footprint can change

dynamically usually only a few segments per

process; e.g. data and stack Paged

size preassigned can be combined (x86).

SEGMENTATION PAGING

Involves programmer Transparent to programmer

Separate compiling No separate compiling

Separate protection No separate protection

Shared No sharing

memory

segment 1

segment 2

page 1page 2

19Fall 2005

ARM memory management Memory region types:section: 1 Mbyte block; large page: 64 kbytes;small page: 4 kbytes.

An address is marked as section-mapped or page-mapped.

Two-level translation scheme.

20Fall 2005

ARM address translation

offset1st index 2nd index

physical address

Translation tablebase register

1st level tabledescriptor

2nd level tabledescriptor

concatenate

concatenate

21Fall 2005

Memory: basic concepts Stores large number of bits

m x n: m words of n bits each k = Log2(m) address input signals or m = 2^k words e.g., 4,096 x 8 memory:

32,768 bits 12 address input signals 8 input/output data signals

Memory access r/w: selects read or write enable: read or write only when asserted multiport: multiple accesses to different

locations simultaneously

m × n memory

…

…

n bits per word

mw

ords

enable2k × n read and write

memory

A0…

r/w

…

Q0Qn-1

Ak-1

memory external view

22Fall 2005

Write ability/ storage permanence Traditional ROM/RAM

ROM read only, bits stored

without power RAM

read and write, lose stored bits without power

Distinctions blurred Advanced ROMs can be

written to e.g., EEPROM

Advanced RAMs can hold bits without power e.g., NVRAM Write ability and storage permanence of memories,

showing relative degrees along each axis (not to scale).

Externalprogrammer

OR in-system,block-orientedwrites, 1,000s

of cycles

Batterylife (10years)

Writeability

EPROM

Mask-programmed ROM

EEPROM FLASH

NVRAM

SRAM/DRAM

Stor

age

perm

anen

ce

Nonvolatile

In-systemprogrammable

Ideal memory

OTP ROM

Duringfabrication

only

Externalprogrammer,

1,000sof cycles

Externalprogrammer,one time only

Externalprogrammer

OR in-system,1,000s

of cycles

In-system, fastwrites,

unlimitedcycles

Nearzero

Tens ofyears

Life ofproduct

23Fall 2005

ROM: “Read-Only” Memory Nonvolatile, read only “Programmed” before inserting to

embedded system Mask-programmed – at fabrication One-time programmable (OTP ROM)

Programmed by user; fuse/anitfuse tech., cheaper

Erasable ROM With UV light – EPROM With electricity - EEPROM

Uses Store software program for general-purpose

processor Store constant data needed by system Implement combinational circuit

2k × n ROM

…

Q0Qn-1

A0

…

enable

Ak-1

External view

24Fall 2005

EEPROM: Electrically erasable programmable ROM

Programmed and erased electronically higher than normal voltage can program and erase individual words Can be erased and programmed 1000s of times

Better write ability can be in-system programmable with built-in circuit to provide

higher than normal voltage writes very slow due to erasing and programming

Similar storage permanence to EPROM (about 10 years) Far more convenient than EPROMs, but more expensive

25Fall 2005

Flash Memory Extension of EEPROM

Same floating gate principle Same write ability and storage permanence

Fast erase Large blocks of memory erased at once, rather than one word at a

time Blocks typically several thousand bytes large

Writes to single words may be slower Entire block must be read, word updated, then entire block written

back

Used with embedded systems storing large data items in nonvolatile memory

26Fall 2005

RAM: “Random-access” memory Volatile, read/write at run time Internal structure more complex

a word consists of several memory cells, each storing 1 bit each input and output data line connects to each cell in its

column rd/wr connected to every cell

SRAM: Static RAM Memory cell uses flip-flop to store bit Requires 6 transistors Holds data as long as power supplied

DRAM: Dynamic RAM Memory cell uses MOS transistor and capacitor to store bit More compact than SRAM “Refresh” required due to capacitor leak

word’s cells refreshed when read Typical refresh rate 15.625 microsec. Slower to access than SRAM

enable2k × n read and write

memory

A0 …

r/w

…

Q0Qn-1

Ak-1

external view

4×4 RAM

2×4 decoder

Q0Q3

A0

enable

A1

Q2 Q1

Memory cell

I0I3 I2 I1

rd/wr To every cell

internal view

Data

W

Data'

SRAM

DataW

DRAM

27Fall 2005

Ram variations PSRAM: Pseudo-static RAM

DRAM with built-in memory refresh controller Popular low-cost high-density alternative to SRAM

NVRAM: Nonvolatile RAM Holds data after external power removed Battery-backed RAM

SRAM with own permanently connected battery writes as fast as reads no limit on number of writes unlike nonvolatile ROM-based

memory SRAM with EEPROM or flash

stores complete RAM contents on EEPROM or flash before power turned off

28Fall 2005

Extended data out DRAM

Improvement of FPM (full page mode) DRAM Extra latch before output buffer

allows strobing of cas before data read operation completed

Reduces read/write latency by additional cycle

row col col col

data data data

Speedup through overlap

ras

cas

address

data

29Fall 2005

(S)ynchronous and Enhanced Synchronous (ES) DRAM

SDRAM latches data on active edge of clock Eliminates time to detect ras/cas and rd/wr signals A counter is initialized to column address then

incremented on active edge of clock to access consecutive memory locations

ESDRAM improves SDRAM added buffers enable overlapping of column addressing faster clocking and lower read/write latency possible

clock

ras

cas

address

data

row col

data data data

30Fall 2005

Rambus DRAM (RDRAM)

More of a bus interface architecture than DRAM architecture

Data is latched on both rising and falling edge of clock

Broken into 4 banks each with own row decoder can have 4 pages open at a time

Capable of very high throughput

31Fall 2005

DRAM integration problem SRAM easily integrated with CPU DRAM more difficultDifferent chip making process between DRAM

and conventional logicGoal of conventional logic (IC) designers:

minimize parasitic capacitance to reduce signal propagation delays and power consumption

Goal of DRAM designers: create capacitor cells to retain stored information

32Fall 2005

SRAM vs. DRAM SRAM:Faster.Easier to integrate with logic.Higher active power consumption per bit.

DRAM:Denser.Must be refreshed.

33Fall 2005

UCB1200Analog &

DigitalSensors

Microphoneand

Speakers

Memory:Flash (1MB)SRAM (1MB)

Display

DC-DCConverter Ba

ttery

RF

StrongARMSA-1100

Design example: SmartBadge

Active power: 3.5 W Idle power: 2.2 W Standby: 0.2 W Sleep time: 1ms Wake-up time: 150 ms

34Fall 2005

MPEG2 video example StrongARM platform with SRAM

0.E+00

1.E-09

2.E-09

3.E-09

4.E-09

5.E-09

6.E-09

0 100000 200000 300000 400000 500000 600000 700000 800000

Ene

rgy

per C

ycle

(mW

hr)

Cycles

ProcessorFLASHSRAMBatteryPins & InterconnectDC-DC Converter

35Fall 2005

SmartBadge HW for MPEG2 Video

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Original L2 Cache Burst SRAM Burst SDRAM

Ene

rgy

(mW

hr)

DC-DC Converter

Interconnect & Pins

L2 Cache

Data Memory

Instruction Memory

Processor

Energy Consumption

Memory Architectures

Hardware Configurations

GSRC Workshop August 4, 2008

Rich Freitas

Memory power trends DRAM: 2X size / chip / 3 year

Growth rate lower than the requirement for system performance growth

Numerically, more memory chips will be needed in a system to match the requirement for growth in system performance

Memory chip power is unlikely to decrease P = active power + leakage power + refresh power DRAM can be put in standby mode, i.e., no active power, but

refresh and leakage power are still present albeit they may be at reduced levels

37


Rich Freitas

Definition of Storage Class Memory

A new class of data storage/memory devicesmany technologies compete to be the ‘best’ SCM

SCM features: Non-volatile ( ~ 10 years) Fast Access times (~ DRAM like ) Low cost per bit more (DISK like – by 2015) Solid state, no moving parts

SCM blurs the distinction between MEMORY (fast, expensive, volatile ) and STORAGE (slow, cheap, non-volatile)

38


Rich Freitas

Phase-changeRAM

Access device(transistor, diode)

PCRAM“programmableresistor”

Bit-line

Word-line

temperature

time

Tmelt

Tcryst

“RESET” pulse

“SET” pulse

VoltagePotential headache: High power/current affects scaling!

Potential headache: If crystallization is slow affects performance!

39


Rich Freitas

Memory/Storage Stack Latency Problem

CPU operations (1ns)

Get data from L2 cache (10ns)

Access DISK (5ms)

Get data from TAPE (40s)

Storage

MemoryGet data from DRAM or PCM (60ns)

Tim

e in

ns

102

108

103

104

105

106

107

109

1010

101

Access FLASH (20 us)

SCMAccess PCM (100 – 1000 ns)

second

CenturyH

uman

Sca

le

40


Rich Freitas

$1 / GB

Chart courtesy of

Dr. Chung Lam

IBM Research

To be published

IBM Journal R&D

$10 / GB

$100 / GB

$1k / GB

$10k / GB

$100k / GB

$1M / GB

$0.10 / GB

$0.01 / GB

DRAM

NAND

DesktopHDD

SCM

SCM

If you could have SCM, why would you need anything else?

C-33

41


Rich Freitas

SCM in a large System

CPU RAM DISK

CPU SCM

TAPE

RA

M

CPU RAM DISK TAPEFLASH

SSD2008

1980

2013

ArchivalActive StorageMemoryLogic

TAPE...DISK

42 GSRC Workshop August 4, 2008 Rich Freitas

Active vs passive power The blue area marks active power in the

power equations The red area marks passive power in the

power equations Passive power is unproductive. It just

causes heat For memories it is leakage and

refresh power, which is typically smaller than maximum active power

For disks it is keeping the motor spinning and the standby power of the electronics, which is typically larger than the maximum active power

For PCM is is the leakage and small standby power and is typically much much smaller than the maximum active power.

fCVIVIVP ddM dddd

2refreshleak α++=

VIVIrdPD t&sc&i8.26.4 ακ ++=

ddddPCM VIVIP activestandby α+=

passive active

motordisk theofpower normalized theis productive and active is device that the timeofportion theis

κα

43


Rich Freitas

Focus on memory/storage stack

Issues Redesign of DRAM and Disks to

eliminate passive power Possible but not probable

DRAM has fast turn off and turn on times, but it is volatile Turning off DRAM when not

active causes data loss Disks are nonvolatile and turn on/off

in ~ 20-30 seconds On/off time to long for practical

active storage systems Storage systems that manage

power in this manner are called MAID system.

So far, only used for archive systems

Opportunities How can PCM be used to virtually

eliminate passive power? Active power is much greater than

passive power Turn on/off time ~50us

How can data be laid out to minimize active power? Memory/storage pools hierarchy

How can active power be used more efficiently? Device design System architecture exploit virtualization (management

challenge) exploit accelerators

Goal: eliminate passive power and make active power more efficient

44Fall 2005

Summary Memory hierarchy Needs: speed, low power, predictable

Cache designMapping, replacement & write policies

Memory typesROM vs RAM, types of ROM/RAM

45Fall 2005

Sources and References

Frank Vahid, Tony Givargis, “Embedded System Design,” Wiley, 2002.

Wayne Wolf, “Computers as Components,” Morgan Kaufmann, 2001.

Peter Marwedel, “Embedded Systems Design,” 2004.