ELEC 5200/6200 Computer Architecture and Design Spring 2017uguin/teaching/E6200_Spring... · ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization

ELEC 5200/6200

Computer Architecture and Design

Spring 2017 Lecture 7: Memory Organization – Part II

1/8/2017 ELEC 5200-001/6200-001 Lecture 7 1

Ujjwal Guin, Assistant Professor

Department of Electrical and Computer Engineering

Auburn University, Auburn, AL 36849

http://www.auburn.edu/~uzg0005/

Adapted from Dr. Chen-Huan Chiang (Intel) and Prof. Vishwani D. Agrawal (Auburn University)

[Adapted from Computer Organization and Design, Patterson & Hennessy, 2014]

http://www.auburn.edu/~uzg0005/

Designing a Computer

Memory

Control

Datapath

Central Processing

Unit (CPU)

or “Processor”

Input

Output

Types of Computer Memories

From the cover of:

A. S. Tanenbaum, Structured Computer Organization, Fifth Edition, Upper Saddle

River, New Jersey: Pearson Prentice Hall, 2006.

Random Access Memory (RAM)

Memory

cell

array

Address

decoder

Read/write

circuits

Address bits

Data bits

Six-Transistor SRAM Cell

Bit line

Word

line

Bit line

bit bit

Dynamic RAM (DRAM) Cell

Word line

Bit

line

“Single-transistor DRAM cell”

Robert Dennard’s 1967 invention

Classical RAM Organization (~Square)

Row

Decoder

rowaddress

data bit or word

RAM CellArray

word (row) line

bit (data) lines

Each intersection

represents a

6-T SRAM cell or

a 1-T DRAM cell

Column Selector &

I/O Circuitscolumn

address

One memory row holds a block

of data, so the column address

selects the requested bit or word

from that block

data bitdata bit

Classical DRAM Organization (~Square Planes)

The column addressselects the requested bit from the row in eachplane

Row

Decoder

rowaddress

Column Selector &I/O Circuits

columnaddress

data bit

word (row) line

bit (data) lines

Each intersection

represents a

1-T DRAM cellRAM CellArray

Classical DRAM Operation

DRAM Organization:

– N rows x N column x M-bit

– Read or Write M-bit at a time

– Each M-bit access requires

a RAS / CAS cycle

Row Address

CAS

RAS

Col Address Row Address Col Address

N r

ow

s

N cols

DRAM

M bit planes

Row

Address

Column

Address

M-bit Output

1st M-bit Access 2nd M-bit Access

Cycle Time

N r

ow

s

N cols

DRAM

Column

Address

M-bit Output

M bit planesN x M SRAM

Row

Address

Page Mode DRAM Operation Page Mode DRAM

– N x M SRAM to save a row

After a row is read into the

SRAM “register”

– Only CAS is needed to access

other M-bit words on that row

– RAS remains asserted while

CAS is toggled

Row Address

CAS

RAS

Col Address Col Address Col Address Col Address

1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit

Cycle Time

N r

ow

s

N cols

DRAM

Column

Address

M-bit Output

M bit planesN x M SRAM

Row

Address

Synchronous DRAM (SDRAM) Operation

After a row is read into

the SRAM register

– Inputs CAS as the starting “burst”

address along with a burst length

– Transfers a burst of data from a

series of sequential addresses within

that row

- A clock controls transfer of

successive words in the burst –

300MHz in 2004

+1

Row Address

CAS

RAS

Col Address

1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit

Cycle Time

Row Add

Other SDRAM Architectures

Double Data Rate SDRAMs – DDR-SDRAMs

– Double data rate because they transfer data on both the rising and

falling edge of the clock

Most widely used form of SDRAMs

– For DDR memory, 2n prefetch architecture means

Internal bus width is twice of external bus width

Hence, internal column access freq can be half of external data rate

– For users, 2n prefetch means that data access occurs in pairs

A single READ fetches two data words

A single WRITE, two data words must be provided

Other SDRAM Architectures- Cont.

https://www.synopsys.com/Company/Publications/DWTB/Pages/dwtb-ddr4-bank-groups-2013Q2.aspx

Memory Systems that Support Caches

The off-chip interconnect and memory architecture

can affect overall system performance in dramatic

ways CPU

Cache

Memory

Bus

One-word-wide

memory organization

a.

b. Wide memory organization

CPU

Cache

Memory

Bus

Multiplexor

CPU

Cache

Bus

Memory

bank 0

Memory

bank 1

Memory

bank 2

Memory

bank 3

c. Interleaved memory organization

CPU

Cache

Memory

bus32-bit data

&

32-bit addr

per cycle

on-chip

One Word Wide Memory Organization

One word wide bus and one word wide memory

– The bus contains both address and data lines

Wide Memory Organization Increase the bandwidth to one word memory by

widening– Memory

– Buses between processor and memory To allow parallel access to all the words of a block

Logic between processor and cache consists of – A MUX on READs

– Control logic to update the appropriate words on WRITEs

CPU

Cache

Memory

Bus

One-word-wide

memory organization

a.


CPU

Cache

Memory

Bus

Multiplexor

CPU

Cache

Bus

Memory

bank 0

Memory

bank 1

Memory

bank 2

Memory

bank 3


Interleaved Memory Organization Widen the memory but not the interconnection

– Example: 4-way interleaving

one word wide bus and four word wide memory

Sending an address to several banks permits them all to

read simultaneously

– Hence, incurring full DRAM latency only once

CPU

Cache

Memory

bank 1

bus

on-chip

Memory

bank 0

Memory

bank 2

Memory

bank 3

Interleaved Memory Organization

Bank Conflict

– Two consecutive memory operations using the same

bank

– Cause the memory to stall until the busy bank has

completed the prior operation

CPU

Cache

Memory

bank 1

bus

on-chip

Memory

bank 0

Memory

bank 2

Memory

bank 3

Example--- how a memory system affects overall performance

Assume at cache miss to read DRAM

– 1 clock cycle to send the address

– 25 clock cycles for DRAM cycle time

– 8 clock cycles for DRAM access time

– 1 clock cycle to return a word of data

Memory-Bus to Cache bandwidth

– number of bytes accessed from memory and transferred

to cache/CPU per clock cycle

1/8/2017 ELEC 5200-001/6200-001 Lecture 7 19

CPU

Cache

Memory

bus

on-chip

Performance of One Word Wide Memory Organization

If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory

cycle to send address

cycles to read DRAM

cycle to return data

total clock cycles miss penalty

Number of bytes transferred per clock cycle (bandwidth) for a single miss is

bytes per clock

1

25

1

27

4/27 = 0.148

CPU

Cache

Memory

bus

on-chip

What if the block size is four words?cycle to send 1st address

cycles to read DRAM

cycles to return last data word



bytes per clock

25 cycles

25 cycles

25 cycles

25 cycles

1

4 x 25 = 100

4 x 1 = 4

104

(4 x 4)/104 = 0.154


As soon as data is available,

address can be changed to

access the next word

CPU

Cache

Memory

bus

on-chip

What if the block size is four words and if a fast page mode DRAM is used?

cycle to send 1st address

cycles to read DRAM




bytes per clock

25 cycles

8 cycles

8 cycles

8 cycles

1

25 + 3*8 = 49

4 x 1 = 4

54

(4 x 4)/54 = 0.296


What if the cache block size is four words

and main memory width of two words?


cycles to read DRAM



Number of bytes transferred per clock

cycle (bandwidth) for a single miss is

bytes per clock

25 cycles

1

2 x 25 = 50

2 x 1 = 2

53

(4 x 4)/53 = 0.302

CPU

Cache

Memory

Bus

One-word-wide

memory organization

a.


CPU

Cache

Memory

Bus

Multiplexor

CPU

Cache

Bus

Memory

bank 0

Memory

bank 1

Memory

bank 2

Memory

bank 3


25 cycles

Each access has two

words in parallel

Performance of Wide Memory Organization

What if the cache block size is four words

and main memory width of four words?


cycles to read DRAM



Number of bytes transferred per clock

cycle (bandwidth) for a single miss is

bytes per clock

1

25

1

27

(4 x 4)/27 = 0.593

CPU

Cache

Memory

Bus

One-word-wide

memory organization

a.


CPU

Cache

Memory

Bus

Multiplexor

CPU

Cache

Bus

Memory

bank 0

Memory

bank 1

Memory

bank 2

Memory

bank 3


25 cyclesThis access has four

words in parallel

Performance of Wide Memory Organization

For a block size of four words


cycles to read DRAM



CPU

Cache

Memory

bank 1

bus

on-chip

Memory

bank 0

Memory

bank 2

Memory

bank 3


bytes per clock

25 cycles

25 cycles

25 cycles

25 cycles

(4 x 4)/30 = 0.533

1

25

4 x 1 = 4

30

Performance of Interleaved Memory Organization

DRAM Memory System Summary Its important to match the cache characteristics

– Caches access one block at a time (usually more than one word)

DRAM characteristics– Use DRAMs that support fast multiple word accesses,

preferably ones that match the block size of the cache

Memory-bus characteristics– Make sure the memory-bus can support the DRAM

access rates and patterns

– The goal of increasing the Memory-Bus to Cache bandwidth

Virtual Memory Hardware Support

Review: The Memory Hierarchy Take advantage of the principle of locality to present the user

with as much memory as is available in the cheapest

technology at the speed offered by the fastest technology

Increasing

distance

from the

processor in

access time

L1$

L2$

Main Memory

Secondary Memory

Processor

(Relative) size of the memory at each level

Inclusive– what

is in L1$ is a

subset of what

is in L2$ is a

subset of what

is in MM that is

a subset of is

in SM

4-8 bytes (word)

1 to 4 blocks

1,024+ bytes (disk sector = page)

8-32 bytes (block)

Memory Hierarchy

Words

transferred

via load/store

Main memory

Physical Virtual

Cache

Registers (1 or more levels)

Blocks

transferred

automatically

upon cache miss

Pages

transferred

automatically

upon page fault

Cache Miss and Page Fault

Dis

k

All data, organized in

Pages (~4KB), accessed by

Physical addresses

Processor

Cache

MMU

Main

Memory

Pages

(Write-back,

same as in

cache)Cached pages,

Page table

Page fault:

a required page

is not found in

main memory

Cache miss:

a required block

is not found in

cache

“Page fault” in virtual memory is similar to “miss” in cache.

Virtual vs. Physical Address

Processor assumes a virtual memory addressing scheme:

Disk is a virtual memory (large, slow)

A block of data is called a virtual page

An address is called virtual (or logical) address (VA)

Main memory may have a different addressing scheme:

Physical memory consists of caches

Memory address is called physical address

MMU translates virtual address to physical address

Complete address translation table is large and is kept in main memory

MMU contains TLB (translation-lookaside buffer), which keeps record of recent address translations.

Two Programs Sharing Physical Memory A program’s address space is divided into pages (all are one

fixed size) or segments (variable sizes)

– The starting location of each page (either in main memory or in

secondary memory) is contained in the program’s page table

Program 1

virtual address space

main memory

Program 2

virtual address space

disk

Memory Hierarchy Example

32-bit address (byte addressing)

4 GB virtual main memory (disk space) Page size = 4 KB

Number of virtual pages = 4✕230/(4✕210) = 1M

Bits for virtual page number = log2(1M) = 20

1 GB physical main memory Page size 4 KB

Number of physical pages = 1✕230/(4✕210) = 256K

Bits for physical page number = log2(256K) = 18

Page table contains 1M records specifying where each virtual

page is located.

Address Translation

Virtual Address (VA)

Page offsetVirtual page number

31 30 . . . 12 11 . . . 0

Page offsetPhysical page number

Physical Address (PA)29 . . . 12 11 0

Translation

So each memory request first requires an address translation from the virtual space to the physical space– A virtual memory miss (i.e., when the page is not in physical

memory) is called a page fault

A virtual address is translated to a physical address by a

combination of hardware and software

Page TablePage table

register

Address of

Page table

in main

memory

1M

virtu

al page n

um

bers

Virtual main memory

(pages on disk)

Physic

al m

ain

mem

ory

(pages)

Valid bit

Other flags, e.g.,

dirty bit, LRU ref. bit

Page locations

0123..K...

Page 0

Page 1

Page 2

Page 3

-2-13----0--

32-bit Virtual Address (4 KB Page)

1K words

(4KB data)

A virtual page

(contains 4KB,

or 1K words)

20-bit virtual page number |10-b word number within page|2-b byte offset

32 bits (4 bytes)

Virtual to Physical Address Translation

20-bit virtual page number | 12-bit byte offset within page

18-bit physical page number | 12-bit byte offset within page

Address translation

Virtual address

Physical address

Virtual Memory System

Processor

MMU:

Memory

management

unit with

TLB

Virtual or logical

address (VA)

Physical

address (PTE)

Main Memory

with page table

Physical

address

Data

Data

DMA:

Direct memory

access

Cache

Disk

SRAM

DRAM

TLB: Translation-Lookaside Buffer

A processor request requires two accesses to main

memory: Access page table to get physical address

Access physical address

TLB acts as a cache of page table Holds recent virtual to physical page translations

Eliminates one main memory access if requested virtual page

address is found in TLB (hit)

TLB Organization

Page table

register

Address of

Page table

in main

memory

1M

virtu

al page n

um

bers

Virtual main memory

(pages on disk)

Physic

al m

ain

mem

ory

(pages)

Valid bit

Other flags, e.g.,

dirty bit, LRU ref. bit

Page locations

01234.K...

Page 0

Page 1

Page 2

Page 3

-2-13----0--

V D R Tag Phy. Pg. Addr.

1 1 1 4 3

1 0 1 1 2

TLB Data

V: Valid bit

D: Dirty bit

R: Reference bit (LRU)

Tag: Index in page table

Physical page address

Typical TLB Characteristics

TLB size: 16 – 512 entries

Block size: 1 – 2 page table entries of 4 – 8 bytes

each

Hit time: 0.5 – 1 clock cycle

Miss penalty: 10 – 100 clock cycles

Miss rate: 0.01% – 1%

Integrating Virtual Memory, TLBs, and Caches

Intel IA-32 Memory Management

The memory management for IA-32 architecture are

divided into two parts: segmentation and paging.

– Segmentation provides a mechanism of isolating individual

code, data, and stack modules

Multiple programs (or tasks) can run on the same processor without

interfering with one another.

– Paging provides a mechanism for implementing a conventional

demand-paged, virtual-memory system

Sections of a program’s execution environment are mapped into

physical memory as needed.

The processor uses two stages of address translation

– Logical address to linear address (Segmentation)

– Linear address to physical address (Paging)

*Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A: System Programming

Guide, Part 1

Segmentation

Segmentation provides a mechanism for dividing

the processor’s addressable memory space (called

the linear address space) into smaller protected

address spaces called segments.

All the segments in a system are contained in the

processor’s linear address space.

To locate a byte in a particular segment, a logical

address (far pointer) has to be provided.

Logical Address Translation

Paging

Paging (or linear-address translation) is the process

of translating linear addresses so that they can be

used to access memory or I/O devices.

Paging translates each linear address to a physical

address and determines, for each translation, what

accesses to the linear address are allowed (the

address’s access rights)

Linear Address Translation

Segmentation and Paging

Complete Picture:

IA-32 System-Level Registers and Data Structures

Next Class Performance

1/8/2017 ELEC 5200-001/6200-001 Lecture 7 51

Documents

ELEC 5200/6200 Computer Architecture and Design Spring 2017uguin/teaching/E6200_Spring... · ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization