Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
ELEC 5200/6200
Computer Architecture and Design
Spring 2017 Lecture 7: Memory Organization – Part II
1/8/2017 ELEC 5200-001/6200-001 Lecture 7 1
Ujjwal Guin, Assistant Professor
Department of Electrical and Computer Engineering
Auburn University, Auburn, AL 36849
http://www.auburn.edu/~uzg0005/
Adapted from Dr. Chen-Huan Chiang (Intel) and Prof. Vishwani D. Agrawal (Auburn University)
[Adapted from Computer Organization and Design, Patterson & Hennessy, 2014]
Designing a Computer
Memory
Control
Datapath
Central Processing
Unit (CPU)
or “Processor”
Input
Output
Types of Computer Memories
From the cover of:
A. S. Tanenbaum, Structured Computer Organization, Fifth Edition, Upper Saddle
River, New Jersey: Pearson Prentice Hall, 2006.
Random Access Memory (RAM)
Memory
cell
array
Address
decoder
Read/write
circuits
Address bits
Data bits
Six-Transistor SRAM Cell
Bit line
Word
line
Bit line
bit bit
Dynamic RAM (DRAM) Cell
Word line
Bit
line
“Single-transistor DRAM cell”
Robert Dennard’s 1967 invention
Classical RAM Organization (~Square)
Row
Decoder
rowaddress
data bit or word
RAM CellArray
word (row) line
bit (data) lines
Each intersection
represents a
6-T SRAM cell or
a 1-T DRAM cell
Column Selector &
I/O Circuitscolumn
address
One memory row holds a block
of data, so the column address
selects the requested bit or word
from that block
data bitdata bit
Classical DRAM Organization (~Square Planes)
The column addressselects the requested bit from the row in eachplane
Row
Decoder
rowaddress
Column Selector &I/O Circuits
columnaddress
data bit
word (row) line
bit (data) lines
Each intersection
represents a
1-T DRAM cellRAM CellArray
Classical DRAM Operation
DRAM Organization:
– N rows x N column x M-bit
– Read or Write M-bit at a time
– Each M-bit access requires
a RAS / CAS cycle
Row Address
CAS
RAS
Col Address Row Address Col Address
N r
ow
s
N cols
DRAM
M bit planes
Row
Address
Column
Address
M-bit Output
1st M-bit Access 2nd M-bit Access
Cycle Time
N r
ow
s
N cols
DRAM
Column
Address
M-bit Output
M bit planesN x M SRAM
Row
Address
Page Mode DRAM Operation Page Mode DRAM
– N x M SRAM to save a row
After a row is read into the
SRAM “register”
– Only CAS is needed to access
other M-bit words on that row
– RAS remains asserted while
CAS is toggled
Row Address
CAS
RAS
Col Address Col Address Col Address Col Address
1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit
Cycle Time
N r
ow
s
N cols
DRAM
Column
Address
M-bit Output
M bit planesN x M SRAM
Row
Address
Synchronous DRAM (SDRAM) Operation
After a row is read into
the SRAM register
– Inputs CAS as the starting “burst”
address along with a burst length
– Transfers a burst of data from a
series of sequential addresses within
that row
- A clock controls transfer of
successive words in the burst –
300MHz in 2004
+1
Row Address
CAS
RAS
Col Address
1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit
Cycle Time
Row Add
Other SDRAM Architectures
Double Data Rate SDRAMs – DDR-SDRAMs
– Double data rate because they transfer data on both the rising and
falling edge of the clock
Most widely used form of SDRAMs
– For DDR memory, 2n prefetch architecture means
Internal bus width is twice of external bus width
Hence, internal column access freq can be half of external data rate
– For users, 2n prefetch means that data access occurs in pairs
A single READ fetches two data words
A single WRITE, two data words must be provided
Other SDRAM Architectures- Cont.
https://www.synopsys.com/Company/Publications/DWTB/Pages/dwtb-ddr4-bank-groups-2013Q2.aspx
Memory Systems that Support Caches
The off-chip interconnect and memory architecture
can affect overall system performance in dramatic
ways CPU
Cache
Memory
Bus
One-word-wide
memory organization
a.
b. Wide memory organization
CPU
Cache
Memory
Bus
Multiplexor
CPU
Cache
Bus
Memory
bank 0
Memory
bank 1
Memory
bank 2
Memory
bank 3
c. Interleaved memory organization
CPU
Cache
Memory
bus32-bit data
&
32-bit addr
per cycle
on-chip
One Word Wide Memory Organization
One word wide bus and one word wide memory
– The bus contains both address and data lines
Wide Memory Organization Increase the bandwidth to one word memory by
widening– Memory
– Buses between processor and memory To allow parallel access to all the words of a block
Logic between processor and cache consists of – A MUX on READs
– Control logic to update the appropriate words on WRITEs
CPU
Cache
Memory
Bus
One-word-wide
memory organization
a.
b. Wide memory organization
CPU
Cache
Memory
Bus
Multiplexor
CPU
Cache
Bus
Memory
bank 0
Memory
bank 1
Memory
bank 2
Memory
bank 3
c. Interleaved memory organization
Interleaved Memory Organization Widen the memory but not the interconnection
– Example: 4-way interleaving
one word wide bus and four word wide memory
Sending an address to several banks permits them all to
read simultaneously
– Hence, incurring full DRAM latency only once
CPU
Cache
Memory
bank 1
bus
on-chip
Memory
bank 0
Memory
bank 2
Memory
bank 3
Interleaved Memory Organization
Bank Conflict
– Two consecutive memory operations using the same
bank
– Cause the memory to stall until the busy bank has
completed the prior operation
CPU
Cache
Memory
bank 1
bus
on-chip
Memory
bank 0
Memory
bank 2
Memory
bank 3
Example--- how a memory system affects overall performance
Assume at cache miss to read DRAM
– 1 clock cycle to send the address
– 25 clock cycles for DRAM cycle time
– 8 clock cycles for DRAM access time
– 1 clock cycle to return a word of data
Memory-Bus to Cache bandwidth
– number of bytes accessed from memory and transferred
to cache/CPU per clock cycle
1/8/2017 ELEC 5200-001/6200-001 Lecture 7 19
CPU
Cache
Memory
bus
on-chip
Performance of One Word Wide Memory Organization
If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory
cycle to send address
cycles to read DRAM
cycle to return data
total clock cycles miss penalty
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per clock
1
25
1
27
4/27 = 0.148
CPU
Cache
Memory
bus
on-chip
What if the block size is four words?cycle to send 1st address
cycles to read DRAM
cycles to return last data word
total clock cycles miss penalty
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per clock
25 cycles
25 cycles
25 cycles
25 cycles
1
4 x 25 = 100
4 x 1 = 4
104
(4 x 4)/104 = 0.154
Performance of One Word Wide Memory Organization
As soon as data is available,
address can be changed to
access the next word
CPU
Cache
Memory
bus
on-chip
What if the block size is four words and if a fast page mode DRAM is used?
cycle to send 1st address
cycles to read DRAM
cycles to return last data word
total clock cycles miss penalty
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per clock
25 cycles
8 cycles
8 cycles
8 cycles
1
25 + 3*8 = 49
4 x 1 = 4
54
(4 x 4)/54 = 0.296
Performance of One Word Wide Memory Organization
What if the cache block size is four words
and main memory width of two words?
cycle to send 1st address
cycles to read DRAM
cycles to return last data word
total clock cycles miss penalty
Number of bytes transferred per clock
cycle (bandwidth) for a single miss is
bytes per clock
25 cycles
1
2 x 25 = 50
2 x 1 = 2
53
(4 x 4)/53 = 0.302
CPU
Cache
Memory
Bus
One-word-wide
memory organization
a.
b. Wide memory organization
CPU
Cache
Memory
Bus
Multiplexor
CPU
Cache
Bus
Memory
bank 0
Memory
bank 1
Memory
bank 2
Memory
bank 3
c. Interleaved memory organization
25 cycles
Each access has two
words in parallel
Performance of Wide Memory Organization
What if the cache block size is four words
and main memory width of four words?
cycle to send 1st address
cycles to read DRAM
cycles to return last data word
total clock cycles miss penalty
Number of bytes transferred per clock
cycle (bandwidth) for a single miss is
bytes per clock
1
25
1
27
(4 x 4)/27 = 0.593
CPU
Cache
Memory
Bus
One-word-wide
memory organization
a.
b. Wide memory organization
CPU
Cache
Memory
Bus
Multiplexor
CPU
Cache
Bus
Memory
bank 0
Memory
bank 1
Memory
bank 2
Memory
bank 3
c. Interleaved memory organization
25 cyclesThis access has four
words in parallel
Performance of Wide Memory Organization
For a block size of four words
cycle to send 1st address
cycles to read DRAM
cycles to return last data word
total clock cycles miss penalty
CPU
Cache
Memory
bank 1
bus
on-chip
Memory
bank 0
Memory
bank 2
Memory
bank 3
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per clock
25 cycles
25 cycles
25 cycles
25 cycles
(4 x 4)/30 = 0.533
1
25
4 x 1 = 4
30
Performance of Interleaved Memory Organization
DRAM Memory System Summary Its important to match the cache characteristics
– Caches access one block at a time (usually more than one word)
DRAM characteristics– Use DRAMs that support fast multiple word accesses,
preferably ones that match the block size of the cache
Memory-bus characteristics– Make sure the memory-bus can support the DRAM
access rates and patterns
– The goal of increasing the Memory-Bus to Cache bandwidth
Virtual Memory Hardware Support
Review: The Memory Hierarchy Take advantage of the principle of locality to present the user
with as much memory as is available in the cheapest
technology at the speed offered by the fastest technology
Increasing
distance
from the
processor in
access time
L1$
L2$
Main Memory
Secondary Memory
Processor
(Relative) size of the memory at each level
Inclusive– what
is in L1$ is a
subset of what
is in L2$ is a
subset of what
is in MM that is
a subset of is
in SM
4-8 bytes (word)
1 to 4 blocks
1,024+ bytes (disk sector = page)
8-32 bytes (block)
Memory Hierarchy
Words
transferred
via load/store
Main memory
Physical Virtual
Cache
Registers (1 or more levels)
Blocks
transferred
automatically
upon cache miss
Pages
transferred
automatically
upon page fault
Cache Miss and Page Fault
Dis
k
All data, organized in
Pages (~4KB), accessed by
Physical addresses
Processor
Cache
MMU
Main
Memory
Pages
(Write-back,
same as in
cache)Cached pages,
Page table
Page fault:
a required page
is not found in
main memory
Cache miss:
a required block
is not found in
cache
“Page fault” in virtual memory is similar to “miss” in cache.
Virtual vs. Physical Address
Processor assumes a virtual memory addressing scheme:
Disk is a virtual memory (large, slow)
A block of data is called a virtual page
An address is called virtual (or logical) address (VA)
Main memory may have a different addressing scheme:
Physical memory consists of caches
Memory address is called physical address
MMU translates virtual address to physical address
Complete address translation table is large and is kept in main memory
MMU contains TLB (translation-lookaside buffer), which keeps record of recent address translations.
Two Programs Sharing Physical Memory A program’s address space is divided into pages (all are one
fixed size) or segments (variable sizes)
– The starting location of each page (either in main memory or in
secondary memory) is contained in the program’s page table
Program 1
virtual address space
main memory
Program 2
virtual address space
disk
Memory Hierarchy Example
32-bit address (byte addressing)
4 GB virtual main memory (disk space) Page size = 4 KB
Number of virtual pages = 4✕230/(4✕210) = 1M
Bits for virtual page number = log2(1M) = 20
1 GB physical main memory Page size 4 KB
Number of physical pages = 1✕230/(4✕210) = 256K
Bits for physical page number = log2(256K) = 18
Page table contains 1M records specifying where each virtual
page is located.
Address Translation
Virtual Address (VA)
Page offsetVirtual page number
31 30 . . . 12 11 . . . 0
Page offsetPhysical page number
Physical Address (PA)29 . . . 12 11 0
Translation
So each memory request first requires an address translation from the virtual space to the physical space– A virtual memory miss (i.e., when the page is not in physical
memory) is called a page fault
A virtual address is translated to a physical address by a
combination of hardware and software
Page TablePage table
register
Address of
Page table
in main
memory
1M
virtu
al page n
um
bers
Virtual main memory
(pages on disk)
Physic
al m
ain
mem
ory
(pages)
Valid bit
Other flags, e.g.,
dirty bit, LRU ref. bit
Page locations
0123..K...
Page 0
Page 1
Page 2
Page 3
-2-13----0--
32-bit Virtual Address (4 KB Page)
1K words
(4KB data)
A virtual page
(contains 4KB,
or 1K words)
20-bit virtual page number |10-b word number within page|2-b byte offset
32 bits (4 bytes)
Virtual to Physical Address Translation
20-bit virtual page number | 12-bit byte offset within page
18-bit physical page number | 12-bit byte offset within page
Address translation
Virtual address
Physical address
Virtual Memory System
Processor
MMU:
Memory
management
unit with
TLB
Virtual or logical
address (VA)
Physical
address (PTE)
Main Memory
with page table
Physical
address
Data
Data
DMA:
Direct memory
access
Cache
Disk
SRAM
DRAM
TLB: Translation-Lookaside Buffer
A processor request requires two accesses to main
memory: Access page table to get physical address
Access physical address
TLB acts as a cache of page table Holds recent virtual to physical page translations
Eliminates one main memory access if requested virtual page
address is found in TLB (hit)
TLB Organization
Page table
register
Address of
Page table
in main
memory
1M
virtu
al page n
um
bers
Virtual main memory
(pages on disk)
Physic
al m
ain
mem
ory
(pages)
Valid bit
Other flags, e.g.,
dirty bit, LRU ref. bit
Page locations
01234.K...
Page 0
Page 1
Page 2
Page 3
-2-13----0--
V D R Tag Phy. Pg. Addr.
1 1 1 4 3
1 0 1 1 2
TLB Data
V: Valid bit
D: Dirty bit
R: Reference bit (LRU)
Tag: Index in page table
Physical page address
Typical TLB Characteristics
TLB size: 16 – 512 entries
Block size: 1 – 2 page table entries of 4 – 8 bytes
each
Hit time: 0.5 – 1 clock cycle
Miss penalty: 10 – 100 clock cycles
Miss rate: 0.01% – 1%
Integrating Virtual Memory, TLBs, and Caches
Intel IA-32 Memory Management
The memory management for IA-32 architecture are
divided into two parts: segmentation and paging.
– Segmentation provides a mechanism of isolating individual
code, data, and stack modules
Multiple programs (or tasks) can run on the same processor without
interfering with one another.
– Paging provides a mechanism for implementing a conventional
demand-paged, virtual-memory system
Sections of a program’s execution environment are mapped into
physical memory as needed.
The processor uses two stages of address translation
– Logical address to linear address (Segmentation)
– Linear address to physical address (Paging)
*Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A: System Programming
Guide, Part 1
Segmentation
Segmentation provides a mechanism for dividing
the processor’s addressable memory space (called
the linear address space) into smaller protected
address spaces called segments.
All the segments in a system are contained in the
processor’s linear address space.
To locate a byte in a particular segment, a logical
address (far pointer) has to be provided.
Logical Address Translation
Paging
Paging (or linear-address translation) is the process
of translating linear addresses so that they can be
used to access memory or I/O devices.
Paging translates each linear address to a physical
address and determines, for each translation, what
accesses to the linear address are allowed (the
address’s access rights)
Linear Address Translation
Segmentation and Paging
Complete Picture:
IA-32 System-Level Registers and Data Structures
Next Class Performance
1/8/2017 ELEC 5200-001/6200-001 Lecture 7 51