Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Mattausch, CMOS Design, H19/7/6 1
Memory Circuits (Part 2)
• Memory Access Bandwidth – Access-Bandwidth Definition– Possibilities for Increased Access Bandwidth
– Necessity for Increased Access Bandwidth– Multi-Ported Memories
• Memory with Best-Match Content-Based Access– Widely Used Best-Match Distance-Measures
– Conventional Best-Match Content-Addressable-Memory (CAM) Architectures– New Architectures: Time-Domain and Mixed Analog-Digital
CMOS Logic Circuit Designhttp://www.rcns.hiroshima-u.ac.jp
Link(リンク): センター教官講義ノート の下 CMOS論理回路設計
Mattausch, CMOS Design, H19/7/6 2
Access-Bandwidth Definition for Memories
Definition of Access Bandwidth: Maximum number of bits per secondwhich can be written into the memory orwhich can be read from the memory.
Units of Access Bandwidth: bit/sec
Random Access Bandwidth: Number of independent words (of W bit each), which can be accessed (read orwrite) per sec. Unit is word/sec.
Mattausch, CMOS Design, H19/7/6 3
Factors Determining the Access Bandwidth
Access Bandwidth = (Time-per-Access)-1∗ (Bit-per-Access or Wordlength)∗ (Number-of-Access-Ports)
= TA-1∗ W∗ NP
Memory
W, TA W, TA W, TA
NP
Memory-access bandwidth is determined by three factors: access time (TA), wordlength (W) and port number (NP).
Mattausch, CMOS Design, H19/7/6 4
Memory-Port Number and Achievable Bandwidth
A random access bandwidth beyond 100 Gbit/sec relies strongly on the availability of memories with multiple ports.
Port Number (N)
109
1010
1011
1012
10 20 30 40 50 60
500MHz Clock / 64Bit Wordlength 50MHz Clock / 32Bit Wordlength
Ban
dwid
th (
Bit/
s) Practical Limit for 1-Port Memories
Insights:
a) Exploitation of pipelining, interleaving and larger wordlength cannot improve the bandwidth of 1-port memories above 50-100 Gb/sec.
b) Multiport memories are able to extend the limits of system bandwidth by orders of magnitude.
c) Tb/s bandwidth can be realized with 32 port memories.
d) High bandwidth is possible even at low clock frequencies.
Mattausch, CMOS Design, H19/7/6 5
Memory-Access Bandwidth- Access-Bandwidth Definition- Possibilities for Increased Access
Bandwidth- Necessity for Increased Access Bandwidth
• In Computers• In Networks
- Multi-Ported Memory Architectures
Mattausch, CMOS Design, H19/7/6 6
Typical Single-Processor Architecture
Today an important limit for microprocessor performance is the access bandwidth of the various memory components.
Main Memory
Storage Unit
(Cache, TLB, Virtual Address
Translator, Write Through Buffers, etc.)
Execution Unit (Multiple ALUs, Register
File, Multiple Issue Logic)
Instruction Unit (Fetch Unit, Decode Logic,
Program Counter)
Data exchange
CPU
Data exchange
Data exchange
TLB = Translation Lookaside Buffer (CAM for recent Virtual Address Translations)
ALU = Arithmetic Logic Unit (Actually performs the processor operations)
Mattausch, CMOS Design, H19/7/6 7
ALU and Registerfile
Advanced microprocessors need registerfiles (fast SRAMs) with high random-access bandwidth i.e. large port numbers.
1-Port Register File
ALU
REG
ALU
3-Port Register File
3n-Port Register File
ALU 1 ALU 2 ALU n
ConventionalMicroprocessor
Architecture
RISCMicroprocessor
Architecture
n-Issue Super-ScalarMicroprocessor
Architecture(Pentium 4 has n=4)
Mattausch, CMOS Design, H19/7/6 8
Multiprocessor Architectures
Advanced shared-memory multiprocessors need main memories with high random-access bandwidth.
Shared-MemoryArchitecture
1-port memory
1-port memory
1-port memory
switching network
proc-essor
proc-essor
proc-essor
multiport main memory !!
1-port memory
proc-essor
1-port memory
proc-essor
1-port memory
proc-essor
Message-PassingArchitecture
Mattausch, CMOS Design, H19/7/6 9
Structure of the Worldwide Internet
An internet exchange has to handle many random-data streams simultaneously.
Network Exchange
Network Exchange
Network Exchange
Network Exchange
Mattausch, CMOS Design, H19/7/6 10
Operation Principle of an ATM Internet Exchange
The random-access bandwidth of the storage part limits the performance of an internet exchange.
Structure of ATM (Asynchronous Transfer Mode) Data Packages
Destination Code (40 Bit) Data to be Transmitted (384 Bit)
ATM Network-Exchange OperationW
rite
to
nex
t fr
ee
stor
age
loca
tion
Storage Part Computing Part
ATM-Package 1ATM-Package 2ATM-Package 3
ATM-Package Nfreefree
Calculate Output Data Line
for next ATM-Package
from Destination Code S
end
to d
ata
line
of
des
tinat
ion
Data-Line 1 (in)
Data-Line 2 (in)
Data-Line M (in)
Data-Line 1 (out)
Data-Line 2 (out)
Data-Line M (out)
Mattausch, CMOS Design, H19/7/6 11
Memory-Access Bandwidth- Access-Bandwidth Definition- Possibilities for Increased Access
Bandwidth- Necessity for Increased Access Bandwidth- Multi-Ported Memory Architectures
• Multi-Port Memory Cell• Switching-Network Multi-Bank Architecture• Hierarchical Multi-Bank Architecture
Mattausch, CMOS Design, H19/7/6 12
Port Implementation in the Memory Cell
Transistor and interconnection-line numbers in an N-port SRAM cell increase linearly with N.
1-portSRAM cell
B1 B1
W1
B1 B1 B2 BN B2 BN
W1
W2
WN
N-portSRAM cell
Mattausch, CMOS Design, H19/7/6 13
Design Examples of SRAM and ROM Cells
The size of an N-port memory cell increases faster than linearly with port number N.
1-, 2-, 4- and 8-portSRAM cells
20µm
1 Port 2 Ports (max)2 Ports (min)
4 Ports (max)4 Ports (min)
8 Ports (max)8 Ports (min)
VSSVSS B1 B1
VDD
W1
VSS B1 B2 VSSB2
VDD
W1
W2
VSS B1 B2 VSSB2
B1
VDD
W1
W2
VSS B4 B1 VSSB1
B4
VDD
W1
W2
B2 B3 B3
B2
W3
W4
VSS B1 B4 VSSB4
B1
VDD
W1
W2
B2 B3 B3
B2
W3
W4
VSS B4 B5 VSSB5
B4
VDD
W1
W2
B1 B8 B6
B1
W3
W4
W5
W6
W7
W8
B8
B2
B3
B7B6 B2 B3 B7
VSS B6 B7 VSSB7
B6
VDD
W1
W2
B3 B1 B5
B3
W3
W4
W5
W6
W7
W8
B1
B8
B4
B2B5 B8 B4 B2
B2
10µm
1 Port 2 Ports
4 Ports
8 Ports
VSS
W1
B1
VSS
W1
W2
B1 B2
VSS
W1
W2
W3
W4
B1B2 B3 B4
W1
W2
W3
W4
W5
W6
W7
W8
VSS
B1B2 B3 B4 B5B6 B8 B7
1-, 2-, 4- and 8-portROM cells
Mattausch, CMOS Design, H19/7/6 14
Analysis of Multi-Port-Cell Area Increase
Area-increase of N-port-storage cells is unacceptable for larger storage capacities.
Quadratic area increase as a function of port number !!
0
5
10
15
20
25
30
35
2 4 6 8 10
SRAMm i n
SRAMm ax
ROM
N-P
ort-
cell
Are
a no
rmal
ized
to 1
-Por
t-ce
ll ar
ea
Port Number (N)
Estimation for 32 ports:~ factor 100 increase
for SRAM ~ factor 400 increase
for ROM
Mattausch, CMOS Design, H19/7/6 15
Switching-Network Multi-Bank Architecture
The switching-network architecture reduces the area problem but complexity and the conflict problem increase.
Advantage:Sub-linear area increasePort (N-1)
Port 3
Port 2
Port 1
Port (N-2)
Port N
switching network
(e.g. bus-system, crossbar switch, multistage interconnection network)
1-port memory
2
1-port memory
1
1-port memory
M
1-port memory
M-1
Disadvantages:• Access-conflict probabilitybecomes higher
• Complexity of the switchingnetwork increases fast with the bank number M
Mattausch, CMOS Design, H19/7/6 16
Hierarchical Multi-Bank Architecture
The hierarchical multi-bank architecture has a regular/modular structure and solves the complexity problem.
Row
-Sel
ecto
r (N
Por
ts)
Conflict Resolver
Column-Selector (N Ports)
Y-D
ecod
er
(1 P
ort)
X-Decoder (1 Port)
1-to-N-Port Transition
Hierarchy Level 1
Cell
Cell Cell
Cell
1 PortHierarchy Level 2
1-Port Block
N Ports
Data Level 2
Data Level 1
Addresses Level 1
Addresses Level 2
Port Transition Port Transition
Port TransitionPort Transition
1-Port Block
1-Port Block
1-Port Block
Mattausch, CMOS Design, H19/7/6 17
Multi-Port-Cell and Multi-Bank Architecture Area
The hierarchical multi-bank architecture enables enormous area-reductions of the multi-port memory.
0.01
0.1
1
102 103 104 105
4-Port Design 4-Port Estimate 8-Port Estimate16-Port Estimate32-Port Estimate
Are
a-R
educ
tion
Fac
tor
for
Mul
tipor
t S
RA
Ms
Storage-Capacity K on first Hierarchy Level
M ultiport-C ell A rchitecture
R eduction
to
1/20
8-Port
D esign
16-Port
D esign
Mattausch, CMOS Design, H19/7/6 18
Recent Article about Memory Research at RCNS
From Nikkei Microdevices published in April 2004STARC(半導体理工学研究センター)は日本の半導体企業(NEC、 東芝、日立、ソニー、他7社)による産学連携のための組織。
Mattausch, CMOS Design, H19/7/6 19
Memory with Best-Match Content-Based Access
- Widely Used Best-Match DistanceMeasures
- Conventional Best-Match Content-Addressable-Memory (CAM) Architectures
- New Architectures: Time Domain andMixed Analog-Digital
Mattausch, CMOS Design, H19/7/6 20
Widely Used Best-Match Distance Measures
The Hamming distance is implemented with simple EXOR gates, while the Manhattan Distance needs subtractors.
Di = IN j− REFijj =1
W
∑
Di is call Hamming distance, if INj and REFij are 1-bit binaries.
DH, i = IN j ⊗ REFijj=1
W
∑
DM, i = {(IN j− REFij ) ⋅signj=1
W
∑ (Inj− REFij )}
Di is call Manhattan distance, if INj and REFij are n-bit binaries with n>1.
Mattausch, CMOS Design, H19/7/6 21
Reference-Pattern R
Reference-Pattern 1
Reference-Pattern 2
Reference-Pattern 3
Word-Serial Best-Match CAM
The word-serial CAM determines the best-match pattern (winner) in R clock cycles, which is insufficient if R is large.
Input Pattern
DistanceCalculator
Current ReferencePattern
DistanceComparator
Dc<Dwin
CurrentDistance Dc
Transfer Circuit
Current WinnerPattern
Current WinnerDistance Dwin
yes
Reference Pattern Memory
Sequential Best-Match Calculation(Winner-take-all Circuit)
En
Mattausch, CMOS Design, H19/7/6 22
Partially Parallel (p-fold parallel) Best-Match CAM
The partially word-parallel CAM reduces the search time for the best-match pattern (winner) to R/p clock cycles, but
becomes soon very complex.
R/pReference
Pattern MemoryInput Pattern
Sequential Best-Match Circuit
Partial Winner
R/pReference
Pattern Memory
Sequential Best-Match Circuit
Partial Winner
R/pReference
Pattern Memory
Sequential Best-Match Circuit
Partial Winner
Global Sequential Best-Match Circuit
Global Winner
Mattausch, CMOS Design, H19/7/6 23
Fully-Parallel Best-Match CAM
Parallel digital distance calculation requires too much hardware, so that other solutions are required.
Input Data IN={IN1, IN2,…,INW}
Reference Data 1 REF1={REF11, REF12,…, REF1W}
Distance Calculation D1(IN, REF1)Reference Data 2
REF2={REF21, REF22,…, REF2W}Distance Calculation D2(IN, REF2)
Reference Data R REFR={REFR1, REFR2,…, REFRW}
Distance Calculation DR(IN, REFR)
Best Match
Calculation
MIN(D1,…,DR)
Winner ofContent Matching
Mattausch, CMOS Design, H19/7/6 24
Memory with Best-Match Content-Based Access
- Widely Used Best-Match DistanceMeasures
- Conventional Best-Match Content-Addressable-Memory (CAM) Architectures
- New Architectures: Time Domain andMixed Analog-Digital
Mattausch, CMOS Design, H19/7/6 25
Time-Domain Conversion for Best-Match Search
The search signal stops for one clock cycle at non-matching reference bits. The distance is the latency of clock-cycles.
SearchSignal
In
Clock
ReferenceBit 1
EXOR
IN1 IN1
ReferenceBit 2
EXOR
IN2 IN2
ReferenceBit W
EXOR
INW INW
Distance isthe Number ofClock Cycles until Out=1
Out
Time-DomainWord Comparator
Edge-TriggeredD-Flip-Flop
Mattausch, CMOS Design, H19/7/6 26
Digital-Analog Conversion for Best-Match Search
Word comparators generate analog signals and differences between best-match word and other words are amplified.
SC11
Search Word (W bit)
C1(Comp.Signal)
F (Feed-back)
WinnerTakeAll
Circuit(WTA)
O(R)
En
(Enable)
R-R
ow
Dec
od
e
W-Column Decode and Read/Write
Mat
ch S
ign
als
(Bit Store)BC11
(Bit Comp.)
SC1W
BC1W
SCR1
BCR1
SCRW
BCRW
LA1
CR
F
LAR
M1
MR
WC1 (Word Comp.)
WCR
WinnerLine-Up
Amplifier(WLA)
O(R)
Mattausch, CMOS Design, H19/7/6 27
Line-Up-Regulation Principle
The best-match (winner) output is aligned to the maximum amplification region of the amplifier for all search cases.
OutputVoltage
(LAi)
Input Voltage (Ci)
IncreasingWinner-Input
Distance
AmplifierCharacteristic
Winner
Nearest LoserOther Losers
DistanceAmplification
Mattausch, CMOS Design, H19/7/6 28
Best-Match CAM CMOS-Chip Design(RCNS: Research Center for Nanodevices and Systems)
矢野祐二 (M2) の研究成果。
(システムLSIを実現するためのハード設計資産およびソフト設計資産を対象とす る、主要半導体メーカー10社等からの賞。)
Minimum Distance Search Time : < 240nsPower Dissipation : <260mW at 10MHz(34.7mW/mm2)Processing Performance : 170GOPS (Giga Operations Per Second)(20GOPS/mm2)Application: Real-Time Motion-Picture Compression