EE-382M VLSI–II Early Planning for Memory Array...

Foil # 1 / 58 The University of Texas at AustinEE 382M Class Notes

Early Planning for Memory Array

Design

EE-382M VLSI–II

Steven C. SullivanGian Gerosa

Class Agenda

• Memory Hierarchy (6 foils)

• Memory Cell Types (9 foils)

• Basic Array Structure (5 foils)

• Bitline Segmentation (3 foils)

• Area Estimation (7 foils)

• Access Time & Power Estimation (4 foils)

• Clock & Power Distribution (4 foils)

Access TimeCapacity

Register File 0.25-1ns0.5-1KB

Level 1 Cache 1-4ns8-64KB

Level 2 Cache 5-20ns256KB-2MB

Main Memory 35-50ns128-256MB

Hard Drive 5-10ms10-50GB

Memory Hierarchy

Processor

Memory hierarchy gives the appearance of large capacity and fast access time.

Processor-Memory Performance Gap

µProc60%/yr

DRAM7%/yr.

Performance Gap:(grows 50% per year)

The need for memory hierarchy is steadily

increasing. 20

1.35X/yr

1.55X/yr

Memory Hierarchy Evolution

Chipset

386No on-die cache.

L1 cache on motherboard.

Chipset

Level 1 cache on-die. Level 2 on motherboard

Chipset

Pentium

Separate Instruction and Data Caches

Memory Hierarchy Evolution

Chipset DRAM

Pentium III

L2 cache on-die

Chipset DRAM

Pentium 4(Foster)

L3 cache on-die

Chipset

Pentium II

Separate bus to L2 cache in same

package

Recent development: 3-D packaging allows more integration

Functional Block Diagram

Multiplexors andSense Amplifiers

Column Decoder

Column Address

Decoder

Cell Array

2N x 2M2NNRowAddress

Word Lines

Read/WriteBuffer

2(M-K)

“1-hot” select

Class Agenda

Memory Cell Overview

• A memory cell array has the following capabilities;• A means of storing bits of information (storage elements)• A means of selecting the stored information (wordlines)• A means of transferring data to/from storage elements (bitlines)

• 1T/1C memory cell is the simplest implementation• Only requires 1 W/L and 1 B/L metalization

• 6T SRAM cell consumes more area and requires true & complement bitlines, but is more stable and develops a sensing voltage faster than DRAM cell

• Register File cells allow multiple entries to be accessed or written simultaneously– However, this requires multiple wordlines and bitlines and

becomes metal-limited– Used for integer/floating point registers, single & multiple-cycle

queues and buffers

Memory Cell Types

• Schematic of 1-T DRAM cell, 6T dual ended SRAM cell

1-transistor DRAM

Storagecap

BL #BL

6-transistor SRAM

• Industry standard DRAM cell• Smallest area per bit• Explicit storage capacitor• Destructive READ

• Industry standard SRAM cell• Used for FAST static arrays• Cross-coupled inverters• Non-destructive READ with

proper stability analysis

BL #BL

6-transistor SRAM cell

BL #BL

PASSGATE

1.0 μm

0.68 μm (65nm)

In 65nm CMOS, a typical6T bitcell area = .68 μm2

Multi-Port Memory Cell Types

RBL #RBL

1 Read (DE), 1 Write (DE)

RBLRWL

D #D#WBL

1 Write (DE), 1 Read (SE)

Register File Multi-Ported Bitcell

VDD rwl wl0 GND wl1 GND

2 Write (DE), 1 Read (SE)

1 Write (SE), 1 Read (DE)

1 Write (SE), 1 Read (DE)Slight modification

WBLRBL0

1 Write (SE), 2 Read (SE)

Relative Memory Cell SizesDimensions in M1 pitches.

(assume M1 same)

Cell WL Dir BL Dir Area

1T 1 1.5 1.5

4T 3 4 12

6T 4 6 24

4R/2W 9 9 81

Class Agenda

Array Design Choices

• Decoders– Predecoder & Banked WL Drivers - for large number of rows– Hierarchical WL & WL Repeaters - for large number of cols

• Cells– Differential - for few ports and large array size– Single Ended - for many ports or small array size

• Bitlines– Hierarchical - for many rows & available higher metal– Serial - for large number of rows & no higher metal

• Column Muxing– Differential - group by bit– Single Ended - group by entry

Basic Array Characteristics

• Array Size– Number of entries– Bits per entry

• Number of Ports– Number of simultaneous reads– Number of simultaneous writes

• Latency– Cycles from address to read data– Cycles from address to write completed

Precharge

Basic Array Layout

Address

Bitline ReceiversWrite Buffers

Decoder

Columns

Read DataWrite Data

CellCell

CellCellCellWordLine

Large Signal vs Small Signal Arrays

WordLine

Sense Amp

Bit Bit#

Small Signal Arrays• Differential bitlines• Dual-ended Sense

amplifier

WordLine

Large Signal Arrays• Single-ended bitline• Inverter threshold

• Small Signal Arrays:– DRAM and SRAM chips– Processor D-cache and I-cache

• Large Signal Arrays:– Processor register files– Multi-ported data structures

• Small Signal Arrays are less common because:– Sense amps require special characterization– More sensitive to noise– Area and timing overhead of differential sense amp– May not scale well to low supply voltage

Large Signal vs Small Signal Arrays

Class Agenda

Register File Bitline Segmentation

• Problem: In general, long bitlines cause very slow edge rates– May consider converting to an SSA design approach

• However, very short bitlines causes overall area to increase– Array efficiency goes down; wastes valuable silicon area

• Solution: Break up bitline depth to determine optimal design point– Divide up into smaller sections & recombine with “wire-OR”

• Example #1 shows 16 memory cells on a bitline which drives a dynamic “wire OR” global bitline

• Example #2 shows a “serial” global bitline structure– The lower global bitline is in series with the upper global bitline

with a receiver and NMOS pulldown device in the center (acts like a “repeater”)

Register File Segmentation Example #1

Memorycell

Local BL

Global BL

Global BL receiver

Dynamic latch

Global bitline acts a dynamic “wire-OR”16 cells

Register File Segmentation Example #2

• Serial global bitline

Memorycell

Local BL

Global BL

Global BL receiver

Dynamic “wire OR”

#pc Dynamic latch

Class Agenda

• Cell Area– 6T bitcell dimensions strongly dependent on technology

• Need an actual layout study to determine area– Multiported cells are wire limited and can be easily caclulated

• Cell Height is a function of {MV_Pitch*(Wordlines + Shields)}

• Cell Width is a function of {MH_Pitch*(Bitlines + Datalines + Shields)}

• Local Bitline Receivers and Dataline drivers– Height of array is increased by local bitline receivers

• NumReadPorts*NumEntries/CellPerLBL– Height of array is increased by local dataline drivers

• NumWritePorts*NumEntries/CellPerLBL

Array Area Estimation

• Decoder & Wordline Repeaters

– Width of array is increased by the decoder

• Decoder width is a function of number of ports

• 20% of total array width is a reasonable estimate

– Width of array is increased by wordline repeaters

• Typically no more than 32 to 64 bitcells on a single wordline (limits rise/fall time of selected row)

Array Area EstimationCell Height & Width CalculationRecall

Cell Height = {MH_Pitch*(Wordlines + Shields)}

MH_Pitch*[(#R + #W) + WL_shield*(#R + #W + 1)]

Cell Width = {MV_Pitch*(Bitlines + Datalines + Shields)}

Mv_Pitch*(#R + Rd_shield*#R + 1) + (#W + Wr_shield*#W + 1)

#R Number of Read Ports#W Number of Write PortsWL_shield Read wordline shield factorRd_shield Read bitline shield factorWr_shield Write dataline shield factorMH_Pitch Wordline PitchMV_Pitch Bitline Pitch

Consider: 3 read ports & 2 write ports, 16-bits, 64-entryCell Height = MH_Pitch*(Wordlines + Shields)= MH_Pitch*[(#R + #W) + WL_shield*(#R + #W + 1)]

= 0.2um * [(3 + 2) + (5 shields + 1)] = 2.20um

Cell Width = MV_Pitch*(Bitlines + Datalines + Shields)= MV_Pitch*(#R + Rd_shield*#R + 1) + (#W + Wr_shield*#w + 1)

= 0.2um * [(3 + 0.5*3 + 1) + (2 + 0.5*2 + 1) ] = 1.90um

• Sub-array dimensions are:

X = 16 * (Cell_width) = 16 * 1.90um = 30.4umY = 64 * (Cell_Height) = 64 * 2.20um = 140.8um

SRAM Array Area Estimation

Estimate subarray first:1. # 6T bitcells * bitcell area + wordline & column decoders + sense-amp

+ read/write sequentials.2. The decoders + sense-amps + sequentials are typically 15% of the

subarray bitcell area.3. Use an ‘array efficiency’ factor to calculate the total SRAM array area;

this includes clock buffers, address decoders, control logic, repeaters, routing, etc.; typical numbers are in the range of ~60%.

EXAMPLE:

• A 16KB L1 cache with four 4KB subarrays; each subarray is comprised of 128 bitcells/colum and 256 bitcells/wordline; the 6T bitcell area in this 65 nm CMOS technology is 0.82 μm2.

Bitcell subarray = 0.68 μm2 * 128 * 256 = 22,282 μm2

Subarray = 1.15 * 22,282 = 25,624 μm2

4 subarrays = 4 * 25,624 = ~102,500 μm2

16KB L1 cache = 102,500 / 0.60 = 170,833 μm2 or ~ 0.17 mm2

Floorplan Options

Sub Array

Rd Block

Wrt DriverCTL

Sub Array

Rd Block

Wrt DriverCTL

Sub Array

Rd Block

Wrt Driver

Possible Large-Signal Array Floorplans• Array Area Calculator provides dimensions for these blocks

Pchg Pchg Pchg

Floorplanning ToolStructured Datapath

Sample FloorplanGenerated from a floorplanning CAD tool

bitslices

rwldrv

wwldrvdecode

mergelogic

Class Agenda

wordline RC delay (example) 128 bitcells in a row

• RT = Σ Ri = 140 mΩ/ * 348μm/0.1μm = 487.2 Ω

• CT = Σ Ci = CM1 + Ggate

= 348μm * 0.23fF/μm + 128*(2*0.5μm)*2.0fF/μm= 80fF + 256fF = 336fF

• trow = 0.38 * RT * CT = 62ps (50% point of rising wave)

Break into components= wordline driver + wordline RC delay + column fall time + colmux + setup

Access Time Estimation

R1 R2 R128

C1 C2 C128clk

Access Time EstimationColumn Fall Time• Assume bitline is discharged linearly, then we can use;• dV/dt = Iread/CBL

• Bitline falls to VDD/2 = 1.0V/2 in 113ps

68fF0.5um*600uA/um

dV/dt = WL=VDD

68fFCBL

0.5μm

LOWdV/dt = 4.41 V/ns

VDD/2 50%

t {ns}

dV/dt = 4.41 V/ns

CJ=1.25fF/μm2

Access Time Estimation

Sum up components of delay; assume inverter delay is 40ps and nand2 is about 60ps delay and setup into latch is 30ps;

Taccess = Wordline driver + wordline delay + column delay + column mux + setup

= (60ps + 40ps) + 62ps + 113ps + 60ps + 30ps

= 365ps

Should easily meet machine cycle time since low frequency … however,the above calculated value of 365ps is only the READ-ACCESS time …Wire routing and data capture budgets have not been factored yet.May be able to use a “high Vt” device if it is available from Fab

Preliminary Power Estimation

• Most power dissipation for an array occurs in bitlines and sense amplifiers• Calculate total bitline capacitance

– {Metal2 bitline cap} + {junction cap} X {number of bitcells}• Calculate sense node capacitive load to include in power dissipation • For power dissipation, use the approximation:

Pdiss = a * Ctotal * (Vsupply)2 * frequency

Where alpha is the “Activity Factor” 0 < a < 1

• Memory cells can contribute significant D.C. power due to leakage from many cells in standby; be sure to take that into account

Pstatic = Ileakage * VDD

Class Agenda

Local Clock Distribution

• At high frequencies, clock uncertainties become a significant portion of the cycle time (10-15% of cycle time or more)

• Important to define the overall clocking scheme and distribution before implementation begins

• Clock inaccuracy is composed of 2 major sources;– Clock jitter: due to PLL, DLL, etc– Clock skew: mismatches in clock buffer tree, load,

inductance or variances due to process (Leff is not constant), VDD (it is not constant), and local temperature.

• A global clock grid that distributes to local clock buffers requires large overhead but helps minimize clock skew– LCB’s are evenly distributed within array block and tap off

of global clock grid with minimum route

Port1 Input Data LatchLCB

Port0 Input Data Latch LCB

Port0 Read/Write CktLCB

Port0 Output LatchLCB

Port1 Read/Write Ckt

BitcellArray

Port0 Read/Write Ckt LCB

BitcellArray

Port0 D

ecoder

Port0 Output Latch LCB

LCBPort1 Output LatchPort1 Read/Write Ckt

Port0 Read/Write CktP

ort1 Decoder

LCB Placement

Large number of LCBs minimizes wire load from LCB to sequentials, thus reducing skew variance.

SAMPLE Power/Ground GRID

Shielding takes up significant routing resources.Global M6 routes over the array should have minimal coupling noise to array bitlines.

* Where λ is minimum critical dimension for width/space

VSS VDD VSSS

(Full Shielding, MCF = 1.0)

2λ2λ

Power/Clock Grid• Clock grid is interleaved between VDD and VSS on metal6

Port0 Input Data Latch LCB

Port0 Read/Write CktLCB

Port1 Read/Write Ckt

BitcellArray

Port0 Read/Write Ckt LCB

BitcellArray

Port0 D

ecoderLCB

Port0 Output Latch LCB

LCBPort1 Output LatchPort1 Read/Write Ckt

Port0 Read/Write CktP

ort1 Decoder

BACKUP

Memory Array Performance

• Optimization of memory arrays and caches requires careful analysis of:– Size and speed of the array which impacts:

• Power: static and dynamic• Latency: number of clocks to access the memory cell• Area and aspect ratios• Redundancy

– Hit rate (caches): requires additional logic and tag arrays.– Architecture: How many levels of caching?

• In addition need to account for array BIST. This requires additional logic and impacts performance.

Memory Array Performance

Array Redundant Elements

Address

WordLine

Bitline ReceiversWrite Buffers

Decoder

Columns

Read DataWrite Data

Precharge

Redundant Address &

enable

Redundant Wordline &

Driver

Redundant Column & Bitslice

Account for area overhead if redundancy is used for repair

Trade-offsLarge Signal Arrays Small Signal Arrays

Simplest sense scheme• Single-ended bitlines

Need sense-amplifier• Dual-ended bitlines

Good noise margin• Vdd/2 threshold

Noise-sensitive• Few hundred millivolts ΔV

Lower bitcell density(Used for small queues & register files, 8 ~ 32 cells on a bitline)

Highest bitcell density(Used for large 1st & 2nd level cache arrays, 64, 128, 256 or more cells on a bitline)

Static timing analysis works Static timing analysis difficult

Multi-portedUsually single-ended;Many READ/WRITE ports

Single portedUsually dual-ended; 1 ~ 3 ports

Dual-Ended Cell Column MuxingAddr[6:0]

Read D

ecoder

128 Rows

2 Cols

Data[1:0]

Write D

ecoderC

Addr[6:2]

Read D

32 Rows

8 Cols

Data[0]

Write D

4:1 4:1

Data[1]

Addr[1:0]

For minimum delay cell array should be roughly square.

Single Ended Cell Column Muxing

Single ended arrays must group bits of the same entry together, to write wordlines only on cells of one entry.

Addr[6:2]

Read D

32 Rows

8 Cols

Data[0]

Cells E

ntry A4:1

Data[1]

Addr[1:0]

Write D

Cells E

ntry B

Cells E

ntry C

Cells E

ntry D

Write D

Addr[6:0]

Read D

ecoder

128 Rows

2 Cols

Data[1:0]

Write D

ecoderC

Dual Ended vs Single Ended Column Muxing

Same bit of different entries grouped together.

Write data driven only on some columns.

Dual-Ended Cells

Write wordline “on’ for entire row.

Different bits of same entry grouped together.

Write data can be driven on every column.

Write wordline “on” for only 1 entry.

A0 B0 C0 D0 A1 B1 C1 D1

Data[0]

Data[1]

Read WL

Write WL

A0 A1 B0 B1 C0 C1 D0 D1

Data[0]

Data[1]

Read WL

Write WLs

Single-EndedCells

Segmentation Guidelines• Design considerations for segmenting the bitlines are based on

variables such as;– Number of entries– Number of ports– Number of bits

• Processor architecture and manufacturing technology also contribute to design decisions– For example, a high-leakage process may limit the number of

cells on a bitline before losing state

• The following table is a guideline to help determine how to divide up the bitlines for optimum performance– The final decision will be based on careful HSPICE

simulations of the different options over PVT variations

Table of GuidelinesENTRIES PORTS <=64 <=128 <=256

Single Array; Split LBL with a maximum of 8 bits per LBL in M2; each to NAND2 receiver followed by a latch; GBL to the input of latch at the bottom in M4 ; 1-cycle latency is assumed

Split into 2 sub arrays with 64 entries each; LBL and GBL should follow the guidelines for similar ports; Output of GBL to NAND2 between subarrays. Single cycle latency is assumed

LBL and GBL guidelines are the same as < 64 entries with similar ports;Stacked twice for 256 entries. 2:1 mux between the two 128 entry sub-arrays; at least two cycle latency is required

Single Array; Split LBL with a maximum of 8 bits per LBL in M2;Each to a NAND2 receiver followed by latch;Split GBLs are routed in M4 to NAND2 ; dynamic latch in the middle; Lached outputs to destination drivers in M4 (or M3)

Split into 2 sub arrays with up to 64 entries each;LBL and GBL should follow the guidelines for entries with similar ports;Output of GBL to dynamic latch followed by latches;Two cycle latency is assumed.

LBL and GBL guidelines are the same as < 64 entries with similar ports;Stacked twice for 256 entries. 2:1 mux between the two 128 entry sub-arrays;More than 2-cycle latency is required.

17 --21

Single Array; Split LBL with a maximum of 8 bits per LBL in M2; each to NAND2 receiver followed by dynamic wire-ORSplit GBLs are routed in M4 (or M2) to NAND2Latch in the middle; Latched outputs to destination drivers in M4 (or M3); Maximum of 48 entries can be supported for this many ports

Split into 2 sub arrays with up to 48 entries each;LBL and GBL should follow the guidelines for similar ports;Ouput of GBL to dynamic latch followed by latches;At least 2-cycle latency assumed;

LBL and GBL guidelines are the same as < 64 entries with similar ports;Stacked twice for 256 entries.2:1 mux between the two 128 entry sub-arrays;More than 2-cycle latency is required

EE-382M VLSI–II Early Planning for Memory Array...

Documents

IC Layout Design of Decoder Using Electric VLSI Design System · various features that can be used to design and check the IC layout. Moreover, Electric VLSI Design System also allows

1 ELEC692 VLSI Signal Processing Architecture Lecture 10 Viterbi Decoder Architecture

Algorithm and VLSI Architecture for Polar Codes Decoder

Coder and Decoder in NFC Device - homepage.fudan.edu.cnhomepage.fudan.edu.cn/...and-Decoder-in-NFC-Device.pdf · Coder and Decoder in NFC Device ... Anti-Collision in decoder 0 1

Design of a High-Speed Asynchronous Turbo Decoder Pankaj Golani, George Dimou, Mallika Prakash and Peter A. Beerel Asynchronous CAD/VLSI Group Ming Hsieh

DSTV DECODER SETUP MANUAL July 2013 - …smartvillage.net.za/Smart-Village-Decoder-Setup.pdf · DSTV DECODER SETUP MANUAL ... Ensure that the DSTV decoder is connected correctly to

VS1063 STANDALONE PLAYER VSMPG “VLSI Solution Audio Decoder”

A VITERBI DECODER USING SYSTEM C FOR AREA EFFICIENT …In this thesis, the VLSI implementation of Viterbi decoder using a design and simulation platform called SystemC is studied

Display and DVR Application - fpga.worldfpga.world/_exhibit/2007tr/English PPT/DSP Track/DSP4_DVR_Displa… · Decoder Video Decoder Video Decoder Video Decoder Video Encoder CPU

Vlsi&Es Es&Vlsi Vlsi&Es Es&Vlsid Final

A Parallel MCMC-Based MIMO Detector: VLSI …...A Parallel MCMC-Based MIMO Detector: VLSI Design and Algorithm 151 Fig.1. Assumed MIMO BICM-ID System Model. Detector and decoder iteratively

AFN Decoder Reconfiguration/Re-Tuning Decoder …myafn.dodmedia.osd.mil/satchange/PDFs/FEB 25 Asia D9865.pdfAFN Decoder Reconfiguration/Re-Tuning . Decoder Cisco D9865 for Pacific

A Micropower Analog VLSI HMM State Decoder for Wordspottingpapers.nips.cc/...micropower-analog...wordspotting.pdf · block in pattern recognition systems: a hidden Markov model (HMM)

A Flexible Sphere Decoder Architecture for MIMO …icslwebs.ee.ucla.edu/dejan/researchwiki/images/3/36/Qualsproposal... · A Flexible VLSI Architecture for Extracting Diversity

VLSI Question Bank Vlsi

VS1053B PATCHES AND FLAC DECODER VSMPG “VLSI Solution ... · VS1053B PATCHES AND FLAC DECODER VSMPG “VLSI Solution Audio Decoder” Project Code: VS1053 Project Name: Support

Reconfigurable VLSI Communication Processor Architectures · Opportunities for Reconfigurable Accelerators for Communication Systems XCommonality of Algorithms, e.g. Viterbi Decoder,

A Programmable Max-Log-MAP Turbo Decoder Implementationdownloads.hindawi.com/journals/vlsi/2008/319095.pdf · Correspondence should be addressed to Perttu Salmela, perttu.salmela@tut.ﬁ

VLSI Architectures for Multi-Gbps Low-Densitytcc/Darabiha_PhD_thesis.pdf · is investigated which is particularly advantageous in fully-parallel LDPC decoders. To increase decoder

Eindhoven University of Technology MASTER VLSI design of a ... · VLSI design of a Reed-Solomon decoder for gigabit automotive ethernet Xue, B. Award date: 2016 Link to publication