Wei Zhang † , Li Shang ‡ and Niraj K. Jha †

NanoMap: An Integrated Design Optimization Flow NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture Reconfigurable Architecture

Wei Zhang†, Li Shang‡ and Niraj K. Jha†

Dept. of Electrical EngineeringPrinceton University†

Dept. of Electrical and Computer EngineeringQueen’s University ‡

Outline

Temporal Logic Folding Background on NRAMs Overview for hybrid

NAnoTUbe/CMOS REconfigurable architecture (NATURE) (DAC 2006)

NanoMap: Design Optimization Flow

Experimental Results Conclusions

Input Design

NanoMap

NATURE

Basic idea: Use run-time reconfiguration to realize different functions in the same resource every few cycles

Temporal Logic Folding

LUT3

OUT

dg

l

a

b

c

e

f

h

i

d

g

l OUT

a

b

cOUT

e

f

h

id

g

l

ab

c

LUT1

e

f h

LUT2

i

i =abc’

LUT1

LUTLUT1

LUT2

LUT3

MEM

l =(I’+e’+f’)h’

OUT =d’g’+l

LUT2

LUT3

LUT3

LUT2

LUT1

NATURE

CMOS fabricationcompatible

CMOS fabricationcompatible NRAM-basedNRAM-based

Run-timereconfiguration

Run-timereconfiguration

Temporallogic folding

Temporallogic folding

Designflexibility

Designflexibility

Logicdensity

Logicdensity

Overview of NATUREOverview of NATURE

Distributed non-volatile nanotube RAMs (NRAMs): main storage for reconfiguration bits

Fine-grain reconfiguration (even cycle-by-cycle) and logic folding

Area-delay trade-off flexibility More than an order of

magnitude increase in logic density

More than an order of magnitude reduction in area-time product

Comparisons assume NRAMs/ CMOS logic implemented in the same technology

Non-volatility: useful in low power & secure processing

Overview of NATURE (Contd.)

Challenges in nano-circuits/architectures Many programmable nanofabrics proposed:

Nanowire PLA (Dehon, 2004), CMOL (Strukov, 2005), etc.

Lack of a mature fabrication process Fabrication defects and run-time failures

(between 1% and 10%) Regular, reconfigurable architectures,

such as an FPGA, favored Facilitates fabrication Fault tolerance through reconfiguration NATURE: fabricatable using CMOS-compatible

fabrication process

Source: http://www.nantero.com/nram.html

Non-volatile nanotube random-access memory (NRAM) Mechanically bent or not: determines bistable

on/off states Same/opposite voltage added to change the state CMOS-compatible fabrication process 10 Gbit NRAMs already fabricated: ready to be

commercialized in the near future

NRAMTM by Nantero

NRAMs

Properties of NRAMs Non-volatile Similar speed to SRAM Similar density to DRAM Chemically and mechanically stable

NATURE not tied to NRAMs Phase change RAM Magnetoresistive RAM Ferroelectric RAM

Length-1wire

Length-4wire Long wire Switch boxLB

Switchmatrix SMB

S1

S1

Long wireLength-4 wire

Length-1 wire

Direct link

S1

S1 S1: Switch box between length-1 wires

S2: Switch box betweenlength-4 wires

Switch matrix: Local routingnetwork

Connection block Switch block

Island-style logic blocks (LBs) connected by various levels of interconnects

An LB contains a super macroblock (SMB) and a local switch matrix

Architecture of NATURE

n1 macroblocks (MBs) comprise an SMB:

here n1 = 4

Architecture of a Super Macroblock Architecture of a Super Macroblock (SMB)(SMB)

MB MBNRAM

MB NRAMNRAM MB

SRAMbits

SRAMbits

---- 2

0---

- 20

---- 2

0

---- 2

0

CLK and Global signals

---- 8

---- 8

---- 8

---- 8

---- 1

20

---- 1

20

---- 1

20

NRAM

SRAMbits

SRAMbits

---- 1

20


ReconfigurationbitsReconfiguration

bits

From Switch matrix

From Switch matrix

From Switch matrix

Output to Interconnect

20 44X1 MUX 20 44X1 MUX

20 44X1 MUX 20 44X1 MUX

n2 logic elements (LEs) comprise an MB:

here n2 = 4

Architecture of a Macroblock (MB)Architecture of a Macroblock (MB)

NRAM LE LE

13 to 5crossbar

13 to 5crossbar

NRAM

LE

13 to 5crossbar

NRAMNRAM LE

65 SRAMbits

65 SRAMbits

65 SRAMbits

65 SRAMbits

---- 5 ---

- 5

---- 5

---- 5

---- 1

7

---- 1

7

---- 1

7

---- 1

7

13 to 5crossbar

---- 2

---- 2

---- 2

---- 2


---- 6

5

---- 6

5

---- 6

5

---- 6

5

8 Outputsof MB


Inputs to MB

Inputs to MB

Inputs to MB

Reconfiguration bits

Reconfiguration bits

Logic Element (Basic Configuration)

An LE implements a computation and contains: An m-input look-up table (LUT) l flip-flops Input to flip-flop selected between LUT output

and a primary input

m-input LUT

DFF

SRAM cell

DFF

CLK

Folding Levels

Logic folding at different levels of granularity, providing flexibility to perform area-delay trade-offs

Level-p folding: LE reconfiguration after the execution of p LUT computations

Reconfiguration time: 160ps Larger folding level, typically delay decrease, area increase

(a) level-1 folding (b) level-2 folding

a0

y0 y1 y2 y3

b0 c0

z0 z1 z2

d0 g0

x0 x1 x2 x3

e0

x0 x1 x2 x3

f0

y0 y1 y2 y3

h0

LUT node

Outputd

i0

a2 a3 a4 a6

Reconfiguration

Reconfiguration

a0

y0 y1 y2 y3

b0 c0

z0 z1 z2

d0e0

x0 x1 x2 x3

f0

y0 y1 y2 y3

g0

x0 x1 x2 x3

h0

d

i0

a2 a3 a4 a6

Output

Design Optimization Flow: NanoMap

Optimize and implement design on NATURE

Integrate temporal logic folding Choose a proper folding level Use force-directed scheduling (FDS) technique

to balance resource usage across folding cycles

Input design specified in register-transfer level (RTL) and/or gate-level VHDL

Motivational Example

Different planes should have same number of folding stages to guarantee global synchronization

Key issue: how to achieve the optimization objective Appropriate folding level Assign the logic to folding stages

reg1 reg2

+

reg3

×

L2L1

L3

s0 s1

input 1 input 2

LUT1

LUT3

LUT2

4 4

44

4 4

4

LUT4

Level 1 register

Level 2 register

Plane Logic in Plane

Pla

ne

cycle

Foldingstage

Fold

ing

cycle

Motivational Example (Contd.)

Example optimization objective Minimize circuit delay under an area constraint

of 32 LEs Assume each LE contains one LUT and two flip-

flops: 32 LEs provide 32 LUTs and 64 flip-flops

reg1 reg2

+

reg3

×

L2L1

L3

s0 s1

input 1 input 2

LUT1

LUT3

LUT2

4 4

44

4 4

4

LUT4

50 LUTs

14 flip-flops

8 LUTsLogic depth: 4

38 LUTsLogic depth: 7

Plane depth: 9

Iterative Design Flow

Start with initial guess for folding level and iteratively refine it Large folding level -> better circuit delay, but

large area cost Initial #folding stages: Initial folding levels:

Partition RTL modules into a series of connected LUT clusters logic depth at most equal to the folding level Significantly speeds up the mapping procedure

232

50

52

9

Iterative Design Flow (Contd.)

Cluster size should be smaller than the area constraint

b3 0 0 0

P7 P6 P5

P4

a0

0

a1

a2

a3

P0

P1

P2

P3

FA

FA

FA 0

0

0

0

0

0

0

000

Clu

ster

1C

lust

er 2

FA

bj sum

sum

carryout

ai

0 b2 b1 b0

carry in

out

in

34 LUTs> 32 LUTs

b3 0 0 0

P7 P6

P5

P4

a0

0

a1

a2

a3

P0

P1

P2

P3

FA

FA

FA 0

0

0

0

0

0

0

000

Clu

ster

1C

lust

er 2

0 b2 b1 b0

Level-5 folding Level-4 folding

Solution for the Example

Three folding stages using level-4 folding 32 LEs required for mapping the RTL

circuit; area constraint satisfied Circuit delay = 3 * folding cycle delay

6LEs s0, s1

6LEs storage 1-4

reg1-3

mul: c2

s0, s132LEs

storage add storage 1-4mul: c1

reg1-3

foldingcycle 1

foldingcycle 2

foldingcycle 3

8LEs 4LEs

reg1-3

add

s0, s1

LUT1-4

Solution

Choosefolding level

Module partition

Constraintsatisfied?

FDS to balance resource usage

Constraintsatisfied?

Decreasefolding level

No

No

Yes

Yes

NanoMap: Flow Diagram

LogicMapping

Temporalclustering

Temporalplacement

Routing

Input network

Modulelibrary

Folding levelcomputation

Delay estimation

Schedule each LUT/LUT clusterusing FDS

Perform logic folding?

Yes

No

Placement routable?

No

Yes

Satisfy area constraints?

Yes

Final placement using modified VPR

placer

Satisfy delay constraints?

Yes

Outputreconfiguration bitsOptimization

objective

No

No

RTL module partition

1

3

4

5

6

7

8

10

11

12

14

15

Final routingusing VPR router

16

User constraint

Circuit parameter search

2

Map each LUT/LUT cluster to

SMBs7

Fast placement using modified VPR

placer9

Refine placement?

Yes

No

13

Force-Directed Scheduling

Perform FDS on RTL modules partitioned into LUTs/LUT clusters

Iteratively schedule LUT/(LUT cluster) to minimize overall resource usage

Model resource usage as a force: F = Kx K: distribution graphs (DGs) that describe the

probability of resource usage Aim of FDS: minimize force, indicating

minimum increase in resource usage LE usage depends on LUT computations

and register storage operations:two DGs needed

Temporal Clustering

For each folding stage, a constructive algorithm used to assign LUTs to LEs and pack LEs into MBs and SMBs

Unpacked LUT with a maximal number of inputs selected as initial seed

New LUTs with high attractions to the seed selected and assigned to the SMB

Attractions depend on timing criticality and input pin sharing

Considers attractions across all the folding cycles

B

E F

DC

DC

A

Fold

ing

cycl

e2

Fold

ing

cycl

e1

Placement and Routing

VPR (U. Toronto) modified to perform placement and support temporal logic folding Simulated annealing

approach Cost function computed

across the folding stages Routing using VPR router

performed hierarchically, considering direct link, length-1, length-4 and global interconnects

C

D

C

SMB1

SMB4

D

Fold

ing

cycl

e2

Fold

ing

cycl

e1

23

Experimental Setup

Instance of architecture: 4 MBs in an SMB 4 LEs in an MB LEs contain a 4-input LUT and 2 flip-flops

Impact of fixing k at 16 vs. allowing a high enough k to show design trade-offs

Results based on 100nm technology parameters to implement CMOS logicand NRAMs

Experimental Results (Contd.)Experimental Results (Contd.)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

ex1

FIR

ex2

c5315

Biq

uad

Paulin

ASP

P4

(normalized to no-folding)

Delay (ns) for AT optimization

No folding k enough k = 16

1 1 1 1 1 1 12 2

22 2 2

11

1 1

1

1

1

1

2 22

2 2 2

1

02468

1012141618

ex1

FIR

ex2

c5

31

5

Biq

uad

Pau

lin

AS

PP

4

(normalized to no-folding)

#LE * Delay adv. for AT opt.

No folding k enough k = 16

Experimental Results (Contd.)

Reduction

in #LEs

Maximum AT improvement

Average AT improvement

Circuit delay

increase

k enough 14.8X 16.2X 11.0X 31.8%

k = 16 9.2X 9.3X 7.8X 19.4%

Improvement under AT optimization for RTL Benchmarks

LE utilization around 100% 50% reduced need for a deep interconnect

hierarchy for level-1 vs. no-folding – indicates trading interconnect area for NRAM area advantageous

Experimental Results (Contd.)Experimental Results (Contd.) Flexibility in choosing the best folding level and performing

area-delay trade-offs Mapping results for typical optimizations using Paulin

benchmark as an example

Opt.

obj.

Area

const.

(#LEs)

Delay

const.

(ns)

Folding

level

Case1 AT No No 1

Case2 Delay No No No

Case3 Area No 27 4

Case4 Delay 210 No 31

10

100

1000

10000

Delay(ns)

Area(#LEs)

Mapping results for typical optimizations

case 1 case 2 case 3 case 4

Typical optimizations

Conclusions

NATURE: A new high-performance run-time reconfigurable architecture

NanoMap: an integrated optimization design flow for NATURE

Introduction of NRAMs into the architecture enables cycle-by-cycle reconfiguration and logic folding: leading to significant logic density and area-time product advantages

Can be very useful for cost-conscious embedded systems and improvement of future FPGAs

Non-volatility: helpful in secure and low power processing

Documents

Wei Zhang † , Li Shang ‡ and Niraj K. Jha †