Flexible Hardware Design at Low Levels of Abstraction · Fast, low-power prefix networks Mary...

Preview:

Citation preview

Flexible Hardware Design at Flexible Hardware Design at Low Levels of AbstractionLow Levels of Abstraction

Emil Axelsson

Hardware Description and Verification

May 2009

Why low-level?Why low-level?

gadget a b = case a of2 -> thing (b+10)3 -> thing (b+20)_ -> fixNumber a

Related question: Why is some software written in C? (but difference between high- and low-level is much greater in hardware)

Ideal:Software-like code → magic compiler → chip masks

Why low-level?Why low-level?

Related question: Why is some software written in C? (but difference between high- and low-level is much greater in hardware)

Ideal:Software-like code → magic compiler → chip masks

gadget a b = case a of2 -> thing (b+10)3 -> thing (b+20)_ -> fixNumber a

Why low-level?Why low-level?

Reality:“Ascii schematic” → chain of synthesis tools → chip masks

Why low-level?Why low-level?

Reality:“Ascii schematic” → chain of synthesis tools → chip masks

Reiterate to improve timing/power/area/etc.Very costly / time-consuming

Each fabrication costs ≈ $1.000.000

Failing abstractionFailing abstraction

Realistic flow cannot avoid low-level awareness

ParadoxModern designs require higher abstraction level...but...Modern chip technologies make abstraction harder

Main problem: Routing wires are dominant in signal delays and power consumption

Controlling the wires is key to the performance!

Gate vs. wire delay under scalingGate vs. wire delay under scaling

Process technology node [nm]

Rel

ativ

e de

lay

Physical design levelPhysical design level

Certain high-performance components (e.g. arithmetic) need to be designed at even lower level

Physical level:A set of connected standard cells (implemented gates)Absolute or relative positions of cells (placement)Shape of connecting wires (routing)

Physical design levelPhysical design level

Design by interfacing to physical CAD toolsCall automatic tools for certain tasks (mainly routing)

Often done through scripting codeTediousHard to explore design spaceLimited design reuse

Aim of this work:Raise the abstraction level of physical design!Raise the abstraction level of physical design!

Two ways to raise abstractionTwo ways to raise abstraction

Automatic synthesis+ Powerful abstraction– May not be optimal for e.g. high-performance arithmetic– Opaque (hard to control the result)– Unstable (heuristics-based)

Language-based techniques (higher-order functions, recursion, etc.)

+ Transparent, stable– Still quite low-level– Somewhat limited to regular circuits

Two ways to raise abstractionTwo ways to raise abstraction

Automatic synthesis+ Powerful abstraction– May not be optimal for e.g. high-performance arithmetic– Opaque (hard to control the result)– Unstable (heuristics-based)

Language-based techniques (higher-order functions, recursion, etc.)

+ Transparent, stable– Still quite low-level– Somewhat limited to regular circuits

Our approach

LavaLava

Gate-level hardware description in Haskell

Parameterized module generators: Haskell programs that generate circuits

Can be smart, e.g. optimize for speed in a given environment

Basic placement expressed through combinators

Used successfully to generate high-performance FPGA cores

Wired: Extension to LavaWired: Extension to Lava

Finer control over geometry

More accurate performance modelsFeedback from timing/power analysis enables self-optimizing generators

Wire-awareness (unique for Wired)Performance analysis based on wire length estimatesControl routing through “guides” (experimental)

...

Monads in HaskellMonads in Haskell

Haskell functions are pure

Side-effects can be “simulated” using monads

add a b = do    as <­ get    put (a:as)    return (a+b)

*Main> runState prog [](26, [18,11,5])

Monads can also be used to model e.g. IO, exceptions,

non-determinism etc.

prog = do    a <­ add 5 6    b <­ add a 7    add b 8

Syntactic sugar, expands to a pure

program with explicit state passing

Result Side-effect

Monad combinatorsMonad combinators

Haskell has a general and well-understood combinator library for monadic programs

*Main> runState (mapM (add 2) [11..13]) []([13,14,15],[2,2,2])

*Main> runState (mapM (add 2 >=> add 4) [11..13]) []([17,18,19],[4,2,4,2,4,2])

Example: Parallel prefixExample: Parallel prefix

Given inputs

compute

for ∘, an associative (but not necessarily commutative) operator

x1, x2, … xn

y1 = x1

y2 = x1 ∘ x2

yn = x1 ∘ x2 ∘ … ∘ xn

Parallel prefixParallel prefix

Very central component in microprocessors

Most common use: Computing carries in fast adders

Trying different operators:

Addition: prefix (+) [1,2,3,4]

Parallel prefixParallel prefix

Very central component in microprocessors

Most common use: Computing carries in fast adders

Trying different operators:

Addition: prefix (+) [1,2,3,4]= [1, 1+2, 1+2+3, 1+2+3+4] = [1,3,6,10]

Parallel prefixParallel prefix

Very central component in microprocessors

Most common use: Computing carries in fast adders

Trying different operators:

Addition: prefix (+) [1,2,3,4]= [1, 1+2, 1+2+3, 1+2+3+4] = [1,3,6,10]

Boolean OR: prefix (||) [F,F,F,T,F,T,T,F]

Parallel prefixParallel prefix

Very central component in microprocessors

Most common use: Computing carries in fast adders

Trying different operators:

Addition: prefix (+) [1,2,3,4]= [1, 1+2, 1+2+3, 1+2+3+4] = [1,3,6,10]

Boolean OR: prefix (||) [F,F,F,T,F,T,T,F]

= [F,F,F,T,T,T,T,T]

Parallel prefixParallel prefix

Implementation choices (relying on associativity):

prefix (∘) [x1,x2,x3,x4] = [y1,y2,y3,y4]

Serial: y4 = ((x1 ∘ x2) ∘ x3) ∘ x4

Parallel: y4 = (x1 ∘ x2) ∘ (x3 ∘ x4)

Sharing: y4 = y3 ∘ x4

There are many of them...There are many of them...

Sklansky

Brent-Kung

Ladner-Fischer

Parallel prefix: SklanskyParallel prefix: Sklansky

sklansky op [a] = return [a]

sklansky op as = do

    let k       = length as `div` 2        (ls,rs) = splitAt k as'

    ls' <­ sklansky op ls    rs' <­ sklansky op rs

    rs'' <­ sequence [op (last ls', r) | r <­ rs']    return (ls' ++ rs'')

Simplest approach (divide-and-conquer)

Purely structural (no geometry)

Could have been (monadic) Lava

Refinement: Add placementRefinement: Add placement

sklansky op [a] = space cellWidth [a]

sklansky op as = downwards 1 $ do

    let k       = length as `div` 2        (ls,rs) = splitAt k as'

    (ls',rs') <­ rightwards 0 $ liftM2 (,)        (sklansky op ls)        (sklansky op rs)

    rs'' <­ rightwards 0 $              sequence [op (last ls', r) | r <­ rs']    return (ls' ++ rs'')

Sklansky with placementSklansky with placement

Simple postscript allows interactive development of

placement

Refinement: Add routing guidesRefinement: Add routing guides

bus = rightwards 0 . mapM bus1  where    bus1 = space 2750 >=> guide 3 500 >=> space 1250

sklanskyIO op = downwards 0      $ inputList 16 "in"    >>= bus    >>= space 1000    >>= sklansky op    >>= space 1000    >>= bus    >>= output "out"

Reusing standard (monadic) Haskell combinators

(nothing Wired-specific)

Sklansky with guidesSklansky with guides

Refinement: More guidesRefinement: More guides

sklansky op [a] = space cellWidthD [a]

sklansky op as  = downwards 1 $ do

    bus as    let k       = length as `div` 2        (ls,rs) = splitAt k as

    (ls',rs') <­ rightwards 0 $ liftM2 (,)        (sklansky op ls)        (sklansky op rs)

    rs'' <­ rightwards 0 $              sequence [op (last ls', r) | r <­ rs']    bus (ls' ++ rs'')

Sklansky with guidesSklansky with guides

Experiment: CompactionExperiment: Compaction

sklansky op [a] = space cellWidthD [a]

sklansky op [a] = return [a]

Buses were compacted separately

Export to CAD tool Export to CAD tool (Cadence Soc Encounter)(Cadence Soc Encounter)

Auto-routed in Encounter

Odd rows flipped to sharepower rails

Simple change in recursive call:

sklansky (flipY.op) ls

Exchanged using DEF file

format

Fast, low-power prefix networksFast, low-power prefix networks

Mary Sheeran has developed circuit generators in Lava that search for fast, low-power parallel prefix networks

Initially, crude performance modelsDelay: Logical depthPower: Number of operators

Still good results

Now using Wired to improve accuracyStatic timing/power analysis using models from cell library

Minimal change to search algorithmMinimal change to search algorithm

prefix f p = memo pm  where    pm ([],w)  = perhaps id' ([],w)    pm ([i],w) = perhaps id' ([i],w)    pm (is,w) | 2^(maxd(is,w)) < length is = Fail    pm (is,w)

      = (bestOn is f . dropFail)          [ wrpC ds (prefix f p) (prefix p p)            | ds <­ igen ... ]        where          wrpC ds p1 p2 =            wrp ds (perhaps id’ c) (p1 c1) (p2 c2)

          ...

Minimal change to search algorithmMinimal change to search algorithm

prefix f p = memo pm  where    pm ([],w)  = perhaps id' ([],w)    pm ([i],w) = perhaps id' ([i],w)    pm (is,w) | 2^(maxd(is,w)) < length is = Fail    pm (is,w)

      = (bestOn is f . dropFail)          [ wrpC ds (prefix f p) (prefix p p)            | ds <­ igen ... ]        where          wrpC ds p1 p2 =            wrp ds (perhaps id’ c) (p1 c1) (p2 c2)

          ... Plug in cost functions that analyze the placed network through Wired

85 bits, depth 885 bits, depth 8

85 bits, depth 885 bits, depth 8

Design explorationDesign exploration

85 inputs, depth 8, varying allowed fanout

At 128 bits, minimum depth is slower than going one deeper (crude delay model fails)

Accurate model consistent with timing report from Encounter

Fanout7 0,646 15,28 0,628 15,79 0,624 15,9

10 0,620 16,1

Delay [ns] Power [mW]

Fanout 7

Fanout 8

Fanout 9

Fanout 10

Binary multiplicationBinary multiplication

       101100     * 001011       101100      101100     000000    101100   000000+ 000000      000111100100

“Partial products”

484

1) Generate the partial products (PPs)

2) Sum the partial productsa) Sum until two terms leftb) Add the two remaining terms

44 * 11

Binary multiplicationBinary multiplication

       101100     * 001011       101100      101100     000000    101100   000000+ 000000      000111100100

“Partial products”

484

1) Generate the partial products (PPs)

2) Sum the partial productsa) Sum until two terms leftb) Add the two remaining terms

44 * 11

Not in this talk

Column compression multipliersColumn compression multipliers

       101100     * 001011       101100      101100     000000    101100   000000+ 000000     

Use full adders to compress the bits in each column until only two bits remain

Each full adder produces a carry which is forwarded to the next column

Different strategies for which order to process the bits yields very different characteristics (e.g. linear vs. logarithmic depth)

High-performance multiplier (HPM)High-performance multiplier (HPM)

Multiplier reduction tree with logarithmic logic depth and regular connectivity.Eriksson, Sheeran, et al. ISCAS '06.

Simple scheme:Process PP signals firstProcess full adder output bits “as late as possible”Prioritize carry bits

Purely structural version (Purely structural version (≈≈Lava)Lava)

Show code...

Refinement 1Refinement 1

Refinement 2Refinement 2

Refinement 3Refinement 3

Rectangular transformRectangular transform

Using reduction tree in real designUsing reduction tree in real design

Using reduction tree in real designUsing reduction tree in real design

By Kasyab, Ph.D. Student at Computer

Engineering

SummarySummary

Wire-aware hardware design methods needed

Wired offers flexible hardware design at low levels of abstraction

SklanskyAt Intel: 1000 lines of scripting code (Perl)In Wired: <50 lines (though fewer details)

Layout-/wire-aware design exploration

Get WiredGet Wired

Install Haskell Platform (to get the Cabal tool):http://hackage.haskell.org/platform/

Install Wired:

Manual download:http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Wired

> cabal install Wired