35
CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.4 Krste Asanovic UC Berkeley Fall 2009

CS294-48: Hardware Design Patterns Berkeley Hardware

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS294-48: Hardware Design Patterns Berkeley Hardware

CS294-48: Hardware Design PatternsBerkeley Hardware Pattern Language Version 0.4

Krste AsanovicUC Berkeley

Fall 2009

Page 2: CS294-48: Hardware Design Patterns Berkeley Hardware

Overall Problem Statement

Application(s)

MP3 bit string Audio

Hardware (RTL)

MP3 bit stringAudio

(Berkeley) Hardware Pattern Language

Page 3: CS294-48: Hardware Design Patterns Berkeley Hardware

BHPL Goals

BHPL captures problem-solution pairs for creating hardware designs (machines) to execute applications

BHPL Non-GoalsDoesn’t describe applications themselves, only machines that execute applications and strategies for mapping applications onto machines

Page 4: CS294-48: Hardware Design Patterns Berkeley Hardware

BHPL Overview

Dense Linear Algebra

Sparse Linear Algebra

Spectral Methods FSMsGraph

Algorithms

Circuits

N-Body Methods

Dynamic Programming

Computational Patterns

Graph Traversal

Structured Grids

Unstructured Grids Graphical

Models

Pipelines

Model-View-Controller

Event Based Process Control

Agent&Repository

Structural Patterns

Map-Reduce

Iteration Layered Systems

Task Graphs

Applications (including OPL patterns)

Machines

BHPLMapping Patterns

Page 5: CS294-48: Hardware Design Patterns Berkeley Hardware

Machine Vocabulary

Page 6: CS294-48: Hardware Design Patterns Berkeley Hardware

Machine VocabularyMachines described using a hierarchical structural decomposition

Units (processing engines)MemoriesNetworks (connect multiple entities)Channels (point-to-point connections)(Memories, Networks, and Channels are just specialized Units)

Page 7: CS294-48: Hardware Design Patterns Berkeley Hardware

Hierarchy within Unit

Input Port

Output Port

Input/Output Port

Page 8: CS294-48: Hardware Design Patterns Berkeley Hardware

Hierarchy within Memory

Page 9: CS294-48: Hardware Design Patterns Berkeley Hardware

Hierarchy within Network (2)

Page 10: CS294-48: Hardware Design Patterns Berkeley Hardware

Hierarchy within Network

Page 11: CS294-48: Hardware Design Patterns Berkeley Hardware

Hierarchy within Channel

Page 12: CS294-48: Hardware Design Patterns Berkeley Hardware

Units are FSMsAll units are digital hardware, i.e., describable as a finite-state machine (FSM)

Different ways of factoring out the FSM description of a unit

Structural decomposition into hierarchical sub-units

Decompose functionality into control + datapath

Can further decompose control into inter-transaction scheduling plus intra-transaction sequencing

All factorings are equivalent, so pick factoring that best explains what unit does

Page 13: CS294-48: Hardware Design Patterns Berkeley Hardware

Structural Decomposition

Page 14: CS294-48: Hardware Design Patterns Berkeley Hardware

Control + Datapath

Page 15: CS294-48: Hardware Design Patterns Berkeley Hardware

Controller TypesState Machine Controller

control lines generated by state machine

Microcoded Controllersingle-cycle datapath, control lines in ROM/RAM

In-Order Pipeline Controllerpipelined control, dynamic interaction between stages

Out-of-Order Pipeline Controlleroperations within a control stream might be reordered internally

Threaded Pipeline Controllermultiple control streams one execution pipelinecan be either in-order (PPU) or out-of-order

Page 16: CS294-48: Hardware Design Patterns Berkeley Hardware

Leaf-Level HardwareRegister

Memory

Combinational Logic Wires

Tristate driverMultiplexer/ALU

FIFO

• Conventional schematic notation• (Need additional notation for asynchronous logic?)

Page 17: CS294-48: Hardware Design Patterns Berkeley Hardware

Hardware Patterns

Page 18: CS294-48: Hardware Design Patterns Berkeley Hardware

Decoupled UnitsProblem: Difficult to design a large unit with a single controller, especially when components have variable processing rates. Large controllers have long combinational paths.

Solution: Break large unit into smaller sub-units where each sub-unit has a separate controller and all channels between sub-units have some form of decoupling (i.e., no combinational path between units on each side of channel).

Applicability: Larger units where area and performance overhead of decoupling is small compared to benefits of simpler design and shorter controller critical paths.

Consequences: Decoupled channels generally have greater communication latency and area/power cost. Sub-unit controllers must cope with unknown arrival time of inputs and unknown time of availability of space on outputs. Sub-units must be synchronized explicitly.

Page 19: CS294-48: Hardware Design Patterns Berkeley Hardware

Decoupled Units

SharedMemory

Network

Unless shared memory is truly multiported, channels to memory must be decoupled

Channels to network are

always decoupled in any case

Page 20: CS294-48: Hardware Design Patterns Berkeley Hardware

Pipelined OperatorProblem: Combinational function of operator has long critical path that would reduce system clock frequency. High throughput of this function is required.

Solution: Divide combinational function using pipeline registers such that logic in each stage has critical path below desired cycle time. Improve throughput by initiating new operation every clock cycle overlapped with propagation of earlier operations down pipeline.

Applicability: Operators that require high throughput but where latency is not critical.

Consequences: Latency of function increases due to propagation through pipeline registers, adds energy/op. Any associated controller might have to track execution of operation across multiple cycles.

Page 21: CS294-48: Hardware Design Patterns Berkeley Hardware

Pipelined Operator

Clock Clock Clock

Clock Clock

f(g(in))

g(in) f(in)

Page 22: CS294-48: Hardware Design Patterns Berkeley Hardware

Multicycle Operator

Problem: Combinational function of operator has long critical path that would reduce system clock frequency. High throughput of this function is not required.

Solution: Hold input registers stable for multiple clock cycles of main system, and capture output after combinational function has settled.

Applicability: Operators where high throughput is not required, or if latency is critical (in which case, replicate to increase throughput).

Consequences: Associated controller has to track execution of operation across multiple cycles. CAD tools might detect false critical path in block.

Page 23: CS294-48: Hardware Design Patterns Berkeley Hardware

Multicycle Operator

Clock/2 Clock/2

Clock Clock

f(g(in))

f(g(in))

Page 24: CS294-48: Hardware Design Patterns Berkeley Hardware

Memory PatternsTrue Multiport Memory

Banked MemoryInterleave lesser-ported banks to provide higher bandwidth

Cached MemoryMemory hierarchy to provide higher-bandwidth, lower latency for predictable accesses

Bypassed MemoryReduce latency of pipelined dependent memory accesses

Page 25: CS294-48: Hardware Design Patterns Berkeley Hardware

Network Patterns

Connects multiple units using shared resources

BusLow-cost, ordered

CrossbarHigh-performance

Multi-stage networkTrade cost/performance

Page 26: CS294-48: Hardware Design Patterns Berkeley Hardware

Control+Datapath

Problem:

Solution:

Applicability:

Consequences:

Page 27: CS294-48: Hardware Design Patterns Berkeley Hardware

Machine Types

If SCSD, SCMD, MCMD machines are patterns, what is the problem-solution?

If they’re solutions, what’re the problems?

Page 28: CS294-48: Hardware Design Patterns Berkeley Hardware

SCMD Distributed Memory

Examples: MPP, ICL DAP, CM-1, CM-2, MasPar, Sony Playstation-2 Graphics Engine, Vision processing chips

C

N

D

M

D

M

D

M

D

M

D

M

Page 29: CS294-48: Hardware Design Patterns Berkeley Hardware

SCMD Shared Memory

Examples: STARAN, BSP, TI ASC, CDC Star-100, Multi-Lane Vector Machines

C

M

D D D D D

Page 30: CS294-48: Hardware Design Patterns Berkeley Hardware

MCMD Shared Memory

Examples: Burroughs B5x00 series, Network Packet Routers

M

D

C

D

C

D

C

D

C

D

C

Page 31: CS294-48: Hardware Design Patterns Berkeley Hardware

Homogeneous MCMD Distributed Memory

Examples: Caltech Cosmic Cube, Transputer, nCube, Clusters

Message Network

D

M

C

D

M

C

D

M

C

D

M

C

D

M

C

Page 32: CS294-48: Hardware Design Patterns Berkeley Hardware

Heterogeneous MCMD Distributed Memory

Examples: Signal Processing Pipelines,

P

M

P

M P

M

P

M

P

M

P = C + D

Page 33: CS294-48: Hardware Design Patterns Berkeley Hardware

Systolic

Examples: Warp, Raw, Motion Estimation Engines,

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P = C + D

Page 34: CS294-48: Hardware Design Patterns Berkeley Hardware

Channels

Control->Datapathdirectpipelined? (maybe don’t need pipelined controller?)

Datapath<->Memoryfixed latency

cannot have shared memory without true multiportdecoupled in-orderout-of-order

Control<->Network<->Controlfixed latencyFIFOsaddressable messaging

Page 35: CS294-48: Hardware Design Patterns Berkeley Hardware

BHPL Version 0.3

Systolic

Application Patterns

(from OPL)

SCMD Distributed Memory Heterogeneous MCMD Distributed Memory

Machine Organizations

SCMD Shared Memory

MCMD Shared Memory

Hardware Building Blocks

FIFO Multiport Memory

CAM

Arbiter

ProcessingIn-Order Pipeline

FSM

Microcoded Engine Out-of-

Order Pipeline

Threaded Pipeline

Communication Channel

Crossbar

Memory NetworksPMNC Layer

Banked Memory

Cached Memory

Bypassed Memory

Bus

Multi-Stage Networks

Dense Linear Algebra

Sparse Linear Algebra

Spectral Methods FSMs

Graph Traversal

Graph Algorithms

Circuits

N-Body Methods

Dynamic Programming

Structured Grids

Unstructured Grids Graphical

Models

Computational PatternsPipelines

Model-View-Controller

Event Based

Map-Reduce

Process Control

Iteration

Agent&Repository

Layered Systems

Task Graphs

Structural Patterns

Homogeneous MCMD Distributed Memory

Channels