57
SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John Wawrzynek U.C. Berkeley BRASS group

SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

SCORE - Stream Computations Organized for Reconfigurable ExecutionEylon Caspi, Michael Chu, Randy Huang, Joseph Yeh,

Yury Markovskiy Andre DeHon, John Wawrzynek U.C. Berkeley BRASS group

Page 2: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Outline

Lecture 1– Introduction– Related Work– SCORE Computational Model– Hardware Requirements– Language Instantiation

Lecture 2– Execution Example– SCORE Run-Time Environment– Example: JPEG– Results and Conclusion

Page 3: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Introduction

Problem:Lack of unifying computational model which allows applications portability and longevity without sacrificing a substantial fraction of raw capabilities

Solution:Stream based compute model.Divide computation into fixed “pages.”Time multiplex “pages” into hardware.

Page 4: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Introduction

SCORE – Ease development, deployment, and

range of RC applications– Efficient implementation maximizing

resources

Page 5: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Introduction

Current Issues?– Existing targets not portable

Software for RC hardware tied to a particular device

– Existing targets expose fixed resource limitations Impaired expressivenessAlgorithms used restricted by available hardwareNo dynamic resource allocation

Addressing Issues– Virtualize resources

computations, communication, and memory resources

– Convenient and efficient model

Page 6: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Introduction

SCORE - Programming model is natural abstraction of communication between spatial, hardware blocks.

Data flow communications graph captures the blocks of computation (operators) and the communication (streams) between them.

Then capture and map to hardware efficiently

Page 7: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Related Work

Villasenor et At circa 1995– Motion-wavelet video coder– Hand-partitioning design into “pages” and

manually reconfiguring each deviceRun on 1/3 as many machinesOnly experienced 10% overhead

SCORE builds on:– Instruction Set Architecture, Data Flow, Disturbed

and streaming computation models– PRISC, DISC, GARP

Page 8: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

SCORE Computational Model

Compute Model– Abstract model capturing essential semantics of

computation Programming Model

– Programming constructs providing convenient way to express computations in the compute model

Execution Model– Low-level description of the computation and the

semantics which the hardware is expected to provide when interpreting this description

Page 9: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Compute Model

Graph of computation operators and memory blocks linked together by streams

Streams– Provide node-to-node communication– Single source, single sink FIFO Queues

Operators– Finite State Machine (FSM) node

Interact via stream links

– Turing Complete (TM) nodeSupport resource allocation and stream operations

Page 10: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Compute Model

Operations are fully deterministic– Determinism of individual operators– Timing independent communication– Operators cannot side-effect each

other’s state1. Communicate through streams which guarantee a

timing independent order of execution2. Memory segments have single unique owner (no

multiple read-write hazards)

Page 11: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Programming Model

Framework independent of device limits

Guidelines for efficient execution on any hardware implementation

Key Abstractions for Programming model– Operators– Streams– Memory Segments

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 12: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Programming Model

Operators– Represents an algorithmic transformation

of input data to produce output data– Computation building blocks for

computation (Multiplier, FIR, FFT)– Size of operator in hardware is

implementation dependent, is not limited to programming model

– Partitioning is integral part to automate the compilation process

Page 13: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Programming Model

Streams– Communication uses streaming data flow– Producer connected to consumer via streams– Defines where data is logically routed– Acts as unbounded length queue for data

tokens– Data Presence Signals

Operators signal when producing data and consuming data

Page 14: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Programming Model

Memory Segments– Contiguous block of

memory– serves as the basic

unit for memory management

– used by giving a specific operating mode, then linking it into a data flow graph

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 15: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Programming Model

Dynamic Features– Dynamic rate operators

Consume / produce tokens at data-dependent ratesEfficient operators for tasks:

– Data Compression (JPEG), decompression, searching, and filtering

Scheduling decisions should be made at Run Time

– Dynamic graph composition and instantiationComputational graphs can be created, extended or

modified during execution

– Dynamic handling of uncommon events (Exception Handling)

Page 16: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Execution Model

3 Key Components– Compute Page (CP)

fixed size block of RC logic which is the basic unit of virtualization and scheduling

– Memory Segmentcontiguous block of memory which is the basic

unit for data page management– Stream Link

logical connection between the output of one page and the input of another page

Page 17: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 18: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 19: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Hardware Virtualization

Compute pages, segments, and streams fundamental units for – allocation– virtualization – management of hardware resources

Page 20: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 21: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 22: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Example of Stream Buffer Execution

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 23: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 24: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 25: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 26: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Model Implications

Advice for Programmers– Describe computations as spatial

pipelines with multiple, independent computational paths

– Avoid or minimize feedback cycles– Expose large data streams to SCORE

operators

Page 27: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Hardware Requirements

Sequential Processor and RC device RC Device divided into a number of

equivalent and independent compute pages Multiple distributed memory blocks required

to store intermediate data High bandwidth, Low Latency

communication, among compute pages and memory, allowing memory pages to be used concurrently

Page 28: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 29: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Language Instantiation

One could define – subsets of conventional HDLs– subsets of conventional programming

languages (C++, Java)Instead they define

– RTL language to describe SCORE operatorsTDF: Intermediate language

Page 30: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Language Requirements

SCORE Operators are synchronous, single clock entities with their own state– Communicate only through designed I/O streams– Operation is gated by data presence on the I/O

streams– Each operation is viewed as a FSM with

associated Data Path SCORE does not have a global shared

memory abstraction among operators– Remember memory segments (no two operators

can share memory at same time)

Page 31: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

TDF

RTL Description with special syntax for handling input and output data dreams from the operator– Data Path operators similar to C

To allow dynamic operators, basic form is FSM– Each State specifies the inputs which must be

present before it can “fire”– When input arrives, operator consumes the

inputs and the FSM may choose to change states

Page 32: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 33: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 34: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 35: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 36: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

END PART 1

Tune in next week for exciting examples

Page 37: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Execution Example

Reference Figure 16– Shows example of C++ program which

uses the merge and uniq operators* SCORE operator instantiation and

composition can be performed from C++ code

Page 38: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Example - Assumptions

Design consists of 3 behavioral operators– Fully implementation of each operator requires

only one compute page The RC array contains one compute page

and three configurable memory blocks– Each CMB partitioned into 4 segments (s0 - s3)

s0 and s1 buffer computation data s2 and s3 store state / configuration for a compute page

Page 39: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Example - Assumptions

CMB state maintained by controller– Details are not shown in this example

Each compute page has 2 input 2 output FIFO buffers

Scheduling and array reconfiguration are performed at the beginning of each timeslice

Page 40: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Execution Example

Physical view of array at each point in timeline

Single Letter identifiers assigned– A: merge (inputs i0, i1)– B: merge (inputs t1, t2)– C: uniq– Segments: S0, S1

Page 41: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Timeline for Execution Example

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 42: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Step-by-Step Execution Example

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 43: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 44: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 45: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 46: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 47: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 48: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 49: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

SCORE Run-Time EnvironmentBuilding ApplicationsRun-Time Environment

Page 50: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Example: JPEG

Page 51: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Conclusion

Page 52: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Figure 18

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 53: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Figure 19

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 54: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Figure 20

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 55: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Table 2

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 56: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Figure 21

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 57: SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John

Figure 4

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.