View
219
Download
0
Tags:
Embed Size (px)
Citation preview
SCORE - Stream Computations Organized for Reconfigurable ExecutionEylon Caspi, Michael Chu, Randy Huang, Joseph Yeh,
Yury Markovskiy Andre DeHon, John Wawrzynek U.C. Berkeley BRASS group
Outline
Lecture 1– Introduction– Related Work– SCORE Computational Model– Hardware Requirements– Language Instantiation
Lecture 2– Execution Example– SCORE Run-Time Environment– Example: JPEG– Results and Conclusion
Introduction
Problem:Lack of unifying computational model which allows applications portability and longevity without sacrificing a substantial fraction of raw capabilities
Solution:Stream based compute model.Divide computation into fixed “pages.”Time multiplex “pages” into hardware.
Introduction
SCORE – Ease development, deployment, and
range of RC applications– Efficient implementation maximizing
resources
Introduction
Current Issues?– Existing targets not portable
Software for RC hardware tied to a particular device
– Existing targets expose fixed resource limitations Impaired expressivenessAlgorithms used restricted by available hardwareNo dynamic resource allocation
Addressing Issues– Virtualize resources
computations, communication, and memory resources
– Convenient and efficient model
Introduction
SCORE - Programming model is natural abstraction of communication between spatial, hardware blocks.
Data flow communications graph captures the blocks of computation (operators) and the communication (streams) between them.
Then capture and map to hardware efficiently
Related Work
Villasenor et At circa 1995– Motion-wavelet video coder– Hand-partitioning design into “pages” and
manually reconfiguring each deviceRun on 1/3 as many machinesOnly experienced 10% overhead
SCORE builds on:– Instruction Set Architecture, Data Flow, Disturbed
and streaming computation models– PRISC, DISC, GARP
SCORE Computational Model
Compute Model– Abstract model capturing essential semantics of
computation Programming Model
– Programming constructs providing convenient way to express computations in the compute model
Execution Model– Low-level description of the computation and the
semantics which the hardware is expected to provide when interpreting this description
Compute Model
Graph of computation operators and memory blocks linked together by streams
Streams– Provide node-to-node communication– Single source, single sink FIFO Queues
Operators– Finite State Machine (FSM) node
Interact via stream links
– Turing Complete (TM) nodeSupport resource allocation and stream operations
Compute Model
Operations are fully deterministic– Determinism of individual operators– Timing independent communication– Operators cannot side-effect each
other’s state1. Communicate through streams which guarantee a
timing independent order of execution2. Memory segments have single unique owner (no
multiple read-write hazards)
Programming Model
Framework independent of device limits
Guidelines for efficient execution on any hardware implementation
Key Abstractions for Programming model– Operators– Streams– Memory Segments
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Programming Model
Operators– Represents an algorithmic transformation
of input data to produce output data– Computation building blocks for
computation (Multiplier, FIR, FFT)– Size of operator in hardware is
implementation dependent, is not limited to programming model
– Partitioning is integral part to automate the compilation process
Programming Model
Streams– Communication uses streaming data flow– Producer connected to consumer via streams– Defines where data is logically routed– Acts as unbounded length queue for data
tokens– Data Presence Signals
Operators signal when producing data and consuming data
Programming Model
Memory Segments– Contiguous block of
memory– serves as the basic
unit for memory management
– used by giving a specific operating mode, then linking it into a data flow graph
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Programming Model
Dynamic Features– Dynamic rate operators
Consume / produce tokens at data-dependent ratesEfficient operators for tasks:
– Data Compression (JPEG), decompression, searching, and filtering
Scheduling decisions should be made at Run Time
– Dynamic graph composition and instantiationComputational graphs can be created, extended or
modified during execution
– Dynamic handling of uncommon events (Exception Handling)
Execution Model
3 Key Components– Compute Page (CP)
fixed size block of RC logic which is the basic unit of virtualization and scheduling
– Memory Segmentcontiguous block of memory which is the basic
unit for data page management– Stream Link
logical connection between the output of one page and the input of another page
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Hardware Virtualization
Compute pages, segments, and streams fundamental units for – allocation– virtualization – management of hardware resources
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Example of Stream Buffer Execution
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Model Implications
Advice for Programmers– Describe computations as spatial
pipelines with multiple, independent computational paths
– Avoid or minimize feedback cycles– Expose large data streams to SCORE
operators
Hardware Requirements
Sequential Processor and RC device RC Device divided into a number of
equivalent and independent compute pages Multiple distributed memory blocks required
to store intermediate data High bandwidth, Low Latency
communication, among compute pages and memory, allowing memory pages to be used concurrently
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Language Instantiation
One could define – subsets of conventional HDLs– subsets of conventional programming
languages (C++, Java)Instead they define
– RTL language to describe SCORE operatorsTDF: Intermediate language
Language Requirements
SCORE Operators are synchronous, single clock entities with their own state– Communicate only through designed I/O streams– Operation is gated by data presence on the I/O
streams– Each operation is viewed as a FSM with
associated Data Path SCORE does not have a global shared
memory abstraction among operators– Remember memory segments (no two operators
can share memory at same time)
TDF
RTL Description with special syntax for handling input and output data dreams from the operator– Data Path operators similar to C
To allow dynamic operators, basic form is FSM– Each State specifies the inputs which must be
present before it can “fire”– When input arrives, operator consumes the
inputs and the FSM may choose to change states
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
END PART 1
Tune in next week for exciting examples
Execution Example
Reference Figure 16– Shows example of C++ program which
uses the merge and uniq operators* SCORE operator instantiation and
composition can be performed from C++ code
Example - Assumptions
Design consists of 3 behavioral operators– Fully implementation of each operator requires
only one compute page The RC array contains one compute page
and three configurable memory blocks– Each CMB partitioned into 4 segments (s0 - s3)
s0 and s1 buffer computation data s2 and s3 store state / configuration for a compute page
Example - Assumptions
CMB state maintained by controller– Details are not shown in this example
Each compute page has 2 input 2 output FIFO buffers
Scheduling and array reconfiguration are performed at the beginning of each timeslice
Execution Example
Physical view of array at each point in timeline
Single Letter identifiers assigned– A: merge (inputs i0, i1)– B: merge (inputs t1, t2)– C: uniq– Segments: S0, S1
Timeline for Execution Example
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Step-by-Step Execution Example
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
SCORE Run-Time EnvironmentBuilding ApplicationsRun-Time Environment
Example: JPEG
Conclusion
Figure 18
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Figure 19
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Figure 20
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Table 2
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Figure 21
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Figure 4
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.