View
219
Download
5
Tags:
Embed Size (px)
Citation preview
Model Based Design for DSP:Presentation to Stevens
Will Plishker, Chung-Ching Shen, Nimish Sane, George Zaki, Soujanya Kedilaya, Shuvra S. Bhattacharyya
Maryland DSPCAD Research Group(http://www.ece.umd.edu/DSPCAD/home/dspcad.htm)Department of Electrical and Computer Engineering, andInstitute for Advanced Computer StudiesUniversity of Maryland, College Park
Outline
Model Based Design Dataflow Interchange Format Multiprocessor Scheduling Preliminary Setup and Results with GPUs Future Directions
Introduction In modern, complex systems we
would like to Create an application description
independent of the target Interface with a diverse set of tools
and teams Achieve high performance Arrive at an initial prototype quickly
But algorithms are far removed from their final implementation Low level programming
environments Diverse and changing platforms Non-uniform functional verification Entrenched design processes Tool selection
Implementation Gap
Abstract representation of an
algorithm
Low level, high performance,
implementation
ThresholdModule
1
2
3
4
Pattern comparator
Pattern (4 bits)
Decision check
1 2 3 4
Decision (1 bit)
NO zero (38 bit)
E Adder
H Adder
E/Gamma EGamma (1 bit)
Fine Grain ORFinegrain(1 bit)
Channel EtAdder
Channel Et4x9 bits
YES
38 bit
Model-Based Design for Embedded Systems High level application subsystems are specified
in terms of components that interact through formal models of computation C or other “platform-oriented” languages can be used
to specify intra-component behavior Model-specific language can be used to specify inter-
component behavior Object-oriented techniques can be used to maintain
libraries of components Popular models for embedded systems
Dataflow and KPNs (Kahn process networks) Continuous time, discrete event FSM and related control formalisms
Dataflow-based Design: Related Trends Dataflow-based design (in our context) is a
specific form of model-based design Dataflow-based design is complementary to
Object-oriented design DSP C compiler technology Synthesis tools for hardware description
languages (e.g., Verilog and VHDL)
Example: Dataflow-based design for DSP
Example from Agilent ADS tool
Example: QAM Transmitter in National Instruments LabVIEW
Source: [Evans 2005]
Rate Control
QAM Encoder
TransmitFilters
PassbandSignal
Crossing the Implementation Gap:Design Flow Using DIF
Dataflow Models DSP Designs
The DIF Package (TDP)
DSPLibraries
Dataflow-based DSP Design Tools
EmbeddedProcessing Platforms
The DIF Language (TDL)
DIF Specification
Signal Proc
Image/Video
Comm Sys
Meta-ModelingPDF BLDF
DynamicCFDF BDF
DIF-to-C
Algorithms
Front-end
DIF RepresentationAIF / Porting
StaticSDF MDSDF
HSDF CSDF
C
DSP
Other Embedded Platforms
Other Tools
Other Ex/Im
VSIPL
TIOthe
r
Autocoding Toolset
Ptolemy II
DIF-A T Ex/ImPtolemy Ex/Im
Java
Java VM
Ada
VDM
Dataflow with Software Defined Radio:DIF + GNU Radio
GRC
The DIF Package (TDP)
Platforms
GPUsMulti-processors
GNU Radio Engine Python/C++
Python Flowgraph
(.py)
3a) Perform online scheduling
DIF specification
(.dif)
3b) Architecture specification (.arch?)
Cell FPGA
XML Flowgraph
(.grc)
Schedule (.dif,
.sched)
4) Architecture aware MP scheduling
• (assignment, ordering, invocation)
• Processors• Memories• Interconnect
1) Convert or generate .dif file(Complete)
Platform Retargetable
Library
Uniprocessor Scheduling
Existing or Completed
Proposed
Legend
DIF Lite2) Execute static schedules from DIF (Complete)
Background: Dataflow Graphs Vertices (actors) represent computation Edges represent FIFO buffers Edges may have delays, implemented as
initial tokens Tokens are produced and consumed on edges Different models have different rules for
production (SDF=fixed, CSDF=periodic, BDF=dynamic)
X Y 5Zp1 c1 p2 c2
e1e2
Evolution of Dataflow Models of Computation for DSP: Examples
Computation Graphs and Marked Graphs [Karp 1966, Reiter 1968] Synchronous dataflow, [Lee 1987]
Static multirate behavior SPW (Cadence) , National Instruments LabVIEW, and others.
Well behaved stream flow graphs [1992] Schemas for bounded dynamics
Boolean/integer dataflow [Buck 1994] Turing complete models
Multidimensional synchronous dataflow [Lee 1992] Image and video processing
Scalable synchronous dataflow [Ritz 1993] Block processing COSSAP (Synopsys)
CAL [Eker 2003] Actor-based dataflow language
Cyclo-static dataflow [Bilsen 1996] Phased behavior Eonic Virtuoso Synchro, Synopsys El Greco and Cocentric,
Angeles System Canvas
Bounded dynamic dataflow Bounded dynamic data transfer
[Pankert 1994]
The processing graph method [Stevens, 1997] Reconfigurable dynamic dataflow U. S. Naval Research Lab, MCCI
Autocoding Toolset
Stream-based functions [Kienhuis 2001]
Parameterized dataflow [Bhattacharya 2001] Reconfigurable static dataflow Meta-modeling for more general
dataflow graph reconfiguration
Reactive process networks [Geilen 2004]
Blocked dataflow [Ko 2005] Image and video through
parameterized processing
Windowed synchronous dataflow [Keinert 2006]
Parameterized stream-based functions [Nikolov 2008]
Enable-invoke dataflow [Plishker 2008]
Variable rate dataflow [Wiggers 2008]
Modeling Design Space
XPSDF
XPCSDF
Ex
pr
es
siv
e p
ow
er
Verification / synthesis power
XC, BDF, DDF
XSDF
XCSDF
XCSDF, SSDFMDSD,
WBDF
X
Dataflow Interchange Format
Describe DF graphs in text
Simple DIF file:
dif graph1_1 { topology { nodes = n1, n2, n3, n4; edges = e1 (n1, n2), e2 (n2, n1), e3 (n1, n3), e4 (n1, n3), e5 (n4, n3), e6 (n4, n4);
}}
More features of DIF Ports interface {
inputs = p1, p2:n2; outputs = p3:n3, p4:n4; } Hierarchy refinement {
graph2 = n3; p1 : e3; p2 : e4; p3 : e5; p4 : p3; }
More features of DIF Production and consumption production { e1 = 4096; e10 = 1024; ...
} consumption { e1 = 4096; e10 = 64;
... }
Computation keyword User defined attributes
4096
4096
1024
64
The DIF Language SyntaxdataflowModel graphID { basedon { graphID; } topology { nodes = nodeID, ...; edges = edgeID (srcNodeID,
snkNodeID), ...; } interface { inputs = portID [:nodeID], ...; outputs = portID [:nodeID], ...; } parameter { paramID [:dataType]; paramID [:dataType] = value; paramID [:dataType] : range; } refinement { subgraphID = supernodeID; subPortID : edgeID; subParamID = paramID; }
builtInAttr { [elementID] = value; [elementID] = id; [elementID] = id1, id2, ...; }attribute usrDefAttr{ [elementID] = value; [elementID] = id; [elementID] = id1, id2, ...; }actor nodeID { computation = stringValue; attrID [:attrType] [:dataType] =
value; attrID [:attrType] [:dataType] =
id; attrID [:attrType] [:dataType] =
id1, ...; }}
Uniprocessor Scheduling for Synchronous Dataflow An SDF graph G = (V,E) has a valid schedule if it
is deadlock-free and is sample rate consistent (i.e., it has a periodic schedule that fires each actor at least once and produces no net change in the number of tokens on each edge).
Balance eqs: e E, prd(e) x q[src(e)] = cns(e) x q[snk(e)].
Repetition vector q is the minimum solution of balance eqs.
A valid schedule is then a sequence of actor firings where each actor v is fired q[v] (repetition count) times and the firing sequence obeys the precedence constraints imposed by the SDF graph.
Example: Sample Rate Conversion
Flat strategy Topological sort the graph and iterate each actor
v q[v] times. Low context switching but large buffer
requirement and latency CD to DAT Flat Schedule:
(147A)(147B)(98C)(56D)(40E)(160F)
CD to DAT: 44.1 kHz to 48 kHz sampling rate conversion.
CD FIR1 FIR2 FIR3 FIR4 DAT1 1 2 3 4 7 5 7 4 1
e1 e2 e3 e4 e5(A) (B) (C) (D) (E) (F)
Scheduling Algorithms Acyclic pairwise grouping of adjacent nodes (APGAN)
An adaptable (to different cost functions) and low-complexity heuristic to compute a nested looped schedule of an acyclic graph in a way that precedence constraints (topological sort) is preserved through the scheduling process.
Dynamic programming post optimization (DPPO) Dynamic programming over a given actor ordering (any topological sort). GDPPO, CDPPO, SDPPO.
Recursive procedure call (RPC) based MAS Generate MASs for a given R-schedule through recursive graph
decomposition. The resulting schedule is bounded polynomially in the graph size.Algorithm Looped Schedule Buffer Size
Flat (147A)(147B)(98C)(56D)(40E)(160F) 1273
APGAN (49(3AB)(2C))(8(7D)(5E(4F))) 438
DPPO (7(7(3AB)(2C))(8D))(40E(4F)) 347
RPC-basedMAS
((2(((7((AB)(2(AB)C))D)D)(5E(4F)))(2(((7((AB)(2(AB)C))D)D)(5E(4F)))(E(4F))))((((7((AB)(2(AB)C))D)D)(5E(4F)))(E(4F))))
69
Representative Dataflow Analyses and Optimizations Bounded memory and deadlock detection: consistency Buffer minimization: minimize communication cost Multirate loop scheduling: optimize code/data trade-off Parallel scheduling and pipeline configuration Heterogeneous task mapping and co-synthesis Quasi-static scheduling: minimize run-time overhead Probabilistic design: adapt system resources and exploit
slack Data partitioning: exploit parallel data memories Vectorization: improve context switching, pipelining Synchronization optimization: self-timed
implementation Clustering of actors into atomic scheduling units
Multiprocessor Scheduling
Multiprocessor scheduling problem: Actor assignment (mapping) Actor ordering Actor invocation
Approaches to each of these tend to be platform specific Tools can be brought under a common formal umbrella
Multiprocessor Scheduling
Mapping/SchedulingApplication Model, G(V, E, t(v), C(e))
Multiprocessor Mapping
Application Model, G(V, E, t(v), C(e)) Mapping
P1
P2
P4
P3
Invocation Example: Self-Timed (ST) scheduling
E
B
F
D
C
G
A
H
Proc 1
Proc 5 Proc
2
Proc 4Proc
3
Application Graph
Execution TimesA, B, F: 3C, H : 5D : 6E : 4G : 2
HH
DD D
GCGC
E A E A E A E
B F B F
C
B F B
C G
F
G
D
H H
Proc 1
Proc 2
Proc 3
Proc 4
Proc 5
18TST=
9
Assignment and ordering performed at compile-time.
Invocation performed at run-time (via synchronization)
Gantt Chart for ST schedule
Multicore Schedules Traditional multicore scheduling
Convert application DAG to Homogenous Synchronous Dataflow (HSDF)
Perform HSDF mapping Problem: exponential graph explosion
Our solution: single processor schedule (SPS) represented
as a generalized schedule tree (GST) generate an equivalent multiprocessor
schedule (MPS) to be represented as a forest of GSTs.
Traditional Dataflow Multiprocessor Scheduling (MPS)
A B C
A
B
C
A
A
A
A
A
B
1
3 2
1
1
1
1
1
1
1
1
1
11
1
11
1
1
1
Synchronous Dataflow (SDF)representation of application
Homogenous SDF representation of application
GST Representation for MPS - Simple Example
(a) An SDF graph
(b) SPS as a GST
(c) MPS represented as a forest of GSTs
P1 P2 P3
Demonstration on GPUs:Start with parallel actors
Within an actor (FIR Filter).
N
ii inxbny
0
][][
Limitation (IIR Filter)
Q
jj
P
ii jnyainxbny
10
][][][
Individual actor results:CUDA FIR vs. Stock GR FIR
Individual Actor Results:Turbo Block Decode
Future Direction: Tackling the general MP scheduling problem with dataflow analysis
Many dataflow analysis techniques are available once the problem is well defined in dataflow terms
Maximize multicore utilization by replicating and fusing actors/blocks Stateless vs. stateful Computation to communication ratios Firing rates/execution times to number of blocks
Once application is mapped to blocks/processors Single processor scheduling to minimize buffering
Focus first on MP Scheduling for GPUs Blocks Threads Memory
Refine to a simpler question:When to off-load onto a GPU? Given:
An application graph Actor timing
characteristics for communication and computation
A target architecture with heterogeneous multiprocessing
Find optimal implementation Latency Throughput
69
2
1
54
3
8
7
GPUCPU
?
Summary
Model Based Design Dataflow Interchange Format Multiprocessor Scheduling Preliminary Setup and Results with GPUs Future Directions