Upload
candace-hodges
View
233
Download
0
Embed Size (px)
DESCRIPTION
Introduction & Motivation Traditional SW & HW Solutions µPµPµPµP µPµPµPµP µPµPµPµP µPµPµPµP Software Only (single/multicore) Hardware Accelerated (Dedicated HW IP) µPµPµPµP I$ D$ JPEG2000Co-Processor µPµPµPµP I$ D$ FPGA Coprocessor Bitstream Memory Reconfigurable (FPGA supporting dynamic reconfiguration) GoalsSoftware OnlyHardware AcceleratedReconfigurable Configurability/FlexibilityYesNoYes PerformanceNoYes
Citation preview
Modeling and Codesign Methods for Data Adaptable Reconfigurable
Embedded SystemsRoman Lysecky
Department of Electrical and Computer EngineeringUniversity of Arizona
Collaborators: Jonathan Sprinkle, Jerzy Rozenblit, Michael Marcellin
Students: Andrew Milakovich, Vijay Shankar Gopinath, Sachidanand Mahadevan, Sean Whitsitt, Nathan Sandoval, Casey Mackin, Kyle Merry
This work was supported in part by the National Science Foundation under Grant CNS-0915010.
Introduction & MotivationData Adaptable Approach
• Increasingly Complex Applications Demands • Complex Algorithms• Compute Intensive• Highly-Configurable
• Example: JPEG2000 Image Compression
• Provides significant advantages – quality and compression – over JPEG standard
• Support for configurability at each processing stage (e.g. color transform, wavelet, block encoding, code stream)
• Results in high-computational demands and larger design space
Introduction & MotivationTraditional SW & HW Solutions
µµPP
µµPP
µµPP
µµPP
Software Only(single/multicore)
Hardware Accelerated
(Dedicated HW IP)
µµPPI$I$
D$D$
JPEG2000 JPEG2000 Co-ProcessorCo-Processor
µµPPI$I$
D$D$
FPGAFPGACoprocessor Coprocessor
BitstreamBitstreamMemoryMemory
Reconfigurable(FPGA supporting dynamic
reconfiguration)
Goals Software Only Hardware Accelerated
Reconfigurable
Configurability/Flexibility
Yes No Yes
Performance No Yes Yes
Introduction & MotivationData-Adaptable Reconfigurable Embedded Systems (DARES)
• Reconfigurable systems for high-configurable/compute-intensive applications• Can be reconfigured at runtime for immediate application needs• How/when to reconfigure specific to application and data input
• Goal: Reconfigure hardware tasks within FPGA based upon the current data profile
Input stream...10110000
Output stream10011000...
µPµP
Reconfigurable FPGA
Task Task AA(512(512x 512)x 512)
Task Task BB(5/3)(5/3)
Task Task CC(Cas(Casual)ual)
New Input Stream...000010100
New Data Profile:- 14-bits/channel- Task A (1024x1024, 4:4:2)- Task B (Wavelet 5/3)- Task C (Error Resilient)
Task A (1024x 1024)
HW Task Implementations
TaskC
TaskB
TaskD
TaskA
Introduction & MotivationData-Adaptable Reconfigurable Embedded Systems (DARES)
• DARES COMPONENTS & METHODOLOGY• Model-driven framework for specifying application tasks, processing
requirements, data configurability, and target data profiles for hardware support
• Runtime middleware and communication framework for runtime communication, system reconfiguration, and process scheduling
• Automated tool flow supporting the proposed methodology
Input stream...10110000
Output stream10011000...
µPµP
Reconfigurable FPGA
Task Task AA(512(512x 512)x 512)
Task Task BB(5/3)(5/3)
Task Task CC(Cas(Casual)ual)
New Input Stream...000010100 Task A
(1024x 1024)
HW Task Implementations
TaskC
TaskB
TaskD
TaskA
DARES ApproachDesign Methodology and Toolchain – Overview
1. Modeling Framework2. SW Task Generation/Compilation3. HW Coprocessor Generation/Synthesis4. HW/SW Communication Framework5. Final Software/Hardware Implementation
HW/SW Codesign (Model HW/SW Codesign (Model Interpreter)Interpreter)
Software Binary Hardware Task Bitstreams
Application and Data Profile Model
Software Model for HW Tasks
HW Task
HW Task
HW Task
HW Task
Hardware Tasks
HW/SW Comm. Framework Xilinx ISEXilinx ISE
ImpulseC CoDeveloperImpulseC CoDeveloper
Software Threads
Communication MiddlewareSoftware Compiler (gcc)Software Compiler (gcc)
(1(1))
Init. Code
0100010010110101010101010100010010110101010101010101111101010110010010010101111101010110010010010001010101010010010001000001010101010010010001000101001000000111111001010010000001111110
(2(2))
(3(3))
(4(4))
(5(5))
DARES ApproachDesign Methodology and Toolchain
• DARES Modeling Framework• Modeling Language to express application as a composition of
Communicating Sequential Dataflow Tasks (CSDT)• Capture application and task level data profiles• Allow designers to specify configuration of tasks for the target
data profiles • Perform design space exploration to determine the Pareto
optimal system implementation• Generate source code for SW and HW task configurations
HW/SW Codesign (Model HW/SW Codesign (Model Interpreter)Interpreter)
Application and Data Profile Model
Task Configurations of Row DCT Task
Application Tasks and Dataflow Model (JPEG)Modeli
ng Langua
ge
Types
Semantics
Constraints
DARES ApproachDesign Methodology and Toolchain
• DARES Modeling Language• Developed using Generic Modeling Environment (GME)• Types
• Task – Models functional unit of application• Config – Models configurability of application task• TaskInstance – Models the instance of an application task.
• Constraints• Simple – Unique Identifiers• Legal dataflow specifications• 1-1 correspondence between IN and OUT ports in a Config
and parent Task
• Semantics (i.e. Model Interpreter)• Driven by Hardware Software Codesign methodology
HW/SW Codesign (Model HW/SW Codesign (Model Interpreter)Interpreter)
Application and Data Profile Model
Modeling
Language
Types
Semantics
Constraints
DARES ApproachDesign Methodology and Toolchain
HW/SW Codesign (Model HW/SW Codesign (Model Interpreter)Interpreter)
Application and Data Profile Model
DARES ApproachDesign Methodology and Toolchain
Design Space
Pruning
Latency Estimation
Optimization
Off-chip Memory
Allocation
Source Code Generation
HW/SW Codesign (Model HW/SW Codesign (Model Interpreter)Interpreter)
Application and Data Profile Model
• DARES HW/SW Codesign Methodology• Design Space Pruning
• Find all compatible combinations of specific task configurations
• Subject to area constraint of FPGA• Latency Estimation
• For all possible application configurations, estimates the end-to-end latency
• Estimation considers:• Task configuration latency• Communication overhead• Required input/output data within all
tasks• Mode of operation of specific task
configurations
DARES ApproachDesign Methodology and Toolchain
Design Space
Pruning
Latency Estimation
Optimization
Off-chip Memory
Allocation
Source Code Generation
HW/SW Codesign (Model HW/SW Codesign (Model Interpreter)Interpreter)
Application and Data Profile Model
• DARES HW/SW Codesign Methodology• Optimization - Design Space Pruning
• Find Pareto optimal combinations of task configurations
• Defines all possible application configurations that will yield best area/latency tradeoff
DARES ApproachDesign Methodology and Toolchain
Design Space
Pruning
Latency Estimation
Optimization
Off-chip Memory
Allocation
Source Code Generation
HW/SW Codesign (Model HW/SW Codesign (Model Interpreter)Interpreter)
Application and Data Profile Model
• DARES HW/SW Codesign Methodology• Off-chip Memory Allocation
• Designer can specify a set of application profiles that must be supported
• Designer can additionally choose from Pareto optimal configuration
• If off-chip configuration memory is still available, selects additional task configuration to support to increase runtime adaptability
DARES ApproachDesign Methodology and Toolchain
Design Space
Pruning
Latency Estimation
Optimization
Off-chip Memory
Allocation
Code Synthesis
HW/SW Codesign (Model HW/SW Codesign (Model Interpreter)Interpreter)
Application and Data Profile Model
• DARES HW/SW Codesign Methodology• Code Synthesis
• Source files generated for all SW task configurations and selected HW task configurations
• Transforms input C code to Pthread implementation with
• Communication Middleware APIs providing the methods accesses to input and output buffers identified by the unique IDs
DARES ApproachDesign Methodology and Toolchain Software Threads
Communication MiddlewareSoftware Compiler (gcc)Software Compiler (gcc)
Init. Code
• Software Task Generation and Compilation• HW/SW Codesign Interpreter transforms the C code for application task
configurations • Generate Pthread implementation for all SW task configurations• Communication Middleware
• APIs providing the methods accesses to input and output buffers identified by the specific tasks
// Original Task Configuration code
void FuncName(){#pragma DARES_DECL_PART int data[64]; ...#pragma DARES_COMP_BEGIN#pragma DARES_READ_INTO(data) // Computation#pragma DARES_WRITE_FROM(data,64)#pragma DARES_cOMP_END}
// Pthread implementation.
void* FuncName(){ ... INTx DARES_SAMPLE_INPUT; int DARES_loop_iter; do{ ... do{ for( DARES_loop_iter = 0; DARES_loop_iter<DEPTH; ++DARES_loop_iter ) { if ( Fifo_Read_Single( ID1, &DARES_SAMPLE_INPUT ) == 0 )
DARES_INPUT[DARES_loop_iter] = DARES_SAMPLE_INPUT; } … for( DARES_loop_iter = 0;DARES_loop_iter<TOKENS;++DARES_loop_iter) { Fifo_Write_Single(ID2, &DARES_OUTPUT[DARES_loop_iter]); } } while(!Fifo_Eos(ID1)); ... } while(1); }
Codesign Interpreter
DARES ApproachDesign Methodology and Toolchain
Software Model for HW Tasks
HW Task
HW Task
HW Task
HW Task
Hardware Tasks
ImpulseC CoDeveloperImpulseC CoDeveloper
• Hardware Coprocessor Generation and Synthesis• HW/SW Codesign Interpreter generates ImpulseC function for all
HW task configurations• Utilizes co_stream interface for FIFO input/output
• Utilize ImpulseC CoDeveloper to synthesize VHDL implementations • Provides rich support for optimizing loops and analyzing the
pipelined loops
// Original Task Configuration code
void FuncName(){#pragma DARES_DECL_PART int data[64]; ...#pragma DARES_COMP_BEGIN#pragma DARES_READ_INTO(data) // Computation#pragma DARES_WRITE_FROM(data,64)#pragma DARES_cOMP_END}
// ImpulseC implementation.
void FuncName( co_stream fifo1, co_stream fifo2 ){
... INT8 DARES_SAMPLE_INPUT; int DARES_loop_iter; do { ... do { for( DARES_loop_iter = 0;DARES_loop_iter<DEPTH;++DARES_loop_iter) { if ( co_stream_read(_INFIFO_, &DARES_SAMPLE_INPUT, sizeof(WIDTH1)) == co_err_none ) DARES_INPUT[DARES_loop_iter] = DARES_SAMPLE_INPUT; } ... for( DARES_loop_iter = 0;DARES_loop_iter<TOKENS;++DARES_loop_iter) { co_stream_write(_OUTFIFO_, &DARES_OUTPUT[DARES_loop_iter], sizeof(WIDTH2)); } } while(!co_stream_eos(_INFIFO_)); ... } while(1);}
Codesign Interpreter
DARES ApproachDesign Methodology and Toolchain
HW Task
HW Task
HW Task
HW Task
Hardware Tasks
HW/SW Comm. Framework Xilinx ISEXilinx ISE
• Hardware/Software Communication Framework• Hardware coprocessors integrated with
hardware/software communication framework• Supports seamless communication between
software and hardware tasks• in conjunction with communication middleware
• Efficient communication mechanisms supported for communication between adjacent and non-adjacent hardware tasks System Bus (PLB)
User IP User IP (HW Task)(HW Task)
Bus Interface (Memory Mapped)Bus Interface (Memory Mapped)
FIFO O
utFIFO
Out
FIFO2B
usFIFO
2Bus
Fpout_wrenFpout_wdataFpout_full
Fpin_wrenFpin_wdata
Fpin_full
DARES ApproachHardware/Software Communication Methods
• Software to Software (SW Buffer)
System Bus (PLB)
User IP User IP
(HW Task)(HW Task)
Bus Interface (Memory Mapped)Bus Interface (Memory Mapped)
FIFO O
utFIFO
Out
FIFO2B
usFIFO
2Bus
Fpout_wrenFpout_wdataFpout_full
Fpin_wrenFpin_wdata
Fpin_full
µP Mem
...Task Task Task
• Software to Software (HW FIFO)
µP Mem
...Task Task Task
DARES ApproachHardware/Software Communication Methods
• Software to Hardware • Hardware to Hardware (Adjacent)
µP Mem
... TaskTask TaskTaskTask
µP Mem
...Task TaskTaskTask
System Bus (PLB)
User IP User IP
(HW Task)(HW Task)
Bus Interface (Memory Mapped)Bus Interface (Memory Mapped)
FIFO O
utFIFO
Out
FIFO2B
usFIFO
2Bus
Fpout_wrenFpout_wdataFpout_full
Fpin_wrenFpin_wdata
Fpin_full
DARES ApproachHardware/Software Communication Methods
• Hardware to Hardware (Non-Adjacent)
• Hardware to Software
µP Mem
...Task TaskTaskTaskTask
µP Mem
...Task TaskTaskTask
System Bus (PLB)
User IP User IP
(HW Task)(HW Task)
Bus Interface (Memory Mapped)Bus Interface (Memory Mapped)
FIFO O
utFIFO
Out
FIFO2B
usFIFO
2Bus
Fpout_wrenFpout_wdataFpout_full
Fpin_wrenFpin_wdata
Fpin_full
DARES ApproachDesign Methodology and Toolchain
• Final Hardware/Software Implementation• Software threads and hardware coprocessors combined for final system
implementation• Requires manual – although automatable – creation of system initialization
code• System configuration for current data profile not supported at runtime
Software Binary Hardware Task Bitstreams
µPµP
Reconfigurable FPGA
Coprocessor Coprocessor BitstreamBitstreamMemoryMemory
0100010010110101010101010100010010110101010101010101111101010110010010010101111101010110010010010001010101010010010001000001010101010010010001000101001000000111111001010010000001111110
DARES ApproachCase Study 1 – JPEG (not 2000)
• Experimental Setup• Consider a JPEG image compression application• Generated software and hardware implementations for
JPEG encoding tasks using DARES toolchain• Discrete cosine transform (dct), quantization (qnt), zig-zag ordering (zz), and
run-length encoding (rle)• Independently verified software and hardware accelerated implementation
• Evaluated system performance for all combination of hardware coprocessors available within system• Manually configured communication between SW and HW tasks to measure
system performance of HW accelerated implementation• Virtex-5 FX FPGA (ML507 board)• 400 MHz PowerPC processor with 100 MHz PLB bus
µPµP
Reconfigurable FPGA
DARES ApproachCase Study 1 – JPEG (not 2000)
• Experimental Results• Achieves performance improvement of 1.8X for single hardware task and up to 5X
with all tasks executing in hardware• Compared to software-only implementations
• Needs for considering the communication method (e.g. DMA, bus hierarchy, NOCs) and latency in determining the Pareto optimal system configuration
DARES ApproachCase Study 2 – JPEG2000 (the real deal…sort of)
• Experimental Setup• Consider a JPEG2000 compression using Jasper’s software implementation• All stages can be configured differently for lossy or lossless compression• Bulk of the execution time is spent in Tier 1 Encoder (typically 50% or more)
• Block Encoder• Bit Plane Encoder• MQEncoder
Forward Multi-Component Transform
Forward Wavelet
TransformQuantization Tier 1
EncoderTier 2
Encoder
Rate Control
DARES ApproachCase Study 2 – JPEG2000 (the real deal…sort of)
• Experimental Setup• Utilized DARES approach to create data-profile
specific implementation of MQEncoder• Data profile supporting 32x32 block size in
HW task• All other block sizes supported in software• Separated JPEG2000 software into multiple
threads• i.e. extracted MQEncode process as
separate thread• Adapted software to fit DARES dataflow
model• Modeling environment and toolchain used to
generate SW and HW source
• Results:Image Format
Image Size (KB)
#of Blocks
MQEncoder % Exec. Time
Estimated Speedup
ActualSpeedup
BMP 37 48 61.4 2.6X 2.5XBMP 257 256 71.5 3.5X 3.3XPGM 147 167 72.8 3.7X 3.5X
Future Work • Future Directions/Open Research:
• Data Informed/Static Scheduling:• Development of static scheduling methods that are aware of impact of data on
execution time of software/hardware tasks • Potential to produce a set of static schedule based upon data input/profiles
• Distributed Synchronization Methods• Need efficient methods for synchronization – can be based upon data profile and
data stream of application• Typical OS synchronization methods are capable but not efficient• Need for distributed synchronization methods (hThreads)
• Synthesis-in-the-Loop:• Utilize synthesis tools during HW/SW codesign process to better estimate actual
performance and area utilization (i.e. design space exploration)• Optimal Synthesis of Communication Framework
• Communication framework can be optimized for a specific application or set of specific tasks configurations
• Consider alternative bus hierarchies, NoC communication networks, transaction scheduling, DMA controls, etc.
• Efficient Runtime Partial Reconfiguration• Proof of concept demonstration of approach with runtime partial reconfiguration• Many challenges ahead (future is bright, but path is trecherous)