26
Modeling and Codesign Methods for Data Adaptable Reconfigurable Embedded Systems Roman Lysecky Department of Electrical and Computer Engineering University of Arizona [email protected] Collaborators : Jonathan Sprinkle, Jerzy Rozenblit, Michael Marcellin Students : Andrew Milakovich, Vijay Shankar Gopinath, Sachidanand Mahadevan, Sean Whitsitt, Nathan Sandoval, Casey Mackin, Kyle Merry This work was supported in part by the National Science Foundation under Grant CNS-0915010.

Modeling and Codesign Methods for Data Adaptable Reconfigurable Embedded Systems Roman Lysecky Department…

Embed Size (px)

DESCRIPTION

Introduction & Motivation Traditional SW & HW Solutions µPµPµPµP µPµPµPµP µPµPµPµP µPµPµPµP Software Only (single/multicore) Hardware Accelerated (Dedicated HW IP) µPµPµPµP I$ D$ JPEG2000Co-Processor µPµPµPµP I$ D$ FPGA Coprocessor Bitstream Memory Reconfigurable (FPGA supporting dynamic reconfiguration) GoalsSoftware OnlyHardware AcceleratedReconfigurable Configurability/FlexibilityYesNoYes PerformanceNoYes

Citation preview

Modeling and Codesign Methods for Data Adaptable Reconfigurable

Embedded SystemsRoman Lysecky

Department of Electrical and Computer EngineeringUniversity of Arizona

[email protected]

Collaborators: Jonathan Sprinkle, Jerzy Rozenblit, Michael Marcellin

Students: Andrew Milakovich, Vijay Shankar Gopinath, Sachidanand Mahadevan, Sean Whitsitt, Nathan Sandoval, Casey Mackin, Kyle Merry

This work was supported in part by the National Science Foundation under Grant CNS-0915010.

Introduction & MotivationData Adaptable Approach

• Increasingly Complex Applications Demands • Complex Algorithms• Compute Intensive• Highly-Configurable

• Example: JPEG2000 Image Compression

• Provides significant advantages – quality and compression – over JPEG standard

• Support for configurability at each processing stage (e.g. color transform, wavelet, block encoding, code stream)

• Results in high-computational demands and larger design space

Introduction & MotivationTraditional SW & HW Solutions

µµPP

µµPP

µµPP

µµPP

Software Only(single/multicore)

Hardware Accelerated

(Dedicated HW IP)

µµPPI$I$

D$D$

JPEG2000 JPEG2000 Co-ProcessorCo-Processor

µµPPI$I$

D$D$

FPGAFPGACoprocessor Coprocessor

BitstreamBitstreamMemoryMemory

Reconfigurable(FPGA supporting dynamic

reconfiguration)

Goals Software Only Hardware Accelerated

Reconfigurable

Configurability/Flexibility

Yes No Yes

Performance No Yes Yes

Introduction & MotivationData-Adaptable Reconfigurable Embedded Systems (DARES)

• Reconfigurable systems for high-configurable/compute-intensive applications• Can be reconfigured at runtime for immediate application needs• How/when to reconfigure specific to application and data input

• Goal: Reconfigure hardware tasks within FPGA based upon the current data profile

Input stream...10110000

Output stream10011000...

µPµP

Reconfigurable FPGA

Task Task AA(512(512x 512)x 512)

Task Task BB(5/3)(5/3)

Task Task CC(Cas(Casual)ual)

New Input Stream...000010100

New Data Profile:- 14-bits/channel- Task A (1024x1024, 4:4:2)- Task B (Wavelet 5/3)- Task C (Error Resilient)

Task A (1024x 1024)

HW Task Implementations

TaskC

TaskB

TaskD

TaskA

Introduction & MotivationData-Adaptable Reconfigurable Embedded Systems (DARES)

• DARES COMPONENTS & METHODOLOGY• Model-driven framework for specifying application tasks, processing

requirements, data configurability, and target data profiles for hardware support

• Runtime middleware and communication framework for runtime communication, system reconfiguration, and process scheduling

• Automated tool flow supporting the proposed methodology

Input stream...10110000

Output stream10011000...

µPµP

Reconfigurable FPGA

Task Task AA(512(512x 512)x 512)

Task Task BB(5/3)(5/3)

Task Task CC(Cas(Casual)ual)

New Input Stream...000010100 Task A

(1024x 1024)

HW Task Implementations

TaskC

TaskB

TaskD

TaskA

DARES ApproachDesign Methodology and Toolchain – Overview

1. Modeling Framework2. SW Task Generation/Compilation3. HW Coprocessor Generation/Synthesis4. HW/SW Communication Framework5. Final Software/Hardware Implementation

HW/SW Codesign (Model HW/SW Codesign (Model Interpreter)Interpreter)

Software Binary Hardware Task Bitstreams

Application and Data Profile Model

Software Model for HW Tasks

HW Task

HW Task

HW Task

HW Task

Hardware Tasks

HW/SW Comm. Framework Xilinx ISEXilinx ISE

ImpulseC CoDeveloperImpulseC CoDeveloper

Software Threads

Communication MiddlewareSoftware Compiler (gcc)Software Compiler (gcc)

(1(1))

Init. Code

0100010010110101010101010100010010110101010101010101111101010110010010010101111101010110010010010001010101010010010001000001010101010010010001000101001000000111111001010010000001111110

(2(2))

(3(3))

(4(4))

(5(5))

DARES ApproachDesign Methodology and Toolchain

• DARES Modeling Framework• Modeling Language to express application as a composition of

Communicating Sequential Dataflow Tasks (CSDT)• Capture application and task level data profiles• Allow designers to specify configuration of tasks for the target

data profiles • Perform design space exploration to determine the Pareto

optimal system implementation• Generate source code for SW and HW task configurations

HW/SW Codesign (Model HW/SW Codesign (Model Interpreter)Interpreter)

Application and Data Profile Model

Task Configurations of Row DCT Task

Application Tasks and Dataflow Model (JPEG)Modeli

ng Langua

ge

Types

Semantics

Constraints

DARES ApproachDesign Methodology and Toolchain

• DARES Modeling Language• Developed using Generic Modeling Environment (GME)• Types

• Task – Models functional unit of application• Config – Models configurability of application task• TaskInstance – Models the instance of an application task.

• Constraints• Simple – Unique Identifiers• Legal dataflow specifications• 1-1 correspondence between IN and OUT ports in a Config

and parent Task

• Semantics (i.e. Model Interpreter)• Driven by Hardware Software Codesign methodology

HW/SW Codesign (Model HW/SW Codesign (Model Interpreter)Interpreter)

Application and Data Profile Model

Modeling

Language

Types

Semantics

Constraints

DARES ApproachDesign Methodology and Toolchain

HW/SW Codesign (Model HW/SW Codesign (Model Interpreter)Interpreter)

Application and Data Profile Model

DARES ApproachDesign Methodology and Toolchain

Design Space

Pruning

Latency Estimation

Optimization

Off-chip Memory

Allocation

Source Code Generation

HW/SW Codesign (Model HW/SW Codesign (Model Interpreter)Interpreter)

Application and Data Profile Model

• DARES HW/SW Codesign Methodology• Design Space Pruning

• Find all compatible combinations of specific task configurations

• Subject to area constraint of FPGA• Latency Estimation

• For all possible application configurations, estimates the end-to-end latency

• Estimation considers:• Task configuration latency• Communication overhead• Required input/output data within all

tasks• Mode of operation of specific task

configurations

DARES ApproachDesign Methodology and Toolchain

Design Space

Pruning

Latency Estimation

Optimization

Off-chip Memory

Allocation

Source Code Generation

HW/SW Codesign (Model HW/SW Codesign (Model Interpreter)Interpreter)

Application and Data Profile Model

• DARES HW/SW Codesign Methodology• Optimization - Design Space Pruning

• Find Pareto optimal combinations of task configurations

• Defines all possible application configurations that will yield best area/latency tradeoff

DARES ApproachDesign Methodology and Toolchain

Design Space

Pruning

Latency Estimation

Optimization

Off-chip Memory

Allocation

Source Code Generation

HW/SW Codesign (Model HW/SW Codesign (Model Interpreter)Interpreter)

Application and Data Profile Model

• DARES HW/SW Codesign Methodology• Off-chip Memory Allocation

• Designer can specify a set of application profiles that must be supported

• Designer can additionally choose from Pareto optimal configuration

• If off-chip configuration memory is still available, selects additional task configuration to support to increase runtime adaptability

DARES ApproachDesign Methodology and Toolchain

Design Space

Pruning

Latency Estimation

Optimization

Off-chip Memory

Allocation

Code Synthesis

HW/SW Codesign (Model HW/SW Codesign (Model Interpreter)Interpreter)

Application and Data Profile Model

• DARES HW/SW Codesign Methodology• Code Synthesis

• Source files generated for all SW task configurations and selected HW task configurations

• Transforms input C code to Pthread implementation with

• Communication Middleware APIs providing the methods accesses to input and output buffers identified by the unique IDs

DARES ApproachDesign Methodology and Toolchain Software Threads

Communication MiddlewareSoftware Compiler (gcc)Software Compiler (gcc)

Init. Code

• Software Task Generation and Compilation• HW/SW Codesign Interpreter transforms the C code for application task

configurations • Generate Pthread implementation for all SW task configurations• Communication Middleware

• APIs providing the methods accesses to input and output buffers identified by the specific tasks

// Original Task Configuration code

void FuncName(){#pragma DARES_DECL_PART int data[64]; ...#pragma DARES_COMP_BEGIN#pragma DARES_READ_INTO(data) // Computation#pragma DARES_WRITE_FROM(data,64)#pragma DARES_cOMP_END}

// Pthread implementation.

void* FuncName(){ ... INTx DARES_SAMPLE_INPUT; int DARES_loop_iter; do{ ... do{ for( DARES_loop_iter = 0; DARES_loop_iter<DEPTH; ++DARES_loop_iter ) { if ( Fifo_Read_Single( ID1, &DARES_SAMPLE_INPUT ) == 0 )

DARES_INPUT[DARES_loop_iter] = DARES_SAMPLE_INPUT; } … for( DARES_loop_iter = 0;DARES_loop_iter<TOKENS;++DARES_loop_iter) { Fifo_Write_Single(ID2, &DARES_OUTPUT[DARES_loop_iter]); } } while(!Fifo_Eos(ID1)); ... } while(1); }

Codesign Interpreter

DARES ApproachDesign Methodology and Toolchain

Software Model for HW Tasks

HW Task

HW Task

HW Task

HW Task

Hardware Tasks

ImpulseC CoDeveloperImpulseC CoDeveloper

• Hardware Coprocessor Generation and Synthesis• HW/SW Codesign Interpreter generates ImpulseC function for all

HW task configurations• Utilizes co_stream interface for FIFO input/output

• Utilize ImpulseC CoDeveloper to synthesize VHDL implementations • Provides rich support for optimizing loops and analyzing the

pipelined loops

// Original Task Configuration code

void FuncName(){#pragma DARES_DECL_PART int data[64]; ...#pragma DARES_COMP_BEGIN#pragma DARES_READ_INTO(data) // Computation#pragma DARES_WRITE_FROM(data,64)#pragma DARES_cOMP_END}

// ImpulseC implementation.

void FuncName( co_stream fifo1, co_stream fifo2 ){

... INT8 DARES_SAMPLE_INPUT; int DARES_loop_iter; do { ... do { for( DARES_loop_iter = 0;DARES_loop_iter<DEPTH;++DARES_loop_iter) { if ( co_stream_read(_INFIFO_, &DARES_SAMPLE_INPUT, sizeof(WIDTH1)) == co_err_none ) DARES_INPUT[DARES_loop_iter] = DARES_SAMPLE_INPUT; } ... for( DARES_loop_iter = 0;DARES_loop_iter<TOKENS;++DARES_loop_iter) { co_stream_write(_OUTFIFO_, &DARES_OUTPUT[DARES_loop_iter], sizeof(WIDTH2)); } } while(!co_stream_eos(_INFIFO_)); ... } while(1);}

Codesign Interpreter

DARES ApproachDesign Methodology and Toolchain

HW Task

HW Task

HW Task

HW Task

Hardware Tasks

HW/SW Comm. Framework Xilinx ISEXilinx ISE

• Hardware/Software Communication Framework• Hardware coprocessors integrated with

hardware/software communication framework• Supports seamless communication between

software and hardware tasks• in conjunction with communication middleware

• Efficient communication mechanisms supported for communication between adjacent and non-adjacent hardware tasks System Bus (PLB)

User IP User IP (HW Task)(HW Task)

Bus Interface (Memory Mapped)Bus Interface (Memory Mapped)

FIFO O

utFIFO

Out

FIFO2B

usFIFO

2Bus

Fpout_wrenFpout_wdataFpout_full

Fpin_wrenFpin_wdata

Fpin_full

DARES ApproachHardware/Software Communication Methods

• Software to Software (SW Buffer)

System Bus (PLB)

User IP User IP

(HW Task)(HW Task)

Bus Interface (Memory Mapped)Bus Interface (Memory Mapped)

FIFO O

utFIFO

Out

FIFO2B

usFIFO

2Bus

Fpout_wrenFpout_wdataFpout_full

Fpin_wrenFpin_wdata

Fpin_full

µP Mem

...Task Task Task

• Software to Software (HW FIFO)

µP Mem

...Task Task Task

DARES ApproachHardware/Software Communication Methods

• Software to Hardware • Hardware to Hardware (Adjacent)

µP Mem

... TaskTask TaskTaskTask

µP Mem

...Task TaskTaskTask

System Bus (PLB)

User IP User IP

(HW Task)(HW Task)

Bus Interface (Memory Mapped)Bus Interface (Memory Mapped)

FIFO O

utFIFO

Out

FIFO2B

usFIFO

2Bus

Fpout_wrenFpout_wdataFpout_full

Fpin_wrenFpin_wdata

Fpin_full

DARES ApproachHardware/Software Communication Methods

• Hardware to Hardware (Non-Adjacent)

• Hardware to Software

µP Mem

...Task TaskTaskTaskTask

µP Mem

...Task TaskTaskTask

System Bus (PLB)

User IP User IP

(HW Task)(HW Task)

Bus Interface (Memory Mapped)Bus Interface (Memory Mapped)

FIFO O

utFIFO

Out

FIFO2B

usFIFO

2Bus

Fpout_wrenFpout_wdataFpout_full

Fpin_wrenFpin_wdata

Fpin_full

DARES ApproachDesign Methodology and Toolchain

• Final Hardware/Software Implementation• Software threads and hardware coprocessors combined for final system

implementation• Requires manual – although automatable – creation of system initialization

code• System configuration for current data profile not supported at runtime

Software Binary Hardware Task Bitstreams

µPµP

Reconfigurable FPGA

Coprocessor Coprocessor BitstreamBitstreamMemoryMemory

0100010010110101010101010100010010110101010101010101111101010110010010010101111101010110010010010001010101010010010001000001010101010010010001000101001000000111111001010010000001111110

DARES ApproachCase Study 1 – JPEG (not 2000)

• Experimental Setup• Consider a JPEG image compression application• Generated software and hardware implementations for

JPEG encoding tasks using DARES toolchain• Discrete cosine transform (dct), quantization (qnt), zig-zag ordering (zz), and

run-length encoding (rle)• Independently verified software and hardware accelerated implementation

• Evaluated system performance for all combination of hardware coprocessors available within system• Manually configured communication between SW and HW tasks to measure

system performance of HW accelerated implementation• Virtex-5 FX FPGA (ML507 board)• 400 MHz PowerPC processor with 100 MHz PLB bus

µPµP

Reconfigurable FPGA

DARES ApproachCase Study 1 – JPEG (not 2000)

• Experimental Results• Achieves performance improvement of 1.8X for single hardware task and up to 5X

with all tasks executing in hardware• Compared to software-only implementations

• Needs for considering the communication method (e.g. DMA, bus hierarchy, NOCs) and latency in determining the Pareto optimal system configuration

DARES ApproachCase Study 2 – JPEG2000 (the real deal…sort of)

• Experimental Setup• Consider a JPEG2000 compression using Jasper’s software implementation• All stages can be configured differently for lossy or lossless compression• Bulk of the execution time is spent in Tier 1 Encoder (typically 50% or more)

• Block Encoder• Bit Plane Encoder• MQEncoder

Forward Multi-Component Transform

Forward Wavelet

TransformQuantization Tier 1

EncoderTier 2

Encoder

Rate Control

DARES ApproachCase Study 2 – JPEG2000 (the real deal…sort of)

• Experimental Setup• Utilized DARES approach to create data-profile

specific implementation of MQEncoder• Data profile supporting 32x32 block size in

HW task• All other block sizes supported in software• Separated JPEG2000 software into multiple

threads• i.e. extracted MQEncode process as

separate thread• Adapted software to fit DARES dataflow

model• Modeling environment and toolchain used to

generate SW and HW source

• Results:Image Format

Image Size (KB)

#of Blocks

MQEncoder % Exec. Time

Estimated Speedup

ActualSpeedup

BMP 37 48 61.4 2.6X 2.5XBMP 257 256 71.5 3.5X 3.3XPGM 147 167 72.8 3.7X 3.5X

Future Work • Future Directions/Open Research:

• Data Informed/Static Scheduling:• Development of static scheduling methods that are aware of impact of data on

execution time of software/hardware tasks • Potential to produce a set of static schedule based upon data input/profiles

• Distributed Synchronization Methods• Need efficient methods for synchronization – can be based upon data profile and

data stream of application• Typical OS synchronization methods are capable but not efficient• Need for distributed synchronization methods (hThreads)

• Synthesis-in-the-Loop:• Utilize synthesis tools during HW/SW codesign process to better estimate actual

performance and area utilization (i.e. design space exploration)• Optimal Synthesis of Communication Framework

• Communication framework can be optimized for a specific application or set of specific tasks configurations

• Consider alternative bus hierarchies, NoC communication networks, transaction scheduling, DMA controls, etc.

• Efficient Runtime Partial Reconfiguration• Proof of concept demonstration of approach with runtime partial reconfiguration• Many challenges ahead (future is bright, but path is trecherous)

Question?

Thank You!