Automated Generation of Hardware Accelerators From Standard Capplication-notes.digchip.com/038/38-21369.pdf · Automated Generation of Hardware Accelerators From Standard C David

Automated Generation of Hardware Accelerators From Standard C

David Lau Altera Santa Cruz [email protected]

Orion Pritchard Altera Santa Cruz

[email protected]

Abstract

Methodologies for synthesis of stand-alone hardware modules from C/C++-based languages have been gaining adoption for embedded system design as an essential means to stay ahead of increasing performance, complexity, and time-to-market demands. However, using C to generate stand-alone blocks does not allow for truly seamless unification of embedded software and hardware development flows.

This paper describes a methodology for generating hardware accelerator modules that are tightly coupled with a soft RISC CPU, its tool chain, and its memory system. This coupling allows for several significant advancements: (1) a unified development environment with true push-button switching between original software and hardware-accelerated implementations, (2) direct access to memory from the accelerator module, (3) full support for pointers and arrays, and (4) latency-aware pipelining of memory transactions.

We also present results of our implementation, the C2H Compiler. Eight user test cases on common embedded applications show speedup factors of 13-73X achieved in less than a few days.

1. Introduction

As embedded systems increase in size and complexity, time to market is becoming a critical constraint. Demanding performance requirements necessitate the use of custom logic, but traditional methods of designing with register-transfer-level hardware description languages do not scale well to today’s large systems. Furthermore, integration of large numbers of components in a complex system is becoming a significant portion of the development cost. This problem is commonly referred to as the design gap—the difference between device density and developer productivity. As Moore’s Law outpaces advancements in design tools and methodologies, this gap widens.

FPGAs are becoming increasingly appropriate solutions to the challenges faced by the embedded systems

designer. Modern devices are sufficiently large, fast, and cost-competitive to contain many or all of the system components, with the benefits of rapid prototyping and field programmability. Furthermore, many design tools exist that facilitate rapid development of complete embedded systems on a programmable chip.

The constant increase in modern device density and performance motivates and justifies movement to a higher level of design abstraction. This movement is achieved in many different ways. At the most basic level, developers have increased productivity through significant usage of standard cells and intellectual property logic blocks. In some cases, these blocks are complete soft processors to which application-specific functionality is added through custom instructions and peripherals. Furthermore, several different tools can be used to automatically generate interconnect logic for IP cores and custom logic blocks that have standardized interfaces.

The most promising source of a significant increase in design abstraction is the recent trend towards high-level synthesis. Many tools have been introduced that can synthesize a custom hardware block from the user’s behavior-level description. The input to these tools is written in a high-level language that resembles C with additional non-standard syntax for expressing things like parallelism, resource binding, hierarchy, ports, and I/O.

A myriad of C-based languages and methodologies for hardware synthesis have emerged over the last 15 years. Page and Luk’s occam solution (a concurrent programming language) evolved into Handel-C, a subset of the C language with occam-like extensions for expressing parallelism [5, 15]. Catapult C uses a synthesizable subset of C++ [11]. Impulse C extends the language with types, functions, and libraries [14]. SystemC augments C++ with a class library that provides hardware-oriented constructs [12]. Transmogrifier C restricts the language and adds one extension for bit-level type sizing [8]. Many other variations also exist.

Traditionally, C-to-gates methodologies have sought to address the problem of generating stand-alone hardware modules. In this paper, we take a very different approach by investigating the generation of coprocessors that offload and enhance performance of a microprocessor running software written in C. This methodology addresses several important issues: (1) tight integration with a software

saadams

Text Box

CP-01027-1.0 April 2007

design flow, including true push-button acceleration of critical computations prototyped or already running in C on a conventional microprocessor or digital signal processing (DSP) device; (2) direct connection of generated hardware accelerators into the processor's memory map; (3) seamless support for pointers and arrays; and (4) efficient latency-aware scheduling and pipelining of memory transactions.

In this paper, we describe our implementation of the above methodology, the C-to-Hardware (C2H) Acceleration Compiler. The C2H Compiler generates, from pure ANSI/ISO standard C functions, hardware accelerators that have direct access to memory and other peripherals in the processor’s system. Recursion and floating-point types are the only major exclusions from standard C. Pointers, arrays, structures, and enums, as well as all loop types (including break, continue, and return statements) are fully supported. The C2H Compiler uses an existing commercial system integration tool to connect the accelerator to the processor and any other peripherals in the system. This gives the accelerator direct access to a memory map identical to that of the CPU, allowing seamless support for pointers and arrays when migrating from software to hardware.

The cockpit for the C2H Compiler is the CPU’s integrated software development environment. By supporting pointers and un-extended ANSI/ISO C, the C2H Compiler allows developers to quickly prototype a function in software running on the processor, then switch to a hardware-accelerated implementation with the push of a button.

Analysts have identified incompatibility between hardware and software design flows [9] as one of the critical problems in the EDA industry. Our methodology addresses this problem by enabling development of embedded software and embedded hardware accelerators, as well as management of hardware/software tradeoffs, from one environment in one language.

In the next section, we provide background information about the underlying tools that support our implementation of this methodology. Sections 3 through 5 discuss architectural issues: design flow, interface to the CPU, and direct memory access, while Section 6 describes the compilation process, including latency-aware scheduling and pipelining. Lastly, we present the results of user testing in Section 7 and conclude in Section 8.

2. Supporting Tools

The methodology described in this paper relies heavily on the tool chain that integrates the hardware accelerator with an associated processor, as well as all other peripherals in a complete embedded system. This section provides background information on these tools. Later sections will discuss, in detail, their roles in supporting the C2H Compiler.

2.1 System Integration

The C2H Compiler makes use of Altera® SOPC Builder, a system integration tool that allows designers to specify and connect components such as processors, memories, and peripherals in a GUI. Based on the component list and connectivity matrix, SOPC Builder outputs HDL for all components in the system, as well as a top-level HDL design file that contains automatically generated interconnect and arbitration logic [1]. It also generates configured ModelSim® projects, scripts, and testbenches.

Components can be intellectual property blocks as well as custom logic. SOPC Builder contains an importer called Component Editor that allows the user to easily create and publish new components from their own Verilog or VHDL design files, which are then portable across systems.

C2H accelerators are custom-generated SOPC Builder components automatically inserted into the system. When HDL is generated for the system, the accelerator is connected, just like any other component, to the remainder of the system using a programmatically generated interconnect fabric.

2.2 System Interconnect

The Avalon® switch fabric is a collection of signals and logic that connects components into a memory-mapped register system. It resembles a typical shared system bus in functionality, but uses dynamically generated wires, registers, and slave-side arbiters that increase performance. Components interface to the rest of the system through the Avalon master and slave ports that, when combined with the matrix of master-slave connections maintained by SOPC Builder, encapsulate the entire system interconnect.

In a traditional shared-system bus architecture,

arbitrators on the master side grant control of the entire shared bus, so only one master can control the bus at any given time. As shown in Figure 1, this creates a bottleneck,

Figure 1. Traditional System Bus

since all other masters must wait for control even if they are attempting to access different slaves. In contrast, the Avalon switch fabric mitigates this bottleneck by creating slave-side arbitrators that allow multiple masters to issue transactions simultaneously. Arbitration occurs between masters only if they contend for the same slave.

Figure 2 shows, from a high level, Avalon switch

fabric connecting masters and slaves in a simple system. Arbitration modules that manage requests from different masters are inserted in front of each slave port when necessary. These modules abstract the system’s interconnect details from master and slave ports alike [2]. If a particular slave has only one master connection, then no arbitrator is necessary. Such is the case for the instruction memory, data memory, and DMA control slaves in this example.

A detailed discussion of the features of the Avalon switch fabric is outside the scope of this paper, but we will give a brief overview of two highlighted features: dynamic bus sizing and burst management. Other features (not discussed) include address decoding, data-path multiplexing, wait-state insertion, pipelining, arbitration for simultaneous multi-mastering, clock domain crossing, and off-chip interfaces.

2.2.1 Dynamic Bus Sizing

The Avalon switch fabric is capable, through its dynamic bus sizing functionality, of generating logic that hides the details of interfacing master and slave ports of different widths. If a slave port that has been designated to

use dynamic bus sizing is connected to a wider master, then logic is generated that automatically issues multiple read/write transactions to the slave for each request made by the master, as shown in Table 1. If it is connected to a narrower master, then logic is generated that uses byte-enable lines (for write transactions) and/or shifts the read/write-data based on the unused least significant address bits.

This is typically useful for slave ports on memory

devices, whereas peripheral registers would more likely use native address alignment, which zero-pads the readdata in the case of the wider master, and discards the least significant bits of the address in the case of the narrow master.

2.2.2 Burst Management

Burst transfers separate the address and data phases of a memory transaction, allowing the master to read/write multiple words, uninterrupted, without requesting additional transfers or updating the address. The Avalon switch fabric supports burst transfers and also manages burst length mismatches between masters and slaves. If a master requests a burst longer than the slave supports, the switch fabric locks the arbiter and issues multiple bursts (or individual transfers if necessary). Figure 3 shows an Avalon bursting read transfer.

2.3 Soft RISC Processor

The C2H Compiler integrates with the Nios® II family of soft embedded microprocessors. They are FPGA-optimized general-purpose RISC architectures, with 32-bit instruction and data, 32 general-purpose registers, three instruction formats, 82 three-operand instructions, up to 256 custom instructions, and optional hardware multiply and divide [3].

Table 1. 32-bit Master View of Data in a 16-bit Slave

Figure 3. Avalon Bursting Read Transfer

Figure 2. Avalon Switch Fabric

Users traditionally have been able to boost performance of critical computations in two ways: (1) by extending the processor’s instruction set with custom hardware instructions written in HDL and imported into the CPU’s logic, and (2) by creating custom hardware accelerator modules that are imported into SOPC Builder and connected directly into the Avalon switch fabric. Both are advantageous in different scenarios. The custom instruction is appropriate for short, frequent computations and has very low overhead. The accelerator module benefits from direct memory access, and can be designed with multiple Avalon master and slave ports that transfer data simultaneously, allowing for extremely high bandwidth in complex tasks and computations. It can also operate autonomously – without intervention from the CPU.

The typical design flow is for the user to establish a baseline hardware platform, develop software for the microprocessor, profile and identify critical computations, implement hardware accelerators in HDL, and integrate the accelerators into the hardware platform, iterating over the last three steps until performance requirements are met. However, the C2H Compiler automates creation of hardware accelerators directly from C code.

C2H accelerators are coprocessors that are fully integrated into the microprocessor and its entire software tool chain. An important distinction must be made here—C2H accelerators are not custom instructions (although custom instructions are a fully supported feature of the processor). Custom instructions extend the arithmetic logic unit of a processor by accepting two scalar operands and returning a scalar value. C2H accelerators are complete coprocessors that accept an unlimited number of arguments (which can be scalars, arrays, structures, etc. passed by value or reference) and are capable of manipulating memory and other peripherals in the system. Although it is perfectly viable to build a tool that automatically generates custom instructions from descriptions in C, this methodology was chosen based on its scope. Its potential for increase in overall performance is much larger.

Although our implementation of this methodology currently only supports Nios II, it is not necessarily processor-specific. It could be applied in other environments and integrated into the software development flows for other microprocessors. However, our CPU’s Eclipse-based integrated development environment (IDE) and integration with SOPC Builder greatly facilitate generation of hardware coprocessors that extend it.

3. Design Flow

Our design flow begins with a C/C++ project in the software IDE, containing an ANSI C function that has been written for acceleration or identified as a bottleneck in

existing code and selected for acceleration. The flow proceeds as follows:

1. Prototype and debug in software IDE.

2. Select target function(s) for acceleration in IDE.

3. Run C2H analysis on accelerated functions to generate detailed report file containing resource usage, performance information, warnings about parts of code that the compiler was unable to optimize, and critical data paths (such as a loop-carried dependency between two pointers passed in as input arguments). This process takes a few seconds to a few minutes to complete, depending on the complexity of the function – on the order of a conventional software compilation.

4. Optimize and iterate.

5. Enable the IDE option to generate an updated device-programming file. The system generation tool then generates logic for the entire system and runs synthesis and place and route on the design.

6. Program hardware image into the device.

7. Verify performance by running the software project, which is now making use of the hardware accelerators in the IDE.

This design flow provides the user with the capability to drive the entire software/hardware partitioning process from the software IDE. Implementation details of connecting the hardware block to the system and enabling it to communicate with the CPU, memory, and other peripherals becomes an abstraction.

Additionally, because the same ANSI/ISO C language is used for software and hardware development, the user can switch between running the function in software, quickly analyzing it for acceleration, or running the complete process to generate an updated hardware image. This gives the user the tools to rapidly iterate over debugging and optimizing the function, without experiencing the long delays of synthesis and placement.

Figure 4 shows how the C2H Compiler integrates into the software build process in the IDE. The left half of the flowchart shows the standard C compilation of main.c and accelerator.c, as it occurs without acceleration. The right half of the flowchart shows the hardware compilation process invoked when a function in accelerator.c is accelerated. It also shows the generation and selective linking of the accelerator driver (discussed in section 4) into the executable file.

When prototyping and optimizing accelerators, the

developer does not need to run this complete process, as options in the IDE provide for switching between (1) the software-only build process, (2) software + C2H analysis/reporting, or (3) software + complete hardware (shown in the flowchart). This allows for fast debug and optimization iterations during the early stages of development, as well as automated integration of the entire hardware flow during later stages.

4. Accelerator Interface to the CPU

As shown in Figure 6, each accelerator has a slave port that acts as a control interface to its main processor. The slave port contains a START bit, as well as a STATUS bit. It also has a bank of registers that accept values of the accelerated function’s input arguments. If the function returns a value, the slave port contains a read-only register that holds it. Because the hardware module is an SOPC Builder component, the CPU can access these registers simply by issuing read and write operations to their locations. Figure 5 shows the CPU mastering an accelerator control slave in a simple system.

For each accelerator, the C2H Compiler emits a C

driver that has the same signature as the original function. This driver is compiled into the software project with the original source code for the function. The linker, depending on whether or not the user has selected to run the function in hardware, chooses the appropriate definition of the function. This process is completely automated—the user need only build the project, and the tool chain proceeds according to the user’s settings.

There are two modes of accelerator operation: polled and interrupt-driven. If the user has specified polled-mode, the driver consists of four operations:

1. Load the input arguments into the accelerator’s

register bank.

2. Set the START bit.

3. Poll the STATUS bit until the hardware signals completion.

4. Read the return value and exit.

The driver function uses memory-mapped read/write operations to communicate with the accelerator. After the

C PreprocessorC Preprocessor

SystemDescription

SystemDescription

accelerator.caccelerator.c

accelerator_driver.c

accelerator_driver.c

main.cmain.c

ParserParser

C2H CompilerC2H Compiler

accelerator.vaccelerator.v

C CompilerC Compiler

main.omain.o

No YesUser setting:

Use acceleratorinstead of original

software?

SystemModule HDL

SystemModule HDL

System GeneratorSystem Generator

Synthesis, Place/RouteSynthesis, Place/Route

Device Programming

File

Device Programming

File

LinkerLinker

Binary Executable

HardwareSoftware

accelerator.oaccelerator.o accelerator_driver.o

accelerator_driver.o

Figure 4. C2H Integration of Hardware Compilation Into the Software Build Process

driver function exits, the processor then resumes execution of the program, just as if the function had run in software.

Currently there is no functionality implemented for handling cache coherency. If the CPU has been configured with a data cache, the wrapper flushes it before starting the accelerator. The user can specify to skip the cache flush if no cacheable data is shared (The Nios II device allows data to be allocated as non-cacheable). The coherency issue will be addressed at a future date.

If the user has specified interrupt-driven mode, initializing and starting the accelerator remains the same but the polling loop and return value fetch are removed. The compiler adds a hardware interrupt port to the accelerator that is asserted upon completion and emits a header file that contains macros for clearing the interrupt and reading the return value for the function. Using these macros, interrupt service routines can be written to allow concurrent operation of the processor and multiple hardware accelerators. Furthermore, the hardware can be integrated for use with a real-time operating system.

5. Direct Memory Access

Handling pointers is one of the most formidable challenges to a complete and elegant C-to-gates solution. In software, pointers and indirection are very well defined—every variable declared (statically or dynamically) has an address, and dereferencing a pointer is merely a load or store operation at its address. In a stand-alone hardware block, the definition becomes blurred. There is no natural address space and a single variable can map to many different registers and wires due to pipelining and speculative execution. As De Micheli noted in 1999, the difficulty of translating hardware constructs such as

pointers has motivated many researchers to avoid them by restricting their languages [7].

By limiting support for pointers, the utility of the language becomes significantly lower, as most nontrivial algorithms use arrays and pointers to operate on memory data structures. This forces the user to spend time overhauling the original code to an acceptable form that does not use any unsupported constructs.

For pointer operations to be meaningful and to migrate seamlessly into hardware, the accelerated function must have access to the same memory map that it did when running in software. This is also necessary for easy transfer of data between the accelerator and the CPU, as well as the other peripherals in the system.

Callahan, Hauser, and Wawrzyrek identified the need for hardware coprocessors to have direct paths to the processor’s memory system during development of the Garp Architecture and C Compiler [4]. Their solution combined a single-issue MIPS processor core with reconfigurable hardware to be used as an accelerator. The reconfigurable hardware had four memory buses, one for random accesses and three for sequential accesses, which all shared an address bus. These buses had access to the main CPU’s memory through the caches, but were limited to only one random request per cycle.

The solution presented here imposes no restrictions on the bandwidth into and out of the accelerator, other than the bandwidth limitations of the connected memories. When the C2H Compiler creates hardware for a function, it generates Avalon master ports for pointer and array operations, as well as operations that access static and global variables allocated on the stack or heap. These master ports allow access to memory and other peripherals in the system, and are capable of operating independently, in parallel. While one or more masters fetch data from input buffers, others write data to output buffers, all on the same clock cycle. Figure 6 shows an accelerator with multiple read and write masters. Since an accelerator can have an arbitrary number of master ports, the memory bandwidth is limited only by the number of slave ports to which they can connect.

Dedicated master ports are generated for groups of pointer operations that are determined to access the same slave port (creating a bottleneck at the slave). Such operations are scheduled with a dependency on a shared resource to prevent unnecessary pipeline stalls and instead allow the compiler to schedule non-dependent operations during the transactions.

Optimal accelerator performance is achieved when large data structures used in critical loops are stored in system modules with separate slave ports. This way, the C2H Compiler is able to use dedicated master ports that connect to each of the slaves and transfer data concurrently. This is easily achieved by using register banks and on-chip memory blocks embedded in the FPGA

ProgramMemory

CPU

DM

AD

MA Accelerator

DM

AD

MA

DataMemory

ArbiterArbiter

DataMemory

ArbiterArbiter

ControlControl

ProgramMemoryProgramMemory

CPUCPU

DM

AD

MA Accelerator

DM

AD

MA

DM

AD

MA AcceleratorAccelerator

DM

AD

MA

DataMemory

DataMemory

ArbiterArbiter

DataMemory

DataMemory

ArbiterArbiter

ControlControl

Figure 5. Accelerator Interfaces to the CPU and

System Memory

fabric, which can either be instantiated manually into the system or inferred by declaring an array in the accelerated function.

6. Accelerator Compilation Process

The C2H Compiler operates in four main stages: (1) control/data flow graph analysis, (2) pointer analysis, (3) scheduling and pipelining, and (4) HDL generation. A discussion of each stage follows.

6.1 Control/Data Flow Graph Analysis

The C2H Compiler extracts instruction-level parallelism from the user’s source code through construction and analysis of control and data flow graphs (DFG). DFG edges are analyzed using methods similar to those described by Coutinho, Jiang, and Luk in 2005 [6]. A number of compiler optimizations are performed, including strength reduction, loop inversion, constant folding, constant propagation, and common subexpression elimination.

6.2 Pointer Analysis

Pointer analysis is of paramount importance to the performance of any compiler targeting a parallel architecture. Inability to determine that two pointers do not alias can sacrifice significant amounts of performance, because the two operations must be forced to execute sequentially, rather than in parallel. Although there are many cases where it is impossible for static analysis to extract precise points-to-information, most pointers are resolvable at compile-time, as Semeria and De Micheli found in 1998 [15].

The C2H Compiler does not attempt to resolve the exact location or variable that a pointer references, but instead simply attempts to determine whether two pointers overlap. This analysis is performed by extracting and comparing a polynomial decomposition for the address expressions of each pointer operation. If two expressions have a nonzero constant offset, then they do not overlap. This analysis is performed iteratively for pointer operations in loops, as the C2H Compiler implements pipelining of successive iterations and therefore must consider loop-carried edges.

In many cases, this analysis technique is sufficient for extracting parallelism from conventional C code. However, two simple methods are supported by which the developer can provide additional information about pointers in their accelerated functions.

6.2.1 The Restrict Pointer Type Qualifier

The restrict type qualifier was introduced in the ISO C99 specification and provides a means by which the user can instruct the compiler to ignore any aliasing implications of dereferencing the qualified pointer. It is not a C2H language extension. restrict is most widely known for delineating memcpy from memmove—the pointer arguments to memcpy are qualified with restrict, whereas the pointer arguments to memmove are not.

6.2.2 Connection Pragma

The C2H Compiler introduces a single optional pragma used to specify connections between accelerator master ports and other slave ports in the system. However, this pragma is a placeholder for a more elegant solution, which is defined in the ISO TR18037 Technical Report on Extensions for the programming language C to support embedded processors [10]. The technical report proposes

AcceleratorLogic

AcceleratorLogic

Processor Access(Read / Write)

ControlControl

ReadMasterRead

MasterMemory WriteMasterWrite

Master Memory

ReadMasterRead

MasterMemory

.

.

.

WriteMasterWrite

Master Memory

.

.

.

AcceleratorLogic

AcceleratorLogic


ControlControl


ControlControlControlControlControlControl

ReadMasterRead

MasterMemory ReadMasterRead

MasterRead

MasterRead

MasterMemory WriteMasterWrite

Master MemoryWriteMasterWrite

MasterWrite

MasterWrite

Master Memory

ReadMasterRead

MasterMemory

.

.

.

ReadMasterRead

MasterMemory ReadMasterRead

MasterRead

MasterRead

MasterMemory

.

.

.

WriteMasterWrite

Master Memory

.

.

.

WriteMasterWrite

Master MemoryWriteMasterWrite

MasterWrite

MasterWrite

Master Memory

.

.

.

Figure 6. Accelerator Master Ports (and Control Slave)

support for system-specific space qualifiers to associate memory spaces with declarations. These methods will likely be adopted when TR18037 gains widespread recognition.

6.3 Latency-Aware Scheduling and Pipelining

Execution of operations in an accelerator is controlled by a hierarchy of finite state machines. One state machine is generated for every loop and subroutine contained in the target function. Thus, scheduling and pipelining are matters of arranging operations inside this hierarchy and assigning state numbers based on the optimized data flow graph.

Similar to other C-to-gates methodologies, most assignments are translated into combinational logic representing the RHS expression, and then registered for pipelining. There are two exceptions to this rule. First, complex arithmetic operations such as multiplication, division, barrel shifting, and pointer operations are extracted into separately pipelined subexpressions. Second, operations that are trivial to perform in hardware simply by manipulation of wires (such as constant-distance shifts, AND/OR by constant, etc.) are not registered.

6.3.1 Speculative Execution

When possible, operations in a control branch are executed speculatively and multiplexed using condition expressions at the merging of control paths. Operations such as volatile and write pointers cannot execute speculatively and are scheduled with a dependency on these condition expressions.

This includes operations in loops, such as speculatable operations that are allowed to be scheduled before the loop condition (or break/continue/return statements) has been evaluated. When the loop exits, the effect of the operation is discarded by multiplexing in the previous value.

6.3.2 Memory Latency-Aware Pipelining

When scheduling pointer operations that require memory transactions, the C2H Compiler queries the system description to determine read latency of the connected

slave ports. Based on the memory latency, the compiler increases the computation time for the pointer operation, allowing other non-dependent operations to be evaluated while the accelerator waits for readdata to return.

For example, consider the following function:

int foo(int *ptr_in, int x, int y, int z) { int xy = x * y; int xy_plus_z = xy + z; int ptr_data = *ptr_in; int prod = ptr_data * xy_plus_z; return prod; }

Assume that ptr_in is connected to a slave with two cycles read latency. When scheduling this function, the C2H Compiler places the calculation of both xy and xy_plus_z in parallel with the read operation, because it expects two cycles of latency before receiving data from memory. Figure 7 illustrates the dependencies and state assignments for this example.

Latency-aware scheduling and pipelining is a powerful feature. If the compiler were unaware of specific memory latencies, then accelerators would either have to stall the pipeline in any case where data was not immediately ready, or insert an arbitrary number of wait-states, which would be unnecessary in all other cases. Our methodology nearly eliminates pipeline stalling due to memory latency by increasing the compiler’s visibility into the run-time behavior of connected peripherals.

The two types of memory-based pipeline stalls that can occur are those that are unpredictable at compile-time. They occur when either: (1) a slave port has a variable amount of latency, or (2) another master port has been granted access to the slave during the time the accelerator requests it. In these two cases, the interconnect fabric signals the master to wait, and the accelerator stalls its pipeline. The compiler is often able to reduce these stalls by the following techniques: (1) pipelining variable-latency slaves based on their maximum number of outstanding read operations, and (2) consolidating master ports and scheduling memory operations to avoid resource conflicts, so that two accelerator master ports do not end up contending with each other. Modulo-scheduling the shared-master memory accesses function is planned. However, though these optimizations reduce stalls, they are not ideal solutions. Critical data structures should be placed in fixed-latency (preferably low-latency) modules that have their own slave ports if possible. This will allow for optimal latency-aware scheduling and eliminate the bottleneck due to multiple-master contention for the same slave.

State 2

State 0

State 1

ptr_data = *ptr_in

ptr_in

xy = x * y

y

prod = ptr_data * xy_plus_z

x

xy_plus_z = xy + z

z

State 2

State 0

State 1

ptr_data = *ptr_in

ptr_in

xy = x * y

y

prod = ptr_data * xy_plus_z

x

xy_plus_z = xy + z

z

Figure 7. Latency-Aware Scheduling of a Read Operation With Two Cycles Latency

6.3.3 Loop Iteration Pipelining

The C2H Compiler pipelines execution of successive loop iterations by allowing multiple states in the loop’s state machine to be active simultaneously. The frequency of new iterations entering the pipeline is determined by analysis of loop-carried dependencies (and resource conflicts). For example, if operation A, which occurs in the first state of the loop, is dependent on the result of operation B from the previous iteration, then new iterations cannot enter the pipeline until the state in which B is valid. This analysis is performed after the operations in a loop have been scheduled (assigned state numbers). It is possible to separate these two processes because state number assignment is concerned with dependency between two operations in the same iteration, whereas the loop iteration frequency calculation is concerned with dependency between two operations of different iterations.

Loop iteration pipelining enables the accelerator to issue multiple outstanding read operations to latent slaves, effectively hiding the memory latency in many cases. This is illustrated in the following example:

int sum_elements (int *list, int len) { int i; int sum = 0; for (i=0; i<len; i++) sum += *list++; }

Assume again that the pointer is connected to a slave with two cycles of read latency. The read operation will be scheduled on State 0 of the loop and the accumulation of sum will be scheduled on State 2. However, iterations of

the loop can enter through the pipeline once every cycle. Figure 8 shows execution of the two operations during multiple iterations of this loop. After the initial latency of three clock cycles is overcome, the accelerator completes one iteration on every cycle. In case the accelerator state machine is stalled during execution of the loop, the compiler generates a FIFO buffer behind the read master that collects incoming readdata values until they are used by downstream operations.

6.4 HDL Generation

As the final step in the compilation process, the C2H Compiler generates synthesizable Verilog or VHDL for the data/control path, state machines, and Avalon interface ports. It also updates the SOPC Builder system description with Avalon interface information and optionally runs system generation to connect the hardware accelerator to other modules.

7. Results

To test the efficiency of the C2H Compiler, several embedded software/hardware engineers were given common algorithms found in embedded computing. Their goal was to achieve the highest possible performance after 8 to 24 hours of debugging and optimization.

Test users found that they could achieve integer speedup factors (3–7X) by simply performing optimizations in their C code, such as applying the restrict qualifier to pointers, reducing loop-carried dependencies, and consolidating control paths where possible. These three techniques significantly increased the amount of

(State 0)readdata = *list++

…N

(State 1)(empty latency state)

...N+1

...

...

...

...

...

...

...

(State 2)sum += readdata

...

Iteration N

N+2

......…


3



2



1


0

Iteration 1Iteration 0Time


…N


...N+1

...

...

...

...

...

...

...


...

Iteration N

N+2

......…


3



2



1


0

Iteration 1Iteration 0Time

One cycle per loop iteration:One cycle per loop iteration:

Memory latency hidden by loop Memory latency hidden by loop

pipelining.pipelining.

Figure 8. Loop Iteration Pipelining Hiding Memory Latency of a Read Operation

parallelism that the C2H Compiler could extract from the target function.

However, despite these significant optimizations, performance was limited not by computational speed, but by availability of data. The problem was no longer compute bound, but I/O bound. Because the accelerator was attempting to operate on multiple data buffers that were all stored in the CPU’s data memory, the bandwidth of the single memory slave became the performance bottleneck. Users were able to overcome this limitation simply by adding on-chip memory buffers to their systems and using them to store critical data structures. Similarly, by using dedicated memory modules for input and output buffers, the accelerators could use many master ports simultaneously to quickly stream data in and out of the buffers.

When combined, code-optimization techniques of reducing dependencies and moving critical arrays into dedicated memory buffers proved extremely successful in increasing accelerator performance. These two techniques addressed the two types of potential bottlenecks: computation and I/O. The methodology used allows for elimination of both: (1) efficient scheduling and pipelining, as well as detailed critical-path reporting, for computational performance; (2) generation of dedicated master ports and buffers, for increased memory bandwidth.

However, not all algorithms and C functions are suitable for hardware acceleration. Parallel/speculative execution and loop pipelining are two key ways by which performance is increased. Therefore, if the algorithm (or its implementation) limits the compiler in doing these two things, then only minimal speedup factors will result. For example, code that contains sequentially dependent operations in disjunct control paths (such as complex peripheral servicing routines) will not significantly benefit from hardware acceleration. These tasks are better suited for running on a processor. In contrast, functions with simple control paths and critical computations consolidated into a small number of inner loops that execute uninterrupted for many iterations are ideal candidates for conversion to hardware coprocessors. It is therefore very important for developers to appropriately manage software/hardware tradeoffs when introducing accelerators into a design, so that sequential tasks stay in the CPU and iterative computation-intensive tasks are moved to hardware.

Table 2 presents performance and area results for eight algorithms common in embedded computing. Speedup is calculated as the total algorithm compute time in software running on Nios II divided by total compute time running in the accelerator. System Resource Increase takes into account the logic element equivalent cost of on-chip hardware blocks such as multipliers and memories, and shows the incremental cost of adding the accelerator and buffers.

This investigation shows that after one to three man-days of work, considerable performance gains (13–73X) can be achieved with C2H acceleration, for approximately one to two times the increase in system resources. Furthermore, this experiment was performed in the early stages of C2H Compiler development before implementation of analysis and reporting functionality, which significantly reduces design time. Many additional optimizations have since been implemented that cause the compiler to be much more efficient with resource utilization.

The most noteworthy result in this experiment is the bitwise Matrix Rotation algorithm, which achieved a 73.6X speedup versus the Nios II embedded microprocessor, and runs at 95 MHz. Measuring absolute compute time, this implementation is slightly faster than a MPC7447A Power PC running at 1.4 GHz.

8. Conclusion

In this paper, we have described a methodology for converting ANSI/ISO standard C functions into custom hardware accelerators that integrate tightly with a general-purpose soft processor. When implemented as a plug-in to the processor’s integrated development environment, the software and hardware design flows are unified and allow for rapid debug and optimization iterations.

By leveraging an existing infrastructure for system generation, a method for seamless integration of accelerators into the CPU’s memory map, and abstraction of the details of system connectivity, were demonstrated. Based on this integration, techniques that provide full support of pointers were shown, as well as latency-aware pipelining of memory operations.

The implementation of this methodology, the C2H Compiler, was also presented. User tests revealed that it is

Table 2. User Test Results

Algorithm Speedup (vs. Nios II CPU)

System FMAX

(MHz)

System Resource Increase

Autocorrelation 41.0X 115 142%

Bit Allocation 42.3X 110 152%

Convolutional Encoder 13.3X 95 133%

FFT 15.0X 85 208%

High Pass Filter 42.9X 110 181%

Matrix Rotation 73.6X 95 106%

RBG to CMYK 41.5X 120 84%

RBG to YIQ 39.9X 110 158%

possible, within hours, to achieve double-digit increases in performance after C2H acceleration.

9. References

[1] Altera Corp. Quartus® II Version 5.1 Handbook, Volume 4: SOPC Builder., Altera Corp., San Jose, CA, 2005.

[2] Altera Corp. Avalon Interface Specification, Altera Corp., San Jose, CA, 2005.

[3] Ball, James. The Nios II Family of Soft, Configurable FPGA Microprocessor Cores. In Proceedings of HOT CHIPS 17., (Stanford, CA, August 14-16, 2005).

[4] Callahan, Timothy J., Hauser, John R., and Wawrzynek, John. The Garp Architecture and C Compiler. In IEEE Computer, April 2000.

[5] Chappel, Stephen and Sullivan, Chris. Handel-C for co-processing and co-design of Field Programmable System on Chip. JCRA, September 2002. http://www.celoxica.com/techlib/files/CEL-W0307171L3E-62.pdf

[6] Coutinho, Jose Gabriel F., Jiang, Jun, and Luk, Wayne. Interleaving Behavioral and Cycle-Accurate Descriptions for Reconfigurable Hardware Compilation. In Proceedings of The IEEE Symposium on Field-Programmable Custom Computing Machines. (FCCM ’05) (Napa, CA, April 18-20, 2005).

[7] De Micheli, Giovanni. Hardware Synthesis from C/C++ Models. In Proceedings of Design, Automation and Test in Europe. (DATE ’99) (Munich, Germany, March 9-12, 1999).

[8] Galloway, David. The Transmogrifier C Hardware Description Language and Compiler for FPGAs. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines. (FCCM '95) (Napa, CA, April 1995).

[9] Goering, Richard. Dataquest predicts automation of RTL. In EE Times. June 10, 2002. http://www.eetimes.com/story/OEG20020610S0055

[10] JTC1/SC22/WG14. Programming languages - C - Extensions to support embedded processors. Technical report, ISO/IEC, 2004. (http://www.open-std.org/jtc1/sc22/wg14).

[11] McCloud, Shawn. Catapult C Synthesis-Based Design Flow: Speeding Implementation and Increasing Flexibility. October 2003. http://www.mentor.com/

[12] Open System C Initiative (OSCI). Draft Standard SystemC Language Reference Manual. April 25, 2005. http://www.systemc.org/

[13] Page, Ian and Luk, Wayne. Compiling Occam into Field-Programmable Gate Arrays. In W. Moore and W. Luk, FPGAs, Oxford Workshop on Field Programmable Logic and Applications, Abingdon EE&CS Books, 15 Harcourt Way, Abingdon OX14 1NV, UK, pp 271-283, 1991.

[14] Pellerin, David and Thibault, Scott. Practical FPGA Programming in C. Prentice Hall, Upper Saddle River, NJ, 2005.

[15] Semeria, Luc and De Micheli, Giovanni. SpC : Synthesis of Pointers in C : Application of Pointer Analysis to the Behavioral Synthesis from C. In Proceedings of The IEEE/ACM International Conference on Computer Aided Design. (ICCAD ‘98) (San Jose, CA, November 8-12, 1998)

Copyright © 2007 Altera Corporation. All rights reserved. Altera, The Programmable Solutions Company, the stylized Altera logo, specific devicedesignations, and all other words and logos that are identified as trademarks and/or service marks are, unless noted otherwise, the trademarks and servicemarks of Altera Corporation in the U.S. and other countries. All other product or service names are the property of their respective holders. Altera productsare protected under numerous U.S. and foreign patents and pending applications, maskwork rights, and copyrights. Altera warrants performance of itssemiconductor products to current specifications in accordance with Altera's standard warranty, but reserves the right to make changes to any products andservices at any time without notice. Altera assumes no responsibility or liability arising out of the application or use of any information, product, or servicedescribed herein except as expressly agreed to in writing by Altera Corporation. Altera customers are advised to obtain the latest version of devicespecifications before relying on any published information and before placing orders for products or services.

101 Innovation DriveSan Jose, CA 95134(408) 544-7000http://www.altera.com

Documents

Automated Generation of Hardware Accelerators From Standard Capplication-notes.digchip.com/038/38-21369.pdf · Automated Generation of Hardware Accelerators From Standard C David