View
25
Download
0
Category
Preview:
DESCRIPTION
Codesign Extended Applications. Brian Grattan, Greg Stitt, Frank Vahid* Dept of Computer Science & Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine - PowerPoint PPT Presentation
Citation preview
1
Codesign Extended Applications
Brian Grattan, Greg Stitt, Frank Vahid*Dept of Computer Science & Engineering
University of California, Riverside*Also with the Center for Embedded Computer Systems at UC
Irvine
This work was supported in part by the National Science Foundation and by NEC C&C Research Labs
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-2
Outline
Introduction: Hardware/Software Partitioning And the common assumption of a single
specification Different Algorithms in Hardware/Software Codesign Extended Applications Experiments Future Work and Conclusions
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-3
Introduction – Hw/Sw Partitioning
Hw/sw partitioning can speedup software Shown by numerous researchers
E.g., Balboni, Fornaciari, Sciuto CODES’96; Eles, Peng, Kuchchinski, Doboli DAES’97; Gajski, Vahid, Narayan, Gong Prentice-Hall 1997; Grode, Knudsen, Madsen DATE’98; many others
1.5 to 10x common Some examples like image processing get 100-800x speedup
E.g., Cameron project, FCCM’02
Can reduce energy too E.g.
Henkel, Li CODES’98 Wan, Ichikawa, Lidsky, Rabaey CICC’98 Stitt, Grattan, Villarreal, Vahid FCCM’02
60-80% energy savings measured on real single-chip uP/FPGA devices
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-4
Hw/Sw Partitioning on Single-Chip Platforms
Numerous single-chip commercial devices with uP and FPGA
Triscend E5 (shown) Triscend A7 Atmel FPSLIC Xilinx Virtex II Pro Altera Excalibur More sure to come…
Make hw/sw partitioning even more attractive
uP and peripherals
Cache/memory
Configurable logic
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-5
Hw/Sw Partitioning – Commercial Tools Evolving
Commercial products evolving Synopsys’ Nimble
compiler (2000) attempt Proceler
Microprocessor Report’s 2001 Technology of the Year Award
Others coming…
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-6
Hw/Sw Partitioning – Single-Spec Assumption
Assumption – Start from a single specification Typically sw source
Partitioning Find critical sw kernels,
map some to hw This assumption is
made in most research efforts as well as commercial tools
Hw/sw partitioner
Sw Hw
Specification
Compilation Synthesis
Binaries Netlists
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-7
Digital Camera Example
Developed with intent of exploring hw/sw tradeoffs Captures images,
compresses, uploads to PC Soon found that a single
specification wasn’t reasonable Two key functions had
different hw/sw algorithms CRC DCT
Controller
Communications
DCT
CCD
Pre-Process
Huffman Encoder
CRCcalculation
Controller
DCT
CCD Pre-Processor
Huffmanencoder
CRC
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-8
Digital Camera Example
Results in weak hw design We would have
written CRC and DCT differently had we known they’d be mapped to hw
Yet, we’d keep the original algorithms if they ended up in software
Hw/sw partitioner
Sw: Huff., CCD, Ctrl Hw: CRC, DCT
Spec: DCT, Huffman, CRC, CCD, Ctrl
Compilation Synthesis
Binaries Netlists
Weak
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-9
Different Algorithms in Hw vs. Sw
The single-specification assumption doesn’t always hold
Key observation Designers often use very different algorithms if a
behavior is mapped to hardware versus if that behavior is mapped to software
Widely known by designers In textbooks Also known in parallel processing – sequential
and parallel algorithms
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-10
Different Algorithms – Sorting Example
Suppose desired behavior fills a buffer, sorts the buffer, and transmits the sorted list
Fill()Sort()Transmit()
Sort() in software –QuickSort Simple and fast in sw Poor in hw, can’t be parallelized well
Sort() in hardware – Parallel Mergesort
Very fast in hardware Slow in sw (if sequential) due to
overhead Derive one from the other?
Quicksort
MS
MS
MS MS MS
MS
…
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-11
Different Algorithms – CRC Example
CRC – Cyclic Redundancy Check Used for error
checking during communication, stronger than parity
Mathematically, divides a constant into the data and saves the remainder
Main Function
…calls crc() with parameters:init_crc-initial value
*data-pointer to data
len-length of data
jinit-initializing options
crc()
returns:value of CRC for given data
crc/data/data/data
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-12
Different Algorithms – CRC in Hardware
char crc_hw(…){ unsigned short j , crc_value = init_crc; unsigned short new_crc_value; if (jinit >= 0) crc_value=((uchar) jinit) | (((uchar) jinit) << 8); for (j=1;j<=len;j++) { new_crc_value = bit(4,data[j]) ^ bit(0,data[j]) ^ bit(8,crc_value) ^ bit(12,crc_value); // bit 0 new_crc_value = new_crc_value | (bit(5,data[j])^bit(1,data[j])^bit(9,crc_value)^bit(13,crc_value))<<1; new_crc_value = new_crc_value | (bit(6,data[j])^bit(2,data[j])^bit(10,crc_value)^bit(14,crc_value))<< 2;. … continue for bits 3 through 7 …. } return (new_crc_value);} Hardware Version
Knowing the generator polynomial, one can calculate the XOR’s for each individual bit
Each CRC value is the result of bit-wise XOR’s with the data and the previous CRC value
Synthesizes to hw very nicely; but getting bits and shifting are inefficient in sw
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-13
Different Algorithms – CRC in Software
Software Version Before doing any
calculations, create an initialization table that calculates the CRC for each individual character
Use data as index into initialization table and execute two XOR’s
Requires lookups, but faster for a sequential calculation
char crc_sw(…) // Source: Numerical Recipes in C{ unsigned short initialize_table(unsigned short crc, unsigned
char one_char); static unsigned short icrctb[256]; unsigned short tmp1, j , crc_value = init_crc; if (!init) { init=1; for (j=0;j<=255;j++) { icrctb[j]=initialize_table(j << 8,(uchar)0); } } if (jinit >= 0) crc_value=((uchar) jinit) | (((uchar) jinit) << 8); for (j=1;j<=len;j++) { tmp1 = data[j] ^ HIBYTE(crc_value); crc_value = icrctb[tmp1] ^ LOBYTE(crc_value) << 8; } }return (crc_value);}
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-14
Different Algorithms -- DCT
DCT – Discrete Cosine Transform Computationally intensive, numerous matrix
multiplies Accounts for perhaps 70% of JPEG encoding time Dozens of possible algorithms
Best algorithm depends largely on computational resources
Certainly different for sw and hw Doing multiplications in floating-point vs. fixed-
point Multiplication by a constant can be efficiently mapped to
hardware, but accuracy will be lost by not using floating-point
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-15
Codesign Extended Applications (CEAs)
Basic idea: Write two versions of certain
functions Only the critical functions, and Only those with different sw and
hw algorithms Typically only a handful of these
Most time is spent in just a few critical functions
Include both function versions in the specification
But use compiler flags to include either sw or hw version
main(){ … crc(); …}
char crc(…){#ifdef cea_crc_hw crc_hw(…);#else crc_sw(…);#endif}
% gcc –Dcea_crc_hw main.c
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-16
CEAs when using C/C++ and VHDL
C code crc_hw(…inputs…)
/* Hardware crc... */
for (j=1;j<=len;j++) {
TSHORT(to_hw)= data[j]);
TBYTE(enable) = 1;
TBYTE(enable) = 0;
}
crc_value=TSHORT(result);
return (crc_value)
VHDL code if (rst = '1') then crc <= "0000000000000000"; done <= '0'; elsif (clk'event and clk = '1') then if (enable = '1') then if done = '0' then crc <= nextCRC16_D8(input,crc); done <= '1'; end if; else done <= '0'; output <= crc; end if; end if;
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-17
CEAs Enable Hw/Sw Partitioning Tool Traditional hw/sw partitioner
Compiler, estimators, search heuristics, technology files, etc.
Drawback: heavy impact on tool flow
CEAs plus platforms result in simple partitioner
Script uses existing compiler, synthesis, and evaluation (simulation or physical measurement)
Drawbacks: must write two versions of critical functions, script may use simpler search function
Different partitioners for different domains
Hw/sw partitioner
Sw Hw
Specification
Compilation Synthesis
Binaries Netlists
Essentially a compiler, search heuristic, and estimator. Heavy-duty tool.
Script
Sw Hw
CEA
Compilation Synthesis
Binaries Netlists
Evaluator
Search heuristic and tool control. Lightweight tool.
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-18
Experiments
Compared hw and sw CRC algorithms Synthesized to FPGA Compiled to MIPS uP
Demonstrates need for different algorithms
Sw and hw CRC algorithms in FPGA.
Size (Blocks)
Delay (clock cycles/character)
Hardware CRC algorithm
19 1
Software CRC algorithm
44 3
Sw and hw CRC algorithms on a microprocessor.
Size (Assembly
Lines)
Clock Cycles
Software CRC Algorithm
1061 180,000
Hardware CRC Algorithm
1298 814,000
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-19
Experiments Wrote small signal processing example as CEA
Wrote sw and hw versions of core functions In this case, algorithms were similar
Setup power measurement for two real platforms XS40 (board with microcontroller chip and Xilinx FPGA chip) E5 (single chip with microcontroller and FPGA)
Partitioning script automatically partitioned and measured power and cycles (overnight – due to place & route time)
Demonstrates how CEAs enable simple yet practical hw/sw partitioning Easily migrates to different platforms, different chips
Partitioning Energy (Joules) on E5 deviceMultiply Sum Bit-Share
SW SW SW 12.4SW SW HW 8.6SW HW SW 8.8HW SW SW 8.0SW HW HW 4.8HW SW HW Does not RouteHW HW SW Does not RouteHW HW HW Does not Route
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-20
Issues and Future Work Issues
What if hw versions not used after partitioning? Wasted effort? Verification of all possible combinations? Must use wisely or problem grows unwieldy
Future work More examples, more platforms Several versions of the same function
One hardware area-conscious One hardware speed-conscious One software code-size-conscious One software speed-conscious …more…
Experimenting with communication between hardware and software
DMA transfer, wide-access memories, …
CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,
Riverside 1-21
Conclusions
Basic hw/sw partitioning assumption of a single specification doesn’t always hold
Codesign Extended Applications help support different algorithms
CEAs enable hw/sw partitioning in existing tool flows Utilizes existing compilation, synthesis, mapping,
evaluation tools, and platforms Simple yet effective approach to hw/sw
partitioning
Recommended