Upload
beryl
View
37
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Warp Processors. Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Task ID: 1331.001 July 2005 – June 2008 Ph.D. students: - PowerPoint PPT Presentation
Citation preview
Warp ProcessorsFrank Vahid (Task Leader)
Department of Computer Science and EngineeringUniversity of California, Riverside
Associate Director, Center for Embedded Computer Systems, UC Irvine
Task ID: 1331.001 July 2005 – June 2008Ph.D. students:
Greg Stitt Ph.D. expected June 2006Ann Gordon-Ross Ph.D. expected June 2006
David Sheldon Ph.D. expected 2009Ryan Mannion Ph.D. expected 2009Scott Sirowy Ph.D. expected 2010
Industrial Liaisons: Brian W. Einloth, Motorola
Serge Rutman, Dave Clark, IntelJeff Welser, IBM
2
Task Description Warp processing background
Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from
microprocessor to FPGA 10x perf./energy gains or more
Task– Mature warp technology Years 1/2 (in progress)
Automatic high-level construct recovery from binaries In-depth case studies (with Freescale)
Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel)
Years 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)
3
µP
FPGAOn-chip CAD
Warp Processing Background: Basic Idea
Profiler
Initially, software binary loaded into instruction memory
11
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software Binary
4
µP
FPGAOn-chip CAD
Warp Processing Background: Basic Idea
ProfilerI Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryMicroprocessor executes
instructions in software binary
22
Time EnergyµP
5
µP
FPGAOn-chip CAD
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryProfiler monitors instructions
and detects critical regions in binary
33
Time Energy
Profiler
add
add
add
add
add
add
add
add
add
add
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
Critical Loop Detected
6
µP
FPGAOn-chip CAD
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD reads in critical
region44
Time Energy
Profiler
On-chip CAD
7
µP
FPGADynamic Part. Module (DPM)
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD decompiles critical
region into control data flow graph (CDFG)
55
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0
8
µP
FPGADynamic Part. Module (DPM)
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD synthesizes
decompiled CDFG to a custom (parallel) circuit
66
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
9
µP
FPGADynamic Part. Module (DPM)
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD maps circuit onto
FPGA77
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
CLB
CLB
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
++
FPGA
10
µP
FPGADynamic Part. Module (DPM)
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software Binary88
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
CLB
CLB
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
++
FPGA
On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more
Mov reg3, 0Mov reg4, 0loop:// instructions that interact with FPGA
Ret reg4
FPGA
Time Energy
Software-only“Warped”
11
Warp Processing Background: Trend Towards Processor/FPGA Programmable Platforms
FPGAs with hard core processors
FPGAs with soft core processors
Computer boards with FPGAs
Cray XD1. Source: FPGA journal, Apr’05
Xilinx Virtex II Pro. Source: Xilinx Altera Excalibur. Source: Altera
Xilinx Spartan. Source: Xilinx
12
Warp Processing Background: Trend Towards Processor/FPGA Programmable Platforms
Programming a key challenge Soln 1: Compile high-level
language to custom binaries Soln 2: Use standard binaries,
dynamically re-map (warp) Cons:
Less high-level information, less optimization
Pros: Available to all software
developers, not just specialists Data dependent optimization Most importantly, standard binaries
enable “ecosystem” among tools, architecture, and applications
Xilinx Virtex II Pro. Source: Xilinx
Architectures
Applications Tools
Standard binaries
Most significant concept presently absent in FPGAs and other new programmable platforms
13
uP I$D$
FPGA
Profiler
On-chip CAD
Warp Processing Background: Basic Technology
Warp processing On-chip profiler Warp-tuned FPGA On-chip CAD, including Just-
in-Time FPGA compilation
BinaryBinary
Partitioning
BinaryHW
Behav./RT Synthesis
Technology Mapping
Placement & Routing
Logic Synthesis
DecompilationBinary Updater
BinaryUpdated Binary
JIT F
PGA
com
pila
tion
14
Warp Processing Background: Initial Results
60 MB
9.1 s Xilinx ISE
Manually performed
3.6MB0.2 s
ROCCAD
On a 75Mhz ARM7: only 1.4 s
46x improvement30% perf. penalty
Log.
Syn.
Tech
. Map
Place
Rout
e
RT Sy
n.
Deco
mp.
Parti
tionin
g
15
Warp Processing Background: Publications 2002-2005
On-chip profiler Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware, A. Gordon-Ross and F. Vahid,
ACM/IEEE Conf. on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2003; Extended version of above in special issue “Best of CASES/MICRO” of IEEE Trans. on Comp., Oct 2005.
Warp-tuned FPGA A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning, R. Lysecky and F. Vahid,
Design Automation and Test in Europe Conf. (DATE), Feb 2004. On-chip CAD, including Just-in-Time FPGA compilation
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid and S. Tan. IEEE Symp. on Field-Programmable Custom Computing Machines (FCCM), 2005.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning. R. Lysecky and F. Vahid. Design Automation and Test in Europe (DATE), March 2005.
Dynamic FPGA Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid, and S. Tan. Design Automation Conf. (DAC), June 2004.
A Codesigned On-Chip Logic Minimizer, R. Lysecky and F. Vahid, ISSS/CODES conf., Oct 2003. Dynamic Hardware/Software Partitioning: A First Approach. G. Stitt, R. Lysecky and F. Vahid, Design
Automation Conf. (DAC), 2003. On-Chip Logic Minimization, R. Lysecky and F. Vahid, Design Automation Conf. (DAC), 2003. The Energy Advantages of Microprocessor Platforms with On-Chip Configurable Logic, G. Stitt and F. Vahid,
IEEE Design and Test of Computers, Nov./Dec. 2002. Hardware/Software Partitioning of Software Binaries, G. Stitt and F. Vahid, IEEE/ACM International Conference
on Computer Aided Design (ICCAD), Nov. 2002.
Related A Self-Tuning Cache Architecture for Embedded Systems. C. Zhang, F. Vahid and R. Lysecky. ACM Transactions on
Embedded Computing Systems (TECS), Vol. 3., Issue 2, May 2004. Fast Configurable-Cache Tuning with a Unified Second-Level Cache. A. Gordon-Ross, F. Vahid, N. Dutt. Int. Symp. on
Low-Power Electronics and Design (ISLPED), 2005.
16
Task Description Warp processing background
Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from
microprocessor to FPGA 10x perf./energy gains or more
Task– Mature warp technology Year 1 (in progress)
Automatic high-level construct recovery from binaries In-depth case studies (with Freescale)
Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel)
Years 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)
17
Automatic High-Level Construct Recovery from Binaries
Challenge: Binary lacks high-level constructs (loops, arrays, ...) Decompilation can help recover
Extensive previous work (e.g., [Cifuentes 93, 94, 99])
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}
loop:reg1 := reg3 << 1reg5 := reg2 + reg1reg6 := mem[reg5 + 0]reg4 := reg4 + reg6reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0
Control/Data Flow Graph CreationOriginal C Code
Corresponding Assembly
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0
Data Flow Analysis
long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4;}
Function Recovery
long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4;}
Control Structure Recovery
long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; }
Array Recovery
Almost Identical Representations
18
New Method: Loop Rerolling Problem: Compiler unrolling of loops (to expose
parallelism) causes synthesis problems: Huge input (slow), can’t unroll to desired amount, can’t use
advanced loop methods (loop pipelining, fusion, splitting, ...) Solution: New decompilation method: Loop
Rerolling Identify unrolled iterations, compact into one iteration
for (int i=0; i < 3; i++) accum += a[i];
Ld reg2, 100(0)Add reg1, reg1, reg2 Ld reg2, 100(1)Add reg1, reg1, reg2Ld reg2, 100(2)Add reg1, reg1, reg2
Loop Unrolling for (int i=0; i<3;i++)
reg1 += array[i];
Loop Rerolling
19
Loop Rerolling: Identify Unrolled Iterations
x = x + 1;for (i=0; i < 2; i++) a[i]=b[i]+1;y=x;
Original C Code
Find Consecutive Repeating Substrings: Adjacent Nodes with Same SubstringUnrolled Loop
2 unrolled iterationsEach iteration = abc (Ld, Add, St)
Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3
Binary
x= x + 1;a[0] = b[0]+1;a[1] = b[1]+1;y = x;
Unrolled Loop
Add r3, r3, 1 => BLd r0, b(0) => AAdd r1, r0, 1 => BSt a(0), r1 => CLd r0, b(1) => AAdd r1, r0, 1 => BSt a(1), r1 => CMov r4, r3 => D
Map to String
BABCABCD
String Representatio
n
Find consecutively repeating instruction sequences
abc c db
abcabcd c abcd d abcd d
dabcd
Suffix Tree
Derived from bioinformatics
techniques
20
Loop Rerolling: Compacting Iterations
Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3
Original C Code
Unrolled Loop Identificiation
Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3
Determine relationship of constants
1)
Add r3, r3, 1i=0loop:Ld r0, b(i)Add r1, r0, 1St a(i), r1Bne i, 2, loopMov r4, r3
Replace constants with induction variable expression
2)
reg3 = reg3 + 1;for (i=0; i < 2; i++) array1[i]=array2[i]+1;reg4=reg3;
Rerolled, decompiled code
3)
x = x + 1;for (i=0; i < 2; i++) a[i]=b[i]+1;y=x;
21
Method: Strength Promotion Problem: Compiler’s strength reduction (replacing
multiplies by shifts and adds) prevents synthesis from using hard-core multipliers, sometimes hurting circuit performance
*
B[i] 10
*
B[i+1] 18
*
B[i+2] 34
*
B[i+3] 66
+++
A[i]
+
++
<< <<
B[i+1] 4B[i+1] 1
+<< <<
B[i] 3 B[i] 1
+<< <<
B[i+2] 5B[i+2]1
+<< <<
B[i+3]6 B[i+3]1
+
A[i]
FIR Filter
Strength-Reduced FIR Filter
Strength-reduced multiplication
22
Strength Promotion
+
++
<< <<
B[i+1] 4B[i+1] 1
+<< <<
B[i] 3 B[i] 1
+<< <<
B[i+2] 5B[i+2]1
+<< <<
B[i+3]6 B[i+3]1
+
A[i]
Identify strength-reduced subgraphs
+
++
<< <<
B[i+1] 4B[i+1] 1
+<< <<
B[i+2] 5B[i+2]1
+<< <<
B[i+3]6 B[i+3]1
+
A[i]
B[i] 10
*
Replace with multiplication
++
+<< <<
B[i+2] 5B[i+2]1
+<< <<
B[i+3]6 B[i+3]1
+
A[i]
B[i] 10
*
B[i] 18
*
+++
<< <<
B[i+3]6 B[i+3]1
+
A[i]
B[i] 10
*
B[i] 18
*
B[i] 34
*++
+
A[i]
B[i] 10
*
B[i] 18
*
B[i] 34
*
B[i] 66
*
Strength promotion lets synthesis decide on strength reduction based on available resources
1
++
B[i+1] 18B[i] 10
+<< <<
B[i+2] 5B[i+2]1
+<< <<
B[i+3]6 B[i+3]
+
A[i]
* *
Synthesis can of course apply strength reduction itself
Solution: Promote strength-reduced code to muls
23
New Decompilation Methods’ Benefits
Rerolling Speedups from better use of smart
buffers Other potential benefits: faster
synthesis, less area Strength promotion
Speedups from fewer cycles Speedups from faster clock
New methods to be developed e.g., pointer DS to arrays
0.00.51.01.52.02.53.0
Speedups from Loop Rerolling
0.0
0.5
1.0
1.5
2.0
2.5
Y axis = speedup, X axis = x_y_z => x adder constraint, y multiplier constraint, z = adders needed for reduction
0
50
100
150
200
250
No Strength PromotionStrength Promotion
Y axis = clock frequency, X axis = adders needed for reduction
24
Decompilation is Effective Even with High Compiler-Optimization Levels
Average Speedup of 10 Examples
0
5
10
15
20
25
30
Speedups similar on MIPS for –O1 and –O3 optimizations
0
5
10
15
20
25
30
Speedups similar on ARM for –O1 and –O3 optimizations
0
5
10
15
20
25
30
Speedups similar between ARM and MIPS
Complex instructions of ARM didn’t hurt synthesis
MicroBlaze speedups much larger
MicroBlaze is a slower microprocessor
-O3 optimizations were very beneficial to hardware
0
5
10
15
20
25
30
MIPS -O
1
MIPS -O
3
ARM -O1
ARM -O3
MicroB
laze -
O1
MicroB
laze -
O3
Publication: New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2005.
25
Task Description Warp processing background
Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from
microprocessor to FPGA 10x perf./energy gains or more
Task– Mature warp technology Year 1 (in progress)
Automatic high-level construct recovery from binaries In-depth case studies (with Freescale)
Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel)
Years 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)
26
Research Problem: Make Synthesis from Binaries Competitive with Synthesis from High-Level Languages
Performed in-depth study with Freescale
H.264 video decoder Highly-optimized proprietary
code, not reference code Huge difference A benefit of SRC
collaboration Research question: Is
synthesis from binaries competitive on highly-optimized code?
Several-month study
MPEG 2 H.264: Better quality, or smaller files, using more
computation
27
Optimized H.264 Larger than most
benchmarks H.264: 16,000 lines Previous work: 100 to
several thousand lines Highly-optimized
H.264: Many man-hours of manual optimization
10x faster than reference code used in previous works
Different profiling results Previous examples
~90% time in several loops H.264
~90% time in ~45 functions Harder to speedup
Function Name Instr %TimeCumulative SpeedupMotionComp_00 33 6.8% 1.1InvTransform4x4 63 12.5% 1.1FindHorizontalBS 47 16.7% 1.2GetBits 51 20.8% 1.3FindVerticalBS 44 24.7% 1.3MotionCompChromaFullXFullY24 28.6% 1.4FilterHorizontalLuma 557 32.5% 1.5FilterVerticalLuma 481 35.8% 1.6FilterHorizontalChroma133 39.0% 1.6CombineCoefsZerosInvQuantScan69 42.0% 1.7memset 20 44.9% 1.8MotionCompensate 167 47.7% 1.9FilterVerticalChroma 121 50.3% 2.0MotionCompChromaFracXFracY48 53.0% 2.1ReadLeadingZerosAndOne56 55.6% 2.3DecodeCoeffTokenNormal93 57.5% 2.4DeblockingFilterLumaRow272 59.4% 2.5DecodeZeros 79 61.3% 2.6MotionComp_23 279 63.0% 2.7DecodeBlockCoefLevels56 64.6% 2.8MotionComp_21 281 66.2% 3.0FindBoundaryStrengthPMB44 67.7% 3.1
28
C vs. Binary Synthesis on Opt. H.264
Binary partitioning competitive with source partitioning Speedups compared to ARM9 software
Binary: 2.48, C: 2.53 Decompilation recovered nearly all high-level information needed for
partitioning and synthesis Discovered another research problem: Why aren’t
speedups (from binary or C) closer to “ideal” (0-time per fct)
Speedup from C Partititioning
0
1
2
3
4
5
6
7
8
9
101 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51
Number of Functions in Hardware
Spee
dup
Speedup from C Partititioning
0
1
2
3
4
5
6
7
8
9
10
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51
Number of Functions in Hardware
Spee
dup
Speedup from C Partititioning
Speedup from Binary Partitioning
0
1
2
3
4
5
6
7
8
9
10
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51
Number of Functions in Hardware
Spee
dup
Ideal Speedup (Zero-time Hw Execution)
Speedup from C Partititioning
Speedup from Binary Partitioning
29
Coding Guidelines Are there C-coding
guidelines to improve partitioning speedups? Orthogonal to C vs. binary
question Guidelines may help both
Examined H.264 code further Several phone conferences
with Freescale liasons, also several email exchanges and reports
0
5
10
Ideal C
0
5
10
Ideal C
Competitive, but both could be better
Coding guidelines get closer to ideal
30
Synthesis-Oriented Coding Guidelines
Pass by value-return Declare a local array and copy in all data needed by a
function (makes lack of aliases explicit) Function specialization
Create function version having frequent parameter-values as constants
void f(int width, int height ) { . . . . for (i=0; i < width, i++) for (j=0; j < height; j++) . . . . . . }
void f_4_4() { . . . . for (i=0; i < 4, i++) for (j=0; j < 4; j++) . . . . . . }
Bounds are explicit so loops are now unrollable
Original Rewritten
31
Synthesis-Oriented Coding Guidelines
Algorithmic specialization Use parallelizable hardware algorithms when possible
Hoisting and sinking of error checking Keep error checking out of loops to enable unrolling
Lookup table avoidance Use expressions rather than lookup tables
int clip[512] = { . . . }void f() { . . . for (i=0; i < 10; i++) val[i] = clip[val[i]]; . . . }
void f() { . . . for (i=0; i < 10; i++) if (val[i] > 255) val[i] = 255; else if (val[i] < 0) val[i] = 0; . . . }
val[1]
<
0 255
3x1
>
val[1]
val[0]
<
0 255
3x1
>
val[0]
Original Rewritten
. . .
Comparisons can now be parallelized
32
Synthesis-Oriented Coding Guidelines
Use explicit control flow Replace function pointers with if statements
and static function calls
void (*funcArray[]) (char *data) = { func1, func2, . . . };void f(char *data) { . . . funcPointer = funcArray[i]; (*funcPointer) (data); . . . }
void f(char *data) { . . . if (i == 0) func1(data); else if (i==1) func2(data); . . . }
Original Rewritten
33
Coding Guideline Results on H.264
Simple coding guidelines made large improvement Rewritten software only ~3% slower than original
And, binary partitioning still competitive with C partitioning Speedups: Binary: 6.55, C: 6.56
Small difference caused by switch statements that used indirect jumps
0123456789
101 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51
Number of Functions in Hardware
Spee
dup
Ideal Speedup (Zero-time Hw Execution)
Speedup from C Partititioning
Speedup from Binary Partitioning
0123456789
101 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51
Number of Functions in Hardware
Spee
dup
Ideal Speedup (Zero-time Hw Execution)Speedup After Rewrite (C Partitioning)Speedup After Rewrite (Binary Partitioning)Speedup from C PartititioningSpeedup from Binary Partitioning
34
Studied More Benchmarks, Developed More Guidelines
Studied guidelines further on standard benchmarks Further synthesis speedups (again, independent of C vs. binary issue)
Publications Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode. G. Stitt, F. Vahid, G.
McGregor, B. Einloth. Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005 (joint publication with Freescale)
Submitted: A Code Refinement Methodology for Performance-Improved Synthesis from C. G. Stitt, F. Vahid, W. Najjar, 2006.
More guidelines to be developed
573 1616 842
0123456789
10
g3fax mpeg2 jpeg brev fir crc
SwHw/sw with original codeHw/sw with guidelines
-88% -47%-30%
-20%
-10%
0%
10%
20%
30%
g3fa
x
mpe
g2
jpeg brev fir crc
Performance Overhead
Size Overhead
35
Task Description Warp processing background
Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from
microprocessor to FPGA 10x perf./energy gains or more
Task– Mature warp technology Year 1 (in progress)
Automatic high-level construct recovery from binaries In-depth case studies (with Freescale)
Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel)
Years 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)
36
Warp-Tailored FPGA Prototype Developed FPGA fabric tailored to
fast/small-memory on-chip CAD Building chip prototype with Intel
Created synthesizable VHDL models, running through Intel shuttle tool flow
Plan to incorporate with ARM processor and other IP on shuttle seat
Bi-weekly phone meetings with Intel engineers since summer 2005, ongoing, scheduled tapeout 2006 Q3
DADGLCH
Configurable Logic Fabric
32-bit MAC
SM
CLB
SM
SM
SM
SM
SM
CLB
SM
CLB
SM
SM
SM
SM
SM
CLB
LUTLUT
a b c d e f
o1 o2 o3o4
Adj.CLB
Adj.CLB
0
0L
1
1L2L
2
3L
3
0123
0L1L2L
3L
0123
0L1L2L3L
0 1 2 3 0L1L2L3L
37
Industrial Interactions Freescale
Numerous phone conferences, emails, and reports, on technical subjects Co-authored paper (CODES/ISSS’05), another pending Summer internship – Scott Sirowy (new UCR graduate student), summer
2005, Austin Intel
Three visits by PI, one by graduate student Roman Lysecky, to Intel Research in Santa Clara
PI presented at Intel System Design Symposium, Nov. 2005 PI served on Intel Research Silicon Prototyping Workshop panel, May 2005 Participating in Intel’s Research Shuttle (chip prototype), bi-weekly phone
conferences since summer 2005 involving PI, Intel engineers, and Roman Lysecky (now Prof. at UA)
IBM Embarking on studies of warp processing results on server applications UCR group to receive Cell-based prototyping platform (w/ Prof. Walid
Najjar) Several interactions with Xilinx also
38
Task Description – Coming Up Warp processing background
Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from
microprocessor to FPGA 10x perf./energy gains or more
Task– Mature warp technology Years 1/2 (in progress)
Automatic high-level construct recovery from binaries In-depth case studies (with Freescale)
Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel)
Years 2/3 – All three sub-tasks just now underway Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)
39
Recent Publications New Decompilation Techniques for Binary-level Co-processor Generation. G.
Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2005. Fast Configurable-Cache Tuning with a Unified Second-Level Cache. A. Gordon-
Ross, F. Vahid, N. Dutt. Int. Symp. on Low-Power Electronics and Design (ISLPED), 2005. Hardware/Software Partitioning of Software Binaries: A Case Study of H.264
Decode. G. Stitt, F. Vahid, G. McGregor, B. Einloth. International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005. (Co-authored paper with Freescale)
Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware. A. Gordon-Ross and F. Vahid. IEEE Trans. on Computers, Special Issue- Best of Embedded Systems, Microarchitecture, and Compilation Techniques in Memory of B. Ramakrishna (Bob) Rau, Oct. 2005.
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid and S. Tan. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2005.
A First Look at the Interplay of Code Reordering and Configurable Caches. A. Gordon-Ross, F. Vahid, N. Dutt. Great Lakes Symposium on VLSI (GLSVLSI), April 2005.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning. R. Lysecky and F. Vahid. Design Automation and Test in Europe (DATE), March 2005.
A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid. Design Automation and Test in Europe (DATE), March 2005.