21
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN fellowship

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

  • Upload
    snowy

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning. Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine - PowerPoint PPT Presentation

Citation preview

Page 1: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering

University of California, Riverside*Also with the Center for Embedded Computer Systems at UC Irvine

This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN

fellowship

Page 2: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 2

IntroductionDynamic Software Optimization

Dynamic optimizations are increasingly common Dynamo - Dynamic software optimizations Transmeta Crusoe, Efficeon - Dynamic code morphing Just In Time (JIT) Compilation - Interpreted languages

Advantages Transparent optimizations

No designer effort No tool restrictions

Adapts to actual usage Drawbacks

Currently limited to software optimizations Limited speedup (1.1x to 1.3x common)

Page 3: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 3

IntroductionHardware/Software Partitioning

SW__________________

SW__________________

SW__________________

HW__________________

SW__________________

SW__________________

ProcessorProcessor ProcessorASIC/FPGA

Critical Regions

ProfilerProfiler Benefits Speedups of 2X to 10X

typical Speedups of 800X

possible Far more potential than

dynamic SW optimizations (1.2x)

Energy reductions of 25% to 95% typical

Time Energy

SW OnlyHW/ SW

Time Energy

SW Only

ProcessorProcessor

Page 4: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 4

IntroductionTraditional Hardware/Software Partitioning

BinarySW

ProfilingCAD Tools

NetlistNetlistSW Binary

ProcessorProcessor ProcessorASIC/FPGA

Requires specialized CAD tools

Non-standard partitioning compilers

Page 5: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 5

IntroductionBinary Hardware/Software Partitioning

Binary Partitioning [Stitt/Vahid ICCAD’02] [Banerjee DATE’03]

Partition application starting from SW binary

Can be desktop based Advantages

Use any standard compiler Supports any language Supports multiple sources from

multiple languages Supports assembly/object code Supports legacy code

Disadvantage Loses some high-level information,

so may be some loss of quality

BinarySW

ProfilingStandard Compiler

BinaryBinary

ProfilingCAD Tools

NetlistNetlistModified Binary

ProcessorProcessor ProcessorASIC/FPGA

Page 6: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 6

IntroductionDynamic Hardware/Software Partitioning

Dynamic HW/SW Partitioning Embed HW/SW partitioning CAD tools

on-chip Feasible in era of billion-transistor chips

Advantages Does not require any special compilers Completely transparent Bring benefits of HW/SW partitioning to

all SW designers Complements other approaches

Desktop CAD best from purely technical perspective

Dynamic opens additional market segments (i.e., all software developers) that otherwise might not use desktop CAD

BinarySW

ProfilingStandard Compiler

BinaryBinary

CAD

FPGAProc.

Page 7: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 7

MIPS/ARM

I$

D$

Configurable Logic

Profiler

Dynamic Part.

Module (DPM)

Profile application to determine critical regions

Partition critical regions to hardware

Program configurable logic & update software binary

Partitioned application executes faster with lower energy consumption

Initially execute application in software only

11

22

33

44

55

IntroductionWarp Processors

Page 8: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 8

ARM I$D$

Config. Logic

Profiler

DPM

Warp ProcessorsRequirements & Tools

Warp Processor Architecture and Tools

Basic configurable logic architecture Efficient profiling architecture On-chip CAD tools for HW/SW partitioning

Decompilation Synthesis Technology Mapping Placement and Routing

BinaryBinary

Decompilation

BinaryHW

RT & Logic Synthesis

Technology Mapping

Placement & Routing

Develop new Warp Configurable Logic Architecture (WCLA) WCLA

Page 9: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 9

Warp Configurable Logic ArchitectureRequirements

Robustness Capable of supporting large set of applications

Simplicity Existing FPGAs are too complex for warp

processors Design goals of FPGAs much different

Design configurable fabric by analyzing architectural features as to their impacts on on-chip CAD tools

Fast execution Very low data memory Produce reasonable hardware circuits

Efficient interface to memory

Page 10: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 10

Warp Configurable Logic Architecture

Data address generators (DADG) and Loop control hardware (LCH)

Found in most digital signal processors Provide fast loop execution

Supports memory accesses with regular access pattern Synthesis of FSM not required for many critical loops Configurable logic fabric input provide alternative

control of loop executionDADG &

LCH

Configurable Logic Fabric

Reg0

32-bit MAC

Reg1 Reg2

ARM I$D$

WCLA

Profiler

DPM

Page 11: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 11

Warp Configurable Logic Architecture

Integrated 32-bit multiplier-accumulator (MAC) Multiplications are frequently found within critical loops

Frequently in the form of a multiply-accumulate operation Fast, single-cycle multipliers are large and require many

interconnections

DADG &

LCH

Configurable Logic Fabric

Reg0

32-bit MAC

Reg1 Reg2

ARM I$D$

WCLA

Profiler

DPM

Page 12: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 12

SM

CLB

SM

SM

SM

SM

SM

CLB

Warp Configurable Logic Architecture Configurable Logic Fabric

DADGLCH

Configurable Logic Fabric

32-bit MAC

SM

CLB

SM

SM

SM

SM

SM

CLB

Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs)

Each CLB is directly connected to a SM Switch matrix connections

Four short wires connect adjacent SMs Four long wires connect every other SM together

Page 13: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 13

Warp Configurable Logic Architecture Combinational Logic Block Design

LUTLUT

a b c d e f

o1 o2 o3o4

Adj.CLB

Adj.CLB

Several studies have analyzed the impact of LUT and CLB size of overall design area and delay

LUTs with 5 to 6 inputs result in best performance LUTs with less than 3 inputs have much worse

performance [Chow, et al. 1999, Singh, et al. 1992] CLB cluster size of 3 to 20 LUTs are feasible [Marquardt,

Betz, Rose 2000]

Page 14: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 14

Warp Configurable Logic Architecture Combinational Logic Block Design

Incorporate two 3-input 2-output LUTs Corresponds to four 3-input LUTs Allows for good quality circuit while reducing on-chip

CAD tools complexity Provide routing resources between adjacent CLBs

to support carry chains

FPGAs WCLAFlexibility: Large CLBs, various internal routing resources

Simplicity:Limited internal routing, reduce technology mapping complexity

LUTLUT

a b c d e f

o1 o2 o3o4

Adj.CLB

Adj.CLB

Page 15: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 15

Warp Configurable Logic ArchitectureSwitch Matrix

0

0L

1

1L2L

2

3L

3

0123

0L1L2L

3L

0123

0L1L2L3L

0 1 2 3 0L1L2L3L

Switch Matrix SM connected using eight channels per

side Four short channels Four long channels

Routes connect wires from different side using the same channel

Each short channel is associated with single long channel

Wires are routed using a single pair of channels through configurable logic fabric

FPGAs WCLAFlexibility: Large routing resources, requires complex routing algorithms

Simplicity: Allow for design of less complex routing algorithm

Page 16: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 16

ResultsBenchmarks

Considered 12 embedded benchmarks from NetBench, MediaBench, EEMBC, and Powerstone

Average of 53% of total software execution time was spent executing single critical loop (more speedup possible if more loops considered)

On average, critical loops comprised only 1% of total program size

brev 992 104 70.0% 10.5% 3.3g3fax1 1094 6 31.4% 0.5% 1.5g3fax2 1094 6 31.2% 0.5% 1.5

url 13526 17 79.9% 0.1% 5logmin 8968 38 63.8% 0.4% 2.8pktflow 15582 5 35.5% 0.0% 1.6canrdr 15640 5 35.0% 0.0% 1.5bitmnp 17400 165 53.7% 0.9% 2.2tblook 15668 11 76.0% 0.1% 4.2ttsprk 16558 11 59.3% 0.1% 2.5

matrix01 16694 22 40.6% 0.1% 1.7idctrn01 17106 13 62.2% 0.1% 2.6

Average: 53.2% 1.1% 3.0

Speedup BoundLoop InstructionsBenchmark Total

InstructionsLoop Execution

Time Loop Size%

Page 17: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 17

ResultsExperimental Setup

Warp Processor 75 MHz ARM7 processor Configurable logic fabric with fixed

frequency of 60 MHz Used dynamic partitioning CAD tools to

map critical region to hardware Executed on an ARM7 processor Active for roughly 10 seconds to perform

partitioning Traditional HW/SW Partitioning

75 MHz ARM7 processor Xilinx Virtex-E FPGA (executing at

maximum possible speed) Manually partitioned software using

VHDL VHDL synthesized using Xilinx ISE 4.1 on

desktop

ARM7 I$D$

WCLA

Profiler

DPM

ARM7 I$D$

Xilinx Virtex-E FPGA

Page 18: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 18

0

1

2

3

4

5

Spee

dup

WCLAXilinx Virtex-E

ResultsPerformance Speedup

Average speedup of 2.1vs. 2.2 for Virtex-E

4.1

Page 19: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 19

0%

20%

40%

60%

80%

100%

Ener

gy R

educ

tion

WCLAXilinx Virtex-E

ResultsEnergy Reduction

Average energy reduction of 33%v.s 36% for Xilinx Virtex-E

74%

Page 20: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 20

Context: UCR’s Research on Configurable SoCs

Self Tuning, Self Configuring Mass Produced ICs

MIPS/ ARM

I$

D$

WCLA

Profiler

Dynamic Part.

Module (DPM)

Cache Tuner

Efficient on-chip profiling [Gordon-Ross, Vahid]

Configurable cache [Zhang, Vahid, Najjar]

Self-tuning cache [Zhang, Vahid, Lysecky]

Binary decompilation, loop unrolling, alias analysis[Stitt, Vahid]

Lean on-chip CAD tools[Lysecky, Vahid, Tan]

Page 21: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Lysecky, R., Vahid, F. 21

Conclusions & Future Work Warp Configurable Logic Fabric

Supports wide range of embedded systems applications Design specifically to allow development of lean on-chip CAD

tools Provide excellent results

Average speedups of 2.1 Average energy reduction of 33% Much better than dynamic software optimizations One loop only – more speedup possible More recent examples since DATE publication – 10x speedups Working towards examples with 100x speedups

Future Work Partitioning multiple software loops to hardware Synthesizing Finite State Machines (FSMs) Improved synthesis, technology mapping, and place and route