Click here to load reader

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning. Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine - PowerPoint PPT Presentation

Text of A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

  • A Configurable Logic Architecture for Dynamic Hardware/Software PartitioningRoman Lysecky, Frank Vahid* Department of Computer Science and EngineeringUniversity of California, Riverside*Also with the Center for Embedded Computer Systems at UC Irvine

    This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN fellowship

    Lysecky, R., Vahid, F.

  • IntroductionDynamic Software OptimizationDynamic optimizations are increasingly commonDynamo - Dynamic software optimizationsTransmeta Crusoe, Efficeon - Dynamic code morphingJust In Time (JIT) Compilation - Interpreted languagesAdvantagesTransparent optimizationsNo designer effortNo tool restrictionsAdapts to actual usageDrawbacksCurrently limited to software optimizationsLimited speedup (1.1x to 1.3x common)

    Lysecky, R., Vahid, F.

  • IntroductionHardware/Software PartitioningBenefitsSpeedups of 2X to 10X typicalSpeedups of 800X possibleFar more potential than dynamic SW optimizations (1.2x)Energy reductions of 25% to 95% typicalSW__________________SW__________________

    Lysecky, R., Vahid, F.

  • IntroductionTraditional Hardware/Software PartitioningRequires specialized CAD toolsNon-standard partitioning compilers

    Lysecky, R., Vahid, F.

  • IntroductionBinary Hardware/Software PartitioningBinary Partitioning [Stitt/Vahid ICCAD02] [Banerjee DATE03]Partition application starting from SW binaryCan be desktop basedAdvantagesUse any standard compilerSupports any languageSupports multiple sources from multiple languagesSupports assembly/object codeSupports legacy codeDisadvantageLoses some high-level information, so may be some loss of quality

    Lysecky, R., Vahid, F.

  • IntroductionDynamic Hardware/Software PartitioningDynamic HW/SW PartitioningEmbed HW/SW partitioning CAD tools on-chipFeasible in era of billion-transistor chipsAdvantagesDoes not require any special compilersCompletely transparentBring benefits of HW/SW partitioning to all SW designersComplements other approachesDesktop CAD best from purely technical perspectiveDynamic opens additional market segments (i.e., all software developers) that otherwise might not use desktop CAD

    Lysecky, R., Vahid, F.

  • IntroductionWarp ProcessorsMIPS/ARMI$D$Configurable LogicProfilerDynamic Part. Module (DPM)Profile application to determine critical regionsPartition critical regions to hardwareProgram configurable logic & update software binaryPartitioned application executes faster with lower energy consumptionInitially execute application in software only 12345

    Lysecky, R., Vahid, F.

  • Warp ProcessorsRequirements & ToolsWarp Processor Architecture and Tools Basic configurable logic architectureEfficient profiling architectureOn-chip CAD tools for HW/SW partitioningDecompilationSynthesisTechnology MappingPlacement and RoutingARMI$D$Config. LogicProfilerDPM

    Lysecky, R., Vahid, F.

  • Warp Configurable Logic ArchitectureRequirementsRobustnessCapable of supporting large set of applicationsSimplicityExisting FPGAs are too complex for warp processorsDesign goals of FPGAs much differentDesign configurable fabric by analyzing architectural features as to their impacts on on-chip CAD toolsFast executionVery low data memoryProduce reasonable hardware circuitsEfficient interface to memory

    Lysecky, R., Vahid, F.

  • Warp Configurable Logic ArchitectureData address generators (DADG) and Loop control hardware (LCH)Found in most digital signal processorsProvide fast loop executionSupports memory accesses with regular access patternSynthesis of FSM not required for many critical loopsConfigurable logic fabric input provide alternative control of loop executionDADG &LCHConfigurable Logic Fabric 32-bit MAC

    Lysecky, R., Vahid, F.

  • Warp Configurable Logic ArchitectureIntegrated 32-bit multiplier-accumulator (MAC)Multiplications are frequently found within critical loopsFrequently in the form of a multiply-accumulate operationFast, single-cycle multipliers are large and require many interconnectionsDADG &LCHConfigurable Logic Fabric 32-bit MAC

    Lysecky, R., Vahid, F.

  • Warp Configurable Logic Architecture Configurable Logic FabricArray of configurable logic blocks (CLBs) surrounded by switch matrices (SMs)Each CLB is directly connected to a SMSwitch matrix connectionsFour short wires connect adjacent SMsFour long wires connect every other SM togetherSMCLBSMSMSMSMSMCLB

    Lysecky, R., Vahid, F.

  • Warp Configurable Logic Architecture Combinational Logic Block DesignSeveral studies have analyzed the impact of LUT and CLB size of overall design area and delayLUTs with 5 to 6 inputs result in best performance LUTs with less than 3 inputs have much worse performance [Chow, et al. 1999, Singh, et al. 1992]CLB cluster size of 3 to 20 LUTs are feasible [Marquardt, Betz, Rose 2000]

    Lysecky, R., Vahid, F.

  • Warp Configurable Logic Architecture Combinational Logic Block DesignIncorporate two 3-input 2-output LUTsCorresponds to four 3-input LUTsAllows for good quality circuit while reducing on-chip CAD tools complexityProvide routing resources between adjacent CLBs to support carry chains

    FPGAsWCLAFlexibility: Large CLBs, various internal routing resourcesSimplicity:Limited internal routing, reduce technology mapping complexity

    Lysecky, R., Vahid, F.

  • Warp Configurable Logic ArchitectureSwitch MatrixSwitch MatrixSM connected using eight channels per sideFour short channelsFour long channelsRoutes connect wires from different side using the same channelEach short channel is associated with single long channelWires are routed using a single pair of channels through configurable logic fabric

    FPGAsWCLAFlexibility: Large routing resources, requires complex routing algorithmsSimplicity: Allow for design of less complex routing algorithm

    Lysecky, R., Vahid, F.

  • ResultsBenchmarksConsidered 12 embedded benchmarks from NetBench, MediaBench, EEMBC, and PowerstoneAverage of 53% of total software execution time was spent executing single critical loop (more speedup possible if more loops considered)On average, critical loops comprised only 1% of total program size

    Lysecky, R., Vahid, F.

    Sheet1

    BenchmarkTotal InstructionsLoop InstructionsLoop Execution TimeLoop Size%Speedup Bound

    brev99210470.0%10.5%3.3

    g3fax11094631.4%0.5%1.5

    g3fax21094631.2%0.5%1.5

    url135261779.9%0.1%5

    logmin89683863.8%0.4%2.8

    pktflow15582535.5%0.0%1.6

    canrdr15640535.0%0.0%1.5

    bitmnp1740016553.7%0.9%2.2

    tblook156681176.0%0.1%4.2

    ttsprk165581159.3%0.1%2.5

    matrix01166942240.6%0.1%1.7

    idctrn01171061362.2%0.1%2.6

    Average:53.2%1.1%3.0

    Bench-markSW Time (s)SW Energy (mJ)HW/SW TimeSpeedupHW/SW Energy (mJ)Energy Savings

    brev0.041.930.022.30.8953.80%

    g3fax118.8916.513.741.4746.6618.50%

    g3fax218.8916.514.381.3753.2317.80%

    url303.9214816.174.224.13862.673.90%

    logmin13.06636.484.932.6259.759.20%

    pktflow0.7938.720.591.431.917.60%

    canrdr0.7938.440.571.429.8622.30%

    bitmnp4.65226.822.192.1115.4549.10%

    tblook0.3617.630.12310.3941.10%

    ttsprk0.9445.720.462.132.5128.90%

    matrix010.6632.340.451.530.555.60%

    idctrn011.9393.941.031.990.234.00%

    g72111.18544.856.241.8332.9638.90%

    Avg.:2.133.10%

    Sheet2

    Sheet3

  • ResultsExperimental SetupWarp Processor75 MHz ARM7 processorConfigurable logic fabric with fixed frequency of 60 MHzUsed dynamic partitioning CAD tools to map critical region to hardwareExecuted on an ARM7 processorActive for roughly 10 seconds to perform partitioningTraditional HW/SW Partitioning75 MHz ARM7 processorXilinx Virtex-E FPGA (executing at maximum possible speed)Manually partitioned software using VHDLVHDL synthesized using Xilinx ISE 4.1 on desktop

    Lysecky, R., Vahid, F.

  • ResultsPerformance Speedup

    Lysecky, R., Vahid, F.

  • ResultsEnergy Reduction

    Lysecky, R., Vahid, F.

  • Context: UCRs Research on Configurable SoCsSelf Tuning, Self Configuring Mass Produced ICs

    Lysecky, R., Vahid, F.

  • Conclusions & Future WorkWarp Configurable Logic FabricSupports wide range of embedded systems applicationsDesign specifically to allow development of lean on-chip CAD toolsProvide excellent resultsAverage speedups of 2.1 Average energy reduction of 33%Much better than dynamic software optimizationsOne loop only more speedup possibleMore recent examples since DATE publication 10x speedupsWorking towards examples with 100x speedupsFuture WorkPartitioning multiple software loops to hardware Synthesizing Finite State Machines (FSMs)Improved synthesis, technology mapping, and place and route

    Lysecky, R., Vahid, F.

Search related