54
8/21/2012 1 Era of Customization and Era of Customization and Specilization Specilization Jason Cong Chancellor’s Professor, UCLA Computer Science Department [email protected] Director, Center for Domain-Specific Computing 1 www.cdsc.ucla.edu Focus of Our Research: Energy Efficient Computing Focus of Our Research: Energy Efficient Computing Parallelization Parallelization Customization Customization Adapt the architecture to Adapt the architecture to Application domain Application domain 2 Application domain Application domain

Era of Customization and Era of Customization and ...cong/slides/Intel_DTTC_keynote.pdfEra of Customization and Era of Customization and SpecilizationSpecilization Jason Cong Chancellor’s

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

  • 8/21/2012

    1

    Era of Customization and Era of Customization and SpecilizationSpecilization

    Jason CongChancellor’s Professor, UCLA Computer Science Department

    [email protected], Center for Domain-Specific Computing

    11

    www.cdsc.ucla.edu

    Focus of Our Research: Energy Efficient ComputingFocus of Our Research: Energy Efficient Computing

    ParallelizationParallelization

    CustomizationCustomization

    Adapt the architecture to Adapt the architecture to

    Application domainApplication domain

    22

    Application domainApplication domain

  • 8/21/2012

    2

    Potential of Customization/SpecializationPotential of Customization/Specialization

    350 mW350 mW350 mW350 mW

    PowerPowerPowerPower

    11 (1/1)11 (1/1)11 (1/1)11 (1/1)3.84 3.84 GbitsGbits/sec/sec3.84 3.84 GbitsGbits/sec/sec0.18mm CMOS0.18mm CMOS0.18mm CMOS0.18mm CMOS

    Figure of MeritFigure of Merit(Gb/s/W)(Gb/s/W)Figure of MeritFigure of Merit(Gb/s/W)(Gb/s/W)

    ThroughputThroughputThroughputThroughputAES 128bit keyAES 128bit key128bit data128bit dataAES 128bit keyAES 128bit key128bit data128bit data

    350 mW350 mW

    PowerPower

    11 (1/1)11 (1/1)3.84 3.84 GbitsGbits/sec/sec0.18mm CMOS0.18mm CMOS

    Figure of MeritFigure of Merit(Gb/s/W)(Gb/s/W)

    ThroughputThroughputAES 128bit keyAES 128bit key128bit data128bit data

    648 Mbits/sec648 Mbits/sec648 Mbits/sec648 Mbits/secASM Pentium III [3]ASM Pentium III [3]ASM Pentium III [3]ASM Pentium III [3] 41.4 W41.4 W41.4 W41.4 W 0.015 (1/800)0.015 (1/800)0.015 (1/800)0.015 (1/800)

    1.32 Gbit/sec1.32 Gbit/sec1.32 Gbit/sec1.32 Gbit/secFPGA [1]FPGA [1]FPGA [1]FPGA [1] 490 mW490 mW490 mW490 mW 2.7 (1/4)2.7 (1/4)2.7 (1/4)2.7 (1/4)

    ASM ASM StrongARMStrongARM [2][2]ASM ASM StrongARMStrongARM [2][2] 240 240 mWmW240 240 mWmW 0.13 (1/85)0.13 (1/85)0.13 (1/85)0.13 (1/85)31 Mbit/sec31 Mbit/sec31 Mbit/sec31 Mbit/sec

    648 Mbits/sec648 Mbits/secASM Pentium III [3]ASM Pentium III [3] 41.4 W41.4 W 0.015 (1/800)0.015 (1/800)

    1.32 Gbit/sec1.32 Gbit/secFPGA [1]FPGA [1] 490 mW490 mW 2.7 (1/4)2.7 (1/4)

    ASM ASM StrongARMStrongARM [2][2] 240 240 mWmW 0.13 (1/85)0.13 (1/85)31 Mbit/sec31 Mbit/sec

    33

    [1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator[1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator[2] Dag Arne Osvik: 544 cycles AES [2] Dag Arne Osvik: 544 cycles AES –– ECB on StrongArm SAECB on StrongArm SA--11101110[3] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet[3] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet[4] gcc, 1 mW/MHz @ 120 Mhz Sparc [4] gcc, 1 mW/MHz @ 120 Mhz Sparc –– assumes 0.25 u CMOSassumes 0.25 u CMOS[5] Java on KVM (Sun J2ME, non[5] Java on KVM (Sun J2ME, non--JIT) on 1 mW/MHz @ 120 MHz Sparc JIT) on 1 mW/MHz @ 120 MHz Sparc –– assumes 0.25 u CMOSassumes 0.25 u CMOS

    Source: Source: P Schaumont and I Verbauwhede, "Domain specific P Schaumont and I Verbauwhede, "Domain specific codesign for embedded security," IEEE Computer 36(4), 2003codesign for embedded security," IEEE Computer 36(4), 2003

    Java [5] Emb. SparcJava [5] Emb. SparcJava [5] Emb. SparcJava [5] Emb. Sparc 450 bits/sec450 bits/sec450 bits/sec450 bits/sec 120 mW120 mW120 mW120 mW 0.0000037 (1/3,000,000)0.0000037 (1/3,000,000)0.0000037 (1/3,000,000)0.0000037 (1/3,000,000)

    C C EmbEmb. . SparcSparc [4][4]C C EmbEmb. . SparcSparc [4][4] 133 Kbits/sec133 Kbits/sec133 Kbits/sec133 Kbits/sec 0.0011 (1/10,000)0.0011 (1/10,000)0.0011 (1/10,000)0.0011 (1/10,000)120 mW120 mW120 mW120 mW

    Java [5] Emb. SparcJava [5] Emb. Sparc 450 bits/sec450 bits/sec 120 mW120 mW 0.0000037 (1/3,000,000)0.0000037 (1/3,000,000)

    C C EmbEmb. . SparcSparc [4][4] 133 Kbits/sec133 Kbits/sec 0.0011 (1/10,000)0.0011 (1/10,000)120 mW120 mW

    Another Example of Specialization Another Example of Specialization ---- Advance of Civilization Advance of Civilization ♦♦ For human brain, Moore’s Law scaling has long stoppedFor human brain, Moore’s Law scaling has long stopped The number neurons and their firing speed did not change significantly

    ♦♦ Remarkable advancement of civilization via specializationRemarkable advancement of civilization via specialization♦♦ More advanced societies have higher degree of specializationMore advanced societies have higher degree of specialization♦♦ Achieved on a common platform!Achieved on a common platform!

    44

  • 8/21/2012

    3

    More Justifications: Utilization More Justifications: Utilization Wall Wall [[G. G. VenkateshVenkatesh et.alet.al. ASPLOS’10]. ASPLOS’10]

    ♦♦ Assuming 80W power budget,Assuming 80W power budget, At 45 nm TSMC process, less than 7% of a 300mmAt 45 nm TSMC process, less than 7% of a 300mm22 die can be die can be p ,p ,

    switched.switched.♦♦ ITRS roadmap and CMOS scaling theory:ITRS roadmap and CMOS scaling theory: Less than 3.5% in 32 nmLess than 3.5% in 32 nm Almost half with each process generationAlmost half with each process generation Even further with 3Even further with 3--D integration.D integration.

    55

    Dark Silicon and the End of Multicore Scaling Dark Silicon and the End of Multicore Scaling [H. Esmaeilzadeh et. al., ISCA'11][H. Esmaeilzadeh et. al., ISCA'11]

    ♦♦ Power wall:Power wall: At 22 nm, 31% of a fixedAt 22 nm, 31% of a fixed--size size

    chip must be powered offchip must be powered offp pp p At 8 nm, more than 50%.At 8 nm, more than 50%.

    ♦♦ A growing gap A growing gap between between achievable achievable vsvs possiblepossible Due to power and parallelism Due to power and parallelism

    limitationslimitations Speedup Speedup gap of at least 22x at 8 gap of at least 22x at 8

    t h lt h l

    Percent dark silicon: geomeanPercent dark silicon: geomean

    66

    nm nm technologytechnology

  • 8/21/2012

    4

    Moore’s Moore’s Law Supports Customization and SpecializationLaw Supports Customization and Specialization

    ♦♦ Previous architecturesPrevious architectures Transistor limited –> maximize device reuse

    ♦♦ Future architecturesFuture architectures♦♦ Future architecturesFuture architectures Power/energy limited -> maximize device efficiency

    ♦♦ A story of specializationA story of specialization

    77

    Example of Customizable Platforms: FPGAsExample of Customizable Platforms: FPGAs Configurable logic Configurable logic

    blocksblocks IslandIsland--style configurable style configurable

    mesh routingmesh routing Dedicated componentsDedicated components Specialization allows Specialization allows

    optimizationoptimization Memory/MultiplierMemory/Multiplier I/O, ProcessorI/O, Processor Anything that the FPGA Anything that the FPGA

    88

    Anything that the FPGA Anything that the FPGA architect wants to put in!architect wants to put in!

    Source:  I. Kuon, R. Tessier, J. Rose. FPGA Source:  I. Kuon, R. Tessier, J. Rose. FPGA Architecture: Survey and Challenges. 2008.Architecture: Survey and Challenges. 2008.

  • 8/21/2012

    5

    More Opportunities for Customization to be ExploredMore Opportunities for Customization to be Explored

    Core parametersCore parameters

    Our Proposal: Customizable Heterogeneous Platform (CHP)Our Proposal: Customizable Heterogeneous Platform (CHP)

    $$ $$ $$ $$

    Cache parametersCache parametersCache size & configurationCache size & configurationCache vs SPMCache vs SPM……

    Frequency & voltageFrequency & voltageDatapath bit widthDatapath bit widthInstruction window sizeInstruction window sizeIssue widthIssue widthCache size & Cache size & configurationconfigurationRegister file organizationRegister file organization# of thread contexts# of thread contexts……

    NoC parametersNoC parametersInterconnect topology Interconnect topology # of virtual channels# of virtual channelsRouting policyRouting policyLink bandwidthLink bandwidthRouter pipeline depthRouter pipeline depthNumber of RFNumber of RF--I enabled I enabled routersroutersRFRF I channel and I channel and

    FixedFixedCoreCore

    FixedFixedCoreCore

    FixedFixedCoreCore

    FixedFixedCoreCore

    CustomCustomCoreCore

    CustomCustomCoreCore

    CustomCustomCoreCore

    CustomCustomCoreCore

    PP PP

    99

    Key questions:Key questions: Optimal tradeOptimal trade--off between efficiency & customizabilityoff between efficiency & customizabilityWhich options to fix at CHP creation? Which to be set by CHP mapper?Which options to fix at CHP creation? Which to be set by CHP mapper?

    Custom instructions & acceleratorsCustom instructions & acceleratorsShared vs. private acceleratorsShared vs. private acceleratorsChoice of acceleratorsChoice of acceleratorsCustom instruction selectionCustom instruction selectionAmount of programmable fabric Amount of programmable fabric ……

    RFRF--I channel and I channel and bandwidth allocationbandwidth allocation……

    ProgProgFabricFabric

    ProgProgFabricFabric acceleratoracceleratoracceleratoraccelerator acceleratoracceleratoracceleratoraccelerator

    Reconfigurable RFReconfigurable RF--I busI busReconfigurable optical busReconfigurable optical busTransceiver/receiverTransceiver/receiverOptical interfaceOptical interface

    Customizable Heterogeneous Platform Customizable Heterogeneous Platform (CHP)(CHP)$$ $$ $$ $$ DRAMDRAM I/OI/O CHPCHP

    Research Scope in CDSC (Center for DomainResearch Scope in CDSC (Center for Domain--Specific Computing)Specific Computing)

    FixedFixedCoreCore

    FixedFixedCoreCore

    FixedFixedCoreCore

    FixedFixedCoreCore

    CustomCustomCoreCore

    CustomCustomCoreCore

    CustomCustomCoreCore

    CustomCustomCoreCore

    ProgProgFabricFabric

    ProgProgFabricFabric acceleratoracceleratoracceleratoraccelerator acceleratoracceleratoracceleratoraccelerator

    DRAMDRAM CHPCHP CHPCHP

    Reconfigurable RFReconfigurable RF--I busI busReconfigurable optical busReconfigurable optical bus

    DomainDomain--specificspecific--modelingmodeling(healthcare applications)(healthcare applications)

    1010

    Reconfigurable optical busReconfigurable optical busTransceiver/receiverTransceiver/receiverOptical interfaceOptical interface

    CHP mappingCHP mappingSourceSource--toto--source CHP mapper source CHP mapper

    Reconfiguring & optimizing backendReconfiguring & optimizing backendAdaptive runtimeAdaptive runtime

    CHP creationCHP creationCustomizable computing engines Customizable computing engines

    Customizable interconnectsCustomizable interconnects

    Architecture Architecture modelingmodeling

    Customization Customization settingsettingDesign onceDesign once Invoke many timesInvoke many times

  • 8/21/2012

    6

    Current Focus Current Focus –– AcceleratorAccelerator--Rich Architectures (ARC)Rich Architectures (ARC)♦♦ Accelerators provide high powerAccelerators provide high power--efficiency over generalefficiency over general--purpose processorspurpose processors IBM wire-speed processor Intel Larrabee

    ♦♦ ITRS 2007 System drivers prediction: Accelerator number close to 1500 by 2022 ITRS 2007 System drivers prediction: Accelerator number close to 1500 by 2022

    ♦♦ Two kinds of acceleratorsTwo kinds of accelerators Tightly coupled – part of datapath Loosely coupled – shared via NoC

    ♦♦ ChallengesChallenges Accelerator extraction and synthesisAccelerator extraction and synthesis Efficient accelerator managementEfficient accelerator management

    S h d liS h d li

    1111

    •• SchedulingScheduling•• SharingSharing•• Virtualization …Virtualization …

    Friendly programming modelsFriendly programming models

    Architecture Support for AcceleratorArchitecture Support for Accelerator--Rich CMPs (ARC) Rich CMPs (ARC) [DAC’2012][DAC’2012]

    CPUCPU Accelerator Accelerator ManagerManager AcceleratorAccelerator

    MotivationMotivation

    Operation Latency (# Cycles)

    1 core 2 cores 4 cores 8 cores 16 cores

    Invoke 214413 256401 266133 308434 316161

    RD/WR 703 725 781 837 885

    AppApp

    OSOS

    1212

    ♦♦ Managing accelerators through the OS is expensiveManaging accelerators through the OS is expensive♦♦ In an accelerator rich CMP, management should be cheaper both in In an accelerator rich CMP, management should be cheaper both in

    terms of time and energyterms of time and energy Invoke “Open”s the driver and returns the handler to driver. Called once. RD/WR is called multiple times.

  • 8/21/2012

    7

    Overall Architecture of ARCOverall Architecture of ARC♦♦ Architecture of ARCArchitecture of ARC Multiple cores and accelerators Global Accelerator Manager Global Accelerator Manager

    (GAM) Shared L2 cache banks and NoC

    routers between multiple accelerators

    1313

    GAMGAMAccelerator + Accelerator +

    DMA+SPMDMA+SPMShared Shared RouterRouterCoreCore

    Shared Shared

    L2 $L2 $Memory Memory

    controllercontroller

    Overall Communication Scheme in ARCOverall Communication Scheme in ARC

    New ISAlcacc-req t

    CPU GAM11

    1.1. The core requests for a given type of accelerator (lcaccThe core requests for a given type of accelerator (lcacc--req).req).

    lcacc-req tlcacc-rsrv t, elcacc-cmd id, f, addrlcacc-free idMemory LCA

    1414

  • 8/21/2012

    8

    Overall Communication Scheme in ARCOverall Communication Scheme in ARC

    New ISAlcacc-req t

    CPU GAM22

    2.2. The GAM responds with a “list + waiting time” or NACKThe GAM responds with a “list + waiting time” or NACK

    lcacc-req tlcacc-rsrv t, elcacc-cmd id, f, addrlcacc-free idMemory LCA

    1515

    Overall Communication Scheme in ARCOverall Communication Scheme in ARC

    New ISAlcacc-req t

    CPU GAM33

    3.3. The core reserves (lcaccThe core reserves (lcacc--rsv) and waits.rsv) and waits.

    lcacc-req tlcacc-rsrv t, elcacc-cmd id, f, addrlcacc-free idMemory LCA

    1616

  • 8/21/2012

    9

    Overall Communication Scheme in ARCOverall Communication Scheme in ARC

    New ISAlcacc-req t

    CPU GAM44

    4.4. The GAM ACK the reservation and send the core ID to acceleratorThe GAM ACK the reservation and send the core ID to accelerator

    lcacc-req tlcacc-rsrv t, elcacc-cmd id, f, addrlcacc-free idMemory LCA

    44

    1717

    Overall Communication Scheme in ARCOverall Communication Scheme in ARC

    New ISAlcacc-req t

    CPU GAM

    5.5. The core shares a task description with the accelerator through memory and The core shares a task description with the accelerator through memory and starts it (lcaccstarts it (lcacc--cmd).cmd).

    lcacc-req tlcacc-rsrv t, elcacc-cmd id, f, addrlcacc-free idMemory Task Task descriptiondescription Accelerator

    5555

    1818

    (( ))•• Task description consists of:Task description consists of:

    oo Function ID and input parametersFunction ID and input parametersoo Input/output addresses and stridesInput/output addresses and strides

  • 8/21/2012

    10

    Overall Communication Scheme in ARCOverall Communication Scheme in ARC

    New ISAlcacc-req t

    CPU GAM66

    6.6. The accelerator reads the task description, and begins workingThe accelerator reads the task description, and begins working•• Overlapped Read/Write from/to Memory and ComputeOverlapped Read/Write from/to Memory and Compute

    lcacc-req tlcacc-rsrv t, elcacc-cmd id, f, addrlcacc-free idMemory Task Task descriptiondescription LCA

    66

    66

    1919

    pp y ppp y p•• Interrupting core when TLB miss Interrupting core when TLB miss

    Overall Communication Scheme in ARCOverall Communication Scheme in ARC

    New ISAlcacc-req t

    CPU GAM

    7.7. When the accelerator finishes its current task it notifies the core.When the accelerator finishes its current task it notifies the core.

    lcacc-req tlcacc-rsrv t, elcacc-cmd id, f, addrlcacc-free idMemory Task Task descriptiondescription LCA

    77

    2020

  • 8/21/2012

    11

    Overall Communication Scheme in ARCOverall Communication Scheme in ARC

    New ISAlcacc-req t

    CPU GAM88

    8.8. The core then sends a message to the GAM freeing the accelerator (lcaccThe core then sends a message to the GAM freeing the accelerator (lcacc--free).free).

    lcacc-req tlcacc-rsrv t, elcacc-cmd id, f, addrlcacc-free idMemory Task Task descriptiondescription LCA

    2121

    Accelerator Chaining and CompositionAccelerator Chaining and Composition

    ♦♦ ChainingChaining Efficient accelerator to

    accelerator communication

    Accelerator1

    S t h d

    Accelerator2

    S t h daccelerator communication

    ♦♦ Composition Composition Constructing virtual

    l t

    Scratchpad

    DMA controller

    Scratchpad

    DMA controller

    3D FFTvirtualizationvirtualization

    2222

    acceleratorsM-point1D FFT

    M-point1D FFT

    3D FFT

    N-point2D FFT

    M-point1D FFT

    M-point1D FFT

  • 8/21/2012

    12

    Accelerator VirtualizationAccelerator Virtualization♦♦ Application programmer or compilation framework selects highApplication programmer or compilation framework selects high--

    level functionalitylevel functionality♦♦ Implementation viaImplementation viapp Monolithic accelerator Distributed accelerators composed to a virtual accelerator Software decomposition libraries

    ♦♦ Example: Implementing a 4x4 2Example: Implementing a 4x4 2--D FFT using 2 4D FFT using 2 4--point 1point 1--D FFT D FFT

    2323

    Accelerator VirtualizationAccelerator Virtualization♦♦ Application programmer or compilation framework selects highApplication programmer or compilation framework selects high--

    level functionalitylevel functionality♦♦ Implementation viaImplementation viapp Monolithic accelerator Distributed accelerators composed to a virtual accelerator Software decomposition libraries

    ♦♦ Example: Implementing a 4x4 2Example: Implementing a 4x4 2--D FFT using 2 4D FFT using 2 4--point 1point 1--D FFT D FFT

    2424

    Step 1: 1D FFT on Row 1 and Row 2Step 1: 1D FFT on Row 1 and Row 2

  • 8/21/2012

    13

    Accelerator VirtualizationAccelerator Virtualization♦♦ Application programmer or compilation framework selects highApplication programmer or compilation framework selects high--

    level functionalitylevel functionality♦♦ Implementation viaImplementation viapp Monolithic accelerator Distributed accelerators composed to a virtual accelerator Software decomposition libraries

    ♦♦ Example: Implementing a 4x4 2Example: Implementing a 4x4 2--D FFT using 2 4D FFT using 2 4--point 1point 1--D FFT D FFT

    2525

    Step 2: 1D FFT on Row 3 and Row 4Step 2: 1D FFT on Row 3 and Row 4

    Accelerator VirtualizationAccelerator Virtualization♦♦ Application programmer or compilation framework selects highApplication programmer or compilation framework selects high--

    level functionalitylevel functionality♦♦ Implementation viaImplementation viapp Monolithic accelerator Distributed accelerators composed to a virtual accelerator Software decomposition libraries

    ♦♦ Example: Implementing a 4x4 2Example: Implementing a 4x4 2--D FFT using 2 4D FFT using 2 4--point 1point 1--D FFT D FFT

    2626

    Step 3: 1D FFT on Col 1 and Col 2Step 3: 1D FFT on Col 1 and Col 2

  • 8/21/2012

    14

    Accelerator VirtualizationAccelerator Virtualization♦♦ Application programmer or compilation framework selects highApplication programmer or compilation framework selects high--

    level functionalitylevel functionality♦♦ Implementation viaImplementation viapp Monolithic accelerator Distributed accelerators composed to a virtual accelerator Software decomposition libraries

    ♦♦ Example: Implementing a 4x4 2Example: Implementing a 4x4 2--D FFT using 2 4D FFT using 2 4--point 1point 1--D FFT D FFT

    2727

    Step 4: 1D FFT on Col 3 and Col 4Step 4: 1D FFT on Col 3 and Col 4

    LightLight--Weight Interrupt SupportWeight Interrupt Support

    CPU GAM

    LCA

    2828

  • 8/21/2012

    15

    LightLight--Weight Interrupt SupportWeight Interrupt Support

    CPU GAMRequest/Reserve Request/Reserve

    LCA

    Request/Reserve Request/Reserve Confirmation and Confirmation and NACKNACKSent by GAMSent by GAM

    2929

    LightLight--Weight Interrupt SupportWeight Interrupt Support

    CPU GAMTLB MissTLB Miss

    LCA

    TLB MissTLB MissTask DoneTask Done

    3030

  • 8/21/2012

    16

    LightLight--Weight Interrupt SupportWeight Interrupt Support

    CPU GAMTLB MissTLB Miss

    LCA

    TLB MissTLB MissTask DoneTask Done

    Core Sends Logical Addresses to LCACore Sends Logical Addresses to LCA

    3131

    Core Sends Logical Addresses to LCACore Sends Logical Addresses to LCALCA keeps a small TLB for the addresses that it is working onLCA keeps a small TLB for the addresses that it is working on

    LightLight--Weight Interrupt SupportWeight Interrupt Support

    CPU GAMTLB MissTLB Miss

    LCA

    TLB MissTLB MissTask DoneTask Done

    Core Sends Logical Addresses to LCACore Sends Logical Addresses to LCA

    3232

    Core Sends Logical Addresses to LCACore Sends Logical Addresses to LCALCA keeps a small TLB for the addresses that it is working onLCA keeps a small TLB for the addresses that it is working on

    Why Logical Address?Why Logical Address?11-- Accelerators can work on irregular addresses (e.g. indirect addressing)Accelerators can work on irregular addresses (e.g. indirect addressing)22-- Using large page size can be a solution but will effect other applications Using large page size can be a solution but will effect other applications

  • 8/21/2012

    17

    LightLight--Weight Interrupt SupportWeight Interrupt Support

    CPU GAMIt’s expensive to It’s expensive to

    LCA

    Latency to switch to ISR and back (# Cycles)

    It s expensive to It s expensive to handle the handle the interrupts via OSinterrupts via OS

    3333

    OperationLatency  to switch to ISR and back (# Cycles)

    1 core 2 cores 4 cores 8 cores 16 cores

    Interrupt  16 K 20 K 24 K 27 K 29 K

    LightLight--Weight Interrupt SupportWeight Interrupt Support

    CPU GAMExtending the core Extending the core

    LWI

    LCA

    Extending the core Extending the core with a lightwith a light--weight weight interrupt supportinterrupt support

    3434

  • 8/21/2012

    18

    LightLight--Weight Interrupt SupportWeight Interrupt Support

    CPU GAMExtending the core Extending the core

    LWI

    LCA

    Extending the core Extending the core with a lightwith a light--weight weight interrupt supportinterrupt support

    Two main components added:Two main components added: A table to store ISR info

    3535

    An interrupt controller to queue and prioritize incoming interrupt packets

    Each thread registers: Each thread registers: Address of the ISR and its arguments and lw-int source

    Limitations:Limitations: Only can be used when running the same thread which LW interrupt belongs to OS-handled interrupt otherwise

    Evaluation methodologyEvaluation methodology♦♦ BenchmarksBenchmarks Medical imaging Vision & Navigation

    3636

  • 8/21/2012

    19

    compressive compressive sensingsensing

    Application Domain: Medical Image ProcessingApplication Domain: Medical Image Processingr

    econ

    stru

    ctio

    nre

    cons

    truct

    ion

    voxels

    2

    points sampled)(-ARmin

    :theoryNyquist -Shannon classical rate aat sampled be can and sparsity,exhibit images Medical

    ugradSuu

    fluid fluid registrationregistration

    total variational total variational algorithmalgorithmd

    enoi

    sing

    deno

    ising

    r

    egist

    ratio

    nre

    gist

    ratio

    n

    h

    zyS

    i,jvolumevoxel

    ji

    S

    kkk

    eiZ

    wjfwi

    1

    21

    2

    j

    2, )(

    1 ,2)()(u :voxel

    )()()()( uxTxRuxTvvuv

    tuv

    3737

    level set level set methodsmethodsseg

    men

    tatio

    nse

    gmen

    tatio

    na

    nalys

    isan

    alysis

    0t)(x, : xvoxels)(surface

    div),(F

    t

    datat

    3

    12

    23

    1),(

    ),()(

    ji

    j

    ij

    j ij

    ij

    i txfxvv

    xp

    xvv

    tv

    txfvpvvtv

    NavierNavier--StokesStokesequationsequations

    Area OverheadArea Overhead

    Core NoC L2 Deblur Denoise Segmentation Registration SPM Banks Number of i t /Si 1 1 8MB 1 1 1 1 39 2KB

    ♦♦ AutoESLAutoESL (from Xilinx) for C to RTL synthesis(from Xilinx) for C to RTL synthesis♦♦ Synopsys for ASIC synthesisSynopsys for ASIC synthesis 32 nm Synopsys Educational libraryCACTI f L2CACTI f L2

    instance/Size 1 1 8MB 1 1 1 1 39 x 2KBArea(mm^2) 10.8 0.3 39.8 0.03 0.01 0.06 0.01 0.02Percentage (%) 18.0 0.5 66.2 3.4 0.8 6.5 1.2 2.4

    Total ARC: 14.3 %

    3838

    ♦♦ CACTI for L2CACTI for L2♦♦ Orion for Orion for NoCNoC♦♦ One One UltraSparcUltraSparc IIIiIIIi core (area scaled to 32 nm)core (area scaled to 32 nm) 178.5 mm^2 in 0.13 um (178.5 mm^2 in 0.13 um (http://en.wikipedia.org/wiki/UltraSPARC_III))

  • 8/21/2012

    20

    Experimental Results Experimental Results –– PerformancePerformance(N cores, N threads, N accelerators)(N cores, N threads, N accelerators)

    Performance improvement Performance improvement 250300350400

    X)

    Speedup over SW-Only

    R i t ti Performance improvement Performance improvement over SW only approaches:over SW only approaches:

    on average 168x, up to 380xon average 168x, up to 380x0

    50100150200

    1 2 4 8 16

    Gain

    (X

    Configuration (N cores, N threads, N accelerators)

    RegistrationDeblurDenoiseSegmentation

    300350

    Speedup over OS-based

    3939

    Performance improvement Performance improvement over OS based approaches:over OS based approaches:on average 51x, up to 292xon average 51x, up to 292x

    050

    100150200250

    1 2 4 8 16

    Gain

    (X)

    Configuration (N cores, N threads, N accelerators)

    RegistrationDeblurDenoiseSegmentation

    Experimental Results Experimental Results –– Energy Energy (N cores, N threads, N accelerators)(N cores, N threads, N accelerators)

    Energy improvement Energy improvement 400500600700

    X)

    Energy gain over SW-only version

    Registration gy pgy pover SWover SW--only approaches:only approaches:

    on average 241x, up to 641xon average 241x, up to 641x0

    100200300400

    1 2 4 8 16

    Gain

    (

    Configuration (N cores, N threads, N accelerators)

    RegistrationDeblurDenoiseSegmentation

    6070

    Energy gain over OS-based version

    4040

    Energy improvement Energy improvement over OSover OS--based approaches:based approaches:

    on average 17x, up to 63xon average 17x, up to 63x0

    102030405060

    1 2 4 8 16

    Gain

    (X)

    Configuration (N cores, N threads, N accelerators)

    RegistrationDeblurDenoiseSegmentation

  • 8/21/2012

    21

    What are the Problems with ARC? What are the Problems with ARC? ♦♦ Dedicated accelerators are inflexible Dedicated accelerators are inflexible An LCA may be useless for new algorithms or new domains Often under-utilized Often under-utilized LCAs contain many replicated structures

    • Things like fp-ALUs, DMA engines, SPM• Unused when the accelerator is unused

    ♦♦ We want flexibility and better resource utilization We want flexibility and better resource utilization Solution: CHARM

    4141

    Solution: CHARM♦♦ Private SPM is wastefulPrivate SPM is wasteful Solution: BiN

    A Composable Heterogeneous AcceleratorA Composable Heterogeneous Accelerator--Rich Rich Microprocessor (CHARM) [ISLPED’12]Microprocessor (CHARM) [ISLPED’12]♦♦ MotivationMotivation Great deal of data parallelism

    • Tasks performed by accelerators tend to have a great deal of data parallelism Variety of LCAs with possible overlap

    • Utilization of any particular LCA being somewhat sporadic• Utilization of any particular LCA being somewhat sporadic It is expensive to have both:

    • Sufficient diversity of LCAs to handle the various applications • Sufficient quantity of a particular LCA to handle the parallelism

    Overlap in functionality• LCAs can be built using a limited number of smaller, more general LCAs: Accelerator building blocks (ABBs)

    4242

    ♦♦ IdeaIdea Flexible accelerator building blocks (ABB) that can be composed into accelerators

    ♦♦ Leverage economy of scaleLeverage economy of scale

  • 8/21/2012

    22

    Micro Architecture of CHARMMicro Architecture of CHARM♦♦ ABBABB Accelerator building blocks (ABB) Primitive components that can be

    composed into acceleratorsp♦♦ ABB islandsABB islands Multiple ABBs Shared DMA controller, SPM and

    NoC interface

    ♦♦ ABCABC Accelerator Block Composer

    (ABC)

    4343

    • To orchestrate the data flow between ABBs to create a virtual accelerator

    • Arbitrate requests from cores♦♦ Other componentsOther components Cores L2 Banks Memory controllers

    An Example of ABB Library (for Medical Imaging)An Example of ABB Library (for Medical Imaging)

    Internal Internal of Polyof Poly

    4444

  • 8/21/2012

    23

    Example of ABB FlowExample of ABB Flow--Graph (Denoise)Graph (Denoise)

    22

    4545

    Example of ABB FlowExample of ABB Flow--Graph (Denoise)Graph (Denoise)22

    ‐‐

    **

    ‐‐

    **

    ‐‐

    **

    ‐‐

    **

    ‐‐

    **

    ‐‐

    **++ ++ ++

    ++

    4646

    ++

    ++

    sqrtsqrt

    1/x1/x

  • 8/21/2012

    24

    Example of ABB FlowExample of ABB Flow--Graph (Denoise)Graph (Denoise)22

    ‐‐

    **

    ‐‐

    **

    ‐‐

    **

    ‐‐

    **

    ‐‐

    **

    ‐‐

    **++ ++ ++

    ++

    ABB1: PolyABB1: Poly

    ABB2: PolyABB2: Poly

    4747

    ++

    ++

    sqrtsqrt

    1/x1/x

    ABB2: PolyABB2: Poly

    ABB3: SqrtABB3: Sqrt

    ABB4: InvABB4: Inv

    Example of ABB FlowExample of ABB Flow--Graph (Denoise)Graph (Denoise)

    22

    ‐‐

    **

    ‐‐

    **

    ‐‐

    **

    ‐‐

    **

    ‐‐

    **

    ‐‐

    **++ ++ ++

    ++

    ABB1:PolyABB1:Poly

    4848

    ++

    ++

    sqrtsqrt

    1/x1/x

    ABB2: PolyABB2: Poly

    ABB3: SqrtABB3: Sqrt

    ABB4: InvABB4: Inv

  • 8/21/2012

    25

    LCA Composition ProcessLCA Composition Process

    ABB ABB ISLAND1ISLAND1

    ABB ABB ISLAND2ISLAND2

    xx

    yy

    xx

    ww

    4949

    ABBABBISLAND3ISLAND3

    ABB ABB ISLAND4ISLAND4

    zz

    ww

    yy

    zz

    LCA Composition ProcessLCA Composition Process1.1. Core initiationCore initiation Core sends the task description: task flow-

    graph of the desired LCA to ABC together with l h d l f i t d t t

    ABB ABB ISLAND1ISLAND1

    ABB ABB ISLAND2ISLAND2

    polyhedral space for input and output

    xx

    yy

    xx

    wwx

    Task descriptionTask description

    5050

    ABBABBISLAND3ISLAND3

    ABB ABB ISLAND4ISLAND4

    zz

    ww

    yy

    zz

    y z

    10x10 input and output10x10 input and output

  • 8/21/2012

    26

    LCA Composition ProcessLCA Composition Process2.2. TaskTask--flow parsing and taskflow parsing and task--list creationlist creation ABC parses the task-flow graph and breaks the request

    into a set of tasks with smaller data size and fills the task list

    ABB ABB ISLAND1ISLAND1

    ABB ABB ISLAND2ISLAND2

    task list

    xx

    yy

    xx

    wwABC generates internallyABC generates internally

    5151

    ABBABBISLAND3ISLAND3

    ABB ABB ISLAND4ISLAND4

    zz

    ww

    yy

    zzNeeded ABBs: “x”, “y”, “z”Needed ABBs: “x”, “y”, “z”

    With task size of 5x5 block, With task size of 5x5 block, ABC generates 4 tasksABC generates 4 tasks

    LCA Composition ProcessLCA Composition Process3.3. Dynamic ABB mappingDynamic ABB mapping ABC uses a pattern matching algorithm to

    assign ABBs to islands Fills the composed LCA table and resource

    ABB ABB ISLAND1ISLAND1

    ABB ABB ISLAND2ISLAND2

    pallocation table

    xx

    yy

    xx

    wwIsland ID

    ABB Type 

    ABB ID Status

    1 x 1 Free

    5252

    ABBABBISLAND3ISLAND3

    ABB ABB ISLAND4ISLAND4

    zz

    ww

    yy

    zz

    1 y 1 Free

    2 x 1 Free

    2 w 1 Free

    3 z 1 Free

    3 w 1 Free

    4 y 1 Free

    4 z 1 Free

  • 8/21/2012

    27

    LCA Composition ProcessLCA Composition Process3.3. Dynamic ABB mappingDynamic ABB mapping ABC uses a pattern matching algorithm to

    assign ABBs to islands Fills the composed LCA table and resource

    ABB ABB ISLAND1ISLAND1

    ABB ABB ISLAND2ISLAND2

    pallocation table

    xx

    yy

    xx

    wwIsland ID

    ABB Type 

    ABB ID Status

    1 x 1 Busy

    5353

    ABBABBISLAND3ISLAND3

    ABB ABB ISLAND4ISLAND4

    zz

    ww

    yy

    zz

    1 y 1 Busy

    2 x 1 Free

    2 w 1 Free

    3 z 1 Busy

    3 w 1 Free

    4 y 1 Free

    4 z 1 Free

    LCA Composition ProcessLCA Composition Process4.4. LCA cloningLCA cloning Repeat to generate more LCAs if ABBs are

    available

    ABB ABB ISLAND1ISLAND1

    ABB ABB ISLAND2ISLAND2

    xx

    yy

    xx

    wwCore ID

    ABB Type 

    ABB ID Status

    1 x 1 Busy

    5454

    ABBABBISLAND3ISLAND3

    ABB ABB ISLAND4ISLAND4

    zz

    ww

    yy

    zz

    1 y 1 Busy

    2 x 1 Busy

    2 w 1 Free

    3 z 1 Busy

    3 w 1 Free

    4 y 1 Busy

    4 z 1 Busy

  • 8/21/2012

    28

    LCA Composition ProcessLCA Composition Process5.5. ABBs finishing taskABBs finishing task When ABBs finish, they signal the ABC. If

    ABC has another task it sends otherwise it frees the ABBs

    DONEDONE

    ABB ABB ISLAND1ISLAND1

    ABB ABB ISLAND2ISLAND2

    frees the ABBs

    xx

    yy

    xx

    wwIsland ID

    ABB Type 

    ABB ID Status

    1 x 1 Busy

    1 y 1 Busy

    2 x 1 Busy

    5555

    ABBABBISLAND3ISLAND3

    ABB ABB ISLAND4ISLAND4

    zz

    ww

    yy

    zz

    2 w 1 Free

    3 z 1 Busy

    3 w 1 Free

    4 y 1 Busy

    4 z 1 Busy

    LCA Composition ProcessLCA Composition Process5.5. ABBs being freedABBs being freed When an ABB finishes, it signals the ABC. If

    ABC has another task it sends otherwise it frees the ABBs

    ABB ABB ISLAND1ISLAND1

    ABB ABB ISLAND2ISLAND2

    frees the ABBs

    xx

    yy

    xx

    wwIsland ID

    ABB Type 

    ABB ID Status

    1 x 1 Busy

    1 y 1 Busy

    2 x 1 Free

    5656

    ABBABBISLAND3ISLAND3

    ABB ABB ISLAND4ISLAND4

    zz

    ww

    yy

    zz

    2 w 1 Free

    3 z 1 Busy

    3 w 1 Free

    4 y 1 Free

    4 z 1 Free

  • 8/21/2012

    29

    LCA Composition ProcessLCA Composition Process6.6. Core notified of end of taskCore notified of end of task When the LCA finishes ABC signals the

    core

    ABB ABB ISLAND1ISLAND1

    ABB ABB ISLAND2ISLAND2

    xx

    yy

    xx

    wwIsland ID

    ABB Type 

    ABB ID Status

    1 x 1 Free

    1 y 1 Free

    2 x 1 Free

    DONEDONE

    5757

    ABBABBISLAND3ISLAND3

    ABB ABB ISLAND4ISLAND4

    zz

    ww

    yy

    zz

    2 w 1 Free

    3 z 1 Free

    3 w 1 Free

    4 y 1 Free

    4 z 1 Free

    ABC Internal DesignABC Internal Design♦♦ ABC subABC sub--componentscomponents Resource Table(RT): To keep track of

    available/used ABBs Composed LCA Table (CLT): Eliminates

    C ComposedDFGCores

    Accelerator Block Composer

    the need to re-compose LCAs Task List (TL): To queue the broken LCA

    requests (to smaller data size) TLB: To service and share the translation

    requests by ABBs Task Flow-Graph Interpreter (TFGI):

    Breaks the LCA DFG into ABBs LCA Composer (LC): Compose the LCA

    using available ABBs♦♦ ImplementationImplementation Resource 

    Composed LCA Table

    TLB

    Task List

    DFG Interpreter

    LCA Composer

    To ABBs(allocate(allocate

    5858

    ♦♦ ImplementationImplementation RT, CLT, TL and TLB are implemented

    using RAM TFGI has a table to keep ABB types and an

    FSM to read task-flow-graph and compares LC has an FSM to go over CLT and RT and

    check mark the available ABBs

    TableTLB

    From ABBs(Done signal)(Done signal)

    ABBs(TLB service)(TLB service)

  • 8/21/2012

    30

    Evaluation MethodologyEvaluation Methodology♦♦ Simics+GEMS based simulationSimics+GEMS based simulation♦♦ AutoPilot/Xilinx+ Synopsys for AutoPilot/Xilinx+ Synopsys for

    ABB/ABC/DMAABB/ABC/DMA--C synthesisC synthesis♦♦ Cacti for memory synthesis (SPM)Cacti for memory synthesis (SPM)♦♦ Automatic flow to generate the CHARM Automatic flow to generate the CHARM

    software and simulation modulessoftware and simulation modules♦♦ Case studiesCase studies Physical LCA sharing with Global

    Accelerator Manager (LCA+GAM) Physical LCA sharing with ABC

    5959

    Physical LCA sharing with ABC (LCA+ABC)

    ABB composition and sharing with ABC (ABB+ABC)

    ♦♦ Medical imaging benchmarksMedical imaging benchmarks Denoise, Deblur, Segmentation and

    Registration

    Area Overhead AnalysisArea Overhead Analysis♦♦ AreaArea--equivalentequivalent The total area consumed by

    the ABBs equals the total the ABBs equals the total area of all LCAs required to run a single instance of each benchmark

    ♦♦ Total CHARM area is 14% Total CHARM area is 14% of the 1cmx1cm chipof the 1cmx1cm chip

    A bit l th LCA b d

    6060

    A bit less than LCA-based design

  • 8/21/2012

    31

    Results: Improvement Over LCAResults: Improvement Over LCA--based based DesignDesign

    ♦♦ N’xN’x’ ’ has has N N times area times area --equivalent accelerators equivalent accelerators

    ♦♦ PerformancePerformance 11.21.4

    1.6

    Normalized Performance

    LCA GAM

    2.5X vs. LCA+GAM (max 5X) 1.4X vs. LCA+ABC (max 2.6X)

    ♦♦ EnergyEnergy 1.9X vs. LCA+GAM (max 3.4X) 1.3X vs. LCA+ABC (max 2.2X)

    ♦♦ ABB+ABC has better ABB+ABC has better energy energy

    0

    0.2

    0.4

    0.6

    0.8

    1x 2x 4x 8x 1x 2x 4x 8x 1x 2x 4x 8x 1x 2x 4x 8x

    Deb Den Reg Seg

    LCA+GAM

    LCA+ABC

    ABB+ABC

    1.2

    1.4

    Normalized Energy

    6161

    and performance and performance ABC starts composing ABBs to

    create new LCAs Creates more parallelism 0

    0.2

    0.4

    0.6

    0.8

    1

    1x 2x 4x 8x 1x 2x 4x 8x 1x 2x 4x 8x 1x 2x 4x 8x

    Deb Den Reg Seg

    LCA+GAM

    LCA+ABC

    ABB+ABC

    Results: Platform FlexibilityResults: Platform Flexibility♦♦ Two applications from two Two applications from two

    unrelated domains to MIunrelated domains to MI Computer vision

    • Log-Polar Coordinate Image Patches (LPCIP)

    Navigation• Extended Kalman Filter-based

    Simultaneous Localization and Mapping (EKF-SLAM)

    ♦♦ Only one ABB is addedOnly one ABB is addedMAX Benefit over 

    LCA+GAM 3 64X

    6262

    ♦♦ Only one ABB is addedOnly one ABB is added Indexed Vector Load

    LCA+GAM 3.64XAVG Benefit over 

    LCA+GAM 2.46XMAX Benefit over 

    LCA+ABC 3.04XAVG Benefit over 

    LCA+ABC 2.05X

  • 8/21/2012

    32

    Memory Management for AcceleratorMemory Management for Accelerator--Rich Rich Architectures Architectures [ISLPED’2012][ISLPED’2012]♦♦ Providing a private buffer for each accelerator is very inefficient. Providing a private buffer for each accelerator is very inefficient. Large private buffers: occupy a considerable amount of chip area Large private buffers: occupy a considerable amount of chip area Small private buffers: less effective for reducing offSmall private buffers: less effective for reducing off--chip bandwidthchip bandwidth

    ♦♦ Not all accelerators are poweredNot all accelerators are powered--on at the same time on at the same time Shared buffer [Lyonsy et al. TACO’12]Shared buffer [Lyonsy et al. TACO’12] Allocate the buffers in the cache onAllocate the buffers in the cache on--demand [demand [Fajardo et al.Fajardo et al. DAC’11DAC’11][Cong et al. ][Cong et al.

    ISLPED’11]ISLPED’11]♦♦ Our solution Our solution BiN: A BufferBiN: A Buffer--inin--NUCA Scheme for AcceleratorNUCA Scheme for Accelerator--Rich CMPsRich CMPs

    6363

    Buffer Size vs. OffBuffer Size vs. Off--chip Memory Access Bandwidthchip Memory Access Bandwidth♦♦ Buffer size Buffer size ↑ ↑ -- offoff--chip memory bandwidth chip memory bandwidth ↓↓: covering longer reuse distance [Cong et al. : covering longer reuse distance [Cong et al.

    ICCAD’11]ICCAD’11]♦♦ Buffer size vs. bandwidth curve: BBBuffer size vs. bandwidth curve: BB--CurveCurve♦♦ Buffer utilization efficiencyBuffer utilization efficiency♦♦ Buffer utilization efficiencyBuffer utilization efficiency Different for various accelerators Different for various accelerators Different for various inputs for one acceleratorDifferent for various inputs for one accelerator

    ♦♦ Prior work: no consideration of global allocation at runtimePrior work: no consideration of global allocation at runtime Accept fixedAccept fixed--size buffer allocation requestssize buffer allocation requests Rely on the compiler to select a single, ‘best’ point in the BBRely on the compiler to select a single, ‘best’ point in the BB--CurveCurve

    6464

    0

    500,000

    1,000,000

    1,500,000

    2,000,000

    2,500,000

    9 27 119 693Buffer size (KB)

    Off-

    chip

    mem

    ory

    acce

    sses

    input image: cube(28)input image: cube(52)input image: cube(76)

    DenoiseDenoise

    High buffer utilization efficiencyHigh buffer utilization efficiency

    Low buffer utilization efficiencyLow buffer utilization efficiency

  • 8/21/2012

    33

    Resource FragmentationResource Fragmentation♦♦ Prior work allocates a Prior work allocates a contiguouscontiguous space to each buffer to simplify buffer accessspace to each buffer to simplify buffer access♦♦ Requested buffers have unpredictable space demand and come in dynamically: Requested buffers have unpredictable space demand and come in dynamically:

    resource fragmentationresource fragmentation♦♦ NUCA complicates buffer allocations in cacheNUCA complicates buffer allocations in cache The distance of the cache bank to the accelerator also mattersThe distance of the cache bank to the accelerator also matters

    ♦♦ To support fragmented resources: paged allocationTo support fragmented resources: paged allocation Analogous to a typical OSAnalogous to a typical OS--managed virtual memorymanaged virtual memory

    ♦♦ Challenges:Challenges: Large private page tables have high energy and area overheadLarge private page tables have high energy and area overhead Indirect access to a shared page table has high latency overheadIndirect access to a shared page table has high latency overhead

    6565

    Indirect access to a shared page table has high latency overheadIndirect access to a shared page table has high latency overhead

    Shared buffer space: 15KBShared buffer space: 15KB

    Buffer 1: 5KB, duration: 1K cyclesBuffer 1: 5KB, duration: 1K cycles

    Buffer 2: 5KB, duration: 2K cyclesBuffer 2: 5KB, duration: 2K cycles

    Buffer 3: 10KB, duration: 2K cycles Buffer 3: 10KB, duration: 2K cycles

    BiN: BufferBiN: Buffer--inin--NUCANUCA♦♦ Goals of BufferGoals of Buffer--inin--NUCA (BiN)NUCA (BiN) Towards optimal onTowards optimal on--chip storage utilizationchip storage utilization Dynamically allocate buffer space in the NUCA among a large number of competing Dynamically allocate buffer space in the NUCA among a large number of competing

    accelerators accelerators ♦♦ Contributions of BiN:Contributions of BiN: Dynamic intervalDynamic interval--based global (DIG) buffer allocation: address the buffer resource based global (DIG) buffer allocation: address the buffer resource

    contentioncontention Flexible paged buffer allocation: address the buffer resource fragmentation Flexible paged buffer allocation: address the buffer resource fragmentation

    6666

  • 8/21/2012

    34

    AcceleratorAccelerator--Rich CMP with BiNRich CMP with BiN♦♦Overall architecture of ARC [Cong et al. DAC Overall architecture of ARC [Cong et al. DAC 2011] with BiN2011] with BiN Cores (with private L1 caches)Cores (with private L1 caches) AcceleratorsAccelerators AcceleratorsAccelerators

    ●● Accelerator logicAccelerator logic●● DMADMA--controller controller ●● A small storage for the control structureA small storage for the control structure

    The accelerator and BiN manager (ABM)The accelerator and BiN manager (ABM)●● Arbitration over accelerator resourcesArbitration over accelerator resources●● Allocates buffers in the shared cache (BiN Allocates buffers in the shared cache (BiN

    management)management) NUCA (shared L2 cache) banksNUCA (shared L2 cache) banks (1) The core sends the accelerator and buffer allocation

    6767

    ( )( ) (1) The core sends the accelerator and buffer allocation request with the BB-Curve to ABM.

    (2) ABM performs accelerator allocation, buffer allocationin NUCA, and acknowledges the core.

    (3) The core sends the control structure to the accelerator.(4) The accelerator starts working with its allocated buffer.(5) The accelerator signals to the core when it finishes.(6) The core sends the free-resource message to ABM.(7) ABM frees the accelerator and buffer in NUCA.

    Dynamic IntervalDynamic Interval--based Global (DIG) Allocationbased Global (DIG) Allocation♦♦Perform global allocation for buffer allocation requests in an intervalPerform global allocation for buffer allocation requests in an interval Keep the interval short (10K cycles): Minimize waitingKeep the interval short (10K cycles): Minimize waiting--inin--intervalinterval If 8 or more buffer requests, the DIG allocation will start immediatelyIf 8 or more buffer requests, the DIG allocation will start immediately

    A l 2 b ff ll ti tA l 2 b ff ll ti t♦♦An example: 2 buffer allocation requestsAn example: 2 buffer allocation requests Each point (b, s)Each point (b, s)

    ●● s: buffer sizes: buffer size●● b: corresponding bandwidth requirement at sb: corresponding bandwidth requirement at s●● Buffer utilization efficiency at each point: Buffer utilization efficiency at each point:

    The points are in nonThe points are in non--decreasing order of buffer sizedecreasing order of buffer size( 1) ( 1)( ) /( )ij i j ij i jb b s s

    6868

    10 10( , )b s

    11 11( , )b s

    12 12( , )b s04 04( , )b s

    00 00( , )b s

    01 01( , )b s

    02 02( , )b s

    01 00

    01 00

    ( )( )b bs s

    02 01

    02 01

    ( )( )b bs s

    11 10

    11 10

    ( )( )b bs s

    12 11

    12 11

    ( )( )b bs s

    00s 10s

    01s01 00 11 10

    01 00 11 10

    ( ) ( )( ) ( )b b b bs s s s

    11 10 02 01

    11 10 02 01

    ( ) ( )( ) ( )b b b bs s s s

    02 0112 11

    12 11 02 01

    ( )( )( ) ( )

    b bb bs s s s

    11s

    12s

    02s

  • 8/21/2012

    35

    Flexible Paged AllocationFlexible Paged Allocation♦♦ Set the page size according to buffer size: FixedSet the page size according to buffer size: Fixed total number of pages for each buffer total number of pages for each buffer ♦♦ BiN manager locally keep the information of the current contiguous buffer space in each L2 bankBiN manager locally keep the information of the current contiguous buffer space in each L2 bank Since all of the buffer allocation and free operations are performed by BiN manager Since all of the buffer allocation and free operations are performed by BiN manager

    ♦♦ Allocation: starting from the nearest L2 bank to this accelerator, to the farthestAllocation: starting from the nearest L2 bank to this accelerator, to the farthest♦♦ We allow the last page (source of page fragments) of a buffer to be smaller than the other We allow the last page (source of page fragments) of a buffer to be smaller than the other pages of this bufferpages of this buffer No impact on the page table lookup No impact on the page table lookup The max page fragment will be smaller than the minThe max page fragment will be smaller than the min--page page The page fragments do not waste capacity since they can be used by cacheThe page fragments do not waste capacity since they can be used by cache

    6969

    Buffer Allocation in NUCABuffer Allocation in NUCA♦♦ Total buffer sizeTotal buffer size Buffers are allocated onBuffers are allocated on--demanddemand Set an upperSet an upper--bound of the total buffer size: reduce the impact on cachebound of the total buffer size: reduce the impact on cache StateState--ofof--thethe--art cache partitioning can be used to dynamically tune the upper boundart cache partitioning can be used to dynamically tune the upper bound

    ●● E.g. [Qureshi & Patt, MICRO’06]E.g. [Qureshi & Patt, MICRO’06]

    dealII gcc gobmk hmmer milc namd omnetpp perl povray sphinxxalancbmk0.00

    0.25

    0.50

    0.75

    1.001st bar: 2p-28, 2nd bar: 2p-52, 3rd bar: 2p-76, 4th bar: 2p-100

    Cache BiN upper bound

    Per

    cent

    of c

    apac

    ity

    7070

    ♦♦ Buffer allocations among cache banksBuffer allocations among cache banks Distribute the imposed upper bound onto cache banksDistribute the imposed upper bound onto cache banks

    ●● Avoid creating high contention in a particular cache bankAvoid creating high contention in a particular cache bank StateState--ofof--thethe--art NUCA management schemes can be used to further mitigate contention art NUCA management schemes can be used to further mitigate contention

    introduced by buffer allocationintroduced by buffer allocation●● E.g., page reE.g., page re--coloring scheme [Cho & Jin, MICRO’06]coloring scheme [Cho & Jin, MICRO’06]

  • 8/21/2012

    36

    Hardware Overhead of BiN ManagementHardware Overhead of BiN Management♦♦Storage: Storage:

    32 SRAMs: contiguous spaces info in cache banks32 SRAMs: contiguous spaces info in cache banks●● 77--entry: at most 7 contiguous spaces in a 64KB cache bank with a minentry: at most 7 contiguous spaces in a 64KB cache bank with a min--page of 4KBpage of 4KB

    ●● 14 bits wide (10 bits: the starting block ID, 4 bits: the space length in terms of min14 bits wide (10 bits: the starting block ID, 4 bits: the space length in terms of min--page)page) 8 SRAMs: the BB8 SRAMs: the BB--curves of the buffer requests curves of the buffer requests

    ●● 88--entry: at most 8 BBentry: at most 8 BB--Curve pointsCurve points●● 5B wide: 2B for the buffer size and 3B for the buffer usage efficiency5B wide: 2B for the buffer size and 3B for the buffer usage efficiency

    Total storage overhead: 768B, area: 3,282umTotal storage overhead: 768B, area: 3,282um22 (HP Cacti @ 32nm)(HP Cacti @ 32nm)♦♦Logic: Logic:

    9,725um9,725um22 @ 2GHz (Synopsys DC, SAED library @ 32nm)@ 2GHz (Synopsys DC, SAED library @ 32nm) An average latency of 0.6us (1.2K cycles @ 2GHz) to perform the buffer allocationsAn average latency of 0.6us (1.2K cycles @ 2GHz) to perform the buffer allocations

    7171

    ♦♦The total area of the buffer allocation module is less than 0.01% for a medium size 1cmThe total area of the buffer allocation module is less than 0.01% for a medium size 1cm22 chip chip

    ( 1)

    ( 1)

    ij i j

    ij i j

    b bs s

    ijs

    Simulation Infrastructure & BenchmarksSimulation Infrastructure & Benchmarks♦♦ Extend the fullExtend the full--system cyclesystem cycle--accurate Simics+GEMS simulation platform to support ARC+BiNaccurate Simics+GEMS simulation platform to support ARC+BiN

    CPU 4 Ultra-SPARC III-i cores @ 2GHz

    L1 data/instruction cache 32KB for each core, 4-way set-associative, 64B cache block, 3-cycle access latency, pseudo-LRU, MESI directory coherence by L2 cache

    ♦♦Benchmarks: 4 medical imaging applications in a Benchmarks: 4 medical imaging applications in a medical imaging pipelinemedical imaging pipeline

    L2 cache (NUCA) 2MB, 32 banks, each bank is 64KB, 8-way set-associative, 64B cache block, 6-cycle access latency, pseudo-LRU

    Network on chip 4X8 mesh, XY routing, wormhole switching, 3-cycle router latency, 1-cycle link latency

    Main memory 4GB, 1000-cycle access latency

    7272

    Use the accelerator extraction method of [Cong et.al., DAC’12]Use the accelerator extraction method of [Cong et.al., DAC’12] Accelerator is synthesized by AutoESL from XilinxAccelerator is synthesized by AutoESL from Xilinx

    ♦♦Experimental benchmark naming conventionExperimental benchmark naming convention mPmP--n: m copies of pipelines, the input to each is a unique n^3 pixels image n: m copies of pipelines, the input to each is a unique n^3 pixels image

    ●● No Fragmentation: Used to show the gain of DIG allocation only No Fragmentation: Used to show the gain of DIG allocation only mPmP--mix: m copies of pipelines, the inputs are randomly selected mix: m copies of pipelines, the inputs are randomly selected

    ●● Fragmentation occurs: Used to show the gain of both DIG and paged allocationFragmentation occurs: Used to show the gain of both DIG and paged allocation

  • 8/21/2012

    37

    Reference Design SchemesReference Design Schemes♦♦ Accelerator Store (AS) [Lyonsy, et al. TACO’12]Accelerator Store (AS) [Lyonsy, et al. TACO’12] Separate cache and shared buffer moduleSeparate cache and shared buffer module Set the buffer size 32% larger than maximum buffer size in BiN: overhead of bufferSet the buffer size 32% larger than maximum buffer size in BiN: overhead of buffer--inin--cachecache Partition the shared buffer into 32 banks distributed them to the 32 NoC nodesPartition the shared buffer into 32 banks distributed them to the 32 NoC nodesPartition the shared buffer into 32 banks distributed them to the 32 NoC nodesPartition the shared buffer into 32 banks distributed them to the 32 NoC nodes

    ♦♦ BiC [BiC [Fajardo, et al. DAC’11Fajardo, et al. DAC’11]] BiC dynamically allocates contiguous cache space to a bufferBiC dynamically allocates contiguous cache space to a buffer Upper bound: limiting buffer allocation to at most half of each cache bankUpper bound: limiting buffer allocation to at most half of each cache bank Buffers can span multiple cache banks Buffers can span multiple cache banks

    ♦♦ BiNBiN--PagedPaged Only has the proposed paged allocation scheme Only has the proposed paged allocation scheme

    ♦♦ BiNBiN--Dyn Dyn

    7373

    yy Based on BiNBased on BiN--Paged, it also performs dynamic allocation without consideration of near future buffer Paged, it also performs dynamic allocation without consideration of near future buffer

    requestsrequests It responds to a request immediately by greedily satisfying the request with the current available resourcesIt responds to a request immediately by greedily satisfying the request with the current available resources

    ♦♦ BiNBiN--FullFull This is the entire proposed BiN schemeThis is the entire proposed BiN scheme

    Impact of Dynamic IntervalImpact of Dynamic Interval--based Global Allocationbased Global Allocation♦♦ BiNBiN--Full consistently outperforms Full consistently outperforms

    the other schemes the other schemes The only exception: 4PThe only exception: 4P--mix3mix3

    0.6

    0.8

    1.01.2

    1.4

    ized

    Run

    tim

    e

    ●● 1.32X larger capacity of the AS 1.32X larger capacity of the AS can accommodate all buffer can accommodate all buffer requestsrequests

    ♦♦ Overall, compared to the Overall, compared to the accelerator store and BiC, BiNaccelerator store and BiC, BiN--Full Full reduces the runtime reduction by reduces the runtime reduction by 32% and 35%, respectively32% and 35%, respectively

    0.0

    0.20.4

    0.6

    1P-28

    1P-52

    1P-76

    1P-10

    02P

    -282P

    -522P

    -76

    2P-10

    04P

    -284P

    -524P

    -76

    4P-10

    0

    4P-m

    ix1

    4P-m

    ix2

    4P-m

    ix3

    4P-m

    ix4

    4P-m

    ix5

    4P-m

    ix6

    Nor

    mal

    i

    BiC BiN-Paged BiN-Dyn BiN-Full

    1 0

    1.2

    mem

    Comparison results of runtime

    7474

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1P-28

    1P-52

    1P-76

    1P-10

    02P

    -282P

    -522P

    -76

    2P-10

    04P

    -284P

    -524P

    -76

    4P-10

    0

    4P-m

    ix1

    4P-m

    ix2

    4P-m

    ix3

    4P-m

    ix4

    4P-m

    ix5

    4P-m

    ix6

    Nor

    mal

    ized

    Off-

    chip

    mac

    cess

    cou

    nts

    BiC BiN-Paged BiN-Dyn BiN-Full

    Comparison results of off-chip memory accesses

  • 8/21/2012

    38

    Impact on EnergyImpact on Energy♦♦ AS consumes the least perAS consumes the least per--cache/buffer access energy and the least unit leakagecache/buffer access energy and the least unit leakage Because in the accelerator store the buffer and cache are two separate unitsBecause in the accelerator store the buffer and cache are two separate units

    ♦♦ BiNBiN--DynDyn Saves energy in cases where it can reduce the offSaves energy in cases where it can reduce the off--chip memory accesses and runtime chip memory accesses and runtime Results in a large energy overhead in cases where it significantly increases the runtimeResults in a large energy overhead in cases where it significantly increases the runtime

    ♦♦ Compared with the AS, BiNCompared with the AS, BiN--Full reduces the energy by 12% on averageFull reduces the energy by 12% on average Exception: 4PException: 4P--mixmix--{2,3}{2,3}

    ●● The 1.32X capacity of AS can better satisfy buffer requestsThe 1.32X capacity of AS can better satisfy buffer requests♦♦ Compared with BiC, BinCompared with BiC, Bin--Full reduces the energy by 29% on averageFull reduces the energy by 29% on average

    1 8

    7575

    0.00.20.40.60.81.01.21.41.61.8

    1P-2

    81P

    -52

    1P-7

    6

    1P-1

    002P

    -28

    2P-5

    22P

    -76

    2P-1

    004P

    -28

    4P-5

    24P

    -76

    4P-1

    00

    4P-m

    ix1

    4P-m

    ix2

    4P-m

    ix3

    4P-m

    ix4

    4P-m

    ix5

    4P-m

    ix6

    Nor

    mal

    ized

    Mem

    ory

    subs

    yste

    m e

    nerg

    y

    BiC BiN-Paged BiN-Dyn BiN-Full

    Customizable Heterogeneous Platform Customizable Heterogeneous Platform (CHP)(CHP)$$ $$ $$ $$ DRAMDRAM I/OI/O CHPCHP

    Research Scope in CDSC (Center for DomainResearch Scope in CDSC (Center for Domain--Specific Computing)Specific Computing)

    FixedFixedCoreCore

    FixedFixedCoreCore

    FixedFixedCoreCore

    FixedFixedCoreCore

    CustomCustomCoreCore

    CustomCustomCoreCore

    CustomCustomCoreCore

    CustomCustomCoreCore

    ProgProgFabricFabric

    ProgProgFabricFabric acceleratoracceleratoracceleratoraccelerator acceleratoracceleratoracceleratoraccelerator

    DRAMDRAM CHPCHP CHPCHP

    Reconfigurable RFReconfigurable RF--I busI busReconfigurable optical busReconfigurable optical bus

    DomainDomain--specificspecific--modelingmodeling(healthcare applications)(healthcare applications)

    7676

    Reconfigurable optical busReconfigurable optical busTransceiver/receiverTransceiver/receiverOptical interfaceOptical interface

    CHP mappingCHP mappingSourceSource--toto--source CHP mapper source CHP mapper

    Reconfiguring & optimizing backendReconfiguring & optimizing backendAdaptive runtimeAdaptive runtime

    CHP creationCHP creationCustomizable computing engines Customizable computing engines

    Customizable interconnectsCustomizable interconnects

    Architecture Architecture modelingmodeling

    Customization Customization settingsettingDesign onceDesign once Invoke many timesInvoke many times

  • 8/21/2012

    39

    CHARM Software InfrastructureCHARM Software Infrastructure♦♦ ABB type extraction ABB type extraction Input: compute-intensive kernels

    from different application Output: ABB Super-patterns Currently semi-automatic

    ♦♦ ABB template mappingABB template mapping Input: Kernels + ABB types Output: Covered kernels as an

    ABB flow-graph

    7777

    ♦♦ CHARM uProgram CHARM uProgram generationgeneration Input: ABB flow-graph Output:

    Programming Support for Accelerator-Rich Architectures♦ Two level-support Top-down: Provide accelerator library

    ● Physical and virtualized o FFT, SQRT, etc.

    ● User-specified compilation Bottom-up: Automatic template-based compilation

    ● Step 1: Accelerator template definitiono DFG (CDFG) representing the accelerator functionalityo User-given vs. automatically explored

    ● Step 2: Accelerator candidate identification

    7878

    o Identify accelerator-executable code pieces

    ● Step 3: Accelerator template mappingo Map accelerator-executable candidate to real accelerators

  • 8/21/2012

    40

    Template-Based Compilation Flow♦ Accelerator candidate identification Given an input data flow graph G and accelerator template T, identify all the accelerator

    candidates in G, which can run on the accelerator unitsS bgraph isomorphism ith pre filtering (feat re ector [Cong et al FPGA’08])● Subgraph-isomorphism with pre-filtering (feature vector [Cong, et.al. FPGA’08])

    + ++

    +*

    + ++

    +

    1 2 3 4

    5 6 7

    8 9

    2 3 41 6 7 85

    10 11 129

    i0 i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15

    o0 o1 o2 o3 o4 o5 o6 o7

    o8 o9 o10 o11

    7979

    *+

    /

    *+

    11

    10 12

    131

    4(a) Kernel DFG of rician-denoise

    15

    1413

    o12 o13

    o14arithmetic/logical/move

    (b) User-defined accelerator template

    Template-based Compilation Flow♦ Accelerator template mapping Given an input data flow graph G and a set of identified accelerator candidates, select a

    subset of accelerator candidates which can cover the entire G optimally and map each selected accelerator candidate to an accelerator unitselected accelerator candidate to an accelerator unit.

    + ++

    +*

    + ++

    +

    1 2 3 4

    5 6 7

    8 9

    acc12 3 41

    109

    15

    13

    acc1

    6 7 85

    1211

    14

    8080

    *+

    /

    *+

    1110

    12 13

    14

    (a) One mapping solution

    acc22 3 41

    109

    15

    13

    (b) Accelerator configure of (a)acc2

    6 7 85

    1211

    14

  • 8/21/2012

    41

    Accelerator Template Definition♦ Text-based interface Node declaration

    ● ni:op1;op2;op3…opn● Specify the set of operations supported by each template node ni

    Edge declaration● ni->nj:k● Specify the data flow between template node ni and nj

    o The result of node ni will be sent to node nj as the kth operand

    ExampleNode declaration

    8181

    ● Node declarationo n1: * o n2: +o n3: *;+o n4: /

    • Edge declaration n1->n2:1 n2->n4:1 n3->n4:2

    *

    +/

    {+,*}

    xPilot: Behavioral-to-RTL Synthesis Flow [SOCC’2006]

    Behavioral spec. Behavioral spec. in C/C++/SystemCin C/C++/SystemC

    Advanced transformtion/optimizationsAdvanced transformtion/optimizations Loop unrolling/shifting/pipeliningLoop unrolling/shifting/pipelining Strength reduction / Tree height reductionStrength reduction / Tree height reduction

    Bit idth l iBit idth l iPlatform Platform

    SSDMSSDM

    Bitwidth analysisBitwidth analysis Memory analysis …Memory analysis …FrontendFrontend

    compilercompiler

    Platform Platform descriptiondescription

    Core behvior synthesis optimizationsCore behvior synthesis optimizations SchedulingScheduling Resource binding, e.g., functional unit Resource binding, e.g., functional unit

    binding register/port bindingbinding register/port binding

    8282

    RTL + constraintsRTL + constraints ArchArch--generation & RTL/constraints generation & RTL/constraints

    generationgeneration Verilog/VHDL/SystemCVerilog/VHDL/SystemC FPGAs: Altera, Xilinx FPGAs: Altera, Xilinx ASICs: Magma, Synopsys, …ASICs: Magma, Synopsys, …

    FPGAs/ASICsFPGAs/ASICs

  • 8/21/2012

    42

    AutoPilot Compilation Tool (based UCLA xPilot system)

    ♦ Platform-based C to FPGA synthesis

    C/C++/SystemCC/C++/SystemC

    Simu

    Simu Compilation &Compilation & A t Pil tA t Pil tTMTM

    Com

    mC

    omm

    User ConstraintsUser ConstraintsDesign SpecificationDesign Specification

    y♦ Synthesize pure ANSI-C and

    C++, GCC-compatible compilation flow

    ♦ Full support of IEEE-754 floating point data types & operations

    ♦ Efficiently handle bit-accurate fixed-point arithmetic

    Platform Platform Characterization Characterization

    LibraryLibrary

    ==

    ulation, Verification, and Proulation, Verification, and Pro

    Compilation & Compilation & ElaborationElaboration

    Presynthesis OptimizationsPresynthesis Optimizations

    Behavioral & CommunicationBehavioral & CommunicationSynthesis and OptimizationsSynthesis and Optimizations

    AutoPilotAutoPilotTMTMon Testbenchon Testbench

    ESL SynthesisESL Synthesis

    8383

    p♦ More than 10X design

    productivity gain♦ High quality-of-results

    Timing/Power/Layout Timing/Power/Layout ConstraintsConstraints

    RTL HDLs &RTL HDLs &RTL SystemCRTL SystemC

    FPGAFPGACoCo--ProcessorProcessor

    ototypingototyping

    Developed by AutoESL, acquired by Xilinx in Jan. 2011Developed by AutoESL, acquired by Xilinx in Jan. 2011

    Toplevel Block Diagram

    HMatrix

    multiplyMatrix

    multiplyQRD BackSubst.

    4x4 Matrix Inverse NormSearch/Reorder

    4x4

    AutoPilot Results: Sphere Decoder (from Xilinx)AutoPilot Results: Sphere Decoder (from Xilinx)• Wireless MIMO Sphere

    Decoder– ~4000 lines of C code

    Matrixmultiply

    MatrixmultiplyQRD

    BackSubst.

    3x3 Matrix Inverse NormSearch/Reorder

    3x3

    Matrixmultiply

    MatrixmultiplyQRD

    BackSubst.

    2x2 Matrix Inverse NormSearch/Reorder

    2x2

    8x8 RVDQRD

    Tree Search Sphere DetectorStage 1 Stage 8

    MinSearch…

    Metric RTL Expert

    AutoPilot Expert

    Diff (%)

    – Xilinx Virtex-5 at 225MHz

    • Compared to optimized IP – 11-31% better resource

    usage

    84848/21/2012 UCLA VLSICAD

    LUTs 32,708 29,060 -11%

    Registers 44,885 31,000 -31%

    DSP48s 225 201 -11%

    BRAMs 128 99 -26%

    TCAD April 2011 (keynote paper)“High-Level Synthesis for FPGAs: From Prototyping to Deployment”

    84

  • 8/21/2012

    43

    AutoPilot Results: Optical Flow (from BDTI)AutoPilot Results: Optical Flow (from BDTI)♦♦ ApplicationApplication Optical flow, 1280x720 progress scanOptical flow, 1280x720 progress scan Design too complex for an RTL teamDesign too complex for an RTL team

    Input VideoInput Video

    ♦♦ Compared to highCompared to high--end DSP: end DSP: 30X higher throughput, 40X better cost/fps30X higher throughput, 40X better cost/fps

    Chip Unit Cost

    Highest Frame Rate @ 720p (fps)

    Cost/performance ($/frame/second)

    Xilinx $27 183 $0 14

    Output Video

    8585

    Xilinx Spartan3ADSP XC3SD3400A chip

    $27 183 $0.14

    Texas Instruments TMS320DM6437 DSP processor

    $21 5.1 $4.20

    BDTi evaluation of AutoPilot http://www.bdti.com/articles/AutoPilot.pdf 85

    AutoPilot Results: DQPSK Receiver (from BDTI)AutoPilot Results: DQPSK Receiver (from BDTI)

    ♦♦ ApplicationApplication DQPSK receiverDQPSK receiver 18 75Msamples @75MHz clock 18 75Msamples @75MHz clock

    Hand-coded RTL

    AutoPilot

    18.75Msamples @75MHz clock 18.75Msamples @75MHz clock speedspeed

    ♦♦ Area better than handArea better than hand--codedcodedXilinx XC3SD3400A chip utilization ratio (lower the better)

    5.9% 5.6%

    BDTi evaluation of AutoPilot http://www.bdti.com/articles/AutoPilot.pd

    86868/21/2012 UCLA VLSICAD 86

  • 8/21/2012

    44

    CHP Mapping OverviewCHP Mapping OverviewGoal: Goal: Efficient mapping of domainEfficient mapping of domain--specific application to customizable hardwarespecific application to customizable hardwareAdapt the CHP to a given application so as to optimize performance/power efficiencyAdapt the CHP to a given application so as to optimize performance/power efficiency

    Domain-specific applications

    Abstract execution Programmer

    Domain-specific programming model(Domain-specific coordination graph and domain-specific language extensions)

    Source-to source CHP Mapper (Rose)

    Application characteristics

    CHP architecture models

    C/C++ code

    C/C++ front-end

    Accelerator

    C/C++

    RTL Synthesizer

    Accelerator kernelROSE SAGE IR

    ROSE LLVM translator

    8787

    Reconfiguring and optimizing back-end (LLVM)

    Binary code for fixed & customized cores

    Accelerator code

    compiler/library

    Performance feedback

    Unified Adaptive Runtime system(maps tasks across CPUs, GPUs, Accelerators, FPGA processors)

    CHP architectural prototypes(CHP hardware testbeds, CHP simulation

    testbed, full CHP)

    RTL for prog fabric

    (AutoPilot/xPilot)

    Programming Model and Runtime Support Programming Model and Runtime Support [LCTES12][LCTES12]

    ♦♦ Concurrent Collection (CnC) programming model Concurrent Collection (CnC) programming model Clear separation between application description and Clear separation between application description and

    implementationimplementationpp Fits domain expert needsFits domain expert needs

    ♦♦ CnCCnC--HC: Software flow CnC => HabaneroHC: Software flow CnC => Habanero--C(HC)C(HC)♦♦ CrossCross--device workdevice work--stealing in Habanerostealing in Habanero--CC Task affinity with heterogeneous componentsTask affinity with heterogeneous components

    ♦♦ Data driven runtime in CnCData driven runtime in CnC--HCHC

    8888

    ♦♦ Data driven runtime in CnCData driven runtime in CnC--HCHC

  • 8/21/2012

    45

    CnC Building BlocksCnC Building Blocks♦♦ StepsSteps Computational unitsComputational units Functional with respects to their inputsFunctional with respects to their inputs Functional with respects to their inputsFunctional with respects to their inputs

    ♦♦ Data ItemsData Items Means of communication between stepsMeans of communication between steps Dynamic single assignmentDynamic single assignment

    ♦♦ Control ItemsControl Items Used to create (prescribe) instances of a computation stepUsed to create (prescribe) instances of a computation step

    8989

    Used to create (prescribe) instances of a computation stepUsed to create (prescribe) instances of a computation step

    Intel® Intel® X ® X ®

    Application Application Engine Hub Engine Hub

    Application Engines Application Engines (AEs)(AEs)

    Direct Direct Data Data

    “Commodity” Intel Server“Commodity” Intel Server Convey FPGAConvey FPGA--based coprocessorbased coprocessor

    HCHC--1ex architecture1ex architecture

    Xeon® Xeon® ProcessorProcessor

    ( )( )

    Intel® Intel® Memory Memory Controller Controller Hub (MCH)Hub (MCH)

    Intel® I/O Intel® I/O MemoryMemory MemoryMemory

    Engine Hub Engine Hub (AEH)(AEH)

    (AEs)(AEs) Data Data PortPort

    Xeon QuadXeon QuadCore LV5408Core LV540840W TDP40W TDP

    XC6vlx760 FPGAsXC6vlx760 FPGAs80GB/s off80GB/s off--chip bandwidthchip bandwidth94W Design Power94W Design Power

    9090

    SubsystemSubsystem MemoryMemory MemoryMemory

    Standard Intel® x86Standard Intel® x86--64 64 ServerServerx86x86--64 Linux64 Linux

    Convey coprocessorConvey coprocessorFPGAFPGA--basedbasedShared cacheShared cache--coherent memorycoherent memoryTesla C1060Tesla C1060

    100GB/s off100GB/s off--chip bandwidthchip bandwidth200W TDP200W TDP 90

  • 8/21/2012

    46

    Runtime Support Experimental resultsRuntime Support Experimental results

    ♦♦ Performance for medical imaging kernelsPerformance for medical imaging kernels

    Denoise Registration Segmentation

    Num iterations 3 100 50

    CPU (1 core) 3.3s 457.8s 36.76s

    GPU 0.085s (38.3 ×) 20.26s (22.6 ×) 1.263s (29.1 ×)

    FPGA 0.190s (17.2 ×) 17.52s (26.1 ×) 4.173s (8.8 ×)

    9191

    ( ) ( ) ( )

    Experimental Results (Cont’d)Experimental Results (Cont’d)

    • Execution times and active energy with dynamic work stealingwork stealing

    9292

  • 8/21/2012

    47

    Static vs Dynamic bindingStatic vs Dynamic binding

    ♦♦ Static bindingStatic binding

    ♦♦ Dynamic BindingDynamic Binding

    9393

    93

    Concluding Remarks♦♦ Despite of end of scaling, there is plenty of opportunity with Despite of end of scaling, there is plenty of opportunity with

    customization and specialization for energy efficient computingcustomization and specialization for energy efficient computing♦♦ Many opportunities and challenges for architecture supportMany opportunities and challenges for architecture supporty pp g ppy pp g pp Cores Accelerators Memory Network-on-chips

    ♦♦ Software support is also critical Software support is also critical

    9494

  • 8/21/2012

    48

    Acknowledgements: CDSC Faculty

    Aberle Aberle (UCLA)(UCLA)

    Baraniuk Baraniuk (Rice)(Rice)

    Bui Bui (UCLA)(UCLA)

    Cong (Director) Cong (Director) (UCLA)(UCLA)

    Cheng Cheng (UCSB)(UCSB)

    Chang Chang (UCLA)(UCLA)

    9595

    Reinman Reinman (UCLA)(UCLA)

    Palsberg Palsberg (UCLA)(UCLA)

    Sadayappan Sadayappan (Ohio(Ohio--State)State)

    SarkarSarkar(Associate Dir) (Associate Dir)

    (Rice)(Rice)

    Vese Vese (UCLA)(UCLA)

    Potkonjak Potkonjak (UCLA)(UCLA)

    More Acknowledgements

    Mohammad Ali GhodratMohammad Ali Ghodrat

    Yi ZouYi ZouChunyue Chunyue LiuLiu

    Hui HuangHui HuangMichael GillMichael Gill BeaynaBeaynaGrigorianGrigorian

    9696

    ♦♦ This research is partially supported by the Center for DomainThis research is partially supported by the Center for Domain-- Specific Specific Computing (CDSC) funded by the NSF Expedition in Computing Award CCFComputing (CDSC) funded by the NSF Expedition in Computing Award CCF--0926127, GSRC under contract 20090926127, GSRC under contract 2009--TJTJ--1984.1984.

    LiuLiu GrigorianGrigorian

  • 8/21/2012

    49

    Examples of EnergyExamples of Energy--Efficient CustomizationEfficient Customization

    ♦♦ Customization of processor coresCustomization of processor cores♦♦ Customization of onCustomization of on--chip memorychip memory♦♦ Customization of onCustomization of on--chip interconnectschip interconnects

    9797

    Terahertz VCO in 65nm CMOSTerahertz VCO in 65nm CMOS♦♦ Demonstrated an ultra high Demonstrated an ultra high

    frequency and low power oscillator frequency and low power oscillator structure in CMOS by adding a structure in CMOS by adding a negative resistance parallel tank negative resistance parallel tank

    Measured signal spectrum with Measured signal spectrum with uncalibrated poweruncalibrated power

    negative resistance parallel tank, negative resistance parallel tank, with the fundamental frequency at with the fundamental frequency at 217GHz and 16.8 mW DC power 217GHz and 16.8 mW DC power consumption. consumption.

    ♦♦ The measured 4The measured 4thth and 6and 6ththharmonics are about 870GHz and harmonics are about 870GHz and 1.3THz, respectively. 1.3THz, respectively.

    9898

    higher harmonics (4th and 6th harmonics) may be higher harmonics (4th and 6th harmonics) may be substantially underestimated due to excessive water substantially underestimated due to excessive water

    and oxygen absorption and setup losses at these and oxygen absorption and setup losses at these frequencies.frequencies.

    ““Generating Terahertz Signals in 65nm CMOS with NegativeGenerating Terahertz Signals in 65nm CMOS with Negative--Resistance Resonator Boosting and Selective Harmonic SuppressionResistance Resonator Boosting and Selective Harmonic Suppression””

    Symposium on VLSI Technology and Circuits, June 2010Symposium on VLSI Technology and Circuits, June 2010

  • 8/21/2012

    50

    Use of Multiband RF-Interconnect for Customization

    •• In TX, each mixer upIn TX, each mixer up--converts individual baseband streams into converts individual baseband streams into specific frequency band (or channel)specific frequency band (or channel)

    9999

    specific frequency band (or channel)specific frequency band (or channel)•• N different data streams (N=6 in exemplary figure above) may transmit N different data streams (N=6 in exemplary figure above) may transmit

    simultaneously on the shared transmission medium to achieve higher simultaneously on the shared transmission medium to achieve higher aggregate data rates aggregate data rates

    •• In RX, individual signals are downIn RX, individual signals are down--converted by mixer, and recovered converted by mixer, and recovered after lowafter low--pass filterpass filter

    Mesh Overlaid with RF-I [HPCA’08]

    ♦ 10x10 mesh of pipelined routers NoC runs at 2GHz XY routing64 4GHz 3 wide processor cores♦ 64 4GHz 3-wide processor cores Labeled aqua 8KB L1 Data Cache 8KB L1 Instruction Cache

    ♦ 32 L2 Cache Banks Labeled pink 256KB each Organized as shared NUCA cache

    ♦ 4 Main Memory Interfaces

    100100

    y Labeled green

    ♦ RF-I transmission line bundle Black thick line spanning mesh

  • 8/21/2012

    51

    RF-I Logical Organization

    •• Logically:Logically:-- RFRF--I behaves as set ofI behaves as set of-- RFRF--I behaves as set of I behaves as set of N express channelsN express channels

    -- Each channel assigned Each channel assigned to to srcsrc, , destdest router pair (router pair (ss,,dd))

    101101

    •• Reconfigured by:Reconfigured by:remapping remapping shortcuts to shortcuts to match match needs of differentneeds of different applicationsapplications

    LOGICAL ALOGICAL ALOGICAL BLOGICAL B

    Latest Progress: Die Photo of STLLatest Progress: Die Photo of STL--DBI TransceiverDBI Transceiver

    102102

    Controller SideController Side Memory SideMemory Side Active Area: 0.12mmActive Area: 0.12mm2 2 (15% smaller than Ref. [4])(15% smaller than Ref. [4])

    [[4] G.4] G.--S. S. ByunByun, et al., ISSCC 2011, et al., ISSCC 2011

  • 8/21/2012

    52

    Comparison with StateComparison with State--ofof--thethe--artartThis Work JSSC 2009[1] ISSCC 2009[2] JSSC 2010[3] ISSCC 2011[4]

    Technology 65nm 180nm 130nm 40nm 65nm

    Signaling Single-Ended Single-Ended Single-Ended Differential DifferentialLink Type SBD Bidirectional Bidirectional Bidirectional SBD

    Aggregatedata rate/pin 8Gb/s/pin 5Gb/s/pin 6Gb/s/pin 2.15Gb/s/pin

    4.2Gb/s/pin

    Total Power 14.4mW(BB)17.6mW(RF) 87mW 95mW 14.4mW12mW (BB)11mW (RF)

    Energy/bit/pin 4pJ/bit/pin 17.4pJ/bit/pin 15.8pJ/bit/pin 6.6pJ/bit/pin 5pJ/bit/pin

    Chip Area 0.12mm2 0.52mm2 0.30mm2 0.9mm2 0.14mm2

    FoM 2 08 0 11 0 21 0 17 1 30

    103103

    FoM 2.08 0.11 0.21 0.17 1.30

    mJmmpinGb

    PowerAreapinDRFoM

    2

    //

    [1] K.[1] K.--I. Oh, et al., JSSC2009 (Samsung)I. Oh, et al., JSSC2009 (Samsung) [2] K.[2] K.--S. Ha, et al., ISSCC2009 (Samsung)S. Ha, et al., ISSCC2009 (Samsung) [3] B. Leibowitz, et al., JSSC2010 (Rambus)[3] B. Leibowitz, et al., JSSC2010 (Rambus) [4] G.[4] G.--S. Byun, et al., ISSCC 2011 (UCLA)S. Byun, et al., ISSCC 2011 (UCLA)

    104104

  • 8/21/2012

    53

    Results: Improvement Over LCAResults: Improvement Over LCA--based based Design (Old results)Design (Old results)♦♦ N’pN’p’ has N cores, N threads ’ has N cores, N threads

    and N times area and N times area --equivalent equivalent accelerators accelerators 0.8

    1

    1.2

    Normalized PerformanceLCA+GAM LCA+TD ABB+TD

    ♦♦ EnergyEnergy 2.4X vs. LCA+GAM (max 4.7X) 1.6X vs. LCA+ABC (max 3.1X)

    ♦♦ PerformancePerformance 2.2X vs. LCA+GAM (max 3.8X) 1.6X vs. LCA+ABC (max 2.7X)

    0

    0.2

    0.4

    0.6

    1p 2p 4p 8p 1p 2p 4p 8p 1p 2p 4p 8p 1p 2p 4p 8p

    Seg Deb Reg Den

    1

    1.2

    Normalized EnergyLCA+GAM LCA+TD ABB+TD

    105105

    ♦♦ ABB+ABC better energy and ABB+ABC better energy and performance performance ABC starts composing ABBs to

    create new LCAs Creates more parallelism

    0

    0.2

    0.4

    0.6

    0.8

    1

    1p 2p 4p 8p 1p 2p 4p 8p 1p 2p 4p 8p 1p 2p 4p 8p

    Seg Deb Reg Den

    Power Barrier and Current SolutionPower Barrier and Current Solution•• 10’s to 100’s cores in a processor10’s to 100’s cores in a processor

    •• 1000’s to 10,000’s servers in a data center1000’s to 10,000’s servers in a data center

    ParallelizationParallelization

    106106

  • 8/21/2012

    54

    Examples of EnergyExamples of Energy--Efficient CustomizationEfficient Customization

    ♦♦ Customization of processor coresCustomization of processor cores♦♦ Customization of onCustomization of on--chip memorychip memory♦♦ Customization of onCustomization of on--chip interconnectschip interconnects

    107107