Customizable Domain-Specific Computingcadlab.cs.ucla.edu/~cong/slides/fpl09_keynote.pdfApplication domain 6 Motivation ... LSQ size Branch predictor BTB size BTB complexity 16 Existing

1

Customizable DomainCustomizable Domain--Specific ComputingSpecific Computing

Jason CongCenter for Domain-Specific ComputingUCLA Computer Science Department

[email protected]://cadlab.cs.ucla.edu/~cong

2

The Power Barrier The Power Barrier ……

Source : Shekhar Borkar, Intel

3

Focus: New Transformative Approach to Focus: New Transformative Approach to Power/Energy Efficient ComputingPower/Energy Efficient Computing

Parallelization

Source: Shekhar Borkar, Intel

Current Solution: ParallelizationCurrent Solution: Parallelization

4

Cost and Energy are Still a Big Issue Cost and Energy are Still a Big Issue ……

Cost of computing•HW acquisition

•Energy bill

•Heat removal

•Space

•…

5

Next Significant Opportunity Next Significant Opportunity ---- CustomizationCustomization

Parallelization

Source: Shekhar Borkar, Intel

Customization

Adapt the architecture to

Application domain

6

MotivationMotivationA few factsA few facts

We have sufficient computing power for most applicationsEach user/enterprise need high computing power for only selected tasks in its domainApplication-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture

Our proposalOur proposalA general, customizable platform for the given domain(s)

• Can be customized to a wide-range of applications in the domain• Can be massively produced with cost efficiency• Can be programmed efficiently with novel compilation and runtime systems

Goal: Goal: A A ““supercomputersupercomputer--inin--aa--boxbox”” with 100X performance/power improvement via with 100X performance/power improvement via customization for the intended customization for the intended domain(sdomain(s))

Analogy: Analogy: Advance of civilization via specialization/customizationAdvance of civilization via specialization/customization

7

Example Application Domain: HealthcareExample Application Domain: HealthcareMedical imaging has transformed healthcareMedical imaging has transformed healthcare

An in vivo method for understanding disease development and patient conditionEstimated to be $100 billion/yearMore powerful & efficient computation can help

• Fewer exposures using compressive sensing• Better clinical assessment (e.g., for cancer) using

improved registration and segmentation algorithms

HemodynamicHemodynamic simulation simulation Very useful for surgical procedures involving blood flow and vasculature

Both may take hours to days to constructBoth may take hours to days to constructClinical requirement: 1Clinical requirement: 1--2 min2 min

Cloud computing wonCloud computing won’’t work t work ––•• Communication, realCommunication, real--time requirement, privacytime requirement, privacy

A megawattA megawatt--datacenter for each hospital?datacenter for each hospital? Intracranial aneurysm reconstruction with hemodynamics

Magnetic resonance (MR) angiograph of an aneurysm

8

compressive sensing

level set methods

fluid registration

total variationalalgorithm

Medical Image Processing PipelineMedical Image Processing Pipeline

deno

ising

deno

ising

regis

tratio

nre

gistra

tion

segm

entat

ionse

gmen

tation

analy

sisan

alysis

h

zyS

i,jvolumevoxel

ji

S

kkk

eiZ

wjfwi

∑

=−⎟⎟

⎠

⎞

⎜⎜

⎝

⎛=∀

=−

−

∈∑

1

21

2

j

2, )(

1 ,2)()(u :voxel σ

( ) [ ] )()()()( uxTxRuxTvv

uvtuv

−∇−−−=⋅∇∇++Δ

∇⋅+∂∂

=

ημμ

{ }0t)(x, : xvoxels)(surface

div),(F

==

⎥⎥⎦

⎤

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛

∇∇

+∇=∂∂

ϕ

ϕϕλφϕϕ

t

datat

∑∑==

+∂

∂+

∂∂

−=∂∂

+∂

∂

+Δ+−∇=∇⋅+∂∂

3

12

23

1),(

),()(

ji

j

ij

j ij

ij

i txfxvv

xp

xvv

tv

txfvpvvtv

υ

υ

reco

nstru

ction

reco

nstru

ction

∑∑∀

+

<<

voxels

2

points sampled)(-ARmin

:theoryNyquist -Shannon classical rate aat sampled be can and sparsity,exhibit images Medical

ugradSuu

λ

Navier-Stokesequations

9

compressive sensing

level set methods

fluid registration


Application Domains: Medical Image Processing PipelineApplication Domains: Medical Image Processing Pipelinede

noisi

ngde

noisi

ngre

gistra

tion

regis

tratio

nse

gmen

tation

segm

entat

ionan

alysis

analy

sisre

cons

tructi

onre

cons

tructi

on


non-iterative, highly parallel, local & global communication sparse linear algebra, structured grid, optimization methods

parallel, global communicationdense linear algebra, optimization methods

local communicationsparse linear algebra, n-body methods, graphical models

local communicationdense linear algebra, spectral methods, MapReduce

iterative, local or global communicationdense and sparse linear algebra, optimization methods

•• These algorithms have diverse These algorithms have diverse computation & computation & communication patterns communication patterns

•• A single homogenous system A single homogenous system can not perform very well on can not perform very well on all these algorithmsall these algorithms

10

compressive sensing

level set methods

fluid registration



Non-iterative, highly parallel, local & global communication sparse linear algebra, structured grid, optimization methods

parallel, global communicationdense linear algebra, optimization methods

local communicationsparse linear algebra, n-body methods, graphical models

local communication dense linear algebra, spectral methods, MapReduce

iterative, local or global communicationdense and sparse linear algebra, optimization methods

Need of Customization for Medical Image Processing PipelineNeed of Customization for Medical Image Processing Pipeline

deno

ising

deno

ising

regis

tratio

nre

gistra

tion

segm

entat

ionse

gmen

tation

analy

sisan

alysis

reco

nstru

ction

reco

nstru

ction

•• These algorithms have diverse These algorithms have diverse computation & communication computation & communication patternspatterns

•• A single, homogeneous system A single, homogeneous system cannot perform very well on all cannot perform very well on all of these algorithmsof these algorithms

•• Need architecture Need architecture customization and hardwarecustomization and hardware--software cosoftware co--optimizationoptimization

•• Include many common Include many common computation kernels (computation kernels (““motifsmotifs””))•• Applicable to other domainsApplicable to other domains

BiBi--harmonic registration (Using the same algorithm on all harmonic registration (Using the same algorithm on all platforms)platforms)

CPU (Xenon 2.0 GHz)CPU (Xenon 2.0 GHz)

1x 1x

~100 W~100 W

GPU (Tesla C1060)GPU (Tesla C1060)

93x93x

~150 W~150 W

FPGA (xc4vlx100) FPGA (xc4vlx100)

11x 11x

~5W~5W

3D median filter: For each 3D median filter: For each voxelvoxel, compute the median of , compute the median of the 3 x 3 x 3 neighboring the 3 x 3 x 3 neighboring voxelsvoxels

CPU (Xenon 2.0 GHz)CPU (Xenon 2.0 GHz)

Quick select Quick select

1x 1x

~100 W~100 W

GPU (Tesla C1060)GPU (Tesla C1060)

Median of medians Median of medians

70x 70x

~140 W~140 W

FPGA (xc4vlx100) FPGA (xc4vlx100)

BitBit--byby--bit majority voting bit majority voting

1200x 1200x

~3 W~3 W

11

12

Center for Domain-Specific Computing (CDSC) Organization

Reinman(UCLA)

Palsberg(UCLA)

Sadayappan(Ohio-State)

Sarkar(Associate Dir)

(Rice)

Vese(UCLA)

Potkonjak (UCLA)

• A diversified & highly accomplished faculty team: 8 in CS&E; 1 in EE; 2 in medical school; 1 in applied math

• 15-20 postdocs and graduate students in four universities – UCLA, Rice, Ohio-State, and UC Santa Barbara

Aberle(UCLA)

Baraniuk(Rice)

Bui (UCLA)

Cong (Director) (UCLA)

Cheng (UCSB)

Chang (UCLA)

13

Customizable Heterogeneous Platform (CHP)

$$ $$ $$ $$

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

ProgFabricProg

FabricProg

FabricProg

FabricProg

FabricProg

FabricProg

FabricProg

Fabric

DRAMDRAM

DRAMDRAM

I/OI/O

CHPCHP

CHPCHP

CHPCHP

Reconfigurable RF-I busReconfigurable optical busTransceiver/receiverOptical interface

Overview of the Proposed ResearchOverview of the Proposed Research

CHP mappingSource-to-source CHP mapper

Reconfiguring & optimizing backendAdaptive runtime

Domain characterization Application modeling

Domain-specific-modeling(healthcare applications)

CHP creationCustomizable computing engines

Customizable interconnects

Architecture modeling

Design once Invoke many times

14

CHP Creation CHP Creation –– Design Space ExplorationDesign Space Exploration

Key questions: Optimal trade-off between efficiency & customizabilityWhich options to fix at CHP creation? Which to be set by CHP mapper?

Custom instructions & acceleratorsAmount of programmable fabric Shared vs. private acceleratorsCustom instruction selectionChoice of accelerators…


Core parametersFrequency & voltageDatapath bit widthInstruction window sizeIssue widthCache size & configurationRegister file organization# of thread contexts…


NoC parametersInterconnect topology # of virtual channelsRouting policyLink bandwidthRouter pipeline depthNumber of RF-I enabled

routersRF-I channel and

bandwidth allocation…





$$ $$ $$ $$

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

ProgFabricProg

FabricProg

FabricProg

FabricProg

FabricProg

FabricProg

FabricProg

Fabric


15

Customization for CoresExample of core customization space

Instruction

queue

size

ROB

size

Memory hierarchy and configuration

Cache sizes

Cache associativity

Memory latency

Number and type

of FUs

Register

file

size

LSQ

size

Branch

predictor

BTB size

BTB complexity

16

Existing Studies on Cores Customization (not domain-specific)

Less than 50% worse than ideal 8x powerful core. Up to 40% improvement for changing workloads

Core spilling – spill from 1 core up to 8 cores[Cong et al Trans on PDS 2007]

Less than 30% and 20% worse for sequential and parallel benchmarks respectively

Core fusion – 2-issue cores fused to simulate 4 and 6 issue cores[Ipel et al ISCA 2007]

16% total processor energy saving Issue logic and Issue Queue (43/58)[Folegnani & Gonzalez, ISCA 2001]

Power saving of 59% for the three components

Instruction Queue (17/32) Reorder Buffer (57/128)

Load/Store Queue (18/32)

[Ponomarev, et.al., MICRO 2001]

Up to 78% total energy saving with combined DVS and architectural adaption

Issue Width (8,4,2) Issue Queue (128,64,32)

Function Units (4,2) Dynamic Voltage Scaling

[Hughes, et.al., MICRO 2001]

Up to 50% power reduction and 55% performance improvement

Reduced precision FP arithmetic (mini FPU mantissa 14, exponent 8) , FPU sharing (2:4:8 sharing cores), eliminating trivial FP operations, lookup table

[Yeh et al

MICRO 2007]

Only 2x worse than domain optimized system

Memory system: streaming register files or cache hierarchy

Communication: broadcast and routed Processor: SIMD or RISC superscalar

[Mai et al ISCA 2007]

1.6X performance gain and 0.8X power reduction

5.1x efficiency improvement

Issue queue, issue width, Branch, LSQ, ROB, Registers Cache I-LI, D-L1, L2 cache size and latency, Memory Latency, temporal sensitivity

[Lee and Brooks ASPLOS 2008]

ImpactFeatureReference

17

Energy-Effective Issue Logic [Folegnani & Gonzalez, ISCA’01]

AA BB

CC

Inefficiency of conventional instruction issue logic & issue queue (IQ)

A) Energy waste from empty entries and ready operand

B) Effectively used IQ varies across different applications

C) Effectively used IQ varies in different period of one application

18

Adaptation of Multiple Datapath Resources (cont’d)Dynamic adapt through multi-partitioned resources

Instruction queue (IQ)• avg: 17; max: 32

Reorder buffer (ROB)• avg: 57; max: 128

Load/Store queue (LSQ)• avg 18; max: 32

Three resources are independently adjusted at run time

Downsize the resources based on sampling statistics of effective usage historyUpsize the resources based on the resource miss record

Total power saving for the three resized components: 59%

19

Architectural and Frequency Adaptations for Multimedia Applications [Hughes, et al, MICRO 2001]

Dynamic adaptArchitecture

• Issue Width & Issue Queue• # Function Units

Dynamic Voltage Scaling (DVS)• Continuous DVS (CDVS)• Discrete DVS (DDVS)

Adaptation methodInitial profiling

• Multimedia application has similar performance and power stats for the same frame type

Dynamic adaptation• Choose optimal configuration based

on history stats for the same frame type by table lookup

Energy savingDDVS Alone: 73%Arch Alone: 22%CDVS Alone: 75%Arch + DDVS: 77%Arch + CDVS: 78%

20

Architectural and Frequency Adaptations for Multimedia Applications (cont’d)

Important conclusionsDVS gives the most of energy reductionArchitectural adaption further reduce energy when augmented on DVSWithout DVS, less aggressive architectures are more energy-efficientWith DVS, more aggressive architectures are often more energy-efficient• The higher IPC of the more aggressive architectures means it an be run at a

lower frequency to save energy

21

Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]

Examine two main questions:

Spatial adaptivity - which parameters to tune? Temporal adaptivity – how often to tune?

Study effects of tuning 15 parameters and at different time intervals of adaptation

22


Architectural parameters studiedInstruction

queue

size

ROB

size

Cache sizes

Cache associativity

Memory latency

Number and type

of FUs

Register

file

size

LSQ

size

Branch

predictor

BTB size

BTB complexity

23


Key findingsUp to 5.3x improvement in efficiency through adaptationRelatively frequent adaptation (80K instruction intervals) needed to achieve maximum efficiency

24


Key findingsOn average, adapting 3 parameters is sufficient to achieve 77% of efficiency gain• However, the 3 parameters depend on application and phase

DVFS provides relatively less benefits (in terms of efficiency) with architecture adaptations

25

Existing Studies on Cores Customization (not domain-specific)

Less than 50% worse than ideal 8x powerful core. Up to 40% improvement for changing workloads

Core spilling – spill from 1 core up to 8 cores[Cong et al Trans on PDS 2007]

Less than 30% and 20% worse for sequential and parallel benchmarks respectively

Core fusion – 2-issue cores fused to simulate 4 and 6 issue cores[Ipel et al ISCA 2007]

16% total processor energy saving Issue logic and Issue Queue (43/58)[Folegnani & Gonzalez, ISCA 2001]

Power saving of 59% for the three components

Instruction Queue (17/32) Reorder Buffer (57/128)

Load/Store Queue (18/32)

[Ponomarev, et.al., MICRO 2001]

Up to 78% total energy saving with combined DVS and architectural adaption

Issue Width (8,4,2) Issue Queue (128,64,32)

Function Units (4,2) Dynamic Voltage Scaling

[Hughes, et.al., MICRO 2001]

Up to 50% power reduction and 55% performance improvement

Reduced precision FP arithmetic (mini FPU mantissa 14, exponent 8) , FPU sharing (2:4:8 sharing cores), eliminating trivial FP operations, lookup table

[Yeh et al

MICRO 2007]

Only 2x worse than domain optimized system

Memory system: streaming register files or cache hierarchy

Communication: broadcast and routed Processor: SIMD or RISC superscalar

[Mai et al ISCA 2007]

1.6X performance gain and 0.8X power reduction

5.1x efficiency improvement

Issue queue, issue width, Branch, LSQ, ROB, Registers Cache I-LI, D-L1, L2 cache size and latency, Memory Latency, temporal sensitivity

[Lee and Brooks ASPLOS 2008]

ImpactFeatureReference

26














$$ $$ $$ $$

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

ProgFabricProg

FabricProg

FabricProg

FabricProg

FabricProg

FabricProg

FabricProg

Fabric


27

Customization of Programmable FabricsFPGA-based acceleration has shown a lot of promise

Many applications in bio-informatics, financial engineering, image processing, scientific computing, …Many publications in FCCM, FPGA, FPL, FPT, …

Two significant barriersCommunication between CPU and FPGA accelerator• Overhead of using peripheral bus is too high

Automatic compilation• Real programmers do not use VHDL/Verilog

But … a lot of encouraging progress made recently

28

Customization of Programmable FabricsRecent enablers

Communication between CPU and FPGA accelerator• High-speed connections – HyberTransport bus, FSB, QPI, …• On-chip integration

Automatic compilation• Maturing of C/C++ to RTL synthesis tools

29

Acceleration of Lithographic Simulation [FPGA’08]

Lithography simulationSimulate the optical imaging processComputational intensive; very slow for full-chip simulation

XtremeData X1000 development system (AMD Opteron + Altera StratixII EP2S180)

AutoPilotTM

Synthesis Tool

Algorithm in C

Ι(x,y) = Σ λκ ∗ | Σ τ [ψκ(x−x1, y−y1) −

ψκ(x−x2, y−y1) + ψκ(x−x2, y−y2) − ψκ(x−x1, y−y2)] |2

15X+ Performance Improvement vs. AMD Opteron 2.2GHz Processor Close to 100X improvement on energy efficiency

15W in FPGA comparing with 86W in Opteron

30

xPilot: Behavioral-to-RTL Synthesis Flow Behavioral spec. in C/C++/SystemC

RTL + constraints

SSDMSSDM

μArch-generation & RTL/constraints generation

Verilog/VHDL/SystemCFPGAs: Altera, Xilinx ASICs: Magma, Synopsys, …

Advanced transformtion/optimizationsLoop unrolling/shifting/pipeliningStrength reduction / Tree height reductionBitwidth analysisMemory analysis …

FPGAs/ASICsFPGAs/ASICs

Frontendcompiler

Frontendcompiler

Platform description

Core behvior synthesis optimizationsSchedulingResource binding, e.g., functional unit binding register/port binding

31

Some Recent Studies -- Efficient Identification of Approximate Patterns [Cong & Wei, FPGA’08]

Programers may contain many patternsPrior work can only identify exact patternsWe can efficiently identify “approximate”patterns in large programs

Based on the concept of editing distanceUse data-mining techniquesEfficient subgraph enumeration and pruning

Highly scalable – can handle programs with 100,000+ lines of codeApplications:

Behavioral synthesis: • 20+% area reduction due to sharing of

approximate patternsASIP synthesis:• Identify & extract customized instructions

+ +

<

+ +

-

+ +

-Structure Variation

+ +

*

+ +

*

+ +

*

1616 1616

1616 3232

32323232 3232 3232

3232

+ +*

+ +*

+ +

*

Bitwidth Variation

Ports Variation

32

Some Recent Studies -- Automatic Memory PartitioningTo appear in ICCAD 2009Memory system is critical for high performance and low power design

Memory bottleneck limits maximum parallelismMemory system accounts for a significant portion of total power consumption

GoalGiven platform information (memory port, power, etc.), behavioral specification, and throughput constraints• Partition memories automatically • Meet throughput constraints• Minimize power consumption

A[iA[i]] A[i+1]A[i+1]

for (for (intint i =0; i < n; i++)i =0; i < n; i++)…… = A[i]+A[i+1]= A[i]+A[i+1]

(a) C code

R1 R2

A[0, 2, 4,A[0, 2, 4,……]] A[1, 3, 5A[1, 3, 5……]]

Decoder

(b) Scheduling

(c) Memory architecture after partitioning

33

Automatic Memory Partitioning (AMP)Techniques

Capture array access confliction in conflict graph for throughput optimizationModel the loop kernel in parametric polytopes to obtain array frequency

ContributionsAutomatic approach for design space explorationCycle-accurate Handle irregular array accessesLight-weight profiling for power optimization

Loop NestLoop Nest

Array Subscripts AnalysisArray Subscripts Analysis

Memory Platform Memory Platform InformationInformation

Partition Candidate GenerationPartition Candidate Generation

Try Partition Candidate Try Partition Candidate CCii, , Minimize Accesses on Each BankMinimize Accesses on Each Bank

Meet Port Limitation?Meet Port Limitation?

Loop Pipelining and SchedulingLoop Pipelining and Scheduling

Pipeline ResultsPipeline Results

N

Power OptimizationPower Optimization

Y

Throughput Optimization

34

Automatic Memory Partitioning (AMP)About 6x throughput improvement on average with 45% area overhead

In addition, power optimization can further reduced 30% of powerafter throughput optimization

Original Partition Original Partition Area Power II II SLICES SLICES Comparsion Reduction

fir 3 1 241 510 2.12 26.82%idct 4 1 354 359 1.01 44.23%litho 16 1 1220 2066 1.69 31.58%matmul 4 1 211 406 1.92 77.64%motionEst 5 1 832 961 1.16 10.53%palindrome 2 1 84 65 0.77 0.00%avg 5.67x 1.45 31.80%

35

AutoPilot Compilation Tool (based UCLA xPilot system)

C/C++/SystemCC/C++/SystemC

Timing/Power/Layout Timing/Power/Layout ConstraintsConstraints

RTL RTL HDLsHDLs &&RTL SystemCRTL SystemC

Platform Characterization

Library

FPGAFPGACoCo--ProcessorProcessor

=

Simulation, Verification, and Prototyping

Compilation & Compilation & ElaborationElaboration

PresynthesisPresynthesis OptimizationsOptimizations

Behavioral & CommunicationBehavioral & CommunicationSynthesis and OptimizationsSynthesis and Optimizations

AutoPilotTM

Com

mon Testbench

User ConstraintsUser Constraints

ESL Synthesis

Design Specification

Platform-based C to FPGA synthesisSynthesize pure ANSI-C and C++, GCC-compatible compilation flowFull support of IEEE-754 floating point data types & operationsEfficiently handle bit-accurate fixed-point arithmeticMore than 10X design productivity gainHigh quality-of-results

36

Some Other Usage of AutoPilot (Microsoft)On John Cooley’s DeepChip 6/30/09

http://www.deepchip.com/items/0482-06.html

“We purchased AutoESL's AutoPilot in 2008 to implement some of the time- consuming cores in our software into FPGA hardware for the runtime speed-up improvements…

1. RankBoost - a machine-learning algorithm used in the dynamic ranking of search engines…

2. Sorting Algorithm - also several thousand lines of OO C++ code with 138 lines that needed speeding up…

37














$$ $$ $$ $$

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

ProgFabricProg

FabricProg

FabricProg

FabricProg

FabricProg

FabricProg

FabricProg

Fabric


38

Current On-Chip Interconnect TechnologyOptimized RC lines with repeaters

Wiresizing, buffer insertion, buffer sizing …E.g. UCLA Tio and IPEM packages

Reconfigurable interconnectsFor FPGAs: • RC busses with pass-transistors or bi-directional buffers

For CMPs (chip multi-processors)• Mesh-like network-on-chip (NoC)

Pay a large penalty on performance

3939

Used vs. Available Bandwidth in Modern CMOS

@ 45nm CMOS TechnologyData Rate: 4 Gbit/sfT of 45nm CMOS can be as high as 240GHzBaseband signal bandwidth only about 4GHz98.4% of available bandwidth is wasted

Question: How to take advantage of full-bandwidth of modern CMOS?

10Tf

4040

-100

-90

-80

-70

323.038 323.238 323.438 323.638 323.838 324.0Frequency (GHz)

Pout

(dB

m)

UCLA 90nm CMOS VCO at 324GHz [ISSCC 2008]

CMOS Voltage Controlled Oscillator, measured with a subharmonicmixer and driven with a 80 GHz synthesizer local oscillator. The mixing frequency is (fVCO - 4*fLO)=fIF, or fVCO -4*(80 GHz)= 3.5 GHz, yielding fVCO= 323.5 GHz!

On-Wafer VCO Test Setup at JPL

CMOS VCO designed by Frank Chang’s group at UCLA, fabricated in 90nm process

323.5GHz VCO

*Huang, D., LaRocca T., Chang, M.-C. F., “324GHz CMOS Frequency Generator Using Linear Superposition Technique IEEE International Solid-State Circuits Conference (ISSCC), 476-477, (Feb 2008) San Francisco, CA

4141

Multiband RF-Interconnect

• In TX, each mixer up-converts individual baseband streams into specific frequency band (or channel)

• N different data streams (N=6 in exemplary figure above) may transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates

• In RX, individual signals are down-converted by mixer, and recovered after low-pass filter

Sig

nal S

pect

rum

Sig

nal P

ower

Sig

nal P

ower

Sign

al P

ower

Sig

nal P

ower

4242

Tri-band On-Chip RF-I Test Results

30GHz Channel50 GHz Channel

30GHz Channel

50GHz Channel

Base Band Channel

Process IBM 90nm CMOS Digital Process

Total 3 Channels 30GHz, 50GHz, Base Band

Data Rate in each channel

RF Band: 4Gbps

Base Band: 2Gbps

Total Data Rate 10Gbps

Bit Error Rate Across all Bands <10E‐9

Latency 6 ps/mm

Enegry Per Bit (RF) 0.09*pJ/bit/mm

Enegry Per Bit (BB) 0.125pJ/bit/mm

Data Output waveform Output Spectrum of the RF-Bands, 30GHz and 50GHz

*VCO power (5mW) can be shared by all (many tens) parallel RF-I links in NOC and does not burden individual link significantly.

4343

Comparison between Repeated Bus and Multi-band RF-I @ 32nm

Assumptions:1. 32nm node; 30x repeater,

FO4=8ps, Rwire = 306Ω/mm Cwire = 315fF/mm, wire pitch=0.2um, Bus length = 2cm, f_bus = 1GHz, Bus Width 96Byte

2. Repeaters Area = 0.022mm2

3. Bus physical width = 160um

4. In that width we can fit 13 transmission line, each with 7 carriers with carrying 8Gbps

Interconnect length = 2cm

RF‐I

Repeated

Bus# of wire 13 448

Data rate per carrier

(Gbit/s) 8 NA# of carrier 7 NA

Data rate per carrier

(Gbit/s) 56 1

Aggregate Data Rate 728 768Bus Physical Width 160 160

Transceiver Area (mm2) 0.27 0.022Power (mW) 455 6144

Energy per bit (pJ/bit) 0.63 8

4444

Architectural Impact Using RF-I

High bandwidth communicationData distribution across many-core topologiesVital in keeping many-core designs active

Low latency communicationEnables users to apply parallel computing to a broader applications through faster synchronization and communicationFaster cache coherence protocols

ReconfigurabilityAdapt NoC topology/bandwidth to the needs of the individual application

Power efficient communication

4545

Simple RF-I Topology

Four NoC Components

Tunable Tx/Rx’sArbitrary topologiesArbitrary bandwidths

C C

C C> > > >> > > >

RF-I Transmission Line Bundle

NoC Component

Tx/Rx

C

C

C

C

C

C C C

C C

C C C CC C

C C

C C

Pipeline/Ring

Bus Multicast FullyConnected

Crossbar

One physical topology can be configured to many virtual topologies

4646

Mesh Overlaid with RF-I [HPCA’08]

10x10 mesh of pipelined routersNoC runs at 2GHzXY routing

64 4GHz 3-wide processor coresLabeled aqua8KB L1 Data Cache8KB L1 Instruction Cache

32 L2 Cache Banks Labeled pink256KB eachOrganized as shared NUCA cache

4 Main Memory InterfacesLabeled green

RF-I transmission line bundleBlack thick line spanning mesh

4747

RF-I Logical Organization

• Logically:- RF-I behaves as set of N express channels- Each channel assigned to src, dest router pair (s,d)

• Reconfigured by:- remapping shortcuts to

match needs of different applications LOGICAL ALOGICAL B

4848

Power Savings [MICRO’08]

We can thin the baseline mesh linksFrom 16B……to 8B…to 4B

RF-I makes up the difference in performance while saving overall power!

RF-I provides bandwidth where most necessaryBaseline RC wires supply the rest

16 bytes8 bytes4 bytes

Requires high bw to communicate w/ B

A

B

4949

RF-I Enabled Multicast

Get S

2

1

3 4

2

1

1

1 1

1FILL

Fill

Conventional NoC

Request Scenario

Rx RxTx

RxTx

RxTx

RxTx

RxTx

RxTx

RxTx

RxTx

Tx

RF-I enabled NoC

5050

Impact of Using RF-Interconnects [MICRO’08]

• Adaptive RF-I enabled NoC- Cost Effective in terms of both power and performance

51


$$ $$ $$ $$

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

FixedCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

CustomCore

ProgFabricProg

FabricProg

FabricProg

FabricProg

FabricProg

FabricProg

FabricProg

Fabric

DRAMDRAM

DRAMDRAM

I/OI/O

CHPCHP

CHPCHP

CHPCHP


Overview of the Proposed ResearchOverview of the Proposed Research

CHP mappingSource-to-source CHP mapper

Reconfiguring & optimizing backendAdaptive runtime

Domain characterization Application modeling

Domain-specific-modeling(healthcare applications)

CHP creationCustomizable computing engines

Customizable interconnects

Architecture modeling

Design once Invoke many times

52

CHP Mapping CHP Mapping –– Compilation and Runtime Software Systems Compilation and Runtime Software Systems for Customizationfor Customization

Goals: Efficient mapping of domain-specific specification to customizable hardware– Adapt the CHP to a given application for drastic performance/power efficiency improvement

Domain-specific applicationsDomain-specific applications

Abstract executionAbstract

execution ProgrammerProgrammer

Domain-specific programming model(Domain-specific coordination graph and domain-specific language extensions)

Source-to source CHP MapperSource-to source CHP Mapper

Application characteristics

CHP architecture models

C/C++ code

C/C++ front-endC/C++

front-end

Reconfiguring and optimizing back-endReconfiguring and optimizing back-end

Analysis annotations

Binary code for fixed & customized cores

Customized target code

RTL for programmable fabric

RTL Synthesizer

(xPilot)

RTL Synthesizer

(xPilot)

C/SystemC behavioral spec

Performance feedback

Adaptive runtimeLightweight threads and adaptive configuration

Adaptive runtimeLightweight threads and adaptive configuration

CHP architectural prototypes(CHP hardware testbeds, CHP simulation

testbed, full CHP)

CHP architectural prototypes(CHP hardware testbeds, CHP simulation

testbed, full CHP)

53

FCUDA: CUDA-to-FPGA (Best Paper Award at SASP 2009)

Use CUDA in tandem with High-Level Synthesis (HLS) to:enable high-level abstraction for FPGA programmingexploit massively parallel compute capabilities of FPGAfacilitate single interface for GPU and FPGA kernel acceleration

CUDA: C-based parallel programming model for GPUsconcise expression of coarse grained parallelismvery popular (wide range of existing applications)Explicit partitioning and trasnfer of data between off-chip and on-chip memory

AutoPilot: Advanced HLS tool (from AutoESL)Platform-specific (i.e. FPGA/ASIC) C-to-RTL mappingFine-grained and loop iteration parallelism extractionAnnotated coarse-grained parallelism extraction• Requires explicit expression and annotation from programmer

54

CUDA-to-AutoPilot C TranslationIdentify off-chip data transfers

aggregate multi-thread off-chip accesses into DMA bursts

Split kernel into computation and data communication tasksUse thread-block granularity for splitting kernel threads into parallel FPGA coresAllocate data storage based on following memory space mapping:

GPU FPGA• Global Off-chip DRAM• Shared On-chip BRAMs• Constant/Texture Registers• Registers / Local Memory

thread-block kernel tasks

55

Results

Benchmark Core # DRAM Bandwidth Limiting Resource

matmul 32bit 128 3.5GB/s DSP

matmul 16bit 176 1.6GB/s BRAM

matmul 8bit 176 0.8GB/s BRAM

cp 32bit 25 0.128GB/s DSP

cp 16bit 96 0.19GB/sec DSP

cp 8bit 96 0.1GB/sec DSP

rc5-72 32bit 80 ≈ 0GB/sec LUT

Kernel Configuration Description

Matrix Multiply (matmul)

1024x1024Common kernel in many imaging, simulation, and scientific application

Coulombicpotential (cp)

4000 atoms, 512x512 grid

Computation of electric potential in a volume containing charged atoms

RSA Encryption (rc5-72)

4 Billion KeysBrute force encryption key generation and matching

00.5

11.5

22.5

32bit 16bit 8bit 32bit 16bit 8bit 32bit

matmul cp rc5-72

spee

dup

GPUFPGA

Benchmark GPU GeForce 8800

FPGA Virtex5 xc5vfx200t

FPGA over GPU Benefit

matmul32bit

≈ 100 Watt

10.622 Watt 9.41X

matmul16bit

10.559 Watt 9.47X

matmul 8bit 9.954 Watt 10.05X

Speedup comparable to GPU in several configurations Much more power efficient than GPU!

Assume FPGA has high bandwidth bus to off-chip DDR

56

Concluding RemarksWe believe that domainWe believe that domain--specific customization is the next specific customization is the next transformative approach to energy efficient computingto energy efficient computing

Beyond parallelization?

Many research opportunities and challengesMany research opportunities and challengesDomain-specific modeling/specificationNovel architecture & microarchitecture for customizationCompilation and runtime software to support intelligent customizationNew research in testing, verification, reliability in customizable computing

CDSC is taking a highly integrated effort CDSC is taking a highly integrated effort ––Coordinated crossCoordinated cross--layer customization in modeling, HW, SW, & application layer customization in modeling, HW, SW, & application developmentdevelopment

57

Acknowledgements

Reinman(UCLA)

Palsberg(UCLA)

Sadayappan(Ohio-State)

Sarkar(Associate Dir)

(Rice)

Vese(UCLA)

Potkonjak (UCLA)

•A highly collaborative effort

• thanks to all my co-PIs in four universities – UCLA, Rice, Ohio-State, and UC Santa Barbara

• Thanks the support from the National Science Foundation

Aberle(UCLA)

Baraniuk(Rice)

Bui (UCLA)

Cong (Director) (UCLA)

Cheng (UCSB)

Chang (UCLA)

Documents

Customizable Domain-Specific Computingcadlab.cs.ucla.edu/~cong/slides/fpl09_keynote.pdfApplication domain 6 Motivation ... LSQ size Branch predictor BTB size BTB complexity 16 Existing