Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
Customizable DomainCustomizable Domain--Specific ComputingSpecific Computing
Jason CongCenter for Domain-Specific ComputingUCLA Computer Science Department
[email protected]://cadlab.cs.ucla.edu/~cong
2
The Power Barrier The Power Barrier ……
Source : Shekhar Borkar, Intel
3
Focus: New Transformative Approach to Focus: New Transformative Approach to Power/Energy Efficient ComputingPower/Energy Efficient Computing
Parallelization
Source: Shekhar Borkar, Intel
Current Solution: ParallelizationCurrent Solution: Parallelization
4
Cost and Energy are Still a Big Issue Cost and Energy are Still a Big Issue ……
Cost of computing•HW acquisition
•Energy bill
•Heat removal
•Space
•…
5
Next Significant Opportunity Next Significant Opportunity ---- CustomizationCustomization
Parallelization
Source: Shekhar Borkar, Intel
Customization
Adapt the architecture to
Application domain
6
MotivationMotivationA few factsA few facts
We have sufficient computing power for most applicationsEach user/enterprise need high computing power for only selected tasks in its domainApplication-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture
Our proposalOur proposalA general, customizable platform for the given domain(s)
• Can be customized to a wide-range of applications in the domain• Can be massively produced with cost efficiency• Can be programmed efficiently with novel compilation and runtime systems
Goal: Goal: A A ““supercomputersupercomputer--inin--aa--boxbox”” with 100X performance/power improvement via with 100X performance/power improvement via customization for the intended customization for the intended domain(sdomain(s))
Analogy: Analogy: Advance of civilization via specialization/customizationAdvance of civilization via specialization/customization
7
Example Application Domain: HealthcareExample Application Domain: HealthcareMedical imaging has transformed healthcareMedical imaging has transformed healthcare
An in vivo method for understanding disease development and patient conditionEstimated to be $100 billion/yearMore powerful & efficient computation can help
• Fewer exposures using compressive sensing• Better clinical assessment (e.g., for cancer) using
improved registration and segmentation algorithms
HemodynamicHemodynamic simulation simulation Very useful for surgical procedures involving blood flow and vasculature
Both may take hours to days to constructBoth may take hours to days to constructClinical requirement: 1Clinical requirement: 1--2 min2 min
Cloud computing wonCloud computing won’’t work t work ––•• Communication, realCommunication, real--time requirement, privacytime requirement, privacy
A megawattA megawatt--datacenter for each hospital?datacenter for each hospital? Intracranial aneurysm reconstruction with hemodynamics
Magnetic resonance (MR) angiograph of an aneurysm
8
compressive sensing
level set methods
fluid registration
total variationalalgorithm
Medical Image Processing PipelineMedical Image Processing Pipeline
deno
ising
deno
ising
regis
tratio
nre
gistra
tion
segm
entat
ionse
gmen
tation
analy
sisan
alysis
h
zyS
i,jvolumevoxel
ji
S
kkk
eiZ
wjfwi
∑
=−⎟⎟
⎠
⎞
⎜⎜
⎝
⎛=∀
=−
−
∈∑
1
21
2
j
2, )(
1 ,2)()(u :voxel σ
( ) [ ] )()()()( uxTxRuxTvv
uvtuv
−∇−−−=⋅∇∇++Δ
∇⋅+∂∂
=
ημμ
{ }0t)(x, : xvoxels)(surface
div),(F
==
⎥⎥⎦
⎤
⎢⎢⎣
⎡⎟⎟⎠
⎞⎜⎜⎝
⎛
∇∇
+∇=∂∂
ϕ
ϕϕλφϕϕ
t
datat
∑∑==
+∂
∂+
∂∂
−=∂∂
+∂
∂
+Δ+−∇=∇⋅+∂∂
3
12
23
1),(
),()(
ji
j
ij
j ij
ij
i txfxvv
xp
xvv
tv
txfvpvvtv
υ
υ
reco
nstru
ction
reco
nstru
ction
∑∑∀
+
<<
voxels
2
points sampled)(-ARmin
:theoryNyquist -Shannon classical rate aat sampled be can and sparsity,exhibit images Medical
ugradSuu
λ
Navier-Stokesequations
9
compressive sensing
level set methods
fluid registration
total variationalalgorithm
Application Domains: Medical Image Processing PipelineApplication Domains: Medical Image Processing Pipelinede
noisi
ngde
noisi
ngre
gistra
tion
regis
tratio
nse
gmen
tation
segm
entat
ionan
alysis
analy
sisre
cons
tructi
onre
cons
tructi
on
Navier-Stokesequations
non-iterative, highly parallel, local & global communication sparse linear algebra, structured grid, optimization methods
parallel, global communicationdense linear algebra, optimization methods
local communicationsparse linear algebra, n-body methods, graphical models
local communicationdense linear algebra, spectral methods, MapReduce
iterative, local or global communicationdense and sparse linear algebra, optimization methods
•• These algorithms have diverse These algorithms have diverse computation & computation & communication patterns communication patterns
•• A single homogenous system A single homogenous system can not perform very well on can not perform very well on all these algorithmsall these algorithms
10
compressive sensing
level set methods
fluid registration
total variationalalgorithm
Navier-Stokesequations
Non-iterative, highly parallel, local & global communication sparse linear algebra, structured grid, optimization methods
parallel, global communicationdense linear algebra, optimization methods
local communicationsparse linear algebra, n-body methods, graphical models
local communication dense linear algebra, spectral methods, MapReduce
iterative, local or global communicationdense and sparse linear algebra, optimization methods
Need of Customization for Medical Image Processing PipelineNeed of Customization for Medical Image Processing Pipeline
deno
ising
deno
ising
regis
tratio
nre
gistra
tion
segm
entat
ionse
gmen
tation
analy
sisan
alysis
reco
nstru
ction
reco
nstru
ction
•• These algorithms have diverse These algorithms have diverse computation & communication computation & communication patternspatterns
•• A single, homogeneous system A single, homogeneous system cannot perform very well on all cannot perform very well on all of these algorithmsof these algorithms
•• Need architecture Need architecture customization and hardwarecustomization and hardware--software cosoftware co--optimizationoptimization
•• Include many common Include many common computation kernels (computation kernels (““motifsmotifs””))•• Applicable to other domainsApplicable to other domains
BiBi--harmonic registration (Using the same algorithm on all harmonic registration (Using the same algorithm on all platforms)platforms)
CPU (Xenon 2.0 GHz)CPU (Xenon 2.0 GHz)
1x 1x
~100 W~100 W
GPU (Tesla C1060)GPU (Tesla C1060)
93x93x
~150 W~150 W
FPGA (xc4vlx100) FPGA (xc4vlx100)
11x 11x
~5W~5W
3D median filter: For each 3D median filter: For each voxelvoxel, compute the median of , compute the median of the 3 x 3 x 3 neighboring the 3 x 3 x 3 neighboring voxelsvoxels
CPU (Xenon 2.0 GHz)CPU (Xenon 2.0 GHz)
Quick select Quick select
1x 1x
~100 W~100 W
GPU (Tesla C1060)GPU (Tesla C1060)
Median of medians Median of medians
70x 70x
~140 W~140 W
FPGA (xc4vlx100) FPGA (xc4vlx100)
BitBit--byby--bit majority voting bit majority voting
1200x 1200x
~3 W~3 W
11
12
Center for Domain-Specific Computing (CDSC) Organization
Reinman(UCLA)
Palsberg(UCLA)
Sadayappan(Ohio-State)
Sarkar(Associate Dir)
(Rice)
Vese(UCLA)
Potkonjak (UCLA)
• A diversified & highly accomplished faculty team: 8 in CS&E; 1 in EE; 2 in medical school; 1 in applied math
• 15-20 postdocs and graduate students in four universities – UCLA, Rice, Ohio-State, and UC Santa Barbara
Aberle(UCLA)
Baraniuk(Rice)
Bui (UCLA)
Cong (Director) (UCLA)
Cheng (UCSB)
Chang (UCLA)
13
Customizable Heterogeneous Platform (CHP)
$$ $$ $$ $$
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
ProgFabricProg
FabricProg
FabricProg
FabricProg
FabricProg
FabricProg
FabricProg
Fabric
DRAMDRAM
DRAMDRAM
I/OI/O
CHPCHP
CHPCHP
CHPCHP
Reconfigurable RF-I busReconfigurable optical busTransceiver/receiverOptical interface
Overview of the Proposed ResearchOverview of the Proposed Research
CHP mappingSource-to-source CHP mapper
Reconfiguring & optimizing backendAdaptive runtime
Domain characterization Application modeling
Domain-specific-modeling(healthcare applications)
CHP creationCustomizable computing engines
Customizable interconnects
Architecture modeling
Design once Invoke many times
14
CHP Creation CHP Creation –– Design Space ExplorationDesign Space Exploration
Key questions: Optimal trade-off between efficiency & customizabilityWhich options to fix at CHP creation? Which to be set by CHP mapper?
Custom instructions & acceleratorsAmount of programmable fabric Shared vs. private acceleratorsCustom instruction selectionChoice of accelerators…
Custom instructions & acceleratorsAmount of programmable fabric Shared vs. private acceleratorsCustom instruction selectionChoice of accelerators…
Core parametersFrequency & voltageDatapath bit widthInstruction window sizeIssue widthCache size & configurationRegister file organization# of thread contexts…
Core parametersFrequency & voltageDatapath bit widthInstruction window sizeIssue widthCache size & configurationRegister file organization# of thread contexts…
NoC parametersInterconnect topology # of virtual channelsRouting policyLink bandwidthRouter pipeline depthNumber of RF-I enabled
routersRF-I channel and
bandwidth allocation…
NoC parametersInterconnect topology # of virtual channelsRouting policyLink bandwidthRouter pipeline depthNumber of RF-I enabled
routersRF-I channel and
bandwidth allocation…
Customizable Heterogeneous Platform (CHP)
$$ $$ $$ $$
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
ProgFabricProg
FabricProg
FabricProg
FabricProg
FabricProg
FabricProg
FabricProg
Fabric
Reconfigurable RF-I busReconfigurable optical busTransceiver/receiverOptical interface
15
Customization for CoresExample of core customization space
Instruction
queue
size
ROB
size
Memory hierarchy and configuration
Cache sizes
Cache associativity
Memory latency
Number and type
of FUs
Register
file
size
LSQ
size
Branch
predictor
BTB size
BTB complexity
16
Existing Studies on Cores Customization (not domain-specific)
Less than 50% worse than ideal 8x powerful core. Up to 40% improvement for changing workloads
Core spilling – spill from 1 core up to 8 cores[Cong et al Trans on PDS 2007]
Less than 30% and 20% worse for sequential and parallel benchmarks respectively
Core fusion – 2-issue cores fused to simulate 4 and 6 issue cores[Ipel et al ISCA 2007]
16% total processor energy saving Issue logic and Issue Queue (43/58)[Folegnani & Gonzalez, ISCA 2001]
Power saving of 59% for the three components
Instruction Queue (17/32) Reorder Buffer (57/128)
Load/Store Queue (18/32)
[Ponomarev, et.al., MICRO 2001]
Up to 78% total energy saving with combined DVS and architectural adaption
Issue Width (8,4,2) Issue Queue (128,64,32)
Function Units (4,2) Dynamic Voltage Scaling
[Hughes, et.al., MICRO 2001]
Up to 50% power reduction and 55% performance improvement
Reduced precision FP arithmetic (mini FPU mantissa 14, exponent 8) , FPU sharing (2:4:8 sharing cores), eliminating trivial FP operations, lookup table
[Yeh et al
MICRO 2007]
Only 2x worse than domain optimized system
Memory system: streaming register files or cache hierarchy
Communication: broadcast and routed Processor: SIMD or RISC superscalar
[Mai et al ISCA 2007]
1.6X performance gain and 0.8X power reduction
5.1x efficiency improvement
Issue queue, issue width, Branch, LSQ, ROB, Registers Cache I-LI, D-L1, L2 cache size and latency, Memory Latency, temporal sensitivity
[Lee and Brooks ASPLOS 2008]
ImpactFeatureReference
17
Energy-Effective Issue Logic [Folegnani & Gonzalez, ISCA’01]
AA BB
CC
Inefficiency of conventional instruction issue logic & issue queue (IQ)
A) Energy waste from empty entries and ready operand
B) Effectively used IQ varies across different applications
C) Effectively used IQ varies in different period of one application
18
Adaptation of Multiple Datapath Resources (cont’d)Dynamic adapt through multi-partitioned resources
Instruction queue (IQ)• avg: 17; max: 32
Reorder buffer (ROB)• avg: 57; max: 128
Load/Store queue (LSQ)• avg 18; max: 32
Three resources are independently adjusted at run time
Downsize the resources based on sampling statistics of effective usage historyUpsize the resources based on the resource miss record
Total power saving for the three resized components: 59%
19
Architectural and Frequency Adaptations for Multimedia Applications [Hughes, et al, MICRO 2001]
Dynamic adaptArchitecture
• Issue Width & Issue Queue• # Function Units
Dynamic Voltage Scaling (DVS)• Continuous DVS (CDVS)• Discrete DVS (DDVS)
Adaptation methodInitial profiling
• Multimedia application has similar performance and power stats for the same frame type
Dynamic adaptation• Choose optimal configuration based
on history stats for the same frame type by table lookup
Energy savingDDVS Alone: 73%Arch Alone: 22%CDVS Alone: 75%Arch + DDVS: 77%Arch + CDVS: 78%
20
Architectural and Frequency Adaptations for Multimedia Applications (cont’d)
Important conclusionsDVS gives the most of energy reductionArchitectural adaption further reduce energy when augmented on DVSWithout DVS, less aggressive architectures are more energy-efficientWith DVS, more aggressive architectures are often more energy-efficient• The higher IPC of the more aggressive architectures means it an be run at a
lower frequency to save energy
21
Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]
Examine two main questions:
Spatial adaptivity - which parameters to tune? Temporal adaptivity – how often to tune?
Study effects of tuning 15 parameters and at different time intervals of adaptation
22
Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]
Architectural parameters studiedInstruction
queue
size
ROB
size
Cache sizes
Cache associativity
Memory latency
Number and type
of FUs
Register
file
size
LSQ
size
Branch
predictor
BTB size
BTB complexity
23
Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]
Key findingsUp to 5.3x improvement in efficiency through adaptationRelatively frequent adaptation (80K instruction intervals) needed to achieve maximum efficiency
24
Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]
Key findingsOn average, adapting 3 parameters is sufficient to achieve 77% of efficiency gain• However, the 3 parameters depend on application and phase
DVFS provides relatively less benefits (in terms of efficiency) with architecture adaptations
25
Existing Studies on Cores Customization (not domain-specific)
Less than 50% worse than ideal 8x powerful core. Up to 40% improvement for changing workloads
Core spilling – spill from 1 core up to 8 cores[Cong et al Trans on PDS 2007]
Less than 30% and 20% worse for sequential and parallel benchmarks respectively
Core fusion – 2-issue cores fused to simulate 4 and 6 issue cores[Ipel et al ISCA 2007]
16% total processor energy saving Issue logic and Issue Queue (43/58)[Folegnani & Gonzalez, ISCA 2001]
Power saving of 59% for the three components
Instruction Queue (17/32) Reorder Buffer (57/128)
Load/Store Queue (18/32)
[Ponomarev, et.al., MICRO 2001]
Up to 78% total energy saving with combined DVS and architectural adaption
Issue Width (8,4,2) Issue Queue (128,64,32)
Function Units (4,2) Dynamic Voltage Scaling
[Hughes, et.al., MICRO 2001]
Up to 50% power reduction and 55% performance improvement
Reduced precision FP arithmetic (mini FPU mantissa 14, exponent 8) , FPU sharing (2:4:8 sharing cores), eliminating trivial FP operations, lookup table
[Yeh et al
MICRO 2007]
Only 2x worse than domain optimized system
Memory system: streaming register files or cache hierarchy
Communication: broadcast and routed Processor: SIMD or RISC superscalar
[Mai et al ISCA 2007]
1.6X performance gain and 0.8X power reduction
5.1x efficiency improvement
Issue queue, issue width, Branch, LSQ, ROB, Registers Cache I-LI, D-L1, L2 cache size and latency, Memory Latency, temporal sensitivity
[Lee and Brooks ASPLOS 2008]
ImpactFeatureReference
26
CHP Creation CHP Creation –– Design Space ExplorationDesign Space Exploration
Key questions: Optimal trade-off between efficiency & customizabilityWhich options to fix at CHP creation? Which to be set by CHP mapper?
Custom instructions & acceleratorsAmount of programmable fabric Shared vs. private acceleratorsCustom instruction selectionChoice of accelerators…
Custom instructions & acceleratorsAmount of programmable fabric Shared vs. private acceleratorsCustom instruction selectionChoice of accelerators…
Core parametersFrequency & voltageDatapath bit widthInstruction window sizeIssue widthCache size & configurationRegister file organization# of thread contexts…
Core parametersFrequency & voltageDatapath bit widthInstruction window sizeIssue widthCache size & configurationRegister file organization# of thread contexts…
NoC parametersInterconnect topology # of virtual channelsRouting policyLink bandwidthRouter pipeline depthNumber of RF-I enabled
routersRF-I channel and
bandwidth allocation…
NoC parametersInterconnect topology # of virtual channelsRouting policyLink bandwidthRouter pipeline depthNumber of RF-I enabled
routersRF-I channel and
bandwidth allocation…
Customizable Heterogeneous Platform (CHP)
$$ $$ $$ $$
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
ProgFabricProg
FabricProg
FabricProg
FabricProg
FabricProg
FabricProg
FabricProg
Fabric
Reconfigurable RF-I busReconfigurable optical busTransceiver/receiverOptical interface
27
Customization of Programmable FabricsFPGA-based acceleration has shown a lot of promise
Many applications in bio-informatics, financial engineering, image processing, scientific computing, …Many publications in FCCM, FPGA, FPL, FPT, …
Two significant barriersCommunication between CPU and FPGA accelerator• Overhead of using peripheral bus is too high
Automatic compilation• Real programmers do not use VHDL/Verilog
But … a lot of encouraging progress made recently
28
Customization of Programmable FabricsRecent enablers
Communication between CPU and FPGA accelerator• High-speed connections – HyberTransport bus, FSB, QPI, …• On-chip integration
Automatic compilation• Maturing of C/C++ to RTL synthesis tools
29
Acceleration of Lithographic Simulation [FPGA’08]
Lithography simulationSimulate the optical imaging processComputational intensive; very slow for full-chip simulation
XtremeData X1000 development system (AMD Opteron + Altera StratixII EP2S180)
AutoPilotTM
Synthesis Tool
Algorithm in C
Ι(x,y) = Σ λκ ∗ | Σ τ [ψκ(x−x1, y−y1) −
ψκ(x−x2, y−y1) + ψκ(x−x2, y−y2) − ψκ(x−x1, y−y2)] |2
15X+ Performance Improvement vs. AMD Opteron 2.2GHz Processor Close to 100X improvement on energy efficiency
15W in FPGA comparing with 86W in Opteron
30
xPilot: Behavioral-to-RTL Synthesis Flow Behavioral spec. in C/C++/SystemC
RTL + constraints
SSDMSSDM
μArch-generation & RTL/constraints generation
Verilog/VHDL/SystemCFPGAs: Altera, Xilinx ASICs: Magma, Synopsys, …
Advanced transformtion/optimizationsLoop unrolling/shifting/pipeliningStrength reduction / Tree height reductionBitwidth analysisMemory analysis …
FPGAs/ASICsFPGAs/ASICs
Frontendcompiler
Frontendcompiler
Platform description
Core behvior synthesis optimizationsSchedulingResource binding, e.g., functional unit binding register/port binding
31
Some Recent Studies -- Efficient Identification of Approximate Patterns [Cong & Wei, FPGA’08]
Programers may contain many patternsPrior work can only identify exact patternsWe can efficiently identify “approximate”patterns in large programs
Based on the concept of editing distanceUse data-mining techniquesEfficient subgraph enumeration and pruning
Highly scalable – can handle programs with 100,000+ lines of codeApplications:
Behavioral synthesis: • 20+% area reduction due to sharing of
approximate patternsASIP synthesis:• Identify & extract customized instructions
+ +
<
+ +
-
+ +
-Structure Variation
+ +
*
+ +
*
+ +
*
1616 1616
1616 3232
32323232 3232 3232
3232
+ +*
+ +*
+ +
*
Bitwidth Variation
Ports Variation
32
Some Recent Studies -- Automatic Memory PartitioningTo appear in ICCAD 2009Memory system is critical for high performance and low power design
Memory bottleneck limits maximum parallelismMemory system accounts for a significant portion of total power consumption
GoalGiven platform information (memory port, power, etc.), behavioral specification, and throughput constraints• Partition memories automatically • Meet throughput constraints• Minimize power consumption
A[iA[i]] A[i+1]A[i+1]
for (for (intint i =0; i < n; i++)i =0; i < n; i++)…… = A[i]+A[i+1]= A[i]+A[i+1]
(a) C code
R1 R2
A[0, 2, 4,A[0, 2, 4,……]] A[1, 3, 5A[1, 3, 5……]]
Decoder
(b) Scheduling
(c) Memory architecture after partitioning
33
Automatic Memory Partitioning (AMP)Techniques
Capture array access confliction in conflict graph for throughput optimizationModel the loop kernel in parametric polytopes to obtain array frequency
ContributionsAutomatic approach for design space explorationCycle-accurate Handle irregular array accessesLight-weight profiling for power optimization
Loop NestLoop Nest
Array Subscripts AnalysisArray Subscripts Analysis
Memory Platform Memory Platform InformationInformation
Partition Candidate GenerationPartition Candidate Generation
Try Partition Candidate Try Partition Candidate CCii, , Minimize Accesses on Each BankMinimize Accesses on Each Bank
Meet Port Limitation?Meet Port Limitation?
Loop Pipelining and SchedulingLoop Pipelining and Scheduling
Pipeline ResultsPipeline Results
N
Power OptimizationPower Optimization
Y
Throughput Optimization
34
Automatic Memory Partitioning (AMP)About 6x throughput improvement on average with 45% area overhead
In addition, power optimization can further reduced 30% of powerafter throughput optimization
Original Partition Original Partition Area Power II II SLICES SLICES Comparsion Reduction
fir 3 1 241 510 2.12 26.82%idct 4 1 354 359 1.01 44.23%litho 16 1 1220 2066 1.69 31.58%matmul 4 1 211 406 1.92 77.64%motionEst 5 1 832 961 1.16 10.53%palindrome 2 1 84 65 0.77 0.00%avg 5.67x 1.45 31.80%
35
AutoPilot Compilation Tool (based UCLA xPilot system)
C/C++/SystemCC/C++/SystemC
Timing/Power/Layout Timing/Power/Layout ConstraintsConstraints
RTL RTL HDLsHDLs &&RTL SystemCRTL SystemC
Platform Characterization
Library
FPGAFPGACoCo--ProcessorProcessor
=
Simulation, Verification, and Prototyping
Compilation & Compilation & ElaborationElaboration
PresynthesisPresynthesis OptimizationsOptimizations
Behavioral & CommunicationBehavioral & CommunicationSynthesis and OptimizationsSynthesis and Optimizations
AutoPilotTM
Com
mon Testbench
User ConstraintsUser Constraints
ESL Synthesis
Design Specification
Platform-based C to FPGA synthesisSynthesize pure ANSI-C and C++, GCC-compatible compilation flowFull support of IEEE-754 floating point data types & operationsEfficiently handle bit-accurate fixed-point arithmeticMore than 10X design productivity gainHigh quality-of-results
36
Some Other Usage of AutoPilot (Microsoft)On John Cooley’s DeepChip 6/30/09
http://www.deepchip.com/items/0482-06.html
“We purchased AutoESL's AutoPilot in 2008 to implement some of the time- consuming cores in our software into FPGA hardware for the runtime speed-up improvements…
1. RankBoost - a machine-learning algorithm used in the dynamic ranking of search engines…
2. Sorting Algorithm - also several thousand lines of OO C++ code with 138 lines that needed speeding up…
37
CHP Creation CHP Creation –– Design Space ExplorationDesign Space Exploration
Key questions: Optimal trade-off between efficiency & customizabilityWhich options to fix at CHP creation? Which to be set by CHP mapper?
Custom instructions & acceleratorsAmount of programmable fabric Shared vs. private acceleratorsCustom instruction selectionChoice of accelerators…
Custom instructions & acceleratorsAmount of programmable fabric Shared vs. private acceleratorsCustom instruction selectionChoice of accelerators…
Core parametersFrequency & voltageDatapath bit widthInstruction window sizeIssue widthCache size & configurationRegister file organization# of thread contexts…
Core parametersFrequency & voltageDatapath bit widthInstruction window sizeIssue widthCache size & configurationRegister file organization# of thread contexts…
NoC parametersInterconnect topology # of virtual channelsRouting policyLink bandwidthRouter pipeline depthNumber of RF-I enabled
routersRF-I channel and
bandwidth allocation…
NoC parametersInterconnect topology # of virtual channelsRouting policyLink bandwidthRouter pipeline depthNumber of RF-I enabled
routersRF-I channel and
bandwidth allocation…
Customizable Heterogeneous Platform (CHP)
$$ $$ $$ $$
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
ProgFabricProg
FabricProg
FabricProg
FabricProg
FabricProg
FabricProg
FabricProg
Fabric
Reconfigurable RF-I busReconfigurable optical busTransceiver/receiverOptical interface
38
Current On-Chip Interconnect TechnologyOptimized RC lines with repeaters
Wiresizing, buffer insertion, buffer sizing …E.g. UCLA Tio and IPEM packages
Reconfigurable interconnectsFor FPGAs: • RC busses with pass-transistors or bi-directional buffers
For CMPs (chip multi-processors)• Mesh-like network-on-chip (NoC)
Pay a large penalty on performance
3939
Used vs. Available Bandwidth in Modern CMOS
@ 45nm CMOS TechnologyData Rate: 4 Gbit/sfT of 45nm CMOS can be as high as 240GHzBaseband signal bandwidth only about 4GHz98.4% of available bandwidth is wasted
Question: How to take advantage of full-bandwidth of modern CMOS?
10Tf
4040
-100
-90
-80
-70
323.038 323.238 323.438 323.638 323.838 324.0Frequency (GHz)
Pout
(dB
m)
UCLA 90nm CMOS VCO at 324GHz [ISSCC 2008]
CMOS Voltage Controlled Oscillator, measured with a subharmonicmixer and driven with a 80 GHz synthesizer local oscillator. The mixing frequency is (fVCO - 4*fLO)=fIF, or fVCO -4*(80 GHz)= 3.5 GHz, yielding fVCO= 323.5 GHz!
On-Wafer VCO Test Setup at JPL
CMOS VCO designed by Frank Chang’s group at UCLA, fabricated in 90nm process
323.5GHz VCO
*Huang, D., LaRocca T., Chang, M.-C. F., “324GHz CMOS Frequency Generator Using Linear Superposition Technique IEEE International Solid-State Circuits Conference (ISSCC), 476-477, (Feb 2008) San Francisco, CA
4141
Multiband RF-Interconnect
• In TX, each mixer up-converts individual baseband streams into specific frequency band (or channel)
• N different data streams (N=6 in exemplary figure above) may transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates
• In RX, individual signals are down-converted by mixer, and recovered after low-pass filter
Sig
nal S
pect
rum
Sig
nal P
ower
Sig
nal P
ower
Sign
al P
ower
Sig
nal P
ower
4242
Tri-band On-Chip RF-I Test Results
30GHz Channel50 GHz Channel
30GHz Channel
50GHz Channel
Base Band Channel
Process IBM 90nm CMOS Digital Process
Total 3 Channels 30GHz, 50GHz, Base Band
Data Rate in each channel
RF Band: 4Gbps
Base Band: 2Gbps
Total Data Rate 10Gbps
Bit Error Rate Across all Bands <10E‐9
Latency 6 ps/mm
Enegry Per Bit (RF) 0.09*pJ/bit/mm
Enegry Per Bit (BB) 0.125pJ/bit/mm
Data Output waveform Output Spectrum of the RF-Bands, 30GHz and 50GHz
*VCO power (5mW) can be shared by all (many tens) parallel RF-I links in NOC and does not burden individual link significantly.
4343
Comparison between Repeated Bus and Multi-band RF-I @ 32nm
Assumptions:1. 32nm node; 30x repeater,
FO4=8ps, Rwire = 306Ω/mm Cwire = 315fF/mm, wire pitch=0.2um, Bus length = 2cm, f_bus = 1GHz, Bus Width 96Byte
2. Repeaters Area = 0.022mm2
3. Bus physical width = 160um
4. In that width we can fit 13 transmission line, each with 7 carriers with carrying 8Gbps
Interconnect length = 2cm
RF‐I
Repeated
Bus# of wire 13 448
Data rate per carrier
(Gbit/s) 8 NA# of carrier 7 NA
Data rate per carrier
(Gbit/s) 56 1
Aggregate Data Rate 728 768Bus Physical Width 160 160
Transceiver Area (mm2) 0.27 0.022Power (mW) 455 6144
Energy per bit (pJ/bit) 0.63 8
4444
Architectural Impact Using RF-I
High bandwidth communicationData distribution across many-core topologiesVital in keeping many-core designs active
Low latency communicationEnables users to apply parallel computing to a broader applications through faster synchronization and communicationFaster cache coherence protocols
ReconfigurabilityAdapt NoC topology/bandwidth to the needs of the individual application
Power efficient communication
4545
Simple RF-I Topology
Four NoC Components
Tunable Tx/Rx’sArbitrary topologiesArbitrary bandwidths
C C
C C> > > >> > > >
RF-I Transmission Line Bundle
NoC Component
Tx/Rx
C
C
C
C
C
C C C
C C
C C C CC C
C C
C C
Pipeline/Ring
Bus Multicast FullyConnected
Crossbar
One physical topology can be configured to many virtual topologies
4646
Mesh Overlaid with RF-I [HPCA’08]
10x10 mesh of pipelined routersNoC runs at 2GHzXY routing
64 4GHz 3-wide processor coresLabeled aqua8KB L1 Data Cache8KB L1 Instruction Cache
32 L2 Cache Banks Labeled pink256KB eachOrganized as shared NUCA cache
4 Main Memory InterfacesLabeled green
RF-I transmission line bundleBlack thick line spanning mesh
4747
RF-I Logical Organization
• Logically:- RF-I behaves as set of N express channels- Each channel assigned to src, dest router pair (s,d)
• Reconfigured by:- remapping shortcuts to
match needs of different applications LOGICAL ALOGICAL B
4848
Power Savings [MICRO’08]
We can thin the baseline mesh linksFrom 16B……to 8B…to 4B
RF-I makes up the difference in performance while saving overall power!
RF-I provides bandwidth where most necessaryBaseline RC wires supply the rest
16 bytes8 bytes4 bytes
Requires high bw to communicate w/ B
A
B
4949
RF-I Enabled Multicast
Get S
2
1
3 4
2
1
1
1 1
1FILL
Fill
Conventional NoC
Request Scenario
Rx RxTx
RxTx
RxTx
RxTx
RxTx
RxTx
RxTx
RxTx
Tx
RF-I enabled NoC
5050
Impact of Using RF-Interconnects [MICRO’08]
• Adaptive RF-I enabled NoC- Cost Effective in terms of both power and performance
51
Customizable Heterogeneous Platform (CHP)
$$ $$ $$ $$
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
FixedCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
CustomCore
ProgFabricProg
FabricProg
FabricProg
FabricProg
FabricProg
FabricProg
FabricProg
Fabric
DRAMDRAM
DRAMDRAM
I/OI/O
CHPCHP
CHPCHP
CHPCHP
Reconfigurable RF-I busReconfigurable optical busTransceiver/receiverOptical interface
Overview of the Proposed ResearchOverview of the Proposed Research
CHP mappingSource-to-source CHP mapper
Reconfiguring & optimizing backendAdaptive runtime
Domain characterization Application modeling
Domain-specific-modeling(healthcare applications)
CHP creationCustomizable computing engines
Customizable interconnects
Architecture modeling
Design once Invoke many times
52
CHP Mapping CHP Mapping –– Compilation and Runtime Software Systems Compilation and Runtime Software Systems for Customizationfor Customization
Goals: Efficient mapping of domain-specific specification to customizable hardware– Adapt the CHP to a given application for drastic performance/power efficiency improvement
Domain-specific applicationsDomain-specific applications
Abstract executionAbstract
execution ProgrammerProgrammer
Domain-specific programming model(Domain-specific coordination graph and domain-specific language extensions)
Source-to source CHP MapperSource-to source CHP Mapper
Application characteristics
CHP architecture models
C/C++ code
C/C++ front-endC/C++
front-end
Reconfiguring and optimizing back-endReconfiguring and optimizing back-end
Analysis annotations
Binary code for fixed & customized cores
Customized target code
RTL for programmable fabric
RTL Synthesizer
(xPilot)
RTL Synthesizer
(xPilot)
C/SystemC behavioral spec
Performance feedback
Adaptive runtimeLightweight threads and adaptive configuration
Adaptive runtimeLightweight threads and adaptive configuration
CHP architectural prototypes(CHP hardware testbeds, CHP simulation
testbed, full CHP)
CHP architectural prototypes(CHP hardware testbeds, CHP simulation
testbed, full CHP)
53
FCUDA: CUDA-to-FPGA (Best Paper Award at SASP 2009)
Use CUDA in tandem with High-Level Synthesis (HLS) to:enable high-level abstraction for FPGA programmingexploit massively parallel compute capabilities of FPGAfacilitate single interface for GPU and FPGA kernel acceleration
CUDA: C-based parallel programming model for GPUsconcise expression of coarse grained parallelismvery popular (wide range of existing applications)Explicit partitioning and trasnfer of data between off-chip and on-chip memory
AutoPilot: Advanced HLS tool (from AutoESL)Platform-specific (i.e. FPGA/ASIC) C-to-RTL mappingFine-grained and loop iteration parallelism extractionAnnotated coarse-grained parallelism extraction• Requires explicit expression and annotation from programmer
54
CUDA-to-AutoPilot C TranslationIdentify off-chip data transfers
aggregate multi-thread off-chip accesses into DMA bursts
Split kernel into computation and data communication tasksUse thread-block granularity for splitting kernel threads into parallel FPGA coresAllocate data storage based on following memory space mapping:
GPU FPGA• Global Off-chip DRAM• Shared On-chip BRAMs• Constant/Texture Registers• Registers / Local Memory
thread-block kernel tasks
55
Results
Benchmark Core # DRAM Bandwidth Limiting Resource
matmul 32bit 128 3.5GB/s DSP
matmul 16bit 176 1.6GB/s BRAM
matmul 8bit 176 0.8GB/s BRAM
cp 32bit 25 0.128GB/s DSP
cp 16bit 96 0.19GB/sec DSP
cp 8bit 96 0.1GB/sec DSP
rc5-72 32bit 80 ≈ 0GB/sec LUT
Kernel Configuration Description
Matrix Multiply (matmul)
1024x1024Common kernel in many imaging, simulation, and scientific application
Coulombicpotential (cp)
4000 atoms, 512x512 grid
Computation of electric potential in a volume containing charged atoms
RSA Encryption (rc5-72)
4 Billion KeysBrute force encryption key generation and matching
00.5
11.5
22.5
32bit 16bit 8bit 32bit 16bit 8bit 32bit
matmul cp rc5-72
spee
dup
GPUFPGA
Benchmark GPU GeForce 8800
FPGA Virtex5 xc5vfx200t
FPGA over GPU Benefit
matmul32bit
≈ 100 Watt
10.622 Watt 9.41X
matmul16bit
10.559 Watt 9.47X
matmul 8bit 9.954 Watt 10.05X
Speedup comparable to GPU in several configurations Much more power efficient than GPU!
Assume FPGA has high bandwidth bus to off-chip DDR
56
Concluding RemarksWe believe that domainWe believe that domain--specific customization is the next specific customization is the next transformative approach to energy efficient computingto energy efficient computing
Beyond parallelization?
Many research opportunities and challengesMany research opportunities and challengesDomain-specific modeling/specificationNovel architecture & microarchitecture for customizationCompilation and runtime software to support intelligent customizationNew research in testing, verification, reliability in customizable computing
CDSC is taking a highly integrated effort CDSC is taking a highly integrated effort ––Coordinated crossCoordinated cross--layer customization in modeling, HW, SW, & application layer customization in modeling, HW, SW, & application developmentdevelopment
57
Acknowledgements
Reinman(UCLA)
Palsberg(UCLA)
Sadayappan(Ohio-State)
Sarkar(Associate Dir)
(Rice)
Vese(UCLA)
Potkonjak (UCLA)
•A highly collaborative effort
• thanks to all my co-PIs in four universities – UCLA, Rice, Ohio-State, and UC Santa Barbara
• Thanks the support from the National Science Foundation
Aberle(UCLA)
Baraniuk(Rice)
Bui (UCLA)
Cong (Director) (UCLA)
Cheng (UCSB)
Chang (UCLA)