FPGAs for the Masses: Hardware Acceleration without Hardware Design David B. Thomas [email protected]

FPGAs for the Masses:Hardware Acceleration without Hardware Design

David B. Thomas

[email protected]

Contents

• Motivation for hardware acceleration– Increase performance, reduce power– Types of hardware accelerator

• Research achievements– Accelerated Finance research group– Research direction and publications

• Highlighted contribution: Contessa– Domain specific language for Monte-Carlo– Push-button compilation to hardware

• Conclusion

Motivation

• Increasing demand for High Performance Computing– Everyone wants more compute-power– Finer time-steps; larger data-sets; better models

• Decreasing single-threaded performance– Emphasis on multi-core CPUs and parallelism– Do computational biologists need to learn PThreads?

• Increasing focus on power and space– Boxes are cheap: 16 node clusters are very affordable– Where do you put them? Who is paying for power?

• How can we use hardware acceleration to help?

Types of Hardware Accelerator

• GPU : Graphics Processing Unit– Many-core - 30 SIMD processors per device– High bandwidth, low complexity memory – no caches

• MPPA : Massively Parallel Processor Array– Grid of simple processors – 300 tiny RISC CPUs– Point-to-point connections on 2-D grid

• FPGA : Field Programmable Gate Array– Fine-grained grid of logic and small RAMs– Build whatever you want

Hardware Advantages: Performance

A Comparison of CPUs, GPUs, FPGAs, and MPPAs for Random Number Generation,D. Thomas, L. Howes, and W. Luk , In Proc. of FPGA (To Appear) , 2009

110

1

31

010203040

C PU GPU MPPA F PGA

S peed-up

• More parallelism - more performance

• GPU: 30 cores, 16-way SIMD

• MPPA: 300 tiny RISC cores

• FPGA: hundreds of parallel functional units

Hardware Advantages: Power

A Comparison of CPUs, GPUs, FPGAs, and MPPAs for Random Number Generation,D. Thomas, L. Howes, and W. Luk , In Proc. of FPGA (To Appear) , 2009

1 10 131

1 9 18

175

050

100150200

C PU GPU MPPA F PGA

S peed-up Efficiency

• GPU: 1.2GHz - same power as CPU• MPPA: 300MHz - Same performance as CPU, but 18x less

power• FPGA: 300MHz - faster and less power

FPGA Accelerated Applications

• Finance– 2006: Option pricing: 30x CPU

– 2007: Multivariate Value-at-Risk: 33x Quad CPU

– 2008: Credit-risk analysis: 60x Quad CPU

• Bioinformatics– 2007: Protein Graph Labelling: 100x Quad CPU

• Neural Networks– 2008: Spiking Neural Networks: 4x Quad CPU

1.1x GPU

All with less than a fifth of the power

Problem: Design Effort

• Researchers love scripting languages: Matlab, Python, Perl– Simple to use and understand, lots of libraries

– Easy to experiment and develop promising prototype

• Eventually prototype is ready: need to scale to large problems– Need to rewrite prototype to improve performance: e.g. Matlab to C

– Simplicity of prototype is hidden by layers of optimisation

Hour Day Week Month

0.25

1

Year

4

16

64

256

Scripted

Compiled

Multi-threaded

Vectorised

RelativePerformance

Design-time

CPU

Problems: Design Effort

• GPUs provide a somewhat gentle learning curve– CUDA and OpenCL almost allow compilation of ordinary C code

• User must understand GPU architecture to maximise speed-up– Code must be radically altered to maximise use of functional units

– Memory structures and accesses must map onto physical RAM banks

• We are asking the user to learn about things they don’t care about

Hour Day Week Month

0.25

1

Year

4

16

64

256

Compiled

C to GPU

Reorganisation

Memory Opt.

RelativePerformance

Design-time

CPU

GPU

Problems: Design Effort

• FPGAs provide large speed-up and power savings – at a price!– Days or weeks to get an initial version working

– Multiple optimisation and verification cycles to get high performance

• Too risky and too specialised for most users– Months of retraining for an uncertain speed-up

• Currently only used in large projects, with dedicated FPGA engineer

Hour Day Week Month

0.25

1

Year

4

16

64

256

Initial Design

Parallelisation

Clock Rate

RelativePerformance

Design-time

CPU

GPU

FPGA

Goal: FPGAs for the Masses

• Accelerate niche applications with limited user-base– Don’t have to wait for traditional “heroic” optimisation

• Single-source description– The prototype code is the final code

• Encourage experimentation– Give users freedom to tweak and modify

• Target platforms at multiple scales– Individual user; Research group; Enterprise

• Use domain specific knowledge about applications– Identify bottlenecks: optimise them– Identify design patterns: automate them– Don’t try to do general purpose “C to hardware”

Accelerated Finance Research Project

• Independent sub-group in Computer Systems section– EPSRC project: 3 years, £670K

• “Optimising hardware acceleration for financial computation”

– Team of four: Me, Wayne Luk, 2 PhD students

• Active engagement with financial institutes– Six month feasibility study for Morgan Stanley

– PhD student funded by J. P. Morgan

• Established a lead in financial computing using FPGAs– 7 journal papers, 17 refereed conference papers

– Book chapter in “GPU Gems 3”

Finance: Increasing Automation

Application-Specific Custom Design

ManualDesign Pattern

Semi-AutomatedDesign Pattern

Automated DesignTool (Contessa)

SimpleMonte-Carlo

Option Pricing

CorrelatedValue-at-Risk

Discrete-eventCredit-Risk models

Path-dependentoption pricing

Variance ReductionQuasi Monte-CarloNumerical Solutions

Target GPUs,MPPAs, ...

Dynamic MemorySupport

Hardware architectures for Monte-Carlobased financial simulations,

D. Thomas, J. Bower, W. Luk, Proc. FPT, 2006

A Reconfigurable Simulation Frameworkfor Financial Computation,

J. Bower, D. Thomas, et. al, Proc. Reconfig, 2006

Automatic Generation and Optimisation ofReconfigurable Financial Monte-Carlo Simulations

D. Thomas, J. Bower, W. Luk, Proc ASAP, 2007

A Domain Specific Language for ReconfigurablePath-based Monte Carlo SimulationsD. Thomas, W. Luk, Proc. FPT, 2007

Irregular Monte-Carlo

Credit Risk Modelling using HardwareAccelerated Monte-Carlo Simulation,D. Thomas, W. Luk, Proc FCCM, 2008

Case StudiesAutomation

Finance: Increasing Performance

StatisticalAccumulators

MultivariateGaussian RNGs

ArbitraryDistribution RNGs

Uniform RNGs

Binomial andTrinomial Trees

NumericalIntegration

Exponential RNGs

SimpleMonte-Carlo

Option Pricing

CorrelatedValue-at-Risk

Discrete-eventCredit-Risk models

Path-dependentoption pricing

Variance ReductionQuasi Monte-CarloNumerical Solutions

Finite DifferenceMethods

High quality uniform random number generationthrough LUT optimised linear recurrences,

D. Thomas, W. Luk, Proc. FPT, 2005

Efficient Hardware Generation of RandomVariates with Arbitrary Distributions,D. Thomas, W. Luk, Proc. FCCM, 2006

Sampling from the Multivariate GaussianDistribution using Reconfigurable Hardware

D. Thomas, W. Luk, Proc FCCM, 2007

Sampling from the Exponential Distributionusing Independent Bernoulli Variates

D. Thomas, W. Luk, Proc FPL, 2008

Estimation of Sample Mean andVariance for Monte-Carlo Simulations

D. Thomas, W. Luk, Proc FPT, 2008

Exploring reconfigurable architecturesfor financial computation

Q. Jin, D. Thomas, W. Luk, B. Cope, Proc ARC, 2007Irregular Monte-Carlo

Case Studies Optimisation

Contessa: Overall Goals

• Language for Monte-Carlo applications• One description for all platforms

– FPGA family independent– Hardware accelerator card independent

• “Good” performance across all platforms• No hardware knowledge needed• Quick to compile• It Just Works: no verification against software

FPGA : Field Programmable Gate Array

? ? ? ?

? ? ? ?

? ? ? ?

? ? ? ?

• Grid of logic gates– No specific function– Connect as needed



• Allocate logic– Adders, multipliers, RAMs

• Area = performance– Make the most of it– Fixed-size grid

? ?

?

+? ?

? ?

+?

? ? ? ?

? ? ? ?

×




• Area = performance– Make the most of it

• Pipelining is key– Lots of registers in logic– Pipeline for high clock rate

? ?

?

+? ?

? ?

+?

? ? ? ?

? ? ? ?

×




• Area = performance– Make the most of it

• Pipelining is key– Lots of registers in logic– Pipeline for high clock rate

• Multi-cycle feedback paths– Floating-point: 10+ cycles

? ?

?

+

? ?

? ? ?

? ? ? ?

? ? ? ?

×

Contessa: Basic Ideas

• Contessa: Pure functional high-level language– Variables can only be assigned once– No shared mutable state

• Continuation-based control-flow– Iteration through recursion– Functions do not return: no stack

• Syntax driven compilation to FPGA– No side-effects: maximise thread-level parallelism– Thread-level parallelism allows deep pipelines– Deep pipelines allow high clock rate - high performance

• Hardware independent– No explicit timing or parallelism information– No explicit binding to hardware resources

par amet er ( fl oat , VOLATI LE_ENTER) ;par amet er ( i nt , MAX_D) ; / / Remaining parameters elided.

accumul at or ( fl oat , pr i ce) ; / / Price at end of simulations.

/ / This block is the arity-0 entry point for all threads.voi d i ni t ( ){ stabl e( 1, S_I NI T) ; } / / Start threads in stable block

/ / Stable market: step price forward for each day in simulation.voi d st abl e( i nt d, fl oat s){

i f ( d==MAX_D) {pr i ce += s; / / Accumulate final price of simulation.return; / / Exit thread with nullary return.

}el se i f ( uni f r nd( ) >VOLATI LE_ENTER) {vol at i l e( d+1, 0, VOL_I NI T, s) ; / / Simulate volatile day.

}el se{fl oat ns=s*l ognr nd( STABLE_MU, STABLE_SI GMA) ;stabl e( d+1, ns) ; / / Simulate stable day in one step

}}

/ / Volatile market: s tep price in small increments through day.voi d vol at i l e( i nt di nc, fl oat t , fl oat v, fl oat s){

i f ( t >MAX_T) { / / End of day, so ...stabl e( di nc, s) ; / / ... return to stable phase.

}el se{fl oat nv=sqr t ( v+uni f r nd( ) ) ; / / New volatility...fl oat ns=s*l ognr nd( VOL_MU, VOL_SI GMA*nv) ; / / & pricevol at i l e( di nc, t +expr nd( ) , nv, ns) ; / / Loop

}}

• Familiar semantics– Looks like C– Behaves like C– No surprises for user

• Built-in primitives– Random numbers– Statistical accumulators– Map to FPGA optimised

functional units

• Restricts choices– User can’t write poorly

performing code– e.g. Just-in-time random

number generation

• Straight to hardware– No hardware hints– Push a button

par amet er ( fl oat , VOLATI LE_ENTER) ;par amet er ( i nt , MAX_D) ; / / Remaining parameters elided.

accumul at or ( fl oat , pr i ce) ; / / Price at end of simulations.

/ / This block is the arity-0 entry point for all threads.voi d i ni t ( ){ stabl e( 1, S_I NI T) ; } / / Start threads in stable block

/ / Stable market: step price forward for each day in simulation.voi d st abl e( i nt d, fl oat s){

i f ( d==MAX_D) {pr i ce += s; / / Accumulate final price of simulation.return; / / Exit thread with nullary return.

}el se i f ( uni f r nd( ) >VOLATI LE_ENTER) {vol at i l e( d+1, 0, VOL_I NI T, s) ; / / Simulate volatile day.

}el se{fl oat ns=s*l ognr nd( STABLE_MU, STABLE_SI GMA) ;stabl e( d+1, ns) ; / / Simulate stable day in one step

}}

/ / Volatile market: s tep price in small increments through day.voi d vol at i l e( i nt di nc, fl oat t , fl oat v, fl oat s){

i f ( t >MAX_T) { / / End of day, so ...stabl e( di nc, s) ; / / ... return to stable phase.

}el se{fl oat nv=sqr t ( v+uni f r nd( ) ) ; / / New volatility...fl oat ns=s*l ognr nd( VOL_MU, VOL_SI GMA*nv) ; / / & pricevol at i l e( di nc, t +expr nd( ) , nv, ns) ; / / Loop

}}

• Each function is a pipeline– Parameters are inputs– Function calls are outputs– Can be very deep pipelines– Floating-point: 100+ cycles

• Function call = continuation– Tuple of target + arguments– Completely captures thread– Can be queued and routed

• Massively multi-threaded– Threads are cheap to create– Route threads like packets– Queue threads in FIFOs

Convert Functions to Pipelines

void step(int t, float s)

{

float ds=s+rand();

step(t+1, ds);

}

t

add

rand

s

inc

t’ s’

Cycle -1

Cycle 0

Cycle 1

Cycle 2

Nested Loops

void outer(...){ if(...) outer(...); else if(...) inner(...); else acc();}

void inner(...){ if(...) inner(...); else outer(...);}

init

acc

+

outer

+

inner

+

Replicating Bottleneck Functions

init

acc

+

step step

+

void init(){ step(...);}

void step(...){ if(...) step(...); else acc(...);}

void acc(...){ ...}

Contessa: User Experience

• True push-button compilation– Hardware usage is transparent to user– High-level source code through to hardware

• Progressive optimisation: speed vs startup delay– Interpreted: immediate– Software: 1-10 seconds– Hardware: 10-15 minutes– No alterations to source code

• Speedup of 15x to 60x over software– Greater speedup in more computationally intensive apps.

• Power reduction of 75x to 300x over software– 300MHz FPGA vs 3GHz CPU

Contessa: Future Work

• Scaling across multiple FPGAs– Easy to move thread-states over high-speed serial links– Automatic load-balancing

• Move threads between FPGA and CPU/GPU– Some functions are infrequently used: place in CPU– Threads move seamlessly back and forth

• Automatic optimisation of function network– Replication of bottleneck functions– Place lightly loaded functions in slower clock domains

• Allow more general computation– Fork/join semantics– Dynamic data structures

Conclusion

• Goal: hardware acceleration of applications– Increase performance, reduce power– Make hardware acceleration more widely available

• Achievements: accelerated finance on FPGAs– Three year EPSRC project: 25 papers (so far)– Speedups of 100x over quad CPU, using less power– Domain specific language for financial Monte-Carlo

• Future: ease-of-use and generality– Target more platforms, hybrids: CPU+FPGA+GPU– DSLs for other domains: bioinformatics, neural nets

Documents

FPGAs for the Masses: Hardware Acceleration without Hardware Design David B. Thomas [email protected]