12
Presented by Field-Programmable Gate Array Research Speeds HPC up to 100X Olaf O. Storaasli Future Technologies Group Computer Science and Mathematics Division

Field-Programmable Gate Array Research Speeds HPC “ up to 100X ”

  • Upload
    awena

  • View
    39

  • Download
    0

Embed Size (px)

DESCRIPTION

Field-Programmable Gate Array Research Speeds HPC “ up to 100X ”. Olaf O. Storaasli Future Technologies Group Computer Science and Mathematics Division. THE SUPERCOMPUTER COMPANY. Explore FPGAs for future ORNL HPC. Why HPC vendors offer FPGAs. - PowerPoint PPT Presentation

Citation preview

Page 1: Field-Programmable Gate Array Research Speeds HPC  “ up to 100X ”

Presented by

Field-Programmable Gate Array Research Speeds HPC “up to 100X”

Olaf O. StoraasliFuture Technologies Group

Computer Science and Mathematics Division

Page 2: Field-Programmable Gate Array Research Speeds HPC  “ up to 100X ”

2 Storaasli_FPGA_SC07

Contents Background: Why FPGAs? ORNL success: FPGA systems, tools and up to 100X speedup Partners: Research Lab, , SRC,

THE SUPERCOMPUTER COMPANY

Explore FPGAs for future ORNL HPC

Virtex4 FPGA blades “accelerate mission-critical applications > 100X.”

Steve Scott, CTO HPCWire 24/3/2006

“After exhaustive analysis, Cray concluded that, although multi-core commodity processors will deliver some improvement, exploiting parallelism through a variety of processor technologies using scalar, vector, multithreading and hardware accelerators (e.g., FPGAs or ClearSpeed co-processors) creates the greatest opportunity for application acceleration.”

ORNL benefit: Exceed petaflops and reduce power

Why HPC vendors offer FPGAs

, ,

Page 3: Field-Programmable Gate Array Research Speeds HPC  “ up to 100X ”

3 Storaasli_FPGA_SC07

FPGA Logic slice

What’s an FPGA?Your “custom chip”

Xilinx Virtex4 FPGA: 25K slices (miniCPUs) Logic array: user-tailored to application On-chip RAM, multipliers and PowerPCs Gigabit transceivers/DSP blocks => FastIO/precision 100–1000 operations/clock cycle

Page 4: Field-Programmable Gate Array Research Speeds HPC  “ up to 100X ”

4 Storaasli_FPGA_SC07

0

100

200

300

Computation(GOPS)

Memory Bandwidth(GB/sec)

IO Bandwidth(Gbps)

PentiumVirtex-4

FPGAVirtex4

Pentium

Why FPGAs? Performance—optimal silicon use

(maximize parallel ops/cycle) Rapid growth—cells, speed, I/O Power—1/10th CPUs Flexible—tailor to application

1000

800

600

400

200

00

100

200

300

400

500

600

700

2002 2004 2006 2008

Thou

sand

s

Logic Cells

MH

z

Clock speed (MHz)

Page 5: Field-Programmable Gate Array Research Speeds HPC  “ up to 100X ”

5 Storaasli_FPGA_SC07

Cray XD1

ORNL FPGA hardware/tools SRC-6 (Carte), Digilent (Viva,

VHDL), Nallatech (Viva) Cray XD1 (MitrionC, VHDL):

6 FPGAs + 144 Opterons SGI RASC-Altix/Virtex4s (MitrionC) CHiMPS (Bee2 => Cray XD1 => DRC

=> XT4)

RASC sgi

Page 6: Field-Programmable Gate Array Research Speeds HPC  “ up to 100X ”

6 Storaasli_FPGA_SC07

Find parallelism: 80% FFTs

More GF/$ GF/Watt

Goal

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Profile

Model faster

Ported HPC code spectral transform shallow water model (STSWM) to FPGAs

FTRNDE

FTRNPE

FTTdd

UV FFT

SHTRNS FFT

COMP1

STEP

FTRNEX FTRNVX

8 calls in parallel

3 functions in parallel

2 calls in parallel

HLL compilerCHiMPS, Mitrion(FPGA Tools Inside) FPGA

speedup

HLL developerprofiles

Page 7: Field-Programmable Gate Array Research Speeds HPC  “ up to 100X ”

7 Storaasli_FPGA_SC07

Viva: Graphical icons—3-dimensional MitrionC: Text/flow—1-dimensional

Exploring programming options

Gauss matrix solver Compiler, simulator, and debugger

+ Carte/SRC, CHiMPS-VHDL/Xilinx ,

Page 8: Field-Programmable Gate Array Research Speeds HPC  “ up to 100X ”

8 Storaasli_FPGA_SC07

*FPGA vs 2.2 GHz Opteron

First mixed-precision LU and solver for FPGAs

Benefits:High performance of LP arithmeticHigh precision accuracySpeedup increases with matrix size(as LU dominates calculations)

Design Double FP Single FP S10e5

PE Amount 8 16 32

Max size 128 256 256

Achievablefrequency 120 MHz 150 MHz 150 MHz

Slices 27,005 (57%) 14,792 (59%) 14,730 (62%)

BRAMs 68 (29%) 129 (55%) 65 (28%)

MULT18X18 128 (55%) 64 (27%) 32 (13%)

0

200

400

600

800

1000

Exe

cutio

n tim

e (u

s)

64 96 128Matrix size

8757149

218133

404

865

443

258

S10e5SingleDouble

0

20

30

Spe

edup

double single S10e5

10.97.7

21.3

9.7 10.3

36.6LUSolver

Design data type

10

40

37X* LU decomposition speedup10X for matrix equation solver

Page 9: Field-Programmable Gate Array Research Speeds HPC  “ up to 100X ”

9 Storaasli_FPGA_SC07

FPG

A S

peed

up

8 hrs => 5 min

*Virtex-4 FPGA vs 2.2 GHz Opteron on Cray XD1

120

100

80

60

40

20

0 26 28 30 32 34 36 38 40Genome sequence

8K w/align16K w/align8K w/o align16K w/o align

100X* DNA sequence speedupBacillus anthracis human DNA comparison

24#

# 24= Sequence AE17024

Page 10: Field-Programmable Gate Array Research Speeds HPC  “ up to 100X ”

10 Storaasli_FPGA_SC07

FPGA speedup growswith query size

Page 11: Field-Programmable Gate Array Research Speeds HPC  “ up to 100X ”

11 Storaasli_FPGA_SC07

Acknowledgement:

This research is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Summary

ORNL FPGA research: Increasing HPC relevence FPGA systems: Cray, SRC, Nallatech, Digilent, SGI Compilers: Mitrion-C, Carte, Viva, DSPlogic, CHiMPS Speedup: 10X eqn soln, 100X DNA sequencing Partners: Xilinx, UT, Mitrion, Cray, SGI

Next: Explore DRC, more FPGAs and CHiMPS

Page 12: Field-Programmable Gate Array Research Speeds HPC  “ up to 100X ”

12 Storaasli_FPGA_SC07

Contact

12 Storaasli_ReconfigHPC_SC07

Olaf StoraasliFuture Technologies GroupComputer Science and Mathematics [email protected] Olaf ORNL