18
Vivek Sarkar Department of Computer Science Rice University [email protected] August 27, 2007 COMP 635: Seminar on Heterogeneous Processors www.cs.rice.edu/~vsarkar/comp635 2 COMP 635, Fall 2007 (V.Sarkar) Course Goals Gain familiarity with heterogeneous processor systems by studying a few sample design points in the spectrum Study and critique current software environments for these designs (programming models, compilers, tools, runtimes) Discuss research challenges in advancing the state of the art of software for heterogeneous processors Target audience: software, hardware, and application researchers interested in building or using heterogeneous processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas

Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

Vivek Sarkar

Department of Computer ScienceRice University

[email protected]

August 27, 2007

COMP 635: Seminar on HeterogeneousProcessors

www.cs.rice.edu/~vsarkar/comp635

2COMP 635, Fall 2007 (V.Sarkar)

Course Goals

• Gain familiarity with heterogeneous processor systems bystudying a few sample design points in the spectrum

• Study and critique current software environments for thesedesigns (programming models, compilers, tools, runtimes)

• Discuss research challenges in advancing the state of the artof software for heterogeneous processors

• Target audience: software, hardware, and applicationresearchers interested in building or using heterogeneousprocessor systems, or understanding strengths andweaknesses of heterogeneous processors w.r.t. their researchareas

Page 2: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

3COMP 635, Fall 2007 (V.Sarkar)

Course Organization• Class dates (12 lectures)

— 8/27, 9/10, 9/20 (Thurs), 9/24, 10/1, 10/8, 10/22, 10/29, 11/5, 11/19, 11/26, 12/3— No classes on 9/3 (Labor Day), 10/15 (Midterm Recess), 11/12 (Supercomputing 2007

conference week)— No class on 9/17 (Mon); we will meet on 9/20 (Thurs) instead that week

• Time & Place— Default: Mondays, 3:30pm - 4:30pm, DH 2014— Exception: time & place for 9/20 (Thurs) lecture TBD— 30 minutes reserved after lecture for discussion (optional)

• Office Hours (DH 3131)— 11am - 12noon, Fridays from 8/31/07 to 12/7/07

• OWL-Space repository: COMP 635 F07

• Grading— Satisfactory/unsatisfactory grade for students taking seminar for credit

– Others should register officially as auditors, if possible— For a satisfactory grade, you need to

1. Attend at least 50% of lectures2. Submit a 4-page project/study report by 12/7/07 (report can be prepared in a group - just

plan on 4 pages/person in that case)— Optional in-class presentation of project/study report on 12/3/07

4COMP 635, Fall 2007 (V.Sarkar)

Course Content• Introduction to Heterogeneous Processors and their Programming

Models (1 lecture)

• Cell Processor and Cell SDK (2 lectures)

• Nvidia GPU and CUDA programming environment (2 lectures)

• DRC FPGA Coprocessor Module and Celoxica ProgrammingEnvironment (1 lecture)

• Clearspeed Accelerator and SDK (1 lecture)

• Imagine Stream Processor (1 lecture)

• Microsoft Accelerator Library (1 lecture)

• Vector and SIMD processors -- a historical perspective (1 lecture)

• Programming Model and Runtime Desiderata for futureHeterogeneous Processors (1 lecture)

• Student presentations (1 lecture)

Page 3: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

5COMP 635, Fall 2007 (V.Sarkar)

COMP 635 Lecture 1: Introduction toHeterogeneous Processors and their

Programming Models

6COMP 635, Fall 2007 (V.Sarkar)

Acknowledgments

• Georgia Tech ECE 6100, Module 14— Vince Mooney, Krishna Palem, Sudhakar Yalamanchili—http://www.ece.gatech.edu/academic/courses/fall2006/ece6100/Class/ind

ex.html

• MIT 6.189 IAP 2007, Lecture 2—“Introduction to the Cell Processor”, Michael Perrone— http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf

• UIUC ECE 497, Lecture 16—courses.ece.uiuc.edu/ece412/lectures/lecture16.ppt

• UIUC ECE 498 AL1, Programming Massively Parallel Processors— David Kirk, Wen-mei Hwu—http://courses.ece.uiuc.edu/ece498/al1/Syllabus.html

Page 4: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

7COMP 635, Fall 2007 (V.Sarkar)

Heterogeneous Processors

ACC

LOCALMEMORY

ACC

MA

I NM

EMO

RY

GPP

MTM

ACC

LOCALMEMORY

Memory transfermoduleschedulessystem-wide bulkdata movement

General-purpose processororchestrates activity

Accelerators can usescheduled, streamingcommunication…

or can operate onlocally-buffered datapushed to them inadvance

Accelerated activities and associated private dataare localized for bandwidth, power, efficiency

Motivation:

1) Different parts of programs have differentrequirements

Control-intensive portions need goodbranch predictors, speculation, bigcaches to achieve good performance

Data-processing portions need lots ofALUs, have simpler control flows

2) Power consumptionFeatures like branch prediction, out-of-

order execution, tend to have veryhigh power/performance ratios.

Applications often have time-varyingperformance requirements

8COMP 635, Fall 2007 (V.Sarkar)

Sample Application Domains forHeterogeneous Processors

• Cell Processor— Medical imaging, Drug discovery, Reservoir modeling, Seismic analysis,

• GPU (e.g., Nvidia)— Computer-aided design (CAD), Digital content creation (DCC), emerging

HPC applications, …

• FPGA (e.g., Xilinx DRC)—HPC, Petroleum, Financial, …

• HPC accelerators (e.g., Clearspeed)— HPC, Network processing, Graphics, …

• Stream Processors (e.g., Imagine)—Image processing, Signal processing, Video, Graphics, …

• Others—TCP/IP offload, Crypto, …

Page 5: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

9COMP 635, Fall 2007 (V.Sarkar)

Programming Models for Heterogeneous Processors

• Data Parallelism

• Single Program Multiple Data (SPMD)

• Pipelining

• Work Queue

• Fork Join

• Message Passing

• Storage Models: Shared vs. Local vs. Partitioned Memories

• Hybrid combinations of above

Only a limited subset of these models are in production usetoday ==> programming model implementations forheterogeneous processors will have to grow to accommodatenew application domains and new classes of programmers

10COMP 635, Fall 2007 (V.Sarkar)

Heterogeneous Processor Spectrum

HeterogeneousMulticore

Dimension 1:Distance ofaccelerator frommain processor

Dimension 2:Hardwarecustomization inaccelerator

Page 6: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

11COMP 635, Fall 2007 (V.Sarkar)

Heterogeneous Processor Spectrum

HeterogeneousMulticore

Dimension 1:Distance ofaccelerator frommain processor

Dimension 2:Hardwarecustomization inaccelerator

Focus of this course

Focus of this course

12COMP 635, Fall 2007 (V.Sarkar)

Spectrum of Programmers for HeterogeneousProcessors

• Application-level Users— Plug & play experience by using ISV frameworks such as

MATLAB and Mathematica, etc

• Library-level Programmers— Portable library interface that works across homogeneous and

heterogeneous processors

• Language-level Programmers— Portable programming language that works across

homogeneous and heterogeneous processors— Conspicuous lack of new languages for heterogeneous

processors, especially languages with managed runtimes!

• SDK-level Programmers— C-based compilers and tools that are specific to a given

heterogeneous processor

Page 7: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

13COMP 635, Fall 2007 (V.Sarkar)

Spectrum of Programmers for HeterogeneousProcessors

• Application-level Users— Plug & play experience by using ISV frameworks such as

MATLAB and Mathematica, etc

• Library-level Programmers— Portable library interface that works across homogeneous and

heterogeneous processors

• Language-level Programmers— Portable programming language that works across

homogeneous and heterogeneous processors— Conspicuous lack of new languages for heterogeneous

processors, especially languages with managed runtimes!

• SDK-level Programmers— C-based compilers and tools that are specific to a given

heterogeneous processor

Focus of this course

14COMP 635, Fall 2007 (V.Sarkar)

Cell Broadband Engine (BE)

Page 8: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

15COMP 635, Fall 2007 (V.Sarkar)

Cell Performance

16COMP 635, Fall 2007 (V.Sarkar)

Cell Temperature Distribution

Power and heat are key constraints

Page 9: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

17COMP 635, Fall 2007 (V.Sarkar)

Code Partitioning for Cell

Flow Graph Node

Call Graph Node

Flow Graph Edge

Call Graph Edge

Key

Outlining Cloning

Compile forPPE

Compilefor SPE

• Outlining: extract parallel loop into a separate procedure• Cloning: make separate copies for PPE and SPE, including clones of allprocedures called from loop• Coordination: insert operations on signal registers and mailbox queues in PPEand SPE codes• Reference: “Using advanced compiler technology to exploit the performance ofthe Cell Broadband Engine architecture”, A. Eichenberger et al, IBM SystemsJournal, Vol 45, No 1, 2006

18COMP 635, Fall 2007 (V.Sarkar)

• A quiet revolution and potential build-up— Calculation: 367 GFLOPS vs. 32 GFLOPS— Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s— Until last year, programmed through graphics API

— GPU in every PC and workstation – massive volume and potential impact

Why GPUs?

Page 10: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

19COMP 635, Fall 2007 (V.Sarkar)

Sample GPU Applications

16%931,365Finite-Difference Time Domain analysis of2D electromagnetic wave propagation

FDTD

>99%33490Computing a matrix Q, a scanner’sconfiguration in MRI reconstruction

MRI-Q

96%98536Two Point Angular Correlation FunctionTRACF

>99%31952Single-precision implementation of saxpy,used in Linpack’s Gaussian elim. routine

SAXPY

>99%160322Petri Net simulation of a distributed systemPNS

99%2811,104Rye Polynomial Equation Solver, quantumchem, 2-electron repulsion

RPES

99%1461,874Finite element modeling, simulation of 3Dgraded materials

FEM

>99%2181,979Distributed.net RC5-72 challenge client codeRC5-72

>99%2851,481SPEC ‘06 version, change to single precisionand print fewer reports

LBM

35%19434,811SPEC ‘06 version, change in guess vectorH.264

% timeKernelSourceDescriptionApplication

20COMP 635, Fall 2007 (V.Sarkar)

Performance of Sample Kernels and Applications

• GeForce 8800 GTX vs. 2.2GHz Opteron 248• 10× speedup in a kernel is typical, as long as the kernel can occupy enough

parallel threads• 25× to 400× speedup if the function’s data requirements and control flow suit

the GPU and the application is optimized• Keep in mind that the speedup also reflects how suitable the CPU is for

executing the kernelSource: Slide 21, Lecture 1, UIUC ECE 498, David Kirk & Wen-mei Hwu, http://courses.ece.uiuc.edu/ece498/al1/lectures/lecture1%20intro%20fall%202007.ppt

Page 11: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

21COMP 635, Fall 2007 (V.Sarkar)

FPGAs: Basics of FPGA Offload

Source: “Compiling Software Code to FPGA-based Accelerator Processors for HPC Applications” by Doug Johnson,[email protected], gladiator.ncsa.uiuc.edu/PDFs/rssi06/presentations/14_Doug_Johnson.pdf

22COMP 635, Fall 2007 (V.Sarkar)

FPGA Acceleration Examples

Page 12: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

23COMP 635, Fall 2007 (V.Sarkar)

ClearSpeed Multi-Threaded Array Processor (MTAP)

• Hardware multi- threading forlatency tolerance

• Asynchronous, overlapped I/O

• Poly execution unit contains 96Processor Elements (PE’s) orcores.

• Array of PE’s operates in asynchronous manner, i.e. eachPE executes the sameinstruction on its data.

Source: “Accelerating HPC Applications with ClearSpeed”by Daniel Kliger, [email protected],www.cse.scitech.ac.uk/disco/mew17/talks/ClearSpeed%20Daresbury%20MEW%202006.pdf

24COMP 635, Fall 2007 (V.Sarkar)

Clearspeed Linpack results

• Standard System—Two 3.0 GHz Intel Xeon 5160 (Woodcrest) dual core processors,

16GB memory per node– Single server: 34 GFLOPS– Four node cluster: 136 GFLOPS– Power consumption: 1,940 Watts– Benchmark runtime: 48.4 minutes

• ClearSpeed Accelerated System—Add two Advance accelerator boards per node (25W per board!)

– Single server: 90.1 GFLOPS– Four node cluster: 364.2 GFLOPS– Power consumption: 2,140 Watts– Benchmark runtime: 18.4 minutes

Page 13: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

25COMP 635, Fall 2007 (V.Sarkar)

ClearSpeed’s CSXL acceleration library

The CSXL acceleration library intercepts and accelerates calls tofunctions in the Basic Linear Algebra Subprograms (BLAS) library.These include Level 3 BLAS DGEMM calls and LAPACK DGETRFcalls.

26COMP 635, Fall 2007 (V.Sarkar)

Imagine Stream Processor

Page 14: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

27COMP 635, Fall 2007 (V.Sarkar)

Transforming Memory Accesses to Communicationfor Scalability

Software challenge: deliver productivity of shared memory model, combined with scalability of communication model

28COMP 635, Fall 2007 (V.Sarkar)

Example of how Compilers can Help

Source: UIUC ECE 497, courses.ece.uiuc.edu/ece412/lectures/lecture16.ppt

Opportunity for new languages to reducecompiler effort and

broaden applicability

Page 15: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

29COMP 635, Fall 2007 (V.Sarkar)

Code Partitioning for Heterogeneous Processors

• Factors to consider when extracting a region of code for executionon an accelerator— Matching operations in code region with primitives in

accelerator (includes instruction selection and FPGA synthesis)— Establishing coherence between main and local memories— Obeying local memory size constraints— Volume of data to be communicated— Granularity of region relative to overhead of thread creation— Structural constraints of task/thread being extracted— Cloning of code that needs to be executed on multiple elements— Coordination with rest of the program (coroutine vs. macro-

dataflow models)— . . .

30COMP 635, Fall 2007 (V.Sarkar)

Reading List for Next Lecture (Sep 10th)

1. “Using advanced compiler technology to exploit the performance of the CellBroadband Engine architecture”, A. Eichenberger et al, IBM Systems Journal,Vol 45, No 1, 2006,http://researchweb.watson.ibm.com/journal/sj/451/eichenberger.pdf

2. “Dynamic Multigrain Parallelization on the Cell Broadband Engine”, F. Blagojevicet al, PPoPP 2007 Best Paper, March 2007,http://portal.acm.org/ft_gateway.cfm?id=1229445&type=pdf&coll=portal&dl=ACM&CFID=14018324&CFTOKEN=91433508

Page 16: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

31COMP 635, Fall 2007 (V.Sarkar)

Announcement: Kickoff Meeting for HabaneroMulticore Software Research Project

Habanero is a new research project focused onMulticore Software. Its scope will span programminglanguages, compilers, virtual machines, and low-levelruntime systems, and is synergistic with the expertisewe have in various CS groups at Rice including theParallel Compilers, Scalar Compilers, ProgrammingLanguage Technologies, and Systems groups. Akickoff meeting for the Habanero project is scheduledfor 1pm - 2:30pm on Wednesday, August 29th in DH3076. Cookies will be served!

32COMP 635, Fall 2007 (V.Sarkar)

BACKUP SLIDES START HERE

Page 17: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

33COMP 635, Fall 2007 (V.Sarkar)

Freescale MPC8572 PowerQUICC III Processor

• Dual Embedded e500 core 36-bit physical addressing• Double-precision floating-point• Integrated L1/L2 cache

— L1 cache—32 KB data and 32 KB— Shared L2 cache—1 MB with ECC— L2 configurable as SRAM, cache and I/O transactions can be

stashed into L2 cache regions• Integrated DDR memory controller with• full ECC support• Integrated security engine, Pattern Matching Engine, Packet

Deflate Engine• Four on-chip triple-speed Ethernet controllers

34COMP 635, Fall 2007 (V.Sarkar)

Freescale MPC8572 PowerQUICC III Processor

Source: Freescale

Page 18: Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F

35COMP 635, Fall 2007 (V.Sarkar)

AMD’s use of HyperTransport (Torrenza)

• “Torrenza” technology— Allows licensing of coherent

HyperTransport™ to 3rd partymanufacturers to make socket-compatible accelerators/co-processors

— Allows 3rd party PPUs (PhysicsProcessing Unit), GPUs, and co-processors to access main systemmemory directly and coherently

— Could make acceleratorprogramming model easier to usethan say, the Cell processor, whereeach SPE cannot directly accessmain memory.