Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh

Team Members:Team Members:Tyler DrakeTyler DrakeRobert WrisleyRobert WrisleyKyle Von KoeppingKyle Von KoeppingJustin WalshJustin Walsh

Faculty Advisors:Faculty Advisors:Computer Science – Prof. Sanjay RajopadhyeComputer Science – Prof. Sanjay Rajopadhye

Electrical & Computer Engineering – Prof. Olivera NotarosElectrical & Computer Engineering – Prof. Olivera Notaros

• Project Goals: To develop parallel versions of Project Goals: To develop parallel versions of applications that will run on a graphics card applications that will run on a graphics card and measure the performance.and measure the performance.– Started with a simple Matrix Multiply program.Started with a simple Matrix Multiply program.– We intend to develop at least one or two We intend to develop at least one or two

additional applications and also to pursue an additional applications and also to pursue an analysis of hardware optimizations.analysis of hardware optimizations.

– Develop a process for tuning applications & Develop a process for tuning applications & hardware that other developers can use more hardware that other developers can use more easily.easily.

• Tyler Drake – Computer Science majorTyler Drake – Computer Science major• Robert Wrisley – Computer Science/Computer Engineering Robert Wrisley – Computer Science/Computer Engineering

dual majordual major• Kyle Von Koepping – Electrical Engineering majorKyle Von Koepping – Electrical Engineering major• Justin Walsh – Computer Science/Computer Engineering Justin Walsh – Computer Science/Computer Engineering

dual majordual major

• Shared coding responsibilitiesShared coding responsibilities– Enables comparison and greater understanding for all team Enables comparison and greater understanding for all team

membersmembers– Possibly divide responsibilities for the second half of the projectPossibly divide responsibilities for the second half of the project

• Transistor densities on single-core Transistor densities on single-core processors were doubling approximately processors were doubling approximately every 18 months.every 18 months.

• This trend has remained valid since first This trend has remained valid since first observed in 1965 and is expected to hold for observed in 1965 and is expected to hold for several more years.several more years.

• This natural trend had become the standard This natural trend had become the standard goal for hardware companies.goal for hardware companies.

• There is an ultimate limit to Moore’s law. There is an ultimate limit to Moore’s law. • Transistors will soon reach sizes of atomic Transistors will soon reach sizes of atomic

level.level.• Moore’s law does not apply to Random Moore’s law does not apply to Random

Access Memory (RAM) speeds and hard Access Memory (RAM) speeds and hard drive seek times. (AKA Memory Wall)drive seek times. (AKA Memory Wall)

• Redesign of processor architecture isn’t Redesign of processor architecture isn’t driven directly by Moore’s Law, but by driven directly by Moore’s Law, but by the fact that these and other factors have the fact that these and other factors have not kept up with this growth rate.not kept up with this growth rate.

• CPU or multiple CPU’s CPU or multiple CPU’s are not the only are not the only processors found on a processors found on a personal computerpersonal computer

• The graphics card has a The graphics card has a graphics processing unit graphics processing unit (GPU). (GPU).

• The GPU is specifically designed to render 3D The GPU is specifically designed to render 3D models onto a 2D displaymodels onto a 2D display

• Designed for floating point computation with a Designed for floating point computation with a highly parallel architecture.highly parallel architecture.

• Engineers have begun to exploit the highly Engineers have begun to exploit the highly parallel architecture of the GPU for general parallel architecture of the GPU for general applications.applications.

• Graphics companies encourage general Graphics companies encourage general purpose computing on the GPU (GPGPU).purpose computing on the GPU (GPGPU).

• Nvidia has developed CUDA (Compute Nvidia has developed CUDA (Compute Unified Device Architecture).Unified Device Architecture).

• Based on the C language programmers can Based on the C language programmers can easily shift to developing on the GPU easily shift to developing on the GPU

What We Have What We Have Done So Far Done So Far

• Learning about CUDALearning about CUDA– NVIDIA CUDA guidesNVIDIA CUDA guides– Lecture slides from University of Illinois, Urbana-ChampaignLecture slides from University of Illinois, Urbana-Champaign– Papers from various academic groups Papers from various academic groups

• University of Illinois, Urbana-ChampaignUniversity of Illinois, Urbana-Champaign• Tokyo Institute of TechnologyTokyo Institute of Technology• University of California at BerkeleyUniversity of California at Berkeley

• Learning to write parallel programs in CS475 using Learning to write parallel programs in CS475 using MPI & OpenMPMPI & OpenMP

• Writing simple programs using CUDA and observing Writing simple programs using CUDA and observing performanceperformance– Matrix MultiplyMatrix Multiply

• ResultsResults• Achieved 131 Gigaflops/sec on a GTX280 with Achieved 131 Gigaflops/sec on a GTX280 with

N = 1024. GTX 280 peak is 933 Gigaflops/sec.N = 1024. GTX 280 peak is 933 Gigaflops/sec.

• OptimizationsOptimizations• TilingTiling the result matrix into smaller sub- the result matrix into smaller sub-

matrices and having each thread block matrices and having each thread block compute a sub-matrix will reduce amount of compute a sub-matrix will reduce amount of data needed to be loaded by each thread block.data needed to be loaded by each thread block.

• This helps to reduce memory latency.This helps to reduce memory latency.

• Memory Memory – Must allocate memory on the graphics card from Must allocate memory on the graphics card from

the main program being run on the CPUthe main program being run on the CPU– Memory for the graphics card is explicitly Memory for the graphics card is explicitly

managed by the programmermanaged by the programmer

• An “extension” to C, not a separate language An “extension” to C, not a separate language – Similar to MPI, OpenMP, etc.Similar to MPI, OpenMP, etc.

Increasing problem Increasing problem complexitycomplexity Some are no longer Some are no longer

“Pleasantly Parallel”“Pleasantly Parallel”

Higher degree of kernel Higher degree of kernel analysisanalysis

Moving to more dynamic Moving to more dynamic programs programs

• Additional programs being written for the GPU Additional programs being written for the GPU include:include:– Scan: Matrix computation where the ith index is Scan: Matrix computation where the ith index is

the sum of the previous i-1 indices!the sum of the previous i-1 indices!– Knapsack: profit maximization given a capacity Knapsack: profit maximization given a capacity

and list of items with their weight & profitand list of items with their weight & profit– Matrix Multiply for still larger matricesMatrix Multiply for still larger matrices– Triangular Matrix MultiplicationTriangular Matrix Multiplication

Mandelbrot SetMandelbrot Set

Pleasantly Pleasantly parallel, parallel, familiarfamiliar

Easily scalableEasily scalable

Ray TracingRay Tracing

Very computationally Very computationally intensiveintensive

Feasible for non-Feasible for non-realtime realtime computationscomputations

Very dynamic, due to Very dynamic, due to recursionrecursion

High degree of High degree of realismrealism

Examples of images generated by Ray TracingExamples of images generated by Ray Tracing

Hidden Markov Hidden Markov ModelsModels

Clear parallelismClear parallelism

Wide range of Wide range of applicationsapplications

Uses of Hidden Markov ModelsUses of Hidden Markov Models

• To develop a more complex application for To develop a more complex application for the GPU and optimize the performancethe GPU and optimize the performance

• To analyze hardware optimizations and To analyze hardware optimizations and evaluate the performance gainsevaluate the performance gains

• Develop a process for future programmers Develop a process for future programmers that will give them the best performance that will give them the best performance increases with the minimum development increases with the minimum development efforteffort

• Please Note: These goals are tentative and subject to change.Please Note: These goals are tentative and subject to change.

• Moore’s Law now being applied to processors Moore’s Law now being applied to processors per core instead of transistors per processor.per core instead of transistors per processor.

• Multi-core machines offer the next generation Multi-core machines offer the next generation of performance enhancements… but they are of performance enhancements… but they are already here!already here!

• GPUs provide massively parallel architectures GPUs provide massively parallel architectures that programmers can take advantage of to that programmers can take advantage of to see phenomenal performance gains.see phenomenal performance gains.

• Learning to use the Learning to use the CUDA library and some CUDA library and some of the nuances.of the nuances.

• Have gotten good Have gotten good performance on Matrix-performance on Matrix-Multiply attempts.Multiply attempts.

• Also completing CUDA Also completing CUDA versions of Scan and versions of Scan and Knapsack problems.Knapsack problems.

• Move on to a more Move on to a more complex application.complex application.

• Researching hardware Researching hardware optimizations that can optimizations that can further enhance further enhance performance on GPUs.performance on GPUs.

• Develop a combined Develop a combined approach for future approach for future applications applications programmers to follow.programmers to follow.

• $50 spent for a graphics card that is CUDA $50 spent for a graphics card that is CUDA compatible.compatible.

• We’d like to thank Prof. Dan Connors for the use of We’d like to thank Prof. Dan Connors for the use of his machines with Nvidia GTX280 graphics cards.his machines with Nvidia GTX280 graphics cards.– This provided us free access to a consistent build for all of This provided us free access to a consistent build for all of

us to run our code and sample code on.us to run our code and sample code on.

• We don’t project any major costs next semester, We don’t project any major costs next semester, except perhaps for some materials for our E-Days except perhaps for some materials for our E-Days presentation.presentation.

Documents

Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh