of 28 /28
University of Michigan Electrical Engineering and Computer Science Amir Hormati , Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable Stream Programming on Graphics Engines

Sponge: Portable Stream Programming on Graphics Engines

  • Author
    morey

  • View
    58

  • Download
    2

Embed Size (px)

DESCRIPTION

Amir Hormati , Mehrzad Samadi , Mark Woh , Trevor Mudge , and Scott Mahlke. Sponge: Portable Stream Programming on Graphics Engines. Why GPUs?. Every mobile and desktop system will have one Affordable and high performance Over-provisioned Programmable. Sony PlayStation Phone. - PowerPoint PPT Presentation

Text of Sponge: Portable Stream Programming on Graphics Engines

Sponge: Portable Stream Programming on Graphics Engines

Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott MahlkeSponge: Portable Stream Programming on Graphics EnginesUniversity of MichiganElectrical Engineering and Computer ScienceUniversity of MichiganElectrical Engineering and Computer ScienceWhy GPUs?

Every mobile and desktop system will have one

Affordable and high performance

Over-provisioned

Programmable

Sony PlayStation Phone#University of MichiganElectrical Engineering and Computer ScienceHigher FLOP per Watt and Higher FLOP per dollarI7: 0.36 GFLOP/$GTX 285: 3.54 GFLOP/$

I7: 0.78 GFLOP/WGTX 285: 5.2 GFLOP/W2GPU ArchitectureSharedRegs01234567Interconnection NetworkCPUSM 0SM 1SM 29Kernel 1Kernel 2Time01234567SharedRegs01234567SharedRegs01234567RegistersGlobal Memory (Device Memory)Shared Memory #University of MichiganElectrical Engineering and Computer ScienceGPU Programming Model

Threads Blocks Grid

All the threads run one kernel

Registers private to each thread

Registers spill to local memory

Shared memory shared between threads of a block

Global memory shared between all blocks#University of MichiganElectrical Engineering and Computer ScienceGrid 1GPU Execution ModelSM 1SharedRegs01234567SM 0SharedRegs01234567SM 2SharedRegs01234567SM 3SharedRegs01234567SM 30SharedRegs01234567#University of MichiganElectrical Engineering and Computer ScienceGPU Execution ModelBlock 0Block 1Block 3SharedRegisters01245367SM0Block 2Warp 0Warp 1ThreadId0313263#University of MichiganElectrical Engineering and Computer ScienceGPU Programming ChallengesOptimized forGeForce GTX 285Optimized forGeForce 8400 GSData restructuring for complex memory hierarchy efficientlyGlobal memory, Shared memory, Registers

Partitioning work between CPU and GPU

Lack of portability between different generations of GPURegisters, active warps, size of global memory, size of shared memory

Will vary even moreNewer high performance cards e.g. NVIDAs FermiMobile GPUs with less resources

#University of MichiganElectrical Engineering and Computer ScienceNonlinear Optimization Space

[Ryoo , CGO 08]SAD Optimization Space908 ConfigurationsWe need higher level of abstraction!#University of MichiganElectrical Engineering and Computer ScienceGoalsWrite-once parallel software

Free the programmer from low-level details

(C + Pthreads) Shared Memory Processors(C +Intrinsics) SIMD Engines(Verilog/VHDL) FPGAs(CUDA/OpenCL) GPUsParallel Specification#University of MichiganElectrical Engineering and Computer ScienceStreamingHigher-level of abstraction

Decoupling computation and memory accesses

Coarse grain exposed parallelism, exposed communication

Programmers can focus on the algorithms instead of low-level details

Streaming actors use buffers to communicate

A lot of recent works on extending portability of streaming applications

Actor 1Actor 2Actor 5SplitterActor 4Actor 3JoinerActor 6#University of MichiganElectrical Engineering and Computer ScienceStream programming encourages a style of programming that expresses the parallelism inherent in a program by decoupling computation and memory accesses [11][12]. The explicit parallelism and locality of data in a stream program makes it easier to compile effciently using traditional compiler optimizations.

10SpongeGenerating optimized CUDA for a wide variety of GPU targets

Perform an array of optimizations on stream graphs

Optimizing and porting to different generations

Utilize memory hierarchy (registers, shared memory, coallescing)

Efficiently utilize streaming cores

Reorganization and ClassificationMemory LayoutGraph RestructuringRegister OptimizationShared/Global MemoryHelper ThreadsBank Conflict ResolutionLoop Unrolling Software Prefetching

#University of MichiganElectrical Engineering and Computer Science11GPU Performance Model- Memory bound KernelsM 0M 1M 2M 3M 4M 5M 6M 7C 0C 1C 2C 3C 4C 5C 6C 7 Memory Time- Computation bound KernelsM 0M 1M 4M 5M 2M 3M 6M 7C 0C1C 2C 3C 4C 5C 6C 7 Computation TimeMCMemory InstructionsComputation Instructions#University of MichiganElectrical Engineering and Computer ScienceActor ClassificationHigh Traffic Actors(HiT)Large number of memory accesses per actorLess number of threads with shared memoryUsing shared memory underutilizes the processors

Low Traffic Actors(LoT)Less number of memory accesses per actorMore number of threadsUsing shared memory increases the performance#University of MichiganElectrical Engineering and Computer ScienceThread 1Thread 2Thread 3Thread 015141312111098765432101514131211109876543210Global Memory AccessesA[4,4]Global MemoryGlobal Memory26101426101415913159130481204812371115371115Large access latency

Not access the words in sequence

No coalescingA[4,4]A[4,4]A[4,4]A[i, j] Actor A has i pops and j pushes#University of MichiganElectrical Engineering and Computer ScienceThread 3Thread 2Thread 1Thread 0Shared Memory15141312111098765432101514131211109876543210A[4,4]A[4,4]A[4,4]A[4,4]Shared MemoryShared Memory15141312111098765432101514131211109876543210Global To SharedGlobal To SharedGlobal To SharedGlobal To SharedGlobal MemoryGlobal Memory3210321076547654111098111098151413121514131232103210765476541110981110981514131215141312First bring the data into shared memory with coalescingEach filter brings data for other filtersSatisfies coalescing constraints

After data is in the shared memory, then each filter accesses its own memory.

Improve bandwidth and performance

Shared to GlobalShared to GlobalShared to GlobalShared to Global#University of MichiganElectrical Engineering and Computer ScienceUsing Shared MemoryShared memory is 100x faster than global memory

Coalesce all global memory accesses

Number of threads is limited by size of the shared memory.

#University of MichiganElectrical Engineering and Computer Science

Helper ThreadsShared memory limits the number of threads.

Underutilized processors can fetch data.

All the helper threads are in one warp. (no control flow divergence)

#University of MichiganElectrical Engineering and Computer Science

Data PrefetchBetter register utilization

Data for iteration i+1 is moved to registers

Data for iteration i is moved from register to shared memory

Allows the GPU to overlap instructions

#University of MichiganElectrical Engineering and Computer ScienceLoop unrollingSimilar to traditional unrolling

Allows the GPU to overlap instructions

Better register utilization

Less loop control overhead

Can also be applied to memory transfer loops

#University of MichiganElectrical Engineering and Computer ScienceMethodologySet of benchmarks from the StreamIt Suite3GHz Intel Core 2 Duo CPU with 6GB RAMNvidia Geforce GTX 285

StreamProcessorsProcessor clockMemory ConfigurationMemory Bandwidth2401476 MHz2GB DDR3159.0 GB/s#University of MichiganElectrical Engineering and Computer ScienceResult (Baseline CPU)1024#University of MichiganElectrical Engineering and Computer ScienceResult (Baseline GPU)64%3%16%16%#University of MichiganElectrical Engineering and Computer ScienceConclusionFuture systems will be heterogeneous

GPUs are important part of such systems

Programming complexity is a significant challenge

Sponge automatically creates optimized CUDA code for a wide variety of GPU targets

Provide portability by performing an array of optimizations on stream graphs

#University of MichiganElectrical Engineering and Computer ScienceSequential work on traditional processorsParallelizable work on specialized computing engines

23Questions#University of MichiganElectrical Engineering and Computer ScienceSpatial Intermediate RepresentationStreamItMain Constructs:Filter Encapsulate computation.Pipeline Expressing pipeline parallelism.Splitjoin Expressing task-level parallelism.Other constructs not relevant hereExposes different types of parallelismComposable, hierarchicalStateful and stateless filterspipelinefiltersplitjoin#University of MichiganElectrical Engineering and Computer Science25Nonlinear Optimization Space

[Ryoo , CGO 08]SAD Optimization Space908 Configurations#University of MichiganElectrical Engineering and Computer ScienceThread 1Thread 2Thread 0Bank Conflict765432101514131211109876543210A[8,8]A[8,8]A[8,8]Shared MemoryShared Memory765432101514131211109876543210Conflict0800801911912102210227

data = buffer[BaseAddress + s * ThreadId]#University of MichiganElectrical Engineering and Computer ScienceThread 2Thread 1Thread 0Removing Bank Conflict765432101514131211109876543210A[8,8]A[8,8]A[8,8]Shared MemoryShared Memory765432101514131211109876543210092092110311032114211428

data = buffer[BaseAddress + s * ThreadId]if GCD( # of bank, s) is 1 there will be no bank conflict s must be odd#University of MichiganElectrical Engineering and Computer SciencePer-block Register

Begin Kernel :

For number of iterations

For number of pops

For number of pushs

Shared

Global

Shared

Global

syncthreads

syncthreads

End Kernel

Work

Work

For number of iterations

Begin Kernel :

End Kernel

Work

Shared

Global

Shared

Global

End Kernel

Begin Kernel >:

For number of iterations

syncthreads

syncthreads

If helper threads

If helper threads

If worker threads

Work

For number of iterations

Begin Kernel :

End Kernel

Regs

Global

Work

Regs

Global

Shared

Regs

Shared

Global

For number of pops

For number of pops

End Kernel

For number of iterations

syncthreads

syncthreads

For number of pops

For number of pushs

Begin Kernel :

If not the last iteration

Begin Kernel :

For number of iterations

For number of pops

For number of pushs

Shared

Global

Shared

Global

syncthreads

syncthreads

End Kernel

Work

For number of pops

Work

Work

For number of iterations/2

For number of pops

Shared

Global

Shared

Global

syncthreads

syncthreads

For number of pushs

Shared

Global

For number of pushs

Shared

Global

syncthreads

Begin Kernel :

End Kernel

syncthreads