Paraprox : Pattern-Based Approximation for Data Parallel Applications

Adaptive Input-aware Compilation for Graphics Engines

Paraprox: Pattern-Based Approximation for Data Parallel ApplicationsMehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke

University of MichiganMarch 2014

Compilers Creating Custom ProcessorsUniversity of MichiganElectrical Engineering and Computer ScienceApproximate Computing100% accuracy is not always necessary

Less WorkBetter performanceLower power consumption

There are many domains where approximate output is acceptable

#Data Parallelism is everywhere

Good opportunity for automatic approximation

Financial ModelingMedicalImagingAudio ProcessingMachine LearningPhysicsSimulationGamesImage ProcessingStatisticsVideo Processing

Mostly regular applicationsWorks on large data setsExact output is not required for operation#Approximating KMeans#Approximating KMeans#Approximating KMeans#Approximating KMeans#Approximating KMeans#Approximating KMeansApproximating alone is not enough we need a way to control the output quality#Approximate ComputingAsk the programmer to do itNot easy / practicalHard to debug

Automatic ApproximationOne solution does not fit all

Paraprox : Pattern-based ApproximationPattern-specific approximation methodsProvide knobs to control the output quality

#Common PatternsImage Processing, Finance, MapMachine Learning, Physics,..ReductionStatistics,Scatter/GatherSignal Processing, Physics,PartitioningImage Processing, Physics,StencilMachine Learning, Search,ScanM. McCool et al. Structured Parallel Programming: Patterns for Efcient Computation. Morgan Kaufmann, 2012.#ParaproxParallel Program(OpenCl/CUDA)ApproximationMethodsApproximate Kernels

Tuning ParametersParaproxRuntime system

Pattern Detection#Common PatternsImage Processing, Finance, MapMachine Learning, Physics,..ReductionStatistics,Scatter/GatherSignal Processing, Physics,PartitioningImage Processing, Physics,StencilMachine Learning, Search,Scan#Approximate Memoization

BlackScholes#Approximate MemoizationIdentify candidate functionsFill the TableFind the table sizeDetermine qi for each inputExecutionCheck The Quality#

Candidate FunctionsPure functions do not:read or write any global or static mutable state.call an impure function.perform I/O.

In CUDA/OpenCL:No global/shared memory accessNo thread ID dependent computation#Table SizeSpeedupQuality32K16K64K#How Many Bits per Input?55564564564564564544737595.2%91.2%96.5%95.4%91.3%95.8%95.4%95.1%Table Size = 32KB15 bits address Inputs that do not need high precision will get fewer number of bits.ABCInputBitsQuantization LevelsA532B664C416Output Quality#Common PatternsImage Processing, Finance, MapMachine Learning, Physics,..ReductionStatistics,Scatter/GatherSignal Processing, Physics,PartitioningImage Processing, Physics,StencilMachine Learning, Search,Scan#Tile ApproximationDifference with neighbors

#Stencil/PartitioningC = Input[i][j]W = Input[i][j-1]E = Input[i][j+1]NW = Input[i-1][j-1]N = Input[i-1][j]NE = Input[i-1][j+1]SW = Input[i+1][j-1]S = Input[i+1][j]SE = Input[i+1][j+1]Paraprox looks for global/texture/shared load accesses to the arrays with affine addressesControl the output quality by changing the number of accesses per tileNWNNEWCESWSSE#Stencil/PartitioningC = Input[i][j]W = Input[i][j-1]E = Input[i][j+1]NW = Input[i-1][j-1]N = Input[i-1][j]NE = Input[i-1][j+1]SW = Input[i+1][j-1] WS = Input[i+1][j] CSE = Input[i+1][j+1] EParaprox looks for global/texture/shared load accesses to the arrays with affine addressesControl the output quality by changing the number of accesses per tileNWNNEWCESWSSE#Stencil/PartitioningC = Input[i][j]W = Input[i][j-1]E = Input[i][j+1]NW = Input[i-1][j-1] WN = Input[i-1][j] CNE = Input[i-1][j+1] ESW = Input[i+1][j-1] WS = Input[i+1][j] CSE = Input[i+1][j+1] EParaprox looks for global/texture/shared load accesses to the arrays with affine addressesControl the output quality by changing the number of accesses per tileNWNNEWCESWSSE#Stencil/PartitioningC = Input[i][j]W = Input[i][j-1]E = Input[i][j+1] NW = Input[i-1][j-1] N = Input[i-1][j] NE = Input[i-1][j+1] SW = Input[i+1][j-1] S = Input[i+1][j]SE = Input[i+1][j+1] Paraprox looks for global/texture/shared load accesses to the arrays with affine addressesControl the output quality by changing the number of accesses per tileNWNNEWCESWSSECCCCCCCC#Common PatternsImage Processing, Finance, MapMachine Learning, Physics,..ReductionStatistics,Scatter/GatherSignal Processing, Physics,PartitioningImage Processing, Physics,StencilMachine Learning, Search,Scan#Scan/ Prefix Sum#Data Parallel Scan 11111234Scan11111234Scan11111234Scan11111234Scan4444481216Scan12345678910111213141516AddAddAddPhase IPhase IIPhase III#Data Parallel Scan 11111234Scan11111234Scan11111234Scan11111234Scan4444481216Scan12345678910111213141516AddAddAddPhase IPhase IIPhase III#Scan ApproximationN0Output Elements#Evaluation#Experimental SetupClang 3.3

GPUNVIDIA GTX 560

CPUIntel Core I7

Benchmarks NVIDIA SDK, Rodinia, DriverAST VisitorPatternDetectionRewriteActionGeneratorCUDAApproximateKernels#Runtime SystemQualityTargetQualitySpeedupSAGE[MICRO2013]Green[PLDI2010]Quality Checking#Speedups for Both CPU and GPU7.9Target = 90%SpeedupCPUGPUGeometricMean#One Solution Does Not Fit All!ParaproxLoop Perforation#We Have Control on Output Quality#We Have Control on Output Quality#Distribution of Errors#Distribution of Errors#ConclusionManual approximation is not easy/practical.

We need tools for approximation

One approximation method does not fit all applications.

By using pattern-based optimization, we achieved 2.6x speedup by maintaining 90% of the output quality. #Paraprox: Pattern-Based Approximation for Data Parallel ApplicationsMehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke

University of MichiganMarch 2014

Compilers creating custom processorsUniversity of MichiganElectrical Engineering and Computer ScienceDiv

S

X

T

R

V

CallResult

PutResult

float2

Documents

Paraprox : Pattern-Based Approximation for Data Parallel Applications