Upload
verlee
View
233
Download
0
Embed Size (px)
DESCRIPTION
Paraprox : Pattern-Based Approximation for Data Parallel Applications. Mehrzad Samadi , D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014. University of Michigan Electrical Engineering and Computer Science. - PowerPoint PPT Presentation
Citation preview
Adaptive Input-aware Compilation for Graphics Engines
Paraprox: Pattern-Based Approximation for Data Parallel ApplicationsMehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke
University of MichiganMarch 2014
Compilers Creating Custom ProcessorsUniversity of MichiganElectrical Engineering and Computer ScienceApproximate Computing100% accuracy is not always necessary
Less WorkBetter performanceLower power consumption
There are many domains where approximate output is acceptable
#Data Parallelism is everywhere
Good opportunity for automatic approximation
Financial ModelingMedicalImagingAudio ProcessingMachine LearningPhysicsSimulationGamesImage ProcessingStatisticsVideo Processing
Mostly regular applicationsWorks on large data setsExact output is not required for operation#Approximating KMeans#Approximating KMeans#Approximating KMeans#Approximating KMeans#Approximating KMeans#Approximating KMeansApproximating alone is not enough we need a way to control the output quality#Approximate ComputingAsk the programmer to do itNot easy / practicalHard to debug
Automatic ApproximationOne solution does not fit all
Paraprox : Pattern-based ApproximationPattern-specific approximation methodsProvide knobs to control the output quality
#Common PatternsImage Processing, Finance, MapMachine Learning, Physics,..ReductionStatistics,Scatter/GatherSignal Processing, Physics,PartitioningImage Processing, Physics,StencilMachine Learning, Search,ScanM. McCool et al. Structured Parallel Programming: Patterns for Efcient Computation. Morgan Kaufmann, 2012.#ParaproxParallel Program(OpenCl/CUDA)ApproximationMethodsApproximate Kernels
Tuning ParametersParaproxRuntime system
Pattern Detection#Common PatternsImage Processing, Finance, MapMachine Learning, Physics,..ReductionStatistics,Scatter/GatherSignal Processing, Physics,PartitioningImage Processing, Physics,StencilMachine Learning, Search,Scan#Approximate Memoization
BlackScholes#Approximate MemoizationIdentify candidate functionsFill the TableFind the table sizeDetermine qi for each inputExecutionCheck The Quality#
Candidate FunctionsPure functions do not:read or write any global or static mutable state.call an impure function.perform I/O.
In CUDA/OpenCL:No global/shared memory accessNo thread ID dependent computation#Table SizeSpeedupQuality32K16K64K#How Many Bits per Input?55564564564564564544737595.2%91.2%96.5%95.4%91.3%95.8%95.4%95.1%Table Size = 32KB15 bits address Inputs that do not need high precision will get fewer number of bits.ABCInputBitsQuantization LevelsA532B664C416Output Quality#Common PatternsImage Processing, Finance, MapMachine Learning, Physics,..ReductionStatistics,Scatter/GatherSignal Processing, Physics,PartitioningImage Processing, Physics,StencilMachine Learning, Search,Scan#Tile ApproximationDifference with neighbors
#Stencil/PartitioningC = Input[i][j]W = Input[i][j-1]E = Input[i][j+1]NW = Input[i-1][j-1]N = Input[i-1][j]NE = Input[i-1][j+1]SW = Input[i+1][j-1]S = Input[i+1][j]SE = Input[i+1][j+1]Paraprox looks for global/texture/shared load accesses to the arrays with affine addressesControl the output quality by changing the number of accesses per tileNWNNEWCESWSSE#Stencil/PartitioningC = Input[i][j]W = Input[i][j-1]E = Input[i][j+1]NW = Input[i-1][j-1]N = Input[i-1][j]NE = Input[i-1][j+1]SW = Input[i+1][j-1] WS = Input[i+1][j] CSE = Input[i+1][j+1] EParaprox looks for global/texture/shared load accesses to the arrays with affine addressesControl the output quality by changing the number of accesses per tileNWNNEWCESWSSE#Stencil/PartitioningC = Input[i][j]W = Input[i][j-1]E = Input[i][j+1]NW = Input[i-1][j-1] WN = Input[i-1][j] CNE = Input[i-1][j+1] ESW = Input[i+1][j-1] WS = Input[i+1][j] CSE = Input[i+1][j+1] EParaprox looks for global/texture/shared load accesses to the arrays with affine addressesControl the output quality by changing the number of accesses per tileNWNNEWCESWSSE#Stencil/PartitioningC = Input[i][j]W = Input[i][j-1]E = Input[i][j+1] NW = Input[i-1][j-1] N = Input[i-1][j] NE = Input[i-1][j+1] SW = Input[i+1][j-1] S = Input[i+1][j]SE = Input[i+1][j+1] Paraprox looks for global/texture/shared load accesses to the arrays with affine addressesControl the output quality by changing the number of accesses per tileNWNNEWCESWSSECCCCCCCC#Common PatternsImage Processing, Finance, MapMachine Learning, Physics,..ReductionStatistics,Scatter/GatherSignal Processing, Physics,PartitioningImage Processing, Physics,StencilMachine Learning, Search,Scan#Scan/ Prefix Sum#Data Parallel Scan 11111234Scan11111234Scan11111234Scan11111234Scan4444481216Scan12345678910111213141516AddAddAddPhase IPhase IIPhase III#Data Parallel Scan 11111234Scan11111234Scan11111234Scan11111234Scan4444481216Scan12345678910111213141516AddAddAddPhase IPhase IIPhase III#Scan ApproximationN0Output Elements#Evaluation#Experimental SetupClang 3.3
GPUNVIDIA GTX 560
CPUIntel Core I7
Benchmarks NVIDIA SDK, Rodinia, DriverAST VisitorPatternDetectionRewriteActionGeneratorCUDAApproximateKernels#Runtime SystemQualityTargetQualitySpeedupSAGE[MICRO2013]Green[PLDI2010]Quality Checking#Speedups for Both CPU and GPU7.9Target = 90%SpeedupCPUGPUGeometricMean#One Solution Does Not Fit All!ParaproxLoop Perforation#We Have Control on Output Quality#We Have Control on Output Quality#Distribution of Errors#Distribution of Errors#ConclusionManual approximation is not easy/practical.
We need tools for approximation
One approximation method does not fit all applications.
By using pattern-based optimization, we achieved 2.6x speedup by maintaining 90% of the output quality. #Paraprox: Pattern-Based Approximation for Data Parallel ApplicationsMehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke
University of MichiganMarch 2014
Compilers creating custom processorsUniversity of MichiganElectrical Engineering and Computer ScienceDiv
S
X
T
R
V
CallResult
PutResult
float2