40
Paraprox: Pattern- Based Approximation for Data Parallel Applications Mehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014 Compilers Creating Custom Processors University of Michigan Electrical Engineering and Computer Science

Paraprox : Pattern-Based Approximation for Data Parallel Applications

  • Upload
    verlee

  • View
    233

  • Download
    0

Embed Size (px)

DESCRIPTION

Paraprox : Pattern-Based Approximation for Data Parallel Applications. Mehrzad Samadi , D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014. University of Michigan Electrical Engineering and Computer Science. - PowerPoint PPT Presentation

Citation preview

Adaptive Input-aware Compilation for Graphics Engines

Paraprox: Pattern-Based Approximation for Data Parallel ApplicationsMehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke

University of MichiganMarch 2014

Compilers Creating Custom ProcessorsUniversity of MichiganElectrical Engineering and Computer ScienceApproximate Computing100% accuracy is not always necessary

Less WorkBetter performanceLower power consumption

There are many domains where approximate output is acceptable

#Data Parallelism is everywhere

Good opportunity for automatic approximation

Financial ModelingMedicalImagingAudio ProcessingMachine LearningPhysicsSimulationGamesImage ProcessingStatisticsVideo Processing

Mostly regular applicationsWorks on large data setsExact output is not required for operation#Approximating KMeans#Approximating KMeans#Approximating KMeans#Approximating KMeans#Approximating KMeans#Approximating KMeansApproximating alone is not enough we need a way to control the output quality#Approximate ComputingAsk the programmer to do itNot easy / practicalHard to debug

Automatic ApproximationOne solution does not fit all

Paraprox : Pattern-based ApproximationPattern-specific approximation methodsProvide knobs to control the output quality

#Common PatternsImage Processing, Finance, MapMachine Learning, Physics,..ReductionStatistics,Scatter/GatherSignal Processing, Physics,PartitioningImage Processing, Physics,StencilMachine Learning, Search,ScanM. McCool et al. Structured Parallel Programming: Patterns for Efcient Computation. Morgan Kaufmann, 2012.#ParaproxParallel Program(OpenCl/CUDA)ApproximationMethodsApproximate Kernels

Tuning ParametersParaproxRuntime system

Pattern Detection#Common PatternsImage Processing, Finance, MapMachine Learning, Physics,..ReductionStatistics,Scatter/GatherSignal Processing, Physics,PartitioningImage Processing, Physics,StencilMachine Learning, Search,Scan#Approximate Memoization

BlackScholes#Approximate MemoizationIdentify candidate functionsFill the TableFind the table sizeDetermine qi for each inputExecutionCheck The Quality#

Candidate FunctionsPure functions do not:read or write any global or static mutable state.call an impure function.perform I/O.

In CUDA/OpenCL:No global/shared memory accessNo thread ID dependent computation#Table SizeSpeedupQuality32K16K64K#How Many Bits per Input?55564564564564564544737595.2%91.2%96.5%95.4%91.3%95.8%95.4%95.1%Table Size = 32KB15 bits address Inputs that do not need high precision will get fewer number of bits.ABCInputBitsQuantization LevelsA532B664C416Output Quality#Common PatternsImage Processing, Finance, MapMachine Learning, Physics,..ReductionStatistics,Scatter/GatherSignal Processing, Physics,PartitioningImage Processing, Physics,StencilMachine Learning, Search,Scan#Tile ApproximationDifference with neighbors

#Stencil/PartitioningC = Input[i][j]W = Input[i][j-1]E = Input[i][j+1]NW = Input[i-1][j-1]N = Input[i-1][j]NE = Input[i-1][j+1]SW = Input[i+1][j-1]S = Input[i+1][j]SE = Input[i+1][j+1]Paraprox looks for global/texture/shared load accesses to the arrays with affine addressesControl the output quality by changing the number of accesses per tileNWNNEWCESWSSE#Stencil/PartitioningC = Input[i][j]W = Input[i][j-1]E = Input[i][j+1]NW = Input[i-1][j-1]N = Input[i-1][j]NE = Input[i-1][j+1]SW = Input[i+1][j-1] WS = Input[i+1][j] CSE = Input[i+1][j+1] EParaprox looks for global/texture/shared load accesses to the arrays with affine addressesControl the output quality by changing the number of accesses per tileNWNNEWCESWSSE#Stencil/PartitioningC = Input[i][j]W = Input[i][j-1]E = Input[i][j+1]NW = Input[i-1][j-1] WN = Input[i-1][j] CNE = Input[i-1][j+1] ESW = Input[i+1][j-1] WS = Input[i+1][j] CSE = Input[i+1][j+1] EParaprox looks for global/texture/shared load accesses to the arrays with affine addressesControl the output quality by changing the number of accesses per tileNWNNEWCESWSSE#Stencil/PartitioningC = Input[i][j]W = Input[i][j-1]E = Input[i][j+1] NW = Input[i-1][j-1] N = Input[i-1][j] NE = Input[i-1][j+1] SW = Input[i+1][j-1] S = Input[i+1][j]SE = Input[i+1][j+1] Paraprox looks for global/texture/shared load accesses to the arrays with affine addressesControl the output quality by changing the number of accesses per tileNWNNEWCESWSSECCCCCCCC#Common PatternsImage Processing, Finance, MapMachine Learning, Physics,..ReductionStatistics,Scatter/GatherSignal Processing, Physics,PartitioningImage Processing, Physics,StencilMachine Learning, Search,Scan#Scan/ Prefix Sum#Data Parallel Scan 11111234Scan11111234Scan11111234Scan11111234Scan4444481216Scan12345678910111213141516AddAddAddPhase IPhase IIPhase III#Data Parallel Scan 11111234Scan11111234Scan11111234Scan11111234Scan4444481216Scan12345678910111213141516AddAddAddPhase IPhase IIPhase III#Scan ApproximationN0Output Elements#Evaluation#Experimental SetupClang 3.3

GPUNVIDIA GTX 560

CPUIntel Core I7

Benchmarks NVIDIA SDK, Rodinia, DriverAST VisitorPatternDetectionRewriteActionGeneratorCUDAApproximateKernels#Runtime SystemQualityTargetQualitySpeedupSAGE[MICRO2013]Green[PLDI2010]Quality Checking#Speedups for Both CPU and GPU7.9Target = 90%SpeedupCPUGPUGeometricMean#One Solution Does Not Fit All!ParaproxLoop Perforation#We Have Control on Output Quality#We Have Control on Output Quality#Distribution of Errors#Distribution of Errors#ConclusionManual approximation is not easy/practical.

We need tools for approximation

One approximation method does not fit all applications.

By using pattern-based optimization, we achieved 2.6x speedup by maintaining 90% of the output quality. #Paraprox: Pattern-Based Approximation for Data Parallel ApplicationsMehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke

University of MichiganMarch 2014

Compilers creating custom processorsUniversity of MichiganElectrical Engineering and Computer ScienceDiv

S

X

T

R

V

CallResult

PutResult

float2