Upload
julie-ray
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
SAGE: Self-Tuning Approximation for Graphics
Engines
Mehrzad Samadi1, Janghaeng Lee1, D. Anoushe Jamshidi1, Amir Hormati2, and Scott Mahlke1
University of Michigan1, Google Inc.2
December 2013
Compilers creating custom processorsUniversity of MichiganElectrical Engineering and Computer Science
2
Approximate Computing
• Different domains:– Machine Learning– Image Processing– Video Processing– Physical Simulation– …
Higher performanceLower power consumption
Less work
Quality: 100% 95%
90% 85%
3
Ubiquitous Graphics Processing Units
Super Computers Desktops Cell Phones
• Wide range of devices
• Mostly regular applications• Works on large data sets
Servers
Good opportunity for automatic approximation
4
SAGE Framework
• Simplify or skip processing• Computationally expensive• Lowest impact on the output quality
Output Error
Speedup
10%
10%
2~3x
• Self-Tuning Approximation on Graphics Engines• Write the program once• Automatic approximation• Self-tuning dynamically
5
Overview
6
SAGE FrameworkInput Program
ApproximationMethods
Approximate Kernels
Tuning Parameters
Static Compiler
Runtime system
7
Static Compilation
Evaluation Metric
Target Output
Quality (TOQ)
CUDA Code
Atomic Operation
Data Packing
ThreadFusion
Approximate Kernels
Tuning Parameters
8
Runtime System
Preprocessing
Tuning Execution Calibration
CPU
GPU
Time
Quality
Speedup
TOQ
Tuning T T + C T + 2C
Tuning
Preprocessing
Calibration
9
Runtime System
Preprocessing
Tuning Execution Calibration
CPU
GPU
Time
Quality
Speedup
TOQ
Tuning T T + C T
10
Approximation Methods
11
Approximation Methods
Evaluation Metric
Target Output
Quality (TOQ)
CUDA Code
Atomic Operation
Data Packing
ThreadFusion
Approximate Kernels
Tuning Parameters
12
Atomic Operations
// Compute histogram of colors in an image __global__ void histogram(int n, int* color, int* bucket) int tid = threadIdx.x + blockDim.x * blockIdx.x; int nThreads = gridDim.x * blockDim.x; for ( int i = tid ; tid < n; tid += nThreads) int c = colors[i]; atomicAdd(&bucket[c], 1);
• Atomic operations update a memory location such that the update appears to happen atomically
13
Atomic Operations
Threads
14
Atomic Operations
Threads
15
Atomic Operations
Threads
16
Atomic Operations
Threads
17
Atomic Add
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 310
2
4
6
8
10
12
14
Conflicts per Warp
Slowdown
18
Atomic Operation Tuning
Iterations
T0 T32 T64 T96 T128 T160
// Compute histogram of colors in an image __global__ void histogram(int n, int* color, int* bucket) int tid = threadIdx.x + blockDim.x * blockIdx.x; int nThreads = gridDim.x * blockDim.x; for ( int i = tid ; tid < n; tid += nThreads) int c = colors[i]; atomicAdd(&bucket[c], 1);
19
Atomic Operation Tuning
Iterations
T0 T32 T64 T96 T128 T160
• SAGE skips one iteration per thread• To improve the performance, it drops the iteration with the maximum number of conflicts
20
Atomic Operation Tuning
T0 T32 T64 T96 T128 T160
It drops 50% of iterations• SAGE skips one iteration per thread• To improve the performance, it drops the iteration with the maximum number of conflicts
21
Atomic Operation Tuning
T0 T32 T64
Drop rate goes down to 25%
22
Dropping One Iteration
0
Conflicts
2
8
17
12
Iteration No.
1
2
3
After optimizationCD 0
0
Conflict Detection
1
3
CD 1
CD 2
CD 3
Max conflictIteration 0Iteration 1Iteration 2
23
Approximation Methods
Evaluation Metric
Target Output
Quality (TOQ)
CUDA Code
Atomic Operation
Data Packing
ThreadFusion
Approximate Kernels
Tuning Parameters
24
Data PackingThreads
Memory
25
Data PackingThreads
Memory
26
Data PackingThreads
Memory
27
Data PackingThreads
Memory
28
Quantization• Preprocessing finds min and max of the input
sets and packs the data
• During execution, each thread unpacks the data and transforms the quantization level to data by applying a linear transformation
Min Max
0 1 2 3 4 5 6 7
101
Quantization Levels
29
Quantization• Preprocessing finds max and min of the input
sets and packs the data
• During execution, each thread unpacks the data and transforms the quantization level to data by applying a linear transformation
Min Max
0 1 2 3 4 5 6 7
101
30
Approximation Methods
Evaluation Metric
Target Output
Quality (TOQ)
CUDA Code
Atomic Operation
Data Packing
ThreadFusion
Approximate Kernels
Tuning Parameters
31
Thread Fusion
Threads
Memory
Memory ≈ ≈ ≈ ≈
32
Thread Fusion
Threads
Memory
Memory
33
Thread Fusion
T0
Computation
Output writing
T1
ComputationOutput writing
34
Thread Fusion
T0 T1 T254 T255 T0 T1 T254 T255
Block 0 Block 1
ComputationOutput writing
35
Thread Fusion
T0 T127
Block 0 Block 1
T0 T127
Reducing the number of threads per block results in poor resource utilization
36
Block Fusion
T0 T254
Block 0 & 1 fused
T255T1
37
Runtime
38
How to Compute Output Quality?
Accurate Version
ApproximateVersion
EvaluationMetric
• High overhead• Tuning should find a good enough
approximate version as fast as possible• Greedy algorithm
39
Tuning
K(0,0) Quality = 100%Speedup = 1x
TOQ = 90%
K(x,y)
Tuning parameter of the First optimization
Tuning parameter of the Second optimization
40
Tuning
K(0,0) Quality = 100%Speedup = 1x
K(1,0) K(0,1)94%1.15X
96%1.5X
TOQ = 90%
41
Tuning
K(0,0) Quality = 100%Speedup = 1x
K(1,0) K(0,1)
K(1,1) K(0,2)
94%1.15X
96%1.5X
94%2.5X
95%1.6X
TOQ = 90%
42
Tuning
K(0,0) Quality = 100%Speedup = 1x
K(1,0) K(0,1)
K(1,1) K(0,2)
K(2,1) K(1,2)
94%1.15X
96%1.5X
94%2.5X
95%1.6X
88%3X
87%2.8X
Final ChoiceTuning Path
TOQ = 90%
43
Evaluation
44
Experimental Setup
• Backend of Cetus compiler
• GPU– NVIDIA GTX 560
• 2GB GDDR 5
• CPU– Intel Core i7
• Benchmarks– Image processing– Machine Learning
45
K-Means
80
90
100
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 1011061111160
1
2
Output Quality
Accumulative Speedup
TOQ
46
Confidence
0 50 100 150 200 250 300 350 40040
50
60
70
80
90
100
Calibration points
93
CI = 99%
CI = 95%
CI = 90%
After checking 50 samples, we will be 93% confident that 95% of the outputs satisfy the TOQ threshold𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ∝𝑝𝑟𝑖𝑜𝑟× h𝑙𝑖𝑘𝑒𝑙𝑖 𝑜𝑜𝑑
Beta Uniform Binomial
47
Calibration Overhead
0 20 40 60 80 100 120 140 160 180 2000
5
10
15
20
25
Calibration Interval
Gaussian
K-Means
Calibration Overhead(%)
48
K-Means
NB
Histogram
SVM
Fuzzy
MeanShift
Binarization
Dynamic
Mean Filter
Gaussian
1 2 3 4
2.0
1.8
3.6
1.4
1.7
2.3
2.2
1.3
3.3
1.7
1.7
1.3
1.3
1.8
1.5
1.6
Speedup 1 2 3 4
2.5
2.2
6.4
1.4
1.8
2.7
2.2
1.3
3.3
3.1
3.1
1.4
1.6
3.5
1.3
1.3
1.5
1.4
1.3
Speedup
Performance
6.4
SAGE SAGE
TOQ = 95% TOQ = 90%Loop Perforation Loop Perforation
GeometricMean
49
Conclusion• Automatic approximation is possible
• SAGE automatically generates approximate kernels with different parameters
• Runtime system uses tuning parameters to control the output quality during execution
• 2.5x speedup with less than 10% quality loss compared to the accurate execution
SAGE: Self-Tuning Approximation for Graphics
Engines
Mehrzad Samadi1, Janghaeng Lee1, D. Anoushe Jamshidi1, Amir Hormati2, and Scott Mahlke1
University of Michigan1, Google Inc.2
December 2013
Compilers creating custom processorsUniversity of MichiganElectrical Engineering and Computer Science
51
Backup Slides
52
What Does Calibration Miss?
10 20 30 40 50 60 70 80 90 1000%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
flower(250) grandma(800) deadline(900) foreman(300) harbour(300)
ice(240) akiyo(300) carphone(380) crew(300) galleon(350)
Calibration Interval
Perc
ent D
iffer
ence
in E
rror
53
SAGE Can Control The Output Quality
909192939495969798991001
2
3
4
5
6
7Naïve Bayes Fuzzy Kmeans Mean Filter
Output Quality(%)
Sp
eed
up
54
Distribution of Errors
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
HistogramKmeansNaïve BayesFuzzy KmeansSVMDynamicMean FilterGaussianBinarizationMeanshift
Error
Perc
enta
ge o
f Ele
men
ts