Upload
dangtruc
View
219
Download
2
Embed Size (px)
Citation preview
Approximating to the Last Bit
Thierry Moreau, Adrian Sampson, Luis Ceze {moreau, luisceze}@cs.washington.edu, [email protected]
WAX 2016 co-located with ASPLOS 2016
April 3rd 2016
What this Talk is About
2
How many bits in a program are really that important?
1 - AXE: Quality Tuning Framework
2 - PERFECT Benchmark Study
Precision Tuning
3
More precision means larger memory footprint, more data movement, more energy used in computation
Precision Tuning
4
More precision means larger memory footprint, more data movement, more energy used in computation
doublefloat
Precision Tuning
5
More precision means larger memory footprint, more data movement, more energy used in computation
n
doublefloat
1
AXE Precision Tuning Framework
6
Goal: Maximize Bit-Savings given a Quality Target
AXE Precision Tuning Framework
kernel.c
quality target
AXE framework instruction-level
precision requirements
quality &bit-savings
7
Built on top of ACCEPT, the approximate C/C++ compiler http://accept.rocks
quality
bit-savings
bad OK
AXE Precision Tuning Framework
8
instruction 0instruction 1instruction 2…instruction n-1instruction n
Default (no bit-savings)
AXE Precision Tuning Framework
9
instruction 0instruction 1instruction 2…instruction n-1instruction n
Coarse-Grained Precision Reduction
quality
bit-savings
bad OK
AXE Precision Tuning Framework
10
instruction 0instruction 1instruction 2…instruction n-1instruction n
Fine-Grained Precision Reduction
quality
bit-savings
bad OK
PERFECT Benchmark SuiteApplication Domain Kernels Metric
PERFECT Application 1Discrete Wavelet
Transform
Signal to Noise Ratio(SNR)
[120dB to 10dB] (0.0001% to 31.6% MSE)
2D ConvolutionHistogram Equalization
Space Time Adaptive Processing
Outer ProductSystem SolveInner Product
Synthetic Aperture RadarInterpolation 1Interpolation 2Back Projection
Wide Area Motion Imaging
DebayerImage RegistrationChange Detection
Required Kernels FFT 1DFFT 2D
11
1 - PERFECT Dynamic Instruction Mix
12
load/store 27%
int arith 4%
fp arith 31%
math 1%
int arith 25%
control 11%
Safe to approximatePrecise
1 - PERFECT Dynamic Instruction Mix
Long latency ops are all safe to approximate
13
fp arith 31%
math 1%
Safe to approximatePrecise
1 - PERFECT Dynamic Instruction Mix
14
load/store 27%
Memory ops are mostly safe to approximate
(mostly data vs. pointers)
Safe to approximatePrecise
1 - PERFECT Dynamic Instruction Mix
15
int arith 25%
control 11%
Control and address computation must
remain precise
Safe to approximatePrecise
2 - Bit-Savings over Approximate Instructions
16
Bit-S
avin
gs
0%
20%
40%
60%
80%
100%
Average SNR (dB)10 20 40 60 80 100 120
26%32%
40%48%
57%
74%83%
High QualityApproximate
2 - Bit-Savings over Approximate Instructions
17
Bit-S
avin
gs
0%
20%
40%
60%
80%
100%
Average SNR (dB)10 20 40 60 80 100 120
26%32%
40%48%
57%
74%83%
PERFECT Manual 0.001% MSE
2 - Bit-Savings over Approximate Instructions
18
Bit-S
avin
gs
0%
20%
40%
60%
80%
100%
Average SNR (dB)10 20 40 60 80 100 120
26%32%
40%48%
57%
74%83%
PERFECT Manual 0.001% MSE
Approximate Computing 10% MSE
Future Architectural Challenges
Mechanisms to translate bit-savings into energy savings?
New data types/representations?
ISA extensions?
19
Thank You!
20
Thierry Moreau, Luis Ceze, Adrian Sampson {moreau, luisceze}@cs.washington.edu, [email protected]
WAX 2016 co-located with ASPLOS 2016
April 3rd 2016
Approximating to the Last Bit
Backup Slides
21
Bit Savings
Explore the opportunity for precision reduction in a hardware-agnostic way
22
BitSavings =X
insnstatic
(precisionref
� precision
approx
)
precision
ref
⇥ execs
execs
total
Framework Overview
Built on top of ACCEPT, the approximate C/C++ compiler http://accept.rocks
AnnotatedProgram
Program Inputs &Quality Metrics
ACCEPTstatic analysis ILPC*
ACCEPTerror injection & instrumentation
ApproximateBinary
Execution & Quality
Assessment
Quality AutotunerOutput Configuration
Quality Results& Bit Savings
* Instruction-level Precision Configuration
23
Program Annotation
void conv2d (pix *in, pix *out, flt *filter){ for (row) { for (col) { flt sum = 0 int dstPos = … for (row_offset) { for (col_offset) { int srcPos = … int fltPos = … sum += in[srcPos] * filter[fltPos] } } out[dstPos] = sum / normFactor } }}
AnnotatedProgram
Program Inputs &Quality Metrics
ACCEPTstatic analysis ILPC*
ACCEPTerror injection & instrumentation
ApproximateBinary
Execution & Quality
Assessment
Quality AutotunerOutput Configuration
Quality Results& Bit Savings
* Instruction-level Precision Configuration
24
Program Annotation
void conv2d (APPROX pix *in, APPROX pix *out, APPROX flt *filter){ for (row) { for (col) { APPROX flt sum = 0 int dstPos = … for (row_offset) { for (col_offset) { int srcPos = … int fltPos = … sum += in[srcPos] * filter[fltPos] } } out[dstPos] = sum / normFactor } }}
AnnotatedProgram
Program Inputs &Quality Metrics
ACCEPTstatic analysis ILPC*
ACCEPTerror injection & instrumentation
ApproximateBinary
Execution & Quality
Assessment
Quality AutotunerOutput Configuration
Quality Results& Bit Savings
* Instruction-level Precision Configuration
Key: use the APPROXtype qualifier
25
Program Annotation
Takeways: Annotating data is intuitive (~10 mins to annotate a kernel) Variables used to index arrays cannot be safely approximated
typedef float flttypedef int pix
typedef APPROX float flttypedef APPROX int pix
tips on annotating programs faster
AnnotatedProgram
Program Inputs &Quality Metrics
ACCEPTstatic analysis ILPC*
ACCEPTerror injection & instrumentation
ApproximateBinary
Execution & Quality
Assessment
Quality AutotunerOutput Configuration
Quality Results& Bit Savings
* Instruction-level Precision Configuration
26
Static Analysis
AnnotatedProgram
Program Inputs &Quality Metrics
ACCEPTstatic analysis ILPC*
ACCEPTerror injection & instrumentation
ApproximateBinary
Execution & Quality
Assessment
Quality AutotunerOutput Configuration
Quality Results& Bit Savings
* Instruction-level Precision Configuration
Instruction-Level Precision Configuration (ILPC)
conv2d:13:7:load:Int32 conv2d:13:10:load:Float conv2d:13:11:fmul:Float conv2d:13:12:fadd:Float conv2d:15:1:fdiv:Float conv2d:15:7:store:Int32
ACCEPT identified safe-to-approximate instructions from data annotations using flow analysis
void conv2d (APPROX pix *in, APPROX pix *out, APPROX flt *filter){ for (row) { for (col) { APPROX flt sum = 0 int dstPos = … for (row_offset) { for (col_offset) { int srcPos = … int fltPos = … sum += in[srcPos] * filter[fltPos] } } out[dstPos] = sum / normFactor } }}
ACCEPT
27
Approximate Binary
AnnotatedProgram
Program Inputs &Quality Metrics
ACCEPTstatic analysis ILPC*
ACCEPTerror injection & instrumentation
ApproximateBinary
Execution & Quality
Assessment
Quality AutotunerOutput Configuration
Quality Results& Bit Savings
* Instruction-level Precision Configuration
Error Injection
Instruction-Level Precision Configuration (ILPC)
conv2d:13:7:load:Int32 conv2d:13:10:load:Float conv2d:13:11:fmul:Float conv2d:13:12:fadd:Float conv2d:15:1:fdiv:Float conv2d:15:7:store:Int32
Each instruction in the ILCP acts as a quality knob that the autotuner can use to maximize bit-savings
ACCEPT
28
AnnotatedProgram
Program Inputs &Quality Metrics
ACCEPTstatic analysis ILPC*
ACCEPTerror injection & instrumentation
ApproximateBinary
Execution & Quality
Assessment
Quality AutotunerOutput Configuration
Quality Results& Bit Savings
* Instruction-level Precision Configuration
Quality Assessment
The programmer provides a quality assessment script to evaluate quality on the program output
Reference Binary
Approximate Binary
eval.py
10dB SNR
29
AnnotatedProgram
Program Inputs &Quality Metrics
ACCEPTstatic analysis ILPC*
ACCEPTerror injection & instrumentation
ApproximateBinary
Execution & Quality
Assessment
Quality AutotunerOutput Configuration
Quality Results& Bit Savings
* Instruction-level Precision Configuration
Autotuner
config k: error = 0.10%
config [k+1, i-1]: error = 5.91%
config [k+1, i]: error = 0.30%
config [k+1, i+1]: error = 0.12%
config [k+2, i-1]: error = 5.91%
config [k+2, i]: error = 0.33%
config [k+2, i+1]: error = 1.6%
…
…
…
…
…
…
Greedy iterative algorithm: reduces precision requirement of the instruction that impacts quality the least
Finds solution in O(m2n) worst case where m is the number of static safe-to-approximate instructions and n are the levels of precision for all instructions
30
AnnotatedProgram
Program Inputs &Quality Metrics
ACCEPTstatic analysis ILPC*
ACCEPTerror injection & instrumentation
ApproximateBinary
Execution & Quality
Assessment
Quality AutotunerOutput Configuration
Quality Results& Bit Savings
* Instruction-level Precision Configuration
Autotuner
precise60dB40dB
20dB
10dB The autotuner greedily maximizes bit-savings as the quality target is lowered
31
Precision “Guarantees”
Currently empirically derived and input dependent
Future work would extend on the current infrastructure to assimilate data dependence
information in order to derive formal error guarantees
1 - PERFECT Dynamic Instruction Mix
0%
25%
50%
75%
100%
2d-convdwthist-eqoutersysteminnerinterp1interp2bp debayerlucas-kanadechange-detfft-1dfft-2dAVERAGE
prec_controlprec_int_arithprec_memappr_mathappr_fp_arithappr_int_arithappr_mem
33
PERFECT Benchmark SuiteApplication Domain Kernels Metric
PERFECT Application 1Discrete Wavelet
Transform
SNR [120dB to 10dB]
2D ConvolutionHistogram Equalization
Space Time Adaptive Processing
Outer ProductSystem SolveInner Product
Synthetic Aperture RadarInterpolation 1Interpolation 2Back Projection
Wide Area Motion Imaging
DebayerImage RegistrationChange Detection
Required Kernels FFT 1DFFT 2D
10 log10
PNk=1 |rk|2PN
k=1 |rk � ak|2
!
N : number of output elements
rk: reference value of element ktk: approximate value of element k
34
2 - Bit-Savings over Approximate Instructions
35
Aggr
egat
e Bi
t Sav
ings
0%
25%
50%
75%
100%
2dco
nv dwt
histeq ou
ter
system
solve inn
er
interp
1int
erp2 bp
deba
yer
lucas
kana
de
chan
gede
tfft1
dfft2
d
AVERAGE
10 20 40 60 80 100 120
2 - Bit-Savings over Approximate Instructions
You don’t need a lot of bits to obtain an acceptable output!36
Bit-S
avin
gs
0%
20%
40%
60%
80%
100%
Average SNR (dB)10 20 40 60 80 100 120
int arith fp arith mem ops math AGGREGATE
26%32%
40%
48%
57%
74%
83%83%
74%
57%
48%
40%32%
26%
Architectural Target
37
Core Energy Breakdownoverheads
compute
General Purpose CPU
compute
Vector Processor*
Accelerators
specialization
the smaller the overheads, the larger the potential gains
* [Quora, Venkataramani et al., MICRO2013]
Precision ScalingMechanisms for precision scalability:
• Fine-grained ALU power gating*
• Bit-sliced ALU units
• Lossy Compression
38
++
Energy Savings
Bit-Savings
?
* [Quora, Venkataramani et al., MICRO2013]