Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
PolyMage: Automatic Optimization for ImageProcessing Pipelines
Ravi Teja MullapudiVinay Vasista
Uday Bondhugula
CSA, Indian Institute of Science
June 27, 2016
Table of Contents
1 Image Processing Pipelines
2 Language
3 Compiler
4 Related Work
5 Performance Evaluation
Table of Contents
1 Image Processing Pipelines
2 Language
3 Compiler
4 Related Work
5 Performance Evaluation
Image Processing Pipelines - Data
Cameras and InternetInstagram60 Million photos per day.http://instagram.com/press/
YouTube100 hours of video uploaded every minute.https://www.youtube.com/yt/press/statistics.html
AstronomyLarge Synoptic Survey Telescope (LSST)Generates 30 TB of image data every night.http://lsst.org/lsst/google
Medical ImagingHuman Connectome ProjectfMRI data for 68 subjects 1.873 TB.http://www.humanconnectome.org/
http://instagram.com/press/https://www.youtube.com/yt/press/statistics.htmlhttp://lsst.org/lsst/googlehttp://www.humanconnectome.org/
Image Processing Pipelines - Computation
Synthesis, Enhancement and Analysis of Images
Applications
Computational Photography
Computer Vision
Medical Imaging
Image Processing Pipelines - Challenges
• Real-time processing• High resolution• Complex algorithms
Need for Speed
• Deep memory hierarchies• Parallelism• Heterogeneity
Modern Architectures
• OpenCV, CImg, MATLAB• Limited optimization• Architecture support
Libraries
• Requires expertise• Tedious and error prone• Not portable
Hand Optimization
Image Processing Pipelines - Challenges
• Real-time processing• High resolution• Complex algorithms
Need for Speed
• Deep memory hierarchies• Parallelism• Heterogeneity
Modern Architectures
• OpenCV, CImg, MATLAB• Limited optimization• Architecture support
Libraries
• Requires expertise• Tedious and error prone• Not portable
Hand Optimization
Image Processing Pipelines - Challenges
• Real-time processing• High resolution• Complex algorithms
Need for Speed
• Deep memory hierarchies• Parallelism• Heterogeneity
Modern Architectures
• OpenCV, CImg, MATLAB• Limited optimization• Architecture support
Libraries
• Requires expertise• Tedious and error prone• Not portable
Hand Optimization
Image Processing Pipelines - Challenges
• Real-time processing• High resolution• Complex algorithms
Need for Speed
• Deep memory hierarchies• Parallelism• Heterogeneity
Modern Architectures
• OpenCV, CImg, MATLAB• Limited optimization• Architecture support
Libraries
• Requires expertise• Tedious and error prone• Not portable
Hand Optimization
Domain Specific Languages
Productivity, Performance and Portability
• Decouple algorithms from schedules• Support common patterns in the domain• High performance compilation
Image Processing Pipelines - Computation Patterns
f (x, y) = g(x, y)
Point-wise
f (x, y) =+1∑
σx=−1
+1∑σy=−1
g(x + σx , y + σy )
Stencil
Image Processing Pipelines - Computation Patterns
f (x, y) =+1∑
σx=−1
+1∑σy=−1
g(2x + σx , 2y + σy )
Downsample
f (x, y) =+1∑
σx=−1
+1∑σy=−1
g((x + σx )/2, (y + σy )/2)
Upsample
Image Processing Pipelines - Computation Patterns
f (g(x))+ = 1
Histogram
f (t, x, y) = g(f (t − 1, x, y))
Time-iterated
PolyMage Framework
DSL SpecBuild stage graphStatic bounds checkInlining
Polyhedral representationDefault schedule
AlignmentScalingGrouping
Schedule transformationStorage optimization
Code generation
Table of Contents
1 Image Processing Pipelines
2 Language
3 Compiler
4 Related Work
5 Performance Evaluation
Language Constructs
Parameter
Variable
Image
Interval
Function
Accumulator
Stencil
Condition
Select
Case
Accumulate
N = Parameter( I n t )x = Va r i ab l e ()I = Image(Float , [N])
c1 = Cond i t i on (x, ’>=’, 1) & Cond i t i on (x, ’
Language Constructs
Parameter
Variable
Image
Interval
Function
Accumulator
Stencil
Condition
Select
Case
Accumulate
R, C = Parameter( I n t ), Parameter( I n t )I = Image(UChar, [R, C])x, y = Va r i ab l e (), Va r i ab l e ()
row , col = I n t e r v a l (0, R, 1), I n t e r v a l (0, C, 1)bins = I n t e r v a l (0, 255, 1)hist = Accumulator(redDom = ([x,y],[row ,col]),
varDom = ([x],bins), I n t )hist.defn = Accumulate(hist(I(x,y)), 1, Sum)
hist : [0..255]→ Zhist(p) =| {(x , y) : I (x , y) = p} |
Unsharp Mask
R, C = Parameter( I n t ), Parameter( I n t )thresh , w = Parameter( F loa t ), Parameter( F loa t )
x, y, c = Va r i ab l e (), Va r i ab l e (), Va r i ab l e ()I = Image(Float , [3, R+4, C+4])
cr = I n t e r v a l (0, 2, 1)xr, xc = I n t e r v a l (2, R+1, 1), I n t e r v a l (0, C+3, 1)yr, yc = I n t e r v a l (2, R+1, 1), I n t e r v a l (2, C+1, 1)
blurx = Funct ion (varDom = ([c, x, y], [cr , xr, xc]), F loa t )blurx.defn = [ S t e n c i l (I(c, x, y), 1.0/16 ,
[[1, 4, 6, 4, 1]]) ]
blury = Funct ion (varDom = ([c, x, y], [cr , yr, yc]), F loa t )blury.defn = [ S t e n c i l (blurx(c, x, y), 1.0/16 ,
[[1], [4], [6], [4], [1]]]) ]
sharpen = Funct ion (varDom = ([c, x, y], [cr, yr, yc]), F loa t )sharpen.defn = [ I(c, x, y) * ( 1 + w ) - blury(c, x, y) * w ]
masked = Funct ion (varDom = ([c, x, y], [cr , yr, yc]), F loa t )diff = Abs((I(c, x, y) - blury(c, x, y)))cond = Cond i t i on ( diff , ‘
Harris Corner Detection
R, C = Parameter( I n t ), Parameter( I n t )I = Image(Float , [R+2, C+2])
x, y = Va r i ab l e (), Va r i ab l e ()row , col = I n t e r v a l (0,R+1,1), I n t e r v a l (0,C+1,1)
c = Cond i t i on (x,’>=’ ,1) & Cond i t i on (x,’=’ ,1) & Cond i t i on (y,’=’ ,2) & Cond i t i on (x,’=’ ,2) & Cond i t i on (y,’
Pyramid Blending
↓ x
↓ x
↓ y
↓ y
↓ x
↓ x
↓ y
↓ y
↓ x
↓ x
↓ y
↓ y
↑ x
↑ y
L
X
↑+
↑ x
↑ y
L
↑ x
↑ x
↑ y
L
X
↑+
↑ x
↑ y
L
↑ x
↑ x
↑ y
L
X
↑+
↑ x
↑ y
L
M
↓ y↓ x
↓ y↓ x
↓ y↓ x
X
↑ x
Table of Contents
1 Image Processing Pipelines
2 Language
3 Compiler
4 Related Work
5 Performance Evaluation
Compiler - Polyhedral Representation
x = Va r i ab l e ()fin = Image(Float , [18])f1 = Funct ion (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), F loa t )f1.defn = [ fin(x) + 1 ]f2 = Funct ion (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), F loa t )f2.defn = [ f1(x-1) + f1(x+1) ]fout = Funct ion (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), F loa t )fout .defn = [ f2(x-1) . f2(x+1) ]
x
f1(x)
f2(x)
fout (x)
Domains
Compiler - Polyhedral Representation
x = Va r i ab l e ()fin = Image(Float , [18])f1 = Funct ion (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), F loa t )f1.defn = [ fin(x) + 1 ]f2 = Funct ion (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), F loa t )f2.defn = [ f1(x-1) + f1(x+1) ]fout = Funct ion (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), F loa t )fout .defn = [ f2(x-1) . f2(x+1) ]
x
f1(x)
f2(x)
fout (x)
Dependence vectors
Compiler - Polyhedral Representation
x = Va r i ab l e ()fin = Image(Float , [18])f1 = Funct ion (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), F loa t )f1.defn = [ fin(x) + 1 ]f2 = Funct ion (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), F loa t )f2.defn = [ f1(x-1) + f1(x+1) ]fout = Funct ion (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), F loa t )fout .defn = [ f2(x-1) . f2(x+1) ]
x
f1(x)
f2(x)
fout (x)
Live-outs
Compiler - Polyhedral Representation
x = Va r i ab l e ()fin = Image(Float , [18])f1 = Funct ion (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), F loa t )f1.defn = [ fin(x) + 1 ]f2 = Funct ion (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), F loa t )f2.defn = [ f1(x-1) + f1(x+1) ]fout = Funct ion (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), F loa t )fout .defn = [ f2(x-1) . f2(x+1) ]
x
f1(x)
f2(x)
fout (x)
f1(x)→ (0, x)
f2(x)→ (1, x)
fout (x)→ (2, x)
Schedule default
Compiler - Polyhedral Representation
x = Va r i ab l e ()fin = Image(Float , [18])f1 = Funct ion (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), F loa t )f1.defn = [ fin(x) + 1 ]f2 = Funct ion (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), F loa t )f2.defn = [ f1(x-1) + f1(x+1) ]fout = Funct ion (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), F loa t )fout .defn = [ f2(x-1) . f2(x+1) ]
x
f1(x)
f2(x)
fout (x)
f1(x)→ (0, x)
f2(x)→ (1, x)
fout (x)→ (2, x)
f1(x)→ (0, x)
f2(x)→ (1, x + 1)
fout (x)→ (2, x + 2)
Schedule skewed
Compiler - Scheduling Criteria
f1(x)
f2(x)
fout (x)
Default schedule
Parallelism
Locality
Storage
Compiler - Scheduling Criteria
f1(x)
f2(x)
fout (x)
Default schedule
Parallelism
Locality
Storage
Compiler - Scheduling Criteria
f1(x)
f2(x)
fout (x)
Default schedule
Parallelism
Locality
Storage
Compiler - Scheduling Criteria
f1(x)
f2(x)
fout (x)
Default schedule
Parallelism
Locality
Storage
Compiler - Scheduling Criteria
f1(x)
f2(x)
fout (x)
Parallelogram tiling
Parallelism
Locality
Storage
Compiler - Scheduling Criteria
f1(x)
f2(x)
fout (x)
Split tiling
Parallelism
Locality
Storage
Compiler - Scheduling Criteria
f1(x)
f2(x)
fout (x)
Overlap tiling
Parallelism
Locality
Storage
Redundant computation
Compiler - Alignment and Scaling
• f (x , y) = g(0, x , y) + g(1, x , y) + g(2, x , y)
• Default schedulesf (x , y)→ (1, x , y , 0)g(0, x , y)→ (0, 0, x , y)Dependence vector non-constant (1, x , y − x ,−y)
• Aligned schedulesf (x , y)→ (1, 0, x , y)g(0, x , y)→ (0, 0, x , y)Dependence vector (1, 0, 0, 0)
Alignment
Compiler - Alignment and Scaling
• f (x) = g(2x) + g(2x + 1)
• Default schedulesf (x)→ (1, x)g(x)→ (0, x)Dependence vectors non-constant (1,−x), (1,−x − 1)
• Scaled schedulesf (x)→ (1, 2x)g(x)→ (0, x)Dependence vectors (1, 0), (1, -1)
Scaling
Compiler - Overlapped Tiling
x
φl φr
h
oτ
f
f↓1
f↓2
f↑1
fout
f
f↓1
f↓2
f↑1
fout
f (x) = fin(x)
f↓1(x) = f (2x − 1) + f (2x + 1)
f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)
f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)
fout (x) = f↑1(x/2)
f (x)→ (0, x)
f↓1(x)→ (1, 2x)
f↓2(x)→ (2, 4x)
f↑1(x)→ (3, 2x)
fout (x)→ (4, x)
• Conservative vs precise bounding faces• Significant reduction in redundant computation
Tile shape
Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1
fk(~i )→ (T , ~sk )
Tile constraints• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out
Scratch pads
Compiler - Overlapped Tiling
x
φl φr
h
oτ
f
f↓1
f↓2
f↑1
fout
f
f↓1
f↓2
f↑1
fout
f (x) = fin(x)
f↓1(x) = f (2x − 1) + f (2x + 1)
f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)
f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)
fout (x) = f↑1(x/2)
f (x)→ (0, x)
f↓1(x)→ (1, 2x)
f↓2(x)→ (2, 4x)
f↑1(x)→ (3, 2x)
fout (x)→ (4, x)
• Conservative vs precise bounding faces• Significant reduction in redundant computation
Tile shape
Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1
fk(~i )→ (T , ~sk )
Tile constraints• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out
Scratch pads
Compiler - Overlapped Tiling
x
φl φr
h
oτ
f
f↓1
f↓2
f↑1
fout
f
f↓1
f↓2
f↑1
fout
f (x) = fin(x)
f↓1(x) = f (2x − 1) + f (2x + 1)
f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)
f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)
fout (x) = f↑1(x/2)
f (x)→ (0, x)
f↓1(x)→ (1, 2x)
f↓2(x)→ (2, 4x)
f↑1(x)→ (3, 2x)
fout (x)→ (4, x)
• Conservative vs precise bounding faces• Significant reduction in redundant computation
Tile shape
Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1
fk(~i )→ (T , ~sk )
Tile constraints• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out
Scratch pads
Compiler - Overlapped Tiling
x
φl φr
h
oτ
f
f↓1
f↓2
f↑1
fout
f
f↓1
f↓2
f↑1
fout
f (x) = fin(x)
f↓1(x) = f (2x − 1) + f (2x + 1)
f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)
f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)
fout (x) = f↑1(x/2)
f (x)→ (0, x)
f↓1(x)→ (1, 2x)
f↓2(x)→ (2, 4x)
f↑1(x)→ (3, 2x)
fout (x)→ (4, x)
• Conservative vs precise bounding faces• Significant reduction in redundant computation
Tile shape
Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1
fk(~i )→ (T , ~sk )
Tile constraints• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out
Scratch pads
Compiler - Overlapped Tiling
x
φl φr
h
oτ
f
f↓1
f↓2
f↑1
fout
f
f↓1
f↓2
f↑1
fout
f (x) = fin(x)
f↓1(x) = f (2x − 1) + f (2x + 1)
f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)
f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)
fout (x) = f↑1(x/2)
f (x)→ (0, x)
f↓1(x)→ (1, 2x)
f↓2(x)→ (2, 4x)
f↑1(x)→ (3, 2x)
fout (x)→ (4, x)
• Conservative vs precise bounding faces• Significant reduction in redundant computation
Tile shape
Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1
fk(~i )→ (T , ~sk )
Tile constraints• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out
Scratch pads
Compiler - Overlapped Tiling
x
φl φr
h
oτ
f
f↓1
f↓2
f↑1
fout
f
f↓1
f↓2
f↑1
fout
f (x) = fin(x)
f↓1(x) = f (2x − 1) + f (2x + 1)
f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)
f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)
fout (x) = f↑1(x/2)
f (x)→ (0, x)
f↓1(x)→ (1, 2x)
f↓2(x)→ (2, 4x)
f↑1(x)→ (3, 2x)
fout (x)→ (4, x)
• Conservative vs precise bounding faces• Significant reduction in redundant computation
Tile shape
Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1
fk(~i )→ (T , ~sk )
Tile constraints• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out
Scratch pads
Compiler - Overlapped Tiling
x
φl φr
h
oτ
f
f↓1
f↓2
f↑1
fout
f
f↓1
f↓2
f↑1
fout
f (x) = fin(x)
f↓1(x) = f (2x − 1) + f (2x + 1)
f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)
f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)
fout (x) = f↑1(x/2)
f (x)→ (0, x)
f↓1(x)→ (1, 2x)
f↓2(x)→ (2, 4x)
f↑1(x)→ (3, 2x)
fout (x)→ (4, x)
• Conservative vs precise bounding faces• Significant reduction in redundant computation
Tile shape
Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1
fk(~i )→ (T , ~sk )
Tile constraints
• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out
Scratch pads
Compiler - Overlapped Tiling
x
φl φr
h
oτf
f↓1
f↓2
f↑1
fout
f
f↓1
f↓2
f↑1
fout
f (x) = fin(x)
f↓1(x) = f (2x − 1) + f (2x + 1)
f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)
f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)
fout (x) = f↑1(x/2)
f (x)→ (0, x)
f↓1(x)→ (1, 2x)
f↓2(x)→ (2, 4x)
f↑1(x)→ (3, 2x)
fout (x)→ (4, x)
• Conservative vs precise bounding faces• Significant reduction in redundant computation
Tile shape
Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1
fk(~i )→ (T , ~sk )
Tile constraints
• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out
Scratch pads
Compiler - Overlapped Tiling
x
φl φr
h
oτf
f↓1
f↓2
f↑1
fout
f
f↓1
f↓2
f↑1
fout
f (x) = fin(x)
f↓1(x) = f (2x − 1) + f (2x + 1)
f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)
f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)
fout (x) = f↑1(x/2)
f (x)→ (0, x)
f↓1(x)→ (1, 2x)
f↓2(x)→ (2, 4x)
f↑1(x)→ (3, 2x)
fout (x)→ (4, x)
• Conservative vs precise bounding faces• Significant reduction in redundant computation
Tile shape
Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1
fk(~i )→ (T , ~sk )
Tile constraints
• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out
Scratch pads
Compiler - Overlapped Tiling
x
φl φr
h
oτf
f↓1
f↓2
f↑1
fout
f
f↓1
f↓2
f↑1
fout
f (x) = fin(x)
f↓1(x) = f (2x − 1) + f (2x + 1)
f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)
f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)
fout (x) = f↑1(x/2)
f (x)→ (0, x)
f↓1(x)→ (1, 2x)
f↓2(x)→ (2, 4x)
f↑1(x)→ (3, 2x)
fout (x)→ (4, x)
• Conservative vs precise bounding faces• Significant reduction in redundant computation
Tile shape
Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1
fk(~i )→ (T , ~sk )
Tile constraints
• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out
Scratch pads
Compiler - Grouping
Iin
Ix Iy
Ixx Ixy Iyy
Sxx SyySxy
det
trace
harris
• Constant dependencesAlignment, Scaling• Redundant computation vs ReuseOverlap, Tile sizes, Parameter estimates• Live-out constraints
Fusion criteria
• Exponential number of valid groupings• Greedy iterative approach
Fusion heuristic
Compiler - Grouping
Iin
Ix Iy
Ixx Ixy Iyy
Sxx SyySxy
det
trace
harris
• Constant dependencesAlignment, Scaling• Redundant computation vs ReuseOverlap, Tile sizes, Parameter estimates• Live-out constraints
Fusion criteria
• Exponential number of valid groupings• Greedy iterative approach
Fusion heuristic
Compiler - Grouping
Iin
Ix Iy
Ixx Ixy Iyy
Sxx SyySxy
det
trace
harris
• Constant dependencesAlignment, Scaling• Redundant computation vs ReuseOverlap, Tile sizes, Parameter estimates• Live-out constraints
Fusion criteria
• Exponential number of valid groupings• Greedy iterative approach
Fusion heuristic
Compiler - Grouping
Iin
Ix Iy
Ixx Ixy Iyy
Sxx SyySxy
det
trace
harris
• Constant dependencesAlignment, Scaling• Redundant computation vs ReuseOverlap, Tile sizes, Parameter estimates• Live-out constraints
Fusion criteria
• Exponential number of valid groupings• Greedy iterative approach
Fusion heuristic
Compiler - Grouping
Iin
Ix Iy
Ixx Ixy Iyy
Sxx SyySxy
det
trace
harris
Input : DAG of stages, (S ,E ); parameter estimates, P;tile sizes, T ; overlap threshold, othresh
/* Initially, each stage is in a separate group */1 G ← ∅2 for s ∈ S do3 G ← G ∪ {s}4 repeat5 converge ← true6 cand set ← getSingleChildGroups(G , E )7 ord list ← sortGroupsBySize(cand set, P)8 for each g in ord list do9 child = getChildGroup(g , E )
10 if hasConstantDependenceVectors(g , child) then
11 or ← estimateRelativeOverlap(g , child , T )12 if or < othresh then
13 merge ← g ∪ child14 G ← G − g − child15 G ← G ∪ merge16 converge ← false17 break
18 until converge = true19 return G
Algorithm
Compiler - Grouping
Iin
Ix Iy
Ixx Ixy Iyy
Sxx SyySxy
det
trace
harris
Input : DAG of stages, (S ,E ); parameter estimates, P;tile sizes, T ; overlap threshold, othresh
/* Initially, each stage is in a separate group */1 G ← ∅2 for s ∈ S do3 G ← G ∪ {s}4 repeat5 converge ← true6 cand set ← getSingleChildGroups(G , E )7 ord list ← sortGroupsBySize(cand set, P)8 for each g in ord list do9 child = getChildGroup(g , E )
10 if hasConstantDependenceVectors(g , child) then
11 or ← estimateRelativeOverlap(g , child , T )12 if or < othresh then
13 merge ← g ∪ child14 G ← G − g − child15 G ← G ∪ merge16 converge ← false17 break
18 until converge = true19 return G
Algorithm
Compiler - Grouping
Iin
Ix Iy
Ixx Ixy Iyy
Sxx SyySxy
det
trace
harris
Input : DAG of stages, (S ,E ); parameter estimates, P;tile sizes, T ; overlap threshold, othresh
/* Initially, each stage is in a separate group */1 G ← ∅2 for s ∈ S do3 G ← G ∪ {s}4 repeat5 converge ← true6 cand set ← getSingleChildGroups(G , E )7 ord list ← sortGroupsBySize(cand set, P)8 for each g in ord list do9 child = getChildGroup(g , E )
10 if hasConstantDependenceVectors(g , child) then
11 or ← estimateRelativeOverlap(g , child , T )12 if or < othresh then
13 merge ← g ∪ child14 G ← G − g − child15 G ← G ∪ merge16 converge ← false17 break
18 until converge = true19 return G
Algorithm
Compiler - Grouping
Iin
Ix Iy
Ixx Ixy Iyy
Sxx SyySxy
det
trace
harris
Input : DAG of stages, (S ,E ); parameter estimates, P;tile sizes, T ; overlap threshold, othresh
/* Initially, each stage is in a separate group */1 G ← ∅2 for s ∈ S do3 G ← G ∪ {s}4 repeat5 converge ← true6 cand set ← getSingleChildGroups(G , E )7 ord list ← sortGroupsBySize(cand set, P)8 for each g in ord list do9 child = getChildGroup(g , E )
10 if hasConstantDependenceVectors(g , child) then
11 or ← estimateRelativeOverlap(g , child , T )12 if or < othresh then
13 merge ← g ∪ child14 G ← G − g − child15 G ← G ∪ merge16 converge ← false17 break
18 until converge = true19 return G
Algorithm
Compiler - Grouping
Iin
Ix Iy
Ixx Ixy Iyy
Sxx SyySxy
det
trace
harris
Input : DAG of stages, (S ,E ); parameter estimates, P;tile sizes, T ; overlap threshold, othresh
/* Initially, each stage is in a separate group */1 G ← ∅2 for s ∈ S do3 G ← G ∪ {s}4 repeat5 converge ← true6 cand set ← getSingleChildGroups(G , E )7 ord list ← sortGroupsBySize(cand set, P)8 for each g in ord list do9 child = getChildGroup(g , E )
10 if hasConstantDependenceVectors(g , child) then
11 or ← estimateRelativeOverlap(g , child , T )12 if or < othresh then
13 merge ← g ∪ child14 G ← G − g − child15 G ← G ∪ merge16 converge ← false17 break
18 until converge = true19 return G
Algorithm
Compiler - Grouping
Iin
Ix Iy
Ixx Ixy Iyy
Sxx SyySxy
det
trace
harris
Input : DAG of stages, (S ,E ); parameter estimates, P;tile sizes, T ; overlap threshold, othresh
/* Initially, each stage is in a separate group */1 G ← ∅2 for s ∈ S do3 G ← G ∪ {s}4 repeat5 converge ← true6 cand set ← getSingleChildGroups(G , E )7 ord list ← sortGroupsBySize(cand set, P)8 for each g in ord list do9 child = getChildGroup(g , E )
10 if hasConstantDependenceVectors(g , child) then
11 or ← estimateRelativeOverlap(g , child , T )12 if or < othresh then
13 merge ← g ∪ child14 G ← G − g − child15 G ← G ∪ merge16 converge ← false17 break
18 until converge = true19 return G
Algorithm
Compiler - Grouping
↓ x
↓ x
↓ y
↓ y
↓ x
↓ x
↓ y
↓ y
↓ x
↓ x
↓ y
↓ y
↑ x
↑ y
L
X
↑+
↑ x
↑ y
L
↑ x
↑ x
↑ y
L
X
↑+
↑ x
↑ y
L
↑ x
↑ x
↑ y
L
X
↑+
↑ x
↑ y
L
M
↓ y↓ x
↓ y↓ x
↓ y↓ x
X
↑ x
Compiler - Code Generation
void pipe_harris( i n t C, i n t R, f l o a t * I, f l o a t *& harris){
/* Live out allocation */
harris = ( f l o a t *) (malloc(sizeof( f l o a t )*(2+R)*(2+C)));#pragma omp parallel for
f o r ( i n t Ti = -1; Ti
Auto Tuning
200 250 300 350 400 45020
40
60
Execution time on 1 core (ms)
Execution
timeon
16cores(m
s)
60 80 100 120 140 160 1805
10
15
Execution time on 1 core (ms)
Execution
timeon
16cores(m
s)• Tile sizes and overlap thershold determine grouping• Seven tiles sizes for each dimension• Three threshold values• Small search space ( 72 ∗ 3 for 2d-tiling )
Tuning
Camera PipelinePyramid Blending
Table of Contents
1 Image Processing Pipelines
2 Language
3 Compiler
4 Related Work
5 Performance Evaluation
Related work
• Decoupled view of computation and schedules• Scheduling for affine loop nestsDo not target specific domains• Overlapped tilingWorks for simple time-iterated stencilsDifferent approach to constructing overlapped tiles
Polyhedral compilation
• Domain specific language and compiler system• Effective for exploring schedulesRequires an explicit schedule specification
Halide
Halide
ImageParam input(UInt (16), 2);
Func blur_x("blur_x"), blur_y("blur_y");
Var x("x"), y("y"), xi("xi"), yi("yi");
// The algorithm
blur_x(x, y) = (input(x, y) + input(x+1, y) + input(x+2, y))/3;
blur_y(x, y) = (blur_x(x, y) + blur_x(x, y+1) + blur_x(x, y+2))/3;
// How to schedule it
blur_y.split(y, y, yi, 8).parallel(y).vectorize(x, 8);
blur_x.store_at(blur_y , y).compute_at(blur_y , yi).vectorize(x, 8);
Halide Blur
Schedule
Table of Contents
1 Image Processing Pipelines
2 Language
3 Compiler
4 Related Work
5 Performance Evaluation
Experimental Setup
Intel Xeon E5-2680Clock 2.7 GHz
Cores / socket 8Total cores 16
L1 cache / core 32 KBL2 cache / core 512 KB
L3 cache / socket 20 MBCompiler Intel C compiler (icc) 14.0.1
Compiler flags -O3 -xhostLinux kernel 3.8.0-38 (64-bit)
Evaluation Method
• Seven representative benchmarks• Varying structure and complexity
Benchmarks
• HalideTuned schedule, Matched schedule• OpenCVOptimized library calls
Comparison
Multiscale Interpolation
1 2 4 8 160
2
4
6
8
10
12
142.24
4.03
6.57
9.82
12.54
1.28 2
.38
3.93
6.18
9.43
1.46 2
.57
4.07
5.7 5.88
1
1.8
2.94
4.42
5.82
2.14
3.44
5.94
7.25
6.93
1.77
2.99
5.29
7.13
6.92
1.28
2.43
4.1
7.1
12.11
0.88 1.68
3.19
5.47
8.5
Number of cores
Speedupover
PolyMag
ebase(1
core) PolyMage(opt+vec)
PolyMage(opt)
PolyMage(base+vec)
PolyMage(base)
Halide(tuned+vec)
Halide(tuned)
Halide(matched+vec)
Halide(matched)
Harris Corner Detection
1 2 4 8 160
10
20
30
40
503.74 7.35
12.85
24.02
46.78
1.12
2.24
4.03 7.64
15.18
2.47
4.31 7.83
12.22 16.22
1 1.94
3.47 6.18
10.3
1.64
3.17 6.08
10.17
18.07
0.93
1.84
3.51 6.05
10.3
1.87
3.73 7
.43
13.65
25.35
0.73
1.45
2.91 5.31
9.88
Number of cores
Speedupover
PolyMag
ebase(1
core) PolyMage(opt+vec)
PolyMage(opt)
PolyMage(base+vec)
PolyMage(base)
Halide(tuned+vec)
Halide(tuned)
Halide(matched+vec)
Halide(matched)
Camera Pipeline
1 2 4 8 160
5
10
15
20
25
30
35
2.79
5.49
9.5
18.16
32.37
0.79
1.57
2.74 5.26
10.28
2.95
5.62
9.58
13.22
24.2
1
1.98 3.61
6.5
12.16
4.82 7.3
12.32
21.26
31.28
1.4 2.59 4.71
7.56
14.15
2.42 4.83
9.55
17.49
33.75
Number of cores
Speedupover
PolyMag
ebase(1
core) PolyMage(opt+vec)
PolyMage(opt)
PolyMage(base+vec)
PolyMage(base)
Halide(tuned+vec)
Halide(tuned)
FCam
Results Summary
Benchmark Number Image size Lines PolyMage OpenCV Speedup overof stages 1 core 4 cores 16 cores (1 core) H-tuned (16 cores)
Harris Corner 11 6400× 6400 43 233.79 68.03 18.69 810.24 2.59×*Pyramid Blending 44 2048×2048×3 71 196.99 57.84 21.91 197.28 4.61×*
Unsharp Mask 4 2048×2048×3 16 165.40 44.92 14.85 349.57 1.6×*Local Laplacian 99 2560×1536×3 107 274.50 76.60 32.35 - 1.54×Camera Pipeline 32 2528× 1920 86 67.87 19.95 5.86 - 1.04×
Bilateral Grid 7 2560× 1536 43 89.76 27.30 8.47 - 0.89×Multiscale Interpol. 49 2560×1536×3 41 101.70 34.73 18.18 - 1.81×
Mean speedup of 1.27× over tuned Halide schedulesComparable performance to highly tuned camera pipelineimplementation
Conclusion
DSL for high-performance image processing
Optimization techniques• Tiling• Storage optimization• Grouping and fusingEffectiveness• Up to 1.81× better than tuned schedules• Matching hand tuned performance
Acknowledgements
Halide, OpenCV, isl, islpy and cgen
Intel for their hardware
Thank You!
Pyramid Blending
1 2 4 8 160
2
4
6
8
10
12
14
161.66
3.2
5.66
9.96
14.95
1.26 2.42
4.29
7.49
13.37
1.13 2.02
3.25
4.71
5.31
1
1.82 2
.99
4.55 5.35
0.56
1
1.83 2.71
3.24
0.66
1.16 2.08 2.98
3.43
1.24 2.12
3.7
5.72
7
0.76 1.45 2
.64
4.31
5.98
Number of cores
Speedupover
PolyMag
ebase(1
core) PolyMage(opt+vec)
PolyMage(opt)
PolyMage(base+vec)
PolyMage(base)
Halide(tuned+vec)
Halide(tuned)
Halide(matched+vec)
Halide(matched)
Bilateral Grid
1 2 4 8 160
2
4
6
8
10
12
141.15 2.17
3.77
6.55
12.16
0.82 1.61 2
.73
4.74
8.99
1.65
3.17
3.42
3.56
3.72
1
1.97
2.15
2.28
2.42
1.6
2.92
5.4
8.55
13.68
1.13 2.11
4.03
6.72
10.37
Number of cores
Speedupover
PolyMag
ebase(1
core) PolyMage(opt+vec)
PolyMage(opt)
PolyMage(base+vec)
PolyMage(base)
Halide(tuned+vec)
Halide(tuned)
Local Laplacian Filter
1 2 4 8 160
2
4
6
8
10
12
141.62
3.41
5.8
9.41
13.73
1.02 1.99
3.48
6.1
10.81
1.58
2.93
4.71
6.41
8.74
1
1.92
3.3
5.23
7.39
1.04 1.99
3.68
6.18
8.93
0.55
1.07 2.08
3.61
5.71
Number of cores
Speedupover
PolyMagebase
(1core)
PolyMage(opt+vec)
PolyMage(opt)
PolyMage(base+vec)
PolyMage(base)
Halide(tuned+vec)
Halide(tuned)
Image Processing PipelinesLanguageCompilerRelated WorkPerformance Evaluation