PolyMage: Automatic Optimization for Image Processing Pipelinesmcl.csa.iisc.ac.in/downloads/slides/PolyMage.pdf · 2016. 6. 27. · PolyMage: Automatic Optimization for Image Processing

PolyMage: Automatic Optimization for ImageProcessing Pipelines

Ravi Teja MullapudiVinay Vasista

Uday Bondhugula

CSA, Indian Institute of Science

June 27, 2016

Table of Contents

1 Image Processing Pipelines

2 Language

3 Compiler

4 Related Work

5 Performance Evaluation

Table of Contents

2 Language

3 Compiler

4 Related Work

Image Processing Pipelines - Data

Cameras and InternetInstagram60 Million photos per day.http://instagram.com/press/

YouTube100 hours of video uploaded every minute.https://www.youtube.com/yt/press/statistics.html

AstronomyLarge Synoptic Survey Telescope (LSST)Generates 30 TB of image data every night.http://lsst.org/lsst/google

Medical ImagingHuman Connectome ProjectfMRI data for 68 subjects 1.873 TB.http://www.humanconnectome.org/

http://instagram.com/press/https://www.youtube.com/yt/press/statistics.htmlhttp://lsst.org/lsst/googlehttp://www.humanconnectome.org/

Image Processing Pipelines - Computation

Synthesis, Enhancement and Analysis of Images

Applications

Computational Photography

Computer Vision

Medical Imaging

Image Processing Pipelines - Challenges

• Real-time processing• High resolution• Complex algorithms

Need for Speed

• Deep memory hierarchies• Parallelism• Heterogeneity

Modern Architectures

• OpenCV, CImg, MATLAB• Limited optimization• Architecture support

Libraries

• Requires expertise• Tedious and error prone• Not portable

Hand Optimization

Need for Speed

Libraries

Hand Optimization

Need for Speed

Libraries

Hand Optimization

Need for Speed

Libraries

Hand Optimization

Domain Specific Languages

Productivity, Performance and Portability

• Decouple algorithms from schedules• Support common patterns in the domain• High performance compilation

Image Processing Pipelines - Computation Patterns

f (x, y) = g(x, y)

Point-wise

f (x, y) =+1∑

σx=−1

+1∑σy=−1

g(x + σx , y + σy )

Stencil

f (x, y) =+1∑

σx=−1

+1∑σy=−1

g(2x + σx , 2y + σy )

Downsample

f (x, y) =+1∑

σx=−1

+1∑σy=−1

g((x + σx )/2, (y + σy )/2)

Upsample

f (g(x))+ = 1

Histogram

f (t, x, y) = g(f (t − 1, x, y))

Time-iterated

PolyMage Framework

DSL SpecBuild stage graphStatic bounds checkInlining

Polyhedral representationDefault schedule

AlignmentScalingGrouping

Schedule transformationStorage optimization

Code generation

Table of Contents

2 Language

3 Compiler

4 Related Work

Language Constructs

Parameter

Variable

Image

Interval

Function

Accumulator

Stencil

Condition

Select

Case

Accumulate

N = Parameter( I n t )x = Va r i ab l e ()I = Image(Float , [N])

c1 = Cond i t i on (x, ’>=’, 1) & Cond i t i on (x, ’

Language Constructs

Parameter

Variable

Image

Interval

Function

Accumulator

Stencil

Condition

Select

Case

Accumulate

R, C = Parameter( I n t ), Parameter( I n t )I = Image(UChar, [R, C])x, y = Va r i ab l e (), Va r i ab l e ()

row , col = I n t e r v a l (0, R, 1), I n t e r v a l (0, C, 1)bins = I n t e r v a l (0, 255, 1)hist = Accumulator(redDom = ([x,y],[row ,col]),

varDom = ([x],bins), I n t )hist.defn = Accumulate(hist(I(x,y)), 1, Sum)

hist : [0..255]→ Zhist(p) =| {(x , y) : I (x , y) = p} |

Unsharp Mask

R, C = Parameter( I n t ), Parameter( I n t )thresh , w = Parameter( F loa t ), Parameter( F loa t )

x, y, c = Va r i ab l e (), Va r i ab l e (), Va r i ab l e ()I = Image(Float , [3, R+4, C+4])

cr = I n t e r v a l (0, 2, 1)xr, xc = I n t e r v a l (2, R+1, 1), I n t e r v a l (0, C+3, 1)yr, yc = I n t e r v a l (2, R+1, 1), I n t e r v a l (2, C+1, 1)

blurx = Funct ion (varDom = ([c, x, y], [cr , xr, xc]), F loa t )blurx.defn = [ S t e n c i l (I(c, x, y), 1.0/16 ,

[[1, 4, 6, 4, 1]]) ]

blury = Funct ion (varDom = ([c, x, y], [cr , yr, yc]), F loa t )blury.defn = [ S t e n c i l (blurx(c, x, y), 1.0/16 ,

[[1], [4], [6], [4], [1]]]) ]

sharpen = Funct ion (varDom = ([c, x, y], [cr, yr, yc]), F loa t )sharpen.defn = [ I(c, x, y) * ( 1 + w ) - blury(c, x, y) * w ]

masked = Funct ion (varDom = ([c, x, y], [cr , yr, yc]), F loa t )diff = Abs((I(c, x, y) - blury(c, x, y)))cond = Cond i t i on ( diff , ‘

Harris Corner Detection

R, C = Parameter( I n t ), Parameter( I n t )I = Image(Float , [R+2, C+2])

x, y = Va r i ab l e (), Va r i ab l e ()row , col = I n t e r v a l (0,R+1,1), I n t e r v a l (0,C+1,1)

c = Cond i t i on (x,’>=’ ,1) & Cond i t i on (x,’=’ ,1) & Cond i t i on (y,’=’ ,2) & Cond i t i on (x,’=’ ,2) & Cond i t i on (y,’

Pyramid Blending

↓ x

↓ y

↓ x

↓ y

↓ x

↓ y

↑ x

↑ y

L

X

↑+

↑ x

↑ y

L

↑ x

↑ y

L

X

↑+

↑ x

↑ y

L

↑ x

↑ y

L

X

↑+

↑ x

↑ y

L

M

↓ y↓ x

X

↑ x

Table of Contents

2 Language

3 Compiler

4 Related Work

Compiler - Polyhedral Representation

x = Va r i ab l e ()fin = Image(Float , [18])f1 = Funct ion (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), F loa t )f1.defn = [ fin(x) + 1 ]f2 = Funct ion (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), F loa t )f2.defn = [ f1(x-1) + f1(x+1) ]fout = Funct ion (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), F loa t )fout .defn = [ f2(x-1) . f2(x+1) ]

x

f1(x)

f2(x)

fout (x)

Domains

x

f1(x)

f2(x)

fout (x)

Dependence vectors

x

f1(x)

f2(x)

fout (x)

Live-outs

x

f1(x)

f2(x)

fout (x)

f1(x)→ (0, x)

f2(x)→ (1, x)

fout (x)→ (2, x)

Schedule default

x

f1(x)

f2(x)

fout (x)

f1(x)→ (0, x)

f2(x)→ (1, x)

fout (x)→ (2, x)

f1(x)→ (0, x)

f2(x)→ (1, x + 1)

fout (x)→ (2, x + 2)

Schedule skewed

Compiler - Scheduling Criteria

f1(x)

f2(x)

fout (x)

Default schedule

Parallelism

Locality

Storage

f1(x)

f2(x)

fout (x)

Default schedule

Parallelism

Locality

Storage

f1(x)

f2(x)

fout (x)

Default schedule

Parallelism

Locality

Storage

f1(x)

f2(x)

fout (x)

Default schedule

Parallelism

Locality

Storage

f1(x)

f2(x)

fout (x)

Parallelogram tiling

Parallelism

Locality

Storage

f1(x)

f2(x)

fout (x)

Split tiling

Parallelism

Locality

Storage

f1(x)

f2(x)

fout (x)

Overlap tiling

Parallelism

Locality

Storage

Redundant computation

Compiler - Alignment and Scaling

• f (x , y) = g(0, x , y) + g(1, x , y) + g(2, x , y)

• Default schedulesf (x , y)→ (1, x , y , 0)g(0, x , y)→ (0, 0, x , y)Dependence vector non-constant (1, x , y − x ,−y)

• Aligned schedulesf (x , y)→ (1, 0, x , y)g(0, x , y)→ (0, 0, x , y)Dependence vector (1, 0, 0, 0)

Alignment

Compiler - Alignment and Scaling

• f (x) = g(2x) + g(2x + 1)

• Default schedulesf (x)→ (1, x)g(x)→ (0, x)Dependence vectors non-constant (1,−x), (1,−x − 1)

• Scaled schedulesf (x)→ (1, 2x)g(x)→ (0, x)Dependence vectors (1, 0), (1, -1)

Scaling

Compiler - Overlapped Tiling

x

φl φr

h

oτ

f

f↓1

f↓2

f↑1

fout

f

f↓1

f↓2

f↑1

fout

f (x) = fin(x)

f↓1(x) = f (2x − 1) + f (2x + 1)

f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

fout (x) = f↑1(x/2)

f (x)→ (0, x)

f↓1(x)→ (1, 2x)

f↓2(x)→ (2, 4x)

f↑1(x)→ (3, 2x)

fout (x)→ (4, x)

• Conservative vs precise bounding faces• Significant reduction in redundant computation

Tile shape

Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1

fk(~i )→ (T , ~sk )

Tile constraints• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out

Scratch pads

x

φl φr

h

oτ

f

f↓1

f↓2

f↑1

fout

f

f↓1

f↓2

f↑1

fout

f (x) = fin(x)

f↓1(x) = f (2x − 1) + f (2x + 1)

f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

fout (x) = f↑1(x/2)

f (x)→ (0, x)

f↓1(x)→ (1, 2x)

f↓2(x)→ (2, 4x)

f↑1(x)→ (3, 2x)

fout (x)→ (4, x)

Tile shape

fk(~i )→ (T , ~sk )

Scratch pads

x

φl φr

h

oτ

f

f↓1

f↓2

f↑1

fout

f

f↓1

f↓2

f↑1

fout

f (x) = fin(x)

f↓1(x) = f (2x − 1) + f (2x + 1)

f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

fout (x) = f↑1(x/2)

f (x)→ (0, x)

f↓1(x)→ (1, 2x)

f↓2(x)→ (2, 4x)

f↑1(x)→ (3, 2x)

fout (x)→ (4, x)

Tile shape

fk(~i )→ (T , ~sk )

Scratch pads

x

φl φr

h

oτ

f

f↓1

f↓2

f↑1

fout

f

f↓1

f↓2

f↑1

fout

f (x) = fin(x)

f↓1(x) = f (2x − 1) + f (2x + 1)

f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

fout (x) = f↑1(x/2)

f (x)→ (0, x)

f↓1(x)→ (1, 2x)

f↓2(x)→ (2, 4x)

f↑1(x)→ (3, 2x)

fout (x)→ (4, x)

Tile shape

fk(~i )→ (T , ~sk )

Scratch pads

x

φl φr

h

oτ

f

f↓1

f↓2

f↑1

fout

f

f↓1

f↓2

f↑1

fout

f (x) = fin(x)

f↓1(x) = f (2x − 1) + f (2x + 1)

f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

fout (x) = f↑1(x/2)

f (x)→ (0, x)

f↓1(x)→ (1, 2x)

f↓2(x)→ (2, 4x)

f↑1(x)→ (3, 2x)

fout (x)→ (4, x)

Tile shape

fk(~i )→ (T , ~sk )

Scratch pads

x

φl φr

h

oτ

f

f↓1

f↓2

f↑1

fout

f

f↓1

f↓2

f↑1

fout

f (x) = fin(x)

f↓1(x) = f (2x − 1) + f (2x + 1)

f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

fout (x) = f↑1(x/2)

f (x)→ (0, x)

f↓1(x)→ (1, 2x)

f↓2(x)→ (2, 4x)

f↑1(x)→ (3, 2x)

fout (x)→ (4, x)

Tile shape

fk(~i )→ (T , ~sk )

Scratch pads

x

φl φr

h

oτ

f

f↓1

f↓2

f↑1

fout

f

f↓1

f↓2

f↑1

fout

f (x) = fin(x)

f↓1(x) = f (2x − 1) + f (2x + 1)

f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

fout (x) = f↑1(x/2)

f (x)→ (0, x)

f↓1(x)→ (1, 2x)

f↓2(x)→ (2, 4x)

f↑1(x)→ (3, 2x)

fout (x)→ (4, x)

Tile shape

fk(~i )→ (T , ~sk )

Tile constraints

• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out

Scratch pads

x

φl φr

h

oτf

f↓1

f↓2

f↑1

fout

f

f↓1

f↓2

f↑1

fout

f (x) = fin(x)

f↓1(x) = f (2x − 1) + f (2x + 1)

f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

fout (x) = f↑1(x/2)

f (x)→ (0, x)

f↓1(x)→ (1, 2x)

f↓2(x)→ (2, 4x)

f↑1(x)→ (3, 2x)

fout (x)→ (4, x)

Tile shape

fk(~i )→ (T , ~sk )

Tile constraints

Scratch pads

x

φl φr

h

oτf

f↓1

f↓2

f↑1

fout

f

f↓1

f↓2

f↑1

fout

f (x) = fin(x)

f↓1(x) = f (2x − 1) + f (2x + 1)

f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

fout (x) = f↑1(x/2)

f (x)→ (0, x)

f↓1(x)→ (1, 2x)

f↓2(x)→ (2, 4x)

f↑1(x)→ (3, 2x)

fout (x)→ (4, x)

Tile shape

fk(~i )→ (T , ~sk )

Tile constraints

Scratch pads

x

φl φr

h

oτf

f↓1

f↓2

f↑1

fout

f

f↓1

f↓2

f↑1

fout

f (x) = fin(x)

f↓1(x) = f (2x − 1) + f (2x + 1)

f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

fout (x) = f↑1(x/2)

f (x)→ (0, x)

f↓1(x)→ (1, 2x)

f↓2(x)→ (2, 4x)

f↑1(x)→ (3, 2x)

fout (x)→ (4, x)

Tile shape

fk(~i )→ (T , ~sk )

Tile constraints

Scratch pads

Compiler - Grouping

Iin

Ix Iy

Ixx Ixy Iyy

Sxx SyySxy

det

trace

harris

• Constant dependencesAlignment, Scaling• Redundant computation vs ReuseOverlap, Tile sizes, Parameter estimates• Live-out constraints

Fusion criteria

• Exponential number of valid groupings• Greedy iterative approach

Fusion heuristic

Compiler - Grouping

Iin

Ix Iy

Ixx Ixy Iyy

Sxx SyySxy

det

trace

harris

Fusion criteria

Fusion heuristic

Compiler - Grouping

Iin

Ix Iy

Ixx Ixy Iyy

Sxx SyySxy

det

trace

harris

Fusion criteria

Fusion heuristic

Compiler - Grouping

Iin

Ix Iy

Ixx Ixy Iyy

Sxx SyySxy

det

trace

harris

Fusion criteria

Fusion heuristic

Compiler - Grouping

Iin

Ix Iy

Ixx Ixy Iyy

Sxx SyySxy

det

trace

harris

Input : DAG of stages, (S ,E ); parameter estimates, P;tile sizes, T ; overlap threshold, othresh

/* Initially, each stage is in a separate group */1 G ← ∅2 for s ∈ S do3 G ← G ∪ {s}4 repeat5 converge ← true6 cand set ← getSingleChildGroups(G , E )7 ord list ← sortGroupsBySize(cand set, P)8 for each g in ord list do9 child = getChildGroup(g , E )

10 if hasConstantDependenceVectors(g , child) then

11 or ← estimateRelativeOverlap(g , child , T )12 if or < othresh then

13 merge ← g ∪ child14 G ← G − g − child15 G ← G ∪ merge16 converge ← false17 break

18 until converge = true19 return G

Algorithm

Compiler - Grouping

Iin

Ix Iy

Ixx Ixy Iyy

Sxx SyySxy

det

trace

harris

Algorithm

Compiler - Grouping

Iin

Ix Iy

Ixx Ixy Iyy

Sxx SyySxy

det

trace

harris

Algorithm

Compiler - Grouping

Iin

Ix Iy

Ixx Ixy Iyy

Sxx SyySxy

det

trace

harris

Algorithm

Compiler - Grouping

Iin

Ix Iy

Ixx Ixy Iyy

Sxx SyySxy

det

trace

harris

Algorithm

Compiler - Grouping

Iin

Ix Iy

Ixx Ixy Iyy

Sxx SyySxy

det

trace

harris

Algorithm

Compiler - Grouping

↓ x

↓ y

↓ x

↓ y

↓ x

↓ y

↑ x

↑ y

L

X

↑+

↑ x

↑ y

L

↑ x

↑ y

L

X

↑+

↑ x

↑ y

L

↑ x

↑ y

L

X

↑+

↑ x

↑ y

L

M

↓ y↓ x

X

↑ x

Compiler - Code Generation

void pipe_harris( i n t C, i n t R, f l o a t * I, f l o a t *& harris){

/* Live out allocation */

harris = ( f l o a t *) (malloc(sizeof( f l o a t )*(2+R)*(2+C)));#pragma omp parallel for

f o r ( i n t Ti = -1; Ti

Auto Tuning

200 250 300 350 400 45020

40

60

Execution time on 1 core (ms)

Execution

timeon

16cores(m

s)

60 80 100 120 140 160 1805

10

15

Execution time on 1 core (ms)

Execution

timeon

16cores(m

s)• Tile sizes and overlap thershold determine grouping• Seven tiles sizes for each dimension• Three threshold values• Small search space ( 72 ∗ 3 for 2d-tiling )

Tuning

Camera PipelinePyramid Blending

Table of Contents

2 Language

3 Compiler

4 Related Work

Related work

• Decoupled view of computation and schedules• Scheduling for affine loop nestsDo not target specific domains• Overlapped tilingWorks for simple time-iterated stencilsDifferent approach to constructing overlapped tiles

Polyhedral compilation

• Domain specific language and compiler system• Effective for exploring schedulesRequires an explicit schedule specification

Halide

ImageParam input(UInt (16), 2);

Func blur_x("blur_x"), blur_y("blur_y");

Var x("x"), y("y"), xi("xi"), yi("yi");

// The algorithm

blur_x(x, y) = (input(x, y) + input(x+1, y) + input(x+2, y))/3;

blur_y(x, y) = (blur_x(x, y) + blur_x(x, y+1) + blur_x(x, y+2))/3;

// How to schedule it

blur_y.split(y, y, yi, 8).parallel(y).vectorize(x, 8);

blur_x.store_at(blur_y , y).compute_at(blur_y , yi).vectorize(x, 8);

Halide Blur

Schedule

Table of Contents

2 Language

3 Compiler

4 Related Work

Experimental Setup

Intel Xeon E5-2680Clock 2.7 GHz

Cores / socket 8Total cores 16

L1 cache / core 32 KBL2 cache / core 512 KB

L3 cache / socket 20 MBCompiler Intel C compiler (icc) 14.0.1

Compiler flags -O3 -xhostLinux kernel 3.8.0-38 (64-bit)

Evaluation Method

• Seven representative benchmarks• Varying structure and complexity

Benchmarks

• HalideTuned schedule, Matched schedule• OpenCVOptimized library calls

Comparison

Multiscale Interpolation

1 2 4 8 160

2

4

6

8

10

12

142.24

4.03

6.57

9.82

12.54

1.28 2

.38

3.93

6.18

9.43

1.46 2

.57

4.07

5.7 5.88

1

1.8

2.94

4.42

5.82

2.14

3.44

5.94

7.25

6.93

1.77

2.99

5.29

7.13

6.92

1.28

2.43

4.1

7.1

12.11

0.88 1.68

3.19

5.47

8.5

Number of cores

Speedupover

PolyMag

ebase(1

core) PolyMage(opt+vec)

PolyMage(opt)

PolyMage(base+vec)

PolyMage(base)

Halide(tuned+vec)

Halide(tuned)

Halide(matched+vec)

Halide(matched)

Harris Corner Detection

1 2 4 8 160

10

20

30

40

503.74 7.35

12.85

24.02

46.78

1.12

2.24

4.03 7.64

15.18

2.47

4.31 7.83

12.22 16.22

1 1.94

3.47 6.18

10.3

1.64

3.17 6.08

10.17

18.07

0.93

1.84

3.51 6.05

10.3

1.87

3.73 7

.43

13.65

25.35

0.73

1.45

2.91 5.31

9.88

Number of cores

Speedupover

PolyMag

ebase(1

PolyMage(opt)

PolyMage(base+vec)

PolyMage(base)

Halide(tuned+vec)

Halide(tuned)

Halide(matched+vec)

Halide(matched)

Camera Pipeline

1 2 4 8 160

5

10

15

20

25

30

35

2.79

5.49

9.5

18.16

32.37

0.79

1.57

2.74 5.26

10.28

2.95

5.62

9.58

13.22

24.2

1

1.98 3.61

6.5

12.16

4.82 7.3

12.32

21.26

31.28

1.4 2.59 4.71

7.56

14.15

2.42 4.83

9.55

17.49

33.75

Number of cores

Speedupover

PolyMag

ebase(1

PolyMage(opt)

PolyMage(base+vec)

PolyMage(base)

Halide(tuned+vec)

Halide(tuned)

FCam

Results Summary

Benchmark Number Image size Lines PolyMage OpenCV Speedup overof stages 1 core 4 cores 16 cores (1 core) H-tuned (16 cores)

Harris Corner 11 6400× 6400 43 233.79 68.03 18.69 810.24 2.59×*Pyramid Blending 44 2048×2048×3 71 196.99 57.84 21.91 197.28 4.61×*

Unsharp Mask 4 2048×2048×3 16 165.40 44.92 14.85 349.57 1.6×*Local Laplacian 99 2560×1536×3 107 274.50 76.60 32.35 - 1.54×Camera Pipeline 32 2528× 1920 86 67.87 19.95 5.86 - 1.04×

Bilateral Grid 7 2560× 1536 43 89.76 27.30 8.47 - 0.89×Multiscale Interpol. 49 2560×1536×3 41 101.70 34.73 18.18 - 1.81×

Mean speedup of 1.27× over tuned Halide schedulesComparable performance to highly tuned camera pipelineimplementation

Conclusion

DSL for high-performance image processing

Optimization techniques• Tiling• Storage optimization• Grouping and fusingEffectiveness• Up to 1.81× better than tuned schedules• Matching hand tuned performance

Acknowledgements

Halide, OpenCV, isl, islpy and cgen

Intel for their hardware

Thank You!

Pyramid Blending

1 2 4 8 160

2

4

6

8

10

12