74
PolyMage: Automatic Optimization for Image Processing Pipelines Ravi Teja Mullapudi Vinay Vasista Uday Bondhugula CSA, Indian Institute of Science June 27, 2016

PolyMage: Automatic Optimization for Image Processing Pipelinesmcl.csa.iisc.ac.in/downloads/slides/PolyMage.pdf · 2016. 6. 27. · PolyMage: Automatic Optimization for Image Processing

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • PolyMage: Automatic Optimization for ImageProcessing Pipelines

    Ravi Teja MullapudiVinay Vasista

    Uday Bondhugula

    CSA, Indian Institute of Science

    June 27, 2016

  • Table of Contents

    1 Image Processing Pipelines

    2 Language

    3 Compiler

    4 Related Work

    5 Performance Evaluation

  • Table of Contents

    1 Image Processing Pipelines

    2 Language

    3 Compiler

    4 Related Work

    5 Performance Evaluation

  • Image Processing Pipelines - Data

    Cameras and InternetInstagram60 Million photos per day.http://instagram.com/press/

    YouTube100 hours of video uploaded every minute.https://www.youtube.com/yt/press/statistics.html

    AstronomyLarge Synoptic Survey Telescope (LSST)Generates 30 TB of image data every night.http://lsst.org/lsst/google

    Medical ImagingHuman Connectome ProjectfMRI data for 68 subjects 1.873 TB.http://www.humanconnectome.org/

    http://instagram.com/press/https://www.youtube.com/yt/press/statistics.htmlhttp://lsst.org/lsst/googlehttp://www.humanconnectome.org/

  • Image Processing Pipelines - Computation

    Synthesis, Enhancement and Analysis of Images

    Applications

    Computational Photography

    Computer Vision

    Medical Imaging

  • Image Processing Pipelines - Challenges

    • Real-time processing• High resolution• Complex algorithms

    Need for Speed

    • Deep memory hierarchies• Parallelism• Heterogeneity

    Modern Architectures

    • OpenCV, CImg, MATLAB• Limited optimization• Architecture support

    Libraries

    • Requires expertise• Tedious and error prone• Not portable

    Hand Optimization

  • Image Processing Pipelines - Challenges

    • Real-time processing• High resolution• Complex algorithms

    Need for Speed

    • Deep memory hierarchies• Parallelism• Heterogeneity

    Modern Architectures

    • OpenCV, CImg, MATLAB• Limited optimization• Architecture support

    Libraries

    • Requires expertise• Tedious and error prone• Not portable

    Hand Optimization

  • Image Processing Pipelines - Challenges

    • Real-time processing• High resolution• Complex algorithms

    Need for Speed

    • Deep memory hierarchies• Parallelism• Heterogeneity

    Modern Architectures

    • OpenCV, CImg, MATLAB• Limited optimization• Architecture support

    Libraries

    • Requires expertise• Tedious and error prone• Not portable

    Hand Optimization

  • Image Processing Pipelines - Challenges

    • Real-time processing• High resolution• Complex algorithms

    Need for Speed

    • Deep memory hierarchies• Parallelism• Heterogeneity

    Modern Architectures

    • OpenCV, CImg, MATLAB• Limited optimization• Architecture support

    Libraries

    • Requires expertise• Tedious and error prone• Not portable

    Hand Optimization

  • Domain Specific Languages

    Productivity, Performance and Portability

    • Decouple algorithms from schedules• Support common patterns in the domain• High performance compilation

  • Image Processing Pipelines - Computation Patterns

    f (x, y) = g(x, y)

    Point-wise

    f (x, y) =+1∑

    σx=−1

    +1∑σy=−1

    g(x + σx , y + σy )

    Stencil

  • Image Processing Pipelines - Computation Patterns

    f (x, y) =+1∑

    σx=−1

    +1∑σy=−1

    g(2x + σx , 2y + σy )

    Downsample

    f (x, y) =+1∑

    σx=−1

    +1∑σy=−1

    g((x + σx )/2, (y + σy )/2)

    Upsample

  • Image Processing Pipelines - Computation Patterns

    f (g(x))+ = 1

    Histogram

    f (t, x, y) = g(f (t − 1, x, y))

    Time-iterated

  • PolyMage Framework

    DSL SpecBuild stage graphStatic bounds checkInlining

    Polyhedral representationDefault schedule

    AlignmentScalingGrouping

    Schedule transformationStorage optimization

    Code generation

  • Table of Contents

    1 Image Processing Pipelines

    2 Language

    3 Compiler

    4 Related Work

    5 Performance Evaluation

  • Language Constructs

    Parameter

    Variable

    Image

    Interval

    Function

    Accumulator

    Stencil

    Condition

    Select

    Case

    Accumulate

    N = Parameter( I n t )x = Va r i ab l e ()I = Image(Float , [N])

    c1 = Cond i t i on (x, ’>=’, 1) & Cond i t i on (x, ’

  • Language Constructs

    Parameter

    Variable

    Image

    Interval

    Function

    Accumulator

    Stencil

    Condition

    Select

    Case

    Accumulate

    R, C = Parameter( I n t ), Parameter( I n t )I = Image(UChar, [R, C])x, y = Va r i ab l e (), Va r i ab l e ()

    row , col = I n t e r v a l (0, R, 1), I n t e r v a l (0, C, 1)bins = I n t e r v a l (0, 255, 1)hist = Accumulator(redDom = ([x,y],[row ,col]),

    varDom = ([x],bins), I n t )hist.defn = Accumulate(hist(I(x,y)), 1, Sum)

    hist : [0..255]→ Zhist(p) =| {(x , y) : I (x , y) = p} |

  • Unsharp Mask

    R, C = Parameter( I n t ), Parameter( I n t )thresh , w = Parameter( F loa t ), Parameter( F loa t )

    x, y, c = Va r i ab l e (), Va r i ab l e (), Va r i ab l e ()I = Image(Float , [3, R+4, C+4])

    cr = I n t e r v a l (0, 2, 1)xr, xc = I n t e r v a l (2, R+1, 1), I n t e r v a l (0, C+3, 1)yr, yc = I n t e r v a l (2, R+1, 1), I n t e r v a l (2, C+1, 1)

    blurx = Funct ion (varDom = ([c, x, y], [cr , xr, xc]), F loa t )blurx.defn = [ S t e n c i l (I(c, x, y), 1.0/16 ,

    [[1, 4, 6, 4, 1]]) ]

    blury = Funct ion (varDom = ([c, x, y], [cr , yr, yc]), F loa t )blury.defn = [ S t e n c i l (blurx(c, x, y), 1.0/16 ,

    [[1], [4], [6], [4], [1]]]) ]

    sharpen = Funct ion (varDom = ([c, x, y], [cr, yr, yc]), F loa t )sharpen.defn = [ I(c, x, y) * ( 1 + w ) - blury(c, x, y) * w ]

    masked = Funct ion (varDom = ([c, x, y], [cr , yr, yc]), F loa t )diff = Abs((I(c, x, y) - blury(c, x, y)))cond = Cond i t i on ( diff , ‘

  • Harris Corner Detection

    R, C = Parameter( I n t ), Parameter( I n t )I = Image(Float , [R+2, C+2])

    x, y = Va r i ab l e (), Va r i ab l e ()row , col = I n t e r v a l (0,R+1,1), I n t e r v a l (0,C+1,1)

    c = Cond i t i on (x,’>=’ ,1) & Cond i t i on (x,’=’ ,1) & Cond i t i on (y,’=’ ,2) & Cond i t i on (x,’=’ ,2) & Cond i t i on (y,’

  • Pyramid Blending

    ↓ x

    ↓ x

    ↓ y

    ↓ y

    ↓ x

    ↓ x

    ↓ y

    ↓ y

    ↓ x

    ↓ x

    ↓ y

    ↓ y

    ↑ x

    ↑ y

    L

    X

    ↑+

    ↑ x

    ↑ y

    L

    ↑ x

    ↑ x

    ↑ y

    L

    X

    ↑+

    ↑ x

    ↑ y

    L

    ↑ x

    ↑ x

    ↑ y

    L

    X

    ↑+

    ↑ x

    ↑ y

    L

    M

    ↓ y↓ x

    ↓ y↓ x

    ↓ y↓ x

    X

    ↑ x

  • Table of Contents

    1 Image Processing Pipelines

    2 Language

    3 Compiler

    4 Related Work

    5 Performance Evaluation

  • Compiler - Polyhedral Representation

    x = Va r i ab l e ()fin = Image(Float , [18])f1 = Funct ion (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), F loa t )f1.defn = [ fin(x) + 1 ]f2 = Funct ion (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), F loa t )f2.defn = [ f1(x-1) + f1(x+1) ]fout = Funct ion (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), F loa t )fout .defn = [ f2(x-1) . f2(x+1) ]

    x

    f1(x)

    f2(x)

    fout (x)

    Domains

  • Compiler - Polyhedral Representation

    x = Va r i ab l e ()fin = Image(Float , [18])f1 = Funct ion (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), F loa t )f1.defn = [ fin(x) + 1 ]f2 = Funct ion (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), F loa t )f2.defn = [ f1(x-1) + f1(x+1) ]fout = Funct ion (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), F loa t )fout .defn = [ f2(x-1) . f2(x+1) ]

    x

    f1(x)

    f2(x)

    fout (x)

    Dependence vectors

  • Compiler - Polyhedral Representation

    x = Va r i ab l e ()fin = Image(Float , [18])f1 = Funct ion (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), F loa t )f1.defn = [ fin(x) + 1 ]f2 = Funct ion (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), F loa t )f2.defn = [ f1(x-1) + f1(x+1) ]fout = Funct ion (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), F loa t )fout .defn = [ f2(x-1) . f2(x+1) ]

    x

    f1(x)

    f2(x)

    fout (x)

    Live-outs

  • Compiler - Polyhedral Representation

    x = Va r i ab l e ()fin = Image(Float , [18])f1 = Funct ion (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), F loa t )f1.defn = [ fin(x) + 1 ]f2 = Funct ion (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), F loa t )f2.defn = [ f1(x-1) + f1(x+1) ]fout = Funct ion (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), F loa t )fout .defn = [ f2(x-1) . f2(x+1) ]

    x

    f1(x)

    f2(x)

    fout (x)

    f1(x)→ (0, x)

    f2(x)→ (1, x)

    fout (x)→ (2, x)

    Schedule default

  • Compiler - Polyhedral Representation

    x = Va r i ab l e ()fin = Image(Float , [18])f1 = Funct ion (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), F loa t )f1.defn = [ fin(x) + 1 ]f2 = Funct ion (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), F loa t )f2.defn = [ f1(x-1) + f1(x+1) ]fout = Funct ion (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), F loa t )fout .defn = [ f2(x-1) . f2(x+1) ]

    x

    f1(x)

    f2(x)

    fout (x)

    f1(x)→ (0, x)

    f2(x)→ (1, x)

    fout (x)→ (2, x)

    f1(x)→ (0, x)

    f2(x)→ (1, x + 1)

    fout (x)→ (2, x + 2)

    Schedule skewed

  • Compiler - Scheduling Criteria

    f1(x)

    f2(x)

    fout (x)

    Default schedule

    Parallelism

    Locality

    Storage

  • Compiler - Scheduling Criteria

    f1(x)

    f2(x)

    fout (x)

    Default schedule

    Parallelism

    Locality

    Storage

  • Compiler - Scheduling Criteria

    f1(x)

    f2(x)

    fout (x)

    Default schedule

    Parallelism

    Locality

    Storage

  • Compiler - Scheduling Criteria

    f1(x)

    f2(x)

    fout (x)

    Default schedule

    Parallelism

    Locality

    Storage

  • Compiler - Scheduling Criteria

    f1(x)

    f2(x)

    fout (x)

    Parallelogram tiling

    Parallelism

    Locality

    Storage

  • Compiler - Scheduling Criteria

    f1(x)

    f2(x)

    fout (x)

    Split tiling

    Parallelism

    Locality

    Storage

  • Compiler - Scheduling Criteria

    f1(x)

    f2(x)

    fout (x)

    Overlap tiling

    Parallelism

    Locality

    Storage

    Redundant computation

  • Compiler - Alignment and Scaling

    • f (x , y) = g(0, x , y) + g(1, x , y) + g(2, x , y)

    • Default schedulesf (x , y)→ (1, x , y , 0)g(0, x , y)→ (0, 0, x , y)Dependence vector non-constant (1, x , y − x ,−y)

    • Aligned schedulesf (x , y)→ (1, 0, x , y)g(0, x , y)→ (0, 0, x , y)Dependence vector (1, 0, 0, 0)

    Alignment

  • Compiler - Alignment and Scaling

    • f (x) = g(2x) + g(2x + 1)

    • Default schedulesf (x)→ (1, x)g(x)→ (0, x)Dependence vectors non-constant (1,−x), (1,−x − 1)

    • Scaled schedulesf (x)→ (1, 2x)g(x)→ (0, x)Dependence vectors (1, 0), (1, -1)

    Scaling

  • Compiler - Overlapped Tiling

    x

    φl φr

    h

    f

    f↓1

    f↓2

    f↑1

    fout

    f

    f↓1

    f↓2

    f↑1

    fout

    f (x) = fin(x)

    f↓1(x) = f (2x − 1) + f (2x + 1)

    f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

    f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

    fout (x) = f↑1(x/2)

    f (x)→ (0, x)

    f↓1(x)→ (1, 2x)

    f↓2(x)→ (2, 4x)

    f↑1(x)→ (3, 2x)

    fout (x)→ (4, x)

    • Conservative vs precise bounding faces• Significant reduction in redundant computation

    Tile shape

    Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1

    fk(~i )→ (T , ~sk )

    Tile constraints• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out

    Scratch pads

  • Compiler - Overlapped Tiling

    x

    φl φr

    h

    f

    f↓1

    f↓2

    f↑1

    fout

    f

    f↓1

    f↓2

    f↑1

    fout

    f (x) = fin(x)

    f↓1(x) = f (2x − 1) + f (2x + 1)

    f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

    f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

    fout (x) = f↑1(x/2)

    f (x)→ (0, x)

    f↓1(x)→ (1, 2x)

    f↓2(x)→ (2, 4x)

    f↑1(x)→ (3, 2x)

    fout (x)→ (4, x)

    • Conservative vs precise bounding faces• Significant reduction in redundant computation

    Tile shape

    Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1

    fk(~i )→ (T , ~sk )

    Tile constraints• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out

    Scratch pads

  • Compiler - Overlapped Tiling

    x

    φl φr

    h

    f

    f↓1

    f↓2

    f↑1

    fout

    f

    f↓1

    f↓2

    f↑1

    fout

    f (x) = fin(x)

    f↓1(x) = f (2x − 1) + f (2x + 1)

    f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

    f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

    fout (x) = f↑1(x/2)

    f (x)→ (0, x)

    f↓1(x)→ (1, 2x)

    f↓2(x)→ (2, 4x)

    f↑1(x)→ (3, 2x)

    fout (x)→ (4, x)

    • Conservative vs precise bounding faces• Significant reduction in redundant computation

    Tile shape

    Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1

    fk(~i )→ (T , ~sk )

    Tile constraints• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out

    Scratch pads

  • Compiler - Overlapped Tiling

    x

    φl φr

    h

    f

    f↓1

    f↓2

    f↑1

    fout

    f

    f↓1

    f↓2

    f↑1

    fout

    f (x) = fin(x)

    f↓1(x) = f (2x − 1) + f (2x + 1)

    f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

    f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

    fout (x) = f↑1(x/2)

    f (x)→ (0, x)

    f↓1(x)→ (1, 2x)

    f↓2(x)→ (2, 4x)

    f↑1(x)→ (3, 2x)

    fout (x)→ (4, x)

    • Conservative vs precise bounding faces• Significant reduction in redundant computation

    Tile shape

    Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1

    fk(~i )→ (T , ~sk )

    Tile constraints• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out

    Scratch pads

  • Compiler - Overlapped Tiling

    x

    φl φr

    h

    f

    f↓1

    f↓2

    f↑1

    fout

    f

    f↓1

    f↓2

    f↑1

    fout

    f (x) = fin(x)

    f↓1(x) = f (2x − 1) + f (2x + 1)

    f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

    f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

    fout (x) = f↑1(x/2)

    f (x)→ (0, x)

    f↓1(x)→ (1, 2x)

    f↓2(x)→ (2, 4x)

    f↑1(x)→ (3, 2x)

    fout (x)→ (4, x)

    • Conservative vs precise bounding faces• Significant reduction in redundant computation

    Tile shape

    Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1

    fk(~i )→ (T , ~sk )

    Tile constraints• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out

    Scratch pads

  • Compiler - Overlapped Tiling

    x

    φl φr

    h

    f

    f↓1

    f↓2

    f↑1

    fout

    f

    f↓1

    f↓2

    f↑1

    fout

    f (x) = fin(x)

    f↓1(x) = f (2x − 1) + f (2x + 1)

    f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

    f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

    fout (x) = f↑1(x/2)

    f (x)→ (0, x)

    f↓1(x)→ (1, 2x)

    f↓2(x)→ (2, 4x)

    f↑1(x)→ (3, 2x)

    fout (x)→ (4, x)

    • Conservative vs precise bounding faces• Significant reduction in redundant computation

    Tile shape

    Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1

    fk(~i )→ (T , ~sk )

    Tile constraints• Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out

    Scratch pads

  • Compiler - Overlapped Tiling

    x

    φl φr

    h

    f

    f↓1

    f↓2

    f↑1

    fout

    f

    f↓1

    f↓2

    f↑1

    fout

    f (x) = fin(x)

    f↓1(x) = f (2x − 1) + f (2x + 1)

    f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

    f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

    fout (x) = f↑1(x/2)

    f (x)→ (0, x)

    f↓1(x)→ (1, 2x)

    f↓2(x)→ (2, 4x)

    f↑1(x)→ (3, 2x)

    fout (x)→ (4, x)

    • Conservative vs precise bounding faces• Significant reduction in redundant computation

    Tile shape

    Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1

    fk(~i )→ (T , ~sk )

    Tile constraints

    • Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out

    Scratch pads

  • Compiler - Overlapped Tiling

    x

    φl φr

    h

    oτf

    f↓1

    f↓2

    f↑1

    fout

    f

    f↓1

    f↓2

    f↑1

    fout

    f (x) = fin(x)

    f↓1(x) = f (2x − 1) + f (2x + 1)

    f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

    f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

    fout (x) = f↑1(x/2)

    f (x)→ (0, x)

    f↓1(x)→ (1, 2x)

    f↓2(x)→ (2, 4x)

    f↑1(x)→ (3, 2x)

    fout (x)→ (4, x)

    • Conservative vs precise bounding faces• Significant reduction in redundant computation

    Tile shape

    Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1

    fk(~i )→ (T , ~sk )

    Tile constraints

    • Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out

    Scratch pads

  • Compiler - Overlapped Tiling

    x

    φl φr

    h

    oτf

    f↓1

    f↓2

    f↑1

    fout

    f

    f↓1

    f↓2

    f↑1

    fout

    f (x) = fin(x)

    f↓1(x) = f (2x − 1) + f (2x + 1)

    f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

    f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

    fout (x) = f↑1(x/2)

    f (x)→ (0, x)

    f↓1(x)→ (1, 2x)

    f↓2(x)→ (2, 4x)

    f↑1(x)→ (3, 2x)

    fout (x)→ (4, x)

    • Conservative vs precise bounding faces• Significant reduction in redundant computation

    Tile shape

    Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1

    fk(~i )→ (T , ~sk )

    Tile constraints

    • Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out

    Scratch pads

  • Compiler - Overlapped Tiling

    x

    φl φr

    h

    oτf

    f↓1

    f↓2

    f↑1

    fout

    f

    f↓1

    f↓2

    f↑1

    fout

    f (x) = fin(x)

    f↓1(x) = f (2x − 1) + f (2x + 1)

    f↓2(x) = f↓1(2x − 1)× f↓1(2x + 1)

    f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1)

    fout (x) = f↑1(x/2)

    f (x)→ (0, x)

    f↓1(x)→ (1, 2x)

    f↓2(x)→ (2, 4x)

    f↑1(x)→ (3, 2x)

    fout (x)→ (4, x)

    • Conservative vs precise bounding faces• Significant reduction in redundant computation

    Tile shape

    Default schedule : fk(~i )→ (~sk ),O = h ∗ (|l |+ |r |)τ ∗ T ≤ φl( ~sk ) ≤ τ ∗ (T + 1) + O − 1 ∧τ ∗ T ≤ φr ( ~sk ) ≤ τ ∗ (T + 1) + O − 1

    fk(~i )→ (T , ~sk )

    Tile constraints

    • Storage for intermediate values• Reduction in intermediate storage• Better locality and reuse• Privatized for each thread• Only last level can be live-out

    Scratch pads

  • Compiler - Grouping

    Iin

    Ix Iy

    Ixx Ixy Iyy

    Sxx SyySxy

    det

    trace

    harris

    • Constant dependencesAlignment, Scaling• Redundant computation vs ReuseOverlap, Tile sizes, Parameter estimates• Live-out constraints

    Fusion criteria

    • Exponential number of valid groupings• Greedy iterative approach

    Fusion heuristic

  • Compiler - Grouping

    Iin

    Ix Iy

    Ixx Ixy Iyy

    Sxx SyySxy

    det

    trace

    harris

    • Constant dependencesAlignment, Scaling• Redundant computation vs ReuseOverlap, Tile sizes, Parameter estimates• Live-out constraints

    Fusion criteria

    • Exponential number of valid groupings• Greedy iterative approach

    Fusion heuristic

  • Compiler - Grouping

    Iin

    Ix Iy

    Ixx Ixy Iyy

    Sxx SyySxy

    det

    trace

    harris

    • Constant dependencesAlignment, Scaling• Redundant computation vs ReuseOverlap, Tile sizes, Parameter estimates• Live-out constraints

    Fusion criteria

    • Exponential number of valid groupings• Greedy iterative approach

    Fusion heuristic

  • Compiler - Grouping

    Iin

    Ix Iy

    Ixx Ixy Iyy

    Sxx SyySxy

    det

    trace

    harris

    • Constant dependencesAlignment, Scaling• Redundant computation vs ReuseOverlap, Tile sizes, Parameter estimates• Live-out constraints

    Fusion criteria

    • Exponential number of valid groupings• Greedy iterative approach

    Fusion heuristic

  • Compiler - Grouping

    Iin

    Ix Iy

    Ixx Ixy Iyy

    Sxx SyySxy

    det

    trace

    harris

    Input : DAG of stages, (S ,E ); parameter estimates, P;tile sizes, T ; overlap threshold, othresh

    /* Initially, each stage is in a separate group */1 G ← ∅2 for s ∈ S do3 G ← G ∪ {s}4 repeat5 converge ← true6 cand set ← getSingleChildGroups(G , E )7 ord list ← sortGroupsBySize(cand set, P)8 for each g in ord list do9 child = getChildGroup(g , E )

    10 if hasConstantDependenceVectors(g , child) then

    11 or ← estimateRelativeOverlap(g , child , T )12 if or < othresh then

    13 merge ← g ∪ child14 G ← G − g − child15 G ← G ∪ merge16 converge ← false17 break

    18 until converge = true19 return G

    Algorithm

  • Compiler - Grouping

    Iin

    Ix Iy

    Ixx Ixy Iyy

    Sxx SyySxy

    det

    trace

    harris

    Input : DAG of stages, (S ,E ); parameter estimates, P;tile sizes, T ; overlap threshold, othresh

    /* Initially, each stage is in a separate group */1 G ← ∅2 for s ∈ S do3 G ← G ∪ {s}4 repeat5 converge ← true6 cand set ← getSingleChildGroups(G , E )7 ord list ← sortGroupsBySize(cand set, P)8 for each g in ord list do9 child = getChildGroup(g , E )

    10 if hasConstantDependenceVectors(g , child) then

    11 or ← estimateRelativeOverlap(g , child , T )12 if or < othresh then

    13 merge ← g ∪ child14 G ← G − g − child15 G ← G ∪ merge16 converge ← false17 break

    18 until converge = true19 return G

    Algorithm

  • Compiler - Grouping

    Iin

    Ix Iy

    Ixx Ixy Iyy

    Sxx SyySxy

    det

    trace

    harris

    Input : DAG of stages, (S ,E ); parameter estimates, P;tile sizes, T ; overlap threshold, othresh

    /* Initially, each stage is in a separate group */1 G ← ∅2 for s ∈ S do3 G ← G ∪ {s}4 repeat5 converge ← true6 cand set ← getSingleChildGroups(G , E )7 ord list ← sortGroupsBySize(cand set, P)8 for each g in ord list do9 child = getChildGroup(g , E )

    10 if hasConstantDependenceVectors(g , child) then

    11 or ← estimateRelativeOverlap(g , child , T )12 if or < othresh then

    13 merge ← g ∪ child14 G ← G − g − child15 G ← G ∪ merge16 converge ← false17 break

    18 until converge = true19 return G

    Algorithm

  • Compiler - Grouping

    Iin

    Ix Iy

    Ixx Ixy Iyy

    Sxx SyySxy

    det

    trace

    harris

    Input : DAG of stages, (S ,E ); parameter estimates, P;tile sizes, T ; overlap threshold, othresh

    /* Initially, each stage is in a separate group */1 G ← ∅2 for s ∈ S do3 G ← G ∪ {s}4 repeat5 converge ← true6 cand set ← getSingleChildGroups(G , E )7 ord list ← sortGroupsBySize(cand set, P)8 for each g in ord list do9 child = getChildGroup(g , E )

    10 if hasConstantDependenceVectors(g , child) then

    11 or ← estimateRelativeOverlap(g , child , T )12 if or < othresh then

    13 merge ← g ∪ child14 G ← G − g − child15 G ← G ∪ merge16 converge ← false17 break

    18 until converge = true19 return G

    Algorithm

  • Compiler - Grouping

    Iin

    Ix Iy

    Ixx Ixy Iyy

    Sxx SyySxy

    det

    trace

    harris

    Input : DAG of stages, (S ,E ); parameter estimates, P;tile sizes, T ; overlap threshold, othresh

    /* Initially, each stage is in a separate group */1 G ← ∅2 for s ∈ S do3 G ← G ∪ {s}4 repeat5 converge ← true6 cand set ← getSingleChildGroups(G , E )7 ord list ← sortGroupsBySize(cand set, P)8 for each g in ord list do9 child = getChildGroup(g , E )

    10 if hasConstantDependenceVectors(g , child) then

    11 or ← estimateRelativeOverlap(g , child , T )12 if or < othresh then

    13 merge ← g ∪ child14 G ← G − g − child15 G ← G ∪ merge16 converge ← false17 break

    18 until converge = true19 return G

    Algorithm

  • Compiler - Grouping

    Iin

    Ix Iy

    Ixx Ixy Iyy

    Sxx SyySxy

    det

    trace

    harris

    Input : DAG of stages, (S ,E ); parameter estimates, P;tile sizes, T ; overlap threshold, othresh

    /* Initially, each stage is in a separate group */1 G ← ∅2 for s ∈ S do3 G ← G ∪ {s}4 repeat5 converge ← true6 cand set ← getSingleChildGroups(G , E )7 ord list ← sortGroupsBySize(cand set, P)8 for each g in ord list do9 child = getChildGroup(g , E )

    10 if hasConstantDependenceVectors(g , child) then

    11 or ← estimateRelativeOverlap(g , child , T )12 if or < othresh then

    13 merge ← g ∪ child14 G ← G − g − child15 G ← G ∪ merge16 converge ← false17 break

    18 until converge = true19 return G

    Algorithm

  • Compiler - Grouping

    ↓ x

    ↓ x

    ↓ y

    ↓ y

    ↓ x

    ↓ x

    ↓ y

    ↓ y

    ↓ x

    ↓ x

    ↓ y

    ↓ y

    ↑ x

    ↑ y

    L

    X

    ↑+

    ↑ x

    ↑ y

    L

    ↑ x

    ↑ x

    ↑ y

    L

    X

    ↑+

    ↑ x

    ↑ y

    L

    ↑ x

    ↑ x

    ↑ y

    L

    X

    ↑+

    ↑ x

    ↑ y

    L

    M

    ↓ y↓ x

    ↓ y↓ x

    ↓ y↓ x

    X

    ↑ x

  • Compiler - Code Generation

    void pipe_harris( i n t C, i n t R, f l o a t * I, f l o a t *& harris){

    /* Live out allocation */

    harris = ( f l o a t *) (malloc(sizeof( f l o a t )*(2+R)*(2+C)));#pragma omp parallel for

    f o r ( i n t Ti = -1; Ti

  • Auto Tuning

    200 250 300 350 400 45020

    40

    60

    Execution time on 1 core (ms)

    Execution

    timeon

    16cores(m

    s)

    60 80 100 120 140 160 1805

    10

    15

    Execution time on 1 core (ms)

    Execution

    timeon

    16cores(m

    s)• Tile sizes and overlap thershold determine grouping• Seven tiles sizes for each dimension• Three threshold values• Small search space ( 72 ∗ 3 for 2d-tiling )

    Tuning

    Camera PipelinePyramid Blending

  • Table of Contents

    1 Image Processing Pipelines

    2 Language

    3 Compiler

    4 Related Work

    5 Performance Evaluation

  • Related work

    • Decoupled view of computation and schedules• Scheduling for affine loop nestsDo not target specific domains• Overlapped tilingWorks for simple time-iterated stencilsDifferent approach to constructing overlapped tiles

    Polyhedral compilation

    • Domain specific language and compiler system• Effective for exploring schedulesRequires an explicit schedule specification

    Halide

  • Halide

    ImageParam input(UInt (16), 2);

    Func blur_x("blur_x"), blur_y("blur_y");

    Var x("x"), y("y"), xi("xi"), yi("yi");

    // The algorithm

    blur_x(x, y) = (input(x, y) + input(x+1, y) + input(x+2, y))/3;

    blur_y(x, y) = (blur_x(x, y) + blur_x(x, y+1) + blur_x(x, y+2))/3;

    // How to schedule it

    blur_y.split(y, y, yi, 8).parallel(y).vectorize(x, 8);

    blur_x.store_at(blur_y , y).compute_at(blur_y , yi).vectorize(x, 8);

    Halide Blur

    Schedule

  • Table of Contents

    1 Image Processing Pipelines

    2 Language

    3 Compiler

    4 Related Work

    5 Performance Evaluation

  • Experimental Setup

    Intel Xeon E5-2680Clock 2.7 GHz

    Cores / socket 8Total cores 16

    L1 cache / core 32 KBL2 cache / core 512 KB

    L3 cache / socket 20 MBCompiler Intel C compiler (icc) 14.0.1

    Compiler flags -O3 -xhostLinux kernel 3.8.0-38 (64-bit)

  • Evaluation Method

    • Seven representative benchmarks• Varying structure and complexity

    Benchmarks

    • HalideTuned schedule, Matched schedule• OpenCVOptimized library calls

    Comparison

  • Multiscale Interpolation

    1 2 4 8 160

    2

    4

    6

    8

    10

    12

    142.24

    4.03

    6.57

    9.82

    12.54

    1.28 2

    .38

    3.93

    6.18

    9.43

    1.46 2

    .57

    4.07

    5.7 5.88

    1

    1.8

    2.94

    4.42

    5.82

    2.14

    3.44

    5.94

    7.25

    6.93

    1.77

    2.99

    5.29

    7.13

    6.92

    1.28

    2.43

    4.1

    7.1

    12.11

    0.88 1.68

    3.19

    5.47

    8.5

    Number of cores

    Speedupover

    PolyMag

    ebase(1

    core) PolyMage(opt+vec)

    PolyMage(opt)

    PolyMage(base+vec)

    PolyMage(base)

    Halide(tuned+vec)

    Halide(tuned)

    Halide(matched+vec)

    Halide(matched)

  • Harris Corner Detection

    1 2 4 8 160

    10

    20

    30

    40

    503.74 7.35

    12.85

    24.02

    46.78

    1.12

    2.24

    4.03 7.64

    15.18

    2.47

    4.31 7.83

    12.22 16.22

    1 1.94

    3.47 6.18

    10.3

    1.64

    3.17 6.08

    10.17

    18.07

    0.93

    1.84

    3.51 6.05

    10.3

    1.87

    3.73 7

    .43

    13.65

    25.35

    0.73

    1.45

    2.91 5.31

    9.88

    Number of cores

    Speedupover

    PolyMag

    ebase(1

    core) PolyMage(opt+vec)

    PolyMage(opt)

    PolyMage(base+vec)

    PolyMage(base)

    Halide(tuned+vec)

    Halide(tuned)

    Halide(matched+vec)

    Halide(matched)

  • Camera Pipeline

    1 2 4 8 160

    5

    10

    15

    20

    25

    30

    35

    2.79

    5.49

    9.5

    18.16

    32.37

    0.79

    1.57

    2.74 5.26

    10.28

    2.95

    5.62

    9.58

    13.22

    24.2

    1

    1.98 3.61

    6.5

    12.16

    4.82 7.3

    12.32

    21.26

    31.28

    1.4 2.59 4.71

    7.56

    14.15

    2.42 4.83

    9.55

    17.49

    33.75

    Number of cores

    Speedupover

    PolyMag

    ebase(1

    core) PolyMage(opt+vec)

    PolyMage(opt)

    PolyMage(base+vec)

    PolyMage(base)

    Halide(tuned+vec)

    Halide(tuned)

    FCam

  • Results Summary

    Benchmark Number Image size Lines PolyMage OpenCV Speedup overof stages 1 core 4 cores 16 cores (1 core) H-tuned (16 cores)

    Harris Corner 11 6400× 6400 43 233.79 68.03 18.69 810.24 2.59×*Pyramid Blending 44 2048×2048×3 71 196.99 57.84 21.91 197.28 4.61×*

    Unsharp Mask 4 2048×2048×3 16 165.40 44.92 14.85 349.57 1.6×*Local Laplacian 99 2560×1536×3 107 274.50 76.60 32.35 - 1.54×Camera Pipeline 32 2528× 1920 86 67.87 19.95 5.86 - 1.04×

    Bilateral Grid 7 2560× 1536 43 89.76 27.30 8.47 - 0.89×Multiscale Interpol. 49 2560×1536×3 41 101.70 34.73 18.18 - 1.81×

    Mean speedup of 1.27× over tuned Halide schedulesComparable performance to highly tuned camera pipelineimplementation

  • Conclusion

    DSL for high-performance image processing

    Optimization techniques• Tiling• Storage optimization• Grouping and fusingEffectiveness• Up to 1.81× better than tuned schedules• Matching hand tuned performance

  • Acknowledgements

    Halide, OpenCV, isl, islpy and cgen

    Intel for their hardware

  • Thank You!

  • Pyramid Blending

    1 2 4 8 160

    2

    4

    6

    8

    10

    12

    14

    161.66

    3.2

    5.66

    9.96

    14.95

    1.26 2.42

    4.29

    7.49

    13.37

    1.13 2.02

    3.25

    4.71

    5.31

    1

    1.82 2

    .99

    4.55 5.35

    0.56

    1

    1.83 2.71

    3.24

    0.66

    1.16 2.08 2.98

    3.43

    1.24 2.12

    3.7

    5.72

    7

    0.76 1.45 2

    .64

    4.31

    5.98

    Number of cores

    Speedupover

    PolyMag

    ebase(1

    core) PolyMage(opt+vec)

    PolyMage(opt)

    PolyMage(base+vec)

    PolyMage(base)

    Halide(tuned+vec)

    Halide(tuned)

    Halide(matched+vec)

    Halide(matched)

  • Bilateral Grid

    1 2 4 8 160

    2

    4

    6

    8

    10

    12

    141.15 2.17

    3.77

    6.55

    12.16

    0.82 1.61 2

    .73

    4.74

    8.99

    1.65

    3.17

    3.42

    3.56

    3.72

    1

    1.97

    2.15

    2.28

    2.42

    1.6

    2.92

    5.4

    8.55

    13.68

    1.13 2.11

    4.03

    6.72

    10.37

    Number of cores

    Speedupover

    PolyMag

    ebase(1

    core) PolyMage(opt+vec)

    PolyMage(opt)

    PolyMage(base+vec)

    PolyMage(base)

    Halide(tuned+vec)

    Halide(tuned)

  • Local Laplacian Filter

    1 2 4 8 160

    2

    4

    6

    8

    10

    12

    141.62

    3.41

    5.8

    9.41

    13.73

    1.02 1.99

    3.48

    6.1

    10.81

    1.58

    2.93

    4.71

    6.41

    8.74

    1

    1.92

    3.3

    5.23

    7.39

    1.04 1.99

    3.68

    6.18

    8.93

    0.55

    1.07 2.08

    3.61

    5.71

    Number of cores

    Speedupover

    PolyMagebase

    (1core)

    PolyMage(opt+vec)

    PolyMage(opt)

    PolyMage(base+vec)

    PolyMage(base)

    Halide(tuned+vec)

    Halide(tuned)

    Image Processing PipelinesLanguageCompilerRelated WorkPerformance Evaluation