Loop Fusion in Haskell

Loop fusion in Haskell

Roman Leshchinskiy

Programming Languages and SystemsUniversity of New South Wales

What is this about?

What I do

Data Parallel Haskell

compiles nested data-parallel programs to flat data-parallel ones

lots of arrays and collective operations involved

What other people do

array programs with lots of collective operations

What is this about?

What I do

zipWith (-)(zipWith (*)

(zipWith (-) (replicate_s segd as1) xs)(zipWith (-) (replicate_s segd bs1) ys))

(zipWith (*)(zipWith (-) (replicate_s segd bs2) ys)(zipWith (-) (replicate_s segd as2) xs))

What is this about?

What I do

return . foldl’ hash 5381. map toLower. filter isAlpha =<< readFile f

What is this about?

What I do

What everybody wants

no temporary arrays

fused loops

C-like speed

Loop fusion is easy!

foo xs = map (*5) (map (+3) xs)

"map/map" map f (map g xs) = map (f . g) xs

foo xs = map (*5) (map (+3) xs)bar ys = filter even (filter (<42) ys)

"map/map" map f (map g xs) = map (f . g) xs

foo xs = map (*5) (map (+3) xs)bar ys = filter even (filter (<42) ys)

"map/map" map f (map g xs) = map (f . g) xs"filter/filter" filter f (filter g xs)

= filter (λ x → f x && g x) xs

foo xs = map (*5) (map (+3) xs)bar ys = filter even (filter (<42) ys)baz zs = map (+1) (filter even zs)

= filter (λ x → f x && g x) xs

= filter (λ x → f x && g x) xs"map/filter" map f (filter g xs) = mapFilter f g xs

= filter (λ x → f x && g x) xs"map/filter" map f (filter g xs) = mapFilter f g xs"map/mapFilter" map f (mapFilter g h xs)

= mapFilter (f . g) h xs"mapFilter/filter" mapFilter f g (filter h xs)

= mapFilter (f λ x → g x && h x) xs...

= filter (λ x → f x && g x) xs"map/filter" map f (filter g xs) = mapFilter f g xs"map/mapFilter" map f (mapFilter g h xs)

= mapFilter (f . g) h xs"mapFilter/filter" mapFilter f g (filter h xs)

= mapFilter (f λ x → g x && h x) xs... BAD

The challenge

use a constant number of rewrite rules

don’t require new rules for new combinators

make adding new combinators easy

fuse everything!

don’t require specialised compiler support

handle both sequential and parallel loops

Sequential loops

Streams

data Step s a = Yield a s| Done

data Stream a = ∃s. Stream (s → Step s a) s

stepper produces next element and state from current state

similar to an iterator

actually encodes an anamorphism (unfold)

Streams

stepper

Streams

sumS :: Num a ⇒ Stream a → asumS (Stream step s) = go 0 swhere go z s = case step s of

Yield x s’ → go (z+x) s’Done → z

Streams

stream :: Array a → Stream astream arr = Stream step 0where step i | i < length arr = Yield (arr ! i) (i+1)

| otherwise = Done

Streams

mapS :: (a → b) → Stream a → Stream bmapS f (Stream step s) = Stream step’ swhere step’ s = case step s of

Yield x s’ → Yield (f x) s’Done → Done

Streams

unstream :: Stream a → Array aunstream (Stream step s) = <allocate, fill and freeze>

Stream fusion in three easy steps

Step 1: implement array operations in terms of streams

sum :: Num a ⇒ Array a → asum = sumS . stream

map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream

Step 2: inline them

Step 3: eliminate stream/unstream pairs

"stream/unstream" stream (unstream s) = s

Step 2: inline them

sumsq :: Num a ⇒ Array a → asumsq = sum . map (λx -> x*x)

Step 2: inline them

= sumS . stream . unstream . mapS f . stream

Step 2: inline them

= sumS . mapS f . stream

Step 2: inline them

= sumS . mapS f . stream

LetGHC

Optimising stream operations

sumsq xs = sumS (mapS square ( stream xs))

optimal loop

no Stream or Step values ever created

only general-purpose optimisations

will be optimised further (unboxing etc.)

requires a great compiler (thanks GHC team!)

sumsq xs = sumS (mapS square ( stream xs))

stream :: Array a → Stream astream arr = Stream step 0where step i | i < length arr = Yield (arr ! i) (i+1)

| otherwise = Done

optimal loop

inline

sumsq xs = sumS ( mapS square (Stream step1 0))wherestep1 i = case i < length xs of

True → Yield (xs ! i) (i+1)False → Done

optimal loop

sumsq xs = sumS ( mapS square (Stream step1 0))wherestep1 i = case i < length xs of

mapS :: (a → b) → Stream a → bmapS f (Stream step s) = Stream step’ swhere step’ s = case step s of

Yield x s’ → Yield (f x) s’Done → Done

optimal loop

inline

sumsq xs = sumS (Stream step2 0)wherestep1 i = case i < length xs of

step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done

optimal loop

sumsq xs = sumS (Stream step2 0)wherestep1 i = case i < length xs of

sumS :: Num a ⇒ Stream a → asumS (Stream step s) = go 0 swhere go z s = case step s of

Yield x s’ → go (z+x) s’Done → z

optimal loop

inline

sumsq xs = go 0 0wherestep1 i = case i < length xs of

go z i = case step2 i of

Yield x i’ → go (z+x) i’Done → z

optimal loop

inline

go z i = case (case step1 i ofYield x i’ → Yield (square x) i’Done → Done) of

optimal loop

go z i = case (case step1 i of

Yield x i’ → Yield (square x) i’

Done → Done) ofYield x i’ → go (z+x) i’Done → z

optimal loop

case of case

Yield x i’ → go (z + square x) i’Done → z

optimal loop

inline

sumsq xs = go 0 0wherego z i = case (case i < length xs of

True → Yield (xs ! i) (i+1)False → Done) of

optimal loop

sumsq xs = go 0 0wherego z i = case (case i < length xs of

True → Yield (xs ! i) (i+1)False → Done) of

optimal loop

case of case

sumsq xs = go 0 0wherego z i = case i < length xs of

True → go (z + square (xs ! i)) (i+1)False → z

optimal loop

sumsq xs = go 0 0wherego z i = case i < length xs of

True → go (z + square (xs ! i)) (i+1)False → z

optimal loop

Why does it work?

non-recursive

Why does it work?

non-recursive

Why does it work?

non-recursive

recursive

A slight problem

filterS :: (a → Bool) → Stream a → Stream afilterS f (Stream step s) = Stream step’ swherestep’ s = case step s of

Yield x s’| f x → Yield x s’| otherwise → step s’

Done → Done

A slight problem

Yield x s’| f x → Yield x s’| otherwise → step s’

Done → Done

recursive

Extending streams

Idea: allow a loop iteration not to produce an element

data Step s a = Yield a s| Skip s| Done

Extending streams

Yield x s’| f x → Yield x s’| otherwise → Skip s’

Skip s’ → Skip s’Done → Done

Extending streams

Yield x s’| f x → Yield x s’| otherwise → Skip s’

Skip s’ → Skip s’Done → Done

non-recursive

Stream fusion - summary

encode loops by streams

implement array operations in terms of streams

eliminate stream/unstream pairs (temporaries)

stream producers are non-recursive

standard optimisations remove overhead (loop fusion)

Standard optimisations: inlining, case-of-case, worker/wrappertransformation, SpecConstr, LiberateCase, specialisation ...

Stream fusion - summary

encode loops by streams

implement array operations in terms of streams

eliminate stream/unstream pairs (temporaries)

stream producers are non-recursive

standard optimisations remove overhead (loop fusion)

Standard optimisations: inlining, case-of-case, worker/wrappertransformation, SpecConstr, LiberateCase, specialisation ...

Parallel loops

DPH on multicores

Evaluation strategy after vectorisation

operations are data parallel and flat

executed by a gang of worker threads

essentially fork-join parallelism

DPH on multicores

mapP :: (a → b) → Array a → Array bmapP f xs = <split xs across workers>

DPH on multicores

f is sequential

DPH on multicores

sumP :: Num a ⇒ Array a → asumP xs = <split xs across workers>

DPH on multicores

sumsqP = sumP . mapP square

DPH on multicores

sumsqP xs = <split xs across workers><map square over each chunk><collect local results><split results across workers><sum each chunk><reduce local sums>

DPH on multicores

Distributed types

Idea: let’s make the evaluation strategy explicit! (Keller 1999)

data Dist a a is distributed across threadsDist (Array a) each thread has a local array (chunk)Dist Double each thread has a local Double

splitD distribute an array across threadsjoinD collect thread-local chunks

mapD execute a sequential operation in each threadsumD compute sum of local values

Distributed types

splitD :: Array a → Dist (Array a)joinD :: Dist (Array a) → Array a

Distributed types

splitD :: Array a → Dist (Array a)joinD :: Dist (Array a) → Array amapD :: (a → b) → Dist a → Dist bsumD :: Num a ⇒ Dist a → a

Programming with distributed types

mapP f xs = <split xs across workers><map f over each chunk><collect local results>

mapP f = joinD -- collect. mapD (map f) -- map f over chunks. splitD -- split

sumP xs = <split xs across workers><sum each chunk><reduce local sums>

sumP = sumD -- reduce. mapD sum -- sum each chunk. splitD -- split

Fusing distributed types

sumsqP = sumP . mapP square

sumsqP = sumD -- reduce. mapD sum -- sum each chunk. splitD -- split. joinD -- collect. mapD (map square) -- map square over chunks. splitD -- split

splitD (joinD xs) = xs

sumsqP = sumD -- reduce. mapD sum -- sum each chunk. mapD (map square) -- map square over chunks. splitD -- split

splitD (joinD xs) = xs

sumsqP = sumD -- reduce. mapD sum -- sum each chunk. mapD (map square) -- map square over chunks. splitD -- split

splitD (joinD xs) = xsmapD f (mapD g xs) = mapD (f . g) xs

sumsqP = sumD -- reduce

. mapD (sum . map square) -- work

. splitD -- split

sumsqP = sumD -- reduce

. mapD (sum . map square) -- work

. splitD -- split

stream fusion

Distributed types on multicores

data Dist a a is distributed across threads

splitD distribute xs across threadsjoinD collect thread-local chunksmapD execute a sequential operation in each thread

splitD/joinD eliminate communicationmapD/mapD eliminate synchronisation

Distributed types on clusters

data Dist a a is distributed across nodes

splitD scatterjoinD gathermapD execute operation on each node

splitD/joinD eliminate communicationmapD/mapD eliminate synchronisation

Distributed types on GPUs

data Dist a a is in GPU memory

splitD CPU −→ GPU transferjoinD GPU −→ CPU transfermapD execute kernel on the GPU

splitD/joinD eliminate memory transfers (communication)mapD/mapD fuse kernels (synchronisation)

Distribured types – summary

encode parallel loops as split/work/join

eliminate unnecessary split/join pairs

fuse sequential work (stream fusion)

very general mechanism for fusing parallel computations

applicable to a wide range of architectures

again, no specialised compiler support

Obligatory benchmark1

1 2 4 8

sumsq, Haskell sumsq, C dotp, Haskelldotp, C smvm, Haskell smvm, C

1 2 4 8 16 32 64

Runtime @ greyarea

sumsq, Haskell sumsq, C dotp, Haskelldotp, C smvm, Haskell smvm, C

Parting thoughts

it’s nice, it’s easy to use, it works

high-level functional programs compiled to highly efficient code

even parallel ones!

rewrite rules + great optimiser = win

DPH doesn’t require any special-purpose optimisations

try this in an imperative language...

Stream fusion: dph, bytestring, vector, uvector

Distributed types: dph

Parting thoughts

even parallel ones!

don’t

Parting thoughts

even parallel ones!

don’t

Loop Fusion in Haskell

Documents

March 14, 20021 CMPUT680 - Winter 2006 Topic C: Loop Fusion Kit Barton cbarton

Loop Fusion, Loop Distribution and Their Place in the Loop ...llvm.org/devmtg/2019-04/slides/TechTalk-Barton... · –Modify Loop Fusion to use guard block for control flow equivalence

Haskell Templates

Efﬁcient Polynomial-Time Nested Loop Fusion with Full ...toneil/research/papers/mans/ijca03.pdfEfﬁcient Polynomial-Time Nested Loop Fusion with Full Parallelism Edwin H.-M. Sha

Loop Fusion and Tiling - Department of Computer Science · Loop Fusion Formalizing loop alignment Definition Given a dependence δ that has a source in one loop and a sink in another

ATTACHMENT 2 - HASKELL COURSE TITLE PROPOSAL · attachment 2 - haskell course title proposal. attachment 2 - haskell course title proposal. attachment 2 - haskell course title proposal

Loop Fusion, Loop Distribution and Their Place in …...Loop Fusion, Loop Distribution and Their Place in the Loop Optimization Pipeline April 9, 2019 Kit Barton, IBM Canada Johannes

History Haskell

Haskell Internals

Haskell Wikibook

Haskell Notes

Two-Loop Amplitudes for Gluon Fusion into Two Photons

Learn Haskell

Haskell for Grownups - harrisonwl.github.io · Haskell for Grownups Bill Harrison February 8, 2019. Table of Contents Introduction Resources for Haskell Haskell vs. C Types + Functions

Comparing: Haskell, Scala, Go · Comparing: Haskell, Scala, Go AlleleDev(@queertypes) August28,2014 Allele Dev (@queertypes) Comparing: Haskell, Scala, Go

Haskell and HaskellVV. What is Haskell? “Haskell is a polymorphically typed, lazy, pure functional language.” – So what

Cyber Integration for Fusion Centers - OODA Loop

Oo Haskell

xmonad - A Haskell Success Story · xmonad Haskell xmonad + Haskell = ~ Conclusion xmonad A Haskell Success Story Brent Yorgey March 22, 2008 FringeDC Brent Yorgey xmonad

A Haskell Implementation of Turing Machines...A Haskell Implementation of Turing Machines Haskell •Typed , Functional Programming Language • Typed - Data types in haskell are built