Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

▣ Naums Mogers

Christophe Dubach

ARM Research Summit, September 2019

Functional Interface forPerformance Portability on Parallel Accelerators

Hardware accelerators

Architectures

Scopes

Applications

2Requirements

CPU GPU

CGRA

FPGA

ASIC

ML

Security

Server Desktop Mobile Embed

Vision

DBGraph

Energy Mem

Hardware accelerators

Architectures

Scopes

Applications

3Requirements

CPU GPU

CGRA

FPGA

ASIC

ML

Security

Server Desktop Mobile Embed

Vision

DBGraph

Energy Mem

Applications

Current landscape

XLA

TPU

BrainWaveISA

BrainWave

ArmNN

ARM ML

MDK

Movidius

HiAI

HiSiliconNPU

OpenCL

GPUsMulticore

CPUs

OpenMP VHDL

FPGAs

XLA

TPU

BrainWaveISA

BrainWave

ArmNN

ARM ML

MDK

Movidius

HiAI

HiSiliconNPU

Applications

Image Processing

Neural Networks

Graph Analytics

Data-ParallelIntermediate Language

Performance PortableCode Generator

OpenCL

GPUsMulticore

CPUs

OpenMP VHDL

FPGAs

Domain-Specific Languages (DSLs)

Language for Parallelism

Compiler Technology

What we need

6

What is the right interface for HW accelerators?

7

What is the right interface for HW accelerators?

Functional approach

Abstract

▣ Expresses algorithm (WHAT), not implementation (HOW)

8

Safe

▣ Easier to use and parallelise

High-level

▣ Captures plenty of algorithmic meta-info for analysis

Expressive

▣ Control flow

▣ Memory management

Pure

▣ Easy to transform

Composable

▣ Easier to maintain, code-reuse

XLA

TPU

BrainWaveISA

BrainWave

ArmNN

ARM ML

MDK

Movidius

HiAI

HiSiliconNPU

Applications

Image Processing

Neural Networks

Graph Analytics

Lift: Data-ParallelIntermediate Language

Lift: Rewrite rule-based compilerto HW-specific languages

OpenCL

GPUsMulticore

CPUs

OpenMP VHDL

FPGAs

Domain-Specific Languages (DSLs)

Language for Parallelism

Compiler Technology

Lift

10

Lift www.lift-project.org

▣ Functional data-parallel language and compiler

Lift IR www.lift-project.org

toGlobal

toLocal

toPrivate

toVector

toScalar

add, mul

dot

tanh

Address space operators

Casters Data operators

Int, Float

Vector

Array

Data types

Map, Reduce

Zip, Split

Scatter, Gather

Slide

Algorithmic patterns

12

Lift IR: Views www.lift-project.org

Reorder(stride(s)) >> Map(f)

▣ Virtual composable data layout transformations□ Reorder, Transpose, Slide, Slice, etc

▣ Expressed with Views▣ Help avoid extra memory writes

NO WRITES TO MEMORY

TRANSFORMED READS


toGlobal

toLocal

toPrivate

toVector

toScalar

add, mul

dot

tanh



Int, Float

Vector

Array

Data types

Map, Reduce

Zip, Split

Scatter, Gather

Slide


14




IR level

toGlobal

toLocal

toPrivate

toVector

toScalar

add, mul

dot

tanh

Generic

DSL

conv, lstm

blur, sharpen

select

toDRAM

toSRAM

toRegistor

MapGlobalMapLocalReduceSeq

toInt8toFloat16

VVMulMVMulVTanh

Platform-specific

Map, Reduce

Zip, Split

Scatter, Gather

Slide


Int, Float

Vector

Array

Data types

Vector

Matrix

Tensor

Int8Float16Float32

15

How do we achieve performance portability?

Lift: Rewrite Rules

▣ Express algorithmic implementation choices▣ Preserve semantic correctness▣ Leverage algorithmic info

▣ Decouples optimisation from code generation16

Split-join rule Map fusion rule GEMV rule

Rewrite rules

17

Lift: Rewrite Rules

IR level

▣ Split-join rule

▣ Map fusion rule

▣ Reduce rules

▣ ...

Generic

DSL

▣ Algorithm choices for high-level primitives

▣ Precision level

▣ ...

Platform-specific

▣ Using built-ins

▣ Lowering to the platform programming model

▣ ...

18

Lift: rewriting

HOW TO OPTIMISE?

19

Lift: rewriting

HARD STARTING POINT

20

Lift: rewriting Search

21

Lift: rewriting

Exploitation

Search

Built-in primitive

22

Lift: rewriting

Exploitation

Search

Built-in primitive

Parallelisation choice

23

Lift: rewriting

Exploitation

Code generation

Search

Built-in primitive

Parallelisation choice

Lift: Rewrite Rules

▣ Domain-specific and generic▣ Reusable▣ Provably correct▣ Self-contained, extensible

24

Lift: Constraint Inference

▣ Required for valid search space generation when using tuning parameters

▣ Leverages algorithmic meta-info▣ Can express heuristics and HW restrictions

25

Lift: Search Space Exploration

▣ Uniform random sampling▣ Predictor models▣ Genetic algorithms▣ …

26

Lift: Research Directions

▣ Linear algebra▣ Sparse data parallelism▣ Optimising reductions▣ Stencil computations▣ 3D wave modelling▣ High-level synthesis for FPGAs▣ Machine Learning

27

Lift for Machine Learning

▣ Machine Learning□ Convolution inference optimisation□ Platforms: Mali GPUs, BrainWave

28



29



30


Naums Mogers, PhD student, Edinburgh

How to best exploit HW accelerators?

31

Christof Schlaak, PhD student, Edinburgh

How to generate accelerator architectures?

Lu Li, Postdoctoral Researcher, Edinburgh

How to optimise the host code?How to drive the rewriting process?

Christophe Dubach, Reader, Edinburgh

All of the above

32

Lift source code is published

https://github.com/lift-project/lift

References(icons) Noun Project, https://thenounproject.com(icons) Font Awesome, https://fontawesome.com

http://www.lift-project.org

https://github.com/lift-project/lift

https://thenounproject.com/

https://fontawesome.com/

http://www.lift-project.org/

Documents

Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel