Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh

Beyond Auto-Parallelization: Compilers for Many-Core

Systems

Marcelo Cintra

University of Edinburghhttp://

www.homepages.inf.ed.ac.uk/mc

Moore for Less Keynote - September 2008 2

Compilers for Parallel Computers (Today) Auto-parallelizing compilers

– “Holy grail”: convert sequential programs into parallel programs with little or no user intervention

– Only partial success, despite decades of work– No performance debugging tools

For explicitly parallel languages/annotations (e.g., OpenMP, Java Threads)– Main goal: correctly map high-level data and control

flow to hardware/OS threads and communication– Secondary goal: perform simple optimizations specific

to parallel execution– Simple correctness and performance debugging tools


Compilers for Parallel Computers (Future)

Data flow/dependence analysis tools – unsafe/speculative– Probabilistic approaches– Profile-based approaches

Multithreading-specific optimization toolbox– Including alternative/speculative parallel programming models

(e.g., Transactional Memory (TM))

Auto-parallelizing compilers – with speculation– Thread-level speculation (TLS)– Helper threads

Holistic parallelizing tool chain.


Why Be Speculative?

Performance of programs ultimately limited by control and data flows

Most compiler optimizations exploit knowledge of control and data flows

Techniques based on complete/accurate knowledge of control and data flows are reaching their limit– True for both sequential and parallel

optimizationsFuture compiler optimizations must relyon incomplete knowledge: speculative execution


Compilers for Parallel Computers (Future)

Dependence/FlowAnalysis Tool

ParallelizingCompiler

Unsafe

<P-wayparallel

Seq.

P-wayparallel

TLSTM

Auto-TLSCompiler

Auto-TLSCompiler


Outline

Context and Motivation History and status-quo of auto-parallelizing

compilers– Data dependence analysis for array-based programs– Data dependence analysis for irregular programs

Auto-parallelizing compilers for TLS– TLS execution model (speculative parallelization)– Static compiler cost model (PACT’04, TACO’07)


Data Dependence Analysis for Arrays

Based on mathematical evaluation of array index expressions within loop nests

Progressively more capable analyses (e.g., GCD test, Banerjee test), but still restricted to affine loop index expressions

Coupled with mathematical framework to represent loop transformations (e.g., loop interchange, skewing) that can help expose more parallelism


Data Dependence Analysis for Arrays

What’s wrong with traditional data dependence?– Not all index expressions are affine or even

statically defined (e.g., subscripted subscripts)– Not all loops are well structured (e.g.,

conditional exits, control flow)– Not all procedures are analyzable (e.g.,

unavailable code, aliasing, global data access)

– Not all applications make intense use of arrays (e.g., trees, hash tables, linked lists, etc) and loop nests


Data Dependence Analysis for Irregular Programs

Based on ad-hoc analyses (e.g., pointer analysis, shape analysis, task graph analysis)

There isn’t a comprehensive data dependence analysisframework for irregular applications


Outline

Context and Motivation History and status-quo of auto-parallelizing

compilers– Data dependence analysis for array-based programs– Data dependence analysis for irregular programs

Auto-parallelizing compilers for TLS– TLS execution model (speculative parallelization)– Static compiler cost model (PACT’04, TACO’07)


Thread Level Speculation (TLS)

Assume no dependences and execute threads in parallel While speculating, buffer speculative data separately Track data accesses and monitor cross-thread violations Squash offending threads and restart them

All this can be done in hardware, software, or a combination

for(i=0; i<100; i++) { … = A[L[i]] + …

A[K[i]] = …}

Iteration J+2… = A[5]+…

A[6] = ...

Iteration J+1… = A[2]+…

A[2] = ...

Iteration J… = A[4]+…

A[5] = ...RAW


Squash & restart: re-executing the threads Speculative buffer overflow: speculative

buffer is full, thread stalls until becomes non-speculative

Dispatch & commit: writing back speculative data into memory and starting next speculative thread

Load imbalance: processor waiting for thread to become non-speculative to commit

TLS Overheads


Coping with overheads: Cost Model!

Compiler cost models are key to guide optimizations, but no such cost model exists for TLS

Speculative parallelization can deliver significant speedup or slowdown– Several speculation overheads– Overheads are hard to estimate (e.g., squash?)

A prediction of the value of speedup can be useful– e.g. multi-tasking environment

program A wants to run speculatively in parallel on 4 cores ( predicted speedup 1.8 )

other programs waiting to be scheduled OS decides it does not pay off


Squash & restart: re-executing the threads– Hard because violations are highly unpredictable

Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative– Hard because write-sets are somewhat unpredictable

Dispatch & commit: writing back speculative data into memory and starting next speculative thread– Hard because write-sets are somewhat unpredictable

Load imbalance: processor waiting for thread to become non-speculative to commit– Hard because workloads are very unpredictable and

order does matter due to in-order commit requirement

TLS Overheads


Our Compiler Cost Model: Highlights

First fully static compiler cost model for TLS Can handle all TLS overheads in a single

framework– Including loop imbalance, which is not handled by any

other cost model Produces not only a qualitative (“good” or “bad”)

assessment of the TLS benefits but instead a quantitative value (i.e., expected speedup/slowdown)

Can be easily integrated into most compilers at the intermediate representation level

Simple and fast to compute


Speedup Distribution

Very varied speedup/slowdown behavior

0

10

20

30

40

50

60

70

80

mesa art equake ammp vpr mcf crafty vortex bzip2 average

Fra

ctio

n o

f lo

ops

(%)

0.5<S≤1 1<S≤2 2<S≤3 3<S≤4


0

20

40

60

80

100

120

mesa art equake ammp vpr mcf crafty vortex bzip2 average

Fra

ctio

n o

f lo

ops

(%)

Predict speedup/Actual speedup Predict slowdown/Actual slowdown

Predict speedup/Actual slowdown Predict slowdown/Actual speedup

Model Accuracy (I): Outcomes

Only 17% false positives(performance degradation)

Negligible false negatives(missed opportunities)

Most speedups/slowdownscorrectly predicted by the model


Current Developments

Done:– Completed implementation of TLS code generator in

GCC Doing:

– Implementing cost model in this TLS GCC– Profiling TLS program behavior (with IBM and U. of

Manchester) To Do:

– Develop hybrid cost models based on static and profile information

– Develop “intelligent” cost models based on Machine Learning (with U. of Manchester)


Summary

Paraphrasing M. Snir† (UIUC): “parallel programming will have to become synonymous with programming”

However,– Better (and unsafe) data dependence analysis tools– Explicit (and speculative) parallel models– Auto-parallelizing (speculative) compilers

Much work still needs to be done. At U. of Edinburgh:

– Auto-parallelizing TLS compilers– TLS hardware– STM (software TM)

† Director of Intel+Microsoft’s UPCRC


Acknowledgments

Research Team and Collaborators– Jialin Dou– Salman Khan– Polychronis Xekalakis– Nikolas Ioannou– Fabricio Goes– Constantino Ribeiro– Dr. G. Brown, Dr. M. Lujan, Prof. I. Watson (U. of

Manchester)– Prof. Diego Llanos (U. of Valladolid)

Funding– UK – EPSRC: GR/R65169/01 EP/G000697/1

Beyond Auto-Parallelization: Compilers for Many-Core

Systems

Marcelo Cintra

University of Edinburghhttp://

www.homepages.inf.ed.ac.uk/mc

Documents

Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh