N. Vasilache , R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

1

N. Vasilache , R. Lethin

The R-Stream High-Level Transformation Tool: State of the Art and Objectives Within the UHPC Program

• Government Purpose Rights• Purchase Order Number: N/A• Agreement No.: HR001‐10‐3‐0007• Contractor Name: Intel Corporation• Contractor Address: 2111 NE 25th Ave M/S JF2‐60, Hillsboro, OR 97124• Expiration Date: None• The Government’s rights to use, modify, reproduce, release, perform, display, or disclose this technical data are

restricted by paragraphs B (1),(3) and (6) of Article VIII as incorporated within the above purchase order and Agreement. No restrictions apply after the expiration data shown above. Any reproduction of the software or portions thereof marked with this legend must also reproduce the markings. The following entities, their respective successors and assigns, shall possess the right to exercise said property rights, as if they were the Government, on behalf of the Government.: University of Delaware – www.udel.edu; ETIInternational – www.etinternational.com; Intel Corporation – www.intel.com; Reservoir Labs – www.reservoir.com; University of California – San Diego – www.ucsd.edu; University of Illinois at Urbana-Champaign- www.illinois.edu.

•

http://www.udel.edu/

http://www.etinternational.com/

http://www.intel.com/

http://www.reservoir.com/

http://www.ucsd.edu/


2

• R-Stream Overview

• UHPC Goals

• Some Performance Results

Outline


Power Efficiency Driving Architectures

3

GPP

SIMD

Memory

SIMD

DMA

FPGASIMD

SIMD

GPP

SIMD

Memory

SIMD

DMA

FPGASIMD

SIMD

GPP

SIMD

Memory

SIMD

DMA

FPGASIMD

SIMD

GPP

SIMD

Memory

SIMD

DMA

FPGASIMD

SIMD

HeterogeneousProcessing

DistributedLocal Memories

ExplicitlyManaged

Architecture

NUMA

BandwidthStarved

Hierarchical(including board,chassis, cabinet)

Multiple Execution

Models

MixedParallelism

Types

MultipleSpatial

Dimensions


4

• Expressing it in the program:• Annotations and pragma dialects for C• Chapel subset (UHPC in progress with UIUC)• CnC subset (UHPC in progress with Intel)• Generating it:• Explicitly (e.g., new languages like CUDA, target

specific )• Implicitly (UHPC in progress: libraries, runtime

abstractions CnC)• But before expressing it, how can programmers find it?• Manual constructive procedures, art, sweat, time

– Artisans get complete control over every detail• Fully-automatic

– Operations research problems and (advanced) autotuning

– Faster, sometimes better, than a human

Computation Choreography

Not our focus


5

Program Transformations Specification

iteration space of a statement S(i,j)

j

i

22: ZZ

t1

t2

• Schedule maps iterations to multi-dimensional time:• A feasible schedule preserves dependences• Placement maps iterations to multi-dimensional space:• UHPC in progress, partially done• Layout maps data elements to multi-dimensional

space:• UHPC in progress• Hierarchical by design, tiling serves separation of

concerns


6

• Parametric imperfect loop nests

• Subsumes classical transformations

• Compacts the transformation search space

• Parallelization, locality optimization (communication avoiding)

• Preserves semantics

• Analytic joint formulations of optimizations

• Not just for affine static control programs

Polyhedral Slogans


R-Stream Blueprint

7

Polyhedral Mapper

Raising Lowering

Scalar RepresentationEDG CFront End

Pretty Printer (CUDA, C+annotations, pthreads …)

Machine Model

CnC / ChapelFront End

(UHPC in progress)

Extended RepresentationCnC High-Level

C Low-Level


8

Mapping Process for Explicitly Managed Memories

2- Task formation:- Coarse-grain atomic tasks- Master/slave side operations

- Local / global data layout optimization- Multi-buffering (explicitly managed)- Synchronization (barriers)- Bulk communications- Thread generation -> master/slave- Target-specific optimizations

1- Scheduling:Parallelism, locality, tilability

3- Placement:Assign tasks to blocks/threads

Dependencies


9

Model for Scheduling Trades 3 Objectives Jointly

Loop Fusion

MoreParalleli

smMore

Locality

Loop Fission Sufficien

tOccupan

cy

MemoryContigui

ty

FewerGlobal

MemoryAccesse

s

+ successive thread

contiguity

BetterEffectiveBandwid

th Patent pending

+ successive thread

contiguity


Inside the R-Stream Mapper

10

Jolylib, …

Extended GDG representation

Tactics Module

ParallelizationLocality

OptimizationTiling Placement Comm.

Generation

MemoryPromotion

SyncGeneration

LayoutOptimization

PolyhedralScanning

…

Optimization modules engineered to expose advanced “knobs” used by auto-tuner


Optimization Across BLAS Calls

/* Optimization with BLAS */for loop { … BLAS call 1 … BLAS call 2 … … BLAS call n …}

Outer loop(s)Retrieve data Z

from diskStore data Z back to diskRetrieve data Z from disk !!!

Numerous cache misses /* Global Optimization*/

doall loop { … for loop { … [read from Z] … [write to Z] … [read from Z] } …}

Loop fusion

can improve locality

Can parallelize

outer loop(s)

VS.

→ Global optimization exposes better parallelism and locality (significant speedups)


12


• UHPC Goals


Outline


13

• Codelets have:• Fine granularity• Explicit communication• Point to point, other kinds of synchronization• Can utilize scheduling and dependence

information hints• Should also use placement of data and

computation hints• Work from local scratch pad memories• Good match for UHPC hardware, allows good

control for energy, resilience, etc.

Codelets From a HLC perspective


14

• Energy • must minimize data motion/communication• Near Threshold Voltage • must find even more parallelism• Resilience • synergy needed with new

checkpointing/recovery models• Self awareness • dynamic distributed feedback and regulation

UHPC from HLC perspective


15

• But programming directly in codelets is impractical:• Exposing machine details is a good thing, but don’t

want programmers to manage them. – Too complicated: getting it done, getting it right, getting

it fast. (Complexity = parallelism x locality x resilience…)

• Writing directly in codelets will also overs-pecify the program, bake to one machine, and defeat portability

• Role of HLC is to take high level abstractions from programmer– sequential code, – Chapel, CnC, – data-parallel idioms, – math language

• Perform optimization to various levels of the target hardware hierarchy

Another Observation


16

Based on R-Stream Technology

Existing New For CodeletsEnergy Locality Opt

Explicit Comm GenMap to acceleratorsHierarchical barriers

Deep hierarchical schedulingPoint to point syncData placement opts

More parallelism

Exact dependenceImperfect loops

Dynamic schedules and placements

Emit scheduling and placement hints

Resilience Emit interaction setsABFT supportMemory reuse optCheckpointing opt

High level programming

Sequential C Chapel, CnC, Math, Data Parallel Idioms

Self-awareness Dynamic mappings


17

• Assume a mapping from CnC -> Codelets• Advantages of CnC• More succinct expression of parallelism (the

skewing problem)• Adaptable parallelism and load balancing• High-level representation of data parallel idioms

– CnC help solve the irregular, idiomatic part of the problem

– R-Stream can target optimizations across irregular idioms

• Easy to test for correctness of generated code and execute efficiently on x86 / clusters.

Goal: Generating CnC


18

• Represent CnC action-attribute graphs explicitly in R-Stream:

• Benefit from optimization across multiple CnC steps

• Explore tradeoff between fusing steps and running them in parallel:– Fused steps reduce the runtime overhead– An also the memory footprint

• Generate many semantically equivalent versions and explore the design space tradeoffs– R-Stream auto-tuning mode will help a lot here

Goal: Synergy with CnC


19

• Extensions to blackboxing:• User interface, can represent any program• Supports even linking with precompiled code• Integrate user-specific data distributions within R-

Stream• HTAs• Locales• Find the right abstraction• The goal for Rstream to understand the

abstraction and make good mapping decisions; not to replace the user choices

• Iterative, feedback-directed design• Language / transformation tool• Transformation tool / Runtime• Language / Runtime

Goal: Synergy with Chapel and UIUC


20

• Support multiple kinds of placement:• Explicit / implicit ; virtual / physical ; linear/ cyclic/ block

cyclic/ general• Build on R-Stream’s current over-provisioning for

performance:• Originally built for CUDA performance• Concepts extend to any architecture with dynamic

scheduling decisions• Has implications on locality/communication granularity• Examine implications on power• Use advanced auto-tuning features for design space

exploration• Explore which modes perform best with CnC:• Dependent on how over-provisioning is implemented • Over-provisioning (may) have implications on memory

persistence:• Opportunities / loss of high-level reuse and

communication optimizations

Goal: Pragmatic Approach


21

• Go beyond loop nest optimizations• Chapel / data-parallel support • CnC attribute action graph optimization• SAR• New locality transformations demonstrated speedups

on linear flight path (reported to DARPA)• MD• Exploring HLC optimization to neutral territory methods• Graph• High level approaches to optimizing graph algorithms

and increasing locality, new lock-free data-parallel algorithm for BFS

• Chess, Hydrodynamics• TBD.

Goal: HLC support for Challenge Applicationss


22


• UHPC Goals


Outline


• Main comparisons:• R-Stream High-Level C Compiler 3.1.2• Intel MKL 10.2.1• Dual quad-core E5405 Xeon processors (8 cores total),

9GB memory, 8 thr

CSLC-LMS (Mapping Across Function/Library Calls)

Radar code MKL

calls

Configuration 1: MKL

Radar code

GCC

Configuration 2: Low-level compilers

ICC

Radar code

Configuration 3: R-Stream

Optimized radar code

GCCICC

R-Stream


CSLC-LMS (Mapping Across Function/Library Calls)


25

• void RTM_3D(double (*U1)[Y][X], double (*U2)[Y][X], double (*V)[Y][X],• int pX, int pY, int pZ) {• double temp;• int i, j, k;

• for (k=4; k<pZ-4; k++) {• for (j=4; j<pY-4; j++) {• for (i=4; i<pX-4; i++) {• temp = C0 * U2[k][j][i] +• C1 * (U2[k-1][j][i] + U2[k+1][j][i] +• U2[k][j-1][i] + U2[k][j+1][i] +• U2[k][j][i-1] + U2[k][j][i+1]) +• C2 * (U2[k-2][j][i] + U2[k+2][j][i] +• U2[k][j-2][i] + U2[k][j+2][i] +• U2[k][j][i-2] + U2[k][j][i+2]) +• C3 * (U2[k-3][j][i] + U2[k+3][j][i] +• U2[k][j-3][i] + U2[k][j+3][i] +• U2[k][j][i-3] + U2[k][j][i+3]) +• C4 * (U2[k-4][j][i] + U2[k+4][j][i] +• U2[k][j-4][i] + U2[k][j+4][i] +• U2[k][j][i-4] + U2[k][j][i+4]);

• U1[k][j][i] = 2.0f * U2[k][j][i] - U1[k][j][i] + V[k][j][i] * temp;

• } } } }

RTM (Exploiting Over-Provisioning for Performance)

25-point 8th order (in space) stencil


26

• 3D discretized wave equation kernel with single time iteration

• Run on NVIDIA GTX 480• Double Precision 256^3 Problem• High-Performance from Over-Provisioning space

exploration and explicit optimization of register rotation and shared memory reuse

RTM (Exploiting Over-Provisioning for Performance)


27

• Examined feasibility and benefits of automatic coordination language (CnC )generation from R-Stream:

• on 4-D stencil, in-place, kernel application• coarse grained parallelism is pipelined (i.e.

wavefronts of parallel tasks) and representative of other streaming kernels

• Rstream generates a non-trivial OpenMP version

• Manually transform this OpenMP version to CnC code

• Process completely automatable

R-Stream to CnC Proof of Concept


28

R-Stream to CnC Proof of Concept


29

• R-Stream simplifies software development and maintenance

• Does this by automatically parallelizing loop code• While optimizing for data locality, coalescing,

communications reuse, etc.

• Many exciting developments within UHPC

Conclusion

Documents

N. Vasilache , R. Lethin