29
Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. N. Vasilache , R. Lethin 1 The R-Stream High-Level Transformation Tool: State of the Art and Objectives Within the UHPC Program Government Purpose Rights Purchase Order Number: N/A Agreement No.: HR001‐10‐3‐0007 Contractor Name: Intel Corporation Contractor Address: 2111 NE 25th Ave M/S JF2‐60, Hillsboro, OR 97124 Expiration Date: None The Government’s rights to use, modify, reproduce, release, perform, display, or disclose this technical data are restricted by paragraphs B (1),(3) and (6) of Article VIII as incorporated within the above purchase order and Agreement. No restrictions apply after the expiration data shown above. Any reproduction of the software or portions thereof marked with this legend must also reproduce the markings. The following entities, their respective successors and assigns, shall possess the right to exercise said property rights, as if they were the Government, on behalf of the Government.: University of Delaware – www.udel.edu ; ETIInternational – www.etinternational.com ; Intel Corporation – www.intel.com ; Reservoir Labs – www.reservoir.com ; University of California – San Diego – www.ucsd.edu ; University of Illinois at Urbana-Champaign- www.illinois.edu.

N. Vasilache , R. Lethin

  • Upload
    aziza

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

Government Purpose Rights Purchase Order Number: N/A Agreement No.: HR001 ‐ 10 ‐ 3 ‐ 0007 Contractor Name: Intel Corporation Contractor Address: 2111 NE 25th Ave M/S JF2 ‐ 60, Hillsboro, OR 97124 Expiration Date: None - PowerPoint PPT Presentation

Citation preview

Page 1: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

1

N. Vasilache , R. Lethin

The R-Stream High-Level Transformation Tool: State of the Art and Objectives Within the UHPC Program

• Government Purpose Rights• Purchase Order Number: N/A• Agreement No.: HR001‐10‐3‐0007• Contractor Name: Intel Corporation• Contractor Address: 2111 NE 25th Ave M/S JF2‐60, Hillsboro, OR 97124• Expiration Date: None• The Government’s rights to use, modify, reproduce, release, perform, display, or disclose this technical data are

restricted by paragraphs B (1),(3) and (6) of Article VIII as incorporated within the above purchase order and Agreement. No restrictions apply after the expiration data shown above. Any reproduction of the software or portions thereof marked with this legend must also reproduce the markings. The following entities, their respective successors and assigns, shall possess the right to exercise said property rights, as if they were the Government, on behalf of the Government.: University of Delaware – www.udel.edu; ETIInternational – www.etinternational.com; Intel Corporation – www.intel.com; Reservoir Labs – www.reservoir.com; University of California – San Diego – www.ucsd.edu; University of Illinois at Urbana-Champaign- www.illinois.edu.

•  

Page 2: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

2

• R-Stream Overview

• UHPC Goals

• Some Performance Results

Outline

Page 3: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

Power Efficiency Driving Architectures

3

GPP

SIMD

Memory

SIMD

DMA

FPGASIMD

SIMD

GPP

SIMD

Memory

SIMD

DMA

FPGASIMD

SIMD

GPP

SIMD

Memory

SIMD

DMA

FPGASIMD

SIMD

GPP

SIMD

Memory

SIMD

DMA

FPGASIMD

SIMD

HeterogeneousProcessing

DistributedLocal Memories

ExplicitlyManaged

Architecture

NUMA

BandwidthStarved

Hierarchical(including board,chassis, cabinet)

Multiple Execution

Models

MixedParallelism

Types

MultipleSpatial

Dimensions

Page 4: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

4

• Expressing it in the program:• Annotations and pragma dialects for C• Chapel subset (UHPC in progress with UIUC)• CnC subset (UHPC in progress with Intel)• Generating it:• Explicitly (e.g., new languages like CUDA, target

specific )• Implicitly (UHPC in progress: libraries, runtime

abstractions CnC)• But before expressing it, how can programmers find it?• Manual constructive procedures, art, sweat, time

– Artisans get complete control over every detail• Fully-automatic

– Operations research problems and (advanced) autotuning

– Faster, sometimes better, than a human

Computation Choreography

Not our focus

Page 5: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

5

Program Transformations Specification

iteration space of a statement S(i,j)

j

i

22: ZZ

t1

t2

• Schedule maps iterations to multi-dimensional time:• A feasible schedule preserves dependences• Placement maps iterations to multi-dimensional space:• UHPC in progress, partially done• Layout maps data elements to multi-dimensional

space:• UHPC in progress• Hierarchical by design, tiling serves separation of

concerns

Page 6: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

6

• Parametric imperfect loop nests

• Subsumes classical transformations

• Compacts the transformation search space

• Parallelization, locality optimization (communication avoiding)

• Preserves semantics

• Analytic joint formulations of optimizations

• Not just for affine static control programs

Polyhedral Slogans

Page 7: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

R-Stream Blueprint

7

Polyhedral Mapper

Raising Lowering

Scalar RepresentationEDG CFront End

Pretty Printer (CUDA, C+annotations, pthreads …)

Machine Model

CnC / ChapelFront End

(UHPC in progress)

Extended RepresentationCnC High-Level

C Low-Level

Page 8: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

8

Mapping Process for Explicitly Managed Memories

2- Task formation:- Coarse-grain atomic tasks- Master/slave side operations

- Local / global data layout optimization- Multi-buffering (explicitly managed)- Synchronization (barriers)- Bulk communications- Thread generation -> master/slave- Target-specific optimizations

1- Scheduling:Parallelism, locality, tilability

3- Placement:Assign tasks to blocks/threads

Dependencies

Page 9: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

9

Model for Scheduling Trades 3 Objectives Jointly

Loop Fusion

MoreParalleli

smMore

Locality

Loop Fission Sufficien

tOccupan

cy

MemoryContigui

ty

FewerGlobal

MemoryAccesse

s

+ successive thread

contiguity

BetterEffectiveBandwid

th Patent pending

+ successive thread

contiguity

Page 10: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

Inside the R-Stream Mapper

10

Jolylib, …

Extended GDG representation

Tactics Module

ParallelizationLocality

OptimizationTiling Placement Comm.

Generation

MemoryPromotion

SyncGeneration

LayoutOptimization

PolyhedralScanning

Optimization modules engineered to expose advanced “knobs” used by auto-tuner

Page 11: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

Optimization Across BLAS Calls

/* Optimization with BLAS */for loop { … BLAS call 1 … BLAS call 2 … … BLAS call n …}

Outer loop(s)Retrieve data Z

from diskStore data Z back to diskRetrieve data Z from disk !!!

Numerous cache misses /* Global Optimization*/

doall loop { … for loop { … [read from Z] … [write to Z] … [read from Z] } …}

Loop fusion

can improve locality

Can parallelize

outer loop(s)

VS.

→ Global optimization exposes better parallelism and locality (significant speedups)

Page 12: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

12

• R-Stream Overview

• UHPC Goals

• Some Performance Results

Outline

Page 13: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

13

• Codelets have:• Fine granularity• Explicit communication• Point to point, other kinds of synchronization• Can utilize scheduling and dependence

information hints• Should also use placement of data and

computation hints• Work from local scratch pad memories• Good match for UHPC hardware, allows good

control for energy, resilience, etc.

Codelets From a HLC perspective

Page 14: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

14

• Energy • must minimize data motion/communication• Near Threshold Voltage • must find even more parallelism• Resilience • synergy needed with new

checkpointing/recovery models• Self awareness • dynamic distributed feedback and regulation

UHPC from HLC perspective

Page 15: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

15

• But programming directly in codelets is impractical:• Exposing machine details is a good thing, but don’t

want programmers to manage them. – Too complicated: getting it done, getting it right, getting

it fast. (Complexity = parallelism x locality x resilience…)

• Writing directly in codelets will also overs-pecify the program, bake to one machine, and defeat portability

• Role of HLC is to take high level abstractions from programmer– sequential code, – Chapel, CnC, – data-parallel idioms, – math language

• Perform optimization to various levels of the target hardware hierarchy

Another Observation

Page 16: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

16

Based on R-Stream Technology

Existing New For CodeletsEnergy Locality Opt

Explicit Comm GenMap to acceleratorsHierarchical barriers

Deep hierarchical schedulingPoint to point syncData placement opts

More parallelism

Exact dependenceImperfect loops

Dynamic schedules and placements

Emit scheduling and placement hints

Resilience Emit interaction setsABFT supportMemory reuse optCheckpointing opt

High level programming

Sequential C Chapel, CnC, Math, Data Parallel Idioms

Self-awareness Dynamic mappings

Page 17: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

17

• Assume a mapping from CnC -> Codelets• Advantages of CnC• More succinct expression of parallelism (the

skewing problem)• Adaptable parallelism and load balancing• High-level representation of data parallel idioms

– CnC help solve the irregular, idiomatic part of the problem

– R-Stream can target optimizations across irregular idioms

• Easy to test for correctness of generated code and execute efficiently on x86 / clusters.

Goal: Generating CnC

Page 18: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

18

• Represent CnC action-attribute graphs explicitly in R-Stream:

• Benefit from optimization across multiple CnC steps

• Explore tradeoff between fusing steps and running them in parallel:– Fused steps reduce the runtime overhead– An also the memory footprint

• Generate many semantically equivalent versions and explore the design space tradeoffs– R-Stream auto-tuning mode will help a lot here

Goal: Synergy with CnC

Page 19: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

19

• Extensions to blackboxing:• User interface, can represent any program• Supports even linking with precompiled code• Integrate user-specific data distributions within R-

Stream• HTAs• Locales• Find the right abstraction• The goal for Rstream to understand the

abstraction and make good mapping decisions; not to replace the user choices

• Iterative, feedback-directed design• Language / transformation tool• Transformation tool / Runtime• Language / Runtime

Goal: Synergy with Chapel and UIUC

Page 20: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

20

• Support multiple kinds of placement:• Explicit / implicit ; virtual / physical ; linear/ cyclic/ block

cyclic/ general• Build on R-Stream’s current over-provisioning for

performance:• Originally built for CUDA performance• Concepts extend to any architecture with dynamic

scheduling decisions• Has implications on locality/communication granularity• Examine implications on power• Use advanced auto-tuning features for design space

exploration• Explore which modes perform best with CnC:• Dependent on how over-provisioning is implemented • Over-provisioning (may) have implications on memory

persistence:• Opportunities / loss of high-level reuse and

communication optimizations

Goal: Pragmatic Approach

Page 21: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

21

• Go beyond loop nest optimizations• Chapel / data-parallel support • CnC attribute action graph optimization• SAR• New locality transformations demonstrated speedups

on linear flight path (reported to DARPA)• MD• Exploring HLC optimization to neutral territory methods• Graph• High level approaches to optimizing graph algorithms

and increasing locality, new lock-free data-parallel algorithm for BFS

• Chess, Hydrodynamics• TBD.

Goal: HLC support for Challenge Applicationss

Page 22: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

22

• R-Stream Overview

• UHPC Goals

• Some Performance Results

Outline

Page 23: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

• Main comparisons:• R-Stream High-Level C Compiler 3.1.2• Intel MKL 10.2.1• Dual quad-core E5405 Xeon processors (8 cores total),

9GB memory, 8 thr

CSLC-LMS (Mapping Across Function/Library Calls)

Radar code MKL

calls

Configuration 1: MKL

Radar code

GCC

Configuration 2: Low-level compilers

ICC

Radar code

Configuration 3: R-Stream

Optimized radar code

GCCICC

R-Stream

Page 24: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

CSLC-LMS (Mapping Across Function/Library Calls)

Page 25: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

25

• void RTM_3D(double (*U1)[Y][X], double (*U2)[Y][X], double (*V)[Y][X],• int pX, int pY, int pZ) {• double temp;• int i, j, k;

• for (k=4; k<pZ-4; k++) {• for (j=4; j<pY-4; j++) {• for (i=4; i<pX-4; i++) {• temp = C0 * U2[k][j][i] +• C1 * (U2[k-1][j][i] + U2[k+1][j][i] +• U2[k][j-1][i] + U2[k][j+1][i] +• U2[k][j][i-1] + U2[k][j][i+1]) +• C2 * (U2[k-2][j][i] + U2[k+2][j][i] +• U2[k][j-2][i] + U2[k][j+2][i] +• U2[k][j][i-2] + U2[k][j][i+2]) +• C3 * (U2[k-3][j][i] + U2[k+3][j][i] +• U2[k][j-3][i] + U2[k][j+3][i] +• U2[k][j][i-3] + U2[k][j][i+3]) +• C4 * (U2[k-4][j][i] + U2[k+4][j][i] +• U2[k][j-4][i] + U2[k][j+4][i] +• U2[k][j][i-4] + U2[k][j][i+4]);

• U1[k][j][i] = 2.0f * U2[k][j][i] - U1[k][j][i] + V[k][j][i] * temp;

• } } } }

RTM (Exploiting Over-Provisioning for Performance)

25-point 8th order (in space) stencil

Page 26: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

26

• 3D discretized wave equation kernel with single time iteration

• Run on NVIDIA GTX 480• Double Precision 256^3 Problem• High-Performance from Over-Provisioning space

exploration and explicit optimization of register rotation and shared memory reuse

RTM (Exploiting Over-Provisioning for Performance)

Page 27: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

27

• Examined feasibility and benefits of automatic coordination language (CnC )generation from R-Stream:

• on 4-D stencil, in-place, kernel application• coarse grained parallelism is pipelined (i.e.

wavefronts of parallel tasks) and representative of other streaming kernels

• Rstream generates a non-trivial OpenMP version

• Manually transform this OpenMP version to CnC code

• Process completely automatable

R-Stream to CnC Proof of Concept

Page 28: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

28

R-Stream to CnC Proof of Concept

Page 29: N. Vasilache ,  R. Lethin

Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government.  The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

29

• R-Stream simplifies software development and maintenance

• Does this by automatically parallelizing loop code• While optimizing for data locality, coalescing,

communications reuse, etc.

• Many exciting developments within UHPC

Conclusion