1
Goal: a portable and high-performance UPC implementation Fully compliant with UPC 1.2 spec. Many extensions for performance and programmability Non-blocking memcpy functions Semaphores for one-side comm. Fine granularity timers Entirely open source Windows/Mac/UNIX CD available at UPC booth 0% 10% 20% 30% 40% 50% 60% 70% 80% 4 4 4 4 4 barnes gups mcop psearch sobel HP(AlphaQuadrics) Itanium/Myrinet Opteron/Infiniband Average of Opt Speedup benchmark processors machine Berkeley UPC Compiler Overview http://upc.lbl.gov Portable Design Translator UPC Code Translator Generated C Code Berkeley UPC Runtime System GASNet Communication System Network Hardware Platform- independent Network- independent Compiler- independent Language- independent Multi-layer design, platform independent code generation Supports large-scale multiprocessors, SMPs and clusters x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, PA-RISC, SPARC, T3E, X1, SX-6, XT3(new), Blue Gene(new) UPC-to-C Translator Runtime + GASNet Well-documented interface Supports multiple UPC compilers (Berkeley UPC and Intrepid GCC/UPC) Etnus TotalView debugger support Communication tracing support Provides app interoperability: UPC calls to/from C, C++, Fortran, MPI Berkeley GASNet used for communication: Portability from layered design Performance from inline functions, macros, and network-specific implementations Support SMP, Myrinet, Quadrics Elan 3/4, Infiniband, IBM LAPI, Dolphin SCI, MPI, Ethernet, X1/Altix shmem Based on Open64 Translate UPC into C with calls to runtime High-level representation to get good serial performance Platform for experimenting with UPC optimizations: upc_forall Loop Optimization PRE & Split-Phase Access VM-based Communication Overlap shared int a[N]; upc_forall(i = L; i <U; i++; i) accum += a[i]; int ofst = (MYTHREAD–L) % THREADS; int *la = (int *) &a[L+ofst]; for(i=L+ofst, j=0; i <U; i+=THREADS, j++) accum += la[j]; Removes runtime branch from upc_forall loops Privatize array accesses that must be local by analyzing affinity exp. Works for affine cyclic/indefinite arrays VM support for demand-driven comm. synchronization Message decomposition and scheduling for bulk comm. Performance improvements for fine-grained benchmarks FT – blocking comm. FT-nb – hand-modified nonblocking comm. FT-md – DDS and decomposition without global scheduling FT-H –- DDS and decomposition with global scheduling FT class B/AlphaQuadrics 0.6 0.8 1 1.2 1.4 1.6 1.8 4 8 16 32 64 Processors Speedup FT FT-nb FT-md FT-H PRE on shared expressions (ptr add, load, and store) Split-phase comm. – moves read initiations up, write completions down Coalesces fine-grained accesses to same struct/array Performance of Demand-Driven Synchronization (DDS)

Berkeley UPC Compiler · Goal: a portable and high-performance UPC implementation Fully compliant with UPC 1.2 spec. Many extensions for performance and programmability Non-blocking

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Berkeley UPC Compiler · Goal: a portable and high-performance UPC implementation Fully compliant with UPC 1.2 spec. Many extensions for performance and programmability Non-blocking

Goal: a portable and high-performanceUPC implementationFully compliant with UPC 1.2 spec.Many extensions for performance and programmability

Non-blocking memcpy functionsSemaphores for one-side comm.Fine granularity timers

Entirely open sourceWindows/Mac/UNIX CDavailable at UPC booth

0%10%20%30%40%50%60%70%80%

4 4 4 4 4

barnes gups mcop psearch sobel

HP(AlphaQuadrics)Itanium/MyrinetOpteron/Infiniband

Drop Page Fields Here

Average of Opt Speedup

benchmark processors

machine

Berkeley UPC Compiler

Overview

http://upc.lbl.gov

Portable DesignTranslatorUPC Code

Translator Generated C Code

Berkeley UPC Runtime System

GASNet Communication System

Network Hardware

Platform-independent

Network-independent

Compiler-independent

Language-independent

Multi-layer design, platform independent code generationSupports large-scale multiprocessors, SMPs and clusters

x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, PA-RISC, SPARC, T3E, X1, SX-6, XT3(new), Blue Gene(new)

UPC-to-C Translator

Runtime + GASNetWell-documented interface

Supports multiple UPC compilers (Berkeley UPC and Intrepid GCC/UPC)

Etnus TotalView debugger supportCommunication tracing supportProvides app interoperability:

UPC calls to/from C, C++, Fortran, MPIBerkeley GASNet used for communication:

Portability from layered designPerformance from inline functions, macros, and network-specific implementationsSupport SMP, Myrinet, Quadrics Elan 3/4, Infiniband, IBM LAPI, Dolphin SCI, MPI, Ethernet, X1/Altix shmem

Based on Open64Translate UPC into C with calls to runtimeHigh-level representation to get good serial performancePlatform for experimenting with UPC optimizations:

upc_forall Loop Optimization

PRE & Split-Phase Access

VM-based Communication

Overlap

shared int a[N];upc_forall(i = L; i <U; i++; i)

accum += a[i];

int ofst = (MYTHREAD–L) % THREADS;int *la = (int *) &a[L+ofst];for(i=L+ofst, j=0; i <U; i+=THREADS, j++)accum += la[j];

Removes runtime branch from upc_forall loopsPrivatize array accesses that must be local by analyzing affinity exp.Works for affine cyclic/indefinite arrays

VM support for demand-driven comm. synchronizationMessage decomposition and scheduling for bulk comm.

Performance improvements for fine-grained benchmarks

FT – blocking comm.

FT-nb – hand-modified nonblocking comm.

FT-md – DDS and decomposition without global scheduling

FT-H –- DDS and decomposition with global scheduling

FT class B/AlphaQuadrics

0.6

0.8

1

1.2

1.4

1.6

1.8

4 8 16 32 64Processors

Spe

edup

FT FT-nb FT-md FT-H

PRE on shared expressions (ptradd, load, and store)Split-phase comm. – moves read initiations up, write completions downCoalesces fine-grained accesses to same struct/array

Performance of Demand-Driven Synchronization (DDS)