2006 Michigan Technological UniversityIPDPS200616/2/6 1 Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University {zhazhang,steve}@mtu.edu

2006 Michigan Technological University

IPDPS2006IPDPS200616/2/616/2/6

1

Zhang Zhang, Steve SeidelDepartment of Computer ScienceMichigan Technological University

{zhazhang,steve}@mtu.edu

http://www.upc.mtu.edu

A Performance Model forA Performance Model forFine-Grain Accesses in UPCFine-Grain Accesses in UPC


IPDPS2006IPDPS200616/2/616/2/6

2

OutlineOutline

1. Motivation and approach

2. The UPC programming model

3. Performance model design

4. Microbenchmarks

5. Application analysis

6. Measurements

7. Summary and continuing work


IPDPS2006IPDPS200616/2/616/2/6

3

1. Motivation1. Motivation

Unified Parallel C (UPC) is an extension of ANSI C that provides a partitioned shared memory model for parallel programming.

UPC compilers are available for platforms ranging from MACs to the Cray X1.

The Partitioned Global Address Space (PGAS) community is asking for performance models.

An accurate model determines if an application code takes best

advantage of the underlying parallel system and identifies code and system optimizations that are

needed.


IPDPS2006IPDPS200616/2/616/2/6

4

ApproachApproach

Construct an application-level analytical performance model.

Model fine-grain access performance based on platform benchmarks and code analysis.

Platform benchmarks determine compiler and runtime system optimization abilities.

Code analysis determines where these optimizations will be applied in the code.


IPDPS2006IPDPS200616/2/616/2/6

5

2. The UPC programming model2. The UPC programming model

UPC is an extension of ISO C99.

UPC processes are called threads: predefined identifiers THREADS and MYTHREAD are provided.

UPC is based on a partitioned shared memory model: A single address space that is logically partitioned among processors. Partitioned memory is part of the programming paradigm. Physical memory may or may not be partitioned.


IPDPS2006IPDPS200616/2/616/2/6

6

UPC’s UPC’s partitionedpartitionedshared address spaceshared address space

Each thread has a private (local) address space.

All threads share a global address space that is partitioned among the threads.

A shared object in thread i’s region of the partition is said to have affinity to thread i.

If thread i has affinity to a shared object x, it is likely that accesses to x take less time than accesses to shared objects to which thread i does not have affinity.

A performance model must capture this property.


IPDPS2006IPDPS200616/2/616/2/6

7

private

shared

UPC programming modelUPC programming model

A[0]=7;

7

th0 th1 th2

shared [5] int A[10*THREADS];

0

15

105

20 25

int i;

i ii

i=3;

3

A[i]=A[0]+2;

9


IPDPS2006IPDPS200616/2/616/2/6

8

3. Performance model design3. Performance model design

Platform abstraction identify potential optimizations microbenchmarks measure platform properties

with respect to those optimizations

Application analysis code is partitioned by sequence points: barriers,

fences, strict memory accesses, library calls. characterize patterns of shared memory accesses


IPDPS2006IPDPS200616/2/616/2/6

9

Platform abstractionPlatform abstraction

UPC compilers and runtime systems try to avoid and/or reduce the latency of remote accesses.

exploit spatial locality overlap remote accesses with other work

Each platform applies a different set of optimization techniques.

The model must capture the effects of those optimizations in the presence of some uncertainty about how they are actually applied.


IPDPS2006IPDPS200616/2/616/2/6

10

Potential optimizationsPotential optimizations

access aggregation: multiple accesses to shared memory that have the same affinity and can be postponed and combined

access vectorization: a special case of aggregation where the stride is regular

runtime caching: exploits temporal and spatial reuse

access pipelining: overlap independent concurrent remote accesses to multiple threads if the network allows


IPDPS2006IPDPS200616/2/616/2/6

11

Potential optimizations (cont’d)Potential optimizations (cont’d)

communication/computation overlapping: the usual technique applied by experienced programmers

multistreaming: provided by hardware with a memory system that can handle multiple streams of data

Notes: The effects of these optimizations are not disjoint,

e.g., caching and coalescing can have similar effects. It can be difficult to determine with certainty which

optimizations are actually at work. Microbenchmarks associated with the performance

model try to reveal available optimizations.


IPDPS2006IPDPS200616/2/616/2/6

12

4. Microbenchmarks identify4. Microbenchmarks identifyavailable optimizationsavailable optimizations

Baseline: cost of random remote shared accesses when no optimizations can be applied

Vector: cost of accesses to consecutive remote locations captures vectorization and runtime caching

Coalesce: random, small-strided remote accesses captures pipelining and aggregation

Local vs. private: accesses to local (shared) memory captures overhead of shared memory addressing

Costs are expressed in units of double words/sec.


IPDPS2006IPDPS200616/2/616/2/6

13

5. Application analysis overview5. Application analysis overview

Application code is partitioned into intervals based on sequence points such as barriers, fences, strict memory accesses, etc.

A dependence graph is constructed for all accesses to the same shared object (i.e., array) in each interval.

References are partitioned into groups based on the four types of benchmarks. These references are amenable to the associated optimizations.

Costs are accumulated to obtain a performance prediction.


IPDPS2006IPDPS200616/2/616/2/6

14

A reference partitionA reference partition

A partition (C, pattern, name) is a set of accesses that occur in a synchronization interval, where

C is the set of accesses, pattern is in {baseline, vector, coalesce, local},

and name is the accessed object, e.g., shared array A.

User defined functions are inlined to obtain flat code.

Alias analysis must be done.

Recursion is not modeled.


IPDPS2006IPDPS200616/2/616/2/6

15

Dependence analysisDependence analysis The reference partitioning graph G’=(V’,E’) is

constructed from G. The goal is to determine sets of accesses that can be done concurrently.

Dependencies considered are true dependence, antidependence, output dependence and input dependence.

The dependence graph G=(V,E) of an interval consists of one vertex for each reference to a shared object (its name) and edges connect dependent vertices.

The reference partitioning graph G’=(V’,E’) is constructed from G.


IPDPS2006IPDPS200616/2/616/2/6

16

Reference partitioning graphReference partitioning graph

The reference partitioning graph G’=(V’,E’) is constructed from G.

Let B be a subset of E consisting of edges denoting true and antidependences.

Construct V’ by grouping vertices in V that have the same name, reference memory locations with the same

affinity, are not connected by an edge in B.

Each vertex in V’ is assigned a reference pattern.


IPDPS2006IPDPS200616/2/616/2/6

17

Example 1Example 1

shared [] float *A;// A points to a remote block of shared memoryfor (i=1; i<N; i++){ ... = A[i]; ... = A[i-1]; ... }

A[i] and A[i-1] are in the same partition.

If the platform supports vectorization then this pattern is assigned the vector type.

If not, each pair of accesses can be coalesced on each iteration.

If not, the baseline pattern is assigned to this partition.


IPDPS2006IPDPS200616/2/616/2/6

18

Example 2Example 2

shared [1] float *B;// B is distributed one element per thread// (round robin)for (i=1; i<N; i++){ ... = B[i]; ... = B[i-1]; ... }

B[i] and B[i-1] are in different partitions.

Vectorization and coalescence cannot be applied

The pattern is mixed baseline-local, e.g. if THREADS=4, then the mix is 75%-25%.

For large numbers of threads the pattern is just baseline.


IPDPS2006IPDPS200616/2/616/2/6

19

Communication costCommunication cost

The communication cost of interval i is

where Nj is the number of shared memory accesses in

partition j and r(Nj ,pattern) is the access rate for that

number of references and that pattern.

The functions r(Nj ,pattern) are determined by

microbenchmarking.

partitions

j j

ji

comm patternNr

NT

1 ),(


IPDPS2006IPDPS200616/2/616/2/6

20

Computation costComputation cost Computation cost is measured by simulating the

computation using only private memory accesses.

The total run time of each interval i is

The cost of barriers is separately measured.

The predicted cost for each thread is the sum of all of these costs.

The highest predicted cost among all threads is taken to be the total cost.

commcompTT


IPDPS2006IPDPS200616/2/616/2/6

21

Speedup predictionSpeedup prediction

Speedup can be estimated by the ratio of the number of accesses in the sequential code to the weighted sum of the number of remote accesses of each type.

Details are given in the paper.


IPDPS2006IPDPS200616/2/616/2/6

22

6. Measurements6. Measurements Microbenchmarks

Applications Histogramming Matrix multiply Sobel edge detection

Platforms measured 16-node 2 GHz x86 Linux Myrinet cluster

MuPC V1.1.2 beta Berkeley UPC C2.2

48-node 300 Mhz Cray T3E GCC UPC


IPDPS2006IPDPS200616/2/616/2/6

23

Prediction precisionPrediction precision Execution time prediction precision is expressed as

%100 actualactualpredicted

A negative value indicates that the cost is overestimated.


IPDPS2006IPDPS200616/2/616/2/6

24

Microbenchmark measurementsMicrobenchmark measurements

A few observations Increasing the number of threads from 2 to 12

decreases performance in all cases: the decrease ranges from 0-10% on the T3E to as high as 25% for Berkeley and 50% in one case(*) for MuPC.

Caching improves MuPC performance for vector and coalesce and reduces performance for (*)baseline write.

Berkeley successfully coalesces reads. GCC is unusually slow at local writes.


IPDPS2006IPDPS200616/2/616/2/6

25

Microbenchmark measurementsMicrobenchmark measurements

Microbenchmark (threads)

MuPC w/o cache MuPC w/ cache Berkeley UPC GCC UPC

read write read write read write read write

baseline(2) (12)

14.0K 12.0K

35.6K 30.8K

23.3K 10.8K

11.4K 7.3K

21.0K 15.1K

46.5K 43.3K

0.45M 0.40M

1.3M 1.1M

vector(2) (12)

16.4K 10.0K

45.4K 34.6K

1.0M 0.77M

1.0M 0.71M

21.1K 14.6K

47.2K 44.8K

0.5M 0.45M

1.7M 1.7M

coalesce(2) (12)

16.7K 12.9K

43.9K 39.5K

82.0K 61.1K

69.9K 46.3K

172K 122K

46.9K 44.4K

0.5M 0.4M 1.6M 1.5M

local(2) (12)

8.3M 8.3M 8.3M 8.3M 8.3M 8.3M 8.3M 8.3M 8.3M 6.7M 6.7M 5.0M 1.2M 1.0M0.7M

0.62M

Increasing the number of threads from 2 to 12 decreases performance in all cases: the decrease ranges from 0-10% on the T3E to as high as 25% for Berkeley and 50% in one case(*) for MuPC.

Caching improves MuPC performance for vector and coalesce and reduces performance for (*)baseline write.

Berkeley successfully coalesces reads.

GCC is unusually slow at local writes.


IPDPS2006IPDPS200616/2/616/2/6

26

HistogrammingHistogrammingshared [1] int array[N]for (i=0; i<N*percentage; i++){ loc = random_index(i); array[loc]++; }upc_barrier;

Random elements of an array are updated in parallel.

Races are ignored.

percentage determines how much of the table is accessed.

Collisions grow as percentage gets smaller.

This fits a mixed baseline-local pattern when the number of threads is small, as explained earlier.


IPDPS2006IPDPS200616/2/616/2/6

27

Histogramming performance Histogramming performance estimatesestimates

For 12 threads the predicted cost is usually within 5%.

Percentage load on table

Precision (δ)

MuPC Berkeley GCC

10% -2.2 -0.38 -3.6

20% -4.8 -0.25 -3.5

30% -0.4 0.15 -3.5

40% -4.0 0.32 -3.5

50% 1.6 0.45 -3.5

60% -9.8 0.36 -3.6

70% -4.6 0.39 -3.5

80% -3.3 0.35 -3.5

90% -9.7 0.54 -3.5


IPDPS2006IPDPS200616/2/616/2/6

28

Matrix multiplyMatrix multiply

Thread t executes pass i if A[i][0] has affinity to t.

Remote accesses are minimized by distributing rows of A and C across threads while columns of B are distributed across threads.

Both cyclic striped and block striped distributions are measured.

Accesses to A and C are local; accesses to B are mixed vector-local.

upc_forall (i=0; i<N; i++; &A[i][0]) { for (j=0; j<N; j++) { C[i][j] = 0.0; for (k=0; k<N; k++) C[i][j] += A[i][k]*B[k][j]; }}


IPDPS2006IPDPS200616/2/616/2/6

29

Matrix multiply performance estimatesMatrix multiply performance estimates

N x N = 240 x 240

Berkeley and GCC costs are underestimated.

MuPC/w cache costs are overestimated because temporal locality is not modeled.

threads

Precision (δ)

MuPC w/o cache MuPC w/ cache Berkeley UPC GCC UPC

cyclic striped

block striped

cyclic striped

block striped

cyclic striped

block striped

cyclic striped

block striped

2 -12.2 0.7 7.9 -2.1 -1.2 -1.9 -4.0 -14.0

4 14.5 0.3 15.3 19.5 -7.4 -14.7 -8.8 -10.5

6 3.6 -0.7 15.8 11.8 -4.5 -8.9 7.6 -2.0

8 2.2 -4.4 9.8 13.0 -7.0 -12.0 10.6 9.6

10 -0.4 -2.2 3.9 2.2 -7.4 -15.2 6.2 -3.2

12 2.9 0.9 9.8 8.8 -5.4 4.4 3.1 -5.7


IPDPS2006IPDPS200616/2/616/2/6

30

Sobel edge detectionSobel edge detection

A 2000 x 2000-pixel image is distributed so that each thread gets approx. 2000/THREADS contiguous rows.

All accesses to the computed image are local.

Read-only accesses to the source array are mixed-local. The north and south border rows are in neighboring threads, all other rows are local.

Source array access patterns are local-vector on MuPC with cache, local-coalesce on Berkeley because it coalesces. local-baseline on GCC because it does not

optimize.


IPDPS2006IPDPS200616/2/616/2/6

31

Sobel performance estimatesSobel performance estimates

Precision is worst for MuPC because of unaccounted- for cache overhead for 2 threads and because the vector pattern only approximates cache behavior for larger numbers of threads.

threads

Precision (δ)

MuPC w/o cache

MuPC w/ cache

Berkeley UPC

GCC UPC

2 4.8 -21.6 7.3 -10.3

4 1.4 15.8 8.3 -10.8

6 11.8 16.3 1.1 -6.3

8 9.9 16.8 -1.3 7.5

10 7.0 17.5 -4.3 -1.0

12 -0.5 14.3 5.0 -3.5


IPDPS2006IPDPS200616/2/616/2/6

32

7. Summary and continuing work7. Summary and continuing work

This is a first attempt at a model for a PGAS language.

The model identifies potential optimizations that a platform may provide and offers a set of microbenchmarks to capture their effects.

Code analysis identifies the access patterns in use and matches them with available platform optimizations.

Performance predictions for simple codes are usually within 15% of actual run times and most of the time they are better than that.


IPDPS2006IPDPS200616/2/616/2/6

33

ImprovementsImprovements

Model coarse-grain shared memory operations such as upc_memcpy().

Model the overlap between memory accesses and computation.

Model contention for access to shared memory.

Model computational load imbalance.

Apply the model to larger applications and make measurements for larger numbers of threads.

Explore how the model can be automated.


IPDPS2006IPDPS200616/2/616/2/6

34

Questions?Questions?

Documents

2006 Michigan Technological UniversityIPDPS200616/2/6 1 Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University {zhazhang,steve}@mtu.edu