View
221
Download
3
Tags:
Embed Size (px)
Citation preview
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
1
Zhang Zhang, Steve SeidelDepartment of Computer ScienceMichigan Technological University
{zhazhang,steve}@mtu.edu
http://www.upc.mtu.edu
A Performance Model forA Performance Model forFine-Grain Accesses in UPCFine-Grain Accesses in UPC
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
2
OutlineOutline
1. Motivation and approach
2. The UPC programming model
3. Performance model design
4. Microbenchmarks
5. Application analysis
6. Measurements
7. Summary and continuing work
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
3
1. Motivation1. Motivation
Unified Parallel C (UPC) is an extension of ANSI C that provides a partitioned shared memory model for parallel programming.
UPC compilers are available for platforms ranging from MACs to the Cray X1.
The Partitioned Global Address Space (PGAS) community is asking for performance models.
An accurate model determines if an application code takes best
advantage of the underlying parallel system and identifies code and system optimizations that are
needed.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
4
ApproachApproach
Construct an application-level analytical performance model.
Model fine-grain access performance based on platform benchmarks and code analysis.
Platform benchmarks determine compiler and runtime system optimization abilities.
Code analysis determines where these optimizations will be applied in the code.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
5
2. The UPC programming model2. The UPC programming model
UPC is an extension of ISO C99.
UPC processes are called threads: predefined identifiers THREADS and MYTHREAD are provided.
UPC is based on a partitioned shared memory model: A single address space that is logically partitioned among processors. Partitioned memory is part of the programming paradigm. Physical memory may or may not be partitioned.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
6
UPC’s UPC’s partitionedpartitionedshared address spaceshared address space
Each thread has a private (local) address space.
All threads share a global address space that is partitioned among the threads.
A shared object in thread i’s region of the partition is said to have affinity to thread i.
If thread i has affinity to a shared object x, it is likely that accesses to x take less time than accesses to shared objects to which thread i does not have affinity.
A performance model must capture this property.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
7
private
shared
UPC programming modelUPC programming model
A[0]=7;
7
th0 th1 th2
shared [5] int A[10*THREADS];
0
15
105
20 25
int i;
i ii
i=3;
3
A[i]=A[0]+2;
9
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
8
3. Performance model design3. Performance model design
Platform abstraction identify potential optimizations microbenchmarks measure platform properties
with respect to those optimizations
Application analysis code is partitioned by sequence points: barriers,
fences, strict memory accesses, library calls. characterize patterns of shared memory accesses
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
9
Platform abstractionPlatform abstraction
UPC compilers and runtime systems try to avoid and/or reduce the latency of remote accesses.
exploit spatial locality overlap remote accesses with other work
Each platform applies a different set of optimization techniques.
The model must capture the effects of those optimizations in the presence of some uncertainty about how they are actually applied.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
10
Potential optimizationsPotential optimizations
access aggregation: multiple accesses to shared memory that have the same affinity and can be postponed and combined
access vectorization: a special case of aggregation where the stride is regular
runtime caching: exploits temporal and spatial reuse
access pipelining: overlap independent concurrent remote accesses to multiple threads if the network allows
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
11
Potential optimizations (cont’d)Potential optimizations (cont’d)
communication/computation overlapping: the usual technique applied by experienced programmers
multistreaming: provided by hardware with a memory system that can handle multiple streams of data
Notes: The effects of these optimizations are not disjoint,
e.g., caching and coalescing can have similar effects. It can be difficult to determine with certainty which
optimizations are actually at work. Microbenchmarks associated with the performance
model try to reveal available optimizations.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
12
4. Microbenchmarks identify4. Microbenchmarks identifyavailable optimizationsavailable optimizations
Baseline: cost of random remote shared accesses when no optimizations can be applied
Vector: cost of accesses to consecutive remote locations captures vectorization and runtime caching
Coalesce: random, small-strided remote accesses captures pipelining and aggregation
Local vs. private: accesses to local (shared) memory captures overhead of shared memory addressing
Costs are expressed in units of double words/sec.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
13
5. Application analysis overview5. Application analysis overview
Application code is partitioned into intervals based on sequence points such as barriers, fences, strict memory accesses, etc.
A dependence graph is constructed for all accesses to the same shared object (i.e., array) in each interval.
References are partitioned into groups based on the four types of benchmarks. These references are amenable to the associated optimizations.
Costs are accumulated to obtain a performance prediction.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
14
A reference partitionA reference partition
A partition (C, pattern, name) is a set of accesses that occur in a synchronization interval, where
C is the set of accesses, pattern is in {baseline, vector, coalesce, local},
and name is the accessed object, e.g., shared array A.
User defined functions are inlined to obtain flat code.
Alias analysis must be done.
Recursion is not modeled.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
15
Dependence analysisDependence analysis The reference partitioning graph G’=(V’,E’) is
constructed from G. The goal is to determine sets of accesses that can be done concurrently.
Dependencies considered are true dependence, antidependence, output dependence and input dependence.
The dependence graph G=(V,E) of an interval consists of one vertex for each reference to a shared object (its name) and edges connect dependent vertices.
The reference partitioning graph G’=(V’,E’) is constructed from G.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
16
Reference partitioning graphReference partitioning graph
The reference partitioning graph G’=(V’,E’) is constructed from G.
Let B be a subset of E consisting of edges denoting true and antidependences.
Construct V’ by grouping vertices in V that have the same name, reference memory locations with the same
affinity, are not connected by an edge in B.
Each vertex in V’ is assigned a reference pattern.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
17
Example 1Example 1
shared [] float *A;// A points to a remote block of shared memoryfor (i=1; i<N; i++){ ... = A[i]; ... = A[i-1]; ... }
A[i] and A[i-1] are in the same partition.
If the platform supports vectorization then this pattern is assigned the vector type.
If not, each pair of accesses can be coalesced on each iteration.
If not, the baseline pattern is assigned to this partition.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
18
Example 2Example 2
shared [1] float *B;// B is distributed one element per thread// (round robin)for (i=1; i<N; i++){ ... = B[i]; ... = B[i-1]; ... }
B[i] and B[i-1] are in different partitions.
Vectorization and coalescence cannot be applied
The pattern is mixed baseline-local, e.g. if THREADS=4, then the mix is 75%-25%.
For large numbers of threads the pattern is just baseline.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
19
Communication costCommunication cost
The communication cost of interval i is
where Nj is the number of shared memory accesses in
partition j and r(Nj ,pattern) is the access rate for that
number of references and that pattern.
The functions r(Nj ,pattern) are determined by
microbenchmarking.
partitions
j j
ji
comm patternNr
NT
1 ),(
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
20
Computation costComputation cost Computation cost is measured by simulating the
computation using only private memory accesses.
The total run time of each interval i is
The cost of barriers is separately measured.
The predicted cost for each thread is the sum of all of these costs.
The highest predicted cost among all threads is taken to be the total cost.
commcompTT
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
21
Speedup predictionSpeedup prediction
Speedup can be estimated by the ratio of the number of accesses in the sequential code to the weighted sum of the number of remote accesses of each type.
Details are given in the paper.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
22
6. Measurements6. Measurements Microbenchmarks
Applications Histogramming Matrix multiply Sobel edge detection
Platforms measured 16-node 2 GHz x86 Linux Myrinet cluster
MuPC V1.1.2 beta Berkeley UPC C2.2
48-node 300 Mhz Cray T3E GCC UPC
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
23
Prediction precisionPrediction precision Execution time prediction precision is expressed as
%100 actualactualpredicted
A negative value indicates that the cost is overestimated.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
24
Microbenchmark measurementsMicrobenchmark measurements
A few observations Increasing the number of threads from 2 to 12
decreases performance in all cases: the decrease ranges from 0-10% on the T3E to as high as 25% for Berkeley and 50% in one case(*) for MuPC.
Caching improves MuPC performance for vector and coalesce and reduces performance for (*)baseline write.
Berkeley successfully coalesces reads. GCC is unusually slow at local writes.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
25
Microbenchmark measurementsMicrobenchmark measurements
Microbenchmark (threads)
MuPC w/o cache MuPC w/ cache Berkeley UPC GCC UPC
read write read write read write read write
baseline(2) (12)
14.0K 12.0K
35.6K 30.8K
23.3K 10.8K
11.4K 7.3K
21.0K 15.1K
46.5K 43.3K
0.45M 0.40M
1.3M 1.1M
vector(2) (12)
16.4K 10.0K
45.4K 34.6K
1.0M 0.77M
1.0M 0.71M
21.1K 14.6K
47.2K 44.8K
0.5M 0.45M
1.7M 1.7M
coalesce(2) (12)
16.7K 12.9K
43.9K 39.5K
82.0K 61.1K
69.9K 46.3K
172K 122K
46.9K 44.4K
0.5M 0.4M 1.6M 1.5M
local(2) (12)
8.3M 8.3M 8.3M 8.3M 8.3M 8.3M 8.3M 8.3M 8.3M 6.7M 6.7M 5.0M 1.2M 1.0M0.7M
0.62M
Increasing the number of threads from 2 to 12 decreases performance in all cases: the decrease ranges from 0-10% on the T3E to as high as 25% for Berkeley and 50% in one case(*) for MuPC.
Caching improves MuPC performance for vector and coalesce and reduces performance for (*)baseline write.
Berkeley successfully coalesces reads.
GCC is unusually slow at local writes.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
26
HistogrammingHistogrammingshared [1] int array[N]for (i=0; i<N*percentage; i++){ loc = random_index(i); array[loc]++; }upc_barrier;
Random elements of an array are updated in parallel.
Races are ignored.
percentage determines how much of the table is accessed.
Collisions grow as percentage gets smaller.
This fits a mixed baseline-local pattern when the number of threads is small, as explained earlier.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
27
Histogramming performance Histogramming performance estimatesestimates
For 12 threads the predicted cost is usually within 5%.
Percentage load on table
Precision (δ)
MuPC Berkeley GCC
10% -2.2 -0.38 -3.6
20% -4.8 -0.25 -3.5
30% -0.4 0.15 -3.5
40% -4.0 0.32 -3.5
50% 1.6 0.45 -3.5
60% -9.8 0.36 -3.6
70% -4.6 0.39 -3.5
80% -3.3 0.35 -3.5
90% -9.7 0.54 -3.5
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
28
Matrix multiplyMatrix multiply
Thread t executes pass i if A[i][0] has affinity to t.
Remote accesses are minimized by distributing rows of A and C across threads while columns of B are distributed across threads.
Both cyclic striped and block striped distributions are measured.
Accesses to A and C are local; accesses to B are mixed vector-local.
upc_forall (i=0; i<N; i++; &A[i][0]) { for (j=0; j<N; j++) { C[i][j] = 0.0; for (k=0; k<N; k++) C[i][j] += A[i][k]*B[k][j]; }}
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
29
Matrix multiply performance estimatesMatrix multiply performance estimates
N x N = 240 x 240
Berkeley and GCC costs are underestimated.
MuPC/w cache costs are overestimated because temporal locality is not modeled.
threads
Precision (δ)
MuPC w/o cache MuPC w/ cache Berkeley UPC GCC UPC
cyclic striped
block striped
cyclic striped
block striped
cyclic striped
block striped
cyclic striped
block striped
2 -12.2 0.7 7.9 -2.1 -1.2 -1.9 -4.0 -14.0
4 14.5 0.3 15.3 19.5 -7.4 -14.7 -8.8 -10.5
6 3.6 -0.7 15.8 11.8 -4.5 -8.9 7.6 -2.0
8 2.2 -4.4 9.8 13.0 -7.0 -12.0 10.6 9.6
10 -0.4 -2.2 3.9 2.2 -7.4 -15.2 6.2 -3.2
12 2.9 0.9 9.8 8.8 -5.4 4.4 3.1 -5.7
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
30
Sobel edge detectionSobel edge detection
A 2000 x 2000-pixel image is distributed so that each thread gets approx. 2000/THREADS contiguous rows.
All accesses to the computed image are local.
Read-only accesses to the source array are mixed-local. The north and south border rows are in neighboring threads, all other rows are local.
Source array access patterns are local-vector on MuPC with cache, local-coalesce on Berkeley because it coalesces. local-baseline on GCC because it does not
optimize.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
31
Sobel performance estimatesSobel performance estimates
Precision is worst for MuPC because of unaccounted- for cache overhead for 2 threads and because the vector pattern only approximates cache behavior for larger numbers of threads.
threads
Precision (δ)
MuPC w/o cache
MuPC w/ cache
Berkeley UPC
GCC UPC
2 4.8 -21.6 7.3 -10.3
4 1.4 15.8 8.3 -10.8
6 11.8 16.3 1.1 -6.3
8 9.9 16.8 -1.3 7.5
10 7.0 17.5 -4.3 -1.0
12 -0.5 14.3 5.0 -3.5
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
32
7. Summary and continuing work7. Summary and continuing work
This is a first attempt at a model for a PGAS language.
The model identifies potential optimizations that a platform may provide and offers a set of microbenchmarks to capture their effects.
Code analysis identifies the access patterns in use and matches them with available platform optimizations.
Performance predictions for simple codes are usually within 15% of actual run times and most of the time they are better than that.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
33
ImprovementsImprovements
Model coarse-grain shared memory operations such as upc_memcpy().
Model the overlap between memory accesses and computation.
Model contention for access to shared memory.
Model computational load imbalance.
Apply the model to larger applications and make measurements for larger numbers of threads.
Explore how the model can be automated.
2006 Michigan Technological University
IPDPS2006IPDPS200616/2/616/2/6
34
Questions?Questions?