View
216
Download
0
Category
Preview:
Citation preview
Computational Algebra on Computational Grids using Grid Parallel GpH
A. Al Zain
Heriot-Watt University
November 21, 2006
Outline
The Design, Implementation and Evaluation of a Grid Parallel Language: GpH
Computational Algebra on Computational Grid
November 21, 2006
Motivation
Dedicated High Performance Computers (HPCs) are rare and expensive.
Clusters are common and cheap. GRID technology offers potential to network remote clusters
low-cost readily-available scalable large-scale HPC: a Computational Grid
November 21, 2006
Motivation II
Computational Grids are a challenging platform: complex architecture, shared dynamic components & high latency
Computational Grids commonly used for high-throughput computing: many small
programs we address parallel computing: one big program program components must communicate & synchronise
We propose a high-level parallel programming model with automatic dynamic resource management: Glasgow parallel Haskell (GpH)
November 21, 2006
Overview
Background: GRID vs Classical HPC, GPH, GUM, GRIDGUM1: an initial port of GPH to the GRID
performance measurements analysis
Design ofGRIDGUM2 with new adaptive mechanisms. GRIDGUM2 performance measurements:
hetero/homogeneous and low/high latency GRIDs scalability
November 21, 2006
Comparing Computational GRIDs and HPCs
Classical HPC Computational GRIDs
Flat: uniform latency to every PE Hierarchical communications structure
Homogeneous, i.e. PEs identical Heterogeneous: e.g. clusters of different sizes and CPU speeds
Dedicated interconnect Shared interconnect, hence variable latency
Low Latency Very High Latency
November 21, 2006
GPH and GUM
GPH Small Extension of the Haskell functional language. High-level parallelism: annotate expressions to introduce &
control parallelism GUM
A sophisticated Runtime Environment (RTE) for GPH, Automatically manages, using dynamic adaptive strategies:
Work distribution Communication Distributed garbage collection
November 21, 2006
GpH HPC Performance
Good performance for kernel programs on both shared and distributed memory machines [Loidl,Trinder et al 1999]
Compares favourably with other mature parallel functional language implementations [Loidl,Rubio Diez et al 2003]
Comparable with conventional technology, e.g. Matrix multiplication is 60% slower than C/PVM on 16 PEs (GPH program is 6 times shorter!) [Loidl,Rubio Diez et al 2003] .
November 21, 2006
GRIDGUM1
GRIDGUM1 is an initial port of GPH to the GRID. Simply replaced (PVM) communications layer with and grid-enabled communications layer (MPICH-G2)
November 21, 2006
Measurement Setup
Hardware Apparatus
BeowulfsCPU Cache Memory
PEsSpeed MHz
kB Total kB
Edin1 534 128 254856 32
Edin2 1395 256 191164 6
Edin3 1816 512 247816 10
Muni 1529 256 515500 7
SBC 933 256 110292 4
November 21, 2006
Software Apparatus
Program Appl. Area Paradigm Regularity Comm Deg (Pkt/S)
Size (SLOC)
queens AI Div-Conq Regular 0.2 (low) 21
parFib Numeric Div-Conq Regular 65.5 (high) 22
linSolv Comp. Alg Data Par Limit irreg. 5.5 (low) 121
sumEuler Num. Anal Data Par Irregular 2.09 (low) 31
matMult Numeric Div-Conq Irregular 67.3 (high) 43
raytracer Vision Data Par High irreg. 46.7 (high) 80
November 21, 2006
Summary of GRIDGUM1 Performance
Low latency computational grids:a) Homogeneous: good and predictable speedupsb) Heterogeneous: poor speedups
High latency computational grids: GRIDGUM1 only delivers acceptable speedups for low communication degree programs,
Al Zain A. Trinder P.W. Loidl H.W. Michaelson G.J. Managing
Heterogeneity in a Grid Parallel Haskell. Journal of Scalable Computing: Practice and Experience 7(3) (September 2006), pp 9-26
November 21, 2006
Poor Load Management LimitsGRIDGUM1 Performance
November 21, 2006
GRIDGUM2 Design
Incorporates new load management mechanisms for computational GRIDs
Use static & partial dynamic information to inform load management:
PEs collect static info like configuration of every other PE PEs cheaply maintain dynamic info including latency to, and
load of, other PEs seek work from nearby (low latency) busy PEs lazily distribute dynamic information in all messages
The first fully implemented virtual shared memory RTE on computational GRIDs.
November 21, 2006
Variable Latency Communication Management
Adapt to different and varying latencies by recording latencies. Record the (timestamped) communication latency to every
other PE in a timestamped table All messages are tagged with a generation-time, so the
recipient PE can calculate the latency. Send additional work to minimise high latency (e.g. intra-
cluster) communication
November 21, 2006
Targeted Load Management
During startup each PE broadcasts its hardware configuration (e.g. CPU speed) to other PEs, and records those of other PEs.
Each PE records the (timestamped) load of every other PE in a table, and attaches it to all messages.
Start the computation in the strongest cluster (no. PEs x CPU speed)
PEs only seek work from PEs with a high load relative to its CPU speed
PEs prefer to obtain work from low-latency PEs
November 21, 2006
GRIDGUM1 RayTracer
November 21, 2006
GRIDGUM2 RayTracer
November 21, 2006
GRIDGUM2 Performance
Computational Grid configurations: Low-latency
Homogeneous Heterogeneous
High-latency Homogeneous Heterogeneous
November 21, 2006
Low Latency Homogeneous
Program Name
GRID-GUM1 GRID-GUM2 Var Redn
Rtime(s) Var Var % Rtime(s) Var Var %
queens 648 149 23% 649 2.6 0.4% 98%
parFib 84 22 26% 88 3.6 4% 84%
linSolv 176 63 36% 149 7.2 4.8% 86%
sumEul. 117 55 47% 116 20 17% 63%
raytr. 476 168 35% 448 27 6.2% 82%
GRID-GUM1 and GRID-GUM2 Perf. Variation on 10 PEs
GRIDGUM1 andGRIDGUM 2 Perf. Variation on 10 PEs
November 21, 2006
RTE Overheads
GRIDGUM1 andGRIDGUM2 Overheads on 16 PEs
ProgramName
RTE No of Threads
Max Heap Residency (KB)
Alloc Rate (MB/s)
Comm Degree (Msgs/s)
Average Pkt Size (Byte)
linSolv GG1GG2
242242
437.2437.2
40.326.5
5.502.54
290.7276.4
matMult GG1GG2
144144
4.34.3
39.040.0
67.3031.29
208.9209.4
…
November 21, 2006
Low Latency Homogeneous Grid Performance Summary
GRIDGUM2 maintains goodGRIDGUM1 performance. Under GRIDGUM2 programs exhibit far less (at least 63%
less ) performance variability than underGRIDGUM1. GRIDGUM2 retains light overhead and does not significantly
change program’s dynamic properties.
November 21, 2006
Low Latency Heterogeneous
Performance on 4 Edin1 and 4 Edin2 PEs
Program Run-time (s) Improvement
GG1 GG2
raytracer 1340 572 57%
queens 668 310 53%
sumEuler 570 279 51%
linSolv 217 180 17%
matMult 94 86 9%
parFib 136 134 1%
November 21, 2006
Low Latency Heterogeneous Grid Performance Summary
GRIDGUM2 improves the performance of 5 programs & maintains good performance of 6th.
Only certain programs are sensitive to heterogeneity: some already give good performance, e.g. parFib others are at some performance bound, e.g. matMult
November 21, 2006
High Latency Homogeneous (raytracer)
Case Config. Mean Latency (ms)
Run-time (s) Impr%
GG1 GG2
1 1E4M 14.4 995 617 38%
2 2E3M 21.5 911 703 23%
3 3E2M 21.5 772 754 2%
4 4E1M 14.4 668 642 4%
November 21, 2006
High Latency Homogeneous Grid Performance Summary
GRIDGUM2 outperforms GRIDGUM1 on all homogeneous high latency configurations for all 3 sensitive programs.
GRIDGUM2 improves programs with a range of parallel behaviours, e.g. sumEuler with low comm-degree & irregular parallelism also improves by 30%
November 21, 2006
Distinguishing Static and Dynamic Improvements
A special RTE, GRIDGUM1.1 uses only static information: launches program on the strongest cluster prevents slow PEs from extracting work from fast PEs no collection or use of dynamic information
November 21, 2006
High Latency Heterogeneous (raytracer)
Config. Mean GG1Rtime
GG1.1Rtime
StaticImpr
GG2Rtime
DynamicImpr
TotalImpr
Ltncy CPU Spd
1E4M 14.4 1330 1490 689 53% 583 7% 60%
2E3M 21.5 1131 1223 745 39% 716 2% 41%
3E2M 21.5 932 1254 983 21% 961 2% 23%
4E1M 14.4 733 1296 1597 -23% 1236 27% 4%
November 21, 2006
High Latency Heterogeneous Grid Performance Summary
Compared withGRIDGUM1,GRIDGUM2 improves the performance of all three programs on all configurations measured.
GRIDGUM2’s static information gives substantial improvements when there are more fast PEs than slow PEs
GRIDGUM2’s dynamic load and latency information improves performance on all of the heterogeneous high-latency GRID configurations measured.
Substantial maximum improvement (60%) for high-communication degree program (raytracer), lower improvements for lower communication degree programs (sumEuler: 31% and queens: 35% ).
November 21, 2006
Scalability raytracer
Case Config. raytracer
GG1Rtime
Spdup GG2Rtime
Spdup
1 6E1M 2530 7 2470 7
2 12E2M 2185 8 1752 10
3 18E3M 1824 10 1527 12
4 24E4M 1776 10 1359 13
5 30E5M 1666 11 1278 14
6 5E230E6M 1652 11 1133 16
November 21, 2006
Scalability Summary
GRID technology offers the opportunity to improve performance by integrating remote heterogeneous clusters.
For realistic programs the parallel performance of GRIDGUM2 scales to medium scale heterogeneous high-latency computational GRIDs: 41 PEs in three clusters.
The overheads ofGRIDGUM2 load management are relatively low, even on medium scale computational GRIDs.
November 21, 2006
SymGrid-Par: Computational Algebra on Computational Grids
Outline: Parallel Computational Algebra. Multiple Parallel Computational Algebra. CAG (Computational Algebra GpH/Grid) Interface. GCA (GpH/Grid Computational Algebra) Prototype Design.
November 21, 2006
Parallel Computational Algebra
November 21, 2006
Multiple Parallel Computational Algebra
SymGrid-Par
ComputationalAlgebra
ComputationalAlgebra
November 21, 2006
CAG Interface (to do)
We believe CA users do not need to learn a middle-ware language/system (GpH/Globus)
Build skeletons (list, map, fold, ...) which can be translated to GpH middle-ware
November 21, 2006
GCA Prototype Design
Run CA as it is in a separate (Unix) process.
All the interface to CA is done through the standard I/O of this process.
Small interpreter is written in the CA side.
Interface can ask the interpreter to invoke any CA function and to translate data objects from external format to CA internal format and the other way around.
process topology of a hybridGpH-CA program
November 21, 2006
more ... GCA Design
The interface consists of Haskell fragment C fragment
C part invokes Posix services, needed to:
initiate the CA process establish the pipe
connection
send command or receive results
provides static memory to store Unix objects that must be preserved between calls.
November 21, 2006
GCA ... Prototype Implementation and Results
Fibonacci: GpH-GAP implementation of Fibonacci shows an average
parallelism of 4 in 4 machines to compute Fibonacci 62.
per-PE activity profile for Fibonacci 62
November 21, 2006
GCA ... Prototype Implementation and Results
Small Group: GpH-GAP implementation of to compute Small-Group (1-
250) in 5 machines shows average parallelism of 2.9.
Small-Group (1-250) in 5 machines
November 21, 2006
smallGroupsSearch := function(N,IntAvgOrder1) local hits, n, i, g; hits := []; for n in [1..N] do for i in [1 .. NrSmallGroups(n)] do g := SmallGroup(n,i); if IntAvgOrder1(g) then Add(hits,[n,i]); fi; od; od; return hits;end;
IntAvgOrder1 := function(g) local cc, sum, c; sum := 0; cc := ConjugacyClasses(g); sum := 0; for c in cc do sum := sum + Size(c)*Order(Representative(c)); od; return (sum mod Size(g)) = 0;end;
Small Group Code
November 21, 2006
module Main where
import GAPAPIimport Parallelimport Systemimport Strategies
main :: IO ()main = do#if !defined(__PARALLEL_HASKELL__) _ccall_ gapInit #endif x <- getArgs let lo = read (x!!0) let hi = read (x!!1) let chunkSize = read (x!!2)
print (smallGroupSearch lo hi chunkSize predSmallGroup)#if !defined(__PARALLEL_HASKELL__) _ccall_ gapTerm#endif
Small Group Code, GpH
November 21, 2006
smallGroupSearch :: Int -> Int -> Int -> ((Int,Int) -> (Int,Int,Bool)) -> [(Int,Int)]smallGroupSearch lo hi chunkSize pred = concat (map (ifmatch pred) [lo..hi] `using` parListChunk chunkSize Strategies.rnf )
predSmallGroup :: (Int,Int) -> (Int,Int,Bool)predSmallGroup (i,n) = (i,n,(gapObject2String (gapEval "IntAvgOrder" [int2GAPObject n, int2GAPObject i])) == "true")
ifmatch :: ((Int,Int) -> (Int,Int,Bool)) -> Int -> [ (Int,Int) ]ifmatch predSmallGroup n = [ (i,n) | (i,n,b) <- ((map predSmallGroup [(i,n) | i <- [1..nrSmallGroups n]]) `using` parListBigChunk 200000 Strategies.rnf), b]
nrSmallGroups :: Int -> IntnrSmallGroups n = gapObject2Int (gapEval "NrSmallGroups" [int2GAPObject n])
Small Group Code, GpH (continued)
Thank You
Questions?
Recommended