Computational Algebra on Computational Grids using Grid Parallel GpH A. Al Zain Heriot-Watt University

Computational Algebra on Computational Grids using Grid Parallel GpH

A. Al Zain

Heriot-Watt University

November 21, 2006

Outline

The Design, Implementation and Evaluation of a Grid Parallel Language: GpH

Computational Algebra on Computational Grid

November 21, 2006

Motivation

Dedicated High Performance Computers (HPCs) are rare and expensive.

Clusters are common and cheap. GRID technology offers potential to network remote clusters

low-cost readily-available scalable large-scale HPC: a Computational Grid

November 21, 2006

Motivation II

Computational Grids are a challenging platform: complex architecture, shared dynamic components & high latency

Computational Grids commonly used for high-throughput computing: many small

programs we address parallel computing: one big program program components must communicate & synchronise

We propose a high-level parallel programming model with automatic dynamic resource management: Glasgow parallel Haskell (GpH)

November 21, 2006

Overview

Background: GRID vs Classical HPC, GPH, GUM, GRIDGUM1: an initial port of GPH to the GRID

performance measurements analysis

Design ofGRIDGUM2 with new adaptive mechanisms. GRIDGUM2 performance measurements:

hetero/homogeneous and low/high latency GRIDs scalability

November 21, 2006

Comparing Computational GRIDs and HPCs

Classical HPC Computational GRIDs

Flat: uniform latency to every PE Hierarchical communications structure

Homogeneous, i.e. PEs identical Heterogeneous: e.g. clusters of different sizes and CPU speeds

Dedicated interconnect Shared interconnect, hence variable latency

Low Latency Very High Latency

November 21, 2006

GPH and GUM

GPH Small Extension of the Haskell functional language. High-level parallelism: annotate expressions to introduce &

control parallelism GUM

A sophisticated Runtime Environment (RTE) for GPH, Automatically manages, using dynamic adaptive strategies:

Work distribution Communication Distributed garbage collection

November 21, 2006

GpH HPC Performance

Good performance for kernel programs on both shared and distributed memory machines [Loidl,Trinder et al 1999]

Compares favourably with other mature parallel functional language implementations [Loidl,Rubio Diez et al 2003]

Comparable with conventional technology, e.g. Matrix multiplication is 60% slower than C/PVM on 16 PEs (GPH program is 6 times shorter!) [Loidl,Rubio Diez et al 2003] .

November 21, 2006

GRIDGUM1

GRIDGUM1 is an initial port of GPH to the GRID. Simply replaced (PVM) communications layer with and grid-enabled communications layer (MPICH-G2)

November 21, 2006

Measurement Setup

Hardware Apparatus

BeowulfsCPU Cache Memory

PEsSpeed MHz

kB Total kB

Edin1 534 128 254856 32

Edin2 1395 256 191164 6

Edin3 1816 512 247816 10

Muni 1529 256 515500 7

SBC 933 256 110292 4

November 21, 2006

Software Apparatus

Program Appl. Area Paradigm Regularity Comm Deg (Pkt/S)

Size (SLOC)

queens AI Div-Conq Regular 0.2 (low) 21

parFib Numeric Div-Conq Regular 65.5 (high) 22

linSolv Comp. Alg Data Par Limit irreg. 5.5 (low) 121

sumEuler Num. Anal Data Par Irregular 2.09 (low) 31

matMult Numeric Div-Conq Irregular 67.3 (high) 43

raytracer Vision Data Par High irreg. 46.7 (high) 80

November 21, 2006

Summary of GRIDGUM1 Performance

Low latency computational grids:a) Homogeneous: good and predictable speedupsb) Heterogeneous: poor speedups

High latency computational grids: GRIDGUM1 only delivers acceptable speedups for low communication degree programs,

Al Zain A. Trinder P.W. Loidl H.W. Michaelson G.J. Managing

Heterogeneity in a Grid Parallel Haskell. Journal of Scalable Computing: Practice and Experience 7(3) (September 2006), pp 9-26

November 21, 2006

Poor Load Management LimitsGRIDGUM1 Performance

November 21, 2006

GRIDGUM2 Design

Incorporates new load management mechanisms for computational GRIDs

Use static & partial dynamic information to inform load management:

PEs collect static info like configuration of every other PE PEs cheaply maintain dynamic info including latency to, and

load of, other PEs seek work from nearby (low latency) busy PEs lazily distribute dynamic information in all messages

The first fully implemented virtual shared memory RTE on computational GRIDs.

November 21, 2006

Variable Latency Communication Management

Adapt to different and varying latencies by recording latencies. Record the (timestamped) communication latency to every

other PE in a timestamped table All messages are tagged with a generation-time, so the

recipient PE can calculate the latency. Send additional work to minimise high latency (e.g. intra-

cluster) communication

November 21, 2006

Targeted Load Management

During startup each PE broadcasts its hardware configuration (e.g. CPU speed) to other PEs, and records those of other PEs.

Each PE records the (timestamped) load of every other PE in a table, and attaches it to all messages.

Start the computation in the strongest cluster (no. PEs x CPU speed)

PEs only seek work from PEs with a high load relative to its CPU speed

PEs prefer to obtain work from low-latency PEs

November 21, 2006

GRIDGUM1 RayTracer

November 21, 2006

GRIDGUM2 RayTracer

November 21, 2006

GRIDGUM2 Performance

Computational Grid configurations: Low-latency

Homogeneous Heterogeneous

High-latency Homogeneous Heterogeneous

November 21, 2006

Low Latency Homogeneous

Program Name

GRID-GUM1 GRID-GUM2 Var Redn

Rtime(s) Var Var % Rtime(s) Var Var %

queens 648 149 23% 649 2.6 0.4% 98%

parFib 84 22 26% 88 3.6 4% 84%

linSolv 176 63 36% 149 7.2 4.8% 86%

sumEul. 117 55 47% 116 20 17% 63%

raytr. 476 168 35% 448 27 6.2% 82%

GRID-GUM1 and GRID-GUM2 Perf. Variation on 10 PEs

GRIDGUM1 andGRIDGUM 2 Perf. Variation on 10 PEs

November 21, 2006

RTE Overheads

GRIDGUM1 andGRIDGUM2 Overheads on 16 PEs

ProgramName

RTE No of Threads

Max Heap Residency (KB)

Alloc Rate (MB/s)

Comm Degree (Msgs/s)

Average Pkt Size (Byte)

linSolv GG1GG2

242242

437.2437.2

40.326.5

5.502.54

290.7276.4

matMult GG1GG2

144144

4.34.3

39.040.0

67.3031.29

208.9209.4

…

November 21, 2006

Low Latency Homogeneous Grid Performance Summary

GRIDGUM2 maintains goodGRIDGUM1 performance. Under GRIDGUM2 programs exhibit far less (at least 63%

less ) performance variability than underGRIDGUM1. GRIDGUM2 retains light overhead and does not significantly

change program’s dynamic properties.

November 21, 2006

Low Latency Heterogeneous

Performance on 4 Edin1 and 4 Edin2 PEs

Program Run-time (s) Improvement

GG1 GG2

raytracer 1340 572 57%

queens 668 310 53%

sumEuler 570 279 51%

linSolv 217 180 17%

matMult 94 86 9%

parFib 136 134 1%

November 21, 2006

Low Latency Heterogeneous Grid Performance Summary

GRIDGUM2 improves the performance of 5 programs & maintains good performance of 6th.

Only certain programs are sensitive to heterogeneity: some already give good performance, e.g. parFib others are at some performance bound, e.g. matMult

November 21, 2006

High Latency Homogeneous (raytracer)

Case Config. Mean Latency (ms)

Run-time (s) Impr%

GG1 GG2

1 1E4M 14.4 995 617 38%

2 2E3M 21.5 911 703 23%

3 3E2M 21.5 772 754 2%

4 4E1M 14.4 668 642 4%

November 21, 2006

High Latency Homogeneous Grid Performance Summary

GRIDGUM2 outperforms GRIDGUM1 on all homogeneous high latency configurations for all 3 sensitive programs.

GRIDGUM2 improves programs with a range of parallel behaviours, e.g. sumEuler with low comm-degree & irregular parallelism also improves by 30%

November 21, 2006

Distinguishing Static and Dynamic Improvements

A special RTE, GRIDGUM1.1 uses only static information: launches program on the strongest cluster prevents slow PEs from extracting work from fast PEs no collection or use of dynamic information

November 21, 2006

High Latency Heterogeneous (raytracer)

Config. Mean GG1Rtime

GG1.1Rtime

StaticImpr

GG2Rtime

DynamicImpr

TotalImpr

Ltncy CPU Spd

1E4M 14.4 1330 1490 689 53% 583 7% 60%

2E3M 21.5 1131 1223 745 39% 716 2% 41%

3E2M 21.5 932 1254 983 21% 961 2% 23%

4E1M 14.4 733 1296 1597 -23% 1236 27% 4%

November 21, 2006

High Latency Heterogeneous Grid Performance Summary

Compared withGRIDGUM1,GRIDGUM2 improves the performance of all three programs on all configurations measured.

GRIDGUM2’s static information gives substantial improvements when there are more fast PEs than slow PEs

GRIDGUM2’s dynamic load and latency information improves performance on all of the heterogeneous high-latency GRID configurations measured.

Substantial maximum improvement (60%) for high-communication degree program (raytracer), lower improvements for lower communication degree programs (sumEuler: 31% and queens: 35% ).

November 21, 2006

Scalability raytracer

Case Config. raytracer

GG1Rtime

Spdup GG2Rtime

Spdup

1 6E1M 2530 7 2470 7

2 12E2M 2185 8 1752 10

3 18E3M 1824 10 1527 12

4 24E4M 1776 10 1359 13

5 30E5M 1666 11 1278 14

6 5E230E6M 1652 11 1133 16

November 21, 2006

Scalability Summary

GRID technology offers the opportunity to improve performance by integrating remote heterogeneous clusters.

For realistic programs the parallel performance of GRIDGUM2 scales to medium scale heterogeneous high-latency computational GRIDs: 41 PEs in three clusters.

The overheads ofGRIDGUM2 load management are relatively low, even on medium scale computational GRIDs.

November 21, 2006

SymGrid-Par: Computational Algebra on Computational Grids

Outline: Parallel Computational Algebra. Multiple Parallel Computational Algebra. CAG (Computational Algebra GpH/Grid) Interface. GCA (GpH/Grid Computational Algebra) Prototype Design.

November 21, 2006

Parallel Computational Algebra

November 21, 2006

Multiple Parallel Computational Algebra

SymGrid-Par

ComputationalAlgebra

ComputationalAlgebra

November 21, 2006

CAG Interface (to do)

We believe CA users do not need to learn a middle-ware language/system (GpH/Globus)

Build skeletons (list, map, fold, ...) which can be translated to GpH middle-ware

November 21, 2006

GCA Prototype Design

Run CA as it is in a separate (Unix) process.

All the interface to CA is done through the standard I/O of this process.

Small interpreter is written in the CA side.

Interface can ask the interpreter to invoke any CA function and to translate data objects from external format to CA internal format and the other way around.

process topology of a hybridGpH-CA program

November 21, 2006

more ... GCA Design

The interface consists of Haskell fragment C fragment

C part invokes Posix services, needed to:

initiate the CA process establish the pipe

connection

send command or receive results

provides static memory to store Unix objects that must be preserved between calls.

November 21, 2006

GCA ... Prototype Implementation and Results

Fibonacci: GpH-GAP implementation of Fibonacci shows an average

parallelism of 4 in 4 machines to compute Fibonacci 62.

per-PE activity profile for Fibonacci 62

November 21, 2006

GCA ... Prototype Implementation and Results

Small Group: GpH-GAP implementation of to compute Small-Group (1-

250) in 5 machines shows average parallelism of 2.9.

Small-Group (1-250) in 5 machines

November 21, 2006

smallGroupsSearch := function(N,IntAvgOrder1) local hits, n, i, g; hits := []; for n in [1..N] do for i in [1 .. NrSmallGroups(n)] do g := SmallGroup(n,i); if IntAvgOrder1(g) then Add(hits,[n,i]); fi; od; od; return hits;end;

IntAvgOrder1 := function(g) local cc, sum, c; sum := 0; cc := ConjugacyClasses(g); sum := 0; for c in cc do sum := sum + Size(c)*Order(Representative(c)); od; return (sum mod Size(g)) = 0;end;

Small Group Code

November 21, 2006

module Main where

import GAPAPIimport Parallelimport Systemimport Strategies

main :: IO ()main = do#if !defined(__PARALLEL_HASKELL__) _ccall_ gapInit #endif x <- getArgs let lo = read (x!!0) let hi = read (x!!1) let chunkSize = read (x!!2)

print (smallGroupSearch lo hi chunkSize predSmallGroup)#if !defined(__PARALLEL_HASKELL__) _ccall_ gapTerm#endif

Small Group Code, GpH

November 21, 2006

smallGroupSearch :: Int -> Int -> Int -> ((Int,Int) -> (Int,Int,Bool)) -> [(Int,Int)]smallGroupSearch lo hi chunkSize pred = concat (map (ifmatch pred) [lo..hi] `using` parListChunk chunkSize Strategies.rnf )

predSmallGroup :: (Int,Int) -> (Int,Int,Bool)predSmallGroup (i,n) = (i,n,(gapObject2String (gapEval "IntAvgOrder" [int2GAPObject n, int2GAPObject i])) == "true")

ifmatch :: ((Int,Int) -> (Int,Int,Bool)) -> Int -> [ (Int,Int) ]ifmatch predSmallGroup n = [ (i,n) | (i,n,b) <- ((map predSmallGroup [(i,n) | i <- [1..nrSmallGroups n]]) `using` parListBigChunk 200000 Strategies.rnf), b]

nrSmallGroups :: Int -> IntnrSmallGroups n = gapObject2Int (gapEval "NrSmallGroups" [int2GAPObject n])

Small Group Code, GpH (continued)

Thank You

Questions?

Documents

Computational Algebra on Computational Grids using Grid Parallel GpH A. Al Zain Heriot-Watt University