38
Grid Computing and the TeraGrid GI Science Gateway Kate Cowles 22S:295 High Performance Computing Seminar, Nov. 29, 2007 () 1 / 38

Grid Computing and the TeraGrid GI Science Gateway

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Grid Computing and the TeraGrid GI Science Gateway

Grid Computing andthe TeraGrid GI Science Gateway

Kate Cowles

22S:295 High Performance Computing Seminar, Nov. 29, 2007

() 1 / 38

Page 2: Grid Computing and the TeraGrid GI Science Gateway

Outline

1 Example data

2 Grid computing

3 The TeraGrid

4 TeraGrid Science Gateways

5 GISolve

() 2 / 38

Page 3: Grid Computing and the TeraGrid GI Science Gateway

Example data

Outline

1 Example data

2 Grid computing

3 The TeraGrid

4 TeraGrid Science Gateways

5 GISolve

() 3 / 38

Page 4: Grid Computing and the TeraGrid GI Science Gateway

Example data

Sulfur dioxide concentration data from the EPA AirQuality System database

() 4 / 38

Page 5: Grid Computing and the TeraGrid GI Science Gateway

Example data

n = 4711 observations from 870 sites in the years 1998–2005

() 5 / 38

Page 6: Grid Computing and the TeraGrid GI Science Gateway

Grid computing

Outline

1 Example data

2 Grid computing

3 The TeraGrid

4 TeraGrid Science Gateways

5 GISolve

() 6 / 38

Page 7: Grid Computing and the TeraGrid GI Science Gateway

Grid computing

Grid computing

several different definitions, all involving distributed computingnetworking together computing clusters at different geographiclocations to harness computing and storage resources

() 7 / 38

Page 8: Grid Computing and the TeraGrid GI Science Gateway

The TeraGrid

Outline

1 Example data

2 Grid computing

3 The TeraGrid

4 TeraGrid Science Gateways

5 GISolve

() 8 / 38

Page 9: Grid Computing and the TeraGrid GI Science Gateway

The TeraGrid

The Teragrid

the world’s largest, most comprehensive distributedcyberinfrastructure for open scientific researchbegan in 2001 when NSF awarded $45 million to establish aDistributed Terascale Facility (DTF)

to NCSA, SDSC, Argonne National Laboratory, and the Center forAdvanced Computing Research (CACR) at California Institute ofTechnology

coordinated through the Grid Infrastructure Group (GIG) at theUniversity of Chicago

() 9 / 38

Page 10: Grid Computing and the TeraGrid GI Science Gateway

The TeraGrid

TeraGrid, continued

more than 250 teraflops of computing capability (May 2007)teraflop: trillion (1012) floating point operations per second

more than 30 petabytes of online and archival data storage (May2007)

petabyte: quadrillion (1015) bytes

rapid access and retrieval over high-performance networks.

() 10 / 38

Page 11: Grid Computing and the TeraGrid GI Science Gateway

The TeraGrid

TeraGrid Resource Provider sites

Indiana UniversityOak Ridge National LaboratoryNational Center for Supercomputing ApplicationsPittsburgh Supercomputing CenterPurdue UniversitySan Diego Supercomputer CenterTexas Advanced Computing CenterUniversity of Chicago/Argonne National Laboratorythe Joint Institute for Computational Sciencesthe Louisiana Optical Network Initiativethe National Center for Atmospheric Research

() 11 / 38

Page 12: Grid Computing and the TeraGrid GI Science Gateway

The TeraGrid

Communication among TeraGrid sites

Each resource provider maintains 10+ Gbps to one of threeTeraGrid hubs (Chicago, Denver or Los Angeles).The hubs are interconnected via 10 Gbps lambdas (fiber-opticcommunications lines).

() 12 / 38

Page 13: Grid Computing and the TeraGrid GI Science Gateway

The TeraGrid

() 13 / 38

Page 14: Grid Computing and the TeraGrid GI Science Gateway

TeraGrid Science Gateways

Outline

1 Example data

2 Grid computing

3 The TeraGrid

4 TeraGrid Science Gateways

5 GISolve

() 14 / 38

Page 15: Grid Computing and the TeraGrid GI Science Gateway

TeraGrid Science Gateways

TeraGrid Science Gateways

enable users with a common scientific goal to use nationalresources through a common interfaceaccount management, accounting, certificates management, anduser support is delegated to the gateway developersthree common forms:

A gateway that is packaged as a web portal with users in front andTeraGrid services in back.Grid-bridging Gateway: Science gateway is a mechanism to extendthe reach of the community’s existing Grid so it may use theresources of the TeraGrid.A gateway that involves application programs running on users’machines (i.e. workstations and desktops) and accesses servicesin TeraGrid (and elsewhere).

() 15 / 38

Page 16: Grid Computing and the TeraGrid GI Science Gateway

TeraGrid Science Gateways

Current TeraGrid Science Gateways

http://www.teragrid.org/programs/sci_gateways/

() 16 / 38

Page 17: Grid Computing and the TeraGrid GI Science Gateway

GISolve

Outline

1 Example data

2 Grid computing

3 The TeraGrid

4 TeraGrid Science Gateways

5 GISolve

() 17 / 38

Page 18: Grid Computing and the TeraGrid GI Science Gateway

GISolve

GISolve

Geographic Information Science gatewayweb portal“The purpose of this project is to develop a TeraGrid ScienceGateway toolkit for GIScience. Our gateway toolkit providesuser-friendly capabilities for performing geographic informationanalysis using computational Grids, and help non-technical usersdirectly benefit from accessing cyberinfrastructure capabilities."current modules

1 random spatial point generator2 distance-weighted interpolation of surfaces3 cluster detection algorithm (G∗

i )4 Bayesian geostatistical spatial model fitting using MCMC

() 18 / 38

Page 19: Grid Computing and the TeraGrid GI Science Gateway

GISolve

Scheduling on TeraGrid Science Gateway web portals

local job scheduler manages individual TeraGrid resourceCondorPortable Batch System (PBS)

Globus Resource Access and Management (GRAM)interacts with local job schedulers to allocate computationalresources for applicationsmonitors and controls computing processes

user interactions with Science Gateways through TeraGridsoftware supporting Web Services Globus Toolkit

() 19 / 38

Page 20: Grid Computing and the TeraGrid GI Science Gateway

GISolve

() 20 / 38

Page 21: Grid Computing and the TeraGrid GI Science Gateway

GISolve

Geostatistical models

natural and interpretable way to model spatial correlation for datameasured at irregularly-spaced point sitescorrelation is a function of the distance, and possibly orientation,between sites

() 21 / 38

Page 22: Grid Computing and the TeraGrid GI Science Gateway

GISolve

Simple geostatistical model with spatial correlationand additive measurement error

Y ∼ N(

XT β, σ2s Σ(φ) + σ2

e I)

X is a matrix of location-specific covariatesβ is a vector of coefficients to be estimatedΣ(φ) is spatial correlation matrix

entries are calculated from correlation function

σ2s is spatial variance

σ2e is random variance (measurement error variance)

I is identity matrixBayesian model completed by specification of prior distributionson φ, σ2

s , σ2e, and β

() 22 / 38

Page 23: Grid Computing and the TeraGrid GI Science Gateway

GISolve

Our alternative reparameterization

facilitates prior specification and computing algorithmreparameterized covariance matrix

σ2sΣ(φ) + σ2

eI = σ2tot [(1− S) Σ(φ) + S I]

where

σ2tot = σ2

s + σ2e

S =σ2

e

σ2s + σ2

e

() 23 / 38

Page 24: Grid Computing and the TeraGrid GI Science Gateway

GISolve

Prior densities

continuous uniform prior on φ

endpoints chosen to reflect belief as to largest and smallestpossible distances at which spatial correlation could decay to 0

joint prior on S and σ2tot obtained by change-of-variable from

inverse gamma priors on σ2e and σ2

s

multivariate normal or flat prior on β

() 24 / 38

Page 25: Grid Computing and the TeraGrid GI Science Gateway

GISolve

Spatiotemporal model with separable correlationstructure

Y ∼ N(

XT β, σ2tot

{(1− S) K [Σ(φ)⊗ Σ(ρ)] KT + S I

})

where Σ(ρ) is an AR(1) matrix representing temporal correlationK is a matrix of 1’s and 0’s that matches each observation Yi withthe correct row and column of Σ(φ)⊗ Σ(ρ)

K is not needed if data are “rectangular"

prior on ρ uniform on (-1,1) or (0,1), slightly bounded away fromendpoints

() 25 / 38

Page 26: Grid Computing and the TeraGrid GI Science Gateway

GISolve

Computing algorithm in GISolve MCMC module

very efficient MCMC computing algorithm that produces lowautocorrelation in MCMC outputcomputational bottleneck is linear algebra operations on bigmatrices, especially cholesky decompositionsingle-chain and multi-chain parallelization

linear algebra operations for each chain are parallelized usingPlaPACKall CPUs for an individual chain must be on same TeraGrid resourcemultiple chains may be run simultaneously

“embarrassingly parallel"different chains may be run on different TeraGrid resourcesSPRNG used to make sure random number streams for differentchains are independent

() 26 / 38

Page 27: Grid Computing and the TeraGrid GI Science Gateway

GISolve

Challenges in TeraGrid implementation

different TeraGrid sites have different software, libraries, and batchscheduling programs installedhad to get PLAPACK installed and working on all sites whereGISolve could be used

() 27 / 38

Page 28: Grid Computing and the TeraGrid GI Science Gateway

GISolve

For more information

details of model and algorithm currently implemented in GISolveare in Yan, Cowles, Wang, and Armstrong, Statistics andComputing, 2007extension of the sequential version of algorithm to handleprediction, areal data, fusion of areal and point source data,complicated spatial and nonspatial covariance structure

implemented in ramps package for Rexplained in Cowles, Yan, and Smith 2007 and Smith, Yan, andCowles 2007 (available as tech reports on stats dept web page)not yet incorporated into GISolve

() 28 / 38

Page 29: Grid Computing and the TeraGrid GI Science Gateway

GISolve

Using MCMC module in GISolve

in advance of your sessionget account on GISolverequest to reserve TeraGrid resources

go to www.gisolveorg to log inupload file of spatiotemporal data you want to analyzeupload configuration fileselect which TeraGrid partner sites you want to use

how many CPUs at eachhow many parallel MCMC chains to runnumber of iterations

() 29 / 38

Page 30: Grid Computing and the TeraGrid GI Science Gateway

GISolve

specify maximum wall clock timemust be long enough for the number of requested iterations to finishmust not run past the end of the reserved time on resource

submit jobclick “Visualize output" to view plots of accumulating samplesdownload zip files of plots and numeric output

() 30 / 38

Page 31: Grid Computing and the TeraGrid GI Science Gateway

GISolve

Data file

Data files must be plain text filesFirst line is two integers:

number of rows of datanumber of regression coefficients in model, including intercept

data itself in rectangular format with the following columns (inorder)

response variablevalues of predictor variables, including a column of ones if interceptis requiredx coordinate of spatial location (longitude)y coordinate of spatial location (latitude)integer representing measurement timte

() 31 / 38

Page 32: Grid Computing and the TeraGrid GI Science Gateway

GISolve

Data file example: sim2000.dat

2000 4-0.0665 1 0.8056 0.9732 1 0.8056 0.9732 1-3.0575 1 0.6244 0.4855 1 0.6244 0.4855 1-2.1376 1 0.7375 0.3452 1 0.7375 0.3452 1...-9.9081 1 0.2165 0.4516 20 0.2165 0.4516 20-9.9882 1 0.7621 0.0421 20 0.7621 0.0421 20-8.8069 1 0.4893 0.4964 20 0.4893 0.4964 20-8.6049 1 0.4142 0.9086 20 0.4142 0.9086 20

() 32 / 38

Page 33: Grid Computing and the TeraGrid GI Science Gateway

GISolve

Configuration file

specifies model to be fit1 choice among three spatial correlation functions2 specification of parameters of prior distributions on σ2

e , σ2z , φ, S, ρ

provides initial values for each MCMC chain

() 33 / 38

Page 34: Grid Computing and the TeraGrid GI Science Gateway

GISolve

Configuration file, continued

content of individual lines1 Correlation type (1 = spherical; 2 = exponential; 3 = Gaussian)2 Distance metric (1 = great circle distance; 2 = euclidean distance)3 The unit of distance: for example, distunit = 10 means that the

distances are in 10s.4 αe, βe, αz , βz (parameters of IG priors for sigma2e and sigma2z)5 Left right endpoints of uniform prior distribution for phi6 Left right endpoints of support of distribution for S7 Left right endpoints of uniform prior distribution for rho8 block_size block_size_alg (PlaPACK configuration; leave as in

example)9 Chain index (from 0 to (number of chains - 1)) and initial values for

phi, S, and rhoas many rows of this kind as there are chains

() 34 / 38

Page 35: Grid Computing and the TeraGrid GI Science Gateway

GISolve

Configuration file example

121.00.5 0.5 0.5 0.50.05 2.00.01 0.990.01 0.99200 4000 0.5 0.75 0.751 1.0 0.5 0.52 1.5 0.25 0.25

() 35 / 38

Page 36: Grid Computing and the TeraGrid GI Science Gateway

GISolve

Specifying number of CPUs and number of chains ateach site

number of CPUs is total number to be divided among all thechains at the sitemake number of CPUs per chain a perfect square to usePLAPACK efficiently

how big a perfect square determined by size of dataset (see graphof speedups in next slide)

running parallel chainshelps in assessing convergencegenerates more samples per unit time if CPUs are availablesamples from different chains are independent

efficient MCMC algorithm results in short burn-in, so a relativelysmall number of iterations are “wasted"’

() 36 / 38

Page 37: Grid Computing and the TeraGrid GI Science Gateway

GISolve

5 10 15 20 25

24

68

ncpu

spee

d up

1 k2 k4 k6 k8 k10 k

() 37 / 38

Page 38: Grid Computing and the TeraGrid GI Science Gateway

GISolve

choosing numbers of CPUs and chainsfor dataset of 10000 observations, if you have 32 CPUs available ateach of 3 sites, perhaps run 1 chain using 25 CPUs at each sitefor dataset of 2000 observations, perhaps 3 chains at each site,each chain using 9 CPUs

ability to extend chains from where they left off is being added

() 38 / 38