20
NDGF CO2 Community Grid NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

Embed Size (px)

Citation preview

Page 1: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2 Community NDGF CO2 Community GridGrid

Olli TourunenNORDUnet 2008Espoo, FinlandApril 10th 2008

Page 2: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

2

Topics Topics

Project overview First use case Requirements and architecture Implementation Experiences Statistics Future

Page 3: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

3

CO2-CG overviewCO2-CG overview

NDGF Community Grid (CO2-CG) project aims to build an application environment for scientists studying CO2 sequestration

CO2-CG was selected in NDGF call for community projects along with BioGrid NDGF provides project coordinator and half FTE for

application grid integration plus funding for full FTE for community software development

One year project, started in fall 2007 Project coordinator: Michael Gronager (NDGF) Project leader: Klaus Johannsen (BCCS, Bergen) Science specialist: Philip Binning (DTU, Copenhagen) Software developer: Csaba Anderlik (BCCS, Bergen) Grid specialist: Olli Tourunen (NDGF)

Page 4: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

4

First use case for CO2-CGFirst use case for CO2-CG

Parameter study of different attributes of potential CO2 sequestration reservoirs

Software: MUFTE-UG, a general purpose simulator for multi-phase, multi-component flow in porous media

Pilot user: Andreas Kopp (University of Stuttgart)

Page 5: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

5

First use case (contd.)First use case (contd.)

Order of hundreds of 32 to 64 processor parallel simulations, computationally bound (not data intensive)

One simulation covers a time frame of approximately 50 years starting from CO2 injection to the reservoir

Why parallel? Isn’t this a parameter study after all? A single 32 process run typically takes 3-4 days to

complete With 16 processes we might be running over a week

Resources for these simulations are provided mainly by NOTUR, the Norwegian national infrastructure for computational science

Page 6: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

6

RequirementsRequirements

Main target: Provide scientists with transparent access to computational resources in the grid

Input: User’s working directory containing MUFTE-UG source code and the input files for the simulation

Output: Simulation results returned to the user in a user specified directory

Support for NorduGrid ARC middleware Standard grid credential handling to avoid need

for custom security policies with participating sites

Page 7: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

7

ARC

ARC

Cluster A

MUFTERuntime

Environment

Cluster B

MUFTERuntime

Environment

Architecture overviewArchitecture overview

Grid Job

Manager

Supercomputer C

MUFTERuntime

Environment

Application server

DB

Job descr 1Job descr 2Job descr 3

RS

S

RR

CommandLine UI

S

S S

S Software

R Results

Page 8: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

8

ArchitectureArchitecture

Command line UI (application server) Introduces one keyword ‘grid’ which can be

invoked with different options á la openssl Example:

user prepares the source code and input files for a simulation in a directory of her choice

User issues command like ‘grid submit –np 32’ The submit module packages the simulation

directory into the spool directory and inserts the parameters into the database

User tracks the progress by running ‘grid status’ The results are made available to user when the

job finishes

Page 9: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

9

Architecture (contd.)Architecture (contd.)

Grid Job Manager (application server) Scans the database for new jobs Prepares the new jobs for grid based on job

parameters Submits the jobs into grid Keeps track of the grid jobs Downloads the results when a job is ready Downloads the evidence for autopsy when a job fails

MUFTE Runtime Environment (grid resource) Standard ARC Runtime Environment Compiles the software based on local configuration

and environment Runs the simulation

Page 10: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

10

ImplementationImplementation

Grid Job Manager (GJM) There is one GJM instance per user “One sweep at a time” -job, intended to be

launched from cron Runs under user credentials Spools active jobs in /var/spool/co2-cg/<user> Written in Python Uses an object-RDB –mapper called SQLAlchemy Interacts with ARC grid middleware through

standard user commands Python API for ARC is also available, might use that

in the future

Page 11: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

11

Implementation (contd.)Implementation (contd.)

Database Standard PostgreSQL relational database 3 main tables plus some auxiliary ones

Runtime Environment Compilation is done on the ARC server host before

the job is submitted using user’s credentials Compilation and execution parameters are based

on the job attributes in the DB Supported levels of parallelism are encoded in the

RE name (e.g. MUFTE-MPI-64-1.0)

Page 12: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

12

ChallengesChallenges

Transparent grid credential handling Balance the security policies and ease of use

Parallel run parameterization User needs vs. types of resources vs. available

resources No explicit brokering support for this in ARC This can be done with clever RE naming

Database access right management (not really an issue until this goes to bigger scale) Lots of different possibilities to solve this if needed (DB

level access rights, per user tablespaces, row change staging, n-layer architecture outside the DB…). So far applied KISS.

Page 13: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

13

Experiences: User sideExperiences: User side

User can access a significant number of distributed resources in a transparent manner Peak so far: 512 cores simultaneously in use

Problems Memory specifications for the jobs Walltime specifications for the jobs Getting all the information to debug the jobs that

have crashed Non-converging jobs

Page 14: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

14

Experiences: Operator sideExperiences: Operator side

It takes around a day setup the MUFTE RE in a new cluster If the site has experience in running MPI-jobs

through ARC, the process is quite straightforward In one case we have also had to set up a cross

compiling facility AA is easy to configure since the users are

managed in NDGF VOMS Since there are not that many parallel jobs run

in the grid, ARC LRMS interface needed some tweaks in some clusters

Thanks for all the sysadmins that have helped us along the way!

Page 15: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

15

StatisticsStatistics

Since February 12th 2008, over 400 simulations of 16 to 64 processors have been run

Total compute time around 230000 hours Disclaimer: measurements done from the

application server side, not from resources accounting.

Page 16: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

16

Statistics (contd.)Statistics (contd.)

Walltime usage 2008

010000

200003000040000

5000060000

7000080000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Week

Wal

ltim

e h

ou

rs

Page 17: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

17

Statistics (contd.)Statistics (contd.)

Walltime distribution

213000

240003500 600

0

50000

100000

150000

200000

250000

Titan (Oslo)

Fyrkat(Aalborg)

Fimm(Bergen)

Stallo(Tromsö)

Wal

ltim

e h

ou

rs

Page 18: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

18

Future developmentsFuture developments

Switching focus to operation Software and application server hardening Automated tests for the runtime environments

+ blacklisting Cleanup procedures Integrate CO2-CG into the NDGF accounting

system Track the simulations that are not converging

Easier certificate handling Possibly a web portal for job tracking and

collaboration Include the new Cray XT4 in Bergen

Page 19: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

19

ConclusionsConclusions

With moderate effort, simple tools and application specific user interface the grid resource usage can be made easy for the end users

On-demand compilation works for selected applications

Parallel jobs can be run in a large scale in the grid with little effort

Page 20: NDGF CO2 Community Grid Olli Tourunen NORDUnet 2008 Espoo, Finland April 10th 2008

NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008

20

Thank you!

Questions, comments?