23
1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

Embed Size (px)

Citation preview

Page 1: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

1. Simulation

2. Archive and distribute

3. Analysis

4. Understanding

Page 2: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

Heterogeneous HPC environments

Large community

SSH is king

No global view

Very complex workflow

etc, etc, etc

Problem Space

Page 3: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

curiehybrid nodes

-q hybrid

curiehybrid nodes

-q hybrid

curiethin nodes-q standard

curiethin nodes-q standard

curielarge nodes

-q xlarge

curielarge nodes

-q xlarge

ESGFESGF

$HOME

$CCCSTOREDIR

$CCCWORKDIR

$SCRATCHDIR

HPSS : Robotic tapes

curiefront-end

curiefront-end

sourcessmall results IGCM_OUT :

MONITORING/ATLAS

temporary REBUILDIGCM_OUT :

files to be packedoutputs of post-proc jobs

IGCM_OUT : Packed results

Output, Analyse SE and TS

Small precious filesSaved space

File system

dods_cp

cp

ccc_hsm get

airainfront-end

airainfront-end

airainnodesairainnodes

cpESGFESGFdods_cp

Temporary space

Saved space

Non saved space

Space on tapes

computecompute

loginlogin

Visible from www

quotasquotas

quotasquotasquotasquotas

quotasquotas

TGCC in a nutshell

Page 4: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

Job_EXP00Job_EXP00

Com

pute

curie

Job_EXP00Job_EXP00 Job_EXP00Job_EXP00

TGCC PeriodLength PeriodLength

$SCRATCHDIR/IGCM_OUT/.../REBUILD

$SCRATCHDIR/IGCM_OUT/XXX/Restart Debug

ESGF=TRUE/FALSE

ncrcat

PackFrequency

$CCCSTOREDIR/IGCM_OUT/XXX/Output

pack_outputpack_output

PackFrequency

$CCCSTOREDIR/IGCM_OUT/.../RESTART DEBUG

Post

curietarpack_restart

pack_debugpack_restartpack_debug

create_tscreate_ts

curiemonitoringmonitoring

Post

TimeSerieFrequency

TS et SE : $CCCSTOREDIR/IGCM_OUT/… dods/storeMONITORING et ATLAS : $CCCWORKDIR dods/work

create_secreate_se

SeasonalFrequency

Atlas/metricsAtlas/metrics

$SCRATCHDIR/IGCM_OUT/XXX/Output

Post

RebuildFrequency

rebuildrebuild

curie

Page 5: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

MQ Cluster

MQ Apps

API

DB’s

IPS

L

IPS

L

IPSL User @ Browser | Command Line | Desktop

json

TGCC

MQ Relay

IDRIS

MQ Relay

CINES

MQ Relay

IPSL

MQ Relay

XXX

MQ Relay

msg msgmsgmsgmsg

Page 6: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

Simulation monitoring & control

ESG-F integration: data publishing

ES-DOC integration: documentation publishing

PCMDI simulation metrics publishing

HPC diagnostics aggregation

Controlled vocabulary management

Push notifications: Web Socket, SMS, SMTP, MQ

Solution Space

Page 7: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

d

Metrics Garden User Web Interface

Test Glecker like metrics on CMIP5 version of IPSL models

Metrics Garden

Page 8: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

1. Simulation

2. Archive and distribute

3. Analysis

4. Understanding

Page 9: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

How do we usually present ourselves• Prodiguer, the national level

– Coordination between french partners

– IPSL, CNRM-CERFACS, TGCC, IDRIS, CINES

– Accompanying the community

• IS-ENES, the European level

– Coordination between European partners

– Heavy workload on operational implementaiton of ESGF (the biggest source of climate models results)

– Strengthening the infrastructure

• ESGF, ES-DOC, international level.

– WGCM Infrastructure Panel– ESGF Governance (Executive Commitee)– ES-DOC Governance (Principal Investigator)

Page 10: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

Many, many processes, many, many communities !

Interconnected communities, all needing access to (some of) the data!

Page 11: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

Resolution

Complexity

Duration and ensemble size

Ehanced computing resources produce MORE DATA

Earth Observations

Page 12: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

The Earth System Grid Federation (ESGF) is a multi-agency, international collaboration of persons and institutions working together to build an open source software infrastructure for the management and analysis of Earth Science data on a global scale

•Software development and project management: ANL, ANU, BADC, CMCC, DKRZ, ESRL, GFDL, GSFC, JPL, IPSL, ORNL, LLNL (lead), PMEL, …

•Operations: tens of data centers across Asia, Australia, Europe and North America

Worldwide distributed system

Page 13: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

Storage evolution in 6 years time (from CMIP3 to CMIP5) : a factor x30

Worldwide distributed system

● Operational since 2011● Hundreds of users per month● Hundreds of To per month● About 10 000 registered users

CMIP3: centralizedCMIP5: distributed system 60 climate models 2 PB of data

Page 14: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

ESGF France

- Cadre de travail des administrateurs de nœuds ESGF de France

- l'IPSL teste et valide les versions ESGF puis publie les procédures de déploiement détaillées et adaptées aux centres

- Partage des connaissances- Synchronisation des déploiements- Support de production- Réunions annuelles à l'IPSL

- La communauté s'inscrit dans la thématique Big Data du projet ANR Convergence ainsi que dans le groupe de travail dédié aux données du projet européen IS-ENES2.

http://forge.ipsl.jussieu.fr/prodiguer/wiki/ESGF-FR [email protected]

Page 15: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

1. ESGF IWT Missions and Challenges

Release management Build, test and validate Provide installation tools Secure deployments Administrators training and support

Missions Challenges

Automated builds and tests Easier installation

Node set up in less than one hour

Page 16: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

2. ESGF IWT RM Process

Release Management Process

The ESGF software stack development respects a release management process which ensures the quality of deliverables. Three distinct roles are identified:

• Developers push new features into the system•IWT Release Manager is responsible for code freeze, cutting releases and compilation• IWT Administrators are requested to test and validate release candidates

Page 17: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

3. ESGF IWT Continuous BuildContinuous Build

The ESGF software stack project source code is hosted on github repositories. ‘Devel’ branches are continuously updated with new features by development teams. Github webhooks trigger the execution of the project compilation on a dedicated machine running Jenkins. Distribution binaries are then made available to the community for testing via a web server. Continuous build is useful to be aware of source code quality and inter project dependencies consistency in real time throughout development phases.

Developers

GitHub DevelBranches

Jenkins Continuous Build Server

Push Code

Triggers Builds Automatically

Binaries Web Server(wars, jars)

Publish Binaries if build completes Warning email if

build breaks

@

http://esgf-build.ipsl.upmc.fr/jenkins

http://esgf-build.ipsl.upmc.fr/builds

Page 18: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

3. ESGF Integration Testing

Integration Testing

The ESGF Test Federation is based on vmware virtual machines. It is completely independent from the production federation and is used to run the esgf test suite which performs user’s perspective tests in order to validate release candidates as well as new installations or upgrades.

ESGF Test Infrastructure ESGF Test Suite

4. ESGF IWT Integration Testing

http://vesgint-data.ipsl.jussieu.fr

https://github.com/ESGF/esgf-test-suite

Python Nose - Test Framework Python Requests - HTTP Support Python Subprocess - System Execution Python Selenium - Browser Simulation Python Multiprocessing - Parallelisation

Page 19: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

5. ESGF IWT Installer and Distribution Mirrors

Installer and Distribution Mirrors

Freshly cut and validated releases are followed by deployment into production. The installer helps each node administrator across the federation to pull the new binaries. Three synchronized distribution mirrors (1 master @IPSL, 1 slave @PCMDI, 1 slave @BADC) improve binaries availability and transfer delays as the installer identifies the fastest mirror.

Node Admins

get_fastest_mirror()

U.S.

U.K.

FRExecute

ESGF Installer

Calls

Page 20: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

Original Timing:o(2) PB of requestedoutput from 20+modelling centresfinished early 2010!Actual Timing?Years late.

IPSL

CMIP3 : 35 To

CNRM-CERFACS

Our data perspective

Page 21: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

Selon les contraintes de sécurité (e.g., centres de calculs), deux architectures possibles :

Datanode ESG-F + données sur le même réseau Exemple : CICLAD (IPSL - Jussieu)

Réseau + Datanode ESG-F + données

Indexnode ESG-F(distant ou non du Datanode)

Page 22: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

Selon les contraintes de sécurité (e.g., centres de calculs), deux architectures possibles :

Datanode ESG-F dans une DMZExemple : TGCC-CCRT

DMZ + Datanode ESG-F

Réseau local + données

Indexnode ESG-F

Pas d’accès interactifFlux réseau à sens uniqueExport NFS read-only

Page 23: 1. Simulation 2. Archive and distribute 3. Analysis 4. Understanding

Login

(1) Data Reference Syntax(2) {datanode} : Filesystem visible par le datanode(3) {project} : Projet (ex : CMIP5)

Vue d'ensemble