29
LLNL-PRES-477393 CMIP5 / ESG-CET Publication Tutorial Bob Drach, PCMDI / LLNL March 14, 2011

CMIP5 / ESG-CET Publication Tutorial

  • Upload
    zorion

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

CMIP5 / ESG-CET Publication Tutorial. Bob Drach, PCMDI / LLNL March 14, 2011. ESG-CET Architecture. Gateways support centralized services: Portal Authn / Authz Search Metadata harvesting Web services Nodes are close to the data Publishing Data servers THREDDS DAP: Hyrax, PyDAP - PowerPoint PPT Presentation

Citation preview

Page 1: CMIP5 / ESG-CET Publication Tutorial

LLNL-PRES-477393

CMIP5 / ESG-CET Publication Tutorial

Bob Drach, PCMDI / LLNL

March 14, 2011

Page 2: CMIP5 / ESG-CET Publication Tutorial

2LLNL-PRES-477393

ESG-CET Architecture

Gateways support centralized services:• Portal• Authn / Authz• Search• Metadata harvesting• Web services

Nodes are close to the data• Publishing• Data servers

THREDDS DAP: Hyrax, PyDAP

• Visualization and computation LAS CDAT, NCL, Ferret, …

Gateways and nodes can be co-resident

PublisherPublisher

gridFTPgridFTP

LASLASTHREDDSTHREDDS

Data Archive

Data Archive

Proxy Certificate

Proxy Certificate

Node Database

Node Database

THREDDS Catalogs

THREDDS Catalogs

Web Services

Web Services HarvesterHarvester

MyProxyMyProxy

PortalPortal

Gateway DatabaseGateway Database

Gateway

Data Node

Page 3: CMIP5 / ESG-CET Publication Tutorial

3LLNL-PRES-477393

Terminology

Publication makes datasets visible on the gateway (portal).• Only metadata is transferred

A dataset is a collection of files. Datasets have versions.

• The ‘unit of publication’ is a version of a dataset.• Versions: monotonically increasing integers, may be YYYYMMDD

Datasets have string dataset identifiers that are unique system-wide. A category is a field that is searched on the gateway.

• Ex: time_frequency, realm, experiment, …

Projects are activities that generate datasets.• Ex: CMIP5, CMIP3• Associated with a set of categories

An experiment describes the input conditions (initial conditions, forcing, time period …) of a climate model experiment.

Data is generated by a climate model or from observations, reanalyses. Other project-specific metadata may be associated with datasets.

Page 4: CMIP5 / ESG-CET Publication Tutorial

4LLNL-PRES-477393

Node Architecture

Publisher• Scans data archive• Generates metadata catalogs:

one catalog per dataset version• Notifies the gateway when new

catalogs are available• Written in Python

Node database• Persistent store of publication

information• Dataset contents, version history• File metadata• Catalog locations• Publication status• Current implementation in

Postgres• May co-exist with gateway DB

PublisherPublisher

gridFTPgridFTP

LASLASTHREDDSTHREDDS

Data Archive

Data Archive

Proxy Certificate

Proxy Certificate

Node Database

Node Database

THREDDS Catalogs

THREDDS Catalogs

Gateway Web

Services

Gateway Web

Services

HarvesterHarvester

Page 5: CMIP5 / ESG-CET Publication Tutorial

5LLNL-PRES-477393

Publication Process

Scan directories• Create a list of files to be

published• Associate each file with a dataset• Generate a mapfile (optional)

Scan data• Read metadata from files• Populate node database

Generate metadata THREDDS catalogs

Publish datasets• Requires valid proxy certificate,

obtained with myproxy-logon• Notifies gateway to harvest

metadata

PublisherPublisher

gridFTPgridFTP

LASLASTHREDDSTHREDDS

Data Archive

Data Archive

Proxy Certificate

Proxy Certificate

Node Database

Node Database

THREDDS Catalogs

THREDDS Catalogs

Gateway Web

Services

Gateway Web

Services

HarvesterHarvester

Page 6: CMIP5 / ESG-CET Publication Tutorial

6LLNL-PRES-477393

Publisher components

Scan directories / files to produce a mapfile

% esgscan_directory [--read-files] [options] directory [directory ...]

Extract metadata from files, populate node database

% esgpublish --map mapfile --project cmip5

Generate THREDDS catalogs

% esgpublish --map mapfile --noscan –-thredds

Notify gateway

% esgpublish --map mapfile –-noscan -–publish

Publisher GUI includes all publisher functionality

All scripts have --help options

esgpublish

esgunpublish

esgpublish

esgunpublish

THREDDSTHREDDS

Data Archive

Data Archive

Proxy Certificate

Proxy Certificate

Node Database

Node Database

THREDDS Catalogs

THREDDS Catalogs

esglist_datasetsesglist_files

esglist_datasetsesglist_files

esgpublish--publish

esgunpublish

esgpublish--publish

esgunpublishesgpublish--thredds

esgunpublish

esgpublish--thredds

esgunpublish

mapfilemapfile

esgscan_directoryesgscan_directory

esginitializeesgsetup

esginitializeesgsetup

Web Services

Web Services

myproxy-logonmyproxy-logon

esgquery_gateway

esgquery_gateway

Page 7: CMIP5 / ESG-CET Publication Tutorial

7LLNL-PRES-477393

Deleting datasets

Order of operations is reverse of publication:• Delete from gateway• Remove TDS catalog• (optional) delete from node DB

Delete a dataset from the gateway

% esgunpublish –skip-thredds cmip5.foo.bar

Delete a TDS catalog

% esgunpublish –skip-gateway cmip5.foo.bar

Delete a dataset entirely, including the node database

% esgunpublish –database-delete cmip5.foo.bar

esgpublish

esgunpublish

esgpublish

esgunpublish

THREDDSTHREDDS

Data Archive

Data Archive

Proxy Certificate

Proxy Certificate

Node Database

Node Database

THREDDS Catalogs

THREDDS Catalogs

esglist_datasetsesglist_files

esglist_datasetsesglist_files

esgpublish--publish

esgunpublish

esgpublish--publish

esgunpublishesgpublish--thredds

esgunpublish

esgpublish--thredds

esgunpublish

mapfilemapfile

esgscan_directoryesgscan_directory

esginitializeesgsetup

esginitializeesgsetup

Web Services

Web Services

myproxy-logonmyproxy-logon

esgquery_gateway

esgquery_gateway

Page 8: CMIP5 / ESG-CET Publication Tutorial

8LLNL-PRES-477393

Querying

List all CMIP5 datasets in the node database

% esglist_datasets cmip5

List all files in a dataset

% esglist_files cmip5.output.PCMDI.pcmdi-test.historical.fx.atmos.fx.

List all datasets in a directory on a gateway

% esgquery_gateway [--service-url gateway_service] --list pcmdi.CCCMA

List all files in a gateway dataset

% esgquery_gateway --files cmip5.output2.CCCma.CanESM2.rcp85.mon.land.Lmon.r5i1p1

esgpublish

esgunpublish

esgpublish

esgunpublish

THREDDSTHREDDS

Data Archive

Data Archive

Proxy Certificate

Proxy Certificate

Node Database

Node Database

THREDDS Catalogs

THREDDS Catalogs

esglist_datasetsesglist_files

esglist_datasetsesglist_files

esgpublish--publish

esgunpublish

esgpublish--publish

esgunpublishesgpublish--thredds

esgunpublish

esgpublish--thredds

esgunpublish

mapfilemapfile

esgscan_directoryesgscan_directory

esginitializeesgsetup

esginitializeesgsetup

Web Services

Web Services

myproxy-logonmyproxy-logon

esgquery_gateway

esgquery_gateway

Page 9: CMIP5 / ESG-CET Publication Tutorial

9LLNL-PRES-477393

THREDDS catalogs

Layout

Reinitialization loads all catalogs into the TDS

thredds_reinit_url = https://localhost:443/thredds/admin/debug?catalogs/reinit

esgpublish

esgunpublish

esgpublish

esgunpublish

THREDDSTHREDDS

Data Archive

Data Archive

Proxy Certificate

Proxy Certificate

Node Database

Node Database

THREDDS Catalogs

THREDDS Catalogs

esglist_datasetsesglist_files

esglist_datasetsesglist_files

esgpublish--publish

esgunpublish

esgpublish--publish

esgunpublishesgpublish--thredds

esgunpublish

esgpublish--thredds

esgunpublish

mapfilemapfile

esgscan_directoryesgscan_directory

esginitializeesgsetup

esginitializeesgsetup

Web Services

Web Services

myproxy-logonmyproxy-logon

esgquery_gateway

esgquery_gatewayTHREDDS

Master Catalog

THREDDS Master Catalog

……/3/3/2/2/1/1

ESG Root Catalog

ESG Root Catalog thredds_root =

$ESGF_HOME/content/thredds/esgcet

Page 10: CMIP5 / ESG-CET Publication Tutorial

10LLNL-PRES-477393

CMIP5 Metadata

CMIP5 DRS (Data Reference Syntax) defines the naming system for CMIP5 dataset identifiers, files, directories, URLs, metadata, …

CMIP5 controlled vocabulary is derived from the DRS document• Permitted values for experiments, models, institutions, …• Should be consistent with publisher configuration

esg.ini esgcet_models_table.txt cf-standard-name-table.xml

CMOR (Climate Model Output Rewriter)• Generates CMIP5-compliant data in netCDF format• Fortran-90, C, Python interfaces

Page 11: CMIP5 / ESG-CET Publication Tutorial

11LLNL-PRES-477393

Publisher configuration, setup

Publisher locates the file in the order:• Environment variable ESGINI• $HOME/.esgcet/esg.ini• <PYTHON>/lib/python2.X/site-packages/esgcet/config/etc/esg.ini• esg.ini in working directory

esgsetup creates esg.ini• Run by esg-node installation script• If no existing configuration, starts with

<PYTHON>/lib/python2.X/site-packages/esgcet/config/etc/template.ini Created in $HOME/.esgcet/esg.ini

• Otherwise updates existing esg.ini• Options:

--config: create initial configuration --db: initialize database --thredds: initialize THREDDS server --publish: configure gateway-related options (service URLs, myproxy, etc.) --handler: create customized handlers

Page 12: CMIP5 / ESG-CET Publication Tutorial

12LLNL-PRES-477393

Configuration layout

Section headers• [DEFAULT]: Options apply to all

sections• [project:foo]: Specific to project

foo Project name(s) are listed in

project_options in [DEFAULT]• [initialize]: Locations of model and

standard name tables• [extract]: File scan phase

(metadata extraction) Enable detailed logging

• [srmls]: Listing SRM files• [hsi]: Listing HSI (HPSS mass

store)

Page 13: CMIP5 / ESG-CET Publication Tutorial

13LLNL-PRES-477393

Dataset roots, services

Dataset roots affect TDS access control, data hiding• thredds_dataset_roots =

root_path | locationroot_path | location …

• Every published file must be under a root location, is protected by ESG (by default)

• Unpublished files under root location(s) are potentially accessible, but are not visible in TDS or the gateway

• Do not store sensitive unpublished data under a dataset root!

Services configure access to files or aggregations• Simple or compound

Page 14: CMIP5 / ESG-CET Publication Tutorial

14LLNL-PRES-477393

Project configuration

experiment_options defines experiments for the project

Categories• Metadata fields that will be

associated with each dataset• Each project has a different set of

categories• May be mandatory: error if not

found during the scan• XX_options if enumerated• TDS catalog <property> element

may be created• Basis of gateway search

Page 15: CMIP5 / ESG-CET Publication Tutorial

15LLNL-PRES-477393

Project configuration

project handler encapsulates logic associated with reading / setting metadata values• ipcc5_builtin for CMIP5• May be customized

Format strings• %(option)s• Option may be defined:

Config file By handler (dynamically)

• Example:%(model_description)s

Dataset_id: template for TDS dataset identifiers• Format strings should be

mandatory• Version added by the publisher

Page 16: CMIP5 / ESG-CET Publication Tutorial

16LLNL-PRES-477393

Project configuration

Maps• Mapping (association) from a set of

independent fields to a dependent field

• The dependent field can be used in a format string

• Form:

map_name = map(variable_1[, variable_2[, ...]] : variable_n) value_1 [ | value_2 [...]] | value_n value_1 [ | value_2 [...]] | value_n

Data file structure• One variable per file (CMIP5 standard)

variable_per_file = true

• Multiple variables per filevariable_per_file = false

Version• vYYYYMMDD

version_by_date=true

• vN

Page 17: CMIP5 / ESG-CET Publication Tutorial

17LLNL-PRES-477393

Offline datasets

Offline datasets: can be listed but not opened for metadata extraction• Published with minimal description: location and size• No associated aggregations

Example: tertiary storage Lister: program that generates metadata for offline datasets

• hsils.py: HPSS• srmls.py: SRM• msls.py: MSS• Listers can be customized

Configuration:• thredds_offline_services: generate TDS catalog <service> element• offline_lister: associate service name with [lister] section• [lister] section

Ex:[hsi]offline_lister_executable = %(pythonbin)s/hsils.pyhsi = /usr/local/bin/hsi

Use –offline, –service options with esgpublish, esgscan_directory

Page 18: CMIP5 / ESG-CET Publication Tutorial

18LLNL-PRES-477393

Mapfiles

Describes file contents of one or more datasets Generate with:

% esgscan_directory [--read-files] [options] –o mapfile directory [directory ...]

File-specific fields• Size• Modification time: epochal time• Checksum (if checksum configuration option set)• Checksum type: MD5 (recommended for CMIP5) or SHA1

Format: one line per file:dataset_name | absolute_path | byte_length [ | property=value [ | property=value ...]]

where properties are: mod_time=<epochal_time> checksum=<checksum_value> checksum_type=<checksum_type>, either MD5 or SHA1

Page 19: CMIP5 / ESG-CET Publication Tutorial

19LLNL-PRES-477393

Directory Scan Modes

esgscan_directory [--read-files | --read-directories] … :• Associate dataset identifier(s) with files• Create listing of files with sizes, modification times, checksums, etc.

To generate dataset identifiers, must obtain metadata from either:• Directory names (--read-directories), or• File metadata (--read-files)

Example:dataset_id = cmip5.%(product)s.%(institute)s.%(model)s.%(experiment)s.%(time_frequency)s.%(realm)s.%(cmor_table)s.%(ensemble)s

File metadata: recommended for CMIP5• For each file, read metadata from file and generate dataset_id

Directory names: recommended if file metadata is incomplete• For each directory:

Match directory_format to directory to generate metadata If directory does not match, no output for that directory

• Somewhat faster, but harder to debug

Page 20: CMIP5 / ESG-CET Publication Tutorial

20LLNL-PRES-477393

Publishing checksums – two approaches

First approach: Enable checksum generation by default. In esg.ini [DEFAULT] section:checksum = md5sum | MD5

Problem with first approach: publication may slow significantly. Second approach (V2.9.0+): disable checksum option, then:

• Publish without checksums, initially• Generate checksums independently, add to a ‘mapfile’ foo.txt of the form:

dataset_name | absolute_path | byte_length | checksum=value | checksum_type=MD5…

• Add the checksums to the node database:% esgupdate_metadata foo.txt

• Republish:% esgpublish --noscan --map foo.txt --project cmip5 --thredds –publish

• Assumes that the dataset has not changed since initial publication• Query to list checksums:

% esgquery_gateway –urls dataset_name

Page 21: CMIP5 / ESG-CET Publication Tutorial

21LLNL-PRES-477393

Publishing replica datasets

Differs from non-replica datasets:• Maintains the replica version. (By default the publisher generates the dataset

version)• Sets catalog properties to flag replicated status

Currently sets master_gateway property

Form of publication command for replication:% esgpublish –replica origin_host_id [--version-list versions.txt] other_options …

--version-list (V2.9.0+):• Text file of the form:

dataset_name | versiondataset_name | version…

Proposed: add properties to the catalog for origin_host and publishing_host

Page 22: CMIP5 / ESG-CET Publication Tutorial

22LLNL-PRES-477393

Publisher GUI

% esgpublish_gui &

• Uses Tcl/Tk

Function menu• All functionality of publisher

scripts

Dataset window• Listing of datasets being

processed• Select dataset to display / edit

metadata

Output window• Standard output, error

messages

Status bar

Page 23: CMIP5 / ESG-CET Publication Tutorial

23LLNL-PRES-477393

Publisher GUI

Metadata editor• Display / edit dataset-level

(global) metadata

Fields are defined in esg.ini:• categories option

name | category_type | is_mandatory | is_thredds_property | display_order

Ex: experiment| enum| true| true| 1

Querying• Select datasets based on

categories (model, experiment, …)

• Categories are project-dependent

Page 24: CMIP5 / ESG-CET Publication Tutorial

24LLNL-PRES-477393

Frequent Questions

How do I add a new model identifier?• Default models table in:

<PYTHON_HOME>/lib/python2.X/site-packages/esgcet-2.Y.Z-py2.Z.egg/esgcet/config/etc/esgcet_models_table.txt

• Copy default table to $HOME/.esgcet/esgcet_models_table.txt, add entry• In esg.ini [initialize] section:

initial_models_table = %(home)s/.esgcet/esgcet_models_table.txt• % esginitialize –c

Similar process for standard names: cf-standard-name-table.xml esgscan_directory generates no output

• Try –read-files option for CMIP5• Check directory_format option in esg.ini

Cannot reinitialize TDS• Check thredds_reinit_url, thredds_username, thredds_password in esg.ini• Verify directly in browser

Page 25: CMIP5 / ESG-CET Publication Tutorial

25LLNL-PRES-477393

Frequent Questions

Publication• Access denied

Need publisher privilege for group owning the parent dataset Granted by gateway administrator

Logging• Publication

Logging to standard output by default Define log_filename for file output

• TDS <TDS_CONTENT>/content/thredds/logs/ Typically <TDS_CONTENT> = /esg or /usr/local/tomcat

• Tomcat $CATALINA_HOME/logs

Page 26: CMIP5 / ESG-CET Publication Tutorial

26LLNL-PRES-477393

Resources

Data node documentationhttp://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation/

• Publisher configuration reference:http://www2-pcmdi.llnl.gov/Members/bdrach/.personal/esg-publisher-configuration/

• CMIP5 controlled vocabulary:http://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation/cmip5_controlled_vocab.txt/view

• CMIP5 publication best practices:http://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation/cmip5-best-practices/

CMIP5 documentation:http://cmip-pcmdi.llnl.gov/cmip5/

• Data Reference Syntax (DRS): http://cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf

ESGF: Earth System Grid Federation:http://esgf.org/

• Node wiki has troubleshooting help: http://esgf.org/wiki/Cmip5DataNode

Page 27: CMIP5 / ESG-CET Publication Tutorial

27LLNL-PRES-477393

Handlers, Customization

Handler: Python class that encapsulates project-specific logic for:• Controlling what metadata is associated with a project, how it is read (project

handler) basic_builtin, ipcc4_builtin, ipcc5_builtin project_handler_name = ipcc5_builtin

• I/O for specific formats (format handler) netcdf_builtin reads netCDF files format_handler_name = netcdf_builtin

• Metadata standards metadata_handler_name = cf_builtin

• THREDDS catalog hook: user-supplied function modifies TDS catalog

Each type of handler may be customized• esgsetup –-handler creates skeleton package• Fill in required classes, functions• Creates a python package, independent of esgcet• Requires knowledge of Python• http://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation/customizing-the-esg-publisher-

with-handlers/

Page 28: CMIP5 / ESG-CET Publication Tutorial

28LLNL-PRES-477393

CMIP5-Specific Publication

Follow the DRS specification• dataset_id = cmip5.%(product)s.%(institute)s.%(model)s.%(experiment)s.%

(time_frequency)s.%(realm)s.%(cmor_table)s.%(ensemble)s• Directory layout

Publisher allows any layout Good idea to follow DRS-recommended layout if possible Drslib has tools to manage DRS-style layout:

http://esgf.org/esgf-drslib-site/

Use date-style versions• version_by_date = true

Generate data with CMOR Make sure esg.ini is up-to-date with CMIP5 controlled vocabulary

Page 29: CMIP5 / ESG-CET Publication Tutorial

29LLNL-PRES-477393

Publication to PCMDI Gateway

Install the latest esgcet package Check that the CMIP5 project configuration is up-to-date Each publishing institution for the PCMDI gateway has an associated

group:• BCC, CCCMA, CMCC, CNRM, DIAS, GFDL, NCCS• Different from data-producing institution: DIAS publishes MIROC and MRI data• Each institution has publication (write) access, optional administrative access• Publishing institution chooses group name

A top-level dataset exists for each group:• pcmdi.BCC, pcmdi.CCCMA, …

Initial read access is restricted, for test publications. When datasets are ready for distribution, read access will be granted to

the CMIP Research group.