Upload
zorion
View
36
Download
0
Embed Size (px)
DESCRIPTION
CMIP5 / ESG-CET Publication Tutorial. Bob Drach, PCMDI / LLNL March 14, 2011. ESG-CET Architecture. Gateways support centralized services: Portal Authn / Authz Search Metadata harvesting Web services Nodes are close to the data Publishing Data servers THREDDS DAP: Hyrax, PyDAP - PowerPoint PPT Presentation
Citation preview
LLNL-PRES-477393
CMIP5 / ESG-CET Publication Tutorial
Bob Drach, PCMDI / LLNL
March 14, 2011
2LLNL-PRES-477393
ESG-CET Architecture
Gateways support centralized services:• Portal• Authn / Authz• Search• Metadata harvesting• Web services
Nodes are close to the data• Publishing• Data servers
THREDDS DAP: Hyrax, PyDAP
• Visualization and computation LAS CDAT, NCL, Ferret, …
Gateways and nodes can be co-resident
PublisherPublisher
gridFTPgridFTP
LASLASTHREDDSTHREDDS
Data Archive
Data Archive
Proxy Certificate
Proxy Certificate
Node Database
Node Database
THREDDS Catalogs
THREDDS Catalogs
Web Services
Web Services HarvesterHarvester
MyProxyMyProxy
PortalPortal
Gateway DatabaseGateway Database
Gateway
Data Node
3LLNL-PRES-477393
Terminology
Publication makes datasets visible on the gateway (portal).• Only metadata is transferred
A dataset is a collection of files. Datasets have versions.
• The ‘unit of publication’ is a version of a dataset.• Versions: monotonically increasing integers, may be YYYYMMDD
Datasets have string dataset identifiers that are unique system-wide. A category is a field that is searched on the gateway.
• Ex: time_frequency, realm, experiment, …
Projects are activities that generate datasets.• Ex: CMIP5, CMIP3• Associated with a set of categories
An experiment describes the input conditions (initial conditions, forcing, time period …) of a climate model experiment.
Data is generated by a climate model or from observations, reanalyses. Other project-specific metadata may be associated with datasets.
4LLNL-PRES-477393
Node Architecture
Publisher• Scans data archive• Generates metadata catalogs:
one catalog per dataset version• Notifies the gateway when new
catalogs are available• Written in Python
Node database• Persistent store of publication
information• Dataset contents, version history• File metadata• Catalog locations• Publication status• Current implementation in
Postgres• May co-exist with gateway DB
PublisherPublisher
gridFTPgridFTP
LASLASTHREDDSTHREDDS
Data Archive
Data Archive
Proxy Certificate
Proxy Certificate
Node Database
Node Database
THREDDS Catalogs
THREDDS Catalogs
Gateway Web
Services
Gateway Web
Services
HarvesterHarvester
5LLNL-PRES-477393
Publication Process
Scan directories• Create a list of files to be
published• Associate each file with a dataset• Generate a mapfile (optional)
Scan data• Read metadata from files• Populate node database
Generate metadata THREDDS catalogs
Publish datasets• Requires valid proxy certificate,
obtained with myproxy-logon• Notifies gateway to harvest
metadata
PublisherPublisher
gridFTPgridFTP
LASLASTHREDDSTHREDDS
Data Archive
Data Archive
Proxy Certificate
Proxy Certificate
Node Database
Node Database
THREDDS Catalogs
THREDDS Catalogs
Gateway Web
Services
Gateway Web
Services
HarvesterHarvester
6LLNL-PRES-477393
Publisher components
Scan directories / files to produce a mapfile
% esgscan_directory [--read-files] [options] directory [directory ...]
Extract metadata from files, populate node database
% esgpublish --map mapfile --project cmip5
Generate THREDDS catalogs
% esgpublish --map mapfile --noscan –-thredds
Notify gateway
% esgpublish --map mapfile –-noscan -–publish
Publisher GUI includes all publisher functionality
All scripts have --help options
esgpublish
esgunpublish
esgpublish
esgunpublish
THREDDSTHREDDS
Data Archive
Data Archive
Proxy Certificate
Proxy Certificate
Node Database
Node Database
THREDDS Catalogs
THREDDS Catalogs
esglist_datasetsesglist_files
esglist_datasetsesglist_files
esgpublish--publish
esgunpublish
esgpublish--publish
esgunpublishesgpublish--thredds
esgunpublish
esgpublish--thredds
esgunpublish
mapfilemapfile
esgscan_directoryesgscan_directory
esginitializeesgsetup
esginitializeesgsetup
Web Services
Web Services
myproxy-logonmyproxy-logon
esgquery_gateway
esgquery_gateway
7LLNL-PRES-477393
Deleting datasets
Order of operations is reverse of publication:• Delete from gateway• Remove TDS catalog• (optional) delete from node DB
Delete a dataset from the gateway
% esgunpublish –skip-thredds cmip5.foo.bar
Delete a TDS catalog
% esgunpublish –skip-gateway cmip5.foo.bar
Delete a dataset entirely, including the node database
% esgunpublish –database-delete cmip5.foo.bar
esgpublish
esgunpublish
esgpublish
esgunpublish
THREDDSTHREDDS
Data Archive
Data Archive
Proxy Certificate
Proxy Certificate
Node Database
Node Database
THREDDS Catalogs
THREDDS Catalogs
esglist_datasetsesglist_files
esglist_datasetsesglist_files
esgpublish--publish
esgunpublish
esgpublish--publish
esgunpublishesgpublish--thredds
esgunpublish
esgpublish--thredds
esgunpublish
mapfilemapfile
esgscan_directoryesgscan_directory
esginitializeesgsetup
esginitializeesgsetup
Web Services
Web Services
myproxy-logonmyproxy-logon
esgquery_gateway
esgquery_gateway
8LLNL-PRES-477393
Querying
List all CMIP5 datasets in the node database
% esglist_datasets cmip5
List all files in a dataset
% esglist_files cmip5.output.PCMDI.pcmdi-test.historical.fx.atmos.fx.
List all datasets in a directory on a gateway
% esgquery_gateway [--service-url gateway_service] --list pcmdi.CCCMA
List all files in a gateway dataset
% esgquery_gateway --files cmip5.output2.CCCma.CanESM2.rcp85.mon.land.Lmon.r5i1p1
esgpublish
esgunpublish
esgpublish
esgunpublish
THREDDSTHREDDS
Data Archive
Data Archive
Proxy Certificate
Proxy Certificate
Node Database
Node Database
THREDDS Catalogs
THREDDS Catalogs
esglist_datasetsesglist_files
esglist_datasetsesglist_files
esgpublish--publish
esgunpublish
esgpublish--publish
esgunpublishesgpublish--thredds
esgunpublish
esgpublish--thredds
esgunpublish
mapfilemapfile
esgscan_directoryesgscan_directory
esginitializeesgsetup
esginitializeesgsetup
Web Services
Web Services
myproxy-logonmyproxy-logon
esgquery_gateway
esgquery_gateway
9LLNL-PRES-477393
THREDDS catalogs
Layout
Reinitialization loads all catalogs into the TDS
thredds_reinit_url = https://localhost:443/thredds/admin/debug?catalogs/reinit
esgpublish
esgunpublish
esgpublish
esgunpublish
THREDDSTHREDDS
Data Archive
Data Archive
Proxy Certificate
Proxy Certificate
Node Database
Node Database
THREDDS Catalogs
THREDDS Catalogs
esglist_datasetsesglist_files
esglist_datasetsesglist_files
esgpublish--publish
esgunpublish
esgpublish--publish
esgunpublishesgpublish--thredds
esgunpublish
esgpublish--thredds
esgunpublish
mapfilemapfile
esgscan_directoryesgscan_directory
esginitializeesgsetup
esginitializeesgsetup
Web Services
Web Services
myproxy-logonmyproxy-logon
esgquery_gateway
esgquery_gatewayTHREDDS
Master Catalog
THREDDS Master Catalog
……/3/3/2/2/1/1
ESG Root Catalog
ESG Root Catalog thredds_root =
$ESGF_HOME/content/thredds/esgcet
10LLNL-PRES-477393
CMIP5 Metadata
CMIP5 DRS (Data Reference Syntax) defines the naming system for CMIP5 dataset identifiers, files, directories, URLs, metadata, …
CMIP5 controlled vocabulary is derived from the DRS document• Permitted values for experiments, models, institutions, …• Should be consistent with publisher configuration
esg.ini esgcet_models_table.txt cf-standard-name-table.xml
CMOR (Climate Model Output Rewriter)• Generates CMIP5-compliant data in netCDF format• Fortran-90, C, Python interfaces
11LLNL-PRES-477393
Publisher configuration, setup
Publisher locates the file in the order:• Environment variable ESGINI• $HOME/.esgcet/esg.ini• <PYTHON>/lib/python2.X/site-packages/esgcet/config/etc/esg.ini• esg.ini in working directory
esgsetup creates esg.ini• Run by esg-node installation script• If no existing configuration, starts with
<PYTHON>/lib/python2.X/site-packages/esgcet/config/etc/template.ini Created in $HOME/.esgcet/esg.ini
• Otherwise updates existing esg.ini• Options:
--config: create initial configuration --db: initialize database --thredds: initialize THREDDS server --publish: configure gateway-related options (service URLs, myproxy, etc.) --handler: create customized handlers
12LLNL-PRES-477393
Configuration layout
Section headers• [DEFAULT]: Options apply to all
sections• [project:foo]: Specific to project
foo Project name(s) are listed in
project_options in [DEFAULT]• [initialize]: Locations of model and
standard name tables• [extract]: File scan phase
(metadata extraction) Enable detailed logging
• [srmls]: Listing SRM files• [hsi]: Listing HSI (HPSS mass
store)
13LLNL-PRES-477393
Dataset roots, services
Dataset roots affect TDS access control, data hiding• thredds_dataset_roots =
root_path | locationroot_path | location …
• Every published file must be under a root location, is protected by ESG (by default)
• Unpublished files under root location(s) are potentially accessible, but are not visible in TDS or the gateway
• Do not store sensitive unpublished data under a dataset root!
Services configure access to files or aggregations• Simple or compound
14LLNL-PRES-477393
Project configuration
experiment_options defines experiments for the project
Categories• Metadata fields that will be
associated with each dataset• Each project has a different set of
categories• May be mandatory: error if not
found during the scan• XX_options if enumerated• TDS catalog <property> element
may be created• Basis of gateway search
15LLNL-PRES-477393
Project configuration
project handler encapsulates logic associated with reading / setting metadata values• ipcc5_builtin for CMIP5• May be customized
Format strings• %(option)s• Option may be defined:
Config file By handler (dynamically)
• Example:%(model_description)s
Dataset_id: template for TDS dataset identifiers• Format strings should be
mandatory• Version added by the publisher
16LLNL-PRES-477393
Project configuration
Maps• Mapping (association) from a set of
independent fields to a dependent field
• The dependent field can be used in a format string
• Form:
map_name = map(variable_1[, variable_2[, ...]] : variable_n) value_1 [ | value_2 [...]] | value_n value_1 [ | value_2 [...]] | value_n
Data file structure• One variable per file (CMIP5 standard)
variable_per_file = true
• Multiple variables per filevariable_per_file = false
Version• vYYYYMMDD
version_by_date=true
• vN
17LLNL-PRES-477393
Offline datasets
Offline datasets: can be listed but not opened for metadata extraction• Published with minimal description: location and size• No associated aggregations
Example: tertiary storage Lister: program that generates metadata for offline datasets
• hsils.py: HPSS• srmls.py: SRM• msls.py: MSS• Listers can be customized
Configuration:• thredds_offline_services: generate TDS catalog <service> element• offline_lister: associate service name with [lister] section• [lister] section
Ex:[hsi]offline_lister_executable = %(pythonbin)s/hsils.pyhsi = /usr/local/bin/hsi
Use –offline, –service options with esgpublish, esgscan_directory
18LLNL-PRES-477393
Mapfiles
Describes file contents of one or more datasets Generate with:
% esgscan_directory [--read-files] [options] –o mapfile directory [directory ...]
File-specific fields• Size• Modification time: epochal time• Checksum (if checksum configuration option set)• Checksum type: MD5 (recommended for CMIP5) or SHA1
Format: one line per file:dataset_name | absolute_path | byte_length [ | property=value [ | property=value ...]]
where properties are: mod_time=<epochal_time> checksum=<checksum_value> checksum_type=<checksum_type>, either MD5 or SHA1
19LLNL-PRES-477393
Directory Scan Modes
esgscan_directory [--read-files | --read-directories] … :• Associate dataset identifier(s) with files• Create listing of files with sizes, modification times, checksums, etc.
To generate dataset identifiers, must obtain metadata from either:• Directory names (--read-directories), or• File metadata (--read-files)
Example:dataset_id = cmip5.%(product)s.%(institute)s.%(model)s.%(experiment)s.%(time_frequency)s.%(realm)s.%(cmor_table)s.%(ensemble)s
File metadata: recommended for CMIP5• For each file, read metadata from file and generate dataset_id
Directory names: recommended if file metadata is incomplete• For each directory:
Match directory_format to directory to generate metadata If directory does not match, no output for that directory
• Somewhat faster, but harder to debug
20LLNL-PRES-477393
Publishing checksums – two approaches
First approach: Enable checksum generation by default. In esg.ini [DEFAULT] section:checksum = md5sum | MD5
Problem with first approach: publication may slow significantly. Second approach (V2.9.0+): disable checksum option, then:
• Publish without checksums, initially• Generate checksums independently, add to a ‘mapfile’ foo.txt of the form:
dataset_name | absolute_path | byte_length | checksum=value | checksum_type=MD5…
• Add the checksums to the node database:% esgupdate_metadata foo.txt
• Republish:% esgpublish --noscan --map foo.txt --project cmip5 --thredds –publish
• Assumes that the dataset has not changed since initial publication• Query to list checksums:
% esgquery_gateway –urls dataset_name
21LLNL-PRES-477393
Publishing replica datasets
Differs from non-replica datasets:• Maintains the replica version. (By default the publisher generates the dataset
version)• Sets catalog properties to flag replicated status
Currently sets master_gateway property
Form of publication command for replication:% esgpublish –replica origin_host_id [--version-list versions.txt] other_options …
--version-list (V2.9.0+):• Text file of the form:
dataset_name | versiondataset_name | version…
Proposed: add properties to the catalog for origin_host and publishing_host
22LLNL-PRES-477393
Publisher GUI
% esgpublish_gui &
• Uses Tcl/Tk
Function menu• All functionality of publisher
scripts
Dataset window• Listing of datasets being
processed• Select dataset to display / edit
metadata
Output window• Standard output, error
messages
Status bar
23LLNL-PRES-477393
Publisher GUI
Metadata editor• Display / edit dataset-level
(global) metadata
Fields are defined in esg.ini:• categories option
name | category_type | is_mandatory | is_thredds_property | display_order
Ex: experiment| enum| true| true| 1
Querying• Select datasets based on
categories (model, experiment, …)
• Categories are project-dependent
24LLNL-PRES-477393
Frequent Questions
How do I add a new model identifier?• Default models table in:
<PYTHON_HOME>/lib/python2.X/site-packages/esgcet-2.Y.Z-py2.Z.egg/esgcet/config/etc/esgcet_models_table.txt
• Copy default table to $HOME/.esgcet/esgcet_models_table.txt, add entry• In esg.ini [initialize] section:
initial_models_table = %(home)s/.esgcet/esgcet_models_table.txt• % esginitialize –c
Similar process for standard names: cf-standard-name-table.xml esgscan_directory generates no output
• Try –read-files option for CMIP5• Check directory_format option in esg.ini
Cannot reinitialize TDS• Check thredds_reinit_url, thredds_username, thredds_password in esg.ini• Verify directly in browser
25LLNL-PRES-477393
Frequent Questions
Publication• Access denied
Need publisher privilege for group owning the parent dataset Granted by gateway administrator
Logging• Publication
Logging to standard output by default Define log_filename for file output
• TDS <TDS_CONTENT>/content/thredds/logs/ Typically <TDS_CONTENT> = /esg or /usr/local/tomcat
• Tomcat $CATALINA_HOME/logs
26LLNL-PRES-477393
Resources
Data node documentationhttp://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation/
• Publisher configuration reference:http://www2-pcmdi.llnl.gov/Members/bdrach/.personal/esg-publisher-configuration/
• CMIP5 controlled vocabulary:http://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation/cmip5_controlled_vocab.txt/view
• CMIP5 publication best practices:http://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation/cmip5-best-practices/
CMIP5 documentation:http://cmip-pcmdi.llnl.gov/cmip5/
• Data Reference Syntax (DRS): http://cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf
ESGF: Earth System Grid Federation:http://esgf.org/
• Node wiki has troubleshooting help: http://esgf.org/wiki/Cmip5DataNode
27LLNL-PRES-477393
Handlers, Customization
Handler: Python class that encapsulates project-specific logic for:• Controlling what metadata is associated with a project, how it is read (project
handler) basic_builtin, ipcc4_builtin, ipcc5_builtin project_handler_name = ipcc5_builtin
• I/O for specific formats (format handler) netcdf_builtin reads netCDF files format_handler_name = netcdf_builtin
• Metadata standards metadata_handler_name = cf_builtin
• THREDDS catalog hook: user-supplied function modifies TDS catalog
Each type of handler may be customized• esgsetup –-handler creates skeleton package• Fill in required classes, functions• Creates a python package, independent of esgcet• Requires knowledge of Python• http://esg-pcmdi.llnl.gov/internal/esg-data-node-documentation/customizing-the-esg-publisher-
with-handlers/
28LLNL-PRES-477393
CMIP5-Specific Publication
Follow the DRS specification• dataset_id = cmip5.%(product)s.%(institute)s.%(model)s.%(experiment)s.%
(time_frequency)s.%(realm)s.%(cmor_table)s.%(ensemble)s• Directory layout
Publisher allows any layout Good idea to follow DRS-recommended layout if possible Drslib has tools to manage DRS-style layout:
http://esgf.org/esgf-drslib-site/
Use date-style versions• version_by_date = true
Generate data with CMOR Make sure esg.ini is up-to-date with CMIP5 controlled vocabulary
29LLNL-PRES-477393
Publication to PCMDI Gateway
Install the latest esgcet package Check that the CMIP5 project configuration is up-to-date Each publishing institution for the PCMDI gateway has an associated
group:• BCC, CCCMA, CMCC, CNRM, DIAS, GFDL, NCCS• Different from data-producing institution: DIAS publishes MIROC and MRI data• Each institution has publication (write) access, optional administrative access• Publishing institution chooses group name
A top-level dataset exists for each group:• pcmdi.BCC, pcmdi.CCCMA, …
Initial read access is restricted, for test publications. When datasets are ready for distribution, read access will be granted to
the CMIP Research group.