22
File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion

File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion

Embed Size (px)

Citation preview

File Formats, Conventions,and

Data Level Interoperability

ESDSWG New Orleans, Oct 20, 2010Joe Glassy, Chris Lynnes ESDSWG Tech Infusion

Introduction & overview

• Outline of objectives:– Discuss role of standard, self-describing “File

formats” in data level interoperability– Summarize common file formats in use, their

properties, & benefits --“data life cycle economics”– Discuss criteria for choosing a file format, matching

it to needs of consumer/producers.– Discuss critical role of Conventions – any file format

needs good recipes to make them interoperable!– Examples: NASA Measures F/T, SMAP, AIRs, Aura

Role(s) Of File Formats in Interoperability

• File formats represent versatile “packages” for multi-dimensional science data and metadata.

• Offer self-describing “well-known structures” to codify desired, common conventions and practices.

• Offer well-documented reference cases to encapsulate specific data models.

• Standard file formats dock with format-aware tools to offer users a seamless end-to-end experience and platform portability

• Enhance Mission-to-Mission continuity

…investment life-cycle economics…

Why (and how) are file formats important?

• Standard formats– Come with thorough documentation– Provide good Reference implementations

• Common formats– More datasets in a format more tools that read

that format• Canonical structures and names

general purpose handlers for coordinates, etc. smarter tools

A generic work flow…

• Consider user community needs and culture, fit within architecture, institutional policies & preferences

• Choose a standard file format (or sub-variant)• Design a convention-enabled, specific internal layout

with metadata interfaces• Prototype: Implement in prototype, evaluate• Implement in production context• Integrate within discovery and catalog environments

(Catalog interoperability…)

Examples of standard file formats

• HDF5 – a file format on its own, as well as a broad foundation for others

• netCDF v4 (stable at v4.1.1, newest : v4.1.2-beta1)– v4 Classic (widespread adoption, some limitations…)– v4 Enhanced (support Groups, User-defined, variable length

types, and more)• netCDF v3 Classic (legacy+ , tools+, but limited)• HDFEOS2, HDFEOS5 – EOS Terra, Aqua, Aura…• HDF4 – legacy, extensive use by MODIS Terra, Aqua• Many other domain-specific, less generic formats abound…

(need transform tools to/from HDF?)

Some selection criteria…• Do file-format’s capabilities support required

functionality?• What is breadth of acceptance, adoption within larger

community? (and/or, does institutional policy dictate a specific format?)

• Presence and quality of documentation (reference, examples and especially tutorials), API software, and community support?

• Contribution to investment, data life-cycle economics?• What is the level of standardization?• Adaptability of format to widely used conventions like

CF 1.x, or other accepted convention(s)?

Internal Layout / Design(once format is chosen & adopted…)

• Define &refine High level organization /structure• /DATA• /METADATA

• Distinguish ‘data’ from ‘metadata’, core structure vs. ‘attributes’– Dimensions, Coordinate Variables, projection attributes– Missing_data, _Fillvalue vs. internal fill value– Units, Gain, offset, min, max, range, etc.

• Prototype it!– Leverage script environments (Python H5Py, PyTables, etc)– Panoply, HDFView also quick, useful for prototyping, feedback

Using “Groups”

• HDF5 (and NetCDF v4-Enhanced) support full use of groups e.g. /DATA vs. /METADATA, etc.

• Groups useful in partitioning out functionally related sets of data or attributes; Hierarchical view mimics file-system

• Facilitates appropriate information-hiding, highlights needed info, shield other (principle of least privilege…)

• Well supported by modern tools (Panoply, HDFViews, PyTables, H5Py) and low-lev APIs.

Example(s) of File Formats In Action

• HDF5 – NASA Measures – NASA Measures Freeze/Thaw (soon available at NSIDC)– http://measures.ntsg.umt.edu/sample_2007_day180.zip

• AQUA AIRS Level 2 (from earlier talk):– http://airspar1u.ecs.nasa.gov/opendap/Aqua_AIRS_Level2/AIRX2RET.005/201

0/285/AIRS.2010.10.12.090.L2.RetStd.v5.2.2.0.G10286064818.hdf

• Aura TES (TES-Aura_L3-CH4_r0000002135_F01_05.he5)

Example: NASA Measures Freeze/Thaw, Daily in HDF5Metadata Block: Attributes

Example: NASA Measures Daily Freeze/Thaw in HDF5Data Variable (FT_SSMI) and Attributes

Example: NASA Level 2 AIRS (Swath) in HDF4

Example: NetCDF, (tos) Sea surface temperatures collected by PCMDI for use by the IPCC, illustrating CF v1.0 layout

Example: TES (HDFEOS5) illustrating CF v1.0 layout

CF Conventions & file formats:--how they contribute to interoperability.

• CF v1.4.x -- the term “CF” is now broader than just climate-forecasting!

• Standard Name Table -- a step towards wider adoption of names, controlled vocabularies, units terminology

• CF v1.4.x provides tool-makers with helpful “lingua-franca” guidance.

• Within a file-format, adopting conventions like CF promotes common layout, names, semantics, for dataset-to-dataset compatibility -- a key to wider data level interoperability.

Attributes vs. Metadata?one man’s ceiling is another man’s floor…

• Collection level vs. Data Set vs. Granule level• Structural vs. science-content• Swath vs. grid vs. point• Commonly used attributes:

– CONVENTIONS attrib, communicates which convention was used

– Basic globals: title, history, institution, source, references– Coordinate variables, axis, formula_terms– Units, _Fillvalue, missing_data, valid_range– Short_name, long_name, other provenance– (gain,offset /scale_factor,addOffset), etc.

Challenges? (just a few remain…)

• Evolution, bifurcation, asymmetric support can result in occasional user confusion:– HDF v1.8.x vs. v1.6.x families?– NetCDF v4 Enhanced vs. NetCDF v4 Classic vs. v3?– HDFEOS5 vs. HDFEOS2?

• Both GUI tool and API support tends to vary by platform (Linux, Mac, Win7) and sub-flavor…

• Multi-library dependency stacks beg for fully bundled, version-matched end-to-end install pkg!

• Conventions community (CF v1.4.x) and metadata standards communities also in motion (but that’s good too…)

Resources : URLs• Climate Forecast (CF) Conventions (now at 1.4.x):

– http://cf-pcmdi.llnl.gov/– http://cf-pcmdi.llnl.gov/documents/cf-conventions

• HDF: – http://www.hdfgroup.org/HDF5/doc/index.html

• HDFEOS– http://www.hdfgroup.org/hdfeos.html– http://hdfeos.org/software/aug_hdfeos5.php

• NetCDF: – http://www.unidata.ucar.edu/software/netcdf/– http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.ht

ml• General:

– http://www.oceanteacher.org/OTMediawiki/index.php/Self-Describing_Formats

– http://en.wikipedia.org/wiki/List_of_file_formats

Resources: File format related Tools

• Panoply: http://www.giss.nasa.gov/tools/panoply/

• HDFView: http://www.hdfgroup.org/hdf-java-html/hdfview/

• OpenDAP: http://opendap.org

• IDV: http://www.unidata.ucar.edu/software/idv/

• McIDAS: http://www.unidata.ucar.edu/software/mcidas/

• Python: – h5py : http://code.google.com/p/h5py/, http://h5py.alfven.org/, – PyTables: http://www.pytables.org/moin

• Perl: PDL-IO-HDF5, and Biohdf?

• Many others: HEG, MTD, HDFEOS plug-in for HDFview, HDFLook, (ncdump, h5dump, and cousins), GRADS, Matlab, binary APIs

A provisional DOI, UUID Strategy

• What we used for NASA Measures Freeze/Thaw, daily (v2) just delivered:– DOI: assigned to our reference paper, by IEEE

Transactions in Geoscience and Remote Sensing– UUID recipe, seedString =

www.our.url/GranuleName/Datetime8601StampImport uuiduuid= uuid.uuid5(seedString)