Transcript

File Formats, Conventions,and

Data Level Interoperability

ESDSWG New Orleans, Oct 20, 2010Joe Glassy, Chris Lynnes ESDSWG Tech Infusion

Introduction & overview

• Outline of objectives:– Discuss role of standard, self-describing “File

formats” in data level interoperability– Summarize common file formats in use, their

properties, & benefits --“data life cycle economics”– Discuss criteria for choosing a file format, matching

it to needs of consumer/producers.– Discuss critical role of Conventions – any file format

needs good recipes to make them interoperable!– Examples: NASA Measures F/T, SMAP, AIRs, Aura

Role(s) Of File Formats in Interoperability

• File formats represent versatile “packages” for multi-dimensional science data and metadata.

• Offer self-describing “well-known structures” to codify desired, common conventions and practices.

• Offer well-documented reference cases to encapsulate specific data models.

• Standard file formats dock with format-aware tools to offer users a seamless end-to-end experience and platform portability

• Enhance Mission-to-Mission continuity

…investment life-cycle economics…

Why (and how) are file formats important?

• Standard formats– Come with thorough documentation– Provide good Reference implementations

• Common formats– More datasets in a format more tools that read

that format• Canonical structures and names

general purpose handlers for coordinates, etc. smarter tools

A generic work flow…

• Consider user community needs and culture, fit within architecture, institutional policies & preferences

• Choose a standard file format (or sub-variant)• Design a convention-enabled, specific internal layout

with metadata interfaces• Prototype: Implement in prototype, evaluate• Implement in production context• Integrate within discovery and catalog environments

(Catalog interoperability…)

Examples of standard file formats

• HDF5 – a file format on its own, as well as a broad foundation for others

• netCDF v4 (stable at v4.1.1, newest : v4.1.2-beta1)– v4 Classic (widespread adoption, some limitations…)– v4 Enhanced (support Groups, User-defined, variable length

types, and more)• netCDF v3 Classic (legacy+ , tools+, but limited)• HDFEOS2, HDFEOS5 – EOS Terra, Aqua, Aura…• HDF4 – legacy, extensive use by MODIS Terra, Aqua• Many other domain-specific, less generic formats abound…

(need transform tools to/from HDF?)

Some selection criteria…• Do file-format’s capabilities support required

functionality?• What is breadth of acceptance, adoption within larger

community? (and/or, does institutional policy dictate a specific format?)

• Presence and quality of documentation (reference, examples and especially tutorials), API software, and community support?

• Contribution to investment, data life-cycle economics?• What is the level of standardization?• Adaptability of format to widely used conventions like

CF 1.x, or other accepted convention(s)?

Internal Layout / Design(once format is chosen & adopted…)

• Define &refine High level organization /structure• /DATA• /METADATA

• Distinguish ‘data’ from ‘metadata’, core structure vs. ‘attributes’– Dimensions, Coordinate Variables, projection attributes– Missing_data, _Fillvalue vs. internal fill value– Units, Gain, offset, min, max, range, etc.

• Prototype it!– Leverage script environments (Python H5Py, PyTables, etc)– Panoply, HDFView also quick, useful for prototyping, feedback

Using “Groups”

• HDF5 (and NetCDF v4-Enhanced) support full use of groups e.g. /DATA vs. /METADATA, etc.

• Groups useful in partitioning out functionally related sets of data or attributes; Hierarchical view mimics file-system

• Facilitates appropriate information-hiding, highlights needed info, shield other (principle of least privilege…)

• Well supported by modern tools (Panoply, HDFViews, PyTables, H5Py) and low-lev APIs.

Example(s) of File Formats In Action

• HDF5 – NASA Measures – NASA Measures Freeze/Thaw (soon available at NSIDC)– http://measures.ntsg.umt.edu/sample_2007_day180.zip

• AQUA AIRS Level 2 (from earlier talk):– http://airspar1u.ecs.nasa.gov/opendap/Aqua_AIRS_Level2/AIRX2RET.005/201

0/285/AIRS.2010.10.12.090.L2.RetStd.v5.2.2.0.G10286064818.hdf

• Aura TES (TES-Aura_L3-CH4_r0000002135_F01_05.he5)

Example: NASA Measures Freeze/Thaw, Daily in HDF5Metadata Block: Attributes

Example: NASA Measures Daily Freeze/Thaw in HDF5Data Variable (FT_SSMI) and Attributes

Example: NASA Level 2 AIRS (Swath) in HDF4

Example: NetCDF, (tos) Sea surface temperatures collected by PCMDI for use by the IPCC, illustrating CF v1.0 layout

Example: TES (HDFEOS5) illustrating CF v1.0 layout

CF Conventions & file formats:--how they contribute to interoperability.

• CF v1.4.x -- the term “CF” is now broader than just climate-forecasting!

• Standard Name Table -- a step towards wider adoption of names, controlled vocabularies, units terminology

• CF v1.4.x provides tool-makers with helpful “lingua-franca” guidance.

• Within a file-format, adopting conventions like CF promotes common layout, names, semantics, for dataset-to-dataset compatibility -- a key to wider data level interoperability.

Attributes vs. Metadata?one man’s ceiling is another man’s floor…

• Collection level vs. Data Set vs. Granule level• Structural vs. science-content• Swath vs. grid vs. point• Commonly used attributes:

– CONVENTIONS attrib, communicates which convention was used

– Basic globals: title, history, institution, source, references– Coordinate variables, axis, formula_terms– Units, _Fillvalue, missing_data, valid_range– Short_name, long_name, other provenance– (gain,offset /scale_factor,addOffset), etc.

Challenges? (just a few remain…)

• Evolution, bifurcation, asymmetric support can result in occasional user confusion:– HDF v1.8.x vs. v1.6.x families?– NetCDF v4 Enhanced vs. NetCDF v4 Classic vs. v3?– HDFEOS5 vs. HDFEOS2?

• Both GUI tool and API support tends to vary by platform (Linux, Mac, Win7) and sub-flavor…

• Multi-library dependency stacks beg for fully bundled, version-matched end-to-end install pkg!

• Conventions community (CF v1.4.x) and metadata standards communities also in motion (but that’s good too…)

Resources : URLs• Climate Forecast (CF) Conventions (now at 1.4.x):

– http://cf-pcmdi.llnl.gov/– http://cf-pcmdi.llnl.gov/documents/cf-conventions

• HDF: – http://www.hdfgroup.org/HDF5/doc/index.html

• HDFEOS– http://www.hdfgroup.org/hdfeos.html– http://hdfeos.org/software/aug_hdfeos5.php

• NetCDF: – http://www.unidata.ucar.edu/software/netcdf/– http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.ht

ml• General:

– http://www.oceanteacher.org/OTMediawiki/index.php/Self-Describing_Formats

– http://en.wikipedia.org/wiki/List_of_file_formats

Resources: File format related Tools

• Panoply: http://www.giss.nasa.gov/tools/panoply/

• HDFView: http://www.hdfgroup.org/hdf-java-html/hdfview/

• OpenDAP: http://opendap.org

• IDV: http://www.unidata.ucar.edu/software/idv/

• McIDAS: http://www.unidata.ucar.edu/software/mcidas/

• Python: – h5py : http://code.google.com/p/h5py/, http://h5py.alfven.org/, – PyTables: http://www.pytables.org/moin

• Perl: PDL-IO-HDF5, and Biohdf?

• Many others: HEG, MTD, HDFEOS plug-in for HDFview, HDFLook, (ncdump, h5dump, and cousins), GRADS, Matlab, binary APIs

A provisional DOI, UUID Strategy

• What we used for NASA Measures Freeze/Thaw, daily (v2) just delivered:– DOI: assigned to our reference paper, by IEEE

Transactions in Geoscience and Remote Sensing– UUID recipe, seedString =

www.our.url/GranuleName/Datetime8601StampImport uuiduuid= uuid.uuid5(seedString)


Recommended