Fundamental Practices for Preparing Data Sets Robert Cook ORNL Distributed Active Archive Center...
If you can't read please download the document
Fundamental Practices for Preparing Data Sets Robert Cook ORNL Distributed Active Archive Center Environmental Sciences Division Oak Ridge National Laboratory
Fundamental Practices for Preparing Data Sets Robert Cook ORNL
Distributed Active Archive Center Environmental Sciences Division
Oak Ridge National Laboratory Oak Ridge, TN [email protected]
CC&E Joint Science Workshop College Park, MD April 19,
2015
Slide 2
CC&E Workshop, Best Data Management Practices, April 19,
2015 Data centers use the 20-year rule The data set and
accompanying documentation should be prepared for a user 20 years
into the future--what does that investigator need to know to use
the data? Prepare the data and documentation for a user who is
unfamiliar with your project, methods, and observations NRC (1991)
2
Slide 3
CC&E Workshop, Best Data Management Practices, April 19,
2015 Fundamental Data Practices 1.Define the contents of your data
files 2.Define the variables 3.Use consistent data organization
4.Use stable file formats 5.Assign descriptive file names
6.Preserve processing information 7.Perform basic quality assurance
8.Provide documentation 9.Protect your data 10.Preserve your data
3
Slide 4
CC&E Workshop, Best Data Management Practices, April 19,
2015 1. Define the contents of your data files Content flows from
science plan (hypotheses) and is informed from requirements of
final archive. Keep a set of similar measurements together in one
file same investigator, methods, time basis, and instrument No hard
and fast rules about contents of each file. 4
Slide 5
CC&E Workshop, Best Data Management Practices, April 19,
2015 2. Define the variables 1.Choose the units and format for each
variable, 2.Explain the format in the metadata, and 3.Use that
format consistently throughout the file Date / Time Example e.g.,
use yyyymmdd; January 2, 1999 is 19990102 Report in both local time
and Coordinated Universal Time (UTC) and 24-hour notation (13:30
hrs instead of 1:30 p.m.) Use a code (e.g., -9999) for missing
values 5 Representation of dates and times
Slide 6
CC&E Workshop, Best Data Management Practices, April 19,
2015 ISO Formatted Dates Sort Chronologically 6
Slide 7
CC&E Workshop, Best Data Management Practices, April 19,
2015 2. Define the variables (cont) Use commonly accepted variable
names and units ORNL DAAC Best Practices (Hook et al., 2010)
Additional examples of variable names, units, and their formats
Next Generation Ecosystem Experiment Arctic Guidance for variable
names and units FLUXNET Guidance for flux tower variable names and
units 7 UDUNITS Unit database and conversion between units CF
Standard Name Climate Forecast (CF) standards promote sharing
International System of Units
Slide 8
CC&E Workshop, Best Data Management Practices, April 19,
2015 2. Define the variables (cont) 8 Scholes (2005) Be consistent
Explicitly state units Use ISO formats Variable Table
Slide 9
CC&E Workshop, Best Data Management Practices, April 19,
2015 2. Define the variables Site Table 9 Site NameSite Code
Latitude (deg ) Longitude (deg) Elevation (m) Date Kataba
(Mongu)k-15.4389223.2529811952000.02.21
Pandamatengap-18.6565125.4995511382000.03.07 Skukuza Flux Tower
skukuza-31.4968825.01973365 2000.06.15 Scholes, R. J. 2005. SAFARI
2000 Woody Vegetation Characteristics of Kalahari and Skukuza
Sites. Data set. Available on-line [http://daac.ornl.gov/] from Oak
Ridge National Laboratory Distributed Active Archive Center, Oak
Ridge, Tennessee, U.S.A. doi:10.3334/ORNLDAAC/777
Slide 10
CC&E Workshop, Best Data Management Practices, April 19,
2015 3. Use consistent data organization (one good approach)
StationDateTempPrecip Units YYYYMMDDCmm HOGI19961001120
HOGI19961002143 HOGI1996100319-9999 Note: -9999 is a missing value
code for the data set 10 Each row in a file represents a complete
record, and the columns represent all the variables that make up
the record.
Slide 11
CC&E Workshop, Best Data Management Practices, April 19,
2015 3. Use consistent data organization (a 2 nd good approach)
StationDateVariableValueUnit HOGI19961001Temp12C
HOGI19961002Temp14C HOGI19961001Precip0mm HOGI19961002Precip3mm 11
Variable name, value, and units are placed in individual rows. This
approach is used in relational databases.
Slide 12
CC&E Workshop, Best Data Management Practices, April 19,
2015 3. Use consistent data organization (cont) Be consistent in
file organization and formatting dont change or re-arrange columns
Include header rows (first row should contain file name, data set
title, author, date, and companion file names) column headings
should describe content of each column, including one row for
variable names and one for variable units 12
Slide 13
CC&E Workshop, Best Data Management Practices, April 19,
2015 13 Example of Poor Data Practices for Collaboration and Data
Sharing Courtesy of Stefanie Hampton, NCEAS Problems with
spreadsheets Multiple tables Embedded figures No headings / units
Poor file names Problems with spreadsheets Multiple tables Embedded
figures No headings / units Poor file names
Slide 14
CC&E Workshop, Best Data Management Practices, April 19,
2015 Stable Isotope Data at ORNL: tabular csv format Aranabar and
Macko. 2005. doi:10.3334/ORNLDAAC/783 14
Slide 15
CC&E Workshop, Best Data Management Practices, April 19,
2015 4. Use stable file formats Los[e] years of critical knowledge
because modern PCs could not always open old file formats. Lesson:
Avoid proprietary formats They may not be readable in the future 15
http://news.bbc.co.uk/2/hi/6265976.stm
Slide 16
CC&E Workshop, Best Data Management Practices, April 19,
2015 4. Use stable file formats (cont) 16 Aranibar, J. N. and S. A.
Macko. 2005. SAFARI 2000 Plant and Soil C and N Isotopes, Southern
Africa, 1995-2000. Data set. Available on-line
[http://daac.ornl.gov/] from Oak Ridge National Laboratory
Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A.
doi:10.3334/ORNLDAAC/783 Use text (ASCII) file formats for tabular
data (e.g.,.txt or.csv (comma-separated values)
Slide 17
CC&E Workshop, Best Data Management Practices, April 19,
2015 4. Use stable file formats (cont) Suggested Geospatial File
Formats Raster formats Geotiff netCDF o with CF convention
preferred HDF ASCII o plain text file gridded format with external
projection information Vector Shapefile ASCII 17 GTOPO30 Elevation
Minimum Temperature
Slide 18
CC&E Workshop, Best Data Management Practices, April 19,
2015 Use descriptive file names Unique Reflect contents ASCII
characters only Avoid spaces Bad: Mydata.xls 2001_data.csv best
version.txt Better:bigfoot_agro_2000_gpp.tiff Site name Year What
was measured Project Name File Format 5. Assign descriptive file
names 18
Slide 19
CC&E Workshop, Best Data Management Practices, April 19,
2015 19 Courtesy of PhD Comics
Slide 20
CC&E Workshop, Best Data Management Practices, April 19,
2015 Biodiversity Lake Experiments Field work Grassland
Biodiv_H20_heatExp_2005_2008.csv
Biodiv_H20_predatorExp_2001_2003.csv
Biodiv_H20_planktonCount_start2001_active.csv
Biodiv_H20_chla_profiles_2003.csv 5. Assign descriptive file names
(cont) Organize files logically Make sure your file system is
logical and efficient 20 From S. Hampton
Slide 21
CC&E Workshop, Best Data Management Practices, April 19,
2015 6. Preserve processing information Keep raw data raw: Do not
Include transformations, interpolations, etc in raw file Make your
raw data read only to ensure no changes
Giles_zoopCount_Diel_2001_2003.csv TAXCOUNTTEMPC C3.9788735812.3
F0.9726135412.7 M0.5305164812.1 F011.9 C10.882389312.8
F43.529557113.1 M21.764778514.2 N61.666872512.9 Raw Data File ###
Giles_zoop_temp_regress_4jun08.r ### Load data Giles