Upload
jennifer-fletcher
View
217
Download
0
Embed Size (px)
Citation preview
Managing the Impacts of Managing the Impacts of Change on Archiving Research Change on Archiving Research
DataData
A Presentation for “International Workshop on A Presentation for “International Workshop on Strategies for Preservation of and Open Access to Strategies for Preservation of and Open Access to
Scientific Data” Scientific Data”
June 23, 2004June 23, 2004
Beijing, ChinaBeijing, China
Raymond McCord Raymond McCord
Oak Ridge National Laboratory*Oak Ridge National Laboratory*
Oak Ridge, Tennessee, USAOak Ridge, Tennessee, USA*Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the U.S. Department of *Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the U.S. Department of
Energy under contract DE-AC05-00OR22725Energy under contract DE-AC05-00OR22725
Presentation StrategyPresentation Strategy
Change is part of Change is part of ScienceScience
Accommodating Accommodating changechange
Integration with Integration with good practicesgood practices
Research Implies Change …Research Implies Change …
repeat…
New informationrequirements
New questions
Research
DiscoveryNot always true for other information
systems
Minimize Changes / Maximize Minimize Changes / Maximize DocumentationDocumentation
Unpredicted variation in data during research Unpredicted variation in data during research is: is: No excuse for loose management of changes!!No excuse for loose management of changes!! Often used as an excuse to avoid standards.Often used as an excuse to avoid standards. Unavoidable in all cases, but try…Unavoidable in all cases, but try…
Missing values will occur; Plan aheadMissing values will occur; Plan ahead Avoid this complexity: “Temp, temp, t, T, temperature…”Avoid this complexity: “Temp, temp, t, T, temperature…”
A source of ambiguity; be clear.A source of ambiguity; be clear. Consider the view of future usersConsider the view of future users Minimal observational intensity is: Minimal observational intensity is:
No excuse (!!) for skipping documentation!!No excuse (!!) for skipping documentation!! Quick study = no documentation?? {NO}Quick study = no documentation?? {NO}
The unexpected are rare and
most valuable??
Management Issues to Management Issues to ConsiderConsider
What will change?What will change?Which changes can be controlled?Which changes can be controlled?How are changes approved?How are changes approved?How are users notified about How are users notified about
changes?changes?How and when can changes be How and when can changes be
“smoothed” in the cumulative view?“smoothed” in the cumulative view?
Things that will ChangeThings that will Change
Access expectationsAccess expectations Removal or addition of access restrictionsRemoval or addition of access restrictions
The scope and logical hierarchy of the The scope and logical hierarchy of the information. information. New parametersNew parameters New disciplinesNew disciplines New study sitesNew study sites New data sources or methodsNew data sources or methods
Revisions and additions to metadata codes Revisions and additions to metadata codes for parameters, sites, and measurements.for parameters, sites, and measurements.
Updates of hardware and softwareUpdates of hardware and software
Design Considerations (1)Design Considerations (1)
Create “extensible standards” for metadataCreate “extensible standards” for metadata Have a process for proposing and implementing new Have a process for proposing and implementing new
standard metadata codes.standard metadata codes. Record the effective dates of changes.Record the effective dates of changes.
Build databases and applications software “for Build databases and applications software “for change”change” Put labels in “lookup” tables (outside the software Put labels in “lookup” tables (outside the software
code)code) DO NOT let the flexibility needed to store the DO NOT let the flexibility needed to store the
information become constrained by software that is information become constrained by software that is too complex to be changed!!too complex to be changed!!
Ask developers: Ask developers: “How hard will this design be to change in the future?” Before software and Before software and databases are built.databases are built.
Design Considerations (2)Design Considerations (2)
Include notification procedures to data users Include notification procedures to data users about changesabout changes Process is simple – distribute information to Process is simple – distribute information to
previous data users.previous data users. Records about previous data access are required.Records about previous data access are required. The description of the change maybe difficult to acquire The description of the change maybe difficult to acquire
and manage.and manage.
Allocate resources for reprocessingAllocate resources for reprocessing Some changes over time maybe very difficult (and Some changes over time maybe very difficult (and
irritating) to the data users.irritating) to the data users. Reprocessing can “smooth over” some changes.Reprocessing can “smooth over” some changes.
Reprocessing may be limited by available Reprocessing may be limited by available documentation.documentation.
Change and Dataset DesignChange and Dataset Design
The following series The following series of slides present:of slides present: Basic “principles” for Basic “principles” for
good dataset design good dataset design ANDAND
How the “principles” How the “principles” need to be adapted need to be adapted to accommodate to accommodate changes and future changes and future data archiving.data archiving.
Rules for CreatingRules for CreatingDatasets for Archiving (1)Datasets for Archiving (1)
Unique OccurrencesUnique Occurrences Each type of measurement is represented in a Each type of measurement is represented in a
consistent way.consistent way. Each measurement event is represented by Each measurement event is represented by
only one value. only one value. If multiple versions of datasets accumulate:
provide version informationExplain version differencesDocument effective date range for each version
When was “it done this way” (observation date range) When was “it distributed this way” (distribution date
range)
Rules for CreatingRules for CreatingDatasets for Archiving (2)Datasets for Archiving (2)
IdentifiersIdentifiersEach value is associated with a Each value is associated with a
parameter name.parameter name.Each measurement value has a quality Each measurement value has a quality
indicator and link to a method indicator and link to a method description.description.
When possible remove multiple aliases for the same identifier (sample ID, site ID or name, measurement name, etc.).
Rules for CreatingRules for CreatingDatasets for Archiving (3)Datasets for Archiving (3)
Place and TimePlace and TimeEach value is associated with a unique Each value is associated with a unique
place name with a quantitatively defined place name with a quantitatively defined location (geographic coordinates).location (geographic coordinates).
Each value is associated with a date and Each value is associated with a date and time.time.
Do not confuse date and time for measurements with:Date and time for storage storage or revisions.Date and time ranges for measurement or
encoding methods.
Rules for CreatingRules for CreatingDatasets for Archiving (4)Datasets for Archiving (4)
Data Storage and TransportData Storage and Transport Data are stored or managed with a database management Data are stored or managed with a database management
system or self documenting data format.system or self documenting data format. NetCDF is an example of a non-proprietary data format that NetCDF is an example of a non-proprietary data format that
is self-documented.is self-documented. Developed by the atmospheric sciences research community.Developed by the atmospheric sciences research community. Main documentation and software libraries are openly available.Main documentation and software libraries are openly available. http://my.unidata.ucar.edu/content/software/netcdf/index.htmlhttp://my.unidata.ucar.edu/content/software/netcdf/index.html Some commercial data analysis software include interfaces to Some commercial data analysis software include interfaces to
this open format.this open format. Include data analysis software in data management suite
Useful for comparing versions of data that accumulate over time Include data format conversion software in data
management suite Useful for migrating data from storage technology to another
Best Practices for Preparing Ecological and Best Practices for Preparing Ecological and Ground-Based Data Sets to Share and Ground-Based Data Sets to Share and
ArchiveArchive Best Practices Include:Best Practices Include:
Assign descriptive file namesAssign descriptive file names Use consistent and stable file formatsUse consistent and stable file formats Define the parametersDefine the parameters Use consistent data organizationUse consistent data organization Perform basic quality assurancePerform basic quality assurance Assign descriptive data set titlesAssign descriptive data set titles Provide documentationProvide documentation
Published: Cook et al. 2001. Bulletin of Published: Cook et al. 2001. Bulletin of the Ecological Society of Americathe Ecological Society of America http://www.daac.ornl.gov/DAAC/PI/http://www.daac.ornl.gov/DAAC/PI/
bestprac.htmlbestprac.html
A Future Scientist’s ViewA Future Scientist’s View
Three years ago:Three years ago:I told my college-age daughter about the I told my college-age daughter about the
Japanese announcement of 1 TB of Japanese announcement of 1 TB of optical memory in 1 cubic centimeter.optical memory in 1 cubic centimeter.
Her reply was:Her reply was:“…We need to know how to think
critically and select what kinds of projects and data we need to keep because the limiting factor will be our minds, not the technology.”
Comments and Questions…Comments and Questions…