11
Active Data Curation in Libraries: Issues and Challenges ASEE ELD Presentation June 27, 2011 William H. Mischo & Mary C. Schlembach

Active Data Curation in Libraries: Issues and Challenges ASEE ELD Presentation June 27, 2011 William H. Mischo & Mary C. Schlembach

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Active Data Curation in Libraries: Issues and Challenges

ASEE ELD PresentationJune 27, 2011

William H. Mischo & Mary C. Schlembach

Active Data Curation• Curation is the active use of data. It is a lifecycle

process.• Curation requires discipline specific knowledge

and experience.• Domain dependent curation rules and

preservation actions must be merged into the scientific workflow processes.

• Need to automate data ingest, descriptive metadata creation, preservation and digital object relationships.

Scientific Workflow

Fedora/Hydra Trusted Digital Repository (OAIS compliant)

Preservation Actions

Metadata Management

METS, PREMIS, MODS, DC, XSLT

The Grainger Library Active Data Curation The Grainger Library Active Data Curation Lifecycle ElementsLifecycle Elements

Curation Rule Engine

Operates on Metadata, Content Objects

AIPs, OAI-ORE

Curation Rule Engine:-- Domain dependent

-- Can be invoked explicitly-- But also automated based on

system trigger events

CI-3, CI-5 Responses

Access Mechanisms and E-Scholarship

Services, GRIPs

DIP Packages

SIP packagesAppraisal

and Selection

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

Say What?• What is the role of the library? The engineering

librarian? The campus? The subject discipline? • Libraries are creating content asset preservation

systems. Trusted Digital Repositories. Fedora/Hydra/archivematica at UIUC Library.

• Role for the science/engineering library: connecting data to literature.

• Knowledge creation process and libraries.• GrIPs (Group Information Profiles).• NSF Data Management Plans.

What Data should be Curated?• Defining data curation: DataNet projects: Data

Conservancy (Hopkins), DataONE (New Mexico). • Purdue profiles.• Raw data and processed data.• We surveyed several groups in specific

disciplines. – Atmospheric Sciences (experimental)– Biophysics (simulation data).

Atmospheric Science: Experimental Data• Five levels and two data streams:

– Level 1: raw voltages from an instrument– Level 2: calibrated data derived from raw

voltages– Level 3: image products displaying the data– Level 4: derived parameters, statistics, etc.

from calibrated data– Level 5: analysis of Level 4 data that winds

up in papers, publications, etc.• Two other necessary data streams: ancillary

instrument information and metadata.

Biophysics: Simulation Data• Modeling of interactions of atomic level molecular data.• Three levels:

– Level 1: raw data from simulation run: positions and velocities of particles; software widely used.

– Level 2: various raw data extracts of subsets of particles run data.

– Level 3: visualization files (movie, images); analysis products generated from the visualization data for publication data.

• Also necessary are input parameters (starting coordinates, etc.) and other metadata.

Data Management Plan• The Data Management Plan (DMP) is a new NSF

mandatory supplementary document for all research proposals.– http://www.nsf.gov/bfa/dias/policy/dmp.jsp

• Each directorate, including the Engineering Directorate (ENG) is providing specific directions and required elements.

• The ENG document: http://nsf.gov/eng/general/ENG_DMP_Policy.pdf

Data Management Plan• The digital data to be archived includes

analyzed data – typically data that will go into articles and papers, and the metadata that defines the data that was generated.

• For Engineering Directorate grants, raw data from sensors or other instruments is not required to be archived.

Data Management Plan• Maximum of two pages and will not count

against the 15 page limit for proposals.• UIUC Grainger Library has prepared overview

document and template for DMPs. Working on Wizard.

• As part of NSF Ethics CORE Digital Library, working on RCR Requirement database and Wizard.