Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs and Increasing Value

Changing the Curation Equation: A Data Lifecycle Approach to

Lowering Costs and Increasing Value

Jim Myers1, Margaret Hedstrom1, Beth A Plale2, Praveen Kumar3, Robert McDonald4, Rob Kooper5, Luigi Marini5, Inna Kouper4, Kavitha Chandrasekar4

[email protected]

1 School on Information, University of Michigan, Ann Arbor, MI, United States. 2 School of Informatics and Computing, Indiana University, Bloomington, IN, United States. 3 Civil and Environmental Engineering, University of Illinois, Urbana-Champaign, IL, United States. 4 Data To Insight Center, Indiana University, Bloomington, IN, United States. 5 National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, IL, United States.

mailto:[email protected]

Outline

• Quick Project Intro• What is SEAD? (Stop by the SEAD booth!)• Why is SEAD? • How does SEAD work?• Future active and social curation work

SEAD: Sustainable Environment -Actionable Data

• An NSF DataNet project started in October, 2011

• An international resource for sustainability science

• A provider of light-weight Data Services based on novel technical and business approaches:– Supporting the long-tail of research– Enabling active and social curation– Providing integrated lifecycle support for data

http://sead-data.net/

Margaret Hedstrom, PIPraveen Kumar, co-PIJim Myers, co-PIBeth Plale, co-PI

Sustainability Research• Central to solving many of society’s most critical

challenges• An exemplar of modern research

– Local processes aggregating to produce global consequences– Multiple time scales– Coupling of natural and human systems– Interacting systems-of-systems requiring multidisciplinary

understanding • Environmental – Economic - Social

Science

Technology

Economics

Poverty & Justice

Policy

Cooperation

SEAD is:

• Data discovery• Project workspaces• A data-aware community network• Curation and preservation services that link to multiple archives and discovery services

SEAD is:• Secure project spaces where teams can:– Gather reference data– Upload and share new results– Annotate– Relate– Organize– Publish

Project Dashboard

SEAD is:• An active repository that creates data pages with– Previews– Extracted Metadata– Overlays– Tags– Comments– Provenance– Use information– Download/Embed

• A tool for community exploration:– Personal and Project Profiles– Publications and Data Citations– Co-author, co-investigator graphs– Temporal analysis

SEAD is:

SEAD is:

• A way to preprint and publish data:– Branded interface– Discovery metadata– Drill-down• Sub-collections• Data Pages

– Submit for curation and preservation The National Center for Earth Surface Dynamics

~1.6 TB, 450K files (2.2 M objects) representing 10 years of research by multiple teams

SEAD is:• A community platform for reference data:– Research Object management– Inference– Curation– Preservation– ID assignment– Catalog Registration– Discovery– Citation Generation

SEAD’s Virtual Archive allows curators to access, assess, enhance, package, and submit data from SEAD project repositories for long-term storage in SEAD-managed storage or external institutional repositories and cloud data services.

Semantic Content Middleware over Scalable File System and Triple Store

Flickr-style web management of dataGeospatial, social network mash-ups, workflows and services

Curation Services to harvest and package specific data sets

Federation of OAI repositories for long-term preservation

Sensor data

– Apps read what they need and write what they know– Curation snapshots meaningful Research Objects– Multiple ROs can be defined/managed re-using the same underlying ‘living’ content– The larger graph can be ~reassembled w/o the ongoing cost of managing at the item level

Why is SEAD needed for curation?

• The nature of modern research• The nature of the data documentation problem• Artificial limitations derived from historical

practice

Unless these issues are addressed (in addition to sheer scale), data curation will remain too cumbersome and expensive for ubiquitous use…

Data Challenges in Sustainability Research• Many dimensions, many coordinate systems, many scales,

many formats, a long-tail of providers and users, …• Managing this data is a drag on productivity…

The Long Tail in Research• Individuals/small groups where:– Scale of research prohibits traditional CI

development, dedicated IT support, full-time curator…

– shared data but multiple disciplinary views – Projects involve reference data from external

sources– Project Team does not control formats and

vocabularies These are not just “challenges for the “future

Analyzing the curation/preservation problem…

• Data and Metadata are known well during the project

• Producers actually memorize or record metadata already, and then spend precious time transferring that between people and systems

• Data users manually assemble missing data/metadata but don’t often have a way to share that with others

• Repositories struggle to attain the domain understanding needed to go beyond basic bibliographic info– Repositories only use metadata to help

with data discovery and internal curation decisions

Bill Michener – DataONE

Who knows what?When do they know it?Why will they tell you?

Producers Users

Jim Myers - SEAD

Our collective legacy• Data can only be in one place…• Data transfer is costly…• Mistakes are costly…• Only the future needs well-organized data

(questionable assumptions)• Curation only happens at data/project/center end-of-life• Submission events must be formal and complete• Only cross-trained professionals are capable of getting it right• Researchers should see curation only as a public service

What’s different for users?• When you add a file:

– You can get it back, from anywhere– You can see your video, zoom in on images, overlay spatial data

on maps and retrieve them from an OGC service endpoint– You see the metadata hidden in the file– You can add titles, descriptions, locations, tags later, not as

required parts of a long submit form, and• When you do, they are search terms and ways to create custom maps

– You can add good data and bad, and figure out which data to keep later (using provenance to guide you)

– Users of your data can add metadata, comments, and derived datasets that improve quality, adapt the data for new purposes, etc.

What’s different for curators• Curation starts with data and metadata in hand, not as a

search through dusty disks• Curators can embed with project teams• Data comes with

– Formal metadata (dc:creator= http://vivo-vis-test.slis.indiana.edu/vivo/individual/n7732 )

– Informal metadata (http://www.holygoat.co.uk/owl/redwood/0.1/tags/taggedWithTag tag:cet.ncsa.uiuc.edu,2008:/tag#bpnm)

– Context! (“bpnm” in the WSC_Reach project always means “Birds Point/New Madrid”)

– Producers and users – conversations are possible

• Packaging, repository selection, submission, registration with catalogs are all automated/semi-automated…

http://vivo-vis-test.slis.indiana.edu/vivo/individual/n7732



http://www.holygoat.co.uk/owl/redwood/0.1/tags/taggedWithTag

http://www.holygoat.co.uk/owl/redwood/0.1/tags/taggedWithTag

SEAD Concept• Leverage incremental, informal active use to

capture data and metadata from first sources• Provide data-related (metadata-driven) services

to active producers and users of data• Simplify and automate curation and preservation

processes using captured information and context• Leverage existing institutional repository

technologies and organizations to provide long-term storage

Increase Value, Lower Costs, Increase Immediacy

SEAD is:• Write once, re-use• Extensible (data, metadata) – within sustainability

research and beyond• Incremental• Living datasets published Research Objects• Scalable

A tool for data producers and users… that also provides a long-term data plan… that can be sustainable at community scale

How?• Web 2.0, Web 3.0… • Strong collaboration with researchers and curators

• Leveraging standards – vocabularies, service endpoints, transfer protocols, submission packages, …

• Leveraging existing software – Medici/Tupelo, VIVO, DataConservancy + Jena, GWT, Geoserver, MySql, Fuseki, …

Current Status• 10 hosted project spaces for pilot groups on VM farm +

community VIVO, VA servers– ~< 2 TB, ~1800 profiles, proof-of-concept submissions to UI

and IU institutional repositories• 1.0 OSS release in November, operating as a DataOne

production Member Node (next week)– Google sign-in, cybersecurity and usability enhancements,

data-maturity-based access control, dashboard, public discovery, and geobrowse interfaces, …

• Project info: http://sead-data.net• Demo Space: http://sead-demo.ncsa.illinois.edu


http://sead-demo.ncsa.illinois.edu/

Going forward– Version 1.0 released– Open early adopter period– Improving scalability– Exploring social feedback mechanisms to further

improve curation – add value, remove costs, engage producers, users, and curators

– Active outreach: Use SEAD! (software or services), Extend SEAD!, collaborate with SEAD!

Acknowledgements

• SEAD Team @ UM, UI, IU• NSF• NCED, IRBO, WSC-Reach, IMLCZO, ICPSR, other sustainability

researchers

• and Thank You!

… stop by the SEAD booth and share your thoughts!


SPARQL Q

ueries

HTTP People/Publica

tion links

SEAD: Components/ Communications

Active Content Repository (multiple webapps): Branded Public AccessActive Project SpacesIndividual Data Pages

SEAD VIVO: Browse Through People , Projects, Publications, Data Citations , and Organizations, Visualize Networks and Community Dynamics

SEAD Virtual Archive:Policy Driven CurationInstitutional/Cloud/Grid StorageFaceted Search for Reference Data

Main Website: Overview, Project Info, Services, Documentation, News

HTTP Links/Embedded Content

SPARQL QueriesHTTP Data/DOI links

BAGIT Data/ Metadata Transfer

SPARQL Q

ueries

HTTP Data li

nks

SPARQL Queries

HTTP People/Org links

VIVO Service call

HTTP Data/DOI links

Andr

oid

Upl

oad

Des

ktop

Dro

p Bo

x

Colle

ction

Pag

es

Dat

a Pa

ges

Proj

ect S

umm

ary

Geo

-web

app

Bran

ded

Repo

sito

ry

Map

Pag

e

Adm

in P

age

Tag

Page

Sear

ch P

age

Web Service APIsWeb Application

User Management Role-based Access Control

Data/Metadata Mgmt

Extractors and Indexing

Tupelo 2 Lucene Geoserver

MySQL Local File System

SEAD Active Content Repository Architecture

SEAD ACRAdditions and 3RD Party

Components

Modified/ Configured Medici/Tupelo 2

Components

RDF + Files

Input Form/Display Generation

Internal APIs

SEAD VIVO Architecture

Peop

le

Proj

ects

Publ

icati

ons

Org

aniza

tions

Joseki/Fuseki/Web Services

Jena/RDF

MySQL Local File System

Dat

a Ci

tatio

ns

Net

wor

k Vi

sual

izatio

n

Tem

pora

l Vi

sual

izatio

n

User Management Entity Management Analytics

Cura

tor’s

W

orkb

ench

Inge

st P

roce

ssin

g

Mat

chm

akin

g

Face

t Sea

rch

Geo

-spa

tial S

earc

h

APIs Web Services

Metadata Extraction/ Persistent Identifier/

Indexing/Archival (Adapted DC Workflow)

Matchmaker/ Repository

Management

Solr Query (XML)

DataONE Member

Node Service

Geospatial Query Service

BagIt Conversion

Solr IndexerLocal File System PostGIS

SEAD Virtual Archive Architecture

SWO

RD

IUScholarworksUIUC IdealsArchival Storage