Upload
sead
View
376
Download
2
Embed Size (px)
DESCRIPTION
A presentation given at AGU 2013
Citation preview
Changing the Curation Equation: A Data Lifecycle Approach to
Lowering Costs and Increasing Value
Jim Myers1, Margaret Hedstrom1, Beth A Plale2, Praveen Kumar3, Robert McDonald4, Rob Kooper5, Luigi Marini5, Inna Kouper4, Kavitha Chandrasekar4
1 School on Information, University of Michigan, Ann Arbor, MI, United States. 2 School of Informatics and Computing, Indiana University, Bloomington, IN, United States. 3 Civil and Environmental Engineering, University of Illinois, Urbana-Champaign, IL, United States. 4 Data To Insight Center, Indiana University, Bloomington, IN, United States. 5 National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, IL, United States.
Outline
• Quick Project Intro• What is SEAD? (Stop by the SEAD booth!)• Why is SEAD? • How does SEAD work?• Future active and social curation work
SEAD: Sustainable Environment -Actionable Data
• An NSF DataNet project started in October, 2011
• An international resource for sustainability science
• A provider of light-weight Data Services based on novel technical and business approaches:– Supporting the long-tail of research– Enabling active and social curation– Providing integrated lifecycle support for data
http://sead-data.net/
Margaret Hedstrom, PIPraveen Kumar, co-PIJim Myers, co-PIBeth Plale, co-PI
Sustainability Research• Central to solving many of society’s most critical
challenges• An exemplar of modern research
– Local processes aggregating to produce global consequences– Multiple time scales– Coupling of natural and human systems– Interacting systems-of-systems requiring multidisciplinary
understanding • Environmental – Economic - Social
Science
Technology
Economics
Poverty & Justice
Policy
Cooperation
SEAD is:
• Data discovery• Project workspaces• A data-aware community network• Curation and preservation services that link to multiple archives and discovery services
SEAD is:• Secure project spaces where teams can:– Gather reference data– Upload and share new results– Annotate– Relate– Organize– Publish
Project Dashboard
SEAD is:• An active repository that creates data pages with– Previews– Extracted Metadata– Overlays– Tags– Comments– Provenance– Use information– Download/Embed
• A tool for community exploration:– Personal and Project Profiles– Publications and Data Citations– Co-author, co-investigator graphs– Temporal analysis
SEAD is:
SEAD is:
• A way to preprint and publish data:– Branded interface– Discovery metadata– Drill-down• Sub-collections• Data Pages
– Submit for curation and preservation The National Center for Earth Surface Dynamics
~1.6 TB, 450K files (2.2 M objects) representing 10 years of research by multiple teams
SEAD is:• A community platform for reference data:– Research Object management– Inference– Curation– Preservation– ID assignment– Catalog Registration– Discovery– Citation Generation
SEAD’s Virtual Archive allows curators to access, assess, enhance, package, and submit data from SEAD project repositories for long-term storage in SEAD-managed storage or external institutional repositories and cloud data services.
Semantic Content Middleware over Scalable File System and Triple Store
Flickr-style web management of dataGeospatial, social network mash-ups, workflows and services
Curation Services to harvest and package specific data sets
Federation of OAI repositories for long-term preservation
Sensor data
– Apps read what they need and write what they know– Curation snapshots meaningful Research Objects– Multiple ROs can be defined/managed re-using the same underlying ‘living’ content– The larger graph can be ~reassembled w/o the ongoing cost of managing at the item level
Why is SEAD needed for curation?
• The nature of modern research• The nature of the data documentation problem• Artificial limitations derived from historical
practice
Unless these issues are addressed (in addition to sheer scale), data curation will remain too cumbersome and expensive for ubiquitous use…
Data Challenges in Sustainability Research• Many dimensions, many coordinate systems, many scales,
many formats, a long-tail of providers and users, …• Managing this data is a drag on productivity…
The Long Tail in Research• Individuals/small groups where:– Scale of research prohibits traditional CI
development, dedicated IT support, full-time curator…
– shared data but multiple disciplinary views – Projects involve reference data from external
sources– Project Team does not control formats and
vocabularies These are not just “challenges for the “future
Analyzing the curation/preservation problem…
• Data and Metadata are known well during the project
• Producers actually memorize or record metadata already, and then spend precious time transferring that between people and systems
• Data users manually assemble missing data/metadata but don’t often have a way to share that with others
• Repositories struggle to attain the domain understanding needed to go beyond basic bibliographic info– Repositories only use metadata to help
with data discovery and internal curation decisions
Bill Michener – DataONE
Who knows what?When do they know it?Why will they tell you?
Producers Users
Jim Myers - SEAD
Our collective legacy• Data can only be in one place…• Data transfer is costly…• Mistakes are costly…• Only the future needs well-organized data
(questionable assumptions)• Curation only happens at data/project/center end-of-life• Submission events must be formal and complete• Only cross-trained professionals are capable of getting it right• Researchers should see curation only as a public service
What’s different for users?• When you add a file:
– You can get it back, from anywhere– You can see your video, zoom in on images, overlay spatial data
on maps and retrieve them from an OGC service endpoint– You see the metadata hidden in the file– You can add titles, descriptions, locations, tags later, not as
required parts of a long submit form, and• When you do, they are search terms and ways to create custom maps
– You can add good data and bad, and figure out which data to keep later (using provenance to guide you)
– Users of your data can add metadata, comments, and derived datasets that improve quality, adapt the data for new purposes, etc.
What’s different for curators• Curation starts with data and metadata in hand, not as a
search through dusty disks• Curators can embed with project teams• Data comes with
– Formal metadata (dc:creator= http://vivo-vis-test.slis.indiana.edu/vivo/individual/n7732 )
– Informal metadata (http://www.holygoat.co.uk/owl/redwood/0.1/tags/taggedWithTag tag:cet.ncsa.uiuc.edu,2008:/tag#bpnm)
– Context! (“bpnm” in the WSC_Reach project always means “Birds Point/New Madrid”)
– Producers and users – conversations are possible
• Packaging, repository selection, submission, registration with catalogs are all automated/semi-automated…
SEAD Concept• Leverage incremental, informal active use to
capture data and metadata from first sources• Provide data-related (metadata-driven) services
to active producers and users of data• Simplify and automate curation and preservation
processes using captured information and context• Leverage existing institutional repository
technologies and organizations to provide long-term storage
Increase Value, Lower Costs, Increase Immediacy
SEAD is:• Write once, re-use• Extensible (data, metadata) – within sustainability
research and beyond• Incremental• Living datasets published Research Objects• Scalable
A tool for data producers and users… that also provides a long-term data plan… that can be sustainable at community scale
How?• Web 2.0, Web 3.0… • Strong collaboration with researchers and curators
• Leveraging standards – vocabularies, service endpoints, transfer protocols, submission packages, …
• Leveraging existing software – Medici/Tupelo, VIVO, DataConservancy + Jena, GWT, Geoserver, MySql, Fuseki, …
Current Status• 10 hosted project spaces for pilot groups on VM farm +
community VIVO, VA servers– ~< 2 TB, ~1800 profiles, proof-of-concept submissions to UI
and IU institutional repositories• 1.0 OSS release in November, operating as a DataOne
production Member Node (next week)– Google sign-in, cybersecurity and usability enhancements,
data-maturity-based access control, dashboard, public discovery, and geobrowse interfaces, …
• Project info: http://sead-data.net• Demo Space: http://sead-demo.ncsa.illinois.edu
Going forward– Version 1.0 released– Open early adopter period– Improving scalability– Exploring social feedback mechanisms to further
improve curation – add value, remove costs, engage producers, users, and curators
– Active outreach: Use SEAD! (software or services), Extend SEAD!, collaborate with SEAD!
Acknowledgements
• SEAD Team @ UM, UI, IU• NSF• NCED, IRBO, WSC-Reach, IMLCZO, ICPSR, other sustainability
researchers
• and Thank You!
… stop by the SEAD booth and share your thoughts!
http://sead-data.net/
SPARQL Q
ueries
HTTP People/Publica
tion links
SEAD: Components/ Communications
Active Content Repository (multiple webapps): Branded Public AccessActive Project SpacesIndividual Data Pages
SEAD VIVO: Browse Through People , Projects, Publications, Data Citations , and Organizations, Visualize Networks and Community Dynamics
SEAD Virtual Archive:Policy Driven CurationInstitutional/Cloud/Grid StorageFaceted Search for Reference Data
Main Website: Overview, Project Info, Services, Documentation, News
HTTP Links/Embedded Content
SPARQL QueriesHTTP Data/DOI links
BAGIT Data/ Metadata Transfer
SPARQL Q
ueries
HTTP Data li
nks
SPARQL Queries
HTTP People/Org links
VIVO Service call
HTTP Data/DOI links
Andr
oid
Upl
oad
Des
ktop
Dro
p Bo
x
Colle
ction
Pag
es
Dat
a Pa
ges
Proj
ect S
umm
ary
Geo
-web
app
Bran
ded
Repo
sito
ry
Map
Pag
e
Adm
in P
age
Tag
Page
Sear
ch P
age
Web Service APIsWeb Application
User Management Role-based Access Control
Data/Metadata Mgmt
Extractors and Indexing
Tupelo 2 Lucene Geoserver
MySQL Local File System
SEAD Active Content Repository Architecture
SEAD ACRAdditions and 3RD Party
Components
Modified/ Configured Medici/Tupelo 2
Components
RDF + Files
Input Form/Display Generation
Internal APIs
SEAD VIVO Architecture
Peop
le
Proj
ects
Publ
icati
ons
Org
aniza
tions
Joseki/Fuseki/Web Services
Jena/RDF
MySQL Local File System
Dat
a Ci
tatio
ns
Net
wor
k Vi
sual
izatio
n
Tem
pora
l Vi
sual
izatio
n
User Management Entity Management Analytics
Cura
tor’s
W
orkb
ench
Inge
st P
roce
ssin
g
Mat
chm
akin
g
Face
t Sea
rch
Geo
-spa
tial S
earc
h
APIs Web Services
Metadata Extraction/ Persistent Identifier/
Indexing/Archival (Adapted DC Workflow)
Matchmaker/ Repository
Management
Solr Query (XML)
DataONE Member
Node Service
Geospatial Query Service
BagIt Conversion
Solr IndexerLocal File System PostGIS
SEAD Virtual Archive Architecture
SWO
RD
IUScholarworksUIUC IdealsArchival Storage