Upload
vanhanh
View
214
Download
2
Embed Size (px)
Citation preview
Fran Berman
Got Data? Building a Sustainable Environment for Data-Driven Innovation
Dr. Francine Berman
Chair, Research Data Alliance / US
Edward P. Hamilton Distinguished Professor of Computer Science, RPI
Fran Berman
Data Drives 21st Century Innovation
Safety and Security
Transformative strategies for disease treatment and well-
being
Research Insights
Efficient Physical Infrastructure Broad spectrum of goods
and services
Fran Berman
Technical infrastructure necessary for data-driven results
Data discoverability tools
Data access via portals, science gateways, etc.
Database and data collection systems
Data services to support use and re-use
Data analysis algorithms, data-driven models and simulations
Data visualization tools
Semantic frameworks
Data management systems
Data storage
Fran Berman
Social, organizational, and human infrastructure equally important
Systems Interoperability
Policy
Sustainable Economics
Common Standards
Traffic Image: Mike Gonzalez
Community Practice Workforce and
training
Fran Berman
What can we do to accelerate the development of effective data infrastructure needed for 21st
century innovation?
Data-driven Innovation
Data Infrastructure
Technical Infrastructure SW and systems, Tools and algorithms, Hardware and
facilities
Social Infrastructure Policy, Practice, Standards,
Rights, Community culture
Fran Berman
Today’s presentation: Emerging trends in the development of effective data infrastructure
for data-driven innovation
“Local” / Institutional
Data Infrastructure What’s needed to
create a 21st century university in the digital age?
Global Data Infrastructure
How do we accelerate open
access data sharing and exchange?
National Data Infrastructure How do we support stewardship and
preservation of valuable research data?
Fran Berman
Increasing national focus on data as a driver for innovation in science and society
Fran Berman, Data and Society
Fran Berman
Increasing federal R&D agency requirements for managed and accessible research data
Fran Berman
February 2013 OSTP Memo: New policies for public access to research data and publications
Agency Plan Requirements:
• Strategy for capitalizing on what exists and fostering public-private partnerships with scientific journals
• Strategy for increasing / enhancing discoverability, access, dissemination, stewardship, preservation
• Approach for measuring and enforcing compliance
• No new money: “Identification of resources within the existing agency budget to implement the plan”
Fran Berman
Fran Berman
Infrastructure realities: management, access, and use of data now and in the future presupposes data
stewardship and preservation today
• Stewardship and Preservation critical: “Homeless” data ceases to exist
• Economically sustainable data infrastructure necessary to support
– Federally mandated data management plans
– Public access to research data
– Use and re-use
– Reproducibility
• The “bigger”, longer-term, more complex, or more valuable the data is, the greater the importance of sustainable data infrastructure
Fran Berman
Data economics: Responsible data stewardship requires a viable business model for sustaining its
underlying infrastructure Data infrastructure costs increase with usage, stewardship and access requirements, perceived value Greater costs at the extremes (including “big” data) …
“Locally Manageable”
Data
More mgt., stewardship
required
Long-lived data
Big data
Coupled data
services
Access control, broad access
More curation required
Fran Berman
10.0
100.0
1000.0
10000.0
100000.0
June-97 June-98 June-99 June-00 June-01 June-02 June-03 June-04 June-05 June-06 June-07 June-08 June-09
Date
Arc
hiv
al S
tora
ge
(TB
)
Model A (8-yr,15.2-mo 2X) TB Stored Planned Capacity
It’s not just about the cost of storage
• Most valuable data replicated • As research collections increase,
storage capacity must stay ahead of demand
Information courtesy of Richard Moore, SDSC
Resources and Resource Refresh
Data Infrastructure costs include • Maintenance and upkeep
• Software tools and packages
• Utilities (power, cooling)
• Space
• Networking
• Security and failover systems
• People (expertise, help, infrastructure management, development)
• Training, documentation
• Monitoring, auditing
• Reporting costs
• Costs of compliance with regulation, policy, etc. …
SDSC Data Storage Growth ‘97-’09
Fran Berman
Economics of public access: Who pays the data bill?
Article: Science Magazine, August 9, 2013. Free public access link at http:/www.cs.rpi.edu/~bermaf/
Fran Berman
Op-ed recommendations: Cultivate / coordinate preservation and stewardship options in every sector
Charleston Ballet blog: http://allianceblog.org/tag/charleston-ballet/ ; iTunes gift card
• Evolve research culture to take
advantage of what works in the private
sector
• Create sustainable university library and repository stewardship solutions
• Clarify public sector stewardship
commitments: articulate what data
will / won’t be supported
• Facilitate private sector stewardship of public access research data as a public good
Private Sector
Public Sector
Individuals Academia
Fran Berman
Things researchers can do to support their data
Small steps for Research Groups and Projects:
1. If you don’t have one, create a data management plan for your current project for a reasonable fixed term of time. Budget realistically for the costs of data stewardship and preservation.
2. Make your data available to the community (as appropriate) by curating it and ingesting it into a publicly accessible repository
3. Cite and publish your data when you write about your results.
4. Work with your professional societies and conferences to include “data sessions”*
* Suggestion from Sibel Adali, RPI
Fran Berman
Local / Institutional infrastructure for data-driven innovation
“Local” / Institutional
Data Infrastructure What’s needed to
create a 21st century university in the digital age?
Global Data Infrastructure
How do we accelerate open
access data sharing and exchange?
National Data Infrastructure How do we support stewardship and
preservation of valuable research data?
Fran Berman
Fran Berman
What’s Needed to Create a 21st Century Campus in the Digital Age ?
• Digitally-enabled campus provides information technologies needed to enable 21st century innovation and discovery
• Key questions for institutions:
– What should we support on campus and what should we outsource?
– How can we plan ahead for growth?
– How do we build in flexibility for emerging trends (e.g. MOOCS and distance education, Cloud resources, etc.)?
Critical for success
• Sustainable business model
• Sufficient user support • Plans for
• Recovery from Failure / Disaster
• Strategic Evolution
Fran Berman
Work to be done
Finding: “The existing, aggregate, national infrastructure is not adequate to meet current
or future needs of the US open science and engineering research community.”
Finding: “Data volumes produced by most new research instrumentation, including that installed
at the campus lab level, cannot be supported by most current campus, regional, and national networking facilities. There is critical need to
restructure and upgrade local campus networks to meet these demands.”
NSF CISE Advisory Committee Task Force on Campus Bridging (March 2011)
Fran Berman
The digitally enabled campus
• Institutions have the opportunity to create 21st century campus cyberinfrastructure that sustainably supports data-driven innovation:
– Stewardship and preservation of valuable data from campus researchers
– Support for data management plans and public access policy compliance
– Coupling of data stewardship with data-driven computing, data services
– Integration of facilities for data generation, data stewardship, data use
– Potential for revenue from broader use outside of campus
Campus Research
Community
Campus Data Center
Storage, Preservation
University Library
Fran Berman
Conversations between campus researchers (data users, creators) and data stewards – the devil is in the details
Researcher: Can you host the digital data from my project?
Steward: • How much data?
• For how long?
• What is the form of the data?
• Who will be in charge of cleaning, annotating, organizing, refreshing, modifying etc. the data?
• Who should have access to the data?
• Can the data be regenerated?
• Who will pay for infrastructure support? Etc. …
Key Issues for Research Data Stewardship Organizations
• Expectations and agreements • What is the “contract” between
users and repositories, between federated partner repositories? Who pays?
• Incentives and penalties • What incentives promote good
stewardship? What happens when the system fails?
• What regulations and policies govern the stewardship relationship?
• Security, privacy, confidentiality, trust • What are appropriate
expectations?
Fran Berman
Making it happen -- A stakeholder “dream team”: CIO + VPR + University Librarian
University Librarian: Library community has critical experience with stewardship • Development of reliable
management, preservation and use environments
• Proper curation and annotation
• Understanding of policy, regulation, intellectual property issues
• Culture of collaboration and partnership
• Sustainability
VPR: Research community characterized by a culture of innovation • Periodic new starts
• Experimentation
• Collaboration and competition
• Increasing need for compliance with policy and regulation wrt conduct and outcomes of federally funded research
• CIO: Experience with reliable, at-scale infrastructure – Integrated and reliable support for campus networking, data, computing
environment
– Deep experience with campus-scale IT infrastructure in the context of institutional power, cooling, space constraints
Campus Research
Community
Campus Data Center
Storage, Preservation
University Library
Fran Berman
The “dream team” can help drive institutional culture change, leadership, competitiveness
Institutional infrastructure needed to support 21st century universities
• Research landscape is changing – Traditional academic modes and forms of research recognition
evolving: new approaches to collaboration / competition, publication, citation
• Educational landscape is changing – University curricula evolving; increasing integration of on-line / on-
site options
• Workforce is changing – More data literacy required from everyone – More data science embedded in everything – More complex policy and regulation to deal with
Fran Berman
Global data infrastructure
“Local” / Institutional
Data Infrastructure What’s needed to
create a 21st century university in the digital age?
Global Data Infrastructure
How do we accelerate open
access data sharing and exchange?
National Data Infrastructure How do we support stewardship and
preservation of valuable research data?
Fran Berman
21st century innovation: Data sharing drives new discovery
Fran Berman
Data sharing is a global issue
Science, Humanities, Arts Communities
Cyberinfrastructure professionals, data analysts, data center staff, …
Data Scientists
Libraries, Archives, Repositories, Museums
Fran Berman
Global community-driven organization launched in March 2013 to accelerate data-driven innovation
RDA focus is on building the social, organizational and technical infrastructure to
reduce barriers to data sharing and exchange
accelerate the development of coordinated global data infrastructure
The Research Data Alliance (RDA)
Fran Berman
CREATE ADOPT USE
RDA Members come together as
• Working Groups – 12-18 month efforts to build, adopt, and use specific pieces of infrastructure
• Interest Groups – longer-lived discussion forums that spawn Working Groups as specific pieces of needed infrastructure are identified.
Working Group efforts focus on the development and use of data sharing infrastructure
• Code, policy, infrastructure, standards, or best practices that are adopted and used by communities to enable data sharing
• “Harvestable” efforts for which 12-18 months of work can eliminate a roadblock
• Efforts that have substantive applicability to groups within the data community, but may not apply to everyone
• Efforts for which working scientists and researchers can start today
Fran Berman
Technical and Social Data Infrastructure
Broad spectrum of infrastructure needed to support data sharing
Common Metadata Standards
Digital Object Identifiers
Data Discoverability Tools
Data Access and Distribution Policy
Data Sharing Practice
Sustainable Economic Model
Data Services
Data Management Systems
Data Analytics
Data Storage
Data Citation Standards
Preservation Policy
Images: NASA, SDSC / UCSD
Fran Berman
Precipitous growth
RDA Launch / First Plenary
March 2013
RDA Second Plenary
September 2013
RDA Third Plenary
March 2014
First RDA organizational telecon: August 2012
Global Data Planning Meeting: October 2012
First Working Groups and Interest Groups
240 participants
First “neutral space” community meeting (Data Citation Summit)
First Organizational Partner Meet-up
First BOFs
380 participants from 22 countries
RDA Fourth Plenary
September 2014
First Organizational Assembly
6 co-located events
14 BOF, 12 Working Groups, 22 Interest Groups
497 participants
RDA Plenary 4 Amsterdam
First Working Group exchange meeting
RDA Plenary 2 Washington, DC
RDA Plenary 1 / Launch Gothenburg, Sweden
RDA Plenary 3 Dublin, Ireland
Fran Berman
The RDA Community Today: Over 2300 members from 96 countries (as of 9/14)
Acad. 63%
Govt. 18%
Private 13%
Other 6%
Map courtesy traveltip.org
North Amer
. 36%
Eur. 50%
Austral-pacific 5%
Africa 3%
South America 1%
Asia 5%
Fran Berman
RDA Interest (IG) and Working Groups (WG) by Focus 1 (as of 9/14)
Domain Science - focused • Toxicogenomics
Interoperability IG
• Structural Biology IG
• Biodiversity Data Integration IG
• Agricultural Data Interoperability IG
• Wheat Data Interoperability WG
• Digital Practices in History and Ethnography IG
• Geospatial IG
• Marine Data Harmonization IG
• Metabolomics IG
• RDA/CODATA Materials Data Infrastructure and Interoperability IG
• Research Data Needs of the Photon and Neutron Science Community IG
• Defining Urban Data Exchange for Science IG*
• The BioSharing Registry: Connecting data policies, standards and databases in the life sciences WG*
• Urban Quality of Life Indicators WG*
Community Needs - focused • Community Capability Model IG
• Engagement IG
• RDA / CODATA Summer Schools in Data Science and Cloud Computing in the Developing World WG*
• Development of Cloud Computing Capacity and Education in Developing World Research IG
• Data for Development IG
• Education and Training on handling of research data IG
* under review
Fran Berman
RDA Interest (IG) and Working Groups (WG) by Focus 2 (as of 9/14)
Data Stewardship and Services - focused • Research Data Provenance IG
• Preservation e-infrastructure IG
• RDA / WDS Publishing Data Services WG
• RDA / WDS Publishing Data Workflows WG
• Long-tail of Research Data IG
• RDA/WDS Publishing Data IG
• RDA/WDS Repository Audit and Certification WG
• Domain Repositories Interest Group
• Brokering Interest Group
• ELIXIR Bridging Force IG*
• Libraries for Research Data IG*
• RDA / WDS Certification of Digital Repositories IG
• RDA / WDS Publishing Data Cost Recovery for Data Centres IG
Reference and Sharing - focused • Data Citation WG
• Standardization of Data Categories and Codes WG
• RDA/CODATA Legal Interoperability IG
• Reproducibility IG*
• Data Description Registry Interoperability Working Group
• RDA / WDS Publishing Data Bibliometrics WG
Base Infrastructure - focused • Data Foundation and Terminology WG • Metadata Standards Directory WG • Practical Policy WG • PID Information Types WG • Data Type Registries WG • Data in Context IG
• Big Data Analytics IG • Data Brokering WG* • Federated Identity Management IG • Metadata IG • PID Interest Group • Service Management IG • Data Fabric IG
* under review
Fran Berman
Organizational Members: Alliance for Permanent Access American University Library Australian National Data Service Barcelona Supercomputing Center -
Centro Nacional de Supercomputación Columbia University Library CNRI CSC Digital Curation Center EIROForum IT Working Group eResearch Services and Scholarly
Application Development Division of Information Services, Griffith University European Data Infrastructure (EUDAT) National Institute of Advanced Industrial
Science and Technology (AIST), Japan International Association of STM
Publishers Internet2
Microsoft Research NZ eScience Infrastructure Purdue University Libraries Research Data Canada Scholarly Publishing and Academic
Resources Coalition (SPARC) Washington University in St. Louis
Libraries Science and Technology Facilities Council
Affiliates CODATA ICSU World Data System ORCID DataCite Global Alliance for Genomics and Health CASRAI
Organizations Committed to Joining RDA
Fran Berman
Create a pipeline of data sharing infrastructure efforts
that are adopted and used by communities during their development
that increase their impact through greater adoption over time
Build and expand the research data community for effective impact
globally, regionally, and within constituent groups
Evolve as a useful, relevant, and agile organization
that helps the community capitalize on opportunity and respond to challenges within the data community
RDA Medium-term (3-5 years) Strategic Goals
Fran Berman
RDA/US Goals:
Contribute to RDA “international” efforts and leadership
Bring US efforts to broader RDA community
Build the RDA community within the US
Leverage and implement RDA deliverables in the US to amplify impact
Collaborate closely with other RDA “regions” on key programs and initiatives
RDA/US: Collaborate Globally, Contribute Locally
NSF-supported RDA/US initiatives: • Outreach (RDA RDA/US) • RDA Deliverables Amplification • Student / Early Career Engagement
RDA/US Steering Committee • Fran Berman, RPI • Kathy Fontaine, RPI • Larry Lannom, CNRI • Beth Plale, IU
RDA US membership (yellow states)
Fran Berman
• Next Plenary: March 9-11, 2015, hosted by RDA/US in San Diego
– Working Meeting of the RDA with co-located community meetings
– SDSC hosting RDA Adoption Day on March 8
More information at rd-alliance.org
Fran Berman
What can we do to accelerate the development of effective data infrastructure needed for 21st
century innovation?
Thank You!