Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Building an Open Data Infrastructure for Research:
Turning Policy into Practice
Juan BicarreguiHead of Data Services Division
STFC Department of Scientific Computing
IDCC 2013, International Digital Curation Conference, 14‐17 January 2013, Amsterdam
Overview1. The Policy Context
– OECD– EC/NSF/…– G8+5– RCUK– Royal Society
2. Big Science Data – What do we need from our data infrastructure? – STFC Facilities and Scientific Computing Department– An example Collaborative Infrastructure: PaNdata
3. Fostering Collaboration on a Global Scale – The Research Data Alliance
1. The Policy Context• OECD, 2004‐2006
– Principles and Guidelines for Access to Research Datafrom Public Funding
• EC, 2007‐2012– Recommendation on access to and preservation of scientific information
• G8+5, 2011‐2012– Global Research Infrastructure Sub Group on Data
• Research Councils UK, 2011– Joint Principles on Data
• Royal Society, 2011‐2012– Science as an Open Exercise
OECD 2004 ‐ 2006ON ACCESS TO RESEARCH DATA FROM PUBLIC FUNDING
2003 ‐ Science and Technology Ministers called on the OECD to develop a set of guidelines based on commonly agreed principles to facilitate cost‐effective access to digital research data from public funding.
Declaration adopted on 30 January 2004
2006 ‐ Recommendation of the Council concerning Access to Research Data from Public Funding (14 December 2006 ‐
Principles and Guidelines endorsed by the OECD Council on 14 December 2006. [C(2006)184]
The OECD member countries are:Australia, Austria, Belgium, Canada, the Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Iceland, Ireland, Italy, Japan, Korea, Luxembourg, Mexico, the Netherlands, New Zealand, Norway, Poland, Portugal, the Slovak Republic, Spain, Sweden, Switzerland, Turkey, the United Kingdom and the United States.
The Commission of the European Communities takes part in the work of the OECD.
OECD Principles and Guidelines for Access to Research Data from Public Funding
13 principles
A – Openness • Openness means access on equal terms for the international research community at
the lowest possible cost, ....
B – Flexibility, C – Transparency, D – Legal conformity, E – Protection of intellectual property, F – Formal responsibility, G – Professionalism
H – Interoperability• Technological and semantic interoperability is a key consideration in enabling and
promoting international and interdisciplinary access to and use of research data. ...
I – Quality, J – Security, K – Efficiency, L – Accountability
M – Sustainability• ... taking administrative responsibility for the measures to guarantee permanent access
to data that have been determined to require long‐term retention.
[http://www.oecd.org/dataoecd/9/61/38500813.pdf]
The Innovation Lifecycle
The Body of Knowledge
The GovernmentProcess
The ResearchProcess
Aggregation of Knowledge lies at the heart of the innovation lifecycle
Enabling Knowledge Creation
Enabling Wealth Creation
Quality Assessment
Strategic Direction
Improved Quality of Life
Improved Understanding
Data and the Research Process
Overview• OECD, 2004‐2006
– Principles and Guidelines for Access to Research Datafrom Public Funding
• EC, 2007‐2012– Recommendation on access to and preservation of scientific information
• G8+5, 2011‐2012– Global Research Infrastructure Sub Group on Data
• Research Councils UK, 2011– Joint Principles on Data
• Royal Society, 2012– Science as an Open Exercise
EUROPEAN COMMISSION 2007‐2012
• 14 February 2007, Commission adopted a Communication on scientific information in the digital age: access, dissemination and preservation
• November 2007, Council Conclusions on scientific information in the digital age: access, dissemination and preservation.
• October 2010 Riding the wave: How Europe can gain from the rising tide of scientific data– Final report of the High Level Expert Group on Scientific Data
• 17 July 2012 ‐ COMMISSION RECOMMENDATION to member states on access to and preservation of scientific information
Covers Publications, Data, and Infrastructure
17 July 2012 ‐ COMMISSION RECOMMENDATION on access to and preservation of scientific information
“publicly funded research should be widely disseminated through open access publication of scientific data and papers.”
Regarding Open access to research data, member states should:
• Define clear policies for the dissemination of and open access to research data resulting from publicly funded research.
• Ensure that, as a result of these policies:– research data that result from publicly funded research become publicly accessible, usable and re‐usable through digital e‐infrastructures. ...
– datasets are made easily identifiable and can be linked to other datasets and publications ...
PaN‐Data Infrastructure for Photon and Neutron Sources
Data Sharing VisionSingle Infrastructure Single User Experience
CapacityStorage
Publications Repositories
Data Repositories
Software Repositories
Raw Data Data Analysis
Analysed Data
Publication Data
Publications
Experiment 1
Raw Data Data Analysis
Analysed Data
Publication Data
Publications
Observation 2
Raw Data Data Analysis
Analysed Data
Publication Data
Publications
Simulation 3
Different Infrastructures Different User ExperiencesRaw Data CatalogueData Analysis
Analysed Data Catalogue
Publication Data Catalogue
Publications Catalogue
Overview• OECD, 2004‐2006
– Principles and Guidelines for Access to Research Datafrom Public Funding
• EC, 2007‐2012– Recommendation on access to and preservation of scientific information
• G8+5, 2011‐2012– Global Research Infrastructure Sub Group on Data
• Research Councils UK, 2011– Joint Principles on Data
• Royal Society, 2012– Science as an Open Exercise
G8+5 Global Research Infrastructure, Sub Group on Data report 2011
28 October 2011In 2020/2030…• Researchers and practitioners from any discipline are able to
find, access and process the data they need in a timely manner. • They are confident in their ability to use and understand data,
and they can evaluate the degree to which that data can be trusted. ...
• Data are managed, shared, and preserved in a way that optimizes scientific discovery, innovation, and societal benefit. Where appropriate, producers of data benefit from opening it to broad access and routinely deposit their data in reliable repositories. A framework of repositories work to international standards, to ensure they are trustworthy...
G8+5 2011
Creating Data• Networks of repositories, ...interoperable at global level
Preserving Data...sound and sustainable data management plans.Accessing Data• Data should also be readily discoverable by all ...• Global governance frameworks should eliminate unnecessary barriers ...
G8+5 2011Underlying computing infrastructures• In the future, ...increasingly complex data will not be understandable without ...yet‐to‐be‐developed analysis tools, ...
Cultural Aspects...ensure that there is an adequate cohort of data specialists are available.
International Coordination and Governance• ...Global coordination will also be required to harmonise the changes to research culture which will lead to increased motivations for collaboration and data sharing.
Data centric view of research
DataCreation
Archival
Access
Storage ComputeNetwork
Services
Curation
the researcher actsthrough ingest and access
Virtual Research Environment
the researcher shouldn’t have to worry about the information infrastructure
Information Infrastructure
G8+5 2012
Framework for Global Research Infrastructures.• Global scientific data infrastructure providers and users should establish an international forum for data interoperability. It should facilitate the exchange and interoperability of data across disciplines and national boundaries by producing high quality, relevant technical documents and procedures that influence the way researchers store, use, and manage data.
Overview• OECD, 2004‐2006
– Principles and Guidelines for Access to Research Datafrom Public Funding
• EC, 2007‐2012– Recommendation on access to and preservation of scientific information
• G8+5, 2011‐2012– Global Research Infrastructure Sub Group on Data
• Research Councils UK, 2011– Joint Principles on Data
• Royal Society, 2011‐2012– Science as an Open Exercise
RCUK Principles on Data PolicySeven (fairly) orthogonal principles:
• Public good
• Preservation
• Discoverability
• Protection
• First use
• Recognition
• Costs
1. Data are a Public GoodPublicly funded research data are a public good, produced in thepublic interest, which should be made openly available with as fewrestrictions as possible in a timely and responsible manner thatdoes not harm intellectual property.
Public good – is nonrival and non‐excludable [wikipedia] consumption by one does not reduce availability for othersno one can be effectively excluded from using
Research Data recorded factual material commonly retained by and accepted in the scientific community as necessary to validate research findings
As few restrictions as possible Later (distinguish registration from restriction)
TimelyLater (discipline specific)
ResponsibleLater (maximising access does not necessarily maximising research benefit)
Intellectual PropertyLater (balance contribution from sharing and from primary research)
RCUK Principles on Data Policy
2) Data should be managed
3) Data should be discoverable
4) There may be constraints
5) Originators may have first use
6) Reusers have responsibilities
7) Data sharing is not free
3 Dimensions of policyPublic Good
Management
Discoverability
Constraints
First Use
Recognition
TheDataitself
Access
Repeat, Repeal, RepurposeWhy might we want access to data?
Three distinct reasons for sharing data:
• Repeat ‐ Validation of previous analysis
– How does this fit with peer review?
• Repeal/Reconsider/ReformReverse ‐ Alternative hypotheses in the same field
– c.f. Reuse
– How does this fit with “right” to first use?
• Repurpose ‐ New research in another field
– c.f. Recycle
– How does this fit with recognition of Intellectual contribution? (What’s in it for me?)
Different concerns and requirements for each type of sharing
Overview• OECD, 2004‐2006
– Principles and Guidelines for Access to Research Datafrom Public Funding
• EC, 2007‐2012– Recommendation on access to and preservation of scientific information
• G8+5, 2011‐2012– Global Research Infrastructure Sub Group on Data
• Research Councils UK, 2011– Joint Principles on Data
• Royal Society, 2011‐2012– Science as an Open Exercise
Royal Society, June 2012Science as an open enterprise• Open inquiry is at the heart of the scientific enterprise.
• Publication of scientific theories ‐ and of the experimental and observational data on which they are based ‐ permits others to identify errors, to support, reject or refine theories and to reuse data for further understanding and knowledge.
• Science’s powerful capacity for self‐correction comes from this openness to scrutiny and challenge.
10 Recommendations
Data
The Research Data Lifecycle – a personal view
the researcher actsthrough ingest and access Research Environment
Creation
Archival
Access
Storage ComputeNetwork
Data
Services
the researcher shouldn’t have to worry about the information infrastructure
Information Infrastructure
Provenanced Research
Overview• OECD, 2004‐2006
– Principles and Guidelines for Access to Research Datafrom Public Funding
• EC, 2007‐2012– Recommendation on access to and preservation of scientific information
• G8+5, 2011‐2012– Global Research Infrastructure Sub Group on Data
• Research Councils UK, 2011– Joint Principles on Data
• Royal Society, 2011‐2012– Science as an Open Exercise
Overview1. The Policy Context
– OECD– EC/NSF/…– G8+5– RCUK– Royal Society
2. Big Science Data – What do we need from our data infrastructure? – STFC Facilities and SCD– PaNdata An example Collaborative Infrastructure
3. Fostering Collaboration on a Global Scale – The Research Data Alliance
2. Big Science Data
• What science does STFC do?
• The research lifecycle
• An example e‐infrastructure project.
Programme includes: • Neutron and Muon Source• Synchrotron Radiation Source• Lasers• Space Science • Particle Physics• Compuing and Data Management• Microstructures• Nuclear Physics• Radio Communications
What is STFC?
250m
ESRF & ILL, GrenobleDaresbury Laboratory
Square Kilometre Array Large Hadron Collider
What is the science?
2. Big Science Data
• What science does STFC do?
• The research lifecycle
• An example e‐infrastructure project.
The 7 C’s
Creation Collection
Capacity
Computation
Curation
Collaboration Communication
DataCreationArchival
Access
Storage ComputeNetworkServicesCuration
Linked systems for:
• Proposal submission• User management• Data acquisition
Metadata carried from each system to the next
Detectors moving from Hz to KHz, towards MHz,...
Creation
Examining the detectors on MAPS instrument on ISIS
1
10
100
1,000
10,000
1997 1998 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Total Data Stored (TeraBytes)
Capacity
Moore’s law for us
is about 15 months}
10 x Moore’s Law (2 years)2 x Moore’s Law (1.5 years)
Moore’s Lawx1000 in 13 yearsDoubling every 1.3 years
2012
Currently store about
20 PetaBytes of data
20PB
Computation
(BlueGene/Q,1.2 Pflop,#13; Emerald GPGPU cluster at RAL)
Computational derivation of properties from theoryReal‐time diagnostics of instrument performance and data flow pipeline.
Fitting of experimental data to model Compute intensive components on HPC
CurationFacility Archives
• All ISIS data (~25 years) > 3,000,000 files
• All Diamond Data (~5 years) > 100,000,000 files
LHC Tier 1
• UK hub for LHC data (11PB)
Other UK Research Funders
• NERC JASMIN+CEMS super data cluster (£4.5M)
• BBSRC Institutes data archive
• MRC Data Support Service
Universities
• Imperial College ‐ National Service for ComputationalChemistry Software
• Oxford, UCL, Southampton Bristol, Emerald GPGPU cluster
Others:
• Publications: The STFC Publications Archive
• Software: The CCPs (Collaborative Computational Projects)
The StorageTektape robots
100PB Capacity
JASMIN & CEMS 4PB Parallel disk
dr2
Collection
Proposal
Approval
SchedulingExperiment
Data cleansing
Record Publication
Scientist submits application for beamtime
Facility committee approves application
Facility registers, trains, and schedules
scientist’s visit
Scientists visits, facility run’s experiment
Subsequent publication registered
with facility
Raw data filtered and cleansed
Data analysis
Tools for processing made available
CommunicationImmense Expectations !
• Web enables:– access to everything
• Everything on‐line
• Interlinking enables:– Validation of results – Repetition of experiment
• Discovery enables:– new knowledge from old
STFC’s “e‐pubs” Institutional Repository has records of 30,000 publications spanning 25 years
“The web has changed everything...”
CollaborationTechnology integration facilitates scientific collaboration• Cross facility/beamline• Cross disciplinaryTechnology integration improves facility efficiency • PaN‐data –Photon and Neutron Data infrastructure
– ICAT also used in Australian Synchrotron and Oak Ridge National Lab• Research Data Alliance
The 7 C’s
Creation Collection
Capacity
Computation
Curation
Collaboration Communication
DataCreationArchival
Access
Storage ComputeNetworkServicesCuration
2. Big Science Data
• What science does STFC do?
• The research lifecycle
• An example e‐infrastructure project.
The PaNdata Collaboration• Established 2007 with 4 partners• Expanded since to 13 organisations
(see next slide)
• Aims: – “...to construct and operate a shared data infrastructure for Neutron and Photon laboratories...”
2007 2008 2009 2010 2011 2012 2013 2014EDNS (4)
EDNP (10)PaNdataEurope(11)
Pandata ODI(11)
PaN‐data bring together 13 major European Research Infrastructures
PaN‐data is coordinated by the STFC Department of Scientific Computing
ISIS is the world’s leading pulsed spallationneutron source
ILL operates the most intense slow neutron source in the world
PSI operates the Swiss Light Source, SLS, and Neutron Spallation Source, SINQ, and is developing the SwissFEL Free Electron Laser
HZB operates the BER II research reactor the BESSY II synchrotron
CEA/LLB operates neutron scattering spectrometers from the Orphée fission reactor
ESRF is a third generation synchrotron light source jointly funded by 19 European countries
Diamond is new 3rd generation synchrotron funded by the UK and the Wellcome Trust
DESY operates two synchrotrons, Doris III and Petra III, and the FLASH free electron laser
Soleil is a 2.75 GeV synchrotron radiation facility in operation since 2007
ELETTRA operates a 2‐2.4 GeV synchrotron and is building the FERMI Free Electron Laser
ALBA is a new 3 GeV synchrotron facility due to become operational in 2010
PaN‐data Partners
JCNS Juelich Centre for Neutron Science MaxLab, Max IV Synchrotron
The Science we do ‐ Structure of materials
Fitting experimental data to model
Bioactive glass for bone growth
Structure of cholesterol in crude oil
Hydrogen storage for zero emission vehicles
Magnetic moments in electronic storage
• Over 30,000 user visitors each year: – physics, chemistry, biology, medicine, – energy, environmental, materials, culture– pharmaceuticals, petrochemicals,
microelectronics
Longitudinal strain in aircraft wing
Diffraction pattern from sample
Visit facility on research campus
Place sample in beam
• Over 5.000 high impact publications per year– But so far no integrated data repositories– Lacking sustainability & traceability
PaNdataODI‐ An Open Data Infrastructure for Photon and Neutron Facilities
PaNdata Today
CapacityStorage
Publications Repositories
Data Repositories
Software Repositories
Raw Data Data Analysis
Analysed Data
Publication Data
Publications
Facility 1
Raw Data Data Analysis
Analysed Data
Publication Data
Publications
Facility 2
Raw Data Data Analysis
Analysed Data
Publication Data
Publications
Facility 3
Different Infrastructures Different User ExperiencesRaw Data Catalogue Software Catalogue Analysed Data Catalogue Publication Data Catalogue Publications Catalogue
Single Infrastructure Single User Experience
PaN‐data Standardisation
PaN-data Europe is undertaking 5 standardisation activities:
1. Development of a common data policy framework
2. Agreement on protocols for shared user information exchange
3. Definition of standards for common scientific data formats
4. Strategy for the interoperation of data analysis software enabling the most appropriate software to be used independently of where the data is collected
5. Integration and cross-linking of research outputs completing the lifecycle of research, linking all information underpinning publications, and supporting the long-term preservation of the research outputs
PaN-data Europe – building a sustainable data infrastructure for Neutron and Photon laboratories
PaNdata ODI Joint Research Activities
PaNdata ODI Service Activities
PaNdata ODI Service ReleasesStandards from
PaNdataSupport Action
uCat
dCat
vLabs
Prov
Pres
Scale
Rel 1 Rel 2 Rel 3 Rel 4
users
data
s/w
Integ
Mar 2014Sep 2013 Dec 2013Jun 2013
Data
PaNdata Data Lifecycle – where are we now?
the researcher actsthrough ingest and access Research Environment
Creation
Archival
Access
Storage ComputeNetwork
Data
Services
the researcher shouldn’t have to worry about the information infrastructure
Information Infrastructure
MetaData/ Catalogues
PortalsUser Info feedDAQ feed
Data Analysis feed
EGIGEANT
Local resources
e‐Infrastructure
Provenanced Data
2. Big Science Data
• What science does STFC do?
• The research lifecycle
• An example e‐infrastructure project.
Overview1. The Policy Context
– OECD– EC/NSF/…– G8+5– RCUK– Royal Society
2. Big Science Data – What do we need from our data infrastructure? – STFC Facilities and SCD– PaNdata An example Collaborative Infrastructure
3. Fostering Collaboration on a Global Scale – The Research Data Alliance
Emerging international organization
Currently supported by:EUNSFAustralian National Data Service
To accelerate data‐driven innovation through research data sharing and exchange.
Infrastructure, Policy, Practice and Standards
3. The Research Data Alliance
VisionResearchersaroundtheworld
sharingandusingresearchdatawithoutbarriers.Purpose…toaccelerateinternational
data‐driveninnovationanddiscoverybyfacilitatingresearchdata
sharing andexchange,use andre‐use,standardsharmonization,anddiscoverability.
…throughthedevelopmentandadoptionofinfrastructure,policy,practice,standards,andotherdeliverables.
Research Data AllianceVision and Purpose
RDA PrinciplesOpenness• Membership is open to all interested organizations, • all meetings are public, • RDA processes are transparent, and • all RDA products are freely available to the public;Consensus• The RDA moves forward by achieving consensus and • resolves disagreements through appropriate voting mechanisms;Balance• The RDA is organized on the principle of balanced representation for
individual organizations and stakeholder communities;Harmonization • The RDA works to achieve harmonization across
standards, policies, technologies, tools, and other data infrastructure elements;
Voluntary • The RDA is not a government organization or regulatory body and,
instead, is a public body responsive to its members; andNon‐profit • RDA is not a commercial organization and will not design, promote,
endorse, or sell commercial products, technologies, or services.
“Building Bridges”
•Bridges to the future– data preservation
•Bridges to research partners•Bridges across disciplines•Bridges across regions•Bridges to integration
– to solve new problems
54
•Bridges across communities
RDA role
Two bridges we can build:
• Connecting Data
• Connecting People
What kind of organisation do we need to do this?
Individual Membership
RDA BodiesCouncil(Strategy)
Technical Advisory Board
(Workplan)
Secretary General
(Operating Plan)
Organisational Advisory Board
(Procedures)
TaskGroups
Secretariat
Members of Staff
Organisational Membership
b
Organisations
Technical Domain Administrative Domain Procedural Domain
Online O
pen Interaction Fora‐use for all kinds of activities, open to all RDA m
embers
Online O
pen Interaction Fora‐use for all kinds of activities, open to all RDA m
embers
Admistration and Management Team‐Implement strategic direction set by council‐Supports the activities of the RDA
‐Arrange plenary meetings‐Run the on‐line for a‐Manage documents
‐Convene nominating committees for‐ Council and TAC
‐Monitor and controls finances‐Prepare reports for
‐ Council, funders,….
Council‐ Set strategic direction‐ Final vote on governance matters‐ Approve new WGs (TAC advised)‐ control balanced WG approach
Technical Advice Committee‐ advise on WG work activities‐ Interacting directly with working groups‐ advise on new WGs and new BoFs‐ Give implementation suggestions to strategic direction from council
Working Groups‐ Carry out work of RDA‐ Reach consensus on outputs‐ May suggest BoFs about new topics‐ Open to all but…
‐ some commitment expected
Plenary‐ Open to all persons involved in RDA‐ Hears and comments on reports from
WGs‐ Suggests new BoFs‐ Hears candidates for TAC
Administrative DomainData Practitioners Domain
Some Risks• Standardisation is easy, I’ve done it a hundred times
(apologies to Mark Twain)
• Two easy ways to standardise:– The Imperial model– The Esperanto model
• Justify need, define benefit, target standardisation• Make a small steps and reassess
• “Never generalise from one example”
Supporting ProjectsThree projects supporting RDA through its first phase:
• iCordi EC Project
• RDA/US NSF Project
• Support in Australia through ANDS
Steering Group to set it up:
• US – Fran Berman, Beth Plale
• EU – Leif Laaksonen, Peter Wittenburg, Juan Bicarregui
• Australia – Ross Wilkinson, Andrew Treloar
Status in January 2013• Initial meetings held in Munich and Washington• ~200 Delegates• Workshops at eIRG, IDCC, ….• ~12 Working Groups being established• Website, Forums, Mailing Lists etc.• Initial Council and Secretariat forming
Launch and first Plenary planned: March 17‐19, GuttenbergPlease get involved by registering and participating in the discussions:
Website: rd‐alliance.org/
Research Data Alliance
The Innovation Lifecycle
The Body of Knowledge
The GovernmentProcess
The ResearchProcess
Aggregation of Knowledge lies at the heart of the innovation lifecycle
Enabling Knowledge Creation
Enabling Wealth Creation
Improved Quality of Life
Improved Understanding
PolicyInitiatives
Disciplinary Initiatives
RDA
Overview1. The Policy Context
– OECD– EC/NSF/…– G8+5– RCUK– Royal Society
2. Big Science Data – What do we need from our data infrastructure? – STFC Facilities and SCD– PaNdata An example Collaborative Infrastructure
3. Fostering Collaboration on a Global Scale – The Research Data Alliance
www.rcuk.ac.uk/research/Pages/DataPolicy.aspx
www.pan‐data.eu
www.rd‐alliance.org