Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Australian Biosciences Data Capability: ABDC
A national research infrastructure providing
bioinformatics resources to life science researchers in Australia
What are we looking to do here?
● The prospect of an NCRIS investment in an Australian Research Data Cloud
creates a requirement for a statement of bioscience data and software
infrastructure needs
● → Setting out the biosciences framework for an Australian Bioscience Data
Capability (ABDC) has become urgent
● So, we are here proposing an approach that should lead to a
community-driven framework; and discussing the nature of the framework in
the context of local and global developments
Key National Research Infrastructure
DIGITAL DATA AND ERESEARCH
● Tier 1 HPC (e.g. NCI, PAWSEY)
● Australian Research Data Cloud
● AREN (Network)
● AAF (Authentication)
HUMANITIES, ARTS, SOCIAL SCIENCE
● Integrated coordinated platform
● Platforms for indigenous
research
● Platforms for social sciences
CHARACTERISATION
● Microscopy network
● Biomedical imaging network
● Neutron stuff (OPAL)
● Synchrotron
ADVANCED FABRICATION &
MANUFACTURING
● Materials on a micro/nano scale
● Bioengineering, fabrication
● New classes of fabricated
devices
ADVANCED PHYSICS AND ASTRONOMY
● Optical and SKA
● International accelerators
● Precision measurement
● Nuclear (OPAL)
EARTH & ENVIRONMENTAL SYSTEMS
● Environmental prediction
● Earth monitoring, exploration
● Remote sensing
● Agriculture networks
● Marine systems
BIOSECURITY
● Prevention of exotic diseases
● Biosecurity testing facility
network
COMPLEX BIOLOGY
● Omics data translation network
● Plant phenomics
● Networked biobanks
● Bioinformatics and automation
THERAPEUTIC DEVELOPMENT
● Drug discovery, manufacturing
● Bioengineering
● Health translation
● Population, tissue, microbial,
genomics datasets
Key National Research InfrastructureA strategy is needed to relate those in red
DIGITAL DATA AND ERESEARCH
● Tier 1 HPC (e.g. NCI, PAWSEY)
● Australian Research Data Cloud
● AREN (Network)
● AAF (Authentication)
HUMANITIES, ARTS, SOCIAL SCIENCE
● Integrated coordinated platform
● Platforms for indigenous
research
● Platforms for social sciences
CHARACTERISATION
● Microscopy network
● Biomedical imaging network
● Neutron stuff (OPAL)
● Synchrotron
ADVANCED FABRICATION &
MANUFACTURING
● Materials on a micro/nano scale
● Bioengineering, fabrication
● New classes of fabricated
devices
ADVANCED PHYSICS AND ASTRONOMY
● Optical and SKA
● International accelerators
● Precision measurement
● Nuclear (OPAL)
EARTH & ENVIRONMENTAL SYSTEMS
● Environmental prediction
● Earth monitoring, exploration
● Remote sensing
● Agriculture networks
● Marine systems
BIOSECURITY
● Prevention of exotic diseases
● Biosecurity testing facility
network
COMPLEX BIOLOGY
● Omics data translation network
● Plant phenomics
● Networked biobanks
● Bioinformatics and automation
THERAPEUTIC DEVELOPMENT
● Drug discovery, manufacturing
● Bioengineering
● Health translation
● Population, tissue, microbial,
genomics datasets
Bioscience should benefit from the ARDCNCRIS Roadmap pg 32 Table 3: Priority Areas for Digital Data and eResearch Platforms
2. Create Australian Research Data Cloud: Enhance existing capability through the integration of existing
capability – ANDS, NeCTAR and RDS to establish an integrated data-intensive infrastructure system,
incorporating physical infrastructure, policies, data, software, tools and support for researchers.
An Australian Research Data Cloud would build on existing eResearch infrastructure to create a cohesive,
seamless experience for researchers that provides a fully integrated system. It should:
● Broadly align with the European Open Science Cloud and other global initiatives.
● Support research data management from creation and discovery, through description and
provenance, integration and storage, manipulation and analysis, and preservation.
● Provide digital platforms that meet specific research requirements and integrate other data rich
research infrastructure.
● Support the sharing of informatics and software techniques to enable the deployment and wide use by
researchers.
The underpinning Australian eResearch infrastructure should include cloud computing, HPC, networks,
access, authentication and trusted data repositories. Data, collaboration and software services, skills and
knowledge provided by the Australian Research Data Cloud will be an essential part of the new system.
Timeline
Community consultations:
Brisbane 8/Aug, Perth 10-11/Oct, Cbr 30/Oct, Syd
3/Nov, Melb 8/Nov & 17/Nov, ABACBS 14/Nov, Adl 20/Nov
Phase
Period Activity
1 June - Aug 2017
Concept development and project planning
2 Sep - Dec 2017
Elaboration of requirements, options and consensus building
3 Jan - Sep 2018
Engagement with expected NCRIS planning of its investments
4 Oct 2018 - ? 2019
Steps to implement results or engage further as needed
Bioplatforms AustraliaAn NCRIS funded national research infrastructure
University of Queensland
Murdoch University
University of NSW
ANU JCSMR
University of MelbourneBioScience, Bio21
University of Queensland
The Australian Wine Research Institute
University of Adelaide
Monash University
APAF MQ
University of Queensland
Genomics Metabolomics Proteomics Bioinformatics
Murdoch University
University of Western Australia
WAIMR
AGRF
AGRF
AGRF
Ramaciotti CentreKCCG
AGRF
University of MelbourneVLSCI
NCI
EMBL Australia Bioinformatics Resource:A bioinformatics infrastructure network (supported by BPA)
HUB
• Hosted by Melbourne Bioinformatics
@ UoM (ex VLSCI)
Nodes in all states
• Expands BPA’s network into
bioinformatics
GOVERNANCE
Paul FlicekLead, Vertebrate
Genomics & ENSEMBL
Jaap HeringaHead, ELIXIR-NL
Vivien BonazziSenior Advisor Data
Science Tech & Innovation
Jason WilliamsEducation, Outreach
and Training Lead
Tony PapenfussHead, Computational Biology,
WEHI, VICMark Walker
Director, Aust Infectious Disease Res Centre, UQ, QLD
Delphine FleuryAus Centre for Plant Functional
Genomics, SA
Sean GrimmondDirector, Centre for UoM Cancer
Research, VIC
Rebecca JohnsonDirector, Australian Museum
Research Institute, NSW
International Scientific Advisory Group
National Reference Group
Prof Jacquie Batley (Plant Genetics & Breeding, UWA) Prof Dave Burt (Director Genomics, UQ)
Prof Peter Cameron (Academic Director, The Alfred Emergency and Trauma Centre/Monash U)
Prof Joanne Daly (CSIRO Honorary Fellow)
Prof Frank Gannon (Director, QIMR Berghofer) Prof Rob Henry (Director, QAAFI, UQ)
Prof Ary Hoffmann (Biosciences, Melbourne U) Prof Dean Jerry (Dep Director, JCU Centre for Tropical Fisheries and Aquaculture, JCU)
Prof Ryan Lister (Head, Epigenetics and Genomics, Harry Perkins Inst/UWA)
Prof John Mattick (Director, Garvan Institute)
Prof Kathryn North (Director, MCRI) Prof Nicki Packer (Macquarie U & Inst for Gycomics, Griffith U)
A/Prof Tony Papenfuss (President ABACBS, Computational biology WEHI & Petermac)
Dr. Maurizio Rossetto (NSW Royal Bot Gardens)
Prof Eric Stone (Director, ANU-CSIRO Centre for Genomics,
Metabolomics and Bioinformatics, ANU)
Dr Jen Taylor (Group leader Bioinformatics, CSIRO)
Prof Steve Wesselingh (Director, SAHMRI) Prof James Whisstock (Monash, EMBL-Australia)
Prof Marc Wilkins (Director, Ramaciotti Centre for Genomics, UNSW)
Dr Helen Cleugh (Director, CSIRO Climate Science Centre)
National Reference Group
What are we looking to do here?
● The prospect of an NCRIS investment in an Australian Research Data Cloud
creates a requirement for a statement of bioscience data and software
infrastructure needs
● → Setting out the biosciences framework for an Australian Bioscience Data
Capability (ABDC) has become urgent
● So, we are here proposing an approach that should lead to a
community-driven framework; and discussing the nature of the framework in
the context of local and global developments
Data infrastructure in the genomic era
NIH on the on the impact of data and informatics in 2013:
“Research in the life sciences has undergone a dramatic transformation in the past two decades. Colossal changes in biomedical research technologies and methods have shifted the bottleneck in scientific productivity from data production to data management, communication, and interpretation. Given the current and emerging needs of the biomedical research community, the NIH has a number of key opportunities to encourage and better support a research ecosystem that leverages data and tools, and to strengthen the workforce of people doing this research. The need for advances in cultivating this ecosystem is particularly evident considering the current and growing deluge of data originating from next-generation sequencing, molecular profiling, imaging, and quantitative phenotyping efforts.”
*http://www.nature.com/nature/journal/v498/n7453/pdf/498255a.pdf
Data infrastructure in the genomic era
ElixirThe ELIXIR Platforms comprise:
● Data Sustaining Europe’s life-science data infrastructure
● Tools Services and connectors to drive access and
exploitation
● Interoperability Supporting the discovery, integration and
analysis of biological data
● Compute Storage, compute and authentication/access
services
● Training Professional skills for managing and exploiting
data
Four Use Cases service domain-specific research communities:
● Human data Developing long-term strategies for managing and accessing sensitive human data
● Rare diseases Supporting the development of new therapies for rare diseases
● Marine metagenomics Developing a sustainable metagenomics infrastructure to nurture research and
innovation in marine science
● Plant science Developing an infrastructure to facilitate genotype-phenotype analyses for crop and tree
species
Although the Commons is a complex ecosystem, there are four main components, which fit together as indicated to the left.
● A computing environment, such as the cloud or HPC (High
Performance Computing) resources, which supports access,
utilization and storage of digital objects.
● Publicly available datasets that adhere to a Commons digital
object compliance model.
● Software services and tools to facilitate access to and use on
data, both the data in the Commons or elsewhere.
● A digital object compliance model that describes the
properties of digital objects that enable them to be findable,
accessible, interoperable and reproducible (FAIR).
Each of these components will require further development and
harmonization while being developed. A series of Commons pilots has
been initiated to develop and test these components in order to
understand and evaluate how well they will contribute to an
ecosystem that will effectively support and facilitate sharing and reuse
of digital objects.
de.NBI - German Network for Bioinf Infrastructure
● Provide, expand and improve a repertoire of specialized bioinformatics tools (over 100)
● Provide access to computing and storage capacities (currently 15000 cores and 5PB storage)
● Provide regular training events (~30 such events in 17/18 from 2 to 5 days in length)
● Maintain and develop specific high-quality data resources
SIB - Swiss Institute for Bioinformatics
SIB leads and coordinates the field of bioinformatics in Switzerland. Its data science experts join forces to advance biological and medical research and enhance health by:
1. Providing the national and international life science community with a state-of-the-art bioinformatics infrastructure, including resources, expertise and services
2. Federating world-class researchers and delivering training in bioinformatics.
IFB - French Institute of Bioinformatics
● Provide core bioinformatics resources to the national and international life science research community in key fields…
● Build academic cloud devoted to bioinformatics
INB - Spanish National Bioinformatics Institute
● Provide world-class core bioinformatics resources to the national
and international life science research community...
● Provide a platform of integrated core bioinformatics services to
the research community...
● Facilitate coordinated participation of bioinformatics groups in
large-scale national and international projects...
The INB consists of 10 nodes,
plus a central node that directs
and coordinates the activities.
Nodes were selected by an
international evaluation of
proposals: genomics,
proteomics, functional
genomics, structural biology,
population genomics and
genome diversity, biomedical
informatics, algorithm
development and high
performance computing.
The computational effort of the INB
is localized in the computational
node by a special agreement with the
Barcelona Supercomputing Centre.
Coordinating node
● Spanish National Cancer Research Centre (CNIO)
Specialist nodes
● Centre for Genomic Regulation, Bioinformatics
and Genomics group (CRG)
● Spanish National Cancer Research Centre,
Structural Computational Biology Group (CNIO)
● Príncipe Felipe Research Centre, Functional
genomics group (CIPF)
● Institute for Research in Biomedicine, Molecular
Recognition & Bioinformatics Group (IRB)
● University of Malaga, Bioinformatics and
Information Technologies Laboratory (UMA)
● Spanish National Center of Biotechnology,
Biocomputing Unit (CNB)
● Pompeu Fabra University, Genomic Diversity
and Population Genomics group (UPF)
● Research Programme on Biomedical Informatics
(IMIM-UPF)
● National Genome Analysis Centre (CNAG)
● Barcelona Supercomputing Centre (BSC)
Biosciences Data Capability components
Vision - Purpose and Key StepsSet out the compelling impact that the realisation of data intense biosciences can deliver
Agree critical steps and identify the key infrastructure needed to achieve those visions
Biosciences Data CommonsCommunity Resources - What and Why
Biosciences Data ConsortiumMechanisms - Who and How
Existing and expected digital bioscience resources (systems,
data, tools and skills) that are valuable to Australian bioscience:
● Be clear about supply, development and access processes.
● For off-shore resources, identify what, if any, improved means of
access or engagement with them would be possible and beneficial
and the mechanisms that could achieve that.
● For on-shore resources identify what improvements to them and
their supporting mechanisms could be made to increase their value
to Australian researchers.
● Identify how and in what way ‘on-shoring’ adds value.
Models for a cooperative structure suitable to the vision to:
● Strengthen relationships with key international data and software
resources.
● Develop agreements, desirable policies and recommended practices that
can support improved data access and data linking.
● Provide guidance to supporters, participants and users of the ABDC such as
BPA, the ARDC, Universities, MRIs, CSIRO, bioscience service providers.
● Be a vehicle for any ‘Excelerate like’ and ‘Data Commons like’ programmes
of work that may be funded to contribute to the ABDC.
● Lead to an Australian Bioinformatics Facility with a federated model such as
de.NBI or SBI or several such structures.
Biosciences Data Cloud Resources (used for Australian research)Commercial: AWS, Azure, Google Cloud, ...
Institutional and State resources
National: ANDS, NeCTAR, RDS, NCI, Pawsey, Future ARDC
INTERNATIONAL: EBI cloud (EMBASSY), Jetstream, ...
A B D C
Summary of our current understanding
30,000 health/biosciences researchers
18,000 health/biosciences RHD students
48,000 health/biosciences PG course work students
(163,000 + 40,000 =) 200,000 health/biosciences UG students
1,000 to 1,500 bioinformatician/computational biologists
ABDC users
UsersEstimated # Australian biology researchers: 30,000 (and perhaps ~ 1 million worldwide)
(In 5 years → 31,500)
Estimated #: 1,000
(In 5 years → 1,500)
bioinformaticians
research into and
application of techniques,
tool development
Eg. research generating
new tool or statistical
method; bioinformatics
core facilities applying
complex analyses
2,000
(→ 3,000)
bioinformatics-intensive
bioscience researchers
where research is fully
dependent on advanced or
novel use of bioinformatics
Eg. Genomic cancer
research, population
genomics/agricultural
genomics programs
7,000
(→ 12,000)
data-intensive
bioscience researchers
where ‘omics data analysis is
a critical contributor to, but
not definer of, the research
outcomes
Eg. RNAseq analysis to
identify upregulated genes in
broader research program
20,000
(→ 15,000)
biology-focussed bioscience
researchers
occasional users of
bioinformatics web services
where bioinformatics adds
value to research outcomes
determined by other means
Eg BLAST, Ensembl
Resources (with examples)DATA TOOLS
Data Collections
Databases Software Informatics
Services
Originating
in Australia
Global
Audience
CMRI ProCan,
ASPREE
Stemformatics,
InnateDB
UnicarbKB
edgeR, Degust,
Prokka, Snippy
Stemformatics.org
InnateDB.com
Australian
Audience
BPA Framework
datasets
WallaBase,
SpiderToxin DB
GVL Galaxy-MEL
Galaxy-QLD
Originating
elsewhere
Global (incl
Australian)
Audience
ICGC, TCGA, SRA,
1000 Genomes,
Genbank
GTex, dbGaP,
TOPMed, ENCODE,
KEGG, GO, ChEBI
TAIR, Zfin
Galaxy Ensembl, BLAST
Galaxy-Main
Initial discussions/consultations led to a ‘data complexity+scale’ layering:
1. An accessible ‘service’ platform to which researchers bring
their data, tools, pipelines and research goals
2. A platform for cross-institutional data integration and
collaboration
3. A platform which can support national scale data investment
4. A mechanism for representing Australia’s interests and
participation in global initiatives
International Science Advisory Group (mid-Sept)
● The process is good (and endorsed)
● The user breakdown is realistic and valuable
● The resource categories are comparable to analogous
attempts at this
● Relating to international initiatives is critical
○ But it’s not clear how to make that work best
○ eg EBI vs ELIXIR...
Next: mapping infrastructure to users groups
This also allows impact to be mapped
Improve the quality and uptake of bioinformatics services
Enable ground breaking advances
Empower Migration
Empower Participation
Expand the scope of knowledge researchable using bioscience techniques in Australia
Reference Group (3 Oct) Comments:
1. An accessible ‘service’ platform to which researchers bring
their data, tools, pipelines and research goals
2. A platform for cross-institutional data integration and
collaboration
3. A platform which can support national scale data investment
4. A mechanism for representing Australia’s interests and
participation in global initiatives
✓ Yes, but make training more explicit
Reference Group (3 Oct) Comments:
1. An accessible ‘service’ platform to which researchers bring
their data, tools, pipelines and research goals
2. A platform for cross-institutional data integration and
collaboration
3. A platform which can support national scale data investment
4. A mechanism for representing Australia’s interests and
participation in global initiatives
✓ Yes, but make training more explicit
✓ ✓** Science Enabler **Investment attractor
Reference Group (3 Oct) Comments:
1. An accessible ‘service’ platform to which researchers bring
their data, tools, pipelines and research goals
2. A platform for cross-institutional data integration and
collaboration
3. A platform which can support national scale data investment
4. A mechanism for representing Australia’s interests and
participation in global initiatives
Government agencies will drive this
✓ ✓** Science Enabler **Investment attractor
✓ Yes, but make training more explicit
Reference Group (3 Oct) Comments:
1. An accessible ‘service’ platform to which researchers bring
their data, tools, pipelines and research goals
2. A platform for cross-institutional data integration and
collaboration
3. A platform which can support national scale data investment
4. A mechanism for representing Australia’s interests and
participation in global initiatives ✓ May need a new entity
✓ Yes, but make training more explicit
Government agencies will drive this
✓ ✓** Science Enabler **Investment attractor
Reference Group (3 Oct) Comments:
1. An accessible ‘service’ platform to which researchers bring
their data, tools, pipelines and research goals
2. A platform for cross-institutional data integration and
collaboration
3. A platform which can support national scale data investment
4. A mechanism for representing Australia’s interests and
participation in global initiatives
✓ Yes, but make training more explicit
✓ May need a new entity
Government agencies will drive this
✓ ✓** Science Enabler **Investment attractorCommon data and compute
resources?
Reference Group (3 Oct) Comments:
1. An accessible ‘service’ platform to which researchers bring
their data, tools, pipelines and research goals
2. A platform for cross-institutional data integration and
collaboration
3. A platform which can support national scale data investment
4. A mechanism for representing Australia’s interests and
participation in global initiatives
✓ Yes, but make training more explicit
✓ May need a new entity
Government agencies will drive this
✓ ✓** Science Enabler **Investment attractorCommon data and compute
resources?
Health/Clinical separate?
Non -omics?
Asia connection?
An improved investment/impact map:
An improved investment/impact map:
Efficiency + resource sustainability
An improved investment/impact map:
Accessibility domain evolution
Efficiency + resource sustainability
An improved investment/impact map:
Accessibility domain evolution
Collaboration, standards, global participation and expertise
Efficiency + resource sustainability
What else do we need to think about?
A. Imaging - where it fits
B. What not being in Europe/US means in practice
C. Relationship to Asia
D. What a data integrative omics/biology capability looks like
E. others…..?
Key National Research InfrastructureA strategy is needed to relate those in red
DIGITAL DATA AND ERESEARCH
● Tier 1 HPC (e.g. NCI, PAWSEY)
● Australian Research Data Cloud
● AREN (Network)
● AAF (Authentication)
HUMANITIES, ARTS, SOCIAL SCIENCE
● Integrated coordinated platform
● Platforms for indigenous
research
● Platforms for social sciences
CHARACTERISATION
● Microscopy network
● Biomedical imaging network
● Neutron stuff (OPAL)
● Synchrotron
ADVANCED FABRICATION &
MANUFACTURING
● Materials on a micro/nano scale
● Bioengineering, fabrication
● New classes of fabricated
devices
ADVANCED PHYSICS AND ASTRONOMY
● Optical and SKA
● International accelerators
● Precision measurement
● Nuclear (OPAL)
EARTH & ENVIRONMENTAL SYSTEMS
● Environmental prediction
● Earth monitoring, exploration
● Remote sensing
● Agriculture networks
● Marine systems
BIOSECURITY
● Prevention of exotic diseases
● Biosecurity testing facility
network
COMPLEX BIOLOGY
● Omics data translation network
● Plant phenomics
● Networked biobanks
● Bioinformatics and automation
THERAPEUTIC DEVELOPMENT
● Drug discovery, manufacturing
● Bioengineering
● Health translation
● Population, tissue, microbial,
genomics datasets