29
Presented by Managed by UT-Battelle for the U.S. Department of Energy Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham Biosciences Division Oak Ridge National Laboratory NDIA Biosurveillance Conference August 28, 2012

Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Presented by

Managed by UT-Battelle for the U.S. Department of Energy

Bioinformatic Strategies:

Integrating Data and Knowledge to Improve

Biosurveillance

Bob Cottingham

Biosciences Division Oak Ridge National Laboratory

NDIA Biosurveillance Conference

August 28, 2012

Page 2: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Introduction and Motivation

• Biosurveillance has been primarily based on traditional and manual methods such PCR detection, and intelligence gathering and analysis.

• As science and technology advances there is increasing potential for terrorist engineered threats.

• But also increasing potential for computational means to detect both natural and synthetic threats.

2

Page 3: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

What would it take to develop a

standardized biosurveillance system

• Common, Standardized, Scalable

• Computational framework

• Allows rapid, efficient development of new computational analytic methods in a common, integrated data environment.

• Enable common methods for the evaluation and comparison of analytic methods to drive improved performance.

• Provide the basis for computational work environments for biosurveillance analysts.

3

Page 4: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Some relevant topics to consider

• Data intensive computing for analysis of multiple real-time data streams,

• Genomic, transcriptomic, and other -omics analysis of samples,

• Sensor network integration,

• Spatial analysis and visualization,

• Social network analysis

4

Page 5: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Déjà vu all over again

– Yogi Berra

• Next-next gen sequencers e.g. PacBio = 100GB/hr in 2012?

• Is storing and processing data all that’s needed?

• Is computing becoming the bottleneck in research progress?

• Haven’t we been saying this since early 1990s?

• Is Moore’s Law keeping up?

• What about transcriptomics and omics integration?

0

100

200

300

400

500

600

700

800

900

2000 2001 2002 2003 2004 2005 2006

Year

Data

bases

Probes per array

1M

60GB DNA bases in Genbank

Probes tested 1T

Scaling of Bio Data - Both Volume and

Breadth of Data

5

Page 6: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Problem: Discovery of Microbial Pathways

Important to Production of Cellulosic Ethanol

Discover novel microbes

Microbial conversion

of cellulose

Cellulose Digestion Regulon Model

Model microbes and mutants

under relevant stress conditions

Annotation Pipeline

0

100

200

300

400

500

600

700

800

900

2000 2001 2002 2003 2004 2005 2006

Year

Data

bases

Other Bio Databases

High Throughput Assay Data Sets

2nd Gen Sequencing

Microarrays

Mass Spec

Metabolomic

Exponential Growth

PacBio 100GB/hr

Super-Exponential Growth

Scientific Model

Knowledgebase

Data Model

Wiki concept

Conceptual Modeling

Integration

Involve All Researchers

Transition Clusters > HPC 6

Page 7: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Computational Biology & Bioinformatics: Future of Analytical Integration

Microbial Annotation Pipeline

Sequencing

Super-exponential growth: more data to analyze, more often.

Quality assessment of data, analysis tools and results.

Versioning of analysis tools, data and evidence.

Align with all conserved protein domains >1000

Desulfobacula toluolica (AF271773)

U3SP-30 F1SP-21

U3SP-04 U3SU-27

U3SP-09 Desulfosporosinus orientis (AF271767)

Desulfotomaculum putei (AF273032) Desulfotomaculum nigrificans (AY015499)

Desulfotomaculum ruminis (U58118) Desulfotomaculum aeronauticum (AF273033)

Archaeoglobus profundus (AF071499) Archaeoglobus fulgidus (M95624)

U3SP-13 F1SU-14

F1SP-18 F1SU-12

F1SP-34 U3SU-15

U3SU-24 Desulfobacter vibrioformis (AJ250472)

Desulfobacter latus (U58124) F1SU-29

F1SU-06 Desulfonema limicola (U58128)

Desulfococcus multivorans (U58126) Desulfobotulus sapovorans (U58120) Desulfovibrio burkinensis (AB061536)

Desulfovibrio fructosovorans (AB061538) Desulfovibrio vulgaris (U16723)

Desulfovibrio longus (AB061540) Desulfovibrio desulfuricans (AF273034) Desulfovibrio intestinalis (AB061539)

Desulfovibrio simplex (AB061541) Desulfomonile tiedjei (AF334595)

Desulfotomaculum thermobenzoicum (AJ310432) Desulfotomaculum thermoacetoxidans (AF271770) Desulfotomaculum luciae (AY015494) Desulfotomaculum kuznetsovii (AF273031)

Desulfotomaculum thermocisternum (AF074396) Desulfotomaculum geothermicum (AF273029)

Desulfotomaculum thermosapovorans (AF271769) Desulfotomaculum acetoxidans (AF271768)

F1SU-19 F1SP-02

F1SP-04 F1SU-02

F1SP-16 F1SP-03

F1SP-26 F1SU-33

F1SU-32 F1SP-23

Desulfobulbus elongatus (AJ310430) Desulfobulbus propionicus (AF218452)

Desulfobulbus rhabdoformis (AJ250473) F1SU-18

Desulfovirga adipica (AF334591) Thermodesulforhabdus norvegica (AF334597) U3SP-34

U3SP-17 U3SU-08

U3SP-20 Desulfoarculus baarsii (AF334600)

F1SU-17 F1SU-26

U3SP-19 U3SP-11

F1SP-13 F1SP-32

U3SU-03 U3SU-13

Thermodesulfovibrio yellowstonii (AF334599) Thermodesulfovibrio islandicus (AF334599)

0.05 changes

100

53

51 100

100

74

100 100

100

100

82

69

82 76

80

100

100

78

98

77

63

80 100

68

71 96

100

80

100

100

60

100

77 86

87

86

54

60

80

96 100

56

64

99

62

100

83

59 70

100

100 100

100

100

Cluster DSR-4

Cluster DSR-9

Cluster DSR-8

Cluster DSR-2

Cluster DSR-7

Cluster DSR-3

Cluster DSR-6

Cluster DSR-5

Cluster DSR-1

58

Increased data and integration increases computation and storage

Predict protein and function

Correlate unknowns

Extend so researcher can see evidence, quality and history.

Establish open standards for assessing quality and performance of analytical methods.

Lead open source software policy.

Page 8: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

• 2008 Workshop Report – Historically projects developed in

isolation resulting in isolated data and methods.

– Vision: Integrating, community informatics resource enabling a broader and more powerful systems biology research effort.

• Objective: Develop an implementation path toward the vision of the DOE Systems Biology Knowledgebase.

Knowledgebase R&D Project

Background and Objective

8

Page 9: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Experimental Design to Validate

Prediction

Page 10: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

A Model of the Earth – 15th

century

Page 11: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

A Current Model of the Earth

Page 12: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Image with Annotation

Page 13: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

User Interface on

Computational Model

Page 14: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Research Function with User Interface

Page 15: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Experimental Design to Validate

Prediction

Page 16: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Could something like that exist for

biosurveillance?

• The ability to collect, communicate and analyze biological-related data is rapidly changing.

• Diverse sources of data can be used in predicting and mitigating an outbreak.

16

Synthetic Biology

DNA

Sequence

• When a bioterrorism event unfolds, we must rapidly collect, analyze, and filter diverse information to enable the best response and decision-making.

• Information could come from search engines like Google, from scientific labs, field detectors, handheld devices, social networks, blogs, over the counter sales, traditional media or virtually any person or entity that shares information on the Web.

Page 17: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Data Intensive Science – The 4th

Paradigm

• Experimental, Theoretical, Computer Simulation, … next Data Intensive.

• Data discipline based on databases, schemas, ontologies – scientific community generally lacks understanding of these topics.

• Data intensive science requires specialized skills and analysis tools.

• Each piece of data needs have its associated ontological and semantic information.

• Search, analysis and reuse is supported by standard vocabularies.

• IT industry building huge “cloud services” with high bandwidth, low cost storage and computing. No prominent bio examples yet.

• Future scientific progress in biology depends on how well the community acquires the necessary expertise in database, workflow, visualization and cloud computing techniques.

17

Page 18: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Scenario Driven Data Modeling SDDM:

1

ID=SCEN:SEQ_1003_3 SEQ=“gaaaatcgcc…”

ID=SCEN:SEQ_1003_2 SEQ=“tctcgacatg…”

3

ID=SCEN:SEQ_1003_1 SEQ=“cttggtcgtc…”

2

4

hasSequence

hasSequence

hasSequence

5

ID=INTERPRO:IPR000871

6

7

ID= INTERPRO: IPR001279

ID= INTERPRO:IPR001036

matches

matches

matches

8

ID=GO:0016787

annotatedBy

annotatedBy

annotatedBy

9

Beta-lactamase

Owl:sameAs

hasOutbreak

10

ID=SCEN:EVENT 1003_1_2 DATE=8/14/2010 DESC= At least 2 Canadians have become infected with a dangerous new "superbug" from India that is spreading around the world, partly due to medical tourism. COOR=-102.041016,55.727112,0.000000

11

hasEvent

12 ID=SCEN:EVENT 1003_1_1 DATE=8/11/2010 DESC=A new "superbug" that is resistant to even the most powerful antimicrobial agents has entered UK hospitals COOR=-2.065430,53.330872,0.000000

13

ID=SCEN:OUTBREAK_1003_2 START DATE=8/13/2010 END DATE=8/14/2010

14

ID=SCEN:EVENT 1003_2_1 DATE=8/13/2010 A Belgian man died from a drug-resistant so-called superbug originating in South Asia, the 1st reported death from the new health threat. COOR= 4.790039,50.359482,0.000000

hasOutbreak

hasEvent

16

ID=PROMED:20100815.2812

hasEvent hasPROMEDRef

hasPROMEDRef 15

ID=PROMED:20100817.2853

hasPROMEDRef

ID=OUTBREAK_1003_1 START DATE=8/11/2010 END DATE=8/14/2010

ID=SCEN:1003

Page 19: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

• 2008 Workshop Report – Historically projects developed in

isolation resulting in isolated data and methods.

– Vision: Integrating, community informatics resource enabling a broader and more powerful systems biology research effort.

• Objective: Develop an implementation path toward the vision of the DOE Systems Biology Knowledgebase.

Knowledgebase R&D Project

Background and Objective

19

Page 20: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

The Problem

• Biological research projects are becoming larger and more complex, both experimentally and computationally.

• To succeed there is an increasing need to cooperate and standardize.

• Technological advances continue to produce exponentially more and diverse types of data.

• A new kind of computational infrastructure is needed for the overall scientific effort to be productive and successful.

20

Page 21: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Kbase Mission

• To provide a large-scale, open community computational capability for systems biology research data management and analysis.

• Promote openness and sharing of data, code and computational infrastructure.

• Address problems of large scale data management and processing.

• Utilize computational techniques required for community effort at data integration, open development, large scale resource sharing and to meet other research community policies and objectives.

21

Page 22: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

How Would Kbase Be Different?

• How would Kbase be different? – Integrate across projects and laboratories

– Implies a community research effort

– More standardized approach

– More mature software engineering approach

– More involvement of non-informatics researchers

• Reference models – technical: – Open source development, e.g. Linux

– iPhone Apps, Google Apps, Facebook Apps

– Google Maps cloud computing with smart phone app

– Wikipedia shared community reference resource

22

Page 23: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Kbase Guiding Principles

• Science drives Kbase development.

• Community effort integrating data and methods across multiple laboratories to improve research productivity.

• Open access – data and methods are available for anyone to use with perhaps a limited embargo policy.

• Open contribution – data and source code managed in an open environment and can be contributed by anyone with an editorial / peer review process.

• Distributed data and methods, Kbase is not a single, centralized, monolithic system.

23

Page 24: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

DOE Systems Biology

Knowledgebase

Tools

Meta-

data Data

DOE Systems Biology Knowledgebase

Establishing A Systems Biology Modeling Framework

Open-Access Data and Information Exchange

• Flexible user interfaces • Easy data retrieval • Environment for in silico experimentation

Open Development of Open-Source Software and Tools

• Analysis and visualization • In silico experimentation • Tracking and evaluation of tool use

Community-Wide Stewardship

• User, Standards, and Advisory committees

• Value-added analysis • Training, tutorials, and support

Data generators

Data

users

Software and tool

developers

Seamless Submission and Incorporation of Diverse Data

• Standards for data, metadata • Quality control and assurance • Automated data handling

24

Page 25: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

What is a Knowledgebase?

Databases enable the rapid organization and search of data

Knowledgebases should learn a “model” of the data to provide

“conclusions” (hypotheses)

Page 26: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Knowledge = Models

• The Knowledgebase should learn models from data and

human interaction.

• Models and their parts should carry information about

the quality of their data, annotations, and predictions.

• Data, protocols, algorithms and models should be

subject to both calculated and community quality

assessment.

Page 27: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

The KBase Infrastructure and Services

Bulk Data Services Persistent Store Central Data Store

Page 28: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Concept: KBase User Experience

Page 29: Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic Strategies: Integrating Data and Knowledge to Improve Biosurveillance Bob Cottingham

Managed by UT-Battelle for the U.S. Department of Energy

Concept: Interactive Community

Knowledge

DOE Office of Science • Office of Biological and Environmental Research

Narra veIdea

ScenarioA

A.1

A.2

ScenarioB

A.1

A.2

Narra veIdea

ScenarioA

A.1

A.2

ScenarioB

A.1

A.2

Narra veIdea

ScenarioA

A.1

A.2

ScenarioB

A.1

A.2

Narra veCodeVersioningandScenarioBranching

HH

H HE

! "##"$%&'() &"'

*+&, "#-. '/ ''

/ 01'

/ 02'

*+&, "#-. '3'

/ 01'

/ 02'

Narra veQueryUpdate

! "##"$%&'() &"'

*+&, "#-. '/ ''

/ 01'

/ 02'

*+&, "#-. '3'

/ 01'

/ 02'

Narra veDataChange

! "##"$%&'() &"'

*+&, "#-. '/ ''

/ 01'

/ 02'

*+&, "#-. '3'

/ 01'

/ 02'

ProjectLinkageandCita on

! "##"$%&'() &"'

*+&, "#-. '/ ''

/ 01'

/ 02'

*+&, "#-. '3'

/ 01'

/ 02'

ENarra veGraphsProvideMeasuresofAc vityandInfluence

Func onandDatacontentscanbetrackedtoassess“use”