39
Human Genome and Big Data Challenges Philip E. Bourne Ph.D. Associate Director for Data Science National Institutes of Health http://www.slideshare.net/pebourne

Human Genome and Big Data Challenges

Embed Size (px)

DESCRIPTION

Presented to the Big Data in Symptoms Research Boot Camp, National Institute of Nursing Research, July 24, 2014

Citation preview

Page 1: Human Genome and Big Data Challenges

Human Genome and Big Data ChallengesPhilip E. Bourne Ph.D.

Associate Director for Data ScienceNational Institutes of Health

http://www.slideshare.net/pebourne

Page 2: Human Genome and Big Data Challenges

The History of Analytics in Biomedical Research

1980s 1990s 2000s 2010s 2020

Discipline:

Unknown Expt. Driven Emergent Over-sold A Service A Partner A Driver

The Raw Material:

Non-existent Limited /Poor More/Ontologies Big Data/Siloed Open/Integrated

The People:

No name Technicians Industry recognition data scientists Academics

Searls (ed) The Roots in Bioinformatics Series PLOS Comp Biol

Page 3: Human Genome and Big Data Challenges

Data Science Timeline

6/12

• U54 Centers of Excellence - under review• U54 BD2K-LINCS– under review• U24 Data Discovery Index– under review• R01, R41, R42, R43, R44, U01 software and

analysis methods grants – on-going• T32, T15, K01, R25 and R26 training awards

– under review

2/14 3/14

Page 4: Human Genome and Big Data Challenges

“It was the best of times, it was the worst of times, it was the age of

wisdom, it was the age of foolishness, it was the epoch of belief, it was the

epoch of incredulity, it was the season of Light, it was the season of

Darkness, it was the spring of hope, it was the winter of despair …”

Page 5: Human Genome and Big Data Challenges

A Tale of Two Numbers

Source Michael Bell http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830

Page 6: Human Genome and Big Data Challenges

Myriad Data Types

Other ‘Omic

Imaging Phenotypic

Clinical

Genomic

Exposure

Page 7: Human Genome and Big Data Challenges

Growth May Just be Beginning

Evidence:– Google car

– 3D printers

– Waze

– Robotics

From: The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies by Erik Brynjolfsson & Andrew McAfee

Page 8: Human Genome and Big Data Challenges

There are other drivers of change out there besides economics and an increasing emphasis on data and

analytics

Page 9: Human Genome and Big Data Challenges

Politicians Demand It:G8 open data charter

9http://opensource.com/government/13/7/open-data-charter-g8

Page 10: Human Genome and Big Data Challenges

[from Harlem Krumholtz]

Page 11: Human Genome and Big Data Challenges

The Story of Meredith

http://fora.tv/2012/04/20/Congress_Unplugged_Phil_Bourne

Stephen Friend

Page 12: Human Genome and Big Data Challenges

We have Entered An Era of Deinstitutionalize & Democratization

of Science

Daniel Hulshizer/Associated Press

Page 13: Human Genome and Big Data Challenges

We have Entered An Era of Deinstitutionalize & Democratization

of Science – NIH Should Support This

Daniel Hulshizer/Associated Press

Page 14: Human Genome and Big Data Challenges

47/53 “landmark” publications could not be replicated

[Begley, Ellis Nature, 483, 2012] [Carole Goble]

Page 15: Human Genome and Big Data Challenges

Scholarship is broken

I have a paper with 16,000 citations that no one has ever read

I have papers in PLOS ONE that have more citations than ones in PNAS

I have data sets I am proud of few places to put them

I edited a journal but it did not count for much

Page 16: Human Genome and Big Data Challenges

It was the age when software developers are in the greatest demand

for science..

It was the age when the rewards outside academia are greater than the

rewards inside

Page 17: Human Genome and Big Data Challenges

It was a time when patient data are becoming more available

It is a time when the ability to maintain the anonymity of a patient gets harder

and harder

Page 18: Human Genome and Big Data Challenges

To Summarize Thus Far …

A time of great (unprecedented?) scientific development but limited

funding

A time of upheaval in the way we do science

Page 19: Human Genome and Big Data Challenges

From a funders perspective…

A time to squeeze every cent/penny to maximize the amount of research that

can be done

A time when top down approaches meet bottom up approaches

Page 20: Human Genome and Big Data Challenges

Top Down vs Bottom Up

Top Down– Regulations e.g. US:

Common Rule, FISMA, HIPPA

– Data sharing policies

• OSTP

• GWAS

• Genome data

• Clinical trials

– Digital enablement

– Moves towards reproducibility

Bottom Up– Communities emerge

and crowd source

• Collaboration

• Data shared

• Open source software

• Common principles

• Standards

Page 21: Human Genome and Big Data Challenges

Okay so what are we doing about it?

Page 22: Human Genome and Big Data Challenges

To start with we are thinking about the complete research lifecycle

Page 23: Human Genome and Big Data Challenges

The Research Life Cycle

IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION

Page 24: Human Genome and Big Data Challenges

Tools and Resources Will Continue To Be Developed

IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION

AuthoringTools

Lab Notebooks

DataCapture

Software

Analysis Tools

Visualization

ScholarlyCommunication

Page 25: Human Genome and Big Data Challenges

Those Elements of the Research Life Cycle Need to Become More Interconnected Around a Common

Framework

IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION

AuthoringTools

Lab Notebooks

DataCapture

Software

Analysis Tools

Visualization

ScholarlyCommunication

Page 26: Human Genome and Big Data Challenges

Those Elements of the Research Life Cycle Need to Become More Interconnected Around a

Common Framework

IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION

AuthoringTools

Lab Notebooks

DataCapture

Software

Analysis Tools

Visualization

ScholarlyCommunication

Commercial &Public Tools

Git-likeResources

By Discipline

Data JournalsDiscipline-

Based MetadataStandards

Community Portals

Institutional Repositories

New Reward Systems

Commercial Repositories

Training

Page 27: Human Genome and Big Data Challenges

Those Elements of the Research Life Cycle Need to Become More Interconnected Around a Common

Framework

IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION

AuthoringTools

Lab Notebooks

DataCapture

Software

Analysis Tools

Visualization

ScholarlyCommunication

Commercial &Public Tools

Git-likeResources

By Discipline

Data JournalsDiscipline-

Based MetadataStandards

Community Portals

Institutional Repositories

New Reward Systems

Commercial Repositories

Training

Page 28: Human Genome and Big Data Challenges

Associate Director for Data Science

CommonsTrainingCenter

BD2KModifiedReview

Sustainability* Education* Innovation* Process

• Cloud – Data & Compute

• Search• Security • Reproducibility

Standards• App Store

• Coordinate• Hands-on• Syllabus• MOOCs

• Community• Centers• Training Grants• Catalogs• Standards• Analysis

• Data Resource Support

• Metrics• Best

Practices• Evaluation• Portfolio

Analysis

The Biomedical Research Digital Enterprise

Communication

Collaboration

Programmatic Theme

Deliverable

Example Features • IC’s• Researchers• Federal

Agencies• International

Partners• Computer

Scientists

Scientific Data Council External Advisory Board

* Hires made

Page 29: Human Genome and Big Data Challenges

 I. Facilitating Broad Use of Biomedical

Big Data II. Developing and Disseminating

Analysis Methods and Software for Biomedical Big Data

 III. Enhancing Training for Biomedical

Big Data IV. Establishing Centers of Excellence

for Biomedical Big Data

BD2K: Four Programmatic Areas

Page 30: Human Genome and Big Data Challenges

What are we proposing as that common framework?

Page 31: Human Genome and Big Data Challenges

The CommonsIs …

A public/private partnership

An agile development starting with the evaluation of a few pilots

An example: porting DbGAP to the cloud

An experiment with new funding strategies

Page 32: Human Genome and Big Data Challenges

What The Commons Is and Is Not

Is Not:– A database

– Confined to one physical location

– A new large infrastructure

– Owned by any one group

Is:– A conceptual framework

– Analogous to the Internet

– A collaboratory

– A few shared rules

• All research objects have unique identifiers

• All research objects have limited provenance

Page 33: Human Genome and Big Data Challenges

Sustainability and Sharing: The Commons

Data

The Long Tail

Core Facilities/HS Centers

Clinical /Patient

The Why:Data Sharing Plans

TheCommons

Government

The How:

DataDiscoveryIndex

SustainableStorage

Quality

Scientific Discovery

Usability

Security/Privacy

Commons == Extramural NCBI == Research Object Sandbox == Collaborative Environment

The End Game:

KnowledgeNIHAwardees

PrivateSector

Metrics/Standards

Rest ofAcademia

Software StandardsIndex

BD2KCenters

Cloud, Research Objects,Business Models

Page 34: Human Genome and Big Data Challenges

What Does the Commons Enable?

Dropbox like storage

The opportunity to apply quality metrics

Bring compute to the data

A place to collaborate

A place to discover

http://100plus.com/wp-content/uploads/Data-Commons-3-1024x825.png

Page 35: Human Genome and Big Data Challenges

[Adapted from George Komatsoulis]

One Possible Commons Business Model

HPC, Institution …

Page 36: Human Genome and Big Data Challenges

Commons Pilots

Define a set of use cases emphasizing:– Openness of the system

– Support for basic statistical analysis

– Embedding of existing applications

– API support into existing resources

Evaluate against the use cases

Review results & business model with NIH leadership

Design a pilot phase with various groups

Conduct pilot for 6-12 months

Evaluate outcomes and determine whether a wider deployment makes sense

Report to NIH leadership summer 2015

Page 37: Human Genome and Big Data Challenges

One Possible End Product

1. User clicks on thumbnail

2. Metadata and a webservices call provide a renderable image that can be annotated

3. Selecting a features provides a database/literature mashup

4. That leads to new papers

1. A link brings up figures from the paper

0. Full text of PLoS papers stored in a database

2. Clicking the paper figure retrievesdata from the PDB which is

analyzed

3. A composite view ofjournal and database

content results

4. The composite view haslinks to pertinent blocks

of literature text and back to the PDB

1.

2.

3.

4.

PLoS Comp. Biol. 2005 1(3) e34

Page 38: Human Genome and Big Data Challenges

Mission Statement

To foster an ecosystem that enables biomedical research to be conducted as a digital enterprise that enhances

health, lengthens life and reduces illness and disability

Page 39: Human Genome and Big Data Challenges

Some Acknowledgements

Eric Green & Mark Guyer (NHGRI)

Jennie Larkin (NHLBI)

Leigh Finnegan (NHGRI)

Vivien Bonazzi (NHGRI)

Michelle Dunn (NCI)

Mike Huerta (NLM)

David Lipman (NLM)

Jim Ostell (NLM)

Andrea Norris (CIT)

Peter Lyster (NIGMS)

All the over 100 folks on the BD2K team