42
Section 3: Commons: Lessons Learned, current state The Big Data to Knowledge (BD2K) Guide to the Fundamentals of Data Science Vivien Bonazzi Senior Advisor for Data Science & the Data Commons National Institutes of Health, Bethesda February 3, 2017

Data commons bonazzi bd2 k fundamentals of science feb 2017

Embed Size (px)

Citation preview

PowerPoint Presentation

Section 3:Commons:Lessons Learned, current stateThe Big Data to Knowledge (BD2K) Guide to the Fundamentals of Data Science

Vivien BonazziSenior Advisor for Data Science & the Data Commons National Institutes of Health, Bethesda

February 3, 2017

Vivien Bonazzi

Leads the Data Commons efforts within the NIH.Serves on the NIH Big Data to Knowledge (BD2K) executive committeeDr. Bonazzi received a B.Sc. in Medical Laboratory Science from the University of Canberra, Australia, a M.Sc. (prelim) in Pharmacology from the University of Melbourne, Australia and a Ph.D. in Molecular Pharmacology and Computational Biology also from the University of Melbourne. Served as a Program Director for the computational biology and bioinformatics program for National Human Genome Research Institute (NHGRI)

Was part of the Human Microbiome Project (HMP) a trans-NIH Common Fund Initiative. She was responsible for the bioinformatics & computational aspect of the project as well as managing several of the computational tools awards. She has held positions as the R&D Director for Bioinformatics at Invitrogen and Director of Gene Discovery at Celera Genomics where she was part of the team that sequenced and annotated the human, mouse and drosophila genomes.

2

Lets Talk About Biomedical Big Data

What Makes Big Data Big?

VOLUMEVELOCITYVARIETYVERACITY

Its a signal of the coming Digital Economy DATA has VALUEDATA is CENTRAL to the Digital EconomyBut its more than this..

An economy characterized by using data to gain a business advantage

(yes, institutions are a business)

Organizations that are not born digital will be at a disadvantage in the new economy

Organizations will be defined by their digital assets

Scientific digital assets Data Software Workflows Documentation Journal Articles

The most successful organizations of the future will be those that can leverage their digital assets and transform them into a digital enterprise

Make data

The currency of an organization

Usable in a digital ecosystems Data Commons

The problem with biomedical data

Digital assets includes Data

Challenges Biomedical Data

The Journal Article is the end goal Data is a means to an ends (low value) Data is not FAIR Findable, Accessible, Interoperable, Reproducible Limited e-infrastructures to support FAIR data

The ProblemWith Biomedical DATA

https://www.youtube.com/watch?v=N2zK3sAtr-4

WhatsChanging?

FAIR principles drive data to become the currency

Policies that promote data sharing via FAIR help change the culture

Currencies dont exist in a vacuum

Buy and sell Goods

15

We also need a digital ecosystem that allows transactions to occur on FAIR data at scale

The Data Commons is a platform that fosters the development of a digital ecosystem

The Data Commons platform that fosters development of a digital ecosystem

Treats products of research data, software, methods, papers etc as digital asset (object)

Digital objects need to conform to FAIR principles

Digital objects exist in a shared virtual space- Find, Deposit, Manage, Share and Reuse: digital assets

Enables interactions between Producers and Consumers of digital assets

Gives currency to digital assets and the people who develop and support them

The Data Commons is a platform? that fosters the development of a digital ecosystem

A nascent platform19

A platform is a plug and play model that allows multiple participants (producers and consumers) to connect to it, interact with each other and create value

Sangeet Paul Choudary Platform Scale

A lot of what see today uses a platform approach

Sangeet Paul Choudary Platform Scale

Platforms that utilize data as a central currency enable transactions between producers and consumers21

The goal of the a Data Commons Platform is to enable interactions between producers and consumersSangeet Paul Choudary Platform Scale

Producers of digital objects - data, tools, workflows - used by consumersThe Platform enables these transactions Accommodates bioinformatics and non bioinformatics users22

To understand the Data Commons Platform (and how it works for biomedical data) we need to use a Platform stackto help visualize the concept

Framework helps visualize the concept of the platform23

Sangeet Paul Choudary Platform Scale

Platforms have 3 layers

NIH Data Commons - Platform Stackhttps://datascience.nih.gov/commons

TechnologyTechnologyDataNetwork/market place

https://datascience.nih.gov/commonsNIH Data Commons - Platform Stack

Initial PhaseUnique digital object identifiers of resolvable to original authoritative sourceMachine readableA minimal set of searchable metadata Clear access rules (especially important for human subjects data)An entry (with metadata) in one or more indices

Future PhasesStandard, community based unique digital object identifiers Conform to community approved standard metadata and ontologies for enhanced searchingDigital objects accessible via open standard APIsNIH Data Commons: Digital Asset Compliance Making things FAIR

27

Data Commons Platform drives digital ecosystem

The NIH Data Commons Pilot

The NIH Data Commons Pilot

Co-location of large and/or highly utilized NIH funded data withstorage and computing infrastructure + Commonly used tools for analyzing and sharing digital objects to create an interoperable resource for the research community.

Investigators will be able to collaborate and share digital objects within this environment and connect with others

Other Data Commons

An NIH Wide Data Commons Pilot - Example

34

An NIH Wide Data Commons Pilot - Example

Indexing

An NIH Wide Data Commons Pilot - Example

Indexing

An NIH Wide Data Commons Pilot - Example

IndexingAuthorization /authentication layer

Digital Ecosystems

38

ConsiderationsMetrics Understanding and accounting of data usage patternsCost Cloud Storage Pay for use cloud compute (NIH credits pilot) Indirect costs for cloudHybrid Clouds Institution (private) and commercial (public) cloudsManaging Open vs Controlled access data Auth: single sign on - dreams/nightmares?Archive vs Working and versioning Copies of dataInteroperability with other Commons (clouds)

Standards Metadata, UIDs, APIsDiscoverability Finding digital objects across cloudsInterfaces For users with different needs and capabilitiesConsent Re-consenting dataPolicies Data sharing policies that are useful and effective Keep pace with use of technology (e.g. dbGAP data in the Cloud) Incentives Access to, and shareability of FAIR Data as part of NIH grant review criteriaGovernance Community involvement in governance models Sustainability Long term support

Considerations

AcknowledgmentsADDS Office: Jennie Larkin, Phil Bourne, Michelle Dunn,Mark Guyer, Allen Dearry, Sonynka Ngosso, Tonya Scott, Lisa Dunneback, Vivek Navale (CIT/ADDS), Ron Margolis NCBI: George KomatsoulisNHGRI: Valentina di Francesco, Ajay Pillai,NIGMS: Susan GregurickCIT: Andrea Norris, Debbie SinmaoNIH Common Fund: Jim Anderson , Betsy Wilder, Leslie DerrNCI: Ian Fore, Sean Davis, Warren Kibbe, Tony Kerlavage, Tanja DavidsenNIAID: Maria Giovanni, Alison Yao, Eric Choi, Claire SchulkeyNHLBI: Weiniu Gan, Alastair ThomsonNIH Clinical Centre: Elaine Ayres, (BITRIS), NIBIB: Vinay Pai (DK), OSP: Dina Paltoo, Kris Langlais, Erin Luetkemeier, Agnes Rooke, Research and Industry: Mathew Trunnell (FHC), Bob Grossman (Chicago), Toby Bloom (NYGC)

Stay in Touch

QR Business [email protected]

SlideshareBlog (Coming soon!)Vivien Bonazzi

[email protected]