24
BIG DATA (IN BIOLOGY): INTEGRATING LARGE, FAST MOVING, HETEROGENEOUS DATASETS Adina Howe Argonne National Laboratory Michigan State University EPA Air Sensors 2013: Data Quality and Applications March 19, 2013

EPA 2013 Air Sensors Meeting Big Data Talk

Embed Size (px)

DESCRIPTION

https://sites.google.com/site/airsensors2013/final-materials

Citation preview

Page 1: EPA 2013 Air Sensors Meeting Big Data Talk

BIG DATA (IN BIOLOGY): INTEGRATING LARGE, FAST MOVING,

HETEROGENEOUS DATASETS

Adina Howe

Argonne National Laboratory

Michigan State University

EPA Air Sensors 2013: Data Quality and Applications

March 19, 2013

Page 2: EPA 2013 Air Sensors Meeting Big Data Talk

Introduction – My perspective

Experiment

Design

Data Generation

Workflow / Tools

Data analysis

Applied Solutions Engineering

Microbial EcologyBioinformatics

Page 3: EPA 2013 Air Sensors Meeting Big Data Talk

THE DATA DELUGEAn exponential landscape

Page 4: EPA 2013 Air Sensors Meeting Big Data Talk

Next-generation sequencing growth outpacing computational resources

Stein, Genome Biology, 2010

Log

Sca

le!

Page 5: EPA 2013 Air Sensors Meeting Big Data Talk

Next-generation sequencing growth outpacing computational resources

Stein, Genome Biology, 2010

Page 6: EPA 2013 Air Sensors Meeting Big Data Talk

Effects of low cost sequencing…1995 First free-living bacterium sequenced

for billions of dollars and years of analysis

Personal genome can be mapped in a few days and hundreds to few thousand dollars

Page 7: EPA 2013 Air Sensors Meeting Big Data Talk

Effects of low cost sequencing on research

Sboner et al., Genome Biology, 2011

Page 8: EPA 2013 Air Sensors Meeting Big Data Talk

Effects of low cost sequencing on research

Sboner et al., Genome Biology, 2011

Page 9: EPA 2013 Air Sensors Meeting Big Data Talk

Effects of low cost sequencing on research

Sboner et al., Genome Biology, 2011

Page 10: EPA 2013 Air Sensors Meeting Big Data Talk

Technology

Core

competencyValue added

RETHINKING

What it takes to deliver

Page 11: EPA 2013 Air Sensors Meeting Big Data Talk

Technical obstacles in the big data deluge

• Access to the data and its value • Access to the resources

Democratization of both data and resource access

“80% of awards and 50% of $$ are for grants < $350,000”

Root causes:• Data volume and velocity “clog”• Data is very heterogeneous• Previous efforts are difficult to integrate• Innovation is necessary but hard

Experiment

Design

Data Generation

Workflow / ToolsData analysis

Applied Solutions

Page 12: EPA 2013 Air Sensors Meeting Big Data Talk

Social obstacles are the most difficult.• Shift of costs do not mean a shift of expectations

• “Give me the answer so I can get back to work.”

• A culture of sharing (data, time, and tools)

• Evolution of necessary training• Creating teams that can communicate across domains

• Incentives are not strong enough• Patterns for success (useful data sharing and

collaboration) are not apparent or well understood.

Page 13: EPA 2013 Air Sensors Meeting Big Data Talk

POSSIBLE SOLUTIONS

Page 14: EPA 2013 Air Sensors Meeting Big Data Talk

Common solutions: been there, done that

http://xkcd.com/927/

Page 15: EPA 2013 Air Sensors Meeting Big Data Talk

What would an ideal solution look like?

• Flexible access to data, tools, and resources

• Cost effective, consistent, reusable (scalable)

• Rapid exploration• Incentives to participate,

share, communicate• Community sandbox (vs

lab-specific)• Painless

Platform which supports an “ecology” of databases, interfaces, and analysis software.

Page 16: EPA 2013 Air Sensors Meeting Big Data Talk

The success of organization: Amazon• > 50 million users, > 1 million product partners, billions of

reviews, dozens of compute services.• Continually changing/updating data sets.• Explicitly adopted a service-oriented architecture that

enables both internal and external use of this data.• For example, the Amazon.com website is itself built from

over 150 independent services…• Amazon routinely deploys new services and functionality.

http://highscalability.com/amazon-architecture

https://plus.google.com/112678702228711889851/posts/eVeouesvaVX

Page 17: EPA 2013 Air Sensors Meeting Big Data Talk

Amazon development guideline:Colloquially said, “You should eat your own dogfood.”

Design and implement the database and database functionality to meet your own needs; only use the functionality you’ve explicitly made available to

everyone.

To adapt to research: database functionality should be designed in tight integration with researchers who are

using it, both at a user interface level and programmatically.

Page 18: EPA 2013 Air Sensors Meeting Big Data Talk

If the “customers” aren’t integrated into the development loop:

http://blog.thingsdesigner.com/uploads/id/tree_swing_development_requirements.jpg

Page 19: EPA 2013 Air Sensors Meeting Big Data Talk

DOE Knowledgebase (KBase)• Emerging software and data environment to enable

researchers• Service oriented architecture where biological data

integrated into single data model with Kbase services loosely coupled to achieve various functions

• Open development environments for community contribution (public data, services, software)

• Provides robust and scalable infrastructure (with some level of support)

https://kbase.us

Page 20: EPA 2013 Air Sensors Meeting Big Data Talk

Kbase uses service oriented architecture

http://kbase.us/files/6913/4990/5274/Infrastructure.pptx.pdf

Hig

her

leve

l fun

ctio

ns

Page 21: EPA 2013 Air Sensors Meeting Big Data Talk

DOE KBase Investment

“…may also apply for additional supplemental funding of up to $300,000 per year for development of systems biology and –omics data driven applications in collaboration with the DOE Systems Biology Knowledgbase.”

Free tutorials / workshops for the community provided.

Page 22: EPA 2013 Air Sensors Meeting Big Data Talk

Advice for the next round…

Data generator:• Managing expectations and value

Developer:• “Eat your own dogfood”

Data analyzer:• Analyze with reproducibility in mind

} Access

Training

Communication

Platform / Teams

Big data is a community

problem and solution

Page 23: EPA 2013 Air Sensors Meeting Big Data Talk

Resources• Amazon interviews

http://highscalability.com/amazon-architecture

• Titus Brown’s blog post on heterogeneous data integration

http://ivory.idyll.org/blog/software-architecture-for-heterogeneous-data-integration.html

• Kbase website

http://www.kbase.us

• Software carpentry – “helping scientists build better software”

http://software-carpentry.org

Page 24: EPA 2013 Air Sensors Meeting Big Data Talk

Thanks!

Please feel free to contact me:

http://adina.github.com

[email protected]

http://cheezburger.com/6983817216