Strata Big Data Science Talk on ADAM

ADAM: Fast, Scalable Genome Analysis

Frank Austin NothaftAMPLab, University of California, Berkeley

with: Matt Massie, André Schumacher, Timothy Danford, Chris Hartl, Jey Kottalam, Arun Aruha, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher, Anthony Joseph, and Dave

Patterson

https://github.com/bigdatagenomics

Wednesday, February 12, 14

Problem

• Whole genome files are large

• Biological systems are complex

• Population analysis requires petabytes of data

• Analysis time is often a matter of life and death

Shredded Book AnalogyDickens accidentally shreds the first printing of A Tale of Two Cities

Text printed on 5 long spools

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

Slide credit to Michael Schatz http://schatzlab.cshl.edu/Wednesday, February 12, 14

It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …

It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …

It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …

• How can he reconstruct the text?– 5 copies x 138, 656 words / 5 words per fragment = 138k fragments– The short fragments from every copy are mixed together– Some fragments are identical

What is ADAM?

• File formats: columnar file format that allows efficient parallel access to genomes

• API: interface for transforming, analyzing, and querying genomic data

• CLI: a handy toolkit for quickly processing genomes

The Broad Institute “Best Practices” Pipeline

SNAPADAM Avocado

MLBaseGraphX

SiRen CAGE

Design Goals

• Develop processing pipeline that enables efficient, scalable use of cluster/cloud

• Provide data format that has efficient parallel/distributed access across platforms

• Enhance semantics of data and allow more flexible data access patterns

Whole GenomeData Sizes

Input PipelineStage

Output

SNAP 1GB Fasta150GB Fastq

Alignment 250GB BAM

ADAM 250GB BAM Pre-processing

200GB ADAM

Avocado 200GB ADAM

Variant Calling 10MB ADAM

Variants found at about 1 in 1,000 loci

ADAM Stack

Physical

File/Block

Record/Split

‣Commodity Hardware‣Cloud Systems - Amazon, GCE, Azure

‣Hadoop Distributed Filesystem‣Local Filesystem

‣Schema-driven records w/ Apache Avro‣Store and retrieve records using Parquet‣Read BAM Files using Hadoop-BAM

‣Transform records using Apache Spark‣Query with SQL using Shark‣Graph processing with GraphX‣Machine learning using MLBase

Implementation

• Work accelerated at end of September

• 15K lines of Scala code

• 100% Apache-licensed open-source

• Code contributions from Mt. Sinai, GenomeBridge, The Broad Institute

• Proof of concept implementations at The Broad Institute, Duke, Harvard, UCSC

• Global Alliance looking at ADAM as potential standard

Commits

AWS EC2 Performance

Picard

ADAM 1 node

ADAM 32 nodes

ADAM 100 nodes

1 10 100 1000 10000

Minutes (log scale)

Sort Mark Duplicates

NA12878 Whole Genome234GB as BAM, 229 GB as ADAM

Mt. Sinai Performance

Picard

ADAM 16 nodes

ADAM 32 nodes

ADAM 64 nodes

ADAM 82 nodes

1 10 100 1000 10000

Minutes (log scale)

Sort Mark Duplicates

NA12878 Whole Genome234GB as BAM, 229 GB as ADAM

Scalability

HG00096 à 16GBNA12878 à 234GB

Acknowledgements

• UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos Kozanitis

• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher

• GenomeBridge: Timothy Danford, Carl Yeksigian

Acknowledgements

This research is supported in part by NSF CISE Expeditions award CCF-1139158 and DARPA

XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, Apple, Inc.,

Cisco, Clearstory Data, Cloudera, Ericsson, Facebook, GameOnTalis, General Electric,

Hortonworks, Huawei, Intel, Microsoft, NetApp, Oracle, Samsung, Splunk, VMware, WANdisco and

Yahoo!.

“Before seeing your data, I would not think sorting could be done that fast.”

- research scientist at the Broad Institute

Call for contributions

• As an open source project, we welcome contributions

• We maintain a list of open enhancements at our Github issue tracker

• Enhancements tagged with “Pick me up!” don’t require a genomics background

• Github: https://github.com/bigdatagenomics/

Strata Big Data Science Talk on ADAM

Health & Medicine

Strata - Final_IB_02_17

Strata schemes management act victoria presentation iconic strata management

RENTAL* STRATA RENTAL OR STRATA - Vancouver · 2020-08-06 · RENTAL* STRATA REQUIRED REQUIRED REQUIRED** RENTAL OR STRATA Development Options: General Building Codes Upgrades

Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk by Frank Austin Nothaft

Strata Manual

Behavioral Analytics with Smartphone Data. Talk at Strata + Hadoop World 2014, Barcelona

Strata schemes management queensland presentation select strata

Adam - Talk 3 - ing the Message

Topview Strata Insurance PDS - Strata Brokersstratabrokers.com.au/.../11/15_topview_strata_insurance_pds_20112… · Our Topview Strata Insurance ... multinational corporations, in

Strata schemes management queensland challenge strata mangement presentation

Strata Online_road_to_enterprise_data_2011

Strata Cache

Strata schemes management queensland colcan strata

STRATA LIFE SPRING 2015 - vic.strata.communityvic.strata.community/documents/Strata Life/Strata Life Spring 2015... · 1 Minister announces full review of Vic strata laws 2 Awards

Toowoomba strata

Strata - absolutephoneanddata.com.au dp5000 handset a… · • Strata CIX Programming Manual • Strata CIX My Phone Manager User Guide. Strata CIX DP5000-series Telephone User Guide

Strata complete tasmania strata title management

Presto Strata Hadoop SJ 2016 short talk

New Strata Briefing GLS 201015 - Universiti Teknologi …fght.utm.my/tlchoon/files/2016/01/New-Strata-Briefing-GLS24NOV15.pdf · - AMENDMENT OF STRATA TITLE ACT, - NEW STRATA MANAGEMENT

Strata schemes management act canberra, act city strata