Upload
zhong-wang
View
1.037
Download
3
Tags:
Embed Size (px)
DESCRIPTION
This talk was adapted from my presentation at the Finishing in the Future 2011, Santa Fe, NM.
Citation preview
BioPig: Hadoop-based Analytic Toolkit for Next-Generation Sequence Data
Zhong Wang, Ph.D.Computational Biology Staff Scientist
Cellulase
The deep metagenome approach to discover cellulases for biofuel research
Large data, large reward
http://www.cazy.org/
Only 1% shared (>=95% identity)50% validated activity
Science. 2011 Jan 28;331(6016):463-7.
Sequence data
More data would be even better
Rumen(2009) Rumen(2010) Rumen(2012)
17 Gb
250 Gb
1000 Gb
But, can analysis keep up with data growth?
Ideal solutions for the terabase problem
1.Scalable to 1Tb?2.Performance (within hours)?
High-Mem cluster
Input/Output (IO)Memory
MP/MPI solution: k-mer counting
1
2
3
4
Raw Data Data slicesEach node/core
has data and table slices Count table
MP/MPI performance
MPI version412 Gb, 4.5B reads2.7 hours on 128x24 coresNESRC HopperII
MP Threaded version268 Gb, 3B reads5 days on 32 coresHigh-Mem Cluster
• Experienced software engineers• Six months of development time• One nodes fails, all fail
Problems:
Fast, scalable
Hadoop/Map Reduce framework
• Google MapReduce– Data Parallel programming model to process petabyte data– Generally has a map and a reduce step
• Apache Hadoop– Distributed file system (HDFS) and job handling for
scalability and robustness– Data locality to bring compute to data, avoiding network
transfer bottleneck
Programmability: Hadoop vs Pig finding out top 5 websites young people visit
BioPig: design goals
• Flexible– every dataset is unique, data analysts have domain knowledge that is essential to
optimize the analysis,– pluggable modules that analysts can use to build custom analytic pipelines,
• High-Level – domain-specific language enable data analysts to create custom pipelines,– hide details of parallelism (too complex for most people),
• Scalability– leverage data parallelism to speed up analytics,– integrate external tools and applications where necessary,– scale from 1 to hundreds of compute nodes with minimal effort and linear
scalability.• Robustness
– Data and computation are replicated across nodes to combat failures
BioPIG
Runs on any hardware supporting Hadoop
• JGI Titanium (commodity hadoop cluster)– Up to 20 16-cores 32GB RAM 1.799Ghz, 1G Ethernet
• NERSC Magellan Cloud Testbed– Up to 200 8-core 24GB RAM, and 2.67GHz Nehalem
processors, 10Gbit InfiniBand, GPFS
• Amazon AWS– Elastic MapReduce with cluster compute nodes (23 GB of
memory, 2 x Intel quad-core “Nehalem” architecture 1690 GB of instance storage, 10G Ethernet
BioPig Modules
Blast
Input/Output(Fasta,q)
K-merCounter
Assembly
How k-mer count is implemented
Load Mapper Shuffle/sort
Reducer Merge
<id1, header, ‘attagc’><id2, header, ‘gttagg’>
<id1, ‘atta’>, <id1,’ttag’><id2, ‘gtta’>, <id2, ‘ttag’>
<‘atta’, id1>, <‘ttag’, id1, id2><‘gtta’, id2>, <‘tagg’, id2>
<‘atta’, 1>, <‘ttag’, 2><‘gtta’, 1>, <‘tagg’, 1>
<‘atta’, 3>, <‘ttag’, 2><‘gtta’, 2>, <‘tagg’, 1>
A 7-liner BioPig script for k-mer counting
Rumen metagenome gene discovery pipeline
Read preprocess
(remove artifacts)
pigBlast(blast reads
against known cellulases)
pigAssembler(Assemble reads
into contigs)
pigExtender(Extend contigs into full-length
enzymes)
Cloud solution to large data
BioPig-Blaster
BioPig-Assembler
BioPig-Extender
BioPIG
BioPig: 61 lines of codeMPI-extender: ~12,000 lines (vs 31 in BioPig)
Flexibility
Programmability
Scalability
xx
Conclusions
Hadoop-based BioPig shows great potential for scalable analysis on very large sequence data, it is robust and easy to use.
Challenges in application
• IO optimization, e.g., reduce data copying • Some problems do not easily fit into
map/reduce framework, e.g., graph-based algorithms
• Integration into exiting framework, Galaxy
Acknowledgement
• Karan Bhatia• Henrik Nordberg• Kai Wang• Rob Egan• Alex Sczyrba• Jeremy Brand @JGI/NERSC• Shane Cannon @NERSC
BioPIG