Upload
absi-ahmed
View
230
Download
0
Embed Size (px)
Citation preview
BigBWA: approaching the Burrows–Wheeler
aligner to Big Data technologies
Dongseo University Division of Computer & Information Engineering
Machine Learning Research Lab
Presented by:
Ahmed A. Absi
Bioinformatics Advance Access published September 5, 2015
Outline
• Introduction
• Motivation
• Proposed work
• Performance Results
• Conclusion
• My opinion
• Current Progress
Evolving scientific instruments and the rapid sophistication of
computing systems have resulted in large-scale scientific
simulations and data analysis workflows.
As more and more scientific data is generated, our ability to
effectively manage and process such data also needs to evolve.
Genomics has become heavily dependent on the use of
sequence alignment tools which is computationally intensive.
Introduction
Introduction
Retrieved on 22nd Nov, 2015 from http://epilepsygenetics.net/2014/06/27/when-will-we-have-the-1000-epilepsy-genome/
• Widely used similarity search tool
• Heuristic approach method seed-and extend
• Uses “look-up” tables to shorten search time
• Performs both Global and Local alignment
• Fastest and most frequently used sequence alignment tool
Burrows–Wheeler Aligner (BWA)
Use Burrows-Wheeler Transform to “index” the human genome and allow
memory-efficient and fast string matching between sequence read and
reference genome.
BWA: Short-read algorithm, alter the read sequence such that it matches
the reference exactly.
BWA-SW: Long-read algorithm, sample reference subsequences and
perform Smith-Waterman alignment between the subsequences and the
read.
BWA-MEM: - Similar features to BWA-SW
- Long-read alignment
- Seed and extend with SW
- Finds larger gaps
- Faster! Generally supersedes BWA-SW
Burrows–Wheeler Aligner (BWA) S/W Package
Motivation
The amount of sequence data is growing rapidly. Such rapid
growth of sequence data will create obstacle for next-generation
sequence processing.
Sequence alignment is a very time-consuming process. This
problem becomes even more noticeable as millions and billions
of reads need to be aligned.
Therefore, NGS professionals demand scalable solutions to
boost the performance of the aligners in order to obtain the
results in reasonable time.
Proposed Approach: BigBWA
BigBWA, a new tool that takes advantage of Hadoop as Big Data
technology to increase the performance of BWA. The main advantages of
our tool are the following:
The alignment process is performed in parallel which reduces the
execution times
BigBWA is fault tolerant, exploiting the fault tolerance capabilities of
the underlying Big Data technology on which it is based.
No modifications to BWA are required to use BigBWA. As a
consequence, any release of BWA (future or legacy) will be
compatible with BigBWA.
Proposed Approach: BigBWA
BigBWA divides the computation into Map and Reduce phases.
In the Map phase, BigBWA splits the reads into subsets, mapping
each subset to a mapper process. Each mapper is responsible for
applying the considered BWA algorithm using as input the reads
assigned by BigBWA.
In case any of the mappers fails, BigBWA would automatically launch
another identical mapper process to replace the faulty one.
In the reducer phase those files are merged into one unique solution.
SEAL (Pireddu et al., 2011) : uses Pydoop, a Python implementation of the
MapReduce programming model that runs on the top of Hadoop. It allows
users to write their programs in Python, calling BWA methods.
pBWA (Peters et al., 2012) : pBWA uses a standard parallel programming
paradigm to parallelize BWA. pBWA lacks fault tolerant mechanisms.
The more important differences between these tools and BigBWA are:
SEAL and pBWA only work with a particular modified version of BWA, whereas BigBWA
works directly with the original BWA implementation keeping the compatibility with future
and legacy BWA versions.
both SEAL and pBWA are based on BWA version, which does not include the new BWA-
MEM algorithm. Therefore, to the best of our knowledge, BigBWA is the first tool to handle
the parallelization of the BWA-MEM algorithm using Big Data technologies.
BigBWA Similar Approaches
Experimental Configuration
Configuration
Setup
5 Nodes: 16 Amazon AWS cluster, Intel Xeon CPUs at 2.5 GHz
488 GB RAM
r3.4xlarge instance type
Hadoop version 2.6.0.
1000 Genomes Project Datasets: 3.9, 13.4, and 54.7 GB.
Evaluation Performance
Evaluation Performance: Datasets
Experimental Configuration
Evaluation Results
Comparison of the performance for the BWA algorithm
Evaluation Results
Comparison of the performance for the BWA-MEM algorithm
Conclusion
This paper introduce up-to-date long read sequence
alignment algorithms in bioinformatics.
BigBWA is a new tool that uses the Big Data technology
Hadoop to boost the performance of the Burrows–Wheeler
aligner (BWA).
Important reductions in the execution times were observed
when using this tool. In addition, BigBWA is fault tolerant
and it does not require any modification of the original BWA
source code.
My opinion
Q & A Thank You!