17
BigBWA: approaching the Burrows–Wheeler aligner to Big Data technologies Dongseo University Division of Computer & Information Engineering Machine Learning Research Lab Presented by: Ahmed A. Absi Bioinformatics Advance Access published September 5, 2015

Ahmed Absi slides bigbwa

Embed Size (px)

Citation preview

Page 1: Ahmed Absi slides  bigbwa

BigBWA: approaching the Burrows–Wheeler

aligner to Big Data technologies

Dongseo University Division of Computer & Information Engineering

Machine Learning Research Lab

Presented by:

Ahmed A. Absi

Bioinformatics Advance Access published September 5, 2015

Page 2: Ahmed Absi slides  bigbwa

Outline

• Introduction

• Motivation

• Proposed work

• Performance Results

• Conclusion

• My opinion

• Current Progress

Page 3: Ahmed Absi slides  bigbwa

Evolving scientific instruments and the rapid sophistication of

computing systems have resulted in large-scale scientific

simulations and data analysis workflows.

As more and more scientific data is generated, our ability to

effectively manage and process such data also needs to evolve.

Genomics has become heavily dependent on the use of

sequence alignment tools which is computationally intensive.

Introduction

Page 4: Ahmed Absi slides  bigbwa

Introduction

Retrieved on 22nd Nov, 2015 from http://epilepsygenetics.net/2014/06/27/when-will-we-have-the-1000-epilepsy-genome/

Page 5: Ahmed Absi slides  bigbwa

• Widely used similarity search tool

• Heuristic approach method seed-and extend

• Uses “look-up” tables to shorten search time

• Performs both Global and Local alignment

• Fastest and most frequently used sequence alignment tool

Burrows–Wheeler Aligner (BWA)

Page 6: Ahmed Absi slides  bigbwa

Use Burrows-Wheeler Transform to “index” the human genome and allow

memory-efficient and fast string matching between sequence read and

reference genome.

BWA: Short-read algorithm, alter the read sequence such that it matches

the reference exactly.

BWA-SW: Long-read algorithm, sample reference subsequences and

perform Smith-Waterman alignment between the subsequences and the

read.

BWA-MEM: - Similar features to BWA-SW

- Long-read alignment

- Seed and extend with SW

- Finds larger gaps

- Faster! Generally supersedes BWA-SW

Burrows–Wheeler Aligner (BWA) S/W Package

Page 7: Ahmed Absi slides  bigbwa

Motivation

The amount of sequence data is growing rapidly. Such rapid

growth of sequence data will create obstacle for next-generation

sequence processing.

Sequence alignment is a very time-consuming process. This

problem becomes even more noticeable as millions and billions

of reads need to be aligned.

Therefore, NGS professionals demand scalable solutions to

boost the performance of the aligners in order to obtain the

results in reasonable time.

Page 8: Ahmed Absi slides  bigbwa

Proposed Approach: BigBWA

BigBWA, a new tool that takes advantage of Hadoop as Big Data

technology to increase the performance of BWA. The main advantages of

our tool are the following:

The alignment process is performed in parallel which reduces the

execution times

BigBWA is fault tolerant, exploiting the fault tolerance capabilities of

the underlying Big Data technology on which it is based.

No modifications to BWA are required to use BigBWA. As a

consequence, any release of BWA (future or legacy) will be

compatible with BigBWA.

Page 9: Ahmed Absi slides  bigbwa

Proposed Approach: BigBWA

BigBWA divides the computation into Map and Reduce phases.

In the Map phase, BigBWA splits the reads into subsets, mapping

each subset to a mapper process. Each mapper is responsible for

applying the considered BWA algorithm using as input the reads

assigned by BigBWA.

In case any of the mappers fails, BigBWA would automatically launch

another identical mapper process to replace the faulty one.

In the reducer phase those files are merged into one unique solution.

Page 10: Ahmed Absi slides  bigbwa

SEAL (Pireddu et al., 2011) : uses Pydoop, a Python implementation of the

MapReduce programming model that runs on the top of Hadoop. It allows

users to write their programs in Python, calling BWA methods.

pBWA (Peters et al., 2012) : pBWA uses a standard parallel programming

paradigm to parallelize BWA. pBWA lacks fault tolerant mechanisms.

The more important differences between these tools and BigBWA are:

SEAL and pBWA only work with a particular modified version of BWA, whereas BigBWA

works directly with the original BWA implementation keeping the compatibility with future

and legacy BWA versions.

both SEAL and pBWA are based on BWA version, which does not include the new BWA-

MEM algorithm. Therefore, to the best of our knowledge, BigBWA is the first tool to handle

the parallelization of the BWA-MEM algorithm using Big Data technologies.

BigBWA Similar Approaches

Page 11: Ahmed Absi slides  bigbwa

Experimental Configuration

Configuration

Setup

5 Nodes: 16 Amazon AWS cluster, Intel Xeon CPUs at 2.5 GHz

488 GB RAM

r3.4xlarge instance type

Hadoop version 2.6.0.

1000 Genomes Project Datasets: 3.9, 13.4, and 54.7 GB.

Evaluation Performance

Page 12: Ahmed Absi slides  bigbwa

Evaluation Performance: Datasets

Experimental Configuration

Page 13: Ahmed Absi slides  bigbwa

Evaluation Results

Comparison of the performance for the BWA algorithm

Page 14: Ahmed Absi slides  bigbwa

Evaluation Results

Comparison of the performance for the BWA-MEM algorithm

Page 15: Ahmed Absi slides  bigbwa

Conclusion

This paper introduce up-to-date long read sequence

alignment algorithms in bioinformatics.

BigBWA is a new tool that uses the Big Data technology

Hadoop to boost the performance of the Burrows–Wheeler

aligner (BWA).

Important reductions in the execution times were observed

when using this tool. In addition, BigBWA is fault tolerant

and it does not require any modification of the original BWA

source code.

Page 16: Ahmed Absi slides  bigbwa

My opinion

Page 17: Ahmed Absi slides  bigbwa

Q & A Thank You!