HPC in Bioinformatics and Genomics - IBM · HPC in Bioinformatics and Genomics Daniel Kahn,...

Preview:

Citation preview

HPC in Bioinformatics and Genomics

Daniel Kahn, Clément Rezvoy and Frédéric Vivien

Lyon 1 University & INRIA HELIX teamLIP-ENS & INRIA GRAAL team

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Moore’s law in genomics

Ø Exponential increase

Ø Doubling time ~20 months

New high-throughput technologies

Ø Pyrosequencing (Roche 454 GS FLX)l 100-400 Mb per run (1 day)

l Long reads (up to 400 bp)

l ~15 Gb raw data

Ø Illumina Genome Analyzerl 1,500 Mb per run (3 days)

l Short reads (35 bp)

l ~1 Tb raw data

Ø Applied Biosystems SOLID sequencerl 3,000 Mb per run (5 days)

l Short reads (35 bp)

l ~15 Tb raw data

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Uses of high throughput sequencing

Ø Population genomicsl For instance, 1000 human genome project

Ø Individual sequencing

Ø Metagenomicsl Comprehensive appraisal of microbial communities and gene repertoires

in various environments

Ø Phylogenomicsl Resolving the history of genes and species

Ø ….

Ø As many computing challenges

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Large scale protein sequence analysis

Ø All vs. all

Ø The challenge of protein modularityl Most proteins are combinatorial arrangements of conserved modules

(domains)

LuxR

GerE

FixJ

OmpR

SpoOA

NtrC

NifA

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

The ProDom project

Ø Need for an automated process in order to allow for comprehensive analysis

Ø Automatically decompose proteins into domains and cluster domain families, using MKDOM2

Ø Generate multiple alignments and trees for all families

Ø Automatically generate mutually consistent representations for all proteins

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Resolving combinatorial proteins

query

internal repeat detection

yes

query

no

PSI-BLAST

DB

DB changesremove newly found domains

split modified sequencessort by size

DB

query

no match matches repeat matches

(i+1)th iteration

ith iteration The MKDOM2 program

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Drawbacks of sequential MKDOM2

Ø Greedy algorithm

Ø Scales quadratically

Ø Data follow Moore’s law

è no more tractable !

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Parallelization of MKDOM2

Ø Parallelization of the main loop

Ø Distribute sequences for independent family construction

Ø Difficulties:l Heterogeneous run times for the main loop

l Possible dependencies between families

è Precalculate an all vs. all comparison in order to select independent queries

è Send batches of independent sequences before worker nodes are idle

è Verify family independence a posteriori

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Speed-up on medium scale test set

Ø 32 Archaeal genomes

Ø 21.5 M aminoacids

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Large-scale test set

Ø 263 genomes

Ø 950,216 protein sequences

Ø 339 M aminoacids

Ø Run on GRID’5000 (150 nodes)

Ø Half of the data set processed in only 20 hours

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Database crunching

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Increasing query sizes

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Variable sizes of domain families

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Heterogeneous run times

Ø ~1000-fold range

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Large result queue

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

… yet efficient node usage

Ø 86% processor usage

D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day

Full-scale protein domain analysis

Ø To be scaled-up 7-fold for full processing of UniProt today !

Ø Will require stable MPI usage of ~1000 processors over the grid

Ø Appropriate infrastructure not yet identified

Ø Other program MPI_MKDOM3 envisioned to make full use of precalculated all vs. all comparison

… required in order to further cope with Moore’s law

INRA ToulouseEmmanuel COURCELLEDaniel KAHN

Support- PRABI- EU (EMBRACE & IMPACT)- IN2P3- GRID’5000

Lyon 1 UniversityINRIA HELIX projectAurélie LAUGRAUD Lauranne DUQUENNEDaniel KAHN

LIP-ENS LyonClément REZVOYFrédéric VIVIEN

Recommended