47
Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Parallel Computational Biochemistry

Parallel Computational Biochemistry

  • Upload
    bedros

  • View
    34

  • Download
    2

Embed Size (px)

DESCRIPTION

Parallel Computational Biochemistry. Proteins, DNA, etc. DNA encodes the information necessary to produce proteins. Proteins are the main molecular building blocks of life (for example, structural proteins, enzymes). Proteins, DNA, etc. - PowerPoint PPT Presentation

Citation preview

Page 1: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Parallel Computational Biochemistry

Page 2: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Proteins, DNA, etc.

DNA encodes the information necessary to produce proteins

Proteins are the main molecular building blocks of life (for example, structural proteins, enzymes)

Page 3: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

• Proteins are formed from a chain of molecules called amino acids

Proteins, DNA, etc.

Page 4: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

• The DNA sequence encodes the amino acid sequence that constitutes the protein

Proteins, DNA, etc.

Page 5: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

• There are twenty amino acids found in proteins, denoted by A, C, D, E, F, G, H, I, ...

Proteins, DNA, etc.

Page 6: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Multiple Sequence Alignment

Page 7: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Databases of Biological Sequences

>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSGDLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDESKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYHWPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDEYSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGIKSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITRGNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVSLAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPYYLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNTKRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH

NCBI: 14,976,310 sequences

15,849,921,438 nucleotides

Swiss-Prot: 104,559 sequences

38,460,707 residues

PDB: 17,175 structures

Page 8: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Sequence comparison

• Compare one sequence (target) to many sequences (database search)

• Compare more than two sequences simultaneously

Page 9: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Applications

• Phylogenetic analysis

• Identification of conserved motifs and domains

• Structure prediction

Page 10: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Page 11: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Phylogenetic Analysis

Page 12: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Structure Prediction

Genomic sequences

> RICIN GLYCOSIDASEMYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSGDLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDESKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYHWPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDEYSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGIKSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITRGNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVSLAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPYYLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNTKRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH

Protein sequences

Protein structures

Page 13: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Clustal W

Page 14: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Progressive Alignment

Scerevisiae [1]Celegans [2] 0.640Drosophia [3] 0.634 0.327Human [4] 0.630 0.408 0.420Mouse [5] 0.619 0.405 0.469 0.289

S.cerevisiaeC.elegans

DrosophilaMouse

Human

1. Do pairwise alignment of all sequences and calculate distance matrix

2. Create a guide tree based on this pairwise distance mat

3. Align progressively following guide tree. • start by aligning most closely related pairs of sequences• at each step align two sequences or one to an existing subalignment

Page 15: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Parallel Clustal• Parallel pairwise

(PW) alignment matrix

• Parallel guide tree calculation

• Parallel progressive alignment

Scerevisiae [1]Celegans [2] 0.640Drosophia [3] 0.634 0.327Human [4] 0.630 0.408 0.420Mouse [5] 0.619 0.405 0.469 0.289

S.cerevisiaeC.elegans

DrosophilaMouse

Human

Page 16: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Parallel Clustal - Improvements

• Optimization of input parameters– scoring matrices, gap penalties - requires

many repetitive Clustal W calculations with various input parameters.

• Minimum Vertex Cover– use minimum vertex cover to remove

erroneous sequences, and identify clusters of highly similar sequences.

Page 17: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Minimum Vertex Cover

Conflict Graph– vertex: sequence– edge: conflict (e.g.

alignment with very poor score)

TASK: remove smallest number of gene sequences that eliminates all conflicts

Page 18: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

FPT Algorithms

• Phase 1: Kernelization

Reduce problem to size f(k)

• Phase 2: Bounded Tree Search

Exhausive tree search; exponential in f(k)

Page 19: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Kernelization

Buss's Algorithm for k-vertex cover

• Let G=(V,E) and let S be the subset of vertices with degree k or more.

• Remove S and all incident edges

G->G’ k -> k'=k-|S|.

• IF G' has more than k x k' edges THEN no k-vertex cover exists

ELSE start bounded tree search on G'

Page 20: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Bounded Tree Search

VC={}

VC+=... VC+=... VC+=...

VC+=... VC+=... VC+=...

VC+=... VC+=... VC+=...

Page 21: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Case 1: simple path of length 3

VC+={v,v2}

VC={...}

VC+={v1,v2} VC+={v1,v3}

search tree

v

v1

v2

v3

in graph G'

remove selected vertices from G'k' - = 2

Page 22: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Case 2: 3-cycle

v

v1

v2

in graph G'

VC+={v,v1}

VC={...}

VC+={v1,v2} VC+={v,v2}

search tree

remove selected vertices from G'k' - = 2

Page 23: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Case 3: simple path of length 2

v

v1

v2

in graph G'

VC={...}

VC+={v1}

search tree

remove v1, v2 from G'k' - = 1

Page 24: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Case 4: simple path of length 1

v

v1

in graph G'

VC={...}

VC+={v}

search tree

remove v, v1 from G'k' - = 1

Page 25: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Sequential Tree Search

Depth first search

– backtrack when k'=0 and G'<>0 ("dead end" ))

– stop when solution found (G'={}, k'>=0 )

Page 26: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Parallel Tree SearchBasic Idea:

– Build top log p levels of the search tree (T ')

– every proc. starts depth-first search at one leaf of T '

– randomize depth-first search by selecting random child

T 'log p

Page 27: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Analysis: Balls-in-bins

sequential depth-first search path total length:L, #solutions: m

expected sequential time (rand. distr.): L/(m+1)

parallel search path

expected parallel time (rand. distr.): p + L/(p(m+1))expected speedup: p / (1 + (m+1)/L)if m << L then expected speedup = p

Page 28: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Simulation Experiment

number of processors

0 50

50

pre

dict

ed s

pee

dup

L = 1,000,000

m = 10m = 100m = 1,000m = 10,000m = 100,000

100

150

200

100 150 200

L = 1,000,000

Page 29: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Implementation

• test platform:– 32 node Beowulf cluster– each node: dual 1.4 GHz Intel Xeon, 512

MB RAM, 60 GB disk– gcc and LAM/MPI on LINUX Redhat 7.2

• code-s: Sequential k-vertex cover

• code-p: Parallel k-vertex cover

Page 30: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

HPCVL

High Performance Computing Virtual Laboratory - HPCVL (www.hpcvl.org)

Created by parallel computing researchers fromCarleton U. (Comp. Sci.)Queen's (Engineering)Ottawa U. (Life Sci./Hospital)

Obtained $30M+ in Federal (CFI) and Ontario (OIT, ORDCF) grants

Page 31: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

• Protein sequences

• Same protein from several hundred species

• Each protein sequence a few hundred amino acid residues in length

• Obtained from the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/)

Page 32: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

• Somatostatin

– neuropeptide involved in the regulation of many functions in different organ systems

– Clustal Threshold = 10, |V| = 559, |E| = 33652, k = 273, k' = 255

Page 33: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

• WW

– small protein domain that binds proline rich sequences in other proteins and is involved in cellular signaling

– Clustal Threshold = 10, |V| = 425, |E| = 40182, k = 322, k' = 318

Page 34: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

• Kinase

– large family of enzymes involved in cellular regulation

– Clustal Threshold = 16, |V| = 647, |E| = 113122, k = 497, k' = 397

Page 35: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

• SH2 (src-homology domain 2)

– involved in targeting proteins to specific sites in cells by binding to phosphor-tyrosine

– Clustal Threshold = 10, |V| = 730, |E| = 95463, k = 461, k' = 397

Page 36: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

• Thrombin

– protease involved in the blood coagulation cascade and promotes blood clotting by converting fibrinogen to fibrin

– Clustal Threshold = 15, |V| = 646, |E| = 62731, k = 413, k' = 413

Page 37: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

• PHD (pleckstrin homology domain)

– involved in cellular signaling

– Clustal Threshold = 10, |V| = 670, |E| = 147054, k = 603, k' = 603

Page 38: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

• Random Graph

|V| = 220, |E| = 2155, k = 122, k' = 122

• Grid Graph

|V| = 289, |E| = 544, k = 145, k' = 145

Page 39: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Test Data

|VC| ~ |V| / 2 k' = k

Page 40: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Sequential Times

Kinase, SH2, Thombin: n/a

Page 41: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Code-p on Virtual Proc.

Page 42: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Parallel Times

Page 43: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Speedup: Somatostatin

Page 44: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Speedup: WW

Page 45: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Speedup: Rand. Graph

Page 46: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Speedup: Grid Graph

Page 47: Parallel Computational Biochemistry

Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

Thank You!

• Questions?