13
Q u e e n s l a n d P a r a l l e l S u p e r c o m p u t i n g F o u n d a t i Q u e e n s l a n d P a r a l l e l S u p e r c o m p u t i n g F o u n d a t i 1. Professor Mark Ragan 1. Professor Mark Ragan (Institute for Molecular (Institute for Molecular Bioscience) Bioscience) 2. Dr Thomas Huber 2. Dr Thomas Huber (Department of Mathematics) (Department of Mathematics) Computational Biology and Computational Biology and Bioinformatics Environment Bioinformatics Environment ComBinE ComBinE National Facility Projects

Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

1. Professor Mark Ragan1. Professor Mark Ragan (Institute for Molecular (Institute for Molecular Bioscience)Bioscience)2. Dr Thomas Huber2. Dr Thomas Huber (Department of Mathematics) (Department of Mathematics)

Computational Biology andComputational Biology andBioinformatics EnvironmentBioinformatics Environment

ComBinEComBinE

National Facility Projects

Page 2: Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Comparison of protein families among completely sequenced

microbial genomes

The scientific problem:

Handcrafted analyses suggest that gene transfer

in nature may be not only from parents to

offspring (“vertical”), but also from one lineage

to another (“lateral” or “horizontal”)

From microbial genomics we have complete

inventories of genes & proteins in ~ 80 genomes

Comparative analysis should identify all cases

of vertical and lateral gene transfer

Page 3: Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Computational requirement for 80 genomes:

1012 BLAST comparisons

5000 T-Coffee alignments

5000 Bayesian inference trees

107 topological comparisons

Find all interestingly large protein families in all microbial genomes

Generate structure-sensitive multiple alignments

Infer phylogenetic trees with appropriate statistics

Compare trees, look for topological incongruence

The approach

Page 4: Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Computations on APAC National Facility

Motif-based multiple alignment30-50 sequences = 2-5 hours per run

Will need ~5000 runs @ 4 - 60 seqs

Bayesian inferenceParameterisation of (MC)3 search

NF used for trials of up to 106 Markov

chain generations (~200 hours / run)

1.5-2.0 Gb RAM per run

Usage of NF:

Code not yet

parallelised

With each run

costing a few 10s of

hours and need for

1000s analyses, it’s

more efficient to use

many processors

simultaneously

Page 5: Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Parameterisation of Metropolis-coupled Markov chain Monte Carlo optimisation

through protein tree space

-13000

-12000

-11000

-10000

-9000

-8000

-7000

-6000

-5000

0 100000 200000 300000 400000 500000

Number of Markov chain generations

Ln

-lik

elih

oo

d

Ln-likelihood as function of number of generations

-14000

-12000

-10000

-8000

-6000

-4000

-2000

0

0 100000 200000 300000 400000 500000 600000

Number of generations

Ln

-lik

elih

oo

d

Log-likelihood as a function of number of Markov chain generations

Approach to stationarity under Jones et al. (1992) and General time-reversible models of protein sequence change

Bayesian inference (MrBayes 2.0) applied to 34-sequence Elongation Factor 1 dataset. Eight simultaneousMarkov chains, discrete approximation of gamma distribution ( = 0.29), chain temperature 0.1000

Page 6: Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

With thanks to collaborators

Mark Borodovsky, Georgia Tech

Robert Charlebois, NGI Inc. (Ottawa)

Tim Harlow, University of Queensland

Jeffrey Lawrence, University of Pittsburgh

Thomas Rand, St Mary’s University

Page 7: Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

1. Professor Mark Ragan1. Professor Mark Ragan (Institute for Molecular (Institute for Molecular Bioscience)Bioscience)2. Dr Thomas Huber2. Dr Thomas Huber (Department of Mathematics) (Department of Mathematics)

Computational Biology andComputational Biology andBioinformatics EnvironmentBioinformatics Environment

ComBinEComBinE

National Facility Projects

Page 8: Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Protein Structure Prediction

Two Lineages• The bioinformatics approach

– Compare sequence to other sequence– huge datasets (0.5*106 sequences)

– Match sequence with known structure– (Low resolution force field development)

• The biophysics approach– Simulations that mimic natural

behaviour

Page 9: Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Protein Structure Prediction

Two Lineages• The bioinformatics approach

– Compare sequence to other sequence– huge datasets (0.5*106 sequences)

– Match sequence with known structure– (Low resolution force field development)

• The biophysics approach– Simulations that mimic natural

behaviour

Hardware Requirements:

CPU: minutes/seqMem: 1 GB

CPU: hours/seqMem: 100s MB

CPU: 100s hoursMem: 10s MB

Page 10: Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Protein Structure Prediction

Two Lineages• The bioinformatics approach

– Compare sequence to other sequence– huge datasets (0.5*106 sequences)

– Match sequence with known structure– (Low resolution force field development)

• The biophysics approach– Simulations that mimic natural

behaviour

Parallelism:

Trivial parallel

Trivial parallel

Hard parallel High bandwidth + low latency requirement

Page 11: Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Force splitting and multiple time step integration

(Ian Lenane)

MD SimulationPropagating Molecular

Models in TimeStart With Old System State

Add Information On Energy

And Force

New System State

Apply Numerical Integrator

Mechanical Description

Newton’s Laws of Motion

Time step required: 10-15s

Time scale wanted: >10-3s System is split in

different domains• Fast varying forces (cheap

to calculate) are integrated more frequent

• Slow varying forced (expensive to calculate) are integrated less frequent

+ More efficient integration

+ Easy to expand to parallel simulations

Page 12: Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Path simulations(Ben Gladwin)

1 1( , )x y

2 2( , )x yWhat if start and end points are given?• proteins: unfolded folded

• Molecular machines: 1 cycle

• Shortest path calculations– Floyd, Dijkstra

• Hamilton’s principle of least action

1

0

)(5.0 }arg{min )()( 2t

t

tt qUmvdtS

+ Computationally very attractive• Extremely long time steps• Very well suited for parallel architectures

(Floyd algorithm parallelized, but performance problems >4PE on -GS NUMA architecture)

Page 13: Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

Qu

een

slan

d Pa

ralle

l Su

percom

putin

g F

ou

nda

tion

National Facility supercomputer use

• 2001 CPU quota: 2*5250 + 8000 service units – Total use 12000 units (3000 units in parallel)

• 2002 CPU quota: 4 * 6000 service units– First quarter: 2000 units

– Second quarter: 85 units

• Collaborators• Dr A. Torda (ANU) Low resolution force fields /

protein structure prediction

• Prof. D. Hume, A/Prof. B. Kobe and Dr. J. Martin (UQ) Structural genomics project

• Prof. K. Burrage, I. Lenane and B. Galdwin (UQ) Numerical integration and path simulations

• Special Thanks• Mrs J. Jenkinson and Dr D. Singleton (NF/ANUSF)