View
40
Download
0
Category
Tags:
Preview:
DESCRIPTION
From Sequence Analysis to Simulations: Applications of HPC in Modern Biology. R. Sankararamakrishnan Department of Biological Sciences & Bioengineering IIT-Kanpur. IIT-K REACH Symposium 2010 Oct 9 th 2010. Computers and Computing in Biology. Mathematical Biology Biostatistics - PowerPoint PPT Presentation
Citation preview
From Sequence Analysis to Simulations: Applications of HPC in Modern Biology
R. SankararamakrishnanDepartment of Biological Sciences & Bioengineering
IIT-Kanpur
IIT-K REACH Symposium 2010
Oct 9th 2010
Computers and Computing in Biology
Bioinformatics
Computational Biology
Mathematical Biology
Biostatistics
Biomathematics
Quantitative Biology
Biophysics
What is Bioinformatics? - Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.
What is Computational Biology? - The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.
- NIH Definition http://www.bisti.nih.gov/
Definitions
HPC Applications: Three examples
Evolutionary relationship among a given set of protein or DNA sequences
Drug Discovery and Design
Structure-function relationship of large biomolecular assemblies
Phylogeny and Phylogenetic tree
Study of evolutionary relationships (sequences/species)
Relationships between organisms with common ancestor
Phylogenetic tree is a graph representing evolutionary history of sequences/species
HumanChimpanzee
Gorilla
Orangutan
Rooted Tree Unrooted Tree
Direction of evolution
Human
Chimpanzee
Gorilla
Orangutan
Phylogenetic trees can be represented in two different ways
Has a unique node
No assumption about common ancestry
!22
!322
n
nN
nR
!32
!523
n
nN
nU
Species
Number of Rooted Trees Number of Unrooted Trees
2 1 1
3 3 1
4 15 3
5 105 15
6 34,459,425 2,027,025
7 213,458,046,767,875 7,905,853,580,625
8 8,200,794,532,637,891,559,375
221,643,095,476,699,771,875
Number of possible unrooted and rooted trees
Maximum likelihood phylogeny problem is NP-hard
Very CPU intensive
For trees containing more than 20 to 25 sequences, the problem cannot be solved any more
Efficient heuristic tree search algorithms are required to reduce the size of the search space
Recently developed algorithms:
IQPNNI, PHYML, GARLI, RAxML
None of these algorithms are guaranteed to find the ML tree; only yield the best known ML tree
Computing phylogenetic trees using ML method
RAxML performance in some HPC platforms
Ott et al. (2008)
212 sequences, 566,470 base pairs
One of the largest datasets analyzed under ML
IBM BlueGene/L; 1024 CPUs
7 distinct tree searches in 14 hours
Phylogenetic analysis of plant channel proteins identified new subfamily
Bansal and Sankararamakrishnan, BMC Struct. Biol. (2007)Gupta and Sankararamakrishnan, BMC Plant Biol. (2009)
“Is there really a case where a drug that is on the market was designed by a computer?”“The reality is that the use of computers and computer methods permeates all aspects of drug discovery today”
Jorgensen (2004)
Roles of Computation in Drug Discovery
“Drug discovery is complex: Successful teams and companies need to congratulated, whereas search for one individual or computer program is counterproductive. There is not going to be a voila moment at the computer terminal. Instead, there is systematic use of wide-ranging computational tools to facilitate and enhance the drug discovery process”
Computation in Drug Discovery
Jorgensen (2004)
Structure-based Drug Design – An Introduction
http://csb.stanford.edu/levitt/demo_lectures/lec7/Lecture7/Discovering_Drugs/pages/Structure_Based_Drug_Design.html
http://www.biocryst.com/our_science
Lead Generation
Lead optimization
De novo design
Virtual screening
Bleicher et al. (2003)
All drugs that are presently in the market are estimated to target less than 500 biomolecules
Docking & Scoring
Drug targets and Drug discovery: Issues
Issues: Scoring function, solvent effect and protein flexibility
Four proteins: trypsin, HIV PR, CDK2 and AChE
Test set for each protein: 10,000 randomly selected compounds
6000 docking poses were selected for the top 1000 compounds
They served as initial conformations for MD simulations
Combination of docking and MD showed a higher and more stable enrichment performance than docking method used alone
A special purpose computer, MDGRAPE-3, was used for MD simulations
It is a cluster of personal computers
Each equipped with 24 MDGRAPE-3 chips and has a peak speed of approximately 2 Tflops
50 such computers were used
Average computational time for a single protein-ligand complex is 2.5 h
For 6,000 protein-ligand conformations, calculations were completed in a week
Steered Molecular Dynamics to compute the force required to extract the inhibitors from enzymes
A small string is connected to the ligand in the complex
This string is pulled at constant velocity into the surrounding water
Force is determined from the extension of the spring and recorded as a function of time
Strongly-bound inhibitors higher peak forces
Weaker inhibitors flatter profiles
Steered MD in Drug Discovery
Jorgensen, 2010
Protein-protein interactions in programmed cell death
Lama and Sankararamakrishnan, Proteins (2008)Lama and Sankararamakrishnan, Biochemistry (2010)
Bcl-2 family complex structures
Total number of atoms: ~50,000 to ~75,000
Simulation period: 50 ns
GlpF: 81006 AtomsAQP1: 75057 Atoms PfAQP: 81503 Atoms
30ns production run was performed for all the three systems.
Each simulation takes ~40 days CPU time (Total CPU time ~ 120 days).
MD simulations of channel proteins in bilayers
Alok Jain, Ravi Verma and R. Sankararamakrishnan, Manuscript in preparation
Complete virus: 1 million atoms(Freddolino et al., 2006)
Arrays of light-harvesting proteins – 1 million atoms (Chandler et al., 2008)
Simulations reaching the million-atom mark
BAR domain proteins – 2.3 million atoms (Yin et al., 2009)
The flagellum – 2.4 million atoms (Kitao et al., 2006)
Minimization and equilibration
Cluster of 48 AMD Athlon 2600+ processors
Simulation
256 Altix nodes at NCSA @UIUC
1.1. ns/day
Complete virus: 1 million atoms
(Freddolino et al., 2006)
Gumbart et al. (2009)
2.7 million atoms
50 ns simulation
MD of protein-conducting channel bound to ribosome
Largest system simulated to date
Bacterial ribosomes are important targets for antibiotics
HPC Platforms for Biology Applications
FPGA-boards: Field programmable gate arrays are ICs which can be programmed. FGPA boards with commonly used bioinformatics algorithms are available
Graphics-Processing Unit (GPU): All bioinformatics applications
Grid Computing: Many applications
Distributed Computing: Protein folding, Drug docking
Cloud Computing:
Acknowledgements
Anjali Bansal
Dilraj Lama
Alok Jain
Tuhin Kumar Pal
Priyanka Srivastava
Vivek Modi
Ravi Kumar Verma
Krishna Deepak
Phani Deep
DST, DBT, CSIR, MHRD
Recommended