Upload
vernon-aguilar
View
52
Download
2
Embed Size (px)
DESCRIPTION
CS369 Computational Biology. University of Auckland Course coordinator: Alexei Drummond. CS369 Computational Biology. This course provides an overview of algorithms and scientific computing techniques in computational biology and bioinformatics. - PowerPoint PPT Presentation
Citation preview
CS369 Computational Biology
University of Auckland
Course coordinator: Alexei Drummond
CS369 2007 2
CS369 Computational Biology
• This course provides an overview of algorithms and scientific computing techniques in computational biology and bioinformatics.
• A hands-on introduction to topics including– Dynamic programming and string algorithms,– Markov models, Hidden Markov models, – Heuristic search algorithms, – Pairwise and multiple sequence alignment,– Phylogenetic Reconstruction– Gene finding and secondary structure prediction
CS369 2007 3
Your lecturers
Alexei Drummond
Coordinator
Michael Dinneen
Algorithmics
Patricia Riddle
Learning and pattern discovery
CS369 2007 4
Recommended Texts
• Bioinformatics: Sequence and Genome analysis, 1st & 2nd Editions– by David W Mount (2001; 2004)– 1st Edition is available as an E-book in the library– 2nd edition is available on short term loan
• Biological sequence analysis– by Durbin, Eddy, Krogh and Mitchinson (1998)– Available on short term loan
• Algorithm Design– by Kleinberg J. and Tardos E (2006)
CS369 2007 5
Assessment
• 2 x 2 hour labs (3rd April and 29th May)– 5% each
• 1 assignment (1st May handout)– 20%
• 1 Mid-semester test– 10%
• 1 Exam– 60%
CS369 2007 6
Lecture schedule - first half
CS369 2007 7
Lecture schedule - second half
8CS369 2007
What is computational biology?
• Computers and biology?
• the organization and the analysis of biological data with computers?
• The new biology of the 21st century?
CS369 2007 9
What is bioinformatics?
• Rapidly growing biological databases contain information about– Cellular and molecular biology
– Ecology and Evolutionary biology
– Microbiology
– Genomics
– Proteomics
• The application of computational biology to understand this data.
CS369 2007 10
What is computational biology?
• Computational biology combines the tools and techniques of – mathematics,– statistics,
– computer science,– biology
• in order to understand the masses of biological data
CS369 2007 11
and the tree of lifeComputational biology
CS369 2007 12
CS369 2007 13
The human 18S gene QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
CS369 2007 14
The frog 18S gene QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
CS369 2007 15
Pairwise alignment of human and frog
Humans and Frogs are 94% similar!
CS369 2007 16
Pairwise sequence alignment
Sequences
x = a c g g t sy = a w g c c t t
Alignment
x = a – c g g – t sy = a w – g c c t t
CS369 2007 17
Dynamic programming algorithm
• computation is carried out bottom-up • store solutions to subproblems in a table • all possible subproblems solved once each, beginning
with smallest subproblems • work up to original problem instance • only optimal solutions to subproblems are used to
compute solution to problem at next level • DO NOT carry out computation in recursive, top-down
manner – same subproblems would be solved many times
CS369 2007 18
Auckland
Te Kuiti
Wellington
Principle of Optimality
CS369 2007 19
Scoring
• Numeric score associated with each column• Total score = sum of column scores• Column types:
(1) Identical (+ve) (2) Conservative (+ve)
(3) Non-conservative (-ve) (4) Gap (-ve)
x = a – c g g – t sy = a w – g c c t t
CS369 2007 20
Comparing many species
This is called a “multiple sequence alignment”
CS369 2007 21
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
CS369 2007 22
Tree of 3000 species
CS369 2007 23
Molecular tracking of the Southern Saltmarsh Mosquito
GTGAATGCCATTGCA
extract multiply
decode
alignbuild tree
CS369 2007 24
Bioinformatics and genome evolution
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
CS369 2007 25
Computational biology and disease
• Bacterial infection can causes serious illnesses, e.g.– Necrotising Fasciitis
(“Flesh eating disease”)– Toxic Shock Syndrome
(TSS)• (e.g. the illness Lana Cockroft
got from the cut in her foot during “Celebrity Treasure Island”).
T: Staphylococcus aureus
B: Streptococcus pyogenes,
CS369 2007 26
Computational biology and Disease
Necrotizing fasciitis
27CS369 2007
Computational biology and Disease
Identify protein structure of superantigens - so that potential drug targets can be uncovered.
CS369 2007 28
Computational biology and sequencing technology
• “Old” Technology– 1977– Small Scale– Slow– Length: long ≈ 700bp
• New Technology: 454– 2003– Large scale: Whole genome– Fast– Length: short ≈ 100bp– http://www.454.com
CS369 2007 29
Africa
n Mam
mot
h Indian
Bring the mammoth back
Using this new sequencing technology and bioinformatics to compare the sequences to the African elephant genome, scientists sequenced 28 million nucleotides of the mammoth genome from frozen bones!
CS369 2007 30
Where does the DNA come from?
• Scientists Sequence DNA of Woolly Mammoth (extinct 10,000 years ago)– http://
www.science.psu.edu/alert/Schuster12-2005.htm QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
CS369 2007 31
Conclusion
• Computational biology and bioinformatics are fast becoming central to the biological sciences– Ecology– Evolutionary biology (tree of life)– Health and disease (flesh-eating bacteria)– Molecular biology and cell biology– Personalized medicine
• Computational biology is heavily algorithmic and often statistical, drawing from the ‘hard-as-in-rock’ sciences like mathematics, computer science and statistics - this type of knowledge is becoming increasingly important in biology.
CS369 2007 32
Reading
• Chapter 1 “Biological Sequence Analysis”, Durbin et al
• Chapter 1 of Mount– Don’t worry about any biological terms that you
don’t immediately understand! This chapter is more to give you a sense of where we are going to go. The pertinent details will be explained along the way.
Biomolecules and the Central Dogma of Molecular Biology
CS369 2007 34
The central dogma of molecular biology
DNA RNA ProteinTranscription Translation
Replication
CS369 2007 35
What is a macromolecule?
• A chain molecule made up of a large number of repeating units.– Homo-polymers are made up of the same units– Hetero-polymers are made up of a number of different
units
• Nylon, Polyester, DNA (poly-nucleotide) and Protein (poly-amino acid) are examples
• Macromolecular 3D structure is a question conformational change rather than breaking and forming strong bonds.
CS369 2007 36
Biomolecular sequences
5’-ACGATCGACTGGTATATCGATGCT-3’
5’-ACGAUCGACUGGUAUAUCGAUGCU-3’
DNA
RNA
Protein
€
X i ∈ {A,C,G,T}
€
X i ∈ {A,C,G,U}
MFINRWLFSTNHKDIGTLYLLFGAW
€
X i ∈ {A,R,N,D,C,E,Q,G,H,I,L,K,M,F,P,S,T,W ,Y,V}
CS369 2007 37
DNA5’-ACGATCGACTGGTATATCGATGCT-3’3’-TGCTAGCTGACCATATAGCTACGA-5’
In nature DNA exists in a double helix of two complementary sequences.
A - T
G - C
Watson-CrickBase pairing
CS369 2007 38
RNA5’-ACGAUCGACUGGUAUAUCGAUGCU-3’RNA
A - U
G - C
G - U
3 H bonds
2 H bonds
1 H bonds
3D structure Secondary structure
CS369 2007 39
Proteins
• Proteins are hetero-polymers made up of 20 standard amino acids and are “coded for” by genes.
• They form 3D molecular structures that do work in the cell.
• An open problem is to determine structure computationally from the primary sequence of amino acids.
A cartoon of the 3D structure of the myoglobin protein
CS369 2007 40
The genetic codeAmino acids
CS369 2007 41
The genetic code
CS369 2007 42
CS369 2007 43