43
CS369 Computational Biology University of Auckland Course coordinator: Alexei Drummond

CS369 Computational Biology

Embed Size (px)

DESCRIPTION

CS369 Computational Biology. University of Auckland Course coordinator: Alexei Drummond. CS369 Computational Biology. This course provides an overview of algorithms and scientific computing techniques in computational biology and bioinformatics. - PowerPoint PPT Presentation

Citation preview

Page 1: CS369 Computational Biology

CS369 Computational Biology

University of Auckland

Course coordinator: Alexei Drummond

Page 2: CS369 Computational Biology

CS369 2007 2

CS369 Computational Biology

• This course provides an overview of algorithms and scientific computing techniques in computational biology and bioinformatics.

• A hands-on introduction to topics including– Dynamic programming and string algorithms,– Markov models, Hidden Markov models, – Heuristic search algorithms, – Pairwise and multiple sequence alignment,– Phylogenetic Reconstruction– Gene finding and secondary structure prediction

Page 3: CS369 Computational Biology

CS369 2007 3

Your lecturers

Alexei Drummond

Coordinator

Michael Dinneen

Algorithmics

Patricia Riddle

Learning and pattern discovery

Page 4: CS369 Computational Biology

CS369 2007 4

Recommended Texts

• Bioinformatics: Sequence and Genome analysis, 1st & 2nd Editions– by David W Mount (2001; 2004)– 1st Edition is available as an E-book in the library– 2nd edition is available on short term loan

• Biological sequence analysis– by Durbin, Eddy, Krogh and Mitchinson (1998)– Available on short term loan

• Algorithm Design– by Kleinberg J. and Tardos E (2006)

Page 5: CS369 Computational Biology

CS369 2007 5

Assessment

• 2 x 2 hour labs (3rd April and 29th May)– 5% each

• 1 assignment (1st May handout)– 20%

• 1 Mid-semester test– 10%

• 1 Exam– 60%

Page 6: CS369 Computational Biology

CS369 2007 6

Lecture schedule - first half

Page 7: CS369 Computational Biology

CS369 2007 7

Lecture schedule - second half

Page 8: CS369 Computational Biology

8CS369 2007

What is computational biology?

• Computers and biology?

• the organization and the analysis of biological data with computers?

• The new biology of the 21st century?

Page 9: CS369 Computational Biology

CS369 2007 9

What is bioinformatics?

• Rapidly growing biological databases contain information about– Cellular and molecular biology

– Ecology and Evolutionary biology

– Microbiology

– Genomics

– Proteomics

• The application of computational biology to understand this data.

Page 10: CS369 Computational Biology

CS369 2007 10

What is computational biology?

• Computational biology combines the tools and techniques of – mathematics,– statistics,

– computer science,– biology

• in order to understand the masses of biological data

Page 11: CS369 Computational Biology

CS369 2007 11

and the tree of lifeComputational biology

Page 12: CS369 Computational Biology

CS369 2007 12

Page 13: CS369 Computational Biology

CS369 2007 13

The human 18S gene QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 14: CS369 Computational Biology

CS369 2007 14

The frog 18S gene QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 15: CS369 Computational Biology

CS369 2007 15

Pairwise alignment of human and frog

Humans and Frogs are 94% similar!

Page 16: CS369 Computational Biology

CS369 2007 16

Pairwise sequence alignment

Sequences

x = a c g g t sy = a w g c c t t

Alignment

x = a – c g g – t sy = a w – g c c t t

Page 17: CS369 Computational Biology

CS369 2007 17

Dynamic programming algorithm

• computation is carried out bottom-up • store solutions to subproblems in a table • all possible subproblems solved once each, beginning

with smallest subproblems • work up to original problem instance • only optimal solutions to subproblems are used to

compute solution to problem at next level • DO NOT carry out computation in recursive, top-down

manner – same subproblems would be solved many times

Page 18: CS369 Computational Biology

CS369 2007 18

Auckland

Te Kuiti

Wellington

Principle of Optimality

Page 19: CS369 Computational Biology

CS369 2007 19

Scoring

• Numeric score associated with each column• Total score = sum of column scores• Column types:

(1) Identical (+ve) (2) Conservative (+ve)

(3) Non-conservative (-ve) (4) Gap (-ve)

x = a – c g g – t sy = a w – g c c t t

Page 20: CS369 Computational Biology

CS369 2007 20

Comparing many species

This is called a “multiple sequence alignment”

Page 21: CS369 Computational Biology

CS369 2007 21

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 22: CS369 Computational Biology

CS369 2007 22

Tree of 3000 species

Page 23: CS369 Computational Biology

CS369 2007 23

Molecular tracking of the Southern Saltmarsh Mosquito

GTGAATGCCATTGCA

extract multiply

decode

alignbuild tree

Page 24: CS369 Computational Biology

CS369 2007 24

Bioinformatics and genome evolution

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 25: CS369 Computational Biology

CS369 2007 25

Computational biology and disease

• Bacterial infection can causes serious illnesses, e.g.– Necrotising Fasciitis

(“Flesh eating disease”)– Toxic Shock Syndrome

(TSS)• (e.g. the illness Lana Cockroft

got from the cut in her foot during “Celebrity Treasure Island”).

T: Staphylococcus aureus

B: Streptococcus pyogenes,

Page 26: CS369 Computational Biology

CS369 2007 26

Computational biology and Disease

Necrotizing fasciitis

Page 27: CS369 Computational Biology

27CS369 2007

Computational biology and Disease

Identify protein structure of superantigens - so that potential drug targets can be uncovered.

Page 28: CS369 Computational Biology

CS369 2007 28

Computational biology and sequencing technology

• “Old” Technology– 1977– Small Scale– Slow– Length: long ≈ 700bp

• New Technology: 454– 2003– Large scale: Whole genome– Fast– Length: short ≈ 100bp– http://www.454.com

Page 29: CS369 Computational Biology

CS369 2007 29

Africa

n Mam

mot

h Indian

Bring the mammoth back

Using this new sequencing technology and bioinformatics to compare the sequences to the African elephant genome, scientists sequenced 28 million nucleotides of the mammoth genome from frozen bones!

Page 30: CS369 Computational Biology

CS369 2007 30

Where does the DNA come from?

• Scientists Sequence DNA of Woolly Mammoth (extinct 10,000 years ago)– http://

www.science.psu.edu/alert/Schuster12-2005.htm QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

Page 31: CS369 Computational Biology

CS369 2007 31

Conclusion

• Computational biology and bioinformatics are fast becoming central to the biological sciences– Ecology– Evolutionary biology (tree of life)– Health and disease (flesh-eating bacteria)– Molecular biology and cell biology– Personalized medicine

• Computational biology is heavily algorithmic and often statistical, drawing from the ‘hard-as-in-rock’ sciences like mathematics, computer science and statistics - this type of knowledge is becoming increasingly important in biology.

Page 32: CS369 Computational Biology

CS369 2007 32

Reading

• Chapter 1 “Biological Sequence Analysis”, Durbin et al

• Chapter 1 of Mount– Don’t worry about any biological terms that you

don’t immediately understand! This chapter is more to give you a sense of where we are going to go. The pertinent details will be explained along the way.

Page 33: CS369 Computational Biology

Biomolecules and the Central Dogma of Molecular Biology

Page 34: CS369 Computational Biology

CS369 2007 34

The central dogma of molecular biology

DNA RNA ProteinTranscription Translation

Replication

Page 35: CS369 Computational Biology

CS369 2007 35

What is a macromolecule?

• A chain molecule made up of a large number of repeating units.– Homo-polymers are made up of the same units– Hetero-polymers are made up of a number of different

units

• Nylon, Polyester, DNA (poly-nucleotide) and Protein (poly-amino acid) are examples

• Macromolecular 3D structure is a question conformational change rather than breaking and forming strong bonds.

Page 36: CS369 Computational Biology

CS369 2007 36

Biomolecular sequences

5’-ACGATCGACTGGTATATCGATGCT-3’

5’-ACGAUCGACUGGUAUAUCGAUGCU-3’

DNA

RNA

Protein

X i ∈ {A,C,G,T}

X i ∈ {A,C,G,U}

MFINRWLFSTNHKDIGTLYLLFGAW

X i ∈ {A,R,N,D,C,E,Q,G,H,I,L,K,M,F,P,S,T,W ,Y,V}

Page 37: CS369 Computational Biology

CS369 2007 37

DNA5’-ACGATCGACTGGTATATCGATGCT-3’3’-TGCTAGCTGACCATATAGCTACGA-5’

In nature DNA exists in a double helix of two complementary sequences.

A - T

G - C

Watson-CrickBase pairing

Page 38: CS369 Computational Biology

CS369 2007 38

RNA5’-ACGAUCGACUGGUAUAUCGAUGCU-3’RNA

A - U

G - C

G - U

3 H bonds

2 H bonds

1 H bonds

3D structure Secondary structure

Page 39: CS369 Computational Biology

CS369 2007 39

Proteins

• Proteins are hetero-polymers made up of 20 standard amino acids and are “coded for” by genes.

• They form 3D molecular structures that do work in the cell.

• An open problem is to determine structure computationally from the primary sequence of amino acids.

A cartoon of the 3D structure of the myoglobin protein

Page 40: CS369 Computational Biology

CS369 2007 40

The genetic codeAmino acids

Page 41: CS369 Computational Biology

CS369 2007 41

The genetic code

Page 42: CS369 Computational Biology

CS369 2007 42

Page 43: CS369 Computational Biology

CS369 2007 43