199
Fundamental Concepts of Fundamental Concepts of Bioinformatics Bioinformatics OCCBIO 2006 Tutorial OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics Research Group

Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

Embed Size (px)

Citation preview

Page 1: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

Fundamental Concepts of Fundamental Concepts of BioinformaticsBioinformatics

OCCBIO 2006 TutorialOCCBIO 2006 Tutorial

Michael L. RaymerComputer Science, Biomedical Sciences

Wright State University

Bioinformatics Research GroupBioinformatics Research Group

Page 2: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

Part I – BackgroundPart I – Background

Some basics of molecular biology, and some of the fundamental

problems facing bioinformaticians

Page 3: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 3

The Central Dogma of molecular biologyThe Central Dogma of molecular biology

Page 4: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 4

DNA structure and base pairingDNA structure and base pairing Polymer of:

• Ribose sugar

• Phosphate

• Nitrogenous base

Four bases• A, C, G, T

Base pairing• A—T

• G—C

Page 5: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 5

DNA is an information carrying moleculeDNA is an information carrying molecule Arranged into 23

chromosome pairs in the nucleus of each cell

Genes: coding information• < 5% of all DNA

• Instructions for protein synthesis

• Directions on when and where to synthesize proteins (regulatory regions)

Page 6: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 6

The Genetic CodeThe Genetic Code Redundancy/robustness

• Synonymous codons

• Dual strands

• Diploidy

• Amino acid structure (?)

Page 7: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

TranscriptionTranscription

DNAtranscriptiontranscription

RNAtranslationtranslation

Protein

Page 8: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 8

Messenger RNAMessenger RNA Carries

instructions for a protein outside of the nucleus to the ribosome

The ribosome is a protein complex that synthesizes new proteins

Page 9: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 9

Prokaryotic gene structureProkaryotic gene structure

Promoter: RNA polymerase bindingPromoter: RNA polymerase binding

Operator: regulationOperator: regulation

CodingCoding

Stop CodonStop Codon

5' UTR5' UTR5' UTR5' UTR 3' UTR3' UTR

5'5' 3'3'

Yeast RNA Polymerase IIDarst et al. in 1991 (Cell 66, pp 121-128)

Page 10: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 10

Regulation of transcriptionRegulation of transcription Energy budget Cellular differentiation & tissue function

From W. Becker, L. Kleinsmith, and J. Hardin, The World of the Cell, Fourth Edition. Copyright © Addison Wesley Longman, Inc.

Page 11: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 11

Bioinformatics problemsBioinformatics problems Shotgun sequencing Sequence alignment & multiple alignment

• Database searches

Phylogenetic tree induction Protein structure determination, modeling, and

prediction Ligand screening and docking Many, many more

Page 12: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 12

Bioinformatics dataBioinformatics data DNA sequence information

• Genome projects, etc.

mRNA expression information• Microarrays, SAGE

Metabolite concentrations• Mass Spec., NMR Spec., etc.

Protein sequence information Protein structure information

• X-Ray Crystallography

Page 13: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

Part II – Obtaining SequencesPart II – Obtaining Sequences

Sanger SequencingPrimer Walking

Shotgun ApproachesFragment Assembly Algorithms

Page 14: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 14

OutlineOutline PCR Sanger Sequencing Primer Walking Shotgun Sequencing

• Models• Algorithms• Analysis

Page 15: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 15

Polymerase chain reaction (PCR)Polymerase chain reaction (PCR)

Page 16: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 16

Gel electrophoresisGel electrophoresis

Page 17: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 17

Sanger sequencingSanger sequencing

Page 18: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 18

Limitations to sequencingLimitations to sequencing You must have a primer of known sequence to

initiate PCR Only about 1000nts can be sequenced in a

single reaction The sequencing process is slow, so it is

beneficial to do as much in parallel as possible• Primer hopping• Shotgun approach

Page 19: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 19

Shotgun SequencingShotgun Sequencing

Page 20: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 20

The Ideal CaseThe Ideal Case Find maximal overlaps between fragments:

ACCGTCGTGCTTACTACCGT

--ACCGT------CGTGCTTAC------TACCGT— TTACCGTGC

Consensus sequence

determined by vote

Page 21: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 21

Quality MetricsQuality Metrics The coverage at position i of the target or

consensus sequence is the number of fragments that overlap that position

Two contigs

No coverage

Target:

Page 22: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 22

Quality MetricsQuality Metrics Linkage – the degree of overlap between

fragments

Target:

Perfect coverage, poor average linkage poor minimum linkage

Page 23: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 23

Real World ComplicationsReal World Complications Base call errors Chimeric fragments, contamination (e.g. from

the vector)

--ACCGT------CGTGCTTAC------TGCCGT— TTACCGTGC

--ACC-GT------CAGTGCTTAC-------TACC-GT— TTACC-GTGC

--ACCGT------CGTGCTTAC------TAC-GT— TTACCGTGC

Base Call Error Deletion ErrorInsertion Error

Page 24: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 24

Unknown OrientationUnknown Orientation

A fragment can come from either strandA fragment can come from either strand

CACGTACGTACTACGGTACTACTGACTGA

CACGT -ACGT --CGTAGT -----AGTAC --------ACTGA ---------CTGA

Page 25: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 25

RepeatsRepeats Direct repeats

A X B X C X D

A X C X B X D

Page 26: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 26

RepeatsRepeats Direct repeats

A X B Y C X D Y E

A X D Y C X B Y E

Page 27: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 27

RepeatsRepeats Inverted repeats

X X

X X

Page 28: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 28

Sequence Alignment ModelsSequence Alignment Models Shortest common superstring

• Input: A collection, F, of strings (fragments)• Output: A shortest possible string S such that for

every f F, S is a superstring of f.

Example:• F = {ACT, CTA, AGT}• S = ACTAGT

Page 29: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 29

Problems with the SCS modelProblems with the SCS model

x x

x x´

Directionality of fragments must be known No consideration of coverage Some simple consideration of linkage No consideration of base call errors

Page 30: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 30

ReconstructionReconstruction Deals with errors and unknown orientation Definitions

• f is an approximate substring of S at error level when ds(f, S) | f |

• ds = substring edit distance:

Reconstruction• Input: A collection, F, of strings, and a tolerance

level, • Output: Shortest possible string, S, such that for

every f F : fSfdSfd ss ,,,min

Match = 0Mismatch = 1

Gap = 1

Page 31: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 31

Reconstruction ExampleReconstruction Example Input: F = {ATCAT, GTCG, CGAG, TACCA}

= 0.25 Output:

ATGAT------CGAC-CGAG----TACCAACGATACGAC

ATCAT

GTCG

ds(CGAG, ACGATACGAC) = 1= 0.25 4

So this output is OK for = 0.25

Page 32: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 32

Gaps in ReconstructionGaps in Reconstruction Reconstruction allows gaps in fragments:

AT-GA-----ATCGATAGAC

ds = 1

Page 33: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 33

Limitations of ReconstructionLimitations of Reconstruction Models errors and unknown orientation Doesn’t handle repeats Doesn’t model coverage Only handles linkage in a very simple way Always produces a single contig

Page 34: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 34

ContigsContigs Sometimes you just can’t put all of the

fragments together into one contiguous sequence:

No way to tell the order of these two contigs.

?No way to tell how much sequence is missing between them.

Page 35: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 35

MulticontigMulticontig Definitions

• A layout, L, is a multiple alignment of the fragments Columns numbered from 1 to |L |

• Endpoints of a fragment: l(f) and r(f)• An overlap is a link is no other fragment completely

covers the overlap

Link Not a link

Page 36: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 36

MulticontigMulticontig More definitions

• The size of a link is the number of overlapping positions

• The weakest link is the smallest link in the layout• A t-contig has a weakest link of size t• A collection, F, admits a t-contig if a t-contig can be

constructed from the fragments in F

ACGTATAGCATGA GTA CATGATCAACGTATAG GATCA

A link of size 5A link of size 5

Page 37: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 37

Perfect MulticontigPerfect Multicontig Input: F, and t Output: a minimum number of collections, Ci,

such that every Ci admits a t-contigLet F = {GTAC, TAATG, TGTAA}

--TAATGTGTAA--

GTAC

t = 3t = 3

TGTAA-------TAATG---------GTAC

t = 1t = 1

Page 38: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 38

Handling errors in MulticontigHandling errors in Multicontig The image of a fragment is the portion of the

consensus sequence, S, corresponding to the fragment in the layout

S is an -consensus for a collection of fragments when the edit distance from each fragment, f, and its image is at most | f |

TATAGCATCAT CGTC CATGATCAACGGATAG GTCCAACGTATAGCATGATCA

An -consensusfor = 0.4

Page 39: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 39

Definition of MulticontigDefinition of Multicontig Input: A collection, F , of strings, an integer t 0, and an error tolerance between 0 and 1

Output: A partition of F into the minimum number of collections Ci such that every Ci admits a t-contig with an -consensus

Page 40: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 40

Example of MulticontigExample of Multicontig Let = 0.4, t = 3

TATAGCATCATACGTC CATGATCAGACGGATAG GTCCAGACGTATAGCATGATCAG

Page 41: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 41

AlgorithmsAlgorithms Most of the algorithms to solve the fragment

assembly problem are based on a graph model A graph, G, is a collection of edges, e, and

vertices, v.• Directed or undirected• Weighted or unweighted

We will discussrepresentations andother issues shortly… A directed,

unweighted graph

A directed, unweighted

graph

Page 42: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 42

The Maximum Overlap GraphThe Maximum Overlap Graph The text calls it an overlap multigraph Each directed edge, (u,v) is weighted with the

length of the maximal overlap between a suffix of u and a prefix of v

a

b

d

c

TACGA

CTAAAGACCC

GACA

1

1

1

2

1 0-weight edges

omitted!

0-weight edges

omitted!

Page 43: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 43

Paths and LayoutsPaths and Layouts The path dbc leads to the alignment:

a

b

d

c

TACGA

CTAAAGACCC

GACA

1

1

1

2

1

GACA-----------ACCC-----------CTAAAG

Page 44: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 44

SuperstringsSuperstrings Every path that covers every node is a

superstring Zero weight edges result in alignments like:

Higher weights produce more overlap, and thus shorter strings

The shortest common superstring is the highest weight path that covers every node

GACA------------GCCC-------------TTAAAG

Page 45: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 45

Graph formulation of SCSGraph formulation of SCS Input: A weighted, directed graph Output: The highest-weight path that touches

every node of the graph

Does this problem sound familiar?Does this problem sound familiar?

Page 46: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 46

The Greedy AlgorithmThe Greedy Algorithm

Algorithm greedy Sort edges in decreasing weight order For each edge in this order If the edge does not form a cycle and the edge does not start or end at the same node as another edge in the set then add the edge to the current set End forEnd Algorithm

Figure 4.16, page 125

Page 47: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 47

Greedy ExampleGreedy Example

7

6

54

3

2

1

2

2

Page 48: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 48

Greedy does not always find the best pathGreedy does not always find the best path

2

3

2ATGC TGCAT

GCC

0

Page 49: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 49

Tools for Shotgun SequencingTools for Shotgun Sequencing

Page 50: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 50

Common DifficultyCommon Difficulty Each of these problems is a method for

modeling fragment assembly Each of these problems is provably intractable How?

Page 51: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 51

Embedding problemsEmbedding problems Suppose I told you that I had found a clever

way to model the TSP as a shortest common superstring problem

• Paths between cities are represented as fragments• The shortest path is the shortest common

superstring of the fragments

If this is true, then there are only two possibilities:

1. This problem is just as intractable as TSP

2. TSP is actually a tractable problem!

Page 52: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 52

NP-Complete ProblemsNP-Complete Problems There is a collection of problems that computer

scientists believe to be intractable• TSP is one of them

Each of them has been modeled as one or more of the other NP-complete problems

If you solve one, you solve them all A problem, p, is NP-hard if you can model one

of these NP-complete problems as an instance of p

Page 53: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 53

NP-CompletenessNP-Completeness

TSP P

NP

3-SAT

Graph 3-coloring

Vertex cover

Subset sumSet packing

Bin packing

Page 54: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 54

P = NP?P = NP?

NP

3-SAT

Graph 3-coloring

Vertex cover

Subset sumSet packing

Bin packing

P

NP

Page 55: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

Part III – Sequence AlignmentsPart III – Sequence Alignments

Needleman-Wunsch

Smith-Waterman

Dynamic Programming

Page 56: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 56

Why align sequences?Why align sequences? The draft human genome is available Automated gene finding is possible Gene: AGTACGTATCGTATAGCGTAA

• What does it do?What does it do?

One approach: Is there a similar gene in another species?• Align sequences with known genes• Find the gene with the “best” match

Page 57: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 57

Comparing two sequencesComparing two sequences Point mutations, easy:ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATTCGCCCTATCGTCTATCT

Indels are difficult, must align sequences:ACGTCTGATACGCCGTATAGTCTATCTCTGATTCGCATCGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCT----CTGATTCGC---ATCGTCTATCT

Page 58: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 58

Scoring a sequence alignmentScoring a sequence alignment Match score: +1 Mismatch score:+0

Gap penalty: –1ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT

Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1)

Score = +11Score = +11

Page 59: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 59

DNA ReplicationDNA Replication Prior to cell division, all the

genetic instructions must be “copied” so that each new cell will have a complete set

DNA polymerase is the enzyme that copies DNA• Synthesizes in the 5' to 3'

direction

Page 60: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 60

Over time, genes accumulate Over time, genes accumulate mutationsmutations Environmental factors

• Radiation

• Oxidation Mistakes in replication or

repair Deletions, Duplications Insertions Inversions Point mutations

Page 61: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 61

Codon deletion:ACG ATA GCG TAT GTA TAG CCG…• Effect depends on the protein, position, etc.• Almost always deleterious• Sometimes lethal

Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?…• Almost always lethal

DeletionsDeletions

Page 62: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 62

IndelsIndels Comparing two genes it is generally impossible

to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known:

ACGTCTGATACGCCGTATCGTCTATCTACGTCTGAT---CCGTATCGTCTATCT

Page 63: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 63

Origination and length penaltiesOrigination and length penalties We want to find alignments that are

evolutionarily likely. Which of the following alignments seems more

likely to you?ACGTCTGATACGCCGTATAGTCTATCTACGTCTGAT-------ATAGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCTAC-T-TGA--CG-CGT-TA-TCTATCT

We can achieve this by penalizing more for a new gap, than for extending an existing gap

Page 64: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 64

Scoring a sequence alignment (2)Scoring a sequence alignment (2) Match/mismatch score: +1/+0

Origination/length penalty: –2/–1ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT

Matches: 18 × (+1) Mismatches: 2 × 0 Origination: 2 × (–2) Length: 7 × (–1)

Score = +7Score = +7

Page 65: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 65

How can we find an optimal alignment?How can we find an optimal alignment? Finding the alignment is computationally hard:ACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG—CATCGTC--T-ATCT

C(27,7) gap positions = ~888,000 possibilities It’s possible, as long as we don’t repeat our

work! Dynamic programming: The Needleman &

Wunsch algorithm

Page 66: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 66

What is the optimal alignment?What is the optimal alignment? ACTCGACAGTAG

Match: +1 Mismatch: 0 Gap: –1

Page 67: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 67

Needleman-Wunsch: Step 1Needleman-Wunsch: Step 1 Each sequence along one axis Mismatch penalty multiples in first row/column 0 in [1,1] (or [0,0] for the CS-minded)

A C T C G0 -1 -2 -3 -4 -5

A -1 1C -2A -3G -4T -5A -6G -7

Page 68: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 68

Needleman-Wunsch: Step 2Needleman-Wunsch: Step 2 Vertical/Horiz. move: Score + (simple) gap penalty Diagonal move: Score + match/mismatch score Take the MAX of the three possibilities

A C T C G0 -1 -2 -3 -4 -5

A -1 1C -2A -3G -4T -5A -6G -7

Page 69: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 69

Needleman-Wunsch: Step 2 (cont’d)Needleman-Wunsch: Step 2 (cont’d) Fill out the rest of the table likewise…

a c t c g0 -1 -2 -3 -4 -5

a -1 1 0 -1 -2 -3c -2a -3g -4t -5a -6g -7

Page 70: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 70

Needleman-Wunsch: Step 2 (cont’d)Needleman-Wunsch: Step 2 (cont’d) Fill out the rest of the table likewise…

The optimal alignment score is calculated in the lower-right corner

a c t c g0 -1 -2 -3 -4 -5

a -1 1 0 -1 -2 -3c -2 0 2 1 0 -1a -3 -1 1 2 1 0g -4 -2 0 1 2 2t -5 -3 -1 1 1 2a -6 -4 -2 0 1 1g -7 -5 -3 -1 0 2

Page 71: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 71

a c t c g0 -1 -2 -3 -4 -5

a -1 1 0 -1 -2 -3c -2 0 2 1 0 -1a -3 -1 1 2 1 0g -4 -2 0 1 2 2t -5 -3 -1 1 1 2a -6 -4 -2 0 1 1g -7 -5 -3 -1 0 2

But what But what isis the optimal alignment the optimal alignment To reconstruct the optimal alignment, we must

determine of where the MAX at each step came from…

Page 72: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 72

A path corresponds to an alignmentA path corresponds to an alignment = GAP in top sequence = GAP in left sequence = ALIGN both positions One path from the previous table: Corresponding alignment (start at the end):

AC--TCGACAGTAG

Score = +2

Page 73: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 73

Practice ProblemPractice Problem Find an optimal alignment for these two

sequences:GCGGTTGCGT

Match: +1 Mismatch: 0 Gap: –1

g c g g t t0 -1 -2 -3 -4 -5 -6

g -1c -2g -3t -4

Page 74: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 74

Practice ProblemPractice Problem Find an optimal alignment for these two

sequences:GCGGTTGCGT g c g g t t

0 -1 -2 -3 -4 -5 -6g -1 1 0 -1 -2 -3 -4c -2 0 2 1 0 -1 -2g -3 -1 1 3 2 1 0t -4 -2 0 2 3 3 2

GCGGTTGCG-T-

Score = +2

Page 75: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 75

g c g0 -1 -2 -3

g -1 1 0 -1g -2 0 1 1c -3 -1 1 1g -4 -2 0 2

Semi-global alignmentSemi-global alignment Suppose we are aligning:GCGGGCG

Which do you prefer?G-CG -GCGGGCG GGCG

Semi-global alignment allows gaps at the ends for free.

Page 76: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 76

Semi-global alignmentSemi-global alignment

g c g0 0 0 0

g 0 1 0 1g 0 1 1 1c 0 0 2 1g 0 1 1 3

Semi-global alignment allows gaps at the ends for free.

Initialize first row and column to all 0’s Allow free horizontal/vertical moves in last

row and column

Page 77: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 77

Local alignmentLocal alignment Global alignments – score the entire alignment Semi-global alignments – allow unscored gaps

at the beginning or end of either sequence Local alignment – find the best matching

subsequence CGATGAAATGGA

This is achieved by allowing a 4th alternative at each position in the table: zero.

Page 78: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 78

c g a t g0 -1 -2 -3 -4 -5

a -1 0 0 0 0 0a -2 0 0 1 0 0a -3 0 0 1 0 0t -4 0 0 0 2 1g -5 0 1 0 1 3g -6 0 1 0 0 2a -7 0 0 2 1 1

Local alignmentLocal alignment Mismatch = –1 this time

CGATGAAATGGA

Page 79: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 79

Optimal Substructure in AlignmentsOptimal Substructure in Alignments Consider the alignment:ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT

Is it true that the alignment in the boxed region must be optimal?

Page 80: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 80

A Greedy StrategyA Greedy Strategy Consider this pair of sequencesGAGCCAGC

Greedy Approach:G or G or -C - G

Leads toGAGC--- Better: GACG---CAGC CACG

GAP = 1

Match = +1

Mismatch = 2

Page 81: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 81

Breaking apart the problemBreaking apart the problem Suppose we are aligning:ACTCGACAGTAG

First position choices:A +1 CTCGA CAGTAG

A -1 CTCG- ACAGTAG

- -1 ACTCGA CAGTAG

Page 82: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 82

A Recursive Approach to AlignmentA Recursive Approach to Alignment Choose the best alignment based on these three

possibilities:align(seq1, seq2) {

if (both sequences empty) {return 0;}if (one string empty) {

return(gapscore * num chars in nonempty seq);else {

score1 = score(firstchar(seq1),firstchar(seq2)) + align(tail(seq1), tail(seq2));score2 = align(tail(seq1), seq2) + gapscore;score3 = align(seq1, tail(seq2) + gapscore;return(min(score1, score2, score3));

}}

}

Page 83: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 83

Time Complexity of RecurseAlignTime Complexity of RecurseAlign What is the recurrence equation for the time

needed by RecurseAlign?

3)1(3)( nTnT

3

3

3 3

3 3…

n

3

9

27

3n

Page 84: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 84

RecurseAlign repeats its workRecurseAlign repeats its workA C G T A T C G C G T A T A

G

A

T

G

C

T

C

T

C

G

Page 85: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 85

Dynamic ProgrammingDynamic Programming Remember all the subproblem answers along the way:

This is possible for any problem that exhibits optimal substructure

a c t c g0 -1 -2 -3 -4 -5

a -1 1 0 -1 -2 -3c -2 0 2 1 0 -1a -3 -1 1 2 1 0g -4 -2 0 1 2 2t -5 -3 -1 1 1 2a -6 -4 -2 0 1 1g -7 -5 -3 -1 0 2

Page 86: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 86

Saving SpaceSaving Space Note that we can throw away the previous rows

of the table as we fill it in:

a c t c g0 -1 -2 -3 -4 -5

a -1 1 0 -1 -2 -3c -2 0 2 1 0 -1a -3 -1 1 2 1 0g -4 -2 0 1 2 2t -5 -3 -1 1 1 2a -6 -4 -2 0 1 1g -7 -5 -3 -1 0 2

This row is based only on this one

Page 87: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 87

Saving Space (2)Saving Space (2) Each row of the table contains the scores for

aligning a prefix of the left-hand sequence with all prefixes of the top sequence:

a c t c g0 -1 -2 -3 -4 -5

a -1 1 0 -1 -2 -3c -2 0 2 1 0 -1a -3 -1 1 2 1 0g -4 -2 0 1 2 2t -5 -3 -1 1 1 2a -6 -4 -2 0 1 1g -7 -5 -3 -1 0 2

Scores for aligning aca with

all prefixes of actcg

Page 88: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 88

Divide and ConquerDivide and Conquer By using a recursive approach, we can use only

two rows of the matrix at a time:• Choose the middle character of the top sequence, i• Find out where i aligns to the bottom sequence

Needs two vectors of scores

• Recursively align the sequences before and after the fixed positions

ACGCTATGCTCATAG

CGACGCTCATCG

i

Page 89: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 89

Finding where Finding where ii lines up lines up Find out where i aligns to the bottom sequence

Needs two vectors of scores

Assuming i lines up with a character:alignscore = align(ACGCTAT, prefix(t)) + score(G, char from t)

+ align(CTCATAG, suffix(t)) Which character is best?

• Can quickly find out the score for aligning ACGCTAT with every prefix of t.

s: ACGCTATGCTCATAG

t: CGACGCTCATCG

i

Page 90: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 90

Finding where Finding where ii lines up lines up But, i may also line up with a gap

Assuming i lines up with a gap:

alignscore = align(ACGCTAT, prefix(t)) + gapscore+ align(CTCATAG, suffix(t))

s: ACGCTATGCTCATAG

t: CGACGCTCATCG

i

Page 91: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 91

Recursive CallRecursive Call Fix the best position for I Call align recursively for the prefixes and

suffixes:

s: ACGCTATGCTCATAG

t: CGACGCTCATCG

i

Page 92: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 92

ComplexityComplexity Let len(s) = m and len(t) = n Space: 2m Time:

• Each call to build similarity vector = m´n´

• First call + recursive call:

s: ACGCTATGCTCATAG

t: CGACGCTCATCG

i

j

mn

jnmmjmn

jnm

Tjm

Tmnmn

nmT

2

)(

,2

,222

,

Page 93: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 93

General Gap PenaltiesGeneral Gap Penalties Suppose we are no longer using simple gap

penalties:• Origination = −2• Length = −1

Consider the last position of the alignment for ACGTA with ACG

We can’t determine the score for

unless we know the previous positions!

G-

-G

or

Page 94: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 94

Scoring BlocksScoring Blocks Now we must score a block at a time

A block is a pair of characters, or a maximal group of gaps paired with characters

To score a position, we need to either start a new block or add it to a previous block

A A C --- A TATCCG A C T AC

A C T ACC T ------ C G C --

Page 95: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 95

The AlgorithmThe Algorithm Three tables

• a – scores for alignments ending in char-char blocks• b – scores for alignments ending in gaps in the top

sequence (s)• c – scores for alignments ending in gaps in the left

sequence (t)

Scores no longer depend on only three positions, because we can put any number of gaps into the last block

Page 96: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 96

The RecurrencesThe Recurrences

1,1

1,1

1,1

max,,

jic

jib

jia

jipjia

jkkwkjic

jkkwkjiajib

1for ,,

1for ,,max,

ikkwjkib

ikkwjkiajic

1for ,,

1for ,,max,

Page 97: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 97

The Optimal AlignmentThe Optimal Alignment The optimal alignment is found by looking at

the maximal value in the lower right of all three arrays

The algorithm runs in O(n3) time• Uses O(n2) space

Page 98: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

Part IV – Database SearchesPart IV – Database Searches

BLAST

Search statistics

Page 99: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 99

Database SearchingDatabase Searching How can we find a particular short sequence in

a database of sequences (or one HUGE sequence)?

Problem is identical to local sequence alignment, but on a much larger scale.

We must also have some idea of the significance of a database hit.• Databases always return some kind of hit, how

much attention should be paid to the result?

Page 100: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 100

BLASTBLAST BLAST – Basic Local Alignment Search Tool An approximation of the Needleman & Wunsch

algorithm Sacrifices some search sensitivity for speed

Page 101: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 101

Scoring MatricesScoring Matrices DNA

• Identity

• Transition/TransversionA R N D C Q E G H I L K M F P S T W Y V

A 2R -2 6N 0 0 2D 0 -1 2 4C -2 -4 -4 -5 4Q 0 1 1 2 -5 4E 0 -1 1 3 -5 2 4G 1 -3 0 1 -3 -1 0 5H -1 2 2 1 -3 3 1 -2 6I -1 -2 -2 -2 -2 -2 -2 -3 -2 5L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6K -1 3 1 0 -5 1 0 -2 0 -2 -3 5M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 3T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -2 0 1 3W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 2 4

Proteins• PAM

• BLOSUM

Page 102: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 102

The BLAST algorithmThe BLAST algorithm Break the search sequence into words

• W = 3 for proteins, W = 12 for DNA

Include in the search all words that score above a certain value (T) for any search word

MCGPFILGTYC

MCG

CGP

MCG, CGP, GPF, PFI, FIL, ILG, LGT, GTY, TYC

MCG CGPMCT MGP …MCN CTP … …

This list can be computed in linear time

This list can be computed in linear time

Page 103: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 103

The Blast Algorithm (2)The Blast Algorithm (2) Search for the words in the database

• Word locations can be precomputed and indexed• Searching for a short string in a long string

Regular expression matching: FSA

HSP (High Scoring Pair) = A match between a query word and the database

Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A

Extend the hit until the score falls below a threshold value, X

Page 104: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 104

Page 105: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 105

Results from a BLAST searchResults from a BLAST search

Page 106: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 106

Search Significance ScoresSearch Significance Scores A search will always return some hits.

How can we determine how “unusual” a particular alignment score is?• ORF’s

Assumptions

Page 107: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 107

Assessing significance requires a Assessing significance requires a distributiondistribution I have an apple of diameter 5”. Is that unusual?

Diameter (cm)

Fre

quen

cy

Page 108: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 108

Is a match significant?Is a match significant? Match scores for aligning my sequence with

random sequences. Depends on:

• Scoring system• Database• Sequence to search for

Length Composition

How do we determine the random sequences?

Match score

Fre

quen

cy

Page 109: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 109

Generating “random” sequencesGenerating “random” sequences Random uniform model:

P(G) = P(A) = P(C) = P(T) = 0.25P(G) = P(A) = P(C) = P(T) = 0.25• Doesn’t reflect nature

Use sequences from a database• Might have genuine homology

We want unrelated sequences

Random shuffling of sequences• Preserves composition• Removes true homology

Page 110: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 110

What distribution do we expect to see?What distribution do we expect to see? The mean of n random (i.i.d.) events tends

towards a Gaussian distribution.• Example: Throw n dice and compute the mean.• Distribution of means:

n = 2 n = 1000

Page 111: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 111

The extreme value distributionThe extreme value distribution This means that if we get the match scores for

our sequence with n other sequences, the mean would follow a Gaussian distribution.

The maximum of n (i.i.d.) random events tends towards the extreme value distribution as n grows large.

Page 112: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 112

Comparing distributionsComparing distributions

x

ex

eexf1

2

2

2

2

1

x

exf

Extreme Value: Gaussian:

Page 113: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 113

Determining P-valuesDetermining P-values If we can estimate and , then we can

determine, for a given match score x, the probability that a random match with score x or greater would have occurred in the database.

For sequence matches, a scoring system and database can be parameterized by two parameters, K and , related to and .• It would be nice if we could compare hit

significance without regard to the database and scoring system used!

Page 114: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 114

Bit ScoresBit Scores The expected number of hits with score S is:

E = Kmn e s

• Where m and n are the sequence lengths

Normalize the raw score using:

Obtains a “bit score” S’, with a standard set of units.

The new E-value is:

2ln

ln KSS

SmnE 2

Page 115: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 115

P values and E valuesP values and E values Blast reports E-values E = 5, E = 10 versus P = 0.993 and P = 0.99995 When E < 0.01 P-values and E-values are

nearly identical

Page 116: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 116

BLAST parametersBLAST parameters Lowering the neighborhood word threshold (T)

allows more distantly related sequences to be found, at the expense of increased noise in the results set.

Raising the segment extension cutoff (X) returns longer extensions for each hit.

Changing the minimum E-value changes the threshold for reporting a hit.

Page 117: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

Part V – PhylogeniesPart V – Phylogenies

Preliminaries

Distance-based methods

Parsimony Methods

Page 118: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 118

Phylogenetic TreesPhylogenetic Trees Hypothesis about the relationship between

organisms Can be rooted or unrooted

A B C D E

A B

C

D

E

Tim

e

Root

Page 119: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 119

Tree proliferationTree proliferation

!22

!322

n

nN

nR !32

!523

n

nN

nU

Species Number of Rooted Trees Number of Unrooted Trees

2 1 1

3 3 1

4 15 3

5 105 15

6 34,459,425 2,027,025

7 213,458,046,767,875 7,905,853,580,625

8 8,200,794,532,637,891,559,375 221,643,095,476,699,771,875

Page 120: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 120

Molecular phylogeneticsMolecular phylogenetics Specific genomic

sequence variations (alleles) are much more reliable than phenotypic characteristics

More than one gene should be considered

Page 121: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 121

An ongoing didacticAn ongoing didactic Pheneticists tend to prefer distance based

metrics, as they emphasize relationships among data sets, rather than the paths they have taken to arrive at their current states.

Cladists are generally more interested in evolutionary pathways, and tend to prefer more evolutionarily based approaches such as maximum parsimony.

Page 122: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 122

Distance matrix methodsDistance matrix methods

Species A B C D

B 9 – – –

C 8 11 – –

D 12 15 10 –

E 15 18 13 5

Page 123: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 123

UPGMAUPGMA Similar to average-link clustering Merge the closest two groups

• Replace the distances for the new, merged group with the average of the distance for the previous two groups

Repeat until all species are joined

Page 124: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 124

UPGMA Step 1UPGMA Step 1

Species A B C D

B 9 – – –

C 8 11 – –

D 12 15 10 –

E 15 18 13 5

Merge D & E

D E

Species A B C

B 9 – –

C 8 11 –

DE 13.5 16.5 11.5

Page 125: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 125

UPGMA Step 2UPGMA Step 2

Merge A & C

D E

Species A B C

B 9 – –

C 8 11 –

DE 13.5 16.5 11.5

A C

Species B AC

AC 10 –

DE 16.5 12.5

Page 126: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 126

UPGMA Steps 3 & 4UPGMA Steps 3 & 4

Merge B & AC

D EA C

Species B AC

AC 10 –

DE 16.5 12.5

B

Merge ABC & DE

D EA C B

(((A,C)B)(D,E))

Page 127: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 127

Parsimony approachesParsimony approaches Belong to the broader class of character based

methods of phylogenetics Emphasize simpler, and thus more likely

evolutionary pathways

I: GCGGACGII: GTGGACG

C T

I II

(C or T)

C T

I II

A

(C or T)

Page 128: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 128

Informative and uninformative sitesInformative and uninformative sitesPosition

Seq 1 2 3 4 5 6

1 G G G G G G

2 G G G A G T

3 G G A T A G

4 G A T C A T

For positions 5 & 6, it is possible to select more parsimonious trees – those that invoke less substitutions.

For positions 5 & 6, it is possible to select more parsimonious trees – those that invoke less substitutions.

Page 129: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 129

Parsimony methodsParsimony methods Enumerate all possible trees Note the number of substitutions events

invoked by each possible tree• Can be weighted by transition/transversion

probabilities, etc.

Select the most parsimonious

Page 130: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 130

Branch and Bound methodsBranch and Bound methods Key problem – number of possible trees grows

enormous as the number of species gets large Branch and bound – a technique that allows

large numbers of candidate trees to be rapidly disregarded

Requires a “good guess” at the cost of the best tree

Page 131: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 131

Branch and Bound for TSPBranch and Bound for TSP Find a minimum cost

round-trip path that visits each intermediate city exactly once

NP-complete Greedy approach:

A,G,E,F,B,D,C,A= 251

AC

F

E

D

G

B

93

46

20

35

68

1257 31

15

82

17

8259

Page 132: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 132

Search all possible pathsSearch all possible pathsA

C

F

E

D

G

B

93

46

20

35

68

1257 31

15

82

17

8259

AC

F

E

D

G

B

93

46

20

35

68

1257 31

15

82

17

8259

All paths

AG (20) AB (46) AC (93)

AGF (88) AGE (55)

AGFB AGFE AGFC

ACB (175) ACD ACF

ACBE (257)

Best estimate: 251

Page 133: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 133

Parsimony – Branch and BoundParsimony – Branch and Bound Use the UPGMA tree for an initial best estimate

of the minimum cost (most parsimonious) tree Use branch and bound to explore all feasible

trees Replace the best estimate as better trees are

found Choose the most parsimonious

Page 134: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 134

Parsimony exampleParsimony examplePosition

Seq 1 2 3 4 5 6

1 G G G G G G

2 G G G A G T

3 G G A T A G

4 G A T C A TAll trees

(1,2) [0] (1,3) [1] (1,4) [1]

Position 5:

Etc.

Page 135: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

Part VI – Aligning protein sequencesPart VI – Aligning protein sequences

PAM matrices

BLOSUM matrices

Page 136: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 136

Sequence Alignments RevisitedSequence Alignments Revisited Scoring nucleotide sequence alignments was

easier• Match score• Possibly different scores for transitions and

transversions For amino acids, there are many more possible

substitutions How do we score which substitutions are highly

penalized and which are moderately penalized?• Physical and chemical characteristics• Empirical methods

Page 137: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 137

Scoring MismatchesScoring Mismatches Physical and chemical characteristics

• V I – Both small, both hydrophobic, conservative substitution, small penalty

• V K – Small large, hydrophobic charged, large penalty

• Requires some expert knowledge and judgement

Empirical methods• How often does the substitution V I occur in

proteins that are known to be related? Scoring matrices: PAM and BLOSUM

Page 138: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 138

PAM matricesPAM matrices PAM = “Point Accepted Mutation” interested

only in mutations that have been “accepted” by natural selection

Starts with a multiple sequence alignment of very similar (>85% identity) proteins. Assumed to be homologous

Compute the relative mutability, mi, of each amino acid• e.g. mA = how many times was alanine substituted

with anything else?

Page 139: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 139

Relative mutabilityRelative mutability ACGCTAFKIGCGCTAFKIACGCTAFKLGCGCTGFKIGCGCTLFKIASGCTAFKLACACTAFKL

Across all pairs of sequences, there are 28A X substitutions

There are 10 ALA residues, so mA = 2.8

Page 140: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 140

Pam Matrices, cont’dPam Matrices, cont’d Construct a phylogenetic tree for the sequences

in the alignment

Calculate substitution frequences FX,X

Substitutions may have occurred either way, so A G also counts as G A.

ACGCTAFKI

GCGCTAFKI ACGCTAFKL

GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL

AG IL

AG AL CS GA

FG,A = 3

Page 141: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 141

Mutation ProbabilitiesMutation Probabilities Mi,j represents the probability of J I

substitution.

= 2.025

iij

ijjij F

FmM

4

37.2,

AGM

ACGCTAFKI

GCGCTAFKI ACGCTAFKL

GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL

AG IL

AG AL CS GA

Page 142: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 142

The PAM matrixThe PAM matrix The entries, Ri,j are the Mi,j values divided by

the frequency of occurrence, fi, of residue i.

fG = 10 GLY / 63 residues = 0.1587

RG,A = log(2.025/0.1587) = log(12.760) = 1.106

The log is taken so that we can add, rather than multiply entries to get compound probabilities.

Log-odds matrix Diagonal entries are 1– mj

Page 143: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 143

Interpretation of PAM matricesInterpretation of PAM matrices PAM-1 – one substitution per 100 residues (a

PAM unit of time) Multiply them together to get PAM-100, etc. “Suppose I start with a given polypeptide

sequence M at time t, and observe the evolutionary changes in the sequence until 1% of all amino acid residues have undergone substitutions at time t+n. Let the new sequence at time t+n be called M’. What is the probability that a residue of type j in M will be replaced by i in M’?”

Page 144: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 144

PAM matrix considerationsPAM matrix considerations

If Mi,j is very small, we may not have a large enough sample to estimate the real probability. When we multiply the PAM matrices many times, the error is magnified.

PAM-1 – similar sequences, PAM-1000 very dissimilar sequences

Page 145: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 145

BLOSUM matrixBLOSUM matrix Starts by clustering proteins by similarity Avoids problems with small probabilities by

using averages over clusters Numbering works opposite

• BLOSUM-62 is appropriate for sequences of about 62% identity, while BLOSUM-80 is appropriate for more similar sequences.

Page 146: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

Part VII – Protein StructurePart VII – Protein Structure

Preliminaries

Lattice Models

Protein Folding Algorithms

Illustrations from: C Branden and J Tooze, Introduction to Protein Structure, 2nd ed. Garland Pub. ISBN 0815302703

Page 147: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 147

The many functions of proteinsThe many functions of proteins Mechanoenzymes: myosin, actin Rhodopsin: allows vision Globins: transport oxygen Antibodies: immune system Enzymes: pepsin, renin, carboxypeptidase A Receptors: transmit messages through

membranes Vitelogenin: molecular velcro

• And hundreds of thousands more…

Page 148: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 148

Proteins are chains of amino acidsProteins are chains of amino acids Polymer – a molecule composed of repeating units

Page 149: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 149

Amino acid compositionAmino acid composition

Basic Amino AcidStructure:• The side chain, R,

varies for each ofthe 20 amino acids

C

RR

C

H

NO

OHH

H

Aminogroup

Carboxylgroup

Side chain

Page 150: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 150

The Peptide BondThe Peptide Bond

Dehydration synthesis Repeating backbone: N–C –C –N–C –C

• Convention – start at amino terminus and proceed to carboxy terminus

O O

Page 151: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 151

Peptidyl polymersPeptidyl polymers A few amino acids in a chain are called a

polypeptide. A protein is usually composed of 50 to 400+ amino acids.

Since part of the amino acid is lost during dehydration synthesis, we call the units of a protein amino acid residues.carbonylcarbonylcarboncarbon

amideamidenitrogennitrogen

Page 152: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 152

Side chain propertiesSide chain properties Recall that the electronegativity of carbon is at

about the middle of the scale for light elements• Carbon does not make hydrogen bonds with water

easily – hydrophobic• O and N are generally more likely than C to h-bond

to water – hydrophilic We group the amino acids into three general

groups:• Hydrophobic• Charged (positive/basic & negative/acidic)• Polar

Page 153: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 153

The Hydrophobic Amino AcidsThe Hydrophobic Amino Acids

Proline severelyProline severelylimits allowablelimits allowableconformations!conformations!

Page 154: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 154

The Charged Amino AcidsThe Charged Amino Acids

Page 155: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 155

The Polar Amino AcidsThe Polar Amino Acids

Page 156: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 156

More Polar Amino AcidsMore Polar Amino Acids

And then there’s…And then there’s…

Page 157: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 157

Planarity of the peptide bondPlanarity of the peptide bond

Phi () – the angle of rotation about the N-C bond.

Psi () – the angle of rotation about the C-C bond.

The planar bond angles and bond lengths are fixed.

Page 158: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 158

Phi and psiPhi and psi

= = 180° is extended conformation

: C to N–H : C=O to C

C

C=O

N–H

Page 159: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 159

The Ramachandran PlotThe Ramachandran Plot

G. N. Ramachandran – first calculations of sterically allowed regions of phi and psi

Note the structural importance of glycine

Observed(non-glycine)

Observed(glycine)Calculated

Page 160: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 160

Primary & Secondary StructurePrimary & Secondary Structure Primary structurePrimary structure = the linear sequence of

amino acids comprising a protein:AGVGTVPMTAYGNDIQYYGQVT…

Secondary structureSecondary structure• Regular patterns of hydrogen bonding in proteins

result in two patterns that emerge in nearly every protein structure known: the -helix and the-sheet

• The location of direction of these periodic, repeating structures is known as the secondary secondary structurestructure of the protein

Page 161: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 161

The alpha helixThe alpha helix 60°

Page 162: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 162

Properties of the alpha helixProperties of the alpha helix

60° Hydrogen bondsHydrogen bonds

between C=O ofresidue n, andNH of residuen+4

3.6 residues/turn 1.5 Å/residue rise 100°/residue turn

Page 163: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 163

Properties of Properties of -helices-helices 4 – 40+ residues in length Often amphipathic or “dual-natured”

• Half hydrophobic and half hydrophilic• Mostly when surface-exposed

If we examine many -helices,we find trends…• Helix formers: Ala, Glu, Leu,

Met• Helix breakers: Pro, Gly, Tyr,

Ser

Page 164: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 164

The beta strand (& sheet)The beta strand (& sheet)

135° +135°

Page 165: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 165

Properties of beta sheetsProperties of beta sheets Formed of stretches of 5-10 residues in

extended conformation Pleated – each C a bit

above or below the previous Parallel/aniparallelParallel/aniparallel,

contiguous/non-contiguous

Page 166: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 166

Parallel and anti-parallel Parallel and anti-parallel -sheets-sheets Anti-parallel is slightly energetically favored

Anti-parallelAnti-parallel ParallelParallel

Page 167: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 167

Turns and LoopsTurns and Loops Secondary structure elements are connected by

regions of turns and loops Turns – short regions

of non-, non-conformation

Loops – larger stretches with no secondary structure. Often disordered.• “Random coil”• Sequences vary much more than secondary

structure regions

Page 168: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

Levels of Protein Levels of Protein StructureStructure

Secondary structure elements combine to form tertiary structure

Quaternary structure occurs in multienzyme complexes• Many proteins are

active only as homodimers, homotetramers, etc.

Page 169: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 169

Disulfide BondsDisulfide Bonds Two cyteines in

close proximity will form a covalent bond

Disulfide bond, disulfide bridge, or dicysteine bond.

Significantly stabilizes tertiary structure.

Page 170: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 170

Protein Structure ExamplesProtein Structure Examples

Page 171: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 171

Determining Protein StructureDetermining Protein Structure There are O(100,000) distinct proteins in the

human proteome. 3D structures have been determined for 14,000

proteins, from all organisms• Includes duplicates with different ligands bound,

etc.

Coordinates are determined by X-ray X-ray crystallographycrystallography

Page 172: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 172

X-Ray CrystallographyX-Ray Crystallography

~0.5mm

• The crystal is a mosaic of millions of copies of the protein.

• As much as 70% is solvent (water)!

• May take months (and a “green” thumb) to grow.

Page 173: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 173

X-Ray diffractionX-Ray diffraction

Image is averagedover:• Space (many copies)• Time (of the diffraction

experiment)

Page 174: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 174

Electron Density MapsElectron Density Maps Resolution is

dependent on the quality/regularity of the crystal

R-factor is a measure of “leftover” electron density

Solvent fitting Refinement

Page 175: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 175

The Protein Data BankThe Protein Data Bank

ATOM 1 N ALA E 1 22.382 47.782 112.975 1.00 24.09 3APR 213ATOM 2 CA ALA E 1 22.957 47.648 111.613 1.00 22.40 3APR 214ATOM 3 C ALA E 1 23.572 46.251 111.545 1.00 21.32 3APR 215ATOM 4 O ALA E 1 23.948 45.688 112.603 1.00 21.54 3APR 216ATOM 5 CB ALA E 1 23.932 48.787 111.380 1.00 22.79 3APR 217ATOM 6 N GLY E 2 23.656 45.723 110.336 1.00 19.17 3APR 218ATOM 7 CA GLY E 2 24.216 44.393 110.087 1.00 17.35 3APR 219ATOM 8 C GLY E 2 25.653 44.308 110.579 1.00 16.49 3APR 220ATOM 9 O GLY E 2 26.258 45.296 110.994 1.00 15.35 3APR 221ATOM 10 N VAL E 3 26.213 43.110 110.521 1.00 16.21 3APR 222ATOM 11 CA VAL E 3 27.594 42.879 110.975 1.00 16.02 3APR 223ATOM 12 C VAL E 3 28.569 43.613 110.055 1.00 15.69 3APR 224ATOM 13 O VAL E 3 28.429 43.444 108.822 1.00 16.43 3APR 225ATOM 14 CB VAL E 3 27.834 41.363 110.979 1.00 16.66 3APR 226ATOM 15 CG1 VAL E 3 29.259 41.013 111.404 1.00 17.35 3APR 227ATOM 16 CG2 VAL E 3 26.811 40.649 111.850 1.00 17.03 3APR 228

http://www.rcsb.org/pdb/

Page 176: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 176

Views of a proteinViews of a protein

Wireframe Ball and stick

Page 177: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 177

Views of a proteinViews of a protein

Spacefill Cartoon CPK colors

Carbon = green, black, or grey

Nitrogen = blue

Oxygen = red

Sulfur = yellow

Hydrogen = white

Page 178: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 178

The Protein Folding ProblemThe Protein Folding Problem Central question of molecular biology:

“Given a particular sequence of amino acid Given a particular sequence of amino acid residues (primary structure), what will the residues (primary structure), what will the tertiary/quaternary structure of the resulting tertiary/quaternary structure of the resulting protein be?”protein be?”

Input: AAVIKYGCAL…Output: 11, 22…= backbone conformation:(no side chains yet)

Page 179: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 179

Forces driving protein foldingForces driving protein folding It is believed that hydrophobic collapse is a key

driving force for protein folding• Hydrophobic core• Polar surface interacting with solvent

Minimum volume (no cavities) Disulfide bond formation stabilizes Hydrogen bonds Polar and electrostatic interactions

Page 180: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 180

Folding helpFolding help Proteins are, in fact, only marginally stable

• Native state is typically only 5 to 10 kcal/mole more stable than the unfolded form

Many proteins help in folding• Protein disulfide isomerase – catalyzes shuffling of

disulfide bonds• Chaperones – break up aggregates and (in theory)

unfold misfolded proteins

Page 181: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 181

The Hydrophobic CoreThe Hydrophobic Core Hemoglobin A is the protein in red blood cells

(erythrocytes) responsible for binding oxygen. The mutation E6V in the chain places a

hydrophobic Val on the surface of hemoglobin The resulting “sticky patch” causes hemoglobin

S to agglutinate (stick together) and form fibers which deform the red blood cell and do not carry oxygen efficiently

Sickle cell anemia was the first identified molecular disease

Page 182: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 182

Sickle Cell AnemiaSickle Cell Anemia

Sequestering hydrophobic residues in Sequestering hydrophobic residues in the protein core protects proteins from the protein core protects proteins from hydrophobic agglutination.hydrophobic agglutination.

Page 183: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 183

Computational Problems in Protein FoldingComputational Problems in Protein Folding

Two key questions:• Evaluation – how can we tell a correctly-folded

protein from an incorrectly folded protein? H-bonds, electrostatics, hydrophobic effect, etc. Derive a function, see how well it does on “real” proteins

• Optimization – once we get an evaluation function, can we optimize it? Simulated annealing/monte carlo EC Heuristics We’ll talk more about these methods later…

Page 184: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 184

Fold OptimizationFold Optimization Simple lattice models (HP-

models)• Two types of residues:

hydrophobic and polar• 2-D or 3-D lattice• The only force is hydrophobic

collapse• Score = number of HH

contacts

Page 185: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 185

H/P model scoring: count noncovalent hydrophobic interactions.

Sometimes:• Penalize for buried polar or surface hydrophobic

residues

Scoring Lattice ModelsScoring Lattice Models

Page 186: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 186

What can we do with lattice models?What can we do with lattice models? For smaller polypeptides, exhaustive search can

be used• Looking at the “best” fold, even in such a simple

model, can teach us interesting things about the protein folding process

For larger chains, other optimization and search methods must be used• Greedy, branch and bound• Evolutionary computing, simulated annealing• Graph theoretical methods

Page 187: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 187

The “hydrophobic zipper” effect:

Learning from Lattice ModelsLearning from Lattice Models

Ken Dill ~ 1997

Page 188: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 188

Absolute directions• UURRDLDRRU

Relative directions• LFRFRRLLFFL• Advantage, we can’t have UD or RL in absolute• Only three directions: LRF

What about bumps? LFRRR• Bad score• Use a better representation

Representing a lattice modelRepresenting a lattice model

Page 189: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 189

Preference-order representationPreference-order representation Each position has two “preferences”

• If it can’t have either of the two, it will take the “least favorite” path if possible

Example: {LR},{FL},{RL},{FR},{RL},{RL},{FR},{RF}

Can still cause bumps:{LF},{FR},{RL},{FL},{RL},{FL},{RF},{RL},{FL}

Page 190: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 190

More realistic modelsMore realistic models Higher resolution lattices (45° lattice, etc.) Off-lattice models

• Local moves• Optimization/search methods and /

representations Greedy search Branch and bound EC, Monte Carlo, simulated annealing, etc.

Page 191: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 191

The Other Half of the PictureThe Other Half of the Picture Now that we have a more realistic off-lattice

model, we need a better energy function to evaluate a conformation (fold).

Theoretical force field:G = Gvan der Waals + Gh-bonds + Gsolvent + Gcoulomb

Empirical force fields• Start with a database• Look at neighboring residues – similar to known

protein folds?

Page 192: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 192

Threading: Fold recognitionThreading: Fold recognition Given:

• Sequence: IVACIVSTEYDVMKAAR…

• A database of molecular coordinates

Map the sequence onto each fold

Evaluate• Objective 1: improve

scoring function• Objective 2: folding

Page 193: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 193

Secondary Structure PredictionSecondary Structure Prediction

AGVGTVPMTAYGNDIQYYGQVT…AGVGTVPMTAYGNDIQYYGQVT…A-VGIVPM-AYGQDIQY-GQVT…AG-GIIP--AYGNELQ--GQVT…AGVCTVPMTA---ELQYYG--T…

AGVGTVPMTAYGNDIQYYGQVT…AGVGTVPMTAYGNDIQYYGQVT…----hhhHHHHHHhhh--eeEE…----hhhHHHHHHhhh--eeEE…

Page 194: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 194

Secondary Structure PredictionSecondary Structure Prediction Easier than folding

• Current algorithms can prediction secondary structure with 70-80% accuracy

Chou, P.Y. & Fasman, G.D. (1974). Biochemistry, 13, 211-222.

• Based on frequencies of occurrence of residues in helices and sheets

PhD – Neural network based• Uses a multiple sequence alignment• Rost & Sander, Proteins, 1994 , 19, 55-72

Page 195: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 195

Chou-Fasman ParametersChou-Fasman ParametersName Abbrv P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)Alanine A 142 83 66 0.06 0.076 0.035 0.058Arginine R 98 93 95 0.07 0.106 0.099 0.085Aspartic Acid D 101 54 146 0.147 0.11 0.179 0.081Asparagine N 67 89 156 0.161 0.083 0.191 0.091Cysteine C 70 119 119 0.149 0.05 0.117 0.128Glutamic Acid E 151 37 74 0.056 0.06 0.077 0.064Glutamine Q 111 110 98 0.074 0.098 0.037 0.098Glycine G 57 75 156 0.102 0.085 0.19 0.152Histidine H 100 87 95 0.14 0.047 0.093 0.054Isoleucine I 108 160 47 0.043 0.034 0.013 0.056Leucine L 121 130 59 0.061 0.025 0.036 0.07Lysine K 114 74 101 0.055 0.115 0.072 0.095Methionine M 145 105 60 0.068 0.082 0.014 0.055Phenylalanine F 113 138 60 0.059 0.041 0.065 0.065Proline P 57 55 152 0.102 0.301 0.034 0.068Serine S 77 75 143 0.12 0.139 0.125 0.106Threonine T 83 119 96 0.086 0.108 0.065 0.079Tryptophan W 108 137 96 0.077 0.013 0.064 0.167Tyrosine Y 69 147 114 0.082 0.065 0.114 0.125Valine V 106 170 50 0.062 0.048 0.028 0.053

Page 196: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 196

Chou-Fasman AlgorithmChou-Fasman Algorithm Identify -helices

• 4 out of 6 contiguous amino acids that have P(a) > 100

• Extend the region until 4 amino acids with P(a) < 100 found

• Compute P(a) and P(b); If the region is >5 residues and P(a) > P(b) identify as a helix

Repeat for -sheets [use P(b)] If an and a region overlap, the overlapping

region is predicted according to P(a) and P(b)

Page 197: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 197

Chou-Fasman, cont’dChou-Fasman, cont’d Identify hairpin turns:

• P(t) = f(i) of the residue f(i+1) of the next residue f(i+2) of the following residue f(i+3) of the residue at position (i+3)

• Predict a hairpin turn starting at positions where: P(t) > 0.000075 The average P(turn) for the four residues > 100 P(a) < P(turn) > P(b) for the four residues

Accuracy 60-65%

Page 198: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 198

Chou-Fasman ExampleChou-Fasman Example CAENKLDHVRGPTCILFMTWYNDGP CAENKL – Potential helix (!C and !N)

Residues with P(a) < 100: RNCGPSTY

• Extend: When we reach RGPT, we must stop• CAENKLDHV: P(a) = 972, P(b) = 843• Declare alpha helix

Identifying a hairpin turn• VRGP: P(t) = 0.000085• Average P(turn) = 113.25

Avg P(a) = 79.5, Avg P(b) = 98.25

Page 199: Fundamental Concepts of Bioinformatics OCCBIO 2006 Tutorial Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics

OCCBIO 2006 – Fundamental Bioinformatics 199

Other topics?Other topics? Tools and languages Forensic DNA Microarray analysis