8/27/20151 Pairwise sequence Alignment. 8/27/20152 Many of the images in this power point presentation are from Bioinformatics and Functional Genomics

04/21/23 1

Pairwise sequence Alignment

04/21/23 2

Many of the images in this power point presentationare from Bioinformatics and Functional Genomicsby Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.

Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks!

Copyright notice

http://he-cda.wiley.com/WileyCDA/HigherEdTitle/productCd-0471210048.html

http://he-cda.wiley.com/WileyCDA/HigherEdTitle/productCd-0471210048.html

04/21/23 3

• It is used to decide if two proteins (or genes) are related structurally or functionally

• It is used to identify domains or motifs that are shared between proteins

• It is the basis of BLAST searching

• It is used in the analysis of genomes

Pairwise sequence alignment is the most fundamental operation of bioinformatics

04/21/23 4

04/21/23 5

Pairwise alignment: protein sequencescan be more informative than DNA

• protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties

• codons are degenerate: changes in the third position often do not alter the amino acid that is specified

• DNA sequences can be translated into protein, and then used in pairwise alignments

04/21/23 6

• DNA can be translated into six potential proteins

5’ CAT CAA 5’ ATC AAC 5’ TCA ACT

5’ GTG GGT 5’ TGG GTA 5’ GGG TAG

Pairwise alignment: protein sequencescan be more informative than DNA

5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

04/21/23 7

Pairwise alignment The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.

Definitions

04/21/23 8

HomologySimilarity attributed to descent from a common ancestor.

Definitions

Page 44retinol-binding protein (rbp)

(NP_006735)-lactoglobulin

(P02754)

04/21/23 9

HomologySimilarity attributed to descent from a common ancestor.

Definitions

Identity

The extent to which two (nucleotide or amino acid) sequences are invariant.

Page 44

RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K ++ + + +GTW++MA+ L+ A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

04/21/23 10

Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function.

Paralogs Homologous sequences within a single species that arose by gene duplication.

Definitions: two types of homology

04/21/23 11

04/21/23 12

Definitions• Similarity: The extent to which nucleotide or protein

sequences are related.

• Identity: The extent to which two sequences are invariant.

• Conservation: Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.

• Percent Similarity of two protein sequence: Identity + Conservation.

04/21/23 13

04/21/23 14

Conservation Substitutions

• Basic Amino Acid (K, R, H)

• Acidic Amino Acid (D, E)

• Hydroxylated Amino Acid (S, T)

• Hydrophobic Amino Acid (W, F, Y, L, I, V, M, A)

Pairwise alignment of retinol-binding protein and -lactoglobulin

04/21/23 15

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin

137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin

Identity(bar)


04/21/23 16





Somewhatsimilar

(one dot)

Verysimilar

(two dots)

04/21/23 17

Gaps

• Common Mutations: substitutions, insertions, deletions.

• Insertions and deletion lead to a Gap in the alignment: Positions at which a letter is paired with a null are called gaps. – Gap scores are typically negative. – Since a single mutational event may cause the

insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap.

– In BLAST, it is rarely necessary to change gap values from the default.

RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K ++ + + +GTW++MA+ L+ A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

04/21/23 18






Internalgap

Terminalgap

04/21/23 19

General approach to pairwise alignment

• Choose two sequences• Select an algorithm that generates a score• Allow gaps (insertions, deletions)• Score reflects degree of similarity• Alignments can be global or local• Estimate probability that the alignment occurred by chance

04/21/23 20

Key Issues in Pairwise Alignment

• The scoring system

• Types of alignments (local vs. global)

• The alignment algorithm

• Measuring alignment significance

04/21/23 21

An alignment scoring system is required to evaluate how good an alignment is

• positive and negative values assigned

• gap creation and extension penalties

• positive score for identities

• some partial positive score for conservative substitutions

• use of a substitution matrix

04/21/23 22

Scoring systems

• Match scores: s(a,b) assigns a score to each combination of aligned letters. Examples: PAM, Blosum.

• Gap score: f(g) assigns a score to a gap of length g. Examples: linear, affine.

• Scoring systems are usually additive: the total score is the sum of the substitution scores and all the gap scores.

04/21/23 23

Gap Scores

• Gap scores are negative (they are “costs”)

• Linear: gap score is proportional to length of gap. f(g) = -g · d

• Affine: gap score is a constant plus a linear score. With affine scores, many small gaps cost more than one large gap. f(g) = -d -(g-1)e, for g>0

04/21/23 24

Calculation of an alignment score

Match Score Tables

• Match scores are computed using a substitution matrix

Log-odds Match Scores

• Match scores are usually log-odds scores.

s(a,b) = log(Pr(a, b | M) / Pr(a, b | R))• Pr(a, b | M) = pab is the probability of the residues

a and b appearing assuming the correct positions are aligned and the sequences descended from a common ancestor.

• Pr(a, b | R) is the probability of a and b if we pick two random sequence positions to align. Assuming independence (null model) gives Pr(a, b | R)=qaqb.

Adding log-odds substitution scores gives the log-odds of the alignment• For ungapped

alignments, the log-odds alignment score of two sequence segments x and y is equal to the log-odds ratio of the segments.

• Let x = x1 x2 ...xn and y = y1 y2 ...yn. )|,Pr(

)|,Pr(log

log

log

),(

Ryx

Myx

qq

p

qq

p

yxsS

i yx

yx

yx

yx

i

ii

i

ii

ii

ii

ii

A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids.

Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments(or multiple sequence alignments) of amino acids. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution.

The two major types of substitution matrices arePAM and BLOSUM.

Substitution Matrix

04/21/23 29

Dayhoff Model: Accepted Point mutations

• Accepted Point Mutations (PAM): a replacement of one amino acid in protein by another residue that has been accepted by natural selection.

• When PAM occurs:– A gene undergoes a DNA mutation such that

it encodes a different amino acid– The entire species adopts that changes as the

predominant form.

04/21/23 30

Dayhoff’s 34 protein superfamilies

Protein PAMs per 100 million yearsIg kappa chain 37Kappa casein 33Lactalbumin 27Hemoglobin 12Myoglobin 8.9Insulin 4.4Histone H4 0.10Ubiquitin 0.00

04/21/23 31

A Ala

R Arg

N Asn

D Asp

C Cys

Q Gln

E Glu

G Gly

A

R 30

N 109 17

D 154 0 532

C 33 10 0 0

Q 93 120 50 76 0

E 266 0 94 831 0 422

G 579 10 156 162 10 30 112

H 21 103 226 43 10 243 23 10

Dayhoff’s numbers of “accepted point mutations”:what amino acid substitutions occur in proteins?

Number of Accepted point mutations, multiplied by 10, in 1572 cases

04/21/23 32

Dayhoff et al. described the “relative mutability” of each amino acid as the probability that amino acid will change over a small evolutionary time period. The total number of changes are counted (on all branches of all protein trees considered), and the total number of occurrences of each amino acid is also considered. A ratio is determined.

Relative mutability [changes] / [occurrences]

Example:

sequence 1 ala his val alasequence 2 ala arg ser val

For ala, relative mutability = [1] / [3] = 0.33For val, relative mutability = [2] / [2] = 1.0

The relative mutability of amino acids

04/21/23 33

The relative mutability of amino acids

Asn 134 His 66Ser 120 Arg 65Asp 106 Lys 56Glu 102 Pro 56Ala 100 Gly 49Thr 97 Tyr 41Ile 96 Phe 41Met 94 Leu 40Gln 93 Cys 20Val 74 Trp 18

04/21/23 34

Normalized frequencies of amino acids

Gly 8.9% Arg 4.1%Ala 8.7% Asn 4.0%Leu 8.5% Phe 4.0%Lys 8.1% Gln 3.8%Ser 7.0% Ile 3.7%Val 6.5% His 3.4%Thr 5.8% Cys 3.3%Pro 5.1% Tyr 3.0%Glu 5.0% Met 1.5%Asp 4.7% Trp 1.0%

blue=6 codons; red=1 codon

These values sum to 1.0.

04/21/23 35

PAM matrices are based on global alignments of closely related proteins.

Other PAM matrices are extrapolated from PAM1.

All the PAM data come from closely related proteins(>85% amino acid identity)

PAM matrices:Accepted point mutations

04/21/23 36

Dayhoff’s PAM1

• PAM1: calculated from comparisons of sequences with no more than 1% divergence. – Defined as evolutionary interval.– 1 PAM = PAM1 = 1% average change of all amino

acid positions• Value in PAM1: for one amino acid

– number of “accepted point mutations” × relative mutability × fraction of change from one amino acid to another amino acid over all changes of one amino acid to any other amino acid.

– Normalization over all 20 changes.

04/21/23 37

Dayhoff’s PAM1 mutation probability matrix

A Ala

R Arg

N Asn

D Asp

C Cys

Q Gln

E Glu

G Gly

H His

I Ile

A 9867 2 9 10 3 8 17 21 2 6

R 1 9913 1 0 1 10 0 0 10 3

N 4 1 9822 36 0 4 6 6 21 3

D 6 0 42 9859 0 6 53 6 4 1

C 1 1 0 0 9973 0 0 0 1 1

Q 3 9 4 5 0 9876 27 1 23 1

E 10 0 7 56 0 35 9865 4 2 3

G 21 1 12 11 1 3 7 9935 1 0

H 1 8 18 3 1 20 1 0 9912 0

I 2 2 3 1 2 1 2 0 0 9872

Each element of the matrix shows the probability that an originalamino acid (top) will be replaced by another amino acid (side)

PAM

• After 100 PAMs of evolution, not every residue will have changed– some residues may have mutated several

times– some residues may have returned to their

original state– some residues may not changed at all

04/21/23 39

Dayhoff’s PAM0 mutation probability matrix:the rules for extremely slowly evolving proteins

PAM0 AAla

RArg

NAsn

DAsp

CCys

QGln

EGlu

GGly

A 100% 0% 0% 0% 0% 0% 0% 0%R 0% 100% 0% 0% 0% 0% 0% 0%N 0% 0% 100% 0% 0% 0% 0% 0%D 0% 0% 0% 100% 0% 0% 0% 0%C 0% 0% 0% 0% 100% 0% 0% 0%Q 0% 0% 0% 0% 0% 100% 0% 0%E 0% 0% 0% 0% 0% 0% 100% 0%G 0% 0% 0% 0% 0% 0% 0% 100%

Top: original amino acidSide: replacement amino acid

04/21/23 40

Dayhoff’s PAM2000 mutation probability matrix:the rules for very distantly related proteins

PAM AAla

RArg

NAsn

DAsp

CCys

QGln

EGlu

GGly

A 8.7% 8.7% 8.7% 8.7% 8.7% 8.7% 8.7% 8.7%R 4.1% 4.1% 4.1% 4.1% 4.1% 4.1% 4.1% 4.1%N 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% 4.0%D 4.7% 4.7% 4.7% 4.7% 4.7% 4.7% 4.7% 4.7%C 3.3% 3.3% 3.3% 3.3% 3.3% 3.3% 3.3% 3.3%Q 3.8% 3.8% 3.8% 3.8% 3.8% 3.8% 3.8% 3.8%E 5.0% 5.0% 5.0% 5.0% 5.0% 5.0% 5.0% 5.0%G 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% 8.9%


04/21/23 41

PAM250 Matrix

• Commonly used

• Describes the frequency of amino acid replacement between distantly related proteins

• An evolutionary distance where proteins share about 20% amino acid identity

04/21/23 42

PAM250 mutation probability matrix A R N D C Q E G H I L K M F P S T W Y V A 13 6 9 9 5 8 9 12 6 8 6 7 7 4 11 11 11 2 4 9

R 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4 4 3 7 2 2

N 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4 5 4 2 3 3

D 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4 5 5 1 2 3

C 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2 3 2 1 4 2

Q 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4 3 3 1 2 3

E 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4 5 5 1 2 3

G 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8 11 9 2 3 7

H 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3 3 2 2 3 2

I 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2 3 4 1 3 9

L 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5 4 6 6 7 13

K 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6 8 8 4 3 5

M 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1 1 1 1 1 2

F 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1 2 2 4 20 3

P 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20 6 5 1 2 4

S 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9 10 9 4 4 6

T 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6 8 11 2 3 6

W 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 55 1 0

Y 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1 2 2 3 31 2

V 7 4 4 4 4 4 4 5 4 15 10 4 10 5 5 5 7 2 4 17


04/21/23 43

A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 -4 -5 12 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2 4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3 1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4 -6 -2 -3 -4 -2 -2 6 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 A R N D C Q E G H I L K M F P S T W Y V

PAM250 log oddsscoring matrix

04/21/23 44

How do we go from a mutation probabilitymatrix to a log odds matrix?

• The cells in a log odds matrix consist of an “odds ratio”:

the probability that an alignment is authenticthe probability that the alignment was random

The score S for an alignment of residues a,b is given by:

S(a,b) = 10 log10 (Mab/pb)

As an example, for tryptophan,

S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4

04/21/23 45

What do the numbers meanin a log odds matrix?

S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4

A score of +17 for tryptophan means that this alignmentis 50 times more likely than a chance alignment of twoTrp residues.

S(a,b) = 17Probability of replacement (Mab/pb) = xThen17 = 10 log10 x1.7 = log10 x101.7 = x = 50

04/21/23 46

What do the numbers meanin a log odds matrix?

A score of +2 indicates that the amino acid replacementoccurs 1.6 times as frequently as expected by chance.

A score of 0 is neutral.

A score of –10 indicates that the correspondence of two amino acids in an alignment that accurately representshomology (evolutionary descent) is one tenth as frequentas the chance alignment of these amino acids.

04/21/23 47

Comparing two proteins with a PAM1 matrixgives completely different results than PAM250!

Consider two distantly related proteins. A PAM40 matrixis not forgiving of mismatches, and penalizes themseverely. Using this matrix you can find almost no match.

A PAM250 matrix is very tolerant of mismatches.

hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC * ** * **

24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7% hsrbp, 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV btlact, 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN * **** * * * * ** *

hsrbp, 86 --CADMVGTFTDTEDPAKFKM btlact, 80 GECAQKKIIAEKTKIPAVFKI ** * ** **

04/21/23 48

04/21/23 49

PAM: “Accepted point mutation”

• Two proteins with 50% identity may have 80 changesper 100 residues. (Why? Because any residue can besubject to back mutations.)

• Proteins with 20% to 25% identity are in the “twilight zone”and may be statistically significantly related.

• PAM or “accepted point mutation” refers to the “hits” or matches between two sequences (Dayhoff & Eck, 1968)

04/21/23 50

Ancestral sequence

Sequence 1ACCGATC

Sequence 2AATAATC

A no change AC single substitution C --> AC multiple substitutions C --> A --> TC --> G coincidental substitutions C --> AT --> A parallel substitutions T --> AA --> C --> T convergent substitutions A --> TC back substitution C --> T --> C

ACCCTAC

Li (1997) p.70

04/21/23 51

Pe

rce

nt

ide

nti

ty

Evolutionary distance in PAMs

Two randomly diverging protein sequences changein a negatively exponential fashion

“twilight zone”

04/21/23 52

Pe

rce

nt

ide

nti

ty

Differences per 100 residues

At PAM1, two proteins are 99% identicalAt PAM10.7, there are 10 differences per 100 residuesAt PAM80, there are 50 differences per 100 residues

At PAM250, there are 80 differences per 100 residues

“twilight zone”

04/21/23 53

PAM matrices reflect different degrees of divergence

PAM250

04/21/23 54

PAM250

• In the “twilight zone”

• At this level of divergence, it is difficulty to assess whether the two proteins are homologous.

• Other techniques may use– Multiple sequence alignment– Structural alignment.

04/21/23 55

Comments

• Dayhoff's methodology of comparing closely related species turned out not to work very well for aligning evolutionarily divergent sequences.

• Sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales.

04/21/23 56

Henikoff and Henikoff: BLOSUM

• BLOSUM: Block Substitution Matrices– constructed these matrices using multiple alignments

of evolutionarily divergent proteins. – The probabilities used in the matrix calculation are

computed by looking at "blocks" of conserved sequences found in multiple protein alignments.

– These conserved sequences are assumed to be of functional importance within related proteins.

04/21/23 57

BLOSUM Matrices

• The BLOCKS database contains thousands of groups of multiple sequence alignments.

• The BLOSUM62 matrix is calculated from observed substitutions between proteins that share 62% sequence identity or more – the BLOSUM100 matrix is calculated from alignments between proteins

showing 100% identity the proteins in the BLOCKS database – One would use a higher numbered BLOSUM matrix for aligning two

closely related sequences and a lower number for more divergent sequences.

• BLOSUM62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.

04/21/23 58

BLOSUM Matrices

100

62

30

Per

cent

am

ino

acid

iden

tity

BLOSUM62

colla

pse

04/21/23 59

BLOSUM Matrices

100

62

30

Per

cent

am

ino

acid

iden

tity

BLOSUM62

100

62

30

BLOSUM30

100

62

30

BLOSUM80

colla

pse

colla

pse

colla

pse

04/21/23 60

A 4R -1 5N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -1 1 1 -2 -1 -3 -2 5M -1 -2 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V

Blosum62 scoring matrix

04/21/23 61

Differences between PAM and BLOSUM

• PAM matrices are based on an explicit evolutionary model (i.e. replacements are counted on the branches of a phylogenetic tree), whereas the BLOSUM matrices are based on an implicit model of evolution.

• The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The BLOSUM matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps.

• The method used to count the replacements is different, unlike the PAM matrix, the BLOSUM procedure uses groups of sequences within which not all mutations are counted the same.

• Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance, while larger numbers in the BLOSUM matrix naming scheme denote higher sequence similarity and therefore smaller evolutionary distance. Example: PAM150 is used for more distant sequences than PAM100; BLOSUM62 is used for closer sequences than Blosum50.

http://en.wikipedia.org/wiki/Substitution_matrix

04/21/23 62

Rat versus mouse RBP

Rat versus bacteriallipocalin

04/21/23 63

Measuring Alignment Significance

• The statistical significance of a an alignment score is used to try to determine if an alignment is the result of homology or just random chance.

04/21/23 64

True positives False positives

False negatives

Sequences reportedas related

Sequences reportedas unrelated

True negatives

04/21/23 65


False negatives



True negatives

homologoussequences

non-homologoussequences

04/21/23 66


False negatives



True negatives

homologoussequences

non-homologoussequences

Sensitivity:ability to findtrue positives

Specificity:ability to minimize

false positives

04/21/23 67

RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K++ + + +GTW++MA + L + A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

Randomization test: scramble a sequence

• First compare two proteins and obtain a score

• Next scramble the bottom sequence 100 times,and obtain 100 “randomized” scores (+/- S.D. )

• Composition and length are maintained

• If the comparison is “real” we expect the authentic score to be several standard deviations above the mean of the “randomized” scores

04/21/23 68

0

2

4

6

8

10

12

14

16

1 10 19 28 37

100 random shufflesMean score = 8.7Std. dev. = 4.2

Quality score

Num

ber

of in

stan

ces

A randomization test shows that RBP is significantly related to -lactoglobulin

Real comparisonScore = 37

But this test assumes a normal distribution of scores!

04/21/23 69

For align RBP to b-lactoglobulin:

scorerandomizedrealS

Z

7.62.4

7.837

Z

04/21/23 70

The PRSS program performsa scramble test for you(http://fasta.bioch.virginia.edu/fasta/prss.htm)

Badscores

Good scores

But these scores are notnormally distributed!

04/21/23 71

We will first consider the global alignment algorithmof Needleman and Wunsch (1970).

We will then explore the local alignment algorithmof Smith and Waterman (1981).

Finally, we will consider BLAST, a heuristic versionof Smith-Waterman. We will cover BLAST in detaillater.

Two kinds of sequence alignment: global and local

04/21/23 72

• Two sequences can be compared in a matrix along x- and y-axes.

• If they are identical, a path along a diagonal can be drawn

• Find the optimal subpaths, and add them up to achieve the best score. This involves

--adding gaps when needed--allowing for conservative substitutions--choosing a scoring system (simple or complicated)

• N-W is guaranteed to find optimal alignment(s)

Global alignment with the algorithmof Needleman and Wunsch (1970)

04/21/23 73

The Needleman and Wunsch Algorithm: Dynamic Programming

• S. Needleman and C. Wunsch were the first to apply a dynamic programming approach to the problem of sequence alignment.

• The key to understanding the dynamic programming approach to sequence alignment lies in observing how the alignment problem is broken down into sub-problems.

04/21/23 74

[1] set up a matrix

[2] score the matrix

[3] identify the optimal alignment(s)

Three steps to global alignment with the Needleman-Wunsch algorithm

04/21/23 75

Global alignment with the algorithm of Needleman and Wunsch (1970)

• Build a matrix F, index by i and j (the positions in the two sequences), where the F(i,j) is the score of the best alignment of x1..i and y1..j

04/21/23 76

Example: forward approach

• Align sequence x and y.

• F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.

djiF

djiF

yxsjiF

jiF

F

ji

1,

,1

,1,1

max,

00,0

04/21/23 77

DP in equation form

1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,

04/21/23 78

A simple example

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

A

G

C 1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,

Find the optimal alignment of AAG and AGC.Use a gap penalty of d=-5.

04/21/23 79

A simple example

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

0

A

G

C 1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,


04/21/23 80

A simple example

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

0 -5 -10 -15

A -5

G -10

C -15 1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,


04/21/23 81

A simple example

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

0 -5 -10 -15

A -5 2 -3 -8

G -10 -3 -3 -1

C -15 -8 -8 -6 1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,


04/21/23 82

Trace back

• Start from the lower right corner and trace back to the upper left.

• Each arrow introduces one character at the end of each aligned sequence.

• A horizontal move puts a gap in the left sequence.

• A vertical move puts a gap in the top sequence.• A diagonal move uses one character from each

sequence.

04/21/23 83




• A vertical move puts a gap in the top sequence.

• A diagonal move uses one character from each sequence.

A simple example

A A G

0 -5

A 2 -3

G -1

C -6


04/21/23 84




• A vertical move puts a gap in the top sequence.

• A diagonal move uses one character from each sequence.

A simple example

A A G

0 -5

A 2 -3

G -1

C -6


AAG- AAG--AGC A-GC

04/21/23 85

N-W is guaranteed to find optimal alignments,although the algorithm does not search all possiblealignments.

an optimal path (alignment) is identified byincrementally extending optimal subpaths.Thus, a series of decisions is made at each step of thealignment to find the pair of residues with the best score.

Needleman-Wunsch: dynamic programming

04/21/23 86

Commercial Tools for Sequence Alignment: Accelrys Genetic Computer Group (GCG)

• Formally known as the GCG Wisconsin Package

• GCG contains over 140 programs and utilities covering the cross-disciplinary needs of today’s research environment.

• http://www.accelrys.com/products/gcg/

04/21/23 87

Global alignment versus local alignment

• Global alignment (Needleman-Wunsch) extends from one end of each sequence to the other

• Local alignment finds optimally matching regions within two sequences (“subsequences”)

• Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences

• Smith and Waterman (1981) solved the problem of performing optimal local sequence alignment. Other methods (BLAST, FASTA) are faster but less thorough.

04/21/23 88

Smith-Waterman Algorithm

• Similar to Needleman-Wunch, but:– Start a new alignment when encountering a

negative score– The alignment can end anywhere in the

matrix– Trace back starts from entry with highest

score in the matrix, and terminates at entry with score 0

04/21/23 89

How the Smith-Waterman algorithm works

Set up a matrix between two proteins (size m+1, n+1)

No values in the scoring matrix can be negative! S > 0

The score in each cell is the maximum of four values:[1] s(i-1, j-1) + the new score at [i,j] (a match or mismatch)

[2] s(i,j-1) – gap penalty[3] s(i-1,j) – gap penalty[4] zero

04/21/23 90

Local alignment: example

A A G

0 0 0 0

G 0 0 0 2

A 0 2 2 0

A 0 2 4 0

G 0 0 0 6

G 0 0 0 2

C 0 0 0 0

1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,

Find the optimal local alignment of AAG and GAAGGC.Use a gap penalty of d=-5.

0

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

04/21/23 91

Match: 1Mismatch: -1/3Gap: -4/3

04/21/23 92

Extended Smith & Waterman

To get multiple local alignments:• delete regions around best path

• repeat backtracking

04/21/23 93

Heuristic versions of Smith-Waterman: FASTA and BLAST

• Smith-Waterman is very rigorous and it is guaranteed to find an optimal alignment.

• But Smith-Waterman is slow. It requires computer space and time proportional to the product of the two sequences being aligned (or the product of a query against an entire database).

• Gotoh (1982) and Myers and Miller (1988) improved the algorithms so both global and local alignment require less time and space.

• FASTA and BLAST provide rapid alternatives to S-W

04/21/23 94

FASTA Idea

• Idea: a good alignment probably matches some identical ‘words’ (ktups)

• Example:

Database record:

ACTTGTAGATACAAAATGTG

Aligned query sequence:

A-TTGTCG-TACAA-ATCTGT

Matching words of size 4

04/21/23 95

FASTA Stage I

• A “lookup table” (Database word) is created. It consists of short stretches of amino acids. The length of a stretch is called a k-tuple.

• Upon query:– For each DB record:

• Find matching words• Search for long diagonal runs

of matching words• Finds the ten highest

scoring segments that align to the query.

*= matching word

Position in query

Position in DB record

* * * *

* * *

* * * * *

04/21/23 96

FASTA stage II, III

• II, These ten aligned regions are re-scored with a PAM or BLOSUM matrix.

• III, High-scoring segments are joined

04/21/23 97

FASTA final stage

• Apply an exact algorithm to surviving records, computing the final alignment score.– Usually, Needleman-Wunsch or Smith-

Waterman is then performed.

04/21/23 98

Advantage:

• The FASTA program can search the NBRF protein sequence library (2.5 million residues) in less than 20 min on an IBM-PC microcomputer

Rapid and sensitive sequence comparison with FASTP and FASTA.Pearson WR. Methods Enzymol. 1990;183:63-98.

04/21/23 99

Limits

• Local similarity might be missed because only 10 regions saved at init1 stage.

• Non-identical conserved stretches may be overlooked

Documents

8/27/20151 Pairwise sequence Alignment. 8/27/20152 Many of the images in this power point presentation are from Bioinformatics and Functional Genomics