37
1 1 Sequence Alignment Marco Botta Dipartimento di Informatica Università di Torino [email protected] www.di.unito.it/~botta/didattica/ 2 Università di Torino Sequence Comparison Much of bioinformatics involves sequences DNA sequences RNA sequences Protein sequences We can think of these sequences as strings of letters DNA & RNA: alphabet of 4 letters Protein: alphabet of 20 letters 3 Università di Torino Sequence Comparison (cont) Finding similarity between sequences is important for many biological questions For example: Find genes/proteins with common origin Allows to predict function & structure Locate common subsequences in genes/proteins Identify common “motifs” Locate sequences that might overlap Help in sequence assembly

Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

1

1

Sequence Alignment

Marco BottaDipartimento di Informatica

Università di [email protected]

www.di.unito.it/~botta/didattica/

2Università di Torino

Sequence Comparison

Much of bioinformatics involves sequences• DNA sequences• RNA sequences• Protein sequencesWe can think of these sequences as strings of

letters• DNA & RNA: alphabet of 4 letters• Protein: alphabet of 20 letters

3Università di Torino

Sequence Comparison (cont)

• Finding similarity between sequences isimportant for many biological questions

For example:• Find genes/proteins with common origin

– Allows to predict function & structure

• Locate common subsequences in genes/proteins– Identify common “motifs”

• Locate sequences that might overlap– Help in sequence assembly

Page 2: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

2

4Università di Torino

Sequence Alignment

Input: two sequences over the same alphabetOutput: an alignment of the two sequencesExample:• GCGCATGGATTGAGCGA

• TGCGCCATTGATGACCA

A possible alignment:-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

5Università di Torino

Alignments

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Three elements:• Perfect matches• Mismatches• Insertions & deletions (indel)

6Università di Torino

Choosing Alignments

There are many possible alignmentsFor example, compare:

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

to------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Which one is better?

Page 3: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

3

7Università di Torino

Scoring Alignments

Rough intuition:• Similar sequences evolved from a common ancestor• Evolution changed the sequences from this

ancestral sequence by mutations:– Replacements: one letter replaced by another– Deletion: deletion of a letter– Insertion: insertion of a letter

• Scoring of sequence similarity should examine howmany operations took place

8Università di Torino

Simple Scoring Rule

Score each position independently:• Match: +1• Mismatch: -1• Indel -2Score of an alignment is sum of positional

scores

9Università di Torino

Example

Example:-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Score: (+1x13) + (-1x2) + (-2x4) = 3

------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Score: (+1x5) + (-1x6) + (-2x12) = -25

Page 4: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

4

10Università di Torino

More General Scores

• The choice of +1,-1, and -2 scores was quitearbitrary

• Depending on the context, some changesare more plausible than others– Exchange of an amino-acid by one with similar

properties (size, charge, etc.)vs.– Exchange of an amino-acid by one with

opposite properties

11Università di Torino

• We define a scoring function by specifyinga function

– σ(x,y) is the score of replacing x by y– σ(x,-) is the score of deleting x– σ(-,x) is the score of inserting x

• The score of an alignment is the sum ofposition scores

Additive Scoring Rules

ℜ−∪Σ×−∪Σ �}){(}){(:σ

12Università di Torino

Edit Distance

• The edit distance between two sequences isthe “cost” of the “cheapest” set of editoperations needed to transform onesequence into the other

• Computing edit distance between twosequences almost equivalent to finding thealignment that minimizes the distance

nment)score(aligmax),d( & of alignment 21 ss21 ss =

Page 5: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

5

13Università di Torino

Computing Edit Distance

• How can we compute the edit distance??– If |s| = n and |t| = m, there are more than

alignments

• The additive form of the score allows toperform dynamic programming tocompute edit distance efficiently

+m

nm

14Università di Torino

Recursive Argument

• Suppose we have two sequences:s[1..n+1] and t[1..m+1]

The best alignment must be in one of three cases:1. Last position is (s[n+1],t[m +1] )2. Last position is (s[n +1],-)3. Last position is (-, t[m +1] )

])[],[(])..[],..,[(])..[],..[(1mt1ns

m1tn1sd1m1t1n1sd++

+=++σ

15Università di Torino

Recursive Argument

• Suppose we have two sequences:s[1..n+1] and t[1..m+1]

The best alignment must be in one of three cases:1. Last position is (s[n+1],t[m +1] )2. Last position is (s[n +1],-)3. Last position is (-, t[m +1] )

)],[(])..[],..,[(])..[],..[(

−+++=++

1ns1m1tn1sd1m1t1n1sd

σ

Page 6: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

6

16Università di Torino

Recursive Argument

• Suppose we have two sequences:s[1..n+1] and t[1..m+1]

The best alignment must be in one of three cases:1. Last position is (s[n+1],t[m +1] )2. Last position is (s[n +1],-)3. Last position is (-, t[m +1] )

])1[,(])..1[],1..,1[(])1..1[],1..1[(

+−++=++

mtmtnsdmtnsd

σ

17Università di Torino

Recursive Argument

Define the notation:

• Using the recursive argument, we get thefollowing recurrence for V:

+−++−++++++

=++])[,(],[)],[(],[

])[],[(],[max],[

1jtj1iV1is1jiV

1jt1isjiV1j1iV

σσ

σ

])..[],..[(],[ j1ti1sdjiV =

18Università di Torino

Recursive Argument

• Of course, we also need to handle the basecases in the recursion:

])[,(],[],[)],[(],[],[

],[

1jtj0V1j0V1is0iV01iV

000V

+−+=+−++=+

=

σσ

Page 7: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

7

19Università di Torino

Dynamic ProgrammingAlgorithm

We fill the matrix using the recurrence rule

0A1

G2

C3

0

A 1

A 2

A 3

C 4

20Università di Torino

0A1

G2

C3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

0A1

G2

C3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

Dynamic ProgrammingAlgorithm

Conclusion: d(AAAC,AGC) = -1

21Università di Torino

Reconstructing the BestAlignment

• To reconstruct thebest alignment, werecord which case inthe recursive rulemaximized the score

0A1

G2

C3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

Page 8: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

8

22Università di Torino

Reconstructing the BestAlignment

• We now trace back thepath that corresponds tothe best alignment

0A1

G2

C3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

AAACAG-C

23Università di Torino

Reconstructing the BestAlignment

• Sometimes, more thanone alignment has thebest score

0A1

G2

C3

0 0 -2 -4 -6

A 1 -2 1 -1 -3

A 2 -4 -1 0 -2

A 3 -6 -3 -2 -1

C 4 -8 -5 -4 -1

AAACA-GC

24Università di Torino

Complexity

Space: O(mn)Time: O(mn)• Filling the matrix O(mn)• Backtrace O(m+n)

Page 9: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

9

25Università di Torino

Local Alignment

Consider now a different question:• Can we find similar substring of s and t ?• Formally, given s[1..n] and t[1..m] find

i,j,k, and l such that d(s[i..j],t[k..l]) ismaximal

26Università di Torino

Local Alignment

• As before, we use dynamic programming• We now want to setV[i,j] to record the best

alignment of a suffix of s[1..i] and a suffixof t[1..j]

• How should we change the recurrence rule?

27Università di Torino

Local Alignment

New option:• We can start a new match instead of extend

previous alignment

+−++−++++++

=++

01jtj1iV

1is1jiV1jt1isjiV

1j1iV])[,(],[)],[(],[

])[],[(],[max],[

σσ

σ

Alignment of empty suffixes

Page 10: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

10

28Università di Torino

]))[,(],[,max(],[))],[(],[,max(],[

],[

1jtj0V01j0V1is0iV001iV

000V

+−+=+−++=+

=

σσ

Local Alignment

• Again, we also need to handle the basecases in the recursion:

29Università di Torino

Local Alignment Example

0A1

T2

C3

T4

A5

A6

0

T 1

A 2

A 3

T 4

A 5

s = TAATAt = ATCTAA

30Università di Torino

Local Alignment Example

0T1

A2

C3

T4

A5

A6

0 0 0 0 0 0 0 0

T 1 0 1 0 0 1 0 0

A 2 0 0 2 0 0 2 1

A 3 0 0 1 1 0 1 3

T 4 0 0 0 0 2 0 1

A 5 0 0 1 0 0 3 1

s = TAATAt = TACTAA

Page 11: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

11

31Università di Torino

Local Alignment Example

0T1

A2

C3

T4

A5

A6

0 0 0 0 0 0 0 0

T 1 0 1 0 0 1 0 0

A 2 0 0 2 0 0 2 1

A 3 0 0 1 1 0 1 3

T 4 0 0 0 0 2 0 1

A 5 0 0 1 0 0 3 1

s = TAATAt = TACTAA

32Università di Torino

Local Alignment Example

0T1

A2

C3

T4

A5

A6

0 0 0 0 0 0 0 0

T 1 0 1 0 0 1 0 0

A 2 0 0 2 0 0 2 1

A 3 0 0 1 1 0 1 3

T 4 0 0 0 0 2 0 1

A 5 0 0 1 0 0 3 1

s = TAATAt = TACTAA

33Università di Torino

Sequence Alignment

We have seen two variants of sequence alignment:• Global alignment• Local alignmentOther variants:• Finding best overlap (exercise)

All are based on the same basic idea of dynamicprogramming

Page 12: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

12

34Università di Torino

Alignment with GapsAAC-AATTAAG-ACTAC-GTTCATGAC

A-CGA-TTA-GCAC-ACTG-T-C-GA-

AACAATTAAGACTACGTTCATGAC---

AACAATT--------GTTCATGACGCA

I

II

35Università di Torino

Gaps

• Both alignments have the same number ofmatches and spaces but…alignment IIseems better.

• Definition: A gap is any maximal,consecutive run of spaces in a single string.

• The length of the gap will be the number ofspaces in it.

• Example I has 11 gaps while example II hasonly 2 gaps.

36Università di Torino

Biological Motivation

• Number of mutational events– A single gap - due to single event that removed

a number of residues.– Each separate gap - due to distinct independent

events.• Protein structure

– Protein secondary structure consists of alphahelixes, beta sheets and loops

– Loops of varying size can lead to very similarstructure.

Page 13: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

13

37Università di Torino

Biological Motivation

38Università di Torino

cDNA matching

• cDNA - is the sequence after splicing and editing,after the introns have been removed.

• We expect regions of high similarity separated bylong gaps.

• These gaps correspond to the introns removed bysplicing

39Università di Torino

Gap Penalty Models• Constant Model

– Gives each gap a constant weight, spaces are free– Maximize:– Time– Works well for cDNA matching

• Affine Model– There is a penalty for starting a gap and a penalty for

each space extending it.– A single gap contributes– Maximize:– Time– Most widely used

gapss WTS gii #),( '' ×+∑)(nmO

spacesgapss WWTS sgii ##),( '' ×+×+∑)(nmO

WW sg q+

Page 14: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

14

40Università di Torino

Gap Penalty Models

• Convex model– Each extra space contributes less penalty– Gap function is convex in length– Example– Time– Better model of biology

• General model– The weight of a gap is some arbitrary– Time

qW g log+

)log( mnmO

)(qw

)( 22 mnnmO +

41Università di Torino

Example RevisedAAC-AATTAAG-ACTAC-GTTCATGAC

A-CGA-TTA-GCAC-ACTG-T-C-GA-

AACAATTAAGACTACGTTCATGAC---

AACAATT--------GTTCATGACGCA

I

II

42Università di Torino

Indel modelAAC-AATTAAG-ACTAC-GTTCATGAC

A-CGA-TTA-GCAC-ACTG-T-C-GA-I

AACAATTAAGACTACGTTCATGAC---

AACAATT--------GTTCATGACGCAII

Scoring Parameters:Match: +1indel: -2

-6

-6

Page 15: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

15

43Università di Torino

Constant modelAAC-AATTAAG-ACTAC-GTTCATGAC

A-CGA-TTA-GCAC-ACTG-T-C-GA-I

AACAATTAAGACTACGTTCATGAC---

AACAATT--------GTTCATGACGCAII

Scoring Parameters:Match: +1open gap: -2

-6

12

44Università di Torino

Affine modelAAC-AATTAAG-ACTAC-GTTCATGAC

A-CGA-TTA-GCAC-ACTG-T-C-GA-I

AACAATTAAGACTACGTTCATGAC---

AACAATT--------GTTCATGACGCAII

Scoring Parameters:Match: +1Open Gap: -2Extend Gap:-1

-17

1

45Università di Torino

Convex modelAAC-AATTAAG-ACTAC-GTTCATGAC

A-CGA-TTA-GCAC-ACTG-T-C-GA-I

AACAATTAAGACTACGTTCATGAC---

AACAATT--------GTTCATGACGCAII

Scoring Parameters:Match: +1Open Gap: -2 Gap Length: -logn

-6

~7

Page 16: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

16

46Università di Torino

Affine Weight Model

• We divide the possible alignments of theprefixes S1..i and T1..j into 3 types:

S_______iT_______j

S______i------T____________j

S____________iT_______j-----

A(i,j)

B(i,j)

C(i,j)

47Università di Torino

Affine Weight Model

Recurrence relations

+−−+−−+−−

=),()1,1(),()1,1(),()1,1(

max),(jisjiCjisjiBjisjiA

jiA

+−−

++−−=

s

sg

WjiBWWjiA

jiB)1,1(

)1,1(max),(

+−−

++−−=

s

sg

WjiCWWjiA

jiC)1,1(

)1,1(max),(

48Università di Torino

Affine Weight Model

Initial Conditions:

sg

sg

jWWiCjAiWWiBiA

+==

+==

)0,(),0(

)0,()0,(

Complexity•Time: O(nm) we compute 3 matrices.•Space: O(nm)

Optimal alignment :)},(),,(),,(max{),( mnCmnBmnAmnV =

Page 17: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

17

49Università di Torino

Affine Weight Model

This model has a natural explanation as afinite state automata.

A

B

C

S(i,j)

Ws

Ws

Wg+Ws

S(i,j)

S(i,j)

Wg+Ws

50Università di Torino

Alignment in Real Life

• One of the major uses of alignments is tofind sequences in a “database”

• Such collections contain massive number ofsequences (order of 106)

• Finding homologies in these databases withdynamic programming can take too long

51Università di Torino

Heuristic Search

• Instead, most searches relay on heuristicprocedures

• These are not guaranteed to find the best match• Sometimes, they will completely miss a high-

scoring match

We now describe the main ideas used by some ofthese procedures– Actual implementations often contain additional tricks and hacks

Page 18: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

18

52Università di Torino

Basic Intuition

• Almost all heuristic search procedure arebased on the observation that real-lifematches often contain long strings with gap-less matches

• These heuristic try to find significant gap-less matches and then extend them

53Università di Torino

Banded DP

• Suppose that we have two strings s[1..n] andt[1..m] such that n≈m

• If the optimal alignment of s and t has few gaps,then path of the alignment will be close todiagonal s

t

54Università di Torino

Banded DP

• To find such a path, it suffices tosearch in a diagonal region ofthe matrix

• If the diagonal band has width k,then the dynamic programmingstep takes O(kn)

• Much faster than O(n2) ofstandard DP

s

t k

Page 19: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

19

55Università di Torino

Banded DP

Problem:• If we know that t[i..j] matches the query s,

then we can use banded DP to evaluatequality of the match

• However, we do not know this apriori!

How do we select which sequences to alignusing banded DP?

56Università di Torino

FASTA Overview

Main idea:• Find potential diagonals & evaluate them• Suppose that we have a relatively long gap-

less matchAGCGCCATGGATTGAGCGA

TGCGACATTGATCGACCTA

• Can we find “clues” that will let us find itquickly?

57Università di Torino

Signature of a Match

Assumption: good matches contain several“patches” of perfect matches

AGCGCCATGGATTGAGCGA

TGCGACATTGATCGACCTA

Since this is a gap-less alignment,all perfect match regionsshould be on one diagonal

s

t

Page 20: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

20

58Università di Torino

FASTA

• Given s and t, and a parameter k• Find all pairs (i,j) such that s[i..i+k]=t[j..j+k]• Locate sets of pairs that are on the same diagonal

– By sorting according to i-j

• Compute score for the diagonal that containall of these pairs

s

t

59Università di Torino

FASTA

Postprocessing steps:– Find highest scoring diagonal matches– Combine these to potential gapped matches– Run banded DP on the region containing these

combinations

• Most applications of FASTA use very small k(2 for proteins, and 4-6 for DNA)

60Università di Torino

FASTA Output

SCORES Init1: 1201 Initn: 1844 Opt: 1915

Smith-Waterman score: 1915; 59.3% identity in 496 aa overlap

10 20 30 40

A41264 MADKKKITASLIYAVSVAAIGSLQFGYNTGVINAPEKIIQAFYNRTL

::::|::|: || |::|||||||| ||||||:|:|: ||:|

A49158 MPSGFQQIGSEDGEPPQQRVTGTLVLAVFSAVLGSLQFGYNIGVINAPQKVIEQSYNETW

10 20 30 40 50 60

50 60 70 80 90 100

A41264 SQRSG----ETISPELLTSLWSLSVAIFSVGGMIGSFSVSLFVNRFGRRNSMLLVNVLAF

|:| :| | ||:||:||||||||||||:|| :::: : :||: :||: ||||

A49158 LGRQGPEGPSSIPPGTLTTLWALSVAIFSVGGMISSFLIGIISQWLGRKRAMLVNNVLAV

70 80 90 100 110 120

. . . . . .

Page 21: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

21

61Università di Torino

BLAST Overview

• BLAST uses similar intuition• It relies on high scoring matches rather

than exact matches• It is designed to find alignments of a target

string s against large databases

62Università di Torino

High-Scoring Pair

• Given parameters: length k, and thresholdT• Two strings s and t of length k are a high scoring

pair (HSP) if d(s,t) > T

• Given a query s[1..n], BLAST construct all wordsw, such that w is an HSP with a k-substring of s– Note that not all substrings of s are HSPs!

• These words serve as seeds for finding longermatches

63Università di Torino

Finding Potential Matches

We can locate seed words in a large databasein a single pass

• Construct a FSA that recognizes seed words• Using hashing techniques to locate

matching words

Page 22: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

22

64Università di Torino

Extending Potential Matches

• Once a seed is found, BLAST attempts tofind a local alignment that extends the seed

• Seeds on the same diagonalare combined (as in FASTA)

s

t

65Università di Torino

BLAST programs

• BLASTN - Nucleotide query searching a nucleotidedatabase.

• BLASTP - Protein query searching a protein database.• BLASTX - Translated nucleotide query sequence (6

frames) searching a protein database.• TBLASTN - Protein query searching a translated

nucleotide (6 frames) database.• TBLASTX - Translated nucleotide query (6 frames)

searching a translated nucleotide (6 frames) database

66Università di Torino

BLAST Search

Page 23: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

23

67Università di Torino

BLAST Output

• List of hits– Database accession codes, name, description.– Score in bits (Usually >30 bits is significant )– Expectation value E()

• For each hit– A header including hit name, description, length– Each hit may contain several HSPs– Score and expectation value

– how many identical residues– how many residues contributing positively to the score

• The local alignment itself

68Università di Torino

BLAST Output

69Università di Torino

BLAST Output

Page 24: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

24

70Università di Torino

BLAST Output

71Università di Torino

What do we use

• Originally Blast did not allow gaps.– Now people use gapped-Blast– Gapped blast joins different diagonals.

• For proteins Blast is superior• For nucleotides Fasta is better.

84Università di Torino

Matrici di Punteggio

• L’allineamento dipende dal punteggioassegnato alle coppie di amminoacidi(matrici di punteggio) e dalla funzione dipenalizzazione degli indels

Page 25: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

25

85Università di Torino

Matrici di punteggio

• Matrici di sostituzione amminoacidicaovvero Tavole di confronto tra simboli

• Esistono tavole per il confronto tra proteinee tra acidi nucleici

• Sono state pubblicate moltissime matricibasate su modelli diversi

86Università di Torino

Matrici di M. Dayhoff (1978)

• Percentuale di mutazioni accettate (PAM)• Questa famiglia di matrici è basata sulla

probabilità di sostituzione, durantel’evoluzione, di un amminoacido con unaltro in sequenze proteiche omologhe

• Ciascuna matrice misura i cambiamentiattesi per un dato periodo evolutivo

87Università di Torino

Matrici PAM

• Secondo questo modello, le sostituzioniamminoacidiche osservate lungo un certoperiodo di tempo, possono essereestrapolate per periodi più lunghi

• Nel calcolo delle matrici PAM, si assumeche il cambiamento ad un certo sito siaindipendente da precedenti eventimutazionali nello stesso sito

Page 26: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

26

88Università di Torino

Calcolo di matrici PAM

• Basato su 1572 mutazioni in 71 gruppi disequenze simili almeno all’85%

• Le mutazioni non alteranosignificativamente la funzione delleproteine (mutazioni accettate)

• Le sequenze simili vengono organizzate inalberi filogenetici dai quali vengono desuntele mutazioni

89Università di Torino

Specie A A W T V A S A V R T S I

Specie B A Y T V A A A V R T S I

Specie C A W T V A A A V L T S I

A B C

L→ R

W→ Y

90Università di Torino

Calcolo di matrici PAM

• pi = ai/atot frequenza dell’amminoacido i

• fij = n(ai→aj) numero di mutazioni ai→aj

• fi = ∑j fij numero di mutazioni di ai

• f = ∑i fi numero totale di mutazioni

Page 27: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

27

91Università di Torino

Calcolo di matrici PAM

• Si definisca la scala come il tempo evolutivonecessario per incorporare 1 amminoacido mutatosu 100:

1 PAM

• La mutabilità relativa di ai è:

mi = fi /100·f pi

92Università di Torino

Calcolo di matrici PAM

• Se mi è la probabilità di mutazione di ai , allora

Mii = 1 – miè la probabilità di conservazione di ai

• La probabilità della mutazione ai→aj è

Mij = (fij / fi) mi

93Università di Torino

Calcolo di matrici PAM

• La matrice Mij ottenuta è una matrice ditransizione

• In generale, per avere le probabilità per kintervalli evolutivi:

Mijk

• una delle matrici più utilizzate è PAM250

Page 28: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

28

94Università di Torino

95Università di Torino

Matrice BLOSUM (Henikoff &Henikoff, 1992)

• Blocks Amino Acid Substitution Matrices= BLOSUM

• Basata sulle sostituzioni amminoacidicheosservate in ~2000 blocchi conservati disequenze.

• Questi blocchi sono stati estratti da unabanca dati di 500 famiglie di proteine

• Sono contati gli scambi amminoacidiciosservati in ciascuna colonna

96Università di Torino

Esempio di calcolo di matriceBLOSUM

...A...

...A...

...A...

...A...

...A...

...S...

...A...

...A...

...A...

...A... • 9 A e 1 S• 36 A→A (fAA) e 9 A→S

(fAS)• 210 possibili coppie di

amminoacidi• La frequenza di A→A è

qAA = fAA/(fAA + fAS) =0.8

• La frequenza di A→S èqAS = fAS/(fAA + fAS) =0.2

Page 29: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

29

97Università di Torino

Esempio di calcolo di matriceBLOSUM

• La frequenza attesa che A sia coinvolta inuna coppia di mutazioni è pA = (qAA +qAS/2) = 0.9

• La frequenza attesa che S sia coinvolta inuna coppia di mutazioni è pS = (qAS/2) = 0.1

• La frequenza attesa di una coppia AA è eAA= pA

2 = 0.81• La frequenza attesa di una coppia AS è eAS

= 2 pA pS = 0.18

98Università di Torino

Esempio di calcolo di matriceBLOSUM

• Il valore per la coppia AA nella matrice èqAA/eAA = 0,99 e per AS è qAS/eAS = 1.11

• I valori sono convertiti in bits:sAA = log2(qAA/eAA) = -0.04sAS = log2(qAS/eAS) = 0,30

99Università di Torino

Calcolo di matrice BLOSUM

• Per bilanciare il sovracampionamento diresidui provenienti da sequenze moltosimili, le sequenze più simili di una certasoglia (per esempio 60% identità) sonoraggruppate e gli scambi amminoacidiciinterni al gruppo vengono mediati. Lamatrice risultante si chiama BLOSUM60

• La matrice più utilizzata è la BLOSUM62

Page 30: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

30

100Università di Torino

101Università di Torino

The Hazard of Large Databases

• Define• This is the probability that two unrelated

sequences will match with score > εεεε bychance

• Assuming that they are independent of eachother, and all are unrelated to s, we have

)|),(( UtsdPp εε >=

εεε NpN

t eptsdP −−≈−−=> 1)1(1)),((max

102Università di Torino

Local Matching

• Question: Which local alignment query is expectedto give a higher score:– To a short sequence– To a long sequence?

• A local match can begin at any of the nm entries inthe DP matrix.

• The score is the optimal of all these starting points.• If all starting points were independent we would need

to calculate the probability of attaining such a scorein nm trails.

Page 31: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

31

103Università di Torino

Score Significance-Fasta

• How meaningful is a score?• Calculate distribution of scores and related scores

• Under reasonable assumptions the scores for un-gapped alignment behave according to theExtreme Value Distribution

104Università di Torino

Extreme Value Distribution(BLAST)

• We ask the following questions: Given a databaseof size m and a sequence of size n

• What is the expected number of hits with score atleast S? This number is called an E-score

• Notice this is a Poisson distribution.• K corrects for the dependencies• λ depends on the scoring matrix• Doubling length of sequence doubles expectation• Doubling score causes E to decrease exponentially

SKmneSE λ−=)(

105Università di Torino

Blast P-value

• Recall Poisson distribution:– Probability of finding no hits with a score >= S

– Therefore probability of finding at least one hitwith score >= S is

– This is called the P-value.

Ee−

Ee−−1

Page 32: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

32

106Università di Torino

Protein Families

• Consider Zinc Fingers:• All have the same function:

– Bind to DNA• All have similar structure• They constitute a Protein Family• In a protein family some parts of the

sequence (the functional parts) are moreconserved than others.

107Università di Torino

Multiple Alignment

• Proteins can be classified into families:– Common structure.– Common function.– Common evolutionary origin.

• For a set of sequences belonging to some family– Each pair has some differences– But, there are some common motifs in almost all sequences

of the family• A multiple alignment carries more information than

pairwise alignment

108Università di Torino

Definition

A multiple alignment of stringsS1,S2,…,Sk is a series of strings withblanks S’1,S’2,…,S’k such that:– |S’1|=|S’2|=…=|S’k|– S’j is an extension of Sj obtained by

insertion of blanks.

Page 33: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

33

109Università di Torino

Example

AGT..CTT.ACGCG

AGTAGCTT...GCG

..TAGC.T..GGCG

.CTA.C.TAACCCG

ACTA...TAAC...

110Università di Torino

Example

111Università di Torino

Sum of Pairs

• The sum of pairwise distances between allpairs of sequences for some scoring matrix

• Not only assumes that alignment of eachcolumn is independent, but also each pair ofsequences.– Each sequence is scored as if descended from k-

1 sequences instead of one common ancestor.

),()( li

lk

kii mmsmS ∑

<

=

Page 34: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

34

112Università di Torino

Calculation of MultipleAlignment

• The optimal alignment can be calculatedexactly using r-dimensional dynamicprogramming.– Space complexity O(nk)– Time complexity O(2knk)

• A Heuristic Program called ClustalWquickly finds a good multiple alignment.

113Università di Torino

The ClustalW Algorithm

• Three steps:– 1 Compare all pairs of sequences to obtain a

similarity matrix– 2 Based on the similarity matrix, make a guide

tree relating all the sequences– 3 Perform progressive alignment where the

order of the alignments is determined by theguide tree

114Università di Torino

ClustalW - Score of aligning twoalignment columns

• sum the score matrix entry for all pairs ofresidues

• weight each pair by the sequences’ weights

1: peeksavtal

2: geekaavlal

3: egewglvlhv

4: aaektkirsa

Score:M( t, v)+ M( t, i)+M( l, v)+ M( l, i)

Page 35: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

35

115Università di Torino

ClustalW - Weighting sequences

• each sequence is given a weight• groups of related sequences receive lower

weight

1: peeksavtal

2: geekaavlal

3: egewglvlhv

4: aaektkirsa

Weighted score:w1* w3* M( t, v)+w1* s4* M( t, i)+

w2* w3* M( l, v)+w2* w4* M( l, i)

116Università di Torino

ClustalW - Similarity matrix

• Distance between sequences - measure fromthe guide tree - determines which matrix touse– 80- 100% seq- id -> use Blosum80– 60- 80% seq- id -> Blosum60– 30- 60% seq- id -> Blosum45– 0- 30% seq- id -> Blosum30

117Università di Torino

ClustalW - Gap penalties

• Initial gap penalty– GOP

• Gap extension penalty– GEP

GTEAKLIVLMANE

GA---------KL

Penalty: GOP+ 8* GEP

Page 36: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

36

118Università di Torino

ClustalWModifications of gap penalty

• Position specific penalty– gap at position

• yes -> lower GOP– no, but gap within 8 residues -> increase GOP– hydrophilic residues

• lower GOP

119Università di Torino

ClustalW - summary

• Does not use a score for the final alignment• Each pairwise alignment is done using

dynamic programming• Heuristics (e. g., gap- penalty modifications)

are used - tailored to globular proteins• Graphical version: ClustalX

120Università di Torino

Creating a PSSM

• After aligning the sequences we see thatthere are some conserved regions.

• We use the multiple alignment of Blastresults to create a Position Specific ScoringMatrix.

• This matrix represents information from awhole family, it is more strict in highlyconserved regions.

Page 37: Sequence Alignment - DiUniTobotta/didattica/SequenceAlignments.pdf · Gap Penalty Models • Convex model – Each extra space contributes less penalty – Gap function is convex

37

121Università di Torino

PSI- BLAST (Position SpecificIterated)

• BLAST provides a new automatic “profile like” search.• Iterative procedure:

– Perform BLAST on database.– Use Significant alignments to construct a “position specific”

score matrix.– This matrix replaces the query sequence in the next round of

database searching.• The program may be iterated until no new significant

alignments are found.• Most commonly used search method today.