76
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 4: Pair -wise (2) and Multiple C E N T R F O R I N T E B I O I N F O E Lecture 4: Pair -wise (2) and Multiple sequence alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The Netherlands ibivu.nl [email protected] E G R A T I V E O R M A T I C S V U

E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

1-month Practical CourseGenome Analysis (Integrative Bioinformatics & Genomics)

Lecture 4: Pair-wise (2) and Multiple

CENTR

FORINTE

BIOINFO

E

Lecture 4: Pair-wise (2) and Multiple sequence alignment

Centre for Integrative Bioinformatics VU (IBIVU)Vrije Universiteit AmsterdamThe Netherlandsibivu.nl [email protected]

EGRATIVE

ORMATICSVU

Page 2: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Alignment input parametersScoring alignments

20×20

A number of different schemes have been developed to compile residue exchange matrices

10 1

Amino Acid Exchange Matrix

Gap penalties (open, extension)

However, there are no formal concepts to calculate corresponding gap penalties

Emperically determined values are recommended for PAM250, BLOSUM62, etc.

Page 3: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

M = BLOSUM62, Po= 0, Pe= 0

Page 4: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

M = BLOSUM62, Po= 12, Pe= 1

Page 5: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

M = BLOSUM62, Po= 60, Pe= 5

Page 6: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

There are three kinds of alignments

n Global alignment (preceding slides)n Semi-global alignmentn Local alignment

Page 7: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Variation on global alignment

n Global alignment: previous algorithm is called global alignment because it uses all letters from both sequences.

CAGCACTTGGATTCTCGGCAGC-----G-T----GGCAGC-----G-T----GG

n Semi-global alignment: uses all letters but does not penalize for end gaps

CAGCA-CTTGGATTCTCGG---CAGCGTGG--------

Page 8: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Semi-global alignment

n Global alignment: all gaps are penalisedn Semi-global alignment: N- and C-terminal (5’

and 3’) gaps (end-gaps) are not penalised

MSTGAVLIY--TS-----

---GGILLFHRTSGTSNS

End-gaps

End-gaps

Page 9: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Semi-global alignment

Applications of semi-global:– Finding a gene in genome– Placing marker onto a chromosome– One sequence much longer than the – One sequence much longer than the

other

Risk: if gap penalties high -- really bad alignments for divergent sequences

Protein sequences have N- and C-terminal amino acids that are often small and hydrophilic

Page 10: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Semi-global alignment

n Ignore 5’ or N-terminal end gaps– First row/column

set to 0set to 0

n Ignore C-terminal or 3’ end gaps– Read the result

from last row/column (select the highest scoring cell) T

G

A

-

GTGAG-

1300-10

-202-210

-1-1-11-10

000000

Page 11: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Semi-global dynamic programming- two examples with different gap penalties -

These values are copied from the PAM250 matrix (see earlier slide), after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix)

Page 12: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

There are three kinds of alignments

n Global alignment n Semi-global alignmentn Local alignment

Page 13: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Local dynamic programming(Smith & Waterman, 1981)

LCFVMLAGSTVIVGTREDASTIL

������������

ILCGS

�������������� �������

������������������������������

������������

AGSTVIVGA-STILCG

Page 14: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Local dynamic programming (Smith and Waterman, 1981)

basic algorithm

���j-1 j

����

�����������

���������� ��������

����������� �

����������� �

��������

��������

����������

Page 15: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Example: local alignment of two sequences

n Align two DNA sequences:– GAGTGA– GAGGCGA (note the length – GAGGCGA (note the length

difference)

n Parameters of the algorithm:– Match: score(A,A) = 1

– Mismatch: score(A,T) = -1

– Gap: g = -2 M[i, j] =M[i, j-1] – 2

M[i-1, j] – 2

M[i-1, j-1] ± 1

max

0

Page 16: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

The algorithm. Step 1: init

n Create the matrix

n Initiation– No beginning

row/column

M[i, j] =M[i, j-1] – 2

M[i-1, j] – 2

M[i-1, j-1] ± 1

max

654321j→i↓ AGTGAG

0

row/column

– Just apply the equation…

7

6

5

4

3

2

1

i↓

A

G

C

G

G

A

G

AGTGAG

Page 17: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

The algorithm. Step 2: fill in

n Perform the forward step…

M[i, j] =M[i, j-1] – 2

M[i-1, j] – 2

M[i-1, j-1] ± 1

max

654321j→i↓ AGTGAG

0

7

6

5

4

3

2

1

i↓

A

G

C

G

G

A

1G

AGTGAG

0 01 1

1

1

0

0

2 0 0 0 2

0 3 1 1 0

0 1 2

Page 18: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

The algorithm. Step 2: fill in

n Perform the forward step…

M[i, j] =M[i, j-1] – 2

M[i-1, j] – 2

M[i-1, j-1] ± 1

max

654321j→i↓ AGTGAG

0

7

6

5

4

3

2

1

i↓

A

G

C

G

G

A

1G

AGTGAG

0

0 01 1

1

1

1

0

0

2 0 0 0 2

0 3 1 1 0

0 1 2 2 0

0 0 0 1 1

0 1 0 1 0

0 2 0 0 0 2

Page 19: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

The algorithm. Step 2: fill in

n We’re done

n Find the highest cell anywhere in

M[i, j] =M[i, j-1] – 2

M[i-1, j] – 2

M[i-1, j-1] ± 1

max

654321j→i↓ AGTGAG

0

cell anywhere in the matrix

7

6

5

4

3

2

1

i↓

A

G

C

G

G

A

1G

AGTGAG

0

0 01 1

1

1

1

0

0

2 0 0 0 2

0 3 1 1 0

0 1 2 2 0

0 0 0 1 1

0 1 0 1 0

0 2 0 0 0 2

Page 20: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

The algorithm. Step 3: trace back

n Reconstruct path leading to highest scoring cell

n Trace back until zero or start of

M[i, j] =M[i, j-1] – 2

M[i-1, j] – 2

M[i-1, j-1] ± 1

max

654321j→i↓ AGTGAG

0

zero or start of sequence: alignment path can begin and terminate anywhere in matrix

n Alignment: GAGGAG

7

6

5

4

3

2

1

i↓

A

G

C

G

G

A

1G

AGTGAG

0

0 01 1

1

1

1

0

0

2 0 0 0 2

0 3 1 1 0

0 1 2 2 0

0 0 0 1 1

0 1 0 1 0

0 2 0 0 0 2

Page 21: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Local dynamic programming(Smith & Waterman, 1981; Gotoh, 1984)

���

j-1

Gap

This is the general DP algorithm, which is suitable for linear, affine and concave

���������

���������� !!������" � #� � ������#$

���������"���"��������������� !%!��� � #� � ���%���#$

Gap opening penalty

Gap extension penalty

and concave penalties, although for the example here affine penalties are used

Page 22: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Measuring Similarity

n Sequence identity (number of identical exchanges per unit length)

n Raw alignment scoren Sequence similarity (alignment score n Sequence similarity (alignment score

normalised to a maximum possible)n Alignment score normalised to a

randomly expected situation (database/homology searching)

Page 23: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Pairwise alignment

n Now we know how to do it:n How do we get a multiple alignment

(three or more sequences)?

n Multiple alignment: much greater combinatorial explosion than with pairwise alignment…..

Page 24: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Multiple alignment idea

• Take three or more related sequences and align them such that the greatest number of similar characters are aligned in the same column of the alignment.

Ideally, the sequences are orthologous, but often include paralogues.

Page 25: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

• You can score a multiple alignment by taking all the pairs of aligned sequences and adding up the pairwise scores:

Scoring a multiple alignment

Sa,b = -�li jbas ),( )(kgpN

kk •�

•This is referred to as the Sum-of-Pairs score

Page 26: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Information content of a multiple alignment

F

C D

Page 27: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

���������������� �

� ����������������������������� ��� ��� ��������������������������������

� ���������������� ���������������� ������������������������

� ����������������� ������

� ������� ���������������

� �����!������������������"�������������

Page 28: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

# ��� ��� �������������!�# ��� ��� �������������!�

q ������������ ���� ��������� ����$�������������������������������� �������%

q &��������"��� �������$���������������� ��������������������������!�����$���������������� ��������������������������!������ ���������������

q '������"��� �������$���������������� ������������������"��� ��������

�������! ���� �������� ����������������

Page 29: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Exhaustive & Heuristicalgorithms

• Exhaustive approaches• Examine all possible aligned positions simultaneously• Look for the optimal solution by (multi-dimensional) DP• Very (very) slow• Very (very) slow

• Heuristic approaches• Strategy to find a near-optimal solution

(by using rules of thumb) • Shortcuts are taken by reducing the search space

according to certain criteria• Much faster

Page 30: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

(�� �������� ��� ��� �������(�� �������� ��� ��� �������# ��# ��))!��������� �!�����������������!��������� �!�����������������

������������� ������

q �&���������������������� �������

Ø �* �����������

q +� ��������������������������������������

Ø �%�%��+ ��������������� ���������������������,���!�+���������� ��������������

q '��������� ��"���������� ��� ��� �����������������

Page 31: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

(������(������))��������� ����������������� �������� ��������&���������� ��������&����������

sequence

sequence

Page 32: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

# ��# ��))!��������� �!������!��������� �!������

�����������������������#��������� %,��#��������� %,�-./0-./0��

Sequence 1

Seq

uenc

e 2

Seq

uenc

e 2

Page 33: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

The MSA approach

n Key idea: restrict the computational costs by determining a minimal region within the n-dimensional matrix that contains the optimal path

Lipman et al. 1989

contains the optimal path

Page 34: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

The MSA method in detail

1. Let’s consider 3 sequences2. Calculate all pair-wise alignment

scores by Dynamic programming3. Use the scores to predict a tree4. Produce a heuristic multiple align.

based on the tree (quick & dirty)5. Calculate maximum cost for each

sequence pair from multiple

1. .

2. .

3. .

4. .

5. .

1 23

12

13

23

132

132

1sequence pair from multiple alignment (upper bound) &determine paths with < costs.

6. Determine spatial positions that must be calculated to obtain the optimal alignment (intersecting areas or ‘hypersausage’ around matrix diagonal)

7. Perform multi-dimensional DPNote Redundancy caused by highly

correlated sequences is avoided

6. .

1

3

Page 35: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

The DCA (Divide-and-Conquer) approachStoye et al. 1997

n Each sequence is cut in two behind a suitable cut position somewhere close to its midpoint.

n This way, the problem of aligning one family of (long) sequences is divided into the two problems of aligning two families of (shorter) sequences.families of (shorter) sequences.

n This procedure is re-iterated until the sequences are sufficiently short.

n Optimal alignment by MSA.

n Finally, the resulting short alignments are concatenated.

Page 36: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

So in effect …Sequence 1

Seq

uen

ce 2

Sequence 3

Seq

uen

ce 2

Page 37: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

# ��� ��� �������������!�# ��� ��� �������������!�

qq ������������������ ���� ��������� ���������� ���� ��������� ����$�������������������������������� �������%$�������������������������������� �������%

q &��������"��� �������$���������������� ��������������������������!�����$���������������� ��������������������������!������ ���������������

q '������"��� �������$���������������� ������������������"��� ��������

�������! ���� �������� ����������������

Page 38: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

1������������"��� �������������!1������������"��� �������������!

q 2�!�� �����!��3�����������������!����� ����������� ��������������������������"� ��������� ���!

q &������ �3��������������������������� ���������������������������������� ��� ����!��4��!������5����!������ � !�������� �������� ����������"� ��!!����������������������!�����������!� ���������%

Page 39: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Making a guide tree

�����

���������

���������%����������

1213

45

Score 1-2

Score 1-3

Score 4-5

������ ���������� ����������� ������

��� ����

�������������!

"#"

Page 40: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

&��������"��� ��� ��� �������&��������"��� ��� ��� �������

1213

45

Score 1-2

Score 1-3

Score 4-5

Scores Similarity

Guide tree Multiple alignment

Scores Similaritymatrix5×5

Scores to distances Iteration possibilities

Page 41: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

&��������"��� ���������������&��������"��� ���������������

-% &�����������)������ ������������� ���������������� �� ���������� 6��%�%������+�+)-�7*�� ���������� #���������!���������)� � � �� �������

*% 2�������� ����������������������������� ���������!���������������2�������� ����������������������������� ���������!���������������

8% 2��������������������!�������!������

9% : ���������������� �������"� ,���!�!� �������!�����!��� �������������!�����!� �����������+)-�� ��������������

Page 42: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

;����� ����������"��� ��� ��;����� ����������"��� ��� ��

� ����������������� ���������������� ��� �����������!��������� �����������!������

13

2

13

dAlign these two

These two are aligned

25

13

13

25

25

root

4

Page 43: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

&<:�'+=����������"���������&<:�'+=����������"���������

13

2

13

13

13

25

254

4

At each step, Praline checks which of the pair-wise alignments (sequence-sequence, sequence-profile, profile-profile) has the highest score – this one gets selected

Page 44: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

But how can we align blocks of sequences ?

AB

CD

ABCD

E

?E

n The dynamic programming algorithm performs well for pairwise alignment (two axes).

n So we should try to treat the blocks as a “single” sequence …

Page 45: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

How to represent a block of sequences

n Historically: consensus sequencesingle sequence that best represents the amino acids observed at each alignment amino acids observed at each alignment position.

n Modern methods: alignment profilerepresentation that retains the information about frequencies of amino acids observed at each alignment position.

Page 46: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Consensus sequence

���������� F A T N M G T S D P P T H T R L R K L V S Q

��������� F V T N M N N S D G P T H T K L R K L V S T

�������� F * T N M * * S D * P T H T * L R K L V S *

n Problem: loss of information

n For larger blocks of sequences it “ punishes” more distant members

Page 47: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Alignment profiles

n Advantage: full representation of the sequence alignment (more information retained)

n Not only used in alignment methods, but n Not only used in alignment methods, but also in sequence-database searching (to detect distant homologues)

n Also called PSSM in BLAST (Position-specific scoring matrix)

Page 48: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

# ��� ��� ������������� ��# ��� ��� ������������� ��

AC

ifA..fC..

Core region Core regionGapped region

fA..fC..

fA..fC..

frequencies

ACD•••WY

-

fA..fC..fD..•••fW..fY..Gapo, gapxGapo, gapx

Position-dependent gap penalties

Gapo, gapx

fA..fC..fD..•••fW..fY..

fA..fC..fD..•••fW..fY..

Page 49: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

&���� �� � !���&���� �� � !���q =���� �3����������������������!������������� ��!��������� ���������������%

iACD•••WY

Gappenalties

i0.30.10•••0.30.3

0.51.0Position dependent gap penalties

0.500•••00.5

00.50.2•••0.10.2

1.0

Page 50: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

&���� �&���� �))��������� ���������������� �������

sequence

ACD……VWY

Page 51: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

(��������������� ��� �������(��������������� ��� �������

::>>�

?%9�:

?%*��

?%9�>

(�����������������!�� ���������������������� ����!������������������� ����������3

(�����@�?%9 A����,�:��B�?%* A����,����B�?%9 A����,�>�

Page 52: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

&���� �&���� �))����� ��� ������������ ��� �������

ACD..Y

profile

ACD……VWY

profile

Page 53: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

;����� ������������������ �;����� ������������������ �))����� ������� ����������������

q :������������������� ���������"��!�������������!�������������������������������!�������

ACD..Y

Profile 1ACD..Y

Profile 2

�����������������������������!�������

q '�����!�������������������������������������� �������

q C���������������������� �������������������3

�� ××=*?

*?

DD�D� ���,����������(

Page 54: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

&���� ���������� ��� �������&���� ���������� ��� �������

?%9�:

?%*��

?%E0�;

?%*0�(?%*��

?%9�>

#������������������������� ���������� ��������������%�����������������������������!��������� �����������3

(�����@�?%9A?%E0A��:,;��B�?%*A?%E0A���,;��B�?%9A?%E0A��>,;��B

B ?%9A?%*0A��:,(��B�?%*A?%*0A���,(��B�?%9A?%*0A��>,(�

���,�����"� ��������������!����������������������&:#*0?,�F ���G*���������������!��������,�

?%*0�(

Page 55: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

&��������"��� ���������������&��������"��� ���������������

Methods:

nBiopat (Hogeweg and Hesper 1984 -- first integrated method ever)

nMULTAL (Taylor 1987)

nDIALIGN (1&2, Morgenstern 1996) – local MSA

nPRRP (Gotoh 1996)

nClustalW (Thompson et al 1994)

nPRALINE (Heringa 1999)

nT-Coffee (Notredame 2000)

nPOA (Lee 2002)

nMUSCLE (Edgar 2004)

nPROBSCONS (Do, 2005)

nMAFFT

Page 56: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Pair-wise alignment quality versussequence identity

(Vogt et al., JMB 249, 816-831,1995)

Page 57: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Clustal, ClustalW, ClustalX

n CLUSTAL W/X (Thompson et al., 1994) uses Neighbour Joining (NJ) algorithm (Saitou and Nei, 1984), widely used in phylogenetic analysis, to construct a guide tree (see lecture on phylogenetic methods).

n Sequence blocks are represented by profile, in which the individual sequences are additionally weighted according to the branch sequences are additionally weighted according to the branch lengths in the NJ tree.

n Further carefully crafted heuristics include: – (i) local gap penalties – (ii) automatic selection of the amino acid substitution matrix, (iii) automatic

gap penalty adjustment– (iv) mechanism to delay alignment of sequences that appear to be distant at

the time they are considered.

n CLUSTAL (W/X) does not allow iteration (Hogeweg and Hesper, 1984; Corpet, 1988, Gotoh, 1996; Heringa, 1999, 2002)

Page 58: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Aligning 13 Flavodoxins + cheY

Flavodoxinfold: doubly

5(βα)

fold: doubly wound βαβstructure

Page 59: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Flavodoxin family - TOPS diagrams

The basic topology of the flavodoxin fold is given below, the other four TOPS diagrams show flavodoxin folds with local insertions of secondary structure

� $%&"

$%&

"

secondary structure elements (David Gilbert)

α-helix

β-strand

Page 60: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Flavodoxin-cheY NJ tree

Page 61: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

ClustalW web-interface

Page 62: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

CLUSTAL X (1.64b) multiple sequence alignment Flavodoxin-cheY

1fx1 -PKALIVYGSTTGNTEYTAETIARQLANAG-Y-EVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPLFD-SLEETGAQGRK

FLAV_DESVH MPKALIVYGSTTGNTEYTAETIARELADAG-Y-EVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPLFD-SLEETGAQGRK

FLAV_DESGI MPKALIVYGSTTGNTEGVAEAIAKTLNSEG-M-ETTVVNVADVTAPGLAEGYDVVLLGCSTWGDDEIE------LQEDFVPLYE-DLDRAGLKDKK

FLAV_DESSA MSKSLIVYGSTTGNTETAAEYVAEAFENKE-I-DVELKNVTDVSVADLGNGYDIVLFGCSTWGEEEIE------LQDDFIPLYD-SLENADLKGKK

FLAV_DESDE MSKVLIVFGSSTGNTESIAQKLEELIAAGG-H-EVTLLNAADASAENLADGYDAVLFGCSAWGMEDLE------MQDDFLSLFE-EFNRFGLAGRK

FLAV_CLOAB -MKISILYSSKTGKTERVAKLIEEGVKRSGNI-EVKTMNLDAVDKKFLQE-SEGIIFGTPTYYAN---------ISWEMKKWID-ESSEFNLEGKL

FLAV_MEGEL --MVEIVYWSGTGNTEAMANEIEAAVKAAG-A-DVESVRFEDTNVDDVAS-KDVILLGCPAMGSE--E------LEDSVVEPFF-TDLAPKLKGKK

4fxn ---MKIVYWSGTGNTEKMAELIAKGIIESG-K-DVNTINVSDVNIDELLN-EDILILGCSAMGDE--V------LEESEFEPFI-EEISTKISGKK

FLAV_ANASP SKKIGLFYGTQTGKTESVAEIIRDEFGNDVVT----LHDVSQAEVTDLND-YQYLIIGCPTWNIGELQ---SD-----WEGLYS-ELDDVDFNGKL

FLAV_AZOVI -AKIGLFFGSNTGKTRKVAKSIKKRFDDETMSD---ALNVNRVSAEDFAQ-YQFLILGTPTLGEGELPGLSSDCENESWEEFLP-KIEGLDFSGKT

2fcr --KIGIFFSTSTGNTTEVADFIGKTLGAKADAP---IDVDDVTDPQALKD-YDLLFLGAPTWNTGADTERSGT----SWDEFLYDKLPEVDMKDLP

FLAV_ENTAG MATIGIFFGSDTGQTRKVAKLIHQKLDGIADAP---LDVRRATREQFLS--YPVLLLGTPTLGDGELPGVEAGSQYDSWQEFTN-TLSEADLTGKT

FLAV_ECOLI -AITGIFFGSDTGNTENIAKMIQKQLGKDVAD----VHDIAKSSKEDLEA-YDILLLGIPTWYYGEAQ-CD-------WDDFFP-TLEEIDFNGKL

3chy --ADKELKFLVVDDFSTMRRIVRNLLKELG----FNNVEEAEDGVDALN------KLQAGGYGFV--I------SDWNMPNMDG-LELLKTIR---3chy --ADKELKFLVVDDFSTMRRIVRNLLKELG----FNNVEEAEDGVDALN------KLQAGGYGFV--I------SDWNMPNMDG-LELLKTIR---

. ... : . . :

1fx1 VACFGCGDSSYEYF--CGAVDAIEEKLKNLGAEIVQDG----------------LRIDGDPRAARDDIVGWAHDVRGAI---------------

FLAV_DESVH VACFGCGDSSYEYF--CGAVDAIEEKLKNLGAEIVQDG----------------LRIDGDPRAARDDIVGWAHDVRGAI---------------

FLAV_DESGI VGVFGCGDSSYTYF--CGAVDVIEKKAEELGATLVASS----------------LKIDGEPDSAE--VLDWAREVLARV---------------

FLAV_DESSA VSVFGCGDSDYTYF--CGAVDAIEEKLEKMGAVVIGDS----------------LKIDGDPERDE--IVSWGSGIADKI---------------

FLAV_DESDE VAAFASGDQEYEHF--CGAVPAIEERAKELGATIIAEG----------------LKMEGDASNDPEAVASFAEDVLKQL---------------

FLAV_CLOAB GAAFSTANSIAGGS--DIALLTILNHLMVKGMLVYSGGVA----FGKPKTHLGYVHINEIQENEDENARIFGERIANKVKQIF-----------

FLAV_MEGEL VGLFGSYGWGSGE-----WMDAWKQRTEDTGATVIGTA----------------IVN-EMPDNAPECKE-LGEAAAKA----------------

4fxn VALFGSYGWGDGK-----WMRDFEERMNGYGCVVVETP----------------LIVQNEPDEAEQDCIEFGKKIANI----------------

FLAV_ANASP VAYFGTGDQIGYADNFQDAIGILEEKISQRGGKTVGYWSTDGYDFNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSWVAQLKSEFGL------

FLAV_AZOVI VALFGLGDQVGYPENYLDALGELYSFFKDRGAKIVGSWSTDGYEFESSEAVV-DGKFVGLALDLDNQSGKTDERVAAWLAQIAPEFGLSL----

2fcr VAIFGLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVR-DGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------

FLAV_ENTAG VALFGLGDQLNYSKNFVSAMRILYDLVIARGACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSWLEKLKPAVL-------

FLAV_ECOLI VALFGCGDQEDYAEYFCDALGTIRDIIEPRGATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKWVKQISEELHLDEILNA

3chy AD--GAMSALPVL-----MVTAEAKKENIIAAAQAGAS----------------GYV-VKPFTAATLEEKLNKIFEKLGM--------------

. . : . .

The secondary structures of 4 sequences are known and can be used to asses the alignment (red is β-strand, blue is α-helix)

Page 63: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

1������������ ����H1������������ ����H

:���������"�������������IIII

q &��������"��� ��� ��� �����������������!��������3�: ���������������!����������������������������#(:�� ������� ���� ����� ���������������������������!������ ��������������"�������%

q J����������������������������� �������!��������������"��� ���������������������������������������������������������%

q '������� � �����!���������� ����������������������������������������������������������������%�%������������� ������������������ ���������� ��!��������� �������������%

Page 64: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

&'������ ������(�%���� ��)

Progressive multiple alignment

*�� �+�,����������"-./

Page 65: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

• Matrix extension (T-coffee)• Profile pre-processing (Praline)

• Secondary structure-induced

Additional strategies for multiple sequence alignment

• Secondary structure-induced alignment

Objective: try to avoid (early) errors

Page 66: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Integrating alignment methods and alignment information with

T-Coffee• Integrating different pair-wise alignment

techniques (NW, SW, ..)techniques (NW, SW, ..)

• Combining different multiple alignment methods (consensus multiple alignment)

• Combining sequence alignment methods with structural alignment techniques

• Plug in user knowledge

Page 67: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Matrix extension

T-CoffeeTree-based Consistency Objective Function

For alignmEnt Evaluation

Cedric Notredame (“Bioinformatics for dummies” )

Des Higgins

Jaap Heringa J. Mol. Biol., J. Mol. Biol., 302, 205302, 205--217217;2000;2000

Page 68: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Using different sources of alignment information

0�1����

,���� �

0�1����

2��� �

���1��1������ ������

���1��,���� � 2��� � ���1��

3�0�44��

Page 69: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

T-Coffee library system

���� ��� ������� ����

5 65" 7 255 " 5 65" 7 255 "

5 65" 8 259 "9

7 255 8 :57 ;"

7 �55 8 <58 57

Page 70: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Matrix extensionMatrix extension

�$

�%

&�

&

$%

$&

%&

Page 71: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Search matrix extension – alignment transitivity

Page 72: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

T-Coffee

Other sequences

Direct alignment

Page 73: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

Search matrix extension

Page 74: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

T-COFFEE web-interface

Page 75: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

3D-COFFEE

n Computes structural alignments

n Structures associated with the sequences are with the sequences are retrieved and the information is used to optimise the MSA

n More accurate … but for many proteins we do not have a structure

Page 76: E N Genome Analysis (Integrative - VUSemi-global dynamic programming-two examples with different gap penalties - ... n Raw alignment score n Sequence similarity (alignment score

but.....T-COFFEE (V1.23) multiple sequence alignmentFlavodoxin-cheY1fx1 ----PKALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK-----

FLAV_DESVH ---MPKALIVYGSTTGNTEYTAETIARELADAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK-----

FLAV_DESGI ---MPKALIVYGSTTGNTEGVAEAIAKTLNSEG-METTVVNVADVT-APGLAEGYDVVLLGCSTWGDDEIE------LQEDFVPL-YEDLDRAGLKDKK-----

FLAV_DESSA ---MSKSLIVYGSTTGNTETAAEYVAEAFENKE-IDVELKNVTDVS-VADLGNGYDIVLFGCSTWGEEEIE------LQDDFIPL-YDSLENADLKGKK-----

FLAV_DESDE ---MSKVLIVFGSSTGNTESIAQKLEELIAAGG-HEVTLLNAADAS-AENLADGYDAVLFGCSAWGMEDLE------MQDDFLSL-FEEFNRFGLAGRK-----

4fxn ------MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVN-IDELL-NEDILILGCSAMGDEVLE-------ESEFEPF-IEEIS-TKISGKK-----

FLAV_MEGEL -----MVEIVYWSGTGNTEAMANEIEAAVKAAG-ADVESVRFEDTN-VDDVA-SKDVILLGCPAMGSEELE-------DSVVEPF-FTDLA-PKLKGKK-----

FLAV_CLOAB ----MKISILYSSKTGKTERVAKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQ-ESEGIIFGTPTYYAN---------ISWEMKKW-IDESSEFNLEGKL-----

2fcr -----KIGIFFSTSTGNTTEVADFIGKTLGAKA---DAPIDVDDVTDPQAL-KDYDLLFLGAPTWNTGA----DTERSGTSWDEFLYDKLPEVDMKDLP-----

FLAV_ENTAG ---MATIGIFFGSDTGQTRKVAKLIHQKLDGIA---DAPLDVRRAT-REQF-LSYPVLLLGTPTLGDGELPGVEAGSQYDSWQEF-TNTLSEADLTGKT-----

FLAV_ANASP ---SKKIGLFYGTQTGKTESVAEIIRDEFGNDV---VTLHDVSQAE-VTDL-NDYQYLIIGCPTWNIGEL--------QSDWEGL-YSELDDVDFNGKL-----FLAV_ANASP ---SKKIGLFYGTQTGKTESVAEIIRDEFGNDV---VTLHDVSQAE-VTDL-NDYQYLIIGCPTWNIGEL--------QSDWEGL-YSELDDVDFNGKL-----

FLAV_AZOVI ----AKIGLFFGSNTGKTRKVAKSIKKRFDDET-M-SDALNVNRVS-AEDF-AQYQFLILGTPTLGEGELPGLSSDCENESWEEF-LPKIEGLDFSGKT-----

FLAV_ECOLI ----AITGIFFGSDTGNTENIAKMIQKQLGKDV---ADVHDIAKSS-KEDL-EAYDILLLGIPTWYYGEA--------QCDWDDF-FPTLEEIDFNGKL-----

3chy ADKELKFLVVD--DFSTMRRIVRNLLKELGFN-NVE-EAEDGVDALNKLQ-AGGYGFVISDWNMPNMDGLE--------------LLKTIRADGAMSALPVLMV

:. . . : . ::

1fx1 ---------VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG---------------------LRIDGDPRAA--RDDIVGWAHDVRGAI--------

FLAV_DESVH ---------VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG---------------------LRIDGDPRAA--RDDIVGWAHDVRGAI--------

FLAV_DESGI ---------VGVFGCGDSS--YTYFCGA-VDVIEKKAEELGATLVASS---------------------LKIDGEPDSA----EVLDWAREVLARV--------

FLAV_DESSA ---------VSVFGCGDSD--YTYFCGA-VDAIEEKLEKMGAVVIGDS---------------------LKIDGDPE----RDEIVSWGSGIADKI--------

FLAV_DESDE ---------VAAFASGDQE--YEHFCGA-VPAIEERAKELGATIIAEG---------------------LKMEGDASND--PEAVASFAEDVLKQL--------

4fxn ---------VALFGS------YGWGDGKWMRDFEERMNGYGCVVVETP---------------------LIVQNEPD--EAEQDCIEFGKKIANI---------

FLAV_MEGEL ---------VGLFGS------YGWGSGEWMDAWKQRTEDTGATVIGTA---------------------IV--NEMP--DNAPECKELGEAAAKA---------

FLAV_CLOAB ---------GAAFSTANSI--AGGSDIA-LLTILNHLMVKGMLVY----SGGVAFGKPKTHLGYVHINEIQENEDENARIFGERIANKVKQIF-----------

2fcr ---------VAIFGLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRDG-KFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------

FLAV_ENTAG ---------VALFGLGDQLNYSKNFVSA-MRILYDLVIARGACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSWLEKLKPAVL-------

FLAV_ANASP ---------VAYFGTGDQIGYADNFQDA-IGILEEKISQRGGKTVGYWSTDGYDFNDSKALRNG-KFVGLALDEDNQSDLTDDRIKSWVAQLKSEFGL------

FLAV_AZOVI ---------VALFGLGDQVGYPENYLDA-LGELYSFFKDRGAKIVGSWSTDGYEFESSEAVVDG-KFVGLALDLDNQSGKTDERVAAWLAQIAPEFGLSL----

FLAV_ECOLI ---------VALFGCGDQEDYAEYFCDA-LGTIRDIIEPRGATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKWVKQISEELHLDEILNA

3chy TAEAKKENIIAAAQAGASGYVVKPFT---AATLEEKLNKIFEKLGM----------------------------------------------------------

.