Biology 224 Instructor: Tom Peavy March 13 & 18, 2008

Biology 224Instructor: Tom PeavyMarch 13 & 18, 2008

<Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner>

Multiple Sequence Alignment

Multiple sequence alignment: definition

• a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned

• Homologous residues are aligned in columns across the length of the sequences

• residues are homologous in an evolutionary sense

• residues are homologous in a structural sense

Multiple sequence alignment: properties

• not necessarily one “correct” alignment of a protein family

• protein sequences evolve...

• ...the corresponding three-dimensional structures of proteins also evolve

• may be impossible to identify amino acid residues that align properly (structurally) throughout a multiple sequence alignment

• for two proteins sharing 30% amino acid identity, about 50% of the individual amino acids are superposable in the two structures

Multiple sequence alignment: features

• some aligned residues, such as cysteines that form disulfide bridges, may be highly conserved

• there may be conserved motifs such as a transmembrane domain

• there may be conserved secondary structure features

• there may be regions with consistent patterns of insertions or deletions (indels)

Multiple sequence alignment: methods

There are two main ways to make a multiple sequence alignment:

(1) Progressive alignment (Feng & Doolittle). (e.g. ClustalW)

(2) Iterative approaches.

Use Clustal W to do a progressive MSA

http://www2.ebi.ac.uk/clustalw/

Feng-Doolittle MSA occurs in 3 stages

[1] Do a set of global pairwise alignments (Needleman and Wunsch)

[2] Create a guide tree

[3] Progressively align the sequences

Progressive MSA stage 1 of 3:generate global pairwise alignments

Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 84Sequences (1:3) Aligned. Score: 84Sequences (1:4) Aligned. Score: 91Sequences (1:5) Aligned. Score: 92Sequences (2:3) Aligned. Score: 99Sequences (2:4) Aligned. Score: 86Sequences (2:5) Aligned. Score: 85Sequences (3:4) Aligned. Score: 85Sequences (3:5) Aligned. Score: 84Sequences (4:5) Aligned. Score: 96

five closely related lipocalins

best score

Number of pairwise alignments needed

For N sequences, (N-1)(N)/2

For 5 sequences, (4)(5)/2 = 10

Feng-Doolittle stage 2: guide tree

• Convert similarity scores to distance scores

• A tree shows the distance between objects

• Distance methods used (i.e. Neighbor joining)

• ClustalW provides a syntax to describe the tree

• A guide tree is not a phylogenetic tree

Progressive MSA stage 2 of 3:generate guide tree

five closely related lipocalins

3 (rat RBP)

2 (murine RBP)

4 (porcine RBP)

5 (bovine RBP)

1 (human RBP)

((Human RBP:0.04284,(Mouse RBP:0.00075, Rat RBP:0.00423) :0.10542):0.01900, Pig RBP:0.01924, Bovine RBP:0.01902);

Feng-Doolittle stage 3: progressive alignment

• Make a MSA based on the order in the guide tree

• Start with the two most closely related sequences

• Then add the next closest sequence

• Continue until all sequences are added to the MSA

• Rule: “once a gap, always a gap”

Clustal W alignment of 5 closely related lipocalins

CLUSTAL W (1.82) multiple sequence alignment

gi|89271|pir||A39486 MEWVWALVLLAALGSAQAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 50gi|132403|sp|P18902|RETB_BOVIN ------------------ERDCRVSSFRVKENFDKARFAGTWYAMAKKDP 32gi|5803139|ref|NP_006735.1| MKWVWALLLLAAW--AAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48gi|6174963|sp|Q00724|RETB_MOUS MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50gi|132407|sp|P04916|RETB_RAT MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50 ********************:* ***:*****

gi|89271|pir||A39486 EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 100gi|132403|sp|P18902|RETB_BOVIN EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 82gi|5803139|ref|NP_006735.1| EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98gi|6174963|sp|Q00724|RETB_MOUS EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100gi|132407|sp|P04916|RETB_RAT EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100 *********:*******.*:************.**:**************

gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADS 150gi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADS 132gi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148gi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150gi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150 ****************:*******:****:*:* ****** *********

Why “once a gap, always a gap”?

• There are many possible ways to make a MSA

• Where gaps are added is a critical question

• Gaps are often added to the first two (closest) sequences

• To change the initial gap choices later on would beto give more weight to distantly related sequences

• To maintain the initial gap choices is to trustthat those gaps are most believable

Multiple sequence alignment to profile HMMs

• Hidden Markov models (HMMs) are “states” that describe the probability of having a particular amino acid residue at arranged in a column of a multiple sequence alignment

• HMMs are probabilistic models

• Like a hammer is more refined than a blast, an HMM gives more sensitive alignments than traditional techniques such as progressive alignments

GTWYA (hs RBP)GLWYA (mus RBP)GRWYE (apoD)GTWYE (E Coli)GEWFS (MUP4)

An HMM is constructed from a MSA

Example: five lipocalins

GTWYAGLWYAGRWYEGTWYEGEWFS

Prob. 1 2 3 4 5p(G) 1.0p(T) 0.4p(L) 0.2p(R) 0.2p(E) 0.2 0.4p(W) 1.0p(Y) 0.8p(F) 0.2p(A) 0.4p(S) 0.2

GTWYAGLWYAGRWYEGTWYEGEWFS

P(GEWYE) = (1.0)(0.2)(1.0)(0.8)(0.4) = 0.064

log odds score = ln(1.0) + ln(0.2) + ln(1.0) + ln(0.8) + ln(0.4) = -2.75

G:1.0T:0.4L:0.2R:0.2E:0.2

W:1.0Y:0.8F:0.2

E:0.4A:0.4S:0.2

BLOCKS (HMM)CDD (HMM)DOMO (Gapped MSA)INTERPROiProClassMetaFAMPfam (profile HMM library)PRINTSPRODOM (PSI-BLAST)PROSITESMART

Databases of multiple sequence alignments

Query = your favorite protein

Database = set of many PSSMs

CDD is related to PSI-BLAST, but distinct

CDD searches against profiles generatedfrom pre-selected alignments

Purpose: to find conserved domainsin the query sequence

You can access CDD via DART at NCBI

CDD uses RPS-BLAST: reverse position-specific

Multiple sequence alignment algorithms

Progressive

Iterative

Local Global

PIMA

DIALIGN SAGA

CLUSTALPileUpother

AMASCINEMAClustalWClustalXDIALIGNHMMTMatch-BoxMultAlinMSAMuscaPileUpSAGAT-COFFEE

Multiple sequence alignment programs

Clustal X

GCGPileUp

[1] As percent identity among proteins drops,performance (accuracy) declines also. This isespecially severe for proteins < 25% identity.

Proteins <25% identity: 65% of residues align well

Proteins <40% identity: 80% of residues align well

Assessment of alternativemultiple sequence alignment algorithms

[2] “Orphan” sequences are highly divergent members of a family. Surprisingly, orphans do not disrupt alignments. Also surprisingly, global alignment algorithms outperform local.

Documents

Biology 224 Instructor: Tom Peavy March 13 & 18, 2008