77
CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Embed Size (px)

Citation preview

Page 1: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

CS177 Lecture 4 Sequence Alignment

Tom Madej 10.03.05

Page 2: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Overview

• The alignment problem.• Homology, divergence, convergence.• Dynamic programming overview.• Amino acid substitution matrices.• Fast sequence database searching: BLAST.• Multiple sequence alignment.• PSI-BLAST, position specific score matrices.• Databases of multiple sequence alignments.

Page 3: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

What is a sequence alignment?

• A linear, one-to-one correspondence between some of the symbols in one sequence with some of the symbols in another sequence.

• May be DNA or protein sequences.

• A (sequence) alignment can also be derived from a superposition of two protein structures, it is then sometimes called a structure alignment.

Page 4: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Classwork

• In Exercise 1 we will see DNA-DNA alignments produced by the program Basic Local Alignment Search Tool (BLAST)

• In Exercise 2 we will see alignments of protein structures produced by Vector Alignment Search Tool (VAST)

Page 5: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Exercise 1: BLAST this DNA sequence against the Human Genome

> CCR5ATGGATTATCAAGTGTCAAGTCCAATCTATGACATCAATTATTATACATCGGAGCCCTGCCAAAAAATCA ATGTGAAGCAAATCGCAGCCCGCCTCCTGCCTCCGCTCTACTCACTGGTGTTCATCTTTGGTTTTGTGGG CAACATGCTGGTCATCCTCATCCTGATAAACTGCAAAAGGCTGAAGAGCATGACTGACATCTACCTGCTC AACCTGGCCATCTCTGACCTGTTTTTCCTTCTTACTGTCCCCTTCTGGGCTCACTATGCTGCCGCCCAGT GGGACTTTGGAAATACAATGTGTCAACTCTTGACAGGGCTCTATTTTATAGGCTTCTTCTCTGGAATCTT CTTCATCATCCTCCTGACAATCGATAGGTACCTGGCTGTCGTCCATGCTGTGTTTGCTTTAAAAGCCAGG ACGGTCACCTTTGGGGTGGTGACAAGTGTGATCACTTGGGTGGTGGCTGTGTTTGCGTCTCTCCCAGGAA TCATCTTTACCAGATCTCAAAAAGAAGGTCTTCATTACACCTGCAGCTCTCATTTTCCATACAGTCAGTA TCAATTCTGGAAGAATTTCCAGACATTAAAGATAGTCATCTTGGGGCTGGTCCTGCCGCTGCTTGTCATG GTCATCTGCTACTCGGGAATCCTAAAAACTCTGCTTCGGTGTCGAAATGAGAAGAAGAGGCACAGGGCTG TGAGGCTTATCTTCACCATCATGATTGTTTATTTTCTCTTCTGGGCTCCCTACAACATTGTCCTTCTCCT GAACACCTTCCAGGAATTCTTTGGCCTGAATAATTGCAGTAGCTCTAACAGGTTGGACCAAGCTATGCAG GTGACAGAGACTCTTGGGATGACGCACTGCTGCATCAACCCCATCATCTATGCCTTTGTCGGGGAGAAGT TCAGAAACTACCTCTTAGTCTTCTTCCAAAAGCACATTGCCAAACGCTTCTGCAAATGCTGTTCTATTTT CCAGCAAGAGGCTCCCGAGCGAGCAAGCTCAGTTTACACCCGATCCACTGGGGAGCAGGAAATATCTGTG GGCTTGTGA

Page 6: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Exercise 1: cont.

• From NCBI home page follow the “BLAST” link• Then from “Genomes” select “Human”• Paste the sequence from the previous slide into the box

(note that it is in FASTA format)• Click on “Begin Search”• Wait for a short while then try “Format”

Page 7: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Alignment format examples

• Aligned (similar) regions are in rectangles.

Page 8: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Exercise 2

• From http://www.ncbi.nlm.nih.gov/Structure/ go to the MMDB summary page for 1nqp.

• Click on the colored bar for chain “A”.

• Scroll down and select the VAST neighbor 1KR7 A.

• Click on “View 3D Structure” at the top to view the structural superposition using Cn3D.

• The sequence alignment will be shown in a separate window.

Page 9: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Divergence and Convergence

• Two proteins may be similar because of divergence from a common ancestor (i.e. they are homologous).

• … or, perhaps two proteins may be similar because they perform similar functions and are thereby constrained, even though they arose independently (functional convergence hypothesis, they are then called analogous).

Page 10: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Divergence vs. Convergence

• Convergence can occur; e.g. there exist enzymes with different overall structures but remarkably similar arrangements of active site residues to carry out a similar function.

• So how can one establish homology?

Page 11: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Homology

“… whenever statistically significant sequence or structural similarity between proteins or protein domains is observed, this is an indication of their divergent evolution from a common ancestor or, in other words, evidence of homology.”

E.V. Koonin and M.Y. Galperin, Sequence – Evolution – Function, Kluwer 2003

Page 12: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Two arguments…

• We see a continuous range of sequence similarity. Convergence is extremely unlikely for highly similar protein families. It then appears implausible to invoke it for less similar families.

• The same or very similar functions may be carried out by proteins with very different structures (folds). Therefore, functional constraints cannot force convergence of the sequence or structure between proteins.

Page 13: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Alignments of hemoglobins: human-wolf (83%); human-chicken (59%)

Page 14: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Alignments of hemoglobins: human-hagfish (24%); human-worm (17%)

Page 15: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Sequence identity for VAST neighbors of 1NQP A (a globin)

Page 16: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Issues in sequence alignment…

• How do you quantify amino acid similarity?

• How can you handle gaps in the alignment?

• How do you define the overall similarity score?

• Can you compute an optimal alignment?

• Can you compute an alignment efficiently?

• Can you calculate statistical significance?

Page 17: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Global and local alignment

• Alignment programs may be modified, e.g. by scoring parameters, to produce global or local alignments.

• Local alignments tend to be more useful, as it is highly possible that a significant similarity may encompass only a portion of one or both sequences being compared.

Page 18: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Dynamic Programming overview

• Computational algorithmic method devised in the 1940’s.

• An efficient method to find optimal solutions for certain classes of problems.

• Particularly useful in bioinformatics for protein sequence comparison, etc.

Page 19: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Characteristics of problems that can be solved by DP…

• Optimal substructure: The optimal solution may be constructed from optimal solutions to subproblems.

• Overlapping subproblems: To compute the optimal solution we only need to consider a “small” number of subproblems.

Page 20: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Example: Longest Common Subsequence (LCS)

• Given two sequences of symbols, find a longest subsequence that occurs in both sequences.

• seq1: abcabddcbaabb • seq2: bdacbaabcbaa

• abcabddcbaabb• bdacbaabcbaa• Here is another subsequence of length 8: bcabcbaa

Page 21: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Optimal substructure of an LCS

• X = x1…xm, Y = y1…yn input sequences; let Z = z1…zk be any LCS of X and Y

• If xm = yn then zk = xm = yn and Zk-1 is an LCS for Xm-1 and Yn-1

• If xm yn and zk xm then Z is an LCS for Xm-1 and Y

• If xm yn and zk yn then Z is an LCS for X and Yn-1

• (Note: Xm-1 = x1…xm-1, etc.)

Page 22: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Computing the length of an LCS

• sij = length of an LCS for Xi and Yj

• If xi = yj then sij = si-1,j-1 + 1

• Otherwise sij = max(si,j-1, si-1,j)

• This leads to a bookkeeping scheme where we calculate the scores by filling in an mxn table. The score at position (i, j) depends on the score at (i-1, j-1) if the symbols xi and yj match, otherwise it depends on the scores at positions (i-1, j) and (i, j-1).

• An LCS can be computed by “traceback”.

Page 23: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

DP table initializations: fill in first row and first column

Page 24: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

DP computation: fill in the rest of the tables row by row (or column by column)

Page 25: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Traceback to find an LCS: bab

Page 26: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Complexity of the DP algorithm

• Time O(nm); space O(nm) where n, m are the lengths of the two sequences.

• Space complexity can be reduced to O(n) by not storing the entries of dynamic programming table that are no longer needed for the computation (keep current row and the previous row only).

Page 27: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Using DP for protein sequence alignment

• In protein sequence alignments there are blocks of similarity and dissimilarity (gaps). E.g. look back at the globin alignments earlier in the lecture.

• The scoring is more complicated because there are “degrees of similarity” among the 20 amino acids.

• To enforce the block structure “gap penalties” have to be introduced.

Page 28: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

DNA and proteins

• Much more sensitive comparison is possible between protein sequences than DNA sequences.

• One reason is that the third codon in the genetic code is highly redundant, and introduces noise into DNA comparisons.

• Another reason is that the physico-chemical properties of the amino acid residues provide information that is highly relevant to comparison.

Page 29: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Note the genetic code is degenerate.

Page 30: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Molecular Cell Biology, Lodish et al.Fig. 3-2.

Page 31: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Molecular Cell Biology, Lodish et. alFig. 3-2.

Page 32: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Amino acid substitution matrices

• Ab initio approaches; e.g. assign scores based on number of mutations needed to transform one codon to another, or on other physico-chemical measures of a.a. similarity.

• Empirical, i.e. derive statistics about a.a. substitutions from collections of alignments. (These work the best).

Page 33: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Computation of scores, empirical approach

sij = ln(qij/(pipj))/λ

• sij is the score for substituting amino acid type i by type j.

• qij is the observed frequency of substitutions of a.a. type i by a.a. type j (in the training set).

• pi is the background frequency for a.a. type i in the training set.

• λ is a positive constant.

Page 34: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Example (numbers are fictional)

• N = total number of a.a.’s in training set = 1000• F occurs 15 times• Y occurs 25 times• (F, Y) substitutions occur 200 times

• Np = total number of subst. pairs = 10000

• Then: pF = 0.015, pY = 0.025, qFY = 0.02

• ln(qFY/(pFpY)) ln(53) 4

Page 35: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Substitution matrices

• There are many different matrices available.

• The most commonly used are the BLOSUM or PAM series.

• BLAST defaults to the BLOSUM62 matrix. The BLOSUM matrices have been shown to be more sensitive than the PAM matrices for database searches.

Page 36: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05
Page 37: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Gap penalties

• There is no suitable theory for gap penalties.• The most common type of gap penalty is the affine gap

penalty: g = a + bx, where a is the gap opening penalty, b is the gap extension penalty, and x is the number of gapped-out residues.

• Typical values, e.g. a = 10 and b = 1 for BLAST.• If the gap penalty has different opening and extension

costs, then the DP algorithm becomes a little more complicated (cf. Chapter 3 in Mount).

Page 38: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Dynamic programming algorithm for computing the score of the optimal alignment

For a sequence S = a1, a2, …, an let Sj = a1, a2, …, aj

Align(Si,S’j) = the score of the highest scoring alignment between S1

i,S2j

S(ai, a’j)= similarity score between amino acids ai and a’j given by a scoring matrix like PAM, BLOSUM

g – gap penalty

Align(Si,S’j)= max

Align(Si-1,S’j-1)+ S(ai, a’j)

Align(Si,S’j-1) - g

Align(Si-1,S’j) - g

{

Page 39: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Organizing the computation – dynamic programming table

Align(Si-1,S’j-1)+ s(ai, a’j)

Align(Si-1,S’j) - g

Align(Si,S’j-1) - g

Align(Si,S’j) = max

{

j

i

Align[i, j] =

Align

+s(ai,aj)

max

Page 40: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Example of DP computation with g = 0; match = 1; mismatch = 0

Maximal Common Subsequence

A T T G C G C G C A T

ATGCTTAACCA

+1

max

0 0 0 0 0 0 0 0 0 0 0 0

0 1 1 1 1 1 1 1 1 1 1 1

0 1 2 2 2 2 2 2 2

0 1 2 2 3

0 1

0 1

0 1

0 1

0 1

0 1

0 1

0 1

initialization

Page 41: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Example of DP computation with g = 2; match = 2; mismatch = -1

A T T G C G C G C A T

ATGCTTAACCA

+2

max-2

-2

Initialization (penalty for starting with a gap)

0 -2 -4 -6 -8 -10 -12 -14 -16 -18 -20 -22

-2 2 0 -2 -4

-4 0 4 2

-6 -2 2 3

-8 -4

-10

-12

-14

-16

-18

-20

-22

Page 42: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

The iterative algorithm

m = |S|; n = |S’| for i 0 to m do A[i,0]- i * g for j 0 to n do A[0,j] - j * g for i 0 to m do for j 0 to n A[i,j]max ( A[i-1,j] – g A[i-1,j-1] + s(i,j) A[i,j-1] – g )return(A[m,n])

Page 43: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Exercise 3

• Copy and paste the sequences for 1nqpA and 1kr7A into a notepad.

• Go to the web site: http://pir.georgetown.edu/pirwww/search/textpsd.shtml

• Follow the “Search and Retrieval” link, then “Pairwise alignment”.

• Copy in your sequences (use FASTA format and remove numbers) and then run SSEARCH (Smith-Waterman algorithm, a DP alignment method).

Page 44: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

SSEARCH (pir.georgetown.edu)

Page 45: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

BLAST (Basic Local Alignment Search Tool)

• Extremely fast, can be on the order of 50-100 times faster than Smith-Waterman.

• Method of choice for database searches.

• Statistical theory for significance of results.

• Heuristic; does not guarantee optimal results.

• Many variants, e.g. PHI-, PSI-, RPS-BLAST.

Page 46: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Why database searches?

• Gene finding.

• Assigning likely function to a gene.

• Identifying regulatory elements.

• Understanding genome evolution.

• Assisting in sequence assembly.

• Finding relations between genes.

Page 47: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Issues in database searches

• Speed.

• Relevance of the search results (selectivity).

• Recovering all information of interest (sensitivity).– The results depend on the search parameters, e.g. gap

penalty, scoring matrix.– Sometimes searches with more than one matrix should be

performed.

Page 48: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Main idea of BLAST

• Homologous sequences are likely to contain a short high scoring similarity region, a hit. Each hit gives a seed that BLAST tries to extend on both sides.

Page 49: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Some BLAST terminology…

word – substring of a sequence

word pair – pair of words of the same length

score of a word pair – score of the gapless alignment of the two words:

V A L M R

V A K N S score = 4 + 4 – 2 – 2 - 1 = 3 (BLOSUM62)

HSP – high scoring word pair

Page 50: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Main steps of BLAST

• Parameters: w = length of a hit; T = min. score of a hit (for proteins: w = 3, T = 13 (BLOSUM62)).

• Step 1: Given query sequence Q, compile the list of possible words which form with words in Q high scoring word pairs.

• Step 2: Scan database for exact matching with the list of words compiled in step 1.

• Step 3: Extend the hits from step 2.

• Step 4: Evaluate significance of extended hits from step 3.

Page 51: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Step 1: Find high scoring words

• For every word x of length w in Q make a list of words that when aligned to x score at least T.

• Example: Let x = AIV then the score for AIA is 4+4+0 (dropped) and for AII 4+4+3 (taken).

• The number of words in the list depends on w and T, and is usually much less than 203 (typically about 50).

Page 52: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Step 2 – Finding hits

• Scan database for exact matching with the list of words compiled in step1 :

• Two techniques.– Hash table.– Keyword tree (there is a finite-automaton based method for

exact matching with a set of patterns represented as a tree).

Page 53: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Step 3: Extending hits

• Parameter: X (controlled by a user).

• Extend the hits in both ways along diagonal (ungapped alignment) until score drops more than X relative to the best score yet attained.

• Return the score of the highest scoring segment pair (HSP).

extensions

Page 54: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

E-values, P-values

• E-value, Expectation value; this is the expected number of hits of at least the given score, that you would expect by random chance for the search database.

• P-value, Probability value; this is the probability that a hit would attain at least the given score, by random chance for the search database.

• E-values are easier to interpret than P-values.

• If the E-value is small enough, e.g. no more than 0.10, then it is essentially a P-value.

Page 55: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Karlin-Altschul statistics

• Expected number of HSPs with score ≥ S is:

E = KmNe –λS

• m = length of query sequence

• N = database size in residues

• K scales the search space size

• λ a scale for the scoring system

Page 56: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

The bit score

• This formula “normalizes” a raw score:

S’ = (λS – ln K)/ln 2

• The E-value is then given by:

E = mN 2 –S’

• Allows direct comparison of E-values, even with differing scoring matrices.

Page 57: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Karlin and Altschul provided a theory for computing statistical significance

• Assumptions:

– The scoring matrix M must be such that the score for a random alignment is negative.

– n, m (lengths of the aligned sequences) are large.

– The amino acid distribution p(x) is in the query sequence and the data base is the same.

– Positive score is possible (i.e. M has at least one positive entry).

Page 58: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

P(S<x) = exp (-e –(x-) ) thus:P(S>=x) =1- exp (-e –(x-) )

Score of high scoring sequence pairs follows extreme value distribution

normal

Extreme values

– decay constant u – value of the peak

Page 59: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Refinement of the basic algorithm-the two hit method

• Observation: HSPs of interest are long and can contain multiple hits a relatively short distance apart.

• Central idea: Look for non-overlapping pairs of hits that are of distance at most d on the same diagonal.

• Benefits: – Can reduce word size w from 3 two 2 without losing

sensitivity (actually sensitivity of two-hit BLAST is higher).

– Since extending a hit requires a diagonal partner, many fewer hits are extended, resulting in increased speed.

Page 60: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Gapped BLAST

• Find two non-overlapping hits of score at least T and distance at most d from one another.

• Invoke ungapped extension.

• If the HSP generated has normalized score at least Sg bits then start gapped extension.

• Report resulting alignment if it has sufficiently significant (very small) E-value.

Page 61: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Gapped BLAST statistics

• Theory does not work.• Simulations indicate that for the best hits scores for local

alignments follow an extreme value distribution.• Method approximates and to match experimental distribution

- and can be computed from the median and variation of the experimental distribution.

• BLAST approach – simulate the distribution for set of scoring matrices and a number of gap penalties. BLAST offers a choice of parameters from this pre-computed set.

Page 62: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Motivation for multiple alignments

• Look at a multiple alignment of several of the VAST structure neighbors of 1h1b chain A.

Page 63: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Multiple Sequence Alignment

• A multiple alignment of sequences from a protein family makes the conserved features much more apparent.

• Much more difficult to compute than pairwise alignments.

• The most commonly used and newer programs use the “progressive alignment strategy”.

• There are also important databases of multiple alignments for protein families.

Page 64: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Progressive alignment

• Determine an (approximate) phylogenetic tree for the sequences.

• Construct the multiple alignment by merging pairwise alignments based on the phylogenetic tree, the most closely related sequences first, etc.

• The idea is, if you are careless about the order and merge distantly related sequences too soon in the process, then errors in the alignment may occur and propagate.

Page 65: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Multiple sequence alignment programs

• CLUSTALW, Thompson et al. NAR 1994.

• T-Coffee (Tree-based Consistency Objective Function for alignment Evaluation), C. Notredame et al. JMB 2000.

• MUSCLE (Multiple sequence comparison by log expectation), R. Edgar NAR 2004.

Page 66: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05
Page 67: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

PSI-BLAST

• Position Specific Iterated BLAST

• As a first step runs a (regular) BLAST.

• Hits that cross the threshold are used to construct a position specific score matrix (PSSM).

• A new search is done using the PSSM to find more remotely related sequences.

• The last two steps are iterated until convergence.

Page 68: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

PSSM (Position Specific Score Matrix)

• One column per residue in the query sequence.

• Per-column residue frequencies are computed so that log-odds scores may be assigned to each residue type in each column.

• There are difficulties; e.g. pseudo-counts are needed if there are not a lot of sequences, the sequences must be weighted to compensate for redundancy.

Page 69: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Two key advantages of PSSMs

• More sensitive scoring because of improved estimates of probabilities for a.a.’s at specific positions.

• Describes the important motifs that occur in the protein family and therefore enhances the selectivity.

Page 70: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Position Specific Substitution Rates

Active site serineWeakly conserved serine

Page 71: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Position Specific Score Matrix (PSSM)

A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 D -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Active site nucleophile

Serine scored differently in these two positions

Page 72: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Exercise: using PSI-BLAST

• Go to the NCBI BLAST web site.

• Click on the link to PHI- and PSI- BLAST.

• Choose the search database to be “pdb”.

• Enter the sequence for 1h1bA, Human Neutrophil Elastase.

• How many iterations does it take before 1dleB (Factor B Serine Protease domain) shows up as a significant hit?

Page 73: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Modifying thresholds…

• On occasion, it can prove useful to modify (increase) the inclusion threshold parameter.

• The user can also manually select hits to include in the PSSM that do not meet the threshold, e.g. if one is certain they are homologs to the query.

• Of course, one must always be extremely careful when doing so!

Page 74: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

HMMs

• Hidden Markov Models, a type of statistical model.

• Have numerous applications (including outside of bioinformatics).

• HMMs were used to construct Pfam, a database of multiple alignments for protein families (HMMer).

Page 75: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

CDD Search

• Conserved Domain Database (CDD) at NCBI.

• Contains alignments from Smart, Pfam, COG, KOG, and LOAD databases.

• Many multiple alignments are manually curated.

• PSSMs derived from multiple alignments may be searched with RPS-BLAST (Reverse Position Specific BLAST).

Page 76: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Multiple alignment of globins from CDD

Page 77: CS177 Lecture 4 Sequence Alignment Tom Madej 10.03.05

Thank you to Dr. Teresa Przytycka for slides about dynamic programming and BLAST.