1
Homology and sequence alignment.
HomologyHomology = Similarity between objects due to a common ancestry
Hund = Dog,Schwein = Pig
3
Sequence homology
VLSPAVKWAKVGAHAAGHG||| || |||| | ||||VLSEAVLWAKVEADVAGHG
Similarity between sequences as a result of common ancestry.
4
Sequence alignment
Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.
5
Why align?VLSPAVKWAKV||| || |||| VLSEAVLWAKV
1.To detect if two sequence are homologous. If so, homology may indicate similarity in function (and structure).
2.Required for evolutionary studies (e.g., tree reconstruction).
3.To detect conservation (e.g., a tyrosine that is evolutionary conserved is more likely to be a phosphorylation site).
6
Sequence alignment
If two sequences share a common ancestor – for example human and dog hemoglobin, we can represent their evolutionary relationship using a tree
VLSPAV-WAKV||| || |||| VLSEAVLWAKV
VLSPAV-WAKV VLSEAVLWAKV
7
Perfect match
VLSPAV-WAKV||| || |||| VLSEAVLWAKV
VLSPAV-WAKV VLSEAVLWAKV
A perfect match suggests that no change has occurred from the common ancestor (although this is not always the case).
8
A substitution
VLSPAV-WAKV||| || |||| VLSEAVLWAKV
VLSPAV-WAKV VLSEAVLWAKV
A substitution suggests that at least one change has occurred since the common ancestor (although we cannot say in which lineage it has occurred).
9
Indel
VLSPAV-WAKV||| || |||| VLSEAVLWAKV
VLSPAV-WAKV
VLSEAVLWAKV
Option 1: The ancestor had L and it was lost here. In such a case, the event was a deletion.
VLSEAVLWAKV
10
Indel
VLSPAV-WAKV||| || |||| VLSEAVLWAKV
VLSPAV-WAKV
VLSEAVWAKV
Option 2: The ancestor was shorter and the L was inserted here. In such a case, the event was an insertion.
VLSEAVLWAKV
L
11
Indel
VLSPAV-WAKV
Normally, given two sequences we cannot tell whether it was an insertion or a deletion, so we term the event as an indel.
VLSEAVLWAKV
Deletion? Insertion?
12
Global vs. Local
• Global alignment – finds the best alignment across the entire two sequences.
• Local alignment – finds regions of similarity in parts of the sequences.
ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ
ADLG CDRYFQ|||| |||| |ADLG CDRYYQ
Global alignment:
forces alignment in
regions which differ
Local alignment will
return only regions of
good alignment
13
Global alignment
PTK2 protein tyrosine kinase 2 of human and rhesus monkey
14
Proteins are comprised of domains
Domain B
Protein tyrosine kinase domain
Domain A
Human PTK2 :
15
Protein tyrosine kinase domain
In leukocytes, a different gene for tyrosine kinase is expressed.
Domain X
Protein tyrosine kinase domain
Domain A
16
Domain X
Protein tyrosine kinase domain
Domain BProtein tyrosine kinase domain
Domain A
Leukocyte TK
PTK2 The sequence similarity is restricted to a single domain
17
Global alignment of PTK and LTK
18
Local alignment of PTK and LTK
19
Conclusions
Use global alignment when the two sequences share the same overall sequence arrangement.
Use local alignment to detect regions of similarity.
20
How alignments are computed
21
Pairwise alignment
AAGCTGAATTCGAAAGGCTCATTTCTGA
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
One possible alignment:
22
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
This alignment includes:2 mismatches 4 indels (gap)
10 perfect matches
23
Choosing an alignment for a pair of sequences
AAGCTGAATTCGAAAGGCTCATTTCTGA
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-
Which alignment is better?
Many different alignments are
possible for 2 sequences:
24
Scoring system (naïve)
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1
A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-
Higher score Better alignment
Perfect match: +1
Mismatch: -2
Indel (gap): -1
25
Alignment scoring - scoring of sequence similarity:
Assumes independence between positions:each position is considered separately
Scores each position:• Positive if identical (match)• Negative if different (mismatch or gap)
Total score = sum of position scoresCan be positive or negative
26
Scoring system
•In the example above, the choice of +1 for match,-2 for mismatch, and -1 for gap is quite arbitrary
•Different scoring systems different alignments
•We want a good scoring system…
27
DNA scoring matrices
Can take into account biological phenomena such as:
• Transition-transversion
28
Amino-acid scoring matrices• Take into account physico-chemical properties
29
Scoring gaps (I)
In advanced algorithms, two gaps of one amino-acid are given a different score than one gap of two amino acids. This is solved by giving a penalty to each gap that is opened.
Gap extension penalty < Gap opening penalty
30
Homology versus chance similarity
How to check if the score is significant?
A. Take the two sequences Compute score.
B. Take one sequence randomly shuffle it -> find score with the second sequence. Repeat 100,000 times.
If the score in A is at the top 5% of the scores in B the similarity is significant.
31
How close?
• Rule of thumb:
• Proteins are homologous if they are at least 25% identical (length >100)
• DNA sequences are homologous if they are at least 70% identical
32
Twilight zone
• < 25% identity in proteins – may be homologous and may not be….
• (Note that 5% identity will be obtained completely by chance!)
33
Searching a sequence database
Idea: In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs
The same idea in short: Use your sequence as a query to find homologous
sequences in a sequence database
34
Some terminology
• Query sequence - the sequence with which we are searching
• Hit – a sequence found in the database, suspected as homologous
35
Query sequence: DNA or protein?
• For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences.
• Which is preferable?
36
Protein is better!
• Selection (and hence conservation) works (mostly) at the protein level:
CTTTCA = Leu-SerTTGAGT = Leu-Ser
37
Query type
• Nucleotides: a four letter alphabet
• Amino acids: a twenty letter alphabet
• Two random DNA sequences will, on average, have 25% identity
• Two random protein sequences will, on average, have 5% identity
38
Conclusion
The amino-acid sequence is often preferable for homology search
39
How do we search a database?
• If each pairwise alignment takes 1/10 of a second, and if the database contains 107
sequences, it will take 106 seconds = 11.5 days to complete one search.
• 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.
40
Conclusion
• Using the exact comparison pairwise alignment algorithm between the query and all DB entries – too slow
41
Heuristic
•Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution
42
BLAST
43
BLAST
• BLAST - Basic Local Alignment and Search Tool
• A heuristic for searching a database for similar sequences
44
DNA or Protein• All types of searches are possible
Query: DNA Protein
Database: DNA Protein
blastn – nuc vs. nucblastp – prot vs. protblastx – translated query vs. protein databasetblastn – protein vs. translated nuc. DBtblastx – translated query vs. translated database
Translated databases:
trEMBLgenPept
45
BLAST - underlying hypothesis
• The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them
• The heuristic:
1. Discard irrelevant sequences
2. Perform exact local alignment only with the remaining sequences
46
How do we discard irrelevant sequences quickly?
• Divide the database into words of length w (default: w = 3 for protein and w = 7 for DNA)
• Save the words in a look-up table that can be searched quickly
WTDFGYPAILKGGTAC
WTDTDFDFGFGYGYP …
47
BLAST: discarding sequences
• When the user enters a query sequence, it is also divided into words
• Search the database for consecutive neighboring words
48
Neighbor words
• neighbor words are defined according to a scoring matrix (e.g., BLOSUM62 for proteins) with a certain cutoff level
GFB
GFC (20)
GPC (11)WAC (5)
49
E-value• The number of times we will theoretically
find an alignment with a score ≥ Y of a random sequence vs. a random database
Theoretically, we could trust
any result with an
E-value ≤ 1
In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a
significant homology.E-values between 10-4 and 10-2 should be checked (similar domains, maybe
non-homologous).E-values between 10-2 and 1 do not
indicate a good homology
Web servers for pairwise alignment
BLAST 2 sequences (bl2Seq) at NCBI
Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment
• Does not use an exact algorithm but a heuristic
Back to NCBIBack to NCBI
BLAST – bl2seqBLAST – bl2seq
Bl2Seq - queryBl2Seq - query
blastnblastn – – nucleotidenucleotide blastpblastp – – proteinprotein
Bl2seq resultsBl2seq results
Bl2seq results
Match Match Dissimilarity Dissimilarity Gaps Gaps Similarity Similarity Low Low
complexity complexity
BLAST – programsBLAST – programs
Query: DNA Protein
Database: DNA Protein
BLAST – BlastpBLAST – Blastp
Blastp - resultsBlastp - results
Blastp – results (cont’)Blastp – results (cont’)
Blast scores:
• Bits score – A score for the alignment according to the number of similarities, identities, etc.
• Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a random database of a particular size. The closer the e-value is to zero, the greater the confidence that the hit is really a homolog
Blastp – acquiring sequencesBlastp – acquiring sequences
blastp – acquiring sequencesblastp – acquiring sequences
64
VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--
Similar to pairwise alignment BUT n sequences are aligned instead of just 2
Multiple sequence alignment
65
MSA = Multiple Sequence AlignmentEach row represents an individual sequenceEach column represents the ‘same’ position
VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--
Multiple sequence alignment
66
Conserved positions
• Columns in which all the sequences contain the same amino acids or nucleotides
• Important for the function or structure
VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGSSSNIGS--ITVNWYQQLPGLRLSCTGSGFIFSS--YAMYWYQQAPGLSLTCTGSGTSFDD-QYYSTWYQQPPG
67
Consensus sequence
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
A consensus sequence holds the most frequent character of the alignment at each column
68
Profile = PSSM = Position Specific Score Matrix
A T C T T G
A A C T T G
A A C T T C
1 2 3 4 5 6
A 1 .67 0 0 0 0
C 0 0 1 0 0 0.33
G 0 0 0 0 0 0.67
T 0 .33 0 1 1 0
69
Alignment methods
There is no available optimal solution for MSA – all methods are heuristics:
• Progressive/hierarchical alignment (Clustal)
• Iterative alignment (mafft, muscle)
70
ABCDE
Compute the pairwise Compute the pairwise alignments for all against alignments for all against
all (6 pairwise alignments).all (6 pairwise alignments).The similarities are The similarities are
converted to distances and converted to distances and stored in a tablestored in a table
First step:
Progressive alignment
A B C D E
A
B 8
C 15 17
D 16 14 10
E 32 31 31 32
71
A
D
C
B
E
Cluster the sequences to create a Cluster the sequences to create a tree (tree (guide treeguide tree):):
•represents the order in which pairs of represents the order in which pairs of sequences are to be alignedsequences are to be aligned•similar sequences are neighbors in the similar sequences are neighbors in the tree tree •distant sequences are distant from distant sequences are distant from each other in the treeeach other in the tree
Second step: A B C D E
A
B 8
C 15 17
D 16 14 10
E 32 31 31 32The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!
72
Third step:A
D
C
B
E
1. Align the most similar (neighboring) pairs
sequence
sequence
sequence
sequence
73
Third step:A
D
C
B
E
2. Align pairs of pairs
sequence
profile
74
Third step:A
D
C
B
E sequence
profile
Main disadvantages:
•Sub-optimal tree topology
•Misalignments resulting from globally aligning pairs of sequences.
75
ABCDE
Iterative alignment
Guide tree
MSA
Pairwise distance table
A
DCB
Iterate until the MSA does not change (convergence)
E
76
Case study: Using homology searching
• The human kinome
77
Kinases and phosphatases
78
Multi-tasking enzymes
• Signal transduction• Metabolism• Transcription• Cell-cycle• Differentiation• Function of nervous and
immune system• …• And more
79
How many kinases in the human genome?
• 1950’s, discovery that reversible phosphorylation regulates the activity of glycogen phosphorylase
• 1970’s, advent of cloning and sequencing produced a speculation that the vertebrate genome encodes as many as 1,001 kinases
80
• 2001 – human genome sequence …
• As well – databases of Genbank, Swissprot, and dbEST
• How can we find out how many kinases are out there?
How many kinases in the human genome?
81
The human kinome
• In 2002, Manning, Whyte, Martinez, Hunter and Sudarsanam set out to:
1. Search and cross-reference all these databases for all kinases
2. Characterize all found kinases
82
ePKs and aPKs
Eukaryotic protein kinases (majority) catalytic domain
Atypical protein kinases
Sequence homology of the catalytic domain; additional regulatory domains are non-homologous
No sequence homology to ePKs; some aPK subfamilies have structural similarity to ePKs
83
The search
• Several profiles were built:based on the catalytic domain of:
(a) 70 known ePKs from yeast, worm, fly, and human with > 50% identity in the ePK domain
(b) each subfamily of known aPKs
• HMM-profile searches and PSI-BLAST searches were performed
84
The results…
• 478 ePKs • 40 aPKs
• Total of 518 kinases
in the human genome
(half of the prediction
in the 1970’s)
[1.7% of human genes]