8/12/2019 Blast Clustal 4Students
1/33
So you have a sequence. What now?
8/12/2019 Blast Clustal 4Students
2/33
The simplest bioinformatic problem:
Let us assume you have an uncharacterised (yet) nucleotide sequence that you obtainedfrom a PCR experiment.
Question:How do you characterise (validate) your PCR product?
Answer:
(1) You interrogate a PRIMARY database (e.g. GenBank) and retrieve all the
sequences that are significantly similar (i.e are HOMOLOGOUS) to your query.
This is done using the BLAST software.
(2) You generate a multiple sequence alignment of the retrieved (Homologous) proteins.
This is done using ClustalW.
8/12/2019 Blast Clustal 4Students
3/33
8/12/2019 Blast Clustal 4Students
4/33
1) Create a 2-D matrix and populate it withscores representing the similarities of the
compared sequences
2) Accumulate the scores in the matrix &
penalize insertions and deletions
3) Identify the highest scoring path in thematrix.
8/12/2019 Blast Clustal 4Students
5/33
SEQ1
A H C N I R V S G V C L C R P M
A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0I 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0I 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0N 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
SEQ2 R 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0K 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0R 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0H 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
8/12/2019 Blast Clustal 4Students
6/33
A H C N I R V S G V C L C R P M
A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0I 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0I 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0N 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
R 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0K 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0
R 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0H 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
A H C N I R V S G V C L C R P M
A 8 7 6 6 5 4 4 4 4 4 3 3 2 1 0 0I 7 7 6 6 6 4 4 4 4 4 3 3 2 1 0 0
C 6 6 7 6 5 4 4 4 4 4 4 3 3 1 0 0I 6 6 6 5 6 4 4 4 4 4 3 3 2 1 0 0N 5 5 5 6 5 4 4 4 4 4 3 3 2 1 0 0
R 4 4 4 4 4 5 4 4 4 4 3 3 2 1 0 0C 3 3 4 3 3 3 3 3 3 3 4 3 3 1 0 0K 3 3 3 3 3 3 3 3 3 3 3 3 2 1 0 0
C 2 2 3 2 2 2 2 2 2 2 3 2 3 1 0 0R 2 1 1 1 1 2 1 1 1 1 1 1 1 2 0 0H 1 2 1 1 1 1 1 1 1 1 1 1 1 1 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
The MAX previous score, the one
that has to be added to the current
RED CELL value, is the highest in
the BLUE ROW OR COLUMN.
The matrix is accumulated moving from the bottomright corner to the top left corner!
8/12/2019 Blast Clustal 4Students
7/33
P
A
GCS-H
CS-S S N
Q
Y
WF
M
I V
L
T
Small
Hydrophobic
PolarAliphatic
Tiny
Aromatic
Charged
enn agram o am no ac s proper es
K
RH
-
D +
E
8/12/2019 Blast Clustal 4Students
8/33
Matrix representing probabilities of amino acid substitutions. This and other existing
matrices can be used to build more accurate alignments of two sequences.
8/12/2019 Blast Clustal 4Students
9/33
8/12/2019 Blast Clustal 4Students
10/33
To search databases we use heuristic, similaritybased algorithms Similarity based database searches generate local alignments
to find (within a sequence database) sequences related to the
query sequence. Given a query sequence, local alignments of the query sequence
are generated against every sequence in the database. The scores of
the alignments are used to identify sequences that are related to the
query
sequence. BLAST is the most common heuristic algorithm used to searchsequence databases.
8/12/2019 Blast Clustal 4Students
11/33
BLASTThe Basic Local Alignment Search Tool BLAST is the standard database search tool. Developed by Altschul Stephen in 1990.
BLAST is a class of related software that perform a variety ofdatabase comparisons. For example:
Objective: To find high scoring untapped alignments between
a query sequence and the sequences in a database. These are called High Scoring Pairs (HSP). The existence of such segments above a given similarity thresholdindicates pairwise similarity beyond random chance. This is used to distinguish related from unrelated sequences in adatabase.
8/12/2019 Blast Clustal 4Students
12/33
8/12/2019 Blast Clustal 4Students
13/33
The Algorithm
Given a Query sequence (e.g. QLNFSAGW)
FIRST STEP - SEEDING. Generate all words of length K (e.g. k= 2)in the querysequence.
Words in our example:QL; NF; SA; GW; LN; FS; AG.
SECOND STEP.Identify all words in the sequences in the database.
THIRD STEP. Align every seed against every word generated from the database.Calculate (Using BLOSUM62 -or another matrix) the score of every ungapped twoletter alignment generated in this way. An alignment is considered a MATCH if itsscore is above a certain threshold (default = 8 for amino acids).
FOURTH STEP. Matches (only) are extended to generate longer alignments. If nomatch is found for two sequences, they are not considered any longer. This savestime. If multiple matches are found for two sequences, all matches are extended.The extension of a match continues until mismatches cause the alignment score todrop below a given threshold (22 for proteins 20 for DNA).
Resulting ungapped alignments are the HSPs.
8/12/2019 Blast Clustal 4Students
14/33
8/12/2019 Blast Clustal 4Students
15/33
8/12/2019 Blast Clustal 4Students
16/33
Extending a match
Stop when : Score Current Extension < 22.
Every MATCH (alignment with a minimal score of 8), is extended until we found the
best extension (alignment of maximal score).
AGT PYNNGT NNT LTW HKR RRR K
TAG PYNNGT NNT LTW KHK KKK R
Initial Match (or Hit)
Extend until score of alignment increases
Keep extending until score drops below 22
8/12/2019 Blast Clustal 4Students
17/33
Interpreting BLAST The output of BLAST provides a list of pairwise sequence matchesranked by the statistical significance of the scores of their HSP.
In BLAST the statistical indicator is the E-value (NOT to be confused with a P value -see below). E-values (expectation values) express how likely it is for an HSP ofa certain score to be observed by chance alone in a database ofgiven dimensions. E = m * n * P. m = total number of residue in database. n = number of residue in the query sequence P = the probability that an HSP alignment is a result of random
chance (THIS IS THE PROBABILITY OF THE ALIGNMENT!)
8/12/2019 Blast Clustal 4Students
18/33
Interpretations of E-values
E =< 1e - 50: Extremely high sequence similarity. Very close homologs.
1e - 50< E < 1e - 8: Significantly high similarity. Surely homologous.
1e - 7 < E < 1e - 2 (0.01): Sequences similar but not necessarily homologous. If they are
homologous, they are distant homologoue.
0.01 < E < 10: Match not significant.
Generally speaking, as a rule of thumb: E =< 1e - 8 is significant.
Calculating E-values an example Given a Query Sequence 100 residues long A database containing 1012 residues P = 1*10-20 (of the HSP between 2 sequences)
E-value = 100 * 1012 * 10-20 = 10-6 This will be expressed as: 1e-6 in the BLAST output.
8/12/2019 Blast Clustal 4Students
19/33
8/12/2019 Blast Clustal 4Students
20/33
E = 4.2
8/12/2019 Blast Clustal 4Students
21/33
Proteins can be classified in families
Members of a family generally perform similar (or related) tasksand have specific signatures. They are identified using BLAST If we can identify a protein as a member of a well-characterised family,we can generally predict its function. Signatures of a protein family are referred as Conserved Motifs. Conserved motifs can only be identified building a multiple sequencealignment. If we can identify a conserved motif we learned somethinguseful about the considered protein family Motifs generally have functional and/or structural relevance Understanding motifs is useful for: biotech proposes. Proteins with specific functions can be engineered. Clues about the causes of diseases can be unrevealed.
8/12/2019 Blast Clustal 4Students
22/33
GCGGCCCA TCAGGTAGTT GGTGG
GCGGCCCA TCAGGTAGTT GGTGG
GCGTTCCA TCAGCTGGTT GGTGG
GCGTCCCA TCAGCTAGTT GGTGG
GCGGCGCA TTAGCTAGTT GGTGA
******** ********** *****
TTGACATG CCGGGG---A AACCG
TTGACATG CCGGTG--GT AAGCC
TTGACATG -CTAGG---A ACGCG
TTGACATG -CTAGGGAAC ACGCG
TTGACATC -CTCTG---A ACGCG
******** ?????????? *****
Easy
Difficult due
to insertions
or deletions
(indels)
Building a multiple sequence alignmentcan be easy or difficult
8/12/2019 Blast Clustal 4Students
23/33
8/12/2019 Blast Clustal 4Students
24/33
Multiple Sequence Alignment- Goals To generate a concise, information-rich summaryof sequence data. Sometimes used to illustrate the dissimilarity orsimilarity between a group of sequences. Alignments can be treated as models that can beused to test hypotheses. Does this model of events accurately reflect knownbiological evidence.
8/12/2019 Blast Clustal 4Students
25/33
8/12/2019 Blast Clustal 4Students
26/33
1) Given a set of sequences, the first step of a multiple sequence alignment is
calculating the pairwise distances between the sequences.
2) The pairwise distances are used to build a guide tree which is used as aguide to perform the multiple sequence alignment.
3) Using the guide tree sequences are aligned starting from the two most
similar. More distantly related sequences are progressively added.
Seq_a
Seq_b
Seq_c
Seq_d
Multiple Sequence Alignment with Clustal (Thompson1996): The Principle
8/12/2019 Blast Clustal 4Students
27/33
8/12/2019 Blast Clustal 4Students
28/33
ClustalW- Guide Tree
Generate a Neighbor-Joining guidetree from these pairwise distances. This guide tree gives the order inwhich the progressive alignment willbe carried out.
Cl t lW Fi t i
8/12/2019 Blast Clustal 4Students
29/33
ClustalW- First pair Align the two most closely-relatedsequences first. This alignment is then fixed and willnever change. If a gap is to beintroduced subsequently, then it will be
introduced in the same place in bothsequences, but their relative alignmentremains unchanged.
8/12/2019 Blast Clustal 4Students
30/33
8/12/2019 Blast Clustal 4Students
31/33
1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ
Hbb_Human 1 -Hbb_Horse 2 .17 -Hba_Human 3 .59 .60 -Hba_Horse 4 .59 .59 .13 -Myg_Whale 5 .77 .77 .75 .75 -
Hbb_Human
Hbb_Horse
Hba_Horse
Hba_Human
Myg_Whale
1
2
3 4
1
2
3 4
alpha-helices
Quick pairwise alignment:
calculate distance matrix
Neighbor-joining tree
(guide tree)
Progressive alignment
following guide tree
CLUSTAL W
8/12/2019 Blast Clustal 4Students
32/33
8/12/2019 Blast Clustal 4Students
33/33
Advice on progressive alignment Progressive alignment is a mathematicalprocess that is completely independentof biological reality
Can be a very good estimate Can be an impossibly poor estimate
Requires user input and skill