Download pdf - Blast Clustal 4Students

8/12/2019 Blast Clustal 4Students

1/33

So you have a sequence. What now?


2/33

The simplest bioinformatic problem:

Let us assume you have an uncharacterised (yet) nucleotide sequence that you obtainedfrom a PCR experiment.

Question:How do you characterise (validate) your PCR product?

Answer:

(1) You interrogate a PRIMARY database (e.g. GenBank) and retrieve all the

sequences that are significantly similar (i.e are HOMOLOGOUS) to your query.

This is done using the BLAST software.

(2) You generate a multiple sequence alignment of the retrieved (Homologous) proteins.

This is done using ClustalW.


3/33


4/33

1) Create a 2-D matrix and populate it withscores representing the similarities of the

compared sequences

2) Accumulate the scores in the matrix &

penalize insertions and deletions

3) Identify the highest scoring path in thematrix.


5/33

SEQ1

A H C N I R V S G V C L C R P M

A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0I 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0I 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0N 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

SEQ2 R 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0K 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0R 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0H 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0


6/33


A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0I 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0I 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0N 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

R 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0K 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

C 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0

R 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0H 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0


A 8 7 6 6 5 4 4 4 4 4 3 3 2 1 0 0I 7 7 6 6 6 4 4 4 4 4 3 3 2 1 0 0

C 6 6 7 6 5 4 4 4 4 4 4 3 3 1 0 0I 6 6 6 5 6 4 4 4 4 4 3 3 2 1 0 0N 5 5 5 6 5 4 4 4 4 4 3 3 2 1 0 0

R 4 4 4 4 4 5 4 4 4 4 3 3 2 1 0 0C 3 3 4 3 3 3 3 3 3 3 4 3 3 1 0 0K 3 3 3 3 3 3 3 3 3 3 3 3 2 1 0 0

C 2 2 3 2 2 2 2 2 2 2 3 2 3 1 0 0R 2 1 1 1 1 2 1 1 1 1 1 1 1 2 0 0H 1 2 1 1 1 1 1 1 1 1 1 1 1 1 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

The MAX previous score, the one

that has to be added to the current

RED CELL value, is the highest in

the BLUE ROW OR COLUMN.

The matrix is accumulated moving from the bottomright corner to the top left corner!


7/33

P

A

GCS-H

CS-S S N

Q

Y

WF

M

I V

L

T

Small

Hydrophobic

PolarAliphatic

Tiny

Aromatic

Charged

enn agram o am no ac s proper es

K

RH

-

D +

E


8/33

Matrix representing probabilities of amino acid substitutions. This and other existing

matrices can be used to build more accurate alignments of two sequences.


9/33


10/33

To search databases we use heuristic, similaritybased algorithms Similarity based database searches generate local alignments

to find (within a sequence database) sequences related to the

query sequence. Given a query sequence, local alignments of the query sequence

are generated against every sequence in the database. The scores of

the alignments are used to identify sequences that are related to the

query

sequence. BLAST is the most common heuristic algorithm used to searchsequence databases.


11/33

BLASTThe Basic Local Alignment Search Tool BLAST is the standard database search tool. Developed by Altschul Stephen in 1990.

BLAST is a class of related software that perform a variety ofdatabase comparisons. For example:

Objective: To find high scoring untapped alignments between

a query sequence and the sequences in a database. These are called High Scoring Pairs (HSP). The existence of such segments above a given similarity thresholdindicates pairwise similarity beyond random chance. This is used to distinguish related from unrelated sequences in adatabase.


12/33


13/33

The Algorithm

Given a Query sequence (e.g. QLNFSAGW)

FIRST STEP - SEEDING. Generate all words of length K (e.g. k= 2)in the querysequence.

Words in our example:QL; NF; SA; GW; LN; FS; AG.

SECOND STEP.Identify all words in the sequences in the database.

THIRD STEP. Align every seed against every word generated from the database.Calculate (Using BLOSUM62 -or another matrix) the score of every ungapped twoletter alignment generated in this way. An alignment is considered a MATCH if itsscore is above a certain threshold (default = 8 for amino acids).

FOURTH STEP. Matches (only) are extended to generate longer alignments. If nomatch is found for two sequences, they are not considered any longer. This savestime. If multiple matches are found for two sequences, all matches are extended.The extension of a match continues until mismatches cause the alignment score todrop below a given threshold (22 for proteins 20 for DNA).

Resulting ungapped alignments are the HSPs.


14/33


15/33


16/33

Extending a match

Stop when : Score Current Extension < 22.

Every MATCH (alignment with a minimal score of 8), is extended until we found the

best extension (alignment of maximal score).

AGT PYNNGT NNT LTW HKR RRR K

TAG PYNNGT NNT LTW KHK KKK R

Initial Match (or Hit)

Extend until score of alignment increases

Keep extending until score drops below 22


17/33

Interpreting BLAST The output of BLAST provides a list of pairwise sequence matchesranked by the statistical significance of the scores of their HSP.

In BLAST the statistical indicator is the E-value (NOT to be confused with a P value -see below). E-values (expectation values) express how likely it is for an HSP ofa certain score to be observed by chance alone in a database ofgiven dimensions. E = m * n * P. m = total number of residue in database. n = number of residue in the query sequence P = the probability that an HSP alignment is a result of random

chance (THIS IS THE PROBABILITY OF THE ALIGNMENT!)


18/33

Interpretations of E-values

E =< 1e - 50: Extremely high sequence similarity. Very close homologs.

1e - 50< E < 1e - 8: Significantly high similarity. Surely homologous.

1e - 7 < E < 1e - 2 (0.01): Sequences similar but not necessarily homologous. If they are

homologous, they are distant homologoue.

0.01 < E < 10: Match not significant.

Generally speaking, as a rule of thumb: E =< 1e - 8 is significant.

Calculating E-values an example Given a Query Sequence 100 residues long A database containing 1012 residues P = 1*10-20 (of the HSP between 2 sequences)

E-value = 100 * 1012 * 10-20 = 10-6 This will be expressed as: 1e-6 in the BLAST output.


19/33


20/33

E = 4.2


21/33

Proteins can be classified in families

Members of a family generally perform similar (or related) tasksand have specific signatures. They are identified using BLAST If we can identify a protein as a member of a well-characterised family,we can generally predict its function. Signatures of a protein family are referred as Conserved Motifs. Conserved motifs can only be identified building a multiple sequencealignment. If we can identify a conserved motif we learned somethinguseful about the considered protein family Motifs generally have functional and/or structural relevance Understanding motifs is useful for: biotech proposes. Proteins with specific functions can be engineered. Clues about the causes of diseases can be unrevealed.


22/33

GCGGCCCA TCAGGTAGTT GGTGG

GCGGCCCA TCAGGTAGTT GGTGG

GCGTTCCA TCAGCTGGTT GGTGG

GCGTCCCA TCAGCTAGTT GGTGG

GCGGCGCA TTAGCTAGTT GGTGA

******** ********** *****

TTGACATG CCGGGG---A AACCG

TTGACATG CCGGTG--GT AAGCC

TTGACATG -CTAGG---A ACGCG

TTGACATG -CTAGGGAAC ACGCG

TTGACATC -CTCTG---A ACGCG

******** ?????????? *****

Easy

Difficult due

to insertions

or deletions

(indels)

Building a multiple sequence alignmentcan be easy or difficult


23/33


24/33

Multiple Sequence Alignment- Goals To generate a concise, information-rich summaryof sequence data. Sometimes used to illustrate the dissimilarity orsimilarity between a group of sequences. Alignments can be treated as models that can beused to test hypotheses. Does this model of events accurately reflect knownbiological evidence.


25/33


26/33

1) Given a set of sequences, the first step of a multiple sequence alignment is

calculating the pairwise distances between the sequences.

2) The pairwise distances are used to build a guide tree which is used as aguide to perform the multiple sequence alignment.

3) Using the guide tree sequences are aligned starting from the two most

similar. More distantly related sequences are progressively added.

Seq_a

Seq_b

Seq_c

Seq_d

Multiple Sequence Alignment with Clustal (Thompson1996): The Principle


27/33


28/33

ClustalW- Guide Tree

Generate a Neighbor-Joining guidetree from these pairwise distances. This guide tree gives the order inwhich the progressive alignment willbe carried out.

Cl t lW Fi t i


29/33

ClustalW- First pair Align the two most closely-relatedsequences first. This alignment is then fixed and willnever change. If a gap is to beintroduced subsequently, then it will be

introduced in the same place in bothsequences, but their relative alignmentremains unchanged.


30/33


31/33

1 PEEKSAVTALWGKVN--VDEVGG2 GEEKAAVLALWDKVN--EEEVGG3 PADKTNVKAAWGKVGAHAGEYGA4 AADKTNVKAAWSKVGGHAGEYGA5 EHEWQLVLHVWAKVEADVAGHGQ

Hbb_Human 1 -Hbb_Horse 2 .17 -Hba_Human 3 .59 .60 -Hba_Horse 4 .59 .59 .13 -Myg_Whale 5 .77 .77 .75 .75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

1

2

3 4

1

2

3 4

alpha-helices

Quick pairwise alignment:

calculate distance matrix

Neighbor-joining tree

(guide tree)

Progressive alignment

following guide tree

CLUSTAL W


32/33


33/33

Advice on progressive alignment Progressive alignment is a mathematicalprocess that is completely independentof biological reality

Can be a very good estimate Can be an impossibly poor estimate

Requires user input and skill