49
Pairwise Alignment

Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Pairwise Alignment

Page 2: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Sequences are related..

Phylogenetic tree of globin-type proteins found in humans

Page 3: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

The process of lining up two or more sequences to achieve maximal levels of identity (or similarity, in the case of amino acid sequences).

Definition of Pairwise alignment

Page 4: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

What for? A Few Examples:• Determining whether 2 sequences from 2

entries found by search of keywords are similar/ identical

• Focus on differences (genes sequenced in different labs, alternative splicing, SNPs, mutations.

• Finding similar (conserved) regions in two sequences

• More….

Page 5: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

How do we align two sequences?

ATTGCAGTGATCG

ATTGCGTCGATCG

Solution 1 Solution 2ATTGCAGTGATCG ATTGCAGT-GATCG||||| ||||| ||||| || ||||| ATTGCGTCGATCG ATTGC-GTCGATCG

10 matches | , 3 mismatches

12 matches |, 2 gaps -

Page 6: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Which alignment is better?

Solution 1 Solution 2ATTGCAGTGATCG ATTGCAGT-GATCG||||| ||||| ||||| || ||||| ATTGCGTCGATCG ATTGC-GTCGATCG

10X1+3X(-1) = 7 12X1+2X(-2) = 8

10 matches, 3 mismatches 12 matches, 2 gaps

We will use a scoring schemeMatch +1 +1Mismatch –1 0Indel(gap) -2 -2

10X1+3X(0) = 10 12X1+2X(-2) = 8

Page 7: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Changing the scores of the matrix scheme can change the final score of a

given aligned segment.

So how do we determine our matrix schemes?

Page 8: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

The mechanistic Rational

DNAמה קורה בעת סינתיזת ?

Page 9: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Biological causes of mismatchesAccumulation of mutations in a segment of the sequence that is less crucial for function can create a stretch of mismatches.

(Any residue can be subject to back mutations.)

Very common.

ATTGCAGTGATCG||||| |||||ATTGCGTCGATCG

ATTGCAGTGATCG||||| | |||||ATTGCGGCGATCG

May reflect 2 or 4 independent

mutations

Original sequence

Emerging sequence

Original sequence

Emerging sequence

Page 10: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Biological causes of gaps

(indel – insertion / deletion)•A single mutation can create a gap .

•Unequal crossover in meiosis can lead to insertion or deletion of strings of bases.

•DNA slippage in the replication procedure can result in the repetition of a string.

•Retrovirus insertions.

•Translocations of DNA between chromosomes.

Less common than events leading to single mutations

Are all gaps equal?

Page 11: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

A sequence with a short gap: ATCTTCAGTGTTTCCCCTGTTTTGCCC.ATTTAGTTCGCTC ||||||||||||||||||||||||||| |||||||||||||

ATCTTCAGTGTTTCCCCTGTTTTGCCCGATTTAGTTCGCTC

A sequence with a long gap: ATCTTCAGTGTTTCCCCTGTTTTGCCC....................ATTTAGTTCGCTC ||||||||||||||||||||||||||| ||||||||||||| ATCTTCAGTGTTTCCCCTGTTTTGCCCGXXXXXXXXXXXXXXXXXXXATTTAGTTCGCTC

Consider the following pair of sequences:

Two options for gap scoring

Keep the score similar regardless of gap length = have a zero gap extension penalty and just penalize when you open a gap.

Make the score become larger as a linear function of gap length = add gap extension penalty. This will penalize several small gaps by the same extent as 1 large gap.

Page 12: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Gap penalties can penalize for:

•Gap opening

•Gap extension

•Gap ending (ClustalW – multiple alignment)

•Gap separation (minimum distance between 2 gaps) [ClustalW]

Page 13: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

What happens to the alignment if we change the gap penalties?

Gap opening

Gap extension

Page 14: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

מ:global alignmentאיך יושפע

קנסות גבוהים על פתיחת פער •

קנסות גבוהים על הארכת פער•

local alignmentהאם יושפע באותו אופן/ באותה

מידה?

Page 15: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

ATTGCAGTGATCGATTGCAGT-GATCG||||| |||||||||| || ||||| ATTGCGTCGATCGATTGC-GTCGATCG

Matches | Mismatches

Gaps - - - - -

Gap openingGap extension

פרס קנסות

Minimal space between two gaps הרשאות

When comparing nucleotide or amino

acid sequences

ציון ההשוואה ניתן בשיטת השוט והגזר

So far, when nucleotide sequences were considered all

mismatches received the same (negative) score .

Page 16: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Ex: Pairwise alignments43.2% identity; Global alignment score: 374

10 20 30 40 50 alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.: .:. : : :::: .. : :.::: :... .: :. .: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP 10 20 30 40 50

60 70 80 90 100 110 alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL .::.::::: :.....::.:.. .....::.:: ::.::: ::.::.. :. .:: :.beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 60 70 80 90 100 110

120 130 140 alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:. .: .:.:...:. ::.beta GKEFTPPVQAAYQKVVAGVANALAHKYH 120 130 140

Page 17: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Pairwise alignment

Percent identity is not a good measure of alignment quality

100.000% identity in 3 aa overlap

SPA::: SPA

Page 18: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Pairwise alignments: alignment score

43.2% identity; Global alignment score: 374

10 20 30 40 50 alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.: .:. : : :::: .. : :.::: :... .: :. .: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP 10 20 30 40 50

60 70 80 90 100 110 alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL .::.::::: :.....::.:.. .....::.:: ::.::: ::.::.. :. .:: :.beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 60 70 80 90 100 110

120 130 140 alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:. .: .:.:...:. ::.beta GKEFTPPVQAAYQKVVAGVANALAHKYH 120 130 140

Page 19: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Global alignment

An alignment that assumes that the two proteins are basically similar over the entire length of one another. The alignment attempts to match them to each other from end to end, even though parts of the alignment are not very convincing.

A short example

NLGPSTKDFGKISESREFDNQ

| |||| |

QLNQLERSFGKINMRLEDALV

Page 20: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Local alignment

An alignment that searches for segments of the two sequences that match well. There is no attempt to force entire sequences into an alignment, just those parts that appear to have good similarity, according to some criterion.

Using the same sequences as above, one could get:

NLGPSTKDDFGKILGPSTKDDQ

||||

QNQLERSSNFGKINQLERSSNN

Page 21: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Applying LOCAL

Applying GLOBAL

Global a.

Few mismatches

Several mismatches

Local a.

Page 22: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

If two proteins share more than one common region, for example one has a single copy of a

particular domain while the other has two copies, it may be possible to "miss" one of the

two copies if using local alignment, which presents only the best scoring alignment.

Emboss [best solution] vs. Lalign (Embnet) [several solutions]

Page 23: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Pairwise alignments: conservative substitutions43.2% identity; Global alignment score: 374

10 20 30 40 50 alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.: .:. : : :::: .. : :.::: :... .: :. .: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP 10 20 30 40 50

60 70 80 90 100 110 alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL .::.::::: :.....::.:.. .....::.:: ::.::: ::.::.. :. .:: :.beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 60 70 80 90 100 110

120 130 140 alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:. .: .:.:...:. ::.beta GKEFTPPVQAAYQKVVAGVANALAHKYH 120 130 140

Page 24: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

However, in the case of amino acids Not all matches are equal. Not all mismatches are equal!

Page 25: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Amino acid properties

Serine (S) and Threonine (T) have similar physicochemical properties

Aspartic acid (D) and Glutamic acid (E) have similar properties

Substitution of S/T or E/D occurs relatively often during evolution

=>

Substitution of S/T or E/D should result in scores that are only moderately lower than identities

=>

Page 26: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Non-polar hydrophobic

All other aa are polar, hydrophylic:

Acidic

Basic

All Amino Acids Are Equal…

Page 27: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

http://teachline.ls.huji.ac.il/~72332/mouse/aa-properties.html

Each a”a is characterized by a combination of features (size, charge, etc.).

The relative importance of each feature may vary according to the a”a role in the 3-D structure and function of the protein.

So how can we score matches and mismatches?

Page 28: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

To that end, amino acids substitution matrices were developed (Blosum, PAM).

Page 29: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

The PAM and BLOSUM substitution matrices describe the likelihood that two residue types would mutate to each other.

Amino Acids Substitution Matrices

These matrices are based on biological sequence information: the substitutions observed in structural (BLOSUM) or evolutionary (PAM) alignments of well studied protein families

These scoring systems have a probabilistic foundation.

Page 30: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

• All the PAM data come from alignments of closely related proteins (>85% amino acid identity) from 71 protein families (total of 1572 protein sequences).

• PAM matrices are based on global sequence alignments - these include both highly conserved and highly mutable regions.

PAM series - Percent Accepted Mutation(Accepted by natural selection)

Some of the protein families are:Ig kappa chainKappa caseinLactalbuminHemoglobin MyoglobinInsulinHistone H4

Ubiquitin

Page 31: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

PAM series - Percent Accepted Mutation(Accepted by natural selection)

*Varying degrees of conservation

Page 32: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

•The PAM 250 matrix is appropriate for searching for alignments of sequence that have diverged by 250 PAMs, 250 mutations per 100 amino acids of sequence. •Because of back mutations and silent mutations this corresponds to sequences that are about ~20 percent identical.

Smaller PAM number – less diversity between compared sequences

Better suited for more conserved sequences

PAM1 99% identity in sequences

Various degrees of conservation

Page 33: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Various degrees of conservationThe PAM1 is the matrix calculated from comparisons

of sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids.

Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids.

All the PAM data come from closely related proteins>)85% amino acid identity.(

Page 34: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

BLOSUM series - Blocks Substitution Matrix. (Henikoff S. & Henikoff JG., PNAS, 1992)

A substitution matrix based on alignments in the BLOCKS database – conserved regions (blocks) of

•Families of proteins•Family members have identical biochemical functions, and show common motifs•Common blocks of local alignment not containing gaps.

The BLOCKS database contains thousands of groups of multiple sequence alignments. Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins.

Page 35: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Extracting probabilities from Blocks- example

A A C D A AA A A D C RD R C G N AA N C N A RC R K D A NA A K N C R

Substitutions counted in column 1AA, AD, AA, AC, AA, AD, AA, AC, AA, DA, DC, DA, AC, AA, CA

6AA (P(AA)=6/15)4AD (P(AD)=4/15)4AC1DC…Statistics of substitutions and log-odds computation as described for PAM.

Page 36: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members.

Page 37: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -1 1 1 -2 -1 -3 -2 5 M -1 -2 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C Q E G H I L K M F P S T W Y V

Blosum62 scoring matrix

Page 38: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Using an amino acid substitution matrix

Gap penalties (not included in this example) are treated as previously

described

match

match

mismatch

mismatch

Notice that matches and mismatches

don’t have the same values.

Page 40: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

A substitution is more likely to occur between amino acids with similar biochemical properties.

Page 41: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Likelihood of a substitution is also affected by the degree of degenerativity of the genetic code of the different amino acids

Page 42: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

How do we choose the most appropriate scoring matrix?

• Blosum matrices are more commonly used than PAM matrices.

•The Blosum matrices are best for detecting local alignments.

•The Blosum62 matrix is the best for detecting the majority of weak protein similarities.

•The Blosum45 matrix is the best for detecting long and weak alignments.

Page 43: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Rat versus mouse RBP

Rat versus bacteriallipocalin

Page 44: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

The following matrices are roughly equivalent

PAM100 BLOSUM90PAM120 BLOSUM80PAM160 BLOSUM60PAM200 BLOSUM52PAM250 BLOSUM45

Page 45: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Limitations

• Substitution matrices do not take into account long range interactions between residues.

• They assume that identical residues are equal (whereas in real life a residue at the active site has other evolutionary constraints than the same residue outside of the active site)

• They assume evolution rate to be constant.

Page 46: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

DNA Substitution Matrices

Purine – Purine

Pyrimidine - Pyrimidine

Purine – Pyrimidine

Pyrimidine - Purine

Page 47: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

ConservationThe extent to which nucleotide or protein sequences are related. It can be evaluated by identity and similarity.

Identity ( | )The extent to which two sequences are invariant.

Similarity ( . : )Changes at a specific position of an amino acid that preserve the physico-chemical properties of the original residue.

Definitions

Page 47

Page 48: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

There are many ways to align two sequences.

Several ways to present the pairwise alignment

Do not blindly trust your alignment to be the only truth. In particular, gapped regions may be quite variable.

Sequences sharing less than 20% identity are difficult to align.

Page 49: Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

Dotplots: visual sequence comparison

1. Place two sequences along axes of plot

2. Place dot at grid points where two sequences have identical residues

3. Diagonals correspond to conserved regions