44
BIOINFORMATICS Dr. Aladdin Hamwieh Khalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 6 Sequence Alignment po University lty of technical engineering rtment of Biotechnology

Bioinformatics

  • Upload
    rafer

  • View
    19

  • Download
    0

Embed Size (px)

DESCRIPTION

Lecture 6 Sequence Alignment. Bioinformatics. Dr. Aladdin HamwiehKhalid Al- shamaa Abdulqader Jighly. Aleppo University Faculty of technical engineering Department of Biotechnology. 2010-2011. Gene prediction: Methods. Gene Prediction can be based upon: Coding statistics - PowerPoint PPT Presentation

Citation preview

Page 1: Bioinformatics

BIOINFORMATICS

Dr. Aladdin Hamwieh Khalid Al-shamaaAbdulqader Jighly

2010-2011

Lecture 6Sequence Alignment

Aleppo UniversityFaculty of technical engineeringDepartment of Biotechnology

Page 2: Bioinformatics

GENE PREDICTION: METHODS

Gene Prediction can be based upon:

Coding statistics

Gene structure

Comparison

Statistical approach

Similarity-based approach

Page 3: Bioinformatics

GENE PREDICTION: METHODS

Gene Prediction can be based upon:

Coding statistics

Gene structure

Comparison

Statistical approach

Similarity-based approach

Page 4: Bioinformatics

ALIGNMENT Sequence alignment involves the

identification of the correct location of deletions and insertions that have occurred in either of the two lineages since their divergence from a common ancestor.

Dynamic programming is the standard approach to sequence alignment

Global alignment: optimize the overall similarity of the two sequences

Local alignment: find only relatively conserved subsequences

Pairwise alignment: is the alignment between two sequences

Multiple alignment: is the alignment between more than two sequences

Page 5: Bioinformatics

METHODS OF ALIGNMENT:

1.DOT MATRIX2.DISTANCE MATRIX

Page 6: Bioinformatics

DOT PLOT ALGORITHMTake two sequences (A & B), write

sequence A out as a row (length=m) and sequence B as a column (length =n)

Create a table or “matrix” of “m” columns and “n” rows

Compare each letter of sequence A with every letter in sequence B. If there’s a match mark it with a dot, if not, leave blank

Page 7: Bioinformatics

DOT PLOT ALGORITHM

A C D E F G H G GACDEFGHGA X Not MatchedComplete identity

Page 8: Bioinformatics

DOT PLOTS & INTERNAL REPEATS

Page 9: Bioinformatics

9

Advantages:Highlighting Information

The vertical gap indicates that a coding region corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.

Page 10: Bioinformatics

10

The two pairs of diagonally oriented parallel lines most probably indicate that two small internal duplications occurred in the bacterial gene.

Advantages:Highlighting Information

Page 11: Bioinformatics

SCORING MATRICES Scoring matrices are created based on

biological evidence. To generalize scoring, consider a (4+1) x

(4+1) scoring matrix δ. In the case of an amino acid sequence

alignment, the scoring matrix would be a (20+1)x(20+1) size.

The addition of 1 is to include the score for comparison of a gap character “-”.

Page 12: Bioinformatics

SCORING MATRICE ELEMENTSInput: two sequences over the same alphabetOutput: an alignment of the two sequencesExample: GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA A possible alignment:

-GCGC-ATGGATTGAGCGATGCGCCATTGAT-GACC-A

Three elements:Perfect matchesMismatches Insertions & deletions (indel)

Page 13: Bioinformatics

Score each position independently: Match: +1 Mismatch: -1 Indel: -2

Score of an alignment is sum of position scores

SCORING SCHEME

Example: -GCGC-ATGGATTGAGCGATGCGCCATTGAT-GACC-A

Score: (+1x13) + (-1x2) + (-2x4) = 3

------GCGCATGGATTGAGCGATGCGCC----ATTGATGACCA--Score: (+1x5) + (-1x6) + (-2x11) = -23

A G C T -A +1 –1 –1 -1 -2G –1 +1 –1 -1 -2C –1 –1 +1 -1 -2T –1 –1 –1 +1 -2- -2 -2 -2 -2 *

Page 14: Bioinformatics

TRANSITION AND TRANSVERSION Matrix Example:

A C G TA +3 –2 –1 -2 C –2 +3 –2 -1 G –1 –2 +3 -2 T –2 –1 –2 +3

Page 15: Bioinformatics

THE GLOBAL ALIGNMENT PROBLEMFind the best alignment between two strings

under a given scoring schema

Input : Strings v and w and a scoring schemaOutput : Alignment of maximum score

↑← = -б = 1 if match = -µ if mismatch

si-1,j-1 +1 if vi = wj

si,j = max s i-1,j-1 -µ if vi ≠ wj

s i-1,j - σ s i,j-1 - σ

m : mismatch penaltyσ : indel penalty

si-1,j-1 s i-1,j

s i,j-1 si,j

WWj-1 Wj

VVi

Vi-

1

Page 16: Bioinformatics

LONGEST COMMON SUBSEQUENCES – PRACTICE 1

T G C A T A0 0 0 0 0 0 0

A 0T 0C 0T 0G 0A 0T 0

• V = ATCTGAT• W = TGCAT

• Mismatches are not allowed (μ = -∞)

• No indels penalties (σ = 0)• and matches are rewarded with +1

Page 17: Bioinformatics

LONGEST COMMON SUBSEQUENCES – PRACTICE 2 T G C A T A

0 0 0 0 0 0 0A 0 0 0 0 1 1 1T 0C 0T 0G 0A 0T 0

Page 18: Bioinformatics

LONGEST COMMON SUBSEQUENCES – PRACTICE 3

T G C A T A0 0 0 0 0 0 0

A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0T 0G 0A 0T 0

Page 19: Bioinformatics

LONGEST COMMON SUBSEQUENCES – PRACTICE 4

T G C A T A0 0 0 0 0 0 0

A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0 1 1 2 2 2 2T 0G 0A 0T 0

Page 20: Bioinformatics

LONGEST COMMON SUBSEQUENCES – PRACTICE 5

T G C A T A0 0 0 0 0 0 0

A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0 1 1 2 2 2 2T 0 1 1 2 2 3 3G 0A 0T 0

Page 21: Bioinformatics

LONGEST COMMON SUBSEQUENCES – PRACTICE 6

T G C A T A0 0 0 0 0 0 0

A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0 1 1 2 2 2 2T 0 1 1 2 2 3 3G 0 1 2 2 2 3 3A 0T 0

Page 22: Bioinformatics

LONGEST COMMON SUBSEQUENCES – PRACTICE 7

T G C A T A0 0 0 0 0 0 0

A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0 1 1 2 2 2 2T 0 1 1 2 2 3 3G 0 1 2 2 2 3 3A 0 1 2 2 3 3 4T 0

Page 23: Bioinformatics

LONGEST COMMON SUBSEQUENCES – PRACTICE 8

T G C A T A0 0 0 0 0 0 0

A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0 1 1 2 2 2 2T 0 1 1 2 2 3 3G 0 1 2 2 2 3 3A 0 1 2 2 3 3 4T 0 1 2 2 3 4 4

Page 24: Bioinformatics

LONGEST COMMON SUBSEQUENCES – PRACTICE 9

T G C A T A0 0 0 0 0 0 0

A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0 1 1 2 2 2 2T 0 1 1 2 2 3 3G 0 1 2 2 2 3 3A 0 1 2 2 3 3 4T 0 1 2 2 3 4 4

Page 25: Bioinformatics

LONGEST COMMON SUBSEQUENCES – PRACTICE 10

T G C A T A0 0 0 0 0 0 0

A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0 1 1 2 2 2 2T 0 1 1 2 2 3 3G 0 1 2 2 2 3 3A 0 1 2 2 3 3 4T 0 1 2 2 3 4 4

• Computing similarity s(V,W) = 4• Computing distance d(V,W) = n + m – 2 s(V,M) =

5

Page 26: Bioinformatics

LONGEST COMMON SUBSEQUENCES – PRACTICE 10

• Alignment: – T G C A T – A –A T – C – T G A T

T G C A T A0 0 0 0 0 0 0

A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0 1 1 2 2 2 2T 0 1 1 2 2 3 3G 0 1 2 2 2 3 3A 0 1 2 2 3 3 4T 0 1 2 2 3 4 4

Page 27: Bioinformatics

PROTEIN SUBSTITUTION MATRIX

Identity Scoring Matrix Percent Accepted Mutation (PAM)Blocks Substitution Matrix (BLOSUM)

Page 28: Bioinformatics

IDENTITY SCORING MATRIXA R N D C Q E G H I L K M F P S T W Y V

A 1R 0 1N 0 0 1D 0 0 0 1C 0 0 0 0 1Q 0 0 0 0 0 1E 0 0 0 0 0 0 1G 0 0 0 0 0 0 0 1H 0 0 0 0 0 0 0 0 1I 0 0 0 0 0 0 0 0 0 1L 0 0 0 0 0 0 0 0 0 0 1K 0 0 0 0 0 0 0 0 0 0 0 1M 0 0 0 0 0 0 0 0 0 0 0 0 1F 0 0 0 0 0 0 0 0 0 0 0 0 0 1P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Page 29: Bioinformatics

PERCENT ACCEPTED MUTATION (PAM)o 1 PAM is the amount of evolutionary change

that yields, on average, one substitution in 100 amino acid residues.

o PAM250 matrix assumes/is optimized for sequences separated by 250 PAM, i.e. 250 substitutions in 100 amino acids (longer evolutionary time)

o To derive a mutational probability matrix for a protein sequence that has undergone N percent accepted mutations, a PAM-N matrix, the PAM-1 matrix is multiplied by itself N times

o PAM250 is suitable for comparing distantly related sequences, while a lower PAM is suitable for comparing more closely related sequences.

Page 30: Bioinformatics

SELECTING A PAM MATRIX Low PAM numbers: short sequences,

strong local similarities. High PAM numbers: long sequences,

weak similarities.PAM60 for close relations (60% identity)PAM120 recommended for general use (40%

identity)PAM250 for distant relations (20% identity)

If uncertain, try several different matricesPAM40, PAM120, PAM250 recommended.

Page 31: Bioinformatics

A BETTER MATRIX - PAM250A R N D C Q E G H I L K M F P S T W Y V

A 2R -2 6N 0 0 2D 0 -1 2 4C -2 -4 -4 -5 4Q 0 1 1 2 -5 4E 0 -1 1 3 -5 2 4G 1 -3 0 1 -3 -1 0 5H -1 2 2 1 -3 3 1 -2 6I -1 -2 -2 -2 -2 -2 -2 -3 -2 5L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6K -1 3 1 0 -5 1 0 -2 0 -2 -3 5M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 3T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -2 0 1 3W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4

Page 32: Bioinformatics

BLOSUM: BLOCKS SUBSTITUTION MATRIX Based on BLOCKS database

~2000 blocks from 500 families of related proteins

Families of proteins with identical function Blocks are short conserved patterns of 3-

60 amino acid long without gaps Each block represent sequences

alignment with different identity percentage

AABCDA … BBCDADABCDA. A. BBCBBBBBCDABA.BCCAAAAACDAC.DCBCDBCCBADAB.DBBDCCAAACAA … BBCCC

Page 33: Bioinformatics

BLOSUM MATRICES For each block the amino-acid substitution

rates were calculated to create BLOSUM matrix

Different BLOSUMn matrices are calculated independently from BLOCKS

BLOSUMn is based on sequences that shared at least n percent identical

BLOSUM62 represents closer sequences than BLOSUM45

Page 34: Bioinformatics

SELECTING A BLOSUM MATRIX

For BLOSUMn, higher n suitable for sequences which are more similarBLOSUM62 recommended for general

useBLOSUM80 for close relationsBLOSUM45 for distant relations

Page 35: Bioinformatics

EQUIVALENT PAM AND BLOSUM MATRICESThe following matrices are roughly equivalent...

•PAM100 Blosum90 •PAM120 Blosum80 •PAM160 Blosum60 •PAM200 Blosum52 •PAM250 Blosum45

Generally speaking... •The Blosum matrices are best for detecting local alignments. •The Blosum62 matrix is the best for detecting the majority of weak protein similarities. •The Blosum45 matrix is the best for detecting long and weak alignments.

Less divergent

More divergent

Page 36: Bioinformatics

A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1

A R N D C Q E G H I L K M F P S T W Y V X

BLOSUM62

Common amino acids have low weights

Rare amino acids have high weights

Page 37: Bioinformatics

A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

BLOSUM62

Positive for more likely substitution

Page 38: Bioinformatics

A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

BLOSUM62

Negative for less likely substitution

Page 39: Bioinformatics

ALIGNMENT SCOREA 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

…PQG……PQG… 7+5+6

=18

..PQG..

..PEG.. 7+2+6

=15

…PQG……PQA…

7+5+0 =12

Page 40: Bioinformatics

AFFINE GAP PENALTIES In nature, a series of k indels often come as a

single event rather than a series of k single nucleotide events:

Normal scoring would give the same score for

both alignments

This is more likely

This is less likely

Page 41: Bioinformatics

ACCOUNTING FOR GAPS Gaps- contiguous sequence of spaces in

one of the rows

Score for a gap of length x is: -(ρ + σx) where ρ >0 is the penalty for introducing

a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too

much of a penalty for extending the gap.

Page 42: Bioinformatics

MULTIPLE SEQUENCE ALIGNMENT1. All sequences are compared to each

other (pairwise alignments)2. A dendrogram (like a phylogenetic

tree) is constructed, describing the approximate groupings of the sequences by similarity (stored in a file).

3. The final multiple alignment is carried out, using the dendrogram as a guide.

Page 43: Bioinformatics

APPLICATIONS OF MULTIPLE ALIGNMENTS

Page 44: Bioinformatics

THANK YOU