Upload
rafer
View
19
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Lecture 6 Sequence Alignment. Bioinformatics. Dr. Aladdin HamwiehKhalid Al- shamaa Abdulqader Jighly. Aleppo University Faculty of technical engineering Department of Biotechnology. 2010-2011. Gene prediction: Methods. Gene Prediction can be based upon: Coding statistics - PowerPoint PPT Presentation
Citation preview
BIOINFORMATICS
Dr. Aladdin Hamwieh Khalid Al-shamaaAbdulqader Jighly
2010-2011
Lecture 6Sequence Alignment
Aleppo UniversityFaculty of technical engineeringDepartment of Biotechnology
GENE PREDICTION: METHODS
Gene Prediction can be based upon:
Coding statistics
Gene structure
Comparison
Statistical approach
Similarity-based approach
GENE PREDICTION: METHODS
Gene Prediction can be based upon:
Coding statistics
Gene structure
Comparison
Statistical approach
Similarity-based approach
ALIGNMENT Sequence alignment involves the
identification of the correct location of deletions and insertions that have occurred in either of the two lineages since their divergence from a common ancestor.
Dynamic programming is the standard approach to sequence alignment
Global alignment: optimize the overall similarity of the two sequences
Local alignment: find only relatively conserved subsequences
Pairwise alignment: is the alignment between two sequences
Multiple alignment: is the alignment between more than two sequences
METHODS OF ALIGNMENT:
1.DOT MATRIX2.DISTANCE MATRIX
DOT PLOT ALGORITHMTake two sequences (A & B), write
sequence A out as a row (length=m) and sequence B as a column (length =n)
Create a table or “matrix” of “m” columns and “n” rows
Compare each letter of sequence A with every letter in sequence B. If there’s a match mark it with a dot, if not, leave blank
DOT PLOT ALGORITHM
A C D E F G H G GACDEFGHGA X Not MatchedComplete identity
DOT PLOTS & INTERNAL REPEATS
9
Advantages:Highlighting Information
The vertical gap indicates that a coding region corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene.
10
The two pairs of diagonally oriented parallel lines most probably indicate that two small internal duplications occurred in the bacterial gene.
Advantages:Highlighting Information
SCORING MATRICES Scoring matrices are created based on
biological evidence. To generalize scoring, consider a (4+1) x
(4+1) scoring matrix δ. In the case of an amino acid sequence
alignment, the scoring matrix would be a (20+1)x(20+1) size.
The addition of 1 is to include the score for comparison of a gap character “-”.
SCORING MATRICE ELEMENTSInput: two sequences over the same alphabetOutput: an alignment of the two sequencesExample: GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA A possible alignment:
-GCGC-ATGGATTGAGCGATGCGCCATTGAT-GACC-A
Three elements:Perfect matchesMismatches Insertions & deletions (indel)
Score each position independently: Match: +1 Mismatch: -1 Indel: -2
Score of an alignment is sum of position scores
SCORING SCHEME
Example: -GCGC-ATGGATTGAGCGATGCGCCATTGAT-GACC-A
Score: (+1x13) + (-1x2) + (-2x4) = 3
------GCGCATGGATTGAGCGATGCGCC----ATTGATGACCA--Score: (+1x5) + (-1x6) + (-2x11) = -23
A G C T -A +1 –1 –1 -1 -2G –1 +1 –1 -1 -2C –1 –1 +1 -1 -2T –1 –1 –1 +1 -2- -2 -2 -2 -2 *
TRANSITION AND TRANSVERSION Matrix Example:
A C G TA +3 –2 –1 -2 C –2 +3 –2 -1 G –1 –2 +3 -2 T –2 –1 –2 +3
THE GLOBAL ALIGNMENT PROBLEMFind the best alignment between two strings
under a given scoring schema
Input : Strings v and w and a scoring schemaOutput : Alignment of maximum score
↑← = -б = 1 if match = -µ if mismatch
si-1,j-1 +1 if vi = wj
si,j = max s i-1,j-1 -µ if vi ≠ wj
s i-1,j - σ s i,j-1 - σ
m : mismatch penaltyσ : indel penalty
si-1,j-1 s i-1,j
s i,j-1 si,j
WWj-1 Wj
VVi
Vi-
1
LONGEST COMMON SUBSEQUENCES – PRACTICE 1
T G C A T A0 0 0 0 0 0 0
A 0T 0C 0T 0G 0A 0T 0
• V = ATCTGAT• W = TGCAT
• Mismatches are not allowed (μ = -∞)
• No indels penalties (σ = 0)• and matches are rewarded with +1
LONGEST COMMON SUBSEQUENCES – PRACTICE 2 T G C A T A
0 0 0 0 0 0 0A 0 0 0 0 1 1 1T 0C 0T 0G 0A 0T 0
LONGEST COMMON SUBSEQUENCES – PRACTICE 3
T G C A T A0 0 0 0 0 0 0
A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0T 0G 0A 0T 0
LONGEST COMMON SUBSEQUENCES – PRACTICE 4
T G C A T A0 0 0 0 0 0 0
A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0 1 1 2 2 2 2T 0G 0A 0T 0
LONGEST COMMON SUBSEQUENCES – PRACTICE 5
T G C A T A0 0 0 0 0 0 0
A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0 1 1 2 2 2 2T 0 1 1 2 2 3 3G 0A 0T 0
LONGEST COMMON SUBSEQUENCES – PRACTICE 6
T G C A T A0 0 0 0 0 0 0
A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0 1 1 2 2 2 2T 0 1 1 2 2 3 3G 0 1 2 2 2 3 3A 0T 0
LONGEST COMMON SUBSEQUENCES – PRACTICE 7
T G C A T A0 0 0 0 0 0 0
A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0 1 1 2 2 2 2T 0 1 1 2 2 3 3G 0 1 2 2 2 3 3A 0 1 2 2 3 3 4T 0
LONGEST COMMON SUBSEQUENCES – PRACTICE 8
T G C A T A0 0 0 0 0 0 0
A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0 1 1 2 2 2 2T 0 1 1 2 2 3 3G 0 1 2 2 2 3 3A 0 1 2 2 3 3 4T 0 1 2 2 3 4 4
LONGEST COMMON SUBSEQUENCES – PRACTICE 9
T G C A T A0 0 0 0 0 0 0
A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0 1 1 2 2 2 2T 0 1 1 2 2 3 3G 0 1 2 2 2 3 3A 0 1 2 2 3 3 4T 0 1 2 2 3 4 4
LONGEST COMMON SUBSEQUENCES – PRACTICE 10
T G C A T A0 0 0 0 0 0 0
A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0 1 1 2 2 2 2T 0 1 1 2 2 3 3G 0 1 2 2 2 3 3A 0 1 2 2 3 3 4T 0 1 2 2 3 4 4
• Computing similarity s(V,W) = 4• Computing distance d(V,W) = n + m – 2 s(V,M) =
5
LONGEST COMMON SUBSEQUENCES – PRACTICE 10
• Alignment: – T G C A T – A –A T – C – T G A T
T G C A T A0 0 0 0 0 0 0
A 0 0 0 0 1 1 1T 0 1 1 1 1 2 2C 0 1 1 2 2 2 2T 0 1 1 2 2 3 3G 0 1 2 2 2 3 3A 0 1 2 2 3 3 4T 0 1 2 2 3 4 4
PROTEIN SUBSTITUTION MATRIX
Identity Scoring Matrix Percent Accepted Mutation (PAM)Blocks Substitution Matrix (BLOSUM)
IDENTITY SCORING MATRIXA R N D C Q E G H I L K M F P S T W Y V
A 1R 0 1N 0 0 1D 0 0 0 1C 0 0 0 0 1Q 0 0 0 0 0 1E 0 0 0 0 0 0 1G 0 0 0 0 0 0 0 1H 0 0 0 0 0 0 0 0 1I 0 0 0 0 0 0 0 0 0 1L 0 0 0 0 0 0 0 0 0 0 1K 0 0 0 0 0 0 0 0 0 0 0 1M 0 0 0 0 0 0 0 0 0 0 0 0 1F 0 0 0 0 0 0 0 0 0 0 0 0 0 1P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
PERCENT ACCEPTED MUTATION (PAM)o 1 PAM is the amount of evolutionary change
that yields, on average, one substitution in 100 amino acid residues.
o PAM250 matrix assumes/is optimized for sequences separated by 250 PAM, i.e. 250 substitutions in 100 amino acids (longer evolutionary time)
o To derive a mutational probability matrix for a protein sequence that has undergone N percent accepted mutations, a PAM-N matrix, the PAM-1 matrix is multiplied by itself N times
o PAM250 is suitable for comparing distantly related sequences, while a lower PAM is suitable for comparing more closely related sequences.
SELECTING A PAM MATRIX Low PAM numbers: short sequences,
strong local similarities. High PAM numbers: long sequences,
weak similarities.PAM60 for close relations (60% identity)PAM120 recommended for general use (40%
identity)PAM250 for distant relations (20% identity)
If uncertain, try several different matricesPAM40, PAM120, PAM250 recommended.
A BETTER MATRIX - PAM250A R N D C Q E G H I L K M F P S T W Y V
A 2R -2 6N 0 0 2D 0 -1 2 4C -2 -4 -4 -5 4Q 0 1 1 2 -5 4E 0 -1 1 3 -5 2 4G 1 -3 0 1 -3 -1 0 5H -1 2 2 1 -3 3 1 -2 6I -1 -2 -2 -2 -2 -2 -2 -3 -2 5L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6K -1 3 1 0 -5 1 0 -2 0 -2 -3 5M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 3T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -2 0 1 3W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4
BLOSUM: BLOCKS SUBSTITUTION MATRIX Based on BLOCKS database
~2000 blocks from 500 families of related proteins
Families of proteins with identical function Blocks are short conserved patterns of 3-
60 amino acid long without gaps Each block represent sequences
alignment with different identity percentage
AABCDA … BBCDADABCDA. A. BBCBBBBBCDABA.BCCAAAAACDAC.DCBCDBCCBADAB.DBBDCCAAACAA … BBCCC
BLOSUM MATRICES For each block the amino-acid substitution
rates were calculated to create BLOSUM matrix
Different BLOSUMn matrices are calculated independently from BLOCKS
BLOSUMn is based on sequences that shared at least n percent identical
BLOSUM62 represents closer sequences than BLOSUM45
SELECTING A BLOSUM MATRIX
For BLOSUMn, higher n suitable for sequences which are more similarBLOSUM62 recommended for general
useBLOSUM80 for close relationsBLOSUM45 for distant relations
EQUIVALENT PAM AND BLOSUM MATRICESThe following matrices are roughly equivalent...
•PAM100 Blosum90 •PAM120 Blosum80 •PAM160 Blosum60 •PAM200 Blosum52 •PAM250 Blosum45
Generally speaking... •The Blosum matrices are best for detecting local alignments. •The Blosum62 matrix is the best for detecting the majority of weak protein similarities. •The Blosum45 matrix is the best for detecting long and weak alignments.
Less divergent
More divergent
A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1
A R N D C Q E G H I L K M F P S T W Y V X
BLOSUM62
Common amino acids have low weights
Rare amino acids have high weights
A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
BLOSUM62
Positive for more likely substitution
A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
BLOSUM62
Negative for less likely substitution
ALIGNMENT SCOREA 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
…PQG……PQG… 7+5+6
=18
..PQG..
..PEG.. 7+2+6
=15
…PQG……PQA…
7+5+0 =12
AFFINE GAP PENALTIES In nature, a series of k indels often come as a
single event rather than a series of k single nucleotide events:
Normal scoring would give the same score for
both alignments
This is more likely
This is less likely
ACCOUNTING FOR GAPS Gaps- contiguous sequence of spaces in
one of the rows
Score for a gap of length x is: -(ρ + σx) where ρ >0 is the penalty for introducing
a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too
much of a penalty for extending the gap.
MULTIPLE SEQUENCE ALIGNMENT1. All sequences are compared to each
other (pairwise alignments)2. A dendrogram (like a phylogenetic
tree) is constructed, describing the approximate groupings of the sequences by similarity (stored in a file).
3. The final multiple alignment is carried out, using the dendrogram as a guide.
APPLICATIONS OF MULTIPLE ALIGNMENTS
THANK YOU