Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CONCEPT OF SEQUENCE COMPARISON
Natapol Pornputtapong
18 January 2018
SEQUENCE ANALYSIS - A ROSETTASTONE OF LIFE“Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution.”
— Wikipedia —
2Pravech Ajawatanawong, Faculty of Science, Mahidol University
COMPARING SEQUENCES
• cornerstone in sequence analysis
• aims for identification of sequence relatedness
• ONLY “homologous sequences” (derived from the same ancestor) can be compared
• homologous sequences should (but not MUST) have similar function and similar sequences
3Pravech Ajawatanawong, Faculty of Science, Mahidol University
HOMOLOGY IN STRUCTURES
• reasons why structures have similar shapes are homology and homoplasy
• homology = shares the same ancestor
• homoplasy = similar structures but not derived from the same ancestor
4Pravech Ajawatanawong, Faculty of Science, Mahidol University
HOMOLOGY IN SEQUENCES
5
ACTGTACTCGCATCG
ACTATACTCTCATTG ACTGTTCTCCCATCAspecies A species B
Pravech Ajawatanawong, Faculty of Science, Mahidol University
DEGREES OF HOMOLOGY
Homology is qualitative!
• Paralog: homologous genes have diverged from each other after gene duplication
• Ortholog: Genes originating from a single ancestral gene
• Xenolog: Homologous genes acquired via Horizontal Gene Transfer (HGT)
6Pravech Ajawatanawong, Faculty of Science, Mahidol University; Koonin (2005) Annu. Rev. Genet
SEQUENCE ALIGNMENT
ACTATACTCTCATTG
ACTGTTCTCCCATCA
7
DOT PLOT
8
ACCTCGTGCA
ACTTAGTCCA
sequence 1
sequence 2
A TT A CTG C AC
A
T
C
C
T
A
C
C
G
G
Sequence 2
Sequ
ence
1
sequence 1 ACCT-CGTGC-A
|| | || | |
sequence 2 AC-TTAGT-CCA
Pravech Ajawatanawong, Faculty of Science, Mahidol University
DOT PLOT
• too many dots (high background) = no information
• How can we handle this problem?
9
seq_1
seq
_2
Pravech Ajawatanawong, Faculty of Science, Mahidol University
GENERAL PARAMETERS FOR DOT PLOT• Window size = subsequence length
• Window sliding = rate of moving window
• Threshold or mismatch = cut off (normally use similarity score as the cut off)
10
TGAATCCCAGTTCAGCTCTTCAGCCTTTCGTGGATAAGAGAAGGCTGAAAGCGGGTCACGTTTTG
TAAATGGCAGTACAGCTGTTAGGCCCATCGTGGCTAAGATCAGGCTCCAAATAGGTCCAGTTCCC
window size
70% 70% 80%
Pravech Ajawatanawong, Faculty of Science, Mahidol University
PRACTICAL HINTS FOR DOT PLOT
• a window of 10-20 residues is a good place to start
• comparative very large sequences (>30 to about 100 residues) may be useful.
• a good practical rule is to makes plots that have 3–5 times as many dots as the length of the sequences (e.g., 3000-5000 dots for a 1000 base sequence)
11Pravech Ajawatanawong, Faculty of Science, Mahidol University
DOT PLOT
12
Sequence 1
Sequ
ence
2
horizontal offsets(indels)
Pravech Ajawatanawong, Faculty of Science, Mahidol University
INTERPRETATION OF DOT PLOT (1)
highly similar
• single diagonal line
• needs noise (or background) reduction
13
sequence 1
sequ
ence
2
Pravech Ajawatanawong, Faculty of Science, Mahidol University
INTERPRETATION OF DOT PLOT (2)
domain identification
14
sequence 1
sequ
ence
2
Pravech Ajawatanawong, Faculty of Science, Mahidol University
EXON AND INTRON
15http://myhits.isb-sib.ch/util/dotlet
INTERPRETATION OF DOT PLOT (3)
inversion
16
sequence 1
sequ
ence
2
Pravech Ajawatanawong, Faculty of Science, Mahidol University
INTERPRETATION OF DOT PLOT (4)
17
sequence 1
sequ
ence
2
repeat
Pravech Ajawatanawong, Faculty of Science, Mahidol University
REPEATED PROTEIN DOMAINS
18http://myhits.isb-sib.ch/util/dotlet
INTERPRETATION OF DOT PLOT (5)
19
sequence 1
sequ
ence
2
palindromic sequence
Pravech Ajawatanawong, Faculty of Science, Mahidol University
TERMINATORS AND OTHER STEM-LOOP STRUCTURES
20http://myhits.isb-sib.ch/util/dotlet
INTERPRETATION OF DOT PLOT (6)
21
sequence 1
sequ
ence
2
low complexity regions
AAAAAAAAAAAAAA
Pravech Ajawatanawong, Faculty of Science, Mahidol University
LOW-COMPLEXITY REGIONS
22
Plasmodium falciparum serine-repeat antigen protein precursor
http://myhits.isb-sib.ch/util/dotlet
GAPS IN ALIGNMENT
• gap has never exist in nature
• gaps make the comparison difficult
• gap in sequence alignment most likely is indel
• accuracy of alignment determines accuracy of indel
23
ACGTCTGATACGCCGTATCGTCTATCT
ACGTCTGAT---CCGTATCGTCTATCT
gap ~ indel(insertion/deletion)Pravech Ajawatanawong, Faculty of Science, Mahidol University
SCORING PAIRWISE SEQUENCE ALIGNMENT FOR DNA SEQUENCES
• the easiest method to score is match scoring
• Normalized score
24
seq1 ATTCGTCGTAGCTAGGCTAA
||| | |||| | || |||
seq2 ATTGGCCGTACCATGGATAA
match = 14 positionsmismatch = 6 positionstotal length = 20 positions∴ similarity score = 70%
Pravech Ajawatanawong, Faculty of Science, Mahidol University
seq1 ATTCGTCGTAGCTAGGCTAA
||| | |||| | || |||
seq2 ATTGGCCGTACCATGGATAA
match = 14 positions∴ similarity score = 14
SCORING PAIRWISE SEQUENCE ALIGNMENT FOR PROTEIN SEQUENCES
• idea = amino acids that have the same physicochemical property would not change the structure of protein
25
MAATPTVLLFWKLLDEVFMA
||+|| || ||||+|||||| 90% similarity
MAVTPLVLFFWKLVDEVFMA
MAATPTVLLFWKLLDEVFMA
|| || || |||| |||||| 80% identity
MAVTPLVLFFWKLVDEVFMA
Pravech Ajawatanawong, Faculty of Science, Mahidol University
CONFUSING TERMS
Identity
• proportion of pairs of identical characters between 2 sequences
• strongly depends on how two sequences are aligned
Similarity
• proportion of pairs of similar characters between 2 sequences
• similarity is determined by substitution matrix
• strongly depends on how two sequences are aligned and matrix used
Homology
• two sequences are homologs if they have the same ancestor
• we cannot score homology (so yes or no ONLY)
26Pravech Ajawatanawong, Faculty of Science, Mahidol University
ALIGNMENT EVENT AND MUTATION EVENT• Match -> no mutation
• Mismatch -> substitution
• Gap -> insertion/deletion (InDel)
27Pravech Ajawatanawong, Faculty of Science, Mahidol University
SUBSTITUTION MUTATION IN DNA
28
T A C C T G A G C C A A
Tyr Leu Ser Gln
C T A
Leuoriginal DNA seq.
T A C C T G A G C T A A
Tyr Leu Ser
C T Anon-sense mutation
T A C C T C A G C C A A
Tyr Leu Ser Gln
C T A
Leusilent mutation
T A C C T G C G C C A A
Tyr Leu Arg Gln
C T A
Leumissense mutation
Pravech Ajawatanawong, Faculty of Science, Mahidol University
NUCLEOTIDE SUBSTITUTION
• sequences that share the same common ancestor will gradually diverse
• very difficult to perform direct observation
• sequence divergence = proportion (p) of nucleotide sites that two sequences are different
29
ACTGTACTCGCATCG
ACTATACTCTCATTG
ACTGTTCTCCCATCA
Pravech Ajawatanawong, Faculty of Science, Mahidol University
EMPIRICAL STUDIES OF AMINO ACID SUBSTITUTION• several studies observed of the amino acid
substitution—results show that amino acid substitution is not random
• amino acids with similar chemical properties are more often to substitute in the sequence
• some amino acids (e.g., cysteine, glycine and tryptophan) are rarely changed
30Pravech Ajawatanawong, Faculty of Science, Mahidol University
POINT ACCEPTED MUTATION (PAM)• proposed in 1978 by Margaret Oakley Dayhoff• the first substitution matrix for amino acid changes• one PAM is a unit of evolutionary divergence in which
1% of amino acids have been changed• if no selection for fitness (impossible!!), substitution is
one of the main factors that drive the protein sequence change
• under observation of related protein sequences, frequencies of amino acid substitutions are biased—prone to maintain the function of protein
• these are the point mutations that have been “accepted” during evolution
31Pravech Ajawatanawong, Faculty of Science, Mahidol University
PAM 250 MATRIX
• the 1 PAM unit was constructed from the observation of amino acid changes in closely related proteins
• the data of one PAM was then extrapolated to PAM250
• only PAM250 was published by Dayhoff et al. (1978)
• higher PAM matrix is good for highly divergent sequences; lower PAM is good for conserved sequences
32Pravech Ajawatanawong, Faculty of Science, Mahidol UniversityBIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins
BLOSUM MATRIX
• observed amino acid changes by different strategy with PAM matrix construction
• sequence data are derived from BLOCKS database
• differ from PAM, BLOSUM used distantly related sequences (PAM used closely related sequences)
• BLOSUM62 matrix (the first BLOSUM matrix)—sequences having at least 62% identity are merged into a single sequence
• higher BLOSUM matrix (e.g., BLOSUM90) is good for comparing very similar sequences, the lower BLOSUM (e.g., BLOSUM30) is for highly divergent sequences
33Pravech Ajawatanawong, Faculty of Science, Mahidol University
BLOSUM 62 MATRIX
• the 1 PAM unit was constructed from the observation of amino acid changes in closely related proteins
• the data of one PAM was then extrapolated to PAM250
• only PAM250 was published by Dayhoff et al. (1978)
• higher PAM matrix is good for highly divergent sequences; lower PAM is good for conserved sequences
34Pravech Ajawatanawong, Faculty of Science, Mahidol UniversityBIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins
SUGGESTED USES FOR COMMON SUBSTITUTION MATRICES
35Menlove, Clement, and Crandall: Similarity Searching Using BLAST
GAP PENALTY
• assumption = indel is rare (not easy to occur)
• gap opening = penalty when gap is introduced into the alignment
• gap extension = penalty of the large size of gap, normally count from the second position of gap
36
CCGTATCGTCTATCTACGTGCACTGAT
CCCAATCTTCAATCTACG---TCTGAT
gap opening gap extensionPravech Ajawatanawong, Faculty of Science, Mahidol University
DYNAMIC PROGRAMMING
37Sean R Eddy Nature Biotechnology 22, 909 - 910 (2004)
BLAST: BASIC LOCAL ALIGNMENT SEARCH TOOL
38Wishard, Introduction to Bioinformatics A theoretical and Practical Approach
PAIRWISE SEQUENCES ALIGNMENT• aim for comparison of 2 sequences
• global alignment—try to do the best alignment of two sequences across the entire length
• local alignment—try to fine the highly similar region(s) between two sequences
• overlapping alignment—global alignment of two sequences with different sizes
39Pravech Ajawatanawong, Faculty of Science, Mahidol University
GLOBAL ALIGNMENT
• end-to-end alignment
• may end up with a lot of gaps in the alignment if 2 sequences have dissimilar in size
• Not sensitive to the modular nature of proteins
• very sensitive to gap penalties (gap opening and gap extension)
• Needleman-Wunch algorithm (1970)
40
5' ACTACTAGATTACTTACGGATCAGGTACTTTAGAGGCTTGCAACCA 3'
||||||||||| ||||||| |||||||||||||| |||||||
5' ACTACTAGATT----ACGGATC--GTACTTTAGAGGCTAGCAACCA 3'
Pravech Ajawatanawong, Faculty of Science, Mahidol University
LOCAL ALIGNMENT
• finds local regions with high level of similarity
• more sensitive to the modular nature of proteins
• can be used to search databases
• Smith-Waterman algorithm (1981)
41
ACTACTAGATTACTTACGGATCAGGTACTTTAGAGGCTTGCAACCA
||||||||||| ||||||| |||||||||||||| |||||||
ACTACTAGATT----ACGGATC--GTACTTTAGAGGCTAGCAACCA
ACTACTAGATT
|||||||||||
ACTACTAGATT
ACGGATC
|||||||
ACGGATC
GTACTTTAGAGGCTTGCAACCA
|||||||||||||| |||||||
GTACTTTAGAGGCTAGCAACCALocal Alignment
Global Alignment
Pravech Ajawatanawong, Faculty of Science, Mahidol University
MULTIPLE SEQUENCE ALIGNMENT
42
PROBLEM OF USING PAIRWISE ALIGNMENT• good for comparing of only two sequences
• hard to understand and interpret the alignment results when a number of sequences are >2
• less evolutionary meaning
43
ATGCTAGTAAGC
ATTCAA-T--GC
-TTCTAGC--GC
ATGCTAGTAAGC
ATTCAA-T--GC
ATTCAA-TGC
-TTCTAGCGC
ATGCTAGTAAGC
-TTCTAGC--GC
Pravech Ajawatanawong, Faculty of Science, Mahidol University
MULTIPLE SEQUENCE ALIGNMENT (MSA)• most useful object in sequence analysis
• mid 1980s, MSA was generated by hand because dynamic programming (at that time) were slow when applied to >3 sequences
• idea—arrangement of the homologous residues (nucleotide or amino acid) in the same column
• provides more biological information than pairwise sequence alignment
44Pravech Ajawatanawong, Faculty of Science, Mahidol University
MSA METHODS
• Exact method
• Progressive methods: Clustal, MUSCLE
• Iterative methods: MAFFT
• Consistency based methods: T-Coffee, ProbCons
• Structure based methods: 3D-Coffee
Multiple sequence alignment methods 45
MSA METHODS
Sviatopolk-Mirsky Pais et al. (2014) Algorithm for Molecular Biology 46
PROGRESSIVE ALIGNMENT
47
dynamic programming
Pravech Ajawatanawong, Faculty of Science, Mahidol University
THE CLUSTAL SERIES
• Clustal was published by Thompson, et al. in 1994
• ClustalW, ClustalX
• Clustal algorithm were obsolete, but their algorithm is good for understanding the MSA
• algorithm—generated a guide tree, then,do a progressive alignment based on that guide tree
• Latest: Clustal Omega
48Pravech Ajawatanawong, Faculty of Science, Mahidol University
MUSCLE ALIGNMENT PROGRAM
• MUltiple Sequence Comparison by Log-Expectation (MUSCLE)
• was published by Edgar RC, et al. in 2004
• step I: progressive alignment
• step II: improve progressive alignment
• step III: refinement
• very easy command line
• improved speed and accuracy (based on SP method)
Pravech Ajawatanawong, Faculty of Science, Mahidol University
49
MUSCLE ALIGNMENT PROGRAM
50
CHOOSING THE RIGHT MSA PROGRAM
Chagoyen M (2013) Sequence Analysis and Structure Prediction Service. 51
QUESTIONS?
52