Upload
sydney-brooks
View
215
Download
0
Embed Size (px)
Citation preview
Pairwise Alignment
How do we tell whether two sequences are
similar?
BIO520 Bioinformatics Jim Lund
Assigned reading:Ch 4.1-4.7, Ch 5.1, get what you can out of 5.2, 5.4
Pairwise alignment
• DNA:DNA
• polypeptide:polypeptide
The BASIC Sequence Analysis Operation
Alignments
• Pairwise sequence alignments
–One-to-One
–One-to-Database• Multiple sequence alignments
–Many-to-Many
Origins of Sequence Similarity
• Homology– common evolutionary descent
• Chance– Short similar segments are very
common.
• Similarity in function– Convergence (very rare)
Visual sequence comparison: Dotplot
Visual sequence comparison: Filtered dotplot
4 bp window, 75% identity cutoff
Visual sequence comparison: Dotplot
4 bp windw, 75% identity cutoff
Dotplots of sequence rearrangements
Assessing similarity
GAACAAT||||||| 7/7 OR 100%GAACAAT
GAACAAT | 1/7 or 14%GAACAAT
Which is BETTER?How do we SCORE?
Similarity
GAACAAT||||||| 7/7 OR 100%GAACAAT
GAACAAT||| ||| 6/7 OR 84%GAATAAT
MISMATCH
Mismatches
GAACAAT||| ||| 6/7 OR 84%GAATAAT
GAACAAT||| ||| 6/7 OR 84%GAAGAAT
Terminal Mismatch
GAACAATttttt ||| |||aaaccGAATAAT 6/7 OR 84%
INDELS
GAAgCAAT||| |||| 7/7 OR 100%GAA*CAAT
Indels, cont’d
GAAgCAAT||| ||||GAA*CAAT
GAAggggCAAT||| ||||GAA****CAAT
Similarity Scoring
Common Method: • Terminal mismatches (0)• Match score (1)• Mismatch penalty (-3)• Gap penalty (-1)• Gap extension penalty (-1)
DNA Defaults
DNA Scoring
GGGGGGAGAA
|||||*|*|| 8(1)+2(-3)=22GGGGGAAAAAGGGGG
GGGGGGAGAA--GGG
|||||*|*|| ||| 11(1)+2(-3)+1(-1)+1(-1)=33GGGGGAAAAAGGGGG
Absurdity of Low Gap Penalty
GATCGCTACGCTCAGC A.C.C..C..T
Perfect similarity,Every time!
Sequence alignment algorithms
• Local alignment– Smith-Waterman
• Global alignment– Needleman-Wunsch
Alignment Programs
• Local alignment (Smith-Waterman)– BLAST (simplified Smith-Waterman)
– FASTA (simplified Smith-Waterman)
– BESTFIT (GCG program)
• Global alignment (Needleman-Wunsch)– GAP
Local vs. global alignment
10 gaggc 15 ||||| 3 gaggc 7
1 gggggaaaaagtggccccc 19 || |||| ||1 gggggttttttttgtggtttcc 22
Global alignment: alignment of the full length of the sequences
Local alignment: alignment of regions of substantial similarity
Local vs. global alignment
BLAST Algorithm
Look for local alignment, a High Scoring Pair (HSP)• Finding word (W) in query and subject. Score > T.• Extend local alignment until score reaches
maximum-X.• Keep High Scoring Segment Pairs (HSPs) with
scores > S.• Find multiple HSPs per query if present• Expectation value (E value) using Karlin-Altschul
stats
BLAST statistical significance: assessing the likelihood a match
occurs by chance
Karlin-Altschul statistic:E = k m N exp(-Lambda S)
m = Size of query seqeunceN = Size of databasek = Search space scaling parameterLambda = scoring scaling parameterS = BLAST HSP score
Low E -> good match
BLAST statistical significance:
Rule of thumb for a good match:
•Nucleotide match•E < 1e-6•Identity > 70%
•Protein match•E < 1e-3•Identity > 25%
Protein Similarity Scoring
• Identity - Easy• WEAK Alignments• Chemical Similarity
– L vs I, K vs R…
• Evolutionary Similarity–How do proteins evolve?–How do we infer similarities?
BLOSUM62
C S T P A G N D C 9 -1 -1 -3 0 -3 -3 -3 S -1 4 1 -1 1 0 1 0 T -1 1 4 1 -1 1 0 1 P -3 -1 1 7 -1 -2 -1 -1 A 0 1 -1 -1 4 0 -1 -2 G -3 0 1 -2 0 6 -2 -1 N -3 1 0 -2 -2 0 6 1 D -3 0 1 -1 -2 -1 1 6
Single-base evolution changes the encoded
AACAU=HCAU=H
CAC=H CGU=R UAU=Y
CAA=Q CCU=P GAU=D
CAG=Q CUU=L AAU=N
Substitution Matrices
Two main classes:
• PAM-Dayhoff
• BLOSUM-Henikoff
PAM-Dayhoff
• Built from closed related proteins, substitutions constrained by evolution and function
• “accepted” by evolution (Point Accepted Mutation=PAM)
• 1 PAM::1% divergence• PAM120=closely related proteins
• PAM250=divergent proteins
BLOSUM-Henikoff&Henikoff
• Built from ungapped alignments in proteins: “BLOCKS”
• Merge blocks at given % similar to one sequence
• Calculate “target” frequencies
• BLOSUM62=62% similar blocks– good general purpose
• BLOSUM30– Detects weak similarities, used for distantly related proteins
BLOSUM62
C S T P A G N D C 9 -1 -1 -3 0 -3 -3 -3 S -1 4 1 -1 1 0 1 0 T -1 1 4 1 -1 1 0 1 P -3 -1 1 7 -1 -2 -1 -1 A 0 1 -1 -1 4 0 -1 -2 G -3 0 1 -2 0 6 -2 -1 N -3 1 0 -2 -2 0 6 1 D -3 0 1 -1 -2 -1 1 6
Gapped alignments
• No general theory for significance of matches!!
• G+L(n) – indel mutations rare
– variation in gap length “easy”, G > L
Real Alignments
Phylogeny
1 MGLSDGEWQLVLNAWGKVEADVAGHGQEVLIRLFTGHPETLEKFDKFKHL 50 ||||||||||||| |||||||||||||||||||| ||||||||||||||| 1 MGLSDGEWQLVLNVWGKVEADVAGHGQEVLIRLFKGHPETLEKFDKFKHL 50 . . . . . 51 KTEAEMKASEDLKKHGNTVLTALGGILKKKGHHEAEVKHLAESHANKHKI 100 |.| ||||||||||||||||||||||||||||||||. ||:||| |||| 51 KSEDEMKASEDLKKHGNTVLTALGGILKKKGHHEAELTPLAQSHATKHKI 100 . . . . . 101 PVKYLEFISDAIIHVLHAKHPSDFGADAQAAMSKALELFRNDMAAQYKVL 150 |||||||||:||| || .||| ||||||| |||||||||||||||.|| | 101 PVKYLEFISEAIIQVLQSKHPGDFGADAQGAMSKALELFRNDMAAKYKEL 150
151 GFHG 154 || | 151 GFQG 154
Cow-to-Pig Protein
Cow-to-Pig cDNA 1 CAGCTGTCGGAGACAGACACCCAGTCAGTCCCGCCCTTGTTCTTTTTCTC 50 | ||| ||| || | ||||| |||| ||| |||||| 1 .......CAGAGCCAGGACACCCAGTACGCCCGCACTTGCTCTGTTTCTC 43 . . . . . 51 TTCTTCAGACTGCGCCATGGGGCTCAGCGACGGGGAATGGCAGTTGGTGC 100 |||| ||||||| |||||||||||||||||||||||||||||| |||||| 44 TTCTGCAGACTGTGCCATGGGGCTCAGCGACGGGGAATGGCAGCTGGTGC 93 . . . . . 101 TGAATGCCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 150 |||| | ||||||||||||||||||||||||||||||||||||||||||| 94 TGAACGTCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 143 . . . . . 151 GTCCTCATCAGGCTCTTCACAGGTCATCCCGAGACCCTGGAGAAATTTGA 200 ||||||||||||||||| | ||||| ||||||||||||||||||||||| 144 GTCCTCATCAGGCTCTTTAAGGGTCACCCCGAGACCCTGGAGAAATTTGA 193 . . . . . 201 CAAGTTCAAGCACCTGAAGACAGAGGCTGAGATGAAGGCCTCCGAGGACC 250 |||||| |||||||||||| |||||| ||||||||||||||| ||||||| 194 CAAGTTTAAGCACCTGAAGTCAGAGGATGAGATGAAGGCCTCTGAGGACC 243
80% Identity (88% at aa!)
DNA similarity reflects polypeptide similarity
101 TGAATGCCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 150 |||| | ||||||||||||||||||||||||||||||||||||||||||| 94 TGAACGTCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 143
501 CCAGTACAAGGTGCTGGGCTTCCATGGCTAAGCCCCACCCCTGTGCCCCT 550 | ||||||||| |||||||||||| ||||||||||| | | || | 494 CAAGTACAAGGAGCTGGGCTTCCAGGGCTAAGCCCCCCAGACGCCCCTCA 543 . . . . .
Coding vs Non-coding Regions
451 CAGGCTGCCATGAGCAAGGCCCTGGAACTGTTCCGGAATGACATGGCTGC 500 |||| ||||||||||||||||||||||| |||||||| |||||||| || 444 CAGGGAGCCATGAGCAAGGCCCTGGAACTCTTCCGGAACGACATGGCGGC 493 . . . . . 501 CCAGTACAAGGTGCTGGGCTTCCATGGCTAAGCCCCACCCCTGTGCCCCT 550 | ||||||||| |||||||||||| ||||||||||| | | || | 494 CAAGTACAAGGAGCTGGGCTTCCAGGGCTAAGCCCCCCAGACGCCCCTCA 543 . . . . . 551 CAC.CCCACCCACCTGGG...........CAGGGTGGGCGGGGACTGAAT 588 | | |||| |||| |||| | || ||| ||| ||||| 544 CCCACCCATCCACTTGGGCCAGGGCCCCCCGCGGAGGGTGGGCGCTGAAG 593 . . . . . 589 CCCAAGTAGTTATAGGGTTTGCTTCTGAGTGTGTGCTTTGTTTAGGAGAG 638 | | |||| | |||||||||||||||||||| ||||||||| | ||||| 594 CTCCTGTAGCTGTAGGGTTTGCTTCTGAGTGT.TGCTTTGTTCATGAGAG 642 . . . . . 639 GTGGGTGGAAGAGGTGGATGGGTTAGGGGTGGAGG............... 673 |||||||| ||||||||| ||| | | ||||| || 643 GTGGGTGGGAGAGGTGGAGGGGCTGGTGGTGGTGGTGGGGGGGTGTTCAG 692
90% in coding (70% in non-coding)
Third Base of Codon is Hypervariable
201 CAAGTTCAAGCACCTGAAGACAGAGGCTGAGATGAAGGCCTCCGAGGACC 250 ||||||*||||||||||||*||||||*|||||||||||||||*||||||| 194 CAAGTTTAAGCACCTGAAGTCAGAGGATGAGATGAAGGCCTCTGAGGACC 243 . . . . . 251 TGAAGAAGCATGGCAACACGGTGCTCACGGCCCTGGGGGGTATCCTGAAG 300 ||||||||||*||||||||||||||*||*|||||||||||*|||||*||| 244 TGAAGAAGCACGGCAACACGGTGCTGACTGCCCTGGGGGGCATCCTTAAG 293
Cow-to-Fish Protein
1 MGLSDGEWQLVLNAWGKVEADVAGHGQEVLIRLFTGHPETLEKFDKFKHL 50 :. :|| || .||| | || || |||| |||||. | || : 1 ....MADFDMVLKCWGPMEADHATHGSLVLTRLFTEHPETLKLFPKFAGI 46 . . . . . 51 KTEAEMKASEDLKKHGNTVLTALGGILKKKGHHEAEVKHLAESHANKHKI 100 :: . || ||| || :|| :| | | .| |. ||| |||| 47 .AHGDLAGDAGVSAHGATVLNKLGDLLKARGAHAALLKPLSSSHATKHKI 95 . . . . . 101 PVKYLEFISDAIIHVLHAKHPSDFGADAQAAMSKALELFRNDMAAQYKVL 150 |: . |.: | |: | | | | |: : : || | || | 96 PIINFKLIAEVIGKVMEEKAGLD..AAGQTALRNVMAIIITDMEADYKEL 143
151 GFHG 154 || 144 GFTE 147
42% identity, 51% similarity
Cow-to-Fish DNA
32 .ACAGGACATTTTACTACTCTGCAGATAATGGCTGACTTTGACATGGTAC 80 | | | | | | || | | || | | |||| | 51 TTCTTCAGACTGCGCCATGGGGCTCAGCGACGGGGAATGGCAGTTGGTGC 100 . . . . . 81 TGAAGTGCTGGGGTCCAATGGAGGCGGACCACGCAACCCACGGGAGTCTG 130 |||| |||||| ||||||| || |||| ||| ||| | 101 TGAATGCCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 150 . . . . . 131 GTGCTGACCCGTTTATTCACAGAGCACCCAGAAACCCTAAAGTTATTCCC 180 || || | | | | ||||||| || || || ||||| || ||| 151 GTCCTCATCAGGCTCTTCACAGGTCATCCCGAGACCCTGGAGAAATTTGA 200 . . . . . 181 CAAGTTTGCTGGC...ATCGCCCATGGGGACCTGGCCGGGGATGCAGGTG 227 |||||| | | | | | || || | | | 201 CAAGTTCAAGCACCTGAAGACAGAGGCTGAGATGAAGGCCTCCGAGGACC 250
48% similarity
Protein vs. DNAAlignments
• Polypeptide similarity > DNA• Coding DNA > Non-coding
• 3rd base of codon hypervariable• Moderate Distance poor DNA similarity
Rules of Thumb
• DNA-DNA similarities– 50% significant if “long”
– E < 1e-6, 70% identity
• Protein-protein similarities– 80% end-end: same structure, same function
– 30% over domain, similar function, structure overall similar
– 15-30% “twilight zone”
– Short, strong match…could be a “motif”
Basic BLAST Family
• BLASTN– DNA to DNA database
• BLASTP– protein to protein database
• TBLASTN– DNA (translated) to protein database
• BLASTX– protein to DNA database (translated)
• TBLASTX– DNA (translated) to DNA database (translated)
DNA Databases
• nr (non-redundantish merge of Genbank, EMBL, etc…)– EXCLUDES HTGS0,1,2, EST, GSS, STS, PAT, WGS
• est (expressed sequence tags)• htgs (high throughput genome seq.)• gss (genome survey sequence)• vector, yeast, ecoli, mito• chromosome (complete genomes)• And more
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#nucleotide_databases
Protein Databases
• nr (non-redundant Swiss-prot, PIR, PDF, PDB, Genbank CDS)
• swissprot
• ecoli, yeast, fly
• month
• And more
BLAST Input
• Program
• Database
• Options - see more
• Sequence– FASTA
– gi or accession#
BLAST Options
• Algorithm and output options– # descriptions, # alignments returned– Probability cutoff– Strand
• Alignment parameters– Scoring Matrix
• PAM30, PAM70, BLOSUM45, BLOSUM62BLOSUM62, BLOSUM80, BLOSUM80
– Filter (low complexity) PPPPP->XXXXX
Extended BLAST Family
• Gapped Blast (default)Gapped Blast (default)• PSI-Blast (Position-specific iterated
blast)– “self” generated scoring matrix
• PHI BLAST (motif plus BLAST)• BLAST2 client (align two seqs)
• megablast (genomic sequence)• rpsblast (search for domains)