View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Advanced Bioinformatics (MB480/580)>Sulfolobus virus 1 complete genome 15465 bp.TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAGTACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAAGATATACTGAGAGTCCTACGCGTTAGTTCAGGTCAGACAAGAGAGAACGAAATCAATTCTGAAACAATTATTTGACCATGGTAAGGAACATGAAGATGGAGTAATGAATGGTTATGGTTAGGGACTAAAATTATAAACGCCCATAAG
Learn How to:● Assemble a genome and predict its:
- ORFs- Promoters
● Annotate genome:- Predict protein functions- Model them if possible- Re-design them if possible
● Predict functions by inference from a large amount of unrelated data● Predict ncRNAs● High-throughput methods and data interpretation● Prepare the data for presentations & publications
What is Bioinformatics?
• Choices:– The analysis of biological molecules
using computers and statistical techniques•TRUE
– The science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research• also TRUE, but suits Computational Biology
better
More definitions
• The collection, organization and analysis of large amounts of biological data, using networks of computers and databases.
• The process of developing tools and processes to quantify and collect data to study biological systems logically.
• The science of informatics as applied to biological research.
Yet more definitions
• Mark Gerstein’s definition:– Bioinformatics is conceptualizing biology in terms of
macromolecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied maths, computer science, and statistics) to understand and organize the information associated with these molecules, on a large-scale.
– The manuscript breaking down each part of the above statement will be e-mailed.
– http://wiki.bioinformatics.org/Bioinformatics_FAQ
The important stuff
• Bioinformatics brings together biological data from genome research with the theory and tools of mathematics, computer science and artificial intelligence.
• Bioinformatics includes any application of computer technology and information science to:– Gather, organize, store and handle data.– Analyze, interpret and spread data.– Predict biological structure and function.
What is the information in Molecular Biology?
• Central Dogmaof Molecular Biology
DNA -> RNA -> Protein -> Phenotype
• Molecules– Sequence, Structure, Function
• Processes– Mechanism, Specificity,
Regulation
• Central Paradigmfor Bioinformatics
Genomic Sequence Information -> mRNA (level) -> Protein Sequence -> Protein Structure -> Protein Function -> Phenotype
• Large Amounts of Information– Standardized– Statistical
•Genetic material •Information transfer (mRNA)•Protein synthesis (tRNA/mRNA)•Some catalytic activity
•Most cellular functions are performed or facilitated by proteins.
•Primary biocatalyst
•Cofactor transport/storage
•Mechanical motion/support
•Immune protection
•Control of growth/differentiation
This slide is courtesy of Mark Gerstein
Language of biology is not easy to understand
• Just like in spoken language, some words look very different but have the same meaning (car and automobile are synonyms; sequences of distantly related proteins are synonyms)
• Some words look or sound very similar yet have different meaning (complement and compliment; eminent and imminent; allude and elude; decent and descent are homophones; GAG and TAG codons are homophones)
• In spoken language, we came up with the rules which is why most of the time we can trace back their origins
• How do we trace the origins of Nature’s language?
Why is Bioinformatics important?
• Supports experimental work– In some cases, it provides complementary
data• More importantly, guides experimental
work– Predictions based on data– Extension of experiments in new directions
• To be believable, Bioinformatics predictions have to be verifiable– Statistical significance, or some other kind
of significance score
When did Bioinformatics begin?
• 10-15 years ago?– This is a common assumption
• Bioinformatics existed even back in 70s– It was called differently– It was underused because the amount of
biological sequence data was small
Bioinformatics and Genome Biology
• The revolution driving enormous development in Bioinformatics and experimental sciences came from whole genome sequencing
Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Small, K. V., Fraser, C. M., Smith, H. O. & Venter, J. C. (1995). "Whole-genome random sequencing and assembly of Haemophilus influenzae rd." Science 269: 496-512.(Picture adapted from TIGR website, http://www.tigr.org)
• Integrative Data1995, HI (bacteria): 1.6 Mb & 1600 genes done1997, yeast: 13 Mb & ~6000 genes for yeast1998, worm: ~100Mb with 19 K genes1999: >30 completed genomes!2003, human: 3 Gb & 50 K genes...
Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than Shakespeare managed in a lifetime, although the latter make better reading.
-- G A Petsko, Nature 401: 115-116 (1999)
What can we infer from sequence using Bioinformatics?
Expressed?
• cellular function • physiological function • substrate binding sites • protein-protein interfaces
• activity • specificity • docking • localisation
DNA
ORF
Protein
Active proteinDomains =smallest functional /structural subunits
3D structure
Function
Make sense of subtle differences
[Waterston et al. Nature 2002]
- About 90% of the mouse and human genomes are in syntenic blocks.
What’s in the genome?
• If we are so much alike in terms of genome, why are we so much different?– Large variation in human population– Similar genes and similar genome organization
between human and chimp (or even human and mouse), yet large phenotypic difference
• The importance of non-coding parts of our genome became more obvious– Non-coding, regulatory RNAs– Binding sites for regulatory proteins– Other possibilities that are not obvious right now
Complexity of biological information
1. Finding regulatorymotifs in DNA
2. Increasing the speedand reliability of functionalannotation from sequence
The more we know, the better?
So we have a genome sequence …>Sulfolobus virus 1 complete genome 15465 bp.TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAGCGGAATACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAAGGAGGGATATACTGAGAGTCCTACGCGTTAGTTCAGGTCAGACAAGAGAGAACGTAAACAAATCAATTCTGAAACAATTATTTGACCATGGTAAGGAACATGAAGATGAAGAAGAGTAATGAATGGTTATGGTTAGGGACTAAAATTATAAACGCCCATAAGACTAACGGCTTTGAAAGTGCGATTATTTTCGGGAAACAAGGTACGGGAAAGACTACTTACGCCCTTAAGGTGGCAAAAGAAGTTTACCAGAGATTAGGACATGAACCGGACAAGGCATGGGAACTGGCCCTTGACTCTTTATTCTTTGAGCTTAAAGATGCATTGAGGATAATGAAAATATTCAGGCAAAATGATAGGACAATACCAATAATAATTTTCGACGATGCTGGGATATGGCTTCAAAAATATTTATGGTATAAGGAAGAGATGATAAAGTTTTACCGTATATATAACATTATTAGGAATATAGTAAGCGGGGTGATCTTCACTACCCCTTCCCCTAACGATATAGCGTTTTATGTGAGGGAAAAGGGGTGGAAGCTGATAATGATAACGAGAAACGGAAGACAACCTGACGGTACGCCAAAGGCAGTAGCTAAAATAGCGGTGAATAAGATAACGATTATAAAAGGAAAAATAACAAATAAGATGAAATGGAGGACAGTAGACGATTATACGGTCAAGCTTCCGGATTGGGTATATAAAGAATATGTGGAAAGAAGAAAGGTTTATGAGGAAAAATTGTTGGAGGAGTTGGATGAGGTTTTAGATAGTGATAACAAAACGGAAAACCCGTCAAACCCATCACTACTAACGAAAATTGACGACGTAACAAGATAGTGATACGGGTAATGTCAGACCCCTTTTAGCCATTCCGCATACTTTTTATATTGCTCTTTCGCTATGCCGAAGAGCGATACGTAATGTTGCGTTAAAACGCGTGTCGGTTTACGCCCTTGAATAAAATCGATAATATCTAACGGTACGCTTAGCTCAGCCATCTTAGACGCTACGAATTTGCGGAAGTACTTTATCGCTATAGCGTCCTTATGACGTCGTTCAAAGTCCGCTATTGCCCACTTCGTCACCTCTACTCTCTTCAGAGGCGTTATGTGGAATACATAGAAGACGCCCTTATATCCCCTAGTCCAACTAAGCGGATAATAACAGACGTCGTTACCGCAAATGTCCCTTTCGGGTTCCTTCAGCACTTTCAGTATTTCGCTCAGCCTAACGCCCGACTCGAGAGCGATACGGTAGATGAAGTAGACGTTTTCGCTATAGTCTTTTGCTAATTGTAACGTCCTTTTTATCTCTTCCAACGTTGGAATGTAGATATCAGCGTTCGCCTTCTTCACCTTTACCGCTTTCAATATTTTATCCGCAAATTCATCATGTATGATATTGCGTGACGCTAAGAAACGTGCAAAGAGTCGGTAAGCCTTCTGTGCGTCTCTCGTCTCTTTATACGGCTTTGATATAGCATTGATGTAGTCCTTTGCAGTTTTTTCGCTTATCCCCCTTTCGTTCATGAGATAGTCGTAGAACGCCTTTATGTTGCCGTCCGTCGCGTATTGGCGCAAATTGGCAACCAACGCTATTTTACGTCGTTCAGTTCCCTCTTTTCCGCCTCCGGAGCCGGAGGTCCCGGGTTCAAATCCCGGCGGGTCCGCTTGTAGGGGAGTATCCCCTACGACCCCTAATTTCATTTTTAGATATGATTCAACGACGTCAGCTAAAGGACCCACGTAACGCTCTTTTACCTCACCGTTTTCATACTCTAGCTTGTAAACATAATACCGCCCTTTCCTCTCGCGTAAAATATAATCCCCGTATTTATAACGCGTCTTATCTTTCGTCATTTCGCCTCACAGTATTATGGTTGCCAAAACGGGCTTATAAGCATTGGCAACCCGTTAATTTTTGCCGTTAAAACACGTTGAATTGAAAGAAGACGGCAAAGAATCCACACAGGTAATACTAAAAAAGTAGTATTACTTACATTAGAAGGACTCATTTGTCCACCTTGTATTCTAGCCATGCTATCTCTGCCTTCAGCTCATCTAGCTTCCCCTTTATGTCTGTCAGGTCAAGGGGAACTCCTCTCATTAACCTGAGTTCGTTTTCGATTTTTTCAAGCTCCTTTTCCAACTCCTCTAGTTTCTCTAATTCCTTTAGTCGTTCTTCCAATTTCTTTTCCAATTTCCCCTTTGCGTCATTTATAATTATGCTTACTACCCAAACAATTCCTAAATCAGAAATAATTATTAACTCCTCTGAGTTGAATATCATTTTCCGCCCCTCGCTAAATACTCCTTAAAGCTCTGATAGAACCCCTTCAGACTAACCCGTAAGTCTGTTAGGTTCTTCCAGTATTGTAATGGGATTAAGTAATAGTAGCTTACTGCATCTCTCTCAAATTTGTCCTTCTTAATCTTTCCTTGCTTTTCTAAGTTGAGTATTTGCAGTGCTGAGATACATTTTAACTTGTCCTCAGCATCTGAATAGTGTATAAACCAAACCCTCCCCATAACCTCATTCTGCTTTGCAACTTCTACTTTAGTGCTTAATATTGCGTAAACGCTTTCGCCGTATCTTTCTTTGCTCTGTTCTTCAGTCCATGAACTTCCCGTAATATCTATCCAAATTAAAGGATAATATTCTGTCTTAGCCTTAACGTATAAAGTCAAATCGTATTTATCTTGCAGACCGCTATAGTATTGCTCATTTATTACATTAGTTAAAGTCCCCACGCCAGTTGGGCGGATATAAACATCAAAGTCTAACAAACCCTTAGCCCGCCACTTTGATAAAGAGATTAAGAGCTTTCCAAAAACTAGGTATTCTCGCCCTAAATAAGTTGAAGGGAGGATATAATCCTCAGCTTGATTACCCCAATACTTTAGCTTAAAATTAGTTTCAGCCATCTCACTCACCATATTGAAACGTGGGCTAGTATGTGAATCAGTACTGATGCTATTGCAAATAACACACTTGCAGTAGCAATTCCTATTACAATCCATTTACCATAATCCACCTTAGTTTGTTGGTCAATATACTCGTTGATGATCTTTAGTATTTCTGGCTTTAGTTCTGATAATGAAAGGAAGACAGAGGCATAAAGTACTAAGGAGGATGTGAACAGATTATCCGCCTTTTCTGAAAGTTTATAAAGCTCATATCTTGCTCTCTCATAATCTTCATAATTAATAATTTCATCAAACTTTTCTACTTGCTCTTCATATTCTTTCTTCAGAGAGTAAGGAGTTGTCTTTTCAATTACTCCTAATTTTATTAACTTCTTAACAGCTTCCTTAAATCCTTGTTTATTGCTAGCATACGCTAAAGGGTCTTTTCCTTCTTGAGAAGCTCTATAGATAACTATAGCACCATAAACAATATTTACAATATCGTATGGTAAGGAATACGCACCGATTTGGGCAATATCTTCAACTCTTCTTTGATCCATCTAGTTCACCTCTTTTTGATTTGTTTGTAGGTTTCTATCGCAGTTTTCAGCGATATCGCAAATAGCTTCCCCTTTTCCGTTAGGTATAGCCTCTTTTCGCCTCTTTCTTGACGCTCTTTCACGAAGCCCTCTTGTATTAGGAACTTTTTTGCATCATAAAAGGTGGCAGTGGACATGGGAAATTCTGCGTTTACTTTCTTGTATAGGTCATATGTTGCTATTCCTTCATTATCATATAGATAAGCCAATACTATGGCTTCGGGGTAGAAGAATGGTGTACTTTTCATATCCTCCTCACTCCTCAGCCTCTAATAGCTTAACTGCCTCCTCTATCAACTGTCCCATTGTCTTTCCAGTCTTTGCCTTAAGCCTCTGCAGAGTCTCATATGTTTCCTCACTTATTGAAATGTTAAGCCTTTTGACTATCCTATCTTTCCTCTTCTCTATCATTTAGGTCACCTTGTTTATTGTTATTTGAAATACGTATCCGTCTTCGTCACATCGAAGTATAATTTTGTATCCATTATTAGCATATTCTACGTCAAAGTTCCCACAACAATAATTCGGGTCTTCGGACTCGTTATAGACTTTGCTCCAACCATCTTTTTGTAGTGCCTCTTCTAAGTAGTCTACTCTGATGAAGCCTTCATCATATTCGTTCAGTACCCTAAAGCTTATACTATCAATGCCTAATACGTCTAATAGCTTCAACAGATCGAATATAGGAACTTGCACCATCATTTCAGCTCACCTTAATGAGCTGATATAATTCCGCTTCTATCTTTTGAACTTGGAAGTATGCCTTGCCTAGCTTTTGCTTATCCATATTGCCCGTTATTCTATCAATCTTAATCTCGTGGATTAATGATAATAGCTCTCTGACATCCTCATCAAGCATTTCAAATAATTCTTTCTCTAAGACTTCTTTACTCATTGTTTTTCACCTTAGCAAACTCATCTAACGTTGTTTGTCTCAGTTCTCTTTTCTTTATCAAATAAAATTCCGAATGTCCCTTCTTATTGTTATTACTGTACTTCATGTCAGTTCACTGCTTTGCCTTTATAAATCCTTGATCCGTTTGCTCAAAATTTGCGGGCTGGGCAT
Gene finding through learningatgccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgtaa
gaggatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagatg
Gene
Non-gene
gcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag
Gene?
atg
tga
ggtgag
ggtgag
ggtgag
caggtg
cagatg
cagttg
caggccggtgag
Map looks better. Is this all?
OK, so we’ll predict protein functions …
… maybe do few experiments …
… and then enjoy glory (maybe money, too).
Trevor Douglas and Mark Young (2006) Science 312, 873 - 875.
What can we do with Molecular Biology information?
• Different levels of Molecular Biology information
• DNA– Coding or non-coding– Meaningful or junk DNA?
• RNA– Information transfer (mRNA, tRNA, rRNA)– Regulatory roles
• Protein– Structure and function– Modifications
Molecular Biology Information in DNA and RNA
• Raw DNA Sequence– 4 bases: AGCT– Coding or Not?– How do we parse
the sequence into genes?
– Because of introns, ~1 K in a gene could mean ~2 M in genome
• Raw RNA Sequence– 4 bases: AGCU– mRNA, tRNA, rRNA– Regulatory RNAs– Secondary
structure
atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgcagcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatacatggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtgaaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatccagcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattcttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaactggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgcaggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgtgttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactgcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgcatcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacctgcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgttgttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatcaaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacactgaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgcagacgctggtatcgcattaactgattctttcgttaaattggtatc . . .
. . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg
Molecular Biology information in protein sequences
1. Finding regulatorymotifs in DNA
• 20 letter alphabet, more combinatorial variability than DNA (20AA-number)
– ACDEFGHIKLMNPQRSTVWY but not BJOUXZ
• Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain
• More than 2 million unique protein sequences (more than 5.6 M of total sequences in the database)
• We must be able to “transfer” the function from characterized proteins to uncharacterized ones based on some measure of similarity
Molecular Biology information in macromolecular structures
• DNA/RNA/Protein– The majority of all structures
are of proteins– Proteins easier to crystallize
and were thought to be more important
Organizing information: Redundancy and multiplicity help
…• Fairly different sequences may have the same
structure and function– Bad news: If they are very different, how do we find this?– Good news: Once they are found, we learn something more
about structure and function
• An organism has many similar genes and non-coding RNAs– The redundancy present for essential genes and/or RNAs
(rRNA)
• Single gene may have multiple functions– Combining domains in eukaryotes produces large proteins
• Genes are grouped into pathways; this is good
… though sometimes the path is difficult
• Evolutionary distances do not help establish initial relationship– Large differences (large evolutionary distances) between
proteins are hard to identify and defend on statistical grounds without experiment
• Evolutionary distances do help once the relationship is established– If the relationship between distant proteins is
established, their conserved parts provide information about what is vital for function
– Less conserved parts of proteins are less important for function - scaffold
• Given all these difficulties, how do we find hidden similarities?
Some things we can do using just sequence
• Sequence (text string) comparisons– Sequence (text string) search– Sequence alignment– Finding short sequences in biological sequences– Significance statistics
• Databases– Building, Querying
• Learning patterns– Artificial Intelligence and Machine Learning– Mining for patterns and clustering them
• Secondary structure prediction– Where are helices, strands and loops in proteins?– Finding trans-membrane helices
• Tertiary structure prediction– Fold recognition and structure prediction– Active site identification
How are optimal alignments found?(Should we all pick the one we like?)
Aligning text strings …Which alignment is the best?
Raw Data ???T C A T G C A T T G
2 matches, 0 gaps
T C A T G | |C A T T G
3 matches (2 end gaps)
T C A T G . | | | . C A T T G
4 matches, 1 insertion
T C A - T G | | | | . C A T T G
4 matches, 1 insertion
T C A T - G | | | | . C A T T G
Dynamic Programming to the rescue1. Finding regulatory
motifs in DNA•What to do for Bigger String?SSDSEREEHVKRFRQALDDTGMKVPMATTNLFTHPVFKDGGFTANDRDVRRYALRKTIRNIDLAVELGAETYVAWGG
REGAESGGAKDVRDALDRMKEAFDLLGEYVTSQGYDIRFAIEPKPNEPRGDILLPTVGHALAFIERLERPELYGVNP
EVGHEQMAGLNFPHGIAQALWAGKLFHIDLNGQNGIKYDQDLRFGAGDLRAAFWLVDLLESAGYSGPRHFDFKPPRT
EDFDGVWAS
•Needleman-Wunsch (1970) provided first automatic method– Dynamic Programming to Find Global Alignment– Local Alignment is sometimes better than Global
•Needleman-Wunsch Test Data–ABCNYRQCLCRPMAYCYNRCKCRBP
Make a dot plot (Similarity matrix)
Put 1's where characters are identical.
A B C N Y R Q C L C R P M
A 1
Y 1
C 1 1 1
Y 1
N 1
R 1 1
C 1 1 1
K
C 1 1 1
R 1 1
B 1
P 1
Scoring the alignment
• The idea is to go through the matrix and find a shortest path to the bottom (it is actually done from the bottom backwards)
• Caveat 1: This path also needs to have the highest score
• Caveat 2: We have to score the gaps (insertions and deletions) since they do not exist in proteins
Global alignment by dynamic programming
Sequence X: MONTANASequence Y: MONTANAScoring system: 5 for match; -2 for mismatch; -6 for gap
Dynamic programming matrix: M O N T A N A 0 -6 -12 -18 -24 -30 -36 -42 M -6 5 -1 -7 -13 -19 -25 -31 O -12 -1 10 4 -2 -8 -14 -20 N -18 -7 4 15 9 3 -3 -9 T -24 -13 -2 9 20 14 8 2 A -30 -19 -8 3 14 25 19 13 N -36 -25 -14 -3 8 19 30 24 A -42 -31 -20 -9 2 13 24 35
Optimum alignment score: 35X: MONTANAY: MONTANA
What about gaps?
Sequence X: MONTTANASequence Y: MONTANAScoring system: 5 for match; -2 for mismatch; -6 for gap
Dynamic programming matrix: M O N T A N A 0 -6 -12 -18 -24 -30 -36 -42 M -6 5 -1 -7 -13 -19 -25 -31 O -12 -1 10 4 -2 -8 -14 -20 N -18 -7 4 15 9 3 -3 -9 T -24 -13 -2 9 20 14 8 2 T -30 -19 -8 3 14 18 12 6 A -36 -25 -14 -3 8 19 16 17 N -42 -31 -20 -9 2 13 24 18 A -48 -37 -26 -15 -4 7 18 29
Optimum alignment score: 29X: MONTTANAY: MON-TANA
Scoring “real-life” alignments
Sequence X: MONTANABOBCATSSequence Y: MONTANAGRIZZLIESScoring system: 5 for match; -2 for mismatch; -6 for gap
Dynamic programming matrix: M O N T A N A G R I Z Z L I E S 0 -6 -12 -18 -24 -30 -36 -42 -48 -54 -60 -66 -72 -78 -84 -90 -96 M -6 5 -1 -7 -13 -19 -25 -31 -37 -43 -49 -55 -61 -67 -73 -79 -85 O -12 -1 10 4 -2 -8 -14 -20 -26 -32 -38 -44 -50 -56 -62 -68 -74 N -18 -7 4 15 9 3 -3 -9 -15 -21 -27 -33 -39 -45 -51 -57 -63 T -24 -13 -2 9 20 14 8 2 -4 -10 -16 -22 -28 -34 -40 -46 -52 A -30 -19 -8 3 14 25 19 13 7 1 -5 -11 -17 -23 -29 -35 -41 N -36 -25 -14 -3 8 19 30 24 18 12 6 0 -6 -12 -18 -24 -30 A -42 -31 -20 -9 2 13 24 35 29 23 17 11 5 -1 -7 -13 -19 B -48 -37 -26 -15 -4 7 18 29 33 27 21 15 9 3 -3 -9 -15 O -54 -43 -32 -21 -10 1 12 23 27 31 25 19 13 7 1 -5 -11 B -60 -49 -38 -27 -16 -5 6 17 21 25 29 23 17 11 5 -1 -7 C -66 -55 -44 -33 -22 -11 0 11 15 19 23 27 21 15 9 3 -3 A -72 -61 -50 -39 -28 -17 -6 5 9 13 17 21 25 19 13 7 1 T -78 -67 -56 -45 -34 -23 -12 -1 3 7 11 15 19 23 17 11 5 S -84 -73 -62 -51 -40 -29 -18 -7 -3 1 5 9 13 17 21 15 16
Optimum alignment score: 16X: MONTANA--BOBCATSY: MONTANAGRIZZLIES
The scoring depends on our choice of parameters
Sequence X: MONTANABOBCATSSequence Y: MONTANAGRIZZLIESScoring system: 5 for match (B=G); -2 for mismatch; -6 for gap
Optimum alignment score: 23X: MONTANAB--OBCATSY: MONTANAGRIZZLIES
Sequence X: MONTANABOBCATSSequence Y: MONTANAGRIZZLIESScoring system: 5 for match; -2 for mismatch; -1 for gap
Optimum alignment score: 26X: MONTANA--------BOBCATSY: MONTANAGRIZZLIE------S
How do we choose good scoring parameters?
• A simple scoring scheme considers only sequence identity
• More realistic scoring schemes consider sequence similarity, which is taken from substitution matrices
• We measure the frequency of residue substitutions and normalize it by residue frequency in the database (LOG2 an/ad)
• Zero in substitution matrix means that the substitution occurs by chance
• Score less than zero means that the substitution is unlikely to occur by chance
• There is no universally good matrix
BLOSUM62 substitution matrix
A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
What is the limit of substitution matrices?
Why did substitution matrices fail?
• The proteins in question are very distantly related and their substitution patterns are not properly rewarded by general matrices
• Substitution matrices do not capture all families equally well because they are meant to be general
• Can we build protein family-specific substitution matrices?
• Yes, these are known as protein family profiles
How do we build protein family-specific matrix?
Search protein database using BLOSUM62 matrix
Build protein-family specific matrix (profile) and search protein database again
???
Detecting distant relationships using profiles (PSI-BLAST)
Position-specific scoring matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V T -1 0 1 2 -2 0 1 -1 0 -2 -2 0 -1 -2 -1 0 3 -1 -1 -1 M -1 -2 -3 -3 -1 -2 -3 -3 -2 1 1 -2 7 0 -3 -3 -2 -1 -1 0 D -1 2 0 2 -3 3 1 -1 0 -3 -3 1 -2 -3 -1 0 0 -2 -2 -2 V -1 -4 -4 -5 -1 -4 -4 -5 -3 4 1 -4 0 -1 -4 -4 -1 -2 -2 5 I -1 -4 -5 -6 -1 -4 -5 -6 -4 5 1 -5 0 -1 -5 -5 -2 -2 -2 4 S 0 -1 0 -1 -1 -1 -1 -1 -1 -3 -3 -1 -2 -3 -2 4 3 -3 -2 -2 F -2 -4 -4 -5 -2 -4 -5 -5 -2 1 2 -4 1 6 -4 -4 -3 2 2 0 K -1 3 -1 -1 -4 1 0 -2 0 -3 -3 5 -2 -4 -2 -1 -1 -3 -3 -3 L -2 -4 -5 -6 -1 -4 -5 -6 -3 3 4 -4 2 1 -4 -5 -3 -1 -1 1 P -1 -3 -3 -2 -4 -3 -2 -3 -3 -4 -4 -2 -4 -4 7 -2 -2 -4 -4 -3 P -1 -1 -1 2 -4 -1 0 -1 -1 -4 -4 -1 -4 -4 6 -1 -1 -3 -3 -3 E 0 0 0 2 -4 1 4 -2 -1 -3 -3 1 -2 -4 -1 -1 -1 -3 -3 -2 L -2 -3 -4 -5 -2 -3 -4 -5 -2 2 4 -4 6 1 -4 -4 -2 -1 -1 1 N -1 0 3 0 -2 0 2 -1 0 -2 0 1 -1 -2 -1 0 0 -1 -1 -2 A 2 1 -1 -1 -1 0 0 -2 0 0 0 0 0 -1 -1 -1 0 0 0 0 K -1 1 -1 -1 -3 1 0 -2 0 -2 0 4 -1 -2 -2 -1 -1 -2 -1 -1 L -2 -4 -5 -5 -2 -4 -5 -5 -3 1 5 -4 1 1 -4 -5 -3 -1 -1 1 E -1 0 2 4 -4 0 3 -1 0 -5 -4 0 -4 -4 -1 0 -1 -3 -3 -4 S 1 1 0 0 -3 3 1 -1 0 -2 -2 1 -2 -2 -1 1 0 -2 -1 -2 V -1 -1 -2 -3 -1 -2 -3 -3 0 2 1 -2 1 2 -2 -2 -1 1 3 3 A 5 -3 -3 -2 0 -2 -2 0 -2 -2 -2 -2 -2 -3 -2 0 -1 -3 -3 -1 L -1 0 -1 -2 -1 -1 -1 -2 0 2 1 0 1 0 -2 0 0 1 2 0 K -1 2 0 0 -3 1 2 -1 0 -3 -3 3 -2 -4 -1 1 0 -3 -2 -2 E -1 1 0 0 -2 2 2 -2 3 -2 0 1 -1 -2 -1 -1 0 -1 -1 -1 K -1 1 1 0 -3 0 0 2 0 -4 -3 3 -3 -3 -1 0 -1 -2 -2 -3 K -1 0 -1 -1 -2 0 0 -3 0 0 -1 3 3 -1 -2 -1 0 -1 0 1 S -1 -1 2 0 -2 0 0 -1 0 -3 -3 0 -2 -3 -1 3 2 -2 -2 -3
Combining sequence similarity with SS information
Can we quickly scan for common protein families?
• YES, many databases available• Instead of comparing our query to other
sequences, we compare it to the database of profiles (also called Hidden Markov Models or HMMs)
• Profiles (and HMMs) capture the average preference for residues at all positions; They are probabilistic representations of protein families
• Try these databases:– PFAM (http://pfam.wustl.edu)– SMART (http://smart.embl-heidelberg.de)
Profiles HMMs have other uses
• Profile HMMs represent a phylogenetic footprint of a given protein family– Also used for secondary structure
prediction– Predictions of trans-membrane proteins– Prediction of protein disorder
• Most of these predictors are based on machine-learning algorithms that are trained on known data and can extract subtle patterns
Questions?
Mensur DlakicDepartment of Microbiology
111 Lewis HallTel: 994-6576
[email protected] office: 109 Lewis Hall