Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA

Advanced Bioinformatics (MB480/580)>Sulfolobus virus 1 complete genome 15465 bp.TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAGTACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAAGATATACTGAGAGTCCTACGCGTTAGTTCAGGTCAGACAAGAGAGAACGAAATCAATTCTGAAACAATTATTTGACCATGGTAAGGAACATGAAGATGGAGTAATGAATGGTTATGGTTAGGGACTAAAATTATAAACGCCCATAAG

Learn How to:● Assemble a genome and predict its:

- ORFs- Promoters

● Annotate genome:- Predict protein functions- Model them if possible- Re-design them if possible

● Predict functions by inference from a large amount of unrelated data● Predict ncRNAs● High-throughput methods and data interpretation● Prepare the data for presentations & publications

What is Bioinformatics?

• Choices:– The analysis of biological molecules

using computers and statistical techniques•TRUE

– The science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research• also TRUE, but suits Computational Biology

better

More definitions

• The collection, organization and analysis of large amounts of biological data, using networks of computers and databases.

• The process of developing tools and processes to quantify and collect data to study biological systems logically.

• The science of informatics as applied to biological research.

Yet more definitions

• Mark Gerstein’s definition:– Bioinformatics is conceptualizing biology in terms of

macromolecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied maths, computer science, and statistics) to understand and organize the information associated with these molecules, on a large-scale.

– The manuscript breaking down each part of the above statement will be e-mailed.

– http://wiki.bioinformatics.org/Bioinformatics_FAQ

http://wiki.bioinformatics.org/Bioinformatics_FAQ

http://wiki.bioinformatics.org/Bioinformatics_FAQ

The important stuff

• Bioinformatics brings together biological data from genome research with the theory and tools of mathematics, computer science and artificial intelligence.

• Bioinformatics includes any application of computer technology and information science to:– Gather, organize, store and handle data.– Analyze, interpret and spread data.– Predict biological structure and function.

What is the information in Molecular Biology?

• Central Dogmaof Molecular Biology

DNA -> RNA -> Protein -> Phenotype

• Molecules– Sequence, Structure, Function

• Processes– Mechanism, Specificity,

Regulation

• Central Paradigmfor Bioinformatics

Genomic Sequence Information -> mRNA (level) -> Protein Sequence -> Protein Structure -> Protein Function -> Phenotype

• Large Amounts of Information– Standardized– Statistical

•Genetic material •Information transfer (mRNA)•Protein synthesis (tRNA/mRNA)•Some catalytic activity

•Most cellular functions are performed or facilitated by proteins.

•Primary biocatalyst

•Cofactor transport/storage

•Mechanical motion/support

•Immune protection

•Control of growth/differentiation

This slide is courtesy of Mark Gerstein

Language of biology is not easy to understand

• Just like in spoken language, some words look very different but have the same meaning (car and automobile are synonyms; sequences of distantly related proteins are synonyms)

• Some words look or sound very similar yet have different meaning (complement and compliment; eminent and imminent; allude and elude; decent and descent are homophones; GAG and TAG codons are homophones)

• In spoken language, we came up with the rules which is why most of the time we can trace back their origins

• How do we trace the origins of Nature’s language?

Why is Bioinformatics important?

• Supports experimental work– In some cases, it provides complementary

data• More importantly, guides experimental

work– Predictions based on data– Extension of experiments in new directions

• To be believable, Bioinformatics predictions have to be verifiable– Statistical significance, or some other kind

of significance score

When did Bioinformatics begin?

• 10-15 years ago?– This is a common assumption

• Bioinformatics existed even back in 70s– It was called differently– It was underused because the amount of

biological sequence data was small

Bioinformatics and Genome Biology

• The revolution driving enormous development in Bioinformatics and experimental sciences came from whole genome sequencing

Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Small, K. V., Fraser, C. M., Smith, H. O. & Venter, J. C. (1995). "Whole-genome random sequencing and assembly of Haemophilus influenzae rd." Science 269: 496-512.(Picture adapted from TIGR website, http://www.tigr.org)

• Integrative Data1995, HI (bacteria): 1.6 Mb & 1600 genes done1997, yeast: 13 Mb & ~6000 genes for yeast1998, worm: ~100Mb with 19 K genes1999: >30 completed genomes!2003, human: 3 Gb & 50 K genes...

Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than Shakespeare managed in a lifetime, although the latter make better reading.

-- G A Petsko, Nature 401: 115-116 (1999)

What can we infer from sequence using Bioinformatics?

Expressed?

• cellular function • physiological function • substrate binding sites • protein-protein interfaces

• activity • specificity • docking • localisation

DNA

ORF

Protein

Active proteinDomains =smallest functional /structural subunits

3D structure

Function

Make sense of subtle differences

[Waterston et al. Nature 2002]

- About 90% of the mouse and human genomes are in syntenic blocks.

What’s in the genome?

• If we are so much alike in terms of genome, why are we so much different?– Large variation in human population– Similar genes and similar genome organization

between human and chimp (or even human and mouse), yet large phenotypic difference

• The importance of non-coding parts of our genome became more obvious– Non-coding, regulatory RNAs– Binding sites for regulatory proteins– Other possibilities that are not obvious right now

Complexity of biological information

1. Finding regulatorymotifs in DNA

2. Increasing the speedand reliability of functionalannotation from sequence

The more we know, the better?

So we have a genome sequence …>Sulfolobus virus 1 complete genome 15465 bp.TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAGCGGAATACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAAGGAGGGATATACTGAGAGTCCTACGCGTTAGTTCAGGTCAGACAAGAGAGAACGTAAACAAATCAATTCTGAAACAATTATTTGACCATGGTAAGGAACATGAAGATGAAGAAGAGTAATGAATGGTTATGGTTAGGGACTAAAATTATAAACGCCCATAAGACTAACGGCTTTGAAAGTGCGATTATTTTCGGGAAACAAGGTACGGGAAAGACTACTTACGCCCTTAAGGTGGCAAAAGAAGTTTACCAGAGATTAGGACATGAACCGGACAAGGCATGGGAACTGGCCCTTGACTCTTTATTCTTTGAGCTTAAAGATGCATTGAGGATAATGAAAATATTCAGGCAAAATGATAGGACAATACCAATAATAATTTTCGACGATGCTGGGATATGGCTTCAAAAATATTTATGGTATAAGGAAGAGATGATAAAGTTTTACCGTATATATAACATTATTAGGAATATAGTAAGCGGGGTGATCTTCACTACCCCTTCCCCTAACGATATAGCGTTTTATGTGAGGGAAAAGGGGTGGAAGCTGATAATGATAACGAGAAACGGAAGACAACCTGACGGTACGCCAAAGGCAGTAGCTAAAATAGCGGTGAATAAGATAACGATTATAAAAGGAAAAATAACAAATAAGATGAAATGGAGGACAGTAGACGATTATACGGTCAAGCTTCCGGATTGGGTATATAAAGAATATGTGGAAAGAAGAAAGGTTTATGAGGAAAAATTGTTGGAGGAGTTGGATGAGGTTTTAGATAGTGATAACAAAACGGAAAACCCGTCAAACCCATCACTACTAACGAAAATTGACGACGTAACAAGATAGTGATACGGGTAATGTCAGACCCCTTTTAGCCATTCCGCATACTTTTTATATTGCTCTTTCGCTATGCCGAAGAGCGATACGTAATGTTGCGTTAAAACGCGTGTCGGTTTACGCCCTTGAATAAAATCGATAATATCTAACGGTACGCTTAGCTCAGCCATCTTAGACGCTACGAATTTGCGGAAGTACTTTATCGCTATAGCGTCCTTATGACGTCGTTCAAAGTCCGCTATTGCCCACTTCGTCACCTCTACTCTCTTCAGAGGCGTTATGTGGAATACATAGAAGACGCCCTTATATCCCCTAGTCCAACTAAGCGGATAATAACAGACGTCGTTACCGCAAATGTCCCTTTCGGGTTCCTTCAGCACTTTCAGTATTTCGCTCAGCCTAACGCCCGACTCGAGAGCGATACGGTAGATGAAGTAGACGTTTTCGCTATAGTCTTTTGCTAATTGTAACGTCCTTTTTATCTCTTCCAACGTTGGAATGTAGATATCAGCGTTCGCCTTCTTCACCTTTACCGCTTTCAATATTTTATCCGCAAATTCATCATGTATGATATTGCGTGACGCTAAGAAACGTGCAAAGAGTCGGTAAGCCTTCTGTGCGTCTCTCGTCTCTTTATACGGCTTTGATATAGCATTGATGTAGTCCTTTGCAGTTTTTTCGCTTATCCCCCTTTCGTTCATGAGATAGTCGTAGAACGCCTTTATGTTGCCGTCCGTCGCGTATTGGCGCAAATTGGCAACCAACGCTATTTTACGTCGTTCAGTTCCCTCTTTTCCGCCTCCGGAGCCGGAGGTCCCGGGTTCAAATCCCGGCGGGTCCGCTTGTAGGGGAGTATCCCCTACGACCCCTAATTTCATTTTTAGATATGATTCAACGACGTCAGCTAAAGGACCCACGTAACGCTCTTTTACCTCACCGTTTTCATACTCTAGCTTGTAAACATAATACCGCCCTTTCCTCTCGCGTAAAATATAATCCCCGTATTTATAACGCGTCTTATCTTTCGTCATTTCGCCTCACAGTATTATGGTTGCCAAAACGGGCTTATAAGCATTGGCAACCCGTTAATTTTTGCCGTTAAAACACGTTGAATTGAAAGAAGACGGCAAAGAATCCACACAGGTAATACTAAAAAAGTAGTATTACTTACATTAGAAGGACTCATTTGTCCACCTTGTATTCTAGCCATGCTATCTCTGCCTTCAGCTCATCTAGCTTCCCCTTTATGTCTGTCAGGTCAAGGGGAACTCCTCTCATTAACCTGAGTTCGTTTTCGATTTTTTCAAGCTCCTTTTCCAACTCCTCTAGTTTCTCTAATTCCTTTAGTCGTTCTTCCAATTTCTTTTCCAATTTCCCCTTTGCGTCATTTATAATTATGCTTACTACCCAAACAATTCCTAAATCAGAAATAATTATTAACTCCTCTGAGTTGAATATCATTTTCCGCCCCTCGCTAAATACTCCTTAAAGCTCTGATAGAACCCCTTCAGACTAACCCGTAAGTCTGTTAGGTTCTTCCAGTATTGTAATGGGATTAAGTAATAGTAGCTTACTGCATCTCTCTCAAATTTGTCCTTCTTAATCTTTCCTTGCTTTTCTAAGTTGAGTATTTGCAGTGCTGAGATACATTTTAACTTGTCCTCAGCATCTGAATAGTGTATAAACCAAACCCTCCCCATAACCTCATTCTGCTTTGCAACTTCTACTTTAGTGCTTAATATTGCGTAAACGCTTTCGCCGTATCTTTCTTTGCTCTGTTCTTCAGTCCATGAACTTCCCGTAATATCTATCCAAATTAAAGGATAATATTCTGTCTTAGCCTTAACGTATAAAGTCAAATCGTATTTATCTTGCAGACCGCTATAGTATTGCTCATTTATTACATTAGTTAAAGTCCCCACGCCAGTTGGGCGGATATAAACATCAAAGTCTAACAAACCCTTAGCCCGCCACTTTGATAAAGAGATTAAGAGCTTTCCAAAAACTAGGTATTCTCGCCCTAAATAAGTTGAAGGGAGGATATAATCCTCAGCTTGATTACCCCAATACTTTAGCTTAAAATTAGTTTCAGCCATCTCACTCACCATATTGAAACGTGGGCTAGTATGTGAATCAGTACTGATGCTATTGCAAATAACACACTTGCAGTAGCAATTCCTATTACAATCCATTTACCATAATCCACCTTAGTTTGTTGGTCAATATACTCGTTGATGATCTTTAGTATTTCTGGCTTTAGTTCTGATAATGAAAGGAAGACAGAGGCATAAAGTACTAAGGAGGATGTGAACAGATTATCCGCCTTTTCTGAAAGTTTATAAAGCTCATATCTTGCTCTCTCATAATCTTCATAATTAATAATTTCATCAAACTTTTCTACTTGCTCTTCATATTCTTTCTTCAGAGAGTAAGGAGTTGTCTTTTCAATTACTCCTAATTTTATTAACTTCTTAACAGCTTCCTTAAATCCTTGTTTATTGCTAGCATACGCTAAAGGGTCTTTTCCTTCTTGAGAAGCTCTATAGATAACTATAGCACCATAAACAATATTTACAATATCGTATGGTAAGGAATACGCACCGATTTGGGCAATATCTTCAACTCTTCTTTGATCCATCTAGTTCACCTCTTTTTGATTTGTTTGTAGGTTTCTATCGCAGTTTTCAGCGATATCGCAAATAGCTTCCCCTTTTCCGTTAGGTATAGCCTCTTTTCGCCTCTTTCTTGACGCTCTTTCACGAAGCCCTCTTGTATTAGGAACTTTTTTGCATCATAAAAGGTGGCAGTGGACATGGGAAATTCTGCGTTTACTTTCTTGTATAGGTCATATGTTGCTATTCCTTCATTATCATATAGATAAGCCAATACTATGGCTTCGGGGTAGAAGAATGGTGTACTTTTCATATCCTCCTCACTCCTCAGCCTCTAATAGCTTAACTGCCTCCTCTATCAACTGTCCCATTGTCTTTCCAGTCTTTGCCTTAAGCCTCTGCAGAGTCTCATATGTTTCCTCACTTATTGAAATGTTAAGCCTTTTGACTATCCTATCTTTCCTCTTCTCTATCATTTAGGTCACCTTGTTTATTGTTATTTGAAATACGTATCCGTCTTCGTCACATCGAAGTATAATTTTGTATCCATTATTAGCATATTCTACGTCAAAGTTCCCACAACAATAATTCGGGTCTTCGGACTCGTTATAGACTTTGCTCCAACCATCTTTTTGTAGTGCCTCTTCTAAGTAGTCTACTCTGATGAAGCCTTCATCATATTCGTTCAGTACCCTAAAGCTTATACTATCAATGCCTAATACGTCTAATAGCTTCAACAGATCGAATATAGGAACTTGCACCATCATTTCAGCTCACCTTAATGAGCTGATATAATTCCGCTTCTATCTTTTGAACTTGGAAGTATGCCTTGCCTAGCTTTTGCTTATCCATATTGCCCGTTATTCTATCAATCTTAATCTCGTGGATTAATGATAATAGCTCTCTGACATCCTCATCAAGCATTTCAAATAATTCTTTCTCTAAGACTTCTTTACTCATTGTTTTTCACCTTAGCAAACTCATCTAACGTTGTTTGTCTCAGTTCTCTTTTCTTTATCAAATAAAATTCCGAATGTCCCTTCTTATTGTTATTACTGTACTTCATGTCAGTTCACTGCTTTGCCTTTATAAATCCTTGATCCGTTTGCTCAAAATTTGCGGGCTGGGCAT

Gene finding through learningatgccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgtaa

gaggatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagatg

Gene

Non-gene

gcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag

Gene?

atg

tga

ggtgag

ggtgag

ggtgag

caggtg

cagatg

cagttg

caggccggtgag

Map looks better. Is this all?

OK, so we’ll predict protein functions …

… maybe do few experiments …

… and then enjoy glory (maybe money, too).

Trevor Douglas and Mark Young (2006) Science 312, 873 - 875.

What can we do with Molecular Biology information?

• Different levels of Molecular Biology information

• DNA– Coding or non-coding– Meaningful or junk DNA?

• RNA– Information transfer (mRNA, tRNA, rRNA)– Regulatory roles

• Protein– Structure and function– Modifications

Molecular Biology Information in DNA and RNA

• Raw DNA Sequence– 4 bases: AGCT– Coding or Not?– How do we parse

the sequence into genes?

– Because of introns, ~1 K in a gene could mean ~2 M in genome

• Raw RNA Sequence– 4 bases: AGCU– mRNA, tRNA, rRNA– Regulatory RNAs– Secondary

structure

atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgcagcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatacatggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtgaaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatccagcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattcttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaactggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgcaggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgtgttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactgcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgcatcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacctgcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgttgttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatcaaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacactgaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgcagacgctggtatcgcattaactgattctttcgttaaattggtatc . . .

. . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg

Molecular Biology information in protein sequences

1. Finding regulatorymotifs in DNA

• 20 letter alphabet, more combinatorial variability than DNA (20AA-number)

– ACDEFGHIKLMNPQRSTVWY but not BJOUXZ

• Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain

• More than 2 million unique protein sequences (more than 5.6 M of total sequences in the database)

• We must be able to “transfer” the function from characterized proteins to uncharacterized ones based on some measure of similarity

Molecular Biology information in macromolecular structures

• DNA/RNA/Protein– The majority of all structures

are of proteins– Proteins easier to crystallize

and were thought to be more important

Organizing information: Redundancy and multiplicity help

…• Fairly different sequences may have the same

structure and function– Bad news: If they are very different, how do we find this?– Good news: Once they are found, we learn something more

about structure and function

• An organism has many similar genes and non-coding RNAs– The redundancy present for essential genes and/or RNAs

(rRNA)

• Single gene may have multiple functions– Combining domains in eukaryotes produces large proteins

• Genes are grouped into pathways; this is good

… though sometimes the path is difficult

• Evolutionary distances do not help establish initial relationship– Large differences (large evolutionary distances) between

proteins are hard to identify and defend on statistical grounds without experiment

• Evolutionary distances do help once the relationship is established– If the relationship between distant proteins is

established, their conserved parts provide information about what is vital for function

– Less conserved parts of proteins are less important for function - scaffold

• Given all these difficulties, how do we find hidden similarities?

Some things we can do using just sequence

• Sequence (text string) comparisons– Sequence (text string) search– Sequence alignment– Finding short sequences in biological sequences– Significance statistics

• Databases– Building, Querying

• Learning patterns– Artificial Intelligence and Machine Learning– Mining for patterns and clustering them

• Secondary structure prediction– Where are helices, strands and loops in proteins?– Finding trans-membrane helices

• Tertiary structure prediction– Fold recognition and structure prediction– Active site identification

How are optimal alignments found?(Should we all pick the one we like?)

Aligning text strings …Which alignment is the best?

Raw Data ???T C A T G C A T T G

2 matches, 0 gaps

T C A T G | |C A T T G

3 matches (2 end gaps)

T C A T G . | | | . C A T T G

4 matches, 1 insertion

T C A - T G | | | | . C A T T G

4 matches, 1 insertion

T C A T - G | | | | . C A T T G

Dynamic Programming to the rescue1. Finding regulatory

motifs in DNA•What to do for Bigger String?SSDSEREEHVKRFRQALDDTGMKVPMATTNLFTHPVFKDGGFTANDRDVRRYALRKTIRNIDLAVELGAETYVAWGG

REGAESGGAKDVRDALDRMKEAFDLLGEYVTSQGYDIRFAIEPKPNEPRGDILLPTVGHALAFIERLERPELYGVNP

EVGHEQMAGLNFPHGIAQALWAGKLFHIDLNGQNGIKYDQDLRFGAGDLRAAFWLVDLLESAGYSGPRHFDFKPPRT

EDFDGVWAS

•Needleman-Wunsch (1970) provided first automatic method– Dynamic Programming to Find Global Alignment– Local Alignment is sometimes better than Global

•Needleman-Wunsch Test Data–ABCNYRQCLCRPMAYCYNRCKCRBP

Make a dot plot (Similarity matrix)

Put 1's where characters are identical.

A B C N Y R Q C L C R P M

A 1

Y 1

C 1 1 1

Y 1

N 1

R 1 1

C 1 1 1

K

C 1 1 1

R 1 1

B 1

P 1

Scoring the alignment

• The idea is to go through the matrix and find a shortest path to the bottom (it is actually done from the bottom backwards)

• Caveat 1: This path also needs to have the highest score

• Caveat 2: We have to score the gaps (insertions and deletions) since they do not exist in proteins

Global alignment by dynamic programming

Sequence X: MONTANASequence Y: MONTANAScoring system: 5 for match; -2 for mismatch; -6 for gap

Dynamic programming matrix: M O N T A N A 0 -6 -12 -18 -24 -30 -36 -42 M -6 5 -1 -7 -13 -19 -25 -31 O -12 -1 10 4 -2 -8 -14 -20 N -18 -7 4 15 9 3 -3 -9 T -24 -13 -2 9 20 14 8 2 A -30 -19 -8 3 14 25 19 13 N -36 -25 -14 -3 8 19 30 24 A -42 -31 -20 -9 2 13 24 35

Optimum alignment score: 35X: MONTANAY: MONTANA

What about gaps?

Sequence X: MONTTANASequence Y: MONTANAScoring system: 5 for match; -2 for mismatch; -6 for gap

Dynamic programming matrix: M O N T A N A 0 -6 -12 -18 -24 -30 -36 -42 M -6 5 -1 -7 -13 -19 -25 -31 O -12 -1 10 4 -2 -8 -14 -20 N -18 -7 4 15 9 3 -3 -9 T -24 -13 -2 9 20 14 8 2 T -30 -19 -8 3 14 18 12 6 A -36 -25 -14 -3 8 19 16 17 N -42 -31 -20 -9 2 13 24 18 A -48 -37 -26 -15 -4 7 18 29

Optimum alignment score: 29X: MONTTANAY: MON-TANA

Scoring “real-life” alignments

Sequence X: MONTANABOBCATSSequence Y: MONTANAGRIZZLIESScoring system: 5 for match; -2 for mismatch; -6 for gap

Dynamic programming matrix: M O N T A N A G R I Z Z L I E S 0 -6 -12 -18 -24 -30 -36 -42 -48 -54 -60 -66 -72 -78 -84 -90 -96 M -6 5 -1 -7 -13 -19 -25 -31 -37 -43 -49 -55 -61 -67 -73 -79 -85 O -12 -1 10 4 -2 -8 -14 -20 -26 -32 -38 -44 -50 -56 -62 -68 -74 N -18 -7 4 15 9 3 -3 -9 -15 -21 -27 -33 -39 -45 -51 -57 -63 T -24 -13 -2 9 20 14 8 2 -4 -10 -16 -22 -28 -34 -40 -46 -52 A -30 -19 -8 3 14 25 19 13 7 1 -5 -11 -17 -23 -29 -35 -41 N -36 -25 -14 -3 8 19 30 24 18 12 6 0 -6 -12 -18 -24 -30 A -42 -31 -20 -9 2 13 24 35 29 23 17 11 5 -1 -7 -13 -19 B -48 -37 -26 -15 -4 7 18 29 33 27 21 15 9 3 -3 -9 -15 O -54 -43 -32 -21 -10 1 12 23 27 31 25 19 13 7 1 -5 -11 B -60 -49 -38 -27 -16 -5 6 17 21 25 29 23 17 11 5 -1 -7 C -66 -55 -44 -33 -22 -11 0 11 15 19 23 27 21 15 9 3 -3 A -72 -61 -50 -39 -28 -17 -6 5 9 13 17 21 25 19 13 7 1 T -78 -67 -56 -45 -34 -23 -12 -1 3 7 11 15 19 23 17 11 5 S -84 -73 -62 -51 -40 -29 -18 -7 -3 1 5 9 13 17 21 15 16

Optimum alignment score: 16X: MONTANA--BOBCATSY: MONTANAGRIZZLIES

The scoring depends on our choice of parameters

Sequence X: MONTANABOBCATSSequence Y: MONTANAGRIZZLIESScoring system: 5 for match (B=G); -2 for mismatch; -6 for gap

Optimum alignment score: 23X: MONTANAB--OBCATSY: MONTANAGRIZZLIES

Sequence X: MONTANABOBCATSSequence Y: MONTANAGRIZZLIESScoring system: 5 for match; -2 for mismatch; -1 for gap

Optimum alignment score: 26X: MONTANA--------BOBCATSY: MONTANAGRIZZLIE------S

How do we choose good scoring parameters?

• A simple scoring scheme considers only sequence identity

• More realistic scoring schemes consider sequence similarity, which is taken from substitution matrices

• We measure the frequency of residue substitutions and normalize it by residue frequency in the database (LOG2 an/ad)

• Zero in substitution matrix means that the substitution occurs by chance

• Score less than zero means that the substitution is unlikely to occur by chance

• There is no universally good matrix

BLOSUM62 substitution matrix

A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

What is the limit of substitution matrices?

Why did substitution matrices fail?

• The proteins in question are very distantly related and their substitution patterns are not properly rewarded by general matrices

• Substitution matrices do not capture all families equally well because they are meant to be general

• Can we build protein family-specific substitution matrices?

• Yes, these are known as protein family profiles

How do we build protein family-specific matrix?

Search protein database using BLOSUM62 matrix

Build protein-family specific matrix (profile) and search protein database again

???

Detecting distant relationships using profiles (PSI-BLAST)

Position-specific scoring matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V T -1 0 1 2 -2 0 1 -1 0 -2 -2 0 -1 -2 -1 0 3 -1 -1 -1 M -1 -2 -3 -3 -1 -2 -3 -3 -2 1 1 -2 7 0 -3 -3 -2 -1 -1 0 D -1 2 0 2 -3 3 1 -1 0 -3 -3 1 -2 -3 -1 0 0 -2 -2 -2 V -1 -4 -4 -5 -1 -4 -4 -5 -3 4 1 -4 0 -1 -4 -4 -1 -2 -2 5 I -1 -4 -5 -6 -1 -4 -5 -6 -4 5 1 -5 0 -1 -5 -5 -2 -2 -2 4 S 0 -1 0 -1 -1 -1 -1 -1 -1 -3 -3 -1 -2 -3 -2 4 3 -3 -2 -2 F -2 -4 -4 -5 -2 -4 -5 -5 -2 1 2 -4 1 6 -4 -4 -3 2 2 0 K -1 3 -1 -1 -4 1 0 -2 0 -3 -3 5 -2 -4 -2 -1 -1 -3 -3 -3 L -2 -4 -5 -6 -1 -4 -5 -6 -3 3 4 -4 2 1 -4 -5 -3 -1 -1 1 P -1 -3 -3 -2 -4 -3 -2 -3 -3 -4 -4 -2 -4 -4 7 -2 -2 -4 -4 -3 P -1 -1 -1 2 -4 -1 0 -1 -1 -4 -4 -1 -4 -4 6 -1 -1 -3 -3 -3 E 0 0 0 2 -4 1 4 -2 -1 -3 -3 1 -2 -4 -1 -1 -1 -3 -3 -2 L -2 -3 -4 -5 -2 -3 -4 -5 -2 2 4 -4 6 1 -4 -4 -2 -1 -1 1 N -1 0 3 0 -2 0 2 -1 0 -2 0 1 -1 -2 -1 0 0 -1 -1 -2 A 2 1 -1 -1 -1 0 0 -2 0 0 0 0 0 -1 -1 -1 0 0 0 0 K -1 1 -1 -1 -3 1 0 -2 0 -2 0 4 -1 -2 -2 -1 -1 -2 -1 -1 L -2 -4 -5 -5 -2 -4 -5 -5 -3 1 5 -4 1 1 -4 -5 -3 -1 -1 1 E -1 0 2 4 -4 0 3 -1 0 -5 -4 0 -4 -4 -1 0 -1 -3 -3 -4 S 1 1 0 0 -3 3 1 -1 0 -2 -2 1 -2 -2 -1 1 0 -2 -1 -2 V -1 -1 -2 -3 -1 -2 -3 -3 0 2 1 -2 1 2 -2 -2 -1 1 3 3 A 5 -3 -3 -2 0 -2 -2 0 -2 -2 -2 -2 -2 -3 -2 0 -1 -3 -3 -1 L -1 0 -1 -2 -1 -1 -1 -2 0 2 1 0 1 0 -2 0 0 1 2 0 K -1 2 0 0 -3 1 2 -1 0 -3 -3 3 -2 -4 -1 1 0 -3 -2 -2 E -1 1 0 0 -2 2 2 -2 3 -2 0 1 -1 -2 -1 -1 0 -1 -1 -1 K -1 1 1 0 -3 0 0 2 0 -4 -3 3 -3 -3 -1 0 -1 -2 -2 -3 K -1 0 -1 -1 -2 0 0 -3 0 0 -1 3 3 -1 -2 -1 0 -1 0 1 S -1 -1 2 0 -2 0 0 -1 0 -3 -3 0 -2 -3 -1 3 2 -2 -2 -3

Combining sequence similarity with SS information

Can we quickly scan for common protein families?

• YES, many databases available• Instead of comparing our query to other

sequences, we compare it to the database of profiles (also called Hidden Markov Models or HMMs)

• Profiles (and HMMs) capture the average preference for residues at all positions; They are probabilistic representations of protein families

• Try these databases:– PFAM (http://pfam.wustl.edu)– SMART (http://smart.embl-heidelberg.de)

http://pfam.wustl.edu/

http://smart.embl-heidelberg.de/

Profiles HMMs have other uses

• Profile HMMs represent a phylogenetic footprint of a given protein family– Also used for secondary structure

prediction– Predictions of trans-membrane proteins– Prediction of protein disorder

• Most of these predictors are based on machine-learning algorithms that are trained on known data and can extract subtle patterns

Questions?

Mensur DlakicDepartment of Microbiology

111 Lewis HallTel: 994-6576

[email protected] office: 109 Lewis Hall

Documents

Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA