A BioInformatics Survey... just a taste. Steve Thompson Steve Thompson Florida State University School of Computational Science and Information Technology

A BioInformatics Survey . . . just a taste.

Steve ThompsonSteve Thompson

Florida State University School of Florida State University School of Computational Science and Computational Science and

Information Technology (Information Technology (CSITCSIT))

Florida State UniversityFlorida State University

School of Information StudiesSchool of Information Studies

LIS 4722LIS 4722

Information RepresentationInformation Representation

Shawne MiksaShawne Miksa

Spring SemesterSpring Semester

March 20 & 22, 2001March 20 & 22, 2001

What is bioinformatics , genomics, sequence analysis, computational molecular biology . . . ?

The Reverse Biochemistry Analogy.The Reverse Biochemistry Analogy.Biochemists no longer have to begin a research project Biochemists no longer have to begin a research project

by isolating and purifying massive amounts of a protein by isolating and purifying massive amounts of a protein from its native organism in order to characterize a from its native organism in order to characterize a particular gene product. Rather, now scientists can particular gene product. Rather, now scientists can amplify a section of some genome based on its amplify a section of some genome based on its similarity to other genomes, sequence that piece of similarity to other genomes, sequence that piece of DNA and, DNA and, using sequence analysis tools, infer all sorts using sequence analysis tools, infer all sorts of functional, evolutionary, and, perhaps, structural of functional, evolutionary, and, perhaps, structural insight into that stretch of DNA!insight into that stretch of DNA!

The computer and molecular databases are a The computer and molecular databases are a necessary, integral part of this entire process.necessary, integral part of this entire process.

High quality training is essential!High quality training is essential!Why: graduates need to be competitive on the Why: graduates need to be competitive on the world biotechnology market.world biotechnology market.A perusal of employment listings in scientific journals or e-news groups (e.g. A perusal of employment listings in scientific journals or e-news groups (e.g.

http://net.bio.net/http://net.bio.net/hypermailhypermail//biojobsbiojobs// and and http://www.http://www.genomeWebgenomeWeb.com/careers/jobs.asp.com/careers/jobs.asp) clearly illustrates this point; over half are often ) clearly illustrates this point; over half are often bioinformatics/ biocomputing type positions. An Alfred P. Sloan bioinformatics/ biocomputing type positions. An Alfred P. Sloan Foundation Report from a couple of years ago, "Hiring Patterns Foundation Report from a couple of years ago, "Hiring Patterns Experienced by Students Enrolled in Bioinformatics/Computational Experienced by Students Enrolled in Bioinformatics/Computational Biology Programs” (Biology Programs” (http://www.http://www.sloansloan.org/programs/.org/programs/scitechscitech_page1._page1.htmhtm, , May 1999) provides some early insights to the trend.May 1999) provides some early insights to the trend.

The biotechnology sector either in academia or in commerce, especially the The biotechnology sector either in academia or in commerce, especially the pharmaceutical industry, is the obvious employer, but opportunities pharmaceutical industry, is the obvious employer, but opportunities abound, in fields as diverse as hospital administration and genetic abound, in fields as diverse as hospital administration and genetic counseling, to large scale sequencing centers and software development counseling, to large scale sequencing centers and software development companies. There is no lack of incentive and the situation is unlikely to companies. There is no lack of incentive and the situation is unlikely to change for quite some time, especially with the completion of so many change for quite some time, especially with the completion of so many genome projects on the horizon. All that newly sequenced DNA needs to genome projects on the horizon. All that newly sequenced DNA needs to be analyzed and annotated; it is, and will continue to be, an enormous job. be analyzed and annotated; it is, and will continue to be, an enormous job. This is the essence of 'data-mining.' This is the essence of 'data-mining.'

Definitions:Definitions:BiocomputingBiocomputing and and computational biologycomputational biology are synonyms and are synonyms and describe the use of computers and computational techniques to describe the use of computers and computational techniques to analyze any type of a biological system, from individual molecules analyze any type of a biological system, from individual molecules to organisms to overall ecology.to organisms to overall ecology.

BioinformaticsBioinformatics describes using computational techniques to describes using computational techniques to access, analyze, and interpret the biological information in any access, analyze, and interpret the biological information in any type of biological database (more later).type of biological database (more later).

Sequence analysisSequence analysis is the study of molecular sequence data for is the study of molecular sequence data for the purpose of inferring the function, interactions, evolution, and the purpose of inferring the function, interactions, evolution, and perhaps structure of biological molecules.perhaps structure of biological molecules.

GenomicsGenomics analyzes the context of genes or complete genomes analyzes the context of genes or complete genomes (the total DNA content of an organism) within the same and/or (the total DNA content of an organism) within the same and/or across different genomes.across different genomes.

ProteomicsProteomics is the subdivision of genomics concerned with is the subdivision of genomics concerned with analyzing the complete protein complement, i.e. the proteome, of analyzing the complete protein complement, i.e. the proteome, of organisms, both within and between different organisms.organisms, both within and between different organisms.

The exponential growth of The exponential growth of molecular sequence databases molecular sequence databases

& cpu power.YearYear BasePairs BasePairs

SequencesSequences

19821982 680338 680338 606 60619831983 2274029 2274029 2427 242719841984 3368765 3368765 4175 417519851985 5204420 5204420 5700 570019861986 9615371 9615371 9978 997819871987 15514776 15514776 145841458419881988 23800000 23800000 205792057919891989 34762585 34762585 287912879119901990 49179285 49179285 395333953319911991 71947426 71947426 556275562719921992 101008486 101008486 786087860819931993 157152442 143492 157152442 14349219941994 217102462 215273 217102462 21527319951995 384939485 555694 384939485 55569419961996 651972984 1021211 651972984 102121119971997 1160300687 1765847 1160300687 176584719981998 2008761784 2837897 2008761784 283789719991999 3841163011 4864570 3841163011 486457020002000 11101066288 10106023 11101066288 10106023

http://www.http://www.ncbincbi..nlmnlm..nihnih..govgov//GenbankGenbank//genbankstatsgenbankstats.html.html

Database Growth (cont.)Database Growth (cont.)The Human Genome Project and numerous The Human Genome Project and numerous

smaller genome projects have kept the data smaller genome projects have kept the data coming at alarming rates. As of February 2001 coming at alarming rates. As of February 2001 45 complete, finished genomes45 complete, finished genomes are publicly are publicly available for analysis, not counting all the virus available for analysis, not counting all the virus and viroid genomes available.and viroid genomes available.

The International Human Genome Sequencing The International Human Genome Sequencing Consortium announced the completion of a Consortium announced the completion of a ""Working DraftWorking Draft" of the " of the human genomehuman genome in in June 2000; independently that same month, June 2000; independently that same month, the private company Celera Genomics the private company Celera Genomics announced that it had completed the first announced that it had completed the first assembly of the human genome. Both articles assembly of the human genome. Both articles were recently published mid-February of this were recently published mid-February of this year in the journals year in the journals ScienceScience and and NatureNature..

Some neat stuff from the papers:Some neat stuff from the papers:We, We, Homo sapiensHomo sapiens, aren’t nearly as special , aren’t nearly as special

as we had hoped we were. Of the 3.2 as we had hoped we were. Of the 3.2 billion base pairs in our DNA —billion base pairs in our DNA —Traditional, text-book estimates of the number Traditional, text-book estimates of the number

of genes were often in the 100,000 range; of genes were often in the 100,000 range; turns out we’ve only got about twice as turns out we’ve only got about twice as many as a fruit fly, between 25’ and 35,000!many as a fruit fly, between 25’ and 35,000!

The protein coding region of the genome is The protein coding region of the genome is only about 1% or so, much of the remainder only about 1% or so, much of the remainder is “jumping” “selfish DNA” of which much is “jumping” “selfish DNA” of which much may be involved in regulation and control.may be involved in regulation and control.

Over 100-200 genes were transferred from an Over 100-200 genes were transferred from an ancestral bacterial genome to an ancestral ancestral bacterial genome to an ancestral vertebrate genome!vertebrate genome!

What are these databases like?What are these databases like?What are primary sequences? What are primary sequences? (Central Dogma: DNA —> RNA —> protein)(Central Dogma: DNA —> RNA —> protein)

Primary refers to one dimension — all of the “symbol” information written in sequential order Primary refers to one dimension — all of the “symbol” information written in sequential order necessary to specify a particular biological molecular entity, be it polypeptide or nucleotide. necessary to specify a particular biological molecular entity, be it polypeptide or nucleotide. The symbols are the one letter alphabetic codes for all of the biological nitrogenous bases The symbols are the one letter alphabetic codes for all of the biological nitrogenous bases and amino acid residues and their ambiguity codes. Biological carbohydrates, lipids, and and amino acid residues and their ambiguity codes. Biological carbohydrates, lipids, and structural information are not included within this sequence, however, much of this type of structural information are not included within this sequence, however, much of this type of information is available in the reference documentation sections associated with primary information is available in the reference documentation sections associated with primary sequences in the databases.sequences in the databases.

What are sequence databases?What are sequence databases?These databases are an organized way to store the tremendous amount of sequence These databases are an organized way to store the tremendous amount of sequence information that accumulates from laboratories worldwide. Each database has its own information that accumulates from laboratories worldwide. Each database has its own specific format. Three major database organizations around the world are responsible for specific format. Three major database organizations around the world are responsible for maintaining most of this data; they largely ‘mirror’ one another.maintaining most of this data; they largely ‘mirror’ one another.North America: National Center for Biotechnology Information (North America: National Center for Biotechnology Information (NCBINCBI): ): GenBankGenBank & & GenPept.GenPept.

Also Georgetown University’s NBRF Protein Identification Resource: Also Georgetown University’s NBRF Protein Identification Resource: PIRPIR & NRL_3D. & NRL_3D.Europe: European Molecular Biology Laboratory (also EBI & ExPasy): Europe: European Molecular Biology Laboratory (also EBI & ExPasy): EMBLEMBL & Swiss-Prot. & Swiss-Prot.Asia: The DNA Data Bank of Japan (DDBJ). Asia: The DNA Data Bank of Japan (DDBJ).

Content & Organization.Content & Organization.Most sequence databases are examples of complex ASCII/Binary databases, but usually Most sequence databases are examples of complex ASCII/Binary databases, but usually are not Oracle or SQL or Object Oriented (proprietary ones often are). They contain are not Oracle or SQL or Object Oriented (proprietary ones often are). They contain several very long text files containing different types of information all related to particular several very long text files containing different types of information all related to particular sequences, such as all of the sequences themselves, versus all of the title lines, or all of sequences, such as all of the sequences themselves, versus all of the title lines, or all of the reference sections. Binary files often help ‘glue together’ all of these other files by the reference sections. Binary files often help ‘glue together’ all of these other files by providing indexing functions. Software is usually required to successfully interact with these providing indexing functions. Software is usually required to successfully interact with these databases and access is most easily handled through various software packages and databases and access is most easily handled through various software packages and interfaces, either on the World Wide Web or otherwise, although systems level commands interfaces, either on the World Wide Web or otherwise, although systems level commands can be used if one understands the data's structure. Nucleic acid sequence databases can be used if one understands the data's structure. Nucleic acid sequence databases (and TrEMBL) are split into subdivisions based on taxonomy (historical).(and TrEMBL) are split into subdivisions based on taxonomy (historical).

What about other types of biological What about other types of biological databases?databases? Three dimensional structure databases:Three dimensional structure databases:

the the Protein Data BankProtein Data Bank and Rutgers Nucleic Acid Database. and Rutgers Nucleic Acid Database.

Still more; these can be considered ‘non-molecular’:Still more; these can be considered ‘non-molecular’:Reference Databases: e.g. Reference Databases: e.g.

OMIMOMIM — Online Mendelian Inheritance in Man — Online Mendelian Inheritance in ManPubMed/MedLine — over 11 million citations from more PubMed/MedLine — over 11 million citations from more

than 4 thousand bio/medical scientific journals. than 4 thousand bio/medical scientific journals. Phylogenetic Tree Databases: e.g. the Tree of Life.Phylogenetic Tree Databases: e.g. the Tree of Life.Metabolic Pathway Databases: e.g. WIT (What Is There) and Metabolic Pathway Databases: e.g. WIT (What Is There) and

Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Genes and Genomes).Genes and Genomes).

Population studies data — which strains, where, etc.Population studies data — which strains, where, etc.

And then databases that most biocomputing people And then databases that most biocomputing people don’t even usually consider:don’t even usually consider:e.g. GIS/GPS/remote sensing data, medical records, census e.g. GIS/GPS/remote sensing data, medical records, census

counts, mortality and birth rates . . . .counts, mortality and birth rates . . . .

So how does one do Bioinformatics?So how does one do Bioinformatics?Often on the InterNet over the World Wide Web:Often on the InterNet over the World Wide Web:

SiteSite URL (Uniform Resource Locator)URL (Uniform Resource Locator) ContentContent

Nat’l Center Biotech' Info'Nat’l Center Biotech' Info' http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/ databases/analysis/softwaredatabases/analysis/software

PIR/NBRFPIR/NBRF http://www-nbrf.georgetown.edu/http://www-nbrf.georgetown.edu/ protein sequence databaseprotein sequence database

Johns Hopkins BioInfo'Johns Hopkins BioInfo' http://www.bis.med.jhmi.edu/bioInformatics.htmlhttp://www.bis.med.jhmi.edu/bioInformatics.html databases/analysis/softwaredatabases/analysis/software

Harvard Bio' LaboratoriesHarvard Bio' Laboratories http://golgi.harvard.edu/http://golgi.harvard.edu/ databases/analysis/softwaredatabases/analysis/software

IUBIO Biology ArchiveIUBIO Biology Archive http://iubio.bio.indiana.edu/http://iubio.bio.indiana.edu/ database/software archivedatabase/software archive

Univ. of MontrealUniv. of Montreal http://megasun.bch.umontreal.ca/http://megasun.bch.umontreal.ca/ database/software archivedatabase/software archive

Japan's GenomeNetJapan's GenomeNet http://www.genome.ad.jp/http://www.genome.ad.jp/ databases/analysis/softwaredatabases/analysis/software

European Mol' Bio' Lab'European Mol' Bio' Lab' http://www.embl-heidelberg.de/http://www.embl-heidelberg.de/ databases/analysis/softwaredatabases/analysis/software

European BioinformaticsEuropean Bioinformatics http://www.ebi.ac.uk/http://www.ebi.ac.uk/ databases/analysis/softwaredatabases/analysis/software

The Sanger InstituteThe Sanger Institute http://www.sanger.ac.uk/http://www.sanger.ac.uk/ databases/analysis/softwaredatabases/analysis/software

Univ. of Geneva BioWebUniv. of Geneva BioWeb http://www.expasy.ch/http://www.expasy.ch/ databases/analysis/softwaredatabases/analysis/software

ProteinDataBankProteinDataBank http://www.rcsb.org/pdb/http://www.rcsb.org/pdb/ 3D mol' structure database3D mol' structure database

Molecules R UsMolecules R Us http://molbio.info.nih.gov/cgi-bin/pdbhttp://molbio.info.nih.gov/cgi-bin/pdb 3D protein/nuc' visualization3D protein/nuc' visualization

The Genome DataBaseThe Genome DataBase http://www.gdb.org/http://www.gdb.org/ The Human Genome ProjectThe Human Genome Project

Stanford GenomicsStanford Genomics http://genome-www.stanford.edu/http://genome-www.stanford.edu/ various genome projectsvarious genome projects

Inst. for Genomic Res’rchInst. for Genomic Res’rch http://www.tigr.org/http://www.tigr.org/ esp. microbial genome projectsesp. microbial genome projects

HIV Sequence DatabaseHIV Sequence Database http://hiv-web.lanl.gov/http://hiv-web.lanl.gov/ HIV epidemeology seq' DBHIV epidemeology seq' DB

The Baylor Search LaunchThe Baylor Search Launch http://searchlauncher.bcm.tmc.edu/http://searchlauncher.bcm.tmc.edu/ sequence search launchersequence search launcher

Pedro's BioMol Res' ToolsPedro's BioMol Res' Tools http://www.public.iastate.edu/~pedro/research_tools.htmlhttp://www.public.iastate.edu/~pedro/research_tools.html big bookmark listbig bookmark list

BioToolKitBioToolKit http://www.biosupplynet.com/cfdocs/btk/btk.cfmhttp://www.biosupplynet.com/cfdocs/btk/btk.cfm annotated molbio tool linksannotated molbio tool links

Felsenstein's PHYLIP siteFelsenstein's PHYLIP site http://evolution.genetics.washington.edu/phylip.htmlhttp://evolution.genetics.washington.edu/phylip.html phylogenetic inferencephylogenetic inference

The Tree of LifeThe Tree of Life http://phylogeny.arizona.edu/tree/phylogeny.htmlhttp://phylogeny.arizona.edu/tree/phylogeny.html overview of all phylogenyoverview of all phylogeny

Ribosomal Database Proj’Ribosomal Database Proj’ http://www.cme.msu.edu/RDP/http://www.cme.msu.edu/RDP/ databases/analysis/softwaredatabases/analysis/software

WIT MetabolismWIT Metabolismhttp://wit.mcs.anl.gov/WIT2http://wit.mcs.anl.gov/WIT2 metabolic reconstructionsmetabolic reconstructions

BIOSCI/BIONETBIOSCI/BIONET http://net.bio.nethttp://net.bio.net biologists' news groupsbiologists' news groups

Access ExcellenceAccess Excellence http://www.accessexcellence.org/http://www.accessexcellence.org/ biology teaching and learningbiology teaching and learning

CELLS alive!CELLS alive! http://www.cellsalive.com/http://www.cellsalive.com/ animated microphotographyanimated microphotography

Genetics Computer GroupGenetics Computer Group http://www.gcg.com/http://www.gcg.com/ Wisconsin S.A. PackageWisconsin S.A. Package

NCBI’s BLAST & Entrez, EMBL’s NCBI’s BLAST & Entrez, EMBL’s SRS, + , + GCG’s SeqLab and LookUp, phylogenetics . . . SeqLab and LookUp, phylogenetics . . .

What about Homology?What about Homology?Inference through homology is a fundamental Inference through homology is a fundamental

principle of biology!principle of biology!

What is homologyWhat is homology — in this context it is similarity great enough — in this context it is similarity great enough such that common ancestry is implied. Walter Fitch, a famous such that common ancestry is implied. Walter Fitch, a famous molecular evolutionist, likes to relate the analogy — homology is like molecular evolutionist, likes to relate the analogy — homology is like pregnancy, you either are or you’re not; there’s no such thing as 65% pregnancy, you either are or you’re not; there’s no such thing as 65% pregnant!pregnant!

How to see similarities — How to see similarities — Pairwise ComparisonsPairwise Comparisons::

The Dot Matrix Method.The Dot Matrix Method.Provides a ‘Gestalt’ of all possible alignments Provides a ‘Gestalt’ of all possible alignments

between two sequences.between two sequences.To begin — very simple 0, 1 (match, nomatch) To begin — very simple 0, 1 (match, nomatch)

identity scoring function.identity scoring function.Put a dot wherever symbols match.Put a dot wherever symbols match.

Identities and insertion/deletion events (indels) identified Identities and insertion/deletion events (indels) identified (zero:one match score matrix, no window).(zero:one match score matrix, no window).

Noise due to random composition effects contributes to confusion. To ‘clean up’ Noise due to random composition effects contributes to confusion. To ‘clean up’ the plot consider a filtered windowing approach. A dot is placed at the middle of the plot consider a filtered windowing approach. A dot is placed at the middle of a window if some ‘stringency’ is met within that defined window size. Then the a window if some ‘stringency’ is met within that defined window size. Then the window is shifted one position and the entire process is repeated window is shifted one position and the entire process is repeated (zero:one (zero:one match score, match score, window of size three and a stringency level of two out of threewindow of size three and a stringency level of two out of three).).

The phenylalanine transfer RNA molecule from yeast plotted against itself The phenylalanine transfer RNA molecule from yeast plotted against itself using a window size to 7 and the stringency value to 5. As a general guide using a window size to 7 and the stringency value to 5. As a general guide pick a window size about the same size as the feature that you are trying to pick a window size about the same size as the feature that you are trying to recognize and a stringency such that unwanted background noise is just recognize and a stringency such that unwanted background noise is just filtered away enough to enable you to see that desired feature.filtered away enough to enable you to see that desired feature.

RNA comparisons of the reverse, complement of a sequence to itself can often RNA comparisons of the reverse, complement of a sequence to itself can often be very informative. The yeast tRNA sequence is compared to its reverse, be very informative. The yeast tRNA sequence is compared to its reverse, complement using the same 5 out of 7 stringency setting as previously. The complement using the same 5 out of 7 stringency setting as previously. The stem-loop, inverted repeats of the tRNA clover-leaf molecular shape become stem-loop, inverted repeats of the tRNA clover-leaf molecular shape become obvious. They appear as clearly delineated diagonals running perpendicular to obvious. They appear as clearly delineated diagonals running perpendicular to an imaginary main diagonal running oppositely than before.an imaginary main diagonal running oppositely than before.

22 GAGCGCCAGACT G 12, 2222 GAGCGCCAGACT G 12, 22 || | ||||| | A || | ||||| | A48 CTGGAGGTCTAG A 348 CTGGAGGTCTAG A 3

Base position 22 through position 33 base pairs with (think —Base position 22 through position 33 base pairs with (think — is quite similar to is quite similar to the reverse-complement of) itself from base position 37 through position 48. the reverse-complement of) itself from base position 37 through position 48. MFold, Zuker’s RNA folding algorithm uses base pairing energies to find the MFold, Zuker’s RNA folding algorithm uses base pairing energies to find the family of optimal and suboptimal structures; the most stable structure found is family of optimal and suboptimal structures; the most stable structure found is shown to possess a stem at positions 27 to 31 with 39 to 43. However the shown to possess a stem at positions 27 to 31 with 39 to 43. However the region around position 38 is represented as a loop. The actual modeled region around position 38 is represented as a loop. The actual modeled structure as seen in PDB’s 1TRA shows ‘reality’ lies somewhere in between.structure as seen in PDB’s 1TRA shows ‘reality’ lies somewhere in between.

Pairwise Comparisons: Dynamic Programming.Pairwise Comparisons: Dynamic Programming.

A ‘brute force’ approach just won’t work. The computation required to compare all A ‘brute force’ approach just won’t work. The computation required to compare all possible alignments between two sequences requires time proportional to the possible alignments between two sequences requires time proportional to the product of the lengths of the two sequences, without considering gaps at all. If product of the lengths of the two sequences, without considering gaps at all. If the two sequences are approximately the same length (N), this is a Nthe two sequences are approximately the same length (N), this is a N22 problem. problem. To include gaps, the calculation needs to be repeated 2N times to examine the To include gaps, the calculation needs to be repeated 2N times to examine the possibility of gaps at each possible position within the sequences, now a Npossibility of gaps at each possible position within the sequences, now a N4N4N problem.problem.

Therefore, Therefore, An optimal alignment is defined as an arrangement of two sequences, 1 An optimal alignment is defined as an arrangement of two sequences, 1 of length of length ii and 2 of length and 2 of length jj, such that:, such that:

1)you maximize the number of matching symbols between 1 and 2;1)you maximize the number of matching symbols between 1 and 2;

2)you minimize the number of indels within 1 and 2; and2)you minimize the number of indels within 1 and 2; and3)you minimize the number of mismatched symbols between 1 and 2.3)you minimize the number of mismatched symbols between 1 and 2.Therefore, the actual solution can be represented by:Therefore, the actual solution can be represented by:

SSii-1 -1 jj-1-1 or or max Smax Si-xi-x j-j-11 + w + wx-x-11 or orSSijij = s = sijij + max 2 < + max 2 < xx < < ii max Smax Sii-1 -1 j-yj-y + w + wy-y-11 2 < 2 < yy < < II

Where SWhere Sij ij is the score for the alignment ending at is the score for the alignment ending at ii in sequence 1 and in sequence 1 and jj in sequence 2, in sequence 2,

ssijij is the score for aligning is the score for aligning ii with with jj,,

wwxx is the score for making a is the score for making a xx long gap in sequence 1, long gap in sequence 1,

wwyy is the score for making a is the score for making a yy long gap in sequence 2, long gap in sequence 2,

allowing gaps to be any length in either sequence.allowing gaps to be any length in either sequence.

An oversimplified example:

total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}])here}])

Optimum AlignmentsOptimum AlignmentsThere will probably be more than one best path through the matrix and none There will probably be more than one best path through the matrix and none

of them may be the biologically CORRECT alignment. Starting at the of them may be the biologically CORRECT alignment. Starting at the top and working down as we did, then tracing back, I found two optimum top and working down as we did, then tracing back, I found two optimum alignments:alignments:

cTATAtAaggcTATAtAagg cTATAtAaggcTATAtAagg| ||||| | ||||| | ||||| ||||cg.TAtAaT.cg.TAtAaT. cgT.AtAaT.cgT.AtAaT.

Each of these solutions yields a traceback total score of 22. This is the Each of these solutions yields a traceback total score of 22. This is the number optimized by the algorithm, not any type of a similarity or identity number optimized by the algorithm, not any type of a similarity or identity score! Even though one of these alignments has 6 exact matches and score! Even though one of these alignments has 6 exact matches and the other has 5, they are both optimal according to the relatively strange the other has 5, they are both optimal according to the relatively strange criteria by which we solved the algorithm. Software will report only one criteria by which we solved the algorithm. Software will report only one of these solutions. Do you have any ideas about how others could be of these solutions. Do you have any ideas about how others could be discovered? Answer — Often if you reverse the solution of the entire discovered? Answer — Often if you reverse the solution of the entire dynamic programming process, other solutions can be found!dynamic programming process, other solutions can be found!

Global versus local solution: negative numbers in match matrix and pick Global versus local solution: negative numbers in match matrix and pick best diagonal within overall graph.best diagonal within overall graph.

What about proteinsWhat about proteins — conservative replacements and — conservative replacements and similarity as opposed to identity. The nitrogenous bases, A,C, T, G, similarity as opposed to identity. The nitrogenous bases, A,C, T, G, are either the same or they’re not, but amino acids can be similar, are either the same or they’re not, but amino acids can be similar, genetically and structurally!genetically and structurally!

Values whose magnitude is 4 are drawn in outline characters to make them easier to recognize. Notice that positive values for identity range from 4 to 11 and negative values for those substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for identity.

BLOSUM62 amino acid substitution matrix.

Henikoff, S. and Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks.Proc. Natl. Acad. Sci. USA 89: 10915-10919.

{GAP_CREATE 12GAP_EXTEND 4}

A B C D E F G H I K L M N P Q R S T V W X Y ZA 44 -2 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -1 -2 -1B -2 66 -3 66 2 -3 -1 -1 -3 -1 -4-4 -3 1 -1 0 -2 0 -1 -3 -4-4 -1 -3 2C 0 -3 99 -3 -4-4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2 --44D -2 66 -3 66 2 -3 -1 -1 -3 -1 -4-4 -3 1 -1 0 -2 0 -1 -3 -4-4 -1 -3 2E -1 2 -4-4 2 55 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 55F -2 -3 -2 -3 -3 66 -3 -1 0 -3 0 0 -3 -4-4 -3 -3 -2 -2 -1 1 -1 3 --33G 0 -1 -3 -1 -2 -3 66 -2 -4-4 -2 -4-4 -3 0 -2 -2 -2 0 -2 -3 -2 -1 -3 -2H -2 -1 -3 -1 0 -1 -2 88 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 -1 2 0I -1 -3 -1 -3 -3 0 -4-4 -3 44 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3K -1 -1 -3 -1 1 -3 -2 -1 -3 55 -2 -1 0 -1 1 2 0 -1 -2 -3 -1 -2 1L -1 -4-4 -1 -4-4 -3 0 -4-4 -3 2 -2 44 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -3M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 55 -2 -2 0 -1 -1 -1 1 -1 -1 -1 -2N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 66 -2 0 0 1 0 -3 -4-4 -1 -2 0P -1 -1 -3 -1 -1 -4-4 -2 -2 -3 -1 -3 -2 -2 77 -1 -2 -1 -1 -2 -4-4 -1 -3 -1Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 55 1 0 -1 -2 -2 -1 -1 2R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 55 -1 -1 -3 -3 -1 -2 0S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 44 1 -2 -3 -1 -2 0T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 55 0 -2 -1 -2 -1V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 44 -3 -1 -1 -2W -3 -4-4 -2 -4-4 -3 1 -2 -2 -3 -3 -2 -1 -4-4 -4-4 -2 -3 -3 -2 -3 11 11 -1 2 -3X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 77 -2Z -1 2 -4-4 2 55 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 xx

Significance: When is an Alignment Worth Significance: When is an Alignment Worth Anything Biologically?Anything Biologically?

Monte Carlo simulations:Monte Carlo simulations:Z score = [ ( actual score ) - ( mean of randomized scores ) ]

( standard deviation of randomized score distribution )

Many Z scores measure the distance from a mean using a simplistic Many Z scores measure the distance from a mean using a simplistic Monte Carlo model assuming a normal distribution, in spite of the fact Monte Carlo model assuming a normal distribution, in spite of the fact that ‘sequence-space’ actually follows what is know as an ‘extreme that ‘sequence-space’ actually follows what is know as an ‘extreme value distribution;’ however, the Monte Carlo method does value distribution;’ however, the Monte Carlo method does approximate significance estimates pretty well.approximate significance estimates pretty well.

Pairwise Comparisons: Database SearchingPairwise Comparisons: Database Searching

BLAST — Basic Local Alignment Search BLAST — Basic Local Alignment Search Tool, developed at NCBI.Tool, developed at NCBI.

1)1) Normally NOT a good idea to use Normally NOT a good idea to use for DNA against DNA searches w/o for DNA against DNA searches w/o translation (not optimized);translation (not optimized);

2)2) Prefilters repeat and “low Prefilters repeat and “low complexity” sequence regions;complexity” sequence regions;

4)4) Can find more than one region of Can find more than one region of gapped similarity;gapped similarity;

5)5) Very fast heuristic and parallel Very fast heuristic and parallel implementation;implementation;

6)6) Restricted to precompiled, specially Restricted to precompiled, specially formatted databases;formatted databases;

FastA — and its family of relatives, FastA — and its family of relatives, developed by Bill Pearson at the developed by Bill Pearson at the University of Virginia.University of Virginia.

1)1) Works well for DNA against DNA Works well for DNA against DNA searches (within limits of possible searches (within limits of possible sensitivity);sensitivity);

2)2) Can find only one gapped region of Can find only one gapped region of similarity;similarity;

3)3) Relatively slow, should usually be Relatively slow, should usually be run in the background;run in the background;

4)4) Does not require specially prepared, Does not require specially prepared, preformatted databases.preformatted databases.

Add the previous concepts to ‘hashing’ to come up with heuristic style database searching. Hashing Add the previous concepts to ‘hashing’ to come up with heuristic style database searching. Hashing

breaks sequences into small ‘words’ or ‘ktuples’ of a set size to create a ‘look-up’ table with words keyed breaks sequences into small ‘words’ or ‘ktuples’ of a set size to create a ‘look-up’ table with words keyed

to numbers. When a word matches part of a database entry, that match is saved. ‘Worthwhile’ results at to numbers. When a word matches part of a database entry, that match is saved. ‘Worthwhile’ results at

the end are compiled and the longest alignment within the program’s restrictions is created. Hashing the end are compiled and the longest alignment within the program’s restrictions is created. Hashing

reduces the complexity of the search problem from Nreduces the complexity of the search problem from N22 for dynamic programming to N, the length of all the for dynamic programming to N, the length of all the

sequences in the database. Approximation techniques are collectively known as ‘heuristics.’ In database sequences in the database. Approximation techniques are collectively known as ‘heuristics.’ In database

searching the heuristic restricts search space by calculating a statistic that allows the program to decide searching the heuristic restricts search space by calculating a statistic that allows the program to decide

whether further scrutiny of a particular match should be pursued.whether further scrutiny of a particular match should be pursued.

Versions available of each for DNA-DNA, DNA-protein, protein-DNA, and protein-Versions available of each for DNA-DNA, DNA-protein, protein-DNA, and protein-protein searches. Translations done ‘on the fly’ for mixed searches.protein searches. Translations done ‘on the fly’ for mixed searches.

The algorithms:The algorithms:

BLAST:BLAST:

FastA:FastA:

Two word hits on the Two word hits on the same diagonal above same diagonal above some similarity threshold some similarity threshold triggers ungapped triggers ungapped extension until the score extension until the score isn’t improved enough isn’t improved enough above another threshold:above another threshold:

the HSP.the HSP.

Find all ungapped exact Find all ungapped exact word hits; maximize the word hits; maximize the ten best continuous ten best continuous regions’ scores: regions’ scores: init1init1..

Combine non-Combine non-overlapping init overlapping init regions on different regions on different diagonals:diagonals:initninitn..

Use dynamic Use dynamic programming ‘in a programming ‘in a band’ for all regions band’ for all regions with with initninitn scores scores better than some better than some threshold: threshold: optopt score.score.

Initiate gapped extensions Initiate gapped extensions using dynamic programming for using dynamic programming for those HSP’s above a third those HSP’s above a third threshold up to the point where threshold up to the point where the score starts to drop below a the score starts to drop below a fourth threshold: yields fourth threshold: yields alignment.alignment.

Histogram Key:Histogram Key: Each histogram symbol represents 604 search set sequencesEach histogram symbol represents 604 search set sequences Each inset symbol represents 21 search set sequencesEach inset symbol represents 21 search set sequences z-scores computed from opt scoresz-scores computed from opt scores

z-score obs expz-score obs exp (=) (*)(=) (*)

< 20 650 0:==< 20 650 0:== 22 0 0:22 0 0: 24 3 0:=24 3 0:= 26 22 8:*26 22 8:* 28 98 87:*28 98 87:* 30 289 528:*30 289 528:* 32 1714 2042:===*32 1714 2042:===* 34 5585 5539:=========*34 5585 5539:=========* 36 12495 11375:==================*==36 12495 11375:==================*== 38 21957 18799:===============================*=====38 21957 18799:===============================*===== 40 28875 26223:===========================================*====40 28875 26223:===========================================*==== 42 34153 32054:=====================================================*===42 34153 32054:=====================================================*=== 44 35427 35359:==========================================================*44 35427 35359:==========================================================* 46 36219 36014:===========================================================*46 36219 36014:===========================================================* 48 33699 34479:======================================================== *48 33699 34479:======================================================== * 50 30727 31462:=================================================== *50 30727 31462:=================================================== * 52 27288 27661:=============================================*52 27288 27661:=============================================* 54 22538 23627:====================================== *54 22538 23627:====================================== * 56 18055 19736:============================== *56 18055 19736:============================== * 58 14617 16203:========================= *58 14617 16203:========================= * 60 12595 13125:=====================*60 12595 13125:=====================* 62 10563 10522:=================*62 10563 10522:=================* 64 8626 8368:=============*=64 8626 8368:=============*= 66 6426 6614:==========*66 6426 6614:==========* 68 4770 5203:========*68 4770 5203:========* 70 4017 4077:======*70 4017 4077:======* 72 2920 3186:=====*72 2920 3186:=====* 74 2448 2484:====*74 2448 2484:====* 76 1696 1933:===*76 1696 1933:===* 78 1178 1503:==*78 1178 1503:==* 80 935 1167:=*80 935 1167:=* 82 722 893:=*82 722 893:=* 84 454 707:=*84 454 707:=* 86 438 547:*86 438 547:* 88 322 423:*88 322 423:* 90 257 328:*90 257 328:* 92 175 253:* :========= *92 175 253:* :========= * 94 210 196:* :=========*94 210 196:* :=========* 96 102 152:* :===== *96 102 152:* :===== * 98 63 117:* :=== *98 63 117:* :=== * 100 58 91:* :=== *100 58 91:* :=== * 102 40 70:* :== *102 40 70:* :== * 104 30 54:* :==*104 30 54:* :==* 106 17 42:* :=*106 17 42:* :=* 108 14 33:* :=*108 14 33:* :=* 110 14 25:* :=*110 14 25:* :=* 112 12 20:* :*112 12 20:* :* 114 9 15:* :*114 9 15:* :* 116 6 12:* :*116 6 12:* :* 118 8 9:* :*118 8 9:* :*>120 1030 7:*= :*=======================================>120 1030 7:*= :*=======================================

These are the best hits, those These are the best hits, those most similar sequences with a most similar sequences with a Pearson Pearson zz-score greater than -score greater than 120 in this search.120 in this search.

‘‘Sequence-space’ actually Sequence-space’ actually follows the ‘extreme value follows the ‘extreme value distribution.’distribution.’Based on this known statistical Based on this known statistical distribution, and robust statistical distribution, and robust statistical methodology, a realistic methodology, a realistic Expectation function, the E Expectation function, the E value, can be calculated. The value, can be calculated. The particulars of how BLAST and particulars of how BLAST and FastA do this differ, but the FastA do this differ, but the ‘take-home’ message is the ‘take-home’ message is the same:same:The higher the E value is, the The higher the E value is, the more probable that the observed more probable that the observed match is due to chance in a match is due to chance in a search of the same size search of the same size database and the lower its Z database and the lower its Z score will be, i.e. is NOT score will be, i.e. is NOT significant. Therefore, the significant. Therefore, the smaller the E value, i.e. the smaller the E value, i.e. the closer it is to zero, the more closer it is to zero, the more significant it is and the higher its significant it is and the higher its Z score will be! The E value is Z score will be! The E value is the number that really matters.the number that really matters.

Multiple Sequence Analysis:Multiple Sequence Analysis:Multiple Sequence Alignment.Multiple Sequence Alignment.

Dynamic programming’s complexity increases exponentially with the Dynamic programming’s complexity increases exponentially with the number of sequences being compared. N-dimensional matrix ideas . . . .number of sequences being compared. N-dimensional matrix ideas . . . .

Therefore — Therefore — pairwise, progressive pairwise, progressive dynamic programming restricts dynamic programming restricts the solution to the the solution to the neighborhood of only two neighborhood of only two sequences at a time.sequences at a time.

All sequences are compared, All sequences are compared, pairwise, and then each is pairwise, and then each is aligned to its most similar aligned to its most similar partner or group of partners. partner or group of partners. Each group of partners is then Each group of partners is then aligned to finish the complete aligned to finish the complete multiple sequence alignment.multiple sequence alignment.

Conserved regions can be Conserved regions can be visualized with a sliding window visualized with a sliding window approach and appear as peaks. approach and appear as peaks. Let’s concentrate on the first Let’s concentrate on the first peak seen here.peak seen here.

MotifsMotifs

GHVDHGKS

A consensus isn’t A consensus isn’t necessarily the necessarily the biologically “correct” biologically “correct” combination. combination. Therefore, build one-Therefore, build one-dimensional ‘pattern dimensional ‘pattern descriptors.’descriptors.’

PROSITE Database of PROSITE Database of protein families and protein families and domains - over 1,000 domains - over 1,000 motifs.motifs.

This motif, the P-loop, is This motif, the P-loop, is defined: defined: (A,G)x4GK(S,T), i.e. (A,G)x4GK(S,T), i.e. either an Alanine or a either an Alanine or a Glycine, followed by Glycine, followed by four of anything, four of anything, followed by an invariant followed by an invariant Glycine-Lysine pair, Glycine-Lysine pair, followed by either a followed by either a Serine or a Threonine.Serine or a Threonine.

But motifs can not convey But motifs can not convey any degree of the any degree of the ‘importance’ of the ‘importance’ of the residues.residues.

Enter Enter the the ProfileProfile

Given a multiple sequence alignment, how can we use all of the information contained Given a multiple sequence alignment, how can we use all of the information contained in it to find ever more remotely similar sequences, that is those “Twilight Zone” in it to find ever more remotely similar sequences, that is those “Twilight Zone” similarities below ~20% identity, those Z scores below ~5, those BLAST/Fast similarities below ~20% identity, those Z scores below ~5, those BLAST/Fast EE values above ~10values above ~10-5-5 or so? or so?

Use a position specific, two-dimensional matrix where conserved areas of the alignment Use a position specific, two-dimensional matrix where conserved areas of the alignment receive the most importance and variable regions hardly matter!receive the most importance and variable regions hardly matter!

The threonine at position 27 is absolutely conserved — it gets the highest score, 150! The aspartate at position 22 The threonine at position 27 is absolutely conserved — it gets the highest score, 150! The aspartate at position 22 substituted with a tryptophan would never happen, -87. Tryptophan is the most conserved residue on all matrix substituted with a tryptophan would never happen, -87. Tryptophan is the most conserved residue on all matrix series and aspartate 22 is conserved throughout the alignment — the negative matrix score of any substitution to series and aspartate 22 is conserved throughout the alignment — the negative matrix score of any substitution to tryptophan times the high conservation at that position for aspartate equals the most negative score in the profile. tryptophan times the high conservation at that position for aspartate equals the most negative score in the profile. Position 16 has a valine assigned because it has the highest score, 37, but glycine also occurs several times, a Position 16 has a valine assigned because it has the highest score, 37, but glycine also occurs several times, a score of 20. However, other residues are ranked in the substitution matrices as being quite similar to valine; score of 20. However, other residues are ranked in the substitution matrices as being quite similar to valine; therefore isoleucine and leucine also get similar scores, 24 and 14, and alanine occurs some of the time in the therefore isoleucine and leucine also get similar scores, 24 and 14, and alanine occurs some of the time in the alignment so it gets a comparable score, 15.alignment so it gets a comparable score, 15.

Advanced methodologiesAdvanced methodologiesMany wondrous things can be accomplished based on combinations of all

the previous techniques.

PSI-BLAST uses profile methods to iterate database searches.

Profiles can be discovered in unaligned sequences to discover motifs using expectation maximization and/or hidden Markov model statistical methods.

Secondary structure can be predicted in many cases. See http://www.embl-heidelberg.de/predictprotein/predictprotein.html, which uses multiple sequence alignment profile techniques along with neural net technology. Even three-dimensional “homology modeling” will often lead to remarkably accurate representations if the similarity is great enough between your protein and one in which the structure has been solved through experimental means. See SwissModel at http://www.expasy.ch/swissmod/SWISS-MODEL.html.

Evolutionary relationships can be ascertained using a multiple sequence alignment and the methods of molecular phylogenetics. See the PAUP* and PHYLIP software packages. And if you’re really interested in this topic check out the Workshop on Molecular Evolution offered every August at the Woods Hole Marine Biological Laboratory and/or similar courses worldwide.

So what about training?So what about training?

How do you get training in this field; How do you get training in this field; what are you supposed to do?what are you supposed to do?Read all you can and explore the Web Read all you can and explore the Web

sites and, if you’re serious, get sites and, if you’re serious, get involved in one of the training involved in one of the training programs, usually at the graduate programs, usually at the graduate level, around the country.level, around the country.

See the URL’s coming up . . .See the URL’s coming up . . .

What you can do here at FSU . . .What you can do here at FSU . . .

BioComputing EducationSix Major Proposal Foci at FSU: Six Major Proposal Foci at FSU:

Current WorkshopsCurrent Workshops — continue to offer and further — continue to offer and further expand GCG SeqLab workshop series; each session currently expand GCG SeqLab workshop series; each session currently offered twice per semester.offered twice per semester.

ModulesModules (such as this one) (such as this one) — incorporate across the curricula — incorporate across the curricula within existing courses, interdisciplinary by nature.within existing courses, interdisciplinary by nature.

Graduate CourseGraduate Course — practical, project-oriented approach. — practical, project-oriented approach. Collaborate with proposed Math course. Collaborate with proposed Math course.

Undergraduate Genomics CourseUndergraduate Genomics Course — survey, — survey, practical WWW techniques, implications, & ethics.practical WWW techniques, implications, & ethics.

Computational Molecular Biology ProgramComputational Molecular Biology Program — in — in association and cooperation with students’ present major association and cooperation with students’ present major department. Pros and Cons . . .department. Pros and Cons . . .

Summer Short CourseSummer Short Course — long-range. Participants from — long-range. Participants from world-wide disparate disciplines learning bioinformatics world-wide disparate disciplines learning bioinformatics techniques and theory. techniques and theory.

GCG SeqLab Workshop Series

Presently four different sessions:Presently four different sessions:

Intro to SeqLab & Multiple Sequence Intro to SeqLab & Multiple Sequence AnalysisAnalysis and its and its supplementsupplement

Rational Primer DesignRational Primer Design

Database Searching &Database Searching & Pairwise Pairwise Comparisons — Significance Comparisons — Significance

Molecular Evolutionary Molecular Evolutionary PhylogeneticsPhylogenetics

http://bio.http://bio.fsufsu..eduedu/~/~stevetstevet/workshop.html/workshop.html

FOR MORE INFO...

Education/Training ProgramsI helped to develop one of the first at I helped to develop one of the first at

Washington State University. They are still Washington State University. They are still

relatively rare, but more appear all the time.relatively rare, but more appear all the time.

Biocomputing education URL’s:Biocomputing education URL’s:http://linkage.rockefeller.edu/wli/bioinfocourse/http://linkage.rockefeller.edu/wli/bioinfocourse/

http://www.techfak.uni-bielefeld.de/bcd/Curric/syllabi.htmlhttp://www.techfak.uni-bielefeld.de/bcd/Curric/syllabi.html

http://http://biotech.icmb.utexas.edu/pages/bioinform/biprograms_us.htmlbiotech.icmb.utexas.edu/pages/bioinform/biprograms_us.html

http://http://iscbiscb.org/.org/univuniv.html.html

http://http://bozemanbozeman.genome..genome.washingtonwashington..eduedu//compbiocompbio

http://www.http://www.csccsc..livliv.ac..ac.ukuk/~/~martynmartyn//biosystemsbiosystems

http://130.88.90.2:8900http://130.88.90.2:8900

http://www.http://www.snarkwaresnarkware.org/.org/bioedusoftbioedusoft//

See the listed references and WWW sites and See the listed references and WWW sites and participate in my bioinformatics workshop series.participate in my bioinformatics workshop series.

FOR MORE INFO...

Contact CSIT (http://www.csit.fsu.edu/) for general Contact CSIT (http://www.csit.fsu.edu/) for general questions; me (questions; me (stevetstevet@[email protected]) for specific ) for specific bioinformatics assistance and/or collaboration.bioinformatics assistance and/or collaboration.

Gunnar von Heijne in his quite readable treatise, Gunnar von Heijne in his quite readable treatise, Sequence Analysis in Molecular Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit Biology; Treasure Trove or Trivial Pursuit (1987), provides a very appropriate (1987), provides a very appropriate conclusion:conclusion:

““Think about what you’re doing; use your knowledge of the molecular system Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your direction of inquiry; involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and do not blindly accept everything the use as much information as possible; and do not blindly accept everything the computer offers you.”computer offers you.”

He continues:He continues:

““. . . if any lesson is to be drawn . . . it surely is that to be able to make a useful . . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only second a contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, we have to find ways to theoretician . . . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and above all we have to become better cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.”biologists. But that’s all it takes.”

ConclusionsConclusions

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular BiologyJournal of Molecular Biology 215215, 403-410., 403-410.

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs. Generation of Protein Database Search Programs. Nucleic Acids ResearchNucleic Acids Research 2525, 3389-3402., 3389-3402.

Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids ResearchNucleic Acids Research 2020, 2013-2018., 2013-2018.

Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Washington, U.S.A.Seattle, Washington, U.S.A.

Genetics Computer Group (GCG), Inc. (Copyright 1982-2000) Genetics Computer Group (GCG), Inc. (Copyright 1982-2000) Program Manual for the Wisconsin PackageProgram Manual for the Wisconsin Package, Version 10.1, Madison, Wisconsin, USA , Version 10.1, Madison, Wisconsin, USA 53711.53711.

Gribskov, M. and Devereux, J., editors (1992) Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis PrimerSequence Analysis Primer. W.H. Freeman and Company, New York, N.Y., U.S.A.. W.H. Freeman and Company, New York, N.Y., U.S.A.

Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A.Proc. Natl. Acad. Sci. U.S.A. 8484, 4355-4358., 4355-4358.

Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences U.S.A.Proceedings of the National Academy of Sciences U.S.A. 8989, 10915-10919., 10915-10919.

Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular BiologyJournal of Molecular Biology 4848, 443-453., 443-453.

Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio 1994. 1994. Nucleic Acids ResearchNucleic Acids Research 2222, 3470-3473., 3470-3473.

Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences U.S.A.Proceedings of the National Academy of Sciences U.S.A. 8585, , 2444-2448.2444-2448.

Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Journal of Molecular BiologyJournal of Molecular Biology 232232, 584-599., 584-599.

Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an Expandable GUI for Multiple Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an Expandable GUI for Multiple Sequence Analysis. Sequence Analysis. CABIOSCABIOS, , 1010, 671-675., 671-675.

Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and StructureAtlas of Protein Sequences and Structure, (M.O. Dayhoff , (M.O. Dayhoff editor) editor) 55, Suppl. , Suppl. 33, 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A., 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A.

Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied MathematicsAdvances in Applied Mathematics 22, 482-489., 482-489.

Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain Folding in Nucleic Acids. The Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain Folding in Nucleic Acids. The Omega Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1. Omega Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1. Nucleic Acids ResearchNucleic Acids Research 1010, 2471-2484., 2471-2484.

Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) personal copyright, and (1997) Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) personal copyright, and (1997) Smithsonian Institution, Washington D.C., U.S.A.Smithsonian Institution, Washington D.C., U.S.A.

Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids ResearchNucleic Acids Research, , 2222, 4673-4680., 4673-4680.

von Heijne, G. (1987) von Heijne, G. (1987) Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit.Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, Inc., San Diego, California, U.S.A. Academic Press, Inc., San Diego, California, U.S.A.

Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Proceedings of the National Academy of Sciences U.S.A.Sciences U.S.A. 8080, 726-730., 726-730.

Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. ScienceScience 244244, 48-52., 48-52.

ReferencesReferences

Laboratory Exercise:Laboratory Exercise: A structural variant of this simple A structural variant of this simple protein molecule causes the protein molecule causes the notorious condition “Mad-Cow notorious condition “Mad-Cow Disease.” Use the Web sites Disease.” Use the Web sites you learned about in this lecture you learned about in this lecture to investigate:to investigate:

What is the molecule’s name?What is the molecule’s name?

What is the name of the What is the name of the disease in human beings?disease in human beings?

Is it caused by a virus or Is it caused by a virus or bacteria or other pathogen?bacteria or other pathogen?

Do humans have a gene for this Do humans have a gene for this protein, and, if so, what is its protein, and, if so, what is its name, where is it located, and name, where is it located, and what does it do?what does it do?

What is the name of one of the What is the name of one of the most similar genes or proteins most similar genes or proteins in non-vertebrates? Is this in non-vertebrates? Is this similarity significant?similarity significant?

Just explore — no credit.Just explore — no credit.

Documents

A BioInformatics Survey... just a taste. Steve Thompson Steve Thompson Florida State University School of Computational Science and Information Technology