THE FLORIDA STATE UNIVERSITYengelen/TaylorThesis.doc · Web viewNote that in the great-great-great-great-grandparent to grandchild relationship, chances are low that there is great

THE FLORIDA STATE UNIVERSITY

COLLEGE OF ARTS AND SCIENCES

DEVELOPING A BIOINFORMATICS UTILITY BELT TO ELIMINATE

SEARCH REDUNDANCY FROM THE EVER-GROWING DATABASES

By

MISHA TAYLOR

A Thesis submitted to the Department of Computer Science

in partial fulfillment of the requirements for the degree of

Master of Science

Degree Awarded:Spring Semester, 2003

Copyright © 2003Misha Taylor

All Rights Reserved

The members of the Committee approve the thesis of Misha Taylor defended on April 1, 2003.

______________________________Robert van EngelenProfessor Directing Thesis

______________________________David SwoffordCommittee Member

______________________________Theodore BakerCommittee Member

______________________________Steven M. ThompsonCommittee Member

The Office of Graduate Studies has verified and approved the above named committee members

ii

This thesis is dedicated to my dear friend Jane Sinagub, without whose support this work could not have been produced. We’ve been through some tough times, but I’m glad that we were able to make it through the years together as friends. I love you dearly Jane, and I’m happy that you decided to let me play a bit part in your life.To Paul, “Brother” Max, and “Ma Petite” Flora, you’re as much as my family as Jane and I love you just as much.Also, thanks to Deena Westbrook for keeping me company after bioinformatics class and entertained with chitchat while writing my thesis. It made the process a lot easier and more fun than it should have been.

iii

ACKNOWLEDGMENTS

Steven Thompson taught me more about bioinformatics in six months than I could have learned on my own in six years. I am still amazed that he was willing to take on a computer science student with absolutely no background in biology whatsoever. He took me out to dinner and slowly introduced me to the field of bioinformatics. All this without me being enrolled in his classes or being one of his students, but just wanting to know more about his world. Steve is one of the most extraordinary teachers I’ve ever met. I also appreciate his patience in re-explaining biological concepts to me two or three times as it was very difficult for me to start acquiring a “biological mindset”.

Thanks to Theresa Thompson for helping me understand the “tao of Steve”, when things were getting difficult at one point, and also getting to the bottom of the mystery of the pink pants. Her advice on literature and politics was also very helpful and insightful.

Robert van Engelen was my major professor throughout this process. I appreciate his guidance and advice, and his willingness to work with an interdisciplinary project. I also would like to thank my original advisor, Ernest McDuffie, who sadly left The Florida State University before I was able to complete my original thesis.

I also appreciate having taken a semester of Dave Swofford’s bioinformatics class while completing this thesis. I think it really helped to give me a solid grounding in some of the fundamental algorithms. He is a very knowledgeable and accomplished instructor.

iv

TABLE OF CONTENTS

List of Tables........................................................................................viiList of Figures.....................................................................................viiiAbstract.................................................................................................ix

INTRODUCTION....................................................................................1

BIOLOGICAL DATABASES....................................................................3

Biological Databases are massive................................................4Biological data can be noisy and redundant, with unclear features........................................................................................5Another Artificial Source of Redundancy.....................................6

A REAL-WORLD PROBLEM...................................................................7

Process of the human expert........................................................8Sorting out redundancies.............................................................8Taxonomy Analysis.....................................................................10

SIMILARITY SEARCHES.....................................................................12

BLAST Output............................................................................12Part 1 – Overview of the Query...............................................12Part 2 – Descriptions of each significant alignment...............13Part 3 – Pairwise Alignments..................................................13Part 4 – Statistical Summary...................................................14

Evolution....................................................................................15Three fundamental algorithm types...........................................18

Dot plots..................................................................................18The need for quantitative methods.........................................21Pairwise sequence alignment.................................................24Global Alignment.....................................................................28Local Alignments.....................................................................32Multiple sequence alignment..................................................32

GCG WISCONSIN PACKAGE...............................................................34

MUMS AND SUFFIX TREES...............................................................41

v

MUMs.........................................................................................42Longest Increasing Subsubsequence.........................................47

CONSENSUS ALGORITHMS...............................................................48

Clustering in MUMmer 2.1.....................................................49The Union-Find Algorithm......................................................49

IMPLEMENTATION RECOMMENTATIONS: C++ OR JAVA?.............52

NINJA......................................................................................53The Cost of Being Object-Oriented.........................................54Further adventures of the Rice Research Team.....................55

CONCLUSION.....................................................................................57

EPILOGUE...........................................................................................61

REFERENCES......................................................................................63

BIOGRAPHICAL SKETCH....................................................................67

vi

LIST OF TABLES

Table 1 - Dot plot example...................................................................18Table 2 - The 4 DNA Nucleotide Sequences and Their Official Codes23Table 3 - The 20 Amino Acids and Their Official Codes......................23

vii

LIST OF FIGURES

Figure 1 - Growth of GenBank...............................................................5Figure 2 - Align ESTs to cDNAs.............................................................9Figure 3 - Example of a sequence region which contains a gene model.............................................................................................................10Figure 4 - BLAST Output, Part 1 - Overview of the Query..................12Figure 5 - BLAST Output, Part 2 - Description of each significant alignment.............................................................................................13Figure 6 - BLAST Output, Part 3 – Pairwise Alignments.....................14Figure 7 - BLAST Output, Part 4 - Statistical Summary......................15Figure 8 - Origin of genes having a similar sequence[1]....................16Figure 9 - Dot plot window example 1.................................................19Figure 10 - Dot plot window example 2...............................................20Figure 11 - Dot plot of two paralogues................................................21Figure 12 - Key observation for dynamic programming......................26Figure 13 - Dynamic programming always looks back to previous diagonal...............................................................................................27Figure 14 - Recurrence relation..........................................................28Figure 15 - Global alignment, part 1...................................................29Figure 16 - Global alignment, part 2...................................................30Figure 17 - Global alignment, part 3...................................................30Figure 18 - Global alignment, part 4...................................................31Figure 19 - Global alignment, part 5...................................................31Figure 20 - Suffix tree..........................................................................44Figure 21 - Aligning Genome A and Genome B after locating MUMs.45

viii

Figure 22 - Crossing anchors..............................................................46Figure 23 - Aligned anchors................................................................46Figure 24 - Clustering two-dimensional data......................................48

ix

ABSTRACT

Biological databases are growing at an exponential rate. Designing algorithms to deal with the inherent redundancy in these databases which can cope with the overwhelming amount of data returned from similarity searches is an active area of research.

This paper presents an overview of a real-world problem related to biological database searching, outlining how a human expert solves this problem. Then, several bioinformatics approaches are presented from the literature, forming a “utility belt” which might be used to solve the problem computationally.

x

INTRODUCTION

Bioinformatics tools are basically a set of computer algorithms and procedures for analyzing and sifting through biological data[1]. The field of bioinformatics has been around for about 20 or 30 years. Biologists, realizing just how valuable a tool computers were, started applying computer technology to their scientific endeavors just as soon as computers were available to the general public. According to Stanford Professor, Mark A. Musen[2], it is only the term bioinformatics that is relatively new, not the actual field itself.

The bio-prefix of the word should be fairly straightforward to understand. Bioinformatics deals with biology and biological data.

It is the –informatics suffix that is more interesting. The word informatics is derived from the French word, informatique, which can be translated into English as “data processing”[3].

Why is an uncommon word like informatics used in the term bioinformatics? Mark Musen[2] claims that it is in the European tradition to incorporate elements of data processing into the fundamental computer science curriculum, whereas in the United States it is not. Instead, if a student wishes to pursue a serious study of “information, its structure, its acquisition, and its use,” that student historically enrolled in a business program, or more recently, an information sciences program. Apparently in Europe, it is more common that the equivalent of information sciences is integrated into the computer science curriculum and not as a separate field of study or departmental entity.

A key difference between the informatics perspective and the traditional computer science perspective is focus. In informatics, the information and knowledge is of central focus, whereas in computer science, the focus is mostly on algorithms and writing programs[2]. From an informatics perspective “you can’t do anything without the data – knowledge is power,” whereas from a computer science perspective one might say “you can’t do anything without a program – after all, it is the code actually performing the action.” Neither statement is invalid, depending on your perspective.

In fact, bioinformatics incorporates both data and algorithms. The term used for the field is a reminder that the biology and the data are paramount.

That is a brief explanation of the word bioinformatics and the etymology of the term. As mentioned earlier, it is a fairly new term for the field.

So, our task, should we choose to accept it, is to do the research necessary in order to prepare for the development of a bioinformatics utility belt. This virtual bioinformatics utility belt, explained herein in this document, will help us tackle the information overload from the ever-growing biological databases.

Biological databases are filled with redundant sequence information for various reasons, which will be covered in more detail in the next chapter. When searches are performed against these databases, the result sets also contain redundant “hits”. In this document, it is our goal to explore the nature of this redundancy problem more fully, putting together a list of possible solutions and approaches to address this issue – the utility belt. We won’t be able to offer a solution to the redundancy problem here, but rather it is our goal to offer a starting point for a potential solution.

CHAPTER 1

BIOLOGICAL DATABASES

While one can argue that even in the Internet age, most new biological information is still published as text in the form of journal articles, papers and annotations[4] – nothing is going to keep the academic engine away from the “publish or perish” mantra anytime soon – there has been an explosion in the amount of biological data “published” to computer databases over the past 20 years. Researchers routinely publish their biological findings to Internet databases such as GenBank, SWISS-PROT, Pfam, and SMART.

GenBank is one of the largest and oldest biological databases. It contains all publicly available DNA sequences[5]. Obviously, DNA sequences are important to biologists because DNA sequences contain the unique instructions that indicate how to create a particular organism. DNA consists of varying sequences (typically very long sequences of thousands or millions of molecules) out of 4 possible molecules called “bases” or “nucleotides”[6].

SWISS-PROT is another one of the older biological databases. It contains “annotated protein sequences,” including shape and sequence information. Proteins are important because they are responsible for performing most of the functions of life. Information about the three-dimensional (3D) shape of a protein is important, because shape determines the function of a protein within an organism. Proteins are sequences of 20 possible amino acids. As far

as biologists know, proteins with identical sequences fold into the same three-dimensional shape.

The Pfam and SMART databases are newer, smaller, more specialized protein databases, classifying and categorizing proteins into families and domains. They contain much more detailed information, complementing the SWISS-PROT protein database.

GenBank, SWISS-PROT, Pfram, and SMART are just a few selected examples of popular biological databases. There are many others available that are also widely used. Biologists rarely perform research without referring to these biological databases. Even if results are published in the form of a report or paper, most journals require that researchers post their findings to at least one of these databases. In addition, most of these database efforts are funded by government-sponsored grants, so many of these databases are also publicly-accessible, including all of the databases mentioned above[5].Biological Databases are massive

One characteristic of these biological databases is that they are massive. As of August 2002, there are approximately 18,197,000 sequence records in GenBank[7]. Release 40.44 of SWISS-PROT contains 122,214 annotated protein sequence records comprising 44,864,044 amino acids on February 22, 2003[8]. Pfam contains 3,735 alignments and 3-D models for protein families, and the SMART databases contains 500 domains encompassing more than 54,000 proteins[5].

Also, biological databases are rapidly growing. For example, GenBank, one of the oldest and largest biological databases, is doubling is size every 15 months[5]. The following diagram shows the almost exponential growth rate of GenBank from 1982 to 2002. The number of bases has been plotted against the release date[9].

Figure 1 - Growth of GenBank

These databases are enormous. Any researcher wanting to search through this data would have to scan through tens of thousands to millions of records to find data relevant to their research. At times, this can seem like looking for a needle in a haystack.Biological data can be noisy and redundant, with unclear features

Another characteristic of biological data is that it can be fuzzy. Any quantitative measure of a biological system is merely an approximation. Biological researchers use a wide variety of

procedures to collect their data, and the results are typically an “average value of several independent experiments”[10].

In addition to being fuzzy, biological data also tends to have a lot of noise in it. While some of this noise is due to experimental error, this noise can also be produced by nature itself in the form of mutations. As an organism evolves over time, physical changes get expressed in its genes and chromosomes. A gene is a specific sequence of DNA that produces a protein. A chromosome is a collection of genes and is the basic unit of how traits get passed from parents to children[11]. Two genes are matched based on their sequence similarity, not on the exact sequence, given that one must account for evolution in an organism, and the overall “fuzziness” in this whole process.

Another issue that contributes to noise in the data is that biologists do not yet understand the relevance of many components in DNA sequences and protein sequences. For example, certain sections in a DNA sequence code for certain proteins – these are called exons. Other components do not seem to affect an organism’s protein-making machinery in any way that biologists can understand. These sections are known as “spacer DNA” or “noncoding DNA”. Spacer DNA within a gene are also known as introns[12]. While looking for a sequence match, one must take into account only the exons (the coding regions) in a sequence. In humans, only 3% of a genetic sequence may consist of exons. As with mutations, this means that sequence matching involves similarity and inexact matching techniques rather than exact techniques[6].Another Artificial Source of Redundancy

In addition to the previously mentioned sources of redundancy, there is another artificial source of redundancy, due to the re-entry of the duplicate sequences in the database added with newer and more advanced sequencing techniques. While effort is made to purge old,

duplicate entries from databases, this process is not perfect. Because experimental error is involved, the older, shorter copies might not “exhibit the same quality” as the longer version, and it may be difficult to identify an older sequence as a duplicate entry. As the technology improves to gather longer nucleotide sequences, there may be multiple “copies” of a gene – a longer version, and perhaps several other older, shorter copies that were sequenced earlier. Thus, there is more redundancy and noise in biological databases as a result[13].

CHAPTER 3

A REAL-WORLD PROBLEM

In order to provide focus for our research in the development of a bioinformatics utility belt, we have been provided a real-world biological problem by Steven M. Thompson, a colleague at The Florida State University School of Computational Science & Information Technology. We won’t be able to solve the problem here. Rather, we need it to investigate the requirements of the utility belt. In addition, having a real-world problem or application area in mind when one performs research can help maintain focus.

Having some idea of the real-world constraints and strains and stresses of a real-world application makes the research take into account some of the noisy aspects of the real-world problems in bioinformatics.

The real-world problem is this – with the exponential growth in modern biological databases (which are doubling every 15 months), many real-world database searches return too much data to deal with at one time. Several computer algorithms have been developed to return results from biological databases quickly and effectively, but still, biological researchers are required to further sift through the results of a search manually, sorting out redundancies because these algorithms don’t do this automatically.

Our aim is to develop a computer program which automates the process of sorting out these search result redundancies, producing smaller, more manageable result sets. The ultimate goal would be to

develop a tool which is as good at sorting out redundancies as a human expert, but a tool which can make the list of possibilities smaller in a short period time would suffice, as this process is time-consuming for a human to perform.

Our goal is to help, providing a bioinformatics utility belt to accomplish this task.Process of the human expert

In order to get useful information from the results of a search performed on a biological database, a human expert will often first go through a list of sequences returned and perform two processes in parallel:

1. Sort out redundancies2. Taxonomy analysis

While performing these two processes, the human expert distills a subset of the resultant sequences, those that are redundant or duplicate, down to just one sequence. By iteratively going through this process, the list of candidate sequences gets progressively smaller.

This process is somewhat analogous to the contig assembly process in which short sequences are assembled together to form a long, unambiguous sequence that does not overlap on DNA sequence machines. This process is just on a much larger scale than contig assembly. Instead of forming one unambiguous sequence from a bunch of shorter sequences, any kind of sequence in the DNA database is fair game.

If one performs these two processes in sequence, sorting out redundancies and taxonomy analysis, the order of what should be done should the reverse of what is above. One should first perform taxonomy analysis, then sort out redundancies, since taxonomy analysis will reduce the size of the sequence list more quickly.

Sorting out redundanciesLet us delve into the “sorting out redundancies” step in a little

more detail. For the types of searches that this problem deals with, the resultant search will usually contain three basic types of sequence data:

1. Genomic sequence – Full sequences with introns (non-coding), exons (coding), and “nongenic” (non-coding) information.

2. cDNA – Complementary DNA from mRNA, roughly represents the coding components of a genomic sequence. May be partial or complete.

3. EST – Expressed sequence tag. Fairly short cDNA sequences that do not cover the entire coding sequence of a gene. Used for quick identification of genes. ESTs are fairly error prone, because an EST is generated in a single pass, however it “allows [for] gene RNA products to be observed more directly than [with] genomic sequencing”[14].

The human expert performs the following steps to sort out redundancies:

1. Cluster analysis – Align ESTs to cDNAs to genomic sequences (as shown in the figure below).

2. Splice out introns.3. Use the consensus.

Figure 2 - Align ESTs to cDNAs

The most difficult problem in redundancy analysis is aligning cDNA to genomic sequences. Since genomic sequences contain introns (non-coding regions) and spacer sequences, it can be difficult to determine whether or not a genomic sequence and a cDNA sequence match up.

The following figure explains some of the symbols used in Figure 2 in more detail. The diagram shows a sequence region which contains a gene model from the Sanger Institute’s WormBase[13]. “Exons are shown as block boxes linked by introns”[13]. The portion of the sequence that codes for a protein is shaded in red. As shown above in Figure 2 and in this diagram, coordinate start/stop “spans” for sequence features are marked off as horizontal bars. These are indicated by the lettered features: [A] sequence, [B] exon, [C] CDS coding exon, and [D] introns.

Figure 3 - Example of a sequence region which contains a gene model

Taxonomy AnalysisIn order to perform taxonomy analysis, the human expert could

do the following:1. Use a text-based taxonomy tool, like GCG’s LookUp

program, a European Molecular Biology Laboratory (EMBL) Sequence Retrieval System (SRS) derivative, “to sort out the desired taxonomic level from the similarity search output” or from the databases initially.

2. Homologous genes are identified.A basic issue with taxonomy analysis is determining which

homologues to compare, the paralogues or orthologues. Paralogues are duplicated genes within the same species (e.g. alpha and beta hemoglobin in human beings). Orthologues are the same gene between different species (e.g. alpha hemoglobin in humans and chimpanzees).

It is up to the biologist to decide whether or not to use both. Often, if the final goal is to ascertain molecular phylogenies, it is best only to use one or the other set depending on the question being asked – organismal taxonomy versus gene phylogeny.

A good first-pass approximation of the analysis can be executed by performing the following for a particular sequence:

1. Estimate orthologues – most similar sequences from different taxonomies

2. Estimate paralogues – most similar sequences from same taxonomy

CHAPTER 4

SIMILARITY SEARCHES

The process of searching for matches against a particular genetic sequence in a biological database (or databases) is called a similarity search. One might think that this is similar to the process of performing a search on the world-wide-web with an Internet search engine like Google[15]. In some fashion, an input sequence on which to search is entered, a bioinformatics tool will then perform a search against all the requested biological databases, returning back a list of “hits” – typically biological sequences as results from the search.

However, the information returned from a similarity search is very different from an Internet search, since you are dealing with sequence data and not textual data. According to Bioinformatics for Dummies, the information returned from a similarity search includes:

The Expectation value (E-Value), which tells you how likely it is that the similarity between your sequence and a database sequence is due to chance

The length of the segment similarity between the two sequences The patterns of amino acid conversion The number of insertion/deletions[13]

BLAST OutputExcerpts from the output of a similarity search using BLAST, a

popular bioinformatics database searching algorithm[16], follows:

Part 1 – Overview of the Query

Query: = gi|2501594|sp|Q57997|Y577_METJA PROTEIN MJ0577 (162 letters)Database: Non-redundant GenBank CDS translations+PDB+SwissProt+SPupdate+PIR 437,713 sequences; 134,605,311 total letters

Figure 4 - BLAST Output, Part 1 - Overview of the Query

The first item in the similarity search output is an overview of the query. A summary of the query sequence is presented in the summary. In this example, it is the protein MJ0577, which has 162 letters.

Part 2 – Descriptions of each significant alignment.

1. sp|Q57997|Y577_METJA PROTEIN MJ0577 >gi|2128018|pir||A64372... 314 2e-852. pdb|1MJH| Structure-Based Assignment Of The Biochemical F... 272 1e-723. dbj|BAA29916| (AP000003) 170aa long hypothetical protein [P... 107 6e-234. sp|Q57951|Y531_METJA HYPOTHETICAL PROTEIN MJ0531 >gi|212801... 91 4e-185. gi|2622094 (AE000872) conserved protein [Methanobacterium t... 85 4e-166. gi|2621993 (AE000865) conserved protein [Methanobacterium t... 81 4e-157. gi|2621194 (AE000803) conserved protein [Methanobacterium t... 80 7e-158. gi|2622163 (AE000877) conserved protein [Methanobacterium t... 79 2e-149. sp|P42297|YXIE_BACSU HYPOTHETICAL 15.9 KD PROTEIN IN BGLH-W... 76 1e-1310. sp|Q50777|YB54_METTM HYPOTHETICAL 16.1 KD PROTEIN IN MTR RE... 66 2e-1011. gi|2648791 (AE000981) conserved hypothetical protein [Archa... 65 3e-1012. gi|2648610 (AE000970) conserved hypothetical protein [Archa... 64 5e-10

Figure 5 - BLAST Output, Part 2 - Description of each significant alignment

The overview of the query is followed by a listing of descriptions of each significant alignment. This is usually a very long listing for most query sequences, given the amount of redundancy in the database. For this query sequence, the list had to be truncated to make the figure brief. Each entry in the list contains the sequence database code, its name, the alignment score, and an expectation or E-value.

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut2.html?#2648610%232648610

http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=02648610&dopt=GenPept























Alignment scores are not a sufficient indicator of how similar two sequences are and can vary between the different searching algorithms. It would be similar to giving a number for a person’s height without the units. Expectation scores are used to measure the sequence similarity. An E-value is “the expected number of sequences of score >= S that would be found by random chance.”[17] This value has absolutely no biological significance, it is just used for sequence similarity and comparison. “Low values of E indicate that the search result was unlikely to have been obtained by random chance, and thus is likely to bear an evolutionary relationship to the query sequence. E values of 10-3 and below are often consider indicator of statistically significant results.”[17]

Part 3 – Pairwise Alignments

sp|Q57997|Y577_METJA MJ0577 - Methanococcus jannaschii >gi|5107801|pdb|1MJH|A Chain A, Structure-Based Assignment Of The Biochemical Function Of Hypothetical Protein Mj0577: A Test Case Of Structural Genomics >gi|5107802|pdb|1MJH|B Chain B, Structure-Based Assignment Of The Biochemical Function Of Hypothetical Protein Mj0577: A Test Case Of Structural Genomics >gi|1591284 (U67506) conserved hypothetical protein [Methanococcus jannaschii] Length = 162 Score = 314 bits (796), Expect = 2e-85

Identities = 162/162 (100%), Positives = 162/162 (100%)

Query: 1 MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVA 60 MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVASbjct: 1 MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVA 60

Query: 61 GLNKSVEEFENELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEG 120 GLNKSVEEFENELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEGSbjct: 61 GLNKSVEEFENELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEG 120

Query: 121 VDIIIMGSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNS 162 VDIIIMGSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNSSbjct: 121 VDIIIMGSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNS 162 pdb|1MJH| Structure-Based Assignment Of The Biochemical Function Of Hypothetical Protein Mj0577: A Test Case Of Structural Genomics Length = 287 Score = 272 bits (687), Expect = 1e-72 Identities = 145/161 (90%), Positives = 145/161 (90%), Gaps = 16/161 (9%)

Query: 2 SVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVAG 61 SVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIK Sbjct: 143 SVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIK------------- 189



Query: 62 LNKSVEEFENELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEGV 121 SVEEFENELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEGVSbjct: 190 ---SVEEFENELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEGV 246

Query: 122 DIIIMGSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNS 162 DIIIMGSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNSSbjct: 247 DIIIMGSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNS 287

Figure 6 - BLAST Output, Part 3 – Pairwise Alignments

Following the descriptions of each significant alignment are the actual pairwise alignments used to generate the individual alignment scores against the query sequence and the database entry. As with Part 2, this output has been abbreviated for clarity.

Part 4 – Statistical Summary

1. Database: Non-redundant GenBank CDS translations+PDB+SwissProt+SPupdate+PIR2. Posted date: Feb 26, 2000 10:08 PM3. Number of letters in database: 142,135,17 Number of sequences in database: 461,1624. Lambda K H 0.313 0.135 0.349 Gapped Lambda K H 0.270 0.0470 0.230 5. Matrix: BLOSUM626. Gap Penalties: Existence: 11, Extension: 17. Number of Hits to DB: 39862250Number of Sequences: 461162Number of extensions: 1595704Number of successful extensions: 8417Number of sequences better than 1.0: 86Number of HSP's better than 1.0 without gapping: 57Number of HSP's successfully gapped in prelim test: 29Number of HSP's that attempted gapping in prelim test: 8293Number of HSP's gapped (non-prelim): 1218. length of query: 162length of database: 142,135,178effective HSP length: 60effective length of query: 102effective length of database: 114,465,458effective search space: 11675476716effective search space used: 116754767169. T: 11A: 40X1: 16 ( 7.4 bits)X2: 38 (14.8 bits)X3: 64 (24.9 bits)S1: 42 (21.9 bits)S2: 75 (33.6 bits)

Figure 7 - BLAST Output, Part 4 - Statistical Summary

Finally the BLAST output concludes with statistical details on the search.Evolution

It should seem obvious why you might want to perform a similarity search against textual information on the Internet – when you do not know the exact information you are looking for, but you would like to look for information that is related-to or similar to your search term. After all, to a computer, a genetic sequence is just a string of characters. If a similarity search were performed on biological databases in the exact same manner search engines worked with documents and text, what would happen? Actually it wouldn’t work at all, due to the nature of biological data.

In biology, sequence similarity means a lot more than just mere comparison of the characters in the sequence – how many characters are the same and how many are different, and so forth. Why? Because of evolution. Over time, as a species reproduces and produces offspring, their genetic sequences can be changed by:

Insertions Deletions Substitutions

Figure 8 - Origin of genes having a similar sequence[1]

This might mean that say, a great-great-great-grandparent and a grandchild might be related genetically, but many things might have potentially happened to change their genetic sequence between the generations so that if one sequenced their DNA it wouldn’t be a perfect match. We would still say that they are similar, if we could determine somehow all the possible insertions, deletions, and/or substitutions that might have happened along the way. Note that in the great-great-great-great-grandparent to grandchild relationship, chances are low that there is great sequence divergence, but perhaps if this great-great-grandparent was the proverbial Eve and the grandchild was a modern-day human, it might be a different story. And definitely if we talk about a more evolutionary time scale, like millions of years, then there is enough time changes of this sort to happen.

Another note about similarity – biologists like to distinguish between quantitative and qualitative similarity. Quantitative similarity is just called similarity. Typically some sort of index is used to quantify the “likeness” of two or more genetic sequences.

Biologists have other special terms for similarity that can only be inferred, but cannot be observed or derived quantitatively – homology and homoplasy. Homology is “the presence of a similar feature due to inheritance from a common ancestor”[18]. Homoplasy is “the shared presence of a similar feature for reasons other than ancestry, [perhaps due to] convergence, reversal, [or] parallelism”[18].

Note that a human expert used “homologues” – sequences that exhibit the property of homology. Homologues are sequences that are similar because of a common ancestor, inferred after performing “taxonomy analysis”. A basic issue with taxonomy analysis is determining which homologues to compare: the paralogues or orthologues”.

Taxonomy is “the science of classifying living organisms specifically named categorized based on shared characteristics and natural relationships”. “Orthologue sequences are the same sequences that can be found in different species. Paralogue sequences are similar sequences within the same genome that have evolved through gene duplication, or the same sequence found in the same species”.

When you are trying to assess similarity after performing a biological database search, it is a very important matter to decide whether or not you care whether or not your definition of similarity involves just a single species or not. An example of this is performing a search for the alpha and beta hemoglobin in human beings vs. the alpha and beta hemoglobin in human beings and chimpanzees.

Searching in biological databases involves not only simple lookup and searching methods, but also more sophisticated methods that take into account the ancestry and possible evolution of a sequence in order to find matches in a database.Three fundamental algorithm types

There are three fundamental types of algorithms used in most biological sequence analysis methods: dot plots, pairwise sequence alignment, and multiple sequence alignment. You’ll find them mentioned over and over again in bioinformatics literature. These types of algorithms are the fundamental building blocks for assembling more sophisticated searching algorithms, so understanding them is essential.

Dot plotsDot plots are also known as dot matrix plots. It is a qualitative

method for comparing two sequences which can take into account complex things like rearrangements or repeated sequences, since it relies on visual perception and interpretation.

A two-dimensional matrix is used in a dot plot. The first sequence is placed along the top row, the second sequence is placed along the first column. Then, for every row and column in the matrix, a dot is placed in the row and column where there is a match between the elements in each sequence. Diagonal lines, or diagonal patterns of dots, indicate matching areas between the two sequences.

Here is an example of a dot plot comparing the two strings “I LOVE DOT PLOTS” and “I HATE DOT PLOTS”:

Table 1 - Dot plot example

I L O V E D O T P L O T SI ●

● ● ●HAT ● ●E ●

● ● ●D ●O ● ● ●T ● ●

● ● ●P ●L ● ●O ● ● ●T ● ●S ●

By looking for the large, almost unbroken, diagonal line of dots down the center, it should be easy to tell that these two strings are very similar. The only break on the diagonal is just a bit at the beginning, between the “HAT” and the “LOV” part. Herein lies the

great power in the visual method of dot plots. You can determine similarity qualitatively merely by scanning for diagonals.

Note that the computational complexity of the dot plot algorithm is O(N2), as it must iteratively traverse every row and every column in the matrix, performing a comparison to decide whether or not to place a dot. That is n×n or n2 operations.

Sometimes this straightforward, one-to-one method for generating the dots in a dot plot will generate very noisy matrices where it is difficult to sort out clear diagonals for matches, particularly with very long sequences. In those cases, some sort of filtering algorithm might be used. One technique is to use a sliding window, calculating a score for a number of characters which then will be used in determining whether or not to place a dot, rather than just looking at one character at a time. Then, the decision about whether or not to place a dot in the box for the character depends on whether or not the score for the window exceeds a particular threshold. After the decision about whether or not to place a dot is made, the window is moved forward by one for placement of the next dot in the matrix. Here is an example with a window size of 5:

I H A T E D O T P L O T S

I L O V E D O T P L O T S

Figure 9 - Dot plot window example 1

Every pair of letters in the window, II, LH, OA, VT, EE must have a defined score. One such system is something like “put a dot if there are c / w matches, where w is the window size, and c is the

threshold count for matches”. After the decision to put a dot (or not put a dot) is made for that particular entry in the matrix, the window is moved forward, and the process is repeated for the next element in the matrix:

I H A T E D O T P L O T S

I L O V E D O T P L O T S

Figure 10 - Dot plot window example 2

Please note that while English characters have been used in the previous examples for illustrative purposes, real dot plots, of course, use sequence data – DNA and proteins. Here is an example of what a dot plot with real sequence data looks like. The following figure is a dot plot of two DNA sequences, the alpha chain of human hemoglobin is on the horizontal axis and the beta chain of the human hemoglobin is on the vertial axis (two paralogues) [19]:

Figure 11 - Dot plot of two paralogues

The need for quantitative methodsWhile dot plots are a valuable technique in similarity analysis,

other techniques are also necessary. One might note the need for an extra technique because dot plots require that a human being visually analyze the diagram for similarities. While this is a powerful technique, it doesn’t lend itself easily to something that could be automated. Other issues pointed out by Professor Russ Altman in his Computational Molecular Biology course are the following[20]:

Difficult to find optimal alignments Need scoring schemes more sophisticated than “identical

match”

Difficult to estimate significance of alignments

So a more quantitative way of analyzing the similarity between two sequences has been developed. This process is called sequence alignment. Given two (or more) sequences and a scoring method for counting matches between sequences, one finds the optimal pairing of the nucleotides/residues in one sequence to the other sequence.

The author and researcher, Dan Gusfield[21], points out that this is an incredible insight “biologically meaningful results could come from considering DNA as a one-dimensional string, abstracting away the reality of DNA as a flexible three dimensional molecule, interacting in a dynamic environment with protein and RNA, and repeating a life-cycle in which the classic linear chromosome exists for only a fraction of the time. A similar, but strong, assumption existed for protein, holding, for example that all the information needed for correct three dimensional folding is contained in the protein sequence itself, essentially independent of the biological environment the protein lives in.”[21]

This allows computer scientists the ability to apply a whole host of string, tree and other computational algorithms to sequence data which might solve biological problems without having to understand a whole lot of biology. Gusfield goes on to say that there are really only three basic pieces of information a computer scientist needs to know before starting to tackle the sequence alignment problem:

1. DNA is composed of a 4-letter alphabet (see figure below). Protein is composed of a 20 letter alphabet (see figure below). The information represented by these strings “underlies biochemistry, cell biology, and development…” and “is the root of an organism’s biology”[22].

2. Even though a lot of complex phenomena occurs at the biological level – molecular biology – which studies life at

the cellular level – “[it] is all about sequences”. It actually is not a bad thing (at times) to abstract away the biology and just look at the sequence, which is good for the computer scientist, which might not have as much formal training in biology, but does have a lot of training in algorithms.

3. “The ultimate rationale behind all purposeful structures and behavior of living things is embodied in the sequences of residues in polypeptide chains” – proteins. “…It is at this level of organization that the secret of life (if there is one) is to be found.”

Table 2 - The 4 DNA Nucleotide Sequences and Their Official Codes

The 4 DNA Nucleotide Sequences and Their Official Codes# 1-Letter Code Nucleotide Name1 A Adenine2 C Cytosine3 G Guanine4 T Thymine

Table 3 - The 20 Amino Acids and Their Official Codes

The 20 Amino Acids and Their Official Codes# 1-Letter Code 3-Letter Code

Name1 A Ala Alanine2 R Arg Arginine3 N Asn Asparagine4 D Asp Aspartic acid5 C Cys Cysteine6 Q Gln Glutamine7 E Glu Glutamic acid8 G Gly Glycine9 H His Histidine10 I Ile Isoleucine

11 L Leu Leucine12 K Lys Lysine13 M Met Methionine14 F Phe Phenylalanine15 P Pro Proline16 S Ser Serine17 T Thr Threonine18 W Trp Tryptophan19 Y Tyr Tyrosine20 V Val Valine

So a lot of complex biochemistry can be abstracted away to just sequences and strings. However, one doesn’t have to take into account all the complexity of a molecule while performing sequence analysis – just the information contained within – one of the most fascinating discoveries made in the 20th century by Watson, Crick and Hershey[12].

Pairwise sequence alignmentFirst, let us talk about pairwise sequence alignment. In

pairwise sequence alignment, only two sequences are compared at a time. Here is a statement of the problem, according to Russ B. Altman[20]:Given:

2 sequences scoring system for evaluating match (or mismatch) of two

characters penalty function for gaps in sequences

Produce an optimal pairing of sequences that retains the order of characters in each sequence, perhaps introducing gaps, such that the total score is optimal.

To calculate the score, one iterates through the pair of sequences comparing each pair of characters, adding up scores. In the simplest case, without gaps, the alignment score is:

∑=

N

i 1match score if seq1i = seq2i / mismatch score if seq1i ≠seq2i

When gaps are introduced, then there is just one extension to the score – a gap penalty:

∑=

N

i 1match score if seq1i = seq2i / mismatch score if seq1i ≠seq2i /

gap penalty if seqli is gap or seq2i is gapSome examples follow. I’m going to follow Professor’s Altman’s

lead and sometimes use the English alphabet for examples rather than using DNA or protein sequences. Sequences like “TTTGACAC” or “RKVA—GMAKPNM” are confusing. So for any English examples, our alphabet will be the 26 letters in the English language, plus the ‘ ‘ (space) character and gaps will be represented by the “#” (pound) symbol.

Keep in mind that real-world examples will use DNA and protein sequences. When DNA and protein sequences are used, only the 4 letter DNA alphabet or the 20 letter protein alphabet is permissible, and gap characters are usually represented by a ‘-‘ character (hyphen). Note: the 4-letter DNA alphabet sometimes permits many algorithmic shortcuts that would not be possible if biology actually used 26 letters like the English language.

So here is an example of a pairwise sequence alignment:

Is the alignment in the previous example optimal? Well, one way that we could quantitatively prove this is to try all the possible alignments and see which alignment has a maximum score. Notice the one thing which makes this problem hard is the introduction of gaps anywhere in the string/sequence. This means that a naïve approach where one generates a sequence with gaps in every position and of every length, is similar to a dot plot which has a computational complexity of O(N2).

This notion of gaps comes directly from biology. Recall that sequences can be similar, even if there may have been any number of insertions, deletions or transpositions between one or more nucleotides. The alignment algorithms must take this biological notion into account. Well, it is more than a “notion” – it is a biological fact.

Dynamic programming[23] is similar to the concept of “divide-and-conquer”. When a problem is too difficult or large to solve by itself, break it down into subproblems, solve those problems individually (or recursively), then combine all the individual solutions to solve the original, larger problem – simple enough.

Given the two strings:

Sequence alignment is incredibly funSequence alignment is not fun

The score for a match is +1The score for a mismatch is 0

One way we could align these strings is like so:

Sequence alignment is incredibly funSequence alignment is not####### fun

So for this alignment, the score is 26 (the summation of the match and/or mismatch scores for each character)

Dynamic programming is just a bit of a twist on divide-and-conquer in that the subproblems cannot be solved independently – the “subproblems share subsubproblems”. In order to “solve every subproblem only once” dynamic programming algorithms keep track of the temporary subproblem solution in a temporary “traceback area” or table, so it won’t have to go through “the work of recomputing the answer every time the subproblem is encountered”.

Dynamic programming algorithms are typically used for optimization problems like our pairwise alignment problem, where we are trying to find the alignment(s) which produce an optimal score. The dynamic programming approach will handle multiple solutions, if there is more than one optimal score. There is a good description of dynamic programming in the Algorithm Design Manual: “[it] considers all possible decisions and always selects the one[s] that proves to be the best. By storing the consequences of all possible decisions to date and using this information in a systematic way, the total amount of work is minimized.”[24]

So how do you break this pairwise alignment problem down into subproblems? Well, first a key observation is “the score of the best possible alignment that ends at a given pair of positions (i..j) in the two sequences is the score of the best alignment previous to those two positions PLUS the score for alignment those two positions.”[20]

Figure 12 - Key observation for dynamic programming

Another way to look at the problem is to go back to dot plot diagrams, and note the following:

Figure 13 - Dynamic programming always looks back to previous diagonal

So this is exactly what we do when implementing a pairwise algorithm using dynamic programming. The overall approach is the following:

1. Fill out a matrix with the first sequence along the top row and the second sequence along the first column (just as with a dot pot). This matrix holds all the best possible scores for the alignments. You’ll also keep a second set of pointers (or a second matrix) called a “trace back” to keep track of the dependencies of the scores.

2. Find the best score in the entire matrix. The best score depends on whether or not you are doing a “global” or “local” alignment. More details will be given later.

3. Go back through the trace back matrix to reconstruct position-by-position the elements of sequences in the alignment, including gaps.

To figure out the scores in the matrix we use something called a mathematical recurrence relation, which looks something like this:

Figure 14 - Recurrence relation

Global Alignment If we compare the two sequences in their entirety and do not

distinguish whether or not gaps are located at the ends of one or both sequences, then a global alignment is being performed. Since a global alignment is the “best alignment of the entirety of two sequences,” the best score will be in the final row or final column in the matrix, depending on the algorithm.

One relatively simple algorithm for performing a pairwise global alignment of two sequences is the Needleman and Wunsch algorithm[25]. In fact, Needleman and Wunsch were the very first to apply the concept of dynamic programming to the problem of sequence alignment.

We demonstrate an example using the Needleman and Wunsch algorithm for performing a global alignment. Note that we still use the three general steps outlined above. In this case, the two sequences are AATCG and AGGC, where the gap cost is -1, the match bonus is +1, and the mismatch score is 0.

As outlined in the previous section, first we must create and fill a matrix with the first sequence along the top row and the second sequence along the first column. In the Needleman and Wunsch algorithm, the first row and column of numbers are initially filled with multiples of the gap cost, as shown in the following figure:

A A C G

A

-3 -4

G

G

C

-1

-2

-3

-2-10

-4

Figure 15 - Global alignment, part 1

Next, we continue to fill out the matrix using the recurrence relation, making a note of the best score in the matrix. BestScore[i,j] = BestScore[<i,<j]+Match[i,j]+Gap Cost.

We start filling out the matrix using the recurrence relation, starting at position (2,2). Using the recurrence relation, we compare three possible values:

1. First, we can look at the value at (1,1). We either add the match bonus or the mismatch cost depending on whether or not the actual sequences in the row/column heading match or not. In this case they do: A with an A: So the score is 0+1=1, the value at (1,1) plus the match bonus.

2. Second, we take the value from the left at (2,1) and add the gap cost. The score is -1-1=-2

3. Third, we do the same thing for the value above at (1,2) – add the gap cost. That score is also -2-1=-2

Out of these three values 1, -2, and -2, the maximum value is 1. We write that value in the matrix, as shown in the figure below:

A A C G

A

-3 -4

G

G

C

-1

-2

-3

1

-2-10

-4


As we’re filling in the matrix, we also keep track of pointers to the previous best as a “trace back”. This will be used in step 3 to print out the final alignment. Some implementations of the algorithm

keep these trace back pointers in a separate matrix entirely, in order to be efficient. For representational purposes, our “trace back” pointers will just be pointers – arrows in the matrix. Since the best score came from (1,1), we will place a trace back pointer pointing there, like so:

A A C G

A

-3 -4

G

G

C

-1

-2

-3

1

-2-10

-4


Here is the whole matrix filled out:

A A C G

A

-3 -4

G

G

C

-1

-2

-3

1

0

-1

0

1

0

-1

0

1

-2

0

1

-2-10

-4 -2 -1 1 1


After filling in the matrix using the recurrence relation, we move on to step two, finding the best score. In this case, the best score is in the lower right-hand corner: 1. This is the optimal score.

On to step 3, converting the “trace back” pointers to an alignment. If the trace back pointer is diagonal, it represents an alignment between the two sequences. If the trace back pointer is vertical, it represents a gap in the sequences along the top row. If the trace back pointer is horizontal, it represents a gap in the sequence along the first column.

The following diagram shows the allowed paths from the lower right-hand corner:

A A C G

A

-3 -4

G

G

C

-1

-2

-3

1

0

-1

0

1

0

-1

0

1

-2

0

1

-2-10

-4 -2 -1 1 1


Following the pointers from right to left, the following optimal alignment, with an alignment score of 1 is obtained:

AACGAGGCThere may be many possible paths. Finding all the possible

paths can be quite computationally expensive, if one wants to find all possible optimal paths. Most algorithms pick a heuristic, or offer to

pick the high road or the low road – favoring the higher path over the lower path[17].

Local AlignmentsA local alignment differs from a global alignment in that the best

score can be the best score ANYWHERE in the matrix, rather than just being the best score in the final row or final column, because a local alignment does not necessarily look at the best alignment of the two sequences in their entirety.

A popular local alignment algorithm is the Smith-Waterman[26] algorithm. Like the Needleman and Wunsch algorithm, it is also a fundamental algorithm in bioinformatics.

Smith-Waterman merely tweaks the global alignment algorithm slightly to provide local alignments. It offers a fourth option when filling out the matrix (in the recurrence relation) – there must be no negative values in the matrix – instead, if a value is negative, we write a 0 in the matrix.

Once the matrix is completed in this fashion, to find a partial alignment, one just looks for a maximum score in the matrix and follows the trace back pointers until a zero value is reached instead of going to the end of both sequences[17].

Multiple sequence alignmentMultiple sequence alignment is basically the same process as

pairwise sequence alignment, it is just that more than two sequences can be compared at a time – there is no limitation that just two be compared at a time. Other than that, the basic concepts are the same. Multiple sequence alignment algorithms seek to find an optimal score between multiple sequences, which represent an alignment between multiple sequences.

In the general case, the multiple alignment problem is NP-hard. In practice, multiple alignment algorithms can normally only be run

on a modest number of sequences. However, it is an active research area in bioinformatics. In particular, multiple sequence alignment algorithms are incredibly useful in the production of phylogenetic trees used in taxonomy analysis [27].

CHAPTER 5

GCG WISCONSIN PACKAGE

Another component in our utility belt is an oldie, but a goodie – the GCG Wisconsin package by Accelrys. GCG stands for the Genetics Computer Group. The Wisconsin Package for Sequence Analysis was a project that began in 1982 to serve Oliver Smithies’ lab at the University of Wisconsin Department of Genetics[28]. In 1990, that project became a private company, the Genetics Computer Group (GCG). The company was “one of the pioneers of bioinformatics and its Wisconsin Package sequence analysis tools are widely used”. In 1997, GCG was acquired by the British chemical informatics company, the Oxford Molecular Group. Then in September 2000, the drug discovery and chemical development corporation, Pharmacopeia bought Oxford Molecular Group (and thus, GCG)[29]. On June 1, 2001, Pharmacopeia, Inc. merged together all its software acquisitions (there were several more in addition to Oxford Molecular Group and CG) under one umbrella company - Accelrys[29]. Since the GCG name has brand recognition, the company chose not to rename the product, so it is still officially called the “GCG Wisconsin package” or “GCG” or the “Wisconsin package”, even though officially the original company that started the business no longer exists.

Of particular interest to us, GCG provides implementations of many bioinformatics database searching algorithms. Always one would rather reuse rather than rewrite. No need to reinvent the wheel.

Plus, it is not corporate press release hyperbole that the GCG Wisconsin package is widely used. In fact, if anything, Accelrys blunted the effectiveness of their “widely used” statement in recent changes made to their GCG web page. In his GCG workshops, Steven Thompson points out that they used to publish more quantitative usage numbers on the web – “Used all over the world by more than 30,000 scientists at over 530 institutions in 35 countries”[30]. For what exact reason, I’m not sure, but I’ve seen it done time and time again in software companies that I’ve worked for when they were acquired by a large corporate entity – that the large corporate entity becomes suddenly unwilling to publish any quantitative numbers on any of their subcorporations. This might be a tactic used to make it more difficult to tell which entities are losing money vs. gaining money. It also makes it more difficult for competitors to extrapolate the exact costs of doing business by combining these sales numbers with information from quarterly financial reports.

GCG has the same popularity in the bioinformatics community that a software package like SAS has with statisticians. Any serious researcher will require that this statistics software be available to them, even if they have to pay an annual license fee – public domain equivalents just don’t compare – the software is just that useful, it has so many necessary features, and it is used by most of their colleagues that it would be almost impossible to share work or collaborate if they didn’t have it. It is almost the same situation with GCG as with SAS. The Florida State university is a good example. We have SAS installed on the garnet computer system. Of course, GCG is used by the bioinformatics researchers working on mendel. Would Florida State University researchers and scientists ever make bad decisions about choices in software?

The latest version of GCG, version 10.3, as of this writing, is geared towards the UNIX operating environment. It runs on RedHat

7.1/7.2, Sun Solaris 2.6, 7 or 8 (SPARC only), IBM Aix 5.1 and above, HP/Compaq Tru UNIX 4.0E or later (Digital Unix) and 5.0, and Silicon Graphics IRIX 6.5 (RISC). Basically the recommended platforms for GCG are any popular version of UNIX, but not necessarily, the absolute-latest versions – preferably one version back from the latest release.

The recommended minimum disk space for GCG is 80GB. This storage is mostly for biological databases. With GCG, one keeps a copy of the biological databases locally on the machine running the software. At an extra cost, one can purchase DVDs for database updates rather than downloading them from the Internet. The DVD databases are shipped in a compressed format as a DVD-R only holds 4.7GB of data. The recommended amount of memory is 128MB of core memory and 256MB of virtual memory. Please note that these are only the recommended minimum amounts of storage and memory. At these recommended minimums, this configuration can really only be realistically used by only one user on her personal workstation, not by multiple users.

An ideal setup for GCG would have lots of disk space and as many CPUs as possible, as GCG is both CPU- and disk-intensive. The workstation might need several gigabytes of memory to run effectively. For example, at the time of this writing, the computer system mendel, serving the students and researchers at FSU has 8GB of memory and 604GB of disk space to store databases. The running of bioinformatics algorithms, which was covered in the previous chapter, is by nature, CPU-intensive. The biological database search algorithms are disk-intensive. So GCG, like many scientific applications has the worst of both worlds from a performance perspective, being both I/O-bound and CPU-bound. Most business applications are just I/O-bound, so they have an easier-time of it.

The GCG application consists of three interfaces:

SeqLab – a high-level graphical user interface, which requires X-Windows

Command-line – a low-level command-line suite of almost 150 bioinformatics programs and tools

SeqWeb – an add-on product which provides a web front-end for GCG in either the Microsoft or Netscape web browsers

As a bioinformatics tool designer, one can learn a lot from the design of the GCG product. It appears that the GCG designers held to two design principles in the evolution of GCG[31]:

1. The primary customers of GCG are scientists, not computer scientists

2. Built-in support for “continuous evolution” and adaptation to new technologies while maintaining backwards compatibility (and even cross-compatibility with competitors’ products).

These are some very good principles indeed and bear a little more explanation. First, let’s talk about the first principle – the primary customers of GCG are scientists, not computer scientists.

At the moment, UNIX is the choice of platform in scientific computing. In their original paper describing GCG, the designers also mention that the “software tools” approach in UNIX influenced their design, even though the original target platform was the VMS operating system and GCG was originally written in Fortran[31].

The next part of the statement, that the primary customers are NOT computer scientists, again seems terribly important to understand – at least to any computer scientist trying to write a bioinformatics tool. To my knowledge, there is no traditional third-party application programming interface (API) for GCG, and yet, you can use the software as a toolbox of low-level routines to build new bioinformatics algorithms. How is this possible? It is done through

the command-line interface. All the low-level command-line interfaces for GCG work like the UNIX file utilities, in that they are a loose collection of single-purpose commands that can be run from the command-line. In addition, just like the UNIX file utilities (or at least the good UNIX file utilities), they all share a common file format and their inputs and outputs can be redirected, piped, and chained together in as simple or as complicated a fashion as necessary to perform the task. In this way, the low-level command-line interface can be used both by developers writing programs and end-users alike. What is startling is that this “toolbox” design for an API was built into the product back in 1982. This was a fairly radical notion for bioinformatics tools back then.

What is interesting to note is that with GCG, there doesn’t seem to be a hidden, back-door programming interface to call these command-line utilities. Instead developers wanting to write programs spawn a command-line shell and execute the same utilities programmatically that an end-user would – something that didn’t happen in the UNIX file utility world. A GCG developer merely works with the input and output files of the command-line tools or “screen-scrapes” the output, rather than working with programming interfaces. This means that high-level string processing languages like Perl can be used, a favorite among the system administrator types who maintain GCG servers, in addition to more traditional programming languages like C++ or Java.

The key difference here is that the primary customer is different, as mentioned earlier, in that they are not computer scientists. Source code was originally included at no extra cost with GCG, but this was the only programming API, beyond a macro language in SeqLab. Now, even this source code license is no longer an option. Accelrys does not currently redistribute GCG source code.

Even with all this customer-centric focus, GCG missed the boat on their first try at having a good Graphical User Interface (GUI) front-end to the program. Their first attempt was called the Wisconsin Package Interface (WPI). Like many command-line to GUI translations, it merely allowed menu-driven access to the GCG command-line, hardly worth the bother of running the GUI.

Although, it is very difficult to design a useful GUI, and it is even harder to translate a command-line program to a GUI, because you also have to deal with translating an existing program from one “paradigm” to another. A classic failure at this was WordPerfect Corporation when they made the initial switch from DOS to Windows. Although some argue that mismanagement and market forces ultimately doomed the company, so the misstep in the translation of their GUI didn’t really matter. WordPerfect is still a dominant force in the legal industry, but it is not the widely-used word processing application it once was in the 1970s and 1980s[32].

Meanwhile, a researcher named Steve Smith[30] had developed a wonderful editor for genetic sequences called the Genetic Data Environment (GDE). GCG liked his user interfaces ideas so much they decided to hire him. They realized that they were losing market share by not having a useful GUI and that a key to having a useful bioinformatics GUI was having a sequence editor. So GCG revamped their GUI around Steve Smith’s ideas, making the sequence editor the core of the GUI while retaining the ability to invoke the powerful suite of GCG tools from its menus. Thus, the user interface portion of GCG, SeqLab was born[28].

This leads us to the second design principle that GCG learned and heeded. Things change fast in the world of bioinformatics software development. The chief executive officer and bottle washer of Letovsky Associates, Stanley Letovksy likened the challenge of writing bioinformatics software as a “task [which] closely resembles

that of an auto mechanic trying to redesign a car while driving it”[14]. The low-level command-line interface of GCG is simple and modular, and yet it has been able to support over 20 years of evolutionary changes without a significant external redesign of the interface. Part of that was borrowing from the best ideas developed in UNIX, like a set of loosely-integrated command-line utilities. Another component was shedding parts of the UNIX legacy that didn’t work in a bioinformatics context, like having an API. Yet another component is “pervasive compatibility”. GCG tries to work with every commonly used bioinformatics file format in existence. They add support for new file formats all the time. All this without adopting trendy, modular, software engineering concepts like Common Object/Request Broker Architecture (CORBA) and/or Microsoft Component Object Model (COM), which seem like fads by comparison, since it does not look likely either will have anywhere near a 20-year-plus useful life. All that GCG did 20 years ago was to adopt a then-trendy “tool based” design approach.

Although the designers who are now at Accelrys cannot feel totally smug, feeling that all user interface problems have been licked in GCG. There is a need to evolve the program interface for GCG, adapting it to the world-wide-web. Alas, they missed the boat in their first attempt at a web interface for GCG – SeqWeb. A major problem with the SeqWeb interface is that it is “baby GCG”. The web interface has been stripped down and simplified so much that you just can’t do nearly as much in SeqWeb as you can in SeqLab. This is not uncommon as a user interface is translated from one paradigm to another – one often might miss a critical customer usability priority as one is grappling with the technical requirements of the platform shift, as in this case, from the X-Windows interface to the web.

Fortunately, as with the genesis of SeqLab, some freely available tools have sprung up in the GCG-user community which

address many of the deficiencies of the SeqWeb tool – the most popular are W2H[33] and WWW2GCG[34].

W2H was initially developed by Martin Senger at the Deutsches Krebsforschungszentrum, Heidelberg, Germany (DKFZ) in 1997. It has since been maintained as a collaborative project between DKFZ and the European Bioinformatics Institute, Hinxton, UX (EMBL-EBI) with a colleague, Peter Ernst. Both Peter Ernst and Martin Senger have developed W2H as a free web interface to a number of popular bioinformatics tools – not only is GCG included in the supported tool list, but it also supports other tools like the European Molecular Biology Software Suite (EMBOSS) and the Heidelberg Unix Sequence Analysis Resources (HUSAR). The goal of the designers was to “cover as much functionality as possible, and do it as user-friendly as we could”[33]. The suite provides access to over 100 programs and requires only a JavaScript-enabled web browser.

WWW2GCG was written by Marc Colet, of Bioinformatics Unit, in the Département de Biologie Moléculaire de l'Université Libre de Bruxelles. Rather than trying to be a generic web interface to many types of bioinformatics tools, WWW2GCG was written specifically for GCG. It provides a web version of the SeqLab editor and what it calls the “WebShell” – an interface to the UNIX command line so GCG commands can be executed[35].

Both W2H and WWW2GCG are very popular alternatives to the SeqWeb add-in offered as the “official” web interface for GCG.

CHAPTER 6

MUMS AND SUFFIX TREES

In order to solve Steven Thompson’ problem, the idea is to use multiple sequence alignment algorithms to align ESTs and cDNAs returned from database searches. Then, perhaps after that is complete, use a pairwise alignment algorithm to align the genomic sequences with the ESTs and the cDNAs. Many pairwise and multiple sequence analysis algorithms exist, and to boot, there’s some good implementations in the GCG Wisconsin package.

We may have to deal with potentially large sequences and potentially a large list of resultant sequences to sift through. Nearly all traditional pairwise and multiple sequence alignment algorithm implementations rely on some form of dynamic programming algorithm. According to the paper, Alignment of whole genomes[36], these algorithms were designed to work with genes sequences on the order of tens of thousands of nucleotides in length. These algorithms were never designed to process millions of nucleotides in a reasonable amount of time. These sequence alignment techniques never get any better than O(n2) for running time, although hashing can be used to optimize the space complexity to O(n)[36].

The Institute for Genomic Research (TIGR) developed MUMmer. MUMmer performs a pairwise alignment, but does not use a dynamic programming algorithm to perform the heavy lifting of the job of the alignment process. Instead it only resorts to a dynamic programming algorithm to align the gaps in the two sequences, not the nucleotides.

So what does perform most of the alignment duties in their algorithm? Most of the heavy-lifting of the algorithm is performed by suffix trees - a data structure that has been under-utilized in bioinformatics until recently, according to Dan Brown, “because they have a reputation for being confusing and taking lots of room”[37]. Despite the reputation, with the aid of suffix trees the authors claim that MUMer performs the process of pairwise sequence alignment essentially in linear time (with some restrictions on the input).

In addition to suffix trees there are two other concepts necessary to understand MUMmer:

Maximal unique match (MUM)

Longest increasing subsequence (LIS)

The original version, MUMmer 1.0 was fast, but consumed a bit too much memory. MUMmer 2.1 introduced several algorithmic improvements to reduce the memory consumption of the system by one-third while increasing the speed of the main algorithm by a factor of three[38]. The same basic algorithmic building blocks are used in both versions of the program.

So let’s start with some background on the fundamental components of MUMmer – MUMs, suffix trees, LIS, and Smith-Waterman. Fortunately we’ve already discussed Smith-Waterman in Chapter 4 – it is just a kind of pairwise sequence alignment algorithm that performs global alignment.MUMs

A MUM is a maximal unique match. MUMmer identifies all maximal unique matching subsequences between two input strings (or rather genomes, but it is the same thing for our discussion purposes). A MUM is the largest string that occurs exactly once in each sequence that can’t be extended and still be a match.[39]

Here’s an example of a MUM directly from the MUMmer paper:Genome A: tcgatcGACGATCGCGGCCGTAGATCGAATAACGAGCATAAcgactta

Genome B: gcattaGACGATCGCGGCCGTAGATCGAATAACGAGCATAAtccagag

The maximal unique matching subsequence (MUM) is the 39 nucleotides shown in uppercase. Anything longer would be a mismatch, and would thus, would not be a MUM. Anything shorter could be extended.

A point made in the MUMmer paper[36] is that there is some intuition around using MUMs. “If a long, perfectly matching sequence occurs exactly once in each genome, it is almost certain to be part of the global alignment”[36].

MUMs are obtained from suffix trees. A suffix tree is a data structure that contains all the possible substrings of a string. Lloyd Allison presents some good examples of suffix trees, plus an interactive suffix tree simulator on his web page[40]:

If one views a string T as a sequence of characters t1t2…ti…tn, Then Ti=titi+1…tn is the suffix of the string that starts at position i. Here’s an example from his web page:

T1 = mississippiT2 = ississippiT3 = ssissippiT4 = sissippiT5 = issippiT6 = sippiT7 = ippiT8 = ppiT9 = piT10 =iT11 = iT12 = (empty)

A suffix tree stores all these substrings in a tree-like data structure. A suffix tree is an optimized form of a data structure called a “suffix trie”, which is “type of digital search tree…[representing] a set of pattern strings, or keys, over a finite alphabet. The term was coined by Fredkin (1959, 1960), from ‘information retrieval’”[41]. While not obvious or straightforward, it is possible to build a suffix tree in linear O(n) time.

If the non-empty suffixes are sorted:T11 = iT8 = ippiT5 = issippiT2 = ississippiT1 = mississippiT10 = piT9 = ppiT7 = sippiT4 = sissippiT6 = ssippiT3 = ssissippi You can start to notice that some of the suffixes share common

prefixes. Like there are “substrings starting with ‘i’, ‘m’, ‘p’ and ‘s,’ but all of those starting ‘is’, in fact start ‘issi’”.

From these observations, you can construct a suffix tree from these substrings, like so:

mississippi

i

m .. mississippitree

ssi ssippi

ppi

ppi

s si ssippi

ppi

i ssippi

ppi

p pi

i

i .. ississippi

issip, issipp, issippi

ip, ipp, ippi

s..ssissippi

ssip, ssipp, ssippi

si..sissippi

sip, sipp, sippi

p, pp, ppi

p, pi

Tree Substrings

Figure 20 - Suffix tree

In a suffix tree implementation, rather than using the substrings like “ssi” on the edges (arcs) as labels, one merely uses integer indexes denoting the start and end of the substring. For example, with the substring “ssi”, the start and end indexes would be <3,5> in the string “mississippi”.

MUMmer 1.0 constructed a suffix tree for genome A, then added the suffixes for genome B to original suffix tree, by “adding one suffix at a time to the portion of the tree that has already been constructed”[36]. In the original paper, the authors mention that an alternate way of achieving this effect would be to concatenate the two

genomes with a dummy character like a “$” between the two, constructing a suffix tree from the single concatenated string.

MUMmer needs to keep track of one extra critical piece of information in the suffix tree to discover MUMs – every leaf node must indicate whether or not the suffix belongs to genome A or B.

First, a string is a unique match only if it is a node in the suffix tree that has “exactly two descendants, one from each sequence.”[37] Further, a unique match is maximal if it is “followed and preceded by different letters in each sequence.”[37] So that is how MUMs are discovered “it must be a node with two children, which must have different left characters and be from both sequences.”[37]

The beauty of this approach is that these MUMs can be identified all in a single pass through the suffix tree. To top it off, the suffix tree itself can be built in linear time.

Now we have a list of maximal unique matches (MUMs) from genome A and genome B, with each MUM containing the positions in each sequence. The next step is to “stitch together” the two sequences into a full alignment. However, there’s a problem, the positions going from each sequence might “cross each other”. The following diagram from the MUMmer paper illustrates this point:

Figure 21 - Aligning Genome A and Genome B after locating MUMs

It is reasonable to mention that in the area of possible application this problem with “position crossing” won’t be an issue because it will be a precondition that one will be dealing with nearly identical sequences, and thus there will be any “translocations” in the sequences. However, this is a problem in the general pairwise alignment case.

In his lecture on MUMmer, Professor Brown makes this point a little more clear by calling this an “anchor-based” alignment approach, where one breaks the sequence into “chunks” and those chunks have several “anchors”, and that one must choose a good set of anchors as a general approach[37].

In particular, with MUMer, as shown above, the “anchors” for our MUMs might cross each other. Here’s another diagram from Dr. Brown’s lecture:

Figure 22 - Crossing anchors

Dr. Brown points out something that isn’t made completely clear in the paper – their solution is to “choose the largest set of anchors”[37].

Figure 23 - Aligned anchors

This is done by “sorting the positions in one sequence, then finding the longest increasing subsequence in the other sequence.” This step “closes the gaps” and performs a global alignment of the two sequences[36]. Dr. Brown also points this out a little more clearly – “the LIS approach reduces the number of subalignments by a lot”[37].Longest Increasing Subsubsequence

From traversing the suffix tree, we are given N pairs of MUMs (or what Dr. Brown calls “anchors”). Each pair is denoted by (xi, yi).

When one tries to find the largest subsequence, one tries to find “the longest monotonically increasing sequence in a sequence of n numbers”[24]. Or how Dr. Brown puts it, “each point is higher in both x and y coordinates than its predecessor”.

Here’s an example from the Algorithm design manual:If we take a look at the sequence:S = (9, 5, 2, 8, 7, 3, 1, 6, 4)The longest increasing subsequence has a length of 3 and can

be (2, 3, 6) or (2, 3, 4).The same thing happens with MUMmer, just using both x and y

coordinates using a modified LIS algorithm. While the most straightforward approach to coding the LIS algorithm is to use dynamic programming[24], an O(n log n) version of the LIS algorithm has been developed, and that is what MUMmer uses[23].

The LIS algorithm “closes the gaps” between the two sequences and performs a global alignment, but now a local alignment must be performed to close the “local gaps” and complete the alignment. This is where MUMmer employs Smith-Waterman on the two sequences. Smith Waterman is a standard pairwise alignment dynamic programming algorithm for performing local sequence alignment.

CHAPTER 7

CONSENSUS ALGORITHMS

If you recall, the final step in the proposed “human expert” algorithm is to “use the consensus” between the aligned sequences. To generate a consensus amongst sequences, one resorts to a class of algorithms known as clustering algorithms.

Clustering algorithms attempt to classify data into different groups. A clustering algorithm will segment the data into different groups, with each group containing members which are similar. There is no need to know in advance what the patterns will be or how the data will be grouped together – instead these clusters are discovered dynamically.

Below is an example of clustering using two-dimensional data. In the figure below there are two blocks with dots. In the block on the left, the dots are displayed “unclustered”. On the right, the dots on the left are displayed, surrounded by cluster boundaries[42].

Figure 24 - Clustering two-dimensional data

If we do something analogous with the sequences returned from a search result – partitioning them into clusters – we can treat all sequences in the same cluster as if they were a single representative sequence, thus reducing the number of search results returned. We have one major problem, though – not all our sequence data is the same – we’re dealing with genomic sequences, ESTs, and cDNAs. This makes things a bit more difficult in the design of a potential clustering algorithm, but not impossible. It certainly means that a clustering algorithm must at least perform some kind of alignment algorithm before trying to compare two sequences. And even then, there is going to be a decision as to what the threshold should be in deciding how similar two sequences are, particularly between high-quality and low-quality data, like say, between a genomic sequence or cDNA sequence against an EST.

Clustering in MUMmer 2.1One of the many improvements in MUMmer 2 is the addition of

a clustering module. The algorithm used in their clustering module appears to be representative of the way in which clustering is handled in general in a program that deals with sequence data. Admittedly, my sampling of clustering algorithm sources for sequence handling

was three programs in all: d2_cluster[43], Unigene[43], and MUMmer, but MUMmer was indeed consistent and representative, so the statement fits, although this was definitely not an exhaustive search of the clustering field. The only real difference in clustering methodology between these three programs was the heuristic chosen for the clustering criteria, not the actual clustering algorithm.

At the core of the clustering algorithm is a procedure called Union-Find and a data structure called a collection of disjoint sets[44]. Naturally, a set data structure must be at the core of a clustering algorithm, since one is grouping clusters or “sets” of data.

The Union-Find AlgorithmThe Union-Find algorithm works on data that is in a data

structure called collection of “disjoint sets”. A collection of disjoint sets is a collection of sets that are disjoint – no individual piece of data is in more than one set. Another name for the collection of sets is a partition, because each piece of data is partitioned among the sets.

The Union-Find algorithm limits itself to just two operations: “Union” and “Find”, interestingly enough. The Union operation occurs when two sets are merged into one. The Find operation occurs when a query is made for a piece of data – a “find”. The “Find” operation takes some data and returns the set to which it belongs.

Initially the Union-Find algorithm begins with all the data items each in a separate set. Then, according to some criteria, it will merge some of those sets together with the “union” operation. Those merged sets are clusters.

Here is an example:Suppose our data is a list of strings representing produce

bought at a store:“Banana”, “Papaya”, “Avocado”, “Pineapple”, “Corn”, “Lettuce”,

“Okra”

So in the first step of the algorithm, we begin with every item in a separate set:

{“Banana”} {“Papaya”} {“Avocado”} {“Pineapple”} {“Corn”} {“Lettuce”} {“Okra”}

Now, suppose our criteria for a merge operation is if the item is either a fruit or a vegetable – this will be our clustering criteria.

We can first can iteratively go through the initial partition, comparing each pair of subsets to see if they are both fruits or both vegetables, and if so, then we need to merge them. Here are the results of that first pass:

{“Banana”, “Papaya”} {“Avocado”} {“Pineapple”} {“Corn”, “Lettuce”, “Okra”}

With N items in our initial partition, we’d have to perform N iterations before the clustering would be finished. After that is complete, the final, clustered partition[43], would be this:

{“Banana”, “Papaya”, “Pineapple”} {“Avocado”, “Corn”, “Lettuce”, “Okra”}

Then, with the Find operation, we can answer meta-questions about the clusters. For example, find(“Papaya”) returns {“Banana”, “Papaya”, “Pineapple”}, but that also says “A ‘Papaya’ is a fruit and so are ‘Banana’ and ‘Pineapple’”.[45]

There’s only one extra twist to the way that MUMmer implements Quick-Find. Naturally, in order to be efficient, its set implementation is tree-based. However it does one extra bit of optimization in its partitioning scheme – it maintains the connected components of a graph to build partitions. Each vertex is exactly one component in this partitioning scheme. What “connected components” means is that it maintains a list of paths between two nodes. The paths are a list of nodes and edges in the tree. MUMmer needs to maintain this list of connected components, because that list

of paths is the list of partitions, if the internal representation for the set data structure is a tree.[24]

The only thing left to mention is just a little bit about the clustering criteria. Naturally because DNA sequences are involved, MUMmer performs an alignment. Basically it performs a variant of the dot plot algorithm between the two sequences, looking for “sufficiently similar diagonals”, according to a threshold set by the user. If the two sequences are “sufficiently similar”, then they are placed in the same cluster.

CHAPTER 8

IMPLEMENTATION RECOMMENTATIONS: C++ OR JAVA?

Our choice of implementation language for the bioinformatics utility belt is largely a matter of choosing between C++ or Java.

The C/C++ language has a long history of performance. It is used in numerous scientific applications and is a staple in computer science programs everywhere. For scientific applications, it is probably second only to FORTRAN in popularity, although perhaps that is waning as fewer and fewer students even learn the language these days. Also, there have been some adequate numerical libraries developed for the C/C++ language.

C/C++ might be the language of choice for implementation by looking at the choice that the TIGR developers made – what language did they choose for the MUMmer implementation? C++. But why? Perhaps it was merely the language with which the authors were most familiar? Could that have been it? Why not Java, the other most popular choice of language in computer science these days.

As an experiment, I tried implementing a simplified version of the MUMmer algorithm, using BioJava, a popular Java-based class library of biological processing routines[46]. It failed miserably, throwing a runtime exception saying that it ran out of memory when I tried to run the BioJava suffix tree class on even a relatively small gene, of about 15,000 nucleotides. There is a note in the BioJava implementation that while the implementation is “needed to be as space-efficient as possible…more work could be done on the space

issue.”[47] A more thorough perusal of their code reveals that a lot more work needs to be done on the space issue, actually. And not only that, there are a lot of issues that need to be addressed for running time. They actually use a more generic suffix tree implementation that performs miserably for both space and time on large gene sequences. It makes no assumptions about the size of the alphabet, it sorts the “left child, right sibling” nodes, to optimize for subsequent searching later on. That would be great a) if you had a large alphabet and had a large number of child nodes – we don’t, we’re working with DNA – four letters in the alphabet, and b) we were going to do much searching – we only plan on iterating through the string once if we’re using an algorithm like MUMmer uses. And this sorting algorithm is a huge time killer. Not to mention that their representation for the substrings takes an enormous amount of space, relatively speaking. Their implementation uses whole objects instead of trying to pack the data efficiently into structures. Also, they use “real” tree implementation for their suffix trees instead of using “suffix arrays” - easy to follow from an object-oriented programming perspective, but space-filling.

However, this just seems like an implementation issue. The BioJava coders didn’t write a good implementation for the kind of objects that we need for our utility belt.

It appears that Java is not always the best choice for demanding scientific applications – at least, not yet.

NINJAIt is the goal of the members of the Numerically Intensive Java

(NINJA) Project, to promote Java as a good environment for scientific computing. In their seminal paper describing the project, they point out three major issues with the Java language that act as roadblocks to getting good performance in numerical code:[48]

“The Java exception model” – More specifically all the validation that Java requires be performed on arrays to make sure that any references are non-null and are not out-of-bounds cause a lot of extra runtime overhead. All these checks cause a lot of extra “runtime overhead”. In addition “code reordering…and loop iteration reordering…is prohibited, thus preventing almost all optimizations for numerical codes.”

“Arrays in Java” – “Java has no direct support for truly rectangular multidimensional arrays”. The only way Java supports multidimensional arrays is through arrays of arrays which is not quite the same thing. Since arrays of arrays are not necessarily rectangular, Java must determine the shape of an array at runtime, which is extra overhead. Also not having rectangular arrays prevents some optimizations.

“Complex numbers in Java” – Java doesn’t support complex numbers in an efficient manner, only real numbers. In Java, one must use the Complex object, which is fairly expensive, due to the overhead in object creation, allocation and garbage collection of intermediate values.

Naturally, the NINJA Project members go on to describe all the various ways to circumvent these problems in Java: postprocessors that transform Java code that performs array processing into a new Java program that deals with arrays and complex numbers more efficiently. It is their thesis that while Java wasn’t originally designed with scientific computing in mind and has some deficiencies that can be overcome, there is no reason why it can’t be a viable scientific computing platform.

The Cost of Being Object-OrientedIn 1998, a study was performed at Rice University, comparing

the performance of a true object-oriented Java program against the original implementation of the popular LINPACK library developed in Fortran.[49] According to author James W. Cooper, “the LINPACK Library is primarily a set of FORTRAN modules for matrix manipulation…Computational scientists in physics, chemistry, and engineering have used these FORTRAN libraries for years and have invested a great deal of effort in optimizing the code and the compilers that produce the object modules.”[50]

Just to make things interesting, a third version of the program was written in Java for comparison – a “lite” version – “designed for higher performance”, where no objects where instantiated, to be able to quantitatively benchmark the “overhead” of being object-oriented, in addition to having a more ordinary benchmark for Java compilers and virtual machines, that they would have if they wrote any sort of Java implementation.

Note that these researchers at Rice used many of the optimizations that the NINJA project group espouse. In particular, the Rice group used a postprocessor to make their array handling code more efficient.

James W. Cooper noted that the researchers at Rice University found that on a Sun machine, the “lite” Java version was only about 70-130% slower than the pure, compiled FORTRAN program. What is “depressing” is that the “well-structured OO” version was “on average 10-20 times slower than that”[50].

So in their paper, the researchers at Rice found several more subtle effects that limited the performance of their Java code, even with some of the NINJA team’s optimizations:

Every number that is part of a computation is allocated on the heap as a separated object, with

additional overhead for instantiation and garbage collection

Numbers that are elements of a matrix are scattered all over the heap, eliminating cache performance benefits that standard matrices can use.

All numeric operations were done through method calls on corresponding objects, which incurred additional overhead as well as dynamic dispatch to determine which method is invoked.

Each number takes up more memory as an object Objects and method calls prevent or limit some

common compiler optimizations, such as code motion, that can be performed in the FORTRAN and Lite OO versions[50].

These results are interesting. The results lend a little more credence that one can go overboard with object-orientation – you can take a metaphor too far. Making every number an object was probably a bad design idea by the Java language designers.

One should still write well-structured, object-oriented programs, because there are easier to understand and maintain. One might advocate doing scientific coding in Java over C++ because it forces you to write better object oriented programs. We can expect the supporting tools and compilers to support performant programs. Or one can take a bit more pragmatic position, and rather use C++ for now, since the tools are more established. It seems one has to understand a whole lot more about the internals of the Java language/compiler if you want to take on a scientific computing project in this language.

Further adventures of the Rice Research TeamSince the Rice paper was written in 1998, I wondered if

anything more recent had been done and if anything had changed since then. Four years is a lot of time in the computer science world, and this was with the 1.1 & 1.2 versions of the Java Virtual Machine (JVM) – positively ancient in the Java world. In further work

performed by the same group, they developed a postprocessor for the fully object-oriented Java programs that allowed them to perform within 75% of the Fortran code. This speedup was about the same as the lite OO version in their original program. This postprocessor is called JaMake.

A postprocessor that took a fully object-oriented Java program and made it perform as well as the “lite” OO Java version. Well, almost as well as the “lite” OO version. Even still, they continued to use the JVM 1.1 & 1.2 for testing. I would like to see if there is any performance increase as a result of using the more recent 1.3 or 1.4.x JVMs.

One can conclude that if you want to get every bit of performance out of your code, for those portions, you still really need to resort to C++ or Fortran. But perhaps Java is becoming a viable option for scientific programming.

CONCLUSION

The logical next step might be to use the tools in this bioinformatics utility belt as the basic building blocks for an automated solution to this search result redundancy problem. The components of the utility belt include:

The three fundamental bioinformatics algorithm types: dot plots, pairwise alignment, and multiple sequence alignment

Implementations of the bioinformatics algorithms incorporated in the GCG Wisconsin package

The optimized pairwise alignment and clustering algorithm from MUMmer.

Hints towards the development of an approach towards a consensus algorithm.

Clustering algorithms. The recommendation that C++ would be the preferred

implementation language over Java.One might ask, “Why is this necessary? Why not eliminate the

source of the redundancy in the first place – in the databases themselves?” It is a reasonable question. Pre-elimination of the redundancy at the database-level would certainly be a more efficient solution to the problem. However, this kind of preprocessing would require write-access to the database (or a high-bandwidth connection) and it would require filtering of literally billions of sequences. Post-processing the redundancy at the “similarity search result”-level does

not require the same kind of intensive processing. One only needs to filter a few hundred or perhaps a few thousand sequences at a time.

The simple answer is that you will never get biologists to do it that way – create databases that are completely redundancy-free, at least for the time being. Efforts have been made to create “redundancy-free” databases, like NCBI’s Unigene[51], however its existence has not made biologists switch to Unigene over say, Genbank in droves, due to Genbank’s rendundancy. Part of the reason is that even Unigene does not really address the redundancy problem in the way that all biologists want the problem addressed. There is no attempt to produce contig assembly sequences or consensus sequences for the Unigene clusters[52]. Also, Unigene only focuses on mRNA and EST data[13]. The lack of a consensus sequence and the concentration mRNA and EST sequences mean that Unigene cannot be the single redundancy-free biological database that every biologist would use. I am sure there are other non-technical reasons why biologists prefer this as well. According to Steven Thompson, the post-processing approach for eliminating redundancy is also a cultural preference amongst bioinformaticians. Not to mention that post-processing a few search results allows one to offer a wide variety of customization options to tweak the heuristic parts of the algorithm. If one were to pre-process an entire database, the customization possibilities would probably be more limited. Perhaps it makes sense that one goes for the less-efficient option of post-processing for its customization possibilities and because it can deal with smaller chunks of data at a time, rather than pre-processing, which is more efficient solution, but you would have to deal with the entire database at once and not have so many customization options.

Here is a sketch of the steps that the automated solution would follow to tackle this redundancy problem using the utility belt components:

1. Issue a similarity search to query a particular gene sequence against one more biological databases. This similarity search could be performed using the GCG BLAST or FAST family of programs. The similarity search would return a set of sequences and expectation scores.

2. Perform a multiple sequence alignment, aligning ESTs to cDNA to genomic sequences of all the data in the result set. MUMmer’s fast pairwise alignment algorithm would be used. The heuristic to be used is that one can perform a multiple sequence alignment with a pairwise alignment algorithm, the pairwise algorithm is executed “progressively” on two sequences at a time, until all the sequences in the result set have been aligned.

3. After the sequences have been aligned, the sequences are partitioned into clusters, using both the aligned sequence data and the expectation scores to guide the clustering algorithm.

4. As a precursor to generating the consensus sequence in step 5, first spice out all the introns in each sequence, because introns are unnecessary in the consensus.

5. To generate a consensus sequence, stitch together the longest possible consensus sequence from the aligned EST, cDNA and genomic sequence data, from exons. Unfortunately consensus sequence generation might be stymied a bit by errors in the data. cDNA sequences are fairly low-quality, whereas ESTs and genomic sequences are high-quality. That is one reason why a database like Unigene focuses on EST and mRNA data. The computer algorithm might end up in a quandary where it can’t stitch together the sequences properly, due to errors in the data.

This will have to be dealt with somehow in the consensus-building algorithm.

6. Steps 2-5 focus on sorting out redundancies, but dealing with the introns genomic sequences and aligning the short cDNA sequences is going to be a huge problem for the algorithm. Taxonomy analysis can help filter results from the database. First the program must present a choice to the user – do they want the program to “estimate orthologues” or “estimate paralogues”. This is based on a heuristic from biology that a good first-pass approximation that most similar sequences some from different taxonomies if one is estimating orthologues, or that most similar sequences come from the same taxonomy if one is estimate paralogues. Basically this tells the program a default to compare for similarity – duplicated genes within the same species or duplicated genes within different species.

7. The program will identify the homologous genes according to the user’s selection preference, so that the genomic sequences may be more properly aligned to cDNA sequences.

Technically steps 6-7 would occur first in the program, or rather a default for orthologues or paralogues would probably be chosen. Then step 7 would occur whenever a cDNA sequence needed to be aligned with a genomic sequence.

It is with this sketch of an algorithm that the next phase of development would most likely proceed in solving this redundancy problem. We see the biggest problem area being the generation of the consensus sequence. Generation of census sequences appears to be such a huge problem in the general case that the database designers for Unigene[51] skirted the issue entirely – no consensus

sequences are generated for this database. We hope that the incorporation of the taxonomy heuristic will address the problem in a reasonable fashion. This is left to be seen in a working implementation of the algorithm.

EPILOGUE

”Holy information overload!” is a cry you might hear from any researcher looking at the results of a search on a modern biological database. Much like when one tries to perform search on the world-wide-web, a biological “search engine” must now sift through millions of pieces of data in a biological database, with perhaps a large subset of the “hits” from that biological database search being redundant or similar. According to an article in the March 2002, The Scientist, “GenBank, one of the oldest and largest..biological databases..doubles in size about every 15 months”[5].

So what is a researching superhero to do? And believe me, those doing biological research are doing the work of superheros. Anyone seeking out cures for cancer, searching for the causes of genetic diseases, doing forensic research to solve a crime is doing pretty amazing stuff! And to top it off, they can tell you where to order the best sushi in town. Now that is a superhero if I’ve ever seen one! So, back to the original question, what is a researching superhero to do?

Well, in this paper, we’re going to pattern an answer for our researcher from the pages of Batgirl, a popular graphic novel by DC Comics[53]. What!...you might scoff, do superheros and Batgirl have to do with biology and databases? Quite a bit, actually. If you do not care to read about all this nonsense about superheros and Batgirl, skip to the next chapter. If you do not care about the author or

biology or databases, go ahead and just skip the entire thesis.[54] *sniff*

In superhero tradition, we need to pick a secret identity for our researcher. We’ll call her Biogirl. A superhero must have a costume, right? Nah, biology researchers are always a well-dressed lot – unnecessary, don’t you think? A sidekick? Of course! Being a modern woman, Biogirl’s sidekick is a talking computer with gigabit-speed network access to all the biological databases for her research. Her superpowers? Well, here’s where Biogirl gets inspiration from her alter-ego Batgirl. First and foremost, Batgirl’s primary superpower is her mind. Batgirl tries hard to think of all the possibilities to get herself out of any sticky situation, and of course, does – the writers guarantee that. So does Biogirl, but she doesn’t have any writers. Or….does she?

OK, so Biogirl has a brain, that helps, but Biogirl still has to deal with this overwhelming amount of biological data, so what else? Again, we look to Batgirl for inspiration. Batgirl has a utility belt with all sorts of helpful gadgets to get her out of a jam, helping her to defeat the evil legions of doom. Biogirl does too. In this case, Biogirl’s utility belt is filled with biological tools – tools which help Biogirl cope with an overload of information.

So what is in Biogirl’s utility belt? Well, this is the primary focus of this paper. Rather than bombs, grappling hooks and locks lockpicks – all very useful in Batgirl’s line of work[55], Biogirl’s utility belt contains a set of bioinformatics tools.

So there you have it, here’s some research on a list of techniques one might need to have handy in a bioinformatics utility belt. We hope Biogirl will be pleased. We haven’t solved any problems in this paper, but rather, we’ve just outlined the problem area and a list of possible solutions.

REFERENCES

1. Mount, D.W., Bioinformatics: Sequence and Genome Analysis. 1st edition ed. 2001: Cold Spring Harbor Laboratory. 564.

2. Mark A. Musen, M., PhD, Fundamental Methods for Informatics BMI 210A/CS270A - Introduction. 2001, Stanford University School of Medicine: Stanford, California USA.

3. Correard, M.-H., V. Grundy, and J.-B. Ormal-Grenon, Oxford-Hachette French Dictionary: French-English English-French. 2001: Oxford University Press.

4. Hirschman, L., et al. Literature Data Mining for Biology. in Proceedings of Pacific Symposium on Biocomputing 2002. 2002. Hawaii: World Scientific Publishing Company, Incorporated.

5. Wilson, J.F., The Rise of Biological Databases, in The Scientist. 2002. p. 34.

6. Campbell, N.A. and J.B. Reece, Biology. 4th Edition ed. Benjamin/Cummings series in the life sciences. 1996, Menlo Park, California.

7. NCBI, GenBank Overview. 2003, NCBI.

8. NCSC, SWISS-PROT Protein Knowledgebase Release 40.44 Statistics. 2003, NCSC.

9. NCBI, Growth of GenBank. 2003, NCBI.

10. Brusic, V., et al. Data Learning: Understanding Biological Data. in Knowledge Sharing Across Biological and Medical Knowledge Based Systems: Papers from the 1998 AAAI Workshop. 1998: AAAI Press.

11. Program, T.U.S.D.o.E.H.G., Genomics and Its Impact on Medicine and Society: A 2001 Primer. 2001.

12. Tagliaferro, L. and M.V. Bloom, The Complete Idiot's Guide to Decoding Your Genes. 1999: Alpha Books.

13. Claverie, J.-M. and C. Notredame, Bioinformatics for Dummies. 2003, New York, NY: John Wiley & Sons. 432.

14. Letovsky, S., ed. Bioinformatics: Databases and Systems. Hardcover ed. 1999, Kluwer Academic Publishers: Norwell, MA. 304.

15. Brin, S. and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. in Proceedings 7th International World Wide Web Conference. 1998.

16. NCBI, BLAST Tutorial. 2000, NCBI.

17. Krane, D.E. and M.L. Raymer, Fundamental Concepts of Bioinformatics. September 12, 2002 ed. 2000. 320.

18. Swofford, D., BSC4933 Computational Methods in Bioinformatics, C. Notes, Editor. 2003.

19. Vingron, M., Online Lectures in Bioinformatics - Pairwise sequence comparison - Dot plots. 2000, German Cancer Research Center.

20. Russ B. Altman, M., PhD, Representations and Algorithms for Computational Molecular Biology BM214/CS274 - Pairwise Sequence Alignment using Dynamic Programming. 2001, Stanford University School of Medicine: Stanford, California USA.

21. Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. 1997, New York, NY: Cambridge University Press. 534.

22. Olson, M.V., A Time to Sequence. Science, 1995. 270(5235): p. 394.

23. (Editor), T.H.C., et al., Introduction to Algorithms, Second Edition. 2nd ed. 2001, Cambridge, Massachusetts: MIT Press. 1184.

24. Skiena, S.S., The Algorithm Design Manual. 1997, Stony Brook, NY: Telos Pr. 504.

25. Needleman, S.B. and C.D. Wunsch, A General Method Applicable to the Search for SImilarities in Amino Acid Sequence of Two Proteins. J. Mol. Biol., 1970(48): p. 443-453.

26. Smith, T.F. and M.S. Waterman, Identification of Common Molecular Subsequences. J. Mol. Biol., 1981(147): p. 195-197.

27. Vingron, M., Online Lectures in Bioinformatics - Multiple sequence analysis. 2000, German Cancer Research Center.

28. Thompson, S.M., BSC4993(4) & BSC5936(2): Computational Methods in Bioformatics: BioComputing Basics. 2003, Steven M. Thompson.

29. Inc., A., GCG is now Accelrys. 2001-2003, Accelrys.

30. Thompson, S.M., BioInformatics: A SeqLab Introduction. 2003, The Florida State Univeristy: Tallahassee, Florida.

31. Devereux, J., P. Haeberli, and O. Smithies, A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Research, 1984. 12(1): p. 387-395.

32. Peterson, W.E.P., Almost Perfect: How a Bunch of Regular Guys Built Wordperfect Corporation. 1994: Diane Publishing.

33. Senger, M., Welcome to W2H: the WWW Interface to Sequence Analysis Software Tools. 2002, Peter Ernst and Martin Senger.

34. Spegelaere, P. and M. Colet, Welcome to WWW2GCG. 1997, Marc Colet.

35. Herzog, R., WWW2GCG: a Web interface to the GCG package. 1997, embnet.news Vol 4 No 2 (31 July 1997).

36. Delcher, A.L., et al., Alignment of whole genomes. Nucleic Acids Research, 1999. 27(11): p. 2369-2376.

37. Brown, P.D., CS 882: Computational techniques in genome sequence analysis - Lecture 8. 2003, University of Waterloo.

38. Delcher, A.L., et al., Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Research, 2002. 30(11): p. 2478-2483.

39. Brown, P.D., CS 882: Computational techniques in genome sequence analysis, Winter 2003 - Lecture 8. 2003.

40. Allison, L., Suffix Trees. 2002, Monash University.

41. Stephen, G.A., String Searching Algorithms. Lecture Notes Series on Computing - Vol. 3. 1994, River Edge, NJ: World Scientific Pub Co. 256.

42. Gibas, C. and P. Jambeck, Developing Bioinformatics Computer Skills. 1st Edition (April 15, 2001) ed. 2001: O'Reilly & Associates.

43. Burke, J., D. Davison, and W. Hide, d2_cluster: A Validated Method for Clustering EST and Full-Length cDNA Sequences. Genome Research, 1999. 9(11): p. 1135-1142.

44. Delcher, A.L., mgaps.cc. 1994, Arthur L. Delcher.

45. Shewchuk, J., CS61S: Lecture 33. 2002, University of California at Berkeley.

46. Pocock, M., T. Down, and C.A. Technology, BioJava.org. 2002, BioJava.org.

47. Pocock, M. and T. Down, biojava API. 2002, BioJava.org.

48. Moreira, J.E., et al., The Ninja Project. Communications of the ACM, 2001. 44(10): p. 102.

49. Budlimic, Z., K. Kennedy, and J. Piper, The Cost of Being Object-Oriented: A Preliminary Study. Scientific Computing, 1999. 7(2): p. 87.

50. Cooper, J.W., Is Java Fast Enough? JavaPro Magazine, 2002. 6(3).

51. Pontius, J.U., L. Wagner, and G.D. Schuler, UniGene: A Unified View of the Transcriptome. 2002, The NCBI Handbook.

52. Biology, G., UniGene. 2000: Bethesda, USA.

53. Puckett, K., et al., Batgirl: A Knight Alone. 2002: DC Comics.

54. III, C.E., Mr. Bunny's Guide to ActiveX. 1998: Addison Wesley Longman.

55. Gresh, L.H. and R. Weinberg, The Science of Superheroes. Hardcover ed. 2002: John Wiley & Sons. 224.

BIOGRAPHICAL SKETCH

Educational BackgroundMaster of Science, Computer Science (To be awarded, Spring, 2003) The Florida State University, Tallahassee, Florida. Honors: Induction to Upsilon Pi Epsilon, Computer Honor SocietyBachelor of Science, Computer Science (Spring 1993) The University of Wisconsin, Milwaukee, Wisconsin, Honors: Sophomore Honor StudentProfessional Experience

AcademicResearch Assistant, Florida State University 6/02-12/02, Office for Distance and Distributed LearningInstructor, Florida State University, 9/01-5/02, Department of Computer ScienceLab Assistant, University of Wisconsin, 8/91-4/93, Center for Crytography, Computer and Network Security

Non-AcademicEngineer, Hummingbird, 3/98-Present, Tallahassee, FloridaPresident, Moondaughter Productions, 4/00-Present, Tallahassee, FloridaSenior Engineer, CyberSafe, 8/96-1/98, Issaquah, WASoftware Engineer, BindView Development 12/93-7/96, Houston, TXSystems Engineer, Allied Computer Group, 9/93-12/93, Milwaukee, WI

Network Consultant, Computer Power Group, 5/93-8/93, Milwaukee, WI

Publications and PresentationsTomlinson, P. and Kretzer M., In Windows Developer Journal,

April 1999, Volume 10, Number 4. [Reader email reprinted. Idea was the basis for Tomlinson’s article]

Taylor, M., Java Development Brown Bag Lunch – A 6-Week Short Course. January 15, 2003-February 19, 2003 at Hummingbird, Inc.

Taylor, M., COM Development Brown Bag Lunch – A 6-Week Short Course. March 7, 2003-April 11, 2003 at Hummingbird, Inc.

Taylor, M., R&D Brown Bag: Aspect-Oriented Programming. March 17, 2003 at Hummingbird, Inc.

Professional MembershipsWomen in Technology International (WITI)American Association of University Women (AAUW)Association for Computing Machinery (ACM)IEEE Computer Society

Documents

THE FLORIDA STATE UNIVERSITYengelen/TaylorThesis.doc · Web viewNote that in the great-great-great-great-grandparent to grandchild relationship, chances are low that there is great