22
An introduction to BIO informatics AlgoRITHMS S. Parthasarathy National Institute of Technology Tiruchirappalli – 620 015 (E-mail: [email protected])

An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

An introduction to

BIOinformatics AlgoRITHMS

S. Parthasarathy

National Institute of Technology

Tiruchirappalli – 620 015

(E-mail: [email protected])

Page 2: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

Contents 1. Algorithms

1.1 Analyzing algorithms 1.2 Input data 1.3 Input size 1.4 Running time 1.5 Memory requirements 1.6 Worst-case and average-case analysis 1.7 Order of growth 1.8 Examples 2. Similarity: basic concepts

2.1 Distances, similarity scores 2.2 String distances 2.3 Approximate string matching; finding optimal alignments 2.4 Motifs as elements of similarity References

1. Algorithms

2

Page 3: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

Informally, an algorithm is any well-defined computational procedure that takes some value, or set of values, as input and produces some value, or set of values, as output. An algorithm is thus a sequence of computational steps that transform the input into the output.

We can also view an algorithm as a tool for solving a well-specified computational problem. The statement of the problem specifies in general terms the desired input/output relationship. The algorithm describes a specific computational procedure for achieving that input/output relationship. A good algorithm is like a sharp knife-it does exactly what it is supposed to do with a minimum amount of applied effort. Using the wrong algorithm to solve a problem is like trying to cut a steak with a screwdriver: you may eventually get a digestible result, but you will expend considerably more effort than necessary, and the result is unlikely to be aesthetically pleasing. Algorithms devised to solve the same problem often differ dramatically in their efficiency. These differences can be much more significant than the difference between a personal computer and a supercomputer.

1.l. Analyzing algorithms

Analyzing an algorithm has come to mean predicting the resources that the algorithm requires. Occasionally, resources such as memory, communication bandwith, or logic gates are of primary concern, but most often it is computational time and memory requirements that we want to measure. Generally, by analyzing several candidate algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate, but several inferior algorithms are usually discarded in the process.

Before we can analyze an algorithm, we must have a model of the implementation technology that will be used, including a model for the resources of that technology and their costs. Generally one assumes a simple idealized computer, a generic one-processor, random-access machine (RAM) model of computation. In the RAM model, instructions are executed one after another, with no concurrent operations.

Analyzing even a simple algorithm can be a challenge. The mathematical tools required may include discrete combinatorics, elementary probability theory, algebraic dexterity, and the ability to identify the most significant terms in a formula. Because the behavior of an algorithm may be different for each possible input, we need a means for summarizing that behaviour in simple, easily understood formulas.

1.2. Input data

3

Page 4: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

1.2.1. Structures, sequences

Biocomputing algorithms are quite varied in scope and computational tools. The majority of algorithms presented in this course are concerned with sequences. Sequences are a special kind of structure representations.

But what is a structure? Protein structures, gene structures, share a common architecture: they are composed of elements (substructures) and connections (relationships) between them. In a 3D structure of a molecule we have the atoms as elements and the chemical bonds as connections between them, plus the 3D coordinates. In fact this is a rich representation. We can eliminate some of the information and we get more abstract and concise representations. E.g. omitting the 3D coordinates we get atoms and connectivities between them. This is a structural formula (a graph). Omitting the information of the chemical bonds, the structure will collapse to a handful of atoms that we can count piece by piece, to get a composition (a vector). Naturally, atoms are not the only substructures we can use and there is a great variety of larger structural units we can use. Suffice to think about a schematic representation of a protein structure in terms of helices and beta sheets.

Now, an amino acid sequence is a special representation where the substructures we use are the 20 naturally occurring amino acids and the only type of connection or relationship we consider is sequential vicinity. As a result a protein sequence is a character string written with a 20 letter alphabet. DNA sequences are character strings of a 4 letter alphabet.

Other structural representations used in biology use different substructures. For example, protein structures are often depicted as a simple graph of helices, beta sheets and turns. Or, sometimes, we refer to a particular motif as a four helix bundle, i.e. we describe it in terms of composition.

So we have two metaphors for sequences: chemical structures and texts. When to use these? Sequence analysis, such as database searching, can be much better understood with the text metaphor. This is valid to all problems where we keep the amino acid residue representation as standard. If we need more versatile descriptions, the chemical structure metaphor is more useful. For example, the structure of multidomain proteins can be described as a series of complex elements, such as signal peptides, immunoglobulin-domains etc. The chemical structure analogy is especially conspicuous when dealing with the 3D structure of proteins composed of a few α-helices and β-sheets in a special arrangement.

4

Page 5: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

5

Page 6: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

1.2.2. Databases

Sequence databanks are collections of DNA or protein sequences/sequence motifs. The data are organized into records (entries).Current databases are simple series of sequence entries, i.e. are produced in the form of "flat files", not ready for practical use. For the use of the analysis programs that perform e.g. data retrieval and similarity search, the flat files are reorganized into special internal formats that best correspond to the use. The major program packages have their own internal databank format, not seen by the users.

The machinery of computer databanks was developed first for industrial/financial applications (payroll management, banking). Biological sequence databanks are quite simple as compared to these.

Sequence records or "entries" contain data on one single molecule (protein, DNA fragment). The record consists of fields that have internal structures (beginning of field, end of field, subfields). The sequence field contains the sequence data as a character string. The ID (identification) field is a (more-or-less) mnemonic identification code that sometimes changes, the AC (accession number) field is an unequivocal registration number. These record identifiers are useful when retrieving a record.

In addition to identifiers and sequence, the records contain an optional number of additional fields, generally summarized as "annotations". Annotations may contain simple information such as bibliographic references, functional descriptions, cross-references to other databases etc.. Special annotation fields contain structural information organized into a "feature table". This is a special representation of the sequence (in addition to the sequence field), in which sequence segments are associated with descriptors, e.g.. "from position 20 to position 30: signal peptide".

In a more formal way, a database record is a collection of data-structures, and there is no canonical definition what it should contain. We can attempt to describe some of the general features of sequence databases as they are today (1998) but even this small field changes very fast. In any case, a sequence record must contain one canonical structure representation, the sequence, and this is usually given in a character string format. In addition there is a large variety of data given, like a) the name and various identifiers of the sequence b) the organism the sequence is derived of c) the function of the sequence. d) bibliographic references e) cross references to records in other databases (e.g. to 3D data), f) A feature table, which is a collection of sequence segments of with some function. a) to d) are textual information given in a human-like language. In contrast, the feature table is a structure representation, but instead of residues, we now have functional or structural domains, i.e. complex substructures (as compared to simple amino acids).

When we speak about sequence analysis algorithms, the overwhelming majority deals with the sequence (character strings) only and the evaluation of the annotation part is left to the user.

6

Page 7: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

1.3. Input size

The best notion for input size depends on the problem being studied. For many problems, such as sorting or computing discrete Fourier transforms, the most natural and general measure is the number of items in the input. Specifically, sequence processing algorithms in terms of sequence length and alphabet size. Some algorithms may run very well for short DNA sequences but less well for large protein sequences.

The linearity of the sequences is a very important aspect. Many algorithms would run well for character strings but would take impracticably long times for more complex structures such as graphs. For instance, if the input to an algorithm is a graph, the input size can be described by the numbers of vertices and edges in the graph. We shall indicate which input size measure is being used with each problem we study.

Many biological algorithms operate on two kinds of input: a sequence input, and a database. In a first approximation, the sequence database can be considered as a very long string which is concatenated from each sequence entry within the database. So, a database can be characterized by the total number of amino acids or nucleotides.

1.4. Running time

The running time of an algorithm on a particular input is the number of primitive operations or "steps" executed. It is convenient to define the notion of step so that it is as machine-independent as possible. For the moment, let us adopt the following view. A constant amount of time is required to execute each line of our algorithm-code. One line may take a different amount of time than another line, but we shall assume that each execution of the ith line takes time ci, where ci is a constant. This viewpoint is in keeping with the RAM model, and it also reflects how the algorithm code would be implemented on most actual computers.

Clearly, the above notion of running time is symbolic, and can only serve for the comparison of algorithms rather than to estimate real time necessary to run a program. Such "real computing times" may depend much more on input/output operations and the actual architecture of the machine. But if these factors are comparable for two algorithms, then the number of steps is the indicator we can use best.

7

Page 8: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

1.5. Memory requirements

Memory requirements can be best defined as the number of values (numbers, strings etc.) we need to store during the calculations. This can be measured in terms of megabytes of disk space or RAM space necessary. Typically, a program may use a set of computed values that we can pre-compute and store for later use. Or, we can compute the values "on the fly" in each case the program runs. If computation of a value takes less time than reading it from the disk, we may want to compute it afresh every time. So there is a trade off between storage space and running time.

1.6. Worst-case and average-case analysis

Running time and memory requirements can vary a great deal depending on the quality of the algorithm as well as on the input. It is very important to find out how an algorithm will behave in the average case and how it will fare with strange cases. To these problems we refer as average-case and worst case scenarios. The best case has not occurred yet...

• The worst-case running time of an algorithm is an upper bound on the running time for any input. Knowing it gives us a guarantee that the algorithm will never take any longer. We need not make some educated guess about the running time and hope that it never gets much worse.

• For some algorithms, the worst case occurs fairly often. For example, in searching a database for a particular piece of information, the searching algorithm's worst case will often occur when the information is not present in the database. In some searching applications, searches for absent information may be frequent. Therefore, for many algorithms The "average case" is often roughly as bad as the worst case.

• In some particular cases, we shall be interested in the average-case or expected running time of an algorithm. One problem with performing an average-case analysis, however, is that it may not be apparent what constitutes an "average" input for a particular problem. Often, we shall assume that all inputs of a given size are equally likely.

Biological sequence analysis deals with natural protein and gene sequences. As computational objects, these sequences are fairly uniform, i.e. strange sequences, like several hundred identical amino acids in a row, do not occur. This is why for many purposes these can be considered as random character strings of a given composition. For example one can generate random strings whose composition corresponds to the

8

Page 9: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

composition of the database. In other words we can base our average case analysis on an input of a protein sequence of 100 or 500 residues which we generate randomly.

In some cases, however, random sequences are not good models of natural sequences. In particular, both proteins and nucleic acid sequences contain sequences of biased composition. These are sometimes referred to as "low complexity regions" (to be discussed later). We have to remember that some algorithms do not perform well of the biased sequence parts, for example sequence similarity searches are often mislead if a sequence contains long runs of identical residues. In many cases however the compositional bias does not influence the running time estimate.

1.7. Order of growth

Once we know running time and memory requirements for a given input, the next step is to find out how these change if we have a different (larger) input size. In terms of sequences, we are particularly interested how running time is dependent on sequence length. In cases of algorithms operating on a database, we also have to know how running time depends on database size. Usually the size of the alphabet (20 or 4) is also an important factor.

Generally we express order of growth as a proportionality. Simply put, we now ignore our original estimate of running time, we are only concerned with the fact how the increase depends on sequence length. For example, some algorithms have running times that grow linearly with the length l of a sequence O(l) (pronounced as "order of l"). Some have running time estimates O(l2) (pronounced as "theta of l-squared"). Obviously, algorithms of linear growth rate dependence are better than O(l2). We shall use the O-notation informally. However, this is a very important notion since some algorithms which require running times proportional to higher powers of sequence length may give impracticably long running times for sequences that are only slightly longer than the "average sequence" we used to test an algorithm.

We usually consider one algorithm to be more efficient than another if its worst-case running time has a lower order of growth. This evaluation may be in error for small inputs, but for large enough inputs a O(l2) algorithm, for example, will run more quickly in the worst case than a O(l3) algorithm.

Here we have to mention a trivial fact. Linear inputs like character strings of sequences have shallow growth rates in terms of computer times. In contrast, higher order representations, like graphs of molecular topology or atomic detail 3D structures, often require higher analysis times and their size increases running time in a much more

9

Page 10: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

dramatic way. So there are many algorithms that can run on sequences, while very often the same type of analysis on a 3D structure would take frighteningly long times. For example, finding a sequence segment in a large sequence is quite trivial. However, finding a particular 3D arrangement in a protein structure is much more time-consuming.

1.8 Examples

1.8.1. Sorting numbers

Let's take one example from computer science, the problem of sorting n numbers. One of the simplest algorithms, "Insertion-sort" solves the problem in a way we sort cards. To sort the cards in your hand you extract a card, shift the remaining cards, and then insert the extracted card in the correct place. Shell-sort and Quicksort are improved variants of

Insertion-sort. The algorithms differ in their length as well as in the estimated running times:

Comparison of Sorting Methods Method Statements Average-case time Worst-case time Insertion sort 9 O(n2) O(n2) Shell sort 17 O(n1.25) O(n1.5) Quicksort 21 O(n lg n) O(n2)

1.8.2 Substring matching: Finding a part in a large sequence

A large sequence is a given series of characters (proteins have a 20 letters alphabet, DNA only has a 4 letters alphabet). Very often we want to find a sequence motif, say ATGC, in a large sequence. Let suppose that the large sequence has 1000 residues. We can design various algorithms that will perform the job. Let's say that the larger sequence is l residue long (l =1000) and the small part we want to find is a residue long (w=4). Let's call the large string L and the small string W.

10

Page 11: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

A naive algorithm may work as follows:

a) Identify the first residue of the whole string (A)

b) Find the first A in the long sequence. If there is no A, all we have done was to read the whole string of (l -w+1) residues and the procedure is finished.

c) If we found a residue A at position i, let's stop there and see if residue i+1 corresponds to position 2 of the small string (i.e. C) . If yes, then check if residue i+2 corresponds to residue 3 of the small string and so on.

From this we can estimate the number of reading and comparison operation necessary to find all strings. Namely, if the long string does not have any substring corresponding to the small one, then we had to perform l reading+comparison operations, i.e. performance of the algorithm will be proportional to O(l). This is a best case. Now let us take a worst case, e.g. the large string has exactly n pieces of A-s and all of them correspond to an exact copy of the W string. In this case we will have to read and compare l times, again, and so on as an A is found we start a procedure in which we make a maximum of w-1 reading and comparison operations. This is a very crude approach of course but we know that we will need l operations plus n times w-1 comparisons. l +n(w-1) is a maximum. Since n is smaller than l, we can say that the worst case scenario is less than l +l (w-1)=l w. So our algorithm has an upper bound of O(l w), which means that it is a linear function of the sequence length. However if both l and w are very large, this still can mean a lot of steps. Now, the number of comparisons to find out if two letters are equal is maimum equal to the size of the alphabet, S. This comes in as a multiplyer, so the growth rate can be written as SO(l w).

A more educated scenario is the Boyer-Moore (BM) algorithm. Same as the naive algorithm, BM includes a first step identification of the first character but then uses two new functions (compute last-occurence and good-suffix) which makes it the most efficient string-matching algorithm than the others. Without details we mention that this algoritm has a worst case running time O([l-w+1]w+ S) which is better than SO(l w).

1.8.3 Calculating a numeric property of a long sequence

Numeric properties are often calculated from sequence data. In most cases we have a small table that assigns a numeric value to each residue (character) in the sequence. A typical example is a hydrophobicity profile. A hydrophobicity value is an experimental value assigned to each of the 20 amino acids, which means we have a small look-up table of 20 items.

11

Page 12: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

Taking a long sequence of l residues, we can convert it into a hydrophobicity profile by exactly l reading and look-up operations. The result is a numeric profile, a series of l numbers taken from our hydrophobicity table. Let's denote each value in the profile as pi

Very often one has to average the property (hydrophobicity) within a window w. For example, the Kyte-Doolittle plot uses a window size of 21. Given the numeric profile of length l, a naive algorithm would take a pi at each character position in the sequence and would start an averaging process of stepw at each position. This would amount to (l- w+1)w averaging processes. Namely, we only have l -w+1 positions to think about, and at each position we make w operations, each of the consisting of calculating the average of w numbers, dividing the sum by w. Since in this case we have to calculate each value, i.e. we can not introduce further shortcuts, the growth rate of parametric algorithms is, O(l -w+1)w~ O(l w). A "clever" i.e. more intuitive algorithm would use the following consideration. If the sum at position i is known then the sum at position i+1 can be calculated not with w but with exactly two operations: Subtract pi and add p(i+w). For this to be implemented one has to calculate the full sum at position 1 (w operations) then perform 2(l -w) operations which is, roughly ~2l. If w is not very small, this will be much faster than the naive algorithm.

How does this algorithm depend on the alphabet size (protein or DNA). This dependence is hidden in the look-up procedure. If we have to find a value in a table of 20 it takes longer than to find it in a table of 4. So this will be an independent variable, and we can approximately say that if the alphabet is size S, then the naive algorithm will have a running time O(Sw[l -w+1]) and the "clever algorithm will have a running time O(S[2l -w]). The latter will be more favourable than the naive algorithm in all practical cases.

12

Page 13: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

2. Similarity: basic concepts

The concept of similarity is deeply rooted in human language and psychology and it is a complex and ambiguous notion. This is in great part true for the similarity of chemical and biological structures even if the structures are represented in a formal mathematical way.

There are two basic concepts behind the practical computation of similarity of sequences and structures. One of them is the mathematical concept of metrics or distances functions (this is related to similarity scores and edit distances). The other is the concept of motifs which is a more biological or chemical notion. Closely similar sequences can be compared in terms of distance measures (global similarity) while those sharing only a few common domains are better be compared in terms of common motifs.

A trivial example. Let's say we have a number of cars of different types, several pieces of each of them and we want to make a small computer program that will tell if they are similar or different. We can randomly choose almost any physical parameter (weight, size etc.) and use it without problems. for example, we can say that cars that have the same weight (i.e. have a weight difference below a certain threshold) belong to the same group. In most cases this will work perfectly. What happens if we now take another class of objects, say motorbikes, and try to compare them with cars? Simple physical concepts may not be quite useful. For example some motorbikes have larger engines than some cars. However there are important similarities too: Motorbikes and cars both have engines and wheels, i.e. they have identical substructures. For example, the statement "Cars and bikes are similar because they both have wheels" makes sense. In addition, we can make numerical comparisons between wheels.

In comparing proteins, we face these two very problems. Members of the same protein family are like cars: they all have the same components. Members of a superfamily only have a number of common domains. The solutions we use are also similar. If we compare homologous proteins (that are evolutionarily related and perform the same function, we can use a numerical similarity measure to compare and classify them. We can do the same with protein domains (i.e. we can compare and classify wheels among themselves quite well). However, classifying widely divergent proteins that have different types of domains may not be as straightforward. This situation is also reminescent on comparing languages: Closely related languages (like members of the indoeuropean family) can be easily compared and their similarities/distances will be found the same using almost whatever measure. Distantly related languages however can not be so easily classified; depending on what measure we take we can find Japanese closer to the Finns or to the Chinese.

The take home message is that we first use some handles (features, motifs) to determine if two objects belong to the same equivalence class, and if they do then we go ahead with numeric comparison. The features we use very often are domains which are identified

13

Page 14: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

from local similarities. Once identified, we can subject the domains of the same type to global comparison (e.g. phylogenetic analysis) as we often do with entire sequences.

2.1. Distances, similarity scores.

The basic concept is a simple geometric distance of two points, a and b, often referred to as a Euclidean distance. Their distance in the plane is:

(1)

This distance is positive and has the trivial properties that

1. Distance is positive D(a,b)l0, distance from oneself is zero, Daa=0.

2. Distance is the same in both directions, Dab= Dba

3. Triangular inequality, one side of a triangle is smaller than the sum of the two others, i.e Dab+ DbclDac

The concept of distance was generalized into the concept of metrics which is defined by these three properties, plus

4. D preserves the topology of the space. Here "space" means a collection of points or objects between which we want to calculate D.

What does generalization mean?

- One way to generalize the mathematical formula of the distance is to allow for more dimensions. In fact one can calculate a distance in space, with 3 dimensions, but one can calculate a distance between any two vectors with much more parameters (i.e. in an "n-dimensional space). This is a trivial but important fact since in chemistry and biology one frequently characterizes an object (a compound, for instance) with an arbitrary number of parameters. E.g. the amino acid composition of a protein is a vector of 20 parameters and proteins can be very conveniently compared by the distances of their amino acid composition vectors.

- One may want to weight the parameters in the distance function. Staying with the example of amino acid compositions, one may want to give a higher weight to some "important" amino acids. In the example of the Euclidean distance (equation 1) this would mean that the coordinate y (amino acid y) will be multiplied by a factor greater than 1, because, for some reason, we consider it more important than coordinate x (amino aci d x).

14

Page 15: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

- The distance in the plane uses square and squareroot functions. Even this can be made general, as suggested by the famous physicist Minkovwski (and later used by Einstein in his relativity theory)

(2)

The Euclidean distance is Minkowski distance with n=2. There are a great many distance functions used in chemistry and biology, and most of them are related to the Minkowski distances. One interesting distance is the so called city-block or Manhattan distance with n=1. This corresponds to the distance between points on a map of a city built in a checker-board like manner:

(3)

This distance is the sum of parametric differences. All of these can naturally be weighted (if we "prefer" some of the parameters for some reasons).

Since all structures can be transformed into vectors by listing some of their numeric properties in a given serial order, one can easily calculate distances between them. Even a long string of characters can be turned into a vector of its character composition. In addition, we can calculate the composition in terms of two-letter words, e.g. one can talk about the dinucleoitde composition of DNA. Here we also have to mention the problem of alphabet size,.since the number of parameters involved will depend both on the wordsize (dimers, trimers tetramers) and on the alphabet size. In the case of DNA we also have to think about strand symmetry (AC=GT)

15

Page 16: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

Wordsize Protein DNA DNA with ds symetry1 20 4 2 2 400 16 10 3 8000 64 32 4 160000 256 136

One thing is noteworthy: If we have too many parameters, oftentimes the sequence we want to describe will not contain all the words (e.g. it is quite unlikely that of the 160000 possible tetrapeptides many will be missing in an actual protein, or even from the current database of known sequences). In other terms, many of the vector coordinates may become zeroes if we choose too many parameters. Simple amino acid vectors composition can be used quite well for classifying proteins (Cornish-Bowden et al). ds -Trinucleotide distances were used on the other hand to classify genes (Karle et al). One can conclude that - like in every situation - there is an optimal level of complexity for every problem.

2.2. String distances

One can construct distance functions between character strings (i.e. protein and DNA sequences) that - unlike character composition - take the sequence of the characters into consideration. The simples of these is the edit distance, the number of insertions, deletions and replacements that can be used to turn one word into the other. The best way to do this is to "align" the two character strings on top of each other, like:

B I R D . . | | W O R D

The solid line marks identities while the point marks the non-identities which correspond to replacements. If the words are more complex, it is easier to see that there are more ways to align them, and there is a great deaf possibilities to include insertions/deletions (sometimes these are called gaps or indels):

L A R R Y B I R D or L A R R Y B I R D . . | | . . | | W O R ----------D W O R --------D

An edit distance refers to the best of these alignments which is created by the minimum number of replacement and insertion/deletion operations. We have to make arbitrary decisions how we define the distnace since a string can have multiple copies of a word:

16

Page 17: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

T H E B I R D I S T H E W O R D . . | | | | | | W O R D W O R D

It is a good idea to use some kind of a numeric quality index for the edit distances, for example we can use these in computer programs that look for the best alignment. We can do this very easily, we have to assign some numeric cost value to insertion/deletion and replacement. There is a problem, however. We have to assign more or less arbitrary cost parameters both for gaps and for replacements, so there is no guarantee that we have an optimal solution.

The replacement cost parameters for amino acids are the well known Dayhoff matrices, BLOSUM matrices that will be mentioned later on in this course. These rely on a statistical evaluation of a large number of alignment.

The gap parameters are even more arbitrary - they are different from program to program. One can also define gaps in a length dependent manner, i.e. the cost of introducing two separate gaps may be different than in the cost of introducing two consecutive gaps.

One important feature of biological sequence analysis is that here we use similarity scores rather than distances. The similarity score S calculated between two sequences are adversely proportional to distance, i.e. it will be maximal for identical sequences (and not zero, like a distance function). In mathematical terms S~1/D or S~1/D.

The simplest form of a similarity score looks like this:

(4)

Which means, for a given alignment we sum the values of replacements (in this respect identities are a special kind of replacement by the same character) and we subtract a penalty for gaps. The negative sign for this penalty expresses the common sense that a gap is a bad thing to have, i.e. the more the gaps, the worse is the "quality" of the alignment.

The collection of replacement weights is called a replacement matrix, such as the Dayhoff matrix for amino acid replacements to be discussed later in this course. We mention that all kind of heuristic weighting schemes can be used, for example the weight of some structurally conserved amino acids (e.g. cysteines, tryptophans) can be set higher in order to stress their importance in structures. In this way one gets a modified score value that stresses structural similarities in a rudimentary way.

17

Page 18: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

2.3. Approximate string matching; finding optimal alignments

Similarity scores and distances are the main tools for finding optimal alignments between two biological sequences. There are algorithms that are based on maximizing similarity and others that minimize distance. An "exhaustive search" would then take all possible alignments, determine the similarity score for them and then take the one with the highest value.

One trivial problem of the alignments is that most of the mathematically possible alignments make no sense, e.g.

T H E - - - - B I R D I S T H E W O R D W O R D

These nonsense alignments should be eliminated since they only consume computer time. Many of the improvements that are possible are based either on intuitive tricks, or on a statistical knowledge on the problem. This latter is is especially important in biological applications since biological sequences are not random, so one can make good probability estimates how frequently certain strings are expected to occur.

The currently used algorithms have been developed over many years. They are used in commercial applications such as spell checkers in word processors, dictionaries etc. They were discovered and further developed almost independently in several application fields, including biology, where they constitute the basis of database searching. (Database searching - as will be discussed during this course - simply means to calculate a similarity score between a query sequence and all items in a database, and then list the best scoring entries of the database). As this is the central topic of this course and the programs used will be presented in later lectures, here we only show a brief comparison of the algorithms in terms of their running time requirements.

Comparison of algorithms for comparing protein and DNA sequences ----------------------------------------------------------------------- Algorithm Value Scoring Gap Time Ref. Calculated matrix penalty growth rate ----------------------------------------------------------------------- Needleman- global arbitrary penalty/gap O(l2) Needleman & Wunsch similarity q Wunsch, 1970 Sellers (global) unity penalty/residue O(l2) Sellers, 1974 distance rk Smith- local Sij<0.0 affine O(l2) Smith & Waterman similariry q + rk Waterman, 1981 Gotoh, 1982

18

Page 19: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

FASTA approx. local Sij<0.0 limited gap O(l2)/K Lipman & similarity size Pearson, 1985 Pearson & Lipman, 1988 BLASTP maximum Sij<0.0 (-)multiple O(l2) KAltschulet al., segment score segments 1990 CLUSTAL multiple arbitrary versatile n2O(l2)Higgins et al alignment (secondary structure 1994 (global) dependent) -----------------------------------------------------------------------

The time requirement, O(l2) refers to the length of the input sequence, and can be better written as O(l*D), where l is the length of the input query sequence and D is the length of the database (i.e. the other input). It is conspicuous that the time growth rate couldnot be improved. The drammatic speed differences between the programs all result from shortening the basic running time of the algorithm (we can picture this as a constant of propotionality which is very small for some of the programs, especially FASTA and BLAST). This can be achieved either by clever programming, or, more importantly, by preprocessing of the input (the query and the database) into a fast-to-read format, like a hash table. However, the most important "trick" is that the fast algorithms are not exhaustive but use statistical considerations to pre-screen the alignments. This takes care of the problem of testing many non-sense alignments. And since biological sequences have no "real extremes" among themselves, statistics can be considered quire reliable.

Alphabet size matters only the amount of space required for the algorithms, it does not change the order of the time (though it obviously changes the constant). The alphabet size effect is sequence length independent, and can be made modest on modern computers.

Many multiple alingment programs are based on pairwise alignments so their theorethical growth rate is O(l2). However, for n sequences there are n(n-1)/2 pairwise alignments, so the CPU time grows quite steeply with the number of sequences.

2.4 Motifs as elements of similarity

Until now we talked about a numerical description of similarity: the similarity score or distance function.

19

Page 20: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

Now we switch to another way of describing similarity. Previously we mentioned that all chemical structures including protein and DNA sequences can be considered as assemblies of substructures with relationships between them. Continuing this line of argument we can tentatively say that two structures can be called similar if they have common substructures. Of course not any substructures, for example we would not consider two protein sequences similar if both of them contain alanine. But we certainly would start thinking about similarity if they have a common domain, say an EGF (epidermal growth factor like domain). In facts, domains or modules of multidomain proteins were the origin of defining sequence motifs. One can tentatively say that two protein sequences are distantly related if they share a local homology domain, especially if the same type of homology has been found among other proteins. In other terms here we speak about local similarities that may not include the entire sequence, only parts of those. A trivial metaphor (mentioned earlier) is the similarity of the car and the bicycle: they both have wheels, so they are in a way similar. The wheels can be compared among themselves in terms of numerical indices like weight, size etc., but once we decided that the wheel is there (local similarity region identified), we actually do not care for the actual differences that exist between bicycle wheels and car wheels. On the other hand, if we tried to cluster objects according their weight and size (i.e. would perform a global similarity search), then cars and bicycles would never end up in the same cluster.

This is the reason why we have to deal with local similarities separately: These similarities can not be always simply detected by calculating a "over-all similarity score" for an entire sequence. Namely, the contribution of a local similarity to the over-all similarity score can be quite small, and can be easily blurred by the spurious similarities randomly found between sequences. For this reason, detection of local similarities developed into a separate art with its own tools. The most important achievement of this development is our knowledge on protein modules or domains which are small, structurally conserved protein segments that are found in various protein families.

When we speak about motifs, domain etc., we speak about two different kind of problems:

i) One is to find out if a sequence contains any of the known domain type. This can be best done by comparing the sequence with one or several of the existing domain collections. Most current domain collections contain a consensus representation of certain domain types. These consensus representations can be given in terms of a consensus sequence, a regular expression, a hidden Markov model, a "profile" etc.

20

Page 21: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

All these are common descriptions of a group of sequences. For example, for the EGF domain we have over one thousand known examples, and we can develop one common mathematical description for these. We can use this "consensus representation" to look for new EGF modules, which means one single comparison for the whole group. Naturally, somebody has to develop and maintain these consensus representations, which is not an easy task.

ii) The other, quite different task is to find out if a new sequence contains any new kind of domain. In this case we do a database search, try to find a local similarity region, and then, using our biological insights, try to establish if the newly found similarity region meets the criteria of our biological knowledge on modules (e.g. it is found in well defined group of protein families, one can build "good consensus representation" which will fish out only members of the similarity group etc.) In other terms, we have to ascertain the bological significance of the local similarity.

Summarizing, we accept a local similarity as important either if i) it belongs to a well established group of sequences (known protein modules) or if ii) we ourselves can prove that this similarity is biologically significant e.g. it defines a new type of protein module. As one can see, this apect of sequence similarity is not strictly mathematical but involves a great deal of biological knowledge.

Finally we mention that BLAST, the most frequently used database search program returns a collection of individual local similarities, which - as we will show later during this course - can be easily checked for domain similarities.

This lecture is an introduction to the main concepts of bioinformatics, (algorithm evaluation, data types and the principles of similarity). This lecture notes was written and compiled by Sándor Pongor, Subbiah Parthasarathy and Kristian Vlahoviček for the course: "Bioinformatics: Computer methods in molecular biology" held in Trieste, 3 - 10 July 1998.

21

Page 22: An introduction to BIOinformatics AlgoRITHMS · algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate,

References

1. N. C. Jones and P. A. Pevzner, An Introduction to Bioinformatics Algorithms, Ane Books, New Delhi (2005). 2. Arthur M. Lesk, Introduction to Bioinformatics, Oxford University Press,

New Delhi (2003). 3. D. Higgins and W. Taylor (Eds), Bioinformatics- Sequence, structure and

databanks, Oxford University Press, New Delhi (2000). 4. R.Durbin, S.R.Eddy, A.Krogh and G.Mitchison, Biological Sequence Analysis, Cambridge Univ. Press, Cambridge, UK (1998). 5. A. Baxevanis and B.F. Ouellette. Bioinformatics: A practical Guide to the Analysis of Genes and Proteins, Wiley-Interscience, Hoboken, NJ (1998). 6. Michael S. Waterman, Introduction to computational Biology, Chapman & Hall, (1995). 7. J. A. Clasel and M. P. Deutscher (Eds), Introduction to Biophysical Methods for Protein and Nucleic Acid Research, Academic press, New York (1995).

22