Bioinformatics

Lec-6 1

BIOINFORMATICSAyesha M. KhanSpring 2013

Lec-6

2

Some statistics of local sequence comparison (BLAST)

Once BLAST has found a similar sequence to the query in the database, it is helpful to have some idea of whether the alignment is “good” and whether it portrays a possible biological relationship, or whether the similarity observed is attributable to chance alone.

BLAST uses statistical theory to produce a bit score and expect value (E-value) for each alignment pair (query to hit).

3

Max score = highest alignment score (bit-score) between the query sequence and the database sequence segment.

Total score = sum of alignment scores of all segments from the same database sequence that match the query sequence (calculated over all segments). This score is different from the max score if several parts of the database sequence match different parts of the query sequence.

Query coverage = percent of the query length that is included in the aligned segments. This coverage is calculated over all segments.

E-value = number of alignments expected by chance with a particular score or better.

Lec-6

BLAST Results: Scores and Values

Lec-6

4

Some details: Bit score

The bit score gives an indication of how good the alignment is; the higher the score, the better the alignment.

In general terms, this score is calculated from a formula that takes into account the alignment of similar or identical residues, as well as any gaps introduced to align the sequences.

Key element substitution matrix

Lec-6

5

Bit score (contd.) The BLOSUM62 matrix is the default for

most BLAST programs, the exceptions being blastn, megaBLAST and discontig megablast (programs that perform nucleotide–nucleotide comparisons and hence do not use protein-specific matrices).

Bit scores are normalized, which means that the bit scores from different alignments can be compared, even if different scoring matrices have been used.

Lec-6

6

Some details: E-value

The E-value gives an indication of the statistical significance of a given pairwise alignment and reflects the size of the database and the scoring system used. The lower the E-value, the more significant the hit. A sequence alignment that has an E-value of 0.05 means that this similarity has a 5 in 100 (1 in 20) chance of occurring by chance alone.

E=Kmne-λS

m, n is size of the search space (n is length of query sequence, m is length of the database)

K is a scale parameter for size of search spaceλ is a scale parameter for scoring methodS is bit score

7

Difference between BLOSUM and PAM matrices

Lec-6

BLOSUM comes from alignments of shorter sequences-blocks of sequences that match each other at some defined level of similarity. The BLOSUM method thereby incorporates much more data into its matrices, and is therefore, presumably more accurate. PAM is derived from alignments of proteins.

BLOSUM matrices tend to be more sensitive to distant relationships than PAM. BLOSUM tends to give higher scores to substitutions involving hydrophilic

amino acids and lower scores to substitutions involving hydrophobic amino acids than PAM.

Substitutions of rare amino acids are more tolerated by BLOSUM. General rules:

-Use higher PAM or lower BLOSUM matrices for more divergent sequences-Use lower PAM or higher BLOSUM matrices for more closely related sequences

Lec-6

8

Concept of Gaps in Alignment Sequences may have diverged from a

common ancestor through various types of mutations:• Substitutions• Insertions• DeletionsThe latter two will result in gaps in

alignments

Lec-6

9

Gap Penalty Gap penalties are used during sequence

alignments to penalize the gaps. The gap extension penalty is usually much

smaller, for instance, 10 insertions of one nucleotide each should be harder than one insertion of 10 nucleotides. That is, gap opening is less probable than a single

gap extending over more than one nucleotide. Hence a single mutation event (causing incorporation or deletion of more than one nucleotide) is more probable than multiple mutation events.

Lec-6

10

Gap Penalty Linear gap penalties

Simplest type of gap penalty The overall penalty for one large gap is the

same as for many small gaps wk=c L

Affine gap penalties Have a gap opening penalty c, and a gap

extension penalty, e wk=c +(L-1)e

Lec-6

11

BLAST & FASTA: heuristic methods

BLAST & FASTA use heuristic methods that attempt to approximate the optimal local similarity shared by two sequences.

Use word or k-tuple methods They align two sequences very quickly , by

first searching for identical short stretches of sequences (called words, or k-tuples) and then joining these words into an alignment by the dynamic programming method.

Lec-6

12

BLAST… The BLAST programs are used to find high-

scoring local alignments between a query sequence and a target database.

The BLAST algorithm is based on the fact that true match alignments are very likely to contain short stretch of identities, or very high scoring matches somewhere within them.

So BLAST initially looks for such short stretches and uses them as ‘seeds’ from which it extends out in search of a good longer alignment.

Lec-6

13

Main stages of BLAST1. Remove (filter) low-complexity regions

from Q2. Harvest k-tuples (triples) from Q3. Expand each triple into ~50 high-

scoring words4. Seed a set of possible alignments5. Generate high-scoring pairs (HSP)s from

the seeds6. Test the significance of matches from

the HSPs7. Report the alignments found from the

HSPs

Lec-6

14

Main stages of BLAST (contd.)

Lec-6

15

Multiple Sequence Alignment

Why do we need to carry out multiple sequence alignments?

To make connections between more than two family members

To reveal conserved family characteristics

MSA is a 2D table rows represent individual sequences and columns the residue positions. Absolute position: Property of the sequenceRelative position: Property of the alignment

Lec-6

16

Example:

Lec-6

17

MSA: computational complexity

(O (m1 m2)O: order of the time taken by the algorithm,

and m1 and m2 are the sequence lengths.

When considering more sequences, the time complexity becomes O(m1,m2,m3,….ml)

where ml is the length of the last sequence in the comparison set

Lec-6

18

Simultaneous methods vs progressive methods

Simultaneous methods: Align all the sequences in a given set at once•Extension of a 2D matrix to three or more dimensions•No. of dimensions reflect the no. of sequences to be aligned•Work best on small sets of short sequences

Progressive methods: Align pairs of sequences or building sequence clusters•Use heuristics to reach an alignment in a timely and cost-efficient manner

Lec-6

19

MSA models There are several models for assessing

the score of a given multiple sequence alignment. The most popular ones are sum-of-pairs (SP), tree alignment, and consensus alignment.

Note: which of the above models are progressive alignments and which are based on dynamic programming?(should be able to answer after a few slides)

Lec-6

20

Sum-of-pairs (SP) Recall that:

The standard computational formulation of the pairwise problem is to identify the alignment that maximizes protein sequence similarity, which is typically defined as the sum of substitution matrix scores for each aligned pair of residues, minus some penalties for gaps.

The mathematically — though not necessarily biologically — exact solution can be found in a fraction of a second for a pair of proteins. This approach is generalized to the multiple sequence case by seeking an alignment that maximizes the sum of similarities for all pairs of sequences, i.e. the sum-of- pairs, or SP, score.

Lec-6

21

Sum-of-pairs (SP)... The SP score for the complete alignment

M is the sum of the scores for each column (mi) in the alignment:

Example: We wish to align the following three DNA sequences:S1 = TGCGS2 = AGCTGS3 = AGCG

We wish to use the SP method to score the following alignments of these three sequences:Alignment #1 Alignment #2T-GC-G TGC-G-AGCTG AGCTG-AGC-G AGC-G

Lec-6

22

Sum-of-pairs (SP)...We will use the following simplified DNA substitution matrix:• s(x,y) = 1: when x = y [match]• s(x,y) = -1: when x ! y [mismatch]• s(x,-) = -2: [gap]• s(-,y) = -2: [gap]• s(-,-) = 0: to prevent double counting of gapsWe will construct the following matrices M for each alignment:

Lec-6

23

Sum-of-pairs (SP)...The SP score for each alignment is calculated by summing the individual scores for each column in the matrix.

Using the simplified substitution matrix, the Sum of Pairs method ranks the second alignment as the higher scoring alignment.

Lec-6

24

Consensus alignmentNow create a consensus of the first three sequences and align that to the forth most similar. This process continues until it has worked its way through all sequences and/or sets of clusters.

Lec-6

25

Tree alignment

Lec-6

26

Progressive alignmentIt is a heuristic method!

Up until about 1987, multiple alignments would typically be constructed manually, although a few computer methods did exist. Around that time, algorithms based on the idea of progressive alignment appeared. In this approach, a pairwise alignment algorithm is used iteratively, first to align the most closely related pair of sequences, then the next most similar one to that pair, and so on.The rule “once a gap, always a gap” was implemented, on the grounds that the positions and lengths of gaps introduced between more similar pairs of sequences should not be affected by more distantly related ones.

Lec-6

27

Progressive alignment: CLUSTALW

The three basic steps in the CLUSTAL W approach are shared by all progressive alignment algorithms*:

A. Calculate a matrix of pairwise distances based on pairwise alignments between the sequences

B. Use the result of A to build a guide tree, which is an inferred phylogeny for the sequences

C. Use the tree from B to guide the progressive alignment of the sequences

Lec-6

28

Progressive alignment: CLUSTALWhttp://www.ebi.ac.uk/clustalw/

The basic idea is to use a series of pairwise alignments to align larger and larger groups of sequences, following the branching order of the guide tree. We proceed from the tips of the rooted tree towards the root.At each stage a full dynamic programming algorithm is used, with a residue scoring matrix (e.g., a PAM or a BLOSUM matrix) and gap opening and extension penalties. Each step consists of aligning two existing alignments.Scores at a position are averages of all pairwise scores for residues in the two sets of sequences using matrices with only positive values.

Lec-6

29

Pairwise progressive dynamic programming-liabilities

(1) dependence on initial pairwise sequence alignments and the order of alignment

- ordering them from most similar to least similar usually makes biological sense and works very well.

(2) dependence on substitution matrices and gap penalties

Lec-6

30

Common usage of MSA Detecting similarities between sequences (closely/distantly related) Detecting conserved regions/motifs in the sequences Detection of structural homologies;

Patterns of hydrophobicity/hydrophilicity , gaps etc. Thus assisting the improved prediction of secondary and tertiary

structures and loops and variable regions. Predict features of aligned sequences like conserved positions

which may have structural or functional importance Making patterns or profiles that can be further used to predict new

sequences falling in a given family Computing consensus sequence Inferring evolutionary trees or linkage-phylogenetic analysis etc Deriving profiiles of hidden markov models (HMMs) that can be

used to remove distant sequences (outliers) from the protein families

Lec-6

31

Applicability of MSA Very useful in the development of PCR

primers and hybridization probes; Great for producing annotated, publication

quality, graphics and illustrations; Invaluable in structure/function studies

through homology inference; Recognizable structural conservation between

true homologues extends way beyond statistically significant sequence similarity.

Lec-6

32

Applicability of MSA- contd. Essential for building “profiles” for

remote homology similarity searching; and

Required for molecular evolutionary phylogenetic inference programs.

Lec-6

33

For a given group of sequences, there is no single "correct" alignment, only an alignment that is "optimal" according to some set of calculations.

Determining what alignment is best for a given set of sequences is really up to the judgement of the investigator.

“To raise new questions, new possibilities, to regard old problems from a new angle, require creative imagination and marks real advance in science” Albert Einstein

Documents

Bioinformatics