37
Heuristic Approaches for Sequence Alignments

Heuristic Approaches for Sequence Alignments

  • Upload
    milla

  • View
    26

  • Download
    2

Embed Size (px)

DESCRIPTION

Heuristic Approaches for Sequence Alignments. Outline. Sequence Alignment Database Search FASTA BLAST. Sequence Alignment. Dynamic Programming (give optimal solution(s)) Needleman-Wunsch (Global Alignment) Smith-Waterman (Local Alignment) Heuristics (give approximate solution(s)) - PowerPoint PPT Presentation

Citation preview

Page 1: Heuristic Approaches for Sequence Alignments

Heuristic Approaches forSequence Alignments

Page 2: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 2

Outline

Sequence Alignment Database Search FASTA BLAST

Page 3: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 3

Sequence Alignment

Dynamic Programming (give optimal solution(s)) Needleman-Wunsch (Global Alignment) Smith-Waterman (Local Alignment)

Heuristics (give approximate solution(s))

Trade speed for precision (good for DB search) FASTA (finds local alignments) BLAST (Basic Local Alignment Search Tool)

Page 4: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 4

Database Search

One of the major uses of alignments is to find similar sequences in a database, i.e. compare one input sequence with all sequences in the database and obtain the most similar ones;

Current databases contain massive number of sequences;

Finding homologies in these databases optimally with dynamic programming can take long.

Page 5: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 5

Database Search using Heuristic Sequence Comparison Algorithms

Most database search algorithms relay on heuristic procedures

These are not guaranteed to find the best match

Sometimes, they will completely miss a high-scoring match

Page 6: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 6

Database Search and PAM Matrices - Motivation

Simple scoring scheme (e.g. +1 for match, 0 for mismatch, -1 for mismatch) is not enough, especially for protein sequences

Amino Acids: must consider their relative replacement features in an evolutionary scenario

Page 7: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 7

(Cont’d)

Factors affecting such mutual substitution are numerous (size, chemical properties, etc.)

PAM (Point Accepted Mutations) matrices are widely used – they are derived by direct observation of actual substitution rates.

Page 8: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 8

PAM Matrices (Contn’d)

1-PAM Matrix: reflect an amount of evolution producing on average one mutation per hundred amino acids

How to build a 1-PAM matrices? A probability transition matrix M: each entry Mab

denotes the probability of a changing into b A scoring matrix S S is derived from M

Page 9: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 9

How to Build a Probability Transition Matrix M?

We need: A list of accepted mutations The probability of occurrence Pa for each

amino acid a

M1 (M for 1-PAM) can be computed by simple probability arguments

Mk (M for K-PAM) = M1k

Page 10: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 10

How to Derive S from M?

Question: Assuming pairing an amino acid a with b what is the probability (called a likelihood ratio) this pair is a mutation, not a random occurrence?

Answer:

This ratio =

Where Pb is the probability of a random occurrence

of b.

Mab

Pb

Page 11: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 11

How to Pick Up a PAM Matrix to Use

Use default one – but should know what it is Select several to cover a wide range if little

is known for the sequences In general low PAM numbers are good for

finding local, strong similarities, while large PAM numbers good for detecting long, weak ones.

Page 12: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 12

A Note on FAST Algorithms

Fast is a family of algorithm, e.g. FASTP, FASTA, TFASTA, LFASTA, ...

In this lecture we use FAST or FASTA interchangeably

References: [Pearson90,91, PearsonLipman88, etc.]

Page 13: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 13

FASTA (Pearson and Lipman, 1988)

Determine k-tuples (exact matches) common to both sequences (with two parameters: ktup and offset).

Join k-tuples that are in the same diagonal and not very far apart – creates regions;

Find region with best score – “initial score” to rank the sequences;

Compute an “optimized score”, using DP, restricted to a band around the region.

Page 14: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 14

Parameters ktup and offset

ktup (k = 1, 2) specify the length of a common segment

offset determines a relative displacement between the query sequence and a database sequence (hint: under a DP method, an offset can be viewed as a diagnal in the similarity matrix)

Page 15: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 15

Ktup = 1

FASTA - Determine k-tuples

H A R F Y A A Q I V L 1 2 3 4 5 6 7 8 9 10 11query

sequence

V D M A A Q I A1 2 3 4 5 6 7 8Database

sequence

A 2, 6, 7F 4H 1I 9L 11Q 8R 3V 10Y 5

lookuptable

offsets+9

-2+2+3

-3+1+2

+2+2

-6-2-1

Offset vector

-7 -6 -5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10

21 1 1 1 1 14

Page 16: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 16

FASTA – Diagonal method

H A R F Y A A Q I V L

VDMAAQ IA

0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10

-1

-2

-3

-4

-5

-6

-7

Offset vector

-7 -6 -5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10

21 1 1 1 1 14

V D M A A Q I A1 2 3 4 5 6 7 8

Databasesequence

offsets

+9

-2+2+3

-3+1+2

+2+2

-6-2-1

Page 17: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 17

FASTA - Join k-tuples

Determine k-tuples (exact matches) common to both sequences;

Join k-tuples that are in the same diagonal and not very far apart – creates regions;

The larger ktup, the faster the program

Typically ktup=1 or 2 for proteinsand ktup=4 or 6 for DNA sequence

Note: region should be gapless, and is created bycertain heuristic

Page 18: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 18

FASTA - Compute an optimized score for highest score region

Find region with best score – “initial score”;

Compute an “optimized score”, using DP, restricted to a band around the region.

Page 19: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 19

Some Issues of FAST Algorithms

Selectivity vs. Sensitivity Ktup selectivity Ktup sensitivity

Statistical significance of the scores

Page 20: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 20

BLAST (Altschul et al, 1990)

Compile list of high-scoring words based on the query sequence;

Scanning the database to search for hits – each hit gives a seed;

Extend seeds for each sequence;

Report high scoring segments

Page 21: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 21

BLAST (Basic Local Alignment Search Tool)

Segment: a substring of a sequence Segment pair: a pair of segments with the same length Segment pairs are gapless local alignments

[S.F.Altschul, W.Gish, W.Miller, E.Myers and D.Lipman: Basic Local Alignment Search Tool, J. Mol. Biology, (1990) 215, 403-410]

Querysequence BLAST

database

A list of high-scoring “segment pairs” between the query and database sequences with scores above a certain threshold

Page 22: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 22

Maximum segment pair (MSP) – is a segment pair of maximum score.

Page 23: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 23

A segment pair is locally optimal if its score cannot be improved by either extending or shortening both segments.

Note: Local similarity is useful for finding conserved regions (e.g. in a protein)

Page 24: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 24

BLAST is interested in finding only those sequences with MSP scores over some cutoff score S.

The main strategy of BLAST is to seek only segment pairs that contain a word pair with a score of at least T.

Page 25: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 25

BLAST- Compile list of high-scoring words

Querysequence

.

.

.

. . .

word list

find the list of wordswith score > T

Maximum of N-w+1 wordsTypically w=3 for proteinsand w=11 for DNA sequence

Nw

A N SA N S

2 2 2 = 6 < T

w, T – program parameters

C R YC R Y

12 6 10= 28 > T

Example: w = 3, T = 15

wk

w4

w3

w2

w1

w5

PAM matrices can be used to compute the scores

Page 26: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 26

Databasesequences

Exact matches of words from the word list to the database sequence

BLAST- Search for hits, each hit gives a seed

seeds

Page 27: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 27

BLAST- Search for hits, each hit gives a seed

w5 w1

w2

w4

w3 w6 w8

w7

Lookup (hash) table:

1

2

3

4

5

6

7

8

w

F(w)

Databasesequence

A: 00C: 01G: 10T: 11Byte

A C G T

0 0 0 1 1 0 1 1

DNA sequences

word list

Page 28: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 28

For each exact word match, alignment is extendedin both directions to find high score segments

Maximum Segment Pairs (MSPs)

BLAST- Extend seeds for each sequence

L P S L D L L QUERY SEQUENCE M P S L D L L DATABASE SEQUENCE < WORD> 3-LETTER WORD FOUND INITIALLY 4 4 6 word score = 14 <------- ------->EXTENSION EXTENSION TO LEFT TO RIGHT

2 7 4 4 6 4 4 < MAXIMAL SEGMENT PAIR > SCORE 14 + 9 + 8 = 31

Page 29: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 29

BLAST- report high scoring segments

Choose high score segments: scores > S

Page 30: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 30

Why BLAST is Fast?

Because:

the alignments are gapless!

Page 31: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 31

Statistical Significance of BLAST Results

Question: If a match found by BLAST – what is the probability that such match is due to chance alone?

A well-funded statistical theory is used by BLAST in determine the matching scores.

Page 32: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 32

Q1: What proportion of segment pairs with a given score contain a word pair with a

score at least T?

Answer: [Karlin91]

Q2: What probability q of a MSP pair found (under a threshold score S) will fail to

contain a seed word W (of score >= T)?

Answer: See Plot [Alschul et.al.90]

Questions

Page 33: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 33

Note: PIM-120 scores are used, w=4 and T=8

Score S

- ln q

Page 34: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 34

Improvement of The Basic BLAST-Gapped BLAST and PSI-BLAST

Objectives Speedup the execution substantially Enhance the sensitivity to weak similarities

[S.F. Altschul, et.al., Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Algorithms, Nucleic Acids Research, 1997, Vol25, No. 17, 3389-3402]

Page 35: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 35

Major Extensions/Changes to BLAST

Add ability to generate gapped alignment using dynamic programming to extend a seed in both directions

Using a “two-hit” method to “filter” out the candidate pairs for extension

The search may be iterated: round i will generate a new position-specific score matrix from significant alignments found to be used for round i+1(this process involves the construction of a multiple sequence alignment – see Topic 2C)

Page 36: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 36

The Two-Hit Method

Observation: an HSP of interest is much longer than a single word pair, thus may contain multiple hits on the same diagonal within a relatively short distance apart.

Methods: Choose a “window” , and do extension only when two non-overlapping hits are found within distance A of one another on the same diagonal

Effectiveness: reduce candidate pairs for extension substantially (by 86%)

Page 37: Heuristic Approaches for Sequence Alignments

/course/eleg667-01-f/Topic-2b.ppt 37

An ExampleThe BLAST comparison of broad bean leghemoglobin I (87) (SSWISS-PROT accession no.PO2232) and horse beta -globin (88) (SWISS_PROT accession no.P02062). The 15 hits with score at least 13 are indicated by plus signs. An additional 22 non-overlaping hits with score at least 11 are indicated by dots.

Of these 37 hits, only the two indicated pairs are on the same diagonal and within distance 40 of one another. Thus the two-hit heuristic with T=11 triggers two extensions, in place of the 15 extensions invoked by the one-hit heuristic with T=13.