28
Michael Schroeder Biotechnology Center TU Dresden BLAST

BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Michael Schroeder

Biotechnology CenterTU Dresden

BLAST

Page 2: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Contents Why to compare and align sequences?

How to judge an alignment? Z-score, E-value, P-value, structure and function

How to compare and align sequences? Levensthein distance, scoring schemes, longest common subsequence, global

and local alignment, substitution matrix, How to compute an alignment?

Dynamic programming How to compute an alignment fast?

BLAST How to align many sequences

Multiple sequence alignment, phylogenetic trees Alignments and structure

How to predict protein structure from protein sequence

2

Page 3: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Motivation

Two sequences of length n Dynamic programming:

matrix = time proportional to n2

Database of m sequences: time proportional to m n2

Manual search: How long does it take? 1 cell = 10 sec 1.000.000 sequences Sequence = 100 amino acids

3

Page 4: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Motivation

How can we reduce the database size?

How can we reduce the matrix size?

4

Page 5: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

lev(petra, peter) = ?

5

Page 6: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Levenshtein Distance with Dynamic Programming:

Are all cells in the matrix equally important?

6

i \ j p e t e r

0 1 2 3 4 5

p 1 0 1 2 3 4

e 2 1 0 1 2 3

t 3 2 1 0 1 2

r 4 3 2 1 1 1

a 5 4 3 2 2 2

Page 7: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Levenshtein Distance with Dynamic Programming:

Are all cells in the matrix equally important?

7

i \ j p e t e r

0 1 2 3 4 5

p 1 0 1 2 3 4

e 2 1 0 1 2 3

t 3 2 1 0 1 2

r 4 3 2 1 1 1

a 5 4 3 2 2 2

For the two alignments, we only used 8 out of 36 cells.Can we discard the other cells beforehand? How?

not used

maybe used

used

Page 8: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Levenshtein Distance with Dynamic Programming:

Are all cells in the matrix equally important?

8

i \ j

0 1 2 3 4 5

1 1

2 2

3 3

4 4

5 5

What is the worst possible distance for two strings of size 5?5 mismatches. This means all paths of length >5 can be excluded

Page 9: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Levenshtein Distance with Dynamic Programming:

Are all cells in the matrix equally important?

9

i \ j

0 1 2 3 4 5

1 1

2 2

3 3

4 4

5 5

Paths through red cells have all length >5Only 24 out of 36 can contribute to results.

not used

maybe used

used

Page 10: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Are we solving the right problem?

10

Are all alignments useful?

Only results with reasonable edit distance.

For size 5 strings, let‘s say that‘s 3.

Page 11: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Levenshtein Distance with Dynamic Programming:

Are all cells in the matrix equally important?

11

i \ j

0 1 2 3 4 5

1 1

2 2

3 3

4 4

5 5

Paths through red cells have all length >3Only 16 out of 36 can contribute to results.

not used

maybe used

used

Page 12: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

12

Ukkonen E. (1983) On approximate string matching. In: Karpinski M. (eds) Foundations of Computation Theory. FCT 1983. Lecture Notes in Computer Science, vol 158. Springer

Page 13: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Are we Solving the Right Problem?

13

Page 14: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Are we Solving the Right Problem?

Aligning completely unrelated sequences not needed

A reasonable alignment has some perfect matches

BLAST Idea:1. Identify small perfect matches (termed words)

2. Extend perfect matches on the same diagonal

3. Merge extended perfect matches (with dynamic programming)

14

Page 15: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

BLAST Idea

Do we need the alignments for all sequences in the database?

No, only for “reasonable” hits introduce a ➞threshold

A “reasonable” alignment will contain short stretches of perfect matches

Find these first, then extend them to connect them as best possible

15

Page 16: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Perfect Matches (Words) Formally

16

Page 17: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

BLAST: Filtration Algorithm

detects all matching words of length p (of a query string a in a target string b, both of length n), combining it to an alignment of a and b with no more than k mismatches

Potential match detection: Find all matches of p-tuples between a and b (can be done in linear time by inserting them into a hash table)

Potential match verification: Verify each potential match by extending it to the left and right until either the first k+1 mismatches are found or the beginning or end of the sequences are found

17

Page 18: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

BLAST Mechanism Summary

18

word length p(here: p = 4)

no mismatchesor gaps allowed

only within thegrey areas

Page 19: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Zipf's Law

19

Page 20: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Example for BLAST

20

SWISS_PROT:C79A_HUMAN P11912

Search SWISSPROT for Ig-alpha:

Page 21: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Example for BLAST

21

Page 22: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Example for BLAST

22

Page 23: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Example for BLAST

23

Distribution of Hits:

Page 24: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Example for BLAST: Alignment>gi|126779|sp|P11911|C79A_MOUSE B-cell antigen receptor complex associated protein alpha-chainprecursor (IG-alpha) (MB-1 membrane glycoprotein)(Surface-IGM-associated protein) (Membrane-boundimmunoglobulin associated protein) (CD79A)Length = 220

Score = 278 bits (711), Expect = 5e-75Identities = 150/226 (66%), Positives = 165/226 (73%), Gaps = 6/226 (2%)

Query: 1 MPGGPGVLQALPATIFLLFLLSAVYLGPGCQALWMHKVPASLMVSLGEDAHFQCPHNSSN 60 MPGG + LL LS LGPGCQAL + P SL V+LGE+A C N+ Sbjct: 1 MPGG----LEALRALPLLLFLSYACLGPGCQALRVEGGPPSLTVNLGEEARLTC-ENNGR 55

Query: 61 NANVTWWRVLHGNYTWPPEFLGPGEDPNGTLIIQNVNKSHGGIYVCRVQEGNESYQQSCG 120 N N+TWW L N TWPP LGPG+ G L VNK+ G C+V E N ++SCGSbjct: 56 NPNITWWFSLQSNITWPPVPLGPGQGTTGQLFFPEVNKNTGACTGCQVIE-NNILKRSCG 114

Query: 121 TYLRVRQPPPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKLGLDAGD 180 TYLRVR P PRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEK G+D DSbjct: 115 TYLRVRNPVPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKFGVDMPD 174

Query: 181 EYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGSLNIGDVQLEKP 226 +YEDENLYEGLNLDDCSMYEDISRGLQGTYQDVG+L+IGD QLEKPSbjct: 175 DYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGNLHIGDAQLEKP 220

24

Page 25: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

Example for BLAST: TaxonomyLineage Report

root. cellular organisms. . Eukaryota [eukaryotes]. . . Fungi/Metazoa group [eukaryotes]. . . . Bilateria [animals]. . . . . Coelomata [animals]. . . . . . Gnathostomata [vertebrates]. . . . . . . Tetrapoda [vertebrates]. . . . . . . . Amniota [vertebrates]. . . . . . . . . Eutheria [mammals]. . . . . . . . . . Homo sapiens (man) ---------------------- 473 33 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch. . . . . . . . . . Bos taurus (bovine) ..................... 312 2 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch. . . . . . . . . . Mus musculus (mouse) .................... 278 31 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch. . . . . . . . . . Canis familiaris (dogs) ................. 37 1 hit [mammals] IG KAPPA CHAIN V REGION GOM. . . . . . . . . . Rattus norvegicus (brown rat) ........... 35 7 hits [mammals] Vascular endothelial growth factor receptor 1 precursor (VE. . . . . . . . . . Oryctolagus cuniculus (domestic rabbit) . 29 1 hit [mammals] IG KAPPA CHAIN V REGION K29-213. . . . . . . . . Coturnix japonica ------------------------- 33 2 hits [birds] Vascular endothelial growth factor receptor 2 precursor (VE. . . . . . . . . Gallus gallus (chickens) .................. 31 4 hits [birds] CILIARY NEUROTROPHIC FACTOR RECEPTOR ALPHA PRECURSOR (CNTFR. . . . . . . . Xenopus laevis (clawed frog) ---------------- 30 2 hits [amphibians] Neural cell adhesion molecule 1, 180 kDa isoform precursor . . . . . . . Heterodontus francisci ------------------------ 28 1 hit [sharks and rays] Myelin P0 protein precursor (Myelin protein zero) (Myelin p. . . . . . Drosophila melanogaster ------------------------- 30 2 hits [flies] Neuroglian precursor. . . . . Caenorhabditis elegans ---------------------------- 29 1 hit [nematodes] Hypothetical protein F59B2.12 in chromosome III. . . . Saccharomyces cerevisiae (brewer's yeast) ----------- 33 1 hit [ascomycetes] Putative 101.7 kDa transcriptional regulatory protein in PR. . . Marchantia polymorpha --------------------------------- 29 1 hit [liverworts] Succinate dehydrogenase cytochrome b560 subunit (Succinate . . Agrobacterium tumefaciens str. C58 ---------------------- 28 1 hit [a-proteobacteria] Formamidopyrimidine-DNA glycosylase (Fapy-DNA glycosylase). Human adenovirus type 3 ----------------------------------- 30 1 hit [viruses] EARLY E3 20.5 KD GLYCOPROTEIN. Human adenovirus type 7 ................................... 30 1 hit [viruses] EARLY E3 20.5 KD GLYCOPROTEIN

25

Page 26: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

PSI-Blast

Globin family (oxygen transport) of proteins occurs in many species

Proteins have same function and structure and But there are pairs of members of the family sharing

less than 10% identical residues

A B C

PSI-BLAST idea: score via intermediaries may be better than score from direct comparison

26

50%

Only 10%

50%

Page 27: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

PSI-BLAST

PSI-BLAST 1. BLAST 2. Collect top hits 3. Build multiple sequence alignment from significant

local matches 4. Build profile 5. Re-probe database with profile 6. Go back to 2.

27

Page 28: BLAST - biotec.tu-dresden.de · Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align

PSI-BLAST

But beware of PSI-BLAST: False positives propagate and spread through

iterations If protein A consists of domains D and E, and protein B

of domains E and F and protein C of domain F, then PSI-BLAST will relate A and C although they do not share any domain

28