13
Good solutions are advantageous Christophe Roos Christophe Roos - - MediCel MediCel christophe.roos@medicel.fi Similarity is a tool in understanding the information in a sequence Evolution changes sequences

Good solutions are advantageous

  • Upload
    glenys

  • View
    31

  • Download
    1

Embed Size (px)

DESCRIPTION

Christophe Roos - MediCel ltd christophe.roos@ medicel .fi. Good solutions are advantageous. Evolution changes sequences. Motifs, profiles, structures. Part 5: modular proteins. Similarity is a tool in understanding the information in a sequence. Proteins share similar domains. - PowerPoint PPT Presentation

Citation preview

Page 1: Good solutions are advantageous

Good solutions are advantageous

Christophe RoosChristophe Roos - - MediCel ltdMediCel [email protected]

Similarity is a tool in understanding the information

in a sequence

Evolution changes sequences

Page 2: Good solutions are advantageous

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures

Proteins share similar domains

By comparing several related sequences to each other, one can distiguish segments with higher level of conservation. Usually they have a key role in the function of a protein.

Blast identifies related sequences fast but only roughly.

Page 3: Good solutions are advantageous

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures

Refine the comparison

•Multiple sequence alignments of the best scoring sequences fround by Blast (or some other way) is done with a more sensitive algorithm.•Example: The eyeless gene in the fruit fly is also found in several species: birds, mammals, reptiles, fish, invertebrates. There it is called PAX6.

Page 4: Good solutions are advantageous

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures

Visualise the relationship

• Once a multiple sequence alignment is done, it can also be used for finding relationship (evolutionary distance)

• The distance is calculated as the amount of mutations needed to evolve from a putative ancestor to all used ‘present-day’ sequences. Then a path including all sequences is computed. Different metrics can be used (most parsimonious, maximum likelihood, etc).

Page 5: Good solutions are advantageous

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures

Visualise the output of aligned domains

First all sequence pairs are aligned and scored, then in a second round a multiple sequence alignment is built up.

In this case (PAX6 proteins from vertebrates and fruit fly), two domains are more conserved than the rest of the sequence. The most conserved areas have been highlighted by the use of black or gray background and white text.

Only part of the alignment is shown.

Page 6: Good solutions are advantageous

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures

Profiles and motifs

• A sequence motif is a locally conserved region of a sequence or a short sequence pattern shared by a set of sequences.

• The term motif refers to any sequence pattern that is predictive of a molecule’s function, a structural feature, or a family membership.

• Motifs can be detected in proteins, DNA and RNA sequences, but they most commonly refer to protein motifs.

• Motifs can be represented for computational purposes as– Flexible patterns [K,R]-R-P-C-x(11)-C-V-S (qualitative, unweighted; see the Prosite

database at www.expasy.org)

– Position-specific scoring matrices (PSSM, see next page)

– Profile hidden Markov models (HMM). These are rigorous probabilistic formulation of a sequence profile. They contain the same probability information as PSSMs but can also account for gaps.

Page 7: Good solutions are advantageous

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures

Position specific scoring matrix

• This corresponds to the flexible pattern of the paired box: [K,R]-R-P-C-x(11)-C-V-S

A B C D E F G H I K L M N P Q R S T V W X Y Z * -

-22 -22 -35 -26 -15 -37 -30 -9 -38 35 -36 -23 -16 -34 -5 53 -23 -24 -35 -40 -19 -31 -9 0 0

-51 -52 -62 -57 -46 -64 -59 -33 -66 -16 -63 -49 -44 -64 -34 70 -51 -53 -63 -64 -46 -57 -40 0 0

-42 -58 -59 -55 -53 -68 -59 -54 -63 -51 -65 -57 -62 73 -54 -56 -50 -53 -59 -72 -51 -69 -54 0 0

-42 -69 99 -75 -84 -49 -66 -72 -43 -76 -54 -53 -62 -79 -74 -75 -51 -48 -42 -65 -58 -59 -79 0 0

-21 -38 -19 -41 -30 -29 -43 -36 6 32 -16 -13 -35 -44 -25 -15 -34 -22 47 -41 -18 -36 -27 0 0

-21 6 -8 -12 -27 -7 -25 -13 26 -22 23 8 30 -39 -21 -23 -20 -13 10 -30 -9 -19 -24 0 0

-31 -40 -21 -43 -34 -23 -48 -36 50 33 -9 -8 -37 -47 -27 -17 -39 -28 5 -46 -20 -33 -30 0 0

-27 -36 -24 -38 -30 -12 -40 -30 -3 31 39 3 -32 -42 -20 -11 -35 -28 -10 -37 -16 -29 -24 0 0

-5 11 -7 -8 -18 -24 -15 -11 2 -17 -17 -13 35 -32 -17 -20 20 -2 23 -33 -7 -26 -18 0 0

24 -20 0 -22 -19 -21 -12 -20 5 -19 -12 -9 -16 -24 -19 -22 21 0 24 -29 -7 -25 -19 0 0

21 11 -3 -6 -16 -28 -10 -9 -19 -13 -25 -17 33 -26 -13 -16 2 28 -10 -35 -8 -25 -14 0 0

-3 -17 -4 -21 -21 -11 -19 -18 -1 -20 19 2 -12 -29 -19 -21 20 27 -3 -30 -6 -21 -20 0 0

-18 16 -17 33 -6 -20 -26 52 2 -21 -17 -13 -5 -35 -12 -19 -21 -16 20 -30 -10 -8 -10 0 0

-26 -41 -12 -45 -40 10 -43 -10 30 -33 5 45 -37 -44 -27 -31 -34 -21 7 -4 -17 45 -33 0 0

-27 12 -22 33 -13 -8 -28 -21 -10 -27 -5 42 -15 -40 -20 -28 -28 -24 -14 73 -14 -5 -17 0 0

-42 -69 99 -75 -84 -49 -66 -72 -43 -76 -54 -53 -62 -79 -74 -75 -51 -48 -42 -65 -58 -59 -79 0 0

-40 -73 -33 -75 -63 -45 -72 -68 -6 -66 -29 -28 -71 -71 -65 -67 -59 -40 64 -57 -45 -56 -64 0 0

-25 -40 -35 -44 -45 -59 -39 -45 -60 -47 -63 -56 -36 -55 -47 -48 61 -24 -52 -66 -39 -57 -46 0 0

Page 8: Good solutions are advantageous

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures

Motif and databases – mode of use

• Motifs can be used to search sequence databases– take a family of related sequences– align and define motifs– use the motifs to search a database of sequences to find novel family members– can also be generated from unaligned sequences (e.g. MEME, see next page)

• Motif databases can be searched with sequences– take one sequence and ask what known motifs it contains– deduce its function using knowledge about those motifs in other sequences

• DBs– Blocks, Fred Hutchinson Cancer Research Center (ungapped alignments)– COG, clusters of orthologous groups, NCBI (21 complete genomes)– Pfam, Sanger Center (gapped profiles, curated)– Prints, Univ. Manchester (fingerprints, i.e. more than one pattern)– Prosite, Univ. Geneva (consensus patterns, expert-curated)– SMART, EMBL-Heidelberg– IntePro, EBI (multiple, curated), includes Pfam, SMART, etc. [2 pages forward]

Page 9: Good solutions are advantageous

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures

Motif discovery tools and PSSM creators• The MEME tool takes as input

unaligned sequences and searches for patterns according to several parameters such as– Min-max length– Amount per sequence– Amount per set

• MEME also generates PSSM for the found domains.

• MAST is a tool for searching databases with PSSMs

Page 10: Good solutions are advantageous

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures

The InterPro database of motifs at EBI

• (Nov 2001) was built from Pfam 6.6, PRINTS 31.0, PROSITE 16.37, ProDom 2001.2, SMART 3.1, TIGRFAMs 1.2, and the current SWISS-PROT + TrEMBL data. This release of InterPro contains 4691 entries, representing 1068 domains, 3532 families, 74 repeats and 15 post-translational modification sites.

Page 11: Good solutions are advantageous

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures

Scan the InterPro database - example

• The InterPro database was scanned with the PAX6 sequence from the fruit fly.

Page 12: Good solutions are advantageous

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures

Protein 3D structure

• 3D is better than linear strings of letters...

• Protein folding is critical for function

• Protein folding is ordered

• Structures consist of folds

• 3D structure can be measured, but computational ab initio structure prediction is a tough task and nearly impossible above a certain protein size (cpu and rule limits)

Page 13: Good solutions are advantageous

Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures

Protein 3D structure building blocks

• Primary structure: the linear array of aminoacids

• Secondary structures– Alpha helix

– Beta-strand

• Tertiary structuresD

NA

-bin

din

g p

rote

in

(DN

A h

elix

, wh

ite;

hel

ices

, pin

k;

shee

ts o

f b

eta-

stra

nd

s, o

cra)