26
Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Embed Size (px)

Citation preview

Page 1: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Sequence analysis:Macromolecular motif recognition

Sylvia Nagl

Page 2: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Amino acid primary sequence

2. Homologue(s) with known 3D structure?

Homology modellingavailable

1. Search for sequence homologue(s) and construct an alignment

3. Motif recognition: Search secondary databases

Secondary structure prediction

Fold assignment

Physico-chemical properties

(e. g., using EMBOSS suite)

DNA sequence

Automatic translation

Primary db searches

FASTA, BLAST

Page 3: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

•Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation site etc.

•Pattern: a qualitative motif description based on a regular expression-like syntax

•Profile: a quantitative motif description - assigns a degree of similarity to a potential match

TerminologyTerminology

Page 4: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

EXAMPLE: CATHEPSIN A

PEPTIDASE FAMILY S10

EC # 3.4.16.5

3-D representation

3D profile

(PROCAT)

Active site recognition

Page 5: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

419IAFLTIKGAGHMVPTDKP436

1ac5

1ivy

438LTFVSVYNASHMVPFDKS455

Active site motifs

Conserved seq patterns

Page 6: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Domain recognition

Kringle domain from plasminogen protein

EGF-like domain from coagulation factor X

Page 7: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Why search for motifs?

•to find “homologous” sequences

apply existing information to new sequence

find functionally important sites

•to find templates for homology modelling -lecture on homology modelling

Macromolecular motif recognition

Page 8: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Percent identity Method

100

90

80

70

60

50

40

30

20

10

0

Twilight zone

Midnight zone

Automatic pairwise

Alignment BLAST, Fasta)

Macromolecular motif recognition

Structure prediction

Different analysis methods

Page 9: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

What do we need?

•Method for defining motifs

•Algorithm for finding them

•Statistics to evaluate matches

Macromolecular motif recognition

Page 10: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Methods for defining motifs:

•Regular expression (patterns)

•Profiles

•Hidden Markov Model (HMM)

Macromolecular motif recognition

Page 11: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

1-D representation: Primary amino acid sequenceMIRAAPPPLFLLLLLLLLLVSWASRGEAAPDQDEIQRLPGLAKQPSFRQYSGYLKSSGSKHLHYWFVESQKDPENSPVVLWLNGGPGCSSLDGLLTEHGPFLVQPDGVTLEYNPYSWNLIANVLYLESPAGVGFSYSDDKFYATNDTEVAQSNFEALQDFFRLFPEYKNNKL...

Computational sequence analysis

Query secondary databases over the

Internet

Macromolecular motif recognition

http://www.ebi.ac.uk/interpro/

Page 12: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Macromolecular motif recognition

full domain alignment

single motif

multiple motifs

exact regular expression (PROSITE)

residue frequency matrices (PRINTS)

profile (PROSITE)

Hidden Markov Model (Pfam, PROSITE)

Page 13: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

419IAFLTIKGAGHMVPTDKP436

1ac5

1ivy

438LTFVSVYNASHMVPFDKS455

Active site motifs

Conserved seq patterns

Page 14: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Prosite: Regular expressions

CARBOXYPEPT_SER_HIS

[LIVF]-x(2)-[LIVSTA]-x-[IVPST]-x-[GSDNQL]-[SAGV]-[SG]-H-x-[IVAQ]-P-x(3)-[PSA]

Regular expressions represent features by logical combinations of characters. A regular expression defines a sequence pattern to be matched.

 

Motif modelling methods

Page 15: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Basic rules for regular expressions

• Each position is separated by a hyphen “-”

• A symbol X is a regular expression matching itself

• x means ‘any residue’

• [ ] surround ambiguities - a string [XYZ] matches any of the enclosed symbols

• A string [R]* matches any number of strings that match

• { } surround forbidden residues

• ( ) surround repeat counts

Model formation

•Restricted to key conserved features in order to reduce the “noise” level

•Built by hand in a stepwise fashion from multiple alignments

Regular expressions contd.

Page 16: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Regular expressions contd.

Regular expressions, such as PROSITE patterns, are matched to primary amino acid sequences using finite state automata.

“all-or-none”

Page 17: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Prints: Residue frequency matrices

Motif 1 NPESWTNFANMLWNPYSWVNLTNVLW REYSWHQNHHMIY NEGSWISKGDLLF NPYSWTNLTNVVY NEYSWNKMASVVY NDFGWDQESNLIY NENSWNNYANMIY NEYGWDQVSNLLY NPYAWSKVSTMIY NPYSWNGNASIIY NEYAWNKFANVLF NPYSWNRVSNILY NPYSWNLIANVLY NEYRWNKVANVLF

Motif 2 LDQPFGTGYSQ VDNPVGAGFSY VDQPVGTGFSL VDQPGGTGFSS IDNPVGTGFSF IDQPTGTGFSV VDQPLGTGYSY IDQPAGTGFSP LESPIGVGFSY LDQPVGSGFSY LDQPVGSGFSY LDQPINTGFSN LDQPIGAGFSY LDAPAGVGFSY LDQPVGAGFSY

Motif 3 FFQHFPEYQTNDFHIAGESYAGHYIP

FFNKFPEYQNRPFYITGESYGGIYVP WVERFPEYKGRDFYIVGESYAGNGLM FLSKFPEYKGRDFWITGESYAGVYIP WFQLYPEFLSNPFYIAGESYAGVYVP FFEAFPHLRSNDFHIAGESYAGHYIP FFRLFPEYKDNKLFLTGESYAGIYIP FLTRFPQFIGRETYLAGESYGGVYVP FFNEFPQYKGNDFYVTGESYGGIYVP WMSRFPQYQYRDFYIVGESYAGHYVP

FFRLFPEYKNNKLFLTGESYAGIYIP FFRLFPEYKNNKLFLTGESYAGIYIP WLERFPEYKGREFYITGESYAGHYVP WMSRFPQYRYRDFYIVGESYAGHYVP WFEKFPEHKGNEFYIAGESYAGIYVP

Motif 4 LAFTLSNSVGHMAP

LQFWWILRAGHMVA LMWAETFQSGHMQP LTYVRVYNSSHMVP LQEVLIRNAGHMVP LTFVSVYNASHMVP LTFARIVEASHMVP LTFSSVYLSGHEIP IDVVTVKGSGHFVP MTFATIKGSGHTAE MTFATIKGGGHTAE FGYLRLYEAGHMVP MTFATVKGSGHTAE ITLISIKGGGHFPA MTFATVKGSGHTAE

Motif modelling methods

•a collection of protein “fingerprints” that exploit groups of motifs to build characteristic family signatures•motifs are encoded in ungapped ”raw” sequence format•different scoring methods may be superimposed onto the data, e. .g. BLAST•improved diagnostic reliability•mutual context provided by motif neighbours

Page 18: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Prosite: Profiles

Feature is represented as a matrix with a score for every possible character.

Matrix is derived from a sequence alignment, e.g.:

F K L L S H C L L V F K A F G Q T M F QY P I V G Q E L L GF P V V K E A I L KF K V L A A V I A DL E F I S E C I I Q

Motif modelling methods

Page 19: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Derived matrix:

A -18 -10 -1 -8 8 -3 3 -10 -2 -8 C -22 -33 -18 -18 -22 -26 22 -24 -19 -7 D -35 0 -32 -33 -7 6 -17 -34 -31 0 E -27 15 -25 -26 -9 23 -9 -24 -23 -1 F 60 -30 12 14 -26 -29 -15 4 12 -29 G -30 -20 -28 -32 28 -14 -23 -33 -27 -5 H -13 -12 -25 -25 -16 14 -22 -22 -23 -10 I 3 -27 21 25 -29 -23 -8 33 19 -23 K -26 25 -25 -27 -6 4 -15 -27 -26 0 L 14 -28 19 27 -27 -20 -9 33 26 -21 M 3 -15 10 14 -17 -10 -9 25 12 -11 N -22 -6 -24 -27 1 8 -15 -24 -24 -4 P -30 24 -26 -28 -14 -10 -22 -24 -26 -18 Q -32 5 -25 -26 -9 24 -16 -17 -23 7 R -18 9 -22 -22 -10 0 -18 -23 -22 -4 S -22 -8 -16 -21 11 2 -1 -24 -19 -4 T -10 -10 -6 -7 -5 -8 2 -10 -7 -11 V 0 -25 22 25 -19 -26 6 19 16 -16 W 9 -25 -18 -19 -25 -27 -34 -20 -17 -28 Y 34 -18 -1 1 -23 -12 -19 0 0 -18

Alignment positions

Profiles contd.

Page 20: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Profiles contd.

•inclusion of all possible information to maximise overall signal of protein/domain

i. e., a full representation of features in the aligned sequences

•can detect distant relationships with only few well conserved residues

•position-dependent weights/penalties for all 20 amino acids -- BASED ON AMINO ACID SUBSTITUTION MATRICES -- and for gaps and insertions

•dynamic programming algorithms for scoring hits

Page 21: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Pfam and Prosite: Hidden Markov Models(HMMs)

•Feature is represented by a probabilistic model of interconnecting match, delete or insert states•contains statistical information on observed and expected positional variation - “platonic ideal of protein family”

B EMi

Di

Ii

Macromolecular motif recognition

Page 22: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

Pfam and Prosite: Hidden Markov Models(HMMs)

B EMi

Di

Ii

Macromolecular motif recognition

P of a given amino acid to occurs in a particular state (M, I, D) - at particular position in sequence (for all 20, profile-like)

P of transition state

Page 23: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

•Statistical tests aim to assess the likelihood that a match of a query sequence to a profile, regular expression, HMM, etc, is the result of chance.

•They control for such factors as sequence (match) length, amino acid composition and size of the database searched.

Statistical significance

Page 24: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

•log-odds score: this number is the log of the ratio between two probabilities - P that the sequence belongs to the positive set, and P that the result was obtained by chance due to the amino acid distribution in the positive set (random model).

•Z-score: one needs to estimate an average score and a standard deviation as a function of sequence length. Then, one uses the number of standard deviations each sequence is away from the average as the score.

•e-value (Expect value): given a database search result with alignment score S, the e-value is the expected number of sequences of score >= S that would be found by random chance.

•p-value: the probability that one or more sequences of score >= S would have been found randomly.

Statistical significance

Page 25: Sequence analysis: Macromolecular motif recognition Sylvia Nagl

INTERPRO

•The InterPro database allows efficient searching

•An integrated annotation resource for protein families, domains and functional sites that amalgamates the efforts of the PROSITE, PRINTS, Pfam, ProDom, SMART and TIGRFAMs secondary database projects.

http://www.ebi.ac.uk/interpro

Page 26: Sequence analysis: Macromolecular motif recognition Sylvia Nagl