1 Protein signatures, classification and functional analysis

Menu Introduction: some definitions How to model domains ? Pattern Profile HMM Domain/family databases (InterPro)

Protein domain/family: some definitions Most proteins have modular conserved structures Estimation: ~ 3 domains / protein -> Prediction of domain content of a unkown protein sequence may help to find a function Estimation: ~ 80% of protein have at least a known domain

Number of domains per protein http://prodom.prabi.fr/prodom/current/archives/2006.1/stat.html ~100 protein sequences with 50 domains

CSA_PPIASE TPR Cys 181: active site residue Binding cleft (motif) Example of conserved regions (PPID family) - 1 CSA_PPIASE (cyclophilin-type peptydil-prolyl cis-trans isomerase) (domain) - 3 TPR repeats (tetratrico peptide repeat). - 1 active site - Binding cleft (motif)

InterPro scan results ?

General definitions of conserved sequence signatures Conserved regions in biological sequences can be classified into 5 different groups: Domains: specific combination of secondary structures organized into a characteristic three dimensional structure or fold. Families: groups of proteins that have the same domain arrangement or that are conserved along the whole sequence. Repeats: structural units always found in two or more copies that assemble in a specific fold. Assemblies of repeats might also be thought of as domains. Motifs: region of domains containing conserved active or binding residues, or short conserved regions present outside domains that may adopt folded conformation only in association with their binding ligands. Sites: functional residues (active sites, disulfide bridges, post-translation modified residues).

CSA_PPIASE TPR Cys 181: active site residue Binding cleft (motif) Example of conserved regions (PPID family) - 1 CSA_PPIASE (cyclophilin-type peptydil-prolyl cis-trans isomerase) (domain) - 3 TPR repeats (tetratrico peptide repeat). - 1 active site - Binding cleft (motif)

What makes Bee special?

Measures of Conservation Identity: Proportion of pairs of identical residues between two aligned sequences. Generally expressed as a percentage. This value depends on how the two sequences are aligned. Similarity: Proportion of pairs of similar residues between two aligned sequences. If two residues are similar can determined by a substitution matrix (e.g. BLOSUM62). This value depends strongly on the scoring system used. !!! But not Homology: Two sequences are homologous if and only if they have a common ancestor. This is not a measure of conservation and there is no percentage of homology! (It's either yes or no). Homologous sequences do not necessarily serve the same function, nor are they always highly similar: structure may be conserved while sequence is not.

How to measure conservation ? Pairwise vs multiple sequence alignments Blast vs modelled MSA

Murcia 2011Domain Family databases13 Detect conservation using pairwise alignments A popular way to identify similarities between proteins is to perform a pairwise alignment (Blast, Fasta). When the identity is higher than 40% this method gives good results. However, the weakness of the pairwise alignment is that no distinction is made between an amino acid at a crucial position (like an active site) and an amino acid with no critical role (not enough information).

Pairwise alignment

Detect conservation using MSA A multiple sequence alignment (MSA) gives a more general view of a conserved region by providing a better picture of the most conserved residues, which are usually essential for the protein function. MSA contains higher information content than pairwise alignments

How to use MSA to look for conservation ? -> 1- Model MSA using various methods -> 2- Align the model with your sequence (InterPro scan)

Murcia 2011Domain Family databases19 Methods to Build Models of MSA Consensus: Consensus, Patterns Profile: Position Speficic Scoring Matrices (PSSMs), Generalized Profiles, Hidden Markov Models (HMMs), PSI-BLAST. pattern or PSSM/profile specific is called descriptor, descriptor motif, discriminator or predictor

Why do we need models of MSA? Why do we need classifiers ? to resume in a single descriptor" the differences and similarities observed in each column of the MSA; to use the model/descriptor to search for similar sequences; to classify similar sequences; to align correctly important residues and detect variations in active sites and other important regions of one protein (i.e. SNP); to build databases of models/descriptors which can be used to annotate new proteomes MSA models are more sensitive than Blast (pairwise alignment)

Consensus - pattern

Murcia 2011Domain Family databases22 Consensus Sequences Useful to detect protein belonging to a specific family or a protein domain; much less useful at the DNA level due to the small alphabet (4 letters) and the low sequence conservation of DNA sequence elements (except for the detection of enzyme restriction sites). Patterns do not attempt to describe a complete domain or protein family, but simply try to identify the most important residue combinations, such as the catalytic site of an enzyme. They focus on the most highly conserved residues in a protein family (motifs, sites).

Murcia 2011Domain Family databases23 Use of pattern Patterns are used to describe small functional regions: Enzyme catalytic sites; Prosthetic group attachment sites (heme, PLP, biotin, etc.); Amino acids involved in binding a metal ion; Cysteines involved in disulfide bonds; Regions involved in binding a molecule (ATP, calcium, DNA etc.) or a protein. N-glycosylation sites

Murcia 2011Domain Family databases24 How to Build a PROSITE Pattern Start with a multiple sequence alignment (MSA)

Murcia 2011Domain Family databases25 Consensus Sequences: PROSITE Patterns syntax The PROSITE patterns are described using the following conventions: ex:

Murcia 2011Domain Family databases49 Algorithm and Software to buid and use Generalized Profiles Pftools is a package to perform the different steps of the construction of a profile and to search a database of protein (or DNA) with a profile. http://www.isrec.isb-sib.ch/ftp-server/pftools Searching algorithm: dynamic programming (similar to Smith- Waterman algorithm). -> guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme)

http://www.expasy.org/tools/scanpro site/

Murcia 2011Domain Family databases51 Statistical Significance of Sequence Similarities Each method (except patterns) gives a score of similarity between the query sequence and the subject sequence or the method. Ones need to estimate if this raw score can occure by chance. This is done by the E-value or expected value The E-value is the number of matches with a score equal to or greater than the observed score that are expected to occur by chance. An E-value of 1 is considered not to be significant. An E-value of 0.1 possibly to be significant. An E-value of 0.01 most likely to be significant. Pitfall: The E-value depends on the size of the searched database, as the number of false positives expected above a given score threshold usually increases proportionally with the size of the database.

Murcia 2011Domain Family databases52 Advantage and Limitation of Generalized Profiles Strenghs: Very sensitive to detect similarities (close to the twilight zone). Good scoring system. Weaknesses: Require some expertise to use efficiently. Very CPU expensive.

Generalized Profiles can be represented in a probabilistic framework named Hidden Markov Models (HMMs).

Murcia 2011Domain Family databases56 HMM profiles Each position in an HMM consists of a Match, Insert and Deletion state Parameters describing a HMM profile: Emission probability: the probability of emitting an amino acid x being in state q (Amino acid emission probabilities are evaluated from observed frequencies as for PSSM). Transition probability: 3 states: Match (M), Deletion (D), Insertion (I). Transitions: M->I, M->D, I->M, I->D Transition probabilities are evaluated from observed transition frequencies.

M1M2M3M4M5M6M7M8M9M10M4M5M6M7M8M9M10 I = insert state I1I2I3I4I5I6I7I8I9 D = delete state D2D3D4D5D6D7D8D9 Hidden Markov Models (HMM) Each position in an HMM consists of a Match, Insert and Deletion state M = match state

Murcia 2011Domain Family databases58 HMMER HMM Profile NAME ig ACC PF00047.15 LENG 65 GA 25.1 13.4 TC 25.1 13.4 NC 25.0 25.0 XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455 NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -28.914425 0.238245 HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -16 * -6461 M 1 -2647 -5115 -567 223 -5436 3047 164 -5186 -1236 -2912 -4204 524 -2643 -554 -178 -319 -622 -4737 -5298 -4615 1 I - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -1 -11609 -12651 -894 -1115 -701 -1378 -16 * 2 -972 -498 831 1649 -5434 884 766 -2367 62 -5129 -1 765 -1114 1363 -2178 904 -1046 -4735 -5296 -1449 2 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -1 -11609 -12651 -894 -1115 -701 -1378 * * 3 -1011 -5113 411 -343 -1695 -2365 989 -5184 60 -699 50 -278 850 -148 400 1625 1230 -856 -5296 -4613 3 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -1 -11609 -12651 -894 -1115 -701 -1378 * *

Murcia 2011Domain Family databases59 HMM Profile softwares HMMER is a package to build and use HMMs (http://hmmer.janelia.org/)http://hmmer.janelia.org/ Used by Pfam, SMART and TIGRfam databases. SAM is a similar package (http://www.cse.ucsc.edu/research/compbio/sam.html).http://www.cse.ucsc.edu/research/compbio/sam.html Used by SCOP superfamily and gene3D.

Murcia 2011Domain Family databases60 Advantage and Limitation of HMM Profiles Advantage: Solid theoretical basis: more efficient than generalized profile to estimate insertion and deletion penalties. Other advantages and limitations just like generalized profiles.

Murcia 2011Domain Family databases61 Generalized Profiles and HMM Profiles The format of generalized profiles is equivalent to the one of HMM profiles. It is easy to convert a generalized profile in a HMM profile without loosing information: htop program: convert a HMM profile (HMMER) in generalized profile. ptoh program: convert a generalized profile in HMM profile (HMMER).

Domain/Family databases

MSA models are stored in databases (Prosite, PRINTS, Pfam and InterPro)

Signatures Methods Pattern Fingerprint Sequence clustering Profile HMM

InterPro scan results ? Part of the protein sequence wich has been recognized by different modelled MSA

protein folding InterPro hits InterPro domain architecture

Murcia 2011Domain Family databases67 PROSITE PROSITE is a database containing patterns and generalized profiles. http://www.expasy.org/prosite Contains ~1300 patterns and ~1000 generalized profiles. Good documentation. PROSITE is also use to annotate UniProtKB/Swiss-Prot.

Murcia 2011Domain Family databases68 PROSITE Documentation Page

Murcia 2011Domain Family databases69 PROSITE Pattern Page ID ZF_RING_1; PATTERN. AC PS00518; DT DEC-1991 (CREATED); JUN-1994 (DATA UPDATE); DEC-2005 (INFO UPDATE). DE Zinc finger RING-type signature. PA C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA]. NR /RELEASE=48.7,204086; NR /TOTAL=354(354); /POSITIVE=352(352); /UNKNOWN=0(0); /FALSE_POS=2(2); NR /FALSE_NEG=375; /PARTIAL=2; CC /TAXO-RANGE=??E?V; /MAX-REPEAT=1; CC /VERSION=1; DR Q02084, A33_PLEWA, T; Q09654, ARD1_CAEEL, T; P36406, ARD1_HUMAN, T; DR Q8BGX0, ARD1_MOUSE, T; P36407, ARD1_RAT, T; O76924, ARI2_DROME, T; DR O95376, ARI2_HUMAN, T; Q9Z1K6, ARI2_MOUSE, T; Q99728, BARD1_HUMAN, T; DR O70445, BARD1_MOUSE, T; Q9QZH2, BARD1_RAT, T; Q9NZS9, BFAR_HUMAN, T; DR Q8R079, BFAR_MOUSE, T; Q5PQN2, BFAR_RAT, T; Q96CA5, BIRC7_HUMAN, T;... DR P18541, ZNFP_LYCVA, N; P19326, ZNFP_LYCVP, N; P19325, ZNFP_LYCVT, N; DR Q88470, ZNFP_TACV, N; Q8NEG5, ZSWM2_HUMAN, N; Q9D9X6, ZSWM2_MOUSE, N; DR Q6UY11, EGFL9_HUMAN, F; P30735, VE6_MNPV, F; 3D 1BOR; 1CHC; 1FBV; 1G25; 1JM7; 1RMD; DO PDOC00449; //

Murcia 2011Domain Family databases70 PROSITE profile Page

Murcia 2011Domain Family databases71 Scanprosite Web Page

Murcia 2011Domain Family databases72 Scan Prosite Output The PROSITE database is now complemented by a series of rules that can give more precise information about specific residues.

Murcia 2011Domain Family databases73 ProRule

Murcia 2011Domain Family databases74 Pfam The largest collection of curated domains and families (~10000). Very good descriptors (Few false positives and false negatives). But ~3000 motifs have less than 10 matches on UniProtKB. Uses HMM profiles (HMMER3). http://pfam.sanger.ac.uk/

Murcia 2011Domain Family databases75 Pfam entry page

Murcia 2011Domain Family databases76 SMART ~ 800 descriptors. Concentrates on large domain families and the identification of new domains. Uses HMM profiles (HMMER2). Weak annotation. Good tools for genomic analysis. http://smart.embl.de/smart/set_mode.cgi?NORMAL=1

Murcia 2011Domain Family databases77 SMART homepage

Murcia 2011Domain Family databases78 ProDom ProDom is a database of protein domain families generated automatically from the global comparison of all available protein sequences (last release in 2008 !!). Descriptors are built with PSI-BLAST No annotation http://prodom.prabi.fr/prodom/current/html/home.php Used to defined new pfam families

Murcia 2011Domain Family databases79 Family databases: PRINTS Fingerprints are combination of ungapped PSSM. As gaps are not allowed they are usually directed against well conserved short motifs. The PRINTS database is specialised in subfamily classification. (The GPCR family was divided in more than 100 sub-families) Contains 12000 motifs. http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php

Murcia 2011Domain Family databases80 PRINTS homepage

Murcia 2011Domain Family databases81 Other family databases PANTHER: was developed to annotate the human genome. Contains a lot of models for mammalian proteins, but very few for plant, fungi or bacteria. Family/subfamily classification, more than 5000 families and 25 000 subfamilies. Automatically generated. http://www.pantherdb.org PIRSF: good annotation for functional residues. ~30000 automatically generated HMM profiles. http://pir.georgetown.edu/pirsf/ TIGRFAM only for prokaryotic proteins. 3500 HMM profiles http://www.tigr.org/TIGRFAMs/

Murcia 2011Domain Family databases82 Scop superfamily and CATH Scop Superfamily and CATH are structural domain database using HMM profiles. Hierarchical classification of domains. Use HMM profiles (SAM). Domain boundaries are semi-automatically extracted. Very sensitive methods (often more matches for a given domain than Pfam or PROSITE). Usefull for structure prediction but dangerous for functional prediction. Tends to group structurally related domains but with no functional relationship. (ex: tpr repeat: only alpha helices. SCOP or CATH tpr repeat profiles picked-up a lot of conserved regions rich in alpha helices but not evolutively link to tpr) http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/ http://www.cathdb.info/

InterPro integrates MSA models from various databases and organize them and their annotation so relationships emerge.

Murcia 2011Domain Family databases84 InterPro Interpro is an attempt to group a number of protein databases:Pfam, PROSITE, PRINTS, ProDom, SMART TIGRFAM, SCOP superfamily, Gene3D. http://www.ebi.ac.uk/interpro InterPro tries to have and maintain a high quality annotation. The database and a stand-alone package are available to locally run a complete InterPro analysis. ftp://ftp.ebi.ac.uk/pub/databases/interpro/

Murcia 2011Domain Family databases85 InterProScan

Murcia 2011Domain Family databases86 InterProScan Output

InterPro protein coverage 96.0% of UniProtKB/SwissProt 78.6% of UniProtKB/TrEMBL

Murcia, February, 2011 Protein Sequence Databases

Murcia, February, 2011

Protein Sequence DatabasesMurcia, February, 2011

Never forget that: The computational sequence analysis tools are nave about real biology and the complex relationships between molecular elements and proteins. Therefore we should be critical about what we can achieve with such computational sequence analysis tools. So, again, be critical and understand the biology.

Many thanks to Lorenzo Cerruti Nicolas Hulo Jennifer McDonald And you !

Murcia 2011Domain Family databases94 Further Reading Durbin, Eddy, Mitchison, Krog. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic acids. Cambridge University Press, 1998. Attwood TK, Parry-Smith DJ. Introduction to bioinformatics. Addison Wesley Longman Limited, 1999 Krogh A, Brown M, Mian IS, Sjolander K, Haussler D. Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol. 1994 Feb 4;235(5):1501-31. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997 Sep 1;25(17):3389-402.

Documents

1 Protein signatures, classification and functional analysis