1 Protein signatures, classification and functional
analysis
Slide 2
Menu Introduction: some definitions How to model domains ?
Pattern Profile HMM Domain/family databases (InterPro)
Slide 3
Protein domain/family: some definitions Most proteins have
modular conserved structures Estimation: ~ 3 domains / protein
-> Prediction of domain content of a unkown protein sequence may
help to find a function Estimation: ~ 80% of protein have at least
a known domain
Slide 4
Number of domains per protein
http://prodom.prabi.fr/prodom/current/archives/2006.1/stat.html
~100 protein sequences with 50 domains
Slide 5
CSA_PPIASE TPR Cys 181: active site residue Binding cleft
(motif) Example of conserved regions (PPID family) - 1 CSA_PPIASE
(cyclophilin-type peptydil-prolyl cis-trans isomerase) (domain) - 3
TPR repeats (tetratrico peptide repeat). - 1 active site - Binding
cleft (motif)
Slide 6
InterPro scan results ?
Slide 7
General definitions of conserved sequence signatures Conserved
regions in biological sequences can be classified into 5 different
groups: Domains: specific combination of secondary structures
organized into a characteristic three dimensional structure or
fold. Families: groups of proteins that have the same domain
arrangement or that are conserved along the whole sequence.
Repeats: structural units always found in two or more copies that
assemble in a specific fold. Assemblies of repeats might also be
thought of as domains. Motifs: region of domains containing
conserved active or binding residues, or short conserved regions
present outside domains that may adopt folded conformation only in
association with their binding ligands. Sites: functional residues
(active sites, disulfide bridges, post-translation modified
residues).
Slide 8
CSA_PPIASE TPR Cys 181: active site residue Binding cleft
(motif) Example of conserved regions (PPID family) - 1 CSA_PPIASE
(cyclophilin-type peptydil-prolyl cis-trans isomerase) (domain) - 3
TPR repeats (tetratrico peptide repeat). - 1 active site - Binding
cleft (motif)
Slide 9
Slide 10
What makes Bee special?
Slide 11
Measures of Conservation Identity: Proportion of pairs of
identical residues between two aligned sequences. Generally
expressed as a percentage. This value depends on how the two
sequences are aligned. Similarity: Proportion of pairs of similar
residues between two aligned sequences. If two residues are similar
can determined by a substitution matrix (e.g. BLOSUM62). This value
depends strongly on the scoring system used. !!! But not Homology:
Two sequences are homologous if and only if they have a common
ancestor. This is not a measure of conservation and there is no
percentage of homology! (It's either yes or no). Homologous
sequences do not necessarily serve the same function, nor are they
always highly similar: structure may be conserved while sequence is
not.
Slide 12
How to measure conservation ? Pairwise vs multiple sequence
alignments Blast vs modelled MSA
Slide 13
Murcia 2011Domain Family databases13 Detect conservation using
pairwise alignments A popular way to identify similarities between
proteins is to perform a pairwise alignment (Blast, Fasta). When
the identity is higher than 40% this method gives good results.
However, the weakness of the pairwise alignment is that no
distinction is made between an amino acid at a crucial position
(like an active site) and an amino acid with no critical role (not
enough information).
Slide 14
Pairwise alignment
Slide 15
Detect conservation using MSA A multiple sequence alignment
(MSA) gives a more general view of a conserved region by providing
a better picture of the most conserved residues, which are usually
essential for the protein function. MSA contains higher information
content than pairwise alignments
Slide 16
Slide 17
Slide 18
How to use MSA to look for conservation ? -> 1- Model MSA
using various methods -> 2- Align the model with your sequence
(InterPro scan)
Slide 19
Murcia 2011Domain Family databases19 Methods to Build Models of
MSA Consensus: Consensus, Patterns Profile: Position Speficic
Scoring Matrices (PSSMs), Generalized Profiles, Hidden Markov
Models (HMMs), PSI-BLAST. pattern or PSSM/profile specific is
called descriptor, descriptor motif, discriminator or
predictor
Slide 20
Why do we need models of MSA? Why do we need classifiers ? to
resume in a single descriptor" the differences and similarities
observed in each column of the MSA; to use the model/descriptor to
search for similar sequences; to classify similar sequences; to
align correctly important residues and detect variations in active
sites and other important regions of one protein (i.e. SNP); to
build databases of models/descriptors which can be used to annotate
new proteomes MSA models are more sensitive than Blast (pairwise
alignment)
Slide 21
Consensus - pattern
Slide 22
Murcia 2011Domain Family databases22 Consensus Sequences Useful
to detect protein belonging to a specific family or a protein
domain; much less useful at the DNA level due to the small alphabet
(4 letters) and the low sequence conservation of DNA sequence
elements (except for the detection of enzyme restriction sites).
Patterns do not attempt to describe a complete domain or protein
family, but simply try to identify the most important residue
combinations, such as the catalytic site of an enzyme. They focus
on the most highly conserved residues in a protein family (motifs,
sites).
Slide 23
Murcia 2011Domain Family databases23 Use of pattern Patterns
are used to describe small functional regions: Enzyme catalytic
sites; Prosthetic group attachment sites (heme, PLP, biotin, etc.);
Amino acids involved in binding a metal ion; Cysteines involved in
disulfide bonds; Regions involved in binding a molecule (ATP,
calcium, DNA etc.) or a protein. N-glycosylation sites
Slide 24
Murcia 2011Domain Family databases24 How to Build a PROSITE
Pattern Start with a multiple sequence alignment (MSA)
Slide 25
Murcia 2011Domain Family databases25 Consensus Sequences:
PROSITE Patterns syntax The PROSITE patterns are described using
the following conventions: ex:
Murcia 2011Domain Family databases49 Algorithm and Software to
buid and use Generalized Profiles Pftools is a package to perform
the different steps of the construction of a profile and to search
a database of protein (or DNA) with a profile.
http://www.isrec.isb-sib.ch/ftp-server/pftools Searching algorithm:
dynamic programming (similar to Smith- Waterman algorithm). ->
guaranteed to find the optimal local alignment with respect to the
scoring system being used (which includes the substitution matrix
and the gap-scoring scheme)
Slide 50
http://www.expasy.org/tools/scanpro site/
Slide 51
Murcia 2011Domain Family databases51 Statistical Significance
of Sequence Similarities Each method (except patterns) gives a
score of similarity between the query sequence and the subject
sequence or the method. Ones need to estimate if this raw score can
occure by chance. This is done by the E-value or expected value The
E-value is the number of matches with a score equal to or greater
than the observed score that are expected to occur by chance. An
E-value of 1 is considered not to be significant. An E-value of 0.1
possibly to be significant. An E-value of 0.01 most likely to be
significant. Pitfall: The E-value depends on the size of the
searched database, as the number of false positives expected above
a given score threshold usually increases proportionally with the
size of the database.
Slide 52
Murcia 2011Domain Family databases52 Advantage and Limitation
of Generalized Profiles Strenghs: Very sensitive to detect
similarities (close to the twilight zone). Good scoring system.
Weaknesses: Require some expertise to use efficiently. Very CPU
expensive.
Slide 53
Slide 54
HMM
Slide 55
Generalized Profiles can be represented in a probabilistic
framework named Hidden Markov Models (HMMs).
Slide 56
Murcia 2011Domain Family databases56 HMM profiles Each position
in an HMM consists of a Match, Insert and Deletion state Parameters
describing a HMM profile: Emission probability: the probability of
emitting an amino acid x being in state q (Amino acid emission
probabilities are evaluated from observed frequencies as for PSSM).
Transition probability: 3 states: Match (M), Deletion (D),
Insertion (I). Transitions: M->I, M->D, I->M, I->D
Transition probabilities are evaluated from observed transition
frequencies.
Slide 57
M1M2M3M4M5M6M7M8M9M10M4M5M6M7M8M9M10 I = insert state
I1I2I3I4I5I6I7I8I9 D = delete state D2D3D4D5D6D7D8D9 Hidden Markov
Models (HMM) Each position in an HMM consists of a Match, Insert
and Deletion state M = match state
Murcia 2011Domain Family databases59 HMM Profile softwares
HMMER is a package to build and use HMMs
(http://hmmer.janelia.org/)http://hmmer.janelia.org/ Used by Pfam,
SMART and TIGRfam databases. SAM is a similar package
(http://www.cse.ucsc.edu/research/compbio/sam.html).http://www.cse.ucsc.edu/research/compbio/sam.html
Used by SCOP superfamily and gene3D.
Slide 60
Murcia 2011Domain Family databases60 Advantage and Limitation
of HMM Profiles Advantage: Solid theoretical basis: more efficient
than generalized profile to estimate insertion and deletion
penalties. Other advantages and limitations just like generalized
profiles.
Slide 61
Murcia 2011Domain Family databases61 Generalized Profiles and
HMM Profiles The format of generalized profiles is equivalent to
the one of HMM profiles. It is easy to convert a generalized
profile in a HMM profile without loosing information: htop program:
convert a HMM profile (HMMER) in generalized profile. ptoh program:
convert a generalized profile in HMM profile (HMMER).
Slide 62
Domain/Family databases
Slide 63
MSA models are stored in databases (Prosite, PRINTS, Pfam and
InterPro)
InterPro scan results ? Part of the protein sequence wich has
been recognized by different modelled MSA
Slide 66
protein folding InterPro hits InterPro domain architecture
Slide 67
Murcia 2011Domain Family databases67 PROSITE PROSITE is a
database containing patterns and generalized profiles.
http://www.expasy.org/prosite Contains ~1300 patterns and ~1000
generalized profiles. Good documentation. PROSITE is also use to
annotate UniProtKB/Swiss-Prot.
Slide 68
Murcia 2011Domain Family databases68 PROSITE Documentation
Page
Slide 69
Murcia 2011Domain Family databases69 PROSITE Pattern Page ID
ZF_RING_1; PATTERN. AC PS00518; DT DEC-1991 (CREATED); JUN-1994
(DATA UPDATE); DEC-2005 (INFO UPDATE). DE Zinc finger RING-type
signature. PA C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA]. NR
/RELEASE=48.7,204086; NR /TOTAL=354(354); /POSITIVE=352(352);
/UNKNOWN=0(0); /FALSE_POS=2(2); NR /FALSE_NEG=375; /PARTIAL=2; CC
/TAXO-RANGE=??E?V; /MAX-REPEAT=1; CC /VERSION=1; DR Q02084,
A33_PLEWA, T; Q09654, ARD1_CAEEL, T; P36406, ARD1_HUMAN, T; DR
Q8BGX0, ARD1_MOUSE, T; P36407, ARD1_RAT, T; O76924, ARI2_DROME, T;
DR O95376, ARI2_HUMAN, T; Q9Z1K6, ARI2_MOUSE, T; Q99728,
BARD1_HUMAN, T; DR O70445, BARD1_MOUSE, T; Q9QZH2, BARD1_RAT, T;
Q9NZS9, BFAR_HUMAN, T; DR Q8R079, BFAR_MOUSE, T; Q5PQN2, BFAR_RAT,
T; Q96CA5, BIRC7_HUMAN, T;... DR P18541, ZNFP_LYCVA, N; P19326,
ZNFP_LYCVP, N; P19325, ZNFP_LYCVT, N; DR Q88470, ZNFP_TACV, N;
Q8NEG5, ZSWM2_HUMAN, N; Q9D9X6, ZSWM2_MOUSE, N; DR Q6UY11,
EGFL9_HUMAN, F; P30735, VE6_MNPV, F; 3D 1BOR; 1CHC; 1FBV; 1G25;
1JM7; 1RMD; DO PDOC00449; //
Slide 70
Murcia 2011Domain Family databases70 PROSITE profile Page
Slide 71
Murcia 2011Domain Family databases71 Scanprosite Web Page
Slide 72
Murcia 2011Domain Family databases72 Scan Prosite Output The
PROSITE database is now complemented by a series of rules that can
give more precise information about specific residues.
Slide 73
Murcia 2011Domain Family databases73 ProRule
Slide 74
Murcia 2011Domain Family databases74 Pfam The largest
collection of curated domains and families (~10000). Very good
descriptors (Few false positives and false negatives). But ~3000
motifs have less than 10 matches on UniProtKB. Uses HMM profiles
(HMMER3). http://pfam.sanger.ac.uk/
Slide 75
Murcia 2011Domain Family databases75 Pfam entry page
Slide 76
Murcia 2011Domain Family databases76 SMART ~ 800 descriptors.
Concentrates on large domain families and the identification of new
domains. Uses HMM profiles (HMMER2). Weak annotation. Good tools
for genomic analysis.
http://smart.embl.de/smart/set_mode.cgi?NORMAL=1
Slide 77
Murcia 2011Domain Family databases77 SMART homepage
Slide 78
Murcia 2011Domain Family databases78 ProDom ProDom is a
database of protein domain families generated automatically from
the global comparison of all available protein sequences (last
release in 2008 !!). Descriptors are built with PSI-BLAST No
annotation http://prodom.prabi.fr/prodom/current/html/home.php Used
to defined new pfam families
Slide 79
Murcia 2011Domain Family databases79 Family databases: PRINTS
Fingerprints are combination of ungapped PSSM. As gaps are not
allowed they are usually directed against well conserved short
motifs. The PRINTS database is specialised in subfamily
classification. (The GPCR family was divided in more than 100
sub-families) Contains 12000 motifs.
http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php
Slide 80
Murcia 2011Domain Family databases80 PRINTS homepage
Slide 81
Murcia 2011Domain Family databases81 Other family databases
PANTHER: was developed to annotate the human genome. Contains a lot
of models for mammalian proteins, but very few for plant, fungi or
bacteria. Family/subfamily classification, more than 5000 families
and 25 000 subfamilies. Automatically generated.
http://www.pantherdb.org PIRSF: good annotation for functional
residues. ~30000 automatically generated HMM profiles.
http://pir.georgetown.edu/pirsf/ TIGRFAM only for prokaryotic
proteins. 3500 HMM profiles http://www.tigr.org/TIGRFAMs/
Slide 82
Murcia 2011Domain Family databases82 Scop superfamily and CATH
Scop Superfamily and CATH are structural domain database using HMM
profiles. Hierarchical classification of domains. Use HMM profiles
(SAM). Domain boundaries are semi-automatically extracted. Very
sensitive methods (often more matches for a given domain than Pfam
or PROSITE). Usefull for structure prediction but dangerous for
functional prediction. Tends to group structurally related domains
but with no functional relationship. (ex: tpr repeat: only alpha
helices. SCOP or CATH tpr repeat profiles picked-up a lot of
conserved regions rich in alpha helices but not evolutively link to
tpr) http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/
http://www.cathdb.info/
Slide 83
InterPro integrates MSA models from various databases and
organize them and their annotation so relationships emerge.
Slide 84
Murcia 2011Domain Family databases84 InterPro Interpro is an
attempt to group a number of protein databases:Pfam, PROSITE,
PRINTS, ProDom, SMART TIGRFAM, SCOP superfamily, Gene3D.
http://www.ebi.ac.uk/interpro InterPro tries to have and maintain a
high quality annotation. The database and a stand-alone package are
available to locally run a complete InterPro analysis.
ftp://ftp.ebi.ac.uk/pub/databases/interpro/
Slide 85
Murcia 2011Domain Family databases85 InterProScan
Slide 86
Murcia 2011Domain Family databases86 InterProScan Output
Slide 87
Slide 88
InterPro protein coverage 96.0% of UniProtKB/SwissProt 78.6% of
UniProtKB/TrEMBL
Slide 89
Murcia, February, 2011 Protein Sequence Databases
Slide 90
Murcia, February, 2011
Slide 91
Protein Sequence DatabasesMurcia, February, 2011
Slide 92
Never forget that: The computational sequence analysis tools
are nave about real biology and the complex relationships between
molecular elements and proteins. Therefore we should be critical
about what we can achieve with such computational sequence analysis
tools. So, again, be critical and understand the biology.
Slide 93
Many thanks to Lorenzo Cerruti Nicolas Hulo Jennifer McDonald
And you !
Slide 94
Murcia 2011Domain Family databases94 Further Reading Durbin,
Eddy, Mitchison, Krog. Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic acids. Cambridge University Press,
1998. Attwood TK, Parry-Smith DJ. Introduction to bioinformatics.
Addison Wesley Longman Limited, 1999 Krogh A, Brown M, Mian IS,
Sjolander K, Haussler D. Hidden Markov models in computational
biology. Applications to protein modeling. J Mol Biol. 1994 Feb
4;235(5):1501-31. Altschul SF, Madden TL, Schaffer AA, Zhang J,
Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res.
1997 Sep 1;25(17):3389-402.