Pro Site

Embed Size (px)

Citation preview

{PDOC00000} {BEGIN} ********************************** *** PROSITE documentation file *** ********************************** Release : 16.0 of July 1999 and updates up to September 2000. Copyright: Amos Bairoch Swiss Institute of Bioinformatics (SIB) CMU University of Geneva 1, Rue Michel Servet, 1211 Geneva 4 Switzerland Email : [email protected] Telephone: +41-22-702 54 77 Fax : +41-22-702 55 02 Acknowledgements: - To all those mentioned in this document who have reviewed the entry(ies) for which they are listed as experts. With specific thanks to Rein Aasland, Mark Boguski, Peer Bork, Josh Cherry, Andre Chollet, Frank Kolakowski, David Landsman, Bernard Henrissat, Eugene Koonin, Steve Henikoff, Manuel Peitsch and Jonathan Reizer. - Brigitte Boeckmann is the author of the PDOC00691, PDOC00703, PDOC00829, PDOC00796, PDOC00798, PDOC00799, PDOC00906, PDOC00907, PDOC00908, PDOC00912, PDOC00913, PDOC00924, PDOC00928, PDOC00929, PDOC00955, PDOC00961, PDOC00966, PDOC00988 and PDOC50020 entries. - Philipp Bucher is the author of the PDOC50001 and PDOC50002 entries. - Kay Hofmann is the author of the PDOC50003, PDOC50006, PDOC50007 and PDOC50017 entries. - Keith Robison is the author of the PDOC00830 and PDOC00861 entries. - Chantal Hulo is the author of the PDOC00987 entry. - Vivienne Baillie Gerritsen for undertaking the major task of correcting the grammar and style of this document. -----------------------------------------------------------------------PROSITE is copyright. It is produced by the Swiss Institute of Bioinformatics (SIB). There are no restrictions on its use by non-profit institutions as long as its content is in no way modified. Usage by and for commercial entities requires a license agreement. For information about the licensing scheme see: http://www.isb-sib.ch/announce/ or send an email to [email protected]. -----------------------------------------------------------------------{END} {PDOC00001} {PS00001; ASN_GLYCOSYLATION} {BEGIN} ************************ * N-glycosylation site * ************************ It has been known for a long time [1] that potential N-glycosylation sites are specific to the consensus sequence Asn-Xaa-Ser/Thr. It must be noted that the presence of the consensus tripeptide is not sufficient to conclude that an asparagine residue is glycosylated, due to the fact that the folding of the protein plays an important role in the regulation of N-glycosylation [2]. It has been shown [3] that the presence of proline between Asn and Ser/Thr will

inhibit N-glycosylation; this has been confirmed by a recent [4] statistical analysis of glycosylation sites, which also shows that about 50% of the sites that have a proline C-terminal to Ser/Thr are not glycosylated. It must also be noted that there are a few reported cases of glycosylation sites with the pattern Asn-Xaa-Cys; an experimentally demonstrated occurrence of such a non-standard site is found in the plasma protein C [5]. -Consensus pattern: N-{P}-[ST]-{P} [N is the glycosylation site] -Last update: May 1991 / Text revised. [ 1] Marshall R.D. Annu. Rev. Biochem. 41:673-702(1972). [ 2] Pless D.D., Lennarz W.J. Proc. Natl. Acad. Sci. U.S.A. 74:134-138(1977). [ 3] Bause E. Biochem. J. 209:331-336(1983). [ 4] Gavel Y., von Heijne G. Protein Eng. 3:433-442(1990). [ 5] Miletich J.P., Broze G.J. Jr. J. Biol. Chem. 265:11397-11404(1990). {END} {PDOC00002} {PS00002; GLYCOSAMINOGLYCAN} {BEGIN} ************************************* * Glycosaminoglycan attachment site * ************************************* Proteoglycans [1] are complex glycoconjugates containing a core protein to which a variable number of glycosaminoglycan chains (such as heparin sulfate, chondroitin sulfate, etc.) are covalently attached. The glycosaminoglycans are attached to the core proteins through a xyloside residue which is in turn linked to a serine residue of the protein. A consensus sequence for the attachment site seems to exist [2]. However, it must be noted that this consensus is only based on the sequence of three proteoglycan core proteins. -Consensus pattern: S-G-x-G [S is the attachment site] Additional rule: There must be at least two acidic amino acids from -2 to -4 relative to the serine. -Last update: June 1988 / First entry. [ 1] Hassel J.R., Kimura J.H., Hascall V.C. Annu. Rev. Biochem. 55:539-567(1986). [ 2] Bourdon M.A., Krusius T., Campbell S., Schwarz N.B. Proc. Natl. Acad. Sci. U.S.A. 84:3194-3198(1987). {END} {PDOC00003} {PS00003; SULFATION} {BEGIN} *************************** * Tyrosine sulfation site * *************************** The consensus features of a tyrosine sulfation site have been described in a number of reviews [1,2,3]: - The presence of an acidic (Glu or Asp) amino acid within two residues of

the tyrosine (typically at -1). - The presence of at least three acidic residues from -5 to +5. - No more than one basic residue and three hydrophobic residues from -5 to +5. - Presence of turn-inducing amino acids: at least one Pro or Gly (the amino acids with the strongest turn potential) from -7 to -2 and from +1 to +7, or at least two or three Asp, Ser or Asn from -7 to +7. - Absence of disulfide-bonded cysteine residues from -7 to +7. - Absence of N-linked glycans near the tyrosine. It must be noted that tyrosine sulfation is physiologically relevant only to proteins or domains passing through or located in the Golgi lumen. -Last update: September 2000 / Text revised. [ 1] Huttner W.B. Trends Biochem. Sci. 12:361-363(1987). [ 2] Rosenquist G.L., Nicholas H.B. Jr. Protein Sci. 2:215-222(1993). [ 3] Nicholas H.B. Jr., Chan S.S., Rosenquist G.L. Endocrine 11:285-292(1999). {END} {PDOC00004} {PS00004; CAMP_PHOSPHO_SITE} {BEGIN} **************************************************************** * cAMP- and cGMP-dependent protein kinase phosphorylation site * **************************************************************** There has been a number of studies relative to the specificity of cAMP- and cGMP-dependent protein kinases [1,2,3]. Both types of kinases appear to share a preference for the phosphorylation of serine or threonine residues found close to at least two consecutive N-terminal basic residues. It is important to note that there are quite a number of exceptions to this rule. -Consensus pattern: [RK](2)-x-[ST] [S or T is the phosphorylation site] -Last update: June 1988 / First entry. [ 1] Fremisco J.R., Glass D.B., Krebs E.G. J. Biol. Chem. 255:4240-4245(1980). [ 2] Glass D.B., Smith S.B. J. Biol. Chem. 258:14797-14803(1983). [ 3] Glass D.B., El-Maghrabi M.R., Pilkis S.J. J. Biol. Chem. 261:2987-2993(1986). {END} {PDOC00005} {PS00005; PKC_PHOSPHO_SITE} {BEGIN} ***************************************** * Protein kinase C phosphorylation site * ***************************************** In vivo, protein kinase C exhibits a preference for the phosphorylation of serine or threonine residues found close to a C-terminal basic residue [1,2]. The presence of additional basic residues at the N- or C-terminal of the target amino acid enhances the Vmax and Km of the phosphorylation reaction. -Consensus pattern: [ST]-x-[RK] [S or T is the phosphorylation site]

-Last update: June 1988 / First entry. [ 1] Woodget J.R., Gould K.L., Hunter T. Eur. J. Biochem. 161:177-184(1986). [ 2] Kishimoto A., Nishiyama K., Nakanishi H., Uratsuji Y., Nomura H., Takeyama Y., Nishizuka Y. J. Biol. Chem. 260:12492-12499(1985). {END} {PDOC00006} {PS00006; CK2_PHOSPHO_SITE} {BEGIN} ***************************************** * Casein kinase II phosphorylation site * ***************************************** Casein kinase II (CK-2) is a protein serine/threonine kinase whose activity is independent of cyclic nucleotides and calcium. CK-2 phosphorylates many different proteins. The substrate specificity [1] of this enzyme can be summarized as follows: (1) Under comparable conditions Ser is favored over Thr. (2) An acidic residue (either Asp or Glu) must be present three residues from the C-terminal of the phosphate acceptor site. (3) Additional acidic residues in positions +1, +2, +4, and +5 increase the phosphorylation rate. Most physiological substrates have at least one acidic residue in these positions. (4) Asp is preferred to Glu as the provider of acidic determinants. (5) A basic residue at the N-terminal of the acceptor site decreases the phosphorylation rate, while an acidic one will increase it. -Consensus pattern: [ST]-x(2)-[DE] [S or T is the phosphorylation site] -Note: this pattern is found in most of the known physiological substrates. -Last update: May 1991 / Text revised. [ 1] Pinna L.A. Biochim. Biophys. Acta 1054:267-284(1990). {END} {PDOC00007} {PS00007; TYR_PHOSPHO_SITE} {BEGIN} **************************************** * Tyrosine kinase phosphorylation site * **************************************** Substrates of tyrosine protein kinases are generally characterized by a lysine or an arginine seven residues to the N-terminal side of the phosphorylated tyrosine. An acidic residue (Asp or Glu) is often found at either three or four residues to the N-terminal side of the tyrosine [1,2,3]. There are a number of exceptions to this rule such as the tyrosine phosphorylation sites of enolase and lipocortin II. -Consensus pattern: [RK]-x(2)-[DE]-x(3)-Y or [RK]-x(3)-[DE]-x(2)-Y [Y is the phosphorylation site] -Last update: June 1988 / First entry. [ 1] Patschinsky T., Hunter T., Esch F.S., Cooper J.A., Sefton B.M.

Proc. Natl. Acad. Sci. U.S.A. 79:973-977(1982). [ 2] Hunter T. J. Biol. Chem. 257:4843-4848(1982). [ 3] Cooper J.A., Esch F.S., Taylor S.S., Hunter T. J. Biol. Chem. 259:7835-7841(1984). {END} {PDOC00008} {PS00008; MYRISTYL} {BEGIN} ************************* * N-myristoylation site * ************************* An appreciable number of eukaryotic proteins are acylated by the covalent addition of myristate (a C14-saturated fatty acid) to their N-terminal residue via an amide linkage [1,2]. The sequence specificity of the enzyme responsible for this modification, myristoyl CoA:protein N-myristoyl transferase (NMT), has been derived from the sequence of known N-myristoylated proteins and from studies using synthetic peptides. It seems to be the following: - The N-terminal residue must be glycine. - In position 2, uncharged residues are allowed. Charged residues, proline and large hydrophobic residues are not allowed. - In positions 3 and 4, most, if not all, residues are allowed. - In position 5, small uncharged residues are allowed (Ala, Ser, Thr, Cys, Asn and Gly). Serine is favored. - In position 6, proline is not allowed. -Consensus pattern: G-{EDRKHPFYW}-x(2)-[STAGCN]-{P} [G is the N-myristoylation site] -Note: we deliberately include as potential myristoylated glycine residues, those which are internal to a sequence. It could well be that the sequence under study represents a viral polyprotein precursor and that subsequent proteolytic processing could expose an internal glycine as the N-terminal of a mature protein. -Last update: October 1989 / Pattern and text revised. [ 1] Towler D.A., Gordon J.I., Adams S.P., Glaser L. Annu. Rev. Biochem. 57:69-99(1988). [ 2] Grand R.J.A. Biochem. J. 258:625-638(1989). {END} {PDOC00009} {PS00009; AMIDATION} {BEGIN} ****************** * Amidation site * ****************** The precursor of hormones and other active peptides which are C-terminally amidated is always directly followed [1,2] by a glycine residue which provides the amide group, and most often by at least two consecutive basic residues (Arg or Lys) which generally function as an active peptide precursor cleavage site. Although all amino acids can be amidated, neutral hydrophobic residues such as Val or Phe are good substrates, while charged residues such as Asp or Arg are much less reactive. C-terminal amidation has not yet been shown to occur in unicellular organisms or in plants.

-Consensus pattern: x-G-[RK]-[RK] [x is the amidation site] -Last update: June 1988 / First entry. [ 1] Kreil G. Meth. Enzymol. 106:218-223(1984). [ 2] Bradbury A.F., Smyth D.G. Biosci. Rep. 7:907-916(1987). {END} {PDOC00010} {PS00010; ASX_HYDROXYL} {BEGIN} *************************************************** * Aspartic acid and asparagine hydroxylation site * *************************************************** Post-translational hydroxylation of aspartic acid or asparagine [1] to form erythro-beta-hydroxyaspartic acid or erythro-beta-hydroxyasparagine has been identified in a number of proteins with domains homologous to epidermal growth factor (EGF). Examples of such proteins are the blood coagulation protein factors VII, IX and X, proteins C, S, and Z, the LDL receptor, thrombomodulin, etc. Based on sequence comparisons of the EGF-homology region that contains hydroxylated Asp or Asn, a consensus sequence has been identified that seems to be required by the hydroxylase(s). -Consensus pattern: C-x-[DN]-x(4)-[FY]-x-C-x-C [D or N is the hydroxylation site] -Note: this consensus pattern is located in the N-terminal of EGF-like domains, while our EGF-like cysteine pattern signature (see the relevant entry ) is located in the C-terminal. -Last update: January 1989 / First entry. [ 1] Stenflo J., Ohlin A.-K., Owens W.G., Schneider W.J. J. Biol. Chem. 263:21-24(1988). {END} {PDOC00011} {PS00011; GLU_CARBOXYLATION} {BEGIN} ******************************************** * Vitamin K-dependent carboxylation domain * ******************************************** Vitamin K-dependent carboxylation [1,2] is the post-translational modification of glutamic residues to form gamma-carboxyglutamate (Gla). Proteins known to contain Gla are listed below. - A number of plasma proteins involved in blood coagulation. These proteins are prothrombin, coagulation factors VII, IX and X, proteins C, S, and Z. - Two proteins that occur in calcified tissues: osteocalcin (also known as bone-Gla protein, BGP), and matrix Gla-protein (MGP). - Cone snail venom peptides: conantokin-G and -T, and conotoxin GS [3]. With the exception of the snail toxins, all these proteins contain an N-terminal module of about forty amino acids where the majority of the Glu residues are carboxylated. This domain is responsible for the high-affinity binding of calcium ions. The Gla-domain starts at the N-terminal extremity of the mature form of these proteins and ends with a conserved aromatic residue; a conserved Gla-x(3)-Gla-x-Cys motif [4] is found in the middle of the domain

which seems to be important for substrate recognition by the carboxylase. -Consensus pattern: x(12)-E-x(3)-E-x-C-x(6)-[DEN]-x-[LIVMFY]-x(9)-[FYW] -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: 7. -Note: all glutamic residues present in the domain are potential carboxylation sites; in coagulation proteins, all are modified to Gla, while in BGP and MGP some are not. -Expert(s) to contact by email: Price P.A.; [email protected] -Last update: December 1992 / Text revised. [ 1] Friedman P.A., Przysiecki C.T. Int. J. Biochem. 19:1-7(1987). [ 2] Vermeer C. Biochem. J. 266:625-636(1990). [ 3] Haack J.A., Rivier J.E., Parks T.N., Mena E.E., Cruz L.J., Olivera B.M. J. Biol. Chem. 265:6025-6029(1990). [ 4] Price P.A., Fraser J.D., Metz-Virca G. Proc. Natl. Acad. Sci. U.S.A. 84:8335-8339(1987). {END} {PDOC00012} {PS00012; PHOSPHOPANTETHEINE} {PS50075; ACP_DOMAIN} {BEGIN} ************************************** * Phosphopantetheine attachment site * ************************************** Phosphopantetheine (or pantetheine 4' phosphate) is the prosthetic group of acyl carrier proteins (ACP) in some multienzyme complexes where it serves as a 'swinging arm' for the attachment of activated fatty acid and amino-acid groups [1]. Phosphopantetheine is attached to a serine residue in these proteins [2]. ACP proteins or domains have been found in various enzyme systems which are listed below (references are only provided for recently determined sequences). - Fatty acid synthetase (FAS), which catalyzes the formation of long-chain fatty acids from acetyl-CoA, malonyl-CoA and NADPH. Bacterial and plant chloroplast FAS are composed of eight separate subunits which correspond to the different enzymatic activities; ACP is one of these polypeptides. Fungal FAS consists of two multifunctional proteins, FAS1 and FAS2; the ACP domain is located in the N-terminal section of FAS2. Vertebrate FAS consists of a single multifunctional enzyme; the ACP domain is located between the beta-ketoacyl reductase domain and the C-terminal thioesterase domain [3]. - Polyketide antibiotics synthase enzyme systems. Polyketides are secondary metabolites produced from simple fatty acids, by microorganisms and plants. ACP is one of the polypeptidic components involved in the biosynthesis of Streptomyces polyketide antibiotics actinorhodin, curamycin, granatacin, monensin, oxytetracycline and tetracenomycin C. - Bacillus subtilis putative polyketide synthases pksK, pksL and pksM which respectively contain three, five and one ACP domains. - The multifunctional 6-methysalicylic acid synthase (MSAS) from Penicillium patulum. This is a multifunctional enzyme involved in the biosynthesis of a polyketide antibiotic and which contains an ACP domain in the C-terminal extremity.

- Multifunctional mycocerosic acid synthase (gene mas) from Mycobacterium bovis. - Gramicidin S synthetase I (gene grsA) from Bacillus brevis. This enzyme catalyzes the first step in the biosynthesis of the cyclic antibiotic gramicidin S. - Tyrocidine synthetase I (gene tycA) from Bacillus brevis. The reaction carried out by tycA is identical to that catalyzed by grsA - Gramicidin S synthetase II (gene grsB) from Bacillus brevis. This enzyme is a multifunctional protein that activates and polymerizes proline, valine, ornithine and leucine. GrsB contains four ACP domains. - Erythronolide synthase proteins 1, 2 and 3 from Saccharopolyspora erythraea which is involved in the biosynthesis of the polyketide antibiotic erythromicin. Each of these proteins contain two ACP domains. - Conidial green pigment synthase from Aspergillus nidulans. - ACV synthetase from various fungi. This enzyme catalyzes the first step in the biosynthesis of penicillin and cephalosporin. It contains three ACP domains. - Enterobactin synthetase component F (gene entF) from Escherichia coli. This enzyme is involved in the ATP-dependent activation of serine during enterobactin (enterochelin) biosynthesis. - Cyclic peptide antibiotic surfactin synthase subunits 1, 2 and 3 from Bacillus subtilis. Subunits 1 and 2 contains three related domains while subunit 3 only contains a single domain. - HC-toxin synthetase (gene HTS1) from Cochliobolus carbonum. This enzyme synthesizes HC-toxin, a cyclic tetrapeptide. HTS1 contains four ACP domains. - Fungal mitochondrial ACP [9], which is part of the respiratory chain NADH dehydrogenase (complex I). - Rhizobium nodulation protein nodF, which probably acts as an ACP in the synthesis of the nodulation Nod factor fatty acyl chain. The sequence around the phosphopantetheine attachment site is conserved in all these proteins and can be used as a signature pattern. A profile was also developed that spans the complete ACP-like domain. -Consensus pattern: [DEQGSTALMKRH]-[LIVMFYSTAC]-[GNQ]-[LIVMFYAG]-[DNEKHS]-S[LIVMST]-{PCFY}-[STAGCPQLIVMF]-[LIVMATN]-[DENQGTAKRHLM][LIVMWSTA]-[LIVGSTACR]-x(2)-[LIVMFA] [S is the pantetheine attachment site] -Sequences known to belong to this class detected by the pattern: ALL, except C.paradoxa ACP. -Other sequence(s) detected in SWISS-PROT: 86. -Sequences known to belong to this class detected by the profile: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Note: this documentation entry is linked to both a signature pattern and a profile. As the profile is much more sensitive than the pattern, you should use it if you have access to the necessary software tools to do so. -Last update: November 1997 / Pattern and text revised; profile added. [ 1] Concise Encyclopedia Biochemistry, Second Edition, Walter de Gruyter, Berlin New-York (1988). [ 2] Pugh E.L., Wakil S.J. J. Biol. Chem. 240:4727-4733(1965). [ 3] Witkowski A., Rangan V.S., Randhawa Z.I., Amy C.M., Smith S. Eur. J. Biochem. 198:571-579(1991). [ 6] Scotti C., Piatti M., Cuzzoni A., Perani P., Tognoni A., Grandi G., Galizzi A., Albertini A.M.

Gene 130:65-71(1993). [ 9] Sackmann U., Zensen R., Rohlen D., Jahnke U., Weiss H. Eur. J. Biochem. 200:463-469(1991). {END} {PDOC00013} {PS00013; PROKAR_LIPOPROTEIN} {BEGIN} ********************************************************** * Prokaryotic membrane lipoprotein lipid attachment site * ********************************************************** In prokaryotes, membrane lipoproteins are synthesized with a precursor signal peptide, which is cleaved by a specific lipoprotein signal peptidase (signal peptidase II). The peptidase recognizes a conserved sequence and cuts upstream of a cysteine residue to which a glyceride-fatty acid lipid is attached [1]. Some of the proteins known to undergo such processing currently include (for recent listings see [1,2,3]): Major outer membrane lipoprotein (murein-lipoproteins) (gene lpp). Escherichia coli lipoprotein-28 (gene nlpA). Escherichia coli lipoprotein-34 (gene nlpB). Escherichia coli lipoprotein nlpC. Escherichia coli lipoprotein nlpD. Escherichia coli osmotically inducible lipoprotein B (gene osmB). Escherichia coli osmotically inducible lipoprotein E (gene osmE). Escherichia coli peptidoglycan-associated lipoprotein (gene pal). Escherichia coli rare lipoproteins A and B (genes rplA and rplB). Escherichia coli copper homeostasis protein cutF (or nlpE). Escherichia coli plasmids traT proteins. Escherichia coli Col plasmids lysis proteins. A number of Bacillus beta-lactamases. Bacillus subtilis periplasmic oligopeptide-binding protein (gene oppA). Borrelia burgdorferi outer surface proteins A and B (genes ospA and ospB). Borrelia hermsii variable major protein 21 (gene vmp21) and 7 (gene vmp7). Chlamydia trachomatis outer membrane protein 3 (gene omp3). Fibrobacter succinogenes endoglucanase cel-3. Haemophilus influenzae proteins Pal and Pcp. Klebsiella pullulunase (gene pulA). Klebsiella pullulunase secretion protein pulS. Mycoplasma hyorhinis protein p37. Mycoplasma hyorhinis variant surface antigens A, B, and C (genes vlpABC). Neisseria outer membrane protein H.8. Pseudomonas aeruginosa lipopeptide (gene lppL). Pseudomonas solanacearum endoglucanase egl. Rhodopseudomonas viridis reaction center cytochrome subunit (gene cytC). Rickettsia 17 Kd antigen. Shigella flexneri invasion plasmid proteins mxiJ and mxiM. Streptococcus pneumoniae oligopeptide transport protein A (gene amiA). Treponema pallidium 34 Kd antigen. Treponema pallidium membrane protein A (gene tmpA). Vibrio harveyi chitobiase (gene chb). Yersinia virulence plasmid protein yscJ.

- Halocyanin from Natrobacterium pharaonis [4], a membrane associated copperbinding protein. This is the first archaebacterial protein known to be modified in such a fashion). From the precursor sequences of all these proteins, we derived a consensus pattern and a set of rules to identify this type of post-translational modification.

-Consensus pattern: {DERK}(6)-[LIVMFWSTAG](2)-[LIVMFYSTAGCQ]-[AGS]-C [C is the lipid attachment site] Additional rules: 1) The cysteine must be between positions 15 and 35 of the sequence in consideration. 2) There must be at least one Lys or one Arg in the first seven positions of the sequence. -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: some 100 prokaryotic proteins. Some of them are not membrane lipoproteins, but at least half of them could be. -Last update: November 1995 / Pattern and text revised. [ 1] Hayashi S., Wu H.C. J. Bioenerg. Biomembr. 22:451-471(1990). [ 2] Klein P., Somorjai R.L., Lau P.C.K. Protein Eng. 2:15-20(1988). [ 3] von Heijne G. Protein Eng. 2:531-534(1989). [ 4] Mattar S., Scharf B., Kent S.B.H., Rodewald K., Oesterhelt D., Engelhard M. J. Biol. Chem. 269:14939-14945(1994). {END} {PDOC00342} {PS00409; PROKAR_NTER_METHYL} {BEGIN} ******************************************* * Prokaryotic N-terminal methylation site * ******************************************* A number of bacteria express filamentous adhesins known as pili. The pili are polar flexible filaments of about 5.4 nm diameter and 2500 nm average length; they consist of a single polypeptide chain (called pilin or fimbrial protein) arranged in a helical configuration of five subunits per turn in the assembled pilus. Gram-negative bacteria produce pilin which are characterized by the presence of a very short leader peptide of 6 to 7 residues, followed by a methylated N-terminal phenylalanine residue and by a highly conserved sequence of about 24 hydrophobic residues. This class of pilin is often referred to as NMePhe or type-4 pili [1,2]. Recently a number of bacterial proteins have been sequenced which share the following structural characteristics with type-4 pili [3]: a) The N-terminal residue, which is methylated, is hydrophobic (generally a phenylalanine or a methionine); b) The leader peptide is hydrophilic, consists of 5 to 10 residues (with two exceptions, see below) and ends with a glycine; c) The fifth residue of the mature sequence is a glutamate which seems to be required for the methylation step; d) The first twenty residues of the mature sequence are highly hydrophobic. These proteins are listed below: - Four proteins in an operon involved in a general secretion pathway (GSP) for the export of proteins (also called the type II pathway) [4]. These proteins have been assigned a different gene name in each of the species where they have been sequenced: Species -----------------------Aeromonas hydrophila Gene names ------------------------exeG exeH exeI exeJ

Erwinia chrysanthemi Escherichia coli Klebsiella pneumoniae Pseudomonase aeruginosa Vibrio cholerae Xanthomonas campestris

outG hofG pulG xcpT epsG xpsG

outH hofH pulH xcpU epsH xpsH

outI yheH pulI xcpV epsI xpsI

outJ yheI pulJ xcpW epsJ xpsJ

- Vibrio cholerae toxin co-regulated pilin (gene tcpA). This pilin has a much longer putative leader peptide (25 residues). - Bacillus subtilis comG competence operon proteins 3, 4, and 5 which are involved for the uptake of DNA by competent Bacillus subtilis cells. - ppdA, ppdB and ppdC, three Escherichia coli hypothetical proteins found in the thyA-recC intergenic region. - ppdA, a hypothetical protein near the groeLS operon of Clostridium perfringens. The putative leader peptide is 23 residues long. We developed a signature pattern based on the N-terminal conserved region of all these proteins. -Consensus pattern: [KRHEQSTAG]-G-[FYLIVM]-[ST]-[LT]-[LIVP]-E-[LIVMFWSTAG](14) [The residue after the G is methylated] -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Last update: November 1995 / Text revised. [ 1] Paranchych W., Frost L.S. Adv. Microb. Physiol. 29:53-114(1988). [ 2] Dalrymple B., Mattick J.S. J. Mol. Evol. 25:261-269(1987). [ 3] Hobbs M., Mattick J.S. Mol. Microbiol. 10:233-243(1993). [ 4] Salmond G.P.C., Reeves P.J. Trends Biochem. Sci. 18:7-12(1993). {END} {PDOC00266} {PS00294; PRENYLATION} {BEGIN} **************************************** * Prenyl group binding site (CAAX box) * **************************************** A number of eukaryotic proteins are post-translationally modified by the attachment of either a farnesyl or a geranyl-geranyl group to a cysteine residue [1,2,3,4]. The modification occurs on cysteine residues that are three residues away from the C-terminal extremity; the two residues that separate this cysteine from the C-terminal residue are generally aliphatic. This CysAli-Ali-X pattern is generally known as the CAAX box. Proteins known or strongly presumed to be the target of this modification are listed below. Ras proteins, and ras-like proteins such as Rho, Rab, Rac, Ral, and Rap. Nuclear lamins A and B. Some G protein alpha subunits. G protein gamma subunits (see ). 2',3'-cyclic nucleotide 3'-phosphodiesterase (EC 3.1.4.37). Rhodopsin-sensitive cGMP 3',5'-cyclic phosphodiesterase alpha and beta chains (EC 3.1.4.17). - Rhodopsin kinase (EC 2.7.1.125). - Some dnaJ-like proteins (such as yeast MAS5/YDJ1). - A number of fungal mating factors (such as M-factor or rhodotorucine A).

-Consensus pattern: C-{DENQ}-[LIVM]-x> [C is the prenylation site] -Last update: November 1997 / Text revised. [ 1] Glomset J.A., Gelb M.H., Farnsworth C.C. Trends Biochem. Sci. 15:139-142(1990). [ 2] Lowy D.R., Willumsen B.M. Nature 341:384-385(1989). [ 3] Imagee A.I. Biochem. Soc. Trans. 17:875-876(1989). [ 4] Powers S. Curr. Biol. 1:114-116(1991). {END} {PDOC00687} {PS00881; PROTEIN_SPLICING} {BEGIN} ****************************** * Protein splicing signature * ****************************** Protein splicing [1,2,3,4,E1] is a mechanism by which an internal segment (called intein [5] or spacer) in a protein precursor is excised and the flanking regions (called exteins [5]) are religated to create a functional protein. Currently, such a mechanism has been found in the following proteins: - Vacuolar ATP synthase catalytic subunit A (gene VMA1 or TFP1) from budding yeast and from Candida tropicalis - recA protein from Mycobacterium tuberculosis and leprae. - DNA polymerase from the archaebacteria Thermococcus litoralis and Pyrococcus strains GB-D and KOD1. - clpP protease from the chloroplast of Chlamydomonas eugametos. - dnaB protein from the chloroplast of Porphyra purpurea. - gyrA (DNA gyrase subunit A) from various Mycobacterial species. - Methanococcus jannaschii TFIIB. - Methanococcus jannaschii hypothetical protein MJ0682. In most of these cases the intein seems to be an endonuclease. It has been proposed that the splicing initiates at the C-terminal splice junction. The delta-nitrogen group of a conserved asparagine residue makes a nucleophilic attack on the peptide bond that links this asparagine to the next residue. The next residue (a Cys, Ser or Thr) is then free to attack the peptide bond at the N-terminal splice junction by a transpeptidation reaction that releases the intein and creates a new peptide bond. Such a mechanism is briefly schematized in the following figures. 1) Primary translation product +---------------+ +-------------+ +--------------+ NH2- Extein 1 x--y Intein N--z Extein 2 -COOH +---------------+ +-------------+ +--------------+ 2) Breakage of the peptide bond at the C-terminal splice junction by nucleophilic attack of the asparagine. +---------------+ +-------------+ +--------------+ NH2- Extein 1 x--y Intein N NH2-z Extein 2 -COOH +---------------+ +-------------+ +--------------+ 3) Transpeptidation to produce the final products.

+---------------+ +-------------+ +--------------+ NH2- Extein 1 x--z Extein 2 -COOH NH2-y Intein N +---------------+ +-------------+ +--------------+ In the proteins known to undergo protein splicing, the residues close to the asparagine involved in the nucleophilic attack are conserved and can be used as a signature pattern. We are aware that such a signature is probably going to evolve as soon as new examples are discovered, nevertheless, we believe that it can be useful. -Consensus pattern: [DNEG]-x-[LIVFA]-[LIVMY]-[LVAST]-H-N-[STC] -Sequences known to belong to this class detected by the pattern: ALL, except for clpP from Chlamydomonas eugametos. -Other sequence(s) detected in SWISS-PROT: 16. -Last update: July 1999 / Text revised. [ 1] Shub D.A., Goodrich-Blair H. Cell 71:183-186(1992). [ 2] Cooper A.A., Chen Y.-J., Lindorfer M.A., Stevens T.H. EMBO J. 12:2575-2583(1993). [ 3] Cooper A.A., Stevens T.H. BioEssays 15:667-674(1993). [ 4] Hickey D.A. Trends Genet. 10:147-149(1994). [ 5] Perler F.B., Davis E.O., Dean G.E., Gimble F.S., Jack W.E., Neff N., Noren C.J., Thorner J., Belfort M. Nucleic Acids Res. 22:1125-1127(1994). [E1] http://www.neb.com/neb/inteins/intein_intro.html {END} {PDOC00014} {PS00014; ER_TARGET} {BEGIN} ******************************************** * Endoplasmic reticulum targeting sequence * ******************************************** Proteins that permanently reside in the lumen of the endoplasmic reticulum (ER) seem to be distinguished from newly synthesized secretory proteins by the presence of the C-terminal sequence Lys-Asp-Glu-Leu (KDEL) [1,2]. While KDEL is the preferred signal in many species, variants of that signal are used by different species. This situation is described in the following table. Signal Species ---------------------------------------------------------------KDEL Vertebrates, Drosophila, Caenorhabditis elegans, plants HDEL Saccharomyces cerevisiae, Kluyveromyces lactis, plants DDEL Kluyveromyces lactis ADEL Schizosaccharomyces pombe (fission yeast) SDEL Plasmodium falciparum The signal is usually very strictly conserved in major ER proteins but some minor ER proteins have divergent sequences (probably because efficient retention of these proteins is not crucial to the cell). Proteins bearing the KDEL-type signal are not simply held in the ER, but are selectively retrieved from a post-ER compartment by a receptor and returned to their normal location. The currently known ER luminal proteins are listed below.

- Protein disulphide-isomerase (PDI) (also known as the beta-subunit of prolyl 4-hydroxylase, as a component of oligosaccharyl transferase, as glutathione-insulin transhydrogenase and as a thyroid hormone binding protein). - ERp60, ERp72, and P5, three minor isoforms of PDI. - Trypanosoma brucei bloodstream-specific protein 2, a probable PDI. - hsp70 related protein GRP78 (also known as the immunoglobulin heavy chain binding protein (BiP), and as KAR2, in fungi). - hsp90 related protein 'endoplasmin' (also known as GRP94, Erp99 or Hsp108). - Calreticulin, a calcium-binding protein (also known as calregulin, CRP55, or HACBP). - ERC-55, a calcium-binding protein. - Reticulocalbin, a calcium-binding protein. - Hsp47, a heat-shock protein that binds strongly to collagen and could act as a chaperone in the collagen biosynthetic pathway. - A receptor for a plant hormone, auxin. - Thiol proteases from rice bean (SH-EP) and kidney bean (EP-C1). - Esterases from mammalian liver and from nematodes. - Alpha-2-macroglobulin receptor-associated protein (RAP). - Yeast peptidyl-prolyl cis-trans isomerase D (CYPD). - Yeast protein KRE5, a protein required for (1->6)-beta-D-glucan synthesis. - Yeast protein SEC20, required for the transport of proteins from the endoplasmic reticulum to the Golgi apparatus. - Yeast protein SCJ1, involved in protein sorting. -Consensus pattern: [KRHQSA]-[DENQ]-E-L> -Sequences known to belong to this class detected by the pattern: ALL, except for liver esterases which have H-[TVI]-E-L. -Other sequence(s) detected in SWISS-PROT: 24 proteins which are clearly not located in the ER (because they are of bacterial or viral origin, for example) and a protein which can be considered as valid candidate: human 80KH protein. -Last update: November 1997 / Text revised. [ 1] Munro S., Pelham H.R.B. Cell 48:899-907(1987). [ 2] Pelham H.R.B. Trends Biochem. Sci. 15:483-486(1990). {END} {PDOC00299} {PS00342; MICROBODIES_CTER} {BEGIN} ******************************************* * Microbodies C-terminal targeting signal * ******************************************* Microbodies are a class of small, single membraned organelles to which belong peroxisomes, glyoxysomes, and glycosomes. Microbody proteins are synthesized on free polysomes and imported into the organelle post-translationally. Unlike the import of proteins into mitochondria, chloroplasts or the ER/secretion pathway, import into microbodies does not generally require the removal of a presequence [1]. It has been experimentally shown [2,3,4] that, in some peroxisomal proteins, the targeting signal (PTS) resides in the last three amino acids of the C-terminus. This consensus sequence is known as 'S-K-L' (Ser-Lys-Leu), although some variations are allowed in all three positions. As the peroxisomal targeting signal also seems to be recognized by other microbodies, it is now [1] known as the C-terminal microbody targeting signal (CMTS). It must be noted that not all microbody proteins contain a CMTS; some seem to

contain an internal CMTS-like sequence, but it is not yet known if it is active as such. Finally, a few proteins are synthesized with an N-terminal presequence which is cleaved off during import. Microbody proteins known or thought to contain a CMTS are listed below. Mammalian D-amino acid oxidase. Mammalian acyl-coenzyme A oxidase (but not the fungal enzymes). Mammalian and yeast (S. cerevisiae) carnitine o-acetyltransferase. Mammalian trifunctional fatty acid beta oxidation pathway enzyme. Mammalian, insect, plants, and Aspergillus uricase. Mammalian sterol carrier protein-2 high molecular form (SCP-X). Mammalian long chain alpha-hydroxy acid oxidase. Mammalian soluble epoxide hydrolase (sEH). Firefly luciferase. Plants glycolate oxidase. Plants glyoxisomal isocitrate lyase. Plants and fungal glyoxisomal malate synthase. Trypanosoma glycosomal glucose-6-phosphate isomerase. Trypanosoma glycosomal glyceraldehyde 3-phosphate dehydrogenase. Yeast (H. polymorpha and Pichia pastoris) alcohol oxidase (AOX). Yeast (H. polymorpha) dihydroxy-acetone synthase (DHAS). Yeast (S. cerevisiae) catalase A. Yeast (S. cerevisiae) citrate synthase. Yeast (S. cerevisiae) peroxisomal malate dehydrogenase. Yeast (C. boidinii) peroxisomal protein PMP20. Yeast (C. tropicalis) hydratase-dehydrogenase-epimerase (HDE) from fatty acid beta oxidation pathway. - Yeast (C. tropicalis) isocitrate lyase. - Aspergillus niger monoamine oxidase N. - Candida albicans vacuolar aspartic protease PRA1. -Consensus pattern: [STAGCN]-[RKH]-[LIVMAFY]> -Last update: November 1997 / Pattern and text revised. [ 1] De Hoop M.J., Ab G. Biochem. J. 286:657-669(1992). [ 2] Gould S.J., Keller G.-A., Subramani S. J. Cell Biol. 107:897-905(1988). [ 3] Gould S.J., Keller G.-A., Hosken N., Wilkinson J., Subramani S. J. Cell Biol. 108:1657-1664(1989). [ 4] Gould S.J., Keller G.-A., Schneider M., Howell S.H., Garrard L.J., Goodman J.M., Distel B., Tabak H., Subramani S. EMBO J. 9:85-90(1990). {END} {PDOC00373} {PS00343; GRAM_POS_ANCHORING} {BEGIN} **************************************************************** * Gram-positive cocci surface proteins 'anchoring' hexapeptide * **************************************************************** Surface proteins from Gram-positive cocci contains a conserved hexapeptide located a few residues downstream of a hydrophobic C-terminal membrane anchor region which is followed by a cluster of basic amino acids [1]. This structure is represented in the following schematic representation: +--------------------------------------------+-+--------+-+ Variable length extracellular domain H Anchor B +--------------------------------------------+-+--------+-+

'H': conserved hexapeptide. 'B': cluster of basic residues. It has been proposed that this hexapeptide sequence is responsible for a posttranslational modification necessary for the proper anchoring of the proteins which bear it, to the cell wall. Proteins known to contain such hexapeptide are listed below: Aggregation substance from streptococcus faecalis (asa1). C5a peptidase from Streptococcus pyogenes (scpA). C protein alpha-antigen from Streptococcus agalactiae (bca). Cell surface antigen I/II (PAC) from Streptococcus mutans. Dextranase from Streptococcus downei (dex). Fibronectin-binding protein from Staphylococcus aureus (fnbA). Fimbrial subunits from Actinomyces naeslundii and viscosus. IgA binding protein from Streptococcus pyogenes (arp4). IgA binding protein (B antigen) from Streptococcus agalactiae (bag). IgG binding proteins from Streptococci and Staphylococcus aureus. Internalin A from Listeria monocytogenes (inlA). M proteins from streptococci. Muramidase-released protein from Streptococcus suis (mrp). Nisin leader peptide processing protease from Lactococcus lactis (nisP). Protein A from Staphylococcus aureus. Trypsin-resistant surface T protein from streptococci. Wall-associated protein from Streptococcus mutans (wapA). Wall-associated serine proteinases from Lactococcus lactis.

-Consensus pattern: L-P-x-T-G-[STGAVDE] -Sequences known to belong to this class detected by the pattern: ALL, except for C5a peptidase which has L-P-T-T-N-D. -Other sequence(s) detected in SWISS-PROT: 200 proteins from Gram negative bacteria, archaebacteria, eukaryotes, or viruses. -Last update: November 1995 / Text revised. [ 1] Schneewind O., Jones K.F., Fischetti V.A. J. Bacteriol. 172:3310-3317(1990). {END} {PDOC00015} {PS00015; NUCLEAR} {BEGIN} **************************************** * Bipartite nuclear targeting sequence * **************************************** The uptake of protein by the nucleus is extremely selective and nuclear proteins must therefore contain within their final structure a signal that specifies selective accumulation in the nucleus [1,2]. Studies on some nuclear proteins, such as the large T antigen of SV40, have indicated which part of the sequence is required for nuclear translocation. The known nuclear targeting sequences are generally basic, but there seems to be no clear common denominator between all the known sequences. Although some consensus sequence patterns have been proposed (see for example [3]), the current best strategy to detect a nuclear targeting sequence is based [4] on the following definition of what is called a 'bipartite nuclear targeting sequence': (1) Two adjacent basic amino acids (Arg or Lys). (2) A spacer region of any 10 residues. (3) At least three basic residues (Arg or Lys) in the five positions

after the spacer region. -Sequences known to belong to this class detected by the pattern: 56% of known nuclear proteins according to [4]. -Other sequence(s) detected in SWISS-PROT: about 4.2% of non-nuclear proteins according to [4]. -Last update: October 1993 / Text revised. [ 1] Dingwall C., Laskey R.A. Annu. Rev. Cell Biol. 2:367-390(1986). [ 2] Garcia-Bustos J.F., Heitman J., Hall M.N. Biochim. Biophys. Acta 1071:83-101(1991). [ 3] Gomez-Marquez J., Segade F. FEBS Lett. 226:217-219(1988). [ 4] Dingwall C., Laskey R.A. Trends Biochem. Sci. 16:478-481(1991). {END} {PDOC00016} {PS00016; RGD} {BEGIN} **************************** * Cell attachment sequence * **************************** The sequence Arg-Gly-Asp, found in fibronectin, is crucial for its interaction with its cell surface receptor, an integrin [1,2]. What has been called the 'RGD' tripeptide is also found in the sequences of a number of other proteins, where it has been shown to play a role in cell adhesion. These proteins are: some forms of collagens, fibrinogen, vitronectin, von Willebrand factor (VWF), snake disintegrins, and slime mold discoidins. The 'RGD' tripeptide is also found in other proteins where it may also, but not always, serve the same purpose. -Consensus pattern: R-G-D -Last update: December 1991 / Text revised. [ 1] Ruoslahti E., Pierschbacher M.D. Cell 44:517-518(1986). [ 2] d'Souza S.E., Ginsberg M.H., Plow E.F. Trends Biochem. Sci. 16:246-250(1991). {END} {PDOC00017} {PS00017; ATP_GTP_A} {BEGIN} ***************************************** * ATP/GTP-binding site motif A (P-loop) * ***************************************** From sequence comparisons and crystallographic data analysis it has been shown [1,2,3,4,5,6] that an appreciable proportion of proteins that bind ATP or GTP share a number of more or less conserved sequence motifs. The best conserved of these motifs is a glycine-rich region, which typically forms a flexible loop between a beta-strand and an alpha-helix. This loop interacts with one of the phosphate groups of the nucleotide. This sequence motif is generally referred to as the 'A' consensus sequence [1] or the 'P-loop' [5]. There are numerous ATP- or GTP-binding proteins in which the P-loop is found. We list below a number of protein families for which the relevance of the presence of such motif has been noted:

-

ATP synthase alpha and beta subunits (see ). Myosin heavy chains. Kinesin heavy chains and kinesin-like proteins (see ). Dynamins and dynamin-like proteins (see ). Guanylate kinase (see ). Thymidine kinase (see ). Thymidylate kinase (see ). Shikimate kinase (see ). Nitrogenase iron protein family (nifH/frxC) (see ). ATP-binding proteins involved in 'active transport' (ABC transporters) [7] (see ). DNA and RNA helicases [8,9,10]. GTP-binding elongation factors (EF-Tu, EF-1alpha, EF-G, EF-2, etc.). Ras family of GTP-binding proteins (Ras, Rho, Rab, Ral, Ypt1, SEC4, etc.). Nuclear protein ran (see ). ADP-ribosylation factors family (see ). Bacterial dnaA protein (see ). Bacterial recA protein (see ). Bacterial recF protein (see ). Guanine nucleotide-binding proteins alpha subunits (Gi, Gs, Gt, G0, etc.). DNA mismatch repair proteins mutS family (See ). Bacterial type II secretion system protein E (see ).

Not all ATP- or GTP-binding proteins are picked-up by this motif. A number of proteins escape detection because the structure of their ATP-binding site is completely different from that of the P-loop. Examples of such proteins are the E1-E2 ATPases or the glycolytic kinases. In other ATP- or GTP-binding proteins the flexible loop exists in a slightly different form; this is the case for tubulins or protein kinases. A special mention must be reserved for adenylate kinase, in which there is a single deviation from the P-loop pattern: in the last position Gly is found instead of Ser or Thr. -Consensus pattern: [AG]-x(4)-G-K-[ST] -Sequences known to belong to this class detected by the pattern: a majority. -Other sequence(s) detected in SWISS-PROT: in addition to the proteins listed above, the 'A' motif is also found in a number of other proteins. Most of these proteins probably bind a nucleotide, but others are definitively not ATP- or GTP-binding (as for example chymotrypsin, or human ferritin light chain). -Expert(s) to contact by email: Koonin E.V.; [email protected] -Last update: July 1999 / Text revised. [ 1] Walker J.E., Saraste M., Runswick M.J., Gay N.J. EMBO J. 1:945-951(1982). [ 2] Moller W., Amons R. FEBS Lett. 186:1-7(1985). [ 3] Fry D.C., Kuby S.A., Mildvan A.S. Proc. Natl. Acad. Sci. U.S.A. 83:907-911(1986). [ 4] Dever T.E., Glynias M.J., Merrick W.C. Proc. Natl. Acad. Sci. U.S.A. 84:1814-1818(1987). [ 5] Saraste M., Sibbald P.R., Wittinghofer A. Trends Biochem. Sci. 15:430-434(1990). [ 6] Koonin E.V. J. Mol. Biol. 229:1165-1174(1993). [ 7] Higgins C.F., Hyde S.C., Mimmack M.M., Gileadi U., Gill D.R., Gallagher M.P.

J. Bioenerg. Biomembr. 22:571-592(1990). [ 8] Hodgman T.C. Nature 333:22-23(1988) and Nature 333:578-578(1988) (Errata). [ 9] Linder P., Lasko P., Ashburner M., Leroy P., Nielsen P.J., Nishi K., Schnier J., Slonimski P.P. Nature 337:121-122(1989). [10] Gorbalenya A.E., Koonin E.V., Donchenko A.P., Blinov V.M. Nucleic Acids Res. 17:4713-4730(1989). {END} {PDOC00691} {PS00888; CNMP_BINDING_1} {PS00889; CNMP_BINDING_2} {PS50042; CNMP_BINDING_3} {BEGIN} *********************************************************** * Cyclic nucleotide-binding domain signatures and profile * *********************************************************** Proteins that bind cyclic nucleotides (cAMP or cGMP) share a structural domain of about 120 residues [1-3]. The best studied of these proteins is the prokaryotic catabolite gene activator (also known as the cAMP receptor protein) (gene crp) where such a domain is known to be composed of three alpha-helices and a distinctive eight-stranded, antiparallel beta-barrel structure. Such a domain is known to exist in the following proteins: - Prokaryotic catabolite gene activator protein (CAP). - cAMP- and cGMP-dependent protein kinases (cAPK and cGPK). Both types of kinases contains two tandem copies of the cyclic nucleotide-binding domain. The cAPK's are composed of two different subunits: a catalytic chain and a regulatory chain which contains both copies of the domain. The cGPK's are single chain enzymes that include the two copies of the domain in their Nterminal section. The nucleotide specificity of cAPK and cGPK is due to an amino acid in the conserved region of beta-barrel 7: a threonine that is invariant in cGPK is an alanine in most cAPK. - Vertebrate cyclic nucleotide-gated ion-channels. Two such cations channels have been fully characterized. One is found in rod cells where it plays a role in visual signal transduction. It specifically binds to cGMP leading to an opening of the channel and thereby causing a depolarization of rod photoreceptors. In olfactory epithelium a similar, cAMP-binding, channel plays a role in odorant signal transduction. There are six invariant amino acids in this domain, three of which are glycine residues that are thought to be essential for maintenance of the structural integrity of the beta-barrel. We developed two signature patterns for this domain. The first pattern is located within beta-barrels 2 and 3 and contains the first two conserved Gly. The second pattern is located within betabarrels 6 and 7 and contains the third conserved Gly as well as the three other invariant residues. -Consensus pattern: [LIVM]-[VIC]-x(2)-G-[DENQTA]-x-[GAC]-x(2)-[LIVMFY](4)x(2)-G -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Consensus pattern: [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x[STACV] -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: 1. -Sequences known to belong to this class detected by the profile: ALL.

-Other sequence(s) detected in SWISS-PROT: 8. -Note: this documentation entry is linked to both a signature pattern and a profile. As the profile is much more sensitive than the pattern, you should use it if you have access to the necessary software tools to do so. -Last update: November 1997 / Patterns and text revised; profile added. [ 1] Weber I.T., Shabb J.B., Corbin J.D. Biochemistry 28:6122-6127(1989). [ 2] Kaupp U.B. Trends Neurosci. 14:150-157(1991). [ 3] Shabb J.B., Corbin J.D. J. Biol. Chem. 267:5723-5726(1992). {END} {PDOC00018} {PS00018; EF_HAND} {BEGIN} ********************************** * EF-hand calcium-binding domain * ********************************** Many calcium-binding proteins belong to the same evolutionary family and share a type of calcium-binding domain known as the EF-hand [1 to 5]. This type of domain consists of a twelve residue loop flanked on both side by a twelve residue alpha-helical domain. In an EF-hand loop the calcium ion is coordinated in a pentagonal bipyramidal configuration. The six residues involved in the binding are in positions 1, 3, 5, 7, 9 and 12; these residues are denoted by X, Y, Z, -Y, -X and -Z. The invariant Glu or Asp at position 12 provides two oxygens for liganding Ca (bidentate ligand). We list below the proteins which are known to contain EF-hand regions. For each type of protein we have indicated between parenthesis the total number of EF-hand regions known or supposed to exist. This number does not include regions which clearly have lost their calcium-binding properties, or the atypical low-affinity site (which spans thirteen residues) found in the S-100/ ICaBP family of proteins [6]. Aequorin and Renilla luciferin binding protein (LBP) (Ca=3). Alpha actinin (Ca=2). Calbindin (Ca=4). Calcineurin B subunit (protein phosphatase 2B regulatory subunit) (Ca=4). Calcium-binding protein from Streptomyces erythraeus (Ca=3?). Calcium-binding protein from Schistosoma mansoni (Ca=2?). Calcium-binding proteins TCBP-23 and TCBP-25 from Tetrahymena thermophila (Ca=4?). Calcium-dependent protein kinases (CDPK) from plants (Ca=4). Calcium vector protein from amphoxius (Ca=2). Calcyphosin (thyroid protein p24) (Ca=4?). Calmodulin (Ca=4, except in yeast where Ca=3). Calpain small and large chains (Ca=2). Calretinin (Ca=6). Calcyclin (prolactin receptor associated protein) (Ca=2). Caltractin (centrin) (Ca=2 or 4). Cell Division Control protein 31 (gene CDC31) from yeast (Ca=2?). Diacylglycerol kinase (EC 2.7.1.107) (DGK) (Ca=2). FAD-dependent glycerol-3-phosphate dehydrogenase (EC 1.1.99.5) from mammals (Ca=1). Fimbrin (plastin) (Ca=2). Flagellar calcium-binding protein (1f8) from Trypanosoma cruzi (Ca=1 or 2).

- Guanylate cyclase activating protein (GCAP) (Ca=3). - Inositol phospholipid-specific phospholipase C isozymes gamma-1 and delta-1 (Ca=2) [10]. - Intestinal calcium-binding protein (ICaBPs) (Ca=2). - MIF related proteins 8 (MRP-8 or CFAG) and 14 (MRP-14) (Ca=2). - Myosin regulatory light chains (Ca=1). - Oncomodulin (Ca=2). - Osteonectin (basement membrane protein BM-40) (SPARC) and proteins that contains an 'osteonectin' domain (QR1, matrix glycoprotein SC1) (see the entry ) (Ca=1). - Parvalbumins alpha and beta (Ca=2). - Placental calcium-binding protein (18a2) (nerve growth factor induced protein 42a) (p9k) (Ca=2). - Recoverins (visinin, hippocalcin, neurocalcin, S-modulin) (Ca=2 to 3). - Reticulocalbin (Ca=4). - S-100 protein, alpha and beta chains (Ca=2). - Sarcoplasmic calcium-binding protein (SCPs) (Ca=2 to 3). - Sea urchin proteins Spec 1 (Ca=4), Spec 2 (Ca=4?), Lps-1 (Ca=8). - Serine/threonine protein phosphatase rdgc (EC 3.1.3.16) from Drosophila (Ca=2) - Sorcin V19 from hamster (Ca=2). - Spectrin alpha chain (Ca=2). - Squidulin (optic lobe calcium-binding protein) from squid (Ca=4). - Troponins C; from skeletal muscle (Ca=4), from cardiac muscle (Ca=3), from arthropods and molluscs (Ca=2). There has been a number of attempts [7,8] to develop patterns that pick-up EFhand regions, but these studies were made a few years ago when not so many different families of calcium-binding proteins were known. We therefore developed a new pattern which takes into account all published sequences. This pattern includes the complete EF-hand loop as well as the first residue which follows the loop and which seem to always be hydrophobic. -Consensus pattern: D-x-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC][DENQSTAGC]-x(2)-[DE]-[LIVMFYW] -Sequences known to belong to this class detected by the pattern: ALL, except for a few sequences. -Other sequence(s) detected in SWISS-PROT: 52 proteins which are probably not calcium-binding and a few proteins for which we have reason to believe that they bind calcium: a number of endoglucanases and a xylanase from the cellulosome complex of Clostridium [9]. -Note: positions 1 (X), 3 (Y) and 12 (-Z) are the most conserved. -Note: the 6th residue in an EF-hand loop is, in most cases a Gly, but the number of exceptions to this 'rule' has gradually increased and we felt that the pattern should include all the different residues which have been shown to exist in this position in functional Ca-binding sites. -Note: the pattern will, in some cases, miss one of the EF-hand regions in some proteins with multiple EF-hand domains. -Expert(s) to contact by email: Cox J.A.; [email protected] Kretsinger R.H.; [email protected] -Last update: July 1999 / Text revised. [ 1] Kawasaki H., Kretsinger R.H. Protein Prof. 2:305-490(1995). [ 2] Kretsinger R.H. Cold Spring Harbor Symp. Quant. Biol. 52:499-510(1987).

[ 3] Moncrief N.D., Kretsinger R.H., Goodman M. J. Mol. Evol. 30:522-562(1990). [ 4] Nakayama S., Moncrief N.D., Kretsinger R.H. J. Mol. Evol. 34:416-448(1992). [ 5] Heizmann C.W., Hunziker W. Trends Biochem. Sci. 16:98-103(1991). [ 6] Kligman D., Hilt D.C. Trends Biochem. Sci. 13:437-443(1988). [ 7] Strynadka N.C.J., James M.N.G. Annu. Rev. Biochem. 58:951-98(1989). [ 8] Haiech J., Sallantin J. Biochimie 67:555-560(1985). [ 9] Chauvaux S., Beguin P., Aubert J.-P., Bhat K.M., Gow L.A., Wood T.M., Bairoch A. Biochem. J. 265:261-265(1990). [10] Bairoch A., Cox J.A. FEBS Lett. 269:454-456(1990). {END} {PDOC00019} {PS00019; ACTININ_1} {PS00020; ACTININ_2} {BEGIN} ************************************************ * Actinin-type actin-binding domain signatures * ************************************************ Alpha-actinin is a F-actin cross-linking protein which is thought to anchor actin to a variety of intracellular structures [1]. The actin-binding domain of alpha-actinin seems to reside in the first 250 residues of the protein. A similar actin-binding domain has been found in the N-terminal region of many different actin-binding proteins [2,3]: - In the beta chain of spectrin (or fodrin). - In dystrophin, the protein defective in Duchenne muscular dystrophy (DMD) and which may play a role in anchoring the cytoskeleton to the plasma membrane. - In the slime mold gelation factor (or ABP-120). - In actin-binding protein ABP-280 (or filamin), a protein that link actin filaments to membrane glycoproteins. - In fimbrin (or plastin), an actin-bundling protein. Fimbrin differs from the above proteins in that it contains two tandem copies of the actinbinding domain and that these copies are located in the C-terminal part of the protein. We selected two conserved regions as signature patterns for this type of domain. The first of this region is located at the beginning of the domain, while the second one is located in the central section and has been shown to be essential for the binding of actin. -Consensus pattern: [EQ]-x(2)-[ATV]-[FY]-x(2)-W-x-N -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: 25. -Consensus pattern: [LIVM]-x-[SGN]-[LIVM]-[DAGHE]-[SAG]-x-[DNEAG]-[LIVM]-x[DEAG]-x(4)-[LIVM]-x-[LM]-[SAG]-[LIVM]-[LIVMT]-W-x[LIVM](2) -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Last update: November 1997 / Patterns and text revised.

[ 1] Schleicher M., Andre E., Harmann A., Noegel A.A. Dev. Genet. 9:521-530(1988). [ 2] Matsudaira P. Trends Biochem. Sci. 16:87-92(1991). [ 3] Dubreuil R.R. BioEssays 13:219-226(1991). {END} {PDOC00906} {PS01177; ANAPHYLATOXIN_1} {PS01178; ANAPHYLATOXIN_2} {BEGIN} ********************************************** * Anaphylatoxin domain signature and profile * ********************************************** Anaphylatoxins [1] are mediators of local inflammatory process that act by inducing smooth muscle contraction. There are three different anaphylatoxins: C3a, C4a and C5a. They are peptides of about 75 amino-acid residues that are derived from the proteolytic degradation of complement C3, C4 and C5 and which contains six disulfide-bonded cysteines [2] (see the schematic representation below). +--------------------+ +------------- ------------+ xxxCCxxxxxxxxxxxxCxxxxxxxxxxxxCxxxxxxCCxxx +--------------------------------+ 'C': conserved cysteine involved in a disulfide bond. This cysteine-rich region shares similarity with a three times repeated domain found in the mammalian extracellular matrix proteins fibulins 1 and 2 [3,4]. The three disulfide bonds are conserved in the first and last repeats, but the first disulfide bond is missing in the second repeat. Our consensus pattern spans the entire cysteine-rich domain. -Consensus pattern: [CSH]-C-x(2)-[GAP]-x(7,8)-[GASTDEQR]-C-[GASTDEQL]-x(3,9)[GASTDEQN]-x(2)-[CE]-x(6,7)-C-C [Cs are involved in disulfide bonds] -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Sequences known to belong to this class detected by the profile: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Note: this documentation entry is linked to both a signature pattern and a profile. As the profile is much more sensitive than the pattern, you should use it if you have access to the necessary software tools to do so. -Last update: November 1997 / First entry. [ 1] Hugli T.E. Complement 3:111-127(1986). [ 2] Huber R., Scholze H., Paques E.P., Deisenhofer J. Hoppe-Seyler's Z. Physiol. Chem. 361:1389-1399(1980). [ 3] Argraves W.S., Tran H., Burgess W.H., Dickerson K. J. Cell Biol. 111:3155-3164(1990).

[ 4] Pan T.-C., Sasaki T., Zhang R.-Z., Faessler R., Timpl R. J. Cell Biol. 123:1269-1277(1993). {END} {PDOC50088} {PS50088; ANK_REPEAT} {PS50297; ANK_REP_REGION} {BEGIN} ***************************************************** * Ankyrin repeat and ankyrin repeat region profiles * ***************************************************** Ankyrin repeats (ANK) are tandemly repeated modules of about 33 amino acids. They occur in a large number of functionally diverse proteins mainly from eukaryotes. The few known examples from prokaryotes and viruses may be the result of horizontal gene transfers [1]. Many ankyrin repeat interaction domains. regions are known to function as protein-protein

The conserved fold of the ankyrin repeat unit is known from several crystal and solution structures, e.g. from: p53-binding protein 53BP2 [2], Cyclin-dependent kinase inhibitor p19Ink4d [3], Transcriptional regulator GABP-beta [4], NF-kappaB inhibitory protein IkB-alpha [5].

It has been described as an L-shaped structure consisting of a beta-hairpin and two alpha-helices [2]. Two profiles were developed for this module, the first one picks up ANK repeats while the second profile is 'circular' and will thus detect a region containing adjacent ANK repeats. -Sequences known to belong to this class detected by the repeat profile: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Sequences known to belong to this class detected by the circular profile: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Expert(s) to contact by email: Hofmann K.O.; [email protected] -Last update: September 2000 / First entry. [ 1] Bork P. Proteins 17:363-374(1993). [ 2] Gorina S., Pavletich N.P. Science 274:1001-1005(1996). [ 3] Luh F.Y., Archer S.J., Domaille P.J., Smith B.O., Owen D., Brotherton D.H., Raine A.R., Xu X., Brizuela L., Brenner S.L., Laue E.D. Nature 389:999-1003(1997). [ 4] Batchelor A.H., Piper D.E., De La Brousse F.C., McKnight S.L., Wolberger C. Science 279:1037-1041(1998). [ 5] Jacobs M.D., Harrison S.C. Cell 95:749-758(1998). {END} {PDOC00376}

{PS00495; APPLE} {BEGIN} **************** * Apple domain * **************** Plasma kallikrein (EC 3.4.21.34) and coagulation factor XI (EC 3.4.21.27) are two related plasma serine proteases activated by factor XIIA and which share the same domain topology: an N-terminal region that contains four tandem repeats of about 90 amino acids and a C-terminal catalytic domain. The 90 amino-acid repeated domain contains 6 conserved cysteines. It has been shown [1,2] that three disulfide bonds link the first and sixth, second and fifth, and third and fourth cysteines. The domain can be drawn in the shape of an apple (see below) and has been accordingly called the 'apple domain'. x x x x x x x x x x x x x x C---C .....x x..... Apart from the cysteines, there are a number of other conserved positions in the apple domain. We have developed a pattern, that spans the complete domain, and which includes these conserved positions. -Consensus pattern: C-x(3)-[LIVMFY]-x(5)-[LIVMFY]-x(3)-[DENQ]-[LIVMFY]-x(10)C-x(3)-C-T-x(4)-C-x-[LIVMFY]-F-x-[FY]-x(13,14)-C-x[LIVMFY]-[RK]-x-[ST]-x(14,15)-S-G-x-[ST]-[LIVMFY]-x(2)-C -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Last update: June 1992 / Pattern and text revised. [ 1] McMullen B.A., Fujikawa K., Davie E.W. Biochemistry 30:2050-2056(1991). [ 2] McMullen B.A., Fujikawa K., Davie E.W. Biochemistry 30:2056-2060(1991). {END} {PDOC50176} {PS50176; ARM_REPEAT} {BEGIN} ******************************************** * Armadillo/plakoglobin ARM repeat profile * ******************************************** The armadillo repeat is an approximately 40 amino acids long tandemly repeated sequence motif first identified in the Drosophila segment polarity gene product armadillo, a protein that mediates cell adhesion. Similar repeats were later found in the mammalian armadillo homolog beta-catenin, the junctional plaque protein plakoglobin, the adenomatous polyposis coli (APC) tumor suppressor protein, and a number of other proteins [1]. These proteins exert several functions through interactions of their tandem armadillo repeats x x C---C x Cx x x Cx x x x x x x x x x x x x x x x x x x x x x x x x Schematic representation of an apple domain.

domain with diverse binding partners. The proteins combine structural roles as cell-contact and cytoskeleton-associated proteins and signaling functions by generating and transducing signals affecting gene expression [1,2,3]. The three-dimensional fold of an armadillo repeat is known from the crystal structure of beta-catenin [4]. There, the 12 repeats form a superhelix of alpha-helices, with three helices per unit. The cylindrical structure features a positively charged grove which presumably interacts with the acidic surfaces of the known interaction partners of beta-catenin. -Sequences known to belong to this class detected by the profile: ALL. -Other sequence(s) detected in SWISS-PROT: 2. -Expert(s) to contact by email: Hofmann K.O.; [email protected] -Last update: September 2000 / First entry. [ 1] Peifer M., Berg S., Reynolds A.B. Cell 76:789-791(1994). [ 2] Groves M.R., Barford D. Curr. Opin. Struct. Biol. 9:383-389(1999). [ 3] Hatzfeld M. Int. Rev. Cytol. 186:179-224(1999). [ 4] Huber A.H., Nelson W.J., Weis W.I. Cell 90:871-882(1997). {END} {PDOC00566} {PS00660; BAND_41_1} {PS00661; BAND_41_2} {PS50057; BAND_41_3} {BEGIN} ************************************************* * Band 4.1 family domain signatures and profile * ************************************************* A number of cytoskeletal-associated proteins that associate with various proteins at the interface between the plasma membrane and the cytoskeleton contain a conserved N-terminal domain of about 150 amino-acid residues [1,2, 3]. The proteins in which such a domain is known to exist are listed below. - Band 4.1, which links the spectrin-actin cytoskeleton of erythrocytes to the plasma membrane. Band 4.1 binds with a high affinity to glycophorin and with lower affinity to band 3 protein. - Ezrin (cytovillin or p81), a component of the undercoat of the microvilli plasma membrane. - Moesin, which is probably involved in binding major cytoskeletal structures to the plasma membrane. - Radixin, which seems to play a crucial role in the binding of the barbed end of actin filaments to the plasma membrane in the undercoat of the cellto-cell adherens junction (AJ). - Talin, which binds with high affinity to vinculin and with low affinity to integrins. Talin is a high molecular weight (270 Kd) cytoskeletal protein concentrated in regions of cell-substratum contact and, in lymphocytes, of cell-cell contacts. - Filopodin, a slime mold protein that binds actin ans which is involved in the control of cell motility and chemotaxis. - Merlin (or schwannomin). Defects in this protein are the cause of type 2 neurofibromatosis (NF2), a predisposition to tumors of the nervous system. - Protein NBL4.

- Protein-tyrosine phosphatases PTPN3 (PTP-H1) and PTPN4 (PTP-MEG1). Structurally these two very similar enzymes are composed of a N-terminal band 4.1-like domain followed by a central segment of unknown function and a C-terminal catalytic domain (see ). They could act at junctions between the membrane and the cytoskeleton. - Protein-tyrosine phosphatases PTPN14 (PEZ or PTP36) and PTP-D1, PTP-RL10 and PTP2E. These phosphatases also consist of a N-terminal band 4.1-like domain and a C-terminal catalytic domain. The central domain seems to contain a SH3-binding domain. - Caenorhabditis elegans protein phosphatase ptp-1. Ezrin, moesin, and radixin are highly related proteins, but the other proteins in which this domain is found do not share any region of similarity outside of the domain. In band 4.1 this domain is known to be important for the interaction with glycophorin, an integral membrane protein. We have developed two signature patterns for this domain, one is based on the conserved positions found at the N-terminal extremity of the domain, the second is located in the C-terminal section. -Consensus pattern: W-[LIV]-x(3)-[KRQ]-x-[LIVM]-x(2)-[QH]-x(0,2)-[LIVMF]x(6,8)-[LIVMF]-x(3,5)-F-[FY]-x(2)-[DENS] -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Consensus pattern: [HYW]-x(9)-[DENQSTV]-[SA]-x(3)-[FY]-[LIVM]-x(2)-[ACV]x(2)-[LM]-x(2)-[FY]-G-x-[DENQST]-[LIVMFYS] -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Sequences known to belong to this class detected by the profile: ALL. -Other sequence(s) detected in SWISS-PROT: 7. -Note: this documentation entry is linked to both a signature pattern and a profile. As the profile is much more sensitive than the pattern, you should use it if you have access to the necessary software tools to do so. -Expert(s) to contact by email: Rees J.; [email protected] -Last update: November 1997 / Patterns and text revised; profile added. [ 1] Rees D.J.G., Ades S.A., Singer S.J., Hynes R.O. Nature 347:685-689(1990). [ 2] Funayama N., Nagafuchi A., Sato N., Tsukita S., Tsukita S. J. Cell Biol. 115:1039-1048(1991). [ 3] Takeuchi K., Kawashima A., Nagafuchi A., Tsukita S. J. Cell Sci. 107:1921-1928(1994). {END} {PDOC50197} {PS50197; BEACH} {BEGIN} ************************ * BEACH domain profile * ************************ The BEACH (BEIge And CHS) domain is a region of about 300 residues highly conserved and which is found in otherwise distinct proteins over a wide species range [1,2]. The function of the BEACH domain is unknown. The BEACH

domain is usually followed by a series of WD repeats (see ). Proteins known to contain a BEACH domain are listed below: - Human lysosomal trafficking regulator or Chediak-Higashi Syndrome (CHS) protein. Defects in the CHS protein are the cause of Chediak-Higashi Syndrome, a rare autosomal disorder characterized by hypopigmentation, severe immunologic deficiency, a bleeding tendency and neurologic abnormalities. As an important hallmark of several tissues derived from CHS patients is the occurrence of giant inclusion bodies and organelles and protein sorting defects in these organelles, the CHS protein is thought to be involved in vesicle fusion or fission. The CHS protein contains 7 WDrepeats. The mouse ortholog of CHS is called 'beige'. - Mammalian FAN (Factor Associated with Neutral-sphingomyelinase activation). It binds to the N-SMase activation domain (NSD) of the p55 TNF-receptor and mediates the activation of N-SMase after ligand binding. FAN contains five WD-repeats. - Human CDC4-like protein. - Caenorhabditis elegans hypothetical proteins F52C9.2 and F52C9.3. - Yeast hypothetical protein YCR032W. -Sequences known to belong to this class detected by the profile: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Expert(s) to contact by email: Hofmann K.O.; [email protected] -Last update: September 2000 / First entry. [ 1] Nagle D.L., Karim M.A., Woolf E.A., Holmgren L., Bork P., Misumi D.J., McGrail S.H., Dussault B.J. Jr., Perou C.M., Boissy R.E., Duyk G.M., Spritz R.A., Moore K.J. Nat. Genet. 14:307-311(1996). [ 2] Adam-Klages S., Schwandner R., Adam D., Kreder D., Bernardo K., Kronke M. J. Leukoc. Biol. 63:678-682(1998). {END} {PDOC50138} {PS50138; BRCA2_REPEAT} {BEGIN} ************************ * BRCA2 repeat profile * ************************ The BRCA2 repeat [1] is a small domain (34 residues) currently found only in the breast cancer susceptibility gene protein 2 (BRCA2). These repeats were shown to be required for direct interaction with DNA repair protein RAD51 [2, 3,4]. BRCA2 was also shown to functionally interact with p53 and RAD51 two key components of the cell cycle control and DNA repair pathways [5]. As BRCA1 and BRCA2 display a stable interaction both together [6] and with RAD51 [2,3,4,5,7], they may participate in a pathway associated with the activation of double-strand break repair and/or homologous recombination. -Sequences known to belong to this class detected by the profile: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Expert(s) to contact by email: Falquet L.; [email protected] -Last update: September 2000 / First entry.

[ 1] Bork P., Blomberg N., Nilges M. Nat. Genet. 13:22-23(1996). [ 2] Wong A.K.C., Pero R., Ormonde P.A., Tavtigian S.V., Bartel P.L. J. Biol. Chem. 272:31941-31944(1997). [ 3] Katagiri T., Saito H., Shinohara A., Ogawa H., Kamada N., Nakamura Y., Miki Y. Genes Chromosomes Cancer 21:217-222(1998). [ 4] Chen P.L., Chen C.F., Chen Y., Xiao J., Sharp Z.D., Lee W.H. Proc. Natl. Acad. Sci. U.S.A. 95:5287-5292(1998). [ 5] Marmorstein L.Y., Ouchi T., Aaronson S.A. Proc. Natl. Acad. Sci. U.S.A. 95:13869-13874(1998). [ 6] Scully R., Chen J., Plug A., Xiao Y., Weaver D., Feunteun J., Ashley T., Livingston D.M. Cell 88:265-275(1997). [ 7] Chen J., Silver D.P., Walpita D., Cantor S.B., Gazdar A.F., Tomlinson G., Couch F.J., Weber B.L., Ashley T., Livingston D.M., Scully R. Mol. Cell 2:317-328(1998). {END} {PDOC50172} {PS50172; BRCT} {BEGIN} *********************** * BRCT domain profile * *********************** The breast cancer susceptibility gene contains at its C-terminus two copies of a conserved domain that was named BRCT for BRCA1 C-terminus. This domain of about 95 amino acids is found in a large variety of proteins involved in DNArepair, recombination and cell cycle control [1,2,3]. The BRCT domain is not limited to the C-termini of protein sequences and can be found in multiple copies or in a single copy as in RAP1 and TdT. Recent data [4] indicate that the BRCT domain functions as a protein-protein interaction module. The structure of the first of the two C-terminal BRCT domains of the human DNA repair protein XRCC1 has recently been determined by X-ray crystallography [4]. Some proteins known to contain a BRCT domain are listed below: - Mammalian breast cancer type 1 susceptibility protein. It may regulate gene expression. - Human P53-binding protein 1. - Human poly(ADP-ribose) polymerase (PARP) (EC 2.4.2.30). It modifies various nuclear proteins by poly(ADP-rybosyl)ation. - Vertebrate nucleotidylexotransferase (EC 2.7.7.31). It adds nucleotides at the junction (N region) of rearranged Ig heavy chain and T cell receptor gene segments during the maturation of B and T cells. - Mammalian DNA-repair protein XRCC1. - Human DNA ligase III (EC 6.5.1.1). It is involved in DNA strand-break repair. - Human DNA ligase IV (EC 6.5.1.1). - Drosophila germline transcription factor 1. A putative transcription factor active during oogenesis and embryogenesis. - Baker's yeast DNA ligase II (EC 6.5.1.1). - Baker's yeast RAD9 protein. It is essential for cell cycle arrest following DNA damage by X-irradiation or inactivation of DNA ligase. - Baker's yeast RAP1. It is involved in telomeric and HM loci silencing. - Escherichia coli DNA ligase (EC 6.5.1.2). It is essential for DNA

replication and repair of damaged DNA. - Synechocystis sp. DNA ligase (EC 6.5.1.2). It is probably essential for DNA replication and repair of damaged DNA. -Sequences known to belong to this class detected by the profile: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Expert(s) to contact by email: Hofmann K.O.; [email protected] -Last update: September 2000 / First entry. [ 1] Koonin E.V., Altschul S.F., Bork P. Nat. Genet. 13:266-268(1996). [ 2] Bork P., Hofmann K., Bucher P., Neuwald A.F., Altschul S.F., Koonin E.V. FASEB J. 11:68-76(1997). [ 3] Callebaut I., Mornon J.-P. FEBS Lett. 400:25-30(1997). [ 4] Zhang X., Morera S., Bates P.A., Whitehead P.C., Coffer A.I., Hainbucher K., Nash R.A., Sternberg M.J., Lindahl T., Freemont P.S. EMBO J. 17:6404-6411(1998). {END} {PDOC50097} {PS50097; BTB} {BEGIN} ********************** * BTB domain profile * ********************** The BTB domain (Broad-Complex, Tramtrack and Bric a brac) is also known POZ domain (POxvirus and Zinc finger). It is a homodimerization occurring at the N-terminus of proteins containing multiple copies of zinc fingers of the C2H2 type (see ) or Kelch repeats Many BTB proteins are transcriptional regulators that are thought through the control of chromatin structure. as the domain either [1,2]. to act

The structure of the BTB domain of the promyelocytic leukemia zinc finger (PLZF) protein has been determined by X-ray crystallography and reveals a tightly intertwined dimer with an extensive hydrophobic interface [3]. A surface-exposed groove lined with conserved amino acids is formed at the dimer interface, suggesting a peptide-binding site. Some proteins known to contain a BTB/TTK doamin are listed below: - Mammalian Bach proteins. These transcriptional regulators act as repressors or activators. - Mammalian B-cell lymphoma 6 protein (BCL-6). A transcriptional regulator. - Mammalian calicin. A possible morphogenic cytoskeletal element in spermiogenic differenciation. - Human promyelocytic leukemia zinc finger (PLZF). A probable transcription factor. - Drosophila GAGA transcription factor. - Drosophila Mod(MDG4) or E(VAR)3-93D. A chromatin protein involved in gene silencing in position effect variegation, the control of gypsy insulator sequence, maintenance of gene expression, and apoptosis. - Drosophila Tramtrack (TTK) protein. A probable transcriptional repressor. - Drosophila Kelch protein. A component of ring canals that regulates the flow of cytoplasm between cells. - Vaccinia virus proteins A55, C2, C4, C13, and F3.

- Myxoma virus proteins M-T8 and M-T9. -Sequences known to belong to this class detected by the profile: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Expert(s) to contact by email: Hofmann K.O.; [email protected] -Last update: September 2000 / First entry. [ 1] Zollman S., Godt D., Prive G.G., Couderc J.L., Laski F.A. Proc. Natl. Acad. Sci. U.S.A. 91:10717-10721(1994). [ 2] Bardwell V.J., Treisman R. Genes Dev. 8:1664-1677(1994). [ 3] Ahmad K.F., Engel C.K., Prive G.G. Proc. Natl. Acad. Sci. U.S.A. 95:12123-12128(1998). {END} {PDOC00857} {PS01113; C1Q} {BEGIN} ************************ * C1q domain signature * ************************ C1q is a subunit of the C1 enzyme complex that activates the serum complement system. It is composed of 9 disulfide-linked dimers of the chains A, B and C, which share a common structure which consist of a N-terminal nonhelical region, a triple helical (collagenous) region and a C-terminal globular head which is called the C1q domain. That domain consists of about 136 amino acids which probably form ten beta strands interspersed by beta-turns and/or loops [1]. Such a domain has been found in the C-terminus of vertebrate secreted or membrane-bound proteins which are mostly short-chain collagens and collagenlike molecules [1-4]. These proteins are listed below: - Complement C1q subcomponent chains A, B and C. Efficient activation of C1 takes place on interaction of the globular heads of C1q with the Fc regions of IgG or IgM antibody present in immune complexes. - Vertebrate short-chain collagen type VIII, the major component of the basement membrane of corneal endothelial cells. It is composed of a triple helical domain in between a short N-terminal and a larger C-terminal globule which contains the C1q domain. - Vertebrate collagen type X, which has the same structure than collagen type VIII. It is a product of hyperthrophic chondrotocytes. - Bluegill inner-ear specific structural protein. This short-chain collagen forms a microstructural matrix within the otolithic membrane. - Chipmunk hibernation-associated plasma proteins HP-20, HP-25 and HP-27. These proteins disappear from blood specifically during hibernation. They contain a collagen-like domain near the N-terminus and a C-terminal C1q domain. - Human precerebellin, which is located within postsynaptic structures of Purkinje cells, probably membrane-bound. Cerebellin is involved in synaptic activity. - Rat precerebellin-like glycoprotein, a probable membrane protein. The C1q domain is located at the C-terminal extracellular extremity. - Human endothelial cell multimerin (ECM), a carrier protein for platelet factor V/VA. - Vertebrate 30 Kd adipocyte complement-related protein (ACRP30), also known as ApM1 or AdipoQ. The C-terminal globular domain of the C1q subcomponents and collagen types

VIII and X is important both for the correct folding and alignment of the triple helix and for protein-protein recognition events [5,6]. For collagen type X it has been suggested that the domain is important for initiation and maintenance of the correct assembly of the protein [7]. There are two well conserved regions within the C1q domain: an aromatic motif is located within the first half of the domain, the other conserved region is located near the C-terminal extremity. We derived the signature pattern from the aromatic motif. -Consensus pattern: F-x(5)-[ND]-x(4)-[FYWL]-x(6)-F-x(5)-G-x-Y-x-F-x-[FY] -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Expert(s) to contact by email: Reid K.B.M.; [email protected] -Last update: July 1998 / Text revised. [ 1] Smith K.F., Haris P.I., Chapman D., Reid K.B.M., Perkins S.J. Biochem. J. 301:249-256(1994). [ 2] Brass A., Kadler K.E., Thomas J.T., Grant M.E., Boot-Handford R.P. FEBS Lett. 303:126-128(1992). [ 3] Petry F., Reid K.B.M., Loos M. Eur. J. Biochem. 209:129-134(1992). [ 4] Bork P. Unpublished observations (1995). [ 5] Rosenbloom J., Endo R., Harsch M. J. Biol. Chem. 251:2070-2076(1976). [ 6] Engel J., Prockop D.J. Annu. Rev. Biophys. Chem. 20:137-152(1991). [ 7] Kwan A.P.L., Cummings C.E., Chapman J.A., Grant M.E. J. Cell Biol. 114:597-604(1991). {END} {PDOC50209} {PS50209; CARD} {BEGIN} ******************************************* * CARD caspase recruitment domain profile * ******************************************* The apoptotic signal coming from ligand-induced oligomerization of death receptors is mediated by a number of adaptor proteins containing specialized interaction domains. Besides the caspase recruitment domain (CARD), this group is formed by the death domain (DD) (see ) and the death effector domain (DED) (see ). The CARD domain was first described as a homology region in the N-terminus of the death adaptor protein RAIDD and the caspases ced-3 and CASP2 [1]. Recently, it was shown that this domain is widespread among apoptotic signaling molecules and a function in caspase-recruitment has been proposed [2]. The CARD domain typically associates with other CARD-containing proteins, forming either dimers or trimers [1,2,3,4]. It has been predicted that CARD is related in structure and sequence to both DD and DED domains, which work in similar pathways and show similar interaction properties [2]. Important members of the CARD family occur in the following proteins: - RAIDD death adaptor protein [1]. - Caspases: Ced-3, CASP1, CASP2, CASP4, CASP5, CASP9 and CASP12.

- Inhibitor of apoptosis (IAP) proteins c-IAP1 and c-IAP2. - Caenorhabditis elegans cell death protein Ced-4 and its mammalian homologue Apaf-1. - Equine herpes virus protein E10. -Sequences known to belong to this class detected by the profile: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Expert(s) to contact by email: Hofmann K.O.; [email protected] -Last update: September 2000 / First entry. [ 1] Duan H., Dixit V.M. Nature 385:86-89(1997). [ 2] Hofmann K., Bucher P., Tschopp J. Trends Biochem. Sci. 22:155-156(1997). [ 3] Chinnaiyan A.M., Chaudhary D., O'Rourke K., Koonin E.V., Dixit V.M. Nature 388:728-729(1997). [ 4] Irmler M., Hofmann K., Vaux D., Tschopp J. FEBS Lett. 406:189-190(1997). {END} {PDOC50021} {PS50021; CH} {BEGIN} ************************************ * Calponin homology domain profile * ************************************ A number of actin-binding proteins, including spectrin, alpha-actinin and fimbrin, contain a 250 amino acid stretch called the actin binding domain (ABD). The ABD has probably arisen from duplication of a domain which is also found in a single copy in a number of other proteins like calponin or the vav proto-oncogene and has been called calponin homology (CH) domain [1,2]. A detailed analysis of The CH domain-containing proteins has shown that they can be divided in three groups [1]: - The fimbrin family of monomeric actin cross-linking molecules containing two ABDs - Dimeric cross-linking proteins (alpha-actinin, beta-spectrin, filamin, etc.) and monomeric F-actin binding proteins (dystrophin, utrophin) each containing one ABD - Proteins containing only a single amino terminal CH domain. Each single ABD, comprising two CH domains, is able to bind one actin monomer in the filament. The amino terminal CH domain has the intrinsic ability to bind actin, albeit with lower affinity than the complete ABD, whereas the carboxy terminal CH bind actin extremely weakly or not at all. Nevertheless both CH domains are required for a fully functional ABD; the C-terminal CH domain contributing to the overall stability of the complete ABD through inter-domain helix-helix interactions [1]. Some of the proteins containing a single CH domain also bind to actin, although this has not been shown to be via the single CH domain alone [2]. In addition, the CH domain occurs also in a number of proteins not known to bind actin, a notable example being the vav protooncogene. The resolution of the 3D structure of various CH domains has shown that the conserved core consist of four major alpha-helices [2].

Some proteins known to contain a CH-domain are listed below: - Alpha-actinins. F-actin cross-linking proteins which are thought to anchor actin to a variety of intracellular structures. - Calponins (see ). Thin filament-associated proteins which are implicated in the regulation and modulation of smooth muscle contraction. - Human dystrophin. This protein is defective in Duchenne muscular dystrophy and Becker muscular dystrophy. It may play a role in anchoring the cytoskeleton to the plasma membrane. - Mammalian vav proto-oncogene. It could be an exchange factor for a small ras-like GTP-binding protein. - Human ras GTPase activating-like protein IQGAP1. - Spectrin beta chain. Spectrin is the major constituent of the cytoskeletal network underlying the erythrocyte plasma membrane. - Transgelins. There are actin cross-linking/gelling proteins involved in calcium interactions and contractile properties of the cell that may contribute to replicative senescence. - T-plastin, an actin-bundling protein found in intestinal microvilli, hair cell stereocilia, and fibroblast filopodia. - Drosophila muscle-specific protein 20. It might be the calcium-binding protein of synchronous muscle. - Yeast fimbrin. It binds to actin. - Slime mold gelation factor. It is a F-actin cross-linking protein. -Sequences known to belong to this class detected by the profile: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Last update: September 2000 / First entry. [ 1] Stradal T., Kranewitter W., Winder S.J., Gimona M. FEBS Lett. 431:134-137(1998). [ 2] Keep N.H., Norwood F.L., Moores C.A., Winder S.J., Kendrick-Jones J. J. Mol. Biol. 285:1257-1264(1999). {END} {PDOC00912} {PS01185; CTCK_1} {PS01225; CTCK_2} {BEGIN} ************************************************* * C-terminal cystine knot signature and profile * ************************************************* The structures of transforming growth factor-beta (TGF-beta), nerve growth factor (NGF), platelet-derived growth factor (PDGF) and gonadotropin have been shown to be similar [1,2 and references therein]: these proteins are folded into two highly twisted antiparallel pairs of beta-strands and contain three disulfide bonds, of which two form a cystine ring through which the third bond passes (see the schematic representation below). This structure is called cystine knot [3]. +--C-:-C-c-COOH : : : : +--==c====b2===>C---C--------+ : +--3' double-stranded DNA exonuclease that could act in a pathway that corrects mismatched base pairs. - Yeast EXO1 (DHS1), a protein with probably the same function as exo1. - Yeast DIN7. Sequence alignment of this family of proteins reveals that similarities are largely confined to two regions. The first is located at the N-terminal extremity (N-region) and corresponds to the first 95 to 105 amino acids. The second region is internal (I-region) and found towards the C-terminus; it spans about 140 residues and contains a highly conserved core of 27 amino acids that includes a conserved pentapeptide (E-A-[DE]-A-[QS]). It is possible that the conserved acidic residues are involved in the catalytic mechanism of DNA excision repair in XPG. The amino acids linking the N- and I-regions are not conserved; indeed, they are largely absent from proteins belonging to the second subset. We have developed two signature patterns for these proteins. The first corresponds to the central part of the N-region, the second to part of the Iregion and includes the putative catalytic core pentapeptide. -Consensus pattern: [VI]-[KRE]-P-x-[FYIL]-V-F-D-G-x(2)-[PIL]-x-[LVC]-K -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Consensus pattern: [GS]-[LIVM]-[PER]-[FYS]-[LIVM]-x-A-P-x-E-A-[DE]-[PAS][QS]-[CLM]

-Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: NONE.