1
This work is licensed under a Creative Commons Attribution 3.0 Unported License . Background Proteins can have quaternary symmetry and/or internal symmetry Symmetry is widespread in proteins and can be observed at a number of levels, from crystal symmetry within complexes to pseudo-symmetry in individual chains and domains. Symmetry is known to play a role in protein evolution, 1 allosteric regulation, 2 DNA binding, 3 and cooperative enzyme effects. 4 Symmetry has also been utilized to understand protein folding 5 and to aid the computational design of large proteins. 6 Quaternary symmetry consists of multiple identical polypeptide chains arranged in a symmetric fashion. Such symmetry is extremely common in proteins, occurring in approximately 80% of structures in the Protein Data Bank (PDB). Detecting quaternary symmetry relies on accurate assignment of the correct biological assembly for each protein. The PDB now annotates protein structures with their quaternary symmetry (Peter Rose et al., in preparation). Proteins can also have internal or ternary symmetry, when a single chain contains two or more equivalent subunits. The subunits generally will differ in the exact sequence, but have substantially similar structures. Internal symmetry is sometimes styled as pseudosymmetry to reflect that the equivalence between subunits is generally at the level of residues or secondary structure elements rather than atoms or electron density, as is common with quaternary symmetry. Internal symmetry can arise from quaternary by gene duplication or fusion. Thus, in addition to the many functional implications of symmetry, identifying protein symmetry can provide information about the evolutionary history of a protein. Such fission and fusion events often preserve the overall structure and function of the active complex. Existing methods for finding internal symmetry Several computational methods are available to detect symmetry. Some methods search for periodic sequences or structure (e.g. DAVROS 7 ). These are generally limited in their ability to handle large insertions. Methods based on structural alignment algorithms (SymD, 8 GANGSTA+ 9 ) can tolerate large insertions, but produce pairwise alignments between adjacent symmetric subunits rather than a global alignment of all subunits. This leads to ambiguous alignments, where a single residue could be aligned to several residues in each other subunit, depending on the order in which rotation operations are performed. Conclusion CE-Symm was run over a large hand-curated benchmark, and is able to detect symmetric proteins with a high degree of accuracy, even in the presence of large insertions. The resulting alignment includes exactly one residue from each subunit, as expected for a multiple alignment. It runs quickly and is able to detect symmetry broadly across a variety of folds. The refinement stage can also be used as an independent tool in conjunction with seed alignments from other tools. This allows the circularly permuted alignments from tools such as SymD 8 to be refined into multiple alignments between individual subunits. Because symmetry is hypothesized to derive from gene duplications and fusions, 12 aligning subunits within symmetric proteins can reveal ancient homologies and conserved sequences. CE-Symm is useful both for identifying symmetric proteins and for aligning the subunits for further study. Availability: CE-Symm source code is available under the LGPL license from https://github.com/rcsb/symmetry An online server is available at http://source.rcsb.org/jfatcatserver/symmetry.jsp Aligning Subunits of Internally Symmetric Proteins with CE-Symm Spencer Bliven Bioinformatics and Systems Biology Program University of California San Diego Douglas Myers- Turnbull Dept. of Computer Science & Engineering University of California San Diego Philip Bourne Skaggs School of Pharmacy and Pharmaceutical Sciences University of California San Diego Andreas Prlić San Diego Supercomputer Center University of California San Diego (Left) Beta-carbonic anhydrase from Porphyridium purpureum [1I6O] is a quatramer with D2 quaternary symmetry. (Right) The beta-carbonic anhydrase in E. coli [1DDZ] consists of only two chains, which each have internal C2 symmetry in addition to the C2 quaternary symmetry. The two halves of the chain have 68% sequence identity, strongly indicating that a duplication and fusion event has occured in the evolution of E. coli. D5 quaternary symmetry of GTP cyclohydrolase I [1A8R]. The main 5-fold axis is shown in red; the five 2-fold axes are in blue. Methods The CE-Symm program is able to detect internal symmetry in proteins. It first identifies structurally similar regions within the protein structure. It then refines this alignment to improve the correspondence between subunits. 1. Identify structurally similar regions The CE-Symm algorithm starts by identifying a non-trivial structural alignment between a protein and itself using Combinatorial Extension 10 (CE). This uses the dynamic programming and progressive refinement of CE, but with two modifications. 1.A strong penalty term is added to self- aligned residues to prevent the trivial 0° rotation from dominating. 2.The alignment matrix is duplicated in the manner of Uliel et al. 11 to account for the circular permutation which is introduced when comparing a symmetric protein against a rotated copy of itself. 2. Refinement to ensure transitivity The structural alignment from the first step is then refined to produce a residue-level equivalence map between subunits. Refinement produces a consistent multiple alignment between all identified subunits. The order, k, of rotational symmetry present in the protein (if any) is determined by successively applying the seed alignment until the original orientation is found. Let f be a function over all residues in the protein, such that f(i)=j when i is aligned to j. The goal is to modify f such that k applications of f (i.e. rotations of the protein) give a trivial alignment. Formally, if k (i)=i. To constrain the modifications, we introduce a penalty function σ(i) which goes to zero when the previous condition is met. Two such penalty functions were considered: 1. σ(i) = |f k (i)-i|. This measures the number of insertions or deletions which would need to be added to be made in order to bring residue i into alignment 2. σ(i) = |d( f k-1 (i), f k (i)) - d(i,f k-1 (i))|, where d(i,j) gives the distance between alpha carbons of residues i and j. This minimizes the changes in RMSD required during refinement. The algorithm works by choosing the residue with minimal score and modifying the alignment such that f k (i)=i. To ensure that the alignment remains sequential and well-formed, the selection of residue to modify is limited by the following “eligibility criteria.” 1. f k-1 (i) is defined (f k (i) may be undefined) 2. σ(i)>0 3. σ(f k-1 (i)) > 0 4. j s.t. σ(j)=0: sign(f k-1 (i)-j ) = sign( i-f(j) ) Eligible residues are chosen in order of increasing score, and the alignment modified to set f k-1 (i) i. This process is repeated until no eligible residues remain, at which point remaining residues are removed from the alignment. This algorithm terminates in a multiple alignment between the symmetric subunits with exactly one residue per subunit in each aligned column. The process can also be interleaved with structure-based refinement to iteratively improve the alignment RMSD while preserving the multiple alignment property. (Left) Fibroblast growth factor 1 [3JUT], colored to show internal symmetry. (Right) Dot plot showing equivalent residues within the protein. Red lines correspond to a 120° clockwise rotation of the protein around the 3-fold axis, and cyan to the 240° rotation. After duplicating the matrix, each alignment forms a sequential diagonal line which can be fully detected by CE. Gray shading indicates regions near the diagonal which are penalized by the scoring function. Results Symmetry detection Refinement Poster first presented at the 21st Annual International Conference on Intelligent Systems for Molecular Biology (2013). The RCSB PDB is supported by the National Science Foundation [NSF DBI 0829586]; National Institute of General Medical Sciences; Office of Science, Department of Energy; National Library of Medicine; National Cancer Institute; National Institute of Neurological Disorders and Stroke; and the National Institute of Diabetes & Digestive & Kidney Diseases. The RCSB PDB is a member of the wwPDB. Trypanosoma sialidase [SCOP domain d2agsa2], a six-bladed beta propeller. The alignment shown corresponds to a 120° rotation, permuting the structure by two blades. Superposition of the structure with itself (a) prior to refinement, and (b) after one iteration of refinement. A number of extraneous loops not shared by all blades are marked as unaligned by the refinement procedure. (c) Multiple alignment of the three two-blade subunits considered here. (c) SSRVE---LFKRKNSTVPFEESNGTIRERVVH---SFRIPT-IVNVD----GVMVAIADARYETSFDNSFIETAVKYSVDDGA GKPVS---LKP--LFPAEFDGI------LTKE---FIGGVGAAIVASN---GNLVYPVQIADMG----GRVFTKIMYSEDDGN WVEALGTLSHV--WTN------------SPTSNQQDCQSS--FVAVTIEGKRVMLFTHPLNLKGRW--MRDRLHLWMTD--NQ TWNTQIAIKNSRASSVSRVMDATVIVKGNKLYILVGSFNKTRNSWTQHRDGSDWEPLLVVGE-----VTKSAANGKTTATISW TWKFAEGRSKF------GCSEPAVLEWEGKLIINNRVD--------------GNRRLVYESS-----DMGKT----------- RIFDVGQISIGDE----NSGYSSVLYKDDKLYSLHEINTND-----------VYSLVFVRLIGELQLM--------------- (a) (b) Percentage of SCOP superfamiles with internal symmetry, as detected by CE-Symm SCOP class Number of Superfamilies % symmetric α 503 17.4% β 354 17.5% α/β 244 17.6% α+β 549 12.5% multi-domain 66 3.0% membrane 108 22.0% All classes 1,832 16.0% ROC curves showing the performance of CE-Symm for detecting symmetry, on a benchmark of 1000 randomly selected and manually annotated SCOP superfamilies. Two scoring functions were considered for classification power: TM-Score, 13 and an alternate score incorporating the detection of symmetry order. The TM-Score classifier has an AUC of 0.94. Abstract The CE-Symm algorithm has been developed to detect internal symmetry within protein chains. Symmetry is common across protein fold space and is tied to a number of important biological functions. Using CE-Symm we find that 16% of SCOP superfamilies contain internal symmetry. The algorithm can produce unambiguous multiple alignments between symmetric subunits. It can also be applied to the output of other symmetry detection algorithms to refine alignments and identify conserved regions between all subunits. 1. Lee, J. & Blaber, M. PNAS 108, 126–130 (2011). 2. Monod, J. et al. J Mol Biol 12, 88–118 (1965). 3. Juo, Z. S. et al. J Mol Biol 261, 239–254 (1996). 4. Goodsell, D. S. & Olson, A. J. Annu Rev Biophys Biomol Struct 29, 105–153 (2000). 5. Gosavi, S. et al. J Mol Biol 357, 986–996 (2006). 6. Fortenberry, C. et al. J Am Chem Soc 133, 18026–18029 (2011). 7. Murray, K. B. et al. J Mol Biol 316, 341–363 (2002). 8. Kim, C. et al. BMC Bioinformatics 11, 303 (2010). 9. Guerler, A. et al. J Chem Inf Model 49, 2147–2151 (2009). 10. Shindyalov, I. N. & Bourne, P. E. Protein Eng 11, 739– 747 (1998). 11. Uliel, S. et al. Bioinformatics 15, 930–936 (1999). 12. Abraham, A.-L. et al. J Mol Biol 394, 522–534 (2009). 13. Zhang, Y., & Skolnick, J. (2004). Proteins: Structure, Function, and Bioinformatics, 57(4), 702–710 References Screenshot of the CE-Symm interface, showing a two-fold axis of EPSP synthase [1G6S].

Aligning Subunits of Internally Symmetric Proteins with CE-Symm

Embed Size (px)

Citation preview

Page 1: Aligning Subunits of Internally Symmetric Proteins with CE-Symm

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

BackgroundProteins can have quaternary symmetry and/or internal symmetrySymmetry is widespread in proteins and can be observed at a number of levels, from crystal symmetry within complexes to pseudo-symmetry in individual chains and domains. Symmetry is known to play a role in protein evolution,1 allosteric regulation, 2 DNA binding,3 and cooperative enzyme effects.4 Symmetry has also been utilized to understand protein folding5 and to aid the computational design of large proteins.6

Quaternary symmetry consists of multiple identical polypeptide chains arranged in a symmetric fashion. Such symmetry is extremely common in proteins, occurring in approximately 80% of structures in the Protein Data Bank (PDB). Detecting quaternary symmetry relies on accurate assignment of the correct biological assembly for each protein. The PDB now annotates protein structures with their quaternary symmetry (Peter Rose et al., in preparation).

Proteins can also have internal or ternary symmetry, when a single chain contains two or more equivalent subunits. The subunits generally will differ in the exact sequence, but have substantially similar structures. Internal

symmetr y i s somet imes s t y l ed a s pseudosymmetry to reflect that the equivalence between subunits is generally at the level of residues or secondary structure elements rather than atoms or electron density, as is common with quaternary symmetry.

Internal symmetr y can ar ise from quaternary by gene duplication or fusion. Thus, in addition to the many functional implications of symmetry, identifying protein symmetry can provide information about the evolutionary history of a protein. Such fission and fusion events often preserve the overall structure and function of the active complex.

Existing methods for finding internal symmetrySeveral computational methods are available to detect symmetry. Some methods search for periodic sequences or structure (e.g. DAVROS7). These are generally limited in their ability to handle large insertions. Methods based on structural alignment algorithms (SymD,8 GANGSTA+9) can tolerate large insertions, but produce pairwise alignments between adjacent symmetric subunits rather than a global alignment of all subunits. This leads to ambiguous alignments, where a single residue could be aligned to several residues in each other subunit, depending on the order in which rotation operations are performed.

ConclusionCE-Symm was run over a large hand-curated benchmark, and is able to detect symmetric proteins with a high degree of accuracy, even in the presence of large insertions. The resulting alignment includes exactly one residue from each subunit, as expected for a multiple alignment. It runs quickly and is able to detect symmetry broadly across a variety of folds.

The refinement stage can also be used as an independent tool in conjunction with seed alignments from other tools. This allows the circularly permuted alignments from tools such as SymD8 to be refined into multiple alignments between individual subunits.

Because symmetry is hypothesized to derive from gene duplications and fusions,12 aligning subunits within symmetric proteins can reveal ancient homologies and conserved sequences. CE-Symm is useful both for identifying symmetric proteins and for aligning the subunits for further study.

Availability:

CE-Symm source code is available under the LGPL license from https://github.com/rcsb/symmetry

An online server is available at http://source.rcsb.org/jfatcatserver/symmetry.jsp

Aligning Subunits of Internally Symmetric Proteinswith CE-Symm

Spencer BlivenBioinformatics and Systems Biology Program

University of California San Diego

Douglas Myers-TurnbullDept. of Computer Science & Engineering

University of California San Diego

Philip BourneSkaggs School of Pharmacy and Pharmaceutical Sciences

University of California San Diego

Andreas PrlićSan Diego Supercomputer CenterUniversity of California San Diego

(Left) Beta-carbonic anhydrase from Porphyridium purpureum [1I6O] is a quatramer with D2 quaternary symmetry. (Right) The beta-carbonic anhydrase in E. coli [1DDZ] consists of only two chains, which each have internal C2 symmetry in addition to the C2 quaternary symmetry. The two halves of the chain have 68% sequence identity, strongly indicating that a duplication and fusion event has occured in the evolution of E. coli.

D5 quaternary symmetry of GTP cyclohydrolase I [1A8R]. The main 5-fold axis is shown in red; the five 2-fold axes are in blue.

Methods

The CE-Symm program is able to detect internal symmetry in proteins. It first identifies structurally similar regions within the protein structure. It then refines this alignment to improve the correspondence between subunits.

1. Identify structurally similar regionsThe CE-Symm algorithm starts by identifying a non-trivial structural alignment between a protein and itself using Combinatorial Extension10 (CE). This uses the dynamic programming and progressive refinement of CE, but with two modifications.

1.A strong penalty term is added to self-aligned residues to prevent the trivial 0° rotation from dominating.

2.The alignment matrix is duplicated in the manner of Uliel et al.11 to account for the circular permutation which is introduced when comparing a symmetric protein against a rotated copy of itself.

2. Refinement to ensure transitivityThe structural alignment from the first step is then refined to produce a residue-level equivalence map between subunits. Refinement produces a consistent multiple alignment between all identified subunits.

The order, k, of rotational symmetry present in the protein (if any) is determined by successively applying the seed alignment until the original orientation is found.

Let f be a function over all residues in the protein, such that f(i)=j when i is aligned to j. The goal is to modify f such that k applications of f (i.e. rotations of the protein) give a trivial alignment. Formally, ∀i  f k(i)=i. To constrain the modifications, we introduce a penalty function σ(i) which goes to zero when the previous condition is met. Two such penalty functions were considered:

1. σ(i) = |f k(i)-i|. This measures the number of insertions or deletions which would need to be added to be made in order to bring residue i into alignment

2. σ(i) = |d( f k-1(i), f k(i)) - d(i,f k-1(i))|, where d(i,j) gives the distance between alpha carbons of residues i and j. This minimizes the changes in RMSD required during refinement.

The algorithm works by choosing the residue with minimal score and modifying the alignment such that f k(i)=i. To ensure that the alignment remains sequential and well-formed, the selection of residue to modify is limited by the following “eligibility criteria.”

1. f k-1(i) is defined (f k(i) may be undefined)

2. σ(i)>0

3. σ(f k-1(i)) > 0

4.∀j s.t. σ(j)=0: sign(f k-1(i)-j ) = sign( i-f(j) )

Eligible residues are chosen in order of increasing score, and the alignment modified to set f  k-1(i) ⟵i. This process is repeated until no eligible residues remain, at which point remaining residues are removed from the alignment.

This algorithm terminates in a multiple alignment between the symmetric subunits with exactly one residue per subunit in each aligned column. The process can also be interleaved with structure-based refinement to iteratively improve the alignment RMSD while preserving the multiple alignment property.

(Left) Fibroblast growth factor 1 [3JUT], colored to show internal symmetry. (Right) Dot plot showing equivalent residues within the protein. Red lines correspond to a 120° clockwise rotation of the protein around the 3-fold axis, and cyan to the 240° rotation. After duplicating the matrix, each alignment forms a sequential diagonal line which can be fully detected by CE. Gray shading indicates regions near the diagonal which are penalized by the scoring function.

ResultsSymmetry detection

Refinement

Poster first presented at the 21st Annual International Conference on Intelligent Systems for Molecular Biology (2013).The RCSB PDB is supported by the National Science Foundation [NSF DBI 0829586]; National Institute of General Medical Sciences; Office of Science, Department of Energy; National Library of Medicine; National Cancer Institute; National Institute of Neurological Disorders and Stroke; and the National Institute of Diabetes & Digestive & Kidney Diseases. The RCSB PDB is a member of the wwPDB.

Trypanosoma sialidase [SCOP domain d2agsa2], a six-bladed beta propeller. The alignment shown corresponds to a 120° rotation, permuting the structure by two blades. Superposition of the structure with itself (a) prior to refinement, and (b) after one iteration of refinement. A number of extraneous loops not shared by all blades are marked as unaligned by the refinement procedure. (c) Multiple alignment of the three two-blade subunits considered here.

(c)

SSRVE---LFKRKNSTVPFEESNGTIRERVVH---SFRIPT-IVNVD----GVMVAIADARYETSFDNSFIETAVKYSVDDGAGKPVS---LKP--LFPAEFDGI------LTKE---FIGGVGAAIVASN---GNLVYPVQIADMG----GRVFTKIMYSEDDGNWVEALGTLSHV--WTN------------SPTSNQQDCQSS--FVAVTIEGKRVMLFTHPLNLKGRW--MRDRLHLWMTD--NQ

TWNTQIAIKNSRASSVSRVMDATVIVKGNKLYILVGSFNKTRNSWTQHRDGSDWEPLLVVGE-----VTKSAANGKTTATISWTWKFAEGRSKF------GCSEPAVLEWEGKLIINNRVD--------------GNRRLVYESS-----DMGKT-----------RIFDVGQISIGDE----NSGYSSVLYKDDKLYSLHEINTND-----------VYSLVFVRLIGELQLM---------------

(a) (b)

Percentage of SCOP superfamiles with internal symmetry, as detected by CE-Symm

SCOP class Number of Superfamilies

% symmetric

α 503 17.4%

β 354 17.5%

α/β 244 17.6%

α+β 549 12.5%

multi-domain 66 3.0%

membrane 108 22.0%

All classes 1,832 16.0%ROC curves showing the performance of CE-Symm for detecting symmetry, on a benchmark of 1000 randomly selected and manually annotated SCOP superfamilies. Two scoring functions were considered for classification power: TM-Score,13 and an alternate score incorporating the detection of symmetry order. The TM-Score classifier has an AUC of 0.94.

AbstractThe CE-Symm algorithm has been developed to detect internal symmetry within protein chains. Symmetry is common across protein fold space and is tied to a number of important biological functions. Using CE-Symm we find that 16% of SCOP superfamilies contain internal symmetry.

The algorithm can produce unambiguous multiple alignments between symmetric subunits. It can also be applied to the output of other symmetry detection algorithms to refine alignments and identify conserved regions between all subunits.

1.  Lee, J. & Blaber, M. PNAS 108, 126–130 (2011).

2.  Monod, J. et al. J Mol Biol 12, 88–118 (1965).

3.  Juo, Z. S. et al. J Mol Biol 261, 239–254 (1996).

4.  Goodsell, D. S. & Olson, A. J. Annu Rev Biophys Biomol Struct 29, 105–153 (2000).

5.  Gosavi, S. et al. J Mol Biol 357, 986–996 (2006).

6.  Fortenberry, C. et al. J Am Chem Soc 133, 18026–18029 (2011).

7.  Murray, K. B. et al. J Mol Biol 316, 341–363 (2002).

8.  Kim, C. et al. BMC Bioinformatics 11, 303 (2010).

9.  Guerler, A. et al. J Chem Inf Model 49, 2147–2151 (2009).

10.  Shindyalov, I. N. & Bourne, P. E. Protein Eng 11, 739–747 (1998).

11.  Uliel, S. et al. Bioinformatics 15, 930–936 (1999).

12.  Abraham, A.-L. et al. J Mol Biol 394, 522–534 (2009).13. Zhang, Y., & Skolnick, J. (2004). Proteins: Structure,

Function, and Bioinformatics, 57(4), 702–710

References

Screenshot of the CE-Symm interface, showing a two-fold axis of EPSP synthase [1G6S].