1
Materials & Methods Utilizing fDETECT, we performed a first-of- its-kind analysis of predicted crystallization propensity of 8,652,940 non-redundant proteins from 1,953 fully sequenced proteomes (106 archaea, 1,043 bacteria, 265 eukaryotes and 539 viruses) collected from release 2012_07 of UniProt. 0.1 0.2 0.3 0.4 0.5 0.6 Crystallization propensity (median +/- 25 centile) resolution [Å] Crystallization score 3rd degree polynomial Empirical tests show that fDETECT predicts crystallization propensity with competitive predictive performance while being six orders of magnitude faster than similarly accurate methods. Our evaluation also shows an interesting trend between structure resolution and the predicted crystallization propensity score; higher score implies, on average, that the crystal resolution of the structure is better. There are 1,734,048 protein families clustered at 30% sequence identity level (modeling families), in the analyzed 1,953 proteomes. Each superkingdom show very different overall propensity for crystallization. 0% 20% 40% 60% 80% 100% 0 0.2 0.4 0.6 0.8 1 Crystallization propensity score All Superkingdoms Archaea Bacteria Eukaryota Viruses Archaea+Bacteria Viruses Eukaryota 25 th centile score on PDB median score on PDB 75 th centile score on PDB structural coverage [% of clusters above the score] Structural coverage using X-ray crystallography for a current snapshot of the protein universe Marcin J Mizianty 1 , Xiao Fan 1 , Jing Yan 1 , Eric Chalmers 1 , Christopher Woloschuk 1 , Andrzej Joachimiak 2 & Lukasz Kurgan 1,* 1 Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada and 2 Midwest Center for Structural Genomics, Biosciences Division, Argonne National Laboratory, USA * [email protected] MCSG Can three-dimensional structures of all protein families be determined using X-ray crystallography? Utilizing an accurate and time-efficient method for Fast DEtermination of Targets’ Eligibility for CrysTalization (fDETECT), we performed a first-of-its-kind analysis of crystallization propensity for all proteins encoded in nearly 2,000 fully sequenced genomes. We computed the attainable structural coverage that combines use of current crystallization protocols, and homology modeling, by utilizing crystallization propensity scores for target selection. Introduction We would like to gratefully acknowledge the travel award from Genome Canada to M.J.M. This work was supported by the Dissertation fellowship awarded by the University of Alberta to M.J.M.; and by the Discovery grant from the NSERC of Canada to L.K. Acknowledgments Analysis of the predicted crystallization propensity reveals that three superkingdoms show very different overall propensity for crystallization, with archaea being the easiest to crystallize and eukaryota the hardest. It appears that crystallization propensities of viral proteomes have properties similar to their host proteomes. We also show that current X-ray crystallography know-how combined with homology modeling (using 30% sequence identity cut off) can provide an average structural coverage of 73%. The structural coverage could be increased to 96% by improving homology modeling, if it would produce reliable predictions at 20% sequence identity. Moreover, the use of the knowledge-based target selection strategy significantly increases structural coverage. Results Using homology modeling at 30% sequence identity cut-off virtually all bacterial and archaeal proteomes as well as bacteriophages can be structurally covered at above 70%. However, majority of the eukaryotes and eukaryotic viruses have coverage well below 70%. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0 200 400 600 800 1000 1200 1400 proteomes Bacteria Archaea Eukaryota Viruses Bacteria Viruses Eukaryota Homo sapiens structural coverage [% of clusters above the median PDB score] *with homology modeling at 30% seq. ident. Bacterial proteomes have the widest distribution of crystallization scores that overlap with scores for archaea proteomes and, to a smaller extend, with eukaryotes. Propensities for archaea and eukaryotes show virtually no overlap. 0 20 40 60 80 100 120 140 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 number of proteomes median crystallization propensity score Bacteria Archaea Eukaryota Viruses 0 2 4 6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Viruses - Bacteria Viruses - Eukaryota Method Avg. time [ms] AUC fDETECT 0.8 0.754 PPCpred 152912.9 0.741 XtalPred 70624.4 0.665 CRYSTALP2 0.3 0.658 SVMcrys 153.3 0.615 OBScore 64 0.569 ParCrys N/A 0.557

Structural coverage using X-ray crystallography for MCSGbiomine.cs.vcu.edu/papers/poster3DSig2013.pdf · The structural coverage could be increased to 96% by improving homology modeling,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Structural coverage using X-ray crystallography for MCSGbiomine.cs.vcu.edu/papers/poster3DSig2013.pdf · The structural coverage could be increased to 96% by improving homology modeling,

Materials & Methods

Utilizing fDETECT, we performed a first-of-its-kind analysis of predicted crystallization propensity of 8,652,940 non-redundant proteins from 1,953 fully sequenced proteomes (106 archaea, 1,043 bacteria, 265 eukaryotes and 539 viruses) collected from release 2012_07 of UniProt.

0.1

0.2

0.3

0.4

0.5

0.6

Cry

stal

lizat

ion

pro

pe

nsi

ty

(me

dia

n +

/- 2

5 c

en

tile

)

resolution [Å]

Crystallization score3rd degree polynomial

Empirical tests show that fDETECT predicts crystallization propensity with competitive predictive performance while being six orders of magnitude faster than similarly accurate methods. Our evaluation also shows an interesting trend between structure resolution and the predicted crystallization propensity score; higher score implies, on average, that the crystal resolution of the structure is better.

There are 1,734,048 protein families clustered at 30% sequence identity level (modeling families), in the analyzed 1,953 proteomes. Each superkingdom show very different overall

propensity for crystallization.

0%

20%

40%

60%

80%

100%

0 0.2 0.4 0.6 0.8 1Crystallization propensity score

All SuperkingdomsArchaeaBacteriaEukaryotaViruses Archaea+BacteriaViruses Eukaryota

25

th c

enti

le s

core

on

PD

B

med

ian

sco

re o

n P

DB

75

th c

enti

le s

core

on

PD

B

stru

ctu

ral c

ove

rage

[%

of

clu

ster

s a

bo

ve t

he

sco

re]

Structural coverage using X-ray crystallography for a current snapshot of the protein universe

Marcin J Mizianty1, Xiao Fan1, Jing Yan1, Eric Chalmers1, Christopher Woloschuk1, Andrzej Joachimiak2 & Lukasz Kurgan1,*

1Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada and 2Midwest Center for Structural Genomics, Biosciences Division, Argonne National Laboratory, USA *[email protected]

MCSG

Can three-dimensional structures of all protein families be determined using X-ray crystallography? Utilizing an accurate and time-efficient method for Fast DEtermination of Targets’ Eligibility for CrysTalization (fDETECT), we performed a first-of-its-kind analysis of crystallization propensity for all proteins encoded in nearly 2,000 fully sequenced genomes. We computed the attainable structural coverage that combines use of current crystallization protocols, and homology modeling, by utilizing crystallization propensity scores for target selection.

Introduction

We would like to gratefully acknowledge the travel award from Genome Canada to M.J.M. This work was supported by the Dissertation fellowship awarded by the University of Alberta to M.J.M.; and by the Discovery grant from the NSERC of Canada to L.K.

Acknowledgments

Analysis of the predicted crystallization propensity reveals that three superkingdoms show very different overall propensity for crystallization, with archaea being the easiest to crystallize and eukaryota the hardest. It appears that crystallization propensities of viral proteomes have properties similar to their host proteomes. We also show that current X-ray crystallography know-how combined with homology modeling (using 30% sequence identity cut off) can provide an average structural coverage of 73%. The structural coverage could be increased to 96% by improving homology modeling, if it would produce reliable predictions at 20% sequence identity. Moreover, the use of the knowledge-based target selection strategy significantly increases structural coverage.

Results

Using homology modeling at 30% sequence identity cut-off virtually all bacterial and archaeal proteomes as well as bacteriophages can be structurally covered at above 70%. However,

majority of the eukaryotes and eukaryotic viruses have coverage well below 70%.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 200 400 600 800 1000 1200 1400proteomes

BacteriaArchaeaEukaryotaViruses BacteriaViruses EukaryotaHomo sapiens

stru

ctu

ral c

ove

rage

[%

of

clu

ster

s ab

ove

th

e m

edia

n P

DB

sco

re]

*with homology modeling at 30% seq. ident.

Bacterial proteomes have the widest distribution of crystallization scores that overlap with scores for archaea proteomes and, to a smaller extend, with eukaryotes. Propensities for

archaea and eukaryotes show virtually no overlap.

0

20

40

60

80

100

120

140

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60

nu

mb

er o

f p

rote

om

es

median crystallization propensity score

BacteriaArchaeaEukaryotaViruses 0

2

4

6

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Viruses - BacteriaViruses - Eukaryota

Method Avg. time [ms] AUC

fDETECT 0.8 0.754 PPCpred 152912.9 0.741 XtalPred 70624.4 0.665

CRYSTALP2 0.3 0.658 SVMcrys 153.3 0.615 OBScore 64 0.569 ParCrys N/A 0.557