METHOD: Family Classification Scheme

METHOD: Family Classification Scheme

1) Set for a model building: 67 microbial genomes with identified protein sequences (Table 1)

2) Set for a model estimation: 73 microbial genomes (Table 2)

1) PSI-BLAST search against all sequences in Genbank.

2) Domain parsing.

3) Domain clustering.

The model was tuned on 50,000 randomly chosen full-length SwissProt protein sequences containing domains present in a 721 family subset of PfamA.

E-value (e-4) and number of rounds (6) for (1) and rules for (3) have been chosen to maximize the agreement between the generated and PfamA families.

METHOD: Domain boundaries are identified based on the location of indels in the PSI-BLAST alignment

The result is that 96% of single domain PfamA proteins are predicted as such, but only 24% of PfamA two-domain proteins are predicted correctly.

METHOD: Evaluation of domain family construction on SCOP40 1224 superfamilies (5226 domains, 760 folds)

1) SCOP domains were clustered into families using the authors’ procedure (PSI-BLAST against nr and clustering).

2) For comparison, the same SCOP domains were clustered using BLAST, PSI-BLAST, SAM-T99 (HMM) and PRC (profile-profile) methods.

2) Generated families were compared with the SCOP superfamilies: any pair of domains in a generated family presented in the same SCOP superfamily is considered as a true positive, presented in different SCOP fold – as a false positive.

METHOD: Evaluation of domain family construction on SCOP40 1224 superfamilies (5226 domains, 760 folds)

The threshold of 1% ratio of false positives versus true positives

BLAST detects 9% of true positives

HMM and PRC detect 28% of TPs

PSI-BLAST detects 18% of true positives

Authors’ method gives 32% of TPs

RESULT 1: Distribution of domain family size for 67 genomes (178,310 proteins, 249,574 domains, 31,874 families)

87.7% coverage would be provided by 6072 structures

RESULT 2: Estimation of the number of families in 1000 microbial genomes

1) Examination of a number of families as a function of the number of genomes through 100 times repeated procedure of randomly chosen genomes among 67.

67 genomes


2) Extrapolation of the data on 1000 genomes.

The model predicts ~250,000 families for 1000 genomes

The model predicts ~140,000 singletons for 1000 genomes

~1,000,000 families for 10,000 genomes


3) Testing of the extrapolation model on the complete set of 140 genomes.

RESULT 3: Structural coverage of protein families for 67 genomes

1) For each non-membrane family with more than 2 members (4907 families), a sequence profile (PSSM) was obtained using blastgp.

2) PDB sequences was run against PSSMs using RPS-BLAST. Profile with E-value e-2 was considered to represent a family. Such family was considered to be structurally covered.

Overall, 20% of all families have representative structures. 3926 structures would be required to complete the coverage.

RESULT 3: Structural coverage of protein families for 67 genomes

Suggested strategy of obtaining representative structures for the largest families first: ~4000 structures are required to obtain complete coverage of all families with three or more members, covering 88% of the domains in these genomes.

RESULT 4: Estimation of structural coverage for 1000 microbial genomes

1) Examination of a number of structures needed to obtain coverage of various fractions of proteins (assuming structures for the largest families are obtained first) through 100 times repeated simulation of randomly chosen genomes among 67.



.

A total of 250,000 structures would be required to obtain 100% coverage of protein domains families from 1000 genomes



Representative structures for about 8,000 families will provide 70% coverage


3) Testing of the extrapolation model on the complete set of 140 genomes.

?

DISCUSSION

• Domain boundaries detection method. It doesn’t split at many domain boundaries that are obvious at the structure level. The result is that 96% of single domain PfamA proteins are predicted as such, but only 24% of PfamA two-domain proteins are predicted correctly.

• Clusterization procedure was tuned on a set of PfamA domains. PfamA omits a large number of evolutionary relationships (placing related proteins in different families). PfamA also focuses on larger families, whereas the genome data are dominated by smaller families. As a result, a better method for PfamA may not necessarily be optimum on genome data.

• The final procedure of family generation was benchmarked on a set of SCOP superfamilies. SCOP may be unrepresentative of proteins in genomes as a whole (not including disorder proteins and that form part of complexes).

Documents

METHOD: Family Classification Scheme