Searching for Protein Classification Features

Boise State UniversityScholarWorksElectrical and Computer Engineering FacultyPublications and Presentations

Department of Electrical and ComputerEngineering

1-1-2005

Searching for Protein Classification FeaturesJennifer A. SmithBoise State University

This document was originally published by IEEE in The 2005 IEEE Congress on Evolutionary Computation. Copyright restrictions may apply. DOI:10.1109/CEC.2005.1554744

https://scholarworks.boisestate.edu

https://scholarworks.boisestate.edu/electrical_facpubs

https://scholarworks.boisestate.edu/electrical_facpubs

https://scholarworks.boisestate.edu/electrical

https://scholarworks.boisestate.edu/electrical

http://dx.doi.org/10.1109/CEC.2005.1554744

648

Searching for Protein Classification Features

Scott F. SmithDepartment of Electrical and Computer Engineering

Boise State UniversityBoise, ID [email protected]

Abstract- A genetic algorithm is used to search for aset of classification features for a protein superfamilywhich is as unique as possible to the superfamily.These features may then be used for very fastclassification of a query sequence into a proteinsuperfamily. The features are based on windows ontomodified consensus sequences of multiple alignedmembers of a training set for the protein superfamily.The efficacy of the method is demonstrated usingreceiver operating characteristic (ROC) values andthe performance of resulting algorithm is comparedwith other database search algorithms.

1 Introduction

The need for faster sequence classification algorithms hasbeen demonstrated by (Cameron et al 2004). They notethat search times for the GenBank (Benson et al 2004)sequence database have increased by seven fold over theperiod 1999 to 2003 due to the exponential increase in thenumber of available sequences. This seven-fold increasewould have been even greater if not for the increase inspeed of available processors over the same period. Thistrend is expected to continue.

In this paper, we attempt to increase the speed ofdatabase search algorithms by finding features in thesequences of a protein superfamily which are as unique aspossible. The discovered features will then be used forungapped matching to a query sequence. These featureswill be defined as subsequences of a modified version ofthe consensus sequence of the multiple alignment of thesequences used to train for a superfamily.

In order to determine how well the superfamilyclassification works, a set of superfamily classificationswill be used that are determined independently of theprotein primary sequence. This can be done for thoseproteins that have known three dimensional structure.Such a classification is available in the form of theASTRAL Compendium for Sequence Structure Analysisdatabase (Chandonia et al 2004). ASTRAL containssequences annotated with superfamily classification(among other things) for those proteins represented in theProtein Data Bank (Berman et al 2000). Superfamilyclassifications correspond to those used by the StructuralClassification of Proteins (SCOP) database (Andreeva etal 2004). Version 1.67 of ASTRAL is used with

sequences containing more than 40% sequence identityeliminated. Sequence redundancy elimination to 40%identity is done automatically by the ASTRAL webserver.

The metric used for classification accuracydetermination is the receiver operating characteristic(ROC). This value summarizes the sensitivity andspecificity of the algorithm in a single value between 0and 1, where higher values are better. The use of theROC to measure search accuracy is discussed in(Gribskov and Robinson 1996).

Comparisons of the speed of a protein classificationalgorithm based on the classification features is comparedwith the Smith-Waterman (Smith and Waterman 1981),Fasta (Pearson and Lipman 1985), and BLAST (Altschulet al 1997) scoring methods and the proposed algorithm isfound to be faster. The speed improvement comes fromthe small number of ungapped comparisons that arerequired.

The form of the features used is discussed in Section 2.The use of evolutionary computation to search the verylarge space of possible features for a good feature set isshown in Section 3. In Section 4, an algorithm forscoring possible superfamily membership of a querystring is introduced. The effectiveness of the algorithm interms of ROC scores for the ASTRAL database is shownin Section 5. Section 6 discusses the speed of thealgorithm and Section 7 concludes.

2 Definition of Features

The usual definition of a consensus sequence is asequence which has the symbol that occurs mostfrequently at each position of a multiple alignment. Thedefinition of a consensus sequence which is used in thispaper will instead be a sample sequence drawn from apopulation of sequences with the probability of aparticular symbol at each position equal to the observedfrequency of that symbol in the multiple alignment. Thiscan also be thought of as a sample from the position-specific scoring matrix (PSSM) of the multiple alignment(Mount 2001, pp. 192-198). The mode of the consensussequence definition in this paper is therefore the usualdefinition of consensus sequence.A feature set for a particular protein family will be

defined by a particular consensus sequence draw asdefined above in combination with a binary window

0-7803-9363-5/05/$20.00 ©2005 IEEE.

Authorized licensed use limited to: Boise State University. Downloaded on April 24, 2009 at 18:23 from IEEE Xplore. Restrictions apply.

649

function which selects which ranges of the consensussequence to include and which to omit. The windowfunction will contain one or more groups of ones whichrepresent regions of the consensus sequence which mustbe matched in an ungapped manner when producing thescore for a query sequence.

The number of combinations of possible consensussequences and window functions is very large and there isno hope of trying them all. The measure of fitness of agiven combination of consensus sequence and windowfunction will be the ROC score for protein classificationtaken using the training set for a protein superfamily.Calculation of this ROC score is itself not trivial, makingexhaustive search completely infeasible.

3 Method for Searching for Good Features

A genetic algorithm is used to find good features. Anindividual consists of a consensus sequence and a binarywindow function. A mutation in the consensus sequencetakes the form of redrawing a representative residue at arandomly selected multiple alignment position from theobserved frequency of residues at that position in thetraining set. A mutation in the window function takes theform of changing a 0 to a 1 or vice versa, where thechange is constrained to take place at one of the twopositions adjacent to a change location in the initialwindow function. This is equivalent to allowing a groupof ones to lengthen or shorten by one position. It alsoallows groups to merge and to disappear. Insertions anddeletions are also allowed to take place in the windowfunction by changing a contiguous range of positions allto ones or all to zeros.

An initial population of size N is created by generatingN consensus sequences from the observed distribution ofresidues at each multiple alignment position in thetraining set for the superfamily. The window functionsfor each member of the initial population are generated byrandomly selecting one, two, or three groups of lengthuniformly distributed between three and ten positions tobe set to one and all other positions remain zero.

At each generation, each individual is scored againstevery sequence in the database and a ROC valuecalculated. The scores are found using the BLOSUM62substitution matrix (Henikoff and Henikoff 1993) andmatching the consensus sequence groups determined bythe window function to the query sequence in anungapped fashion. The consensus sequence group ispaired with every possible remaining substring of thequery sequence with the same length and the maximum-valued pairing selected. A remaining substring is part ofthe database sequence starting at a higher position numberthan the last position number which generated themaximum value for the previous feature. The partialscore determined from each feature is summed to get thefinal score.

Figure 1 shows the ranges used for ungappedalignment for the partial score associated with each

feature. In the figure "Feature #1 Range" is shown tocover all the residue positions in the entire querysequence. If the range for feature #1 is residue positions1 through M and feature #1 is labeled with residuepositions 1 through K, then feature #1 positions will firstbe paired with query sequence positions 1 through K, then2 though K+1, and so on until feature #1 positions 1through K are finally paired with query sequencepositions M-K+l through M. Over the M-K-l pairings,the maximum valued pairing is found and is shown inFigure 1 as "Feature #1 Max." The range for feature #2then starts one residue toward the C-terminal end of thequery sequence and goes to the end of the sequence(protein sequences are conventionally written starting atthe N-terminal end of the protein and ending with the C-terminal end).

MFeature#lMax]

Feature #1 Range

Full Query Sequence

Feature #2 Range

Figure 1: Scoring of a Database Sequence

The ROC value is found using the equation

ROCn =1 un

nL<F<n

where n is a false-positive cutoff (with n - 50 used in thisstudy) and L is the number of true positives found beforen false positives are found. The value UF is the number oftrue positives ranked before false positive number F. AROC50 value of 1.0 means that all 50 false positives rankbelow all of the true positives (up to the cutoff) and ascore of 0.0 means that 50 false positives are ranked at thetop of the list.

The population size used to generate the results of thispaper is N = 101. At each new generation, the bestindividual was retained and five copies each of the twentybest individuals were generated and subjected to possiblemutation. The consensus sequences of each individualwere mutated by redrawing the consensus residue fromthe observed distribution with probability 0.001 perresidue position. The window functions were thenmutated with the following five operations in the orderstated: crossover, delete group, insert group, lengthengroup, shorten group. With probability 0.02 perindividual, the window function of an individual wascrossed with another individual with crossover pointuniformly distributed over the length of the windowfunction (which is the same for all individuals). Withprobability 0.02 per individual, one group of contiguous

649


650

Is was changed to Os within the individual with the groupdeleted selected uniformly. If an individual only had onegroup, the delete operation was skipped. With probability0.02 per individual, a range of window function positionswas set to 1 where the length of the range was uniformlydistributed between three and ten and the start positionuniformly distributed along the window function length.This group insertion operation may or may not increasethe number of groups depending on whether the insertedgroup ends in the middle of an existing group or not.With probability 0.2 per group end (0 to 1 or 1 to 0transition), the 0 adjacent to the group end was changedto 1 (group lengthening operation). With probability 0.2per group end, the 1 adjacent to a group end was changedto a 0 (group shortening operation). The GA was iteratedfor twenty generations and the best individual fromgeneration twenty taken as the solution.

4 Search Algorithm Based on the Features

one-time preprocessing of the database to find thefeatures.

5 Efficacy of the Search Algorithm

The effectiveness of the feature-based search algorithmfor finding members of a superfamily in the ASTRAL1.67 database limited to 40% sequence identity is shownin Table 1. All superfamilies in the database with at least40 members are shown. This was done so that 20superfamily members could be used to estimate thefeatures and at least 20 superfamily members would beleft as possible true positives when finding ROC values.The superfamily members used to estimate features wereremoved from the database so that the training set and testset would be independent. In actual application of thealgorithm all known members of a superfamily would beused to estimate the features. The column labeled N is thenumber of superfamily members in the database.

Searching for a superfamily similar to a query string is atwo step process. In the first step, a database ofsequences annotated with superfamily information is usedto find a set of good features for each superfamily in thedatabase. This step needs to be done only once for thedatabase and can therefore be rather computationallyexpensive, as long as it is not so expensive as to beinfeasible. The second step needs to be repeated for everyquery sequence. In this step, the set of features for eachsuperfamily must be scored against the query sequence.This step needs to be fast.

The second step is done exactly as the scoring wasdone in the feature search of step one. The features for agiven superfamily are an ordered set. The first feature iscompared to every possible overlap range of the fullquery sequence. The highest valued comparison is takenas the partial score for feature #1. The comparisons aredone as an ungapped alignment using any standardsubstitution matrix. The location in the query sequence isnoted and feature #2 is compared using ungappedalignment in the portion of the query sequence startingone position beyond the C-terminal end of the feature #1location. This process is continued for all features of thesuperfamily. If any feature does not fit into the remainingquery sequence range it and all higher numberedsequences receive a partial score of zero. The overallscore is the sum of the partial scores.A large speed advantage is expected from the

algorithm since the number of residues in the feature setsfor each superfamily will be much less than the totalnumber of residues in the database and the matching is anungapped alignment. The number of residues is muchsmaller than the database since the sum of the lengths ofthe features will generally be less than the average lengthof the sequences in the superfamily and there is only oneset of features covering all the sequences in thesuperfamily. The cost of this speed advantage will be a

TABLE 1ROCso FOR SUPERFAMILIES WITH AT LEAST 40 MEMBERSSuperfamilya.4.1a.4.5a.39.1b.l.lb.1.18b.1.2b.29.1b.40.4c.1.8c.2.1c.3.1c.37.1c.47.1c.66.1c.67.1c.69.1Mean:

N43844112249524565771325315864534361

SWROC5o0.6400.5600.9300.5730.7870.5050.8330.8950.8800.9030.3400.0000.8340.7230.8870.8200.694

BLASTROCso0.8500.6100.8370.5180.0000.6600.9250.2100.6460.8200.5700.4050.5000.7970.8850.4820.607

QuerySeq.dlhlva2dlpp7u_dlhqva_dlcid 2dlsoxaldlf42a2dldypa_dlpyba_dlug6a_dlvjOa2dllqtaldl qhxa_dleejaldlfpld2dluula_dlrlda

Feat.ROC500.8000.8000.9420.8060.0000.6300.4400.3200.7400.4730.6850.6780.9840.7480.4700.5400.629

# ofFeat.121312

11

4331232

The ability of the proposed algorithm to findsuperfamily members is compared with that of the Smith-Waterman and BLAST algorithms. One superfamilymember was chosen at random from each superfamily tobe a query sequence for the Smith-Waterman and BLASTsearches. The ASTRAL database name for this sequenceis given in the "Query Seq." column of Table 1 for thesake of reproducibility. The ASTRAL database was thensearched using the SSEARCH and BLAST programs andROC values calculated from the results. These ROCvalues are listed in the "SW ROC50" and "BLASTROC50" columns of Table 1. The mean ROC value overthe sixteen superfamilies is 0.694 using Smith-Watermanand 0.607 using BLAST.

The ROC values for the proposed algorithm werecalculated by searching the ASTRAL database (with the20 superfamilies used to estimate the features excluded)with the set of features for that superfamily. The resultingROC values are shown in the Features ROC5o column ofTable 1. The mean ROC value using the proposed

650


651

algorithm is 0.629, which is not very much less than theSmith-Waterman value and slightly more than theBLAST value. The ROC value difference of 0.065between Smith-Waterman and the proposed algorithm isequivalent to moving the true positives in a ranked listincluding all true positives and fifty false positives downan average of a little more than three positions.

TABLE 2FEATURE #1 FOR EACH SUPERFAMILY

Super- Feature #1familya.4.1 VWFQNRRa.4.5 LDAEEFKLLTLISIIEEGELTVKEIAEALa.39.1 AEVDEMFKELDTNGDGEIDFEEFLRLVb.1.1 SLLVKEGETVTLSCbi. 18 VGAASTIGGNVITVLGKGDVKASALLYFGb. 1.2 NSTITGYRVTYVPKNb.29.1 IVNGPWHNVGWGREERSVVLPFDSGKb.40.4 GKEAEVVAELLLKEc. 1.8 VVGADPDKAVTFVDNHc.2. 1 GPGGSVVVVGAAGAVGLVAVQIAKALGc.3.1 IIVGAGLSGLAAAYRLSEAGKNVLLVEc.37.1 RIVIEGPPGAGKSTc.47. 1 PWCGPCKAAKPc.66.1 ALFLLLPLNADARLLKDVLLEAFDDDKLSALVKK

L DLc.67.1 PNNPTGLVPPLELGEDIVDHAATKGINGHSDAAYG

GFAAGc.69.1 SYDGDYLAAGENVIVV

TABLE 3FEATURE #2 FOR EACH SUPERFAMILY THAT HAS ONE

Super- Feature #2familya.4.5 AVVRAIKKLEDKGLISRb. 1.1 STLTITSAQPEDSATYYCb.1.2 YEVSVIALNGRGESc.2.1 VDFALDTVGc.3. 1 LVGAKLAAAGREILSVARKEDQEIQSLRc.37. 1 LLRDAEEIGKDLGFPAARYLDGDMIIADLLLELAL

LKc.66. 1 LNGKGLLSILPMRRAIDATAREDDGAREIKLSKEA

GFc.67.1 AVLSAFSKAFGLRGc.69. 1 GLEDQLAALEWLKENAAAFGGDPKRITIFGESAG

GLSVAALLLLPLDK

TABLE 4FEATURE #3/#4 FOR EACH SUPERFAMILY THAT HAS THEMSuper- Feature #3familyb. 1.1 DKLIFGQGc.2.1 VGAPGGALTAPc.3.1 IQTADGSKGANIVVSADGTFc.37.1 FPRRARQAEALDEALc.67. 1 LRISVGIEDLDDLLADLEAAL

Feature #4c.2.1 NKSKFEEALDFLAQG

The discovered features for each superfamily areshown in Table 2 (first feature), Table3 (second feature),and Table 4 (third and fourth features). None of thesixteen cases resulted in more than four features and onlyone case had a fourth feature. The average number ofresidues in a feature is 19.1 and the average number of

features is just under 2. The features tend to come frompositions in the multiple alignment which are highlyconserved. That is, features tend to come from portionswith few gaps and where substitution matrix values aregenerally large and positive.

The multiple alignments of the twenty trainingsequences from each superfamily were found usingCLUSTALW (Higgins and Sharp 1988).

6 Performance of the Search Algorithm

The inner loop of the proposed search algorithm involvestaking a sum of substitution matrix values and thencomparing the sum to a previous maximum sum. For anaverage length feature, this means looking up 19substitution matrix values, summing the values, andcomparing the sum to the maximum sum generated at allpreceding N-terminal positions. The 19 table look-upsand additions dominate the computation.

The Smith-Waterman algorithm also requires asubstitution matrix value look-up and an addition for eachamino-acid match investigated as well as many additionalcomputations at each attempted match. As a result,counting the number of compared amino acids in eachalgorithm should give an indication of the relative speedof the two algorithms which underestimates thecomputation time of Smith-Waterman relative to theproposed algorithm. The equations used by the Smith-Waterman algorithm are:

Iij= max {Ii Ij - c, Mi,j- g}

Dij = max {Di,, - c, Mij-1 - g}

Mij = max {Iij1I + d(ai,bj), Di-,j-1+ d(ai,bj), Mi-I>1l + d(ai,bj), 0}.

I is the score if the current sub-alignment ends with aninsert, D is the core if the sub-alignment ends with adelete, andM is the score if the sub-alignment ends with amatch or mutation. The penalty for initiating a gap is gand for continuing a gap is c.

The value d(a1,bj) is the substitution matrix lookup-value which is added to three different values. Fouradditional subtractions, a four-way maximum, and twotwo-way maximums are also performed. All of theseoperations are done every time two amino acids arecompared. The single maximum done in the proposedalgorithm is done once for every sum (an average of onceper 19 amino acid comparisons).

The number of amino acid comparisons for the Smith-Waterman algorithm is relatively easy to estimate. Smith-Waterman does a compare for every possible pair ofresidues in the query string versus the database string.For an average case we expect this algorithm to do anumber of amino acid compares equal to the square ofaverage length of a protein sequence. This is actually anunderestimate since the square of an expected value is

651


652

less than the expected value of the square of the samerandom variable that produces values greater than one.

The number of amino acid comparisons done by theproposed algorithm is harder to estimate since the numberof comparisons done for feature #2 or higher depends onwhere the best match was found for feature #1. In orderto continue overestimating the compute time of theproposed algorithm relative to Smith-Waterman, we shallassume that the best match is at the first (N-terminal)position of the query sequence. For an average lengthfeature this means 19*(L-19) comparisons for feature #1,where L is the length of the query string. For feature #2,there would be 19*(L-2*19) comparisons when the bestmatch for feature #1 was N-terminal. For feature #n, thenumber of matches is 19*(L-n*19) for average lengthfeatures.

The average sequence length in the ASTRAL 1.67database limited to no more than 40% sequence identity is181 (L = 181). There are 6600 sequences in the databaserepresenting 1445 superfamilies. The average number ofdiscovered features in the sixteen cases examined abovewas two, so the number of amino acid comparisons for atypical case using the proposed algorithm is estimated tobe 19*(181-19) + 19*(181-38) = 5795. For the Smith-Waterman algorithm, the estimated number ofcomparisons is L2 = 32,761. In addition to a more thanfive-fold decrease in the number of amino acidcomparisons for each score generated, the number oftimes that the query sequence needs to be scored isreduced. With Smith-Waterman, the query string isscored against every database sequence (with multiplesequences existing for most superfamilies). With theproposed algorithm, the query string is scored onceagainst one set of features for each superfamily. For theASTRAL database figures above, this means that 6600 /1445 = 4.57 times as many scores are generated withSmith-Waterman. Since most sequence databases withsuperfamily annotation include far more sequences persuperfamily, this advantage of the proposed algorithm isbeing underestimated here. For example, the Pfamdatabase (Bateman et al 2004) has thousands of membersassociated with many of its classifications. The overallresult in this study is that the Smith-Waterman algorithmdoes 4.57 * (32,761 / 5795) = 25.8 times as many aminoacid comparisons. It is likely that the additionalcomputation time for Smith-Waterman is much more that25.8 times as great.

Table 5 shows the relative execution times of theproposed algorithm as well as several popular pair-wisedatabase search algorithms. The relative speed of theproposed algorithm ("Features") and the Smith-Watermanalgorithm were determined above. The speed of twoflavors of Fasta and of BLAST relative to Smith-Waterman are from (Brenner et al 1998).

TABLE 5RELATIVE EXECUTION TIMES

Features 1.0*Smith-Waterman 25.5Fasta (ktup = 1) 3.9Fasta (ktup = 2) 1.4BLAST 1.0

* Very conservative estimate. Expected to be much lower in practice.

7 Conclusions

A new algorithm for classifying a query sequence into aprotein superfamily has been introduced. The algorithmis faster than the Smith-Waterman algorithm at theexpense of significant one-time processing of the databaseto find superfamily features which can be used forungapped alignment during query sequence scoring. Thealgorithm has been found to be only slightly less effectiveat superfamily classification than the Smith-Watermanalgorithm. Determination of a good set of features for asuperfamily during the preprocessing phase isaccomplished using a genetic algorithm.

Future work on this algorithm will examine ways tomake the preprocessing step converge more rapidly on asolution either by altering the way the initial population ischosen or changing the way new generations are selectedand modified. Also, the fitness function may be modifiedto favor shorter and fewer features in an attempt to furtherreduce the time required for scoring at a hopefully smallpenalty in reduced ROC value.

Acknowledgments

The project described was supported by NIH GrantNumber P20 RRO16454 from the INBRE Program of theNational Center for Research Resources.

BibliographyS. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang,W. Miller, and D. Lipman (1997) "Gapped BLAST andPSI-BLAST: A New Generation of Protein DatabaseSearch Programs," Nucleic Acids Research, vol. 25, no.17, pp. 3389-3402.

A. Andreeva, D. Howorth, S. Brenner, T. Hubbard, C.Chothia, and A. Murzin (2004) "SCOP Database in 2004:Refinements to Integrate Structure and Sequence FamilyData," Nucleic Acids Research, vol. 32, no. 1, pp. D226-D229.

A. Bateman, L. Coin, R. Durbin, R. Finn, V. Hollich, S.Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E.Sonnhammer, D. Holme, C. Yeats and S. Eddy (2004)"The Pfam Protein Families Database," Nucleic AcidsRes., vol. 32, pp. D138-D141.

652


653

D. Benson, I. Karsch-Mizrachi, D. Lipman, J. Ostell andD. Wheeler (2004) "GenBank: Update," Nucleic AcidsResearch, vol. 32, no. 1, pp. D23-D26.

H. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat,H. Weissig, I. Shindyalov, and P. Borne (2000) "TheProtein Data Bank," Nucleic Acids Research, vol. 28, pp.235-242.

S. Brenner, C. Chothia, and T. Hubbard (1998)"Assessing Sequence Comparison Methods with ReliableStructurally Identified Distant EvolutionaryRelationships," Proc. of the National Academy ofSciences, vol. 95, pp. 6073-6078.

M. Cameron, H. Williams, and A. Cannane (2004)"Improved Gapped Alignment in BLAST," IEEE/ACMTrans. on Computational Biology and Bioinformatics,vol. 1, no. 3, pp. 116-129.

J. Chandonia, G. Hon, N. Walker, L. Lo Conte, P. Koehl,M. Levitt, and S. Brenner (2004) "The ASTRALCompendium in 2004," Nucleic Acids Research, vol. 32,no. 1, pp. D189-D192.

M. Gribskov and N. Robinson (1996) "Use of ReceiverOperating Characteristic (ROC) Analysis to EvaluateSequence Matching," Computers and Chemistry, vol. 20,pp. 25-33.

S. Henikoff and J. Henikoff (1993) "PerformanceEvaluation of Amino Acid Substitution Matrices,"Proteins Struct. Funct. Genet., vol. 17, pp. 49-61.

D. Higgins and P. Sharp (1988) "CLUSTAL: A Packagefor Performing Multiple Sequence Alignment on aMicrocomputer," Gene, vol. 73, pp. 237-244.

D. Mount (2001) Bioinformatics: Sequence and GenomeAnalysis, Cold Spring Harbor Laboratory Press.

W. Pearson and D. Lipman (1985) "Improved Tools forBiological Sequence Comparison," Proceedings of theNational Academy of Sciences, vol. 85, no. 8, pp. 2444-2448.

T. Smith and M. Waterman (1981) "Identification ofCommon Molecular Subsequences," Journal ofMolecular Biology, vol. 147, no. 1, pp. 195-197.

653


Documents

Searching for Protein Classification Features