15
Article Probabilistic identification of spin systems and their assignments including coil–helix inference as output (PISTACHIO) Hamid R. Eghbalnia a,b,d, *, Arash Bahrami a,b,c , Liya Wang a,b,c , Amir Assadi d & John L. Markley a,b,c a Biochemistry Department, National Magnetic Resonance Facility at Madison, 433, Babcock Drive, Madison, WI, 53706, USA; b Biochemistry Department, Center for Eukaryotic Structural Genomics; c Graduate Pro- gram in Biophysics, University of Wisconsin-Madison, Madison, WI, 53706, USA; d Mathematics Depart- ment, University of Wisconsin-Madison, 811, Van Vleck Hall, 480 Lincoln Drive, Madison, WI, 53706, USA Received 9 March 2005; Accepted 12 May 2005 Key words: automation, backbone assignments, particle interaction model, sidechain assignments Abstract We present a novel automated strategy (PISTACHIO) for the probabilistic assignment of backbone and sidechain chemical shifts in proteins. The algorithm uses peak lists derived from various NMR experiments as input and provides as output ranked lists of assignments for all signals recognized in the input data as constituting spin systems. PISTACHIO was evaluate00000000d by comparing its performance with raw peak-picked data from 15 proteins ranging from 54 to 300 residues; the results were compared with those achieved by experts analyzing the same datasets by hand. As scored against the best available independent assignments for these proteins, the first-ranked PISTACHIO assignments were 80–100% correct for backbone signals and 75–95% correct for sidechain signals. The independent assignments benefited, in a number of cases, from structural data (e.g. from NOESY spectra) that were unavailable to PISTACHIO. Any number of datasets in any combination can serve as input. Thus PISTACHIO can be used as datasets are collected to ascertain the current extent of secure assignments, to identify residues with low assignment probability, and to suggest the types of additional data needed to remove ambiguities. The current implementation of PISTACHIO, which is available from a server on the Internet, supports input data from 15 standard double- and triple-resonance experiments. The software can readily accommodate additional types of experiments, including data from selectively labeled samples. The assignment probabilities can be carried forward and refined in subsequent steps leading to a structure. The performance of PISTACHIO showed no direct dependence on protein size, but correlated instead with data quality (completeness and signal-to-noise). PISTACHIO represents one component of a comprehensive probabilistic approach we are developing for the collection and analysis of protein NMR data. Introduction NMR resonance assignment, the key step in data analysis in which chemical shifts are assigned to individual nuclei in a covalent structure, has per- sisted as one of the main challenges in solving structures of proteins from NMR data. Automa- tion of this step is desirable from the standpoints of speeding up the process of structure determination and of providing a more objective analysis of the input data. Most assignment strategies analyze data from multidimensional, multinuclear datasets by a stepwise approach that derives from one * To whom correspondence should be addressed. E-mail: [email protected] Journal of Biomolecular NMR (2005) 32: 219–233 Ó Springer 2005 DOI 10.1007/s10858-005-7944-6

Probabilistic identiï¬cation of spin systems and their

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Probabilistic identiï¬cation of spin systems and their

Article

Probabilistic identification of spin systems and their assignments includingcoil–helix inference as output (PISTACHIO)

Hamid R. Eghbalniaa,b,d,*, Arash Bahramia,b,c, Liya Wanga,b,c, Amir Assadid &John L. Markleya,b,caBiochemistry Department, National Magnetic Resonance Facility at Madison, 433, Babcock Drive, Madison,WI, 53706, USA; bBiochemistry Department, Center for Eukaryotic Structural Genomics; cGraduate Pro-gram in Biophysics, University of Wisconsin-Madison, Madison, WI, 53706, USA; dMathematics Depart-ment, University of Wisconsin-Madison, 811, Van Vleck Hall, 480 Lincoln Drive, Madison, WI, 53706, USA

Received 9 March 2005; Accepted 12 May 2005

Key words: automation, backbone assignments, particle interaction model, sidechain assignments

Abstract

We present a novel automated strategy (PISTACHIO) for the probabilistic assignment of backbone andsidechain chemical shifts in proteins. The algorithm uses peak lists derived from various NMR experimentsas input and provides as output ranked lists of assignments for all signals recognized in the input data asconstituting spin systems. PISTACHIO was evaluate00000000d by comparing its performance with rawpeak-picked data from 15 proteins ranging from 54 to 300 residues; the results were compared with thoseachieved by experts analyzing the same datasets by hand. As scored against the best available independentassignments for these proteins, the first-ranked PISTACHIO assignments were 80–100% correct forbackbone signals and 75–95% correct for sidechain signals. The independent assignments benefited, in anumber of cases, from structural data (e.g. from NOESY spectra) that were unavailable to PISTACHIO.Any number of datasets in any combination can serve as input. Thus PISTACHIO can be used as datasetsare collected to ascertain the current extent of secure assignments, to identify residues with low assignmentprobability, and to suggest the types of additional data needed to remove ambiguities. The currentimplementation of PISTACHIO, which is available from a server on the Internet, supports input data from15 standard double- and triple-resonance experiments. The software can readily accommodate additionaltypes of experiments, including data from selectively labeled samples. The assignment probabilities can becarried forward and refined in subsequent steps leading to a structure. The performance of PISTACHIOshowed no direct dependence on protein size, but correlated instead with data quality (completeness andsignal-to-noise). PISTACHIO represents one component of a comprehensive probabilistic approach we aredeveloping for the collection and analysis of protein NMR data.

Introduction

NMR resonance assignment, the key step in dataanalysis in which chemical shifts are assigned toindividual nuclei in a covalent structure, has per-

sisted as one of the main challenges in solvingstructures of proteins from NMR data. Automa-tion of this step is desirable from the standpoints ofspeeding up the process of structure determinationand of providing a more objective analysis of theinput data. Most assignment strategies analyzedata from multidimensional, multinuclear datasetsby a stepwise approach that derives from one* To whom correspondence should be addressed. E-mail:

[email protected]

Journal of Biomolecular NMR (2005) 32: 219–233 � Springer 2005DOI 10.1007/s10858-005-7944-6

Page 2: Probabilistic identiï¬cation of spin systems and their

described by Kurt Wuthrich and coworkers fortwo-dimensional proton spectra (Billeter et al.,1982; Wider et al., 1982). First, signals from dif-ferent experiments are aligned in comparabledimensions. Individual resonances are groupedinto spin systems that are used in the typing stage toscore the amino acid identity of spin systems. Inthe local assembly step, sequentially related spinsystems are identified. These are then mapped ontothe primary sequence in the global assembly (ormapping) stage. Different automation programsimplement each step with varying degrees of success.

A number of computerized approaches to thebackbone assignment problem have been described(Gronwald and Kalbitzer, 2004). Some softwarepackages achieve partial automation of sidechainassignments, for example, GARANT (Bartelset al., 1997) or the combination of GARANT andAUTOPSY (Koradi et al., 1998). However, to ourunderstanding, a satisfactory, flexible, and trulyautomated approach to the overall assignmentproblem has yet to be presented.Many laboratoriescontinue to invest considerable effort in manualassignments to ensure quality.

The need for a mechanism for modifying theallowed input data to include results from newNMR experiments and combinations of experi-ments can be understood by considering theproblem of experiment selection. In larger pro-teins, differences in resonance transfer efficienciesmake it difficult to obtain signals from all side-chain carbons in a single spectrum. Completesidechain assignments may require the collectionof data from specialized experiments (Celda andMontelione, 1993; Lin and Wagner, 1999), whosechoice will depend on the size of the protein, theamino acid sequence, the labeling, pattern, and theinstrumentation available for data collection. Forthese reasons, a workable approach for automatedassignment should be capable of utilizing anycombination of experiments deemed necessary bythe experimenter (Bax et al., 1990a, b; Fesik et al.,1990; Gronwald and Kalbitzer, 2004). Therefore,the algorithm used for automated assignmentshould allow for easy extension of experiment listsand should not impose any restrictions on thecombination of experiments used.

Noise is a common factor in most real-worldoptimization problems. Sources of noise includelimitations in instrument sensitivity and precision,incomplete sampling, and inconsistent human–

computer interactions. Major sources of noise inthe NMR assignment problem are variability indata quality (peak overlaps and peak widths),variability in peak picking (extra peaks or missingpeaks), and variability in spin-system scoring. Theformal theoretical underpinning we seek requiresapproaches that deal with noisy data and quantifythe effects of noise on the output. Our approach isto build methods within the well-establishedstructure of probability theory, which offers theformal theoretical ‘‘glue’’ that can connect indi-vidual pieces of the structure analysis process.Probabilistic methods have the advantage of beingable to deal with noise and uncertainty within thesame rigorous framework.

Resonance assignment programs can be cate-gorized by the methods they use in the mappingsteps. These methods include stochasticapproaches, such as simulated annealing/MonteCarlo algorithms (Buchler et al., 1997; Lukinet al., 1997; Leutner et al., 1998), genetic algo-rithms (Bartels et al., 1997), exhaustive searchalgorithms (Atreya et al., 2000; Andrec and Levy,2002; Coggins and Zhou, 2003; Jung and Zweck-stetter, 2004), heuristic comparison to predictedchemical shifts derived from homologous proteins(Gronwald et al., 1998), and heuristic best-firstalgorithms (Zimmerman et al., 1994; Li andSanctuary, 1997; Hyberts and Wagner, 2003). Forexample, AutoAssign (Moseley et al., 2001) is aconstraint-based expert system that uses a heuris-tic best-first mapping algorithm. Among thesealgorithms, those that rely on stochasticapproaches have the best potential for satisfacto-rily addressing issues arising from noise.

As we discuss in the next section, mathematicalarguments indicate that a general solution to theassignment problem requires a new approach sep-arate from those realized to date. The straightfor-ward combinatorial approach to the NMRassignment problem is to consider the set of allpossible configurations for the assignment and tofind the one with the lowest ‘‘cost’’. This approachhas two notable drawbacks. From a practicalcomputing point of view, exploring the set of allconfigurations is computationally intractable –particularly for large proteins. The more formida-ble difficulty lies in the fact that all acceptable costfunctions involve noisy, local experimental data.

The approach we use in PISTACHIO recaststhe cost function as a measure of ‘‘system energy’’

220

Page 3: Probabilistic identiï¬cation of spin systems and their

and restates the optimization problem in thephysical terms of finding a ‘‘ground state’’. Thisrestatement enables us to ask questions that turnout to have tractable solutions with practicalapplications. For example, we can ask for a ‘‘typ-ical low energy configuration’’ and find a compu-tationally feasible (polynomial time) solution thatgives a great deal of information about the‘‘ground state’’. This is pertinent to the NMRassignment problem, since the available data maynot support a unique set of assignments. An addi-tional advantage to this approach is that it yields aset of assignment configurations with their associ-ated probabilities that can be explored and refinedfurther as additional data become available.

Another difficulty arises from the local natureof the data provided by the relevant NMRexperiments. Even for an individual amino acidresidue, the database of available chemical shifts isinsufficient to construct the accurate multidimen-sional distributions necessary for building anoptimal score function. Our solution to this chal-lenge has been to build highly accurate analyticexpressions for one-dimensional chemical shiftdistributions, to use correction factors to extendthese to multidimensional score functions, and totake advantage of the sharpening of probabilitiesafforded by the multiple dimensions. This last stepis achieved by parsing the sequence of the proteininto overlapping tripeptides and by computingscores and overlap functions for these. The noveltyof our approach comes from the analysis ofoverlapping tripeptides and the algorithms wehave devised for scoring each tripeptide and forassembling them into a single scored sequence.

Probabilistic Identification of Spin Systems andtheir Assignments including coil–helix Inference asOutput (PISTACHIO) is part of a larger effort onthe automated analysis of protein NMR data. Asingle software platform has been developed thatuses as input the sequence of the protein and peaklists derived from various experimental multidi-mensional, multinuclear magnetic resonancedatasets and that provides as output chemical shift

assignments and secondary structure analysis. Theoutput is conveniently reported in NMR-STARformat that can be readily deposited in BMRB.PISTACHIO uses the package PECAN (Eghbal-nia et al., 2005) to identify secondary structureelements from the protein sequence and associatedassigned chemical shifts. Both packages are avail-able for general use from a server at http://bija.nmrfam.wisc.edu

Materials and methods

Analysis of the NMR assignment problem

NMR experiments used for backbone assignmentsyield two types of chemical shift correlations:inter-residue and intra-residue. These correlationsare used to construct intra- and inter-residue spinsystems, consisting of connected nuclei and theirchemical shifts. After intra-residue spin systemshave been constructed, the process of spin systemtyping assigns a positive score to each spin system.The score signifies a measure of correspondencebetween spin system j and residue i.

If we consider the cost of assignment, C(j,i), asthe negative of the score, then the followingassignment strategy can be formulated. Numberthe residues and spin systems with integers from 1to n as shown in the first two rows of Table 1. As-sume that asmany spin systems have been identifiedas there are residues – we will relax this assumptionshortly. Next, rearrange the residue numbers to getthe perfect assignment by minimizing a total costfunction. This is shown, for example, by the cor-respondence between the first and the third row.Note that the third row is obtained by a permuta-tion of the second row in which the numbers shownin bold are not permuted.

To formalize this problem, let r(i)=j denote apermutation of i to j. Then we can describe theoptimal assignment as a search in the space ofpermutation matrices to obtain the minimum costas follows:

Table 1. Conventional method of assigning protein NMR signals

Spin systems 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...

Residues 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...

Assigned residues 6 2 9 3 5 4 1 7 8 11 10 12 14 13 15 ...

221

Page 4: Probabilistic identiï¬cation of spin systems and their

argminr

X

i

CðrðiÞ; iÞ

In the classical mathematical setting, this problemis also known as the ‘‘assignment problem’’ wherethe term ‘‘assignment’’ describes the problemconcerned, typically finding the best way to assignn persons to n jobs. The permutation matrix,which minimizes the cost expression and deter-mines the maximum matching by use of an effi-cient algorithm of O(n3), was first demonstrated byKuhn in 1955. This was a rediscovery of work bytwo Hungarian mathematicians, D. Konig and E.Egervary, that predated the birth of linear pro-gramming by more than 15 years. The approach,therefore, is known as the ‘‘Hungarian method’’.Starting with the work of Edmonds (1965) andbuilding on earlier work of Tutte (1947), this ap-proach has been used to address satisfactorily anumber of optimization problems, in particularthe ‘‘optimal weighted graph matching problem’’.

In NMR, the inter-residue connectivity infor-mation constrains the minimization problem out-lined above by restricting the set of permutations tothose that do not violate the observed data in theexperiments. To take this into account, we couldreformulate the problem by associating a ‘‘con-nectivity cost’’ to each permutation as given by

argminr2Sn

Xn

i¼1

Xn

j¼1lijBðrðiÞ; rðjÞÞ þ

Xn

i¼1CðrðiÞ; iÞ

The second term is the same as before, whereasthe new term B represents the cost of assigning thespin system pair (i,j) to residues (r(i),r(j)). Thiscost is weighted by lij, which is a ‘‘small number’’ ifthe connectivity between pair (i,j) is observed anda ‘‘large number’’ otherwise. Although the ‘‘bestmatching problem’’ illustrated in Table 1 has anO(n3) solution, the more complicated optimizationproblem formulated above, commonly known asthe ‘‘quadratic assignment problem’’ (QAP) or the‘‘weighted bipartite graph matching problem’’(a restricted case of QAP), is known to be NP-hard (Gonzalez, 1996).

The constrained weighted bipartite graph-matching formulation approach to automatedNMR assignments (Ikura et al., 1990; Xu et al.,1993; Geerestein-Ujah et al., 1996; Moseley andMontelione, 1999; Bailey-Kellogg et al., 2000; Pe-

rmi and Annila, 2001; Andrec and Levy, 2002;Coggins and Zhou, 2003; Malmodin et al., 2003),can be considered as a specialized version of theQAP approach in which the values for lij and B arespecified in advance. The remarkable number ofpractical applications formulated as QAP hasspurred research in a wide community and has ledto a number of specialized solutions. Several usefulprotein NMR software tools attempt to solve theQAP problem by the use of various approxima-tions (Eccles et al., 1991; Nelson et al., 1991;Leopold et al., 1994; Bartels et al., 1997; Leutneret al., 1998; Atreya et al., 2000; Chen et al., 2003;Hitchens et al., 2003), possibly inspired by existingspecialized solutions of the QAP. The best-knownapproximations for finding solutions to typicalQAP problems are variations of the ‘‘branch andbound’’ algorithms. Other approximations havebeen proposed, including genetic algorithms, neu-ral networks, and simulated annealing. Theseapproximations address the practical issue offinding solutions; however, the important questionof determining the quality of solutions has re-mained unanswered. Theoretical work (Burkardand Fincke, 1985) has shown that the landscape ofscore functions for QAP problems of practical sizeis essentially flat. An inescapable consequence ofthis theory is that any deterministic algorithm,such as branch and bound or any of its heuristicmodifications, can yield a solution that is trappedin a local minimum without providing any infor-mation about the global minimum. These consid-erations place serious practical limits on theapplicability of branch and bound-type methodsto the NMR assignment problem.

The nature of noise in determining the NMR‘‘cost function’’ makes the application of deter-ministic combinatorial methods even less attrac-tive. The difficulty arises at the spin system typingand assembly stages, where one must assign a costor probability that a spin system (or group of spinsystems) belongs to a particular amino acid (orsequence of amino acids) in the protein sequence.After the time-domain data have been processed,the peak-picking step identifies the chemical shiftsassociated with each peak in the spectrum. Theaccuracy of this step depends on the resolution andsensitivity of the data collected, as well as choicesmade by the expert. The intermediate constructionof spin systems and their scores depend on theinformation content, noise, and ambiguity of the

222

Page 5: Probabilistic identiï¬cation of spin systems and their

data, and these vary from dataset to dataset. Noiseand ambiguity are inherent features of derivedNMR data; they are ever present and vary only indegree. In addition, the noise in the scoring de-pends on the model distributions of chemicalshifts. These problems are complicated furtherupon lifting the earlier assumption regarding theequality of resides in the sequence and intra-resi-due spin systems identified in the experimentaldata. In reality, the number of intra-residue spinsystems identified can exceed or fall short of thenumber of residues. All the above sources of noisewill make it difficult to identify the single globalminimum, particularly within the context ofdeterministic algorithms.

Evolutionary algorithms (EA) are general,nature-inspired heuristics for numerical search andoptimization that are frequently observed to beparticularly robust with regard to the effects ofnoise. This class of techniques, initiated by thework of John Holland (1975), includes searchalgorithms inspired by the process of naturalselection (biological evolution). The approach beenused for NMR assignments (Bartels et al., 1997).The operation of EA depends on a multitude ofparameters in combination with fitness environ-ments; together they form stochastic dynamicalsystems that are not easily analyzed or understood.The class of evolutionary algorithms for optimi-zation, which encompass the methods of geneticalgorithms, evolutionary programming, and evo-lutionary strategies (Davis, 1987, 1991; Goldberg,1989; Holland, 1992; Koza, 1996), have been re-ported to perform well under noisy conditions, assupported by numerical results in test cases (Ranaet al., 1996; Nissen and Propach, 1998; Stroud,2001). However, the successful implementations ofEA (Baum et al., 2001) in the presence of noisehave been limited to problems in which the noisehas a very special structure; and, as has beenpointed out (Michalewicz and Fogel, 2000), ‘‘therereally are no effective heuristics to guide the choicesto be made that will work in general’’.

The NFL (no free lunch) theory (Wolpert andMacready, 1997) presents a formal analysis ofsearch algorithms for optimization, including theEA, random search, and simulated annealingapproaches. The essence of NFL analysis is that inthe absence of prior knowledge or models, theaverage performance of a given optimizationalgorithm across all problems is as good as any

other algorithm. This sobering statement reaffirmsthat no general algorithm can replace carefulmodeling of the specific problem.

Our earlier attempt to address these issues andto develop a statistical approach to automatedassignments was embodied in the CONTRASTsoftware package (Olson and Markley, 1994; Ol-son, 1995), which provides a ranked list ofassignment probabilities. CONTRAST attemptedto model noise, but is subject to the general limi-tations discussed above. PISTACHIO, the newapproach we lay out here, addresses these chal-lenges by transforming the deterministic, combi-natorial optimization problem into a search for theground state configuration of a statistical system.The local state of the statistical system is con-structed in three steps. First, we parse peak listsassociated with particular experiments into the setof all possible tripeptide spin systems specified bythe peptide sequence. Second, we score the list ofpossible tripeptide assignments on the basis of ourprior analysis of the BMRB database of chemicalshifts, which has allowed us to create models forthe chemical shifts for each nucleus and each pairof nuclei within amino acids taken singly and inpairs. This makes use of a formula we have derivedthat accurately computes probability scores formatching chemical shift data to tripeptides. Third,we assemble the overlapping tripeptides to matchthe sequence and to achieve the optimal proba-bilities for correct assignments. Each of the threesteps serves to restrict the size of the combinatorialsearch space and to minimize the impact of noiseon the analysis. The impact of noise is furtherameliorated because the global ground state of themodel corresponds to a state that is ‘‘statisticallyclosest’’ to the reported observations of data (seethe supporting information for an outlined proofand analysis of noise impact).

Mathematical model

Our starting point is to define a ‘‘local cost model’’for our system that has ‘‘suitable emergent globalproperties’’. By a local cost model, we mean thatthe overall cost of assignment can be obtained asthe sum of costs for local assignment. By suitableemergent global properties, we mean that oursolution for the global cost will satisfy all localconstraints and cost functions in the most optimalway. We start by defining a set of local cost

223

Page 6: Probabilistic identiï¬cation of spin systems and their

functions JN as a function of the configurations YN

that represent the spin system for a given tripep-tide. We define the global cost as:

JðYÞ ¼X

YN�YJNðYNÞ

To minimize the overall cost J,we can equivalentlymaximize the following expression:

e�JðYÞ ¼ expðX

YN�Y�JNðYNÞÞ

¼Y

YN�Yexpð�JNðYNÞÞ

We can treat this as a probability by dividing by anormalization factor:

PðYÞ ¼ 1

Z

Y

YN�Yexpð�JNðYNÞÞ

We base the cost J on a QAP cost formula that werestate as:

J ¼Xn

i¼1

Xn

j¼1lijBðrðiÞ; rðjÞÞ

þXn

i¼1CðrðiÞ; iÞ ¼

XWþ

XU

e�J ¼ e�P

We�P

U

ð1Þ

Upon dividing by the normalization factor toobtain probabilities, and by denoting the set of allpermutations as R, we obtain:

PðRÞ ¼ 1

Z

Y

i;j

i 6¼j

expð�WðrðiÞ; rðjÞÞ

�Y

i

expð�UðrðiÞ; iÞÞ ð2Þ

In the expression above, the second partmeasures the merit of assigning a tripeptide spinsystem F to a given amino acid triple, whereas thefirst part Y provides a measure of the compati-bility of adjacent and overlapped spin systems. Wedescribe below how the value of F for each trip-eptide is obtained. We define the value of Y by

Wðx;yÞ¼u¼ �khCnx;Cnyið Þ if u\�e x 6¼y

�e if u��e x 6¼y

B ifx¼y

8<

:

In the above formula, k is a fixed and empir-ically obtained positive constant that sets the scale

of local costs. The parameter e is used as a controlparameter to test the local ‘‘flatness’’ of the solu-tion surface, and B is set to a value much largerthan k to inhibit duplicate selection. The expres-sions Cnx and Cny represent projection of thechemical shift coordinates onto the last n and firstn coordinates, respectively. Intuitively, when nrepresents the spin-system size for a single residueoverlap and the neighboring tripeptides are AKCand CDE, with the overlapping residue being C,we assign a score to the compatibility of thechemical shift values of C in each triplet pair. Thenumber of projected coordinates (or overlappingresidues) is changed in a controlled manner toobtain information regarding the consistency oflocal scores. The above prescription for the valueof Y encodes our objective to make assignmentsthat violate connectivity information ‘‘energeti-cally unfavorable’’. One major practical advantageof this formulation is its generality. It does not relyon a specific experiment or set of experiments andthus enables easy and efficient incorporation ofnew experiments into the framework. The resultingmodel can be represented in the form of a graphG=(V,W), with verticesV and edgesW (Figure 1).In this graphical representation of our model, twooverlapping residues in neighboring tripeptides areconnected by an edge in order to enforce theneighborhood constraint. Tripeptides and the trip-let spin systems (also represented by vertices) areconnected by an edge that represents the possibilitythat the spin system may correspond to the given

Figure 1. Graphical representation of the energetic model forthe case of two overlapping residues in neighboring tripeptides.The tripeptides are the upper vertices, and the triplet spinsystems are the lower vertices. An edge between a tripeptide anda triplet spin system represents the possibility that the spinsystem belongs to that residue.

224

Page 7: Probabilistic identiï¬cation of spin systems and their

residue. The weights along the edges or those forindividual vertices represent local information, andthe task of the algorithm (see below) is to find themost consistent global assignment described by theprobability distribution.

To achieve our goal of deriving probabilitiesfor the pairing of pair spin systems with residues(the F term), we must calculate the marginalprobability instead of the joint probability givenby Equation 2. Whereas the joint probabilityprovides the frequencies of pairing of spin systemsand residues occurring simultaneously across theamino acid sequence, the marginal probabilityindicates the fidelity achieved in pairing a specificspin system with a particular residue.

Brute force computational approaches toderiving marginal probabilities grow exponentiallyin computational time with the number of assign-ments to be made, and, therefore, are impracticaleven for small proteins. Our approach is to derivethe marginal probability by an iterative approxi-mation algorithm. The algorithm exploits parallelsbetween the formula above and particle interac-tion problems in statistical physics. To achieveglobal consistency, the algorithm works on clustersof residues that share common residues and eval-uates the cumulative measure of merit (joint andmarginal probabilities) for each cluster. Values inevery overlap cluster are adjusted in a way thatmoves the global cost in the direction of the min-imum. In each iterative step, the error gradients inlocal clusters are transmitted to larger clusters, andrelative weight adjustments are performed. Thisprocess of adjusting the values in the overlapclusters continues until values computed from twostarting points differ by only a small e value.

In recent years, a number of techniques basedin probability, geometry, and functional analysishave yielded great insight into probability distri-butions in multi-dimensional settings (Talagrand,1995). In intuitive language, the probability dis-tribution of joint events tends to become moreconcentrated around its mean (or median) as weconsider more and more events. In the assignmentproblem, one can take advantage of this propertyby considering tripeptide resonance systemswhenever an accurate estimate of their probabili-ties can be obtained. The tripeptide spin systemsare assembled from spectral data that provide in-tra-residue and inter-residue connectivities. One ofthe advantages of this approach is that it accom-

modates conditional probabilities: neglecting endeffects, each residue forms a part of three over-lapping tripeptides.

To obtain accurate joint probability values, weproceed as follows. Let A, B, and C be subsetsrepresenting events with probabilities of occur-rence jAj; jBj, and jCj, respectively. The followingbasic relationship and its generalization to subsetsdenoted by jAij follow a standard counting argu-ment and the joint probability is obtained byconsidering the value of the complement:

jA[B[Cj¼jAjþjBjþjCj�jA\Bj�jA\Cj�jB\CjþjA\B\Cj

ð3Þ

jAj¼jA1[A2[...[Anj¼Xn

i¼1jAij�

X

pairsðijÞjAi\Ajj

þX

triplesðijkÞjAi\Aj\Akj�...

þð�1Þnþ1jAi\Aj\Ak\...\Anj ð4Þ

This formula, however, does not lead to suf-ficiently accurate probabilities for the tripeptidespin systems or its complement 1� jAj. It is clearfrom Equation 4 that an accurate estimate ofindividual probabilities jAij in the first term of theexpansion is required and that the first- andhigher-order correlations (dependent events) alsoneed to be evaluated. However, we have sufficientdatabase information to build only the first twoterms in the expansion. For this reason, we haverewritten the expression in a way that yields agood estimate from first two terms alone.

jAj � 1�Y

i

Ni

Y

j>i

Nij

Ni ¼ 1� jAij; Nij ¼ N�1i þN�1j

�N�1i N�1j ð1� jAi \ AjjÞ

Computational aspects

The input data consist of peak lists derived from aset of multidimensional NMR experiments. This setcan include experiments such as HSQC, HNCO,CBCA(CO)NH, HNCACB, HN(CO)CACB,

225

Page 8: Probabilistic identiï¬cation of spin systems and their

HNCA, HN(CO)CA, HN(CA)CO, HBHA(-CO)NH, C(CO)NH, HN(CO)(CA)CB,HN(CA)CB, H(CCO)NH, N15-TOCSY, andHCCH-TOCSY. Any combination of these 15experiments can be used to generate candidate spinsystems. The most sensitive experiment (typicallyHSQC or HNCO) is used as the reference startingpoint for spin system generation. For missingchemical shifts, an empty value is assigned to thenucleus. In cases where assignments are ambiguous,all chemical shift possibilities for a given nucleus arecarried forward throughout the computation. PIS-TACHIO uses a grid search for the best tolerancefor generating spin systems (see below). The valuestypically searched are in the range of 0.02–0.03 ppmfor 1H, 0.2–0.275 ppm for 15N, and 0.2–0.3 ppm for13C dimensions. User can specify a search rangedifferent from the default. PISTACHIO treats peaklists from different experiments in an unbiased wayand uses all information entered to create the set ofcandidate spin systems. By constructing profiles(experiment description matrixes), one can cus-tomize PISTACHIO so as to: (1) add additionalexperiments, (2) assign different weightings toexperiments, (3) handle data from selectively la-beled proteins, or (4) incorporate prior assignmentsmade by other means. Only (1) and (3) are currentlyavailable from the PISTACHIO web server usingspecial instructions; (2) and (4) are achievable butmust be coded in by hand.

PISTACHIO performs preprocessing stepsprior to spin system generation. Of these, the mostimportant are alignment and best density and prob-ability estimates. These are discussed briefly below.

Alignment

An alignment algorithm is used to compensate forpossible referencing errors or other source of glo-bal shifting errors. Note that the matching ofpeaks from different experiments is an ill-posedproblem, because the peaks expected from nucleicommon to more than one experiment cannot beassumed to have been observed in each case. Inaddition, non-linear peak displacements in differ-ent spectra have been observed in real data. Ouralgorithm attempts to obtain a global alignment ofcommon chemical shift axes in all datasets in anautomated fashion without trying to match indi-vidual peaks. A linear shift is tried first, and if thisfails, then local adjustments are made. The intui-

tion behind the approach is to replace each peaklocation with a probability distribution and toattempt to match the distributions.

Best density and probability estimates

The chemical shift distribution model used byPISTACHIO for an individual atom, amino acidresidue, or tripeptide is not Gaussian. Ourapproach to deriving an accurate estimate for theprobability distribution is to obtain densities fromthe largest class of distributions that (1) satisfy thelaw of large numbers in probability, (2) have themost parsimonious form, and (3) permit robustestimates of their parameters from Monte Carlosimulations. We have used data available fromBMRB to parameterize a set of distributions foreach nucleus in each amino acid. The details of themethods involved will be presented in a separatepublication. The resulting distributions properlycapture the current statistics for the available dataand providemore accurate local probably estimatesthan those derived from histograms (Figure 2).

Description of the PISTACHIO algorithm

The free energy minimization approach used byPISTACHIO employs an iterative strategy that isguaranteed to converge. The algorithm uses asequence of non-increasing energy states, calledstrata, to control the combinatorial search in theexponentially large number of states. For eachstratum, a sequence of sub-strata is used to controlthe descent in energy by iteratively ‘‘squeezing’’ theenergy between a lower and upper bound. Thefinal ‘‘minimal energy state’’ specifies the configu-ration corresponding to ‘‘highest probabilitystates’’ for assignments ‘‘most consistent’’ with theobserved data in the presence of noise. This cor-respondence utilizes the well-known equivalence ofminimum free energy state and the minimal sta-tistical distance. The state that the algorithmconverges to is exact if the model described byEquation 1 is correct. The consequences of noiseor ambiguities in the data are naturally reflected inthe assignment probabilities.

In order to describe the PISTACHIO algorithmin a simple and accessible form, a number of nota-tional conventions are necessary. Equation 2 isrewritten as PðXÞ ¼ Z�1

Qu2U wuðXuÞ, where u can

be a single or a multi-index and U is a set of multi-

226

Page 9: Probabilistic identiï¬cation of spin systems and their

indices. We refer to u as a cluster and to U as acluster set. In the context of the graph notation(G=(V,W)) introduced above, a cluster is a subsetof vertices of V and a cluster set is a set of thesesubsets. For example, when u ¼ fig;wuðXuÞrepresents /(i,r(i)), and when u ¼ fi; jg;wuðXuÞrepresents Y(r(i),r(j)). The multi-index notation isimplied whenever an index is used with a capitalizedvariable such as X or a function such as w. Thenotation Xu is the shorthand for the randomvector (xi1,xi2,..., xik), where the indices ij form themulti-index or cluster u. The free energy forour model (Figure 2), F �

PuhlogPuðXuÞ

� logwuðXuÞipuðXuÞþPvavhlogPvðxvÞiPvðXvÞ repre-

sents a refined approximation to the ‘‘full’’ freeenergy term (Landau and Lifshitz, 1980). A lower

case x ismeant to denote our interest in the values ofthe function for a cluster of size exactly one. In ourformulation some clusters are not disjoint, and weneed to correctly reflect the number of times eachcluster and the vertices in the intersection of clustersare counted. The constants av are used to count thenumber of times a vertex v appears in the intersec-tion of clusters in a cluster setU and the constants bvare used to count the number of clusters in the setU.PISTACHIO also selects amulti-index setU in eachiteration, which is defined in the display describingthe PISTACHIO algorithm below. The outer loopof the algorithm repeats the iteration until no fur-ther changes in the priority of assignments is ob-served. A change in the priority of assignments isobserved if a previously unassigned spin system isassigned or a previously assigned spin systemchanges its assignment. Details are shown inFigure 3.

Results

In the early testing stage of PISTACHIO, we usedas input simulated datasets obtained by decon-structing BMRB entries into peak lists. Simulatedpeak lists for one small protein (8.6 kDa), twomedium sized proteins (12–18 kDa), and one largeprotein (40.6 kDa) led to average assignmentsaccuracies of 98% (results not shown). The BMRBdeposition for one of these proteins (BMRB 5106)contained the raw peak lists, which enabled us tocompare the use of real vs. simulated data.Whereas the simulated data yielded 69 spin systemswith 99% of the spin systems assigned, a subset ofthe raw data (N15-HSQC, HNCO, HNCACB,CBCA(CO)NH) yielded 66 spin systems with 90%correctly assigned. This result confirmed ourexpectation that true quality measurements canonly be obtained from tests on raw data.

Raw data in the form of peak lists wereobtained from collaborators at the NationalMagnetic Resonance Facility at Madison (NMR-FAM) and the Center for Eukaryotic StructuralGenomics (CESG). To get the best cross section ofresults for testing our system, we adopted thesingle requirement that input data be reported ineither the XEASY (Bartels et al., 1995) or NMR-STAR (defined on the BMRB website http://www.bmrb.wisc.edu) format. Users were free tochoose their experimental dataset combinations,

Figure 2. Results from the classification of a random sample of1000 13Ca chemical shifts for alanine to one of the 20 aminoacids by three different distribution models: the analyticaldistribution used by PISTACHIO, a normal distribution, andan optimal histogram. The bars indicate the percentage ofclassification for each residue (vertical axis) based on ouranalytical distribution. The blue line in the main figure indicatesthe percentage of classification for each residue based onoptimal histograms. The approach used by PISTACHIOfollows the trend of results by an optimal histogram and leadsto a small but noticeable improvement in the classificationalong this single chemical shift dimension (13Ca). When spreadover multiple nuclei, enhancements of this type lead to evidentcumulative improvements in the assignments. The inset com-pares the classification of our analytical model with oneobtained by fitting a normal distribution to the histogram data.When fit with a normal distribution, the distribution of choicesis more entropic (less selective) and alanine is no longerrecognized as the most likely residue.

227

Page 10: Probabilistic identiï¬cation of spin systems and their

Figure 3. PISTACHIO algorithm (left column) and a corresponding description of the steps (right column).

228

Page 11: Probabilistic identiï¬cation of spin systems and their

the method they selected for peak picking (manualvs. automatic), and the number of peaks reported.The PISTACHIO assignments were scored againstthe best set of assignments achieved by othermeans. In some cases, the manual assignment usedfor comparison benefited from corrections occur-ring farther down the structure determinationpipeline. Nonetheless, our criterion for correctnesswas agreement of the PISTACHIO results with thebest known assignments for the protein at hand.

In 10 of the 15 datasets tested (Table 2), PIS-TACHIO achieved better than 90% correctassignments for backbone resonances. Variouspeak picking software packages (NMR_VIEW,SPARKY, NMRPipe) were used in generating theraw peak lists. The datasets contained differentfractions of noise peaks. Depending on theexperiment and method of peak picking, the peakcounts varied from approximately 0.7 times to 7times the expected number of peaks.

The quality of raw peak list data was the keyfactor in determining the extent and correctness ofthe assignments as well as the run times. In order toassess the quality of our data, we use two notions ofquality. First, we use a post-analysis quality num-ber that represents the percentage of peaksassigned from a single typical experiment, generallythe HNCACB experiment, which produces amoderately large number of ‘‘information rich’’peaks. We also use a pre-analysis quality measurethat reports a statistical expectation for the numberof assignments (see supporting information forfurther discussion). To define this pre-analysisquality, let SA be the combinatorial set of vectorsenumerating chemical shift possibilities for spinsystem ‘A’ (for example, two different CO choicesfor one NH). SA can be viewed as a matrix SA(k,l)where the columns are chemical shift vectors androws are chemical shift values for a given nucleus.In the same way, we define SABC for the triplet spin

Table 2. Proteins assigned with PISTACHIO for which assignments made by other means were available for comparison

Protein

designator

Backbone Sidechain Experiments

represented in the

input peak listsa

Residues correctly

assignedb/assignable

residues (number of

Pro residues)

CPU

time

(h)

P>0.95 %

correctbCPU

time

(h)

%

correctb1 2 3 4 5 6 7 8

At2g24940 106/106 (3) 0.2 100% 100% 1 95% PP PPAt1g77540 96/97 (6) 0.2 100% 99% 0.1 95% PP PAt2g23090 82/84 (2) 0.1 98% 97% c c PPMm202773 92/94 (4) 0.1 97% 97% 0.1 85% PP PPPAt5g22580 99/108 (3) 4 95% 92% 1 86% PP PPPCE5073 108/118 (2) 4 95% 92% c c PPPPAt3g17210 96/107 (5) 5 95% 90% 1 90% PPPPPPAt3g51030 108/120 (4) 4 95% 90% 1 85% PPPPPPAt5g01610 95/115(5) 6 90% 83% c c PPPAt3g16450d 232/291 (8) 5 85%b 80% 2 75% PPPPPP PAt1g23750 122/152 (5) 6 85% 80% c c PPAcyl carrier 70/72 (0) 0.7 99% 97% c c PPBMRB 5106 61/68 (2) 1 95% 90% c c PPYggX 71/89 (2) 4 80% 80% 1 75% PP PPP-gamma 62/77 (10) 4 85% 80% c c PP

aEach dataset included an HSQC or HNCO experiment; other experiments are indicated by numbers: 1 – CBCA(CO)NH orHN(CO)CACB; 2 – HNCACB; 3 – HNCA; 4 – HN(CO)CA or CA(CO)NH; 5 – HN(CA)CO; 6 – H(CCO)NH or N15 TOCSY; 7 –C(CO)NH; 8 – HBHA(CO)NH. bCorrect assignments were achieved independently of PISTACHIO and based on the best availableassignment model; in most cases these assignments made use of additional information not available to PISTACHIO, such asstructural constraints. cSidechain data had not been collected yet in these cases. dStereo array isotope labeled (SAIL) protein; isotopeshifts due to labeling were not accounted for.

229

Page 12: Probabilistic identiï¬cation of spin systems and their

system in a tripeptide ABC. The quality factors b(spin system quality factor), c (connectivity qualityfactor), and overall quality factor q are defined as:

b ¼ hIkðSAÞiA;k; c ¼ hðImðSABCÞiABC;m;q ¼

ffiffiffiffiffibc

4p

ð5Þ

Ik(SA)(defined below) represents a count of uniquechemical shifts for spin system A.h�iA;K denotesthe expectation taken over all k in the spin systemsA. For residue B, the index m runs over all chem-ical shift entries that belong only to the residues Aand C. I is defined by the following function:

IkðSAÞ ¼1 SAðk; lÞ ¼ SAðk; l0Þ8l; l00 Otherwise

ImðSABCÞ ¼1 SABCðm; lÞ ¼ SABCðm; l0Þ 8l; l00 Otherwise

All proteins with datasets of average or goodquality yielded assignments ‡ 90% correct. Pro-teins with PISTACHIO assignment scores <90%were ones with spectra characterized as being of

low quality or with many overlapped and missingpeaks. Also in this category was a stereo arrayisotope labeled (SAIL) protein, discussed below,for which no corrections were made for isotopeeffects on chemical shifts. Input datasets contain-ing high noise (many more peaks than expected inparticular experiments) were not necessarily morecomplex to assign than those with low noise. Themost difficult cases were ones in which ‘‘real’’ and‘‘noise’’ peaks were interspersed within the toler-ances for separate peak recognition. For example,spectra from protein At1g23750 contained manyoverlapping peaks in a disordered region of theprotein. The overall assignment result forAt1g23750 was 80% correct, with assignment dif-ficulties confined mostly to the unstructured region(Table 3).

We found that the workstation time requiredwas inversely related the quality of the input dataand did not depend necessarily on the size of theprotein. According to the protein and data avail-able, the computation required minutes to hourson a single workstation (Table 2). The computa-tionally demanding part of the assignment is theiterative algorithm used to determine global con-sistency described above.

Table 3. Completeness and quality of data used for test of automated assignments by PISTACHIO

Name PEAKS Quality

% Reported % ‘‘True’’ % Noise

At2g24940 78 71 7 0.85

At1g77540 93 63 30 0.78

At2g23090 102 70 31 0.71

Mm202773 109 76 32 0.73

At5g22580 93 79 13 0.88

CE5073 102 82 20 0.76

At3g17210 105 91 13 0.75

At3g51030 105 62 42 0.69

At5g01610 166 76 89 0.65

At3g16450 99 72 27 0.72

At1g23750 95 52 42 0.60

Acyl carrier 153 61 92 0.60

BMRB 5106 113 90 23 0.78

YggX 216 47 168 0.63

P-gamma 100 80 20 0.78

Percentages of peaks under the heading PEAKS are for HNCACB experiment. % Reported peaks is the ratio of total reported peaks tothe maximum number of theoretically possible peaks. % ‘‘True’’ peaks is the ratio of total assigned peaks to the maximum number oftheoretically possible peaks. This number is indicative of data completeness. % Noise peaks is the difference between % reported peaksand % ‘‘true’’ peaks. This number is indicative of one source of noise in the input data and may be larger than 100%. Quality is anindicator computed according to formulas (Eq. 5) and is a statistical measure of quality.

230

Page 13: Probabilistic identiï¬cation of spin systems and their

The initial steps of alignment and preliminaryquality check of the data prior are performedrapidly, and these are used on an interactive basisto decide whether the assignment should proceed.PISTACHIO rapidly computes the number of spinsystems represented in the data and quality scoresfor backbone and sidechain assignments. Thesecan be used to immediately predict, in advance ofthe longer computation, the minimum number ofspin systems that likely will be assigned (see sup-porting information). These quality factors canindicate possible sources of error, such as align-ment problems, an unacceptable number of noisepeaks, or data provided in unacceptable format. Ifno difficulties are detected or if they can be cor-rected automatically, PISTACHIO proceeds to theiteration step. Otherwise, the user must interveneto provide improved input data.

The workstation time needed for sidechainassignment is substantially less than that forbackbone assignments. This is because of the costfunction Equation 1 is substantially simplified andbecause an exact calculation of Equation 2 can beperformed. The time usually is linearly propor-tional to the number of residues in protein. Inpractice, our observations show a strong correla-tion between the quality of sidechain and back-bone assignments.

One of the advantages of PISTACHIO is that itprovides a ranked list of possible assignments forambiguous cases. In the case of the P-gammaprotein (Table 2), the presence of multiple peaksresulted in PISTACHIO assignments to three res-idues that later were corrected manually fromadditional information. The corrected assignmentswere those ranked second in the PISTACHIO re-sults. The reporting of alternatives allows experi-menters or structure determination software torapidly find alternative assignments that satisfyadditional information as it is added.

Protein At3g16450 contained SAIL aminoacids. In this labeling pattern, each methyl group is)13C1H(2H)2 and each methylene group is)13C1H2H–. As a result, different nuclei are af-fected by isotope shifts that can be conforma-tionally dependent. For the purposes of testingrobustness, the resulting shifts can be viewed asrandom noise present in the chemical shift data.The range of isotope shifts (now viewed as randomnoise) had varied effects (shifts of 0.4–0.8 ppm) onthe 13Ca and 13Cb chemical shifts. Shifts beyond

0.5 ppm generally are considered to be significantfor the purpose of assignment and structuredetermination. Even though no corrections wereintroduced for these isotope shifts, PISTACHIOachieved 80% correct assignments for this 299-residue protein. These results demonstrate thatPISTACHIO behaves robustly with respect tonoise. With the introduction of appropriate cor-rections for expected isotope shifts, we expect thatthe performance of PISTACHIO with proteinscontaining SAIL amino acids will improvemeasurably.

Discussion

Automation afforded by PISTACHIO has benefitsbeyond improvements in the efficiency of theassignment step. The introduction of formal the-oretical protocols at each computational process-ing step should lead to outcomes that are lesssubjective and more reproducible. This will enableobjective comparisons that exclude variabilityarising from the interpretive evaluation of experts.

PISTACHIO provides an extensible and robustsystem for assignment. Experiments are describedby means of an experiment description matrix, andmodels for assignment are constructed on the basisof this matrix. Therefore, the addition of newexperiments does not require rewriting of thesoftware. PISTACHIO maintains internal recordsof hard to assign regions and their correspondingexperimental data. This database, when sufficientlypopulated, will be used as a statistical knowledgebase to identify difficult regions in new proteinssubmitted for assignment.

PISTACHIO is part of a larger effort aimed atintroducing automated probabilistic analysis ateach step in a protein NMR structural study.Automation procedures at each step should lead toconsistency and verifiability for the overall pro-cess. The tools should also be flexible enough toallow improvements in each step separately and toallow the introduction of new steps that mayamend or replace existing steps. Since the overallrobustness and accuracy of the process is gated bythe weakest step, establishment of formal theo-retical protocols at each step is paramount to theoverall success of the effort. In its current imple-mentation, PISTACHIO uses peak lists suppliedby the user for experiments that report through-

231

Page 14: Probabilistic identiï¬cation of spin systems and their

bond connectivities. Our goal is to derive thesepeak lists in a probabilistic manner, as achieved bythe HIFI-NMR approach (H. R. Eghbalnia et al.,submitted). Once an initial set of assignments isavailable, these can be evaluated according to thelinear analysis of chemical shifts (LACS) algo-rithm (Wang et al., 2005) to identify and correctreferencing problems and to identify outliers thatmay indicate misassignments or unusual secondarystructure. In addition, assignments can be refinedby making use of probabilistic secondary structuredeterminations from the set of assigned chemicalshifts and the peptide sequence as provided by thePECAN approach (Eghbalnia et al., 2005) to re-fine the chemical shift models used in scoring thetripeptide assignments. One can envision a furtherassignment iteration that makes use of NOESYdata and the consistency of assignments withstructural models. These refinements are includedin our future goals for this project.

Supporting information available

Intuitive discussion of the assignment algorithm,examples of additional datasets analyzed by thesoftware, predictive value of the initially derivedquality factors (1 table and 3 figures) at http://dx.doi.org/10.1007/s10858-005-7944-6.

Acknowledgements

This research was carried out at the NationalMagnetic Facility at Madison (NMRFAM), whichis supported by NIH Grant P41 RR02301 fromthe Biomedical Research Technology Program,National Center for Research Resources. A.B. andL.W. had partial support from Grant 1 P50GM64598 from the National Institute of GeneralMedical Science’s Protein Structure Initiative,which supports the Center for Eukaryotic Struc-tural Genomics (CESG). During part of this workH.E. was supported as a postdoctoral trainee bythe National Library of Medicine under Grant5T15LM005359. We thank Eldon L. Ulrich andWilliam M. Westler for advice and encouragementand the various persons at NMRFAM and CESGwho provided the peak lists and assignments usedin this study. Data on the SAIL protein wereobtained in collaboration with Masatsune Kaino-

sho and members of his group at Tokyo Metro-politan University. This work made extensive useof the BioMagResBank and the Protein DataBank.

References

Andrec,M. and Levy, R.M. (2002) J. Biomol. NMR 23, 263–270.Atreya, H.S., Sahu, S.C., Chary, K.V. and Govil, G. (2000)

J. Biomol. NMR 17, 125–136.Bailey-Kellogg, C., Widge, A., Kelley, J.J., Berardi, M.J.,

Bushweller, J.H. and Donald, B.R. (2000) J. Comput. Biol. 7,537–558.

Bartels, C., Guntert, P., Billeter, M. and Wuthrich, K. (1997)J. Comput. Chem. 18, 139–149.

Bartels, C., Xia, T.-H., Billeter, M., Guntert, P. and Wuthrich,K. (1995) J. Biomol. NMR 5, 1–10.

Baum, E.B., Boneh, D. and Garrett, C. (2001) Evol. Comput. 9,93–124.

Bax, A., Clore, G.M. and Gronenborn, A.M. (1990a) J. Magn.Reson. 88, 425–431.

Bax, A., Clore, G.M., Driscoll, P.C., Gronenborn, A.M., Ikura,M. and Kay, L.E. (1990b) J. Magn. Reson. 87, 620–627.

Billeter, M., Braun, W. and Wuthrich, K. (1982) J. Mol. Biol.155, 21–346.

Buchler, N.E.G., Zuiderweg, E.R.P., Wang, H. and Goldstein,R.A. (1997) Biophys. J., 72, WP447.

Burkard, R.E. and Fincke, U. (1985) Discrete Appl. Math. 12,21–29.

Celda, B. and Montelione, G.T. (1993) J. Magn. Reson. SeriesB 101, 189–193.

Chen, Z.Z., Jiang, T., Lin, G., Wen, J., Xu, D., Xu, J. and Xu,Y. (2003) Theor. Comput. Sci. 299, 1–3.

Coggins, B.E. and Zhou, P. (2003) J. Biomol. NMR 26, 93–111.Davis, L. (1987) Genetic Algorithms and Simulated Annealing

Pitman, London.Davis, L. (1991)Handbook of Genetic Algorithms Van Nostrand

Reinhold, New York.Eccles, C., Guntert, P., Billeter, M. and Wuthrich, K. (1991)

J. Biomol. NMR 1, 111–130.Edmonds, J. (1965) J. Res. Nat. Bur. Standards Sec. B 69, 125–

130.Eghbalnia, H., Wang, L., Bahrami, A., Assadi, A. and

Markley, J.L. (2005) J. Biomol. NMR (in press).Fesik, S.W., Eaton, H.L., Olejniczak, E.T. and Gampe, R.T.

(1990) J. Am. Chem. Soc. 112, 5370–5371.Geerestein-Ujah, E.C., Mariani, M., Vis, H., Boelens, R. and

Kaptein, R. (1996) Biopolymers 39, 691–707.Goldberg, D. (1989) Genetic Algorithms in Optimization Search

and Machine Learning, Addison Wesley, New York.Gonzalez, T.F. (1996) Multi-message multicasting: complexity

and approximation. Proceeding of 30th Hawaii Interna-tional Conference on System Sciences HICSS-30.

Gronwald, W. and Kalbitzer, H.R. (2004) Prog. Nuc. Magn.Reson. Spectr. 44, 33–96.

Gronwald, W., Willard, L., Jellard, T., Boyko, R.E.,Rajarathnam, K., Wishart, D.S., Sonnichsen, F.D. and Sy-kes, B.D. (1998) J. Biomol. NMR 12, 395–405.

Hitchens, T.K., Lukin, J.A., Zhan, Y., McCallum, S.A. andRule, G.S. (2003) J. Biomol. NMR 25, 1–9.

232

Page 15: Probabilistic identiï¬cation of spin systems and their

Holland, J.H. (1975) Adaptation in Natural and Artificial Sys-tems an Introductory Analysis with Applications to BiologyControl, and Artificial Intelligence, University of MichiganPress, Ann Arbor.

Holland, J.H. (1992)Adaptation in Natural andArtificial Systemsan Introductory Analysis with Applications to Biology Control,and Artificial Intelligence, MIT Press, Cambridge, Mass.

Hyberts, S.G. and Wagner, G. (2003) J. Biomol. NMR 26, 335–344.

Ikura, M., Kay, L.E. and Bax, A. (1990) Biochemistry 29, 4659–4667.

Jung, Y.S. and Zweckstetter, M. (2004) J. Biomol. NMR 30, 11–23.

Koradi, R., Billeter, M., Engeli, M., Guntert, P. and Wuthrich,K. (1998) J. Magn. Reson. 135, 288–297.

Koza, J.R. (1996) Genetic Programming: Proceedings of theFirst Annual Conference 1996, MIT Press, Cambridge Mass.

Leopold, M.F., Urbauer, J.L. and Wand, A.J. (1994) Mol.Biotech. 2, 61–93.

Landau, L.D. and Lifshitz., E.M. (1980) Statistical Physics(Part 1) (3rd edn.). Pergamon Press Oxford New York.

Leutner, M., Gschwind, R.M., Liermann, J., Schwarz, C.,Gemmecker, G. and Kessler, H. (1998) J. Biomol. NMR 11,31–43.

Li, K.B. and Sanctuary, B.C. (1997) J. Chem. Inform. ComputerSci. 37, 467–477.

Lin, Y. and Wagner, G. (1999) J. Biomol. NMR 15, 227–239.Lukin, J.A., Gove, A.P., Talukdar, S.N. and Ho, C. (1997)

J. Biomol. NMR 9, 51–166.Malmodin, D., Papavoine, C.H. and Billeter, M. (2003)

J. Biomol. NMR 27, 69–79.

Michalewicz, Z. and Fogel, D.B. (2000) How to Solve it:Modern Heuristics Springer, Berlin.

Moseley, H.N. and Montelione, G.T. (1999) Curr. Opin. Struct.Biol. 9, 635–642.

Moseley, H.N., Monleon, D. and Montelione, G.T. (2001)NMR Biol. Macromol. Pt B, 339, 91–108.

Nelson, S.J., Schneider, D.M. and Wand, A.J. (1991) Biophys.J. 59, 1113–1122.

Nissen, V. and Propach, J. (1998) Parallel Problem Solvingfrom Nature, Vol. 5, pp. 159–168.

Olson, J.B., Jr. (1995) Ph.D. thesis, University of Wisconsin-Madison.

Olson, J.B. Jr. and Markley, J.L. (1994) J. Biomol. NMR 4,385–410.

Permi, P. and Annila, A. (2001) J. Biomol. NMR 20, 127–133.Rana, S., Whitley, L.D. and Cogswell, R. (1996) Lecture Notes

in Computer Science (LNCS) 1141, 196–207.Stroud, P.D. (2001) IEEE Trans. Evol. Comput. 5, 66–77.Talagrand, M. (1995) Publications Mathematiques de l’I. H. E.

S. 81, 73–205.Tutte, W.T. (1947) J. London Math. Soc. 22, 107–11.Wang, L., Eghbalnia, H., Bahrami, A. and Markley, J.L. (2005)

J. Biomol. NMR (in press).Wider, G., Lee, K.H. and Wuthrich, K. (1982) J. Mol. Biol.

155, 367–388.Wolpert, D.H. and Macready, W.G. (1997) IEEE Trans. Evol.

Comput. 1, 67–82.Xu, J., Straus, S.K., Sanctuary, B.C. and Trimble, L. (1993) J.

Chem. Inf. Comput. Sci. 33, 668–682.Zimmerman, D., Kulikowski, C., Wang, L.Z., Lyons, B. and

Montelione, G.T. (1994) J. Biomol. NMR 4, 241–256.

233