· Web viewDetermination of Protein Secondary Structure from Infra-Red Spectra Using Partial Least Squares Regression. Kieaibi E. Wilcox, Ewan W. Blanch†, and Andrew J. Doig* *Manchester

1

Determination of Protein Secondary Structure from

Infra-Red Spectra Using Partial Least Squares

Regression

Kieaibi E. Wilcox, Ewan W. Blanch†, and Andrew J. Doig*

*Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester

M1 7DN, UK

†Present Address: School of Applied Sciences, RMIT University, 124a La Trobe Street, Melbourne,

VIC 3001, Australia

ABSTRACT Infra-red (IR) spectra contain substantial information on protein structure. This has

previously most often been exploited by using known band assignments. Here, we convert spectral

intensities in bins within Amide I and II regions to vectors and apply machine learning methods to

determine protein secondary structure. Partial Least Squares was performed on spectra of 90 proteins in

H2O. After preprocessing and removal of outliers, 84 proteins were used for this work. Standard Normal

Variate and 2nd derivative preprocessing methods on the combined Amide I and II data generally gave

the best performance, with root mean square values for prediction of ~12% for -helix, ~7% for -

sheet, 7% for anti-parallel -sheet and ~8% for other conformations. Analysis of FTIR spectra of 16

proteins in D2O showed that secondary structure determination was slightly poorer than in H2O. iPLS

was used to identify the critical regions within spectra for secondary structure prediction and showed

that the sides of bands were most valuable, rather than their peak maxima. In conclusion, we have

2

shown that multivariate analysis of protein FTIR spectra can give -helix, -sheet, other and anti-

parallel -sheet contents to a good accuracy, comparable to circular dichroism, which is widely used for

this purpose.

A rapid assessment of a protein’s secondary structure is of great value in determining whether a

protein is folded or has folded correctly. Circular dichroism (CD) is the most widely used method for

this purpose, as it can quantify helix and sheet contents, though we have also shown that Raman

spectroscopy or Raman Optical Activity (ROA) could be even more accurate.1 Infra-red (IR)

spectroscopy is used even more widely than CD or Raman, and is potentially of great value in

determining protein structure in H2O, D2O and cellular and tissue samples, since the spectra are so

information rich.2 While IR is often used to study protein structure, this is typically achieved by looking

at intensities and wavenumbers of known bands for particular structures, such as the Amide I and III

bands for helix and sheet.3 However, the large number of overlapping IR bands, coupled with variations

of marker bands cited within the literature, typically makes accurate protein structural analysis through

the deconvolution of FTIR spectra both complex and time consuming. Here, we take a similar approach

to the one that we used for protein Raman and ROA spectra1 to analyse protein IR spectra. IR spectra

are turned into vectors of intensities within bins and Partial Least Squares (PLS) and Principal

Component Regression (PCR) methods are used to determine secondary structure contents. We also

investigated whether we can use this approach to go beyond a three way assignment of residues into

helix, sheet or other, and get more detailed information on different types of sheet (parallel, anti-parallel

or mixed), helix (310 or ) and turns.

The Amide I region of FTIR spectra is normally used for the analysis of protein secondary structure

because the frequencies of the Amide I modes are known to correlate closely to a protein’s secondary

structure elements.4 However, the Amide I bands of proteins display extensive overlap

3

of the underlying component bands of α-helix, β-sheet, β-turns and coils, which are

not instrumentally resolvable.5 Mathematical methods, such as second derivatives, are therefore often

used to enhance and resolve the individual band components.4

Although FTIR is a preferred technique for the rapid determination of protein secondary structure, its

spectra is multicollinear in nature and requires calibration to determine protein structural information

from their spectra.6 Multivariate analysis removes the near multicollinearity found in spectral

measurements and derives the relationships between samples through statistical modelling.7

Multivariate regression analysis methods such as Partial Least Squares (PLS) and Principal

Component Regression (PCR) are applied to our dataset of 84 FTIR protein multivariate spectra with 16

proteins also measured in D2O. All spectra were pre-treated by means of intensity normalization and

SNV (Standard Normal Variate). Calculation of the second derivative was also done in order to enhance

the separation of the underlying spectral bands.

60 spectra were used to develop the regression model. Models were cross-validated with 10-fold

segments specified. An optimal number of latent variables, indicated by a minimum root mean square

error of cross-validation (RMSECV), was selected to avoid overfitting.8 The optimal number of

components was used to fit the spectra of new samples.

The predictive ability of the model was assessed by testing on new data (the reliability of prediction of

structure from different but related spectra) as well as already seen data (the reproducibility of

prediction using spectra of same measurements and conditions).9 This was accomplished by assigning

44 samples in the test set of which 24 samples were new to the model and 20 samples were already seen

by the model. This also increased the number of proteins in our test set. The partitioning of the data into

training and test set was done by an algorithm called Kennard Stone (KS) which puts the most different

spectra in the calibration set.

A version of PLS called interval PLS (iPLS) uses subintervals of the full spectrum. Local models are

built with intervals of the spectral region; the performance of the local models is compared to that of the

4

full spectrum model.10 By using intervals of the spectrum, regions specific to each secondary structure

motif can be identified. The prediction ability of sub-spectral regions can possibly reduce spectral

measurement time and reduce production cost by only employing a few significant spectral regions.11

PLS/PCR multivariate calibration is used to quantify new variables ‘y’ from the matrix of the FTIR

measured protein spectra X, via a mathematical model that can relate X to ‘y’. X is the predictor

variables (FTIR spectra of protein samples) and Y is the reference variable (DSSP values). This model

can then be used to predict the structure of unknown protein samples.12 PLS and PCR solve the near

multi-collinearity often found in spectral measurements, where two or more independent (predictor)

variables are correlated and so provide redundant information from the model. FTIR spectra are

multivariate in nature, leading to a need for dimension reduction which can be achieved by PLS

regression.

The PCR method is based on principal component analysis (PCA). PCR performs data decomposition

into loading and score variables. For PCR, the estimated scores matrix consists of the most dominating

principal components of X. These components are linear combinations of X measurements determined

by their ability to account for the variability in X.13 The first principal component is the linear

combination of the original X-variables with the highest possible variance; PCR uses only the X

variables for the analysis without employing the y variable. PLS regression can give good prediction

results with fewer components than PCR because the response variable is employed in the regression. 14

The number of components needed for interpreting the information in X (spectra) which is related to ‘y’

(secondary structure) is, therefore, smaller for PLS than for PCR. This may lead to a simpler

interpretation, though the two methods often give comparable results.4

■ MATERIALS AND METHODSSpectral Measurement and Processing. 84 proteins were measured in H2O; 28 of these FTIR

protein spectra were provided by Dr. Parvez Haris from DeMontfort University, UK (SI Table 1). The

remaining proteins were bought from Sigma-Aldrich and used without further purification. Proteins

5

were chosen based on criteria such as: a wide range of helix and sheet contents, crystal structure quality,

secondary structure contents, fold classes, solubility and stability. Spectra were collected using the ATR

accessory, made with a ZnSe crystal, with a Bruker-Tensor FTIR spectrometer and MCT detector. 30

µL of each protein solution at 50 mg/ml in distilled H2O was placed on the ATR cell. A background

spectrum of H2O was used for automated background signal subtraction using a customized routine in

the OPUS 6.2 software. IR measurements were made with 4 cm-1 spectral resolution, data points were

measured every 1 cm-1, and 32 scans were collected per sample, over a range of 4000 - 950 cm -1.

Acquisition times for each spectrum were less than two minutes. The minimum absorbance for all

spectra in the dataset was subtracted to remove background absorbance and light scattering effects. No

smoothing function was applied to the data.

ATR correction. Attenuated total reflectance (ATR) spectroscopy15 was used. ATR is based on the

concept of internal reflection. The infrared beam enters a sample in an ATR crystal at an incidence

angle of 45° and is reflected into the crystal.16 This radiation enters the crystal in contact with the sample

with lower refractive index; if the angle of incidence penetrated into the crystal is greater than the

critical angle, the beam undergoes total internal reflection.16 The absorption of the evanescent wave of

the penetrated beam into the sample is measured. The path length of a transmission experiment is the

same across the spectrum because it is defined by the thickness of the sample. In the ATR experiment,

however, the depth to which the sample is penetrated by the infrared beam is a function of the

wavelength;17 for this reason, the relative intensity of bands in an ATR spectrum increases with

wavenumber. This effect can cause anomalous dispersion which affects the spectral peak shape. As a

result, the infrared spectrum of a sample obtained by ATR, when compared to its transmission

measurement, shows some significant differences, such as higher absorbance bands around the Amide II

region than in the Amide I region. It is advised to do ATR correction where quantitative analysis is

necessary.18 ATR correction was performed on protein spectra used for this analysis to bring each

spectrum as close as possible to its FTIR counterpart. Using the ‘Advanced ATR Correction’ feature on

6

Bruker’s OPUS software, these abnormalities were partially corrected on all in-house collected spectra,

though we did observe some residual changes in the ratios of Amide I to Amide II peaks compared to IR

transmission spectra. For example, ATR-corrected carbonic anhydrase had a 0.9 to 1 ratio for Amide I

to Amide II peak intensity, while the transmission spectrum of the same protein had a 0.6 to 1 ratio for

Amide I to II intensities. This ATR-corrected ratio can differ from protein to protein; ATR correction is

often less satisfactory for samples with strong absorption peaks.19

84 proteins measured in H2O and 16 selected proteins measured in D2O are listed in SI Table 1. Data

were normalized for intensity and autoscaled using the Standard Normal Variate (SNV) approach. This

centres each spectrum by subtracting the mean value of the spectrum from each absorbance; SNV then

scales each spectrum by its standard deviation to give the spectrum a variance of 1 and mean of 0.

Second derivatives were determined for each spectrum value. The data were normalized subjected to

second derivative analysis using a Matlab’s Savistzky-Golay algorithm with 5 degrees; this reveals the

underlying absorption changes.20

Data Selection Scheme. 60 H2O spectra were used as a calibration set for the PLS regression

model. The remaining 24 were used as part of an independent test set for assessment of predictive

accuracy after training. 20 spectra from the calibration set were also included in the independent test set,

giving 44 samples in total. Selection was made using the Kennard Stone (KS) algorithm. 21,22, a well-

known method for the selection of a sample subset for calibration. It chooses objects from the X-matrix

(the measured spectra) that provides a uniform distribution of the data set. The algorithm does this by

assigning a sample to the calibration set, closest to the mean of the entire sample; the next sample is

then chosen based on the square distance to the sample already assigned, using Euclidean or

Mahalanobis distances.23 The sample furthest from the already selected sample is added to the

calibration set. The algorithm was set to select the first 60 samples for the calibration set, with the rest

left for validation (Figure 1).

7

Figure 1. Schematic representation of 84 protein samples in H2O used in the training set and the

test set. Matrix X represents the FTIR spectra while vector ‘y’ represents the percentage fraction of

secondary structure.

PLS Modeling. Amide I and II regions, whose bands originate from C=O stretching and NH

bending vibrations, were used for this analysis. The Amide III region was not analysed as it was not

recorded in many of the protein spectra. Amide I and II regions were used separately and combined.

The FTIR spectra were put into a matrix containing 84 rows of protein spectra having 101 columns

for the spectral wavenumbers from 1600–1700 cm-1 in the Amide I region and a separate matrix of 121

columns for the Amide II region from 1480–1600 cm-1. The algorithm for PLS was written using Matlab

R2010 and Matlab’s PLS regression. PLS requires a priori knowledge about the protein structural

groups, so relative secondary structure contents determined by DSSP24from the PDB were put into a

matrix form ‘y’. Secondary structure fractions are listed in SI Table 1. Our parameters for the

PLSregress tool in Matlab are available on request.

■ RESULTS AND DISCUSSIONPartial Least Squares. PLS was first applied to the calibration set and Amide I data. It was first

necessary to find the number of PLS components required to explain the data. PLS extracts components

in order of their relevance to the structure type in question. SI Figure 1 shows how the percentage of y

explained variance increases with the number of components used when predicting -helix contents

2

8

from Amide I data. The plot was used to choose a minimum number of components that give rise to the

maximum explained variance. After 10 components the gap between points is small and the RMSE is

low (SI Figure 2), so 10 components for the analysis of α-helix from the Amide I region

were selected to give the maximum explanation of variance and a low Root Mean Square Error for

Calibration (RMSEC).

Similarly, 10 components gave a low RMSEC of 0.020 for analysis of β-sheet content

from the amide I data. Beyond 10 components, the decrease in the RMSEC is

negligible (not shown). Similar numbers of components were used for other

secondary structure types (Table 1).

Cross-Validation. PLS is sensitive to overfitting, that is, the RMSEC will continue to decrease with

additional components; however, the RMSEP will increase due to overfitting.25 For this reason 10-fold

cross-validation of the model was carried out. An optimal number of components, indicated by a

minimum root mean square error of cross-validation (RMSECV), was selected to avoid overfitting. For

PLS, this error measurement, which is the standard deviation of the unexplained variance of the cross-

validation, is more useful than RMSEC, because RMSEC does not indicate when the model is

overfitted. Plots of the cross-validation results are used to visualize the number of components that give

the lowest RMSECV. SI Figure 3 shows that 4 components were sufficient for -helix using Amide I

data. The number of components required for other secondary structures was determined in the same

way.

Independent Test Set. Cross-validation performs an internal test with the calibration model and

is generally adequate to assess the fitting ability of a model. However, it is always useful to additionally

test a model against an independent test set. We did this by using 44 samples of which 24 have not been

previously seen by the model. Ideally, the independent test set would only include proteins that were not

in the calibration set. However, in our case, this would only be 24 proteins, a number too small to give

accurate analysis. The RMSEP of the independent test set is obtained for the judgment of the

9

performance of the calibration set.25 (Table 1). Fitted response values are computed by the PLS model

and an R2 value is generated, comparing predicted values to those known from crystal structures.

PCR. The major difference between PLS and PCR is that PCR uses only the x variables for the

analysis without employing the y variable, although the y response variable is needed for the calculation

of the residual. A comparison of PLS and PCR, using the SNV Amide I data, was made to determine

which method works best.

Plots of RMSECV as a function of number of components showed that 4 PLS and 5 PCR components

are optimal to minimise the RMSECV (not shown). PLS was chosen for future use, since it usually has a

better performance on the calibration set and uses fewer components with cross-validation.

PLS with 2nd Derivative. Amide I data using PLS was compared using SNV or second derivative

spectra (Table 1). The second derivative usually gave a better performance, so was used for further

analysis of Amide II data. The plotted observed vs. fitted response in Figure 2 for second derivative data

shows that the regression line fits the data very well for α-helix for the Amide I region

with the calibration set and less well for the independent test set, as expected (Figure 3). Datasets

computed from the second derivatives of the FTIR spectra were used for secondary structure analysis

with PLS. The R2 and RMSECV values for each calibration were used in choosing the

best fit model for α-helix, β-sheet, parallel β-sheet, anti-parallel β-sheet, β-turns

and other (Table 1). Goormaghtigh et al.3a reported that 310-helix content was too low to give

statistically meaningful results in their study on 50 proteins and we also found a poor correlation for 310-

helix.

10

Figure 2. Predicted versus training set values for the calibration set for the analysis of α-helix

content from 2nd derivative Amide I data in H2O, with line of best fit.

Figure 3. Independent test set plot for analysis of α-helix from the Amide I 2nd derivative

data: PLS fitted vs response on 44 proteins in H2O. Proteins in circles were in the calibration set.

The PLS analysis of the 2nd derivative of Amide I data used 10 components for the

prediction of α-helix content, giving an R2 of 96% with a low RMSEC of 0.040 which,

after cross-validation, gave an RMSECV value of 0.181 with only 4 components.

These 4 components were used to predict α-helix content in the independent test

set, giving a RMSEP of 0.120 with an R2 of 0.71. Figures 4 and 5 show the fitted vs

11

observed plots for the β-sheet calibration and independent test sets using the Amide I

region and 2nd derivatives. Data on additional structure types are listed in Table 1.

Figure 4. Predicted versus training set values for the calibration set for the analysis of -sheet

content from 2nd derivative Amide I data in H2O, with line of best fit.

Figure 5. Independent test set plot for analysis of -sheet from the Amide I 2nd derivative data: PLS

fitted vs response on 44 proteins in H2O. Proteins in circles were in the calibration set.

ζ Scores.Calibration of each of these secondary structural contents was done for both the entire

Amide I and II spectral regions separately, and for the combined Amide I and II regions together (Table

1). Cross-validation R2 results of 0.85 (α-helix), 0.94 (β-sheet), 0.44 (parallel β-sheet),

0.94 (antiparallel β-sheet), 0.79 (mixed β-sheet), 0.53 (310-helix), 0.60 (turns), and 0.62

12

(other) for the Amide I region using PLS 2nd derivative data indicates the percentage explained variance

for each predicted structure-type. The predictors work best for -helix, -sheet and antiparallel β-

sheet. The RMSECV values for α-helix (0.181) and β-sheet (0.109) are higher than for

other structures, such as parallel β-sheet (0.050), 310-helix (0.037), and turns (0.051). This

is not because the predictive values of the latter are better, but rather that in our samples,

the variations of parallel β sheet, 310-helix and turns are a lot smaller. To assess

the predictive accuracy of this statistical analysis, a determinant ‘ζ’ is calculated.

ζ is the ratio of RMSECV (δ) from the FTIR divided by the standard deviation (σ)

of the protein secondary structure contents in the reference set of crystal structures:

ζ = δFTIR / σ (1)

This score compares the distributed width of the reference structure to that of the RMSECV. It is used

to compare the prediction accuracies, as it accounts for the natural variation of the crystal structure data

compared to the calculated structure obtained from the FTIR model.16 A ζ value less than one indicates

that the FTIR prediction values for that structure are better than guesswork and a value around one

means both methods give answers that are comparable. A value higher than one indicates that the FTIR

method is of no value.1,16-17 The ζ scores for α-helix, β-sheet, and anti-parallel β-sheet

from the amide I data are all lower than one, apart from PLS 2nd derivative Amide II for

sheet, showing that the predictors are successful (Table 1). The ζ scores for the other

structures are generally less than one for PCR only. The two scores, RMSECV and ζ, should

both be considered because a secondary structure with few samples and a small

range in the data space could give a low RMSE by chance.

Analysis of Protein FTIR Spectra in D2O. Protein absorption peaks in the Amide I region

overlap the absorption of water bands in the same region (~1643 cm-1), making it difficult to obtain

informative spectra from H2O. One solution is to use D2O as a solvent because it does not absorb in the

same region as the water absorbance band in the Amide I region.4, 26 However, whereas the Amide I

13

region is easily observed, the Amide II region can then be affected by the D2O solvent absorption26 and

D2O may change the protein spectral signature.

Selected proteins in both solvents were used to investigate and compare their Amide I data. The same

techniques and validation steps applied to the proteins dissolved in H2O were applied to 16 proteins in

D2O, using PLS (SI Table 2).

Overall, the predicted values of α-helix and β-sheet for the 16 proteins measured

in both H2O and D2O are similar. The RMSEP values for α-helix from proteins in H2O vs

proteins in D2O are 0.066 and 0.069 respectively, while the RMSEP values for their β-

sheet contents are 0.044 and 0.045, showing slightly better performance with H2O.

We emphasise, however, the significantly smaller datasets used for the D2O studies.

PLS and other quantitative analysis methods have previously been used to quantify secondary

structure in H2O and D2O.27 α-Helix and β-sheet bands within the Amide I region have

been published by Kong and Yu for secondary structure assignments for both H2O

and D2O4. Dousseau and Pezolet used PLS on 13 proteins in both H2O and D2O28. Their D2O results

were poorer than for H2O overall, especially for myoglobin.

Interval Partial Least Squares. Interval Partial Least Squares (iPLS) is a variant of PLS used to

perform regression analysis on sub-intervals of the spectra. Local models are obtained from chosen

interval(s) based on RMSECV performance. Methods developed from sub-intervals may give better

predictions, as they may include less noise.27 Knowledge of the prediction ability of sub-regions of FTIR

spectra may also lead to the improvement of instruments that can reduce production cost by only

employing a few significant spectral regions.27

The KS algorithm was used to split spectra into calibration and test sets. Norgaard’s iPLS_ToolBox

for Matlab was used in the implementation of this work. This allows the matrix of the entire spectrum to

be split into wavenumber intervals; iPLS is performed on the entire spectrum (global) and on the

14

intervals (local) simultaneously. The interval with the lowest error indicates the area of most importance

in terms of structural information. Proteins used are listed in SI Table3.

7 spectral intervals were initially assigned for building the model. PLS models were calculated for

each of the sub-intervals, using 5 latent variables and ‘leave 5% out’ cross-validation across all sub-

intervals. Two components gave the minimum RMSECV; therefore two components was chosen to

model all sub-intervals.

An RMSECV was recalculated for each sub-interval and for the full spectrum. This variable selection

and the results are shown in Figures 6 and 7 for α-helix and β-sheet; the columns represent

the individual spectral regions (the intervals), and the numbers in italics denote

the optimal number of components for each interval.29 Interval number 5 (1628-1641 cm-1) has the

lowest RMSECV for -helix, though its performance is still poorer than for the global model which

uses the entire Amide I spectral region. A calibration model based on interval number 5 using 3

components was used for the independent test set.

1 2 3 4 5 6 70

0.05

0.1

0.15

0.2

RM

SE

CV

iPLS Interval number

1 4 2 3 3 3 1

Figure 6. iPLS for -Helix. Each bar in the plot is a spectral interval from 1700-1600 cm -1, with

1700 cm-1 to the left; Interval number 5 corresponds to frequency factors in the range 1641–1628 cm -1.

The dotted line represents the RMSECV for the global iPLS model. Numbers in italics in each interval

bar signifies the number of PLS components used in fitting each local model. See SI Table 4 for

intervals and their wavenumber ranges. The curve is the mean spectral intensity for the Amide I region.

15

1 2 3 4 5 6 70

0.05

0.1

RM

SE

CV

iPLS Interval number

1 4 2 5 3 3 3

Figure 7. iPLS for β-sheet. Each bar in the plot is a spectral interval from 1700-

1600 cm-1; Interval number 6 corresponds to frequency factors in the range 1614 –1627 cm-1. The

dotted line represents the RMSECV for the global iPLS model. Numbers in italics in each interval bar

signifies the number of PLS components used in fitting each local model. The curve represents the mean

spectral intensity for the Amide I range.

For β-sheet, interval number 6 had the lowest RMSECV out of 7 intervals, used 3

components and was superior to the global model using all Amide I data (Figure 7).

A calibration model based on the 3 components using interval number 6, which covers 1627-1614 cm-1,

was developed. For both -helix and -sheet, we found that the interval that contains data from the side

of a peak was most useful, rather than the interval containing the peak maximum. The intervals

containing the peak maxima are less useful, since they demonstrate less variation between spectra.

iPLS α-helix. The predicted fraction of α-helix from the Amide I region was

compared to that from crystal structures. The RMSECV of 0.144 is a little better than that of the

standard PLS model with RMSECV of 0.181, but the iPLS R2 for cross-validation (0.80) is lower than

the value of 0.85 for PLS. Interval number 5 model (1628-1641 cm-1) with 3 components gave an

RMSECV of 0.149 and an R2 of 0.78 which is below the performance of the global model.

iPLS β-sheet. The RMSECV for the -sheet global model is 0.092 with an R2 of 0.81, while the

RMSECV from the standard PLS model was 0.050 with a particularly high R2 of 0.94. The global

16

iPLS model for β-sheet thus did much more poorly than the traditional PLS model,

but this could be due to the number of components used for iPLS global for Amide I (3

vs. 9). A model of interval number 6 (1614-1627 cm-1) with 3 components gave an RMSECV of 0.087

and an R2 of 0.83 which is much better than the performance of the iPLS global model.

iPLS Independent Test Set. The KS algorithm was used for splitting the protein samples into

training and test sets. Although the number of proteins in the independent test set was the same for both

PLS and iPLS, a few factors were different between them. The proteins in the training sets for the two

methods were not completely identical, as several proteins in the calibration set of the PLS method were

in the test set of the iPLS method This is because for iPLS, data selection for calibration set and test set

was based on regions of the spectra, instead of the full spectra. The numbers of components used for the

prediction test set models were based on the error of the cross-validation performed for calibration

models. The results for the 44 proteins in the iPLS independent test set were as good as the results of the

traditional PLS result. This comparison is presented in Table 4. Figure 8 shows that the model for

α-helix fits the data and the RMSEP is 0.134 with an R2 of 0.78. The β-sheet model

also fits the data well with an RMSEP value of 0.089 and an R2 of 0.79 (Figure 9). These

results are comparable to the standard PLS results. iPLS calibration results, cross-validation results, and

independent test set results are listed in SI Table 5.

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

2

3

4

5 6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

2122

23

24

2526

27 28

29

3031

32 3334

35

36

37

3839

40

41

4243

44 45

Experimental (X-ray Crystallography: -helix)

Pre

dict

ed (F

TIR

/iPLS

)

R2 = 0.7893RMSEP = 0.1349

Figure 8. Local model line of fit for the prediction of -helix in the Amide I region (1628–1641

cm-1). 3 PLS components for the 5th spectral interval were used for the prediction of proteins in the test

17

set.. The numbers in the plot are the different proteins in the dataset which are named in SI Table 6.

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

1 2

3

4

5

6

7

8

9

10

11

1213

14

15

16

17

18

19

20

21

22

23

24 25

26

27

28

29

30 31

32

33

34

35

36

37

38

39

40

41

4243

4445

R2 = 0.7927

RMSEP = 0.0894

Pre

dict

ed (F

TIR

/iPLS

)

Experimental (X-ray Crystallography: -sheet)

Figure 9. Local model line of fit for the prediction of β-sheet in the Amide I

region (1614-1627 cm-1). 3 PLS components for the 6th spectral interval were used for the

prediction of proteins in the test set. The numbers in the plot are the different proteins in the dataset

which are named in SI Table 6.

Navea et al. used iPLS to analyse 24 protein IR spectra using the Amide I, Amide

II and Amide III bands, finding that similar sections within the Amide I region were

most useful.27

■ CONCLUSIONIR spectra of proteins are potentially a rich source of information on protein structure. We therefore

explored whether PLS or PCR methods can be used to determine secondary structure contents from

FTIR spectra. 28 previously acquired spectra, plus spectra from an additional 68 proteins that we

obtained ourselves gave a suitably sized data set. Data for the Amide I and II regions were analysed,

since these were available for all proteins and are the regions most widely used to give information on

secondary structure. A calibration set of 60 proteins was used to optimize the parameters of a method;

cross-validation within this set and application to an independent test set were used to assess each

method’s performance without over-fitting.

18

We first studied the Amide I region using PCR and PLS. This showed that PLS gave the better

performance. We then compared PLS using SNV and 2nd derivative preprocessing methods and found

that 2nd derivatives were better. Finally, we tested whether using the Amide II regions improved a

model’s performance and found that Amide I data was more useful on its own.

Models gave an excellent performance within their calibration sets, with RMSEC values as low as a

few %. This was over-fitted, however, as when the models were applied to data not seen before, or

tested rigorously with cross-validation, the RMS values were around 12% for -helix, 7% for -sheet

and 8% for other.

We explored whether FTIR data could be used more deeply than for just helix, sheet or coil,

subdividing -sheets into parallel, anti-parallel and mixed forms, helices into - and 310-, and coil into

-turns and other. We found that anti-parallel -sheet contents could be predicted to an RMS of 7%

using PLS on 2nd derivative Amide I and II data. It was not possible to predict contents of parallel -

sheets, mixed -sheets, 310-helices or -turns, however. This is presumably because the abundance of

these structures is low, giving a small range of contents and weak signals in the IR spectra attributable

to these structures.

FTIR spectra are often acquired in D2O instead of H2O to avoid the strong water peaks that might

obscure valuable information. Analysis of FTIR spectra of 16 proteins in D2O showed that secondary

structure determination for proteins in H2O did slightly better than proteins in D2O. Goormaghtigh et al.

2a found that an FTIR band in the Amide II region ~1545 cm-1 was most informative for -helix

prediction in their dataset, and a similar band ~1545 cm-1 for random structure (equivalent to the “other”

category used here), which differs from our results. Furthermore, they also reported that their best

marker band for -sheet structure occurred at ~1656 cm-1, which is normally reported as a marker band

of -helix, and conjectured that this might be due to the strong anti-correlation between -helix and -

sheet composition in most proteins. By contrast, we found that 1614-1627 cm -1, the low wavenumber

shoulder of the commonly assigned marker band for -sheet at ~1630 cm-1, was most important for

19

quantification of sheet content. Using Amide I and II data combined was usually more accurate than

using the Amide I data alone. We previously used similar methods on Raman spectra and obtained the

best results when combining Amide I, II and III data.1

Identifying critical regions within spectra for any given secondary structure motif can reduce analysis

time and improve the accuracy of models.30 iPLS models applied to the local models of Amide I FTIR

protein bands generally showed slightly poorer results when compared to the global models for all

structural types. For quantitative analysis, iPLS is a valuable tool, especially for the identification and

qualitative analysis of the spectral regions that are more significant for each structural motif, but the

increased complexity of the models, coupled with a poorer performance, discouraged the further used of

the iPLS method for this project.

In conclusion, we have shown that multivariate analysis of protein FTIR spectra can give -helix, -

sheet, other and anti-parallel -sheet contents to a good accuracy, comparable to CD, which is widely

used for this purpose.

20

Table 1. Prediction Results. Results for PCR and PLS methods on ‘Normalized + SNV’ are also shown in the first and second row of each

structure section. For all methods, 60 protein samples were used for calibration and for internal cross-validation, 44 samples were used for

independent test set. R2 = correlation coefficient of the model for that structure. RMSEC= calculated standard error of the structure distribution in the

calibration set. RMSECV = calculated standard error of the structure distribution from the cross-validation of the calibration set. RMSEP =

calculated standard error of the structure distribution in the independent test set. ζ = RMSE/STDDEV (Standard

Deviation).

Calibration Set (60)

Cross-Validation (60)

Independent Test Set (44)

DSSP

MethodPreProcessing

Amide Spectral Region

Components R2 RMSEC ζ Components R2 RMSECV ζ R2 RMSEP ζ STDDEV

-helix

PCR SNV I 100.73 0.122 0.51 5 - 0.104 0.44 0.70 0.123 0.517 0.238

PLS SNV I 100.88 0.081 0.34 4

0.73 0.143 0.60 0.69 0.126 0.529 0.238

PLS 2nd Deriv I 100.96 0.040 0.18 4

0.85 0.181 0.79 0.71 0.120 0.545 0.238

PLS 2nd Deriv II 150.98 0.031 0.14 4

0.65 0.222 0.97 0.63 0.164 0.729 0.238

PLS 2nd Deriv I & II 100.98 0.028 0.12 5

0.91 0.176 0.77 0.67 0.132 0.592 0.238

21

β-Sheet

PCR SNV I 100.78 0.072 0.46 5 - 0.014 0.09 0.59 0.088 0.564 0.156

PLS SNV I 100.90 0.047 0.30 3

0.77 0.088 0.57 0.69 0.077 0.496 0.156

PLS 2nd Deriv I 100.95 0.020 0.13 9

0.94 0.109 0.70 0.78 0.070 0.473 0.156

PLS 2nd Deriv II 140.96 0.031 0.20 4

0.68 0.179 1.15 0.53 0.101 0.649 0.156

PLS 2nd Deriv I & II 100.99 0.016 0.10 7

0.96 0.100 0.64 0.77 0.071 0.472 0.156

Parallel β-Sheet

PCR SNV I 100.05 0.035 0.92 6 - 0.006 0.16 0.01 0.039 1.091 0.038

PLS SNV I 150.73 0.018 0.47 3

0.73 0.061 1.59 0.03 0.034 0.944 0.038

PLS 2nd Deriv I 150.96 0.007 0.18 3

0.44 0.050 1.30 0.08 0.027 0.711 0.038

PLS Deriv II 110.92 0.920 2.41 3

0.43 0.045 1.19 0.08 0.031 0.822 0.038

PLS 2nd Deriv I & II 100.96 0.008 0.21 3

0.49 0.049 1.28 0.16 0.030 0.777 0.038

Anti-Parallel β-Sheet

PCR SNV I 100.74 0.077 0.51 5 - 0.005 0.04 0.58 0.085 0.558 0.152

PLS SNV I 12 0.9 0.045 0.30 3 0.7 0.094 0.62 0.65 0.077 0.507 0.152

22

1 3

PLS 2nd Deriv I 100.94 0.035 0.23 8

0.94 0.109 0.72 0.78 0.063 0.426 0.152

PLS 2nd Deriv II 150.97 0.024 0.16 2

0.37 0.173 1.14 0.23 0.123 0.811 0.152

PLS 2nd Deriv I & II 100.99 0.016 0.11 8

0.97 0.101 0.66 0.75 0.067 0.445 0.152

Mixed β-Sheet

PCR SNV I 100.22 0.012 0.88 5 - 0.000 0.01 0.14 0.013 0.984 0.014

PLS SNV I 170.79 0.006 0.44 3

0.18 0.013 0.96 0.16 0.013 0.985 0.014

PLS 2nd Deriv I 150.91 0.004 0.26 9

0.79 0.024 1.53 0.38 0.013 0.829 0.016

PLS 2nd Deriv II 150.94 0.003 0.24 3

0.41 0.016 1.16 0.32 0.015 1.092 0.014

PLS 2nd Deriv I & II 110.95 0.003 0.22 3

0.50 0.014 1.05 0.27 0.014 1.032 0.014

310-helix

PCR SNV I 100.12 0.030 0.93 5 - 0.001 0.02 0.01 0.029 0.905 0.032

PLS SNV I 150.60 0.020 0.63 2

0.09 0.033 1.03 0.05 0.030 0.937 0.032

PLS 2nd Deriv I 150.92 0.008 0.25 3

0.53 0.037 1.18 0.26 0.029 0.973 0.032

PLS 2nd Deriv II 110.90 0.010 0.31 2

0.27 0.037 1.16 0.28 0.015 0.493 0.032

23

PLS 2nd Deriv I & II 100.95 0.007 0.21 3

0.41 0.038 1.20 0.16 0.026 0.859 0.032

β-turns

PCR SNV I 130.15 0.039 0.91 4 - 0.001 0.03 0.03 0.044 1.044 0.042

PLS SNV I 150.60 0.027 0.63 3

0.13 0.046 1.09 0.08 0.045 1.075 0.042

PLS 2nd Deriv I 150.94 0.010 0.24 4

0.60 0.051 1.22 0.23 0.043 1.036 0.042

PLS 2nd Deriv II 120.91 0.013 0.30 4

0.45 0.051 1.20 0.05 0.042 0.974 0.042

PLS 2nd Deriv I & II 100.97 0.007 0.18 4

0.65 0.053 1.26 0.12 0.041 0.981 0.042

Other

PCR SNV I 100.49 0.076 0.53 3 - 0.006 0.04 0.18 0.142 1.315 0.144

PLS SNV I 150.88 0.037 0.26 3

0.47 0.097 0.67 0.20 0.140 1.300 0.144

PLS 2nd Deriv I 100.90 0.045 0.31 4

0.62 0.157 1.10 0.65 0.082 0.567 0.144

PLS 2nd Deriv II 120.94 0.034 0.24 1

0.26 0.164 1.14 0.63 0.066 0.457 0.144

PLS 2nd Deriv I & II 100.97 0.024 0.17 4

0.72 0.198 1.38 0.69 0.082 0.568 0.145

24

ASSOCIATED CONTENT

Supporting Information

SI Table 1. Protein FTIR Spectra

SI Table 2 Comparison of PLS results of the secondary structure of proteins in H2O and D2O.

SI Table 3. Proteins used for iPLS

SI Table 4. iPLS intervals and their corresponding frequencies.

SI Table 5. Results for iPLS models for both global and local regression analyses.

SI Table 6. Proteins used for calibration and test sets in iPLS

SI Figure 1. Explained y variance of α-helix content for analysis of the

Amide I region as a function of number of components.

SI Figure 2. RMSEC vs. number of components for α-helix prediction

from the Amide I region.

SI Figure 3. Cross-validation: 4 PLS components only are required to fit

the α-helix model for proteins in H2O using Amide I data.

This material is available free of charge via the Internet at http://pubs.acs.org.

AUTHOR INFORMATION

Corresponding Author

Email: [email protected]

24

25

Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street,

Manchester M1 7DN, UK

The manuscript was written through contributions of all authors. All authors have given approval

to the final version of the manuscript. ‡These authors contributed equally. (match statement to

author names with a symbol)

ACKNOWLEDGMENT

We thank Dr. Parvez Haris for providing FTIR protein spectra.

REFERENCES

1. Kinalwa, M. N.; Blanch, E. W.; Doig, A. J. (2010) Accurate determination of protein secondary structure content from Raman and Raman Optical Activity spectra, Anal.Chem. 82, 6347-6349.2. (a) Goormaghtigh, E.; Ruysschaert, J. M.; Raussens, V. (2006) Evaluation of the information content in infrared spectra for protein secondary structure determination, Biophys. J. 90, 2946-2957; (b) Stuart, B. H. (1996) A Fourier transform infrared spectroscopic study of P2 protein in reconstituted myelin, Biochem. Mol. Biol. Int. 39, 629-634.3. (a) Susi, H.; Byler, D. M. (1986) Resolution-Enhanced Fourier-Transform Infrared-Spectroscopy of Enzymes, Method Enzymol. 130, 290-311; (b) Barth, A. (2007) Infrared spectroscopy of proteins, Biochim. Biophys. Acta-Bioenerg. 1767, 1073-1101; (c) Manning, M. C. (2005) Use of infrared spectroscopy to monitor protein structure and stability, Expert Rev. Proteomics 2, 731-743; (d) Cai, S. W.; Singh, B. R. (1999) Identification of beta-turn and random coil amide III infrared bands for secondary structure estimation of proteins, Biophys. Chem. 80, 7-20.4. Kong, J.; Yu, S. (2007) Fourier transform infrared spectroscopic analysis of protein secondary structures, Acta Biochim. Biophys. Sin. 39, 549-559.5. Haris, P. I.; Severcan, F. (1999) FTIR spectroscopic characterization of protein structure in aqueous and non-aqueous media, J. Mol. Catal. B-Enzym. 7, 207-221.6. (a) Candolfi, A.; De Maesschalck, R.; Jouan-Rimbaud, D.; Hailey, P. A.; Massart, D. L., The influence of data pre-processing in the pattern recognition of excipients near-infrared spectra. 1999; Vol. 21, p 115-32; (b) Ge, Y.-S.; Jin, C.; Song, Z.; Zhang, J.-Q.; Jiang, F.-L.; Liu, Y. (2014) Multi-spectroscopic analysis and molecular modeling on the interaction of curcumin and its derivatives with human serum albumin: A comparative study, Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 124, 265-276.7. Wold, S. (1991) Chemometrics, why, what and where to next?, Journal of pharmaceutical and biomedical analysis 9, 589-596.8. (a) Al-Ghouti, M. A.; Al-Degs, Y. S.; Amer, M. (2008) Determination of motor gasoline adulteration using FTIR spectroscopy and multivariate calibration, Talanta 76, 1105-1112; (b) Bjelanovic, M.; Sorheim, O.; Slinde, E.; Puolanne, E.; Isaksson, T.; Egelandsdal, B. (2013)

25

26

Determination of the myoglobin states in ground beef using non-invasive reflectance spectrometry and multivariate regression analysis, Meat science 95, 451-7; (c) Macdonald, J. R.; Johnson, W. C., Jr. (2001) Environmental features are important in determining protein secondary structure, Protein Sci 10, 1172-7.9. (a) Bartlett, J. W.; Frost, C. (2008) Reliability, repeatability and reproducibility: analysis of measurement errors in continuous variables, Ultrasound in Obstetrics and Gynecology 31, 466-475; (b) Sonich-Mullin, C.; Fielder, R.; Wiltse, J.; Baetcke, K.; Dempsey, J.; Fenner-Crisp, P.; Grant, D.; Hartley, M.; Knaap, A.; Kroese, D.; Mangelsdorf, I.; Meek, E.; Rice, J. M.; Younes, M. (2001) IPCS Conceptual Framework for Evaluating a Mode of Action for Chemical Carcinogenesis, Regulatory Toxicology and Pharmacology 34, 146-152.10. Zou, X.; Zhao, J.; Mao, H.; Shi, J.; Yin, X.; Li, Y. (2010) Genetic algorithm interval partial least squares regression combined successive projections algorithm for variable selection in near-infrared quantitative analysis of pigment in cucumber leaves, Applied spectroscopy 64, 786-94.11. Navea, S.; Tauler, R.; de Juan, A. (2005) Application of the local regression method interval partial least-squares to the elucidation of protein secondary structure, Anal Biochem 336, 231-42.12. (a) Martens, H.; Naes, T., Multivariate Calibration. Wiley: 1991; (b) Wang, Y. Q.; Boysen, R. I.; Wood, B. R.; Kansiz, M.; McNaughton, D.; Hearn, M. T. W. (2008) Determination of the secondary structure of proteins in different environments by FTIR-ATR spectroscopy and PLS regression, Biopolymers 89, 895-905.13. Depczynski, U.; Frost, V. J.; Molt, K. (2000) Genetic algorithms applied to the selection of factors in principal component regression, Analytica Chimica Acta 420, 217-227.14. Haaland, D. M.; Jones, H. D. T.; Thomas, E. V. (1997) Multivariate classification of the infrared spectra of cell and tissue samples, Applied Spectroscopy 51, 340-345.15. Smith, B. C., .Fundamentals of fourier transform infrared spectroscopy. 2nd ed.; Taylor & Francis: 2011.16. (a) Smith, B. C., Fundamentals of Fourier Transform Infrared Spectroscopy, Second Edition. Taylor & Francis: 2011; (b) Stuart, B. H. (1996) A Fourier transform infrared spectroscopic study of the secondary structure of myelin basic protein in reconstituted myelin, Biochemistry and molecular biology international 38, 839-45.17. Glassford, S. E.; Byrne, B.; Kazarian, S. G. (2013) Recent applications of ATR FTIR spectroscopy and imaging to proteins, Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics 1834, 2849-2858.18. Ramer, G.; Lendl, B., Attenuated Total Reflection Fourier Transform Infrared Spectroscopy. In Encyclopedia of Analytical Chemistry, John Wiley & Sons, Ltd: 2006.19. Milosevic, M., Internal reflection and ATR spectroscopy. Wiley: 2012.20. Agnès, T.; Diane, R.; Yves, D.; Dieter, N.; Vincent, F. (2000) Transient non-native secondary structures during the refolding of α-lactalbumin detected by infrared spectroscopy, Nature Structural & Molecular Biology 7, 78-86.21. Kennard, R. W.; Stone, L. A. (1969) Computer Aided Design of Experiments, Technometrics 11, 137-148.22. Perez-Guaita, D.; Ventura-Gayete, J.; Perez-Rambla, C.; Sancho-Andreu, M.; Garrigues, S.; de la Guardia, M. (2012) Protein determination in serum and whole blood by attenuated total reflectance infrared spectroscopy, Anal. Bioanal. Chem. 404, 649-656.

26

27

23. Perez-Guaita, D.; Ventura-Gayete, J.; Pérez-Rambla, C.; Sancho-Andreu, M.; Garrigues, S.; Guardia, M. (2012) Protein determination in serum and whole blood by attenuated total reflectance infrared spectroscopy, Analytical and bioanalytical chemistry 404, 649-656.24. Kabsch, W.; Sander, C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, Biopolymers 22, 2577-637.25. Faber, N. M. (1999) Estimating the uncertainty in estimates of root mean square error of prediction: application to determining the size of an adequate test set in multivariate calibration, Chemometrics and Intelligent Laboratory Systems 49, 79-89.26. Arrondo, J. L. R.; Goni, F. M. (1999) Structure and dynamics of membrane proteins as studied by infrared spectroscopy, Prog. Biophys. Mol. Biol. 72, 367-405.27. Navea, S.; Tauler, R.; de Juan, A. (2005) Application of the local regression method interval partial least-squares to the elucidation of protein secondary structure, Analytical Biochemistry 336, 231-242.28. Dousseau, F.; Pezolet, M. (1990) Determination of the secondary structure-content of proteins in aqueous-solutions from their Amide-I and Amide-II infrared bands - Comparison between classical and partial least-sequares methods, Biochemistry 29, 8771-8779.29. Zou, X. B.; Zhao, J. W.; Mao, H. P.; Shi, J. Y.; Yin, X. P.; Li, Y. X. (2010) Genetic Algorithm Interval Partial Least Squares Regression Combined Successive Projections Algorithm for Variable Selection in Near-Infrared Quantitative Analysis of Pigment in Cucumber Leaves, Applied Spectroscopy 64, 786-794.30. Norgaard, L.; Saudland, A.; Wagner, J.; Nielsen, J. P.; Munck, L.; Engelsen, S. B. (2000) Interval partial least-squares regression (iPLS): A comparative chemometric study with an example from near-infrared spectroscopy, Applied Spectroscopy 54, 413-419.

Table of Contents Graphic

27

Documents

· Web viewDetermination of Protein Secondary Structure from Infra-Red Spectra Using Partial Least Squares Regression. Kieaibi E. Wilcox, Ewan W. Blanch†, and Andrew J. Doig* *Manchester