Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
1
Determination of Protein Secondary Structure from
Infra-Red Spectra Using Partial Least Squares
Regression
Kieaibi E. Wilcox, Ewan W. Blanch†, and Andrew J. Doig*
*Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester
M1 7DN, UK
†Present Address: School of Applied Sciences, RMIT University, 124a La Trobe Street, Melbourne,
VIC 3001, Australia
ABSTRACT Infra-red (IR) spectra contain substantial information on protein structure. This has
previously most often been exploited by using known band assignments. Here, we convert spectral
intensities in bins within Amide I and II regions to vectors and apply machine learning methods to
determine protein secondary structure. Partial Least Squares was performed on spectra of 90 proteins in
H2O. After preprocessing and removal of outliers, 84 proteins were used for this work. Standard Normal
Variate and 2nd derivative preprocessing methods on the combined Amide I and II data generally gave
the best performance, with root mean square values for prediction of ~12% for -helix, ~7% for -
sheet, 7% for anti-parallel -sheet and ~8% for other conformations. Analysis of FTIR spectra of 16
proteins in D2O showed that secondary structure determination was slightly poorer than in H2O. iPLS
was used to identify the critical regions within spectra for secondary structure prediction and showed
that the sides of bands were most valuable, rather than their peak maxima. In conclusion, we have
2
shown that multivariate analysis of protein FTIR spectra can give -helix, -sheet, other and anti-
parallel -sheet contents to a good accuracy, comparable to circular dichroism, which is widely used for
this purpose.
A rapid assessment of a protein’s secondary structure is of great value in determining whether a
protein is folded or has folded correctly. Circular dichroism (CD) is the most widely used method for
this purpose, as it can quantify helix and sheet contents, though we have also shown that Raman
spectroscopy or Raman Optical Activity (ROA) could be even more accurate.1 Infra-red (IR)
spectroscopy is used even more widely than CD or Raman, and is potentially of great value in
determining protein structure in H2O, D2O and cellular and tissue samples, since the spectra are so
information rich.2 While IR is often used to study protein structure, this is typically achieved by looking
at intensities and wavenumbers of known bands for particular structures, such as the Amide I and III
bands for helix and sheet.3 However, the large number of overlapping IR bands, coupled with variations
of marker bands cited within the literature, typically makes accurate protein structural analysis through
the deconvolution of FTIR spectra both complex and time consuming. Here, we take a similar approach
to the one that we used for protein Raman and ROA spectra1 to analyse protein IR spectra. IR spectra
are turned into vectors of intensities within bins and Partial Least Squares (PLS) and Principal
Component Regression (PCR) methods are used to determine secondary structure contents. We also
investigated whether we can use this approach to go beyond a three way assignment of residues into
helix, sheet or other, and get more detailed information on different types of sheet (parallel, anti-parallel
or mixed), helix (310 or ) and turns.
The Amide I region of FTIR spectra is normally used for the analysis of protein secondary structure
because the frequencies of the Amide I modes are known to correlate closely to a protein’s secondary
structure elements.4 However, the Amide I bands of proteins display extensive overlap
3
of the underlying component bands of α-helix, β-sheet, β-turns and coils, which are
not instrumentally resolvable.5 Mathematical methods, such as second derivatives, are therefore often
used to enhance and resolve the individual band components.4
Although FTIR is a preferred technique for the rapid determination of protein secondary structure, its
spectra is multicollinear in nature and requires calibration to determine protein structural information
from their spectra.6 Multivariate analysis removes the near multicollinearity found in spectral
measurements and derives the relationships between samples through statistical modelling.7
Multivariate regression analysis methods such as Partial Least Squares (PLS) and Principal
Component Regression (PCR) are applied to our dataset of 84 FTIR protein multivariate spectra with 16
proteins also measured in D2O. All spectra were pre-treated by means of intensity normalization and
SNV (Standard Normal Variate). Calculation of the second derivative was also done in order to enhance
the separation of the underlying spectral bands.
60 spectra were used to develop the regression model. Models were cross-validated with 10-fold
segments specified. An optimal number of latent variables, indicated by a minimum root mean square
error of cross-validation (RMSECV), was selected to avoid overfitting.8 The optimal number of
components was used to fit the spectra of new samples.
The predictive ability of the model was assessed by testing on new data (the reliability of prediction of
structure from different but related spectra) as well as already seen data (the reproducibility of
prediction using spectra of same measurements and conditions).9 This was accomplished by assigning
44 samples in the test set of which 24 samples were new to the model and 20 samples were already seen
by the model. This also increased the number of proteins in our test set. The partitioning of the data into
training and test set was done by an algorithm called Kennard Stone (KS) which puts the most different
spectra in the calibration set.
A version of PLS called interval PLS (iPLS) uses subintervals of the full spectrum. Local models are
built with intervals of the spectral region; the performance of the local models is compared to that of the
4
full spectrum model.10 By using intervals of the spectrum, regions specific to each secondary structure
motif can be identified. The prediction ability of sub-spectral regions can possibly reduce spectral
measurement time and reduce production cost by only employing a few significant spectral regions.11
PLS/PCR multivariate calibration is used to quantify new variables ‘y’ from the matrix of the FTIR
measured protein spectra X, via a mathematical model that can relate X to ‘y’. X is the predictor
variables (FTIR spectra of protein samples) and Y is the reference variable (DSSP values). This model
can then be used to predict the structure of unknown protein samples.12 PLS and PCR solve the near
multi-collinearity often found in spectral measurements, where two or more independent (predictor)
variables are correlated and so provide redundant information from the model. FTIR spectra are
multivariate in nature, leading to a need for dimension reduction which can be achieved by PLS
regression.
The PCR method is based on principal component analysis (PCA). PCR performs data decomposition
into loading and score variables. For PCR, the estimated scores matrix consists of the most dominating
principal components of X. These components are linear combinations of X measurements determined
by their ability to account for the variability in X.13 The first principal component is the linear
combination of the original X-variables with the highest possible variance; PCR uses only the X
variables for the analysis without employing the y variable. PLS regression can give good prediction
results with fewer components than PCR because the response variable is employed in the regression. 14
The number of components needed for interpreting the information in X (spectra) which is related to ‘y’
(secondary structure) is, therefore, smaller for PLS than for PCR. This may lead to a simpler
interpretation, though the two methods often give comparable results.4
■ MATERIALS AND METHODSSpectral Measurement and Processing. 84 proteins were measured in H2O; 28 of these FTIR
protein spectra were provided by Dr. Parvez Haris from DeMontfort University, UK (SI Table 1). The
remaining proteins were bought from Sigma-Aldrich and used without further purification. Proteins
5
were chosen based on criteria such as: a wide range of helix and sheet contents, crystal structure quality,
secondary structure contents, fold classes, solubility and stability. Spectra were collected using the ATR
accessory, made with a ZnSe crystal, with a Bruker-Tensor FTIR spectrometer and MCT detector. 30
µL of each protein solution at 50 mg/ml in distilled H2O was placed on the ATR cell. A background
spectrum of H2O was used for automated background signal subtraction using a customized routine in
the OPUS 6.2 software. IR measurements were made with 4 cm-1 spectral resolution, data points were
measured every 1 cm-1, and 32 scans were collected per sample, over a range of 4000 - 950 cm -1.
Acquisition times for each spectrum were less than two minutes. The minimum absorbance for all
spectra in the dataset was subtracted to remove background absorbance and light scattering effects. No
smoothing function was applied to the data.
ATR correction. Attenuated total reflectance (ATR) spectroscopy15 was used. ATR is based on the
concept of internal reflection. The infrared beam enters a sample in an ATR crystal at an incidence
angle of 45° and is reflected into the crystal.16 This radiation enters the crystal in contact with the sample
with lower refractive index; if the angle of incidence penetrated into the crystal is greater than the
critical angle, the beam undergoes total internal reflection.16 The absorption of the evanescent wave of
the penetrated beam into the sample is measured. The path length of a transmission experiment is the
same across the spectrum because it is defined by the thickness of the sample. In the ATR experiment,
however, the depth to which the sample is penetrated by the infrared beam is a function of the
wavelength;17 for this reason, the relative intensity of bands in an ATR spectrum increases with
wavenumber. This effect can cause anomalous dispersion which affects the spectral peak shape. As a
result, the infrared spectrum of a sample obtained by ATR, when compared to its transmission
measurement, shows some significant differences, such as higher absorbance bands around the Amide II
region than in the Amide I region. It is advised to do ATR correction where quantitative analysis is
necessary.18 ATR correction was performed on protein spectra used for this analysis to bring each
spectrum as close as possible to its FTIR counterpart. Using the ‘Advanced ATR Correction’ feature on
6
Bruker’s OPUS software, these abnormalities were partially corrected on all in-house collected spectra,
though we did observe some residual changes in the ratios of Amide I to Amide II peaks compared to IR
transmission spectra. For example, ATR-corrected carbonic anhydrase had a 0.9 to 1 ratio for Amide I
to Amide II peak intensity, while the transmission spectrum of the same protein had a 0.6 to 1 ratio for
Amide I to II intensities. This ATR-corrected ratio can differ from protein to protein; ATR correction is
often less satisfactory for samples with strong absorption peaks.19
84 proteins measured in H2O and 16 selected proteins measured in D2O are listed in SI Table 1. Data
were normalized for intensity and autoscaled using the Standard Normal Variate (SNV) approach. This
centres each spectrum by subtracting the mean value of the spectrum from each absorbance; SNV then
scales each spectrum by its standard deviation to give the spectrum a variance of 1 and mean of 0.
Second derivatives were determined for each spectrum value. The data were normalized subjected to
second derivative analysis using a Matlab’s Savistzky-Golay algorithm with 5 degrees; this reveals the
underlying absorption changes.20
Data Selection Scheme. 60 H2O spectra were used as a calibration set for the PLS regression
model. The remaining 24 were used as part of an independent test set for assessment of predictive
accuracy after training. 20 spectra from the calibration set were also included in the independent test set,
giving 44 samples in total. Selection was made using the Kennard Stone (KS) algorithm. 21,22, a well-
known method for the selection of a sample subset for calibration. It chooses objects from the X-matrix
(the measured spectra) that provides a uniform distribution of the data set. The algorithm does this by
assigning a sample to the calibration set, closest to the mean of the entire sample; the next sample is
then chosen based on the square distance to the sample already assigned, using Euclidean or
Mahalanobis distances.23 The sample furthest from the already selected sample is added to the
calibration set. The algorithm was set to select the first 60 samples for the calibration set, with the rest
left for validation (Figure 1).
7
Figure 1. Schematic representation of 84 protein samples in H2O used in the training set and the
test set. Matrix X represents the FTIR spectra while vector ‘y’ represents the percentage fraction of
secondary structure.
PLS Modeling. Amide I and II regions, whose bands originate from C=O stretching and NH
bending vibrations, were used for this analysis. The Amide III region was not analysed as it was not
recorded in many of the protein spectra. Amide I and II regions were used separately and combined.
The FTIR spectra were put into a matrix containing 84 rows of protein spectra having 101 columns
for the spectral wavenumbers from 1600–1700 cm-1 in the Amide I region and a separate matrix of 121
columns for the Amide II region from 1480–1600 cm-1. The algorithm for PLS was written using Matlab
R2010 and Matlab’s PLS regression. PLS requires a priori knowledge about the protein structural
groups, so relative secondary structure contents determined by DSSP24from the PDB were put into a
matrix form ‘y’. Secondary structure fractions are listed in SI Table 1. Our parameters for the
PLSregress tool in Matlab are available on request.
■ RESULTS AND DISCUSSIONPartial Least Squares. PLS was first applied to the calibration set and Amide I data. It was first
necessary to find the number of PLS components required to explain the data. PLS extracts components
in order of their relevance to the structure type in question. SI Figure 1 shows how the percentage of y
explained variance increases with the number of components used when predicting -helix contents
2
8
from Amide I data. The plot was used to choose a minimum number of components that give rise to the
maximum explained variance. After 10 components the gap between points is small and the RMSE is
low (SI Figure 2), so 10 components for the analysis of α-helix from the Amide I region
were selected to give the maximum explanation of variance and a low Root Mean Square Error for
Calibration (RMSEC).
Similarly, 10 components gave a low RMSEC of 0.020 for analysis of β-sheet content
from the amide I data. Beyond 10 components, the decrease in the RMSEC is
negligible (not shown). Similar numbers of components were used for other
secondary structure types (Table 1).
Cross-Validation. PLS is sensitive to overfitting, that is, the RMSEC will continue to decrease with
additional components; however, the RMSEP will increase due to overfitting.25 For this reason 10-fold
cross-validation of the model was carried out. An optimal number of components, indicated by a
minimum root mean square error of cross-validation (RMSECV), was selected to avoid overfitting. For
PLS, this error measurement, which is the standard deviation of the unexplained variance of the cross-
validation, is more useful than RMSEC, because RMSEC does not indicate when the model is
overfitted. Plots of the cross-validation results are used to visualize the number of components that give
the lowest RMSECV. SI Figure 3 shows that 4 components were sufficient for -helix using Amide I
data. The number of components required for other secondary structures was determined in the same
way.
Independent Test Set. Cross-validation performs an internal test with the calibration model and
is generally adequate to assess the fitting ability of a model. However, it is always useful to additionally
test a model against an independent test set. We did this by using 44 samples of which 24 have not been
previously seen by the model. Ideally, the independent test set would only include proteins that were not
in the calibration set. However, in our case, this would only be 24 proteins, a number too small to give
accurate analysis. The RMSEP of the independent test set is obtained for the judgment of the
9
performance of the calibration set.25 (Table 1). Fitted response values are computed by the PLS model
and an R2 value is generated, comparing predicted values to those known from crystal structures.
PCR. The major difference between PLS and PCR is that PCR uses only the x variables for the
analysis without employing the y variable, although the y response variable is needed for the calculation
of the residual. A comparison of PLS and PCR, using the SNV Amide I data, was made to determine
which method works best.
Plots of RMSECV as a function of number of components showed that 4 PLS and 5 PCR components
are optimal to minimise the RMSECV (not shown). PLS was chosen for future use, since it usually has a
better performance on the calibration set and uses fewer components with cross-validation.
PLS with 2nd Derivative. Amide I data using PLS was compared using SNV or second derivative
spectra (Table 1). The second derivative usually gave a better performance, so was used for further
analysis of Amide II data. The plotted observed vs. fitted response in Figure 2 for second derivative data
shows that the regression line fits the data very well for α-helix for the Amide I region
with the calibration set and less well for the independent test set, as expected (Figure 3). Datasets
computed from the second derivatives of the FTIR spectra were used for secondary structure analysis
with PLS. The R2 and RMSECV values for each calibration were used in choosing the
best fit model for α-helix, β-sheet, parallel β-sheet, anti-parallel β-sheet, β-turns
and other (Table 1). Goormaghtigh et al.3a reported that 310-helix content was too low to give
statistically meaningful results in their study on 50 proteins and we also found a poor correlation for 310-
helix.
10
Figure 2. Predicted versus training set values for the calibration set for the analysis of α-helix
content from 2nd derivative Amide I data in H2O, with line of best fit.
Figure 3. Independent test set plot for analysis of α-helix from the Amide I 2nd derivative
data: PLS fitted vs response on 44 proteins in H2O. Proteins in circles were in the calibration set.
The PLS analysis of the 2nd derivative of Amide I data used 10 components for the
prediction of α-helix content, giving an R2 of 96% with a low RMSEC of 0.040 which,
after cross-validation, gave an RMSECV value of 0.181 with only 4 components.
These 4 components were used to predict α-helix content in the independent test
set, giving a RMSEP of 0.120 with an R2 of 0.71. Figures 4 and 5 show the fitted vs
11
observed plots for the β-sheet calibration and independent test sets using the Amide I
region and 2nd derivatives. Data on additional structure types are listed in Table 1.
Figure 4. Predicted versus training set values for the calibration set for the analysis of -sheet
content from 2nd derivative Amide I data in H2O, with line of best fit.
Figure 5. Independent test set plot for analysis of -sheet from the Amide I 2nd derivative data: PLS
fitted vs response on 44 proteins in H2O. Proteins in circles were in the calibration set.
ζ Scores.Calibration of each of these secondary structural contents was done for both the entire
Amide I and II spectral regions separately, and for the combined Amide I and II regions together (Table
1). Cross-validation R2 results of 0.85 (α-helix), 0.94 (β-sheet), 0.44 (parallel β-sheet),
0.94 (antiparallel β-sheet), 0.79 (mixed β-sheet), 0.53 (310-helix), 0.60 (turns), and 0.62
12
(other) for the Amide I region using PLS 2nd derivative data indicates the percentage explained variance
for each predicted structure-type. The predictors work best for -helix, -sheet and antiparallel β-
sheet. The RMSECV values for α-helix (0.181) and β-sheet (0.109) are higher than for
other structures, such as parallel β-sheet (0.050), 310-helix (0.037), and turns (0.051). This
is not because the predictive values of the latter are better, but rather that in our samples,
the variations of parallel β sheet, 310-helix and turns are a lot smaller. To assess
the predictive accuracy of this statistical analysis, a determinant ‘ζ’ is calculated.
ζ is the ratio of RMSECV (δ) from the FTIR divided by the standard deviation (σ)
of the protein secondary structure contents in the reference set of crystal structures:
ζ = δFTIR / σ (1)
This score compares the distributed width of the reference structure to that of the RMSECV. It is used
to compare the prediction accuracies, as it accounts for the natural variation of the crystal structure data
compared to the calculated structure obtained from the FTIR model.16 A ζ value less than one indicates
that the FTIR prediction values for that structure are better than guesswork and a value around one
means both methods give answers that are comparable. A value higher than one indicates that the FTIR
method is of no value.1,16-17 The ζ scores for α-helix, β-sheet, and anti-parallel β-sheet
from the amide I data are all lower than one, apart from PLS 2nd derivative Amide II for
sheet, showing that the predictors are successful (Table 1). The ζ scores for the other
structures are generally less than one for PCR only. The two scores, RMSECV and ζ, should
both be considered because a secondary structure with few samples and a small
range in the data space could give a low RMSE by chance.
Analysis of Protein FTIR Spectra in D2O. Protein absorption peaks in the Amide I region
overlap the absorption of water bands in the same region (~1643 cm-1), making it difficult to obtain
informative spectra from H2O. One solution is to use D2O as a solvent because it does not absorb in the
same region as the water absorbance band in the Amide I region.4, 26 However, whereas the Amide I
13
region is easily observed, the Amide II region can then be affected by the D2O solvent absorption26 and
D2O may change the protein spectral signature.
Selected proteins in both solvents were used to investigate and compare their Amide I data. The same
techniques and validation steps applied to the proteins dissolved in H2O were applied to 16 proteins in
D2O, using PLS (SI Table 2).
Overall, the predicted values of α-helix and β-sheet for the 16 proteins measured
in both H2O and D2O are similar. The RMSEP values for α-helix from proteins in H2O vs
proteins in D2O are 0.066 and 0.069 respectively, while the RMSEP values for their β-
sheet contents are 0.044 and 0.045, showing slightly better performance with H2O.
We emphasise, however, the significantly smaller datasets used for the D2O studies.
PLS and other quantitative analysis methods have previously been used to quantify secondary
structure in H2O and D2O.27 α-Helix and β-sheet bands within the Amide I region have
been published by Kong and Yu for secondary structure assignments for both H2O
and D2O4. Dousseau and Pezolet used PLS on 13 proteins in both H2O and D2O28. Their D2O results
were poorer than for H2O overall, especially for myoglobin.
Interval Partial Least Squares. Interval Partial Least Squares (iPLS) is a variant of PLS used to
perform regression analysis on sub-intervals of the spectra. Local models are obtained from chosen
interval(s) based on RMSECV performance. Methods developed from sub-intervals may give better
predictions, as they may include less noise.27 Knowledge of the prediction ability of sub-regions of FTIR
spectra may also lead to the improvement of instruments that can reduce production cost by only
employing a few significant spectral regions.27
The KS algorithm was used to split spectra into calibration and test sets. Norgaard’s iPLS_ToolBox
for Matlab was used in the implementation of this work. This allows the matrix of the entire spectrum to
be split into wavenumber intervals; iPLS is performed on the entire spectrum (global) and on the
14
intervals (local) simultaneously. The interval with the lowest error indicates the area of most importance
in terms of structural information. Proteins used are listed in SI Table3.
7 spectral intervals were initially assigned for building the model. PLS models were calculated for
each of the sub-intervals, using 5 latent variables and ‘leave 5% out’ cross-validation across all sub-
intervals. Two components gave the minimum RMSECV; therefore two components was chosen to
model all sub-intervals.
An RMSECV was recalculated for each sub-interval and for the full spectrum. This variable selection
and the results are shown in Figures 6 and 7 for α-helix and β-sheet; the columns represent
the individual spectral regions (the intervals), and the numbers in italics denote
the optimal number of components for each interval.29 Interval number 5 (1628-1641 cm-1) has the
lowest RMSECV for -helix, though its performance is still poorer than for the global model which
uses the entire Amide I spectral region. A calibration model based on interval number 5 using 3
components was used for the independent test set.
1 2 3 4 5 6 70
0.05
0.1
0.15
0.2
RM
SE
CV
iPLS Interval number
1 4 2 3 3 3 1
Figure 6. iPLS for -Helix. Each bar in the plot is a spectral interval from 1700-1600 cm -1, with
1700 cm-1 to the left; Interval number 5 corresponds to frequency factors in the range 1641–1628 cm -1.
The dotted line represents the RMSECV for the global iPLS model. Numbers in italics in each interval
bar signifies the number of PLS components used in fitting each local model. See SI Table 4 for
intervals and their wavenumber ranges. The curve is the mean spectral intensity for the Amide I region.
15
1 2 3 4 5 6 70
0.05
0.1
RM
SE
CV
iPLS Interval number
1 4 2 5 3 3 3
Figure 7. iPLS for β-sheet. Each bar in the plot is a spectral interval from 1700-
1600 cm-1; Interval number 6 corresponds to frequency factors in the range 1614 –1627 cm-1. The
dotted line represents the RMSECV for the global iPLS model. Numbers in italics in each interval bar
signifies the number of PLS components used in fitting each local model. The curve represents the mean
spectral intensity for the Amide I range.
For β-sheet, interval number 6 had the lowest RMSECV out of 7 intervals, used 3
components and was superior to the global model using all Amide I data (Figure 7).
A calibration model based on the 3 components using interval number 6, which covers 1627-1614 cm-1,
was developed. For both -helix and -sheet, we found that the interval that contains data from the side
of a peak was most useful, rather than the interval containing the peak maximum. The intervals
containing the peak maxima are less useful, since they demonstrate less variation between spectra.
iPLS α-helix. The predicted fraction of α-helix from the Amide I region was
compared to that from crystal structures. The RMSECV of 0.144 is a little better than that of the
standard PLS model with RMSECV of 0.181, but the iPLS R2 for cross-validation (0.80) is lower than
the value of 0.85 for PLS. Interval number 5 model (1628-1641 cm-1) with 3 components gave an
RMSECV of 0.149 and an R2 of 0.78 which is below the performance of the global model.
iPLS β-sheet. The RMSECV for the -sheet global model is 0.092 with an R2 of 0.81, while the
RMSECV from the standard PLS model was 0.050 with a particularly high R2 of 0.94. The global
16
iPLS model for β-sheet thus did much more poorly than the traditional PLS model,
but this could be due to the number of components used for iPLS global for Amide I (3
vs. 9). A model of interval number 6 (1614-1627 cm-1) with 3 components gave an RMSECV of 0.087
and an R2 of 0.83 which is much better than the performance of the iPLS global model.
iPLS Independent Test Set. The KS algorithm was used for splitting the protein samples into
training and test sets. Although the number of proteins in the independent test set was the same for both
PLS and iPLS, a few factors were different between them. The proteins in the training sets for the two
methods were not completely identical, as several proteins in the calibration set of the PLS method were
in the test set of the iPLS method This is because for iPLS, data selection for calibration set and test set
was based on regions of the spectra, instead of the full spectra. The numbers of components used for the
prediction test set models were based on the error of the cross-validation performed for calibration
models. The results for the 44 proteins in the iPLS independent test set were as good as the results of the
traditional PLS result. This comparison is presented in Table 4. Figure 8 shows that the model for
α-helix fits the data and the RMSEP is 0.134 with an R2 of 0.78. The β-sheet model
also fits the data well with an RMSEP value of 0.089 and an R2 of 0.79 (Figure 9). These
results are comparable to the standard PLS results. iPLS calibration results, cross-validation results, and
independent test set results are listed in SI Table 5.
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
2
3
4
5 6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
2122
23
24
2526
27 28
29
3031
32 3334
35
36
37
3839
40
41
4243
44 45
Experimental (X-ray Crystallography: -helix)
Pre
dict
ed (F
TIR
/iPLS
)
R2 = 0.7893RMSEP = 0.1349
Figure 8. Local model line of fit for the prediction of -helix in the Amide I region (1628–1641
cm-1). 3 PLS components for the 5th spectral interval were used for the prediction of proteins in the test
17
set.. The numbers in the plot are the different proteins in the dataset which are named in SI Table 6.
0 0.1 0.2 0.3 0.4 0.50
0.1
0.2
0.3
0.4
0.5
1 2
3
4
5
6
7
8
9
10
11
1213
14
15
16
17
18
19
20
21
22
23
24 25
26
27
28
29
30 31
32
33
34
35
36
37
38
39
40
41
4243
4445
R2 = 0.7927
RMSEP = 0.0894
Pre
dict
ed (F
TIR
/iPLS
)
Experimental (X-ray Crystallography: -sheet)
Figure 9. Local model line of fit for the prediction of β-sheet in the Amide I
region (1614-1627 cm-1). 3 PLS components for the 6th spectral interval were used for the
prediction of proteins in the test set. The numbers in the plot are the different proteins in the dataset
which are named in SI Table 6.
Navea et al. used iPLS to analyse 24 protein IR spectra using the Amide I, Amide
II and Amide III bands, finding that similar sections within the Amide I region were
most useful.27
■ CONCLUSIONIR spectra of proteins are potentially a rich source of information on protein structure. We therefore
explored whether PLS or PCR methods can be used to determine secondary structure contents from
FTIR spectra. 28 previously acquired spectra, plus spectra from an additional 68 proteins that we
obtained ourselves gave a suitably sized data set. Data for the Amide I and II regions were analysed,
since these were available for all proteins and are the regions most widely used to give information on
secondary structure. A calibration set of 60 proteins was used to optimize the parameters of a method;
cross-validation within this set and application to an independent test set were used to assess each
method’s performance without over-fitting.
18
We first studied the Amide I region using PCR and PLS. This showed that PLS gave the better
performance. We then compared PLS using SNV and 2nd derivative preprocessing methods and found
that 2nd derivatives were better. Finally, we tested whether using the Amide II regions improved a
model’s performance and found that Amide I data was more useful on its own.
Models gave an excellent performance within their calibration sets, with RMSEC values as low as a
few %. This was over-fitted, however, as when the models were applied to data not seen before, or
tested rigorously with cross-validation, the RMS values were around 12% for -helix, 7% for -sheet
and 8% for other.
We explored whether FTIR data could be used more deeply than for just helix, sheet or coil,
subdividing -sheets into parallel, anti-parallel and mixed forms, helices into - and 310-, and coil into
-turns and other. We found that anti-parallel -sheet contents could be predicted to an RMS of 7%
using PLS on 2nd derivative Amide I and II data. It was not possible to predict contents of parallel -
sheets, mixed -sheets, 310-helices or -turns, however. This is presumably because the abundance of
these structures is low, giving a small range of contents and weak signals in the IR spectra attributable
to these structures.
FTIR spectra are often acquired in D2O instead of H2O to avoid the strong water peaks that might
obscure valuable information. Analysis of FTIR spectra of 16 proteins in D2O showed that secondary
structure determination for proteins in H2O did slightly better than proteins in D2O. Goormaghtigh et al.
2a found that an FTIR band in the Amide II region ~1545 cm-1 was most informative for -helix
prediction in their dataset, and a similar band ~1545 cm-1 for random structure (equivalent to the “other”
category used here), which differs from our results. Furthermore, they also reported that their best
marker band for -sheet structure occurred at ~1656 cm-1, which is normally reported as a marker band
of -helix, and conjectured that this might be due to the strong anti-correlation between -helix and -
sheet composition in most proteins. By contrast, we found that 1614-1627 cm -1, the low wavenumber
shoulder of the commonly assigned marker band for -sheet at ~1630 cm-1, was most important for
19
quantification of sheet content. Using Amide I and II data combined was usually more accurate than
using the Amide I data alone. We previously used similar methods on Raman spectra and obtained the
best results when combining Amide I, II and III data.1
Identifying critical regions within spectra for any given secondary structure motif can reduce analysis
time and improve the accuracy of models.30 iPLS models applied to the local models of Amide I FTIR
protein bands generally showed slightly poorer results when compared to the global models for all
structural types. For quantitative analysis, iPLS is a valuable tool, especially for the identification and
qualitative analysis of the spectral regions that are more significant for each structural motif, but the
increased complexity of the models, coupled with a poorer performance, discouraged the further used of
the iPLS method for this project.
In conclusion, we have shown that multivariate analysis of protein FTIR spectra can give -helix, -
sheet, other and anti-parallel -sheet contents to a good accuracy, comparable to CD, which is widely
used for this purpose.
20
Table 1. Prediction Results. Results for PCR and PLS methods on ‘Normalized + SNV’ are also shown in the first and second row of each
structure section. For all methods, 60 protein samples were used for calibration and for internal cross-validation, 44 samples were used for
independent test set. R2 = correlation coefficient of the model for that structure. RMSEC= calculated standard error of the structure distribution in the
calibration set. RMSECV = calculated standard error of the structure distribution from the cross-validation of the calibration set. RMSEP =
calculated standard error of the structure distribution in the independent test set. ζ = RMSE/STDDEV (Standard
Deviation).
Calibration Set (60)
Cross-Validation (60)
Independent Test Set (44)
DSSP
MethodPreProcessing
Amide Spectral Region
Components R2 RMSEC ζ Components R2 RMSECV ζ R2 RMSEP ζ STDDEV
-helix
PCR SNV I 100.73 0.122 0.51 5 - 0.104 0.44 0.70 0.123 0.517 0.238
PLS SNV I 100.88 0.081 0.34 4
0.73 0.143 0.60 0.69 0.126 0.529 0.238
PLS 2nd Deriv I 100.96 0.040 0.18 4
0.85 0.181 0.79 0.71 0.120 0.545 0.238
PLS 2nd Deriv II 150.98 0.031 0.14 4
0.65 0.222 0.97 0.63 0.164 0.729 0.238
PLS 2nd Deriv I & II 100.98 0.028 0.12 5
0.91 0.176 0.77 0.67 0.132 0.592 0.238
21
β-Sheet
PCR SNV I 100.78 0.072 0.46 5 - 0.014 0.09 0.59 0.088 0.564 0.156
PLS SNV I 100.90 0.047 0.30 3
0.77 0.088 0.57 0.69 0.077 0.496 0.156
PLS 2nd Deriv I 100.95 0.020 0.13 9
0.94 0.109 0.70 0.78 0.070 0.473 0.156
PLS 2nd Deriv II 140.96 0.031 0.20 4
0.68 0.179 1.15 0.53 0.101 0.649 0.156
PLS 2nd Deriv I & II 100.99 0.016 0.10 7
0.96 0.100 0.64 0.77 0.071 0.472 0.156
Parallel β-Sheet
PCR SNV I 100.05 0.035 0.92 6 - 0.006 0.16 0.01 0.039 1.091 0.038
PLS SNV I 150.73 0.018 0.47 3
0.73 0.061 1.59 0.03 0.034 0.944 0.038
PLS 2nd Deriv I 150.96 0.007 0.18 3
0.44 0.050 1.30 0.08 0.027 0.711 0.038
PLS Deriv II 110.92 0.920 2.41 3
0.43 0.045 1.19 0.08 0.031 0.822 0.038
PLS 2nd Deriv I & II 100.96 0.008 0.21 3
0.49 0.049 1.28 0.16 0.030 0.777 0.038
Anti-Parallel β-Sheet
PCR SNV I 100.74 0.077 0.51 5 - 0.005 0.04 0.58 0.085 0.558 0.152
PLS SNV I 12 0.9 0.045 0.30 3 0.7 0.094 0.62 0.65 0.077 0.507 0.152
22
1 3
PLS 2nd Deriv I 100.94 0.035 0.23 8
0.94 0.109 0.72 0.78 0.063 0.426 0.152
PLS 2nd Deriv II 150.97 0.024 0.16 2
0.37 0.173 1.14 0.23 0.123 0.811 0.152
PLS 2nd Deriv I & II 100.99 0.016 0.11 8
0.97 0.101 0.66 0.75 0.067 0.445 0.152
Mixed β-Sheet
PCR SNV I 100.22 0.012 0.88 5 - 0.000 0.01 0.14 0.013 0.984 0.014
PLS SNV I 170.79 0.006 0.44 3
0.18 0.013 0.96 0.16 0.013 0.985 0.014
PLS 2nd Deriv I 150.91 0.004 0.26 9
0.79 0.024 1.53 0.38 0.013 0.829 0.016
PLS 2nd Deriv II 150.94 0.003 0.24 3
0.41 0.016 1.16 0.32 0.015 1.092 0.014
PLS 2nd Deriv I & II 110.95 0.003 0.22 3
0.50 0.014 1.05 0.27 0.014 1.032 0.014
310-helix
PCR SNV I 100.12 0.030 0.93 5 - 0.001 0.02 0.01 0.029 0.905 0.032
PLS SNV I 150.60 0.020 0.63 2
0.09 0.033 1.03 0.05 0.030 0.937 0.032
PLS 2nd Deriv I 150.92 0.008 0.25 3
0.53 0.037 1.18 0.26 0.029 0.973 0.032
PLS 2nd Deriv II 110.90 0.010 0.31 2
0.27 0.037 1.16 0.28 0.015 0.493 0.032
23
PLS 2nd Deriv I & II 100.95 0.007 0.21 3
0.41 0.038 1.20 0.16 0.026 0.859 0.032
β-turns
PCR SNV I 130.15 0.039 0.91 4 - 0.001 0.03 0.03 0.044 1.044 0.042
PLS SNV I 150.60 0.027 0.63 3
0.13 0.046 1.09 0.08 0.045 1.075 0.042
PLS 2nd Deriv I 150.94 0.010 0.24 4
0.60 0.051 1.22 0.23 0.043 1.036 0.042
PLS 2nd Deriv II 120.91 0.013 0.30 4
0.45 0.051 1.20 0.05 0.042 0.974 0.042
PLS 2nd Deriv I & II 100.97 0.007 0.18 4
0.65 0.053 1.26 0.12 0.041 0.981 0.042
Other
PCR SNV I 100.49 0.076 0.53 3 - 0.006 0.04 0.18 0.142 1.315 0.144
PLS SNV I 150.88 0.037 0.26 3
0.47 0.097 0.67 0.20 0.140 1.300 0.144
PLS 2nd Deriv I 100.90 0.045 0.31 4
0.62 0.157 1.10 0.65 0.082 0.567 0.144
PLS 2nd Deriv II 120.94 0.034 0.24 1
0.26 0.164 1.14 0.63 0.066 0.457 0.144
PLS 2nd Deriv I & II 100.97 0.024 0.17 4
0.72 0.198 1.38 0.69 0.082 0.568 0.145
24
ASSOCIATED CONTENT
Supporting Information
SI Table 1. Protein FTIR Spectra
SI Table 2 Comparison of PLS results of the secondary structure of proteins in H2O and D2O.
SI Table 3. Proteins used for iPLS
SI Table 4. iPLS intervals and their corresponding frequencies.
SI Table 5. Results for iPLS models for both global and local regression analyses.
SI Table 6. Proteins used for calibration and test sets in iPLS
SI Figure 1. Explained y variance of α-helix content for analysis of the
Amide I region as a function of number of components.
SI Figure 2. RMSEC vs. number of components for α-helix prediction
from the Amide I region.
SI Figure 3. Cross-validation: 4 PLS components only are required to fit
the α-helix model for proteins in H2O using Amide I data.
This material is available free of charge via the Internet at http://pubs.acs.org.
AUTHOR INFORMATION
Corresponding Author
Email: [email protected]
24
25
Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street,
Manchester M1 7DN, UK
The manuscript was written through contributions of all authors. All authors have given approval
to the final version of the manuscript. ‡These authors contributed equally. (match statement to
author names with a symbol)
ACKNOWLEDGMENT
We thank Dr. Parvez Haris for providing FTIR protein spectra.
REFERENCES
1. Kinalwa, M. N.; Blanch, E. W.; Doig, A. J. (2010) Accurate determination of protein secondary structure content from Raman and Raman Optical Activity spectra, Anal.Chem. 82, 6347-6349.2. (a) Goormaghtigh, E.; Ruysschaert, J. M.; Raussens, V. (2006) Evaluation of the information content in infrared spectra for protein secondary structure determination, Biophys. J. 90, 2946-2957; (b) Stuart, B. H. (1996) A Fourier transform infrared spectroscopic study of P2 protein in reconstituted myelin, Biochem. Mol. Biol. Int. 39, 629-634.3. (a) Susi, H.; Byler, D. M. (1986) Resolution-Enhanced Fourier-Transform Infrared-Spectroscopy of Enzymes, Method Enzymol. 130, 290-311; (b) Barth, A. (2007) Infrared spectroscopy of proteins, Biochim. Biophys. Acta-Bioenerg. 1767, 1073-1101; (c) Manning, M. C. (2005) Use of infrared spectroscopy to monitor protein structure and stability, Expert Rev. Proteomics 2, 731-743; (d) Cai, S. W.; Singh, B. R. (1999) Identification of beta-turn and random coil amide III infrared bands for secondary structure estimation of proteins, Biophys. Chem. 80, 7-20.4. Kong, J.; Yu, S. (2007) Fourier transform infrared spectroscopic analysis of protein secondary structures, Acta Biochim. Biophys. Sin. 39, 549-559.5. Haris, P. I.; Severcan, F. (1999) FTIR spectroscopic characterization of protein structure in aqueous and non-aqueous media, J. Mol. Catal. B-Enzym. 7, 207-221.6. (a) Candolfi, A.; De Maesschalck, R.; Jouan-Rimbaud, D.; Hailey, P. A.; Massart, D. L., The influence of data pre-processing in the pattern recognition of excipients near-infrared spectra. 1999; Vol. 21, p 115-32; (b) Ge, Y.-S.; Jin, C.; Song, Z.; Zhang, J.-Q.; Jiang, F.-L.; Liu, Y. (2014) Multi-spectroscopic analysis and molecular modeling on the interaction of curcumin and its derivatives with human serum albumin: A comparative study, Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 124, 265-276.7. Wold, S. (1991) Chemometrics, why, what and where to next?, Journal of pharmaceutical and biomedical analysis 9, 589-596.8. (a) Al-Ghouti, M. A.; Al-Degs, Y. S.; Amer, M. (2008) Determination of motor gasoline adulteration using FTIR spectroscopy and multivariate calibration, Talanta 76, 1105-1112; (b) Bjelanovic, M.; Sorheim, O.; Slinde, E.; Puolanne, E.; Isaksson, T.; Egelandsdal, B. (2013)
25
26
Determination of the myoglobin states in ground beef using non-invasive reflectance spectrometry and multivariate regression analysis, Meat science 95, 451-7; (c) Macdonald, J. R.; Johnson, W. C., Jr. (2001) Environmental features are important in determining protein secondary structure, Protein Sci 10, 1172-7.9. (a) Bartlett, J. W.; Frost, C. (2008) Reliability, repeatability and reproducibility: analysis of measurement errors in continuous variables, Ultrasound in Obstetrics and Gynecology 31, 466-475; (b) Sonich-Mullin, C.; Fielder, R.; Wiltse, J.; Baetcke, K.; Dempsey, J.; Fenner-Crisp, P.; Grant, D.; Hartley, M.; Knaap, A.; Kroese, D.; Mangelsdorf, I.; Meek, E.; Rice, J. M.; Younes, M. (2001) IPCS Conceptual Framework for Evaluating a Mode of Action for Chemical Carcinogenesis, Regulatory Toxicology and Pharmacology 34, 146-152.10. Zou, X.; Zhao, J.; Mao, H.; Shi, J.; Yin, X.; Li, Y. (2010) Genetic algorithm interval partial least squares regression combined successive projections algorithm for variable selection in near-infrared quantitative analysis of pigment in cucumber leaves, Applied spectroscopy 64, 786-94.11. Navea, S.; Tauler, R.; de Juan, A. (2005) Application of the local regression method interval partial least-squares to the elucidation of protein secondary structure, Anal Biochem 336, 231-42.12. (a) Martens, H.; Naes, T., Multivariate Calibration. Wiley: 1991; (b) Wang, Y. Q.; Boysen, R. I.; Wood, B. R.; Kansiz, M.; McNaughton, D.; Hearn, M. T. W. (2008) Determination of the secondary structure of proteins in different environments by FTIR-ATR spectroscopy and PLS regression, Biopolymers 89, 895-905.13. Depczynski, U.; Frost, V. J.; Molt, K. (2000) Genetic algorithms applied to the selection of factors in principal component regression, Analytica Chimica Acta 420, 217-227.14. Haaland, D. M.; Jones, H. D. T.; Thomas, E. V. (1997) Multivariate classification of the infrared spectra of cell and tissue samples, Applied Spectroscopy 51, 340-345.15. Smith, B. C., .Fundamentals of fourier transform infrared spectroscopy. 2nd ed.; Taylor & Francis: 2011.16. (a) Smith, B. C., Fundamentals of Fourier Transform Infrared Spectroscopy, Second Edition. Taylor & Francis: 2011; (b) Stuart, B. H. (1996) A Fourier transform infrared spectroscopic study of the secondary structure of myelin basic protein in reconstituted myelin, Biochemistry and molecular biology international 38, 839-45.17. Glassford, S. E.; Byrne, B.; Kazarian, S. G. (2013) Recent applications of ATR FTIR spectroscopy and imaging to proteins, Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics 1834, 2849-2858.18. Ramer, G.; Lendl, B., Attenuated Total Reflection Fourier Transform Infrared Spectroscopy. In Encyclopedia of Analytical Chemistry, John Wiley & Sons, Ltd: 2006.19. Milosevic, M., Internal reflection and ATR spectroscopy. Wiley: 2012.20. Agnès, T.; Diane, R.; Yves, D.; Dieter, N.; Vincent, F. (2000) Transient non-native secondary structures during the refolding of α-lactalbumin detected by infrared spectroscopy, Nature Structural & Molecular Biology 7, 78-86.21. Kennard, R. W.; Stone, L. A. (1969) Computer Aided Design of Experiments, Technometrics 11, 137-148.22. Perez-Guaita, D.; Ventura-Gayete, J.; Perez-Rambla, C.; Sancho-Andreu, M.; Garrigues, S.; de la Guardia, M. (2012) Protein determination in serum and whole blood by attenuated total reflectance infrared spectroscopy, Anal. Bioanal. Chem. 404, 649-656.
26
27
23. Perez-Guaita, D.; Ventura-Gayete, J.; Pérez-Rambla, C.; Sancho-Andreu, M.; Garrigues, S.; Guardia, M. (2012) Protein determination in serum and whole blood by attenuated total reflectance infrared spectroscopy, Analytical and bioanalytical chemistry 404, 649-656.24. Kabsch, W.; Sander, C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, Biopolymers 22, 2577-637.25. Faber, N. M. (1999) Estimating the uncertainty in estimates of root mean square error of prediction: application to determining the size of an adequate test set in multivariate calibration, Chemometrics and Intelligent Laboratory Systems 49, 79-89.26. Arrondo, J. L. R.; Goni, F. M. (1999) Structure and dynamics of membrane proteins as studied by infrared spectroscopy, Prog. Biophys. Mol. Biol. 72, 367-405.27. Navea, S.; Tauler, R.; de Juan, A. (2005) Application of the local regression method interval partial least-squares to the elucidation of protein secondary structure, Analytical Biochemistry 336, 231-242.28. Dousseau, F.; Pezolet, M. (1990) Determination of the secondary structure-content of proteins in aqueous-solutions from their Amide-I and Amide-II infrared bands - Comparison between classical and partial least-sequares methods, Biochemistry 29, 8771-8779.29. Zou, X. B.; Zhao, J. W.; Mao, H. P.; Shi, J. Y.; Yin, X. P.; Li, Y. X. (2010) Genetic Algorithm Interval Partial Least Squares Regression Combined Successive Projections Algorithm for Variable Selection in Near-Infrared Quantitative Analysis of Pigment in Cucumber Leaves, Applied Spectroscopy 64, 786-794.30. Norgaard, L.; Saudland, A.; Wagner, J.; Nielsen, J. P.; Munck, L.; Engelsen, S. B. (2000) Interval partial least-squares regression (iPLS): A comparative chemometric study with an example from near-infrared spectroscopy, Applied Spectroscopy 54, 413-419.
Table of Contents Graphic
27