7
Journal of Biotechnology 151 (2011) 159–165 Contents lists available at ScienceDirect Journal of Biotechnology journal homepage: www.elsevier.com/locate/jbiotec Predicting cell-specific productivity from CHO gene expression Colin Clarke a,,1 , Padraig Doolan a,1 , Niall Barron a , Paula Meleady a , Finbarr O’Sullivan a , Patrick Gammell b , Mark Melville c , Mark Leonard c , Martin Clynes a a National Institute for Cellular Biotechnology, Dublin City University, Glasnevin, Dublin 9, Ireland b Bio-Manufacturing Sciences Group, Pfizer Inc., Grange Castle International Business Park, Clondalkin, Dublin 22, Ireland c Bioprocess R&D, Pfizer Inc., Andover, MA 01810, USA article info Article history: Received 28 July 2010 Received in revised form 26 October 2010 Accepted 20 November 2010 Available online 27 November 2010 Keywords: Chinese hamster ovary Productivity Microarray Partial least squares Cross model validation Variable selection abstract Improving the rate of recombinant protein production in Chinese hamster ovary (CHO) cells is an impor- tant consideration in controlling the cost of biopharmaceuticals. We present the first predictive model of productivity in CHO bioprocess culture based on gene expression profiles. The dataset used to con- struct the model consisted of transcriptomic data from 70 stationary phase, temperature-shifted CHO production cell line samples, for which the cell-specific productivity had been determined. These sam- ples were utilised to investigate gene expression over a range of high to low monoclonal antibody and fc-fusion-producing CHO cell lines. We utilised a supervised regression algorithm, partial least squares (PLS) incorporating jackknife gene selection, to produce a model of cell-specific productivity (Qp) capable of predicting Qp to within 4.44 pg/cell/day root mean squared error in cross model validation (RMSE CMV ). The final model, consisting of 287 genes, was capable of accurately predicting Qp in a further panel of 10 additional samples which were incorporated as an independent validation. Several of the genes constituting the model are linked with biological processes relevant to protein metabolism. © 2010 Elsevier B.V. All rights reserved. 1. Introduction Cell and process engineering approaches to improve produc- tivity in bioreactors have largely focussed on reactor design and culture strategies such as clonal selection, stability, medium for- mulation, culture temperature and cell engineering for controlled proliferation and increased resistance to apoptosis (Altamirano et al., 2000; Butler, 2005; Prentice et al., 2007; Wurm, 2004). Using this approach, key cell line characteristics, including cell growth rate, achievable cell densities and correct product processing are identified only following a lengthy labour-intensive screening pro- cess. To complement these strategies, previous attempts have been made to modify or improve the performance of these lines in the bioreactor using cellular engineering strategies (reviewed in Mohan et al., 2008). However, these studies have demonstrated only incremental improvements in productivity and the cellular processes underpinning Qp remains poorly understood in Chinese hamster ovary (CHO) and other bioprocess-relevant cell lines. The development of expression profiling methodologies such as microarrays and proteomics offer the prospect of examining the molecular phenotypes underlying productivity in CHO and their Corresponding author. Tel.: +353 1 7005700/5692; fax: +353 1 7005484. E-mail address: [email protected] (C. Clarke). 1 Both authors contributed equally to this publication. application in bioprocess research has already been extensively reviewed (Griffin et al., 2007). Previous microarray expression pro- filing studies focussing on productivity in CHO (Doolan et al., 2008; Schaub et al., 2010; Trummer et al., 2008; Kantardjieff et al., 2010; Yee et al., 2007) and in the commercially used mouse myeloma NS0 cell line (Charaniya et al., 2009; Khoo et al., 2007; Seth et al., 2007) have identified several crucial pathways and processes. These microarray-based productivity studies have also been com- plemented by proteomics studies in CHO (Carlage et al., 2009; Meleady et al., 2008; Nissom et al., 2006) and NS0 (Seth et al., 2007; Smales et al., 2004; Alete et al., 2005; Dinnis et al., 2006). To date, profiling studies in CHO have been characterised by relatively small numbers of samples (typically < 20) compared in a case/control format. Interesting genes and protein candidates are generally prioritised via the traditional paradigm of differen- tial expression (i.e. fold change). A significant drawback of this approach includes the selection of an appropriate threshold (con- sidering the inherent noisy nature of microarrays) resulting in too few or too many genes identified and providing inconsistent com- parison with studies on similar biological systems. This limitation is further compounded by the observation that changes in productiv- ity levels are usually accompanied by only modest changes in gene expression levels (Smales et al., 2004; Yee et al., 2009). Larger sam- ple numbers in combination with more sophisticated algorithms can therefore make a significant contribution to identifying the molecular mechanisms underpinning productivity in CHO. 0168-1656/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.jbiotec.2010.11.016

Predicting cell-specific productivity from CHO gene expression

Embed Size (px)

Citation preview

Page 1: Predicting cell-specific productivity from CHO gene expression

P

CPa

b

c

a

ARRAA

KCPMPCV

1

tcmpetricmtMoph

am

0d

Journal of Biotechnology 151 (2011) 159–165

Contents lists available at ScienceDirect

Journal of Biotechnology

journa l homepage: www.e lsev ier .com/ locate / jb io tec

redicting cell-specific productivity from CHO gene expression

olin Clarkea,∗,1, Padraig Doolana,1, Niall Barrona, Paula Meleadya, Finbarr O’Sullivana,atrick Gammellb, Mark Melvillec, Mark Leonardc, Martin Clynesa

National Institute for Cellular Biotechnology, Dublin City University, Glasnevin, Dublin 9, IrelandBio-Manufacturing Sciences Group, Pfizer Inc., Grange Castle International Business Park, Clondalkin, Dublin 22, IrelandBioprocess R&D, Pfizer Inc., Andover, MA 01810, USA

r t i c l e i n f o

rticle history:eceived 28 July 2010eceived in revised form 26 October 2010ccepted 20 November 2010vailable online 27 November 2010

a b s t r a c t

Improving the rate of recombinant protein production in Chinese hamster ovary (CHO) cells is an impor-tant consideration in controlling the cost of biopharmaceuticals. We present the first predictive modelof productivity in CHO bioprocess culture based on gene expression profiles. The dataset used to con-struct the model consisted of transcriptomic data from 70 stationary phase, temperature-shifted CHOproduction cell line samples, for which the cell-specific productivity had been determined. These sam-

eywords:hinese hamster ovaryroductivityicroarray

artial least squaresross model validation

ples were utilised to investigate gene expression over a range of high to low monoclonal antibody andfc-fusion-producing CHO cell lines. We utilised a supervised regression algorithm, partial least squares(PLS) incorporating jackknife gene selection, to produce a model of cell-specific productivity (Qp) capableof predicting Qp to within 4.44 pg/cell/day root mean squared error in cross model validation (RMSECMV).The final model, consisting of 287 genes, was capable of accurately predicting Qp in a further panelof 10 additional samples which were incorporated as an independent validation. Several of the genes

e link

ariable selection constituting the model ar

. Introduction

Cell and process engineering approaches to improve produc-ivity in bioreactors have largely focussed on reactor design andulture strategies such as clonal selection, stability, medium for-ulation, culture temperature and cell engineering for controlled

roliferation and increased resistance to apoptosis (Altamiranot al., 2000; Butler, 2005; Prentice et al., 2007; Wurm, 2004). Usinghis approach, key cell line characteristics, including cell growthate, achievable cell densities and correct product processing aredentified only following a lengthy labour-intensive screening pro-ess. To complement these strategies, previous attempts have beenade to modify or improve the performance of these lines in

he bioreactor using cellular engineering strategies (reviewed inohan et al., 2008). However, these studies have demonstrated

nly incremental improvements in productivity and the cellularrocesses underpinning Qp remains poorly understood in Chinese

amster ovary (CHO) and other bioprocess-relevant cell lines.

The development of expression profiling methodologies suchs microarrays and proteomics offer the prospect of examining theolecular phenotypes underlying productivity in CHO and their

∗ Corresponding author. Tel.: +353 1 7005700/5692; fax: +353 1 7005484.E-mail address: [email protected] (C. Clarke).

1 Both authors contributed equally to this publication.

168-1656/$ – see front matter © 2010 Elsevier B.V. All rights reserved.oi:10.1016/j.jbiotec.2010.11.016

ed with biological processes relevant to protein metabolism.© 2010 Elsevier B.V. All rights reserved.

application in bioprocess research has already been extensivelyreviewed (Griffin et al., 2007). Previous microarray expression pro-filing studies focussing on productivity in CHO (Doolan et al., 2008;Schaub et al., 2010; Trummer et al., 2008; Kantardjieff et al., 2010;Yee et al., 2007) and in the commercially used mouse myelomaNS0 cell line (Charaniya et al., 2009; Khoo et al., 2007; Seth et al.,2007) have identified several crucial pathways and processes.These microarray-based productivity studies have also been com-plemented by proteomics studies in CHO (Carlage et al., 2009;Meleady et al., 2008; Nissom et al., 2006) and NS0 (Seth et al., 2007;Smales et al., 2004; Alete et al., 2005; Dinnis et al., 2006).

To date, profiling studies in CHO have been characterised byrelatively small numbers of samples (typically < 20) compared ina case/control format. Interesting genes and protein candidatesare generally prioritised via the traditional paradigm of differen-tial expression (i.e. fold change). A significant drawback of thisapproach includes the selection of an appropriate threshold (con-sidering the inherent noisy nature of microarrays) resulting in toofew or too many genes identified and providing inconsistent com-parison with studies on similar biological systems. This limitation isfurther compounded by the observation that changes in productiv-

ity levels are usually accompanied by only modest changes in geneexpression levels (Smales et al., 2004; Yee et al., 2009). Larger sam-ple numbers in combination with more sophisticated algorithmscan therefore make a significant contribution to identifying themolecular mechanisms underpinning productivity in CHO.
Page 2: Predicting cell-specific productivity from CHO gene expression

1 iotech

scgtrsPftgrtrapp

mecpRaoptT

aaammspabTt

2

2

dPds(i

Q

w

dof freedom are unknown, it is recommended to treat the result-ing p-value as measure of non-significance (Mevik and Wehrens,

60 C. Clarke et al. / Journal of B

Multivariate statistics and machine learning algorithms for clas-ification and regression allow relationships between genes to beonsidered and have previously been advocated over univariateene selection methods (Boulesteix and Strimmer, 2007). Par-ial least squares (PLS) is a statistical modelling technique closelyelated to principal component analysis (PCA) and is used to con-truct predictive models for complex multidimensional datasets.LS components, known as latent variables (LVs), are derivedrom linear combinations of the original variables to maximisehe covariance between a matrix of independent variables (e.g.ene expression) and dependent variable(s) (e.g. productivity). Byetaining only those LVs containing the majority of information onhe relationship between predictor and response variables (thusemoving a substantial amount of noise and measurement error)model can then be formed between these LVs and cell-specificroductivity. Detailed treatments of the PLS algorithm have beenreviously described (Martens and Naes, 1989).

Previous examples of PLS predictive model generation fromicroarrays include regression (Gidskehaug et al., 2007; Huang

t al., 2004; Misra et al., 2007), the development of models forlassification (Aaroe et al., 2010; Nguyen and Rocke, 2002a) androportional hazard models for survival analysis (Nguyen andocke, 2002b). Apart from microarrays, the technique is utilisedcross a variety of fields and has previously been applied to vari-us aspects of bioprocessing including mass spectrometry-basedroteomic profiling, process monitoring and process analyticalechnology (PAT) (Sellick et al., 2010; Stansfield et al., 2007;homassen et al., 2010).

In this paper, we construct a regression model using the PLSlgorithm to capture the relationship between gene expression andquantitative phenotypic variable (cell-specific productivity). We

im to produce a model for prediction of Qp from gene expressioneasurements with a potential application in bioprocess develop-ent. The use of a gene selection routine coupled with rigorous

tatistical validation was incorporated to reduce PLS model com-lexity and decrease the error rate. The algorithm may also providevehicle for the identification of subsets of genes relevant to theiology underlying productivity of recombinant proteins in CHO.his work represents one of the largest studies of CHO transcrip-omic datasets published to date.

. Materials and methods

.1. Determination of cell-specific productivity

The concentration of recombinant protein product in con-itioned media samples (volumetric titre) was determined byrotein-A HPLC. Cell viability was determined using the trypan blueye-exclusion viability assay and hemocytometer counting (forhake flask samples) or a Cedex Automated Cell Culture AnalyzerRoche Innovatis) (for bioreactor samples). Cell specific productiv-ty was determined as shown below.

p(pg/cell/day)

=[

titre 2 − titre 1(density 2 − density 1)

]× daily growth rate (1)

here

aily growth rate

= (ln(density 2) − ln(density 1))/(time 2 − time 1)24

nology 151 (2011) 159–165

2.2. Cell line selection and experimental design

A total of 80 fed-batch, temperature-shifted CHO productioncell line samples displaying a range of cell-specific productivityvalues (0.81–50.4 pg protein/cell/day) were selected for transcrip-tional profiling using a proprietary (Wye2aHamster) CHO-specificaffymetrix microarray. All cell line samples were grown in serum-free suspension culture in the temperature-shifted range of 29.5 ◦Cto 31 ◦C (culture temperature shift time-point varied between 24and 72 h according to process design) and were collected during thestationary growth phase (5–10 days) at the following time-points:Day 5 (23 samples), Day 7 (42 samples), Day 8 (7 samples) andDay 10 (8 samples). The entire sample set comprised 42 CHO DUXand 38 CHO K1 samples, from 10 production cell lines expressing avariety of monoclonal antibody (60 samples) and fc-fusion proteinproducts (20 samples). 18 of the samples were isolated from a totalof 14 shake flasks (11 of which were carefully maintained to a pHsetpoint using CO2 and base addition as required); the remainingsamples were isolated from 40 individual bioreactor cultures. Thesample set was split into 70 microarrays for PLS model constructionand validation (calibration data). 10 samples from 5 CHO DUX and5 CHO K1 cultures producing monoclonal antibody and fc-fusionproteins were held back from model building and gene selection toserve as an independent test set evaluation (test data).

2.3. Microarray analysis and data preprocessing

The methods and criteria used for total RNA purification,cRNA sample processing and hybridisation to hamster microarrayshave been previously described (Doolan et al., 2008). The studypresented here utilises a proprietary WyeHamster2a oligonu-cleotide microarray, which has been described previously (Doolanet al., 2008), representing an estimated 10–15% of the CHOtranscriptome. All microarray data were pre-processed in thestatistical software environment R (www.r-project.org) and thearoma.affymetrix package using the robust multichip average(RMA) algorithm (Bolstad et al., 2003; Irizarry et al., 2003a,b).

2.4. Partial least squares implementation

PLS model construction and jackknife variable selection was car-ried out within R using the ‘pls’ package (Mevik and Wehrens, 2007).Cross model validation was implemented using a script writtenin-house (available on request).

2.5. Jackknife gene selection

The elimination of genes which do not contribute significantlyto the model should simplify and improve the accuracy of PLS andpossibly reveal biologically important genes related to cell-specificproductivity. During the construction of the PLS model, each geneis interrogated within the inner loop of model validation (Fig. 1) asto its importance during the model building process. The resam-pling method known as ‘jackknifing’ (JK) (Efron and Stein, 1981)was employed to assess the significance of variables and to removeuninformative or “noisy” genes which have no contribution to thefinal model. The selection of important genes from the analysis isachieved by initially considering the entire complement of the arrayand constructing a model. Each PLS regression coefficient is per-turbed and its approximate “significance” determined using a t-test(as the distribution of PLS regression coefficients and the degrees

2007)). The least “significant” gene (i.e. the gene with the largestp-value) in the model is eliminated from the dataset. The backwardelimination of genes from the model continues until all remaining

Page 3: Predicting cell-specific productivity from CHO gene expression

C. Clarke et al. / Journal of Biotechn

Fig. 1. Schematic of microarray data preparation and analysis using partial leastsquares. Cross model validation is applied using an inner LOOCV and outer (10-fold)ctt

gMsr

2

sAakpictatTupu

ross validation loop in order to avoid overly optimistic performance estimates ofhe procedure. A backward elimination of genes within the inner CV loop is achievedhrough jackknife significance testing.

enes have a value of p < 0.1 (Anderssen et al., 2006; Martens andartens, 2000). The use of such a gene selection routine requires

tringent validation to avoid overfitting the model and offset theisk of selecting uninformative genes.

.6. PLS cross model validation

The development of relevant models requires conservative mea-ures to determine effectiveness in the predicting of future samples.s the number of samples in gene expression analysis is gener-lly limited in comparison to the number of variables, a techniquenown as cross validation (CV) is often used to estimate modelerformance. Here, a portion of the data (held-out data), is kept

ndependent from the model building. Once the model has beenonstructed, the held-out portion is presented as a test sample andhe error rate assessed. In the next iteration of the CV procedure,nother portion is held back as the testing set and the routine con-

inues until all samples have been designated as the held-out set.he average error can be determined from the cross validation andsed as an indication of how the model will perform on future sam-les (generalisation ability). However, it has been noted that these of standard cross validation routines yield overly optimistic

ology 151 (2011) 159–165 161

results in terms of feature selection (overfitting), known as selec-tion bias in the context of microarrays (Ambroise and McLachlan,2002).

Cross model validation (CMV), an expansion of standard CV, hasbeen advocated as a means to alleviate bias in feature selectionand several examples for PLS regression have been demonstrated(Gidskehaug et al., 2008; Westad et al., 2008). CMV, also knownas double cross validation (Filmoser et al., 2009), assesses theentire model development process including gene selection and themodel parameters (in the case of PLS the number of LVs to retain).The CMV algorithm consists of an inner and outer loop; variableselection is conducted in the inner loop while a model constructedusing the selected variables is tested in the outer loop using unseendata (Fig. 1). In the CMV outer loop a portion (known as a fold) ofthe data is held back from the model development phase for testingpurposes (in this case 10%). The remaining data (90%) are subjectedto variable selection within the inner loop where standard CV iscarried out. Here, the CV process utilised in the CMV inner loopis known as leave-one-out cross validation (LOOCV); as its namesuggests a single sample is held out for testing after model devel-opment. Single genes are iteratively eliminated following LOOCV,the process continues until all jackknife p-vales are <0.1. Upon com-pletion of the inner loop, a model is constructed with the selectedgenes, tested against the held-out set in the outer loop and the per-formance determined (see Section 2.7). The CMV and gene selectionprocess continues until all folds have been held out and the averageerror recorded. Here, we repeated the entire validation and geneselection procedure a total of 35 times, the calibration data wererandomised prior to initiating each independent CMV (to offset biasfrom dataset partitioning). The union of all genes selected from the350 iterations (35 independent runs × 10 CMV folds) was calculatedand only those genes retained in two thirds or more of the innerloop gene selections (selection frequency ≥ 66%) were utilised tobuild the PLS final model. This final PLS model was then assessedusing LOOCV (for comparison to the starting model built on the fullgeneset) and also subjected to an independent test set validationwith 10 unseen samples.

2.7. PLS regression evaluation

The success of each PLS model constructed was evaluated using anumber of commonly used measures. Firstly, the root mean squareerror (RMSE) in LOOCV, CMV and independent test set validationis calculated to yield the RMSELOOCV, RMSECMV and RMSEprediction

respectively.

RMSE =

√√√√√n∑

i=1

(yi − yi)2

N(2)

In addition, the correlation coefficient in CMV and LOOCV (Q2)was also calculated.

Q 2 = 1 −

n∑i=1

(yi − yi)2

n∑i=1

(yi − y)2

(3)

where yi are the predicted measurements, yi are the observed mea-surements, y is the mean of the observed measurements and N is thenumber of samples tested. Where applicable the standard deviation(�) of these measures is also presented.

Page 4: Predicting cell-specific productivity from CHO gene expression

162 C. Clarke et al. / Journal of Biotechnology 151 (2011) 159–165

Fwp

2

gPttoP(tRt

3

3

tfWtatrto1fot

3

osoaittmoti

ig. 2. Plot of Qp values of 70 production DUX and K1 CHO cell line samples whichere used to develop the predictive model (calibration data). Samples used dis-layed Qp values in the range of 0.81–50.4 pg/cell/day.

.8. Bioinformatics analysis and literature mining

To gain a further understanding of the jackknife selectedenes and to develop a biological hypothesis encompassing theLS-identified genes, the 287-member model genelist was anno-ated to Genbank gene symbols using an in-house annotationool, yielding a total of 212 annotated transcripts on whichntology and literature mining analysis was carried out usingANTHER (Protein ANalysis THrough Evolutionary Relationships)http://www.pantherdb.org/). Additionally we undertook litera-ure mining analysis using Pathway Studio (Ariadne Genomics,ockville, MD) on this list to determine previously established linkso productivity-related cell processes and analyses.

. Results and discussion

.1. Production CHO cell line sample dataset

Fig. 2 illustrates the range of Qp titre measurements (pg pro-ein/cell/day) for each of the 70 CHO production cell line samplesor which gene expression measurements were obtained using the

yeHamster2a microarray and form the calibration data to buildhe PLS model. As can be seen, there is a relatively uniform spreadcross the range of Qp values up to a per-cell titre of 30 pg pro-ein/cell/day, whereafter seven remaining samples constituted theange of Qp values from ∼30 to 50 pg protein/cell/day (the calibra-ion data were deliberately selected to cover the entire range of Qpbserved in production from the industrial cell lines utilised). The0 samples comprising the independent testing set were acquiredrom CHO cell cultures with a similar range of characteristics to thatf the calibration data. The Qp of these samples ranged from 0.92o 36.90 pg protein/cell/day.

.2. Model construction and performance

Using the methodology outlined above, the optimum subsetf genes from the microarray was chosen with respect to cell-pecific productivity for model construction. A rigorous methodf cross validation, CMV, was applied to avoid overly optimisticssumptions about the future performance of the model, provid-ng a realistic appreciation of the generalisation ability; moreoverhis method allows validation of the gene selection itself. In order

o avoid bias from partitioning of the dataset, the entire CMV

ethod was carried out 35 times in total. Perhaps the most obvi-us drawback to the CMV evaluation of the PLS modelling is theime required. However, such a cross validation attempts to avoidncorrect assessment of the model performance when the dataset

Fig. 3. RMSELOOCV plot for all 3714 genes (dashed) and for 287 jack-knife selectedgenes (solid). As can be seen, upon the removal of genes from the analysis modelaccuracy was increased while the complexity of the model was decreased.

is small, when the use of an independent test set is not applicableor when a variable selection scheme is incorporated.

The average RMSECMV for all iterations was 4.44 Qp units,� = 0.24. Considering the system under investigation and the smallnumber of samples used for calibration, the model shows reason-able performance. The Q2,CMV (analogous to the R2, but calculatedfrom cross validation) was determined to be 0.72, � = 0.07 and fur-ther demonstrates the validity of the model. To our knowledge,this is the first predictive model published derived from microarrayanalysis to accurately forecast the CHO productivity phenotype.

Finally, the number of times a gene is selected over the 350 inter-nal LOOCV iterations is determined. We refer to this measure asthe selection frequency and genes that are retained in two thirdsor more of LOOCV iterations were considered to be significant. Inthis study we discovered a total of 287 genes above the frequencythreshold to yield the optimum predictive model, 212 of whichwere annotated using an in-house annotation system and whichare displayed in Supplementary Table I. The number of LVs to beretained for the optimum model is shown in Fig. 3. As can be seen,the RMSELOOCV is reduced while the number of LVs retained is alsodecreased allowing for a simplified model and decreasing the riskof overfitting.

Fig. 4 displays the values of the measured samples plottedagainst those predicted by the model during a LOOCV analysisof the final 287 selected genes with 3 LVs retained. The LOOCVyields Q2,LOOCV = 0.88, and a corresponding RMSELOOCV = 3.95 pgprotein/cell/day. As mentioned previously, RMSELOOCV providesan optimistic and therefore less reliable viewpoint, neverthelessFigs. 3 and 4 provide a means of evaluating the gene selectionprocedure (i.e. the removal of genes simplifies the PLS model andreduces the RMSELOOCV). There is no doubt that while the analysistime associated with cross model validation is significant, increasedconfidence in model development and variable selection can beachieved.

3.3. Independent test set evaluation

As a further evaluation of the model, 10 microarray sampleswere held back from the model construction and gene selectionprocess. Such a testing procedure is known as an independent orblinded set evaluation, as the data used in this part of analysis

Page 5: Predicting cell-specific productivity from CHO gene expression

C. Clarke et al. / Journal of Biotechn

Fs2

rt

aftstwtwoptefrbsepgs

p

TId3

ig. 4. Representative plot of predicted Qp vs. measured Qp for LOOCV of the finalubset of retained genes. The RMSELOOCV = 3.95 pg/cell/day and Q2,LOOCV = 0.88 for the87 gene dataset, #LVs retained = 3.

emains completely independent from the construction phase andherefore is a measure of the model’s future prediction ability.

All test set microarrays analysed were sampled from temper-ture shifted stationary phase cultures with Qp values rangingrom 0.92 to 36.9 pg protein/cell/day to ensure consistency withhe samples used for calibration. The microarrays underwent theame pre-processing procedure as the microarrays used to develophe model. 287 genes identified during the gene selection processere extracted from the resulting data matrix and presented to

he constructed PLS model. As can be seen from Table 1, the modelas capable of successfully predicting the cell-specific productivity

f this blind testing set of 10 CHO cell line samples. For exam-le, the actual measured Qp of Sample 2 was 31.30 pg/cell/day;he PLS-predicted Qp for this sample was 29.47 pg/cell/day, anrror of 1.82 Qp units. The largest error deviated 5.62 Qp unitsrom the measured value (Table 1). An overall prediction errorate (RMSEprediction) of 3.11 pg/cell/day was observed for the wholelind testing set; indicating that the model is capable of successfullyeparating CHO cell line samples into low, medium and high Qp cat-gories. This independent validation confirms that the cell specificroductivity of stationary phase CHO cells can be predicted from

ene expression measurements and this model represents the firstuch predictive model published for CHO bioprocess culture.

The methodology outlined here may be utilised to develop aredictive method which could contribute to the selection of high

able 1ndependent test set evaluation of PLS model. The table shows the actual and pre-icted Qp for the 10 samples in the testing set. The RMSEprediction for the test set was.11 pg protein/cell/day.

Sample # Actual Qp Model-predicted Qp Error

1 36.90 31.27 5.622 31.30 29.47 1.823 26.90 26.95 −0.054 23.00 18.24 4.755 21.80 18.65 3.146 13.21 11.32 1.887 6.40 5.30 1.098 6.00 5.41 0.589 5.70 2.44 3.25

10 0.92 4.65 −3.73

ology 151 (2011) 159–165 163

Qp clones for progression to bioreactor scale-up and facilitate thestudy of cell-specific productivity in CHO at earlier stages of cellline selection. In addition, a transcriptomic based analysis couldprovide a means of enriching for clones that maintain stable Qp.Current Qp measurement techniques for clonal selection such asELISA or flow cytometry select subclones based on performance atthe time of measurement and do not guarantee future productivityor production stability (Pichler et al., 2010). Given the diminishingcost of microarray technology with the emergence of next gen-eration sequencing (Blow, 2009), a custom microarray or RT-PCRbased screen similar to that described by (Lee et al., 2009) couldbe developed utilising the 287 genes selected during this study.However, it is important to note that the model presented hereis derived from temperature shifted stationary phase cultures andforms a proof of concept study. Future development would incor-porate gene expression and Qp stability data from cells undergoingclonal selection right through to bioreactor scale-up for model-construction.

3.4. Biological relevance of the identified genes toproductivity-related cell processes

At first glance, the list of 212 genes contains several controlgenes which we would expect to have been identified stronglyassociated with productivity in our system and which function asinternal validation of the chosen method of gene selection. Thesegenes include the selection markers NEO (Neomycin phosphotrans-ferase II (resistance gene)), DHFR (dihydrofolate reductase) as wellas a vector-related expressed sequence and a probeset designed todetect expression of the secreted protein product. It is also impor-tant to note at this point that four genes (CD36/SCARB1; HNRPK;HSPA8/HSP70 and HPRT1/hprT) are each represented twice inthe model. This was discovered only following substantial re-annotation efforts made on the 287-member list and is the onlyknown redundancy within the selected genes. Where necessary,the gene is referenced in the text via the former nomenclature.

As the JK–PLS algorithm accentuates those genes which havethe closest relationship to Qp, it is likely that this list is composedof a mix of productivity biomarkers as well as comprising potentialtargets for cell line engineering. It is also reasonable to assume thatthe cellular functions of the genes comprising the model devel-oped using this method will provide insight into the biologicalprocesses underpinning productivity in CHO. Ontology and liter-ature mining analysis revealed that this list constituted a mixedgroup of genes, impacting a diverse collection of cellular processes.For example, we identified 64 unique proteins using PANTHER anal-ysis (see Supplementary Table II) and 88 unique proteins using theliterature mining tool Pathway Studio (Supplementary Table III)to be involved in various cell processes directly impacting pro-tein metabolism. Interesting candidates that emerge from theseanalyses include seven genes which overlapped several PANTHERcategories (highlighted in Supplementary Table II). Also, as can beseen from Supplementary Table III, 43 proteins were representedin more than one process, with 4 proteins (APP, CANX, H-RAS andMAPK1) involved in five of the seven processes outlined and a fur-ther 6 proteins (CTSL, F2R, HMGB1, HSPA8, NSF and VIM) involvedin four of the seven processes outlined. Noteworthy proteins in thislist include APP and MAPK1, which have previously been exten-sively linked with protein secretion and PLAUR, whose role inproteolysis has been comprehensively documented.

Additionally, the protein products of 27 genes (highlighted in

Supplementary Table I) from the 212-member annotated list havebeen shown to be membrane-localised in previous studies in otherorganisms, offering the prospect that these proteins may be utilisedas surface markers for productivity in CHO. Alternately, giventhat the rate-limiting steps in protein secretion may well be at
Page 6: Predicting cell-specific productivity from CHO gene expression

1 iotech

tbS

3p

pfiaitesdS

pticlp22oCotcpeiod2a(

4

swcmoQispRm

vroeoprpNg

64 C. Clarke et al. / Journal of B

he translational or post-translational process level, attention maye focussed on the ER-localised protein products (highlighted inupplementary Table I).

.5. Overlap of 212-member list with targets identified fromroductivity profiling and functional studies in bioprocess culture

The 212-member annotated list was overlapped with genes androteins identified from previous microarray and proteomics pro-ling studies which have been associated with productivity in CHOnd NS0 cells. The results of the overlap analysis are summarisedn Supplementary Table IV. As can be seen, a total of 44 genes fromhe 212-member list were identified as differentially expressed atither (or both) the gene and protein level in at least one of thesetudies; 16 of these genes found in two or more independent pro-uctivity related studies on mammalian cell line are highlighted inupplementary Table IV.

In addition, three genes identified here as associated with theroduction phenotype in CHO have been previously shown to func-ionally impact productivity in bioprocess culture. A previous studydentified that simultaneous overexpression of the lectin-bindinghaperones calnexin (CANX; here identified on the 212-memberist) and calreticulin resulted in a 1.9-fold increase in the specificroduction rate of human thrombopoietin (hTPO)) (Chung et al.,004). A follow-up study by the same group (Mohan and Lee,009) confirmed that CANX overexpression, under the controlf a tetracycline-inducible system, enhanced the productivity ofHO cells producing tumour necrosis factor receptor by 1.7-foldver the 2-fold increase already obtained by sodium butyratereatment. An additional chaperone cell engineering study in NS0ells (Downham et al., 1996) previously correlated an increasedroduction rate of a mouse-human chimeric antibody with thexpression of the molecular chaperone ERp72, a protein disulfidesomerase (PDI) isoform encoded by the PDIA4 gene (identifiedn the 212-member list), although this protein has recently beenemonstrated not to affect productivity in CHO (Hayes et al.,010). Finally, increased (2.5-fold) production of IFN-� in CHO waslso observed following overexpression of heat shock protein 70HSPA8; identified on the 212-member list) (Lee et al., 2009).

. Conclusions

In conclusion, we describe a substantial (70-sample calibrationet, 10-sample testing set) production cell line transcriptomic studyhich has been used to develop the first predictive model for spe-

ific productivity in CHO to within 4.44 Qp units (pg/cell/day). Aultivariate regression algorithm has been applied to the data in

rder to construct a proof of concept model for the prediction ofp based on 287 (212 annotated) key genes retained during an

ntensive selection procedure. The study described herein demon-trates the applicability of PLS for the prediction of cell-specificroductivity from gene expression data to within 4.44 Qp units inMSECMV with Q2,CMV = 0.72. Upon presentation of the test set, theodel constructed returned a RMSEprediction of 3.11 Qp units.The identity of the genes that constitute the model has also pro-

ided some insight into some of the biological processes utilised inegulating productivity in CHO and is most likely composed of a mixf biomarkers for productivity as well as targets for future cell linengineering strategies. The use of diverse bioinformatics strategiesf biological process mapping, literature mining and overlapping of

roductivity-associated transcriptomic and proteomic results hasesulted in the prioritisation of several genes for future studies,articularly the ANXA2, APP, CANX, CDC20, CTSL, HSPA8, LMAN2,EDD4, NPC1, NSF, PDIA4, PPARBP, PPID, PSMD4, RAB6A and RTN3enes, which were identified by all three methods and of which

nology 151 (2011) 159–165

three (CANX, HSPA8 and PDIA4) have previously been demon-strated to functionally impact productivity in bioprocess culture.

The use of CMV provides a robust estimation of future per-formance of the model and offsets bias in gene selection. Whilewe present here an accurate model for predicting productivity instationary-phase, temperature-shifted CHO cultures, we envisagethat this proof-of-concept method may be utilised in the future todevelop a predictive model for productivity (and possibly alterna-tive bioprocess-relevant phenotypes) in CHO at the clonal selectionstage, thereby increasing the potential to identify and select thebest performers from a panel of clones during early stage selection.Further studies could facilitate the development of future predic-tive models using different phenotypes and the refinement of themethodology to yield a better understanding of bioprocessing vari-ables and CHO biology.

Acknowledgements

This work was supported by funding from Science FoundationIreland (SFI) grant number 07/IN.1/B1323.

Appendix A. Supplementary data

Supplementary data associated with this article can be found, inthe online version, at doi:10.1016/j.jbiotec.2010.11.016.

References

Aaroe, J., Lindahl, T., Dumeaux, V., Saebo, S., Tobin, D., Hagen, N., Skaane, P., Lon-neborg, A., Sharma, P., Borresen-Dale, A.L., 2010. Gene expression profilingof peripheral blood cells for early detection of breast cancer. Breast CancerResearch 12, R7.

Alete, D.E., Racher, A.J., Birch, J.R., Stansfield, S.H., James, D.C., Smales, C.M., 2005. Pro-teomic analysis of enriched microsomal fractions from GS-NS0 murine myelomacells with varying secreted recombinant monoclonal antibody productivities.Proteomics 5, 4689–4704.

Altamirano, C., Paredes, C., Cairo, J.J., Godia, F., 2000. Improvement of CHO cell cul-ture medium formulation: simultaneous substitution of glucose and glutamine.Biotechnology Progress 16, 69–75.

Ambroise, C., McLachlan, G.J., 2002. Selection bias in gene extraction on the basis ofgene-expression data. Proceedings of the National Academy of Sciences of theUnited States of America 99, 6562–6566.

Anderssen, E., Dyrstad, K., Westad, F., Martens, H., 2006. Reducing over-optimismin variable selection by cross-model validation. Chemometrics and IntelligentLaboratory Systems 84, 69–74.

Blow, N., 2009. Transcriptomics: the digital generation. Nature 458, 239–242.Bolstad, B.M., Irizarry, R.A., Astrand, M., Speed, T.P., 2003. A comparison of normal-

ization methods for high density oligonucleotide array data based on varianceand bias. Bioinformatics 19, 185–193.

Boulesteix, A.L., Strimmer, K., 2007. Partial least squares: a versatile tool for theanalysis of high-dimensional genomic data. Briefings in Bioinformatics 8, 32–44.

Butler, M., 2005. Animal cell cultures: recent achievements and perspectives in theproduction of biopharmaceuticals. Applied Microbiology and Biotechnology 68,283–291.

Carlage, T., Hincapie, M., Zang, L., Lyubarskaya, Y., Madden, H., Mhatre, R., Hancock,W.S., 2009. Proteomic profiling of a high-producing Chinese hamster ovary cellculture. Analytical Chemistry 81, 7357–7362.

Charaniya, S., Karypis, G., Hu, W.S., 2009. Mining transcriptome data for function-trait relationship of hyper productivity of recombinant antibody. Biotechnologyand Bioengineering 102, 1654–1669.

Chung, J.Y., Lim, S.W., Hong, Y.J., Hwang, S.O., Lee, G.M., 2004. Effect of doxycycline-regulated calnexin and calreticulin expression on specific thrombopoietinproductivity of recombinant Chinese hamster ovary cells. Biotechnology andBioengineering 85, 539–546.

Dinnis, D.M., Stansfield, S.H., Schlatter, S., Smales, C.M., Alete, D., Birch, J.R., Racher,A.J., Marshal, C.T., Nielsen, L.K., James, D.C., 2006. Functional proteomic analysisof GS-NS0 murine myeloma cell lines with varying recombinant monoclonalantibody production rate. Biotechnology and Bioengineering 94, 830–841.

Doolan, P., Melville, M., Gammell, P., Sinacore, M., Meleady, P., McCarthy, K., Fran-cullo, L., Leonard, M., Charlebois, T., Clynes, M., 2008. Transcriptional profilingof gene expression changes in a PACE-transfected CHO DUKX cell line secreting

high levels of rhBMP-2. Molecular Biotechnology 39, 187–199.

Downham, M.R., Farrell, W.E., Jenkins, H.A., 1996. Endoplasmic reticulum proteinexpression in recombinant NS0 myelomas grown in batch culture. Biotechnol-ogy and Bioengineering 51, 691–696.

Efron, B., Stein, C., 1981. The jackknife estimate of variance. Annals of Statistics 9,586–596.

Page 7: Predicting cell-specific productivity from CHO gene expression

iotechn

F

G

G

G

H

H

I

I

K

K

L

M

M

M

M

M

M

M

Yee, J.C., Gatti, M.D., Philp, R.J., Yap, M., Hu, W.S., 2007. Genomic and proteomic

C. Clarke et al. / Journal of B

ilmoser, P., Liebmann, B., Varmuza, K., 2009. Repeated double cross validation.Journal of Chemometrics 23, 160–171.

idskehaug, L., Anderssen, E., Alsberg, B.K., 2008. Cross model validation and optimi-sation of bilinear regression models. Chemometrics and Intelligent LaboratorySystems 93, 1–10.

idskehaug, L., Anderssen, E., Flatberg, A., Alsberg, B., 2007. A framework for sig-nificance analysis of gene expression data using dimension reduction methods.BMC Bioinformatics 8, 346.

riffin, T.J., Seth, G., Xie, H.W., Bandhakavi, S., Hu, W.S., 2007. Advancing mammaliancell culture engineering using genome-scale technologies. Trends in Biotechnol-ogy 25, 401–408.

ayes, N.V., Smales, C.M., Klappa, P., 2010. Protein disulfide isomerase does notcontrol recombinant IgG4 productivity in mammalian cell lines. Biotechnologyand Bioengineering 105, 770–779.

uang, X., Pan, W., Park, S., Han, X., Miller, L.W., Hall, J., 2004. Modeling the relation-ship between LVAD support time and gene expression changes in the humanheart by penalized partial least squares. Bioinformatics 20, 888–894.

rizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B., Speed, T.P., 2003a. Sum-maries of affymetrix GeneChip probe level data. Nucleic Acids Research 31,e15.

rizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U.,Speed, T.P., 2003b. Exploration, normalization, and summaries of high densityoligonucleotide array probe level data. Biostatistics 4, 249–264.

antardjieff, A., Jacob, N.M., Yee, J.C., Epstein, E., Kok, Y.J., Philp, R., Betenbaugh, M.,Hu, W.S., 2010. Transcriptome and proteome analysis of Chinese hamster ovarycells under low temperature and butyrate treatment. Journal of Biotechnology145, 143–159.

hoo, S.H., Falciani, F., Al-Rubeai, M., 2007. A genome-wide transcriptional analy-sis of producer and non-producer NS0 myeloma cell lines. Biotechnology andApplied Biochemistry 47, 85–95.

ee, Y.Y., Wong, K.T.K., Tan, J., Toh, P.C., Mao, Y.Y., Brusic, V., Yap, M.G.S., 2009. Overex-pression of heat shock proteins (HSPs) in CHO cells for extended culture viabilityand improved recombinant protein production. Journal of Biotechnology 143,34–43.

artens, H., Martens, M., 2000. Modified Jack-knife estimation of parameter uncer-tainty in bilinear modelling by partial least squares regression (PLSR). FoodQuality and Preference 11, 5–16.

artens, H., Naes, T., 1989. Multivariate Calibration. J.Wiley & Sons Ltd., ChichesterUK.

eleady, P., Henry, M., Gammell, P., Doolan, P., Sinacore, M., Melville, M., Fran-cullo, L., Leonard, M., Charlebois, T., Clynes, M., 2008. Proteomic profiling of CHOcells with enhanced rhBMP-2 productivity following co-expression of PACEsol.Proteomics 8, 2611–2624.

evik, B.H., Wehrens, R., 2007. The PLS package: principal component and partialleast squares regression in R. Journal of Statistical Software 18, 1–24.

isra, J., Alevizos, I., Hwang, D., Stephanopoulos, G., Stepbanopotilosi, G., 2007. Link-ing physiology and transcriptional profiles by quantitative predective models.Biotechnology and Bioengineering 98, 252–260.

ohan, C., Kim, Y.G., Koo, J., Lee, G.M., 2008. Assessment of cell engineering strategiesfor improved therapeutic protein production in CHO cells. Biotechnology Journal3, 624–630.

ohan, C., Lee, G.M., 2009. Calnexin overexpression sensitizes recombinant CHOcells to apoptosis induced by sodium butyrate treatment. Cell Stress & Chaper-ones 14, 49–60.

ology 151 (2011) 159–165 165

Nguyen, D.V., Rocke, D.M., 2002a. Multi-class cancer classification via partial leastsquares with gene expression profiles. Bioinformatics 18, 1216–1226.

Nguyen, D.V., Rocke, D.M., 2002b. Partial least squares proportional hazard regres-sion for application to DNA microarray survival data. Bioinformatics 18,1625–1632.

Nissom, P.M., Sanny, A., Kok, Y.J., Hiang, Y.T., Chuah, S.H., Shing, T.K., Lee, Y.Y., Wong,K.T.K., Hu, W.S., Sim, M.Y.G., Philp, R., 2006. Transcriptome and proteome pro-filing to understanding the biology of high productivity CHO cells. MolecularBiotechnology 34, 125–140.

Pichler, J., Galosy, S., Mott, J., Borth, N., 2010. Selection of CHO host cell sub-clones with increased specific antibody production rates by repeated cyclesof transient transfection and cell sorting. Biotechnology and Bioengineering,doi:10.1002/bit.22946.

Prentice, H.L., Ehrenfels, B.N., Sisk, W.P., 2007. Improving performance ofmammalian cells in fed-batch processes through “bioreactor evolution”.Biotechnology Progress 23, 458–464.

Schaub, J., Clemens, C., Schorn, P., Hildebrandt, T., Rust, W., Mennerich, D., Kaufmann,H., Schulz, T.W., 2010. CHO gene expression profiling in biopharmaceutical pro-cess analysis and design. Biotechnology and Bioengineering 105, 431–438.

Sellick, C.A., Hansen, R., Jarvis, R.M., Maqsood, A.R., Stephens, G.M., Dickson, A.J.,Goodacre, R., 2010. Rapid monitoring of recombinant antibody production bymammalian cell cultures using Fourier transform infrared spectroscopy andchemometrics. Biotechnology and Bioengineering 106, 432–442.

Seth, G., Philp, R.J., Lau, A., Jiun, K.Y., Yap, M., Hu, W.S., 2007. Molecular portrait ofhigh productivity in recombinant NS0 cells. Biotechnology and Bioengineering97, 933–951.

Smales, C.M., Dinnis, D.M., Stansfield, S.H., Alete, D., Sage, E.A., Birch, J.R., Racher,A.J., Marshall, C.T., James, D.C., 2004. Comparative proteomic analysis of GS-NSO murine myeloma cell lines with varying recombinant monoclonal antibodyproduction rate. Biotechnology and Bioengineering 88, 474–488.

Stansfield, S.H., Allen, E.E., Dinnis, D.M., Racher, A.J., Birch, J.R., James, D.C.,2007. Dynamic analysis of GS-NS0 cells producing a recombinant mono-clonal antibody during fed-batch culture. Biotechnology and Bioengineering 97,410–424.

Thomassen, Y.E., van Sprang, E.N.M., van der Pol, L.A., Bakker, W.A.M., 2010. Mul-tivariate Data Analysis on Historical IPV Production Data for Better ProcessUnderstanding and Future Improvements. Biotechnology and Bioengineering107 (1), 96–104.

Trummer, E., Ernst, W., Hesse, F., Schriebl, K., Lattenmayer, C., Kunert, R., Vorauer-Uhl, K., Katinger, H., Muller, D., 2008. Transcriptional profiling of phenotypicallydifferent Epo-Fc expressing CHO clones by cross-species microarray analysis.Biotechnology Journal 3, 924–937.

Westad, F., Schmidt, A., Kermit, M., 2008. Incorporating chemical band-assignmentin near infrared spectroscopy regression models. Journal of Near Infrared Spec-troscopy 16, 265–273.

Wurm, F.M., 2004. Production of recombinant protein therapeutics in cultivatedmammalian cells. Nature Biotechnology 22, 1393–1398.

exploration of CHO and hybridoma cells under sodium butyrate treatment.Biotechnology and Bioengineering 99, 1186–1204.

Yee, J.C., Gerdtzen, Z.P., Hu, W.S., 2009. Comparative transcriptome analysis to unveilgenes affecting recombinant protein productivity in mammalian cells. Biotech-nology and Bioengineering 102, 246–263.