99
Introduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University Linz Lecture Notes Institute of Bioinformatics Johannes Kepler University Linz A-4040 Linz, Austria Tel. +43 732 2468 8880 Fax +43 732 2468 9511 http://www.bioinf.jku.at

Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

Introduction to Rwith applications to bioinformatics

PART II

Sepp Hochreiter

Institute of Bioinformatics, Johannes Kepler University LinzLecture Notes

Institute of BioinformaticsJohannes Kepler University LinzA-4040 Linz, Austria

Tel. +43 732 2468 8880Fax +43 732 2468 9511

http://www.bioinf.jku.at

Page 2: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

2 Contents

Contents1 Analysis of Gene Expression Data 3

1.1 ALL Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.1 Preprocessing and Filtering . . . . . . . . . . . . . . . . . . . . . 71.1.2 Clustering of Gene Expression Data . . . . . . . . . . . . . . . . 81.1.3 Classification of Gene Expression Data . . . . . . . . . . . . . . 151.1.4 Gene Selection for Gene Expression Data . . . . . . . . . . . . . 23

1.2 Tumorigenic Breast-Cancer Cells . . . . . . . . . . . . . . . . . . . . . . 63

A New Feature Selection Methods 85A.1 Multidimensional Wald and Score Test . . . . . . . . . . . . . . . . . . . 85

A.1.1 Wald Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85A.1.2 Score Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

A.2 Optimal Brain Surgeon on Fisher Information . . . . . . . . . . . . . . . 86A.3 Fisher Information for Additive Gaussian Noise . . . . . . . . . . . . . . 90A.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91A.5 Fisher Information for Logistic Regression . . . . . . . . . . . . . . . . . 92A.6 Constraint Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . 93

A.6.1 Ball Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . 93A.6.2 Variation Constraint . . . . . . . . . . . . . . . . . . . . . . . . 94A.6.3 Deletion in Ball Constraint . . . . . . . . . . . . . . . . . . . . . 94A.6.4 Deletion Variation Constraint . . . . . . . . . . . . . . . . . . . 96

A.7 Constraint Least Square . . . . . . . . . . . . . . . . . . . . . . . . . . . 97A.8 Matrix inversion lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . 97A.9 Compute Inverse Iteratively . . . . . . . . . . . . . . . . . . . . . . . . . 98

A.9.1 Backward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98A.9.2 Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

A.10 Compute Bilinear Form Iteratively . . . . . . . . . . . . . . . . . . . . . 99

Page 3: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 3

1 Analysis of Gene Expression Data

We need following libraries in this section.

> library("Biobase")> library("bioDist")> library("genefilter")> library("MLInterfaces")> library("annotate")> library("GOstats")> library("multtest")> library("Rgraphviz")> library("hgu95av2.db")> library("hgu95av2")> library("ALL")> library("randomForest")> library("kernlab")> library("varSelRF")> library("lars")> library("svmpath")> library("RColorBrewer")> library("class")> library("xtable")

1.1 ALL Data

We analyze the acute lymphoblastic leukemia (ALL) gene expression data set:

Sabina Chiaretti, Xiaochun Li, Robert Gentleman, Antonella Vitale, MarcoVignetti, Franco Mandelli, Jerome Ritz, and Robin Foa Gene expression pro-file of adult T-cell acute lymphocytic leukemia identifies distinct subsets ofpatients with different response to therapy and survival. Blood, 1 April 2004,Vol. 103, No. 7.

Let’s load the data set.

> data(ALL)

The data consist of microarrays from 128 different individuals with acute lymphoblasticleukemia (ALL). The data have been normalized and summarized with RMA. RMA storesthe data are presented in an “exprSet” object which contains experimental informationsadditional to the summarized expression values. The experimental information can beaccess via

Page 4: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

4 1 Analysis of Gene Expression Data

> experimentData(ALL)

Experiment dataExperimenter name: Chiaretti et al.Laboratory: Department of Medical Oncology, Dana-Farber Cancer Institute, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA.Contact information:Title: Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival.URL:PMIDs: 14684422 16243790

Abstract: A 187 word abstract is available. Use 'abstract' method.

The title of the paper is

> aa <- strsplit(experimentData(ALL)@title, " ")> laa <- length(aa[[1]])> i <- laa%/%8 + 1> for (ii in 1:i) {+ if (ii == 1) {+ cat("######## BEGIN TITLE ########\n")+ }+ start <- (ii - 1) * 8 + 1+ end <- ii * 8+ if (end > laa) {+ end <- laa+ }+ cat(aa[[1]][start:end], "\n")+ if (ii == i) {+ cat("######### END TITLE #########\n")+ }+ }

######## BEGIN TITLE ########Gene expression profile of adult T-cell acute lymphocyticleukemia identifies distinct subsets of patients with differentresponse to therapy and survival.######### END TITLE #########

The abstract of the paper is

> aa <- strsplit(abstract(ALL), " ")> laa <- length(aa[[1]])> i <- laa%/%8 + 1

Page 5: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 5

> for (ii in 1:i) {+ if (ii == 1) {+ cat("######## BEGIN ABSTRACT ########\n")+ }+ start <- (ii - 1) * 8 + 1+ end <- ii * 8+ if (end > laa) {+ end <- laa+ }+ cat(aa[[1]][start:end], "\n")+ if (ii == i) {+ cat("######### END ABSTRACT #########\n")+ }+ }

######## BEGIN ABSTRACT ########Gene expression profiles were examined in 33 adultpatients with T-cell acute lymphocytic leukemia (T-ALL). Nonspecificfiltering criteria identified 313 genes differentially expressed inthe leukemic cells. Hierarchical clustering of samples identified2 groups that reflected the degree of T-celldifferentiation but was not associated with clinical outcome.Comparison between refractory patients and those who respondedto induction chemotherapy identified a single gene, interleukin8 (IL-8), that was highly expressed in refractoryT-ALL cells and a set of 30 genesthat was highly expressed in leukemic cells frompatients who achieved complete remission. We next identified19 genes that were differentially expressed in T-ALLcells from patients who either had a relapseor remained in continuous complete remission. A modelbased on the expression of 3 of thesegenes was predictive of duration of remission. The3-gene model was validated on a further setof T-ALL samples from 18 additional patients treatedon the same clinical protocol. This study demonstratesthat gene expression profiling can identify a limitednumber of genes that are predictive of responseto induction therapy and remission duration in adultpatients with T-ALL.######### END ABSTRACT #########

More than 90% of patients with chronic myeloid leukaemia (CML) have a genetic abnor-mality known as the Philadelphia (Ph) chromosome. It is a translocation between chro-

Page 6: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

6 1 Analysis of Gene Expression Data

Figure 1: The Philadelphia (Ph) chromosome. It results from a reciprocal chromosomaltranslocation which fuses the ABL gene on chromosome 9 to the BCR gene on chromo-some 22.5.

mosomes 9 and 22 and produces a new, abnormal gene called BCR-ABL, which fuses theABL gene on chromosome 9 to the BCR gene on chromosome 22.5. Fig. 1 shows thistranslocation. The resulting oncogene (BCR-ABL) encodes a constitutively active formof ABL tyrosine kinase that causes the excess of white blood cells typical of CML. InB-cell lineage ALL, the poor prognosis of adult patients is partly due to the presence ofspecific genetic translocations, such as the BCR/ABL gene rearrangements. The BCR-ABL transcript is constitutively active, i.e. it does not require activation by other cellularmessaging proteins. In turn, BCR-ABL activates a number of cell cycle-controlling pro-teins and enzymes, speeding up cell division. Moreover, it inhibits DNA repair, causinggenomic instability and potentially causing the feared blast crisis in CML.

Page 7: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 7

We want to diagnose BCR-ABL based on gene expression.

Toward this end, we select the B-cells and divide them into the ones which show theBCR/ABL translocation and the ones which do not possess this translocation.

> bcell = grep("^B", as.character(ALL$BT))> moltyp = which(as.character(ALL$mol.biol) %in% c("NEG", "BCR/ABL"))> ALL_bcrneg = ALL[, intersect(bcell, moltyp)]> ALL_bcrneg$mol.biol = factor(ALL_bcrneg$mol.biol)

The data set looks as follows:

> table(ALL_bcrneg$mol.biol)

BCR/ABL NEG37 42

1.1.1 Preprocessing and Filtering

Next we filter out probe sets (gene expression values) with small variance because theirsignal is too small relative to the measurement error and, therefore, not reliable.

> ALL_bcrneg@annotation <- "hgu95av2.db"> ALLfilt_bcrneg = nsFilter(ALL_bcrneg, var.cutoff = 0.75)$eset

The filtered data set is also of the class "ExpressionSet".

> class(ALLfilt_bcrneg)

[1] "ExpressionSet"attr(,"package")[1] "Biobase"

Next we define a function for computing a robust standard deviation, where “rowQ” isthe empirical row quantile and “rowIQRs” is the upper quantile minus the lower quantile.We use the upper 75%-quantile and the lower 25%-quantile.

> rowIQRs = function(eSet) {+ numSamp = ncol(eSet)+ lowQ = rowQ(eSet, floor(0.25 * numSamp))+ upQ = rowQ(eSet, ceiling(0.75 * numSamp))+ upQ - lowQ+ }

Page 8: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

8 1 Analysis of Gene Expression Data

Using the robust standard deviation and instead of the mean the robust median, we stan-dardize the expression values per gene. Now all genes have the same median and the samedeviation and can be treated on the same level.

> standardize = function(x) (x - rowMedians(x))/rowIQRs(x)> exprs(ALLfilt_bcrneg) = standardize(exprs(ALLfilt_bcrneg))

1.1.2 Clustering of Gene Expression Data

Our first approach is unsupervised learning, where we analyze the ALL gene expressiondata by clustering and show the result as a heatmap.

We already have seen other down-projection methods like correspondence analysis (CA),principal coordinates analysis (PCoA or PCO), or multidimensional scaling (MDS). Fur-ther methods are principal component analysis (PCA), independent component analysis(ICA), and factor analysis (FA).

For filtered and standardized gene expression values mutual distances between two B-cells can be computed. The Euclidean distance is the most straightforward approach tocompute the distance.

> eucD = dist(t(exprs(ALLfilt_bcrneg)))> eucM = as.matrix(eucD)> dim(eucM)

[1] 79 79

Using this distance we can plot a heatmap as a false color image. If we arrange simi-lar objects as clusters, then the clusters are visible as blocks at the main diagonal in theheatmap. Objects are clustering by hierarchical clustering which is visualized as a dendro-gram at the left side and at the top of the heatmap. Fig. 2 shows the heatmap of clusteredEuclidean distances.

> hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)> hmcol <- rev(hmcol)

> heatmap(eucM, sym = TRUE, col = hmcol, distfun = as.dist)

Page 9: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 9

6400

216

009

2500

668

001

0400

820

002

1202

626

001

0600

222

009

2401

862

003

2401

108

024

2201

337

013

1401

612

012

2800

122

010

2700

309

017

3101

162

002

2600

348

001

2401

701

010

3300

504

010

4300

428

044

1500

509

008

2801

928

035

2803

730

001

2802

149

006

6500

557

001

2804

228

031

4301

228

023

6200

128

047

2800

728

024

0400

784

004

0300

211

005

2800

628

005

2700

424

022

2804

328

036

0401

612

019

4300

725

003

3600

208

012

2400

122

011

0800

168

003

2401

043

001

1200

624

008

1500

164

001

1200

701

005

0801

1

08011010051200764001150012400812006430012401068003080012201124001080123600225003430071201904016280362804324022270042800528006110050300284004040072802428007280476200128023430122803128042570016500549006280213000128037280352801909008150052804443004040103300501010240174800126003620023101109017270032201028001120121401637013220130802424011620032401822009060022600112026200020400868001250061600964002

Figure 2: Heatmap of clustered Euclidean distances for the B cell expression values.Blocks at the main diagonal indicate clusters.

Page 10: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

10 1 Analysis of Gene Expression Data

Next we use Spearman’s rank correlation as distance and plot again the heatmap afterclustering in Fig. 3.

> spD <- spearman.dist(ALLfilt_bcrneg)> spM <- as.matrix(spD)

> heatmap(spM, sym = TRUE, col = hmcol, distfun = function(x) as.dist(x))

Page 11: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 11

4300

712

019

2500

336

002

4800

124

018

4900

662

002

2401

108

011

0100

503

002

1100

509

017

0800

168

003

2803

528

037

2801

928

021

3000

104

010

3300

501

010

6800

125

006

2400

108

012

2201

108

024

2804

443

004

2800

627

003

2800

184

004

3701

322

013

1201

214

016

2803

128

023

4301

212

007

1202

606

002

2600

131

011

2200

962

001

6500

564

001

2804

728

005

2401

728

042

2803

626

003

2402

204

008

2000

227

004

2804

316

009

6400

204

016

0900

815

005

0400

715

001

2400

822

010

5700

162

003

2800

728

024

4300

112

006

2401

0

24010120064300128024280076200357001220102400815001040071500509008040166400216009280432700420002040082402226003280362804224017280052804764001650056200122009310112600106002120261200743012280232803114016120122201337013840042800127003280064300428044080242201108012240012500668001010103300504010300012802128019280372803568003080010901711005030020100508011240116200249006240184800136002250031201943007

Figure 3: Heatmap of clustered Spearman’s rank correlations as distances for the B cellexpression values. Blocks at the main diagonal indicate clusters.

Page 12: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

12 1 Analysis of Gene Expression Data

The next distance measure is the mutual information which is computed via binning. InFig. 4 the according heatmap after clustering is shown.

> cD <- MIdist(ALLfilt_bcrneg)> cM <- as.matrix(cD)

> heatmap(cM, sym = TRUE, col = hmcol, distfun = function(x) as.dist(x))

Page 13: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 13

0400

762

003

5700

122

010

4300

712

019

2400

108

012

2201

109

008

1500

536

002

4300

124

010

1200

625

003

0802

412

007

1202

606

002

2600

131

011

2200

962

001

6500

564

001

2803

128

023

4301

262

002

1500

124

008

4800

124

018

4900

624

011

0100

508

011

4300

428

044

2800

627

003

2800

184

004

3701

304

016

2800

728

024

2201

312

012

1401

624

017

2804

228

047

2800

504

008

2000

228

043

2700

424

022

2600

328

036

0300

211

005

0901

708

001

6800

328

035

2803

728

019

3000

128

021

1600

964

002

0401

033

005

0101

068

001

2500

6

25006680010101033005040106400216009280213000128019280372803568003080010901711005030022803626003240222700428043200020400828005280472804224017140161201222013280242800704016370138400428001270032800628044430040801101005240114900624018480012400815001620024301228023280316400165005620012200931011260010600212026120070802425003120062401043001360021500509008220110801224001120194300722010570016200304007

Figure 4: Heatmap of clustered mutual informations as distances for the B cell expressionvalues. Blocks at the main diagonal indicate clusters.

Page 14: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

14 1 Analysis of Gene Expression Data

Page 15: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 15

1.1.3 Classification of Gene Expression Data

We analyzed gene expression data by clustering which is shown as a heatmap. Howeverfor this data set we also have labels by whether the sample is BCR/ABL positive or neg-ative. We can construct a classifier which is able to classify a new gene expression ALLexample into BCR/ABL positive or negative.

The construction of a classifier is done by machine learning methods where we focus onsupport vector machines (SVMs).

First we have to define the classes on the whole data set. The we have to divide the wholedata set into a training set and a test set. For the training set 20 BCR/ABL and 20 NEGare randomly sampled and the rest of the data is used for testing.

> Negs <- which(ALLfilt_bcrneg$mol.biol == "NEG")> Bcr <- which(ALLfilt_bcrneg$mol.biol == "BCR/ABL")> S1 <- sample(Negs, 20, replace = FALSE)> S2 <- sample(Bcr, 20, replace = FALSE)> TrainInd <- c(S1, S2)> TestInd <- setdiff(1:79, TrainInd)

As an first approach, we use the machine learning interface MLInterfaces and call themethods by MLearn.

1) k-nearest neighbor

The first approach is k-nearest neighbor where a data point is classified according to themajority class of the k closest training data points. We use k = 1, that is a new data pointis classified according to the class of its closest training point.

> kans = MLearn(mol.biol ~ ., data = ALLfilt_bcrneg, knnI(k = 1,+ l = 0), TrainInd)> confuMat(kans)

predictedgiven BCR/ABL NEG

BCR/ABL 10 7NEG 6 16

2) diagonal linear discriminant analysis

If both classes are normally distributed with the same covariance matrix Σ and class 1has mean µ1 and class 2 has mean µ2, then the Bayes optimal classification is given bythe sign of

xT Σ−1 (µ1 − µ2) + c ,

Page 16: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

16 1 Analysis of Gene Expression Data

where c is some threshold.

Linear discriminant analysis (LDA) constructs a classifier according to this formula. If Σis restricted to a diagonal matrix then this is diagonal linear discriminant analysis (DLDA)which is considered here.

> dldans <- MLearn(mol.biol ~ ., ALLfilt_bcrneg, dldaI, TrainInd)> confuMat(dldans)

predictedgiven BCR/ABL NEG

BCR/ABL 13 4NEG 7 15

3) linear discriminant analysis

An now the full covariance matrix leading to linear discriminant analysis (LDA) which isconsidered now.

> ldaans = MLearn(mol.biol ~ ., ALLfilt_bcrneg, ldaI, TrainInd)> confuMat(ldaans)

predictedgiven BCR/ABL NEG

BCR/ABL 13 4NEG 6 16

As already mentioned we focus on support vector machines (SVMs). Now we present thefirst example. We consider

C-SVMs and

ν-SVMs

where the latter allows for easier adjustment of the hyperparameter (here ν). However,the latter is not suited for highly unbalanced data sets.

An SVM requires a kernel which calculates the similarity between data vectors or objects.As kernels we consider

radial basis function (RBF) kernel – the most common kernel

dot product kernel – the most simple kernel leading to a linear classifier

Laplace kernel.

Page 17: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 17

4) C-SVM with RBF kernel C = 1, γ = 1/dim

> svmrbf1 = MLearn(mol.biol ~ ., data = ALLfilt_bcrneg, svmI, TrainInd)> confuMat(svmrbf1)

predictedgiven BCR/ABL NEG

BCR/ABL 12 5NEG 6 16

5) C-SVM with RBF kernel C = 1, γ = 0.01

> svmrbf1 = MLearn(mol.biol ~ ., data = ALLfilt_bcrneg, svmI, gamma = 0.01,+ TrainInd)> confuMat(svmrbf1)

predictedgiven BCR/ABL NEG

BCR/ABL 5 12NEG 1 21

6) C-SVM with RBF kernel C = 1, γ = 0.001

> svmrbf1 = MLearn(mol.biol ~ ., data = ALLfilt_bcrneg, svmI, gamma = 0.001,+ TrainInd)> confuMat(svmrbf1)

predictedgiven BCR/ABL NEG

BCR/ABL 10 7NEG 7 15

7) C-SVM with RBF kernel C = 1, γ = 0.0001

> svmrbf1 = MLearn(mol.biol ~ ., data = ALLfilt_bcrneg, svmI, gamma = 1e-04,+ TrainInd)> confuMat(svmrbf1)

predictedgiven BCR/ABL NEG

BCR/ABL 11 6NEG 5 17

Page 18: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

18 1 Analysis of Gene Expression Data

8) ν-SVM with RBF kernel ν = 0.5, γ = 0.001

> svmrbf1 = MLearn(mol.biol ~ ., data = ALLfilt_bcrneg, svmI, type = "nu-classification",+ nu = 0.5, gamma = 0.001, TrainInd)> confuMat(svmrbf1)

predictedgiven BCR/ABL NEG

BCR/ABL 11 6NEG 7 15

9) ν-SVM with RBF kernel ν = 0.1, γ = 0.001

> svmrbf1 = MLearn(mol.biol ~ ., data = ALLfilt_bcrneg, svmI, type = "nu-classification",+ nu = 0.1, gamma = 0.001, TrainInd)> confuMat(svmrbf1)

predictedgiven BCR/ABL NEG

BCR/ABL 11 6NEG 7 15

The SVM calls for the package MLearn are given in the following.

The call for the SVM:

svm(x, y = NULL, scale = TRUE, type = NULL,kernel = "radial", degree = 3,gamma = if (is.vector(x)) 1 else 1 / ncol(x),coef0 = 0, cost = 1, nu = 0.5, class.weights = NULL,cachesize = 40, tolerance = 0.001, epsilon = 0.1,shrinking = TRUE, cross = 0, probability = FALSE,fitted = TRUE, ..., subset, na.action = na.omit)

The SVM types for classification, one-class SVM, and regression:

type =C-classificationnu-classificationone-classification

eps-regressionnu-regression

The kernels:

Page 19: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 19

kernel=linear: u'*v

polynomial: (gamma*u'*v + coef0)^degreeradial basis: exp(-gamma*|u-v|^2)sigmoid: tanh(gamma*u'*v + coef0)

Now we leave the machine learning interface MLInterfaces and use the kernel packagekernlab. We now have to define training and test set for kernlab.

> y <- is.vector(79)> y[Negs] <- -1> y[Bcr] <- 1> etrain <- t(exprs(ALLfilt_bcrneg))[TrainInd, ]> ytrain <- y[TrainInd]> etest <- t(exprs(ALLfilt_bcrneg))[TestInd, ]> ytest <- y[TestInd]

For kernlab we define a function which give the confusion matrix.

> confusion_sepp <- function(tr, pr, labels = unique(tr)) {+ tab <- table(factor(tr, levels = unique(tr)), factor(pr,+ levels = unique(tr)))+ tab <- rbind(tab, colSums(tab))+ tab <- cbind(tab, rowSums(tab))+ dimnames(tab)[[1]][1] <- paste(labels[1], "(true)", sep = "")+ dimnames(tab)[[1]][2] <- paste(labels[2], "(true)", sep = "")+ dimnames(tab)[[2]][1] <- paste(labels[1], "(pred)", sep = "")+ dimnames(tab)[[2]][2] <- paste(labels[2], "(pred)", sep = "")+ dimnames(tab)[[2]][3] <- "total(true)"+ dimnames(tab)[[1]][3] <- "total(pred)"+ tab+ }

We have first to train a model by ksvm and the predict on the test set with predict. Theconfusion matrix is given on the prediction of the test set.

10) C-SVM with dot kernel C = 1

> svmrbf1_model = ksvm(etrain, ytrain, type = "C-svc", kernel = "vanilladot",+ C = 1)

Setting default kernel parameters

Page 20: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

20 1 Analysis of Gene Expression Data

> svmrbf1_predict = predict(svmrbf1_model, etest)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 13 4 17NEG(true) 9 13 22total(pred) 22 17 39

11) C-SVM with RBF kernel C = 5, σ = 0.05

> svmrbf1_model = ksvm(etrain, ytrain, type = "C-svc", kernel = "rbfdot",+ kpar = list(sigma = 0.05), C = 5)> svmrbf1_predict = predict(svmrbf1_model, etest)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 1 16 17NEG(true) 1 21 22total(pred) 2 37 39

12) C-SVM with RBF kernel C = 0.5, σ = 0.05

> svmrbf1_model = ksvm(etrain, ytrain, type = "C-svc", kernel = "rbfdot",+ kpar = list(sigma = 0.05), C = 0.5)> svmrbf1_predict = predict(svmrbf1_model, etest)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 10 7 17NEG(true) 6 16 22total(pred) 16 23 39

13) ν-SVM with RBF kernel ν = 0.2, σ = 0.01

> svmrbf1_model = ksvm(etrain, ytrain, type = "nu-svc", kernel = "rbfdot",+ kpar = list(sigma = 0.01), nu = 0.2)> svmrbf1_predict = predict(svmrbf1_model, etest)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 5 12 17NEG(true) 1 21 22total(pred) 6 33 39

Page 21: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 21

14) ν-SVM with dot kernel ν = 0.2

> svmrbf1_model = ksvm(etrain, ytrain, type = "nu-svc", kernel = "vanilladot",+ nu = 0.2)

Setting default kernel parameters

> svmrbf1_predict = predict(svmrbf1_model, etest)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 13 4 17NEG(true) 9 13 22total(pred) 22 17 39

15) ν-SVM with Laplacian kernel (exponential of the distance) ν = 0.2, σ = 0.01

> svmrbf1_model = ksvm(etrain, ytrain, type = "nu-svc", kernel = "laplacedot",+ kpar = list(sigma = 0.01), nu = 0.2)> svmrbf1_predict = predict(svmrbf1_model, etest)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 13 4 17NEG(true) 8 14 22total(pred) 21 18 39

The training call for the SVM in package kernlab is

ksvm(x, y = NULL, scaled = TRUE, type = NULL,kernel ="rbfdot", kpar = "automatic",C = 1, nu = 0.2,epsilon = 0.1, prob.model = FALSE, class.weights = NULL, cross = 0,fit = TRUE, cache = 40, tol = 0.001, shrinking = TRUE, ..., subset, na.action = na.omit)

The following types of SVMs are supplied:

type = C-svc C classification eps-svr epsilon regressionnu-svc nu classification nu-svr nu regressionC-bsvc bound-constraint classification eps-bsvr bound-constraint regressionspoc-svc Crammer, Singer multi-class one-svc novelty detectionkbb-svc Weston, Watkins multi-class

Page 22: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

22 1 Analysis of Gene Expression Data

The following kernels are possible:

kernel= vanilladot Linear kernel rbfdot Radial Basis kernel "Gaussian"polydot Polynomial kernel tanhdot Hyperbolic tangent kernellaplacedot Laplacian kernel anovadot ANOVA RBF kernelbesseldot Bessel kernel splinedot Spline kernelstringdot String kernel

The kernel parameters are as follows:

kpar = (1)sigma: "rbfdot" and "laplacedot" (1)degree, (2)scale, (3)offset: "polydot"(1)scale, (2)offset: "tanhdot" (1)sigma, (2)order, (3)degree: "besseldot"(1)sigma, (2)degree: "anovadot" (1)length, (2)lambda, (3)norm.: "stringdot"for the latter:"length": of the strings,"lambda": decay factor,"normalized": logical whether normalized

Attention: the string kernel “stringdot” may have bugs.

Page 23: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 23

1.1.4 Gene Selection for Gene Expression Data

Now we select relevant genes, that is genes that are important for classification and genesthat indicate the class membership. We select the genes which expression values areindicative to separate BCR/ABL from the rest of the genes.

1) Random Forest

Our first feature or gene selection method is random forest. It builds many small deci-sion trees with subsamples of the data set and subsamples of the features. Afterward thefeatures are ranked according to whether they are selected and how good was the perfor-mance on the left out samples.

> rf.vs1 <- varSelRF(etrain, factor(ytrain), ntree = 200, ntreeIterat = 100,+ vars.drop.frac = 0.2)> rf.vs1

Backwards elimination on random forest; ntree = 200 ; mtryFactor = 1

Selected variables:[1] "1096_g_at" "1635_at" "316_g_at" "32542_at" "32607_at" "33462_at"[7] "33774_at" "34345_at" "36502_at" "37015_at" "37027_at" "37043_at"

[13] "37411_at" "39089_at" "39120_at" "39122_at" "39712_at" "40076_at"[19] "40198_at" "40202_at" "879_at"

Number of selected variables: 21

Fig. 5 shows the result of the random forest procedure. For different number of selectedfeatures the error is plotted.

> plot(rf.vs1, which = 2)

Page 24: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

24 1 Analysis of Gene Expression Data

●●

2 5 10 20 50 100 200 500 2000

0.0

0.1

0.2

0.3

0.4

0.5

Number of variables used

OO

B e

rror

Figure 5: Result of the random forest procedure where for different number of featuresthe corresponding error is given.

Page 25: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 25

Next we reduce the components of the training and test set to the selected features.

> etrain_rf <- etrain[, which(match(colnames(etrain), rf.vs1[[3]]) >+ 0)]> etest_rf <- etest[, which(match(colnames(etrain), rf.vs1[[3]]) >+ 0)]

Using only the selected features SVMs are now trained and tested.

First we use a C-SVM with a dot product kernel.

> svmrbf1_model = ksvm(etrain_rf, ytrain, type = "C-svc", kernel = "vanilladot",+ C = 1)

Setting default kernel parameters

> svmrbf1_predict = predict(svmrbf1_model, etest_rf)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 13 4 17NEG(true) 1 21 22total(pred) 14 25 39

Then we use a ν-SVM with a dot product kernel.

> svmrbf1_model = ksvm(etrain_rf, ytrain, type = "nu-svc", kernel = "vanilladot",+ nu = 0.2)

Setting default kernel parameters

> svmrbf1_predict = predict(svmrbf1_model, etest_rf)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 15 2 17NEG(true) 4 18 22total(pred) 19 20 39

Then we use a ν-SVM with an RBF kernel with σ = 0.01 and ν = 0.2.

Page 26: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

26 1 Analysis of Gene Expression Data

> svmrbf1_model = ksvm(etrain_rf, ytrain, type = "nu-svc", kernel = "rbfdot",+ kpar = list(sigma = 0.01), nu = 0.2)> svmrbf1_predict = predict(svmrbf1_model, etest_rf)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 14 3 17NEG(true) 3 19 22total(pred) 17 22 39

Then we use a ν-SVM with an RBF kernel with σ = 0.1 and ν = 0.2.

> svmrbf1_model = ksvm(etrain_rf, ytrain, type = "nu-svc", kernel = "rbfdot",+ kpar = list(sigma = 0.1), nu = 0.2)> svmrbf1_predict = predict(svmrbf1_model, etest_rf)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 15 2 17NEG(true) 4 18 22total(pred) 19 20 39

Now we select more genes because of the larger steps (dropping fraction) of the randomforest procedure.

> rf.vs2 <- varSelRF(etrain, factor(ytrain), ntree = 200, ntreeIterat = 100,+ vars.drop.frac = 0.4)> rf.vs2

Backwards elimination on random forest; ntree = 200 ; mtryFactor = 1

Selected variables:[1] "1635_at" "37015_at"

Number of selected variables: 2

Fig. 6 shows the result of the random forest procedure for larger step size. Again, fordifferent number of selected features the error is plotted.

> plot(rf.vs2, which = 2)

Page 27: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 27

●●

●●

●●

●●

●●

2 5 10 20 50 100 200 500 2000

0.0

0.1

0.2

0.3

0.4

0.5

Number of variables used

OO

B e

rror

Figure 6: Result of the random forest procedure with larger step size. For different numberof features the corresponding error is given.

Page 28: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

28 1 Analysis of Gene Expression Data

Again we project the data set to the selected features.

> etrain_rf2 <- etrain[, which(match(colnames(etrain), rf.vs2[[3]]) >+ 0)]> etest_rf2 <- etest[, which(match(colnames(etrain), rf.vs2[[3]]) >+ 0)]

We check the performance with a C-SVM with a dot product kernel.

> svmrbf1_model = ksvm(etrain_rf2, ytrain, type = "C-svc", kernel = "vanilladot",+ C = 1)

Setting default kernel parameters

> svmrbf1_predict = predict(svmrbf1_model, etest_rf2)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 15 2 17NEG(true) 3 19 22total(pred) 18 21 39

Increasing the step size gives even more genes.

> rf.vs3 <- varSelRF(etrain, factor(ytrain), ntree = 200, ntreeIterat = 100,+ vars.drop.frac = 0.6)> rf.vs3

Backwards elimination on random forest; ntree = 200 ; mtryFactor = 1

Selected variables:[1] "1635_at" "37015_at"

Number of selected variables: 2

Fig. 7 shows the result of the random forest procedure for a very large step size.

> plot(rf.vs3, which = 2)

Page 29: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 29

●●

●●●

●●

2 5 10 20 50 100 200 500 2000

0.0

0.1

0.2

0.3

0.4

0.5

Number of variables used

OO

B e

rror

Figure 7: Result of the random forest procedure with very large step size. Again, fordifferent number of features the corresponding error is given.

Page 30: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

30 1 Analysis of Gene Expression Data

Again we project the data set to the selected features.

> etrain_rf3 <- etrain[, which(match(colnames(etrain), rf.vs3[[3]]) >+ 0)]> etest_rf3 <- etest[, which(match(colnames(etrain), rf.vs3[[3]]) >+ 0)]

We check the performance with a C-SVM with a dot product kernel.

> svmrbf1_model = ksvm(etrain_rf3, ytrain, type = "C-svc", kernel = "vanilladot",+ C = 1)

Setting default kernel parameters

> svmrbf1_predict = predict(svmrbf1_model, etest_rf3)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 15 2 17NEG(true) 3 19 22total(pred) 18 21 39

Another check of the performance is done with a ν-SVM with an RBF kernel with σ =0.1, ν = 0.2.

> svmrbf1_model = ksvm(etrain_rf3, ytrain, type = "nu-svc", kernel = "rbfdot",+ kpar = list(sigma = 0.1), nu = 0.2)> svmrbf1_predict = predict(svmrbf1_model, etest_rf3)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 15 2 17NEG(true) 4 18 22total(pred) 19 20 39

2) Recursive Feature Elimination (RFE; Backward selection with a linear SVM)

With RFE first a linear SVM is trained, then the features with smallest absolute weights(note, the weight is known for linear SVMs) are removed. Then another linear SVM istrained and the procedure starts again. The procedure is stopped if a minimal number offeatures remain.

We use our own simple implementation of RFE.

Page 31: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 31

> rfe_sepp <- function(etrain_r, etest_r, cycs = 40, ratio = 0.95,+ nu = 0.2) {+ for (j in 1:cycs) {+ svmrbf1_model = ksvm(etrain_r, ytrain, type = "nu-svc",+ kernel = "vanilladot", nu = nu)+ al <- unlist(alpha(svmrbf1_model))+ n <- length(al)+ xx <- unlist(xmatrix(svmrbf1_model))+ mn <- length(xx)+ m <- mn/n+ xx <- matrix(xx, nrow = n, ncol = m)+ w <- t(xx) %*% al+ rat <- ceiling(m * (1 - ratio))+ u <- abs(w)+ ma <- max(u)+ for (i in 1:rat) {+ u[which.min(u)] <- ma + 2+ }+ etrain_r <- etrain_r[, which(u < (ma + 1))]+ etest_r <- etest_r[, which(u < (ma + 1))]+ }+ res <- list(traind = etrain_r, testd = etest_r, rel = u)+ return(res)+ }

> ratio <- 0.95> cycs <- 40> etrain_r <- etrain> etest_r <- etest> nu = 0.2

> res_rfe <- rfe_sepp(etrain_r, etest_r, cycs, ratio, nu)

> etrain_r <- res_rfe$traind> etest_r <- res_rfe$testd

> svmrbf1_model = ksvm(etrain_r, ytrain, type = "nu-svc", kernel = "vanilladot",+ nu = 0.2)

Setting default kernel parameters

> svmrbf1_predict = predict(svmrbf1_model, etest_r)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

Page 32: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

32 1 Analysis of Gene Expression Data

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 14 3 17NEG(true) 9 13 22total(pred) 23 16 39

> colnames(etest_r)

[1] "32589_at" "34878_at" "35831_at" "40990_at" "33291_at"[6] "35140_at" "2031_s_at" "32961_at" "35166_at" "1599_at"

[11] "37964_at" "41187_at" "34609_g_at" "38353_at" "33827_at"[16] "39800_s_at" "36012_at" "35341_at" "33362_at" "33447_at"[21] "38693_at" "31786_at" "34237_at" "33865_at" "37304_at"[26] "37044_at" "36577_at" "37659_at" "38385_at" "39142_at"[31] "38223_at" "36008_at" "33244_at" "36968_s_at" "36131_at"[36] "1097_s_at" "38077_at" "34777_at" "33232_at" "37902_at"[41] "37493_at" "1160_at" "36199_at" "40049_at" "39837_s_at"[46] "32621_at" "32168_s_at" "529_at" "1044_s_at" "34813_at"[51] "32562_at" "1467_at" "1638_at" "36313_at" "40019_at"[56] "36543_at" "37015_at" "38052_at" "35916_s_at" "1971_g_at"[61] "32542_at" "36002_at" "39109_at" "39696_at" "38336_at"[66] "38487_at" "37306_at" "36144_at" "39929_at" "39614_at"[71] "40971_at" "40423_at" "39021_at" "38424_at" "34808_at"[76] "35973_at" "38797_at" "39967_at" "34780_at" "1635_at"[81] "1249_at" "33891_at" "34714_at" "41784_at" "38256_s_at"[86] "34990_at" "40832_s_at" "33392_at" "40928_at" "32134_at"[91] "39793_at" "31763_at" "33895_at" "38684_at" "39363_at"[96] "39884_g_at" "34311_at" "763_at" "34074_s_at" "38207_at"

[101] "36093_at" "1135_at" "37965_at" "40876_at" "37403_at"[106] "38570_at" "38096_f_at" "32035_at" "1717_s_at" "32195_at"[111] "425_at" "35718_at" "1185_at" "404_at" "36624_at"[116] "35303_at" "36412_s_at" "33304_at" "232_at" "33412_at"[121] "963_at" "37377_i_at" "36493_at" "37280_at" "40117_at"[126] "36599_at" "41138_at" "37283_at" "37716_at" "33284_at"[131] "32207_at" "37014_at" "879_at" "39072_at" "37724_at"[136] "40571_at" "40634_at" "41235_at" "34865_at" "37544_at"[141] "544_at" "39089_at" "39790_at" "35320_at" "40584_at"[146] "39379_at" "1884_s_at" "35872_at" "33705_at" "37676_at"[151] "36829_at" "39175_at" "36502_at" "32210_at" "37392_at"[156] "36943_r_at" "37328_at" "37326_at" "525_g_at" "33543_s_at"[161] "38323_at" "35583_at" "35823_at" "36898_r_at" "41015_at"[166] "37506_at" "40167_s_at" "40916_at" "41490_at" "828_at"[171] "33716_at" "36117_at" "37038_at" "32963_s_at" "32755_at"[176] "38265_at" "41683_i_at" "36766_at" "38311_at" "33668_at"

Page 33: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 33

[181] "33424_at" "36676_at" "38087_s_at" "39338_at" "36083_at"[186] "37147_at" "36674_at" "34826_at" "34283_at" "33383_f_at"[191] "1303_at" "41352_at" "40290_f_at" "38997_at" "39070_at"[196] "39436_at" "39708_at" "40075_at" "40202_at" "142_at"[201] "39779_at" "36605_at" "32025_at" "33873_at" "946_at"[206] "39729_at" "35699_at" "41097_at" "32806_at" "33852_at"[211] "38631_at" "1710_s_at" "442_at" "36591_at" "33300_at"[216] "32706_at" "37899_at" "38705_at" "36653_g_at" "40103_at"[221] "1674_at" "40569_at" "932_i_at" "32628_at" "32139_at"[226] "35368_at" "32034_at" "36958_at" "649_s_at" "38656_s_at"[231] "37027_at" "39045_at" "37162_at" "32186_at" "40347_at"[236] "40127_at" "37001_at" "34472_at" "32980_f_at" "34964_at"[241] "33774_at" "39317_at" "38182_at" "38864_at" "106_at"[246] "41592_at" "39329_at" "41189_at" "35331_at" "40506_s_at"[251] "1825_at" "41549_s_at" "32263_at" "40490_at" "35162_s_at"[256] "39762_at" "32967_at" "33206_at" "32394_s_at" "35694_at"[261] "36802_at" "40775_at" "1107_s_at" "37003_at" "40279_at"[266] "37645_at" "40051_at" "38116_at" "37598_at" "37363_at"[271] "35633_at" "35352_at" "34801_at"

> svmrbf1_model

Support Vector Machine object of class "ksvm"

SV type: nu-svc (classification)parameter : nu = 0.2

Linear (vanilla) kernel function.

Number of Support Vectors : 36

Objective Function Value : 0.0266Training error : 0

> ratio <- 0.95> cycs <- 44> etrain_r <- etrain> etest_r <- etest> nu = 0.2

> res_rfe <- rfe_sepp(etrain_r, etest_r, cycs, ratio, nu)

> etrain_r <- res_rfe$traind> etest_r <- res_rfe$testd

Page 34: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

34 1 Analysis of Gene Expression Data

> svmrbf1_model = ksvm(etrain_r, ytrain, type = "nu-svc", kernel = "vanilladot",+ nu = 0.2)

Setting default kernel parameters

> svmrbf1_predict = predict(svmrbf1_model, etest_r)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 13 4 17NEG(true) 9 13 22total(pred) 22 17 39

> colnames(etest_r)

[1] "34878_at" "35831_at" "40990_at" "33291_at" "35140_at"[6] "2031_s_at" "32961_at" "35166_at" "1599_at" "37964_at"

[11] "41187_at" "38353_at" "36012_at" "35341_at" "33362_at"[16] "33447_at" "38693_at" "31786_at" "33865_at" "37304_at"[21] "37044_at" "36577_at" "37659_at" "38385_at" "38223_at"[26] "36008_at" "33244_at" "36131_at" "1097_s_at" "38077_at"[31] "33232_at" "37902_at" "1160_at" "36199_at" "40049_at"[36] "39837_s_at" "32621_at" "32168_s_at" "529_at" "1044_s_at"[41] "34813_at" "32562_at" "1467_at" "1638_at" "36313_at"[46] "40019_at" "36543_at" "37015_at" "38052_at" "1971_g_at"[51] "32542_at" "36002_at" "39109_at" "39696_at" "38487_at"[56] "37306_at" "36144_at" "39929_at" "39614_at" "40971_at"[61] "40423_at" "39021_at" "38424_at" "34808_at" "39967_at"[66] "34780_at" "1635_at" "1249_at" "33891_at" "34714_at"[71] "41784_at" "34990_at" "40832_s_at" "33392_at" "40928_at"[76] "32134_at" "39793_at" "31763_at" "33895_at" "38684_at"[81] "39884_g_at" "34311_at" "38207_at" "37965_at" "40876_at"[86] "37403_at" "38570_at" "38096_f_at" "32035_at" "32195_at"[91] "425_at" "35718_at" "1185_at" "36624_at" "35303_at"[96] "36412_s_at" "33304_at" "33412_at" "37377_i_at" "37280_at"

[101] "36599_at" "37283_at" "33284_at" "37014_at" "879_at"[106] "39072_at" "37724_at" "40571_at" "40634_at" "41235_at"[111] "34865_at" "37544_at" "39089_at" "39790_at" "40584_at"[116] "39379_at" "35872_at" "33705_at" "37676_at" "39175_at"[121] "36502_at" "32210_at" "37392_at" "36943_r_at" "37328_at"[126] "37326_at" "33543_s_at" "38323_at" "35583_at" "35823_at"[131] "41015_at" "37506_at" "40167_s_at" "40916_at" "41490_at"[136] "828_at" "33716_at" "36117_at" "37038_at" "32963_s_at"

Page 35: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 35

[141] "32755_at" "38265_at" "36766_at" "38311_at" "33668_at"[146] "33424_at" "36676_at" "38087_s_at" "39338_at" "36083_at"[151] "37147_at" "33383_f_at" "1303_at" "41352_at" "40290_f_at"[156] "38997_at" "39070_at" "39436_at" "39708_at" "40075_at"[161] "40202_at" "142_at" "39779_at" "36605_at" "33873_at"[166] "946_at" "39729_at" "35699_at" "33852_at" "38631_at"[171] "36591_at" "33300_at" "37899_at" "38705_at" "36653_g_at"[176] "40103_at" "1674_at" "40569_at" "32628_at" "32139_at"[181] "35368_at" "36958_at" "649_s_at" "38656_s_at" "37027_at"[186] "39045_at" "37162_at" "32186_at" "40347_at" "40127_at"[191] "37001_at" "34472_at" "32980_f_at" "34964_at" "33774_at"[196] "39317_at" "38182_at" "38864_at" "41592_at" "39329_at"[201] "41189_at" "35331_at" "40506_s_at" "1825_at" "41549_s_at"[206] "32263_at" "35162_s_at" "39762_at" "33206_at" "32394_s_at"[211] "35694_at" "36802_at" "40775_at" "1107_s_at" "40279_at"[216] "37645_at" "40051_at" "38116_at" "37598_at" "35352_at"[221] "34801_at"

> svmrbf1_model

Support Vector Machine object of class "ksvm"

SV type: nu-svc (classification)parameter : nu = 0.2

Linear (vanilla) kernel function.

Number of Support Vectors : 35

Objective Function Value : 0.0304Training error : 0

3) LASSO regression and LARS.

Now we perform feature selection and regression by LASSO, Least Angle Regression(LARs), and Infinitesimal Forward Stagewise regression models.

> lars_res <- lars(etrain, ytrain, use.Gram = FALSE, type = "lar")> ab <- summary(lars_res)> last <- length(ab[, 1])

Fig. 8 shows the results of Least Angle Regression (LARs) as a solution path. This solu-tion path can be used for feature selection.

> plot(lars_res)

Page 36: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

36 1 Analysis of Gene Expression Data

******************************************** * ***** * ** * ** * * **

0.0 0.2 0.4 0.6 0.8 1.0

−4

−2

02

4

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

******************************************** * ***** * ** * ** * * **

*****************************************

**** *****

* **

*

***

***

******************************************** * ***** * ** * ** * * ********************************************** * ***** * ** * ** * * **

******************************************** * ***** * ** * ** * * **

******************************************** * ****** **

***

**

**

******************************************** * ***** * ** * ** * * **

********************************************* *****

* ***

***

***

******************************************** * ***** * ** * ** * * ********************************************** * ***** * ** * ** * * ********************************************** * ***** * ** * ** * * ******

**************************************** * ***** * ** * ** * * **********************************************

* ***** * ** * ** * * ********************************************** * ***** * **

***

* ***

****

****************************************

* ***** * ***

** * * **

******************************************** * ***** * ** * ** * * ********************************************** * ***** * ** * ** * * **

******************************************** * ***** * **

*** * * **

******************************************** * ***** * ** * ** * * ********************************************** * ***** * ** * ** * * **********

****************

******************** * ***** * ** * ** * * **

******************************************** * ****** **

***

**

**

********************************************* *****

* ***

*** *

**

******************************************** * ***** * ** *** * * **

******************************************** * ***** * ** * ** * * **

******************************************** * ***** * ** * ** * * **

******************************************** * ***** * ** * ** * * **

********************************************* ***** * ** * ** * * *******************************

*************** * ***** * ** * ** * * **

******************************************** * ****** **

***

**

**

**************************

******************* *****

* ***

***

***

********************************************* ***** * ** * ** * * **

********************************************* *****

* **

*

***

***

*************************************

******** ***** * ** * ** * * **

******************************************** * ***** * ** * ** * * **

*****************************

**********

*****

****** * **

***

* ***

********************************************* ***** * **

*** * * **

******************************************** * ***** * ** * ** * * **

LAR

298

2026

677

724

1702

649

1795

0 6 22 27 33 36 37 37 37 37 38

Figure 8: Results of Least Angle Regression (LARs) as a solution path which is used forfeature selection.

Page 37: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 37

> cl <- coef(lars_res)> as <- which(cl[last, ] != 0)> as

1599_at 33362_at 31786_at 33865_at 38385_at 36008_at 33244_at76 139 150 171 217 233 243

36131_at 33232_at 40049_at 1467_at 40019_at 37015_at 39929_at267 298 339 425 439 444 588

38797_at 1635_at 40928_at 40831_at 37403_at 879_at 33705_at649 677 724 725 839 1089 1209

36502_at 40955_at 38323_at 32165_at 39436_at 40202_at 142_at1240 1269 1291 1544 1592 1626 1627

38631_at 36591_at 36653_g_at 37027_at 32186_at 33774_at 39317_at1683 1702 1729 1795 1831 1873 1877

38182_at 35162_s_at 39762_at 1107_s_at1889 2025 2026 2107

> ac <- cl[last, which(cl[last, ] != 0)]> ac

1599_at 33362_at 31786_at 33865_at 38385_at-0.0618439068 -0.3459363140 0.9428169253 0.0990451265 0.3118650924

36008_at 33244_at 36131_at 33232_at 40049_at-0.1348736113 -1.1339255534 -0.2944958709 -1.2579586646 0.0172597580

1467_at 40019_at 37015_at 39929_at 38797_at-0.1015055560 0.0001525672 -0.0231442394 0.0394011422 0.6371631054

1635_at 40928_at 40831_at 37403_at 879_at-0.4160185674 -0.0834372727 -0.2377418918 -0.5777919700 -0.0415393411

33705_at 36502_at 40955_at 38323_at 32165_at-0.1356399992 0.7852543533 0.7424980342 0.4006435518 -0.5101923493

39436_at 40202_at 142_at 38631_at 36591_at-0.3084776079 -0.1790726998 -0.5610535923 -0.0655593514 0.2521319798

36653_g_at 37027_at 32186_at 33774_at 39317_at-0.5837221419 1.2964262178 0.2940397507 1.2476742154 0.6329767404

38182_at 35162_s_at 39762_at 1107_s_at-0.1638327079 0.9433938423 -0.9775439094 -0.0292518489

> etrain_r <- etrain[, as]> etest_r <- etest[, as]

> svmrbf1_model = ksvm(etrain_r, ytrain, type = "nu-svc", kernel = "vanilladot",+ nu = 0.2)

Page 38: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

38 1 Analysis of Gene Expression Data

Setting default kernel parameters

> svmrbf1_predict = predict(svmrbf1_model, etest_r)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 14 3 17NEG(true) 5 17 22total(pred) 19 20 39

> as <- which(cl[4, ] != 0)> as

37015_at 1635_at 37027_at444 677 1795

> ac <- cl[4, which(cl[4, ] != 0)]> ac

37015_at 1635_at 37027_at0.19051061 0.20125239 0.06960437

> etrain_r <- etrain[, as]> etest_r <- etest[, as]

> svmrbf1_model = ksvm(etrain_r, ytrain, type = "nu-svc", kernel = "vanilladot",+ nu = 0.2)

Setting default kernel parameters

> svmrbf1_predict = predict(svmrbf1_model, etest_r)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 15 2 17NEG(true) 2 20 22total(pred) 17 22 39

> as <- which(cl[10, ] != 0)> as

Page 39: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 39

33232_at 37015_at 1635_at 37403_at 33705_at 36502_at 40202_at 37027_at298 444 677 839 1209 1240 1626 1795

38182_at1889

> ac <- cl[10, which(cl[10, ] != 0)]> ac

33232_at 37015_at 1635_at 37403_at 33705_at 36502_at0.04679555 0.29836544 0.36708690 0.06386749 -0.01398280 0.15366319

40202_at 37027_at 38182_at0.07369897 0.17890755 -0.04120743

> etrain_r <- etrain[, as]> etest_r <- etest[, as]

> svmrbf1_model = ksvm(etrain_r, ytrain, type = "nu-svc", kernel = "vanilladot",+ nu = 0.2)

Setting default kernel parameters

> svmrbf1_predict = predict(svmrbf1_model, etest_r)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 16 1 17NEG(true) 5 17 22total(pred) 21 18 39

> as <- which(cl[3, ] != 0)> as

37015_at 1635_at444 677

> ac <- cl[3, which(cl[3, ] != 0)]> ac

37015_at 1635_at0.1549612 0.1451858

Page 40: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

40 1 Analysis of Gene Expression Data

> etrain_r <- etrain[, as]> etest_r <- etest[, as]

> svmrbf1_model = ksvm(etrain_r, ytrain, type = "nu-svc", kernel = "vanilladot",+ nu = 0.2)

Setting default kernel parameters

> svmrbf1_predict = predict(svmrbf1_model, etest_r)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 7 10 17NEG(true) 5 17 22total(pred) 12 27 39

> xs <- scaling(svmrbf1_model)> etest_nn <- etest_r> etest_nn[, 1] <- (etest_r[, 1] - xs$x.scale$`scaled:center`[[1]]) *+ xs$x.scale$`scaled:scale`[[1]]> etest_nn[, 2] <- (etest_r[, 2] - xs$x.scale$`scaled:center`[[2]]) *+ xs$x.scale$`scaled:scale`[[2]]

Fig. 9 shows the two first ranked LARS features in a 2-dimensional plot.

> plot(svmrbf1_model, data = etrain_r)> points(etest_nn[which(ytest < 0), ], pch = 24, col = "green")> points(etest_nn[which(ytest > 0), ], pch = 21, col = "green")

Page 41: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 41

−5

0

5

−0.5 0.0 0.5 1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

● ●

SVM classification plot

1635_at

3701

5_at

●●

●● ●●

Figure 9: The two first ranked LARS features in a 2-dimensional plot.

Page 42: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

42 1 Analysis of Gene Expression Data

> svmrbf1_model = ksvm(etrain_r, ytrain, type = "nu-svc", kernel = "rbfdot",+ kpar = list(sigma = 0.1), nu = 0.2)> svmrbf1_predict = predict(svmrbf1_model, etest_r)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 15 2 17NEG(true) 4 18 22total(pred) 19 20 39

> xs <- scaling(svmrbf1_model)> etest_nn <- etest_r> etest_nn[, 1] <- (etest_r[, 1] - xs$x.scale$`scaled:center`[[1]]) *+ xs$x.scale$`scaled:scale`[[1]]> etest_nn[, 2] <- (etest_r[, 2] - xs$x.scale$`scaled:center`[[2]]) *+ xs$x.scale$`scaled:scale`[[2]]

Fig. 10 shows the two first ranked LARS features in a 2-dimensional plot for this SVM.

> plot(svmrbf1_model, data = etrain_r)> points(etest_nn[which(ytest < 0), ], pch = 24, col = "green")> points(etest_nn[which(ytest > 0), ], pch = 21, col = "green")

Page 43: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 43

−2

−1

0

1

2

3

4

5

−0.5 0.0 0.5 1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

● ●

SVM classification plot

1635_at

3701

5_at

●●

●● ●●

Figure 10: For another SVM, the two first ranked LARS features in a 2-dimensional plot.

Page 44: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

44 1 Analysis of Gene Expression Data

> svmrbf1_model = ksvm(etrain_r, ytrain, type = "nu-svc", kernel = "rbfdot",+ kpar = list(sigma = 0.01), nu = 0.2)> svmrbf1_predict = predict(svmrbf1_model, etest_r)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 15 2 17NEG(true) 4 18 22total(pred) 19 20 39

> xs <- scaling(svmrbf1_model)> etest_nn <- etest_r> etest_nn[, 1] <- (etest_r[, 1] - xs$x.scale$`scaled:center`[[1]]) *+ xs$x.scale$`scaled:scale`[[1]]> etest_nn[, 2] <- (etest_r[, 2] - xs$x.scale$`scaled:center`[[2]]) *+ xs$x.scale$`scaled:scale`[[2]]

Fig. 11 shows the two first ranked LARS features in a 2-dimensional plot for this SVM.

> plot(svmrbf1_model, data = etrain_r)> points(etest_nn[which(ytest < 0), ], pch = 24, col = "green")> points(etest_nn[which(ytest > 0), ], pch = 21, col = "green")

Page 45: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 45

0

5

10

15

−0.5 0.0 0.5 1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

● ●

SVM classification plot

1635_at

3701

5_at

●●

●● ●●

Figure 11: For another SVM, the two first ranked LARS features in a 2-dimensional plot.

Page 46: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

46 1 Analysis of Gene Expression Data

Mow we want to interpret the selected features with respect to their biological role.

> as <- which(cl[10, ] != 0)> as

33232_at 37015_at 1635_at 37403_at 33705_at 36502_at 40202_at 37027_at298 444 677 839 1209 1240 1626 1795

38182_at1889

> ac <- cl[10, which(cl[10, ] != 0)]> ac

33232_at 37015_at 1635_at 37403_at 33705_at 36502_at0.04679555 0.29836544 0.36708690 0.06386749 -0.01398280 0.15366319

40202_at 37027_at 38182_at0.07369897 0.17890755 -0.04120743

> nna <- unique(unlist(mget(names(ac), env = hgu95av2GENENAME,+ ifnotfound = NA)))> nna

[1] "cysteine-rich protein 1 (intestinal)"[2] "aldehyde dehydrogenase 1 family, member A1"[3] "c-abl oncogene 1, receptor tyrosine kinase"[4] "annexin A1"[5] "phosphodiesterase 4B, cAMP-specific (phosphodiesterase E4 dunce homolog, Drosophila)"[6] "PFTAIRE protein kinase 1"[7] "Kruppel-like factor 9"[8] "AHNAK nucleoprotein"[9] "chromosome Y open reading frame 15B"

"c-abl oncogene 1, receptor tyrosine kinase" is the ABL1 gene, which is supposed to beoverexpressed in BCR/ABL patients.

Due to the development of novel drugs able to target the enhanced tyrosine kinase activ-ity of BCR-ABL. The first of these therapies is Imatinib Mesylate (Gleevec) which hasbecome the first line therapy for all patients with CML.

> nna1 <- unique(unlist(mget(rf.vs1$selected.vars, env = hgu95av2GENENAME,+ ifnotfound = NA)))> nna1

Page 47: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 47

[1] "CD19 molecule"[2] "c-abl oncogene 1, receptor tyrosine kinase"[3] "PR domain containing 2, with ZNF domain"[4] "four and a half LIM domains 1"[5] "brain abundant, membrane attached signal protein 1"[6] "purinergic receptor P2Y, G-protein coupled, 14"[7] "caspase 8, apoptosis-related cysteine peptidase"[8] "PRP6 pre-mRNA processing factor 6 homolog (S. cerevisiae)"[9] "PFTAIRE protein kinase 1"

[10] "aldehyde dehydrogenase 1 family, member A1"[11] "AHNAK nucleoprotein"[12] "inhibitor of DNA binding 3, dominant negative helix-loop-helix protein"[13] "centaurin, beta 1"[14] "non-metastatic cells 4, protein expressed in"[15] "metallothionein 1X"[16] "glucose phosphate isomerase"[17] "S100 calcium binding protein A13"[18] "tumor protein D52-like 2"[19] "voltage-dependent anion channel 1"[20] "Kruppel-like factor 9"[21] "myxovirus (influenza virus) resistance 2 (mouse)"

Intersection of random forest and LASSO gene selection:

> intersect(nna1, nna)

[1] "c-abl oncogene 1, receptor tyrosine kinase"[2] "PFTAIRE protein kinase 1"[3] "aldehyde dehydrogenase 1 family, member A1"[4] "AHNAK nucleoprotein"[5] "Kruppel-like factor 9"

Next we want to plot the GO graph for the selected features.

> fsall <- unlist(mget(names(ac), hgu95av2ENTREZID))> fsall <- as.character(fsall)> fsallGO <- makeGOGraph(fsall, "MF", chip = "hgu95av2.db")> nAgo <- makeNodeAttrs(fsallGO, shape = "ellipse", label = substr(nodes(fsallGO),+ 7, 10))> cache(gGopen <- agopen(fsallGO, recipEdges = "distinct", layoutType = "neato",+ nodeAttrs = nAgo, name = ""))

Fig. 12 shows the GO graph with ENTREZ names which are basically only numbers.

> plot(gGopen)

Page 48: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

48 1 Analysis of Gene Expression Data

5515

3677

6872

8270

3676

5488

3169

6914

3674

3167 1758

4029

5099

5497

6491

3824

5083

5096

5496

66202562

8047

8289

6903

0695

0234

0589

0166

0287

4715

5524

8022

6740

0145

4713

2559

4672

05542555

1883

6301

6773

70762553

1882

6772

1948

28021015

5080

5516

3779

9901

8092

9900

9899

46934674

4063

4064

6787

4091

6791

6788

2578

3700

0528

4872

8503

5485

4871

0594

2934

2165

0089

Figure 12: GO graph with ENTREZ names (numbers) for the selected features.

Page 49: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 49

We also want to know whether there were gene set enrichments in the selected features.That means we check whether the selected features belong to certain biological categoriesor participate in the same pathway or have similar molecular function or are found insimilar locations in the cell.

We choose molecular function “MF” in our test.

> gNsLL <- unique(unlist(mget(names(ac), env = hgu95av2ENTREZID,+ ifnotfound = NA)))> gparams <- new("GOHyperGParams", geneIds = gNsLL, universeGeneIds = NULL,+ annotation = "hgu95av2", pvalueCutoff = 0.2, ontology = "MF",+ testDirection = "over")> gGhyp <- hyperGTest(gparams)> aa <- summary(gGhyp)> aa

GOMFID Pvalue OddsRatio ExpCount Count Size1 GO:0001758 0.002006395 1137.428571 0.002007276 1 22 GO:0019834 0.003008271 568.642857 0.003010915 1 33 GO:0005497 0.004009266 379.047619 0.004014553 1 44 GO:0004115 0.005009383 284.250000 0.005018191 1 55 GO:0004029 0.008004461 162.367347 0.008029106 1 86 GO:0004859 0.009001066 142.053571 0.009032744 1 97 GO:0055102 0.010991648 113.614286 0.011040020 1 118 GO:0005544 0.017931160 66.773109 0.018065487 1 189 GO:0016620 0.017931160 66.773109 0.018065487 1 1810 GO:0004114 0.020892169 56.735714 0.021076402 1 2111 GO:0004112 0.021877435 54.027211 0.022080040 1 2212 GO:0004693 0.024828029 47.255952 0.025090955 1 2513 GO:0016903 0.024828029 47.255952 0.025090955 1 2514 GO:0042562 0.025809828 45.360000 0.026094593 1 2615 GO:0004715 0.034607173 33.315126 0.035127337 1 3516 GO:0008289 0.035287756 8.398026 0.307113286 2 30617 GO:0005099 0.046228793 24.586957 0.047170995 1 4718 GO:0008081 0.057727813 19.470443 0.059214653 1 5919 GO:0004672 0.059882766 6.188370 0.410488019 2 40920 GO:0030674 0.068161828 16.343685 0.070254673 1 7021 GO:0016773 0.078131322 5.278365 0.476728140 2 47522 GO:0005496 0.084088300 13.084718 0.087316522 1 8723 GO:0008022 0.084088300 13.084718 0.087316522 1 8724 GO:0030234 0.084294871 5.039811 0.497804541 2 49625 GO:0016301 0.089401477 4.861057 0.514866391 2 51326 GO:0030145 0.090576706 12.089094 0.094341990 1 9427 GO:0005543 0.096106149 11.347763 0.100363819 1 100

Page 50: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

50 1 Analysis of Gene Expression Data

28 GO:0016772 0.111772134 4.227377 0.586124702 2 58429 GO:0005096 0.118830133 9.031106 0.125454774 1 12530 GO:0046872 0.124796832 2.882496 2.062476477 4 205531 GO:0046914 0.127425289 3.092272 1.301718730 3 129732 GO:0004713 0.128669260 8.283598 0.136494794 1 13633 GO:0043169 0.129035324 2.839441 2.085560156 4 207834 GO:0043167 0.138122840 2.752592 2.133734789 4 212635 GO:0005083 0.139293091 7.595724 0.148538452 1 14836 GO:0004857 0.156747736 6.668948 0.168611216 1 16837 GO:0008047 0.182348979 5.631617 0.198720361 1 19838 GO:0042578 0.199862705 5.075360 0.219796763 1 219

Term1 retinal dehydrogenase activity2 phospholipase A2 inhibitor activity3 androgen binding4 3',5'-cyclic-AMP phosphodiesterase activity5 aldehyde dehydrogenase (NAD) activity6 phospholipase inhibitor activity7 lipase inhibitor activity8 calcium-dependent phospholipid binding9 oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor10 3',5'-cyclic-nucleotide phosphodiesterase activity11 cyclic-nucleotide phosphodiesterase activity12 cyclin-dependent protein kinase activity13 oxidoreductase activity, acting on the aldehyde or oxo group of donors14 hormone binding15 non-membrane spanning protein tyrosine kinase activity16 lipid binding17 Ras GTPase activator activity18 phosphoric diester hydrolase activity19 protein kinase activity20 protein binding, bridging21 phosphotransferase activity, alcohol group as acceptor22 steroid binding23 protein C-terminus binding24 enzyme regulator activity25 kinase activity26 manganese ion binding27 phospholipid binding28 transferase activity, transferring phosphorus-containing groups29 GTPase activator activity30 metal ion binding31 transition metal ion binding

Page 51: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 51

32 protein tyrosine kinase activity33 cation binding34 ion binding35 small GTPase regulator activity36 enzyme inhibitor activity37 enzyme activator activity38 phosphoric ester hydrolase activity

In the graph we want to color the significant genes, that is genes which are enriched incertain molecular functions.

> gGopenP <- gGopen> agnd <- AgNode(gGopenP)> for (i in seq(along = agnd)) {+ nm <- name(agnd[[i]])+ mt <- match(nm, aa[[1]])+ if (!is.na(mt)) {+ if (aa[[2]][mt] < 0.05) {+ agnd[[i]]@fillcolor <- "#e31a1c"+ }+ else {+ agnd[[i]]@fillcolor <- "#edf8fb"+ }+ agnd[[i]]@txtLabel@labelText <- paste(signif(aa[[2]][mt],+ 2))+ }+ else {+ agnd[[i]]@fillcolor <- "white"+ agnd[[i]]@txtLabel@labelText <- ""+ }+ }> gGopenP@AgNode <- agnd> gc()

used (Mb) gc trigger (Mb) max used (Mb)Ncells 6593424 176.1 10591793 282.9 10591793 282.9Vcells 29026666 221.5 76040177 580.2 76031529 580.1

Fig. 13 shows the GO graph with gene set enrichment for molecular functions colored.

> plot(gGopenP)

Page 52: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

52 1 Analysis of Gene Expression Data

0.12

0.13

0.13

0.14 0.002

0.008

0.046

0.004

0.14

0.12

0.084

0.0180.026

0.18

0.035

0.025

0.084

0.035

0.084

0.091

0.13

0.06

0.089

0.078

0.110.025

0.2

Figure 13: GO graph with gene set enrichment for molecular functions colored.

Page 53: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 53

4) New methods: Optimal Brain Surgeon (OBS) on the Fisher information matrix forleast square regression and logistic regression.

Appendix A gives the theory for the new feature selection methods.

Here the procedure for least square written in R .

> ls_obs_sepp <- function(X, y, eps = 0.001, ratio = 0.99) {+ library("MASS")+ l <- length(y)+ if (l != nrow(X)) {+ if (l == ncol(X)) {+ cat("Size of X and y do not match: I use X transposed.",+ "\n")+ X <- t(X)+ }+ else {+ cat("Stop: size of X and y do not match. ", "\n")+ return()+ }+ }+ p <- ncol(X)+ pp <- p+ tX <- t(X)+ xy <- tX %*% y+ val <- rep(p, 0)+ rank <- 1:p+ irank <- 1:p+ if (p > l) {+ FIM <- 1/eps * (diag(p) - tX %*% solve(eps * diag(l) ++ X %*% tX) %*% X)+ }+ else {+ FIM <- solve(eps * diag(p) + tX %*% X)+ }+ i = p+ bml <- FIM %*% xy+ while (i > 1) {+ a <- bml * bml/diag(FIM)+ rat <- max(1, ceiling(i * (1 - ratio)))+ ma <- max(a)+ wa <- rank[which.max(a)]+ rem <- rep(0, rat)+ for (t in 1:rat) {+ re <- which.min(a)

Page 54: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

54 1 Analysis of Gene Expression Data

+ if (re < i) {+ help <- rank[re]+ for (j in re:(i - 1)) {+ rank[j] <- rank[j + 1]+ }+ rank[i] <- help+ irank[i] <- wa+ }+ rem[t] <- re+ val[i] <- min(a)+ cat(i, "\n")+ i <- i - 1+ a[re] <- ma+ }+ if (i == 1) {+ val[1] <- max(a)+ cat("1\n")+ }+ else {+ xy <- xy[-rem]+ u <- as.matrix(FIM[, rem])+ s <- as.matrix(u[rem, ])+ u <- as.matrix(u[-rem, ])+ FIM <- FIM[-rem, -rem]+ is <- solve(s)+ FIM <- FIM - u %*% is %*% t(u)+ bml <- FIM %*% xy+ }+ }+ return(list(rank = rank, val = val, irank = irank))+ }

This procedure is not used to select the features.

> res <- ls_obs_sepp(etrain, ytrain, eps = 100, ratio = 0.99)

> sel <- res$rank[1:2]> etrain_r <- etrain[, sel]> etest_r <- etest[, sel]> rownames(etrain) <- 1:40

Fig. 14 shows the feature selection results of OBS on the Fisher information matrix bythe first two selected genes.

Page 55: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 55

●●

−0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

−2.

0−

1.5

−1.

0−

0.5

0.0

0.5

1.0

37015_at

3810

5_at

Figure 14: The first two selected genes of OBS on the Fisher information matrix for leastsquare fit.

> plot(etrain_r)> points(etrain_r[which(ytrain < 0), ], pch = 24, col = "red")> points(etrain_r[which(ytrain > 0), ], pch = 21, col = "blue")

Next let us make the classification with an SVM.

Page 56: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

56 1 Analysis of Gene Expression Data

> sel <- res$rank[1:10]> colnames(etrain[, sel])

[1] "37015_at" "38105_at" "31786_at" "37014_at" "36808_at"[6] "35831_at" "32775_r_at" "35300_at" "41322_s_at" "1635_at"

> sel <- res$rank[1:4]> etrain_r <- etrain[, sel]> etest_r <- etest[, sel]> svmrbf1_model = ksvm(etrain_r, ytrain, type = "nu-svc", kernel = "vanilladot",+ nu = 0.2)

Setting default kernel parameters

> svmrbf1_predict = predict(svmrbf1_model, etest_r)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 14 3 17NEG(true) 7 15 22total(pred) 21 18 39

Here the procedure for logistic regression written in R .

> lr_obs_sepp <- function(X, y, eps = 0.01, epsg = 0.01, epsf = 0.01,+ ratio = 0.99) {+ library("MASS")+ iepsf <- 1/epsf+ l <- length(y)+ if (l != nrow(X)) {+ if (l == ncol(X)) {+ cat("Size of X and y do not match: I use X transposed.",+ "\n")+ X <- t(X)+ }+ else {+ cat("Stop: size of X and y do not match. ", "\n")+ return()+ }+ }+ p <- ncol(X)+ pp <- p

Page 57: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 57

+ tX <- t(X)+ xy <- tX %*% y+ val <- rep(p, 0)+ rank <- 1:p+ irank <- 1:p+ epsg1 <- 1 - epsg+ if (p > l) {+ FIM <- 1/eps * (diag(p) - tX %*% solve(eps * diag(l) ++ X %*% tX) %*% X)+ bml <- FIM %*% xy+ py <- 1/(1 + exp(-y * (X %*% bml)))+ py[which(py > epsg1)] <- epsg1+ py[which(py < epsg)] <- epsg+ da <- (1 - py) * y+ xy <- tX %*% da+ i = p+ bml <- FIM %*% xy+ while (i > l) {+ a <- bml * bml/diag(FIM)+ rat <- max(1, ceiling(i * (1 - ratio)))+ if (i - rat <= l) {+ rat <- i - l+ }+ ma <- max(a)+ wa <- rank[which.max(a)]+ rem <- rep(0, rat)+ for (t in 1:rat) {+ re <- which.min(a)+ if (re < i) {+ help <- rank[re]+ for (j in re:(i - 1)) {+ rank[j] <- rank[j + 1]+ }+ rank[i] <- help+ }+ rem[t] <- re+ irank[i] <- wa+ val[i] <- min(a)+ cat(i, "\n")+ i <- i - 1+ a[re] <- ma+ }+ bml <- bml[-rem]

Page 58: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

58 1 Analysis of Gene Expression Data

+ X <- X[, -rem]+ tX <- t(X)+ py <- 1/(1 + exp(-y * (X %*% bml)))+ py[which(py > epsg1)] <- epsg1+ py[which(py < epsg)] <- epsg+ da <- (1 - py) * y+ xy <- tX %*% da+ u <- as.matrix(FIM[, rem])+ s <- as.matrix(u[rem, ])+ u <- as.matrix(u[-rem, ])+ FIM <- FIM[-rem, -rem]+ is <- solve(s)+ FIM <- FIM - u %*% is %*% t(u)+ bml <- FIM %*% xy+ }+ pp <- l+ cat("END PART1: ", l, " <-> ", ncol(X), "\n")+ }+ for (i in pp:2) {+ a <- bml * bml/diag(FIM)+ mm <- min(a)+ rem <- which.min(a)+ if (rem < i) {+ help <- rank[rem]+ for (j in rem:(i - 1)) {+ rank[j] <- rank[j + 1]+ }+ rank[i] <- help+ }+ irank[i] <- rank[which.max(a)]+ val[i] <- mm+ cat(i, "\n")+ if (i == 2) {+ val[1] <- max(a)+ cat("1\n")+ irank[1] <- rank[1]+ }+ else {+ bml <- bml[-rem]+ X <- X[, -rem]+ II <- epsf * diag(i - 1)+ tX <- t(X)+ py <- 1/(1 + exp(-y * (X %*% bml)))

Page 59: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 59

+ py[which(py > epsg1)] <- epsg1+ py[which(py < epsg)] <- epsg+ dd <- as.vector(py * (1 - py))+ xy <- tX %*% da+ FIM <- solve(II + tX %*% (X/dd))+ bml <- iepsf * xy+ }+ }+ return(list(rank = rank, val = val, irank = irank))+ }

This procedure is not used to select the features.

> res <- lr_obs_sepp(etrain, ytrain, eps = 100, epsg = 0.001, epsf = 100,+ ratio = 0.99)

> sel <- res$rank[1:2]> etrain_r <- etrain[, sel]> etest_r <- etest[, sel]> rownames(etrain) <- 1:40

Fig. 15 shows the feature selection results of OBS on the Fisher information matrix bythe first two selected genes.

> plot(etrain_r)> points(etrain_r[which(ytrain < 0), ], pch = 24, col = "red")> points(etrain_r[which(ytrain > 0), ], pch = 21, col = "blue")

Again let us check the selection by the classification with an SVM.

> sel <- res$rank[1:10]> colnames(etrain[, sel])

[1] "39793_at" "35912_at" "33944_at" "37558_at" "38738_at"[6] "35300_at" "33543_s_at" "35831_at" "37015_at" "40953_at"

> sel <- res$rank[1:4]> etrain_r <- etrain[, sel]> etest_r <- etest[, sel]> svmrbf1_model = ksvm(etrain_r, ytrain, type = "nu-svc", kernel = "vanilladot",+ nu = 0.2)

Page 60: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

60 1 Analysis of Gene Expression Data

●●

●●

●● ●

●●

● ●●

●●

● ●

● ●

−1.0 −0.5 0.0 0.5 1.0

01

23

39793_at

3591

2_at

●●

● ●

● ●

Figure 15: The first two selected genes of OBS on the Fisher information matrix forlogistic regression.

Page 61: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 61

Setting default kernel parameters

> svmrbf1_predict = predict(svmrbf1_model, etest_r)> confusion_sepp(ytest, svmrbf1_predict, c("BCR/ABL", "NEG"))

BCR/ABL(pred) NEG(pred) total(true)BCR/ABL(true) 11 6 17NEG(true) 2 20 22total(pred) 13 26 39

Page 62: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

62 1 Analysis of Gene Expression Data

Page 63: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 63

1.2 Tumorigenic Breast-Cancer Cells

Article “The New England Journal of Medicine”:

The Prognostic Role of a Gene Signature from Tumorigenic Breast-CancerCells,Rui Liu, Ph.D., Xinhao Wang, Ph.D., Grace Y. Chen, M.D., Ph.D., PieroDalerba, M.D., Austin Gurney, Ph.D., Timothy Hoey, Ph.D., Gavin Sherlock,Ph.D., John Lewicki, Ph.D., Kerby Shedden, Ph.D., and Michael F. Clarke,M.D.,The New England Journal of Medicine,January 18, 2007 vol. 356 no. 3.

Background:Breast cancers contain a minority population of cancer cells characterized by CD44 ex-pression but low or undetectable levels of CD24 (CD44+CD24-/low) that have highertumorigenic capacity than other subtypes of cancer cells. Methods We compared thegene-expression profile of CD44+CD24-/low tumorigenic breastcancer cells with that ofnormal breast epithelium. Differentially expressed genes were used to generate a 186-gene “invasiveness” gene signature (IGS), which was evaluated for its association withoverall survival and metastasis-free survival in patients with breast cancer or other typesof cancer.

Results:There was a significant association between the IGS and both overall and metastasis-free survival (P<0.001, for both) in patients with breast cancer, which was independentof established clinical and pathological variables. When combined with the prognosticcriteria of the National Institutes of Health, the IGS was used to stratify patients withhigh-risk early breast cancer into prognostic categories (good or poor); among patientswith a good prognosis, the 10-year rate of metastasis-free survival was 81%, and amongthose with a poor prognosis, it was 57%. The IGS was also associated with the prognosisin medulloblastoma (P = 0.004), lung cancer (P = 0.03), and prostate cancer (P = 0.01).The prognostic power of the IGS was increased when combined with the wound-response(WR) signature.

Conclusions:The IGS is strongly associated with metastasis-free survival and overall survival for fourdifferent types of tumors. This genetic signature of tumorigenic breast-cancer cells waseven more strongly associated with clinical outcomes when combined with the WR sig-nature in breast cancer.

In this paper following gene signature has been found:

Page 64: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

64 1 Analysis of Gene Expression Data

Apoptosis DPF2, CASP8, BCL2Calcium-ion binding SCGN, SWAP70, KIAA0276Cell cycle C10orf9, C10orf7, ALKBH, TOB2Cell-surface receptor XPR1, CD59, LRP2Chemotaxis PLP2, MAPK14, CXCL2Collagen catabolism MMP7Differentiation MGP, MLF1, FLNBIon-channel activity SCNM1Membrane protein HSPC163, C5orf18, MGC4399, CDW92, TMC4, ZDHHC2, TICAM2, KDELR3Metabolism GNPDA1, THEM2, DBR1, FLJ90709, FLJ10774, C16orf33, GAPD, LDHA, MR-1, LARS, GTPBP1,

PRSS16, WFDC2, AIM1, DHRS6, DHRS4, MGC15429, MGC45840, ECHDC2, GOLGIN-67,AFURS1, KIAA0436, CYP4V2, JTV1

Methyltransferase ICMT, DNMT3A, HNMT, METTL7A, METTL2Morphology VIL2, TPD52, ARPC5Nucleotide binding NOL8, NSF, RAD23B, SRP54, HSPA2, PBP, THAP2, CIRBP, SNRPN, KIAA0052Phosphatase DUSP10Proliferation SSR1, ERBB4, EMP1, CHPT1, LRPAP1Protein binding FLJ11752, CSTF1, KLHL20, DNAJC13, APLP2, ARGBP2, DNAJB1, NEBL, SH3BGRL, NUDT5,

GABARAPL1, MAPT, DCBLD1Protein kinase STK39, PAK2, CSNK2A1, PILRB, ERN1, SGKL, WEE1, MAST4, C11orf17Protein transport NUP37, CLTC, COPB2, SLC25A25Signal transduction ECOP, PDE8A, STAM, TUBB, SNX6, RAB23, PLAA, STC2, LTFTranscription factor ISGF3G, ATXN3, GTF3C3, GSK3B, KLF10, ELL2, ZBTB20, IRX3, ETS1, SERTAD1, MGC4251,

MAFF, SFPQ, CITED4, CEBPD, EIF4E2Transferase HS2ST1, AGPS, PGK1, ATIC, ETNK1, ALG2Ubiquitination NCE2, MARCH8, CNOT4, RNF8, PSMA5, DPF2Function unknown AMMECR1, KIAA1287, LOC144233, LOC286505, PNAS-4, FLJ20530, THUMPD3, MGC45564,

CAP350, ETAA16, HAN11, DNAPTP6, C7orf25, FLJ37953, FLJ10587, C7orf36, ELP4, NDEL1,NPD014, DKFZP564D172, FAM53C, IER5, LOC255783, KIAA0146, KIAA0792, LOC439994,LOC283481, CG018, LOC130576, NGFRAP1L1, KIAA1217, C4orf7, C21orf86, C9orf64,FLJ13456, KIAA1600, B7-H4, LOC80298, C7orf2, NUCKS, DKFZP566D1346, LOC388279,FLJ31795, C6orf107, FLJ12439, FLJ12806, FLJ39370

We need following libraries:

> library(hgu133ahsentrezg.db)> library(hgu133ahsentrezgcdf)> library(hgu133ahsentrezgprobe)> library(hgu133bhsentrezg.db)> library(hgu133bhsentrezgcdf)> library(hgu133bhsentrezgprobe)> library(affy)> library(farms)> library(multtest)

Page 65: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 65

> library(limma)> library(geneplotter)

Change to the working directory:

> setwd("c:/sepp/work/analyze_transcripts/test1")

Next read the annotation file, where the first column is the CEL file name, the secondcolumn is the label from which we contruct a matrix of labels “lab1” and a matrix ofnames.

> lat <- read.table("annot.txt", header = FALSE, sep = " ")> lat1 <- factor(lat[, 2])> nsamples <- length(lat1)> lev <- levels(lat1)> lev

[1] "Non-tumorigenic" "Normal" "Tumorigenic"

> lv <- length(lev)> lab1 <- matrix(0, nrow = lv, ncol = nsamples)> for (i in 1:lv) {+ lab1[i, which(lat1 == lev[i])] <- 1+ }> names_exp <- as.character(lat[, 1])> lab_names <- list()> for (i in 1:lv) {+ lab_names[[i]] <- names_exp[as.logical(lab1[i, ])]+ }> names(lab_names) <- lev

We first consider the hgu133a arrays and then the complementary arrays hgu133b.

We read the CEL files for further processing.

> rawdata <- ReadAffy(celfile.path = "c:/sepp/work/analyze_transcripts/test1/133a",+ cdfname = "HGU133A_Hs_ENTREZG")

Now we extract the names of the CEL files as they are read in:

> all_names_t <- strsplit(sampleNames(rawdata), "[\\.,_]")> l1 <- length(all_names_t)> all_names <- rep("", l1)> for (i in 1:l1) {+ all_names[i] <- all_names_t[[i]][1]+ }

Page 66: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

66 1 Analysis of Gene Expression Data

Now constract a label matrix:

> lab <- matrix(FALSE, nrow = lv, ncol = l1)> for (i in 1:lv) {+ lab[i, ] <- all_names %in% lab_names[[i]]+ }> rownames(lab) <- lev> colnames(lab) <- all_names

At this step we can start the data analysis. First step is normalization and summarizationof the raw data.

We now use our method called “FARMS” which is a latent variable model for which theparameters are estimated by an EM algorithms.

> farmsdata_t1 <- q.farms(rawdata)

background correction: nonenormalization: quantilesPM/MM correction : pmonlyexpression values: farmsbackground correcting...done.normalizing...done.11979 ids to be processed| ||####################|

The FARMS method also supplies informative/non-informative calls, that is which probesets (genes) have a good signal and which are noisy.

We select the genes which obtained an informative call.

> ini_t1 <- INIcalls(farmsdata_t1)> summary(ini_t1)

SummaryInformative probe sets : 41.65%Non-Informative probe sets : 58.35%

> ini_data_t1 <- getI_Eset(ini_t1)> ini_data_t1

Page 67: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 67

ExpressionSet (storageMode: lockedEnvironment)assayData: 4989 features, 12 samples

element names: exprs, se.exprsphenoData

sampleNames: GSM158635.CEL, GSM158636.CEL, ..., GSM158646.CEL (12 total)varLabels and varMetadata description:

sample: arbitrary numberingfeatureData

featureNames: 10001_at, 10005_at, ..., AFFX-r2-P1-cre-5_at (4989 total)fvarLabels and fvarMetadata description: none

experimentData: use 'experimentData(object)'Annotation: hgu133ahsentrezg

Now let us define the first classification task. We classify “tumor” vs. “normal”.

For this classification task we now define the labels: tumor is either 1 = Non-tumorigenicor 3 = Tumorigenic.

> lv2 <- 2> sel <- c(1, 3)> lab_t <- rep(FALSE, ncol(lab))> for (i in 1:lv2) {+ lab_t <- lab_t | lab[sel[i], ]+ }> labels_t1 <- rep(0, l1)> labels_t1[lab_t] <- 1

We use the limma package to compute p-values and q-values (p-values adjusted for multi-ple testing). Limma does a linear fit and computes the p-values according to the explainedvariance by the single genes. This variance is determined by the linear factor that eachgene obtained by the linear fit.

> design_t1 <- cbind(rep(1, ncol(ini_data_t1)), labels_t1)> colnames(design_t1) <- c("Tumor", "Normal")> rownames(design_t1) <- colnames(ini_data_t1)> Fit_t1 <- lmFit(ini_data_t1, design_t1)> Fit_t1 <- eBayes(Fit_t1)> PValues_t1 <- as.matrix(Fit_t1$p.value[, "Normal"])> colnames(PValues_t1) <- "rawp"> AdjP_t1 <- mt.rawp2adjp(PValues_t1[, "rawp"], proc = c("BH"))> AdjP_t1.ordered <- AdjP_t1$adjp[order(AdjP_t1$index), ]> p_t1.idx <- AdjP_t1$index

Now let us look at the significantly differentially expressed genes. These are the geneswhich differ between tumor and normal.

Page 68: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

68 1 Analysis of Gene Expression Data

> pv1_t1 <- as.matrix(PValues_t1[which(PValues_t1 <= 0.05)])> rownames(pv1_t1) <- rownames(which(PValues_t1 <= 0.05, arr.ind = TRUE))> ss <- length(pv1_t1)> ss

[1] 744

We next plot the significantly differentially expressed genes along the chromosomes.

First we build the chromosome location and color the genes by wether they are overex-pressed or underexpressed in the tumor samples.

> ccla <- buildChromLocation("hgu133ahsentrezg")> i1 <- rownames(pv1_t1)> a1 <- exprs(ini_data_t1)[which(match(rownames(exprs(ini_data_t1)),+ i1) > 0), ]> sii <- rep(0, ss)> for (i in 1:ss) {+ class1 <- median(a1[i, labels_t1])+ class2 <- median(a1[i, !labels_t1])+ sii[i] <- sign(class1 - class2)+ }> re <- i1[which(sii > 0)]> bl <- i1[which(sii < 0)]

> cPlot(ccla)> cColor(re, "red", ccla)> cColor(bl, "blue", ccla)

Fig. 16 shows the plot of over- and underexpressed genes in tumor compared to normalsamples on different chromosomes.

In the next task we look closer at the tumor samples. We want to see what genes are dif-ferentially expressed in tumorigenic samples, that is in samples with low or undetectablelevels of CD24 (CD44+CD24-/low) leading to higher tumorigenic capacity.

We classify “tumorigenic” vs. “non-tumorigenic” (here 1 = Non-tumorigenic and 3 =Tumorigenic). First, we exclude the normals samples.

> lv2 <- 2> sel <- c(1, 3)> exp_t2 <- rep(FALSE, ncol(lab))> labt2_names <- list()> for (i in 1:lv2) {

Page 69: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 69

Homo sapiens

Chr

omos

ome

110111213141516171819

2202122

3456789MXY

Figure 16: Plot of over- and underexpressed genes in tumor compared to normal sampleson different chromosomes.

Page 70: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

70 1 Analysis of Gene Expression Data

+ exp_t2 <- exp_t2 | lab[sel[i], ]+ labt2_names[[i]] <- all_names[lab[sel[i], ]]+ }> t2_names <- all_names[exp_t2]> l2 <- length(t2_names)> lab_t2 <- matrix(FALSE, nrow = lv2, ncol = l2)> for (i in 1:lv2) {+ lab_t2[i, ] <- t2_names %in% labt2_names[[i]]+ }> rownames(lab_t2) <- lev[c(1, 3)]> colnames(lab_t2) <- t2_names

Now we do summarization only on the tumor samples. Therefore genes that are differen-tially expressed between tumor and normal may not obtain an I/NI call.

> farmsdata_t2 <- q.farms(rawdata[exp_t2])

background correction: nonenormalization: quantilesPM/MM correction : pmonlyexpression values: farmsbackground correcting...done.normalizing...done.11979 ids to be processed| ||####################|

> ini_t2 <- INIcalls(farmsdata_t2)> summary(ini_t2)

SummaryInformative probe sets : 45.1%Non-Informative probe sets : 54.9%

> ini_data_t2 <- getI_Eset(ini_t2)> ini_data_t2

ExpressionSet (storageMode: lockedEnvironment)assayData: 5402 features, 9 samples

element names: exprs, se.exprsphenoData

sampleNames: GSM158635.CEL, GSM158636.CEL, ..., GSM158643_1.CEL (9 total)

Page 71: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 71

varLabels and varMetadata description:sample: arbitrary numbering

featureDatafeatureNames: 10001_at, 10005_at, ..., AFFX-r2-P1-cre-5_at (5402 total)fvarLabels and fvarMetadata description: none

experimentData: use 'experimentData(object)'Annotation: hgu133ahsentrezg

We now look whether the genes are different compared to our first task.

> length(featureNames(ini_data_t2))

[1] 5402

> length(featureNames(ini_data_t1))

[1] 4989

> length(intersect(featureNames(ini_data_t1), featureNames(ini_data_t2)))

[1] 4716

We do the labelling for our second task.

> labels_t2 <- rep(0, l2)> labels_t2[lab_t2[2, ]] <- 1

Again we perform the procedure from limma to find differentially expressed genes.

> design_t2 <- cbind(rep(1, ncol(ini_data_t2)), labels_t2)> colnames(design_t2) <- c("Tumorigenic", "Non-tumorigenic")> rownames(design_t2) <- colnames(ini_data_t2)> Fit_t2 <- lmFit(ini_data_t2, design_t2)> Fit_t2 <- eBayes(Fit_t2)> PValues_t2 <- as.matrix(Fit_t2$p.value[, "Non-tumorigenic"])> colnames(PValues_t2) <- "rawp"> AdjP_t2 <- mt.rawp2adjp(PValues_t2[, "rawp"], proc = c("BH"))> AdjP_t2.ordered <- AdjP_t2$adjp[order(AdjP_t2$index), ]> p_t2.idx <- AdjP_t2$index> pv1_t2 <- as.matrix(PValues_t2[which(PValues_t2 <= 0.05)])> rownames(pv1_t2) <- rownames(which(PValues_t2 <= 0.05, arr.ind = TRUE))> ss <- length(pv1_t2)> ss

Page 72: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

72 1 Analysis of Gene Expression Data

[1] 85

We prepare again the chromosome plot for significantly differentially expressed genes.

> ccla <- buildChromLocation("hgu133ahsentrezg")> i1 <- rownames(pv1_t2)> a1 <- exprs(ini_data_t2)[which(match(rownames(exprs(ini_data_t2)),+ i1) > 0), ]> sii <- rep(0, ss)> for (i in 1:ss) {+ class1 <- median(a1[i, labels_t2])+ class2 <- median(a1[i, !labels_t2])+ sii[i] <- sign(class1 - class2)+ }> re <- i1[which(sii > 0)]> bl <- i1[which(sii < 0)]

> cPlot(ccla)> cColor(re, "red", ccla)> cColor(bl, "blue", ccla)

Fig. 17 shows the plot of over- and underexpressed genes in tumorigenic compared tonon-tumorigenic samples on different chromosomes.

We now want to see if there is an overlap between the genes we identified and the signaturefound in the original publication.

> nna <- unique(unlist(mget(i1, env = hgu133ahsentrezgGENENAME,+ ifnotfound = NA)))> nna

[1] "dehydrogenase/reductase (SDR family) member 2"[2] "ATP-binding cassette, sub-family C (CFTR/MRP), member 4"[3] "N-myc downstream regulated 1"[4] "CCAAT/enhancer binding protein (C/EBP), beta"[5] "CDC42 effector protein (Rho GTPase binding) 3"[6] "protein phosphatase 1, regulatory (inhibitor) subunit 13 like"[7] "KDEL (Lys-Asp-Glu-Leu) endoplasmic reticulum protein retention receptor 3"[8] "egl nine homolog 3 (C. elegans)"[9] "protein kinase C, delta binding protein"

[10] "AHNAK nucleoprotein 2"[11] "perilipin 2"[12] "7-dehydrocholesterol reductase"

Page 73: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 73

Homo sapiens

Chr

omos

ome

110111213141516171819

2202122

3456789MXY

Figure 17: Plot of over- and underexpressed genes in tumorigenic compared to non-tumorigenic samples on different chromosomes.

Page 74: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

74 1 Analysis of Gene Expression Data

[13] "dual specificity phosphatase 1"[14] "dual specificity phosphatase 2"[15] "early growth response 1"[16] "FAT tumor suppressor homolog 1 (Drosophila)"[17] "aldehyde dehydrogenase 1 family, member A3"[18] "EGF-containing fibulin-like extracellular matrix protein 1"[19] "MLX interacting protein"[20] "SET domain containing 1B"[21] "forkhead box O1"[22] "synemin, intermediate filament protein"[23] "fibronectin 1"[24] "zinc finger protein 629"[25] "hairy/enhancer-of-split related with YRPW motif 1"[26] "tetratricopeptide repeat domain 9"[27] "v-maf musculoaponeurotic fibrosarcoma oncogene homolog F (avian)"[28] "GATA binding protein 2"[29] "GATA binding protein 6"[30] "tumor necrosis factor receptor superfamily, member 21"[31] "zinc finger protein 544"[32] "glutathione peroxidase 7"[33] "nuclear receptor subfamily 3, group C, member 1 (glucocorticoid receptor)"[34] "coiled-coil domain containing 106"[35] "nuclear receptor subfamily 4, group A, member 1"[36] "homeobox C6"[37] "interferon, gamma-inducible protein 16"[38] "insulin-like growth factor binding protein 1"[39] "apoptosis-inducing, TAF9-like domain 1"[40] "leukemia inhibitory factor (cholinergic differentiation factor)"[41] "DnaJ (Hsp40) homolog, subfamily B, member 9"[42] "midline 1 (Opitz/BBB syndrome)"[43] "methionine sulfoxide reductase A"[44] "myosin, heavy chain 10, non-muscle"[45] "NGFI-A binding protein 1 (EGR1 binding protein 1)"[46] "activating transcription factor 3"[47] "nuclear factor, interleukin 3 regulated"[48] "dicarbonyl/L-xylulose reductase"[49] "immediate early response 5"[50] "phosphoglucomutase 1"[51] "serpin peptidase inhibitor, clade E (nexin, plasminogen activator inhibitor type 1), member 2"[52] "procollagen-lysine, 2-oxoglutarate 5-dioxygenase 2"[53] "plastin 3 (T isoform)"[54] "DNA-damage-inducible transcript 4"[55] "EPS8-like 1"

Page 75: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 75

[56] "mitochondrial ribosomal protein L20"[57] "zinc finger protein 444"[58] "chromosome 19 open reading frame 66"[59] "DENN/MADD domain containing 4C"[60] "prion protein"[61] "parathyroid hormone-like hormone"[62] "MKL/myocardin-like 2"[63] "KIAA1467"[64] "sulfide quinone reductase-like (yeast)"[65] "3-hydroxybutyrate dehydrogenase, type 1"[66] "basonuclin 1"[67] "fibronectin type III domain containing 3B"[68] "SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily a, member 1"[69] "RAS-like, family 11, member B"[70] "transcobalamin I (vitamin B12 binding protein, R binder family)"[71] "tropomyosin 2 (beta)"[72] "chromosome 11 open reading frame 10"[73] "zinc finger protein 8"[74] "myeloid zinc finger 1"[75] "UDP-N-acetyl-alpha-D-galactosamine:polypeptide N-acetylgalactosaminyltransferase 12 (GalNAc-T12)"[76] "zinc finger protein 552"[77] "caldesmon 1"[78] "NUAK family, SNF1-like kinase, 2"[79] "tubulin, beta 6"[80] "cold shock domain protein A"[81] "caveolin 1, caveolae protein, 22kDa"[82] "ATP-binding cassette, sub-family G (WHITE), member 1"[83] "synuclein, alpha interacting protein"[84] "RAB GTPase activating protein 1-like"[85] "glutamine-fructose-6-phosphate transaminase 2"

> re1 <- cbind(nna, pv1_t2)> nns <- unique(unlist(mget(i1, env = hgu133ahsentrezgSYMBOL, ifnotfound = NA)))> re2 <- cbind(nns, pv1_t2)> signature <- c("DPF2", "CASP8", "BCL2", "SCGN", "SWAP70", "KIAA0276",+ "C10orf9", "C10orf7", "ALKBH", "TOB2", "XPR1", "CD59", "LRP2",+ "PLP2", "MAPK14", "CXCL2", "MMP7", "MGP", "MLF1", "FLNB",+ "SCNM1", "HSPC163", "C5orf18", "MGC4399", "CDW92", "TMC4",+ "ZDHHC2", "TICAM2", "KDELR3", "GNPDA1", "THEM2", "DBR1",+ "FLJ90709", "FLJ10774", "C16orf33", "GAPD", "LDHA", "MR-1",+ "LARS", "GTPBP1,", "PRSS16", "WFDC2", "AIM1", "DHRS6", "DHRS4",+ "MGC15429", "MGC45840", "ECHDC2", "GOLGIN-67,", "AFURS1",+ "KIAA0436", "CYP4V2", "JTV1", "ICMT", "DNMT3A", "HNMT", "METTL7A",+ "METTL2", "VIL2", "TPD52", "ARPC5", "NOL8", "NSF", "RAD23B",

Page 76: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

76 1 Analysis of Gene Expression Data

+ "SRP54", "HSPA2", "PBP", "THAP2", "CIRBP", "SNRPN", "KIAA0052",+ "DUSP10", "SSR1", "ERBB4", "EMP1", "CHPT1", "LRPAP1", "FLJ11752",+ "CSTF1", "KLHL20", "DNAJC13", "APLP2", "ARGBP2", "DNAJB1",+ "NEBL", "SH3BGRL", "NUDT5,", "GABARAPL1", "MAPT", "DCBLD1",+ "STK39", "PAK2", "CSNK2A1", "PILRB", "ERN1", "SGKL", "WEE1",+ "MAST4", "C11orf17", "NUP37", "CLTC", "COPB2", "SLC25A25",+ "ECOP", "PDE8A", "STAM", "TUBB", "SNX6", "RAB23", "PLAA",+ "STC2", "LTF", "ISGF3G", "ATXN3", "GTF3C3", "GSK3B", "KLF10",+ "ELL2", "ZBTB20", "IRX3", "ETS1", "SERTAD1", "MGC4251,",+ "MAFF", "SFPQ", "CITED4", "CEBPD", "EIF4E2", "HS2ST1", "AGPS",+ "PGK1", "ATIC", "ETNK1", "ALG2", "NCE2", "MARCH8", "CNOT4",+ "RNF8", "PSMA5", "DPF2", "AMMECR1", "KIAA1287", "LOC144233",+ "LOC286505", "PNAS-4", "FLJ20530", "THUMPD3", "MGC45564,CAP350",+ "ETAA16", "HAN11", "DNAPTP6", "C7orf25", "FLJ37953", "FLJ10587",+ "C7orf36", "ELP4", "NDEL1,\tNPD014", "DKFZP564D172", "FAM53C",+ "IER5", "LOC255783", "KIAA0146", "KIAA0792", "LOC439994",+ "LOC283481", "CG018", "LOC130576", "NGFRAP1L1", "KIAA1217",+ "C4orf7", "C21orf86", "C9orf64", "FLJ13456", "KIAA1600",+ "B7-H4", "LOC80298", "C7orf2", "NUCKS", "DKFZP566D1346",+ "LOC388279", "FLJ31795", "C6orf107", "FLJ12439", "FLJ12806",+ "FLJ39370")> intersect(signature, nns)

[1] "KDELR3" "MAFF" "IER5"

The complementary array to the investigated hgu133a is hgu133b. Now we also want toinvestigate this array.

> rawdata_b <- ReadAffy(celfile.path = "c:/sepp/work/analyze_transcripts/test1/133b",+ cdfname = "HGU133B_Hs_ENTREZG")> all_names_t_b <- strsplit(sampleNames(rawdata_b), "[\\.,_]")> l1_b <- length(all_names_t_b)> all_names_b <- rep("", l1_b)> for (i in 1:l1_b) {+ all_names_b[i] <- all_names_t_b[[i]][1]+ }> lab_b <- matrix(FALSE, nrow = lv, ncol = l1_b)> for (i in 1:lv) {+ lab_b[i, ] <- all_names_b %in% lab_names[[i]]+ }> rownames(lab_b) <- lev> colnames(lab_b) <- all_names_b

We next do the summarization and the INI calls.

Page 77: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 77

> farmsdata_t1_b <- q.farms(rawdata_b)

background correction: nonenormalization: quantilesPM/MM correction : pmonlyexpression values: farmsbackground correcting...done.normalizing...done.6669 ids to be processed| ||####################|

> ini_t1_b <- INIcalls(farmsdata_t1_b)> summary(ini_t1_b)

SummaryInformative probe sets : 36.12%Non-Informative probe sets : 63.88%

> ini_data_t1_b <- getI_Eset(ini_t1_b)> ini_data_t1_b

ExpressionSet (storageMode: lockedEnvironment)assayData: 2409 features, 12 samples

element names: exprs, se.exprsphenoData

sampleNames: GSM158647.CEL, GSM158648.CEL, ..., GSM158658.CEL (12 total)varLabels and varMetadata description:

sample: arbitrary numberingfeatureData

featureNames: 1_at, 100093630_at, ..., AFFX-r2-P1-cre-5_at (2409 total)fvarLabels and fvarMetadata description: none

experimentData: use 'experimentData(object)'Annotation: hgu133bhsentrezg

We start again with the classification of tumor against normal.

> lv2 <- 2> sel <- c(1, 3)> lab_t_b <- rep(FALSE, ncol(lab_b))> for (i in 1:lv2) {+ lab_t_b <- lab_t_b | lab_b[sel[i], ]+ }> labels_t1_b <- rep(0, l1)> labels_t1_b[lab_t_b] <- 1

Page 78: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

78 1 Analysis of Gene Expression Data

And again the limma analysis.

> design_t1_b <- cbind(rep(1, ncol(ini_data_t1_b)), labels_t1_b)> colnames(design_t1_b) <- c("Tumor", "Normal")> rownames(design_t1_b) <- colnames(ini_data_t1_b)> Fit_t1_b <- lmFit(ini_data_t1_b, design_t1_b)> Fit_t1_b <- eBayes(Fit_t1_b)> PValues_t1_b <- as.matrix(Fit_t1_b$p.value[, "Normal"])> colnames(PValues_t1_b) <- "rawp"> AdjP_t1_b <- mt.rawp2adjp(PValues_t1_b[, "rawp"], proc = c("BH"))> AdjP_t1_b.ordered <- AdjP_t1_b$adjp[order(AdjP_t1_b$index), ]> p_t1_b.idx <- AdjP_t1_b$index> pv1_t1_b <- as.matrix(PValues_t1_b[which(PValues_t1_b <= 0.05)])> rownames(pv1_t1_b) <- rownames(which(PValues_t1_b <= 0.05, arr.ind = TRUE))> ss <- length(pv1_t1_b)> ss

[1] 350

Now again the plot of significantly differentially expressed genes.

> ccla <- buildChromLocation("hgu133bhsentrezg")> i1_b <- rownames(pv1_t1_b)> a1_b <- exprs(ini_data_t1_b)[which(match(rownames(exprs(ini_data_t1_b)),+ i1_b) > 0), ]> sii <- rep(0, ss)> for (i in 1:ss) {+ class1 <- median(a1_b[i, labels_t1_b])+ class2 <- median(a1_b[i, !labels_t1_b])+ sii[i] <- sign(class1 - class2)+ }> re <- i1_b[which(sii > 0)]> bl <- i1_b[which(sii < 0)]

> cPlot(ccla)> cColor(re, "red", ccla)> cColor(bl, "blue", ccla)

Fig. 18 shows the plot of over- and underexpressed genes in tumor compared to normalsamples on different chromosomes.

The next classification task is tumorigenic vs. non-tumorigenic but now on hgu133b.

Again we only consider tumor CEL files.

Page 79: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 79

Homo sapiens

Chr

omos

ome

110111213141516171819

2202122

3456789MXY

Figure 18: Plot of over- and underexpressed genes in tumor compared to normal sampleson different chromosomes.

Page 80: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

80 1 Analysis of Gene Expression Data

> lv2 <- 2> sel <- c(1, 3)> exp_t2_b <- rep(FALSE, ncol(lab_b))> labt2_names_b <- list()> for (i in 1:lv2) {+ exp_t2_b <- exp_t2_b | lab_b[sel[i], ]+ labt2_names_b[[i]] <- all_names_b[lab[sel[i], ]]+ }> t2_names_b <- all_names_b[exp_t2_b]> l2_b <- length(t2_names_b)> lab_t2_b <- matrix(FALSE, nrow = lv2, ncol = l2_b)> for (i in 1:lv2) {+ lab_t2_b[i, ] <- t2_names_b %in% labt2_names_b[[i]]+ }> rownames(lab_t2_b) <- lev[c(1, 3)]> colnames(lab_t2_b) <- t2_names_b

Summarization and INI call but only on the tumor CEL files.

> farmsdata_t2_b <- q.farms(rawdata_b[exp_t2_b])

background correction: nonenormalization: quantilesPM/MM correction : pmonlyexpression values: farmsbackground correcting...done.normalizing...done.6669 ids to be processed| ||####################|

> ini_t2_b <- INIcalls(farmsdata_t2_b)> summary(ini_t2_b)

SummaryInformative probe sets : 39.26%Non-Informative probe sets : 60.74%

> ini_data_t2_b <- getI_Eset(ini_t2_b)> ini_data_t2_b

Page 81: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 81

ExpressionSet (storageMode: lockedEnvironment)assayData: 2618 features, 9 samples

element names: exprs, se.exprsphenoData

sampleNames: GSM158647.CEL, GSM158648.CEL, ..., GSM158655_1.CEL (9 total)varLabels and varMetadata description:

sample: arbitrary numberingfeatureData

featureNames: 1_at, 10004_at, ..., AFFX-r2-P1-cre-5_at (2618 total)fvarLabels and fvarMetadata description: none

experimentData: use 'experimentData(object)'Annotation: hgu133bhsentrezg

Let us compare the called genes with the previously called genes.

> length(featureNames(ini_data_t2_b))

[1] 2618

> length(featureNames(ini_data_t1_b))

[1] 2409

> length(intersect(featureNames(ini_data_t1_b), featureNames(ini_data_t2_b)))

[1] 2254

Next we define the labels.

> labels_t2_b <- rep(0, l2)> labels_t2_b[lab_t2_b[2, ]] <- 1

We apply again the limma procedure to find significantly differentially expressed genesbetween tumorigenic vs. non-tumorigenic samples.

> design_t2_b <- cbind(rep(1, ncol(ini_data_t2_b)), labels_t2_b)> colnames(design_t2_b) <- c("Tumorigenic", "Non-tumorigenic")> rownames(design_t2_b) <- colnames(ini_data_t2_b)> Fit_t2_b <- lmFit(ini_data_t2_b, design_t2_b)> Fit_t2_b <- eBayes(Fit_t2_b)> PValues_t2_b <- as.matrix(Fit_t2_b$p.value[, "Non-tumorigenic"])> colnames(PValues_t2_b) <- "rawp"

Page 82: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

82 1 Analysis of Gene Expression Data

> AdjP_t2_b <- mt.rawp2adjp(PValues_t2_b[, "rawp"], proc = c("BH"))> AdjP_t2.ordered <- AdjP_t2_b$adjp[order(AdjP_t2_b$index), ]> p_t2_b.idx <- AdjP_t2_b$index> pv1_t2_b <- as.matrix(PValues_t2_b[which(PValues_t2_b <= 0.05)])> rownames(pv1_t2_b) <- rownames(which(PValues_t2_b <= 0.05, arr.ind = TRUE))> ss <- length(pv1_t2_b)> ss

[1] 35

Now we plot the significantly differentially expressed genes along the chromosomes.

> ccla <- buildChromLocation("hgu133bhsentrezg")> i1_b <- rownames(pv1_t2_b)> a1_b <- exprs(ini_data_t2_b)[which(match(rownames(exprs(ini_data_t2_b)),+ i1_b) > 0), ]> sii <- rep(0, ss)> for (i in 1:ss) {+ class1 <- median(a1_b[i, labels_t2_b])+ class2 <- median(a1_b[i, !labels_t2_b])+ sii[i] <- sign(class1 - class2)+ }> re <- i1_b[which(sii > 0)]> bl <- i1_b[which(sii < 0)]

> cPlot(ccla)> cColor(re, "red", ccla)> cColor(bl, "blue", ccla)

Fig. 19 shows the plot of over- and underexpressed genes in tumorigenic compared tonon-tumorigenic samples on different chromosomes.

We now want to see if there is an overlap between the genes we identified and the signaturefound in the original publication.

> nna_b <- unique(unlist(mget(i1_b, env = hgu133bhsentrezgGENENAME,+ ifnotfound = NA)))> nna_b

[1] "Yes-associated protein 1, 65kDa"[2] "cofilin 2 (muscle)"[3] "chromosome 6 open reading frame 192"[4] "leucine-rich repeats and immunoglobulin-like domains 3"

Page 83: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

1 Analysis of Gene Expression Data 83

Homo sapiens

Chr

omos

ome

110111213141516171819

2202122

3456789MXY

Figure 19: Plot of over- and underexpressed genes in tumorigenic compared to non-tumorigenic samples on different chromosomes.

Page 84: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

84 1 Analysis of Gene Expression Data

[5] "tandem C2 domains, nuclear"[6] "solute carrier family 16, member 14 (monocarboxylic acid transporter 14)"[7] "retinol dehydrogenase 10 (all-trans)"[8] "zinc finger protein 92"[9] "ribosomal protein L22-like 1"

[10] "SAM and SH3 domain containing 1"[11] "placenta-specific 2 (non-protein coding)"[12] "LATS, large tumor suppressor, homolog 2 (Drosophila)"[13] "heparan sulfate 6-O-sulfotransferase 3"[14] "SH3-domain kinase binding protein 1"[15] "homeobox C9"[16] "TMF1-regulated nuclear protein 1"[17] "mutated in colorectal cancers"[18] "ArfGAP with SH3 domain, ankyrin repeat and PH domain 1"[19] "kelch-like 5 (Drosophila)"[20] "peptidylprolyl isomerase (cyclophilin)-like 3"[21] "family with sequence similarity 120C"[22] "zinc finger and SCAN domain containing 2"[23] "ribosomal protein S6 kinase, 90kDa, polypeptide 3"[24] "transporter 2, ATP-binding cassette, sub-family B (MDR/TAP)"[25] "ubiquitin-conjugating enzyme E2E 2 (UBC4/5 homolog, yeast)"[26] "upstream binding transcription factor, RNA polymerase I"[27] "meteorin, glial cell differentiation regulator"[28] "frizzled homolog 8 (Drosophila)"[29] "anthrax toxin receptor 1"[30] "chromosome 6 open reading frame 105"[31] "asparagine-linked glycosylation 2, alpha-1,3-mannosyltransferase homolog (S. cerevisiae)"[32] "pleckstrin homology-like domain, family B, member 2"[33] "claudin 1"[34] "neurexin 3"[35] "phosphatase and actin regulator 2"

> re1_b <- cbind(nna_b, pv1_t2_b)> nns_b <- unique(unlist(mget(i1_b, env = hgu133bhsentrezgSYMBOL,+ ifnotfound = NA)))> re2_b <- cbind(nns_b, pv1_t2_b)> intersect(signature, nns_b)

[1] "ALG2"

Page 85: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

A New Feature Selection Methods 85

A New Feature Selection Methods

A.1 Multidimensional Wald and Score Test

A.1.1 Wald Test

Let βML be the maximum likelihood estimator and IF (βML) the Fisher Information ma-trix.

The multivariate Wald test assumes that

(β − βML)T IF (βML) (β − βML) ∼ χ2

d , (1)

where d is the number of parameters.

This test can be used to test that some the d parameters of βML are zero.

A.1.2 Score Test

For the score test the score is

U(β) =∂ logL(β|x)

∂β. (2)

Suppose that βML is the maximum likelihood estimate of β then

UT (βML)(IF (βML)

)−1U(βML) ∼ χ2

k (3)

asymptotically, where k is the number of constraints imposed by the null hypothesis and

U(βML) =∂ logL(βML|x)

∂β. (4)

For additive Gaussian noise

U(βML) =∂f(x;β)

∂β

1

σ2(y − f(x;β)) (5)

which is in the linear case f(x;β) = xTβ

U(βML) =1

σ2XT (y − Xβ) (6)

and

IF =∂f(x;β)

∂β

∫ ∞−∞

1

p(y | x;β)∇fp(y | x;β)∇fp(y | x;β)T dy

∂f(x;β)

∂β

T

(7)

Page 86: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

86 A New Feature Selection Methods

which is in the linear case

IF =1

σ2XTX . (8)

When the data follows a normal distribution, the score statistic is the same as the t-statistic.

For logistic regression we have

∂L

∂β= yTD(1 − p(yi | xi;β))X . (9)

and

IFts = XTDp X , (10)

where

Dp = diag(p(yi)

(1 − p(yi)

)). (11)

A.2 Optimal Brain Surgeon on Fisher Information

The task is to remove input variables by setting parameters that scale the input variablesto zero.

The variables which should be set to zero are given by a diagonal matrixD which containsones for the variables that are set to zero and zeros otherwise.

D =

0 . . . 0 0 0 . . . 0 0 0 . . . 0

0. . . 0 0 0 . . . 0 0 0 . . . 0

0 . . . 0 0 0 . . . 0 0 0 . . . 00 . . . 0 1 0 . . . 0 0 0 . . . 00 . . . 0 0 0 . . . 0 0 0 . . . 0

0 . . . 0 0 0. . . 0 0 0 . . . 0

0 . . . 0 0 0 . . . 0 0 0 . . . 00 . . . 0 0 0 . . . 0 1 0 . . . 00 . . . 0 0 0 . . . 0 0 0 . . . 0

0 . . . 0 0 0 . . . 0 0 0. . . 0

0 . . . 0 0 0 . . . 0 0 0 . . . 0

(12)

Let βML be the maximum likelihood estimator and IF (βML) the Fisher Information ma-trix.

Page 87: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

A New Feature Selection Methods 87

The result of the maximum likelihood estimator is asymptotically normal distributed withthe “true” parameter β as mean and covariance

(IF (β)

)−1:βML ∼ N

(β,(IF (β)

)−1)(13)

The random variable is the additive noise on the training data.

Goal is to remove those variables from the maximum likelihood estimator which are atleast determined by the data or which extracted the smallest amount of information fromthe data.

We do not know β, therefore we will use our best guess which can be the maximumlikelihood estimator, a solution of a constraint optimization problem, or a Bayes approach.Let us assume that we have an estimator β̂ for β.

minβ

1

2

(β − β̂

)TIF (β̂)

(β − β̂

)(14)

s.t. D β = 0 (15)

We compute the Lagrangian for this convex optimization problem

LA =1

2

(β − β̂

)TIF (β̂)

(β − β̂

)+ αT D β (16)

For the optimal value the derivative of the Lagrangian with respect to the parameter mustbe zero:

∂LA

∂β= IF (β̂)

(β − β̂

)+ D α = 0 (17)

It follows that

β = −(IF (β̂)

)−1D α + β̂ (18)

We insert this equation into the constraint:

−D(IF (β̂)

)−1D α + D β̂ = 0 (19)

−D(IF (β̂)

)−1D D α + D β̂ = 0 (20)

Let us define the submatrix of(IF (β̂)

)−1of the columns and rows whereD is one as[(IF (β̂)

)−1]D

(21)

Page 88: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

88 A New Feature Selection Methods

and the subvectors of α and β̂ of components whereD is one as

αD and β̂D . (22)

With this definition we have

αD =

[(IF (β̂)

)−1]−1D

β̂D . (23)

Now we complete αD by zeros to obtain α.

α =

{αD for Dii = 10 for Dii = 0

=

0...0[(

IF (β̂))−1]−1

D

β̂D

0...0

(24)

Inserting α gives

β = β̂ −(IF (β̂)

)−1D

0...0[(

IF (β̂))−1]−1

D

β̂D

0...0

. (25)

Page 89: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

A New Feature Selection Methods 89

Now we want to compute the value of the objective:

1

2

0...0[(

IF (β̂))−1]−1

D

β̂D

0...0

T

D(IF (β̂)

)−1(26)

(IF (β̂)

) (IF (β̂)

)−1D

0...0[(

IF (β̂))−1]−1

D

β̂D

0...0

=

1

2

0...0[(

IF (β̂))−1]−1

D

β̂D

0...0

T

D(IF (β̂)

)−1D

0...0[(

IF (β̂))−1]−1

D

β̂D

0...0

=

1

2

(β̂D

)T [(IF (β̂)

)−1]−1D

β̂D

We will remove those variables for which(β̂D

)T [(IF (β̂)

)−1]−1D

β̂D (27)

is minimal.

For one variable i that is

β̂2i[(

IF (β̂))−1]

ii

. (28)

Page 90: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

90 A New Feature Selection Methods

Note that for least square the Wald test optimization and the Likelihood ratio optimizationare the same.

Note that for the likelihood ratio test the likelihood can be expanded into a Taylor seriesaround βML and the Fisher information matrix is the Hession for the logistic regression.

However deletions are only valid if higher order terms vanish. Therefore only variableswith small absolute values can be removed. That means for small absolute values theWald test and the likelihood ratio test are similar.

A.3 Fisher Information for Additive Gaussian Noise

We want to optimize the statistics for the Wald test.

For a regression setting with additive Gaussian iid noise we have with regression functionf

y = f(x;β) + ε (29)ε ∼ N

(0, σ2I

)(30)

That means

p(y | x;β) ∼ N(f(x;β), σ2I

)(31)

The Fisher information matrix is

IF = Ep(y)

(∂ ln p(y | x;β)

∂β

∂ ln p(y | x;β)∂β

T)

= (32)

Ep(y)

(∂f(x;β)

∂β

1

σ4(y − f(x;β)) (y − f(x;β))T

∂f(x;β)

∂β

T)

=

∂f(x;β)

∂β

1

σ4Ep(y)

((y − f(x;β)) (y − f(x;β))T

) ∂f(x;β)

∂β

T

=

∂f(x;β)

∂β

1

σ4σ2 I

∂f(x;β)

∂β

T

=

1

σ2

∂f(x;β)

∂β

∂f(x;β)

∂β

T

If the error distribution is not Gaussian we use

∂ ln p(y | x;β)∂β

=∇fp(y | x;β)p(y | x;β)

∂f(x;β)

∂β(33)

to obtain

IF =∂f(x;β)

∂β

∫ ∞−∞

1

p(y | x;β)∇fp(y | x;β)∇fp(y | x;β)T dy

∂f(x;β)

∂β

T

.(34)

Page 91: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

A New Feature Selection Methods 91

In the linear case f(x;β) = xTβ have

IF =1

σ2XTX (35)

A.4 Logistic Regression

We consider y ∈ +1,−1.

A linear functionXβ can be transformed into a probability by the sigmoid function

1

1 + e− XTβ(36)

Note that

1 − 1

1 + e− XTβ=

e− XTβ

1 + e− XTβ. (37)

We set

p(y = 1 | xi;β) =1

1 + e− (xi)Tβ(38)

and

p(y = −1 | xi;β) =e− (xi)Tβ

1 + e− (xi)Tβ. (39)

Therefore we have

− ln p(y = yi | xi;β) = ln(1 + e− yi (xi)Tβ

)(40)

and the objective which minimization maximizes the likelihood is

L = −l∑

i=1

ln p(yi | xi;β) = −l∑

i=1

ln(1 + e− yi (xi)Tβ

)(41)

The derivatives of the objective with respect to the parameters are

∂L

∂βj= −

l∑i=1

yi∂ (xi)Tβ

∂βj

e− yi (xi)Tβ

1 + e− yi (xi)Tβ= (42)

−l∑

i=1

yi xij(1 − p(yi | xi;β)

).

Page 92: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

92 A New Feature Selection Methods

In matrix notation we have

∂L

∂β= −

l∑i=1

yi(1 − p(yi | xi;β)

)xi = (43)

yTD(1 − p(yi | xi;β))X .

The second derivatives of the objective L that is minimized are

Hjk =∂L

∂βj ∂βk=

l∑i=1

(yi)2xij xik p(y

i | xi;β)(1 − p(yi | xi;β)

), (44)

whereH is the Hessian.

Using

Dp = diagp(yi)(1 − p(yi)

)(45)

we have

H = XTDp X . (46)

An update rule for β using the Newton method is

βt+1 = βt +(XTDp X

)−1XTDp y . (47)

A.5 Fisher Information for Logistic Regression

The Fisher information matrix is

IFts = E

(∂L

∂βt

∂L

∂βs

)= (48)

∑y

p(y)l∑

i=1

l∑k=1

yi(1 − p(yi | xi;β)

)xity

k(1 − p(yk | xk;β)

)xks =

∑y

l∏j=1

p(yj)l∑

i=1

l∑k=1

yiyk(1 − p(yi | xi;β)

) (1 − p(yk | xk;β)

)xitxks

Because p(1) = (1− p(−1)) we have∑yk

p(yk)yk(1 − p(yk)

)= (49)

{− p(yk)

(1 − p(yk)

)+(1 − p(yk)

)p(yk) for yk = −1

p(yk)(1 − p(yk)

)−(1 − p(yk)

)p(yk) for yk = 1

}= 0

Page 93: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

A New Feature Selection Methods 93

Therefore the linear terms in p(yk)yk(1 − p(yk)

)vanish and only terms like p(yk)

(yk(1 − p(yk)

))2remain.

IFts =∑y

l∑i=1

p(yi)(yi(1 − p(yi)

))2xitxis =

∑y

l∑i=1

p(yi)((1 − p(yi)

))2xitxis =(50)

l∑i=1

(p(yi)

(1 − p(yi)

)2+ p(−yi)

(1 − p(−yi)

)2)xitxis =

l∑i=1

(p(yi)

(1 − p(yi)

)2+(1 − p(yi)

) (p(yi)

)2)xitxis =

l∑i=1

p(yi)(1 − p(yi)

) ((1 − p(yi)

)+(p(yi)

))xitxis =

l∑i=1

p(yi)(1 − p(yi)

)xitxis

Using

Dp = diag(p(yi)

(1 − p(yi)

))(51)

we have

IF = XTDp X . (52)

A.6 Constraint Logistic Regression

A.6.1 Ball Constraint

minβL (53)

s.t. βTβ ≤ 2γ (54)

With the Lagrange multiplier α ≥ 0 the Lagrangian is

LA = L + α(βTβ − 2γ

)(55)

The derivative of the Lagrangian is

∂LA

∂β= −

l∑i=1

yi(1 − p(yi | xi;β)

)xi + α β = (56)

−XT Dp y + α β = 0

Page 94: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

94 A New Feature Selection Methods

If we assume thatDp does not depend on β then we have

β ≈ 1

αXT Dp y . (57)

A.6.2 Variation Constraint

For the maximum likelihood estimate, the variation of β is(D0.5

p X β)T (

D0.5p Xβ

)(58)

according to page 709 3.3 (c) of Daryl Pregibon, “Logistic Regression Diagnostics”, TheAnnals of Statistics, vol. 9, no. 4, pp 705-724, 1981. That is the Fisher informationmatrix.

minβL (59)

s.t.(D0.5

p X β)T (

D0.5p Xβ

)≤ 2γ (60)

With the Lagrange multiplier α ≥ 0 the Lagrangian is

LA = L + α((D0.5

p X β)T (

D0.5p Xβ

)− 2γ

)(61)

The derivative of the Lagrangian is

∂LA

∂β= −XT Dp y + αXTDpX β = 0 (62)

If we assume thatDp does not depend on β then we have

β ≈ 1

α

(XTDpX

)−1XT Dp y . (63)

Note, that is the approximation of the maximum likelihood according to Daryl Pregibonpage 709 equation on the top and to David W. Hosmer and Stanly Lemeshow, “AppliedLogistic Regressioin”, Wiley, 1989 page 119 and page 120.

A.6.3 Deletion in Ball Constraint

For the likelihood ratio test we want to maximize the likelihood if some variables are notused.

Page 95: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

A New Feature Selection Methods 95

minβL (64)

s.t. βTβ ≤ 2γ (65)D β = 0 (66)

With the Lagrange multiplier α ≥ 0 the Lagrangian is

LA = L + α1

(βTβ − 2γ

)+αT

2D β (67)

The derivative of the Lagrangian is

∂LA

∂β= −XT Dp y + α1 β + Dα2 = 0 (68)

If we assume thatDp does not depend on β then we have

β ≈ 1

α1

(XT Dp y − D α2

). (69)

D β =1

α1

(DXT Dp y − D α2

)= 0 (70)

from which follows that

D α2 = DXT Dp y (71)

and

β ≈ 1

α1

(XT Dp y − DXT Dp y

). (72)

Here α1 can be computed from

βTβ = 2γ . (73)

Under the assumption that Dp does not change, that means only the according variablesare removed and β is rescaled.

Page 96: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

96 A New Feature Selection Methods

A.6.4 Deletion Variation Constraint

For the likelihood ratio test we want to maximize the likelihood if some variables are notused.

The problem is

minβL (74)

s.t.(D0.5

p X β)T (

D0.5p Xβ

)≤ 2γ (75)

D β = 0 (76)

The derivative of the Lagrangian is

∂LA

∂β= −XT Dp y + α1 X

TDpX β + Dα2 = 0 (77)

and solving it for β gives

β =1

α1

((XTDpX

)−1XT Dp y −

(XTDpX

)−1D α2

). (78)

UsingD β = 0 we get

D β = (79)1

α1

(D(XTDpX

)−1XT Dp y − D

(XTDpX

)−1D α2

)= 0

Solving that forα2 we introduce the submatrix of(XTDpX

)−1 of the columns and rowswhereD is one as [(

XTDpX)−1]

D(80)

and the subvector α2,D of α2 consisting of components whereD is one which gives

α2,D =[(XTDpX

)−1]−1DD(XTDpX

)−1XT Dp y . (81)

Now we complete α2,D by zeros to obtain α2.

β =1

α1

(XTDpX

)−1(82)(

XT Dp y − D[(XTDpX

)−1]−1DD(XTDpX

)−1XT Dp y

).

Page 97: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

A New Feature Selection Methods 97

If we set

β̂ =1

α1

(XTDpX

)−1XT Dp y (83)

then we obtain

β = β̂ −(XTDpX

)−1D[(XTDpX

)−1]−1DD β̂ . (84)

This equation is equal to eq. (25), where the maximum likelihood estimator is β̂ and theFisher information matrix is

(XTDpX

).

Because we assume to be in a minimum of the likelihood, where the first order derivativesare zero, the Hessian of the log-likelihood determines the minimal increase of the log-likelihood. The Hessian is equal to the Fisher information matrix.

Therefore we can proceed as with Optimal Brain Surgeon for optimizing the Wald statis-tics. Note, however, that the Taylor expansion of the log-likelihood only holds for smalldeviations. That means only for deleting small weights.

A.7 Constraint Least Square

minβ

(y − X β)T (y − X β) (85)

s.t. βTβ ≤ γ (86)

With the Lagrange multiplier α ≥ 0 the Lagrangian is

LA = βTXTX β − 2 yTX β + yTy + α(βTβ − γ

)(87)

∂LA

∂β= 2XTX β − 2XTy + 2 α β = 0 (88)

β =(XTX + αI

)−1XTy (89)

The value α can be determined by

βTβ = γ . (90)

A.8 Matrix inversion lemma

(ε I + XTX

)−1=

1

ε

(I − XT

(ε I + XXT

)−1X)

(91)

Page 98: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

98 A New Feature Selection Methods

A.9 Compute Inverse Iteratively

A.9.1 Backward

The following equations show how to compute the inverse of a sub-matrix if the inverseof the whole matrix is known.(

A VV T Z

)−1=

(K UUT S

)= (92)

Then

K = A−1 + A−1V(Z − V TA−1V

)−1V TA−1 (93)

U = −A−1V(Z − V TA−1V

)−1(94)

UT = −(Z − V TA−1V

)−1V TA−1 (95)

S =(Z − V TA−1V

)−1(96)

which implies that

A−1 = K − U S−1UT (97)

A.9.2 Forward

We want to compute

(ε I + XTX

)−1(98)

and know that

XTX =l∑

i=1

xi(xi)T (99)

We set

IF0 = ε I (100)IFi = IFi−1 + xi(xi)T (101)

The matrix inversion lemma gives

(IFi)−1

=(IFi−1

)−1+

(IFi−1

)−1xi (xi)T

(IFi−1

)−11 + (xi)T

(IFi−1

)−1xi

, (102)

where only one matrix by vector multiplication is necessary:(IFi−1

)−1xi . (103)

Page 99: Introduction to R with applications to bioinformaticsIntroduction to R with applications to bioinformatics PART II Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University

A New Feature Selection Methods 99

A.10 Compute Bilinear Form Iteratively

The bilinear form (β̂D

)T [(IF (β̂)

)−1]−1D

β̂D (104)

should be minimized.

If(IF (β̂)

)−1is known then a greedy algorithm can optimize above critereon.

Note that (A vvT z

)−1=

(A−1 + c u uT − c u− c uT c

), (105)

where

c =(z − vT A−1 v

)−1(106)

u = A−1 v . (107)

(q s

) ( A vvT z

)−1(qs

)= (108)

qTA−1 q + c(qTu uTq − s uT q − s qTu + s2

)=

qTA−1 q +(qTu − s

)2