29
Using the a4 package Willem Talloen, Tobias Verbeke October 27, 2020 Contents 1 Introduction 2 2 Preparation of the Data 2 2.1 ExpressionSet object ................................... 2 2.2 Some data manipulation ................................. 2 3 Unsupervised data exploration 3 4 Filtering 4 5 Detecting differential expression 5 5.1 T-test ........................................... 5 5.2 Limma for comparing two groups ............................ 6 5.3 Limma for linear relations with a continuous variable ................. 7 6 Class prediction 8 6.1 PAM ............................................ 8 6.2 Random forest ...................................... 8 6.3 Forward filtering with various classifiers ........................ 9 6.4 Penalized regression ................................... 11 6.5 Logistic regression .................................... 13 6.6 Receiver operating curve ................................. 15 7 Visualization of interesting genes 16 7.1 Plot the expression levels of one gene .......................... 16 7.2 Plot the expression levels of two genes versus each other ............... 20 7.3 Plot expression line profiles of multiple genes/probesets across samples ....... 21 7.4 Smoothscatter plots ................................... 23 7.5 Gene lists of log ratios .................................. 26 8 Pathway analysis 28 8.1 Minus log p ........................................ 28 9 Software used 29 1

Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

Using the a4 package

Willem Talloen, Tobias Verbeke

October 27, 2020

Contents

1 Introduction 2

2 Preparation of the Data 22.1 ExpressionSet object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Some data manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3 Unsupervised data exploration 3

4 Filtering 4

5 Detecting differential expression 55.1 T-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55.2 Limma for comparing two groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65.3 Limma for linear relations with a continuous variable . . . . . . . . . . . . . . . . . 7

6 Class prediction 86.1 PAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86.2 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86.3 Forward filtering with various classifiers . . . . . . . . . . . . . . . . . . . . . . . . 96.4 Penalized regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116.5 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136.6 Receiver operating curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

7 Visualization of interesting genes 167.1 Plot the expression levels of one gene . . . . . . . . . . . . . . . . . . . . . . . . . . 167.2 Plot the expression levels of two genes versus each other . . . . . . . . . . . . . . . 207.3 Plot expression line profiles of multiple genes/probesets across samples . . . . . . . 217.4 Smoothscatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237.5 Gene lists of log ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

8 Pathway analysis 288.1 Minus log p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

9 Software used 29

1

Page 2: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

1 Introduction

The a4 suite of packages is a suite for convenient analysis of Affymetrix microarray experimentswhich supplements Goehlmann and Talloen (2010). The suite currently consists of several packageswhich are centered around particular tasks:

� a4Preproc: package for preprocessing of microarray data. Currently the only function inthe package adds complementary annotation information to the ExpressionSet objects (infunction addGeneInfo). Many of the subsequent analysis functions rely on the presence ofsuch information.

� a4Core: package made to allow for easy interoperability with the nlcv package which iscurrently being developed on R-Forge at http://r-forge.r-project.org/projects/nlcv.

� a4Base: all basic functionality of the a4 suite

� a4Classif: functionality for classification work that has been split off a.o. in order to reducea4Base loading time

� a4Reporting: a package which provides reporting functionality and defines xtable-methodsthat are foreseen for tables with hyperlinks to public gene annotation resources.

This document provides an overview of the typical analysis workflow for such microarray ex-periments using functionality of all of the mentioned packages.

2 Preparation of the Data

First we load the package a4 and the example real-life data set ALL.

R> library(a4)

R> require(ALL)

R> data(ALL, package = "ALL")

For illustrative purposes, simulated data sets can also be very valuable (but not used here).

R> require(nlcv)

R> esSim <- simulateData(nEffectRows=50, betweenClassDifference = 5,

nNoEffectCols = 5, withinClassSd = 0.2)

2.1 ExpressionSet object

The data are assumed to be in an expressionSet object. Such an object structure combines differentsources of information into a single structure, allowing easy data manipulation (e.g., subsetting,copying) and data modelling.

The textttfeatureData slot is typically not yet containing all relevant information about thegenes. This interesting extra gene information can be added using addGeneInfo.

R> ALL <- addGeneInfo(ALL)

2.2 Some data manipulation

The ALL data consists out of samples obtained from two types of cells with very distinct expressionprofiles; B-cells and T-cells. To have a more subtle signal, gene expression will also be comparedbetween the BCR/ABL and the NEG group within B-cells only. To this end, we create theexpressionSet bcrAblOrNeg containing only B-cells with BCR/ABL or NEG.

R> Bcell <- grep("^B", as.character(ALL$BT)) # create B-Cell subset for ALL

R> subsetType <- "BCR/ABL" # other subsetType can be "ALL/AF4"

R> bcrAblOrNegIdx <- which(as.character(ALL$mol) %in% c("NEG", subsetType))

R> bcrAblOrNeg <- ALL[, intersect(Bcell, bcrAblOrNegIdx)]

R> bcrAblOrNeg$mol.biol <- factor(bcrAblOrNeg$mol.biol)

2

Page 3: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

3 Unsupervised data exploration

Spectral maps are very powerful techniques to get an unsupervised picture of how the data looklike. A spectral map of the ALL data set shows that the B- and the T-subtypes cluster togetheralong the x-axis (the first principal component). The plot also indicates which genes contribute inwhich way to this clustering. For example, the genes located in the same direction as the T-cellsamples are higher expressed in these T-cells. Indeed, the two genes at the left (TCF7 and CD3D)are well known to be specifically expressed by T-cells (Wetering 1992, Krissansen 1986).

R> spectralMap(object = ALL, groups = "BT")

R>

R> # optional argument settings

R> # plot.mpm.args=list(label.tol = 12, zoom = c(1,2), do.smoothScatter = TRUE),

R> # probe2gene = TRUE)

PC1 16%

PC

2 12

%

TCF7 CD74

CCN2

HLA−DRAHLA−DPB1HLA−DPB1

CD3D

CD9YBX3

01005

01010

04007

06002

08001

08024

11005

12007

1201912026

14016

15005

16009

20002

22011220132400124008

2401024011

24017

24018

25003

25006

2600126005

27003270043600237013

48001

4900664001

64002

650056800303002

24019

2402226003

28003

28019

36001

43007

43012

62001

62002

62003

04006

04008

04010

04016

15001

15004

16004

19005

24005

26008

28024

28028

28031

28032

31007

33005

43001

63001

68001

08011

08012

08018

09008

12006

1201228001

28005

2800628007

28021

28023

28035

28036

28037

28042

28043

28044

28047

30001

31011

43004

57001

09017

2200922010

84004

LAL5

0100311002

17003

26009

LAL4

01007

09002

16002

16007

19002

28009

44001

49004

560076500302020

04018

10005

15006

18001

19008

19014

1901728008

31015

37001

43006

43015

64005

83001

12008

24006

20005

A spectral map of the bcrAblOrNeg data subset does not show a clustering of BCR/ABL orNEG cells.

R> spectralMap(object = bcrAblOrNeg, groups = "mol.biol", probe2gene = TRUE)

3

Page 4: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

PC1 15%

PC

2 12

%

HBBHBB

HBD

MAFF

IGLL1

TCL1A

KLF9

SNRK

01005

03002

0800108011

0900811005

12006

12007

12012

1202614016

1500520002

22010

22013

2400124010

24011 24017

24022

26003

27003

27004

28019

2802128036

30001

31011

37013

4300149006

62001

62002

6200365005

68003

84004

0101004007

04008

04010

04016

06002

08012

08024

09017

12019

15001

16009

22009

22011

24008

24018

25003

25006

2600128001

28005

28006

28007

28023

28024

28031

2803528037

28042

28043

28044

28047

33005

36002

43004

43007

43012

48001

57001

64001 64002

68001BCR/ABLNEG

4 Filtering

The data can be filtered, for instance based on variance and intensity, in order to reduce thehigh-dimensionality.

R> selBcrAblOrNeg <- filterVarInt(object = bcrAblOrNeg)

R> propSelGenes <- round((dim(selBcrAblOrNeg)[1]/dim(bcrAblOrNeg)[1])*100,1)

This filter selected 18.9 % of the genes (2391 of the in total 12625 genes).

4

Page 5: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

5 Detecting differential expression

5.1 T-test

R> tTestResult <- tTest(selBcrAblOrNeg, "mol.biol")

R> histPvalue(tTestResult[,"p"], addLegend = TRUE)

R> propDEgenesRes <- propDEgenes(tTestResult[,"p"])

0.0 0.2 0.4 0.6 0.8 1.0

050

100

150

200

250

32.7% DE genes

Using an ordinary t-test, there are 171 genes significant at a FDR of 10%. The proportion ofgenes that are trully differentially expressed is estimated to be around 32.7.

The toptable and the volcano plot show that three most significant probe sets all target ABL1.This makes sense as the main difference between BCR/ABL and NEG cells is a mutation in thisparticular ABL gene.

R> tabTTest <- topTable(tTestResult, n = 10)

R> print(xtable(tabTTest,

caption="The top 5 features selected by an ordinary t-test.",

label ="tablassoClass"))

R> volcanoPlot(tTestResult, topPValues = 5, topLogRatios = 5)

5

Page 6: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

gSymbol p logRatio pBH tStat1636 g at ABL1 0.00 -1.10 0.00 9.2639730 at ABL1 0.00 -1.15 0.00 8.691635 at ABL1 0.00 -1.20 0.00 7.28

40202 at KLF9 0.00 -1.78 0.00 6.1837027 at AHNAK 0.00 -1.35 0.00 5.65

39837 s at ZNF467 0.00 -0.48 0.00 5.5040480 s at FYN 0.00 -0.87 0.00 5.30

33774 at CASP8 0.00 -1.00 0.00 5.2636591 at TUBA4A 0.00 -1.15 0.00 5.2537014 at MX1 0.00 1.41 0.00 -5.04

Table 1: The top 5 features selected by an ordinary t-test.

1

4e−14

−3 −2 −1 0 1 2 3

ABL1

ABL1

ABL1

KLF9

AHNAK

JCHAINCCN2

IGLL1

MX1

5.2 Limma for comparing two groups

In this particular data set, the modified t-test using limmaTwoLevels provides very similar results.This is because the sample size is relatively large.

R> limmaResult <- limmaTwoLevels(selBcrAblOrNeg, "mol.biol")

R> volcanoPlot(limmaResult)

R> # histPvalue(limmaResult)

R> # propDEgenes(limmaResult)

6

Page 7: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

1

3e−14

−3 −2 −1 0 1 2 3

ABL1

ABL1

ABL1

KLF9

AHNAK

TUBA4AFYNCASP8ZNF467 MX1

JCHAINCCN2

IGLL1

CRIP1

ZEB1

CNN3

It is very useful to put lists of genes in annotated tables where the genes get hyperlinks toEntrezGene.

R> tabLimma <- topTable(limmaResult, n = 10, coef = 2) # 1st is (Intercept)

Gene logFC AveExpr P.Value adj.P.Val GENENAMEABL1 -1.10 9.20 0.00 0.00 ABL proto-oncogene 1, non-receptor tyrABL1 -1.15 9.00 0.00 0.00 ABL proto-oncogene 1, non-receptor tyrABL1 -1.20 7.90 0.00 0.00 ABL proto-oncogene 1, non-receptor tyrKLF9 -1.78 8.62 0.00 0.00 Kruppel like factor 9AHNAK -1.35 8.44 0.00 0.00 AHNAK nucleoproteinTUBA4A -1.15 9.23 0.00 0.00 tubulin alpha 4aFYN -0.87 7.76 0.00 0.00 FYN proto-oncogene, Src family tyrosinCASP8 -1.00 8.04 0.00 0.00 caspase 8ZNF467 -0.48 7.14 0.00 0.00 zinc finger protein 467MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1

5.3 Limma for linear relations with a continuous variable

Testing for (linear) relations of gene expression with a (continuous) variable is typically doneusing regression. A modified t-test approach improves the results by penalizing small slopes. Themodified regressions can be applied using limmaReg.

R>

7

Page 8: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

6 Class prediction

There are many classification algorithms with profound conceptual and methodological differences.Given the differences between the methods,there’s probably no single classification method thatalways works best, but that certain methods perform better depending on the characteristics ofthe data.

On the other hand, these methods are all designed for the same purpose, namely maximizingclassification accuracy. They should consequently all pick up (the same) strong biological signalwhen present, resulting in similar outcomes.

Personally, we like to apply four different approaches; PAM, RandomForest, forward filteringin combination with various classifiers, and LASSO.

All four methods have the property that they search for the smallest set of genes while havingthe highest classification accuracy. The underlying rationale and algorithm is very different betweenthe four approaches, making their combined use potentially complementary.

6.1 PAM

PAM (Tibshirani 2002) applies univariate and dependent feature selection.

R> resultPam <- pamClass(selBcrAblOrNeg, "mol.biol")

R> plot(resultPam)

R> featResultPam <- topTable(resultPam, n = 15)

R> xtable(head(featResultPam$listGenes),

caption = "Top 5 features selected by PAM.")

0.1

0.2

0.3

0.4

0.5

number of genes

Mis

clas

sific

atio

n er

ror

0 3 3 4 4 5 5 6 8 14 15 18 23 33 43 58 76 98 116

159

213

294

390

553

752

1045

1340

1663

2021

2391

x

6.2 Random forest

Random forest with variable importance filtering (Breiman 2001, Diaz-Uriarte 2006) applies multi-variate and dependent feature selection. Be cautious when interpreting its outcome, as the obtainedresults are unstable and sometimes overoptimistic.

8

Page 9: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

R> resultRF <- rfClass(selBcrAblOrNeg, "mol.biol")

R> plot(resultRF, which = 2)

R> featResultRF <- topTable(resultRF, n = 15)

R> xtable(head(featResultRF$topList),

caption = "Features selected by Random Forest variable importance.")

GeneSymbol1635 at ABL1

1636 g at ABL133284 at MPO34216 at KLF737043 at ID337351 at UPP1

Table 2: Features selected by Random Forest variable importance.

R>

0.05

0.10

0.15

0.20

0.25

number of genes

Mis

clas

sific

atio

n er

ror

2 4 6 9 14 22 34 54 85 132

206

321

501

783

1224

1913

x

6.3 Forward filtering with various classifiers

Forward filtering in combination with various classifiers (like DLDA, SVM, random forest, etc.)apply an independent feature selection. The selection can be either univariate or multivariatedepending on the chosen selection algorithm; we usually choose Limma as a univariate althoughrandom forest variable importance could also be used as a multivariate selection criterium.

R> mcrPlot_TT <- mcrPlot(nlcvTT, plot = TRUE, optimalDots = TRUE,

layout = TRUE, main = "t-test selection")

9

Page 10: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

t−test selectionM

iscl

assi

ficat

ion

Rat

e

Number of features2 5 7 10 15 20 25 30 35

0.0

0.2

0.4

0.6

0.8

1.0 dldarandomForestbaggpamsvm

nFeat optim mean MCR sd MCRdlda 7.00 0.14 0.05

randomForest 3.00 0.15 0.05bagg 25.00 0.16 0.05pam 2.00 0.16 0.07svm 2.00 0.13 0.07

Table 3: Optimal number of genes per classification method together with the respective misclas-sification error rate (mean and standard deviation across all CV loops).

R> scoresPlot(nlcvTT, tech = "svm", nfeat = 2)

10

Page 11: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

0100

508

011

1200

612

026

2000

224

001

2401

727

003

2802

131

011

4900

662

003

8400

404

008

0600

209

017

1600

924

008

2500

628

005

2802

328

035

2804

333

005

4300

757

001

6800

1

0.0

0.2

0.4

0.6

0.8

1.0

prop

ortio

nFreq. of being correctly classified (svm, 2 feat.)

BCR/ABLNEG

0.5 <= score <= 1 0 <= score < 0.5

6.4 Penalized regression

LASSO (Tibshirani 2002) or elastic net (Zou 2005) apply multivariate and dependent featureselection.

R> resultLasso <- lassoClass(object = bcrAblOrNeg, groups = "mol.biol")

R> plot(resultLasso, label = TRUE,

main = "Lasso coefficients in relation to degree of penalization.")

R> featResultLasso <- topTable(resultLasso, n = 15)

11

Page 12: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

0 5 10 15 20

−3

−2

−1

0

Lasso coefficients in relation to degree of penalization.

L1 Norm

Coe

ffici

ents

0 20 30 34 35

389714

756

957

11732456

2772

3006

3236

3393

3437

3724

4034

41844250

4568

4723

4752

4840

5132

521154635887

6303

6565

6958

7082

70947352

7390

74748129

8190

8465

9230

9823

9930

10604

10683

10976

11353

11632

11927

Gene CoefficientITGA7 -3.80ABL1 -2.76TCL1B -2.26RAB32 -1.01CHD3 -0.77SERPINE2 -0.65NFATC1 -0.64ZNF467 -0.60YES1 -0.58ANXA1 -0.58PTDSS1 0.53PTPRJ -0.51F13A1 -0.49DSTN 0.47ALDH1A1 -0.46

Table 4: Features selected by Lasso, ranked from largest to smallest penalized coefficient.

12

Page 13: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

6.5 Logistic regression

Logistic regression is used for predicting the probability to belong to a certain class in binaryclassification problems.

R> logRegRes <- logReg(geneSymbol = "ABL1", object = bcrAblOrNeg, groups = "mol.biol")

ABL1 (1635_at)

log2 intensity

Pro

babi

lity

of b

eing

NE

G

6 7 8 9

0.0

0.2

0.4

0.6

0.8

1.0

BCR/ABL

NEG

The obtained probabilities can be plotted with ProbabilitiesPlot. A horizontal line indicatesthe 50% threshold, and samples that have a higher probability than 50% are indicated with bluedots. Apparently, using the expression of the gene ABL1, quite a lot of samples predicted to witha high probability to be NEG, are indeed known to be NEG.

R> probabilitiesPlot(proportions = logRegRes$fit, classVar = logRegRes$y,

sampleNames = rownames(logRegRes), main = "Probability of being NEG")

13

Page 14: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

2201

024

001

2700

312

0228

036

1200

643

001

2802

128

0103

0030

0049

0015

005

6200

262

001

0100

1200

784

0065

0028

024

1600

6800

124

008

1500

104

010

2804

228

043

2800

528

007

0802

2800

643

007

2200

2803

504

016

0801

228

023

2804

436

0026

001

0.0

0.2

0.4

0.6

0.8

1.0

Probability of being NEG

R> probabilitiesPlot(proportions = logRegRes$fit, classVar = logRegRes$y, barPlot= TRUE,

sampleNames = rownames(logRegRes), main = "Probability of being NEG")

14

Page 15: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

2201

068

003

1202

624

022

4300

108

001

0300

227

004

1500

562

003

0100

520

002

6500

564

001

6800

125

006

0401

012

019

2800

533

005

2800

657

001

2803

528

047

2802

328

031

2600

1

0.0

0.2

0.4

0.6

0.8

6.6 Receiver operating curve

A ROC curve plots the fraction of true positives (TPR = true positive rate) versus the fraction offalse positives (FPR = false positive rate) for a binary classifier when the discrimination thresholdis varied. Equivalently, one can also plot sensitivity versus (1 - specificity).

R> ROCres <- ROCcurve(geneSymbol = "ABL1", object = bcrAblOrNeg, groups = "mol.biol")

15

Page 16: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

ABL1 (1635_at)

Average false positive rate

Ave

rage

true

pos

itive

rat

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

6.09

6.84

7.59

8.34

9.09

9.84

7 Visualization of interesting genes

7.1 Plot the expression levels of one gene

Some potentially interesting genes can be visualized using plot1gene. Here the most significantgene is plotted.

R> plot1gene(probesetId = rownames(tTestResult)[1],

object = selBcrAblOrNeg, groups = "mol.biol", legendPos = "topright")

16

Page 17: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

ABL1 (1636_g_at)lo

g 2 in

tens

ity

7.5

8.0

8.5

9.0

9.5

10.0

10.501

005

0800

109

008

1200

612

012

1401

620

002

2201

324

010

2401

726

003

2700

428

021

3000

137

013

4900

662

002

6500

584

004

0400

704

010

0600

208

024

1201

916

009

2201

124

018

2500

628

001

2800

628

023

2803

128

037

2804

328

047

3600

243

007

4800

164

001

6800

1

labe

ls M

eans

BCR/ABLNEG

There are some variations possible on the default plot1gene function. For example, the labelsof x-axis can be changed or omitted.

R> plot1gene(probesetId = rownames(tTestResult)[1], object = selBcrAblOrNeg,

groups = "mol.biol", sampleIDs = "mol.biol", legendPos = "topright")

17

Page 18: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

ABL1 (1636_g_at)lo

g 2 in

tens

ity

7.5

8.0

8.5

9.0

9.5

10.0

10.5B

CR

/AB

LB

CR

/AB

LB

CR

/AB

LB

CR

/AB

LB

CR

/AB

LB

CR

/AB

LB

CR

/AB

LB

CR

/AB

LB

CR

/AB

LB

CR

/AB

LB

CR

/AB

LB

CR

/AB

LB

CR

/AB

LB

CR

/AB

LB

CR

/AB

LB

CR

/AB

LB

CR

/AB

LB

CR

/AB

LB

CR

/AB

LN

EG

NE

GN

EG

NE

GN

EG

NE

GN

EG

NE

GN

EG

NE

GN

EG

NE

GN

EG

NE

GN

EG

NE

GN

EG

NE

GN

EG

NE

GN

EG

labe

ls M

eans

BCR/ABLNEG

Another option is to color the samples by another categorical variable than used for ordering.

R> plot1gene(probesetId = rownames(tTestResult)[1], object = selBcrAblOrNeg,

groups = "mol.biol", colgroups = 'BT', legendPos = "topright")

18

Page 19: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

ABL1 (1636_g_at)lo

g 2 in

tens

ity

7.5

8.0

8.5

9.0

9.5

10.0

10.501

005

0800

109

008

1200

612

012

1401

620

002

2201

324

010

2401

726

003

2700

428

021

3000

137

013

4900

662

002

6500

584

004

0400

704

010

0600

208

024

1201

916

009

2201

124

018

2500

628

001

2800

628

023

2803

128

037

2804

328

047

3600

243

007

4800

164

001

6800

1

labe

ls M

eans

BB1B2B3B4

The above graphs plot one sample per tickmark in the x-axis. This is very useful to explore thedata as one can directly identify interesting samples. If it is not interesting to know which samplehas which expression level, one may want to plot in the x-axis not the samples but the groups ofinterest. It is possible to pass arguments to the boxplot function to custopmize the graph. Forexample the boxwex argument allows to reduce the width of the boxes in the plot.

R> boxPlot(probesetId = rownames(tTestResult)[1], object = selBcrAblOrNeg, boxwex = 0.3,

groups = "mol.biol", colgroups = "BT", legendPos = "topright")

19

Page 20: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

BCR/ABL NEG

7.5

8.0

8.5

9.0

9.5

10.0

10.5

ABL1 (1636_g_at)

groups

log 2

con

cent

ratio

n

BB1B2B3B4

7.2 Plot the expression levels of two genes versus each other

R> plotCombination2genes(geneSymbol1 = featResultLasso$topList[1, 1],

geneSymbol2 = featResultLasso$topList[2, 1],

object = bcrAblOrNeg, groups = "mol.biol",

main = "Combination of\nfirst and second gene", addLegend = TRUE,

legendPos = "topright")

20

Page 21: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

6.0 6.5 7.0

6

7

8

9

Combination offirst and second gene

ITGA7

AB

L1

BCR/ABLNEG

7.3 Plot expression line profiles of multiple genes/probesets across sam-ples

Multiple genes can be plotted simultaneously on a graph using line profiles. Each line reflects onegene and are colored differenly. As an example, here three probesets that measure the gene LCK,found to be differentially expressed between B- and T-cells. Apparently, one probeset does notmeasure the gene appropriately.

R> myGeneSymbol <- "LCK"

R> probesetPos <- which(myGeneSymbol == featureData(ALL)$SYMBOL)

R> myProbesetIds <- featureNames(ALL)[probesetPos]

R> profilesPlot(object = ALL, probesetIds = myProbesetIds,

orderGroups = "BT", sampleIDs = "BT")

21

Page 22: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

log 2

con

cent

ratio

n

5

6

7

8

9

10

11

B B B1

B1

B1

B1

B1

B1

B2

B2

B2

B2

B2

B2

B2

B2

B2

B2

B2

B2

B3

B3

B3

B3

B3

B3

B3

B3

B4

B4

B4

B4 T T T2

T2

T2

T2

T2

T3

T3

T3

T4

1266_s_at2059_s_at33238_at

22

Page 23: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

7.4 Smoothscatter plots

It may be of interest to look at correlations between samples. As each dot represents a gene, thereare typically many dots. It is therefore wise to color the dots in a density dependent way.

R> plotComb2Samples(ALL, "11002", "01003",

xlab = "a T-cell", ylab = "another T-cell")

Figure 1: Correlations in gene expression profiles between two T-cell samples (samples 11002 and01003).

23

Page 24: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

If there are outlying genes, one can label them by their gene symbol by specifying the expressionintervals (X- or Y- axis or both) that contain the genes to be highlighted using trsholdX andtrsholdY.

R> plotComb2Samples(ALL, "84004", "01003",

trsholdX = c(10,12), trsholdY = c(4,6),

xlab = "a B-cell", ylab = "a T-cell")

Figure 2: Correlations in gene expression profiles between a B-cell and a T-cell (samples 84004 and01003). Some potentially interesting genes are indicated by their gene symbol.

24

Page 25: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

One can also show multiple pairwise comparisons in a pairwise scatterplot matrix.

R> plotCombMultSamples(exprs(ALL)[,c("84004", "11002", "01003")])

R> # text.panel= function(x){x, labels = c("a B-cell", "a T-cell", "another T-cell")})

Figure 3: Correlations in gene expression profiles between a B-cell and two T-cell samples (respec-tively samples 84004, 11002 and 01003).

25

Page 26: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

7.5 Gene lists of log ratios

When analyzing treatments that are primarily interesting relative to a control treatment, it maybe of value to look at the log ratios of several treatments (in columns) for a selected list of genes(in rows).

R> ALL$BTtype <- as.factor(substr(ALL$BT,0,1))

R> ALL2 <- ALL[,ALL$BT != 'T1'] # omit subtype T1 as it only contains one sample

R> ALL2$BTtype <- as.factor(substr(ALL2$BT,0,1)) # create a vector with only T and B

R> # Test for differential expression between B and T cells

R> tTestResult <- tTest(ALL, "BTtype", probe2gene = FALSE)

R> topGenes <- rownames(tTestResult)[1:20]

R> # plot the log ratios versus subtype B of the top genes

R> LogRatioALL <- computeLogRatio(ALL2, reference = list(var="BT", level="B"))

R> a <- plotLogRatio(e = LogRatioALL[topGenes,], openFile = FALSE, tooltipvalues = FALSE,

device = "pdf", filename = "GeneLRlist",

colorsColumnsBy = "BTtype",

main = 'Top 20 genes most differentially between T- and B-cells',

orderBy = list(rows = "hclust"), probe2gene = TRUE)

Top 20 genes most differentially between T− and B−cells

B1 B2 B3 B4 T T2 T3 T4

HLA−DPB1 − major histocompatibility complex, class

HLA−DPB1 − major histocompatibility complex, class

CD19 − CD19 molecule

BLNK − B cell linker

CD79B − CD79b molecule

CD9 − CD9 molecule

NA − NA

CD74 − CD74 molecule

HLA−DRA − major histocompatibility complex, class

IGHM − immunoglobulin heavy constant mu

HLA−DPA1 − major histocompatibility complex, class

HLA−DMA − major histocompatibility complex, class

LCK − LCK proto−oncogene, Src family tyrosine kina

LCK − LCK proto−oncogene, Src family tyrosine kina

TRAT1 − T cell receptor associated transmembrane a

PRKCQ − protein kinase C theta

CD3G − CD3g molecule

CD3D − CD3d molecule

SH2D1A − SH2 domain containing 1A

YME1L1 − YME1 like 1 ATPase

Error bars show the pooled standard deviation

Tue Oct 27 18:15:12 2020 ; R version 4.0.3 (2020−10−10) ; Biobase version 2.50.0

Figure 4: Log ratios of the 20 genes that are most differentially expressed between B-cell and twoT-cells.

The following example demonstrates how to display log ratios for four compounds for whichgene expression was measured on four timepoints.

R> load(system.file("extdata", "esetExampleTimeCourse.rda", package = "a4"))

R> logRatioEset <- computeLogRatio(esetExampleTimeCourse, within = "hours",

reference = list(var = "compound", level = "DMSO"))

R> # re-order

R> idx <- order(pData(logRatioEset)$compound, pData(logRatioEset)$hours)

R> logRatioEset <- logRatioEset[,idx]

R> # plot LogRatioEset across all

R> cl <- "TEST"

R> compound <- "COMPOUND"

R> shortvarnames <- unique(interaction(pData(logRatioEset)$compound, pData(logRatioEset)$hours))

R> shortvarnames <- shortvarnames[-grep("DMSO", shortvarnames), drop=TRUE]

R> plotLogRatio(e = logRatioEset, mx = 1, filename = "logRatioOverallTimeCourse.pdf",

gene.fontsize = 8,

orderBy = list(rows = "hclust", cols = NULL), colorsColumnsBy = c('compound'),

within = "hours", shortvarnames = shortvarnames, exp.width = 1,

main = paste("Differential Expression (trend at early time points) in",

cl, "upon treatment with", compound),

reference = list(var = "compound", level = "DMSO"), device = 'pdf')

26

Page 27: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

Differential Expression (trend at early time points) in TEST upon treatment with COMPOUND

A.2 A.4 A.6 A.22 B.2 B.4 B.6 B.22 C.2 C.4 C.6 C.22 D.2 D.4 D.6 D.22

ARRDC4 − arrestin domain containing 4EGR1 − early growth response 1SPRY4 − sprouty homolog 4 (Drosophila)BHLHE40 − basic helix−loop−helix family, member e4ZFP36 − zinc finger protein 36, C3H type, homolog DUSP5 − dual specificity phosphatase 5PLAUR − plasminogen activator, urokinase receptorSERPINE1 − serpin peptidase inhibitor, clade E (neINSIG1 − insulin induced gene 1IER3 − immediate early response 3NPC1 − Niemann−Pick disease, type C1LDLR − low density lipoprotein receptorHMGCS1 − 3−hydroxy−3−methylglutaryl−Coenzyme A synSC4MOL − sterol−C4−methyl oxidase−likeLPIN1 − lipin 1HMGCR − 3−hydroxy−3−methylglutaryl−Coenzyme A reduFDFT1 − farnesyl−diphosphate farnesyltransferase 1DHCR7 − 7−dehydrocholesterol reductaseIDI1 − isopentenyl−diphosphate delta isomerase 1PCSK9 − proprotein convertase subtilisin/kexin typ

Error bars show the pooled standard deviation

Tue Oct 27 18:15:12 2020 ; R version 4.0.3 (2020−10−10) ; Biobase version 2.50.0

Figure 5: Log ratios for four compounds at four time points (for 20 genes).

27

Page 28: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

8 Pathway analysis

8.1 Minus log p

The MLP method is one method of pathway analysis that is commonly used by the a4 suite userbase. Although the method is explained in detail in the MLP package vignette we briefly walkthrought the analysis steps using the same example dataset used in the preceding parts of theanalysis. In order to detect whether certain gene sets are enriched in genes with low p values, weobtain the vector of p values for the genes and the corresponding relevant gene sets:

R> require(MLP)

R> # create groups

R> labels <- as.factor(ifelse(regexpr("^B", as.character(pData(ALL)$BT))==1, "B", "T"))

R> pData(ALL)$BT2 <- labels

R> # generate p-values

R> limmaResult <- limmaTwoLevels(object = ALL, group = "BT2")

R> pValues <- limmaResult@MArrayLM$p.value

R> pValueNames <- fData(ALL)[rownames(pValues), 'ENTREZID']

R> pValues <- pValues[,2]

R> names(pValues) <- pValueNames

R> pValues <- pValues[!is.na(pValueNames)]

R> geneSet <- getGeneSets(species = "Human",

geneSetSource = "GOBP",

entrezIdentifiers = names(pValues)

)

R> tail(geneSet, 3)

$`GO:2001303`

[1] "239" "246" "247"

$`GO:2001304`

[1] "239"

$`GO:2001306`

[1] "239"

Next, we run the MLP analysis:

R> mlpOut <- MLP(

geneSet = geneSet,

geneStatistic = pValues,

minGenes = 5,

maxGenes = 100,

rowPermutations = TRUE,

nPermutations = 50,

smoothPValues = TRUE,

probabilityVector = c(0.5, 0.9, 0.95, 0.99, 0.999, 0.9999, 0.99999),

df = 9)

The results can be visualized in many ways, but for Gene Ontology based gene set definitions,the following graph may be useful:

R> library(Rgraphviz)

R> library(GOstats)

R> pdf(file = "GOgraph.pdf")

R> plot(mlpOut, type = "GOgraph", nRow = 25)

R> dev.off()

28

Page 29: Using the a4 package - rdrr.io · 2021. 3. 10. · MX1 1.41 6.73 0.00 0.00 MX dynamin like GTPase 1 5.3 Limma for linear relations with a continuous variable Testing for (linear)

9 Software used

� R version 4.0.3 (2020-10-10), x86_64-pc-linux-gnu

� Locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=C,LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8,LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8,LC_IDENTIFICATION=C

� Running under: Ubuntu 18.04.5 LTS

� Matrix products: default

� BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so

� LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so

� Base packages: base, datasets, grDevices, graphics, methods, parallel, stats, stats4, utils

� Other packages: ALL 1.31.0, AnnotationDbi 1.52.0, Biobase 2.50.0, BiocGenerics 0.36.0,GO.db 3.12.0, IRanges 2.24.0, MASS 7.3-53, MLInterfaces 1.70.0, MLP 1.38.0, Rcpp 1.0.5,S4Vectors 0.28.0, XML 3.99-0.5, a4 1.38.0, a4Base 1.38.0, a4Classif 1.38.0, a4Core 1.38.0,a4Preproc 1.38.0, a4Reporting 1.38.0, affy 1.68.0, annotate 1.68.0, cluster 2.1.0,gdata 2.18.0, gmodels 2.18.1, gplots 3.1.0, gtools 3.8.2, hgu95av2.db 3.2.3, nlcv 0.3.5,org.Hs.eg.db 3.12.0, plotrix 3.7-8, xtable 1.8-4

� Loaded via a namespace (and not attached): BiocManager 1.30.10, DBI 1.1.0,KernSmooth 2.23-17, Matrix 1.2-18, R6 2.4.1, RColorBrewer 1.1-2, ROCR 1.0-11,RSQLite 2.2.1, affyio 1.60.0, annaffy 1.62.0, bit 4.0.4, bit64 4.0.5, bitops 1.0-6, blob 1.2.1,caTools 1.18.0, class 7.3-17, codetools 0.2-16, compiler 4.0.3, digest 0.6.27, e1071 1.7-4,foreach 1.5.1, genefilter 1.72.0, glmnet 4.0-2, grid 4.0.3, httr 1.4.2, ipred 0.9-9,iterators 1.0.13, kernlab 0.9-29, lattice 0.20-41, lava 1.6.8, limma 3.46.0, memoise 1.1.0,mpm 1.0-22, multtest 2.46.0, nnet 7.3-14, pamr 1.56.1, pkgconfig 2.0.3,preprocessCore 1.52.0, prodlim 2019.11.13, randomForest 4.6-14, rlang 0.4.8, rpart 4.1-15,shape 1.4.5, splines 4.0.3, survival 3.2-7, tools 4.0.3, varSelRF 0.7-8, vctrs 0.3.4,zlibbioc 1.36.0

29