26
Gene set enrichment analysis with topGO Adrian Alexa, J¨ org Rahnenf¨ uhrer December 16, 2019 http://www.mpi-sb.mpg.de/alexa Contents 1 Introduction 2 2 Instalation 2 3 Quick start guide 3 3.1 Data preparation ........................................... 3 3.2 Performing the enrichment tests .................................. 4 3.3 Analysis of results .......................................... 5 4 Loading genes and annotations data 8 4.1 Getting started ............................................ 8 4.2 The topGOdata object ........................................ 8 4.3 Custom annotations ......................................... 9 4.4 Predefined list of interesting genes ................................. 10 4.5 Using the genes score ......................................... 11 4.6 Filtering and missing GO annotations ............................... 12 5 Working with the topGOdata object 14 6 Running the enrichment tests 16 6.1 Defining and running the test .................................... 17 6.2 The adjustment of p-values ..................................... 19 6.3 Adding a new test .......................................... 19 6.4 runTest: a high-level interface for testing ............................. 19 7 Interpretation and visualization of results 20 7.1 The topGOresult object ....................................... 20 7.2 Summarising the results ....................................... 21 7.3 Analysing individual GOs ...................................... 22 7.4 Visualising the GO structure .................................... 23 8 Session Information 26 References 26 1

Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

Gene set enrichment analysis with topGO

Adrian Alexa, Jorg Rahnenfuhrer

December 16, 2019http://www.mpi-sb.mpg.de/∼alexa

Contents

1 Introduction 2

2 Instalation 2

3 Quick start guide 3

3.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.2 Performing the enrichment tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.3 Analysis of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Loading genes and annotations data 8

4.1 Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2 The topGOdata object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.3 Custom annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.4 Predefined list of interesting genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.5 Using the genes score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.6 Filtering and missing GO annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Working with the topGOdata object 14

6 Running the enrichment tests 16

6.1 Defining and running the test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6.2 The adjustment of p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.3 Adding a new test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.4 runTest: a high-level interface for testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

7 Interpretation and visualization of results 20

7.1 The topGOresult object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7.2 Summarising the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

7.3 Analysing individual GOs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7.4 Visualising the GO structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

8 Session Information 26

References 26

1

Page 2: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

1 Introduction

The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology (GO)terms. The process consists of input of normalised gene expression measurements, gene-wise correlationor differential expression analysis, enrichment analysis of GO terms, interpretation and visualisation of theresults.

One of the main advantages of topGO is the unified gene set testing framework it offers. Besides providingan easy to use set of functions for performing GO enrichment analysis, it also enables the user to easilyimplement new statistical tests or new algorithms that deal with the GO graph structure. This unifiedframework also facilitates the comparison between different GO enrichment methodologies.

There are a number of test statistics and algorithms dealing with the GO graph structured ready to use intopGO. Table 1 presents the compatibility table between the test statistics and GO graph methods.

fisher ks t globaltest sumclassicelimweightweight01leaparentchild

Table 1: Algorithms currently supported by topGO.

The elim and weight algorithms were introduced in Alexa et al. (2006). The default algorithm used by thetopGO package is a mixture between the elim and the weight algorithms and it will be referred as weight01.The parentChild algorithm was introduced by Grossmann et al. (2007).

We assume the user has a good understanding of GO, see Consortium (2001), and is familiar with gene setenrichment tests. Also this document requires basic knowledge of R language.

The next section presents a quick tour into topGO and is thought to be independent of the rest of thismanuscript. The remaining sections provide details on the functions used in the sample section as well asshowing more advance functionality implemented in the topGO package.

2 Instalation

This section briefly describe the necessary to get topGO running on your system. We assume that the userhave the R program (see the R project at http://www.r-project.org) already installed and its familiar withit. You will need to have R 2.10.0 or later to be able to install and run topGO.

The topGO package is available from the Bioconductor repository at http://www.bioconductor.org To beable to install the package one needs first to install the core Bioconductor packages. If you have alreadyinstalled Bioconductor packages on your system then you can skip the two lines below.

> if (!requireNamespace("BiocManager", quietly=TRUE))

+ install.packages("BiocManager")

> BiocManager::install()

Once the core Bioconductor packages are installed, we can install the topGO package by

> if (!requireNamespace("BiocManager", quietly=TRUE))

+ install.packages("BiocManager")

> BiocManager::install("topGO")

Page 3: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

3 Quick start guide

This section describes a simple working session using topGO. There are only a handful of commands necessaryto perform a gene set enrichment analysis which will be briefly presented below.

A typical session can be divided into three steps:

1. Data preparation: List of genes identifiers, gene scores, list of differentially expressed genes or a criteriafor selecting genes based on their scores, as well as gene-to-GO annotations are all collected and storedin a single R object.

2. Running the enrichment tests: Using the object created in the first step the user can perform enrichmentanalysis using any feasible mixture of statistical tests and methods that deal with the GO topology.

3. Analysis of the results: The results obtained in the second step are analysed using summary functionsand visualisation tools.

Before going through each of those steps the user needs to decide which biological question he would liketo investigate. The aim of the study, as well as the nature of the available data, will dictate which teststatistic/methods need to used.

In this section we will test the enrichment of GO terms with differentially expressed genes using two statisticaltests, namely Kolmogorov-Smirnov test and Fisher’s exact test.

3.1 Data preparation

In the first step a convenient R object of class topGOdata is created containing all the information requiredfor the remaining two steps. The user needs to provide the gene universe, GO annotations and either acriteria for selecting interesting genes (e.g. differentially expressed genes) from the gene universe or a scoreassociated with each gene.

In this session we will test the enrichment of GO terms with differentially expressed genes. Thus, the startingpoint is a list of genes and the respective p-values for differential expression. A toy example of a list of genep-values is provided by the geneList object.

> library(topGO)

> library(ALL)

> data(ALL)

> data(geneList)

The geneList data is based on a differential expression analysis of the ALL(Acute Lymphoblastic Leukemia)dataset that was extensively studied in the literature on microarray analysis Chiaretti, S., et al. (2004). Ourtoy example contains just a small amount, 323, of genes and their corresponding p-values. The next dataone needs are the gene groups itself, the GO terms in our case, and the mapping that associate each genewith one or more GO term(s). The information on where to find the GO annotations is stored in the ALL

object and it is easily accessible.

> affyLib <- paste(annotation(ALL), "db", sep = ".")

> library(package = affyLib, character.only = TRUE)

The microarray used in the experiment is the hgu95av2 from Affymetrix, as we can see from the affyLib

object. When we loaded the geneList object a selection function used for defining the list of differentiallyexpressed genes is also loaded under the name of topDiffGenes. The function assumes that the providedargument is a named vector of p-values. With the help of this function we can see that there are 50 geneswith a raw p-value less than 0.01 out of a total of 323 genes.

> sum(topDiffGenes(geneList))

[1] 50

Page 4: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

We now have all data necessary to build an object of type topGOdata. This object will contain all geneidentifiers and their scores, the GO annotations, the GO hierarchical structure and all other informationneeded to perform the desired enrichment analysis.

> sampleGOdata <- new("topGOdata",

+ description = "Simple session", ontology = "BP",

+ allGenes = geneList, geneSel = topDiffGenes,

+ nodeSize = 10,

+ annot = annFUN.db, affyLib = affyLib)

The names of the arguments used for building the topGOdata object should be self-explanatory. We quicklymention that nodeSize = 10 is used to prune the GO hierarchy from the terms which have less than 10annotated genes and that annFUN.db function is used to extract the gene-to-GO mappings from the affyLibobject. Section 4.2 describes the parameters used to build the topGOdata in details.

A summary of the sampleGOdata object can be seen by typing the object name at the R prompt. Having allthe data stored into this object facilitates the access to identifiers, annotations and to basic data statistics.

> sampleGOdata

3.2 Performing the enrichment tests

Once we have an object of class topGOdata we can start with the enrichment analysis. We will use twotypes of test statistics: Fisher’s exact test which is based on gene counts, and a Kolmogorov-Smirnov liketest which computes enrichment based on gene scores. We can use both these tests since each gene has ascore (representing how differentially expressed a gene is) and by the means of topDiffGenes functions thegenes are categorized into differentially expressed or not differentially expressed genes. All these are storedinto sampleGOdata object.

The function runTest is used to apply the specified test statistic and method to the data. It has three mainarguments. The first argument needs to be an object of class topGOdata. The second and third argumentare of type character; they specify the method for dealing with the GO graph structure and the test statistic,respectively.

First, we perform a classical enrichment analysis by testing the over-representation of GO terms within thegroup of differentially expressed genes. For the method classic each GO category is tested independently.

> resultFisher <- runTest(sampleGOdata, algorithm = "classic", statistic = "fisher")

runTest returns an object of class topGOresult. A short summary of this object is shown below.

> resultFisher

Description: Simple session

Ontology: BP

'classic' algorithm with the 'fisher' test

1110 GO terms scored: 48 terms with p < 0.01

Annotation data:

Annotated genes: 310

Significant genes: 46

Min. no. of genes annotated to a GO: 10

Nontrivial nodes: 988

Next we will test the enrichment using the Kolmogorov-Smirnov test. We will use the both the classic andthe elim method.

> resultKS <- runTest(sampleGOdata, algorithm = "classic", statistic = "ks")

> resultKS.elim <- runTest(sampleGOdata, algorithm = "elim", statistic = "ks")

Page 5: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

GO.ID Term Annotated Significant Expected Rank in classicFisher classicFisher classicKS elimKS1 GO:0051301 cell division 145 16 21.52 952 0.97383 1.0e-07 3.1e-072 GO:0031668 cellular response to extracellular stimu... 12 8 1.78 1 4.2e-05 0.00013 0.000133 GO:0010389 regulation of G2/M transition of mitotic... 30 7 4.45 260 0.13535 0.00019 0.000194 GO:0051726 regulation of cell cycle 134 17 19.88 812 0.86271 2.2e-05 0.000675 GO:0140014 mitotic nuclear division 90 6 13.35 982 0.99838 0.00177 0.001776 GO:0050851 antigen receptor-mediated signaling path... 11 7 1.63 7 0.00021 0.00208 0.002087 GO:0051276 chromosome organization 88 7 13.06 969 0.99261 0.00218 0.002188 GO:0048638 regulation of developmental growth 13 3 1.93 426 0.30050 0.00261 0.002619 GO:1900221 regulation of amyloid-beta clearance 10 5 1.48 42 0.00827 0.00287 0.00287

10 GO:0000278 mitotic cell cycle 147 14 21.81 978 0.99648 8.6e-07 0.00339

Table 2: Significance of GO terms according to classic and elim methods.

Please note that not all statistical tests work with every method. The compatibility matrix between themethods and statistical tests is shown in Table 1.

The p-values computed by the runTest function are unadjusted for multiple testing. We do not advocateagainst adjusting the p-values of the tested groups, however in many cases adjusted p-values might bemisleading.

3.3 Analysis of results

After the enrichment tests are performed the researcher needs tools for analysing and interpreting the results.GenTable is an easy to use function for analysing the most significant GO terms and the corresponding p-values. In the following example, we list the top 10 significant GO terms identified by the elim method. Atthe same time we also compare the ranks and the p-values of these GO terms with the ones obtained by theclassic method.

> allRes <- GenTable(sampleGOdata, classicFisher = resultFisher,

+ classicKS = resultKS, elimKS = resultKS.elim,

+ orderBy = "elimKS", ranksOf = "classicFisher", topNodes = 10)

The GenTable function returns a data frame containing the top topNodes GO terms identified by the elimalgorithm, see orderBy argument. The data frame includes some statistics on the GO terms and the p-valuescorresponding to each of the topGOresult object specified as arguments. Table 2 shows the results.

For accessing the GO term’s p-values from a topGOresult object the user should use the score functions.As a simple example, we look at the differences between the results of the classic and the elim methods inthe case of the Kolmogorov-Smirnov test. The elim method was design to be more conservative then theclassic method and therefore one expects the p-values returned by the former method are lower boundedby the p-values returned by the later method. The easiest way to visualize this property is to scatter plotthe two sets of p-values against each other.

> pValue.classic <- score(resultKS)

> pValue.elim <- score(resultKS.elim)[names(pValue.classic)]

> gstat <- termStat(sampleGOdata, names(pValue.classic))

> gSize <- gstat$Annotated / max(gstat$Annotated) * 4

> gCol <- colMap(gstat$Significant)

> plot(pValue.classic, pValue.elim, xlab = "p-value classic", ylab = "p-value elim",

+ pch = 19, cex = gSize, col = gCol)

We can see in Figure 1 that there are indeed differences between the two methods. Some GO terms foundsignificant by the classic method are less significant in the elim, as expected. However, in some cases, we canvisible identify a few GO terms for which the elim p-value is less conservative then the classic p-value. If suchGO terms exist, we can identify them and find the number of annotated genes:

> sel.go <- names(pValue.classic)[pValue.elim < pValue.classic]

> cbind(termStat(sampleGOdata, sel.go),

+ elim = pValue.elim[sel.go],

+ classic = pValue.classic[sel.go])

Page 6: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●●

●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p−value classic

p−va

lue

elim

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

● ●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

1e−11 1e−08 1e−05 1e−02

1e−

061e

−04

1e−

021e

+00

log(p−value) classic

log(

p−va

lue)

elim

Figure 1: p-values scatter plot for the classic (x axis) and elim (y axis) methods. On the right panel the p-valuesare plotted on a linear scale. The left panned plots the same p-values on a logarithmic scale. The size of the dotis proportional with the number of annotated genes for the respective GO term and its coloring represents thenumber of significantly differentially expressed genes, with the dark red points having more genes then the yellowones.

[1] Annotated Significant Expected elim classic

<0 rows> (or 0-length row.names)

It is quite interesting that such cases appear - the above table will not be empty. These GO terms are rathergeneral (having many annotated genes) and their p-values are not significant at the 0.05 level. Thereforethese GO terms would not appear in the list of top significant terms. More significant GO terms are lesslikely to be influenced by this non monotonic behavior.

Another insightful way of looking at the results of the analysis is to investigate how the significant GO termsare distributed over the GO graph. Figure 2 shows the the subgraph induced by the 5 most significant GOterms as identified by the elim algorithm. Significant nodes are represented as rectangles. The plotted graphis the upper induced graph generated by these significant nodes.

> showSigOfNodes(sampleGOdata, score(resultKS.elim), firstSigNodes = 5, useInfo = 'all')

Page 7: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

GO:0000086G2/M transition of m...

0.5526447 / 39

GO:0000278mitotic cell cycle

0.00338714 / 147

GO:0000280nuclear division

0.2555957 / 102

GO:0006996organelle organizati...

0.04883920 / 175

GO:0007049cell cycle0.00461825 / 195

GO:0007154cell communication

0.51866725 / 134

GO:0007346regulation of mitoti...

0.03480311 / 90

GO:0008150biological_process

1.00000046 / 310

GO:0009605response to external...

0.48269417 / 69

GO:0009987cellular process

0.86030346 / 309

GO:0009991response to extracel...

0.46290310 / 20

GO:0010389regulation of G2/M t...

0.0001907 / 30

GO:0010564regulation of cell c...

0.01367012 / 103

GO:0016043cellular component o...

0.26616526 / 202

GO:0022402cell cycle process

0.00814418 / 153

GO:0031668cellular response to...

0.0001318 / 12

GO:0044770cell cycle phase tra...

0.02264013 / 82

GO:0044772mitotic cell cycle p...

0.03557710 / 77

GO:0044839cell cycle G2/M phas...

0.4191827 / 43

GO:0048285organelle fission

0.2555957 / 102

GO:0050789regulation of biolog...

0.85368839 / 241

GO:0050794regulation of cellul...

0.88548338 / 236

GO:0050896response to stimulus

0.55520132 / 183

GO:0051301cell division3.09e−0716 / 145

GO:0051716cellular response to...

0.37210829 / 158

GO:0051726regulation of cell c...

0.00067417 / 134

GO:0065007biological regulatio...

0.91169240 / 251

GO:0071496cellular response to...

0.8073188 / 15

GO:0071840cellular component o...

0.33512027 / 204

GO:0140014mitotic nuclear divi...

0.0017706 / 90

GO:1901987regulation of cell c...

0.15597010 / 59

GO:1901990regulation of mitoti...

0.15597010 / 59

GO:1902749regulation of cell c...

0.7913527 / 33

GO:1903047mitotic cell cycle p...

0.05039512 / 129

Figure 2: The subgraph induced by the top 5 GO terms identified by the elim algorithm for scoring GO terms forenrichment. Rectangles indicate the 5 most significant terms. Rectangle color represents the relative significance,ranging from dark red (most significant) to bright yellow (least significant). For each node, some basic informationis displayed. The first two lines show the GO identifier and a trimmed GO name. In the third line the raw p-valueis shown. The forth line is showing the number of significant genes and the total number of genes annotated tothe respective GO term.

Page 8: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

4 Loading genes and annotations data

4.1 Getting started

To demonstrate the package functionality we will use the ALL(Acute Lymphoblastic Leukemia) gene expres-sion data from Chiaretti, S., et al. (2004). The dataset consists of 128 microarrays from different patientswith ALL measured using the HGU95aV2 Affymetrix chip. Additionally, custom annotations and artificialdatasets will be used to demonstrate specific features.

We first load the required libraries and data:

> library(topGO)

> library(ALL)

> data(ALL)

When the topGO package is loaded three environments GOBPTerm, GOMFTerm and GOCCTerm are created andbound to the package environment. These environments are build based on the GOTERM environment frompackage GO.db. They are used for fast recovering of the information specific to each of the three ontologies:BP, MF and CC. In order to access all GO groups that belong to a specific ontology, e.g. Biological Process(BP), one can type:

> BPterms <- ls(GOBPTerm)

> head(BPterms)

[1] "GO:0000001" "GO:0000002" "GO:0000003" "GO:0000011" "GO:0000012" "GO:0000017"

Usually one needs to remove probes/genes with low expression value as well as probes with very smallvariability across samples. Package genefilter provides tools for filtering genes. In this analysis we chooseto filter as many genes as possible for computational reasons; working with a smaller gene universe allows usto exemplify more of the functionalities implemented in the topGO package and at the same time allows thisdocument to be compiled in a relatively short time. The effect of gene filtering is discussed in more detailsin Section 4.6.

> library(genefilter)

> selProbes <- genefilter(ALL, filterfun(pOverA(0.20, log2(100)), function(x) (IQR(x) > 0.25)))

> eset <- ALL[selProbes, ]

The filter selects only 4101 probesets out of 12625 probesets available on the hgu95av2 array.

The gene universe and the set of interesting genesThe set of all genes from the array/study will be referred from now on as the gene universe. Having thegene universe, the user can define a list of interesting genes or to compute gene-wise scores that quantify thesignificance of each gene. When gene-wise scores are available the list of interesting genes is defined to bethe set of gene with a significant score. The topGO package deals with these two cases in a unified way oncethe main data container, the topGOdata object, is constructed. The only time the user needs to distinguishbetween these two cases is during the construction of the data container.

Usually, the gene universe is defined as all feasible genes measured by the microarray. In the case of the ALLdataset we have 4101 feasible genes, the ones that were not removed by the filtering procedure.

4.2 The topGOdata object

The central step in using the topGO package is to create a topGOdata object. This object will contain allinformation necessary for the GO analysis, namely the list of genes, the list of interesting genes, the genescores (if available) and the part of the GO ontology (the GO graph) which needs to be used in the analysis.The topGOdata object will be the input of the testing procedures, the evaluation and visualisation functions.

To build such an object the user needs the following:

Page 9: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

� A list of gene identifiers and optionally the gene-wise scores. The score can be the t-test statistic (orthe p-value) for differential expression, correlation with a phenotype, or any other relevant score.

� A mapping between gene identifiers and GO terms. In most cases this mapping is directly available inBioconductor as a microarray specific annotation package. In this case the user just needs to specifythe name of the annotation to be used. For example, the annotation package needed for the ALLdataset is hgu95av2.db.

Of course, Bioconductor does not include up-to-date annotation packages for all platforms. Users whowork with custom arrays or wish to use a specific mapping between genes and GO terms, have thepossibility to load custom annotations. This is described in Section 4.3.

� The GO hierarchical structure. This structure is obtained from the GO.db package. At the momenttopGO supports only the ontology definition provided by GO.db.

We further describe the arguments of the initialize function (new) used to construct an instance of thisdata container object.

ontology: character string specifying the ontology of interest (BP, MF or CC)

description: character string containing a short description of the study [optional].

allGenes: named vector of type numeric or factor. The names attribute contains the genes identifiers. Thegenes listed in this object define the gene universe.

geneSelectionFun: function to specify which genes are interesting based on the gene scores. It should bepresent iff the allGenes object is of type numeric.

nodeSize: an integer larger or equal to 1. This parameter is used to prune the GO hierarchy from the termswhich have less than nodeSize annotated genes (after the true path rule is applied).

annotationFun: function which maps genes identifiers to GO terms. There are a couple of annotationfunction included in the package trying to address the user’s needs. The annotation functions takethree arguments. One of those arguments is specifying where the mappings can be found, and needsto be provided by the user. Here we give a short description of each:

annFUN.db this function is intended to be used as long as the chip used by the user has an annotationpackage available in Bioconductor.

annFUN.org this function is using the mappings from the ”org.XX.XX”annotation packages. Currently,the function supports the following gene identifiers: Entrez, GenBank, Alias, Ensembl, GeneSymbol, GeneName and UniGene.

annFUN.gene2GO this function is used when the annotations are provided as a gene-to-GOs mapping.

annFUN.GO2gene this function is used when the annotations are provided as a GO-to-genes mapping.

annFUN.file this function will read the annotationsof the type gene2GO or GO2genes from a textfile.

...: list of arguments to be passed to the annotationFun

4.3 Custom annotations

This section describes how custom GO annotations can be used for building a topGOdata object.

Annotations need to be provided either as gene-to-GOs or as GO-to-genes mappings. An example of suchmapping can be found in the ”topGO/examples” directory. The file ”geneid2go.map” contains gene-to-GOsmappings. For each gene identifier are listed the GO terms to which this gene is specifically annotated. Weuse the readMappings function to parse this file.

> geneID2GO <- readMappings(file = system.file("examples/geneid2go.map", package = "topGO"))

> str(head(geneID2GO))

Page 10: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

List of 6

$ 068724: chr [1:5] "GO:0005488" "GO:0003774" "GO:0001539" "GO:0006935" ...

$ 119608: chr [1:6] "GO:0005634" "GO:0030528" "GO:0006355" "GO:0045449" ...

$ 049239: chr [1:13] "GO:0016787" "GO:0017057" "GO:0005975" "GO:0005783" ...

$ 067829: chr [1:16] "GO:0045926" "GO:0016616" "GO:0000287" "GO:0030145" ...

$ 106331: chr [1:10] "GO:0043565" "GO:0000122" "GO:0003700" "GO:0005634" ...

$ 214717: chr [1:7] "GO:0004803" "GO:0005634" "GO:0008270" "GO:0003677" ...

The object returned by readMappings is a named list of character vectors. The list names give the genesidentifiers. Each element of the list is a character vector and contains the GO identifiers annotated to thespecific gene. It is sufficient for the mapping to contain only the most specific GO annotations. However,topGO can also take as an input files in which all or some ancestors of the most specific GO annotations areincluded. This redundancy is not making for a faster running time and if possible it should be avoided.

The user can read the annotations from text files or they can build an object such as geneID2GO directlyinto R. The text file format required by the readMappings function is very simple. It consists of one line foreach gene with the following syntax:

gene_ID<TAB>GO_ID1, GO_ID2, GO_ID3, ....

Reading GO-to-genes mappings from a file is also possible using the readMappings function. However, itis the user responsibility to know the direction of the mappings. The user can easily transform a mappingfrom gene-to-GOs to GO-to-genes (or vice-versa) using the function inverseList:

> GO2geneID <- inverseList(geneID2GO)

> str(head(GO2geneID))

List of 6

$ GO:0000122: chr "106331"

$ GO:0000139: chr [1:6] "133103" "111846" "109956" "161395" ...

$ GO:0000166: chr [1:10] "067829" "157764" "100302" "074582" ...

$ GO:0000186: chr "181104"

$ GO:0000209: chr "159461"

$ GO:0000228: chr "214717"

4.4 Predefined list of interesting genes

If the user has some a priori knowledge about a set of interesting genes, he can test the enrichment of GOterms with regard to this list of interesting genes. In this scenario, when only a list of interesting genes isprovided, the user can use only tests statistics that are based on gene counts, like Fisher’s exact test, Z scoreand alike.

To demonstrate how custom annotation can be used this section is based on the toy dataset, the geneID2GO

data, from Section 4.3. The gene universe in this case is given by the list names:

> geneNames <- names(geneID2GO)

> head(geneNames)

Since for the available genes we do not have any measurement and thus no criteria to select interesting genes,we randomly select 10% genes from the gene universe and consider them as interesting genes.

> myInterestingGenes <- sample(geneNames, length(geneNames) / 10)

> geneList <- factor(as.integer(geneNames %in% myInterestingGenes))

> names(geneList) <- geneNames

> str(geneList)

Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...

- attr(*, "names")= chr [1:100] "068724" "119608" "049239" "067829" ...

Page 11: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

The geneList object is a named factor that indicates which genes are interesting and which not. It shouldbe straightforward to compute such a named vector in a real case situation, where the user has his ownpredefined list of interesting genes.

We now have all the elements to construct a topGOdata object.

To build the topGOdata object, we will use the MF ontology. The mapping is given by the geneID2GO listwhich will be used with the annFUN.gene2GO function.

> GOdata <- new("topGOdata", ontology = "MF", allGenes = geneList,

+ annot = annFUN.gene2GO, gene2GO = geneID2GO)

The building of the GOdata object can take some time, depending on the number of annotated genes and onthe chosen ontology. In our example the running time is quite fast given that we have a rather small sizegene universe which also imply a moderate size GO ontology, especially since we are using the MF ontology.

The advantage of having (information on) the gene scores (or better genes measurements) as well as a wayto define which are the interesting genes, in the topGOdata object is that one can apply various group testingprocedure, which let us test multiple hypothesis or tune with different parameters.

By typing GOdata at the R prompt, the user can see a summary of the data.

> GOdata

------------------------- topGOdata object -------------------------

Description:

-

Ontology:

- MF

100 available genes (all genes from the array):

- symbol: 068724 119608 049239 067829 106331 ...

- 10 significant genes.

87 feasible genes (genes that can be used in the analysis):

- symbol: 068724 119608 049239 067829 106331 ...

- 8 significant genes.

GO graph (nodes with at least 1 genes):

- a graph with directed edges

- number of nodes = 248

- number of edges = 331

------------------------- topGOdata object -------------------------

One important point to notice is that not all the genes that are provided by geneList, the initial geneuniverse, can be annotated to the GO. This can be seen by comparing the number of all available genes,the genes present in geneList, with the number of feasible genes. We are therefore forced at this point torestrict the gene universe to the set of feasible genes for the rest of the analysis.

The summary on the GO graph shows the number of GO terms and the relations between them of thespecified GO ontology. This graph contains only GO terms which have at least one gene annotated to them.

4.5 Using the genes score

In many cases the set of interesting genes can be computed based on a score assigned to all genes, for examplebased on the p-value returned by a study of differential expression. In this case, the topGOdata object canstore the genes score and a rule specifying the list of interesting genes. The advantage of having both the

Page 12: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

scores and the procedure to select interesting genes encapsulated in the topGOdata object is that the usercan choose different types of tests statistics for the GO analysis without modifying the input data.

A typical example for the ALL dataset is the study where we need to discriminate between ALL cells deliveredfrom either B-cell or T-cell precursors.

> y <- as.integer(sapply(eset$BT, function(x) return(substr(x, 1, 1) == 'T')))

> table(y)

There are 95 B-cell ALL samples and 95 T-cell ALL samples in the dataset. A two-sided t-test can by appliedusing the function getPvalues (a wraping function for the mt.teststat from the multtest package). Bydefault the function computes FDR (false discovery rate) adjusted p-value in order to account for multipletesting. A different type of correction can be specified using the correction argument.

> geneList <- getPvalues(exprs(eset), classlabel = y, alternative = "greater")

geneList is a named numeric vector. The gene identifiers are stored in the names attribute of the vector.This set of genes defines the gene universe.

Next, a function for specifying the list of interesting genes must be defined. This function needs to selectgenes based on their scores (in our case the adjusted p-values) and must return a logical vector specifyingwhich gene is selected and which not. The function must have one argument, named allScore and mustnot depend on any attributes of this object. In this example we will consider as interesting genes all geneswith an adjusted p-value lower than 0.01. This criteria is implemented in the following function:

> topDiffGenes <- function(allScore) {

+ return(allScore < 0.01)

+ }

> x <- topDiffGenes(geneList)

> sum(x) ## the number of selected genes

With all these steps done, the user can now build the topGOdata object. For a short description of thearguments used by the initialize function see Section 4.4

> GOdata <- new("topGOdata",

+ description = "GO analysis of ALL data; B-cell vs T-cell",

+ ontology = "BP",

+ allGenes = geneList,

+ geneSel = topDiffGenes,

+ annot = annFUN.db,

+ nodeSize = 5,

+ affyLib = affyLib)

It is often the case that many GO terms which have few annotated genes are detected to be significantlyenriched due to artifacts in the statistical test. These small sized GO terms are of less importance for theanalysis and in many cases they can be omitted. By using the nodeSize argument the user can control thesize of the GO terms used in the analysis. Once the genes are annotated to the each GO term and the truepath rule is applied the nodes with less than nodeSize annotated genes are removed from the GO hierarchy.We found that values between 5 and 10 for the nodeSize parameter yield more stable results. The defaultvalue for the nodeSize parameter is 1, meaning that no pruning is performed.

Note that the only difference in the initialisation of an object of class topGOdata to the case in which westart with a predefined list of interesting genes is the use of the geneSel argument. All further analysisdepends only on the GOdata object.

4.6 Filtering and missing GO annotations

Before going further with the enrichment analysis we analyse which of the probes available on the array canbe used in the analysis.

Page 13: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

Variance

Log

of p

−va

lues

−60

−50

−40

−30

−20

−10

0

0 2 4 6

●●●●●●

●● ●● ●●● ● ●● ●●● ●●●● ●● ● ● ●●●● ●● ●● ●●●

●● ● ● ●●●● ● ●●●● ●●

●●●●● ●

●●● ●● ● ●●● ●● ●●●

●●● ●● ●●●

●●●● ●●●●●●●● ●● ●●●● ●● ●●

●● ●●●●

●● ●●●● ●●● ●●● ●●● ● ● ●● ●● ● ●● ●●● ●● ●●

●● ● ●● ●●● ●● ●●●●● ●●

● ●●● ● ● ●● ●●●●●●● ●●● ● ●● ●●●● ●●● ●● ●● ● ●●●●● ●● ●● ●●●●●●●●●●● ●● ●●●●

●●● ●●●

●●

●● ●●● ●●●● ●● ●●● ●● ●●●

●●●● ● ●● ● ●●● ●● ●● ●●●●● ●●●● ●●

●●●● ● ●● ●● ● ●● ●● ●●● ●●●●● ●●● ● ●●● ●● ●●● ●● ●● ●● ●●●●● ●● ● ●● ●● ●● ● ●●●● ●●●●●● ● ●●●● ●

●● ●● ●●● ●●●● ●●●● ●● ●● ● ● ●● ●●●●●

●●●●●●●●●●●●●● ●●●● ●●● ● ●● ●● ●● ● ●● ●●● ● ●● ● ●● ●●●● ●●●●●● ●●●● ●●

●●●● ●●●●

●●● ●● ●●● ● ●●●●●●●●●●●

●●●● ●●●●●●● ●●●●●●●● ● ●● ● ●●● ●●● ●●● ●●● ●●●

●●●●● ●●●●●● ●●● ●● ●●●●●● ●●● ● ●●

●● ●● ● ●●●●●●●●

● ●●● ●●● ●●●● ●●● ●● ●●● ●●●●

●●● ● ●●●● ●●●●

● ●●● ●●● ● ●● ●● ●●● ●● ●● ●●● ●●●●●●●● ●● ● ●● ●●●●●●● ●● ●●● ●● ● ●●●● ●●●●● ●●● ●●● ●●●●● ●● ●● ●●●● ●●

● ● ●● ●●

●●●● ●●●● ●●●●●●● ● ● ●●●●● ●● ● ●● ●● ●● ●● ●● ●●●●

●●●

●●● ●● ●●● ●●

●●● ●●●●

●●

●● ●●

●● ●●

●●●●●● ●●●

● ● ●● ● ●● ● ●●●● ●●●●● ●●●● ●

● ●●●●●●● ●●● ● ●●● ●●●● ●●● ●●●●●●●●●● ●●●●●●●●●●●●

●●● ●● ●●●●●● ●●●●●●●●●●

● ● ●●● ●● ●● ●●● ●●● ●●●● ●● ●●●● ●●● ●● ●●●●●●

●● ●●●●● ●● ●●●

●● ●

● ● ●●●●●●●●

● ●● ●●● ● ●● ● ●●● ●● ● ●● ●● ●●●● ●●●●●●● ●●●●●●●●● ●●●●●●●●●

●●●● ●●●●● ●● ● ● ●●●●●●●●●● ●● ●● ● ●●

●● ●●● ●● ●●●● ●● ●● ●●●●

●●

●● ●●●●● ●●

●●

●●

●●● ●●●● ●●●● ●

● ●●

●● ●● ●●● ● ●●● ●● ●●● ●●●●●● ●●●

●●●●● ● ●●●●●●●●● ●● ● ●●●●●●●● ●●● ●● ●●●● ●●●● ●● ●● ●●●●● ●●● ●●●●●●●

● ●● ● ●● ●● ●● ● ●●

●● ●● ●● ●●● ●● ●● ●● ●●● ●●● ●● ● ●● ●●● ●●●● ●●● ●

●●●●●●●●●● ●●●● ●●●●●●●●●●●● ●●●●●●●● ●●● ●●●●●● ●●●●● ●●●●● ●● ●●● ●●●●●● ●●●●

●●

●● ● ●● ●● ● ●● ●● ●●

●●●● ●● ●

●● ●●● ●●●● ●●●●

●●● ●●●●●●● ●● ●●● ● ●●

●● ●●●

●● ● ●●● ●● ● ●●●●●●●● ● ●●● ●●●●● ●●●●●● ●● ●●●●●● ●●● ● ●●●● ● ●●●● ●● ●● ●● ●●●

●●●●●● ●●●●

●●● ●● ● ●●●● ●● ●● ●●●

●●● ●● ●●● ●● ●● ●●

●●●●● ●● ● ●●● ●●● ●●●●● ●●● ●●●

●●● ●●●●●●●●●● ●● ●●

●●●● ●●●●●●●●● ●●● ●● ●●●●

●●● ●●●●● ●●●●● ●●●●● ●●● ●●●

●●●●● ●●●● ●●●● ●●●

●●● ●● ●●● ● ●●● ●● ●● ●●● ●●●● ●●●●●●● ● ●●●●●

●●●

●● ● ●●●● ●

●●●

●● ●● ●● ● ●●●● ●●●●●●●●●●● ●● ●● ● ● ●●●● ●●●●

●●● ●●● ●●● ●●● ●● ●●●●

●●● ● ●● ●●● ●●●●●● ●● ●●● ●●●

●●● ● ●●●● ●●●●●●●●● ●●● ●● ● ●

●●● ●●●●● ●●●● ●●● ●●●● ●●

●● ● ●●●● ●●●●●●●●●●●●● ●● ●● ●

●●● ●●●● ●●●●● ●●●● ●●● ●●●● ●● ●●●●● ●●●●● ●●●● ●●●●●● ●●

●● ●●● ●● ● ● ●●●● ●● ●● ●● ● ●● ● ● ● ●●● ● ● ●● ●● ● ● ●● ●●●●

●●●

●● ●● ●● ● ●● ●●●● ●● ●●●●● ●●● ● ●●● ●●●● ●● ●● ●●● ● ●●●●● ●●

●●

●●●●

●●● ●●●● ●● ●● ● ● ●●●●●●●●●● ●● ●●●●● ●●● ●● ●●●●●●●

●●●●● ●●●● ●● ●●● ● ●●● ●●● ●●●

● ●●●● ●●●●● ●●●● ● ●●●●●●●● ●● ●●● ● ●●●●● ●●●●●●●●●●

●●●●●● ●● ●● ●● ●●● ●●●● ●● ● ●●● ●●●●● ●●

●●● ●●● ● ●● ●

●● ●● ●●●● ●●●●● ●

●●● ●●●●● ●● ●● ●● ●●● ●●●● ●●●●●● ●●●●●●●●● ●● ●●●●

●●

●● ●●● ●

●●●● ●● ●● ●●●● ●●● ● ●●● ●● ●● ● ●●● ●●●●●● ●●●● ●●● ●●●● ● ●●● ●●●● ●● ●●● ●●●

●● ● ●●

● ●●

● ●● ●● ●● ●●●● ●●●● ●●●● ●●●●●●● ●●●● ●●●

●●●

● ●● ●●●●●

●● ●●●● ●● ● ●● ● ●●●●● ●● ● ●●●●●●●●

●● ●●●● ● ●●●● ●● ●● ●●● ●● ●● ●● ●

●●●● ● ● ●●

●● ●●● ●●● ●● ● ●●●● ●●●● ●●● ●● ●●●●● ●●●●

●●●●● ●●●●●●●

●● ● ●● ● ●●● ●●●

● ●●●●●

●● ●●●● ● ●●● ● ● ●● ●●● ● ●●●●●●●●● ●● ●●●●●●

●●

● ●●● ●●●● ● ●●●● ●● ●●●●●● ● ●● ●●●● ●●● ●●●● ●● ●● ● ● ● ●●●●● ●● ●● ●● ●●● ●●

●●● ●●●● ●●●●●● ● ●●● ● ● ●● ●● ●●●●● ●●●●●● ●● ●●● ●●● ●●● ●●●●●●● ●

●● ●●●●● ● ●●● ●

●● ●● ●●●● ●●●● ●●●● ●●● ●●●

●●● ●●● ●● ●●

●● ● ●●● ●● ●●●● ●●● ●●●● ●●●●●● ● ●● ●●●● ●●

●●● ●●

● ● ●● ●●●●●●

●●●●

●●●●● ●●●

● ● ●●● ●●● ●●●● ●● ●●●● ● ●●● ●●●●

● ●●● ● ●●● ●● ●●●●

●●●● ●●●

● ●● ●● ●● ● ●● ●●●

●● ● ●● ●●●●● ●●●●●●●●●●● ●● ●● ● ●

●● ● ●●●●●● ●●● ● ●●

● ●●●● ●●● ● ●●●●

●●●

●● ●●● ●●● ●●● ●●● ●● ●●● ●● ●●●●● ●●●●● ●● ●●●●● ●●●●●● ●● ●●● ●●●● ●● ●●

●●

●● ●●● ●● ●

●● ●●● ●●●● ●●● ● ●●● ●●●● ●●●●

●● ● ● ●● ●

● ●●●● ● ●● ● ●●●

●●●

●●● ●● ●●●● ● ●●● ●●●● ●●

●●●● ●●●● ●●●●● ●●●●●●● ●●● ●● ●●●● ●●●● ●●

●●●● ●●●● ● ●●●● ● ● ●● ●●● ●● ●●● ● ●●● ●●●●●●●●●●

●●● ● ●●● ●●●●● ●●●● ●● ●●● ●●●●●

●● ● ●●●● ●●● ●● ●●●● ●●● ●●

●● ●●

●●●● ●● ●●●● ●● ●●

●●●●● ●●

● ●●●●●●● ● ●●●●● ●● ●● ●● ●●● ●●●

● ●●●●● ●● ●● ●● ●●●●●●

● ● ●● ●●●● ●● ●

● ●●●● ●●

● ●●● ●● ● ●●● ●●●● ●●

●● ●●● ●●

●● ●●●●● ●●●● ●●●● ●●● ●● ●● ●●●●●●●

●●●● ●●

● ●●●

● ●●● ●●●● ●●●

●●●●● ●●● ●●

●●●● ●●

●●●

● ●●●● ●●●● ●

●●

● ●●● ●●●● ●●●● ● ●●●●

● ●● ●●●●●● ● ●●●●●

●● ●● ●●●●●●●●●

●●●● ● ●●● ● ●●●●● ●●●●●● ●●● ●●● ●●

● ●●●● ●●● ●●● ●●● ● ●●●● ●●●● ●●● ●● ●● ●●●●● ●●●● ●●●●●●

●● ●● ●●●● ●●● ●● ●●●● ● ●●● ●●●● ● ● ●● ●● ●

●●●●●●● ●● ●●● ●● ●●●

●●

●●●● ●● ● ●● ●● ●● ●●● ●● ●●

●●● ●● ●●●●●●

●● ●●● ●●● ● ●

●●● ●●●●●● ●● ●●● ●● ●●●●●● ●●● ●● ●●

●●● ●● ●●●●● ● ● ●●● ●● ●●●●● ●●

●● ●●●●●● ●●●

●●●● ●●●●●●●●

● ●●●●●●

●● ●● ●●●●●● ● ●● ●●● ●● ● ●● ●●● ●● ●●● ●● ● ●●

● ●●● ●●●● ●●● ●●●

● ●● ●● ●● ●●●●

●● ●● ●● ●●● ● ●● ● ●●●●●●●

●●

●●●

●●●●● ●●● ●● ●●●● ● ●●●● ●●●●●●● ●●● ●●● ●●● ● ●●●●●●●● ●●●● ● ●●●● ●●

●●●●● ●● ●●●●●● ●

●●

● ●●● ●●

●● ● ●●● ●●● ●

Used●●● ●● ●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●

●●● ● ●● ● ●●●●●●●●●● ●●●● ●●●●●●●● ● ●●●●●●●●●● ● ● ●●● ●●●●●●●● ●●

●●● ●● ●●●●●●●● ●●●● ●● ●●●●●

●●● ●●●●●● ●● ●●●●●● ●● ●● ●●●●●●●●●●●●●●●●●● ●●●● ●●●●● ●●●●●

●●● ● ●●●●●● ●● ●●●● ●●●●●● ●● ●● ●● ●●

Not annotated

−60

−50

−40

−30

−20

−10

0●●●●●● ●●●●● ●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●● ●● ● ●●●●●●●●●●●●●●●● ●●●●●●●●● ●● ●● ●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●● ●●● ●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●● ●●● ●●●●●●● ●●● ●●●●●●●●● ●●● ●● ●●

●●●●●●●● ● ●●● ●●●● ●●●●● ●●●●●●●●●●●●●●● ●●●● ●● ●●● ●● ●●●●●●● ●●●●● ● ●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●● ●●● ●●

●●●●● ●● ●●●●●●● ●●●● ●● ●●●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●

●●●●●● ●● ●●●●●●● ● ●●●●●●● ●●● ●●●●● ●● ●●●●●●● ●

●●● ●●●● ● ●

●●●●●● ●●●●●●●●●●●●● ●●●

●●●● ●●●●●●●● ● ●●●●●●●●

●●●●●

●●● ●●●●●● ●●● ●●●●●●●● ● ●●●

●●●●●●●●●●●●● ●●●●●●● ●●●●●● ●●●● ●●

●● ●●●●● ●●●●● ●●●●● ●●●●●●● ●●●● ●●●●●●●● ●●●●● ●●●●●●●●● ●●● ●● ● ●●●●●●●●●●●●

●●●● ●●●● ●●●●●●● ●●●●●●●● ●●●●●● ●● ● ●●●● ●● ●● ●●●● ●●●

● ●●●●●●●●●●● ●●●●●●● ●● ●●●●●● ●● ●●●●●●●●●● ●●

●●●●

● ●●●●●● ●●●●●● ●●●

●●●●●

●● ●●● ●●● ●●● ●● ●●●● ● ●●●●●● ●● ●●● ●●●●● ● ●●●● ●●● ●●●●●●●● ●●●●● ●● ●● ●● ●●●● ●●● ● ●●●● ●●●●●●●● ●●●●●●● ●●●●●●● ●● ●●●●● ●●●●●●

● ●●●●●●●●● ●●● ●●●●●●●●● ● ●●●●●● ●●●●●

●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●● ●●

●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●●●●●●● ●●● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●● ●● ●●●●●●● ●●● ●●●●● ●●●●● ●●●●●●●●●

●●●●●●●●●●●● ●

● ●●●●●●●●

● ●● ●●●

●●●● ●●●● ●●●

●● ●●● ●●●●● ●●●●

●●●●●●●● ●●●●●●●● ●●● ●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●

●● ●●●●●●●●●●●●●●●●●●● ●●●●● ●●●● ●●●● ●● ●●●●●● ●● ● ●●●●

●●●

●●●●

●●

●● ●●●●● ● ●●●●● ●●●●● ●●●● ●● ●●●● ●● ●●● ●● ●● ●●●●● ●● ●●● ●●●●●

●●● ●●●●●●●● ●●●●● ●●●●●● ●

●●●●● ●● ●●●

●● ●●● ●●●●●●●●●●● ●●●●●●●● ●●●

●●●●● ●●●●●●●●●●●●●● ● ●●● ●●●●●●●●

●● ●●●●●●● ●●●●●●●●●●●● ●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●● ●●●

●●●● ●● ●●● ●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●

●●●●●●● ●● ●● ●●●● ●●●●●● ●●●●● ●

●●●● ●●●

●●●●●●● ●●●●●●●●●●●● ●●● ●●●●●

●● ●●●● ●● ●●●● ●●●●● ●●●● ●●●● ●●● ●

●●

●● ●●●●●●● ●●

●●●●●● ●●●

●● ●●● ●●●●● ●●●●●

●●● ●●●●●●●● ●●●●●● ●●● ●● ●● ●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●

●●●● ●●●●●●●●●

●●●

● ● ●●●●●●●

●●●●●●● ●●● ●●●●●●●●●●●● ●● ●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●

● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●● ●●

●●●●●●● ●●●●● ●●●●●● ●●●●● ●●●●● ●

●●●●●●●●●●●●●● ●● ●●● ●●●●

● ●●●●●●●●●●●● ●●●●●● ●●●●● ●●● ●●●●●●●●●●●●●

●●● ●● ●●●●● ●●●●●●● ●● ●●●●●●●●●● ●● ●●●●●●

●●●●●●

●●●● ●●●●●●●

● ●●●●●●●●● ● ●●● ●● ●●●●●●

●●●● ●●

●●●● ●●

●● ●●●●●

● ●●● ● ●●●●●●● ●●●●●●●● ●●●● ●●●●●●● ●●●

● ●●●● ●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●

●●●●●●●●●● ●●●●●●●●●●●●

●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●● ●●●●●● ●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●● ●●●●●●●●● ●●●● ●●●● ●●●●●

●●●●●●●●●●●●●● ●●●● ●

●● ●●●●●● ●●●●●●● ●●●●●●●●●●●●●●

●●● ●●●●●●●●●●●●

●●●●●●●●●●● ●●● ●●●●●● ●● ●●● ●●●● ●●●●● ●● ●●●● ●● ●● ● ●●●●● ●●

●●

●● ●●● ●● ●●●●●● ● ●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●● ●●●● ●●● ●●●●● ●● ●● ●●●●●●●●●●●●●●●

●●●

●●●

●●●●●●●● ●●●●●●●●●●●●●

●●●●●●●●●●● ●●●●●●●●●●● ●●●●● ●●●●●●●● ●●●

●●●●●●● ●●●●●●●●●●● ●●●●

●●●●●●●●●●●●●●● ●●●●●●●●●●●●

●●●●●●●●●●●●●●● ●●●●●●● ●●● ●●●●● ●●●●●●●●●●●● ●●● ●● ●●●●●● ●● ●●● ●● ●●●●●●●●●●●●●●

●●●●●●●● ●● ●●●●●● ●●●●● ●●●●● ●●●●●●●●

●●●●

●● ●● ●● ●●

●●● ●●●

●● ●●● ●● ●● ●●●● ●●●●●● ●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●

●●●●●●●●● ●●

●●●● ●●● ●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●● ●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●

●●●●●●● ● ●●●●●●●●● ●●●●● ●●● ●●●●●●●●●●●●

●● ●●●●● ●●● ●●●●●●●●●●

●●● ●● ●● ●● ●●●● ● ●●●●● ●●● ●●●●●● ●●● ●●

● ●●● ●●● ●●●●

● ●● ●●● ●● ●● ●●

● ●●●●●●●● ●● ● ●● ●

●●●●

●● ●●●●●●●● ●●●● ●●● ●● ●● ●●● ●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●● ●●●●●●●

● ●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●

●●●●●●●●●

● ●●●●●●●● ●●●● ●●●● ●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●● ●●●●● ● ●●●●●●●● ●●●●● ●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●

●●●● ●

●●●●●●●●●●●● ●●

●● ●●●●●● ●● ●●

●● ●●●●●

●●●●●●●● ●●●●●● ●●●●●●●● ● ●●● ●●●● ●●●●● ●● ●● ●●●

●●●●● ●●● ●● ●●

●●●● ●●●●

● ●● ●●●●● ●●●●

●●●●●●●

● ●●●●●● ● ●●●●● ●● ●●● ●●

●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●

●●● ●● ●●●● ●●●●●●●●● ●●●●●●

●●●●●●●

●●●●● ●●●●●●●●

●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●● ●●●●● ●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●

●●●●●●●●●●●●●

●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●● ●● ●●● ●●●● ● ●●● ●●●●●●●●●● ●●● ●● ●●●● ● ●●●

●●●●● ●● ●●●●

●●● ●●● ● ●●●●●●●● ●● ●●●● ● ●●●●● ●● ●●● ●● ●● ●●●● ●●●●●●●●

●●● ●●●●●●●● ●●●●●●● ●●● ●●●●●●●●●●● ●●●●●●●

●●●●●●●●

●● ●●● ●●●●● ●● ●●●●●●●●●●

●●●● ●●●●●● ●●●●●● ●●●●●●● ●●●●●●●●

●●●●●● ●● ●●●● ●●●● ●●●● ●●●●●●●●●● ●● ●●●●● ●●●● ●●● ●●●●●●●●

●●●●●●

● ●●●●● ●●●●●●● ●●●● ●●● ●●●● ●●● ●●●● ● ●●●●●●● ●●●●

●● ●●●● ●● ●●●● ●●●● ●●●●●●●●●●

● ●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●

●●● ●●●●●●●●●●● ●●●●●●●●●●● ● ●●●●●●●

●●●●●●● ●●●● ●●●● ●●●●●●●●●●●●●●●● ●●●●

●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●

●●●●●●●●●●● ● ●●●●●●●●● ●●● ● ●● ●●●●●●●●●●● ●●●● ●●● ● ●●● ●● ●●● ●●●● ● ●●●● ●● ●●●● ●●●●● ●●

●●●

●● ●●●● ● ●●●

●●●●●● ●●●●●●● ●● ●●●●● ●●

●●● ●●●● ●● ●●●● ● ●● ●●●●●●●●● ●● ●● ●●●●●

● ●●

●●●●●

● ●● ●●●●●●●

●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●

●●● ●●

●● ●● ●●●● ●●●●●●● ●● ●●●● ●●●●● ●●●●●

●●●●● ●●● ●●● ●● ●●●● ●●●● ●●●● ●●●●●●●● ●●●●●●● ●●

●●●●●●● ●● ●

●●●●●

●● ●●

●●●●● ●● ●●●● ●●●●●●

● ● ●●●●●● ●● ●●

● ●● ●●●●● ●●● ●●●

●●●●●

●● ●●●●●● ●● ●● ●● ●● ●● ●●●●●● ●●●● ●● ●●●●●●●

●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ●●●●●●●●●●●●●●●●●●

●●●●●● ●●●●●● ●●●● ●●●●●● ●●●●●●

●● ●●● ●●●● ●●● ●●

● ●● ●●●●●

●●●●●●● ● ●●

●● ●●● ●●●● ●●●●●●●●●●●●●●●● ●● ●

●●●

●●●●●● ●●●●●●● ●●● ●●●●●●●●●

●●● ●● ●●● ●●● ●●●

●●● ●●●●●

● ●●●● ● ●●●● ●● ●●● ●● ●●●● ●●●●●●

●●●●●●●●● ●●●● ●●●●●●●●

●●●●●

●●●●●● ●●●●●●● ●●●●●●● ●● ●● ●

●●●●●● ●●●●● ●●●●●●●● ●●●●●

●●●●●●●●●●●

●●●● ●●●●

●●●●●● ●●●●

●●●●

●●●● ●● ●●●●● ●●●● ●●● ●●●●●●●●●●●●●●● ●●●●● ●● ●●●●●●●●

● ●●

● ●●●●● ●● ●●●

●●●●

● ● ●●●● ●●

●● ●●●●●●●

● ●●●● ●● ●● ●●●●● ●●● ●● ●●

●●●● ●●

●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●● ●●●●● ●●●

●●●●●● ● ●

●●●● ●● ●●

●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●

●●●

●● ●●●●● ●●

●●●● ●●●● ●●●● ●●● ●●●●●●●●●●● ●●● ●●●

●●●●●

●●● ●●● ●●●●●● ●●● ●● ●●●●● ● ●●● ● ●●●●● ●● ●● ●● ●● ●●●

●●●●●

●● ●●

● ●●●● ●●● ●●●

● ●●● ●●● ●●●● ●● ●● ●●● ●●●●●●●

●●●

●●●●●

●●●●

●●●

●●●● ●●●

●●●●●●●●●●●●● ● ●●●●●●●●●●●

●●●●●● ●●

●● ●●●● ●● ●●●●● ●●●●●●●●●●●●●●● ●●●●●●

●● ●● ●●● ●●

●●●●

●●●● ●● ●●● ●●●●●●●●●● ●●●● ●●●●●● ●● ●● ●●● ●●●● ●●● ●●●●●●

●●●● ●●●●●●

●●●●●●●●● ●●

●●●● ●●●●●

●●●●●

●●●● ●●● ●●● ●●● ● ●●●●● ●●●●● ●● ●● ●●●● ●● ●●●●● ●●●●● ●●●● ●● ●●●●●● ●●● ●●●●

●●●

●●●●●●● ●●● ●●●●●●●●●●●●●●● ●● ●●●● ● ●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●● ●● ●●●

●●●

●●●● ●

●●●●●

● ●● ● ●●●●●●● ●●●● ●●●●●●●●●●● ●●

●●●●●●●●● ●●●●●● ●●●●

●● ●●●●● ●●●●●

●● ●● ●●

●●●●●●

● ●●●●●●●

●●●●●●● ●● ●●●

● ●●●●● ●●●●●●●●●● ● ●●●● ●●●●●●● ●●●

● ●●●●●●●●●●●●●●●●● ●●●●●●●

●●●●●●●●●●●●● ●●●●●● ●●●●●

●●● ●● ●● ●●●●●●●● ●●●● ● ●●●●●●●●●

● ●● ●●●

●●● ● ●●●● ●●●●●●● ●●●●●●●● ●●● ●●●● ●●●●●● ●● ●●● ●●● ●●● ●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●● ●● ●●● ●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●

●●●●●●●●● ●●●●●● ●●●●●●●

●●●●● ●●●●●●●● ●●●

●●●●

● ●●●●●

●● ●●●● ●●●●● ●●●●●●●●●●

●●●● ●●●●● ●●●● ●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●

●●●●●● ●●●●●●● ●● ●●● ●● ●●● ●●●●●●●●● ●● ●●●●●● ●●●●●● ●●●●

● ●●●●●●●●●●● ●●●●●●●●●● ●●●●●●● ●● ●●● ●●● ●● ● ●

●●●●●

●●●●

●●●●● ●● ● ●●●●● ●●● ●●●

●●●●●●●●●●●● ●●●●●●● ●●●● ●●●● ●●●●●● ●● ●● ●●●●● ●●● ● ●●●●●● ●● ● ●

●●● ●●●●●● ●●●●●●● ● ●●●●●●●

●●●●●●●●●● ●●●

●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●● ●● ●●●●●●● ●●●●●●●●●●● ●●

●●● ●●●●●●●●●●●

●●●●●●●●●●●●●● ●●●● ●● ●●●●●●●

●●● ●●●●

●●●●●●●● ●● ●● ●●

●●

●●●●●●●●●●●●●

● ●●●●●●●● ●● ●●● ●●● ● ●

●●

●●●●

●●● ●●●●● ●●●●● ● ●● ●● ●●●●●●●● ●●●●● ●●●●●● ●●● ●●●●●● ●● ●●● ●●●●●● ●●●●● ●●●●● ●●●●●●●●●●●

●●●● ●●●●●●●●●● ●●●●●●●● ● ●●●●●●●●●●●●●● ●● ●●● ●●● ●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●● ●●

●●●●

●●

●●●●● ●●●● ●●●●● ●●●

●●●●●●●●●●●●●

●●●●●●●●●●● ●●● ●●● ●●●●●● ●●●●●●

● ●●● ●●●●● ●●●● ●●●●● ●●

●● ●●●● ●●●● ●●●●●● ●●

●●●●● ●●●●● ●● ●●●●●●●●●●●●●●● ●●●●●●● ●●

●● ●●●●●

●●● ● ●●●●●●

● ●●●●●●●●● ●● ●

●●●●● ●●●● ●●●●●●● ●

●●●●●●● ● ●●●● ●● ●●●●● ●● ●●● ●●●● ●● ●

●●●●● ●●●●●●●● ● ●●● ●● ●●●●●● ●●●●● ●●●●●●●● ●●●● ●●●●●●●●●●●●●● ●●●●● ●●

●●● ●●●● ●●●●●●●

●● ●●●● ●●●

●●● ●●●●●

● ● ●●

● ●●●● ●●●● ●●●●● ●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ● ●●● ●●●●●●●● ●● ●●●●● ●●●●●●●●●

● ●●●

●●●●●●●●●●●●● ●●●●●●● ●●● ●●●●●●● ●●●●●●

●●●●●●● ●

●●● ●●●●●●●● ●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●● ● ●●

● ●●● ●● ●● ●●● ●●●●●●● ● ●●●●● ●●●

●●●●●● ●●●●●● ● ●●●●● ●●●● ● ●●

●●●● ●●●●●●●●●●●●●●●●● ●● ●●●

●●●●●●●●●●● ●●●●

●●●●●● ●●●●●● ●●●●●● ●●●●● ●

●●●● ●●●●●●●● ●●●●●●●● ●●●● ●● ●●●● ●●●●● ●●●●●

●●●●● ● ●●●

● ●●●●●

● ●●●●●● ●●● ●● ●●●● ● ●●● ●●●●●●● ●●●●●● ●●●● ●●●●●●●●● ●●●●●●● ●● ●●● ●●●●●●●●● ●●●●●●●●●● ●●●●● ●●●● ●●●●●● ●● ●●●●● ●●● ●● ●●●

●●●●●●● ●● ●●●●●

●●

●●● ●●●●●●●●●● ●● ●●●●●●●●●

●●●●

● ●● ●●●●● ●●● ●●●●

●●●●●●● ●●●●●●●●

● ●● ●● ●●●●● ●●●●●● ●●●●●●●●●

●●● ●●●● ●●●● ●●● ●● ●●●● ●●●●●● ●

●●●●● ●●●●●● ●●●●●●●●●● ●●●● ●

●●●●●● ●●●●●●●● ●●● ●● ●●● ●● ●●●●●●●●●● ●●● ●●●●● ●●●●●●●● ●●●●● ● ●●●●● ●●●●●●● ●●●

●●●●●● ●●●● ● ●● ●●● ●●●●●●●●● ●● ●●●●●●●

●●●●●●●

●●

● ●●●● ●●● ●●● ●●●●●●●●●●●●●●●●●● ●●●● ●● ●●● ●●●

●●● ●●●●●●●●● ●●● ●●●●●● ●●●●● ●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●

●● ●●●●●●● ●● ●●●● ●●●● ●●●●●●●●●●● ●●●●● ●●● ●●●●● ●●●●●●●●●●● ●●●●●●●●● ●● ●●●●●●● ●●●●●●●●●●

●● ●●●●●● ●●●●● ●● ●●●●●● ●●● ●●●●● ●●● ●●● ●●● ●●●●●●● ●●

●●●●●●●● ●●● ●●●●● ●● ●●● ●●● ●● ●●●●● ●● ●●●●● ●●●● ● ●● ●●●●●●●●●●●●●●● ●●●● ●● ●●●●●●●●●●●●●●●●●●●●● ●●

Filtered

Used (#3888)Not annotated (#213)Filtered (#8524)

Figure 3: Scatter plot of FDR adjusted p-values against variance of probes. Points below the horizontal line aresignificant probes.

We want to see if the filtering performed in Section 4.1 removes important probes. There are a total of 12625probes on the hgu95av2 chip. One assumes that only the noisy probes, probes with low expression valuesor small variance across samples are filtered out from the analysis.

The number of probes have a direct effect on the multiple testing adjustment of p-values. Too many probeswill result in too conservative adjusted p-values which can bias the result of tests like Fisher’s exact test.Thus it is important to carefully analyse this step.

> allProb <- featureNames(ALL)

> groupProb <- integer(length(allProb)) + 1

> groupProb[allProb %in% genes(GOdata)] <- 0

> groupProb[!selProbes] <- 2

> groupProb <- factor(groupProb, labels = c("Used", "Not annotated", "Filtered"))

> tt <- table(groupProb)

> tt

groupProb

Used Not annotated Filtered

3888 213 8524

Out of the filtered probes only 95% have annotation to GO terms. The filtering procedure removes 8524probes which is a very large percentage of probes (more than 50%), but we did this intentionally to reducethe expression set for computational purposes.

We perform a differential expression analysis on all available probes and we check if differentially expressedgenes are leaved out from the enrichment analysis.

> pValue <- getPvalues(exprs(ALL), classlabel = y, alternative = "greater")

> geneVar <- apply(exprs(ALL), 1, var)

> dd <- data.frame(x = geneVar[allProb], y = log10(pValue[allProb]), groups = groupProb)

> xyplot(y ~ x | groups, data = dd, groups = groups)

Page 14: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

Figure 3 shows for the three groups of probes the adjusted p-values and the gene-wise variance. Probes withlarge changes between conditions have large variance and low p-value. In an ideal case, one would expectto have a large density of probes in the lower right corner of Used panel and few probes in this region inthe other two panels. We can see that the filtering process throws out some significant probes and in areal analysis a more conservative filtering needs to be applied. However, there are also many differentiallyexpressed probes without GO annotation which cannot be used in the analysis.

5 Working with the topGOdata object

Once the topGOdata object is created the user can use various methods defined for this class to access theinformation encapsulated in the object.

The description slot contains information about the experiment. This information can be accessed orreplaced using the method with the same name.

> description(GOdata)

> description(GOdata) <- paste(description(GOdata), "Object modified on:", format(Sys.time(), "%d %b %Y"), sep = " ")

> description(GOdata)

Methods to obtain the list of genes that will be used in the further analysis or methods for obtaining all genescores are exemplified below.

> a <- genes(GOdata) ## obtain the list of genes

> head(a)

[1] "1000_at" "1005_at" "1007_s_at" "1008_f_at" "1009_at" "100_g_at"

> numGenes(GOdata)

[1] 3888

Next we describe how to retrieve the score of a specified set of genes, e.g. a set of randomly selected genes.If the object was constructed using a list of interesting genes, then the factor vector that was provided atthe building of the object will be returned.

> selGenes <- sample(a, 10)

> gs <- geneScore(GOdata, whichGenes = selGenes)

> print(gs)

If the user wants an unnamed vector or the score of all genes:

> gs <- geneScore(GOdata, whichGenes = selGenes, use.names = FALSE)

> print(gs)

> gs <- geneScore(GOdata, use.names = FALSE)

> str(gs)

The list of significant genes can be accessed using the method sigGenes().

> sg <- sigGenes(GOdata)

> str(sg)

> numSigGenes(GOdata)

Another useful method is updateGenes which allows the user to update/change the list of genes (and theirscores) from a topGOdata object. If one wants to update the list of genes by including only the feasible ones,one can type:

Page 15: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

> .geneList <- geneScore(GOdata, use.names = TRUE)

> GOdata ## more available genes

> GOdata <- updateGenes(GOdata, .geneList, topDiffGenes)

> GOdata ## the available genes are now the feasible genes

There are also methods available for accessing information related to GO and its structure. First, we wantto know which GO terms are available for analysis and to obtain all the genes annotated to a subset of theseGO terms.

> graph(GOdata) ## returns the GO graph

A graphNEL graph with directed edges

Number of Nodes = 6021

Number of Edges = 13666

> ug <- usedGO(GOdata)

> head(ug)

[1] "GO:0000002" "GO:0000003" "GO:0000018" "GO:0000027" "GO:0000028" "GO:0000038"

We further select 10 random GO terms, count the number of annotated genes and obtain their annotation.

> sel.terms <- sample(usedGO(GOdata), 10)

> num.ann.genes <- countGenesInTerm(GOdata, sel.terms) ## the number of annotated genes

> num.ann.genes

> ann.genes <- genesInTerm(GOdata, sel.terms) ## get the annotations

> head(ann.genes)

When the sel.terms argument is missing all GO terms are used. The scores for all genes, possibly annotatedwith names of the genes, can be obtained using the method scoresInTerm().

> ann.score <- scoresInTerm(GOdata, sel.terms)

> head(ann.score)

> ann.score <- scoresInTerm(GOdata, sel.terms, use.names = TRUE)

> head(ann.score)

Finally, some statistics for a set of GO terms are returned by the method termStat. As mentioned previously,if the sel.terms argument is missing then the statistics for all available GO terms are returned.

> termStat(GOdata, sel.terms)

Annotated Significant Expected

GO:0051186 187 25 16.30

GO:0051709 12 0 1.05

GO:1901215 96 5 8.37

GO:0050885 24 1 2.09

GO:0060440 5 1 0.44

GO:0031668 116 16 10.11

GO:0045103 17 1 1.48

GO:0051385 18 0 1.57

GO:0045667 37 1 3.23

GO:0006629 354 40 30.87

Page 16: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

6 Running the enrichment tests

In this section we explain how we can run the desired enrichment method once the topGOdata object isavailable.

topGO package was designed to work with various test statistics and various algorithms which take the GOdependencies into account. At the base of this design stands a S4 class mechanism which facilitates definingand executing a (new) group test. Three types of enrichment tests can be distinguish if we look at the dataused by the each test.

� Tests based on gene counts. This is the most popular family of tests, given that it only requires thepresence of a list of interesting genes and nothing more. Tests like Fisher’s exact test, Hypegeometrictest and binomial test belong to this family. Draghici et al. (2006)

� Tests based on gene scores or gene ranks. It includes Kolmogorov-Smirnov like tests (also known asGSEA), Gentleman’s Category, t-test, etc. Ackermann and Strimmer (2009)

� Tests based on gene expression. Tests like Goeman’s globaltest or GlobalAncova separates from theothers since they work directly on the expression matrix. Goeman and Buhlmann (2007)

There are also a number of strategies/algorithms to account for the GO topology, see Table 1, each of themhaving specific requirements.

For each test type described above and for each algorithm there is S4 class defined in the package. Themain idea is to have a class (container) which can store, for a specified gene set(GO category), all datanecessary for computing the desired test statistic, and a method that will iterate over all GO categories. Insuch a design the user needs to instantiate an object from the class corresponding to the chosen method(teststatistic and algorithm) and then run the iterator function on this object.

The defined S4 classes are organised in a hierarchy which is showed in Figure 4.

There are two possibilities(or interfaces) for applying a test statistic to an object of class topGOdata. Thebasic interface, which provides the core of the testing procedure in topGO, offers more flexibility to theexperienced R user allowing him to implement new test statistics or new algorithms. The second interface ismore user friendly but at the same time more restrictive in the choice of the tests and algorithms used. Wewill further explain how these two interfaces work.

Figure 4: The test statistics class structure.

Page 17: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

6.1 Defining and running the test

The main function for running the GO enrichment is getSigGroups() and it takes two arguments. The firstargument is an instance of class topGOdata and the second argument is an instance of class groupStats orany of its descendents.

To better understand this principle consider the following example. Assume we decided to apply the classicalgorithm. The two classes defined for this algorithm are classicCount and classicScore. If an object ofthis class is given as a argument to getSigGroups() than the classic algorithm will be used. The getSig-

Groups() function can take a while, depending on the size of the graph (the ontology used), so be patient.

The groupStats classes

Next we show how an instance of the groupStats class can represent a gene set and how the test statisticis performed.

We compute the enrichment of cellular lipid metabolic process (GO:0044255) term using Fisher’s exact test.In order to do this we need to define the gene universe, to obtain the genes annotated to GO:0044255, andto define the set of significant genes.

> goID <- "GO:0044255"

> gene.universe <- genes(GOdata)

> go.genes <- genesInTerm(GOdata, goID)[[1]]

> sig.genes <- sigGenes(GOdata)

Now we can instantiate an object of class classicCount. Once the object is constructed we can get the 2×2contingency table or apply the test statistic.

> my.group <- new("classicCount", testStatistic = GOFisherTest, name = "fisher",

+ allMembers = gene.universe, groupMembers = go.genes,

+ sigMembers = sig.genes)

> contTable(my.group)

sig notSig

anno 29 233

notAnno 310 3316

> runTest(my.group)

[1] 0.1023809

The slot testStatistic contains the function (or to be more precise, the method) which computes the teststatistic. We used the GOFisherTest function which is available in the topGO package and as the name statesit implements Fisher’s exact test. The user can define his own test statistic function and then apply it usingthe preferred algorithm. The function, however, should use the methods defined for the groupStats classto access the data encapsulated in such an object. (For example a function which computes the Z score canbe easily implemented using as an example the GOFisherTest method.)

The runTest method is defined for the groupStats class and its used to run/compute the test statistic,by calling the testStatistic function. The value returned by the runTest method in this case is thevalue returned by GOFisherTest method, which is the Fisher’s exact test p-value. The contTable method,showed in the example above, is only defined for the classes based on gene counts and its used to compilethe two-dimensional contingency table based on the object.

To show how the same interface is used for the classes based on gene counts we next build an instance forthe elimCount class. We randomly select 25% of the annotated genes as genes that should be removed.

> set.seed(123)

> elim.genes <- sample(go.genes, length(go.genes) / 4)

Page 18: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

> elim.group <- new("elimCount", testStatistic = GOFisherTest, name = "fisher",

+ allMembers = gene.universe, groupMembers = go.genes,

+ sigMembers = sig.genes, elim = elim.genes)

> contTable(elim.group)

sig notSig

anno 23 174

notAnno 310 3316

> runTest(elim.group)

[1] 0.08669903

We see that the interface accounts for the genes that need to be eliminated, once the object is instantiated.The same mechanism applies for the other hierarchy of classes (the score based and expression based classes),except that each hierarchy has its own specialised methods for computing statistics from the data.

Please note that the groupStats class or any descendent class does not depend on GO, and an object ofsuch a class can be instantiated using any gene set.

Performing the test

According to the mechanism described above, one first defines a test statistic for the chosen algorithm,meaning that an instance of object specific for the algorithm is constructed in which only the test statisticmust be specified, and then calls a generic function (interface) to run the algorithm.

According to this mechanism, one first defines a test statistic for the chosen algorithm, in this case classic andthen runs the algorithm (see the second line). The slot testStatistic contains the test statistic function.In the above example GOFisherTest function which implements Fisher’s exact test and is available in thetopGO package was used. A user can define his own test statistic function and then apply it using the classicalgorithm. (For example a function which computes the Z score can be implemented using as an example theGOFisherTest method.)

> test.stat <- new("classicCount", testStatistic = GOFisherTest, name = "Fisher test")

> resultFisher <- getSigGroups(GOdata, test.stat)

A short summary on the used test and the results is printed at the R console.

> resultFisher

Description: GO analysis of ALL data; B-cell vs T-cell Object modified on: 16 Dec 2019

Ontology: BP

'classic' algorithm with the 'Fisher test' test

6021 GO terms scored: 154 terms with p < 0.01

Annotation data:

Annotated genes: 3888

Significant genes: 339

Min. no. of genes annotated to a GO: 5

Nontrivial nodes: 4236

To use the Kolmogorov-Smirnov (KS) test one needs to provide the gene-wise scores and thus we need toinstantiate an object of a class which is able to deal of the scores. Such a class is the classicScore class,see Figure 4 which will let us run the classic algorithm.

> test.stat <- new("classicScore", testStatistic = GOKSTest, name = "KS tests")

> resultKS <- getSigGroups(GOdata, test.stat)

Page 19: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

The mechanism presented above for classic also hold for elim and weight. The user should pay attentionto the compatibility between the chosen class and the function for computing the test statistic, since noincompatibility test are made when the object is instantiated. For example the weight algorithm will notwork with classes based on gene-wise scores. To run the elim algorithm with KS test one needs to type:

> test.stat <- new("elimScore", testStatistic = GOKSTest, name = "Fisher test", cutOff = 0.01)

> resultElim <- getSigGroups(GOdata, test.stat)

Similarly, for the weight algorithm with Fisher’s exact test one types:

> test.stat <- new("weightCount", testStatistic = GOFisherTest, name = "Fisher test", sigRatio = "ratio")

> resultWeight <- getSigGroups(GOdata, test.stat)

6.2 The adjustment of p-values

The p-values return by the getSigGroups function are row p-values. There is no multiple testing correctionapplied to them, unless the test statistic directly incorporate such a correction. Of course, the researchercan perform an adjustment of the p-values if he considers it is important for the analysis. The reason fornot automatically correcting for multiple testing are:

� In many cases the row p-values return by an enrichment analysis are not that extreme. A FDR/FWERadjustment procedure can in this case produce very conservative p-values and declare no, or very few,terms as significant. This is not necessary a bad thing, but it can happen that there are interestingGO terms which didn’t make it over the cutoff but they are omitted and thus valuable informationlost. In this case the researcher might be interested in the ranking of the GO terms even though notop term is significant at a specify FDR level.

� One should keep in mind that an enrichment analysis consist of many steps and there are manyassumptions done before applying, for example, Fisher’s exact test on a set of GO terms. Performinga multiple testing procedure accounting only on the number of GO terms is far from being enough tocontrol the error rate.

� For the methods that account for the GO topology like elim and weight, the problem of multipletesting is even more complicated. Here one computes the p-value of a GO term conditioned on theneighbouring terms. The tests are therefore not independent and the multiple testing theory does notdirectly apply. We like to interpret the p-values returned by these methods as corrected or not affectedby multiple testing.

6.3 Adding a new test

Example for the Category test ....

6.4 runTest: a high-level interface for testing

Over the basic interface we implemented an abstract layer to provide the users with a higher level interfacefor running the enrichment tests. The interface is composed by a function, namely the runTest function,which can be used only with a predefined set of test statistics and algorithms. In fact runTest is a warpingfunction for the set of commands used for defining and running a test presented in Section 6.1.

There are three main arguments that this function takes. The first argument is an object of class topGOdata.The second argument, named algorithm, is of type character and specifies which method for dealing withthe GO graph structure will be used. The third argument, named statistic, specifies which group teststatistic will be used.

To perform a classical enrichment analysis by using the classic method and Fisher’s exact test, the user needsto type:

Page 20: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

> resultFis <- runTest(GOdata, algorithm = "classic", statistic = "fisher")

Various algorithms can be easily combine with various test statistics. However not all the combinationswill work, as seen in Table 1. In the case of a mismatch the function will throw an error. The algorithm

argument is optional and if not specified the weight01 method will be used. Bellow we can see more examplesusing the runTest function.

> weight01.fisher <- runTest(GOdata, statistic = "fisher")

> weight01.t <- runTest(GOdata, algorithm = "weight01", statistic = "t")

> elim.ks <- runTest(GOdata, algorithm = "elim", statistic = "ks")

> weight.ks <- runTest(GOdata, algorithm = "weight", statistic = "ks") #will not work!!!

The last line will return an error because we cannot use the weight method with the Kolmogorov-Smirnovtest. The methods and the statistical tests which are accessible via the runTest function are available viathe following two functions:

> whichTests()

[1] "fisher" "ks" "t" "globaltest" "sum" "ks.ties"

> whichAlgorithms()

[1] "classic" "elim" "weight" "weight01" "lea" "parentchild"

There is no advantage of using the runTest() over getSigGroups() except that it is more user friendly and itgives cleaner code. However, if the user wants to define his own test statistic or implement a new algorithmbased on the available groupStats classes, then it would be not possible to use the runTest function.

Finally, the function can pass extra arguments to the initialisation method for an groupStats object. Thus,one can specify different cutoffs for the elim method, or arguments for the weight method.

7 Interpretation and visualization of results

This section present the available tools for analysing and interpreting the results of the performed tests.Both getSigGroups and runTest functions return an object of type topGOresult, and most of the followingfunctions work with this object.

7.1 The topGOresult object

The structure of the topGOresult object is quite simple. It contains the p-values or the statistics returned bythe test and basic informations on the used test statistic/algorithm. The information stored in the topGOdataobject is not carried over this object, and both of these objects will be needed by the diagnostic tools.

Since the test statistic can return either a p-value or a statistic of the data, we will refer them as scores!

To access the stored p-values, the user should use the function score. It returns a named numeric vector,were the names are GO identifiers. For example, we can look at the histogram of the results of the Fisher’sexact test and the classic algorithm.

By default, the score function does not warranty the order in which the p-values are returned, as we cansee if we compare the resultFis object with the resultWeight object:

> head(score(resultWeight))

GO:1902894 GO:0048812 GO:1902895 GO:0048813 GO:0071870 GO:0070669

0.6339088 0.9059787 0.4227148 1.0000000 1.0000000 1.0000000

Page 21: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

> pvalFis <- score(resultFis)

> head(pvalFis)

GO:0000002 GO:0000003 GO:0000018 GO:0000038 GO:0000041 GO:0000070

0.7218090 0.7923403 0.1422357 0.0107636 0.8462353 0.9200612

> hist(pvalFis, 50, xlab = "p-values")

Histogram of pvalFis

p−values

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

050

010

0015

0020

00

However, the score method has a parameter, whichGO, which takes a list of GO identifiers and returns thescores for these terms in the specified order. Only the scores for the terms found in the intersection betweenthe specified GOs and the GOs stored in the topGOresult object are returned. To see how this work letscompute the correlation between the p-values of the classic and weight methods:

> pvalWeight <- score(resultWeight, whichGO = names(pvalFis))

> head(pvalWeight)

GO:0000002 GO:0000003 GO:0000018 GO:0000038 GO:0000041 GO:0000070

0.7218090 0.9911929 1.0000000 0.0107636 0.8538853 1.0000000

> cor(pvalFis, pvalWeight)

[1] 0.5415396

Basic information on input data can be accessed using the geneData function. The number of annotatedgenes, the number of significant genes (if it is the case), the minimal size of a GO category as well as thenumber of GO categories which have at least one significant gene annotated are listed:

> geneData(resultWeight)

Annotated Significant NodeSize SigTerms

3888 339 5 4236

7.2 Summarising the results

We can use the GenTable function to generate a summary table with the results from one or more testsapplied to the same topGOdata object. The function can take a variable number of topGOresult objectsand it returns a data.frame containing the top topNodes GO terms identified by the method specified withthe orderBy argument. This argument allows the user decide which p-values should be used for ordering theGO terms.

Page 22: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

GO.ID Term Annotated Significant Expected Rank in classic classic KS weight1 GO:0045061 thymic T cell selection 12 8 1.05 1 1.1e-06 0.00016 1.1e-062 GO:0031580 membrane raft distribution 6 5 0.52 4 2.7e-05 0.00099 2.7e-053 GO:0043651 linoleic acid metabolic process 9 6 0.78 8 2.8e-05 0.00117 2.8e-054 GO:0010863 positive regulation of phospholipase C a... 13 7 1.13 10 3.9e-05 0.00453 3.9e-055 GO:0042473 outer ear morphogenesis 7 5 0.61 13 8.9e-05 0.00383 8.9e-056 GO:0036109 alpha-linolenic acid metabolic process 5 4 0.44 20 0.00026 0.00257 0.000267 GO:0043378 positive regulation of CD8-positive, alp... 5 4 0.44 21 0.00026 0.00403 0.000268 GO:0042474 middle ear morphogenesis 9 5 0.78 27 0.00046 0.01766 0.000469 GO:0034030 ribonucleoside bisphosphate biosynthetic... 18 7 1.57 28 0.00049 0.00087 0.00049

10 GO:0034033 purine nucleoside bisphosphate biosynthe... 18 7 1.57 29 0.00049 0.00087 0.0004911 GO:0042102 positive regulation of T cell proliferat... 42 11 3.66 37 0.00068 0.03758 0.0006812 GO:0045086 positive regulation of interleukin-2 bio... 6 4 0.52 40 0.00074 0.01221 0.0007413 GO:0048935 peripheral nervous system neuron develop... 6 4 0.52 41 0.00074 0.01163 0.0007414 GO:0090270 regulation of fibroblast growth factor p... 6 4 0.52 42 0.00074 0.01656 0.0007415 GO:0050854 regulation of antigen receptor-mediated ... 37 10 3.23 49 0.00091 0.00446 0.0009116 GO:0090181 regulation of cholesterol metabolic proc... 20 7 1.74 52 0.00103 0.00352 0.0010317 GO:0045830 positive regulation of isotype switching 16 6 1.40 64 0.00158 0.00318 0.0015818 GO:0010453 regulation of cell fate commitment 7 4 0.61 65 0.00161 0.03502 0.0016119 GO:0032305 positive regulation of icosanoid secreti... 8 4 0.70 77 0.00299 0.06485 0.0029920 GO:0035898 parathyroid hormone secretion 8 4 0.70 78 0.00299 0.06119 0.00299

Table 3: Significance of GO terms according to different tests.

> allRes <- GenTable(GOdata, classic = resultFis, KS = resultKS, weight = resultWeight,

+ orderBy = "weight", ranksOf = "classic", topNodes = 20)

Please note that we need to type the full names (the exact name) of the function arguments: topNodes,rankOf, etc. This is the price paid for flexibility of specifying different number of topGOresults objects. Thetable includes statistics on the GO terms plus the p-values returned by the other algorithms/test statistics.Table 3 shows the informations included in the data.frame.

7.3 Analysing individual GOs

Next we want to analyse the distribution of the genes annotated to a GO term of interest. In an enrichmentanalysis one expects that the genes annotated to a significantly enriched GO term have higher scores thanthe average gene’ score of the gene universe.

One way to check this hypothesis is to compare the distribution of the gene scores annotated to the specifiedGO term with the distribution of the scores of the complementary gene set (all the genes in the gene universewhich are not annotated to the GO term). This can be easily achieved using the showGroupDensity function.For example, lets look at the distribution of the genes annotated to the most significant GO term w.r.t. theweight algorithm.

> goID <- allRes[1, "GO.ID"]

> print(showGroupDensity(GOdata, goID, ranks = TRUE))

Gene's rank

Den

sity

0.0000

0.0005

0.0010

0.0015

0.0020

0.0025

0 500 1000

●● ●●●● ●●● ● ●● ●●● ●● ● ●●● ●● ●● ●● ● ●●● ● ●● ●● ●● ● ● ● ●●●●●● ● ●●● ●●● ● ● ●●● ●● ●●●● ● ●● ● ●● ●● ●● ●●●● ●●● ●●● ●● ●● ● ● ●● ●●●●● ●● ●● ●● ●● ●●● ● ●● ● ●● ●●●● ●●● ●●● ● ●●● ●● ● ●●● ●●● ● ●● ●● ●●● ●● ●●●● ●● ●●● ● ●●● ●● ● ● ●● ●●● ● ●●●● ●● ●●● ●●●●● ●● ● ●● ● ● ●●● ● ●● ●● ●● ●●● ●●●●●● ●●●● ● ●●● ●● ●● ●● ●● ● ●● ● ● ●●●● ●● ●● ● ●● ● ●●● ● ●● ●●●● ● ● ●● ●● ●● ●●● ● ● ●●● ●● ●●● ●● ● ●●● ●● ●●●● ●●● ●● ●● ●● ●● ●● ●● ●● ●●●●● ● ●● ● ●●● ●● ●●● ●● ● ●● ● ●● ●● ●●● ●●● ● ●●● ●●● ●● ● ●● ●●●● ● ●● ●●● ● ●●● ●● ●● ●●● ●● ●● ● ●● ●● ●● ●● ●● ●●●● ● ●●●● ●● ● ●●● ●●● ●● ●● ● ●●● ● ●● ●● ●● ●● ●● ●●● ● ●● ●● ●●● ●● ● ● ● ●● ● ●● ●●● ● ●● ●● ●● ● ● ●● ●● ●●● ●●● ●●● ● ●● ● ●●● ●● ●●● ●● ●● ● ●● ●● ●● ● ●●● ●● ●● ●● ●●●● ●● ●●●● ●●● ●● ● ●● ●●●● ●● ●●● ● ●● ● ●●●● ●● ● ● ●●● ● ●●● ●● ● ●● ●●●● ● ●● ●● ●● ●●●● ● ●● ●●● ●● ●● ● ●●● ●● ●● ● ●●●● ●●● ●● ●● ●● ●● ●●● ●●● ●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●● ●●● ● ●●● ●●● ●● ●●● ●● ●●● ●● ● ●● ●● ● ● ●●●● ● ●● ●● ● ●● ●●● ●● ●●● ●● ●● ● ●●●● ●● ●●● ● ●● ●● ●●● ●● ●● ●● ●● ●● ●●● ●●● ●●● ●● ●● ● ●●● ● ●● ●●● ●● ●● ●●● ● ●● ● ●●● ●● ●● ● ●●● ●● ●●● ●●● ●● ● ●● ●●● ● ● ●●●● ●● ●●● ● ●●●● ● ●● ●●●● ●● ●●● ●●● ●● ●●●● ● ●● ●● ● ●●●●● ● ● ● ●● ● ● ●●

complementary (832)

0.0000

0.0005

0.0010

0.0015

0.0020

0.0025

●● ●●● ●● ● ●

GO:0045061 (9)

Figure 5: Distribution of the gene’ rank from GO:0045061, compared with the null distribution.

Page 23: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

We can see in Figure 5 that the genes annotated to GO:0045061 have low ranks (genes with low p-value ofthe t-test). The distribution of the ranks is skewed on the left side compared with the reference distributiongiven by the complementary gene set. This is a nice example in which there is a significant difference in thedistribution of scores between the gene set and the complementary set, and we see from Table 3 that thisGO is found as significantly enriched by all methods used.

In the above example, the genes with a p-value equal to 1 were omitted. They can be included using thevalue FALSE for the rm.one argument in the showGroupDensity function.

Another useful function for analysing terms of interest is printGenes. The function will generate a tablewith all the genes/probes annotated to the specified GO term. Various type of identifiers, the gene nameand the p-values/statistics are provided in the table.

> goID <- allRes[10, "GO.ID"]

> gt <- printGenes(GOdata, whichTerms = goID, chip = affyLib, numChar = 40)

Chip ID LL.id Symbol.id Gene name raw p-value33821 at 33821 at 60481 ELOVL5 ELOVL fatty acid elongase 5 3.53e-2135117 at 35117 at 60481 ELOVL5 ELOVL fatty acid elongase 5 1.45e-0934411 at 34411 at 9061 PAPSS1 3’-phosphoadenosine 5’-phosphosulfate sy... 1.96e-0837000 at 37000 at 25874 MPC2 mitochondrial pyruvate carrier 2 5.04e-0536720 at 36720 at 5165 PDK3 pyruvate dehydrogenase kinase 3 0.00041638109 at 38109 at NANA 0.00084041440 at 41440 at 7923 HSD17B8 hydroxysteroid 17-beta dehydrogenase 8 0.00265038429 at 38429 at 2194 FASN fatty acid synthase 0.01556240881 at 40881 at 47 ACLY ATP citrate lyase 0.02212337945 at 37945 at 11332 ACOT7 acyl-CoA thioesterase 7 0.05557134774 at 34774 at 5538 PPT1 palmitoyl-protein thioesterase 1 1.000000

36127 g at 36127 g at 80347 COASY Coenzyme A synthase 1.00000038108 at 38108 at NANA 1.00000038845 at 38845 at 5164 PDK2 pyruvate dehydrogenase kinase 2 1.00000038966 at 38966 at 9524 TECR trans-2,3-enoyl-CoA reductase 1.00000038997 at 38997 at 6576 SLC25A1 solute carrier family 25 member 1 1.000000

38998 g at 38998 g at 6576 SLC25A1 solute carrier family 25 member 1 1.00000039160 at 39160 at 5162 PDHB pyruvate dehydrogenase E1 beta subunit 1.000000

Table 4: Genes annotated to GO:0034033.

The data.frame containing the genes annotated to GO:0034033 is shown in Table 4. One or more GOidentifiers can be given to the function using the whichTerms argument. When more than one GO is specified,the function returns a list of data.frames, otherwise only one data.frame is returned. The function has aargument file which, when specified, will save the results into a file using the CSV format.

For the moment the function will work only when the chip used has an annotation package available inBioconductor. It will not work with other type of custom annotations.

7.4 Visualising the GO structure

An insightful way of looking at the results of the analysis is to investigate how the significant GO terms aredistributed over the GO graph. We plot the subgraphs induced by the most significant GO terms reportedby classic and weight methods. There are two functions available. The showSigOfNodes will plot the inducedsubgraph to the current graphic device. The printGraph is a warping function of showSigOfNodes and willsave the resulting graph into a PDF or PS file.

> showSigOfNodes(GOdata, score(resultFis), firstSigNodes = 5, useInfo = 'all')

> showSigOfNodes(GOdata, score(resultWeight), firstSigNodes = 5, useInfo = 'def')

> printGraph(GOdata, resultFis, firstSigNodes = 5, fn.prefix = "tGO", useInfo = "all", pdfSW = TRUE)

> printGraph(GOdata, resultWeight, firstSigNodes = 5, fn.prefix = "tGO", useInfo = "def", pdfSW = TRUE)

Page 24: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

GO:0001775cell activation

0.05278667 / 640

GO:0002376immune system proces...

0.93706791 / 1179

GO:0002520immune system develo...

0.15064545 / 444

GO:0002521leukocyte differenti...

0.01243832 / 246

GO:0007275multicellular organi...

0.983846113 / 1499

GO:0008150biological_process

1.000000339 / 3888

GO:0009987cellular process

0.921628321 / 3733

GO:0016043cellular component o...

0.997379146 / 1950

GO:0030097hemopoiesis

0.22041940 / 406

GO:0030098lymphocyte different...

0.00050528 / 168

GO:0030154cell differentiation

0.90737395 / 1207

GO:0030217T cell differentiati...

1.31e−0526 / 123

GO:0031579membrane raft organi...

0.0034975 / 13

GO:0031580membrane raft distri...

2.73e−055 / 6

GO:0032501multicellular organi...

0.948648156 / 1948

GO:0032502developmental proces...

0.993651130 / 1734

GO:0033077T cell differentiati...

4.38e−0513 / 43

GO:0042110T cell activation

0.00011636 / 220

GO:0043368positive T cell sele...

0.0001528 / 20

GO:0043383negative T cell sele...

2.73e−055 / 6

GO:0045058T cell selection

0.0001529 / 25

GO:0045059positive thymic T ce...

8.75e−067 / 11

GO:0045061thymic T cell select...

1.12e−068 / 12

GO:0045321leukocyte activation

0.11784859 / 585

GO:0046649lymphocyte activatio...

0.00338943 / 328

GO:0048513animal organ develop...

0.95726979 / 1053

GO:0048534hematopoietic or lym...

0.23567241 / 420

GO:0048731system development

0.962818107 / 1393

GO:0048856anatomical structure...

0.996929118 / 1618

GO:0048869cellular development...

0.92888898 / 1256

GO:0051179localization0.631168164 / 1909

GO:0051641cellular localizatio...

0.90328079 / 1015

GO:0051665membrane raft locali...

2.73e−055 / 6

GO:0051668localization within ...

0.1189398 / 57

GO:0061024membrane organizatio...

0.46294529 / 323

GO:0071840cellular component o...

0.997681150 / 2000

Figure 6: The subgraph induced by the top 5 GO terms identified by the classic algorithm for scoring GO termsfor enrichment. Boxes indicate the 5 most significant terms. Box color represents the relative significance, rangingfrom dark red (most significant) to light yellow (least significant). Black arrows indicate is-a relationships andred arrows part-of relationships.

In the plots, the significant nodes are represented as rectangles. The plotted graph is the upper inducedgraph generated by these significant nodes. These graph plots are used to see how the significant GO termsare distributed in the hierarchy. It is a very useful tool to realize behaviour of various enrichment methodsand to better understand which of the significant GO terms are really of interest.

We can emphasise differences between two methods using the printGraph function:

> printGraph(GOdata, resultWeight, firstSigNodes = 10, resultFis, fn.prefix = "tGO", useInfo = "def")

> printGraph(GOdata, resultElim, firstSigNodes = 15, resultFis, fn.prefix = "tGO", useInfo = "all")

Page 25: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

GO:0001676long−chain fatty aci...

GO:0001775cell activation

GO:0002376immune system proces...

GO:0002520immune system develo...

GO:0002521leukocyte differenti...

GO:0006082organic acid metabol...

GO:0006629lipid metabolic proc...

GO:0006631fatty acid metabolic...

GO:0007275multicellular organi...

GO:0007423sensory organ develo...

GO:0008150biological_process

GO:0008152metabolic process

GO:0009653anatomical structure...

GO:0009790embryo development

GO:0009887animal organ morphog...

GO:0009987cellular process

GO:0010517regulation of phosph...

GO:0010518positive regulation ...

GO:0010863positive regulation ...

GO:0016043cellular component o...

GO:0019752carboxylic acid meta...

GO:0030097hemopoiesis

GO:0030098lymphocyte different...

GO:0030154cell differentiation

GO:0030217T cell differentiati...

GO:0031579membrane raft organi...

GO:0031580membrane raft distri...

GO:0032501multicellular organi...

GO:0032502developmental proces...

GO:0032787monocarboxylic acid ...

GO:0033077T cell differentiati...

GO:0033559unsaturated fatty ac...

GO:0042110T cell activation

GO:0042471ear morphogenesis

GO:0042473outer ear morphogene...

GO:0043085positive regulation ...

GO:0043436oxoacid metabolic pr...

GO:0043583ear development

GO:0043651linoleic acid metabo...

GO:0044093positive regulation ...

GO:0044237cellular metabolic p...

GO:0044238primary metabolic pr...

GO:0044255cellular lipid metab...

GO:0044281small molecule metab...

GO:0045058T cell selection

GO:0045061thymic T cell select...

GO:0045321leukocyte activation

GO:0046649lymphocyte activatio...

GO:0048513animal organ develop...

GO:0048534hematopoietic or lym...

GO:0048562embryonic organ morp...

GO:0048568embryonic organ deve...

GO:0048598embryonic morphogene...

GO:0048731system development

GO:0048856anatomical structure...

GO:0048869cellular development...

GO:0050790regulation of cataly...

GO:0051179localization

GO:0051336regulation of hydrol...

GO:0051345positive regulation ...

GO:0051641cellular localizatio...

GO:0051665membrane raft locali...

GO:0051668localization within ...

GO:0060191regulation of lipase...

GO:0060193positive regulation ...

GO:0061024membrane organizatio...

GO:0065007biological regulatio...

GO:0065009regulation of molecu...

GO:0071704organic substance me...

GO:0071840cellular component o...

GO:0090596sensory organ morpho...

GO:1900274regulation of phosph...

Figure 7: The subgraph induced by the top 5 GO terms identified by the weight algorithm for scoring GO termsfor enrichment. Boxes indicate the 5 most significant terms. Box color represents the relative significance, rangingfrom dark red (most significant) to light yellow (least significant). Black arrows indicate is-a relationships andred arrows part-of relationships.

Page 26: Gene set enrichment analysis with topGO - Bioconductor · 2019-12-17 · 1 Introduction The topGO package is designed to facilitate semi-automated enrichment analysis for Gene Ontology

8 Session Information

The version number of R and packages loaded for generating the vignette were:

� R version 3.6.1 (2019-07-05), x86_64-pc-linux-gnu

� Locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=C,LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C,LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8, LC_IDENTIFICATION=C

� Running under: Ubuntu 18.04.3 LTS

� Matrix products: default

� BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so

� LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so

� Base packages: base, datasets, grDevices, graphics, grid, methods, parallel, stats, stats4, utils

� Other packages: ALL 1.28.0, AnnotationDbi 1.48.0, Biobase 2.46.0, BiocGenerics 0.32.0,GO.db 3.10.0, IRanges 2.20.1, Rgraphviz 2.30.0, S4Vectors 0.24.1, SparseM 1.78, genefilter 1.68.0,graph 1.64.0, hgu95av2.db 3.2.3, lattice 0.20-38, multtest 2.42.0, org.Hs.eg.db 3.10.0, topGO 2.38.1,xtable 1.8-4

� Loaded via a namespace (and not attached): DBI 1.1.0, MASS 7.3-51.4, Matrix 1.2-18,RCurl 1.95-4.12, RSQLite 2.1.4, Rcpp 1.0.3, XML 3.98-1.20, annotate 1.64.0, backports 1.1.5,bit 1.1-14, bit64 0.9-7, bitops 1.0-6, blob 1.2.0, compiler 3.6.1, crayon 1.3.4, digest 0.6.23,matrixStats 0.55.0, memoise 1.1.0, pillar 1.4.2, pkgconfig 2.0.3, rlang 0.4.2, splines 3.6.1,survival 3.1-8, tibble 2.1.3, tools 3.6.1, vctrs 0.2.0, zeallot 0.1.0

References

Ackermann, M. and Strimmer, K. (2009). A general modular framework for gene set enrichment analysis.BMC Bioinformatics, 10:47. 10.1186/1471-2105-10-47.

Alexa, A., Rahnenfuhrer, J., and Lengauer, T. (2006). Improved scoring of functional groups from geneexpression data by decorrelating go graph structure. Bioinformatics (Oxford, England), 22:1600–1607.10.1093/bioinformatics/btl140.

Chiaretti, S., et al. (2004). Gene expression profile of adult T-cell acute lymphocytic leukemia identifiesdistinct subsets of patients with different response to therapy and survival. Blood, 103(7):2771–2778.

Consortium, G. O. (2001). Creating the Gene Ontology Resource: Design and Implementation. GenomeResearch, 11:1425–1433. Cold Spring Harbor Laboratory Press.

Draghici, S., Sellamuthu, S., and Khatri, P. (2006). Babel’s tower revisited: a universal resource forcross-referencing across annotation databases. Bioinformatics (Oxford, England), 22:btl372v1–2939.10.1093/bioinformatics/btl372.

Goeman, J. J. and Buhlmann, P. (2007). Analyzing gene expression data in terms of gene sets: methodologicalissues. Bioinformatics (Oxford, England), 23:980–987. 10.1093/bioinformatics/btm051.

Grossmann, S., Bauer, S., Robinson, P. N., and Vingron, M. (2007). Improved detection of overrepresentationof gene-ontology annotations with parent child analysis. Bioinformatics, 23:3024. 10.1093/bioinformat-ics/btm440.

Yon Rhee, S., Wood, V., Dolinski, K., and Draghici, S. (2008). Use and misuse of the gene ontologyannotations. Nature reviews. Genetics, advanced online publication:509–515. 10.1038/nrg2363.