Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Negative Binomial Additive Model for RNA-Seq Data
Analysis
Xu Ren and Pei Fen Kuan⇤
Department of Applied Mathematics and Statistics,
Stony Brook University,
Stony Brook, NY 11794, USA
April 4, 2019
SUMMARY. High-throughput sequencing experiments followed by di↵erential expression
analysis is a widely used approach for detecting genomic biomarkers. A fundamental step
in di↵erential expression analysis is to model the association between gene counts and co-
variates of interest. Existing models assume linear e↵ect of covariates, which is restrictive
and may not be su�cient for some phenotypes. In this paper, we introduce NBAMSeq, a
flexible statistical model based on the generalized additive model and allows for information
sharing across genes in variance estimation. Specifically, we model the logarithm of mean
gene counts as sums of smooth functions with the smoothing parameters and coe�cients
estimated simultaneously within a nested iterative method. The variance is estimated by
the Bayesian shrinkage approach to fully exploit the information across all genes. Based
on extensive simulation and case studies of RNA-Seq data, we show that NBAMSeq o↵ers
improved performance in detecting nonlinear e↵ect and maintains equivalent performance in
detecting linear e↵ect compared to existing methods. Our proposed NBAMSeq is available
for download at https://github.com/reese3928/NBAMSeq and in submission to Biocon-
ductor repository.
KEY WORDS: Bayesian shrinkage; Di↵erential expression analysis; Generalized additive
model; RNA-Seq; Spline model.
⇤Correspondence to [email protected]
1
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
1 Introduction
In recent years, RNA-Seq experiment has become the state-of-the-art method for quantifying
mRNAs levels by measuring gene expression digitally in biological samples. An RNA-Seq
experiment usually starts with isolating RNA sequences from biological samples using the Il-
lumina Genome Analyzer, the most commonly used platform for generating high-throughput
sequencing data. These mRNA sequences are reverse transcribed into cDNA fragments. To
reduce the sequencing cost and increase the speed of reading the cDNA fragments (typically
a few thousands bp), these fragments are sheared into short reads (50-450bp). These reads
are mapped back to the original reference genome and the number of read counts mapping
to each gene/transcript region are computed. RNA-Seq experiment is usually summarized
as a count table with each row representing a gene/transcript and each column representing
a sample.
An important aspect of statistical inference in RNA-Seq data is the di↵erential expression
(DE) analysis, which is to perform statistical test on each gene to ascertain whether it is
DE or not. Several methods have been developed for DE test using gene counts data,
including DESeq2 [1], edgeR [2], which are based on negative binomial regression model
and BBSeq [3], which is based on beta-binomial regression model. DESeq2 [1] performs
DE analysis in a three-step procedure. The normalization factors for sequencing depth
adjustment are first estimated using median-to-ratio method. Both the coe�cients and
the dispersion parameters in negative binomial distribution are estimated by the Bayesian
shrinkage approach to e↵ectively borrow the information across all genes. Finally, the Wald
tests or the likelihood ratio tests are performed to identify DE genes. edgeR [2] also uses the
negative binomial distribution to model gene counts. It assumes that the variance of gene
counts depends on two dispersion parameters, namely the negative binomial dispersion and
the quasi-likelihood dispersion. The negative binomial dispersion is estimated by fitting a
mean-dispersion trend across all genes whereas the quasi-likelihood dispersion are estimated
by Bayesian shrinkage approach [4]. The DE statistical tests are conducted based on either
the likelihood ratio tests [5, 6] or the quasi-likelihood F-tests [7] [8]. Both the DESeq2 [1] and
edgeR [2] are widely used due in part to their shrinkage estimators which can improve the
stability of DE test in a large range of RNA-Seq data or other technologies that generates read
counts genomics data. BBSeq [3] is an alternative tool for DE analysis. It assumes that the
count data follows a beta-binomial distribution where the beta distribution is parameterized
in a way such that the variance accounts for over dispersion. The parameters are estimated
by the maximum likelihood approach and DE analysis is based on either the Wald test or
the likelihood ratio test. Both the negative binomial and beta-binomial regression models
2
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
belong to the generalized linear model (GLM) class. GLM is popular owing to the ease of
implementation and interpretation, by assuming the link function is a linear combination of
covariates. However, as illustrated in the motivating dataset below, the linearity assumption
is not always appropriate and nonlinear models for linking the phenotype to gene expression
in RNA-Seq data may be important. Stricker et al. [9] proposed GenoGAM, which is a
generalized additive model for ChIP-Seq data. However, GenoGAM was developed for a
di↵erent purpose, i.e., the nonlinear component was to smooth the read count frequencies
across genome, and not for relating a nonlinear association between the phenotype and read
counts.
In this paper, we introduce NBAMSeq, a generalized additive model for RNA-Seq data.
NBAMSeq brings together the negative binomial additive model [10] and information sharing
across genes for variance estimation which enhances the power for detecting DE genes, since
treating each gene independently su↵ers from power loss due to the high uncertainty of
variance, especially when the sample size is small [1]. By borrowing information across
genes, NBAMSeq is able to model the nonlinear association while improving the accuracy
of variance estimation and shows significant gain in power for detecting DE genes when the
nonlinear relationship exists.
2 Motivating Datasets
We demonstrate the existence of nonlinear relationship between gene counts and covari-
ates/phenotypes of interest on two real RNA-Seq datasets. The first dataset is obtained
from The Cancer Genome Atlas (TCGA) consortium, a rich repository consisting of high-
throughput sequencing omics data for multiple types of cancers; downloadable from the
Broad GDAC Firehose. For ease of exposition, we focus on the Glioblastoma multiforme
(GBM), an aggressive form of brain cancer. The dataset consists of RNA-Seq count data on
20,501 genes and 158 samples generated using the Illumina HiSeq2000 platform, along with
the clinical information of these samples. 16,092 genes were retained after filtering for genes
with more than 10 samples having count per million (CPM) less than one. Gene counts are
normalized using median-of-ratios as described in DESeq [11] and DESeq2 [1] to account for
di↵erent library sizes across all samples. Age has been identified as an important prognos-
tic marker for malignant cancers including GBM [12]. To better understand the prognostic
value of age in GBM, here we aim to identify age specific gene expression signature. For
each individual gene i, three models were fitted: (i) base model: yij = �0 + �1x1j + �2x2j,
(ii) linear model: yij = �0 + �1x1j + �2x2j + �3x3j, (iii) cubic spline regression with 3 knots:
yij = �0+�1x1j+�2x2j+f(x3j), where yij denotes logarithm transformed normalized counts
3
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
8.5
9.0
9.5
10.0
10.5
20 40 60 80
age
log(c
ounts
+1)
MTPAP
0.0
2.5
5.0
7.5
10.0
20 40 60 80
age
log(c
ounts
+1)
RANBP17
10
12
14
20 40 60 80
age
log(c
ounts
+1)
CNOT2
10
11
12
13
14
20 40 60 80
age
log(c
ounts
+1)
CPSF6
8.5
9.0
9.5
10.0
10.5
20 40 60 80
age
log(c
ounts
+1)
MTPAP
0.0
2.5
5.0
7.5
10.0
20 40 60 80
age
log(c
ounts
+1)
RANBP17
10
12
14
20 40 60 80
age
log(c
ounts
+1)
CNOT2
10
11
12
13
14
20 40 60 80
age
log(c
ounts
+1)
CPSF6
Figure 1: Nonlinear relationship between gene counts and age.
of sample j and x1j, x2j, x3j denotes gender, race, age respectively. Race is a binary variable
indicating if the sample is Caucasian or non-Caucasian. We then compared the three models
using ANOVA F-test, and the p-values are adjusted using the Benjamini &Hochberg [13]
false discovery rate (FDR). The test statistics, degrees of freedom of F-statistics, number of
significant genes at FDR < 0.05 are summarized in Web Table 1, where SSE1, SSE2, SSE3
denote sum of squared error for model (i), (ii), (iii) respectively. 250 genes are significant
when comparing model (i) to model (iii). We further investigated the p-values of these 250
genes when comparing model (i) to model (ii) and found that 30 out of these 250 genes
have FDR > 0.2 (see Web Table 2), implying that a simple linear regression is unable to
identify these 30 genes even if a less stringent FDR threshold is used. Moreover, 9 genes
are significant when comparing model (ii) to model (iii). Figure 1 shows the scatterplots of
counts versus age of the most significant genes. For visualization, a local regression is fitted
on each scatterplot, suggesting significant nonlinear relationships between normalized gene
counts and age. Although only GBM is shown here, we observed similar nonlinear age e↵ect
for other types of cancer from Broad GDAC Firehose database (see Web Table 3).
The second dataset comes from our study of gene expression associated with posttrau-
matic stress disorder (PTSD) in the World Trade Center (WTC) responders [14]. The dataset
consists of RNA-Seq read counts of 25,830 genes in whole blood profiled in 324 WTC male
responders along with their clinical information. The phenotype of interest is the Posttrau-
matic Stress Disorder Checklist-Specific Version (PCL), a self-report questionnaire assessing
4
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
the severity of WTC-related DSM-IV PTSD symptoms [15]. To better understand the non-
linear e↵ect of PTSD severity on gene expression, here we aim to identify gene expression
signature related to PCL score. The gene filtering criterion and normalization method are
similar to the framework used in the first motivating dataset and samples with missing clini-
cal information were omitted from the analysis. 16,193 genes and 321 samples were retained
for DE analysis. In addition, cell type proportions have been implicated in the analysis of
whole blood samples. The proportions of CD8T, CD4T, natural killer, B-cell and mono-
cytes were estimated as described in our previous paper [16]. Similar to the first motivating
dataset, the base model, linear model, and cubic spline regression model were fitted on each
gene to study the e↵ect of PCL score. Age, race, cell proportions of CD8T, CD4T, natural
killer, B-cell and monocytes were adjusted in each model. When comparing the cubic spline
regression model with base model using ANOVA F-test, 6 genes (AP1G1, NFYA, SCAF11,
SPTLC2, TAOK1, ZNF106) are significant at FDR < 0.05. Web Figure 1 shows the scat-
terplots of counts versus PCL score of these 6 genes with a local regression fitted on each
scatterplot. These plots show a consistent pattern, i.e., gene expression first decrease and
then gradually increase with the minimum at around PCL score 30.
These observations indicate that the relationship between gene counts and covariate of
interest may be nonlinear, and it can be challenging to capture the true underlying non-
linear relationship using parametric methods. Thus, more flexible statistical methods to
characterize the nonlinear pattern in RNA-Seq data will improve the DE analysis of com-
plex phenotypes. To this end, we propose a negative binomial additive model to capture the
nonlinear association in RNA-Seq data analysis.
3 Model Specification
3.1 Sequencing Depth Normalization
For RNA-Seq data, the raw counts of each sample are proportional to its sequencing depth
i.e. the total number of reads. The raw counts are not directly comparable across samples
and therefore normalization factors are needed to adjust for gene counts. Several methods
have been developed for RNA-Seq data normalization including total count, upper quar-
tile [17], median, quantile [18, 19], conditional quantile [20], reads per kilobase per million
mapped reads (RPKM) [21], median-of-ratios [11], and trimmed mean of M values [22]. A
comprehensive comparisons of these normalization methods are provided in Dillies et al.
[23] and Li et al. [24]. By comparing the spearman correlation between normalized counts
and quantitative reverse transcription polymerase chain reaction (qRT-PCR), Li et al. [24]
5
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
showed that RPKM yields the best normalization results when alignment accuracy is low.
However, by considering the e↵ect of normalization method on DE analysis, Dillies et al.
[23] showed that RPKM is ine↵ective and not recommended. They also showed that the
median-of-ratios [11] and trimmed mean of M values [22] maintain low false-positive rate
without compromising the power in DE analysis.
Throughout this paper, we adapt the median-of-ratios used by DESeq [11], DEXSeq [25]
and DESeq2 [1] for sequencing depth adjustment. Let Kij be the read count for gene i in
sample j. The normalization factor of sample j is defined as:
sj =i:K⇤
i 6=0median
Kij
K⇤i
, whereK⇤i =
⇣ mY
j=1
Kij
⌘1/m
(1)
where m is the total number of samples.
3.2 Generalized Additive Model
For gene i and sample j, we assume the gene count Kij follows the generalized additive
model:
Kij ⇠ NB(mean = µij, dispersion=↵i)
log(µij) = log(sj) +X
r
fr(xjr)
where µij = E(Kij) is the mean count and ↵i is dispersion parameter that relates the mean
to the variance by Var(Kij) = µij + ↵iµ2ij, and xjr’s are the r covariates of interest. The
logarithm of normalization factor log(sj) is an o↵set in the model and sj is computed by
Equation (1) in Section 3.1. fr(·), r = 1, 2, ..., R represents the spline or smooth function
which capture the nonlinear associations between covariate r and gene expression. Since our
statistical goal is to conduct inference on genes to ascertain whether they are DE with respect
to the covariate/phenotype of interest, we choose fr(·) to be spline functions to facilitate the
inference, i.e., fr(xjr) =PQ
q=1 �rqBrq(xjr), where �rq are unknown coe�cients and B(·) areB-spline basis functions. The coe�cients � = (�11, ..., �1Q, ..., �R1, ..., �RQ)T are estimated
by the penalized log-likelihood maximization to avoid overfitting. In particular, the model
is estimated by maximizing
L(�) = l(�)� 1
2
X
r
�r�TSr� (2)
with respect to �, where l(·) is the log-likelihood function of negative binomial distribution,
�r is smoothing parameter which controls the smoothness of fr(·), and Sr is a positive
6
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
semidefinite matrix.
3.3 Model Fitting
Several methods have been proposed for fitting the generalized additive model. Hastie and
Tibshirani [26, 27] proposed a general backfitting algorithm, whereas Thurston et al [10]
adapted the backfitting algorithm for negative binomial distribution. In the backfitting
algorithm, given the smoothing parameters �r and dispersion parameters ↵i, the coe�cients
� can be computed by the penalized iteratively reweighted least squares. However, the most
challenging part is in the estimation of the smoothing parameters [28], which is not easy to be
incorporated in the backfitting algorithm. Wood et al [29, 30, 31] proposed a nested iterative
algorithm which estimates the smoothing parameters and coe�cients simultaneously. Their
algorithm estimates the smoothing parameters using the cross validation framework in the
outer iteration and the coe�cients using penalized iteratively reweighted least squares in the
inner iteration. We adapted the algorithm implemented in the mgcv package [31, 30, 32, 28,
33] for model fitting, in which various options for model selection criteria and computational
strategies are provided. Specifically in NBAMSeq, we adapted the gam function from mgcv
[31, 30, 32, 28, 33] for model fitting, in for fitting the generalized additive model on each
gene.
The smoothing parameter tuning criteria can be divided into two types, namely (a) the
prediction error based criteria including AIC, cross-validation or generalized cross-validation
(GCV), and (b) the likelihood based criteria including maximum likelihood or restricted max-
imum likelihood [30]. The likelihood based methods o↵er some advantageous over the predic-
tion error based methods since they impose a larger penalty on overfitting and exhibit faster
convergence of smoothing parameters estimation [34]. In NBAMSeq, we use the restricted
maximum likelihood (REML) method. Maximizing the penalized log-likelihood (2) can be
viewed as a Bayesian approach with Gaussian prior � ⇠ N(0,S�� ), where S� =
P�rSr and
S�� is the inverse or pseudoinverse [35] of S�. Under the Bayesian framework, estimating the
smoothing parameters can be viewed as maximizing the log-marginal likelihood [29, 30, 31]
V (�) = log
Zf(y|�)f(�)d�. (3)
The Laplacian approximation of (3) is given by:
V⇤(�) ' L(�) +
1
2log |S�|+ � 1
2log |XTWX+ S�|+
M
2log(2⇡),
where � is the maximizer of (2), X is the model matrix of covariates i.e. X = [X1,X2
, ...,XR]
7
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
with Xrjk = Brk(xjr). |S�|+ denotes the product of positive eigenvalues of S�, M is the
number of zero eigenvalues of S�, andW is the diagonal weight matrix. For negative binomial
distribution, the link function is logarithm link and the variance is given by µij +↵iµ2ij, thus
the diagonal elements of W is given by wjj = 1(1/µij)2(µij+↵iµ2
ij)= 1
1/µij+↵i. Let �i1, ...,�iR
denote the smoothing parameters of gene i. The nested iteration implemented in mgcv
[31, 30, 32, 28, 33] can be summarized as Algorithms 1 and 2.
Algorithm 1 Outer iteration for � and ↵i
1: Initialize ⇢ = (log↵i, log �i1, ..., log �iR) and �.2: Calculate � by Algorithm 2.3: Calculate @V
@⇢xand @2V
@⇢x@⇢yfor all x, y.
4: Denote gradient by rV and the Hessian matrix by H, where Hxy = @2V@⇢x@⇢y
. Calculate
� = �H�1rV .5: If V (⇢+�) < V (⇢), set � to be �/2 until V (⇢+�) > V (⇢).6: Set ⇢ to be ⇢+�.7: Repeat step 2 to step 6 until @V
@⇢x' 0 for all x and �H is positive semidefinite.
Algorithm 2 Inner iteration for �1: Initialize �.2: Update � by �new = (XTWX + S�)�1XTWz, where z is a vector with each element
zj = log(µij/sj) + (Kij � µij)/µij.3: Update W by wjj =
11/µij+↵i
.4: Repeat step 2 and 3 until the change of � below threshold.
3.4 Dispersion Estimation
Information sharing across genes in the estimation of dispersion parameter has been shown to
increase accuracy compared to treating each gene independently, especially when the number
of replicates is small. Numerous methods have been proposed to incorporate information
sharing, and most of them were based on the empirical Bayesian approach which shrinks the
gene-wise dispersion estimates toward a common dispersion parameter across all genes.
Following DESeq2 [1], we apply the empirical Bayesian shrinkage dispersion parame-
ter estimate. For each individual gene i, we assume a log-normal prior for the dispersion
parameters ↵i:
log(↵i) ⇠ N(log↵tr(µi), �2d).
where dispersion trend relates dispersion to mean of normalized µi =1m
PjKij
sjcounts by
↵tr(µi) = b1µi
+ b0 and �2d is the prior variance, which is assumed to be constant across all
8
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
genes. A Gamma regression is fitted to estimate the parameters b1 and b0. The prior variance
�2d is estimated by subtracting the expected sampling variance of log-dispersion estimator
from the variance estimate of residual log(↵i)� log↵tr(µi) as described in DESeq2 [1]
We first obtain gene-wise dispersion estimates ↵i by Algorithms 1 and 2. These gene-wise
dispersion estimates are used to fit the dispersion trend and estimate the variance in prior
distribution. The final dispersion parameter is estimated by maximizing
lCR(↵)�(log↵� log↵tr(µi))2
2�2d
, (4)
where lCR is Cox-Reid adjusted log-likelihood [36]. Note that maximizing (4) can be viewed
as a maximize a posterior (MAP) approach. The entire workflow implemented in NBAMSeq
is summarized in algorithm 3.
Algorithm 3 Workflow of NBAMSeq1: Estimate normalization factors sj.2: Estimate smoothing parameters and gene-wise dispersions by Algorithms 1 and 2.3: Estimate dispersion trend and prior variance using gene-wise dispersion estimates from
Step 2.4: Calculate MAP estimate of dispersions.5: Estimate coe�cients � by algorithm 2 using the smoothing parameters in Step 2 and
dispersion parameters in Step 4.
4 Inference
Our main objective is to identify the genes that exhibit either linear or nonlinear association
with the covariate/phenotype of interest. We formulate this as a hypothesis testH0 : fr(xr) =
0 versus H1 : fr(xr) 6= 0, where fr(·) is a smooth function of the covariate of interest. Since
we use a Bayesian framework in the coe�cients estimation, we have f(�|y) / f(y|�)f(�).In addition, � ⇠ N(0,S�
� ), thus log f(�|y) / log f(y|�)� 12�
TS��. By applying a second
order Taylor expansion to �, we obtain
log f(y|�)� 1
2�TS��
= log f(y|�)� 1
2�
TS�� +
1
2(� � �)T(D2 log f(y|�)� S�)(� � �),
(5)
where D2 log f(y|�) is Hessian matrix of the log-likelihood of negative binomial distribution
with respect to �. We use the expected information matrix to replace the Hessian matrix
9
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
(see Web Appendix A) and equation (5) simplifies to
log f(y|�)� 1
2�
TS�� � 1
2(� � �)T(XTWX+ S�)(� � �). (6)
The large sample limit of (6) is the probability density function of multivariate normal
distribution, thus �|y ⇠ N(�, (XTWX + S�)�1) and V� = (XTWX + S�)�1 is Bayesian
covariance matrix of �. Let fr be the values of fr(xr) evaluated at the observed values of xr
and X denote the matrix such that fr = X�. Wood [37] showed that fr ⇠ N(fr, XV�XT)
approximately and under the null hypothesis the test statistics fT
r (XV�XT)�1fr follows a
�2 distribution with the degrees of freedom equals to the e↵ective degrees of freedom (edf)
corresponding to variable xr if the edf is an integer. If the edf of xr is not an integer, Wood
[37] showed that the distribution of fT
r (XV�XT)�1fr can be approximated by a non-central
�2 distribution using the methods of Liu et al [38]. Specifically, the degrees of freedom
and the non-central parameter in �2 distribution are determined so that the skewness of
fT
r (XV�XT)�1fr and target �
2 distribution are equal, whereas the di↵erence between the
kurtosis of fT
r (XV�XT)�1fr and target �2 distribution is minimized. We utilize this result
to calculate a p-value for each gene. The p-values are further adjusted using FDR procedure
to account for multiple hypothesis testing.
5 Simulation
To evaluate the performance of NBAMSeq, we designed simulation studies in which NBAM-
Seq was compared with DESeq2 [1] and edgeR [2], two of the most popular softwares for
identifying DE genes in RNA-Seq data. Two scenarios were considered, the first was to evalu-
ate the accuracy of dispersion estimates and power of detecting DE genes when the nonlinear
e↵ect is true association, the second was to investigate the robustness of NBAMSeq when
the linear e↵ect is the true association.
In both scenarios, for gene i and sample j, we assumed that the gene counts follow
negative binomial distribution. i.e. Kij ⇠ NB(mean=µij, dispersion=↵i), µij = sjqij. For
simplicity and without loss of generality, we set all normalization factors sj to be 1. Since
the dispersion parameter decreases with increasing mean counts, we assumed that the true
mean dispersion relationship is ↵i = a/µi + 0.05. The performance under di↵erent degrees
of dispersion (a = 1, 3, 5) were studied. In practice, although a could be larger than 5, the
larger value of a is compensated by the larger mean of normalized counts. The magnitude of
the dispersion ↵i is in similar scale regardless of a, and thus only a = 1, 3, 5 were studied. The
covariate of interest xj was simulated from a uniform distribution U(u1, u2). For simplicity,
10
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
we used u1 = 20 and u2 = 80 because 20 to 80 covers the range of age and PCL score in our
motivating datasets.
5.1 Scenario I
The aim of this scenario was two-fold: (i) to evaluate the accuracy of NBAMSeq dispersion
estimates when nonlinear e↵ect is significant, and (ii) to investigate the power and error rate
control of NBAMSeq.
We simulated several count matrices which contain the same number of genes (n =15,000)
across di↵erent sample sizes (m = 15, 20, 25, 30, 35, 40). We randomly selected 5% of the
15,000 genes to be DE. Within each count matrix, the mean of negative binomial distribution
µij was simulated from
⌘ij = log2(µij) =
8<
:bi0 + bi1B1(xj) + bi2B2(xj) + ...+ bi5B5(xj), i 2 {DE indices}
bi0 + c, otherwise
and µij was calculated by µij = 2⌘ij , where B(·) are cubic B-Spline basis functions to induce anonlinear e↵ect on DE genes, and c is a constant to ensure that the count distribution of DE
and non-DE genes comparable. The intercepts bi0 were simulated from normal distribution,
and the other coe�cients bi1, bi2, ..., bi5 were simulated from a uniform distribution. The
mean and standard deviation of normal distribution, as well as the lower/upper bound of
uniform distribution were chosen such that the distribution of µi =1m
Pmj=1 µij across all
genes mimicked our second motivating dataset. Finally, the count of gene i in sample j
was simulated from a negative binomial distribution with mean µij and dispersion ↵i. The
simulation was repeated 50 times for each combination of dispersion and sample size.
To evaluate the accuracy of dispersion estimates, three methods were compared: (a) DE-
Seq2 [1]; (b) edgeR [2]; (c)NBAMSeq. To illustrate that information sharing across genes
produces more accurate dispersion estimates, the gene-wise dispersion estimated by NBAM-
Seq (NBAMSeq gene-wise) without incorporating prior dispersion trend was also included
in the comparison. The average mean squared error (MSE) across 50 repetitions was calcu-
lated and the results are shown in Figure 2. Among the DE genes in which the nonlinear
e↵ect is significant, NBAMSeq outperforms DESeq2 [1] and edgeR [2] across di↵erent sample
sizes. On the other hand, among the non-DE genes, the MSE of edgeR [1] is lower than DE-
Seq2 [2] across all degrees of dispersion and achieves comparable performance to NBAMSeq.
NBAMSeq also has lower MSE than NBAMSeq (gene-wise), which shows that the Bayesian
shrinkage estimation is better than gene-wise dispersion estimation. This simulation scenario
shows that NBAMSeq is able to maintain good dispersion accuracy for non-DE genes and
11
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
0.00
0.01
0.02
0.03
15 20 25 30 35 40
MS
E (
DE
genes)
a = 1
0.00
0.01
0.02
0.03
15 20 25 30 35 40
MS
E (
non−
DE
genes)
a = 1
0.000
0.025
0.050
0.075
0.100
15 20 25 30 35 40
MS
E (
DE
genes)
a = 3
0.000
0.025
0.050
0.075
0.100
15 20 25 30 35 40
MS
E (
non−
DE
genes)
a = 3
0.00
0.05
0.10
0.15
0.20
15 20 25 30 35 40
sample size
MS
E (
DE
genes)
a = 5
0.00
0.05
0.10
0.15
0.20
15 20 25 30 35 40
sample size
MS
E (
non−
DE
genes)
a = 5
DESeq2 edgeR NBAMSeq NBAMSeq(gene−wise)
Figure 2: Scenario I: MSE of dispersion estimates.
shows improvement in the accuracy over DESeq2 [1] and edgeR [2] for DE genes when the
nonlinear association is significant.
Figure 3 shows the average true positive rate (TPR), empirical false discovery rate (FDR),
empirical false non-discovery rate (FNR) and area under curve (AUC) of DESeq2 [1], edgeR
[2] and NBAMSeq. For edgeR [2], two approaches for DE are available, namely the quasi-
likelihood (QL) and the likelihood ratio test (LRT) approach for calculating the test statistics
as described in Lund et al [7]. The authors further showed that detecting DE genes using
QL approach controls FDR better than the LRT approach. When calculating TPR, FDR,
and FNR, genes are declared to be significant if nominal FDR is < 0.05. The TPR results
indicate that our proposed method NBAMSeq has the highest power for detecting nonlinear
DE genes regardless of the degrees of dispersion and sample sizes. In addition, NBAMSeq
controls the FDR at 0.05 in all cases except in high dispersion (a = 5) and low sample size
(m = 15) in which the FDR is slightly inflated (empirical FDR 0.0576). The FNR results
show that NBAMSeq has lower FNR compared to other methods. To ascertain that the
12
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
0.25
0.50
0.75
1.00
15 20 25 30 35 40
TP
R
a = 1
0.000
0.025
0.050
0.075
0.100
15 20 25 30 35 40
FD
R
a = 1
0.00
0.01
0.02
0.03
0.04
0.05
15 20 25 30 35 40
FN
R
a = 1
0.7
0.8
0.9
1.0
15 20 25 30 35 40
AU
C
a = 1
0.25
0.50
0.75
1.00
15 20 25 30 35 40
TP
R
a = 3
0.000
0.025
0.050
0.075
0.100
15 20 25 30 35 40
FD
R
a = 3
0.00
0.01
0.02
0.03
0.04
0.05
15 20 25 30 35 40
FN
Ra = 3
0.7
0.8
0.9
1.0
15 20 25 30 35 40
AU
C
a = 3
0.25
0.50
0.75
1.00
15 20 25 30 35 40
sample size
TP
R
a = 5
0.000
0.025
0.050
0.075
0.100
15 20 25 30 35 40
sample size
FD
R
a = 5
0.00
0.01
0.02
0.03
0.04
0.05
15 20 25 30 35 40
sample size
FN
R
a = 5
0.7
0.8
0.9
1.0
15 20 25 30 35 40
sample size
AU
C
a = 5
DESeq2 edgeR_LRT edgeR_QL NBAMSeq
Figure 3: Scenario I: TPR, FDR, FNR, AUC of DESeq2, edgeR and NBAMSeq.
13
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
power of NBAMSeq is not sensitive to the nominal FDR cuto↵ for DE genes, the AUCs
comparison shows NBAMSeq is consistently more powerful in all cases. The observation
also shows that the performance of DESeq2 [1] and edgeR [2] LRT approach are comparable,
whereas edgeR [2] QL approach is conservative as shown by the higher FNR. Additional
comparisons of the QL approach for NBAMSeq and Gaussian additive models on logarithm
transformed count data are provided in Web Figure 2 and 3.
5.2 Scenario II
The goal of this scenario was to evaluate the robustness of NBAMSeq when the true associ-
ation is a linear e↵ect. The simulation setup in this scenario is similar to Scenario I except
that ⌘ij is given by:
⌘ij = log2(µij) =
8<
:bi1xj + c1, i 2 {linear DE indices}
bi0 + c2, otherwise
where c1 and c2 are constants to ensure that the DE and non-DE genes counts distribution
are comparable to avoid bias in detecting DE genes. The linear coe�cient bi1 was simulated
from normal distribution. Similar to Scenario I, the mean and standard deviation of normal
distribution were chosen such that the distribution of µi mimicked the distribution of mean
normalized counts in real data. The simulation was repeated 50 times and the average MSE
of dispersion estimates is shown in Web Figure 4. The accuracy of NBAMSeq is comparable
to the accuracy of DESeq2 [1] and edgeR [2] in almost all cases. This simulation shows that
although NBAMSeq is designed to detect nonlinear e↵ect, the method is robust and yields
accurate dispersion estimates when the true association is linear.
To evaluate the power and error rate control of our proposed model NBAMSeq in de-
tecting DE genes with linear e↵ect, we compare the TPR, FDR, FNR, and AUC at nominal
FDR 0.05. Web Figure 5 illustrates that NBAMSeq is a robust approach and maintains
comparable performance to other methods.
6 Real Data Analysis
In this section, we applied DESeq2 [1] and our proposed method NBAMSeq to our motivating
datasets (Section 2), namely the TCGA GBM dataset and WTC dataset. In TCGA GBM
dataset, the association between age and each gene expression were investigated, adjusting
for sex and race. At FDR 0.05, 845 and 1,439 genes were detected by DESeq2 [1] and
14
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
Table 1: KEGG pathways selected by NBAMSeq.ID Description pvalue p.adjusthsa04310 Wnt signaling pathway 0.000252 0.033951hsa04216 Ferroptosis 0.000334 0.033951hsa04060 Cytokine-cytokine receptor interaction 0.000344 0.033951hsa04974 Protein digestion and absorption 0.000463 0.034281
our proposed NBAMSeq, respectively. 818 genes were in common between DESeq2 [1] and
NBAMSeq, 621 genes were unique to NBAMSeq and 27 genes were unique to DESeq2 [1]
(Figure 4). To account for the fact that significant genes are sensitive to the FDR threshold,
for the genes unique to NBAMSeq, we investigated their adjusted p-value in DESeq2 [1]
(Figure 5A), and vice versa for genes unique to DESeq2 (Figure 5B). A large proportion
of the 621 genes unique to NBAMSeq have FDR above 0.1 in DESeq2 [1], implying that
DESeq2 [1] is unable to detect the nonlinear genes even if a less stringent FDR threshold is
used. However, for the 27 genes unique to DESeq2 [1], their adjusted p-values in NBAMSeq
are < 0.1, implying that NBAMSeq is able to detect all these 27 genes at FDR < 0.1. Next,
we performed the pathway analysis on DE genes using the hypergeometric tests implemented
in the clusterProfiler software [39] for both the Kyoto Encyclopedia of Genes and Genomes
(KEGG) [40] and Gene Ontology [41] gene sets. Significant pathways were chosen at FDR
0.05. Gene sets with fewer than 15 genes or greater than 500 genes were omitted from the
analysis. For the KEGG pathways, no gene set was significant among DESeq2 DE genes
whereas 4 genes sets were significant among NBAMSeq DE genes (Table 1). This result is
particularly interesting as mutations in Wnt signaling pathway is implicated in a variety of
developmental defects in animals and associated with aging [42, 43]. For the Gene Ontology,
a total of 5,000 gene sets were tested, 45 and 78 gene sets were identified by DESeq2 and
NBAMSeq, respectively and 34 gene sets were in common between them. A full list of
significant gene sets identified by these two models are provided in Web Table 4 and 5.
Specifically, among the top gene sets unique to NBAMSeq DE genes (Web Table 6) are
extracellular matrix pathways and catabolic process pathways. Changes to extracellular
matrix changes with aging has been implicated in the progression of cancer [44], whereas
Barzilai et al. [45] showed that metabolic pathways are critical in aging process. These
observations suggest that gene sets detected by the NBAMSeq provide additional insights
to the development of cancer and aging as a risk factor.
In the WTC dataset, the relationship between PCL and gene expression were investi-
gated, adjusting for age, race and cell proportions (CD8T, CD4T, natural killer, B-cell,
monocytes). At FDR 0.05, 836 and 775 genes were detected by DESeq2 and NBAMSeq,
respectively (Web Figure 6). Unlike the TCGA GBM dataset, the genes detected by the two
15
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
621
27
818
NBAM
DESeq2
Figure 4: Number of significant genes by DESeq2 and NBAMSeq (GBM dataset).
0
50
100
150
200
0.00 0.25 0.50 0.75 1.00
padj
count
A
0
1
2
3
4
5
0.05 0.06 0.07 0.08 0.09 0.10
padj
count
B
0
50
100
150
200
0.00 0.25 0.50 0.75 1.00
padj
count
A
0
1
2
3
4
5
0.05 0.06 0.07 0.08 0.09 0.10
padj
count
B
Figure 5: Histogram of adjusted p-values. The blue dashed line is FDR cut-o↵ 0.1.
16
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
methods have a large overlap, implying that the primary e↵ect of PCL on gene expression
is linear. Although NBAMSeq identified fewer PCL gene expression signature compared to
DESeq2 [1], the 69 genes unique to DESeq2 were marginal significant in NBAMSeq (Web
Figure 7). However, for the 8 genes unique to NBAMSeq, they were not necessarily sig-
nificant in DESeq2 (Web Table 7). We performed pathway analysis on KEGG [40], and
Web Table 8 shows the significant KEGG pathways among NBAMSeq DE genes. As ex-
pected, these KEGG pathways were also significant among DESeq2 DE genes due to the
large overlapping genes.
7 Discussion
This paper introduced a flexible negative binomial additive model combined with Bayesian
shrinkage dispersion estimates for RNA-Seq data (NBAMSeq). Motivated by the nonlinear
e↵ect of age in TCGA GBM data and PTSD severity (PCL score) in WTC data, our model
aimed to detect genes which exhibit nonlinear association with the phenotype of interest. The
smoothing parameters and coe�cients in NBAMSeq were estimated e�ciently by adapting
the nested iterative method implemented in mgcv [31, 30, 32, 28, 33]. To increase the
accuracy of the dispersion parameter estimation in small sample size scenario, the Bayesian
shrinkage approach was applied to model the dispersion trend across all genes. Hypothesis
tests to identify DE genes were based on chi-squared approximations.
Although our model was developed for RNA-Seq data, it can be extended to identify
nonlinear associations in other genomics biomarkers generated from high throughput se-
quencing which share similar features as RNA-Seq (e.g., over dispersion in count data). Our
simulation studies show that the proposed NBAMSeq is powerful when the underlying true
association in nonlinear and robust when the underlying true association in linear compared
to DESeq2 [1] and edgeR [2]. In addition, the real data analysis showed that NBAMSeq o↵ers
additional biological insights to the role of aging in the development of cancer by modeling
the nonlinear age e↵ect. Both the simulation and case studies suggest the advantageous of
investigating potential nonlinear associations to detect more e↵ective biomarkers, in order
to better understand the biological underpinnings of complex diseases.
NBAMSeq exhibits slight empirical FDR inflation (⇠0.06 at nominal FDR 0.05) in small
sample size (n < 15) and large dispersion scenario, attributed to the chi-squared approxi-
mation in the hypothesis tests. One way to circumvent this issue is by using a permutation
test to obtain the null distribution of the test statistic, but this approach is computational
intensive. Future work includes developing an e�cient parallel computing framework for the
permutation test. In addition, similar to other softwares for RNA-Seq DE analysis, currently
17
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
NBAMSeq is developed for inference at gene level in the absence of prior information. Recent
methods which take into account the gene network structure in DE analysis have been shown
to a powerful approach [46]. Thus, an interesting future research direction is to incorporate
the network topology in NBAMSeq to detect nonlinear association and decipher the collec-
tive dynamics and regulatory e↵ects of gene expression. Finally, NBAMSeq is available for
download at https://github.com/reese3928/NBAMSeq and in submission to Bioconductor
repository.
Acknowledgements
This work was supported in part by the CDC/NIOSH award U01 OH011478.
18
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
References
[1] Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold
change and dispersion for rna-seq data with deseq2. Genome biology, 15(12):550, 2014.
[2] Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edger: a bioconductor
package for di↵erential expression analysis of digital gene expression data. Bioinformat-
ics, 26(1):139–140, 2010.
[3] Yi-Hui Zhou, Kai Xia, and Fred A Wright. A powerful and flexible approach to the
analysis of rna sequence count data. Bioinformatics, 27(19):2672–2678, 2011.
[4] Mark D Robinson and Gordon K Smyth. Moderated statistical tests for assessing dif-
ferences in tag abundance. Bioinformatics, 23(21):2881–2887, 2007.
[5] Davis J McCarthy, Yunshun Chen, and Gordon K Smyth. Di↵erential expression analy-
sis of multifactor rna-seq experiments with respect to biological variation. Nucleic acids
research, 40(10):4288–4297, 2012.
[6] Yunshun Chen, Aaron TL Lun, and Gordon K Smyth. Di↵erential expression analysis
of complex rna-seq experiments using edger. In Statistical analysis of next generation
sequencing data, pages 51–74. Springer, 2014.
[7] Steven P Lund, Dan Nettleton, Davis J McCarthy, and Gordon K Smyth. Detect-
ing di↵erential expression in rna-sequence data using quasi-likelihood with shrunken
dispersion estimates. Statistical applications in genetics and molecular biology, 11(5),
2012.
[8] Aaron TL Lun, Yunshun Chen, and Gordon K Smyth. It?s de-licious: a recipe for
di↵erential expression analyses of rna-seq experiments using quasi-likelihood methods
in edger. In Statistical Genomics, pages 391–416. Springer, 2016.
[9] Georg Stricker, Alexander Engelhardt, Daniel Schulz, Matthias Schmid, Achim Tresch,
and Julien Gagneur. Genogam: genome-wide generalized additive models for chip-seq
analysis. Bioinformatics, 33(15):2258–2265, 2017.
[10] Sally W Thurston, MP Wand, and John K Wiencke. Negative binomial additive models.
Biometrics, 56(1):139–144, 2000.
[11] Simon Anders and Wolfgang Huber. Di↵erential expression analysis for sequence count
data. Genome biology, 11(10):R106, 2010.
19
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
[12] Serdar Bozdag, Aiguo Li, Gregory Riddick, Yuri Kotliarov, Mehmet Baysan, Fabio M
Iwamoto, Margaret C Cam, Svetlana Kotliarova, and Howard A Fine. Age-specific
signatures of glioblastoma at the genomic, genetic, and epigenetic levels. PloS one,
8(4):e62982, 2013.
[13] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical
and powerful approach to multiple testing. Journal of the royal statistical society. Series
B (Methodological), pages 289–300, 1995.
[14] Pei-Fen Kuan, Monika A Waszczuk, Roman Kotov, Sean Clouston, Xiaohua Yang,
Prashant K Singh, Sean T Glenn, Eduardo Cortes Gomez, Jianmin Wang, Evelyn
Bromet, et al. Gene expression associated with ptsd in world trade center responders:
An rna sequencing study. Translational psychiatry, 7(12):1297, 2017.
[15] Frank WWeathers, Brett T Litz, Debra S Herman, Jennifer A Huska, Terence M Keane,
et al. The ptsd checklist (pcl): Reliability, validity, and diagnostic utility. In annual
convention of the international society for traumatic stress studies, San Antonio, TX,
volume 462. San Antonio, TX., 1993.
[16] Pei Fen Kuan, Monika Waszczuk, Roman Kotov, Carmen Marsit, Guia Gu↵anti, Xi-
aohua Yang, Karestan Koenen, Evelyn Bromet, and Benjamin Luft. Dna methylation
associated with ptsd and depression in world trade center responders: An epigenome-
wide study. Biological Psychiatry, 81(10):S365, 2017.
[17] James H Bullard, Elizabeth Purdom, Kasper D Hansen, and Sandrine Dudoit. Eval-
uation of statistical methods for normalization and di↵erential expression in mrna-seq
experiments. BMC bioinformatics, 11(1):94, 2010.
[18] Benjamin M Bolstad, Rafael A Irizarry, Magnus Astrand, and Terence P. Speed. A
comparison of normalization methods for high density oligonucleotide array data based
on variance and bias. Bioinformatics, 19(2):185–193, 2003.
[19] Yee Hwa Yang and Natalie P Thorne. Normalization for two-color cdna microarray
data. Lecture Notes-Monograph Series, pages 403–418, 2003.
[20] Kasper D Hansen, Rafael A Irizarry, and Zhijin Wu. Removing technical variability
in rna-seq data using conditional quantile normalization. Biostatistics, 13(2):204–216,
2012.
20
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
[21] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schae↵er, and Barbara
Wold. Mapping and quantifying mammalian transcriptomes by rna-seq. Nature meth-
ods, 5(7):621, 2008.
[22] Mark D Robinson and Alicia Oshlack. A scaling normalization method for di↵erential
expression analysis of rna-seq data. Genome biology, 11(3):R25, 2010.
[23] Marie-Agnes Dillies, Andrea Rau, Julie Aubert, Christelle Hennequet-Antier, Marine
Jeanmougin, Nicolas Servant, Celine Keime, Guillemette Marot, David Castel, Jordi
Estelle, et al. A comprehensive evaluation of normalization methods for illumina high-
throughput rna sequencing data analysis. Briefings in bioinformatics, 14(6):671–683,
2013.
[24] Peipei Li, Yongjun Piao, Ho Sun Shon, and Keun Ho Ryu. Comparing the normalization
methods for the di↵erential analysis of illumina high-throughput rna-seq data. BMC
bioinformatics, 16(1):347, 2015.
[25] Simon Anders, Alejandro Reyes, and Wolfgang Huber. Detecting di↵erential usage of
exons from rna-seq data. Genome research, 22(10):2008–2017, 2012.
[26] Trevor Hastie and Robert Tibshirani. Generalized additive models. Statist. Sci.,
1(3):297–310, 08 1986.
[27] TJ Hastie and RJ Tibshirani. Generalized additive models. Chapman & Hall: CRC
Monographs on Statistics & Applied Probability. London, 1990.
[28] Simon N Wood. Generalized additive models: an introduction with R. CRC press, 2017.
[29] Simon N Wood. Generalized additive models: an introduction with R. Chapman and
Hall/CRC, 2006.
[30] Simon N Wood. Fast stable restricted maximum likelihood and marginal likelihood
estimation of semiparametric generalized linear models. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 73(1):3–36, 2011.
[31] Simon N Wood, Natalya Pya, and Benjamin Safken. Smoothing parameter and model
selection for general smooth models. Journal of the American Statistical Association,
111(516):1548–1563, 2016.
[32] Simon N Wood. Stable and e�cient multiple smoothing parameter estimation for gen-
eralized additive models. Journal of the American Statistical Association, 99(467):673–
686, 2004.
21
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
[33] Simon N Wood. Thin plate regression splines. Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 65(1):95–114, 2003.
[34] Wolfgang Hardle, Peter Hall, and James Stephen Marron. How far are automatically
chosen regression smoothing parameters from their optimum? Journal of the American
Statistical Association, 83(401):86–95, 1988.
[35] Adi Ben-Israel and Thomas NE Greville. Generalized inverses: theory and applications,
volume 15. Springer Science & Business Media, 2003.
[36] David Roxbee Cox and Nancy Reid. Parameter orthogonality and approximate con-
ditional inference. Journal of the Royal Statistical Society. Series B (Methodological),
pages 1–39, 1987.
[37] Simon NWood. On p-values for smooth components of an extended generalized additive
model. Biometrika, 100(1):221–228, 2012.
[38] Huan Liu, Yongqiang Tang, and Hao Helen Zhang. A new chi-square approximation to
the distribution of non-negative definite quadratic forms in non-central normal variables.
Computational Statistics & Data Analysis, 53(4):853–856, 2009.
[39] Guangchuang Yu, Li-Gen Wang, Yanyan Han, and Qing-Yu He. clusterprofiler: an
r package for comparing biological themes among gene clusters. Omics: a journal of
integrative biology, 16(5):284–287, 2012.
[40] Minoru Kanehisa and Susumu Goto. Kegg: kyoto encyclopedia of genes and genomes.
Nucleic acids research, 28(1):27–30, 2000.
[41] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler,
J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al.
Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25, 2000.
[42] Tannishtha Reya and Hans Clevers. Wnt signalling in stem cells and cancer. Nature,
434(7035):843, 2005.
[43] Jan Gruber, Zhuangli Yee, and Nicholas Tolwinski. Developmental drift and the role of
wnt signaling in aging. Cancers, 8(8):73, 2016.
[44] Cynthia C Sprenger, Stephen R Plymate, and May J Reed. Aging-related alterations in
the extracellular matrix modulate the microenvironment and influence tumor progres-
sion. International journal of cancer, 127(12):2739–2748, 2010.
22
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint
[45] Nir Barzilai, Derek M Hu↵man, Radhika H Muzumdar, and Andrzej Bartke. The critical
role of metabolic pathways in aging. Diabetes, 61(6):1315–1322, 2012.
[46] Laurent Jacob, Pierre Neuvial, and Sandrine Dudoit. More power via graph-structured
tests for di↵erential expression of gene networks. The Annals of Applied Statistics,
6(2):561–600, 2012.
23
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint