Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Probabilistic Models in Functional Neuroimaging
Dirk Ostwald
2
3
Contents
Introduction
Mathematical Preliminaries
Sets
(1) Definition of sets
(2) Union, intersection, difference
(3) Selected sets of numbers
Functions
(1) Sums, products, and exponentiation
(2) General functions
(3) Basic properties of functions
(4) Elementary functions
Matrix Algebra
(1) Matrix definition
(2) Matrix addition and subtraction
(3) Matrix scalar multiplication
(4) Matrix multiplication
(5) Matrix inversion
(6) Inversion of small matrices by hand
(7) Matrix transposition
(8) Matrix determinants
(9) Rank of a matrix
(10) Matrix symmetry and positive-definiteness
Differential Calculus
(1) Intuition and definition of derivatives of univariate functions
(2) Derivatives of important functions
(3) Rules of differentiation
(4) Analytical optimization
(5) Multivariate, real-valued functions and partial derivatives
4
(6) Higher-order partial derivatives
(7) Gradient, Hessian, and Jacobian
(9) Taylor’s theorem
Integral Calculus
(1) Definite integrals – The integral as the signed area under a function’s graph
(2) Indefinite integrals – Integration as the inverse of differentiation
Sequences and Series
(1) Sequences
(2) Series
Ordinary Differential Equations
(1) Differential equations
(2) Initial value problems for systems of first-order ordinary differential equations
(3) Numerical approaches for initial value problems
An Introduction to Fourier Analysis
(1) Generalized cosine and sine functions
(2) Linear combinations of generalized cosine sine functions
(3) The real Fourier series
(4) Periodic and protracted functions
(5) Complex Numbers
(6) Euler’s identities
(7) The complex form of the Fourier series
(8) The polar form of the Fourier series
(9) The Fourier transform
(10) The discrete Fourier transform
(12) Fast Fourier Transforms in Matlab
An Introduction to Numerical Optimization
(1) Gradient methods
(2) The Newton-Raphson method
5
Foundations of Probabilistic Models
Probability Theory
(1) Random variables
(2) Joint and marginal probability distributions
(3) Conditional probabilities
(4) Bayes Theorem
(5) Independent random variables
(6) Discrete random variables and probability mass functions
(7) Continuous random variables and probability density functions
(8) Expected value and variance of univariate random variables
Multivariate Gaussian distributions
(1) The bivariate Gaussian distribution
(2) The multivariate Gaussian distribution
(3) Independent Gaussian random variables and spherical covariance matrices
(4) The linear transformation theorems for Gaussian distributions
(5) The Gaussian joint and conditional distribution theorem
Information Theory
(1) Entropy
(2) Kullback-Leibler Divergence
Principles of Probabilistic Inference
(1) Maximum Likelihood Estimation
(2) Maximum likelihood estimation of the parameters of a univariate Gaussian distribution
(3) Numerical maximum likelihood estimation and Fisher-Scoring
(4) Bayesian Estimation
(5) Bayesian estimation of the expectation of a univariate Gaussian
(6) Principles of variational Bayes
6
Probability distributions in classical inference
(1) The Standard normal distribution
(2) The chi-squared distribution
(4) The 𝑡-distribution
(5) The 𝑓-distribution
Probability distributions in Bayesian inference
(1) The gamma distribution
(2) The inverse Gamma distribution
(3) The Wishart distribution
(4) The inverse Wishart distribution
(5) The normal-gamma distribution and the normal-inverse gamma distribution
(6) The univariate non-central 𝑡 -distribution
(5) The multivariate non-central 𝑡 -distribution
Basic Theory of the General Linear Model
Structural and probabilistic aspects
(1) Experimental design
(2) A verbose introduction
(3) Simple linear regression
(4) The Gaussian assumption
(5) Equivalent formulations
(6) Sampling a simple linear regression model
Maximum likelihood estimation
(1) Maximum likelihood and least-squares beta parameter estimation
(2) Least-squares beta parameter estimation for simple linear regression
(3) General beta parameter estimation
(4) Maximum likelihood variance parameter estimation
Frequentist Parameter Estimator Distributions
(1) The intuitive background for parameter estimator distributions
(2) The sampling distribution of the beta parameter estimator
7
(3) The sampling distribution of the scaled variance parameter estimator
(4) Overview of the frequentist GLM distribution theory
T- and F-Statistics
(1) Significance and hypothesis testing in frequentist statistics
(2) Definition and intuition of the T-statistic
(3) The T-statistic null distribution
(4) The T-statistic and null hypothesis significance testing
(5) Definition and intuition of the F-Statistic
(6) The F-Statistic Null Distribution
(7) The F-statistic and null hypothesis significance testing
(8) Classical variance partitioning formula and the GLM formulation of the F-Statistic
Bayesian estimation
(1) Model Formulation
(2) Bayesian estimation of the beta parameters
(3) Examples for Bayesian beta parameter estimation
(4) Bayesian joint estimation of beta and variance parameters
Fundamental Designs
(1) A spectrum of designs
(2) Simple linear regression
(3) Multiple linear regression
(4) One-sample T-test
(5) Independent two-sample T-test
(6) One-way ANOVA
(7) Multifactorial designs and two-way ANOVA
(8) Analysis of Covariance
Advanced theory of the General Linear Model
The generalized least-squares estimator and whitening
(1) Motivation
(2) Derivation of the generalized least-squares estimator
8
(3) Whitening
Restricted Maximum Likelihood
(1) Motivation
(2) The REML objective function and REML variance parameter estimators
(3) Derivatives of the REML objective function
(4) Fisher-Scoring for the REML objective function and covariance basis matrices
FMRI applications of the General Linear Model
The mass-univariate GLM-FMRI approach
(1) FMRI data acquisition and preprocessing
(2) Brain mapping using the GLM-FMRI approach
First-level regressors
(1) Discrete time-signals
(2) Discrete time-systems
(3) Convolution
(4) The canonical hemodynamic response function
(5) Stimulus onset convolution in GLM-FMRI
First-level design matrices
(1) Parameterizing event-related FMRI designs
(2) Measuring event-related FMRI design efficiency
(3) Finite impulse response designs
(4) Psychophysiological interaction designs
First-level covariance matrices and model estimation
(1) Models of FMRI serial correlations
(2) Estimation of mass-univariate GLMs with serial correlations
Second-level models
(1) The “summary-statistics” approach
(2) A hierarchical GLM
9
(3) An equivalent beta parameter estimates model
The multiple testing problem
(1) An introduction to the multiple testing problem in GLM-FMRI
(2) Type I error rates
(3) Exact, weak, and strong control of family-wise error rates
(4) The Bonferroni procedure and its “conservativeness” in GLM-FMRI
Multivariate Approaches
Classification Approaches
(1) Generative learning - Linear Discriminant Analysis
(2) Discriminative Learning - Logistic Regression
(3) Support Vector Classification
Deterministic Dynamical Models
Deterministic dynamic causal models for FMRI
(1) The structural form of DCM for FMRI
(2) The neural state evolution function
(3) Interpretation of the hemodynamic evolution and observer functions
(4) The probabilistic form of DCM for FMRI
Variational Bayesian inversion of deterministic dynamical models
(1) Model formulation
(2) Fixed-form evaluation of the variational free energy
(3) Optimization of the variational free energy
10
Introduction
The aim of this Section is to review some informal basics of quantitative data analysis in general and
some terminology with respect to experimental design with the aim of establishing general terms for PMFN
and the theory of the General Linear Model (GLM) in particular.
When designing any cognitive neuroscience experiment, it is essential to have at least a vague idea
about the data analytical procedures that are going to be used on the collected data, be it behavioral,
functional magnetic resonance imaging (FMRI) or magneto-/electroencephalographic (M/EEG) measures.
The aim of the following section is to provide a brief overview about data analytical strategies employed in
noninvasive cognitive neuroimaging experiments and, more generally, in probabilistic quantitative data
analysis.
Data analysis as data reduction
Any cognitive experiment generates a wealth of data (numbers). For example, when conducting a
psychophysical experiment, one could present each stimulus of each condition multiple times to the
participant and gather reaction times and correctness of the response on each trial. For reaction times only,
with 100 trials per condition, this would amount to 400 real positive numbers per participant. Normally, one
would not only acquire data from a single participant, and thus deal with 400 times the number of
participants data points. If one concomitantly acquires any form of neurophysiological data (e.g. FMRI data
over many voxels) the number of data points grows very large very quickly. Nevertheless, one would like to
know in what way the experimental manipulation has affected the recorded data. Any data analytical
method hence projects large sets of numbers onto a smaller set of numbers (sometimes also referred to as
“statistics”) that allow for the experimental effects to be more readily evaluated. While many data-analytical
techniques appear very different on the surface, the reduction of the “data dimensionality” is a ubiquitous
characteristic of all data analyses (Figure 1).
Figure 1. Raw data usually comes in the form large data matrices, here represented by an 100 × 150 array of different colours on the left. Usually, the raw data are not reported in scientific reports, but rather a smaller set of numbers (like T or p values in classical statistics, or log evidence ratios in Bayesian statistics). This smaller set of numbers is represented as the 2 × 2 array of different colours on the right.
11
The ubiquity of model-based data analysis
Another ubiquitous characteristic of any data analytical strategy is that it embodies some
assumptions about how the data were generated and which data aspects are important. In essentially every
data analytical approach, the key step is to compare how well a given set of quantitative assumptions or
“model” (often also referred to as “computational model”, or more generally, “theory”) can explain a set of
observed data. When studying a data analytical approach, it is always helpful to aim to identify the following
three components of the scientific method, which we will refer to as “model formulation”, “model
estimation”, and “model evaluation” (Figure 2).
Figure 2. On the relationship of reality and the scientific method.
By “model formulation” we understand the formalization of informal ideas about the generation of
empirical measurements. Important question that are answered during probabilistic model formulation are
the following. Which intuitive idea is used to explain observed data? What are the deterministic and what
are the probabilistic parts of its formalized version? What are fixed parameters of the model, and which
parameters are allowed to vary and be determined by the data?
By “model estimation” we understand the adaptation of the model parameters (and, in the Bayesian
scenario, the approximation of the model’s plausibility for explaining the data) in light of observed data.
Interestingly, models are readily conceivable for which the estimation of their parameters is a non-trivial
task. In PMFN, we will be concerned both with models for which explicit and relatively straightforward
methods for model estimation exist (such as the GLM) and others, for which these methods become much
more involved (such as dynamic causal models in FMRI).
Finally, “model evaluation” refers to evaluating the obtained parameter estimates in some
meaningful sense and to draw conclusions about the experimental hypothesis based on the parameters
and/or the overall model plausibility in light of data.
Note that the upon model evaluation, the scientific method proceeds by going back to the model
formulation step. At least two aims may be addressed during model reformulation: either to conceive a
model formulation that may capture observed data in a more “meaningful” or “better” manner or for
example to relax the assumptions of the model to derive a more general theory.
12
The classical statistics approach vs. Bayesian approaches
Modern, probabilistic model-based approaches to data analysis as encountered in cognitive
neuroimaging commonly comprise both “deterministic” (also referred to as “structural”) and “probabilistic”
aspects. The probabilistic aspects usually model the "noise" in the data, i.e. that part of the data variance
that is not explained by the deterministic aspects. Approaches from classical statistics and Bayesian
approaches differ in the way that the probabilistic aspects are dealt with and at which level probabilistic
concepts are invoked to model “uncertainty”.
Classical inference
In most basic terms, in the classical statistics approach variants of the GLM are combined with the notion
of “null-hypothesis testing”. Informally, one assumes that if there is no experimental effect of interest, the
statistic that one is interested in (for example a group mean) has a certain distribution, namely the “null
distribution”. If one now observes data, based on the null distribution of the statistic of interest, one can
compute the conditional probability of observing this (or more extreme) data given that the null-hypothesis
is true, or in other words, that the null distribution is the true data distribution. If this probability, known as
the 𝑝-value, is small, one concludes that the data does not support the null hypothesis, and rejects it.
Bayesian approach
The Bayesian approach to data analysis takes a different viewpoint. In simplified terms, it uses the same
formalism as classical inference (namely probability theory), but does not interpret probabilities as
“objective” large sample limits, but as measures of “subjective uncertainty”. To this end, the absence of any
experimental data, one may quantify one’s uncertainty about the parameter value of interest using a so-
called “prior distribution”. Further, one explicitly considers the data likelihood, i.e. the distribution of the
data given a specific value of the parameter. Using Bayes' theorem one may then compute the “posterior
distribution” of the parameter given the data, resulting in an updated belief about what the real parameter
underlying the data generation might be. At the same time, one aims to quantify the probability of the data
under the model assumptions employed.
It should be noted that the dichotomy between the Classical and Bayesian approach to data analysis is
not a strict one, and that mixed forms exists. In general, up to now most of cognitive neuroscience (and in
fact quantitative science in general) is dominated by classical statistics. Especially with the current interest
in, and the increasing availability of, large data sets (“big data”) combined with high performance computing
solutions, the Bayesian approach seems to become more popular.
It should not be surprising, if the notions of classical statistical inference and Bayesian inference
appear somewhat nebulous at this point. One aim of PMFN is to obtain a better intuition about both
frameworks, their commonalities, and their differences by putting them into action with respect to the same
underlying model, the GLM.
Univariate vs. multivariate approaches
Another way to classify data analytical procedures is according to them being "univariate" or
"multivariate". The key difference between the two approaches is the dimensionality of the “dependent
experimental variable” or “outcome measure”. If this dimensionality is one, i.e. for each trial of each
13
experimental condition a single, scalar number (for example a reaction time, the BOLD signal in a given
voxel, the EEG frequency power in a specific frequency band) is observed and modeled by the approach,
one speaks of a “univariate” data analytical approach. Typical examples of univariate approaches are the
variants of the standard GLM encountered in PMFN.
On the other hand, if two or more numbers comprise the dependent experimental variable (e.g. a
reaction time and a verbal response, or the multi-voxel activation pattern of a given brain region) and are
modeled by the approach, one speaks of a multivariate approach. Typical examples of multivariate
approaches are the multivariate analysis of variance (MANOVA), canonical correlation, multi-dimensional
scaling and multivariate classification approaches.
Encoding vs. decoding approaches
Yet another form of classification of data analytical approaches often encountered in cognitive
neuroscience is along the lines of “encoding” and “decoding”. Encoding approaches like the GLM rest on an
explicit formulation of experimental circumstances that generate observed responses. Decoding approaches
on the other hand decode, according to some algorithm which is trained on a subset of the data, the
experimental circumstances from the observed response. It is important that the distinction between
encoding and decoding based approaches is artificial: after training, any decoding algorithm is based on a
generative model of the response patterns associated with specific experimental circumstances. In this sense
both encoding and decoding approaches achieve the same thing: they test for the statistical non-
independence between independent and dependent experimental variables.
In summary, the GLM, which we will elucidate from both the classical and Bayesian viewpoint in this
Section of PMFN, corresponds to a univariate data dimensionality reduction technique that encodes simple
assumptions about how observed data are generated.
Study Questions
1. Give brief definitions of the terms Model Formulation, Model Estimation, and Model Evaluation.
2. Provide a brief overview of difference and commonalities between the classical inference and Bayesian statistical approaches.
3. Why is the General Linear Model as discussed in the course referred to as a “univariate” data analysis method?
4. Define the terms “independent variable”, “dependent variable”, “categorical variable”, and “continuous variable”.
5. Explain the difference between within- and between-subject experimental designs.
Study Question Answers
1. By “Model Formulation” we understand the formalization of informal ideas about the generation of empirical measurements.
Important question that we need to answer during probabilistic model formulation are the following. Which intuitive idea is
used to explain observed data? What are the deterministic and what are the probabilistic parts of its formalized version? What
are fixed parameters of the model, and which parameters are allowed to vary and be determined by the data? By “Model
estimation” we understand the adaptation of the model parameters and, in the Bayesian scenario, the approximation of the
model’s plausibility for explaining the data in light of observed data. “Model Evaluation” refers to evaluating the obtained
parameter estimates in some meaningful sense and to draw conclusions about the experimental hypothesis based on the
parameters or the overall model plausibility.
2. Both the classical and the Bayesian approach formulate probabilistic models, which may have the same structural and
probabilistic (likelihood) form. In the classical statistics the notion of “null-hypothesis testing” by means of 𝑝-values is prevalent
and probabilities are interpreted as “objective large sample frequencies”. The Bayesian approach to data analysis uses the
same formalism as classical inference, namely (probability theory, but does not interpret probabilities as “objective” large
14
sample limits, but as measures of “subjective uncertainty”. To this end, the absence of any experimental data, one may quantify
one’s uncertainty about the parameter value of interest using a so-called “prior distribution”. Further, one explicitly considers
the data likelihood, i.e. the distribution of the data given a specific value of the parameter. Using Bayes' theorem one may then
compute the “posterior distribution” of the parameter given the data, resulting in an updated belief about what the real
parameter underlying the data generation might be. At the same time, one aims to quantify the probability of the data under
the model assumptions used.
3. General Linear Model as discussed in the course referred to as a “univariate” data analysis, because the dimensionality of the
observations at each trial/time-point is one.
4. An independent experimental variable refers to an aspect of an experimental design that is intentionally manipulated by the
experimenter and that is hypothesized to cause changes in the dependent variables. A dependent experimental variables is a
quantity that are measured by the experimenter in order to evaluate the effect of one or more independent variables. A
categorical variable is a variable that can take on one of several discrete values, A continuous variable is one that can take on
any value within a pre-specified range, usually described mathematically by a real number.
5. In a between- subject manipulation, different subject groups reflect different values of the independent variable, while in a (full)
within-subject design, subjects are exposed to all values of the independent variable.
15
Mathematical Preliminaries
16
Sets
(1) Definition of Sets
Sets may be defined according to Georg Cantor (1845 – 1918) as follows:
“A set is a gathering together into a whole of definite, distinct objects of our perception
[Anschauung] or of our thought – which are called elements of the set.”
In PMFN we primarily use sets as a means to denote which kind of mathematical objects we are
dealing with. Sets are usually denoted using curly brackets. For example, the set 𝐴 comprising the first five
lowercase letters of the Roman alphabet is denoted as
𝐴 ≔ {𝑎, 𝑏, 𝑐, 𝑑, 𝑒} (1)
We use the symbol “ ≔” to denote a definition and the symbol “=” to denote an equality.
There are essentially three ways to define sets: (a) by listing the elements of the set as in (1); (b) by
specifying the properties of the elements in a set, for example as
𝐵 ≔ {𝑥| 𝑥 𝑖𝑠 𝑜𝑛𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑓𝑖𝑣𝑒 𝑙𝑜𝑤𝑒𝑟𝑐𝑎𝑠𝑒 𝑙𝑒𝑡𝑡𝑒𝑟𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑟𝑜𝑚𝑎𝑛 𝑎𝑙𝑝ℎ𝑎𝑏𝑒𝑡} (2)
where the variable 𝑥 before the vertical bar denotes the elements of the set in a generic fashion, and the
statement after the vertical bar denotes the defining properties of the set; and finally (c) by defining a set to
correspond to another, well known set, for example
𝐶 ≔ ℕ (3)
where ℕ denotes the set of natural numbers (see below). To indicate that e.g. the letter 𝑏 is an element of
𝐴, we write
𝑏 ∈ 𝐴 (4)
which may be read as “𝑏 is in 𝐴” or “𝑏 is an element of 𝐴”. To indicate that, for example, the number 2 is not
an Element of 𝐴, we write
2 ∉ 𝐴 (5)
which may be read as “2 is not in 𝐴” or “2 is not an element of 𝐴”. The number of elements of a set is
referred to as the “cardinality” of a set and is denoted by vertical bars. For example
|𝐴| = |𝐵| = 5 (6)
is the cardinality of the sets 𝐴 and 𝐵, because both contain five elements.
If a set 𝐵 contains all elements of another set 𝐴 and 𝐴 contains some additional elements, i.e., the
set are not equal, then 𝐵 is said to be a “subset” of 𝐴, denoted by 𝐵 ⊂ 𝐴 and 𝐴 is said to be a “superset” of
𝐵, denoted by 𝐴 ⊃ 𝐵. Usually denoting that a set is a subset of another set suffices. For example, if
𝐴 ≔ {1,2, 𝑎, 𝑏} and 𝐵 ≔ {1, 𝑎}, then 𝐵 ⊂ 𝐴, because all elements of 𝐵 are also in 𝐴. If a set 𝐵 may be a
subset or may equal another set 𝐴, the notation 𝐵 ⊆ 𝐴 is also used.
17
(2) Union, Intersection, Difference
Three operations on sets are sometimes helpful: (a) the union of two sets, (b) the intersection of two
sets, and (c) the difference of two sets. These may be defined as follows.
Let 𝑀 and 𝑁 be two arbitrary sets. Then
𝑀 ∪𝑁 ≔ {𝑥|𝑥 ∈ 𝑀 𝑜𝑟 𝑥 ∈ 𝑁 } (1)
denotes the “union” of the two sets 𝑀 and 𝑁. The union of two sets is a set that comprises elements that
are either in 𝑀 (only), in 𝑁 (only), and in both 𝑀 and 𝑁. The “or” in the definition of 𝑀 ∪𝑁 is thus to be
understood in an inclusive “and/or”, rather than exclusive, way. As an example, for 𝑀 ≔ {1,2,3} and
𝑁 ≔ {2,3,5,7} we have
𝑀 ∪𝑁 = {1,2,3,5,7} (2)
The “intersection” of two arbitrary sets 𝑀 and 𝑁 is defined as
𝑀 ∩𝑁 ≔ {𝑥|𝑥 ∈ 𝑀 𝑎𝑛𝑑 𝑥 ∈ 𝑁 } (3)
The intersection 𝑀 ∩𝑁 is thus a set that only comprises elements that are both in 𝑀 and 𝑁. For the
example 𝑀 ≔ {1,2,3} and 𝑁 ≔ {2,3,5,7} we have
𝑀 ∩𝑁 = {2,3} (4)
because 2 and 3 are the only numbers that are both in 𝑀 and 𝑁. If the intersection of two sets does not
contain any elements (a set referred to as the “empty set” and denoted by ∅) the two sets are said to be
“disjoint”.
Finally, the “difference” between two sets, also referred to as the “difference set” of two sets 𝑀 and
𝑁, is defined as
𝑀\𝑁 ≔ {𝑥|𝑥 ∈ 𝑀 𝑎𝑛𝑑 𝑥 ∉ 𝑁} (5)
The set 𝑀\𝑁 thus comprises all elements with the property that they are in 𝑀, but not in 𝑁. For the
example 𝑀 ≔ {1,2,3} and 𝑁 ≔ {2,3,5,7} we have
𝑀\𝑁 = {1} (6)
because the elements 2,3 ∈ 𝑀 are also in 𝑁. Note that the other elements 𝑁 do not play a role in the
difference set. The difference of two sets is not symmetric, as we have for example
𝑁\𝑀 = {5,7} (7)
(3) Selected Sets of Numbers
In this section, we briefly introduce a selection of important sets of numbers that will be required
throughout PMFN.
The set of natural numbers (or positive integers) is denoted by ℕ and defined as
ℕ ≔ {1,2,3,… } (1)
18
where the dots “…” denote “to infinity”. Subsets of the set of natural numbers are the sets natural numbers
of order 𝑛 ∈ ℕ, which are defined as
ℕ𝑛 ≔ {1,2,… , 𝑛} (2)
The union of the set of natural numbers and zero will be denoted by ℕ0, i.e. ℕ0 ≔ ℕ∪ {0}. If 0 and the
“negative natural numbers” are added to the set ℕ, one obtains the set of integers defined by
ℤ ≔ {… ,−3,−2,−1,0,1,2,3,… } (3)
Adding also ratios of integers yields the set of rational numbers, defined as
ℚ ≔ {𝑝
𝑞|𝑝, 𝑞 ∈ ℤ} (4)
The most important basic set of numbers for our purposes is the set of real numbers, denoted by ℝ.
The real numbers ℝ are a superset of the rational numbers, i.e. ℚ ⊂ ℝ. In addition to the rational numbers,
the real numbers include the solutions of some algebraic equations (e.g. √2, the solution of the equation
𝑥2 = 2, which is not an element of ℚ). These numbers are called irrational numbers. Additionally, the real
numbers include the limits of sequences of irrational numbers, such as 𝜋 ≈ 3.14… .Intuitively, the set of real
numbers is the set one thinks of when referring to “continuous” numbers. Interestingly, between any two
real numbers there exists infinitely many more real numbers, while ℝ also extends to negative and positive
infinity. One can show that there are more real numbers than natural numbers, the set of real numbers is
said to be non-countable. Intuitively, the real numbers are the scalar (i.e. “single”) numbers that are used to
model most “continuous” data formats in cognitive neuroimaging. Sometimes, one is interested in only the
non-negative real numbers. These are denoted by
ℝ+ ≔ {𝑥 ∈ ℝ|𝑥 ≥ 0} (5)
In general, contiguous subsets of the real numbers are referred to as “intervals”. We will mainly
deal with closed intervals, i.e., subsets of ℝ, which are defined by two numbers 𝑎, 𝑏 ∈ ℝ and are defined as
[𝑎, 𝑏] ≔ {𝑥 ∈ ℝ|𝑎 ≤ 𝑥 ≤ 𝑏} (6)
where ≤ denotes the property “less than or equal to”. Note that 𝑎 and 𝑏 are elements of the interval [𝑎, 𝑏]
and that [𝑎, 𝑏] is defined as the empty set if 𝑏 ≤ 𝑎. Three different kinds of intervals can be defined,
referred to as “left-“ or “right-semi-open “or “open” intervals by means of the “less than” property <,
respectively
]𝑎, 𝑏] ≔ {𝑥 ∈ ℝ|𝑎 < 𝑥 ≤ 𝑏} (7)
[𝑎, 𝑏[≔ {𝑥 ∈ ℝ|𝑎 ≤ 𝑥 < 𝑏} (8)
]𝑎, 𝑏[≔ {𝑥 ∈ ℝ|𝑎 < 𝑥 < 𝑏} (9)
Note that 𝑎 and 𝑏 are not elements of the interval ]𝑎, 𝑏[.
The set of scalar real numbers can readily be generalized to the set of The real 𝒏-tuples or 𝒏-
dimensional vectors, denoted by ℝ𝑛. For 𝑛 ∈ ℕ, the set ℝ𝑛, with the special case ℝ1 = ℝ denotes the set of
𝑛-tuples with real entries. The 𝑛-tuples are usually denoted as lists, or “column vectors” in the form
19
𝑥 ≔ (
𝑥1𝑥2⋮𝑥𝑛
) (10)
We will usually write 𝑥 ∈ ℝ𝑛 to denote that 𝑥 is a list or column vector of 𝑛 real numbers. An example for
𝑥 ∈ ℝ4 is
𝑥 ≔ (
0.161 1.762−0.203𝜋
) (11)
Study Questions
1. Give brief explanations of the symbols ℕ,ℕ𝑛, ℤ, ℚ, ℝ,ℝ𝑛 and provide a numerical example of a 𝑥 ∈ ℝ5.
2. Consider the sets 𝐴 ≔ {1,2,3} and 𝐵 ≔ {3,4,5}. Write down the sets 𝐶 ≔ 𝐴 ∪ 𝐵 and 𝐷 ≔ 𝐴 ∩ 𝐵
3. Write down the definition of the interval [0,1] ⊂ ℝ. Is 0 an element of this interval?
Study Questions Answers
1. ℕ is the set of natural numbers, i.e. the positive integers from 1 to ∞. ℕ𝑛 is the set of natural numbers from 1 to 𝑛, i.e. the set
of positive integers 1,2, … , 𝑛. ℤ is the set of integers, i.e. the set of positive and negative integers and zero,
… ,−3,−2,−1,0,1,2,3,… ranging from −∞ to ∞. ℚ is the set of rational numbers, i.e. all numbers that can be written as ratios 𝑝
𝑞
where 𝑝 and 𝑞 are integers. ℝ is the set of real numbers, i.e., the set of rational numbers and some numbers which cannot be
written as ratios, such as 𝜋. ℝ𝑛 is the space of 𝑛-tuples with real entries (or, equivalently, the set of 𝑛-dimensional vectors with
real entries). An example of 𝑥 ∈ ℝ5 is 𝑥 = (1,2,3,4,5)𝑇.
2. 𝐶 = {1,2,3} ∪ {3,4,5} = {1,2,3,4,5} and 𝐷 = {1,2,3} ∩ {3,4,5} = {3}
3. The interval [0,1] is defined as [0,1] ≔ {𝑥 ∈ ℝ|0 ≤ 𝑥 ≤ 1}. Yes, 0 is an element of this interval.
20
Functions
Before studying functions, we introduce three helpful concepts often encountered in higher
mathematics, the sum symbol, the product symbol, and the rules of exponentiation.
(1) Sums, products, and exponentiation
In mathematics, one often has to add numbers. A concise way to represent sums is afforded by the
sum symbol,
∑ (1)
The sum symbol is reminiscent of the Greek letter Sigma (Σ), corresponding to the roman capital S
and thus mnemonic for “Sum”. Under the sum symbol one denotes the terms summed over, usually with the
help of indices. For example, for 𝑥1, 𝑥2, 𝑥3 ∈ ℝ we can write the equation
𝑥1 + 𝑥2 + 𝑥3 = 𝑦 (2)
in shorthand notation as
∑ 𝑥𝑖3𝑖=1 = 𝑦 (3)
Note that the subscript at the sum symbol indicates the running index and its initial value (here 𝑖 and
1, respectively) and that the superscript at the sum symbol denotes the final value of the running index
(here 𝑖 = 3). It is an often encountered (and lamentable) behaviour of authors to not include these sub- and
superscripts, which then render the sum symbol somewhat meaningless. We will usually use the index sub-
and superscripts in PMNF.
To obtain some familiarity with the sum symbol, consider the following examples
𝑎 = 1 + 4 + 9 + 16 + 25 + 36 + 49 + 64 + 81 + 100 (4)
𝑏 = 1 ⋅ 𝑥1 + 2 ⋅ 𝑥2 + 3 ⋅ 𝑥3 + ⋯+ 𝑛 ⋅ 𝑥𝑛 (5)
𝑐 = 2 + 2 + 2 + 2 + 2 (6)
Using the sum symbol, these may be written as follows: For 𝑎 we add all squares of the natural numbers
from 1 to 10. We thus write
𝑎 = ∑ 𝑖210𝑖=1 = 1 + 4 + 9 + 16 + 25 + 36 + 49 + 64 + 81 + 100 (7)
For 𝑏, we are given the numbers 𝑥1, … , 𝑥𝑛 ∈ ℝ and have to multiply each one with its index and then add
them all up. We thus write
𝑏 = ∑ 𝑖 ⋅ 𝑥𝑖𝑛𝑖=1 = 1 ⋅ 𝑥1 + 2 ⋅ 𝑥2 + 3 ⋅ 𝑥3 + ⋯+ 𝑛 ⋅ 𝑥𝑛 (8)
For 𝑐, we have to add the number 2 five times. For this we can write
𝑐 = ∑ 251 = 2 + 2 + 2 + 2 + 2 (9)
21
The nomenclature of the indices is irrelevant, for example, we have
∑ 𝑖210𝑖=1 = ∑ 𝑗210
𝑗=1 (10)
One property of sums that we will encounter frequently is that constant factors (i.e. factors that do
not depend on the sum index), may either be written under the sum symbol or out of it. Consider the
arithmetic mean of 𝑛 ∈ ℕ real numbers 𝑥1, 𝑥2, … , 𝑥𝑛. The arithmetic mean corresponds to the sum of the 𝑛
numbers divided 𝑛. We may write this as
𝑥1+𝑥2+⋯+𝑥𝑛
𝑛=
∑ 𝑥𝑖𝑛𝑖=1
𝑛=
1
𝑛∑ 𝑥𝑖𝑛𝑖=1 (11)
or, equivalently, as
𝑥1+𝑥2+⋯+𝑥𝑛
𝑛=
𝑥1
𝑛+𝑥2
𝑛+⋯+
𝑥𝑛
𝑛= ∑
𝑥𝑖
𝑛𝑛𝑖=1 = ∑
1
𝑛𝑛𝑖=1 𝑥𝑖 (12)
We thus have shown that
∑1
𝑛𝑛𝑖=1 𝑥𝑖 =
1
𝑛∑ 𝑥𝑖𝑛𝑖=1 (13)
or, informally, that “we may take out the constant factor 1/𝑛 from under the sum and instead multiply the
whole sum by it”.
Another common mathematical operation is the multiplication of numbers. To this end, the product
sign Π (the greek capital Pi for product) allows for writing products of multiple factors in a concise manner.
In complete analogy to the sum symbol, the product symbole has the following semantics
∏ 𝑎𝑖𝑛𝑖=1 = 𝑎1 ⋅ 𝑎2 ⋅ … ⋅ 𝑎𝑛 (14)
Closely related to the product is the exponentiation operation, essentially a product of a number
with itself. Recall that for 𝑎 ∈ ℝ and 𝑛 ∈ ℕ0 “𝑎 to the power of 𝑛” is defined (recursively) as
𝑎0 ≔ 1 and 𝑎𝑛+1 ≔ 𝑎𝑛 ⋅ 𝑎 (15)
Further, “𝑎 to the power of minus 𝑛” is defined for 𝑎 ∈ ℝ\{0} and 𝑛 ∈ ℕ by
𝑎−𝑛 ≔ (𝑎𝑛)−1 ≔1
𝑎𝑛 (16)
In the term 𝑎𝑛 the number 𝑎 is sometimes referred to as “base” and the number 𝑛 as “exponent” or
“power”. Based on the definition in (15) the following familiar “laws of exponentiation” can be derived,
which hold for all 𝑎, 𝑏 ∈ ℝ and 𝑛,𝑚 ∈ ℤ (given that 𝑎 ≠ 0 for negative powers)
𝑎𝑛𝑎𝑚 = 𝑎𝑛+𝑚 (17)
(𝑎𝑛)𝑚 = 𝑎𝑛𝑚 (18)
(𝑎𝑏)𝑛 = 𝑎𝑛𝑏𝑛 (19)
Further note that the 𝑛th root of a number 𝑎 ∈ ℝ is defined as a number 𝑟 ∈ ℝ such that its 𝑛th power
equals 𝑎
22
𝑟𝑛 ≔ 𝑎 (20)
From this definition it follows, that the 𝑛th root may equivalently be written using a rational exponent
𝑟 = 𝑎1
𝑛 (21)
because from (20) and (17) it then follows that
𝑟𝑛 = (𝑎1
𝑛)𝑛
= 𝑎1
𝑛 ⋅ 𝑎1
𝑛 ⋅ ⋯ ⋅ 𝑎1
𝑛 = 𝑎∑
1
𝑛𝑛𝑖=1 = 𝑎1 = 𝑎 (22)
The familiar square root of a number 𝑎 ∈ ℝ, 𝑎 ≥ 0 may thus equivalently be written as
√𝑎 = 𝑎1
2 (23)
which, together with the laws of exponentiation (17) – (19) often significantly simplifies the handling of
square roots.
(2) General functions
A “function” (also referred to as “mapping”) 𝑓 is generally specified in the form
𝑓:𝐷 → 𝑅, 𝑥 ↦ 𝑓(𝑥) (1)
where the set 𝐷 is called the “domain” of the function 𝑓 and the set 𝑅 is called the “range” or “co-domain”
of the function. In concordance with contemporary mathematical language, the terms “function” and
“mapping” are used interchangeably in these notes.
In (1) “𝑓: 𝐷 → 𝑅” may be read as “the function (or rule) 𝑓 maps all elements of the set 𝐷 onto
elements of the set 𝑅”. Why this may appear artificial and cumbersome at first, it is of immense practical use
when dealing with functions and helps to be precise about their “input” set 𝐷 as well as their “output” set
𝑅. In (1) “𝑥 ↦ 𝑓(𝑥)” denotes the mapping of the domain element 𝑥 ∈ 𝐷 onto the range element 𝑓(𝑥) ∈ 𝑅
and may be read as “𝑥, which is an element of 𝐷, is mapped by 𝑓 onto 𝑓(𝑥), which is an element of 𝑅”.
Note that the arrow “→” is used to denote the mapping between the two sets and the arrow “↦” is used to
denote the mapping between an element in the domain of 𝑓 and an element in the range of 𝑓. The most
important thing to note about (1) is that this notation distinguishes between the function 𝑓 proper, which
corresponds to an “abstract rule”, and elements 𝑓(𝑥) of its range. A function can thus be understood as a
rule that relates two sets of quantities, the inputs and the outputs. Importantly, each input 𝑥 to a function is
related to an output of the function 𝑓(𝑥) in a deterministic fashion.
Usually in definitions of functions, the specification of the function abbreviation, its domain and its
range, as in (1) is followed by a definition of the functional form linking the domain elements to the range
elements. Consider the example of a familiar function, the square of a real number. Using the notation
introduced above, this function can be written as
𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑥2 (2)
23
(3) Basic Properties of Functions
Functions can be categorized according to a basic set of properties that specify how the a function
relates elements of its domain to elements of its range. To define these properties it is helpful to introduce
the notions of an “image” and a “preimage” first. To this end, let 𝑓: 𝐷 → 𝑅 be a function and let 𝑥 ∈ 𝐷. Then
𝑓(𝑥), i.e., the element of 𝑅 that 𝑥 is mapped onto, is called “the image of 𝑥 under 𝑓”. The entire subset of
the range 𝑅 for which images under 𝑓 exist, i.e., the set
𝑓(𝐷) ≔ {𝑦 ∈ 𝑅|𝑡ℎ𝑒𝑟𝑒 𝑒𝑥𝑖𝑠𝑡𝑠 𝑎𝑛 𝑥 ∈ 𝐷 𝑤𝑖𝑡ℎ 𝑓(𝑥) = 𝑦} ⊆ 𝑅 (1)
is called “the image of 𝐷 under 𝑓” and is notated by “𝑓(𝐷)”. Note that the image 𝑓(𝐷) and the range 𝑅 are
not necessarily identical and that the image can be a subset of the range. If 𝑦 is an element of 𝑓(𝐷), then an
𝑥 ∈ 𝐷 for which 𝑓(𝑥) = 𝑦 holds is called a “preimage” of 𝑦 under 𝑓. The following relationships between
images and preimages are important to note: (1) Every element in the domain of a function is allocated
exactly one image in the range under 𝑓. This is a more precise definition of the notion of a function. (2) Not
every element in the range of a function has to be a member of the image of 𝑓. (3) If 𝑦 is an element of
𝑓(𝐷), then there may exist multiple preimages of 𝑦.
The standard example to understand these properties is the square function
𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑥2 (2)
While every real number 𝑥 ∈ ℝ may be multiplied by itself, and thus 𝑦 ≔ 𝑥2 always exists, many 𝑦 ∈ ℝ do
not have preimages under 𝑓, namely all negative numbers. This simply follows as the square of 0 is 0, the
square of a positive number is a positive number, and the square of a negative number is a positive number.
Finally, this function also has the property that for one 𝑦 ∈ 𝑓(ℝ) there exist multiple preimages: for example
4 has the preimages 2 and −2 under 𝑓.
The relations between images and preimages of functions as sketched above are formalized by the
notions of “injective”, “surjective”, and “bijective” functions. These are defined as follows. Let 𝑓: 𝐷 → 𝑅 be a
function. 𝑓 is called surjective, if every element 𝑦 ∈ 𝑅 is a member of the image of 𝑓, or in other words, if
𝑓(𝐷) = 𝑅. If this is not the case, 𝑓 is not surjective. 𝑓 is called injective, if every element in the image of 𝑓
has exactly one preimage under 𝑓. If this is not the case, 𝑓 is not injective. Finally, a function 𝑓 that is
surjective and injective is called “bijective” or “one-to-one” mapping. Figure 1 below illustrates a non-
surjective function, a non-injective function, and a bijective function.
Figure 1. Non-surjective, non-injective, and bijective functions.
24
(4) Elementary Functions
Because the GLM embeds the assumption of normally distributed error terms, and because the
normal or Gaussian distribution belongs to the family of exponential probability density function, we will
require a number of properties of the exponential function, and its inverse, the logarithmic function for
these lecture notes. We will not prove that the properties hold, but rather collect them here for reference.
Proofs of these properties may be found in any undergraduate real analysis textbook.
The identity function
The identity function is defined as
𝑖𝑑 ∶ ℝ → ℝ, 𝑥 ↦ 𝑖𝑑(𝑥) ≔ 𝑥 (1)
The identity function thus maps values 𝑥 ∈ ℝ onto themselves. The derivative of the identity function is 1
𝑑
𝑑𝑥𝑖𝑑(𝑥) =
𝑑
𝑑𝑥𝑥 = 1 (2)
The exponential function
The exponential function is defined as
exp:ℝ → ℝ, 𝑥 ↦ exp(𝑥) ≔ 𝑒𝑥 ≔ ∑𝑥𝑛
𝑛!∞𝑛=0 = 1 + 𝑥 +
𝑥2
2!+𝑥3
3!+𝑥4
4!+⋯ (1)
The mathematical object ∑𝑥𝑛
𝑛!∞𝑛=0 , and infinite sum, is called a “series” and is actually quite intricate. For now
it suffices to recall that 𝑒1 ≈ 2.71… is called “Euler’s number” and that exp(𝑥) thus corresponds to Euler’s
number to the power of 𝑥. The graph of the exponential function is depicted in Figure 2.
A defining property of the exponential function is that it is equal to its own derivative
𝑑
𝑑𝑥exp(𝑥) = exp′(𝑥) = exp (𝑥) (2)
Below we note some properties of the exponential function which are often helpful in algebraic
manipulations:
Special values: exp(0) = 1 and exp(1) = 𝑒 (3)
Value ranges: 𝑥 ∈] − ∞, 0[ ⇒ 0 < exp(𝑥) < 1 and 𝑥 ∈]0,∞[ ⇒ 1 < exp(𝑥) < ∞ (4)
The exponential function thus assumes only strictly positive values, exp(ℝ) =]0,∞[.The brackets]𝑎, 𝑏[
denote the “open interval” 𝑎 < 𝑥 < 𝑏, the brackets [𝑎, 𝑏] denote the closed interval 𝑎 ≤ 𝑥 ≤ 𝑏.
Monotonicity: The exponential function is strictly monotonically increasing. In other words, if 𝑥 < 𝑦,
then it follows that exp(𝑥) < exp(𝑦).
Exponentiation identity (“Product Property”) : exp(𝑎 + 𝑏) = exp(𝑎) ⋅ exp(𝑏) (5)
The exponentiation identity, i.e. the fact that the product of the exponential of two arguments 𝑎 and 𝑏
corresponds to the exponential of the sum of 𝑎 and 𝑏 will be exploited many times in the context of
independent normally distributed random variables in PMNF. From it, it follows that we have
25
exp(𝑎 − 𝑏) =exp(𝑎)
exp(𝑏) and exp(𝑎) ⋅ exp(−𝑎) =
exp(𝑎)
exp(𝑎)= 1 (6)
Note that the term “product property” is a convention used in these notes, but not a general label.
The natural logarithm
The natural logarithm may be defined as the inverse function of the exponential function as follows
ln: ]0,∞[→ ℝ, 𝑥 ↦ ln (𝑥) (7)
Figure 2. Visualization of the graphs of the exponential function and the natural logarithm. Note that 𝑒𝑥𝑝(0) = 1 and
that 𝑙𝑛(1) = 0.
where ln(𝑥) is characterized by the fact that
ln(exp(𝑥)) = 𝑥 for all 𝑥 ∈ ℝ and exp(ln(𝑥)) = 𝑥 for all 𝑥 ∈ ]0,∞[ (8)
Note that the natural logarithm is only defined for positive values 𝑥 ∈]0,∞[. The graph of the natural
logarithm is depicted in Figure 1. An important feature of the natural logarithm, which we will often exploit
in these notes, is its derivative
𝑑
𝑑𝑥ln(𝑥) = ln′(𝑥) =
1
𝑥 (9)
Below we note some properties of the natural logarithm which are often helpful in algebraic manipulations.
Special values: ln(1) = 0 and ln(𝑒) = 1 (10)
Value ranges 𝑥 ∈]0,1[ ⇒ ln(𝑥) < 0 and 𝑥 ∈]1,∞[ ⇒ ln(𝑥) > 0 (11)
The natural logarithm thus assumes values in the entire range of the real numbers, but is only defined on the
set of positive real numbers ln(]0,1[) = ℝ.
26
Monotonicity. The natural logarithm is strictly monotonically increasing. In other words, if 𝑥 < 𝑦,
then it follows that ln(𝑥) < ln(𝑦).
Inverse Property ln1
𝑥= − ln 𝑥 for all 𝑥 ∈]0,∞[ (12)
Product Property ln 𝑥𝑦 = ln 𝑥 + ln 𝑦 for all 𝑥, 𝑦 ∈]0,∞[ (13)
Power Property ln 𝑥𝑘 = 𝑘 ln 𝑥 for all 𝑥 ∈]0,∞[ and 𝑘 ∈ ℚ (14)
Especially the facts that the natural logarithm “turns multiplication into addition” ln 𝑥𝑦 = ln 𝑥 + ln𝑦 and
that it “turns exponentiation into multiplication” ln 𝑥𝑘 = 𝑘 ⋅ ln 𝑥 will exploited many times in PMNF. Note
that the terms “inverse property”, “product property” and “power property” are conventions used in PMNF,
but not general labels.
The cosine, sine, and tangent functions
We eschew a discussion of the geometric intuitions of the trigonometric functions (the interested
reader is referred to [Spivak 1994] for discussion of these basic properties) and introduce the cosine and sine
functions by means of their series representations
cos:ℝ → [−1,1], 𝑥 ↦ cos(𝑥) ≔ ∑ (−1)𝑛∞𝑛=0
𝑥2𝑛
(2𝑛)! (1)
sin:ℝ → [−1,1], 𝑥 ↦ sin(𝑥) ≔ ∑ (−1)𝑛∞𝑛=0
𝑥2𝑛+1
(2𝑛+1)! (2)
The graphs of the sine and cosine functions are shown in Figure 1. Notably, these functions vary in a
prescribed way between +1 and −1 and repeat themselves every 2𝜋.
Figure 1 Sine and cosine on the interval [−2𝜋, 4𝜋].
The following properties of the cosine and sine function, which we state without proofs, are useful
to remember (the interested reader is referred to [Spivak 1994] for a derivation of these properties from the
series definition of sine and cosine)
Derivatives: sin′ = cos and cos′ = −sin (3)
Special values: sin0 = 0 and cos0 = 1 (4)
27
Functional equations: cos(−𝑥) = cos(𝑥) , sin(−𝑥) = −sin(𝑥) and cos2(𝑥) + sin2(𝑥) = 1 (5)
Addition theorems: For all 𝑥, 𝑦 ∈ ℝ
cos(𝑥 + 𝑦) = cos(𝑥) cos(𝑦) − sin(𝑥) sin(𝑦) sin(𝑥 + 𝑦) = sin(𝑥) cos(𝑦) + cos(𝑥) sin(𝑦) (6)
Definition of 𝜋: The zero of cos in [0,2] multiplied by 2 is called 𝜋 (7)
Periodicity of cosine and sine with period 2𝜋 :cos(𝑥 + 2𝜋) = cos(𝑥) , sin(𝑥 + 2𝜋) = sin(𝑥) (8)
Zeros: cos(𝑥) = 0 ⇒ 𝑥 ∈ {𝜋
2+ 𝑘𝜋|𝑘 ∈ ℤ} , sin(𝑥) = 0 ⇒ 𝑥 ∈ {𝑘𝜋|𝑘 ∈ ℤ} (9)
Relation of cosine and sine: cos (𝑥 +𝜋
2) = −sin(𝑥) , sin (𝑥 +
𝜋
2) = cos(𝑥) (10)
The tangent function is defined as
tan:ℝ\ {𝜋
2+ 𝑘𝜋|𝑘 ∈ ℤ} → ℝ, 𝑥 ↦ tan(𝑥) ≔
sin(𝑥)
cos(𝑥) (11)
The following properties of the tangent function, which we state without proofs, are useful (again
the interested reader is referred to [Spivak 1994] for a derivation of these properties)
Derivative tan′(𝑥) =1
cos2(𝑥)= 1 + tan2(𝑥) (𝑥 ≠
𝜋
2+ 𝑘𝜋, 𝑘 ∈ ℤ) (12)
Special values tan(0) = 0, tan (𝜋
4) = 1, tan (−
𝜋
4) = −1 (13)
Functional equations tan(−𝑥) = − tan(𝑥) , tan(𝑥 + 𝜋) = tan(𝑥) (𝑥 ≠𝜋
2+ 𝑘𝜋, 𝑘 ∈ ℤ)(14)
The trigonometric functions are not injective and hence not invertible. However, on the intervals on which
the functions show monotonic behavior, for example on [−𝜋
2,𝜋
2] for the sine, [0, 𝜋] for the cosine, and
[−𝜋
2,𝜋
2] for the tangent function, one may define their inverse functions. These functions are referred to as
the “arc” functions and are defined as follows, where by 𝑓|[𝑎,𝑏] , we denote the restriction of a function 𝑓
on the interval [𝑎, 𝑏].
arccos ≔ (cos|[0,𝜋])−1: [−1,1] → [0, 𝜋] with cos(arccos(𝑥)) = 𝑥 (𝑥 ∈ [−1,1]) (15)
arcsin ≔ (sin|[−𝜋
2,𝜋
2])−1
: [−1,1] → [−𝜋
2,𝜋
2] with arcsin(sin(𝑥)) = 𝑥 (𝑥 ∈ [−1,1]) (16)
and
arctan ≔ (tan|[−𝜋
2,𝜋
2])−1
: ℝ → [−𝜋
2,𝜋
2] with arctan(tan(𝑥)) = 𝑥 (𝑥 ∈ ℝ) (17)
Study Questions
1. Evaluate the sum 𝑦 ≔ ∑ 𝑎𝑖𝑥𝑖4𝑖=1 for 𝑎1 = −1, 𝑎2 = 0, 𝑎3 = 2, 𝑎4 = −2 and 𝑥1 = 3, 𝑥2 = 2, 𝑥3 = 5, 𝑥4 = −2 .
2. Explain the meaning of 𝑓: 𝐷 → 𝑅, 𝑥 ↦ 𝑓(𝑥) and its components 𝑓,𝐷, 𝑅, 𝑥, 𝑓(𝑥) using an example of your choice.
3. Is the function 𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 2𝑥 + 2 a linear function? Justify your answer.
Study Questions Answers
1. With the definitions in the question we have
28
𝑦 = 𝑎1𝑥1 + 𝑎2𝑥2 + 𝑎3𝑥3 + 𝑎4𝑥4
= −1 ⋅ 3 + 0 ⋅ 2 + 2 ⋅ 5 + (−2) ⋅ (−2)
= −3 + 0 + 10 + 4
= 11
2. In 𝑓:𝐷 → 𝑅, 𝑥 ↦ 𝑓(𝑥) 𝑓 denotes a function, i.e. a rule that allocates elements of 𝑅 to all elements of 𝐷. 𝐷 is a set and referred to
as the domain of the function (or its input set), 𝑅 is a set and referred to as the range of the function (or its output set). The
statement “𝑓: 𝐷 → 𝑅” defines the label of the function, its domain and its range. 𝑥 is an element of 𝐷 and denotes an input
argument of the function, ↦ is read as “maps to” and defines which function value 𝑓(𝑥) ∈ 𝑅 is allocated to an input argument 𝑥. As
an example, consider 𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑥2. Here, the function value for the input argument 𝑥 is defined as the square of 𝑥.
3. The function 𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 2𝑥 + 2 on the vector space ℝ is not a linear function. For example, we have 𝑓(1) ≔ 2 ⋅ 1 +
2 = 4 and 𝑓(2) ≔ 2 ⋅ 2 + 2 = 6, but 𝑓(1 + 2) ≔ 2 ⋅ 3 + 2 = 8 ≠ 10 and thus 𝑓(1 + 2) ≠ 𝑓(1) + 𝑓(2).
29
Matrix Algebra
This Section reviews some aspects of elementary matrix algebra. The interested reader will find a
much more exhaustive and principled treatment in [Strang 2009]
(1) Matrix Definition
A matrix is a rectangular collection of numbers. Matrices are usually denoted by capital letters as
follows:
𝐴 = (
𝑎11 𝑎12𝑎21 𝑎22
… 𝑎1𝑚 … 𝑎2𝑚… …
𝑎𝑛1 𝑎𝑛2
… … … 𝑎𝑛𝑚
) = (𝑎𝑖𝑗)1≤𝑖≤𝑛,1≤𝑗≤𝑚 (1)
An entry 𝑎𝑖𝑗 in a matrix 𝐴 is indexed by its row index 𝑖 and its column index 𝑗. For example, the entry 𝑎32 in
the matrix
𝐴 ≔ (
2 78 2
5 2 5 6
6 49 2
0 9 1 6
) (2)
is 4.
The size (sometimes also called dimensionality) of a matrix is determined by its number of rows
𝑛 ∈ ℕ and its number of columns 𝑚 ∈ ℕ. If a matrix has the same number of rows and columns (that is, if
𝑛 = 𝑚) the matrix is called a “square matrix”. When describing a matrix, it is very helpful to mention its
number of rows and columns and the properties of its entries. In our treatment, the entries of a matrix will
usually be elements of the set of real numbers ℝ. To denote that a given matrix 𝐴 has entries from the set of
real numbers, and that it has 𝑛 rows and 𝑚 columns, we will write
𝐴 ∈ ℝ𝑛×𝑚 (3)
(3) may be expressed in words as “The matrix 𝐴 consists of 𝑛 rows and 𝑚 columns and the entries in 𝐴 are
real numbers”. For example, for the matrix 𝐴 in (B.2) we write
𝐴 ∈ ℝ4 ×4 (4)
because it has four rows and four columns (and is, in fact, a square matrix).
Above, we have introduced the set of real 𝑛-tuples (𝑛-dimensional vectors) ℝ𝑛. In the context of
matrix algebra, we can identify the 𝑛-dimensional vectors with the set of 𝑛 × 1 matrices. In other words, for
most of our purposes, we can treat 𝑛-dimensional vectors and matrices with 𝑛 rows and a single column as
equivalent. This also implies that we can set ℝ𝑛 ≔ ℝ𝑛×1 for 𝑛 ∈ ℕ.
Just as one can do algebra with real numbers, one can calculate with matrices. More specifically, one can
(1) Add and subtract two matrices of the same dimensionality (Matrix Addition and Subtraction)
(2) Multiply a matrix with a scalar (Matrix Scalar Multiplication)
(3) Multiply two matrices, but only if certain conditions hold (Matrix Multiplication)
30
(4) Divide by a matrix, or, more precisely, multiply by a matrix inverse (Matrix Inversion)
We will study each type of operation below. Addition, scalar multiplication, matrix multiplication, and matrix
inversion are not the only operations that can be performed with matrices, but will suffice for most of these
lecture notes. One additional concept we will require, however, is the notion of
(5) Matrix Transposition
(2) Matrix Addition and Subtraction
Two matrices of the same size, i.e. with the same number of rows and columns can be added or subtracted
by element-wise adding or subtracting their entries. Formally, we can write this as
𝐴 + 𝐵 = 𝐶 (1)
with 𝐴, 𝐵, 𝐶 ∈ ℝ𝑛 ×𝑚, and the matrix 𝐶 is given by
𝐴 + 𝐵 = (
𝑎11 𝑎12𝑎21 𝑎22
… 𝑎1𝑛𝑚 … 𝑎2𝑚
… …𝑎𝑛1 𝑎𝑛2
⋱ … … 𝑎𝑛𝑚
)+(
𝑏11 𝑏12𝑏21 𝑏22
… 𝑏1𝑚 … 𝑏2𝑚
… …𝑏𝑛1 𝑏𝑛2
⋱ … … 𝑏𝑛𝑚
) (2)
= (
𝑎11+𝑏11 𝑎12 + 𝑏12𝑎21 + 𝑏21 𝑎22 + 𝑏22
… 𝑎1𝑚 + 𝑏1𝑚 … 𝑎2𝑚 + 𝑏2𝑚
… …𝑎𝑛1 + 𝑏𝑛1 𝑎𝑛2 + 𝑏𝑛2
⋱ … … 𝑎𝑛𝑚 + 𝑏𝑛𝑚
)
= (
𝑐11 𝑐12𝑐21 𝑐22
… 𝑐1𝑚 … 𝑐2𝑚
… …𝑐𝑛1 𝑐𝑛2
⋱ … … 𝑐𝑛𝑚
)
= 𝐶
The analogue element-wise operation is defined for subtraction
𝐴 − 𝐵 = 𝐶 (3)
Example
As an example, consider the 2 × 3 matrices 𝐴, 𝐵 ∈ ℝ2×3
𝐴 ≔ (2 −3 01 6 5
) and 𝐵 ≔ ( 4 1 0−4 2 0
) (4)
Since they both have the same size, we can add them, as below
𝐴 + 𝐵 = (2 −3 01 6 5
) + ( 4 1 0−4 2 0
) = (2 + 4 −3 + 1 0 + 01 − 4 6 + 2 5 + 0
) = ( 6 −2 0−3 8 5
) (5)
and we can subtract them, as below
31
𝐴 − 𝐵 = (2 −3 01 6 5
) − (4 1 0−4 2 0
) = (2 − 4 −3 − 1 0 − 01 + 4 6 − 2 5 − 0
) = (−2 −4 05 4 5
) (6)
(3) Matrix Scalar Multiplication
One can also multiply a matrix 𝐴 ∈ ℝ𝑛×𝑚 by a number 𝑐 ∈ ℝ. Note that the set ℝ1×1, the set of
1 × 1 matrices with real entries is usually identified with the set of real numbers ℝ. The operation of
multiplying a matrix by a number is called “scalar multiplication” and is performed element-wise. Formally,
we have with 𝐴 and 𝑐 ∈ ℝ
𝑐𝐴 = 𝐵 (1)
where 𝐵 ∈ ℝ𝑛 ×𝑚 is evaluated according to
𝑐𝐴 = 𝑐(
𝑎11 𝑎12𝑎21 𝑎22
… 𝑎1𝑚 … 𝑎2𝑚
… …𝑎𝑛1 𝑎𝑛2
⋱ … … 𝑎𝑛𝑚
)
= (
𝑐𝑎11 𝑐𝑎12𝑐𝑎21 𝑐𝑎22
… 𝑐𝑎1𝑚 … 𝑐𝑎2𝑚
… …𝑐𝑎𝑛1 𝑐𝑎𝑛2
⋱ … … 𝑐𝑎𝑛𝑚
)
= (
𝑏11 𝑏12𝑏21 𝑏22
… 𝑏1𝑚 … 𝑏2𝑚
… …𝑏𝑛1 𝑏𝑛2
⋱ … … 𝑏𝑛𝑚
)
= 𝐵
(2)
As an example, consider the matrix 𝑋 ∈ ℝ4×3
𝑋 ≔ (
3 1 15 2 52 7 13 4 2
) (3)
and let 𝑐 ≔ −3. Then
𝑐𝑋 = −3(
3 1 15 2 52 7 13 4 2
) = (
−3 ∙ 3 −3 ∙ 1 −3 ∙ 1−3 ∙ 5 −3 ∙ 2 −3 ∙ 5−3 ∙ 2 −3 ∙ 7 −3 ∙ 1−3 ∙ 3 −3 ∙ 4 −3 ∙ 2
) = (
−9 −3 −3−15 −6 −15−6 −21 −3−9 −12 −6
) (4)
(4) Matrix Multiplication
In addition to adding and subtracting matrices of the same size, and multiplying a matrix by a scalar,
one can also multiply two matrices. However, matrix multiplication is not an element-wise operation, but
has a special definition. Importantly, two matrices 𝐴 and 𝐵 can only be multiplied as
𝐴𝐵 (1)
32
if the first matrix 𝐴 has as many columns as the second matrix 𝐵 has rows, or in other words, if
𝐴 ∈ ℝ𝑛×𝑚 and 𝐵 ∈ ℝ𝑚×𝑝 (2)
with 𝑛,𝑚, 𝑝 ∈ ℕ. This condition is sometimes referred to as the equality of the “inner dimensions” with
respect to the product 𝐴𝐵. If this equality does not hold, the two matrices cannot be multiplied. However, if
the condition holds, the matrix product for 𝐴 ∈ ℝ𝑛×𝑚, 𝐵 ∈ ℝ𝑚×𝑝 can be written as
𝐴𝐵 = 𝐶 (3)
where 𝐶 ∈ ℝ𝑛×𝑝 is evaluated according to
𝐴𝐵 = (
𝑎11 𝑎12𝑎21 𝑎22
… 𝑎1𝑚 … 𝑎2𝑚… …
𝑎𝑛1 𝑎𝑛2
… … … 𝑎𝑛𝑚
)(
𝑏11 𝑏12𝑏21 𝑏22
… 𝑏1𝑝 … 𝑏2𝑝
… …𝑏𝑚1 𝑏𝑚2
… … … 𝑏𝑛𝑝
) (4)
=
(
∑ 𝑎1𝑖𝑏𝑖1𝑛𝑖=1 ∑ 𝑎1𝑖𝑏𝑖2
𝑛𝑖=1
∑ 𝑎2𝑖𝑏𝑖1𝑛𝑖=1 ∑ 𝑎2𝑖𝑏𝑖2
𝑛𝑖=1
… ∑ 𝑎1𝑖𝑏𝑖𝑝𝑛𝑖=1
… ∑ 𝑎2𝑖𝑏𝑖2𝑛𝑖=1
… …∑ 𝑎𝑚𝑖𝑏𝑖1𝑛𝑖=1 ∑ 𝑎𝑚𝑖𝑏𝑖2
𝑛𝑖=1
⋱ …… ∑ 𝑎𝑚𝑖𝑏𝑖𝑝
𝑛𝑖=1 )
= (
𝑐11 𝑐12𝑐21 𝑐22
… 𝑐1𝑝 … 𝑐2𝑝
… …𝑐𝑚1 𝑐𝑚2
… … … 𝑐𝑚𝑝
)
= 𝐶
The expression for the matrix product is a bit unhandy, but very important. What the expression says is that
the entry in the 𝑖-th row and 𝑗-th column of the matrix product of the two matrices 𝐴 ∈ ℝ𝑛 ×𝑚 and
𝐵 ∈ ℝ𝑚 ×𝑝 is given by overlaying the entries in the 𝑖-th row of matrix 𝐴 with the 𝑗-th column entries of
matrix 𝐵, multiplying the overlaid numbers and then adding them all up. It takes some time to get used to
matrix multiplication, so do not worry, if it is not completely clear now. Consider the example below next.
Example
Let 𝐴 ∈ ℝ2×3 and 𝐵 ∈ ℝ3 ×2 be defined as
𝐴 ≔ (2 −3 01 6 5
) and 𝐵 ≔ ( 4 2−1 0 1 3
) (5)
We first consider the size of the matrix 𝐶 ≔ 𝐴𝐵. We know that 𝐴 has two rows and three columns, while 𝐵
has three rows and two columns. Because 𝐵 has the same number of rows as 𝐴 has columns, the matrix
product 𝐴𝐵 = 𝐶 is defined. The expression (B.18) tells us, that the resulting matrix 𝐶 has two rows and two
columns, because the number of rows of the resulting matrix 𝐶 is determined by the number of rows of the
first matrix 𝐴, and the number of columns of the resulting matrix 𝐶 is determined by the number of columns
of the second matrix 𝐵. We thus know that is a 2 × 2 matrix or 𝐶 ∈ ℝ2×2. Overlaying the rows of 𝐴 on the
columns of 𝐵, multiplying the entries and adding the results up yields
33
𝐴𝐵 = (2 −3 01 6 5
)( 4 2−1 0 1 3
) (6)
= ((2 ∙ 4) + (−3 ∙ −1) + (0 ∙ 1) (2 ∙ 2) + (−3 ∙ 0) + (0 ∙ 3)(1 ∙ 4) + (6 ∙ −1) + (5 ∙ 1) (1 ∙ 2) + (6 ∙ 0) + (5 ∙ 3)
)
= (8 + 3 + 0 4 + 0 + 04 − 6 + 5 2 + 0 + 15
)
= (11 43 17
)
= 𝐶
It is essential to always keep track of the sizes of the matrices that are involved in a multiplication.
Specifically, if matrix 𝐴 is of size 𝑛 × 𝑚 and matrix 𝐵 is of size 𝑚 × 𝑝, then the product 𝐴𝐵 will always be of
size 𝑛 × 𝑝. This can be visualized as
(𝑛 × 𝑚) ∙ (𝑚 × 𝑝) = (𝑛 × 𝑝) (7)
i.e. the inner numbers 𝑚 disappear.
If one is calculating with scalars, multiplication is commutative, that is for 𝑎, 𝑏 ∈ ℝ, we have
𝑎𝑏 = 𝑏𝑎. In general, matrix multiplication is not commutative, i.e. the side on which 𝐴 and 𝐵 stand matters.
As an example, consider the matrices from above. We have just seen that
𝐶 ≔ 𝐴𝐵 = (2 −3 01 6 5
)( 4 2−1 0 1 3
) = (11 43 17
) (8)
On the other hand, we have
𝐷 ≔ 𝐵𝐴 = ( 4 2−1 0 1 3
)(2 −3 01 6 5
) = ( 10 0 10−2 3 0 5 15 15
) (9)
as you may convince yourself as an exercise.
(5) Matrix Inversion
In order to motivate the concept of matrix inversion consider the equation
𝐴𝑋 = 𝑌 (1)
with 𝐴 ∈ ℝ𝑛×𝑛 , 𝑋 ∈ ℝ𝑛×𝑛, and thus 𝑌 ∈ ℝ𝑛×𝑛. We now simplify the above by assuming 𝑛 = 1, i.e. the
matrices in (24) are in fact scalar number. To remind of this assumption, we represent (24) using lower-case
letters as 𝑎, 𝑥, 𝑦 ∈ ℝ
𝑎𝑥 = 𝑦 (2)
Let us further assume that we know that 𝑎 = 2 and 𝑦 = 6, that is, we have the equation
2 ⋅ 𝑥 = 6 (3)
34
In school mathematics, we were taught how to solve equations of the form (26) for the “unknown variable
𝑥” for many years. The general strategy was to isolate 𝑥 by the appropriate operations (like addition and
multiplication) on the left-hand side and observe the outcome on the right-hand side. This strategy would
look similar to something like this
2 ⋅ 𝑥 = 6 |divide both sides by 2 (4)
2
2⋅ 𝑥 =
6
2 |evaluate the ratios
𝑥 = 3
and we have learned that the unknown value of the variable 𝑥 is equal to 3. We also remember from high
school algebra, that dividing by a scalar number 𝑎 corresponds to multiplication with its inverse, where the
inverse is understood as that number, that yields 1 if multiplied with 𝑎, that is 1
𝑎.
We may thus also write (27) as
2 ⋅ 𝑥 = 6 |multiply both sides by 1
2 (5)
1
2⋅ 2 ⋅ 𝑥 =
1
2⋅ 6 |evaluate the ratios
𝑥 = 3
Of course, the strategy used does not change the result. You may also remember, that inverse of a scalar𝑎,
i.e. 1
𝑎 can also be denoted as 𝑎−1. In general, 𝑎−1 is called the inverse of 𝑎 if, and only if
𝑎𝑎−1 = 𝑎−1𝑎 = 1 (6)
where 1 denotes the “neutral element”, that is for all 𝑎
𝑎 ∙ 1 = 1 ∙ 𝑎 = 𝑎 (7)
To recapitulate (28) in more abstract terms, we have carried out the following operation
𝑎𝑥 = 𝑦 |multiply both sides by 𝑎−1 (8)
𝑎−1𝑎𝑥 = 𝑎−1𝑦 |evaluate the ratio
𝑥 = 𝑎−1𝑦
We now return to the case that 𝐴 ∈ ℝ𝑛×𝑛 , 𝑋 ∈ ℝ𝑛×𝑝 and 𝑌 ∈ ℝ𝑛×𝑝are in fact matrices. Very often we will
encounter analogous statements to (2) for matrices, which are referred to as “Linear Systems of Equations”
𝐴𝑋 = 𝑌 (9)
where we know 𝐴 and 𝑌 and would like to evaluate 𝑋. In analogy to the strategy described in (8), we can
multiply both sides of (9) with the inverse of 𝐴, denoted as 𝐴−1 and obtain
𝐴−1𝐴𝑋 = 𝐴−1𝑌 (10)
35
In analogy to 𝑎−1𝑎 = 1 for scalars, the matrix product 𝐴−1𝐴 yields, by definition, a very specific matrix,
namely the identity matrix 𝐼𝑛 ∈ ℝ𝑛×𝑛. The identity matrix has ones on its main diagonal and zeros
everywhere else. That is, by definition, we have
𝐴 = 𝐼𝑛 ≔ (
1 00 1
… 0 … 0
… …0 0
⋱ … … 1
) (11)
In analogy to 𝑎 ∙ 1 = 1 ∙ 𝑎 = 𝑎 for scalars, the product of a matrix 𝐴 with the identity always yields the
matrix 𝐴 again
𝐴𝐼 = 𝐼𝐴 = 𝐴 (12)
If we consider (10) again, we thus see that we can evaluate the left-hand side of (10) as follows
𝐴−1𝐴𝑋 = 𝐴−1𝑌 ⇔ 𝐼𝑛𝑋 = 𝐴−1𝑌 ⇔ 𝑋 = 𝐴−1𝑌 (13)
In other words, if we have a way to evaluate the inverse 𝐴−1 of the matrix 𝐴, we can solve for the unknown
matrix 𝑋 just as we can solve for the unknown variable 𝑥 in 𝑎𝑥 = 𝑦.
There is a lot more to learn about matrix inversion. For example, so far, we have no idea how to
actually evaluate 𝐴−1 for, say
𝐴 ≔ (1 0 22 −1 34 1 8
) (14)
or in other words, to find a matrix 𝐴−1 such that the matrix product 𝐴−1𝐴 evaluates as
𝐴−1𝐴 = 𝐴−1 (1 0 22 −1 34 1 8
) = (1 0 00 1 00 0 1
) = 𝐼3 (15)
Also, specific conditions exists, under which a matrix may not even be invertible (like one cannot evaluate
the inverse of the scalar 0, i.e 1
0 is “not allowed”). There is also a deep relation between matrix inversion and
the concept of matrix determinants, which you may recall from high-school mathematics. Further note that
we have been very specific on the dimensions of 𝐴, 𝑋, and 𝑌, as in (9) they are all square matrices. For non-
square matrices, things become more complex, and “proper” matrix inversion is not defined.
One example of an inverse matrix will be helpful with respect to multivariate Gaussian distributions:
the inverse of a diagonal matrix, i.e. a square matrix that has non-zero elements on its main diagonal, and
zeros everywhere else:
𝐷 = (
𝑑1 00 𝑑2
⋯ 0⋯ ⋮
⋮ ⋯0 ⋯
⋱ 00 𝑑𝑛
) ∈ ℝ𝑛×𝑛 (16)
The inverse of 𝐷 is given by the diagonal matrix with the one-dimensional inverses 𝑑𝑖−1 =
1
𝑑𝑖 on its diagonal:
36
𝐷−1 =
(
1
𝑑10
01
𝑑2
⋯ 0⋯ ⋮
⋮ ⋯0 ⋯
⋱ 0
01
𝑑𝑛)
∈ ℝ𝑛×𝑛 (17)
as one may convince oneself by evaluating the matrix product 𝐷−1𝐷 for, say 𝑛 = 4.
In the following section, the interested reader will find a procedure that allows for evaluating the
inverse of small (e.g. 𝑛 = 2,3,4) square matrices 𝐴 ∈ 𝕂𝑛×𝑛 by hand. This is helpful for becoming familiar
with the theory of matrix inversion. However, in real life, matrix inverses are usually evaluated
algorithmically with the help of a computer, so being proficient in manual matrix inversion is not an essential
skill.
(6) Inversion of small matrices by hand
While the introduction of matrix inversion has so far explained the main principles, it has not been
discussed how to actually invert a matrix, i.e. how to compute 𝐴−1 if 𝐴 is known. In fact, there are different
ways to do that, and we will only discuss one approach here for the remainder of the chapter.
We will demonstrate this approach by trying to solve the following matrix equation
𝐴𝑋 = (1 0 22 −1 34 1 8
)(
𝑥11 𝑥12 𝑥13𝑥21 𝑥22 𝑥23𝑥31 𝑥32 𝑥33
) = (2 −2 31 4 22 1 1
) = 𝐵 (1)
with 𝐴 ∈ ℝ3×3 and 𝐵 ∈ ℝ3×𝑛 and unknown 𝑋 ∈ ℝ3×3. Recall, that we can solve the equation for the
matrix 𝑋 ∈ ℝ3×3 if we find and inverse 𝐴−1 ∈ ℝ3×3 for the matrix 𝐴 ∈ ℝ3×3, because then
𝑋 = 𝐴−1𝐵 (2)
Also recall, that for an inverse of 𝐴 ∈ ℝ3×3, the following equation holds
𝐴−1𝐴 = 𝐼 (3)
where
𝐼 = (1 0 00 1 00 0 1
) ∈ ℝ3×3 (4)
A practical approach to finding the inverse of 𝐴 is the following
1. Write down the matrix 𝐴 and next to it the identity matrix 𝐼
(1 0 22 −1 34 1 8
|||
1 0 00 1 00 0 1
) (5)
2. Now use three kinds of operations on the rows of 𝐴 to transform 𝐴 into the identity matrix. Apply the
same operations to the identity matrix 𝐼 in parallel. The operations allowed are
a. Exchanging to rows of 𝐴
b. Multiplying a row of 𝐴 by a number
37
c. Adding or subtracting a multiple of another row of 𝐴 from any row of 𝐴
Adding the -2 times the first row to the second row, and -4 times the first row to the third row yields
(1 0 20 −1 10 1 0
|||
1 0 0−2 1 0−4 0 1
) (6)
Exchanging the second and third row yields
(1 0 20 1 00 −1 −1
|||
1 0 0−4 0 1−2 1 0
) (7)
Adding the second row to the third row yields
(1 0 20 1 00 0 −1
|||
1 0 0−4 0 1−6 1 1
) (8)
Adding 2 times the third row to the first yields
(1 0 00 1 00 0 −1
|||
−11 2 2−4 0 1−6 1 1
) (9)
Multiplying the third row by -1 yields
(1 0 00 1 00 0 1
|||
−11 2 2−4 0 16 −1 −1
) (10)
Having transformed the matrix on the right into the identity matrix, the matrix that is left on the left is now
the inverse 𝐴−1 ∈ ℝ3×3, as can be verified by computing
(1 0 22 −1 34 1 8
)(−11 2 2−4 0 16 −1 −1
) = (1 0 00 1 00 0 1
) (11)
Having obtained
𝐴−1 = (−11 2 2−4 0 16 −1 −1
) (12)
The solution to
𝐴𝑋 = (1 0 22 −1 34 1 8
)(
𝑥11 𝑥12 𝑥13𝑥21 𝑥22 𝑥23𝑥31 𝑥32 𝑥33
) = (2 −2 31 4 22 1 1
) = 𝐵 (13)
is then
38
𝑋 = 𝐴−1𝐵 = (−11 2 2−4 0 16 −1 −1
)(2 −2 31 4 22 1 1
) = (−16 32 −27−2 11 −99 −17 15
) (14)
(7) Matrix Transposition
A very basic, and often useful, matrix operation is the transposition of a matrix. By the transposition
of a matrix we understand the exchange of its column and row elements. Transposition of a matrix 𝐴 is
denoted by 𝐴𝑇 and implicates that, if ∈ ℝ𝑛×𝑚 , then 𝐴𝑇 ∈ ℝ𝑚×𝑛. Formally, we have for 𝐴 ∈ ℝ𝑛×𝑚 and
≔ 𝐴𝑇, where
𝐴 = (
𝑎11 𝑎12𝑎21 𝑎22
… 𝑎1𝑚 … 𝑎2𝑚… …
𝑎𝑛1 𝑎𝑛2
… … … 𝑎𝑛𝑚
) (1)
the following form of 𝐵 ∈ ℝ𝑚×𝑛:
𝐵 = (
𝑏11 𝑏12𝑏21 𝑏22
… 𝑏1𝑛 … 𝑏2𝑛… …
𝑏𝑚1 𝑏𝑚2
… … … 𝑏𝑚𝑛
) ≔ (
𝑎11 𝑎22𝑎12 𝑎22
… 𝑎𝑛1 … 𝑎𝑛2… …
𝑎1𝑚 𝑎2𝑚
… … … 𝑎𝑛𝑚
) (2)
For example, if
𝐴 ≔ (2 −3 01 6 5
) (3)
then
𝐴𝑇 ≔ ( 2 1−3 6 0 5
) (4)
If the transpose of a matrix is identical to the matrix, which can only be the case for square matrices,
because otherwise the size of the matrix changes, then the matrix is called “symmetric”. Covariance matrices
encountered later are typical examples of symmetric matrices, as are identity matrices. Finally note that the
transpose of a 1 × 1 matrix, i.e. a scalar, is just the same scalar again.
(8) Matrix Determinants
The theory of matrix determinants is quite evolved and has deep connections to the theory of linear
systems of equations and vector space theory. Historically, determinants were used to determine whether
systems of linear equations of the form 𝐴𝑥 = 𝑏, where 𝐴 ∈ ℝ𝑛×𝑛, 𝑥 ∈ ℝ𝑛, 𝑏 ∈ ℝ𝑛 have a unique solution,
which may be determined based on the value of the determinant of 𝐴. Today, it is actually debatable,
whether determinants are really of much value in linear algebra. Nevertheless, we will encounter a matrix
determinant in the context of multivariate Gaussian distributions, so we briefly review some special
determinants in this section.
Determinants are defined for square matrices, are scalars, and are usually written as det (𝐴), or, as in
these notes, as |𝐴|. For a 2 × 2 matrix 𝐴 ∈ ℝ2×2, the determinant is defined as follows
|𝐴| = |(𝑎 𝑏𝑐 𝑑
)| ≔ 𝑎𝑑 − 𝑏𝑐 (1)
39
For a 3 × 3 matrix 𝐴 ∈ ℝ3×3, the determinant is defined as
|𝐴| = |(𝑎 𝑏 𝑐𝑑 𝑒 𝑓𝑔 ℎ 𝑖
)| ≔ 𝑎𝑒𝑖 + 𝑏𝑓𝑔 + 𝑐𝑑ℎ − 𝑏𝑑𝑖 − 𝑎𝑓ℎ − 𝑐𝑒𝑔 (2)
which one may either memorize using Sarrus’ rule, or based on the determinant development
|𝐴| = |(𝑎 𝑏 𝑐𝑑 𝑒 𝑓𝑔 ℎ 𝑖
)| = 𝑎 |(𝑒 𝑓ℎ 𝑖
)| − 𝑏 |(𝑑 𝑓𝑔 𝑖
)| + 𝑐 |(𝑑 𝑒𝑔 ℎ
)| (3)
= 𝑎𝑒𝑖 − 𝑎𝑓ℎ − 𝑏𝑑𝑖 + 𝑏𝑓𝑔 + 𝑐𝑑ℎ − 𝑐𝑒𝑔
= 𝑎𝑒𝑖 + 𝑏𝑓𝑔 + 𝑐𝑑ℎ − 𝑏𝑑𝑖 − 𝑎𝑓ℎ − 𝑐𝑒𝑔
For 𝑛 > 3, the determinant of a square matrix 𝐴 ∈ ℝ𝑛×𝑛 is more complicated to evaluate. A general
formula for matrix determinants exists (“Leibniz fomula”). However, understanding and using this formula
requires some group theory, and especially the notion of permutation groups, which we do not require (and
hence will not cover) in these notes. We merely state it here for completeness. For 𝐴 ∈ ℝ𝑛×𝑛, the
determinant is given by
|𝐴| = |(𝑎𝑖𝑗)1≤𝑖,𝑗≤𝑛| =∑ 𝑠𝑔𝑛(𝜎)∏ 𝑎𝑖,𝜎𝑖
𝑛𝑖=1𝜎∈𝑆𝑛 (4)
where 𝑆𝑛 denotes the symmetric group in 𝑛 elements, 𝑠𝑔𝑛(𝜎) the signum of the elements 𝜎 of a symmetric
group, and 𝜎𝑖 the value at the 𝑖th position upon application of 𝜎.
More important than Leibniz formula in the context of these notes is the determinant of a special
matrix, namely a diagonal matrix with non-zero elements along its main diagonal and zeros everywhere else.
For such matrices, the determinant is given by the product of its diagonal elements: let 𝐴 ∈ ℝ𝑛×𝑛 be
diagonal, i.e. 𝐴 is of the form
𝐴 = (
𝑎11 00 𝑎22
⋯ 0⋯ ⋮
⋮ ⋯0 ⋯
⋱ 00 𝑎𝑛𝑛
) ∈ ℝ𝑛×𝑛 (5)
Then
|𝐴| = |(
𝑎11 00 𝑎22
⋯ 0⋯ ⋮
⋮ ⋯0 ⋯
⋱ 00 𝑎𝑛𝑛
)| = ∏ 𝑎𝑖𝑖𝑛𝑖=1 = 𝑎11 ⋅ 𝑎22 ⋅ … ⋅ 𝑎𝑛𝑛 (6)
(9) Rank of a Matrix
The rank of a matrix is an important concept and intuitively serves as a measure of “degenerateness”
of the system of linear equations or the linear transformation encoded by a matrix. There are multiple ways
to define the rank of a matrix and all rely on additional theory with respect to the concepts covered thus far
in this Section. Here, we introduce the rank of a matrix as the cardinality of the largest set of” linearly
40
independent columns” of a matrix, which requires the notion of linear independent columns, a concept from
the theory of vector spaces.
Intuitively, the columns of a matrix 𝐴 ∈ ℝ𝑛×𝑛 can be conceived as “𝑛-dimensional vectors”, i.e., lists
of numbers with 𝑛 entries. For example, the matrix
𝐴 = (2 1 31 4 53 4 0
) ∈ ℝ3×3 (1)
can be conceived as the concatenation of the three 3-dimensional vectors (or matrices with a single column)
𝑣1 = (213) , 𝑣2 = (
144) and 𝑣3 = (
350) ∈ ℝ3×1 =:ℝ3 (2)
Now, let 𝑣1, 𝑣2, … , 𝑣𝑛 ∈ ℝ𝑛 denote a set of 𝑛-dimensional vectors with real entries and let 𝑎1, 𝑎2, … , 𝑎𝑛 ∈ ℝ
denote a set of real scalars. The definition of scalar matrix multiplication and matrix addition, we may then
evaluate the sum
𝑣 ≔ 𝑎1𝑣1 + 𝑎2𝑣2 +⋯+ 𝑎𝑛𝑣𝑛 = ∑ 𝑎𝑖𝑣𝑖𝑛𝑖=1 (3)
The resulting vector 𝑣 ∈ ℝ𝑛 is referred to as a “linear combination” of the vectors 𝑣1, 𝑣2, … , 𝑣𝑛. The vectors
𝑣1, 𝑣2, … , 𝑣𝑛 are called linearly dependent, if there exist scalars 𝑎1, 𝑎2, … , 𝑎𝑛, which are not all zero, such
that
𝑎1𝑣1 + 𝑎2𝑣2 +⋯+ 𝑎𝑛𝑣𝑛 = 0 (4)
The linear combination with 𝑎1 = 𝑎2 = ⋯ = 𝑎3 = 0 of a vector set 𝑣1, 𝑣2, … , 𝑣𝑛, i.e.,
0𝑣1 + 0𝑣2 +⋯+ 0𝑣𝑛 = 0 (5)
is called the “trivial representation” of the zero vector 0 ∈ ℝ𝑛. We may thus state that 𝑣1, 𝑣2, … , 𝑣𝑛 are
linearly dependent, if there exists a “non-trivial” representation of 0 ∈ ℝ𝑛, and is linearly independent, if the
only linear combination of 𝑣1, 𝑣2, … , 𝑣𝑛 that results in the zero vector 0 ∈ ℝ𝑛 is the trivial one.
Consider the following examples. Let
𝑣1 = (100) , 𝑣2 = (
010) and 𝑣3 = (
110) ∈ ℝ3×1 (6)
Then 𝑣1, 𝑣2, 𝑣3 are linearly dependent, because if we choose 𝑎1 = 𝑎2 = 1 and 𝑎3 = −1, we can write the
zero vector 0 ∈ ℝ3 using 𝑎𝑖’s (𝑖 = 1,2,3) that are not all zero
1 ⋅ (100) + 1 ⋅ (
010) − 1 ⋅ (
110) = (
110) − (
110) = 0 (7)
On the other hand, let
𝑣1 = (100) , 𝑣2 = (
010) and 𝑣3 = (
001) ∈ ℝ3×1 (8)
41
Then 𝑣1, 𝑣2, 𝑣3 are linearly independent, which can be made intuitive by considering their linear combination
𝑎1 ⋅ (100) + 𝑎2 ⋅ (
010) + 𝑎3 ⋅ (
001) = (
𝑎100) + (
0𝑎20) + (
00𝑎3
) = (
𝑎1𝑎2𝑎3) (9)
As soon as we choose any of the 𝑎1, 𝑎2, 𝑎3 to be non-zero, this linear combination does not result in the zero
vector anymore, and hence the only representation of the zero vector by means of the vectors defined in (8)
is the trivial one. In general, it is not easy to determine whether a given set of vectors is linear dependent or
not and may specialized approaches exist to evaluate this
Returning to the notion of a matrix rank, the column rank of a matrix was introduced above as “the
cardinality of largest set of linearly independent columns (i.e. column vectors)” of a matrix. From the
discussion above, we may now infer that the rank of the matrix
𝐴 = (1 0 00 1 00 0 1
) ∈ ℝ3×3 (10)
is 3. If the rank of a matrix equals its number of columns, the matrix is said to be of “full-rank”.
Importantly, it can be shown that if a square matrix is of full-rank, it can be inverted, while it cannot
be inverted if it is not of full-rank. To show that this is actually case requires additional theory, and the
interested reader is referred to [Strang 2009] in this regard.
(10) Matrix symmetry and positive-definiteness
A square matrix 𝐴 ∈ ℝ𝑛×𝑛 is called symmetric, if it equals its transpose, i.e. if 𝐴 = 𝐴𝑇. For example,
the matrix
𝐴 = (2 1 31 4 53 5 0
) (1)
is symmetric which can be checked by writing down its transpose. Square, symmetric matrices can have an
additional property, referred to as positive-definiteness. The covariance matrix of a multivariate Gaussian
distribution, which is an essential component of PMNF, has be positive-definite, which motivates the
introduction of this term in the current Section. As for the rank of a matrix, positive-definiteness can be
approached from multiple perspectives, and it is usually not trivial to check whether a given matrix is in fact
positive-definite. We here introduce a very basic notion of positive-definiteness that mainly serves to
introduce the term itself.
A symmetric matrix 𝐴 ∈ ℝ𝑛×𝑛 is called positive definite, if the matrix-vector product 𝑥𝑇𝐴𝑥 > 0 for
all non-zero 𝑥 ∈ ℝ𝑛 and is called positive semi-definite if 𝑥𝑇𝐴𝑥 ≥ 0 for all non-zero 𝑥 ∈ ℝ𝑛.
As an example, we consider the real symmetric matrix
𝐴 = (2 11 2
) ∈ ℝ2×2 (2)
42
If we consider a non-zero vector (or “matrix with a single column”)
𝑥 ≔ (𝑥1𝑥2) (3)
we find that
𝑥𝑇𝐴𝑥 = (𝑥1 𝑥2) (2 11 2
) (𝑥1𝑥2) = (2𝑥1 + 𝑥2 𝑥1 + 2𝑥2) (
𝑥1𝑥2) = (2𝑥1 + 𝑥2)𝑥1 + (𝑥1 + 2𝑥2)𝑥2 (4)
If we consider the sum on the right-hand side (4) further, we see
(2𝑥1 + 𝑥2)𝑥1 + (𝑥1 + 2𝑥2)𝑥2 = 2𝑥12 + 𝑥2𝑥1 + 𝑥1𝑥2 + 2𝑥2
2 (5)
= 2𝑥12 + 2𝑥1𝑥2 + 2𝑥2
2
= 𝑥12 + 2𝑥1𝑥2 + 𝑥2
2 + 𝑥12 + 𝑥2
2
= (𝑥1 + 𝑥2)2 + 𝑥1
2 + 𝑥22
For any non-zero 𝑥1, 𝑥2 we thus obtained a sum-of-squares, which cannot be zero. Hence, the matrix 𝐴 is
positive-definite.
Study Questions
1. Let
𝐴 ≔ (1 23 4
) and 𝐵 ≔ (1 10 2
)
Evaluate the following matrices: 𝐶 ≔ 𝐴 + 𝐵𝑇 , 𝐷 ≔ 𝐴 − 𝐵, 𝐸 ≔ 𝐴𝐵 and 𝐷 ≔ 𝐵𝐴.
2. Let 𝑋 ∈ ℝ10×3. What is the size of the matrix ≔ 𝑋𝑇𝑋 ? 3. Explain the concept of a matrix inverse. 4. Evaluate the inverse 𝐴−1 for
𝐴 ≔ (1 0 00 2 00 0 −4
)
5. Evaluate the determinants of the matrices
𝑀 ≔ (1 20 1
) and 𝑁 ≔ (2 12 1
)
6. Write down the definition of the rank of matrix. 7. Write down the definition of a positive-definite matrix. Study question answers
1. With the definitions of the question, we have
𝐶 = 𝐴 + 𝐵𝑇 = (1 23 4
) + (1 10 2
)𝑇
= (1 23 4
) + (1 01 2
) = (2 24 6
)
𝐷 = 𝐴 − 𝐵 = (1 23 4
) − (1 10 2
) = (0 13 2
)
𝐸 = 𝐴𝐵 = (1 23 4
) (1 10 2
) = (1 53 11
)
43
𝐷 = 𝐵𝐴 = (1 10 2
) (1 23 4
) = (4 66 8
)
2. Using the “matrix size formula” (𝑚 × 𝑛) ⋅ (𝑛 × 𝑝) = 𝑚 × 𝑝 for a matrix product 𝐴𝐵 ∈ ℝ𝑚×𝑝 of matrices 𝐴 ∈ ℝ𝑚×𝑛, 𝐵 ∈ ℝ𝑛×𝑝, we have with 𝑋𝑇 ∈ ℝ3×10 (3 × 10) ⋅ (10 × 3) = 3 × 3 and thus 𝑋 ∈ ℝ3×3.
3. In analogy to division of a number 𝑥 ∈ ℝ but itself, which yields 1, the inverse 𝐴−1 ∈ ℝ𝑛×𝑛 of a square matrix 𝐴 ∈ ℝ𝑛×𝑛 is that matrix, which if (pre-or post-)multiplied by 𝐴 yields the identity matrix 𝐼𝑛: 𝐴−1𝐴 = 𝐴𝐴−1 = 𝐼𝑛.
4. The inverse of 𝐴 as defined in the question is given by
𝐴−1 = (
1 0 0
01
20
0 0 −1
4
)
because
(
1 0 0
01
20
0 0 −1
4
)(1 0 00 2 00 0 −4
) = (1 0 00 2 00 0 −4
)(
1 0 0
01
20
0 0 −1
4
) = (1 0 00 1 00 0 1
)
5. For 2 × 2 matrices, the determinant is given by |𝐴| = |(𝑎 𝑏𝑐 𝑑
)| ≔ 𝑎𝑑 − 𝑏𝑐. We thus have
|𝑀| = 1 ⋅ 1 − 2 ⋅ 0 = 1 and |𝑁| = 2 ⋅ 1 − 1 ⋅ 2 = 0
6. The rank of a matrix is the cardinality of largest set of linearly independent columns of a matrix. 7. A symmetric matrix 𝐴 ∈ ℝ𝑛×𝑛 is called positive definite, if the matrix-vector product 𝑥𝑇𝐴𝑥 > 0 for all non-zero 𝑥 ∈ ℝ𝑛.
44
Differential Calculus
Here we summarize some basic results from uni- and multivariate calculus. We eschew a discussion
of concepts from real analysis, such as continuity and differentiability of functions, which are assumed to
hold for the functions of interest covered in PMFN. Interested readers may find [Spivak 1994] a helpful
resource for further study.
(1) Intuition and definition of derivatives of univariate functions
We first concern ourselves with derivatives of univariate, real-valued functions, by which we
understand functions that map real, scalar numbers onto real, scalar numbers, in other words, functions 𝑓 of
the type
𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) (1)
The derivative 𝑓′(𝑥0) ∈ ℝ of a function 𝑓 at the location 𝑥0 ∈ ℝ of a function 𝑓 conveys two basic and
familiar intuitions
(1) It is a measure of the rate of change of 𝑓 at location 𝑥0
(2) It is the slope of the tangent line of 𝑓 at the point (𝑥0, 𝑓(𝑥0)) ∈ ℝ2
Formally, this may be expressed using the “differential quotient” of 𝑓. The differential quotient (sometimes
also referred to as “Newton’s difference quotient”) expresses the difference between two values of the
function 𝑓(𝑥 + ℎ) and 𝑓(𝑥), where 𝑥, ℎ ∈ ℝ with respect to the difference between the two locations 𝑥 and
𝑥 + ℎ for ℎ approaching zero:
𝑓′(𝑥) = limℎ→0𝑓(𝑥+ℎ)−𝑓(𝑥)
ℎ (2)
While (2) represents the formal definition of the derivative of 𝑓 and forms the basis for proofs of the rules of
differentiation we will discuss in the following, its practical importance for our purposes is rather negligible.
Before discussing some basic rules of differentiation, we note that the derivative can either be
considered at a specific point 𝑥0, which is often denoted as
𝑓′(𝑥)|𝑥=𝑥0 (3)
or, if evaluated for all possible values be considered as a function
𝑓′:ℝ → ℝ, 𝑥0 ↦ 𝑓′(𝑥0) ≔ 𝑓′(𝑥)|𝑥=𝑥0 (4)
Intuitively, (4) just means that the derivative of a differentiable function may be evaluated at any point of
the real line.
The derivative 𝑓′ of a function 𝑓 is also referred to as the “first-order” derivative of a function,
where the “zeroth-order” derivative of a function just corresponds to the function itself. High-order
derivatives (i.e. second-order, third-order, and so on) can be evaluated by recursively forming the derivative
of the respective lower order derivative. For example, the second-order derivative of a function corresponds
to the (first-order) derivative of the (first-order) derivative of a function. To this end, the 𝑑
𝑑𝑥 operator
notation for derivatives is useful. Intuitively 𝑑
𝑑𝑥𝑓 can be understood as the imperative to evaluate the
45
derivative of 𝑓, or simply as an alternative notation for 𝑓′. By itself, 𝑑
𝑑𝑥 carries no meaning. We thus have for
the first-order derivative
𝑑
𝑑𝑥𝑓(𝑥) = 𝑓′(𝑥) (5)
and for the second-order derivative
𝑑2
𝑑𝑥2𝑓(𝑥) =
𝑑
𝑑𝑥(𝑑
𝑑𝑥𝑓(𝑥)) = 𝑓′′(𝑥) (6)
Intuitively, the second-order derivative measures the rate of change of the first-order derivative in the
vicinity of 𝑥. If these first-order derivatives, which may be visualized as tangent lines, change relatively
quickly in the vicinity of 𝑥 , the second-order derivative is large, and the function is said to have a high
“curvature”.
(2) Derivatives of important functions
In this section we collect without proof the derivatives of a number of commonly encountered
functions.
Constant function
The derivative of any constant function
𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑎 (𝑎 ∈ ℝ) (1)
is zero:
𝑓′: ℝ → ℝ, 𝑥 ↦ 𝑓′(𝑥) = 0 (2)
For example, the derivative of 𝑓(𝑥) ≔ 2 is 𝑓′(𝑥) = 0.
Single-term polynomials
Let 𝑓 be a “single-term polynomial function” of the form
𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑎𝑥𝑏 (𝑎, 𝑏 ∈ ℝ) (3)
Then the derivative of 𝑓 is given by
𝑓′: ℝ → ℝ, 𝑥 ↦ 𝑓′(𝑥) = 𝑏𝑎𝑥𝑏−1 (4)
For example, the derivative of 𝑓(𝑥) ≔ 2𝑥3 is 𝑓′(𝑥) = 6𝑥2 and the derivative of 𝑔(𝑥) ≔ √𝑥 = 𝑥1
2 is
𝑔′(𝑥) =1
2𝑥−
1
2 =1
2√𝑥
Sine and cosine
Let 𝑓 be the sine function
𝑓:ℝ → [0,1], 𝑥 ↦ 𝑓(𝑥) ≔ sin(𝑥) (5)
46
Then the derivative of 𝑓 is given by
𝑓′:ℝ → [0,1], 𝑥 ↦ 𝑓′(𝑥) = cos(𝑥) (6)
Further, let 𝑔 be the cosine function
𝑔:ℝ → [0,1], 𝑥 ↦ 𝑔(𝑥) ≔ cos(𝑥) (7)
Then the derivative of 𝑔 is given by
𝑔′: ℝ → [0,1], 𝑥 ↦ 𝑔′(𝑥) = − sin(𝑥) (8)
Exponential and logarithm
Let 𝑓 be the exponential function
𝑓:ℝ → ℝ+ 𝑥 ↦ 𝑓(𝑥) ≔ exp(𝑥) (9)
Then the derivative of 𝑓 is given by
𝑓′:ℝ → ℝ+, 𝑥 ↦ 𝑓′(𝑥) = exp(𝑥) (10)
Further, let 𝑔 be the natural logarithm
𝑔:ℝ → ℝ, 𝑥 ↦ 𝑔(𝑥) ≔ ln(𝑥) (11)
Then the derivative of 𝑔 is given by
𝑔′: ℝ → ℝ, 𝑥 ↦ 𝑔′(𝑥) =1
𝑥 (12)
(3) Rules of differentiation
In the following we state important rules for computing derivatives of univariate function without
proof. For a formal derivation of these rules, the interested reader may consult [Spivak 1994]
Summation rule
Let
𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ ∑ 𝑔𝑖(𝑥)𝑛𝑖=1 (13)
be the sum of 𝑛 arbitrary functions 𝑔𝑖: ℝ → ℝ (𝑖 = 1,2, … , 𝑛). Then the derivative of 𝑓 is given by the sum
of the derivatives of the 𝑔𝑖:
𝑓′:ℝ → ℝ, 𝑥 ↦ 𝑓′(𝑥) ≔ ∑ 𝑔𝑖′(𝑥)𝑛
𝑖=1 (14)
For example, the derivative of 𝑓(𝑥) = 𝑥2 + 2𝑥 with 𝑔1(𝑥) ≔ 𝑥2 and 𝑔2(𝑥) ≔ 2𝑥 is
𝑓(𝑥) = 𝑔1′(𝑥) + 𝑔2′(𝑥) = 2𝑥 + 2 (15)
47
Chain Rule
Let ℎ be the concantenation of two functions 𝑓:ℝ → ℝ and 𝑔:ℝ → ℝ , i.e.
ℎ:ℝ → ℝ, 𝑥 ↦ 𝑔(𝑓(𝑥)) (16)
Then the derivative of ℎ is given by
ℎ′: ℝ → ℝ, 𝑥 ↦ ℎ′(𝑥) ≔ 𝑔′(𝑓(𝑥))𝑓′(𝑥) (17)
In words: the derivative of a function that can be written as the concatenation of a first function 𝑓 with a
second function 𝑔 is given by the derivative of the second function 𝑔 “at the location of the function 𝑓”
multiplied with the derivative of the first function. For example, the derivative of ℎ(𝑥) ≔ exp(sin(𝑥)),
which can be written as the concatenation of a function 𝑓(𝑥) ≔ exp(𝑥) with derivative 𝑓′(𝑥) = exp(𝑥)
and a function 𝑔(𝑥) ≔ sin(𝑥) with derivative 𝑔′(𝑥) = cos(𝑥) is given by ℎ′(𝑥) = exp(sin(𝑥)) cos(𝑥).
Product Rule
Let 𝑓 be the product of two functions 𝑔𝑖: ℝ → ℝ (𝑖 = 1,2), i.e.
𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑔1(𝑥)𝑔2(𝑥) (18)
Then the derivative of 𝑓 is given by
𝑓′:ℝ → ℝ, 𝑥 ↦ 𝑓′(𝑥) ≔ 𝑔1′ (𝑥)𝑔2(𝑥) + 𝑔1(𝑥)𝑔2
′ (𝑥) (19)
where 𝑔1′and 𝑔2
′ denote the derivatives of 𝑔1 and 𝑔2, respectively. In words: if a function can be written as
the product of a first and a second function, its derivative corresponds to the product of the derivative of the
first function with the second function “plus” the product of the first function with the derivative of the
second function. For example, the derivative of 𝑓(𝑥) ≔ 𝑥2 exp(𝑥) can be found by writing 𝑓 as 𝑔1 ⋅ 𝑔2 with
𝑔1(𝑥) ≔ 𝑥2 and 𝑔2(𝑥) ≔ exp(𝑥) with derivatives 𝑔1′ (𝑥) = 2𝑥 and 𝑔2
′ (𝑥) = exp(𝑥), respectively. This then
yields for the derivative of 𝑓 that 𝑓′(𝑥) = 2𝑥 exp(𝑥) + 𝑥2 exp(𝑥).
Quotient Rule
Let 𝑓 be the quotient of two functions 𝑔𝑖: ℝ → ℝ (𝑖 = 1,2), i.e.
𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔𝑔1(𝑥)
𝑔2(𝑥) (20)
Then the derivative of 𝑓 is given by
𝑓′:ℝ → ℝ, 𝑥 ↦ 𝑓′(𝑥) ≔𝑔1′ (𝑥)𝑔2(𝑥)−𝑔1(𝑥)𝑔2
′ (𝑥)
𝑔22(𝑥)
(21)
In words: the derivative of a function that can be written as the quotient of a first function in the numerator
and a second function in the denominator is given by the difference of the product of the derivative of the
first function with the second function and the product of the first function with the derivative of the second
function divided by the square of the second function, i.e., the function in the denominator of the original
function. For example, the derivative of 𝑓(𝑥) ≔sin(𝑥)
𝑥2+1 can be evaluated by considering 𝑔1(𝑥) ≔ sin(𝑥) with
48
derivative 𝑔1′ (𝑥) = cos(𝑥) and 𝑔2(𝑥) ≔ 𝑥2 + 1 with derivative 𝑔2
′ (𝑥) = 2𝑥 to yield
𝑓′(𝑥) =cos(𝑥)⋅(𝑥2+1)−sin(𝑥)⋅2𝑥
(𝑥2+1)2.
(4) Analytical Optimization
First- and second-order derivatives can be used to find (local) maxima and minima of functions.
Finding maxima and minima of functions is a fundamental aspect of applied mathematics and is referred to
in general as “optimization”. It is helpful to distinguish clearly between two aspects of optimization: on the
one hand, when finding a maximum or a minimum, one finds a value of the function, say 𝑓 ∶ 𝐷 → 𝑅, in its
range 𝑅 for which the conditions 𝑓(𝑥) ≥ 0 or 𝑓(𝑥) ≤ 0, respectively, hold at least in the vicinity of 𝑥 ∈ 𝐷.
These values in the range of 𝑓 are called maxima or minima, and are sometimes abbreviated by
“max𝑥∈𝐷 𝑓 (𝑥)” and “min𝑥∈𝐷 𝑓 (𝑥)”. On the other hand, one simultaneously finds those points 𝑥 in the
domain of 𝑓, for which 𝑓(𝑥) assumes a maximum or minimum. These points, which are often more
interesting than the corresponding values 𝑓(𝑥) themselves, are referred to as “extremal points” and
sometimes abbreviated by “argmax𝑥∈𝐷 𝑓 (𝑥)” or “argmin𝑥∈𝐷 𝑓 (𝑥)” for extremal points that correspond
to maxima and minima of 𝑓, respectively. Note the difference between max𝑥∈𝐷 𝑓 (𝑥) and argmax𝑥∈𝐷 𝑓 (𝑥):
the former refers to a point (or a set of points) in the range of 𝑓, the latter to a point (or a set of points) in
the domain of 𝑓.
When using first- and second-order derivatives to find extremal points and their corresponding
maxima or minima, it is helpful to distinguish (a) “necessary” and (b) “sufficient” conditions for extremal
points.
Necessary condition for an extremum
The necessary condition for an extremum of a function 𝑓:ℝ → ℝ (i.e., a maximum or minimum) at a
point 𝑥 ∈ 𝐷 is that the first-derivative “vanishes”, or more precisely, is equal to zero: 𝑓′(𝑥) = 0. Intuitively,
this can be made transparent by considering a maximum of 𝑓 at a point 𝑥𝑚𝑎𝑥: for all values of 𝑥 which are
smaller than 𝑥𝑚𝑎𝑥, the derivative is positive, because the function is increasing, leading to the maximum. For
the values of 𝑥 which are larger than 𝑥𝑚𝑎𝑥, the derivative is negative, because the function is decreasing. At
the location of the maximum, the function is neither increasing nor decreasing, and thus 𝑓′(𝑥) = 0. The
reverse is true for a minimum of 𝑓 at a point 𝑥𝑚𝑖𝑛 for all values of 𝑥 which are smaller than 𝑥𝑚𝑖𝑛, the
derivative is negative, because the function is decreasing towards the minimum. For the values of 𝑥 which
are larger than 𝑥𝑚𝑖𝑛, the derivative is positive, because the function is increasing again and recovering from
the minimum. Again, at the location of the minimum, the function is neither increasing nor decreasing, and
thus 𝑓′(𝑥) = 0.
Because in both cases 𝑓′(𝑥) = 0, one cannot decide based on finding a point 𝑥∗ for which
𝑓′(𝑥∗) = 0 holds, whether 𝑥∗ corresponds to a maximum or minimum. On the other hand, if a minimum or
maximum exist at a point 𝑥∗, it necessarily follows that 𝑓′(𝑥∗) = 0, hence nomenclature “necessary
condition”. In fact, there is a third possibility for points at which 𝑓′(𝑥∗) = 0: the function may be decreasing
for 𝑥 < 𝑥∗ and increasing for 𝑥 > 𝑥∗, or vice versa. In both cases, there is no maximum nor minimum in 𝑥∗,
but a what is referred to as a “saddle point”.
49
Sufficient condition for an extremum
The second-order derivative 𝑓′′(𝑥), which intuitively refers to “the slope of the tangent line of the
slope of the tangent line of 𝑓 ∶ ℝ → ℝ in 𝑥” allows to test whether a critical point 𝑥∗ for an extremum (i.e., a
point for which 𝑓′(𝑥∗) = 0) is a maximum, a minimum, or a saddle point. In brief, if 𝑓′′(𝑥∗) < 0, there is a
maximum at 𝑥∗, if 𝑓′′(𝑥∗) > 0, there is a minimum at 𝑥∗, and if 𝑓′′(𝑥∗) = 0, there is a saddle point at 𝑥∗.
Together with the condition 𝑓′(𝑥∗) = 0, these conditions are referred to as sufficient conditions for an
extremum. This can be made intuitive by considering a maximum at 𝑥∗. For points 𝑥 < 𝑥∗, the slope of the
tangent line at 𝑓(𝑥) must be positive, because 𝑓 is increasing towards 𝑥∗. Likewise for points 𝑥 > 𝑥∗, the
slope of the tangent line at 𝑓(𝑥) must be negative, because 𝑓 is decreasing after assuming its maximum in
𝑥∗. In other words, 𝑓′(𝑥) > 0 (positive) for 𝑥 < 𝑥∗ and 𝑓′(𝑥∗) < 0 (negative) for 𝑥 > 𝑥∗ and 𝑓′(𝑥∗) = 0.
We now consider the change of 𝑓′, i.e. 𝑓′′: In the region around the maximum, 𝑓′ decreases from a positive
value, to zero, to a negative value, as just stated. Because 𝑓′(𝑥) is positive just before (to the left of) 𝑥∗ and
negative just after (i.e. to the right of) 𝑥∗, it obviously decreases from just before to just after 𝑥∗ . But this
means that its own rate of change, 𝑓′′, is negative in 𝑥∗. The reverse obviously holds for a minimum of 𝑓 in
𝑥∗. Figure 1 visualizes the behavior of 𝑓, 𝑓′ and 𝑓′′ for three different functions in the neighborhood of a
maximum or minimum of 𝑓.
To recapitulate we have established to the following conditions that use the derivatives of a function
𝑓:ℝ → ℝ to determine its extrema :
Necessary condition
If there is a maximum of minimum of 𝑓:ℝ → ℝ at 𝑥∗, then 𝑓′(𝑥∗) = 0 (1)
Sufficient conditions for an extremum
If for 𝑓:ℝ → ℝ 𝑓′(𝑥∗) = 0 and 𝑓′′(𝑥∗) > 0, then there is a minimum at 𝑥∗ (2)
If for 𝑓:ℝ → ℝ 𝑓′(𝑥∗) = 0 and 𝑓′′(𝑥∗) < 0, then there is a maximum at 𝑥∗ (3)
In the following, we discuss three examples in which we use the conditions above to determine the location
of extremal points (see Figure 1).
Figure 1. Analytical optimization of simple functions. For a detailed discussion, please refer to the main text.
Example 1
50
Consider the function (Figure 1, left panel, blue curve)
𝑓 ∶ [0, 𝜋] → [0,1], 𝑥 ↦ sin(𝑥) (4)
The first derivative of 𝑓 is given by (Figure 1, left panel, red curve)
𝑓′: [0, 𝜋] → [0,1], 𝑥 ↦𝑑
𝑑𝑥sin(𝑥) = cos(𝑥) (5)
and the cosine function assumes a zero point at 𝜋
2 in the interval [0, 𝜋]. We thus have the critical point
𝑥∗ = 𝜋/2 for an extremum. The second derivative of 𝑓 is given by (Figure 1, left panel, red dashed curve)
𝑓′′: [0, 𝜋] → [0,1], 𝑥 ↦𝑑
𝑑𝑥cos(𝑥) = −sin(𝑥) (6)
and because – sin(𝜋/2) = −1 we can conclude that there is a maximum of 𝑓 at 𝑥 = 𝜋/2. Of course, this is
also obvious from the graph of 𝑓.
Example 2
Consider the function (Figure 1, middle panel, blue curve)
𝑓 ∶ ℝ → ℝ, 𝑥 ↦ (𝑥 − 1)2 (7)
The first derivative of 𝑓 is given by (Figure 1, middle panel, red curve)
𝑓′: ℝ → ℝ, 𝑥 ↦𝑑
𝑑𝑥((𝑥 − 1)2) = 2𝑥 − 2 (8)
Setting this derivative to zero and solving for 𝑥 yields
2𝑥 − 2 = 0 ⇔ 2𝑥 = 2 ⇔ 𝑥 = 1 (9)
We thus have the critical point 𝑥∗ = 1 for an extremum. The second derivative of 𝑓 is given by (Figure 1,
middle panel, red dashed curve)
𝑓′′: ℝ → ℝ, 𝑥 ↦𝑑
𝑑𝑥(2𝑥 − 2) = 2 (10)
The second derivative is thus the constant function, and 𝑓′′(𝑥∗) = 2 > 0. We thus conclude that there is a
minimum of 𝑓 at 𝑥 = 1 . Again, this is also obvious from the graph of 𝑓.
Example 3
Consider the function (Figure 1, right panel, blue curve)
𝑓 ∶ ℝ → ℝ, 𝑥 ↦ −𝑥2 (11)
The first derivative of 𝑓 is given by (Figure 1, rightpanel, red curve)
𝑓′: ℝ → ℝ, 𝑥 ↦𝑑
𝑑𝑥(−𝑥2) = −2𝑥 (12)
Setting this derivative to zero and solving for 𝑥 yields
51
−2𝑥 = 0 ⇔ 𝑥 = 0 (13)
We thus have the critical point 𝑥∗ = 0 for an extremum. The second derivative of 𝑓 is given by (Figure 1,
right panel, red dashed curve)
𝑓′′: ℝ → ℝ, 𝑥 ↦𝑑
𝑑𝑥(−2𝑥) = −2 (14)
The second derivative is thus the constant function, and 𝑓′′(𝑥∗) = −2 < 0. We thus conclude that there is a
maximum of 𝑓 at 𝑥 = 1.
(5) Multivariate, real-valued functions and partial derivatives
So far, we have considered functions of the form 𝑓 ∶ ℝ → ℝ, which map single numbers (scalars)
𝑥 ∈ ℝ onto single numbers (scalars) 𝑓(𝑥) ∈ ℝ. Another function type that is commonly encountered in
PMFN are functions of the form
𝑓 ∶ ℝ𝑛 → ℝ,𝑥 ↦ 𝑓(𝑥), where 𝑥 ≔ (
𝑥1⋮𝑥𝑛) ∈ ℝ𝑛 are “𝑛-tuples” or “𝑛-dimensional vectors” (1)
We will call 𝑥 as defined above a “column vector”, and, if we write the vector as a “row vector” (𝑥1, … , 𝑥𝑛)
call this the “transpose” of 𝑥, denoted by a “𝑇”-superscript, 𝑥 = (𝑥1, … , 𝑥𝑛)𝑇. By default, all vectors we
consider are column vectors. In physics, functions of the type (1) are referred to as “scalar fields”, because
they allocate scalars 𝑓(𝑥) ∈ ℝ to “points in (𝑛-dimensional) space” (𝑥1, … , 𝑥𝑛)𝑇. Because the functional
value 𝑓(𝑥) is a real number, these kind of functions are also called “real-valued” function, in
contradistinction to “vector-valued functions” discussed in a later section. An example for a function of the
type (1) is the following
𝑓 ∶ ℝ2 → ℝ, 𝑥 ↦ 𝑓(𝑥) = 𝑓 ((𝑥1𝑥2)) ≔ 𝑥1
2 + 𝑥22 (2)
which is visualized in two different ways in the upper panels of Figure 1. Another example is the function
𝑔 ∶ ℝ2 → ℝ,𝑥 ↦ 𝑔(𝑥) = 𝑔 ((𝑥1𝑥2)) ≔ exp (−
1
2((𝑥1 − 1)
2 + (𝑥2 − 1)2) ) (3)
which is visualized in two different ways in the lower panels of Figure 1. Note that functions defined on
spaces ℝ𝑛 with 𝑛 > 2 are not easily visualized.
Just as for univariate, real-valued functions, one can ask how much a change in the input argument
at a specific point in ℝ𝑛 of a multivariate, real-valued function affects the value of that function. If one asks
this questions for each of the subcomponents 𝑥𝑖 (𝑖 = 1,… , 𝑛) of 𝑥 = (𝑥1, … , 𝑥𝑛) ∈ ℝ𝑛 independently of the
remaining 𝑛 − 1 subcomponents, one is led to the concept of a “partial derivative”: the partial derivative of
a multivariate, real-valued function 𝑓 ∶ ℝ𝑛 → ℝ with respect to a variable 𝑥𝑖 (𝑖 = 1,… , 𝑛) captures how
much the function value changes “in the direction” of 𝑥𝑖, i.e., in the cross-section through the space ℝ𝑛
defined by the variable of interest. Stated differently, the partial derivative of a function 𝑓 ∶ ℝ𝑛 → ℝ in a
point 𝑥 ∈ ℝ𝑛 with respect to a variable 𝑥𝑖 (𝑖 = 1,… , 𝑛) is the derivative of the function 𝑓 with respect to 𝑥𝑖
while all other variables 𝑥𝑗, (𝑗 ∈ {1,2, … , 𝑛}, 𝑗 ≠ 𝑖) are held constant. The partial derivative of a function
52
𝑓 ∶ ℝ𝑛 → ℝ in a point 𝑥 ∈ ℝ𝑛 with respect to a variable 𝑥𝑖 is denoted by 𝜕
𝜕𝑥𝑖𝑓(𝑥), where the “𝜕” symbol is
used to distinguish the notion of a partial derivative from a standard derivative. This notation is somewhat
redundant, because the subscript 𝑖 on the 𝑥 in 𝜕
𝜕𝑥𝑖 already makes it clear that the derivative is with respect
to 𝑥𝑖 only, however, it is commonly used, and if the subcomponents of 𝑥 are not denoted by 𝑥1, … , 𝑥𝑛, but
by, say, 𝑎 ≔ 𝑥1, 𝑏 ≔ 𝑥2, … , 𝑒 ≔ 𝑥𝑛 it is, in fact, helpful.
Because as for the derivative of a univariate, real-valued function, one may evaluate the partial
derivative for all 𝑥 ∈ ℝ𝑛, one can also view the partial derivative of a multivariate, real-valued function as a
function:
𝜕
𝜕𝑥𝑖𝑓 ∶ ℝ𝑛 → ℝ, 𝑥 ↦
𝜕
𝜕𝑥𝑖𝑓 (𝑥) (6)
Figure 2. Visualization of multivariate (here: bivariate), real-valued functions. Real-valued functions of multiple variables are routinely visualized in a three-dimensional way as in the left panels of the Figure. Note that although this is a 3D plot, the function is bivariate, i.e. a function of two variables. The same information can be conveyed by using “isocontour” plots, which in a two-dimensional way visualize the “isocontours” of functions. Isocontours are the lines assuming equal values in the range of the function. Usually isocontour plots suffice to convey all relevant information about a bivariate function.
Examples
We first consider the example
53
𝑓 ∶ ℝ2 → ℝ,𝑥 ↦ 𝑓(𝑥) = 𝑓 ((𝑥1𝑥2)) ≔ 𝑥1
2 + 𝑥22 (7)
Because this function has a 2-dimensional domain, one can evaluate 2 different partial derivatives,
namely
𝜕
𝜕𝑥1𝑓 ∶ ℝ2 → ℝ, 𝑥 ↦
𝜕
𝜕𝑥1𝑓 (𝑥) and
𝜕
𝜕𝑥2𝑓 ∶ ℝ2 → ℝ,𝑥 ↦
𝜕
𝜕𝑥2𝑓 (𝑥) (8)
To evaluate the partial derivative 𝜕
𝜕𝑥1𝑓 ∶ ℝ2 → ℝ one considers the function
𝑓𝑥2 ∶ ℝ → ℝ, 𝑥1 ↦ 𝑓𝑥2(𝑥1) ≔ 𝑥12 + 𝑥2
2 (9)
where 𝑥2 assumes the role of a constant. To indicate that 𝑥2 is no longer an input argument of the function,
but the function still dependent on the constant 𝑥2, we have used the subscript notation “𝑓𝑥2(𝑥1)”. To
evaluate the partial derivative 𝜕
𝜕𝑥1𝑓, one evaluates the standard (univariate) derivative of 𝑓𝑥2:
𝑓𝑥2′ (𝑥) = 2𝑥1 (10)
We thus have
𝜕
𝜕𝑥1𝑓 ∶ ℝ2 → ℝ,𝑥 ↦
𝜕
𝜕𝑥1𝑓(𝑥) =
𝜕
𝜕𝑥1(𝑥1
2 + 𝑥22) = 𝑓𝑥2
′ (𝑥) = 2𝑥1 (11)
and, accordingly, with the corresponding definition of 𝑓𝑥1:
𝜕
𝜕𝑥2𝑓 ∶ ℝ2 → ℝ,𝑥 ↦
𝜕
𝜕𝑥2𝑓(𝑥) =
𝜕
𝜕𝑥2(𝑥1
2 + 𝑥22) = 𝑓𝑥1
′ (𝑥) = 2𝑥2 (12)
We next consider the example
𝑔 ∶ ℝ2 → ℝ,𝑥 ↦ 𝑔(𝑥) = 𝑔 ((𝑥1𝑥2)) ≔ exp (−
1
2((𝑥1 − 1)
2 + (𝑥2 − 1)2) ) (13)
Again, there are two partial derivatives 𝜕
𝜕𝑥1𝑔 and
𝜕
𝜕𝑥2𝑔. Using the chain rule of (univariate) differentiation
and the logic of treating the variable(s) with respect to which the derivative is not performed as constant in
standard univariate differentiation, we obtain, without making the function properties of the partial
derivatives explicit
𝜕
𝜕𝑥1𝑔(𝑥) =
𝜕
𝜕𝑥1exp (−
1
2((𝑥1 − 1)
2 + (𝑥2 − 1)2) )
=𝜕
𝜕𝑥1exp (−
1
2((𝑥1 − 1)
2 + (𝑥2 − 1)2) )
𝜕
𝜕𝑥1((−
1
2((𝑥1 − 1)
2 + (𝑥2 − 1)2) ))
= −exp (−1
2((𝑥1 − 1)
2 + (𝑥2 − 1)2) ) (𝑥1 − 1) (14)
and
𝜕
𝜕𝑥2𝑔(𝑥) =
𝜕
𝜕𝑥2exp (−
1
2((𝑥1 − 1)
2 + (𝑥2 − 1)2) )
54
=𝜕
𝜕𝑥2exp (−
1
2((𝑥1 − 1)
2 + (𝑥2 − 1)2) )
𝜕
𝜕𝑥2((−
1
2((𝑥1 − 1)
2 + (𝑥2 − 1)2) ))
= −exp (−1
2((𝑥1 − 1)
2 + (𝑥2 − 1)2) ) (𝑥2 − 1) (15)
(6) Higher-order partial derivatives0F
1
Like for the standard derivative of a univariate real-valued function 𝑓 ∶ ℝ → ℝ, higher-order partial
derivatives can be formulated and evaluated by taking partial derivatives of partial derivatives. Because
multivariate real-valued functions of the form 𝑓 ∶ ℝ𝑛 → ℝ are functions of multiple input arguments, more
possibilities exist for higher-order derivatives compared to the univariate case. For example, given the partial
derivative 𝜕
𝜕𝑥1𝑓 of a function 𝑓 ∶ ℝ3 → ℝ, one may next form the partial derivative again with respect to 𝑥1,
yielding the second-order partial derivative equivalent to the second-order derivative of a univariate
function and denoted by 𝜕2
𝜕𝑥12 𝑓. However, one may also next form the partial derivative with respect to 𝑥2,
𝜕2
𝜕𝑥1𝜕𝑥2𝑓, or with respect to 𝑥3,
𝜕2
𝜕𝑥1𝜕𝑥3𝑓. Note that the numerator of the partial derivative sign increases its
power with the order of the derivative, and the denominator denotes the variables with respect to which
the derivative is taken. If the derivative is taken multiple times with respect to the same variable, the
variable in the denominator is notated with the corresponding power. Again, note that these are mere
conventions to signal the form of the partial derivative, but the symbols themselves do not have any
meaning besides the implicit encouragement to the reader to evaluate the corresponding partial derivative.
To clarify the notation, we evaluate the first and second-order partial derivatives of the function
𝑓:ℝ3 → ℝ,𝑥 ↦ 𝑓(𝑥) ≔ 𝑥12 + 𝑥1𝑥2 + 𝑥2√𝑥3 (1)
We have for the first-order-derivatives
𝜕
𝜕𝑥1𝑓 ∶ ℝ3 → ℝ, 𝑥 ↦
𝜕
𝜕𝑥1𝑓(𝑥) =
𝜕
𝜕𝑥1(𝑥1
2 + 𝑥1𝑥2 + 𝑥2√𝑥3) = 2𝑥1 + 𝑥2 (2)
𝜕
𝜕𝑥2𝑓 ∶ ℝ3 → ℝ, 𝑥 ↦
𝜕
𝜕𝑥2𝑓(𝑥) =
𝜕
𝜕𝑥2(𝑥1
2 + 𝑥1𝑥2 + 𝑥2√𝑥3) = 𝑥1 +√𝑥3 (3)
𝜕
𝜕𝑥3𝑓 ∶ ℝ3 → ℝ,𝑥 ↦
𝜕
𝜕𝑥3𝑓(𝑥) =
𝜕
𝜕𝑥3(𝑥1
2 + 𝑥1𝑥2 + 𝑥2√𝑥3) = 𝑥21
2𝑥3−1
2 =𝑥2
2√𝑥3, (4)
for the second-order derivatives with respect to 𝑥1
𝜕2
𝜕𝑥12 𝑓 ∶ ℝ
3 → ℝ, 𝑥 ↦𝜕2
𝜕𝑥12 𝑓(𝑥) =
𝜕
𝜕𝑥1(𝜕
𝜕𝑥1𝑓(𝑥)) =
𝜕
𝜕𝑥1(2𝑥1 + 𝑥2) = 2 (5)
𝜕2
𝜕𝑥2𝜕𝑥1𝑓 ∶ ℝ3 → ℝ,𝑥 ↦
𝜕2
𝜕𝑥2𝜕𝑥1𝑓(𝑥) =
𝜕
𝜕𝑥2(𝜕
𝜕𝑥1𝑓(𝑥)) =
𝜕
𝜕𝑥2(2𝑥1 + 𝑥2) = 1 (6)
𝜕2
𝜕𝑥3𝜕𝑥1𝑓 ∶ ℝ3 → ℝ,𝑥 ↦
𝜕2
𝜕𝑥3𝜕𝑥1𝑓(𝑥) =
𝜕
𝜕𝑥3(𝜕
𝜕𝑥1𝑓(𝑥)) =
𝜕
𝜕𝑥3(2𝑥1 + 𝑥2) = 0, (7)
1 This section requires some familiarity with matrix notation as covered in the Section “Matrix Algebra”.
55
for the second-order derivatives with respect to 𝑥2
𝜕2
𝜕𝑥1𝜕𝑥2𝑓 ∶ ℝ3 → ℝ, 𝑥 ↦
𝜕2
𝜕𝑥1𝜕𝑥2𝑓(𝑥) =
𝜕
𝜕𝑥1(𝜕
𝜕𝑥2𝑓(𝑥)) =
𝜕
𝜕𝑥1(𝑥1 +√𝑥3) = 1 (8)
𝜕2
𝜕𝑥22 𝑓 ∶ ℝ
3 → ℝ, 𝑥 ↦𝜕2
𝜕𝑥22 𝑓(𝑥) =
𝜕
𝜕𝑥2(𝜕
𝜕𝑥2𝑓(𝑥)) =
𝜕
𝜕𝑥2(𝑥1 +√𝑥3) = 0 (9)
𝜕2
𝜕𝑥3𝜕𝑥2𝑓 ∶ ℝ3 → ℝ,𝑥 ↦
𝜕2
𝜕𝑥3𝜕𝑥2𝑓(𝑥) =
𝜕
𝜕𝑥3(𝜕
𝜕𝑥2𝑓(𝑥)) =
𝜕
𝜕𝑥3(𝑥1 +√𝑥3) =
1
2𝑥3−1
2 =1
2√𝑥3 (10)
and for the second-order derivatives with respect to 𝑥3
𝜕2
𝜕𝑥1𝜕𝑥3𝑓 ∶ ℝ3 → ℝ,𝑥 ↦
𝜕2
𝜕𝑥1𝜕𝑥3𝑓(𝑥) =
𝜕
𝜕𝑥1(𝜕
𝜕𝑥3𝑓(𝑥)) =
𝜕
𝜕𝑥1(𝑥2
2√𝑥3) = 0 (11)
𝜕2
𝜕𝑥2𝜕𝑥3𝑓 ∶ ℝ3 → ℝ,𝑥 ↦
𝜕2
𝜕𝑥2𝜕𝑥3𝑓(𝑥) =
𝜕
𝜕𝑥2(𝜕
𝜕𝑥3𝑓(𝑥)) =
𝜕
𝜕𝑥2(𝑥2
2√𝑥3) =
1
2√𝑥3 (12)
𝜕2
𝜕𝑥32 𝑓 ∶ ℝ
3 → ℝ, 𝑥 ↦𝜕2
𝜕𝑥32 𝑓(𝑥) =
𝜕
𝜕𝑥3(𝜕
𝜕𝑥3𝑓(𝑥)) =
𝜕
𝜕𝑥3(𝑥2
1
2𝑥3−1
2) = −1
2𝑥2
1
2𝑥3−3
2 = −1
4𝑥2𝑥3
−3
2 (13)
Note from the above that it does not matter in which order the second derivatives are taken, as
𝜕2
𝜕𝑥1𝜕𝑥2𝑓(𝑥) =
𝜕2
𝜕𝑥2𝜕𝑥1𝑓(𝑥) = 1,
𝜕2
𝜕𝑥1𝜕𝑥3𝑓(𝑥) =
𝜕2
𝜕𝑥3𝜕𝑥1𝑓 (𝑥) = 0 and
𝜕2
𝜕𝑥2𝜕𝑥3𝑓(𝑥) =
𝜕2
𝜕𝑥3𝜕𝑥2𝑓(𝑥) =
1
2√𝑥3 (14)
This is a general property of partial derivatives known as “Schwarz’ Theorem”, which we state here without
proof: For a multivariate real-valued function 𝑓 ∶ ℝ𝑛 → ℝ,𝑥 ↦ 𝑓(𝑥) the following identity holds
𝜕2
𝜕𝑥𝑖𝜕𝑥𝑗𝑓(𝑥) =
𝜕2
𝜕𝑥𝑗𝜕𝑥𝑖𝑓(𝑥) 𝑖, 𝑗 ∈ {1,2,… , 𝑛} (15)
Schwarz theorem is helpful when evaluating partial derivatives: on the one hand, one can save some work
by relying on it, on the other hand it can help to validate the analytical results, because if one finds that it
does not hold for certain second-order partial derivatives, something must have gone wrong in the
calculation.
(7) Gradient, Hessian, and Jacobian
First- and second-order partial derivatives of multivariate real-valued functions can be summarized
in two entities known as the “gradient” and the “Hessian” or “Hessian matrix” of the function 𝑓 . The
gradient of a function
𝑓:ℝ𝑛 → ℝ, 𝑥 ↦ 𝑓(𝑥) (1)
at a location 𝑥 ∈ ℝ𝑛 is defined as the 𝑛-dimensional vector of the function’s partial derivatives evaluated at
this location and denoted by the ∇ (nabla) sign:
56
∇𝑓 ∶ ℝ𝑛 → ℝ𝑛, 𝑥 ↦ ∇𝑓(𝑥) ≔
(
𝜕
𝜕𝑥1𝑓(𝑥)
𝜕
𝜕𝑥2𝑓(𝑥)
⋮𝜕
𝜕𝑥3𝑓(𝑥))
∈ ℝ𝑛 (2)
Intuitively, the gradient evaluated at 𝑥 ∈ ℝ𝑛 is a vector that points in the direction of the greatest rate
increase (steepest ascent) of the function in its domain space ℝ𝑛. That this is in fact the case, is not easy to
prove, so we omit the proof here and content with the intuition. Note that the gradient is a vector-valued
function: it takes a vector 𝑥 ∈ ℝ𝑛 as input and returns a vector ∇𝑓(𝑥) ∈ ℝ𝑛.
Second-order partial derivatives are summarized in the so-called Hessian matrix of a multivariate-
real valued function
𝑓:ℝ𝑛 → ℝ, 𝑥 ↦ 𝑓(𝑥) (3)
which is denoted by
𝐻𝑓:ℝ𝑛 → ℝ𝑛×𝑛 , 𝑥 ↦ 𝐻𝑓(𝑥) ≔
(
𝜕2
𝜕𝑥12 𝑓(𝑥)
𝜕2
𝜕𝑥1𝜕𝑥2𝑓(𝑥)
𝜕2
𝜕𝑥2𝜕𝑥1𝑓(𝑥)
𝜕2
𝜕𝑥22 𝑓(𝑥)
⋯𝜕2
𝜕𝑥1𝜕𝑥𝑛𝑓(𝑥)
⋯𝜕2
𝜕𝑥2𝜕𝑥𝑛𝑓(𝑥)
⋮ ⋮𝜕2
𝜕𝑥𝑛𝜕𝑥1𝑓(𝑥)
𝜕2
𝜕𝑥𝑛𝜕𝑥2𝑓(𝑥)
⋱ ⋮
⋯𝜕2
𝜕𝑥𝑛𝜕𝑥𝑛2𝑓(𝑥))
=:(𝜕2
𝜕𝑥𝑖𝜕𝑥𝑗𝑓(𝑥))
𝑖,𝑗=1,…,𝑛
(4)
Note that in each row of the Hessian, the second (in the order of differentiation, not in the order of
notation) of the two partial derivatives is constant, while the first varies from 1 to 𝑛 over columns, and the
reverse is true for each column. The Hessian matrix can be viewed as a matrix-valued function: it takes a
vector 𝑥 ∈ ℝ𝑛 as input and returns an 𝑛 × 𝑛 matrix 𝐻𝑓(𝑥) ∈ ℝ𝑛×𝑛. Also note that due to Schwarz’
Theorem, the Hessian matrix is symmetric, i.e.,
(𝐻𝑓(𝑥))𝑇= 𝐻𝑓(𝑥) (5)
Functions of the form 𝑓 ∶ ℝ𝑛 → ℝ map vectors 𝑥 ∈ ℝ𝑛 onto single numbers (scalars) 𝑓(𝑥) ∈ ℝ. A
further function type commonly encountered in applied mathematics are functions that map vectors onto
vectors. These functions are of the general form
𝑓:ℝ𝑛 → ℝ𝑚, 𝑥 ↦ 𝑓(𝑥) ≔ (𝑓1(𝑥),… , 𝑓𝑚(𝑥))𝑇= (
𝑓1(𝑥1, … , 𝑥𝑛)
𝑓2(𝑥1, … , 𝑥𝑛)⋮
𝑓𝑚(𝑥1, … , 𝑥𝑛)
) (6)
and, in physics, are referred to as “vector fields”. The first derivative of these functions evaluated at 𝑥 ∈ ℝ𝑛
in the direction of the canonical basis vector set is given by the so-called Jacobian matrix, which we denote
by
57
𝐽𝑓(𝑥) ≔ (𝐽𝑖𝑗𝑓)(𝑥) ≔
(
𝜕
𝜕𝑥1 𝑓1(𝑥) ⋯
𝜕
𝜕𝑥𝑛 𝑓1(𝑥)
⋮ ⋱ ⋮𝜕
𝜕𝑥1 𝑓𝑚(𝑥) ⋯
𝜕
𝜕𝑥𝑛 𝑓𝑚(𝑥))
∈ ℝ𝑚×𝑛 (7)
(9) Taylor’s Theorem
Loosely speaking, Taylor’s theorem states that any 𝑘 times differentiable function 𝑓 ∶ ℝ → ℝ can be
locally approximated in the region of a given point of its domain by means of a polynomial. In the “Peano
form”, Taylor’s theorem takes the following form: let 𝑘 ∈ ℕ and let a univariate real-valued function
𝑓 ∶ ℝ → ℝ be 𝑘 times differentiable in a point 𝑎 ∈ ℝ. Then there exists a function ℎ𝑘:ℝ → ℝ such that
𝑓(𝑥) = 𝑓(𝑎) + 𝑓′(𝑎)(𝑥 − 𝑎) +1
2!𝑓′′(𝑎)(𝑥 − 𝑎)2 +⋯+
1
𝑘!𝑓(𝑘)(𝑎)(𝑥 − 𝑎)𝑘 + ℎ𝑘(𝑥)(𝑥 − 𝑎)
𝑘 (1)
with lim𝑥→𝑎 ℎ𝑘(𝑥) = 0. In other words, close to the expansion point 𝑎, the approximation of 𝑓(𝑥) by the
𝑘th order polynomial involving the 𝑘 derivatives of 𝑓 becomes exact. In the applied literature, Taylor’s
theorem is often used without the remainder term, and referred to as a “𝑘th order Taylor approximation” of
the function 𝑓 in (or around) the expansion point 𝑎. This is commonly denoted by
𝑓(𝑥) ≈ 𝑓(𝑎) + 𝑓′(𝑎)(𝑥 − 𝑎) +1
2!𝑓′′(𝑎)(𝑥 − 𝑎)2 +⋯+
1
𝑘!𝑓(𝑘)(𝑎)(𝑥 − 𝑎)𝑘 (2)
First- and second order Taylor approximations, which are perhaps the most often applied approximation, are
thus given for 𝑥, 𝑎 ∈ ℝ by
𝑓(𝑥) ≈ 𝑓(𝑎) + 𝑓′(𝑎)(𝑥 − 𝑎) (3)
and
𝑓(𝑥) ≈ 𝑓(𝑎) + 𝑓′(𝑎)(𝑥 − 𝑎) +1
2𝑓′′(𝑎)(𝑥 − 𝑎)2 (4)
Note that Taylor approximations are local and not global function approximations, i.e. they are good
approximations close to the expansion point 𝑎 ∈ ℝ, but bad approximations far away from the expansion
point.
In analogy to the univariate case, Taylor’s theorem can also be formulated for functions of the form
𝑓:ℝ𝑛 → ℝ. Because 𝑓 is now a function multiple variables, partial derivatives come into play. We do not
state Taylor’s theorem for multivariate real-valued functions explicitly, but in analogy to (15) and (16)
content with stating first- and second-order Talyor approximations around an expansion point 𝑎 ∈ ℝ𝑛 for
multivariate-real valued functions. Notably, in these approximations the gradient vector takes on the role of
the first derivative and the Hessian matrix takes the role of the second derivative. For the first-order Taylor
approximation of a differentiable function 𝑓 ∈ ℝ𝑛 → ℝ in 𝑎 ∈ ℝ𝑛, we have
𝑓(𝑥) ≈ 𝑓(𝑎) + ∇𝑓(𝑎)𝑇(𝑥 − 𝑎) (5)
and for the second-order Talyor approximation of 𝑓 ∈ ℝ𝑛 → ℝ in 𝑎 ∈ ℝ𝑛, we have
𝑓(𝑥) ≈ 𝑓(𝑎) + ∇𝑓(𝑎)𝑇(𝑥 − 𝑎) +1
2(𝑥 − 𝑎)𝑇𝐻𝑓(𝑎)(𝑥 − 𝑎) (6)
58
Finally, also in the case of multivariate, vector-valued functions of the form 𝑓:ℝ𝑛 → ℝ𝑚 local
approximations around expansion points 𝑎 ∈ ℝ𝑛 can be evaluated using an appropriately generalized
Taylor’s theorem. We content here by stating the first-order approximation of a function 𝑓:ℝ𝑛 → ℝ𝑚 in a
point 𝑎 ∈ ℝ𝑛 in analogy to (15) and (17). Notably, the Jacobian matrix now takes on the role of the first
derivative or gradient:
𝑓(𝑥) ≈ 𝑓(𝑎) + 𝐽𝑓(𝑎)(𝑥 − 𝑎) (7)
Note that here 𝑓(𝑎) ∈ ℝ𝑚, 𝐽𝑓(𝑎) ∈ ℝ𝑚×𝑛 and (𝑥 − 𝑎) ∈ ℝ𝑛. Of course, for nonlinear 𝑓 this approximation
will only be reasonable in the close vicinity of 𝑎.
Figure 3. Taylor approximations. Panel A depicts first and second order approximations of the univariate, real-valued logarithmic function 𝑓: ℝ+ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑙𝑛 𝑥 around the expansion point 𝑎 = 2.5. Note that in the vicinity of 𝑎 the second-order approximation captures the behaviour of the function 𝑓 better than the first order approximation. Panel B depicts a first order approximation of the multivariate (here bivariate) real-valued function 𝑓:ℝ2 → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ −𝑥𝑇𝑥 around the expansion point 𝑎 = (0.75, 0.75)𝑇. Note that the approximation is only reasonable in the vicinity of the expansion point.
Study Questions
1. Give a brief explanation of the notion of a derivative of a univariate function 𝑓 in a point 𝑥.
2. Provide brief explanations of the symbols 𝑑
𝑑𝑥,𝑑2
𝑑𝑥2 ,𝜕
𝜕𝑥 ,and
𝜕2
𝜕𝑥2
3. Compute the first derivatives of the following functions
𝑓: ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 2exp(−𝑥3)
𝑔:ℝ → ℝ, 𝑥 ↦ 𝑔(𝑥) ≔ (𝑥2 + 2𝑥 − 𝑎)3
4. Compute the partial derivatives of the function
𝑓:ℝ2 → ℝ, (𝑥, 𝑦) ↦ 𝑓(𝑥, 𝑦) ≔ log 𝑥 + ∑ (𝑦 − 3)2𝑛𝑖=1
with respect to 𝑥 and 𝑦.
59
5. Determine the minimum of the function 𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑥2 + 𝑥 + 2.
Study Question Answers
1. The derivative 𝑓′(𝑥) ∈ ℝ of a function 𝑓 at the location 𝑥 ∈ ℝ of a function 𝑓 conveys two basic intuitions: It is a measure of
the rate of change of 𝑓 at location 𝑥, and it is the slope of the tangent line of 𝑓 at the point (𝑥, 𝑓(𝑥)) ∈ ℝ2.
2. If written in front of a function with input argument 𝑥, 𝑑
𝑑𝑥 is understood as the imperative to evaluate the (first) derivative of
the function, 𝑑2
𝑑𝑥2 is understood as the imperative to evaluate the second derivative of the function (the derivative of the
derivative of the function). Likewise if written in front of a function with multiple input arguments, for example 𝑥, 𝑦, and 𝑧, 𝜕
𝜕𝑥 is
best understood as the imperative to evaluate the (first) derivative of the function with respect to the input variable 𝑥 (the
partial derivative with respect to 𝑥) , while 𝜕2
𝜕𝑥2 is understood as the imperative to evaluate the second derivative of the
function (the derivative of the derivative of the function) with respect to 𝑥, i.e. the second partial derivative with respect to 𝑥. 3. With the chain rule, we have
𝑓′(𝑥) =𝑑
𝑑𝑥2 exp(−𝑥3) = 2
𝑑
𝑑𝑥(exp(−𝑥3)) = 2
𝑑
𝑑𝑥exp(−𝑥3) ⋅
𝑑
𝑑𝑥(−𝑥3) = −6 exp(−𝑥3) 𝑥2
𝑔′(𝑥) =𝑑
𝑑𝑥(𝑥2 + 2𝑥 − 𝑎)3 = 3(𝑥2 + 2𝑥 − 𝑎)2
𝑑
𝑑𝑥(𝑥2 + 2𝑥 − 𝑎) = 3(𝑥2 + 2𝑥 − 𝑎)2(2𝑥 + 2)
4. We have with the linearity of differentiation
𝜕
𝜕𝑥𝑓(𝑥, 𝑦) =
𝜕
𝜕𝑥(log 𝑥 + ∑ (𝑦 − 3)2𝑛
𝑖=1 ) =𝜕
𝜕𝑥log 𝑥 +
𝜕
𝜕𝑥∑ (𝑦 − 3)2𝑛𝑖=1 =
1
𝑥+ 0 =
1
𝑥
and with the linearity of and the chain rule of differentiation
𝜕
𝜕𝑦𝑓(𝑥, 𝑦) =
𝜕
𝜕𝑦(log 𝑥 + ∑ (𝑦 − 3)2𝑛
𝑖=1 )
=𝜕
𝜕𝑦log 𝑥 +
𝜕
𝜕𝑦∑ (𝑦 − 3)2𝑛𝑖=1
= 0 + ∑𝜕
𝜕𝑦((𝑦 − 3)2)𝑛
𝑖=1
= ∑ 2(𝑦 − 3)𝑛𝑖=1 ⋅
𝜕
𝜕𝑦(𝑦 − 3)
= ∑ 2(𝑦 − 3)𝑛𝑖=1 ⋅ 1
= 2∑ (𝑦 − 3)𝑛𝑖=1
5. We first compute the first derivative of the function and then set it to zero, solving for a critical point. For the first derivative, we
have
𝑓′(𝑥) ≔𝑑
𝑑𝑥(𝑥2 + 𝑥 + 2) = 2𝑥 + 1
Setting to zero and solving for the critical point 𝑥∗then yields
𝑓′(𝑥∗) = 0 ⇔ 2𝑥∗ + 1 = 0 ⇔ 𝑥∗ = −1
2
Because 𝑓′′(−1/2) = 2 > 0, we know that −1
2 is indeed a minimum of 𝑓
60
Integral Calculus
We assume that the reader has a basic familiarity with integrals from high school mathematics and focus
the discussion on two features of integration: (1) the intuition of the definite Riemann integral as the signed
area under a function’s graph, and (2) indefinite integration as the inverse of differentiation.
(1) Definite integrals - The integral as the signed area under a function’s graph
We will denote the “definite integral” of a univariate real-valued function
𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) (1)
on an interval [𝑎, 𝑏] ⊂ ℝ by the real number
𝐼 ≔ ∫ 𝑓(𝑥)𝑏
𝑎𝑑𝑥 ∈ ℝ (2)
Two things are important to note with respect to the notation above: first, the definite integral is simply a
real number, and second, the right hand side is merely notational and to be understood as the task of
“integrating the function 𝑓 on the interval [𝑎, 𝑏]”. In other words, there is no mathematical meaning
associated with the “𝑑𝑥” or the ∫𝑏
𝑎 beyond the definition of the integral boundaries 𝑎 and 𝑏. The term
“definite” is used here to distinguish this integral from the “indefinite” integral discussed below. Put simply,
definite integrals are those integrals for which the integral boundaries appear at the integral sign (although
due to sloppy notation, they are sometimes omitted, for example, if the interval of integration is the entire
real line).
Intuitively, the definite integral ∫ 𝑓(𝑥)𝑏
𝑎𝑑𝑥 is best understood as the continuous generalization of
the sum
∑ 𝑓(𝑥𝑖)𝑛𝑖=1 Δ𝑥 (3)
where
𝑎 =: 𝑥1, 𝑥2 ≔ 𝑥1 + Δ𝑥, 𝑥3 ≔ 𝑥2 + Δ𝑥,… . , 𝑥𝑛 ≔ 𝑏 (4)
corresponds to an equipartition of the interval [𝑎, 𝑏], i.e. a partition of the interval [𝑎, 𝑏] into 𝑛 − 1 bins of
equal size Δ𝑥. The term 𝑓(𝑥𝑖)Δ𝑥 (𝑖 = 1,… , 𝑛) corresponds to the area of the rectangle formed by the value
of the function 𝑓 at 𝑥𝑖 (i.e. the upper left corner of the rectangle) as height, and the bin width Δ𝑥 as width.
Summing over all rectangles then yields an approximation of the area under the graph of the function 𝑓,
where terms with negative values of 𝑓(𝑥𝑖) enter the sum with a negative sign. Intuitively, by letting the bin
width Δ𝑥 in the sum (3) approach zero (i.e. making it smaller and smaller) then approximates the integral of
𝑓 on the interval [𝑎, 𝑏]:
∫ 𝑓(𝑥)𝑏
𝑎𝑑𝑥 ≈ ∑ 𝑓(𝑥𝑖)
𝑛𝑖=1 Δ𝑥 for Δ𝑥 → 0 (5)
See Figure 1 below for a visualization.
61
Definite integrals have the “linearity property” which is often useful when evaluating integrals
analytically. We will briefly elucidate this property. Based on the intuition that for a function 𝑓:ℝ → ℝ, 𝑥 ↦
𝑓(𝑥)
∫ 𝑓(𝑥)𝑏
𝑎𝑑𝑥 ≈ ∑ 𝑓(𝑥𝑖)
𝑛𝑖=1 Δ𝑥 (6)
and the fact that for a second function 𝑔:ℝ → ℝ, 𝑥 ↦ 𝑔(𝑥)
∑ (𝑓(𝑥𝑖) + 𝑔(𝑥𝑖))Δ𝑥 𝑛𝑖=1 = ∑ (𝑓(𝑥𝑖)Δ𝑥 + 𝑔(𝑥𝑖)Δ𝑥)
𝑛𝑖=1 = ∑ 𝑓(𝑥𝑖)Δ𝑥 + ∑ 𝑔(𝑥𝑖)Δ𝑥
𝑛𝑖=1 𝑛
𝑖=1 (7)
and a constant 𝑐 ∈ ℝ
∑ 𝑐𝑓(𝑥𝑖)Δ𝑥 𝑛𝑖=1 = 𝑐 ∑ 𝑓(𝑥𝑖)Δ𝑥
𝑛𝑖=1 (8)
we can infer the following linearity properties of the Riemann integral
∫ (𝑓(𝑥) + 𝑔(𝑥))𝑏
𝑎𝑑𝑥 = ∫ 𝑓(𝑥)
𝑏
𝑎𝑑𝑥 + ∫ 𝑔(𝑥)
𝑏
𝑎𝑑𝑥 (9)
and
∫ 𝑐𝑓(𝑥)𝑏
𝑎𝑑𝑥 = 𝑐 ∫ 𝑓(𝑥)
𝑏
𝑎𝑑𝑥 (10)
In words: Firstly, the integral of the sum of two functions 𝑓 + 𝑔 over an interval [𝑎, 𝑏]corresponds to the
sum of the integrals of the individual functions 𝑓 and 𝑔 on [𝑎, 𝑏] Secondly, the integral of a function 𝑓
multiplied by a constant 𝑐 on an interval [𝑎, 𝑏] corresponds to the integral of the function 𝑓 on an interval
[𝑎, 𝑏] multiplied by the constant. Both properties are very useful when evaluating integrals analytically: the
first allows for decomposing integrals of composite functions into sums of integrals of less complex
functions, while the second allows for removing constants from integration.
Figure 1. Visualization of the definite integral as area under a function’s graph
(2) Indefinite Integrals - Integration as the inverse of differentiation
Consider again a univariate real-valued function
62
𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) (1)
and a second function defined by means of making the upper integration boundary of an integral of 𝑓 its
input variable:
𝐹:ℝ+ → ℝ, 𝑥 ↦ 𝐹(𝑥) ≔ ∫ 𝑓(𝑠)𝑥
0𝑑𝑠 (2)
From the discussion above, we have that the value 𝐹 at 𝑥 can be considered the signed area under the graph
of the function 𝑓 on the interval from 0 to 𝑥. Notably, the derivative of the function 𝐹 at 𝑥, denoted by
𝐹′(𝑥), corresponds to the value of the function 𝑓, i.e.
𝐹′(𝑥) =𝑑
𝑑𝑥(∫ 𝑓(𝑠)
𝑥
0𝑑𝑠) = 𝑓(𝑥) (3)
Intuitively, the above statement says that integrating is the inverse of differentiation, in the sense
that first integrating 𝑓 from 0 to 𝑥 and then computing the derivative with respect to 𝑥 yields 𝑓. Any
function 𝐹 with the property 𝐹′(𝑥) = 𝑓(𝑥) for a function 𝑓 is called an anti-derivative or “indefinite
integral” of 𝑓. An indefinite integral is denoted by
𝐹:ℝ → ℝ, 𝑥 ↦ 𝐹(𝑥) = ∫ 𝑓(𝑠)𝑑𝑠 (4)
Note that the definite integral defined above corresponds to a real scalar number, while the indefinite integral is a function.
Proof of (3)
While the statement of equation (3) that the derivative of a function’s antiderivative corresponds to the function itself is familiar and intuitive, it is not necessarily formally easy to grasp. Here we thus provide a proof of equation (3) for the interested reader based on [Leithold 1996]. This proof makes use of limiting processes and the mean value theorem for integration, which the interested reader may review prior to studying the proof. Let 𝑓 ∶ ℝ → ℝ, 𝑠 ↦ 𝑓(𝑠) be a univariate, real-valued function, and define another function
𝐹:ℝ → ℝ, 𝑥 ↦ 𝐹(𝑥) ≔ ∫ 𝑓(𝑠)𝑥
𝑎𝑑𝑠 (3.1)
For any two numbers 𝑥1 and 𝑥1 + Δ𝑥 in the (closed) interval [𝑎, 𝑏] ⊂ ℝ, we then have
𝐹(𝑥1) = ∫ 𝑓(𝑠)𝑥1𝑎
𝑑𝑠 and 𝐹(𝑥1 + Δ𝑥) = ∫ 𝑓(𝑠)𝑥1+Δ𝑥
𝑎𝑑𝑠 (3.2)
Subtraction of these two equalities yields
𝐹(𝑥1 + Δ𝑥) − 𝐹(𝑥1) = ∫ 𝑓(𝑠)𝑥1+Δ𝑥
𝑎𝑑𝑠 − ∫ 𝑓(𝑠)
𝑥1𝑎
𝑑𝑠 (3.3)
From the intuition of the integral as the area between the function 𝑓 and the 𝑥-axis it follows naturally, that the sum of the areas of two adjacent areas is equal to the area of both regions combined, i.e.,
∫ 𝑓(𝑠)𝑥1𝑎
𝑑𝑠 + ∫ 𝑓(𝑠)𝑥1+Δ𝑥
𝑥1𝑑𝑠 = ∫ 𝑓(𝑠)
𝑥1+Δ𝑥
𝑎𝑑𝑠 (3.4)
From this it follows, that the difference above evaluates to
𝐹(𝑥1 + Δ𝑥) − 𝐹(𝑥1) = ∫ 𝑓(𝑠)𝑥1+Δ𝑥
𝑥1𝑑𝑠 (3.5)
According to the “mean value theorem for integration”, there exists a real number 𝑐Δ𝑥 ∈ [𝑥1, 𝑥1 + Δ𝑥] (the dependence on Δ𝑥 of which we have denoted by the subscript)
∫ 𝑓(𝑠)𝑥1+Δ𝑥
𝑥1𝑑𝑠 = 𝑓(𝑐Δ𝑥)Δ𝑥 (3.6)
and we hence obtain
63
𝐹(𝑥1 + Δ𝑥) − 𝐹(𝑥1) = 𝑓(𝑐Δ𝑥)Δ𝑥 (3.7)
Division by Δ𝑥 then yields
𝐹(𝑥1+Δ𝑥)−𝐹(𝑥1)
Δ𝑥= 𝑓(𝑐Δ𝑥) (3.8)
where the left-hand side corresponds to “Newton’s difference quotient”. Taking the limit Δ𝑥 → 0 on both sides then yields
limΔ𝑥→0𝐹(𝑥1+Δ𝑥)−𝐹(𝑥1)
Δ𝑥= lim
Δ𝑥→0𝑓(𝑐Δ𝑥) ⇔ 𝐹′(𝑥1) = lim
Δ𝑥→0𝑓(𝑐Δ𝑥)
(3.9)
by definition of the derivative as the limit of Newton’s difference quotient. The limit on the right hand side of the above remains to be evaluated. To this end, we recall that 𝑐Δ𝑥 ∈ [𝑥1, 𝑥1 + Δ𝑥] or in other words, that 𝑥1 ≤ 𝑐Δ𝑥 ≤ 𝑥1 + Δ𝑥. Notably lim
Δ𝑥→0𝑥1 = 𝑥1 and
limΔ𝑥→0
𝑥1 + Δ𝑥 = 𝑥1. Therefore, we can conclude that limΔ𝑥→0
𝑐Δ𝑥 = 𝑥1, as 𝑐Δ𝑥 is “squeezed” between limΔ𝑥→0
𝑥1 = 𝑥1 and limΔ𝑥→0
𝑥1 + Δ𝑥 =
𝑥1. We thus find
𝐹′(𝑥1) = limΔ𝑥→0
𝑓(𝑐Δ𝑥) = 𝑓(𝑥1) (3.10)
which concludes the proof. □
Indefinite integrals (or antiderivatives) allow for the evaluation of definite integrals ∫ 𝑓(𝑠)𝑏
𝑎𝑑𝑠 by means of
the fundamental theorem of calculus
∫ 𝑓(𝑠)𝑏
𝑎𝑑𝑠 = 𝐹(𝑏) − 𝐹(𝑎) (5)
In words: to evaluate the integral of a univariate real-valued function 𝑓 on the interval [𝑎, 𝑏], one has to first
compute the anti-derivative of 𝑓, and then from the difference between the anti-derivative evaluated at the
upper integral interval boundary 𝑏 and the anti-derivative evaluated at the lower integral interval boundary
𝑎. Equation (5) is very familiar, such that we postpone a formal derivation and first consider properties and
examples.
Without proof we note that the linearity properties of the definite integral also hold for the
indefinite integral, i.e., for functions 𝑓, 𝑔:ℝ → ℝ and constant 𝑐 ∈ ℝ we have
∫(𝑓(𝑥) + 𝑔(𝑥)) 𝑑𝑥 = ∫𝑓(𝑥) 𝑑𝑥 + ∫𝑔(𝑥) 𝑑𝑥 (6)
and
∫ 𝑐𝑓(𝑥) 𝑑𝑥 = 𝑐 ∫𝑓(𝑥) 𝑑𝑥 (7)
Just as for differentiation, it is useful to know the anti-derivatives of a handful of univariate functions
𝑓:ℝ → ℝ which are commonly encountered. We present a selection without proofs below, which can
readily be verified by evaluating the derivatives of the respective antiderivatives to recover the original
functions. Note that the derivative of the constant function 𝑓(𝑥) ≔ 𝑐 with 𝑐 ∈ ℝ is zero. We have
𝑓(𝑥) ≔ 𝑎 ⇒ 𝐹(𝑥) = 𝑎𝑥 + 𝑐
𝑓(𝑥) ≔ 𝑥𝑎 ⇒ 𝐹(𝑥) =1
𝑎+1𝑥𝑎+1 + 𝑐 (𝑎 ≠ −1)
𝑓(𝑥) ≔ 𝑥−1 ⇒ 𝐹(𝑥) = ln 𝑥 + 𝑐
𝑓(𝑥) ≔ exp(𝑥) ⇒ 𝐹(𝑥) = exp(𝑥) + 𝑐
64
𝑓(𝑥) ≔ sin(𝑥) ⇒ 𝐹(𝑥) = −cos(𝑥) + 𝑐
𝑓(𝑥) ≔ cos(𝑥) ⇒ 𝐹(𝑥) = sin(𝑥) + 𝑐 (8)
The interested reader may find proofs of the above in [Spivak 1994].
Example
To illustrate the theoretical discussion above, we evaluate the anti-derivative of the function
𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 2𝑥2 + 𝑥 + 1 (9)
and use this anti-derivative to evaluate the definite integral of this function on the interval [1,2], i.e.
∫ 𝑓(𝑥)2
1𝑑𝑥. To this end, we first use the linearity property of the indefinite integral, which yields for the
antiderivative
𝐹:ℝ → ℝ, 𝑥 ↦ 𝐹(𝑥) ≔ ∫𝑓(𝑥)𝑑𝑥 = ∫(2𝑥2 + 𝑥 + 1)𝑑𝑥 = 2∫ 𝑥2 𝑑𝑥 + ∫𝑥 𝑑𝑥 + ∫1𝑑𝑥 (10)
We make then use of (8) to evaluate the remaining integral terms:
𝐹(𝑥) =2
3𝑥3 +
1
2𝑥2 + 𝑥 + 𝐶 (11)
where the constant 𝐶 ∈ ℝ comprises all constant terms. Importantly, this constant term vanishes, once we
evaluate a definite integral:
∫ 𝑓(𝑥)2
1𝑑𝑥 = 𝐹(2) − 𝐹(1)
=2
323 +
1
222 + 2 + 𝐶 − (
2
313 +
1
212 + 1 + 𝐶)
=16
3+4
2+ 2 + 𝐶 −
2
3−1
2− 1 − 𝐶
=32
6+12
6+12
6+ 𝐶 −
4
6−3
6−6
6− 𝐶
=43
6 (12)
In the remainder of this Section, we formally justify equation (5) for the interested reader.
Proof of (5)
Like equation (3), equation (5) is very familiar, but again, a formal derivation is somewhat more involved. The following
proof makes use of limiting processes and the mean value theorem of differentiation. We first consider the quantity 𝐹(𝑏) − 𝐹(𝑎). To
this end, we select numbers 𝑥1, … , 𝑥𝑛 such that
𝑎 ≔ 𝑥0 < 𝑥1 < 𝑥2 < ⋯ < 𝑥𝑛−1 < 𝑥𝑛 =: 𝑏 (5.1)
It follows that 𝐹(𝑏) − 𝐹(𝑎) = 𝐹(𝑥𝑛) − 𝐹(𝑥0). Now each 𝐹(𝑥𝑖), 𝑖 = 1,… , 𝑛 − 1 is added to the quantity 𝐹(𝑏) − 𝐹(𝑎) with its
additive inverse
𝐹(𝑏) − 𝐹(𝑎) = 𝐹(𝑥𝑛) + (−𝐹(𝑥𝑛−1) + 𝐹(𝑥𝑛−1)) + ⋯+ (−𝐹(𝑥1) + 𝐹(𝑥1)) − 𝐹(𝑥0) (5.2)
= (𝐹(𝑥𝑛) − 𝐹(𝑥𝑛−1)) + (𝐹(𝑥𝑛−1) − 𝐹(𝑥𝑛−2)) + ⋯+ (𝐹(𝑥1) − 𝐹(𝑥0))
65
= ∑ (𝐹(𝑥𝑖) − 𝐹(𝑥𝑖−1))𝑛𝑖=1
The mean value theorem of differentiation states that for a function 𝐹: [𝑎, 𝑏] → ℝ there exists (under certain constraints) a number
𝑐 ∈]𝑎, 𝑏[ such that
𝐹′(𝑐) =𝐹(𝑏)−𝐹(𝑎)
𝑏−𝑎 (5.3)
From the mean value theorem of differentiation it thus follows that for the terms of the sum above, we have with appropriately
chosen 𝑐𝑖 ∈]𝑎, 𝑏[ (𝑖 = 1,… , 𝑛)
𝐹(𝑥𝑖) − 𝐹(𝑥𝑖−1) = 𝐹′(𝑐𝑖)(𝑥𝑖 − 𝑥𝑖−1) (5.4)
and substitution yields
𝐹(𝑏) − 𝐹(𝑎) = ∑ 𝐹′(𝑐𝑖)(𝑥𝑖 − 𝑥𝑖−1)𝑛𝑖=1 (5.5)
By definition it follows that 𝐹′(𝑐𝑖) = 𝑓(𝑐𝑖) and setting Δ𝑥𝑖−1 ≔ 𝑥𝑖 − 𝑥𝑖−1, which yields
𝐹(𝑏) − 𝐹(𝑎) = ∑ 𝑓(𝑐𝑖) Δ𝑥𝑖−1𝑛𝑖=1 (5.6)
By taking the limit of the above, we obtain the Riemann integral
limΔ𝑥𝑖−1→0
(𝐹(𝑏) − 𝐹(𝑎)) = limΔ𝑥𝑖−1→0∑ 𝑓(𝑐𝑖) Δ𝑥𝑖−1𝑛𝑖=1 (5.7)
𝐹(𝑏) and 𝐹(𝑎) are independent of 𝑥𝑖 and the left hand side of the above thus evaluates to 𝐹(𝑏) − 𝐹(𝑎). For the right hand side, we
note that 𝑥𝑖−1 ≤ 𝑐𝑖 ≤ 𝑥𝑖−1 + Δ𝑥𝑖−1 and thus limΔ𝑥𝑖−1→0
𝑥𝑖−1 = 𝑥𝑖−1 and limΔ𝑥𝑖−1→0
𝑥𝑖−1 + Δ𝑥𝑖−1 = 𝑥𝑖−1, from which it follows that
limΔ𝑥𝑖−1→0
𝑐𝑖 = 𝑥𝑖−1. We thus have
𝐹(𝑏) − 𝐹(𝑎) = limΔ𝑥𝑖−1→0∑ 𝑓(𝑥𝑖−1)Δ𝑥𝑖−1𝑛𝑖=1 = limΔ𝑥𝑖→0
∑ 𝑓(𝑥𝑖)Δ𝑥𝑖𝑛−1𝑖=0 = ∫ 𝑓(𝑠)𝑑𝑠
𝑏
𝑎
(5.8)
With the definition of the Riemann integral under the generalization that the Δ𝑥𝑖 may not be equally spaced. □
Study Questions
1. State the intuitions for (a) the definite integral ∫ 𝑓(𝑥)𝑏
𝑎𝑑𝑥 of a function 𝑓 on an interval [𝑎, 𝑏] ⊂ ℝ and (b) the indefinite integral
∫ 𝑓(𝑥)𝑑𝑥 of a function.
2. Evaluate the integral ∫ 13
1𝑑𝑠.
Study question answers
1. The definite integral ∫ 𝑓(𝑥)𝑏
𝑎𝑑𝑥 of a function 𝑓 on an interval [𝑎, 𝑏] ⊂ ℝ is intuitively understood as the signed area between the
function’s graph and the 𝑥-axis. The indefinite integral ∫ 𝑓(𝑥)𝑑𝑥 of a function is that function 𝐹 whose derivative is the function 𝑓,
i.e. 𝐹′(𝑥) =𝑑
𝑑𝑥(∫ 𝑓(𝑥)𝑑𝑥) = 𝑓(𝑥).
2. We first evaluate the anti-derivative of the constant function 1 ∶ ℝ → ℝ, 𝑥 ↦ 1(𝑥) ≔ 1, which is given by the identity function
𝑖𝑑𝑥 ∶ ℝ → ℝ, 𝑥 ↦ 𝑖𝑑(𝑥) ≔ 𝑥, because 𝑑
𝑑𝑥𝑖𝑑𝑥(𝑥) = 1for all 𝑥 ∈ ℝ. Using the fundamental theorem of calculus, we thus have
∫ 13
1𝑑𝑠 = (𝑖𝑑𝑥)1
3 = 3 − 1 = 2
66
Sequences and Series
(1) Sequences
Intuitively, a sequence is an infinite ordered list. More formally, let 𝑀 be a nonempty set, for
example the real line ℝ. A “sequence in 𝑀” is a function 𝑓, which allocates to each natural number 𝑛 ∈ ℕ a
unique element 𝑓(𝑛) ∈ 𝑀. If 𝑀 ≔ ℝ, the sequence is called a “real sequence”. The elements 𝑎𝑛 ≔ 𝑓(𝑛)
are called the “terms” of the sequence 𝑓. Sequences are usually denoted without reference to 𝑓, and the
following notations are commonly encountered
(𝑎𝑛)𝑛∈ℕ or (𝑎𝑛) or (𝑎1, 𝑎2, 𝑎3, … ) (1)
It is important to note that because sequences are defined as functions of the form
𝑓: ℕ → 𝑀, 𝑛 ↦ 𝑓(𝑛) ≔ 𝑎𝑛 (2)
and there are infinitely many natural numbers, sequences have infinitely many terms 𝑎𝑛. Examples of
sequences are
(𝑞𝑛)𝑛∈ℕ = (𝑞1, 𝑞2, 𝑞3, … ) for 𝑞 ∈ ℝ (3)
(𝑛𝑘)𝑛∈ℕ
= (1𝑘 , 2𝑘 , 3𝑘 , … ) for 𝑘 ∈ ℤ (4)
and
(√𝑛)𝑛∈ℕ
= (√1, √2, √3,… ) (5)
An important concept associated with sequences is the question of their convergence. Intuitively, a
sequence is said to converge if its terms never increase or decrease beyond a given finite real number.
(2) Series
Intuitively, series are infinite sums. More formally, series are defined as special kind of sequences:
Let (𝑎𝑛)𝑛∈ℕ be a sequence in ℝ. Then a series
∑ 𝑎𝑛∞𝑛=1 = 𝑎1 + 𝑎2 + 𝑎3 +⋯ (1)
is the sequence (𝑠𝑛)𝑛∈ℕ, where
(𝑠𝑛)𝑛∈ℕ ≔ ∑ 𝑎𝑘𝑛𝑘=1 (2)
The 𝑎𝑛 are referred to as “terms” of the series, whereas the 𝑠𝑛 are referred to as the “partial sums” of the
series. If a sequence converges, i.e., if the sequence of partial sums does not increase or decrease beyond a
given finite real number for 𝑛 going to infinity, then the symbol ∑ 𝑎𝑛∞𝑛=1 is also used to indicate the value of
this limiting value.
Examples for series are the so-called “geometric series”
∑1
𝑧𝑛∞𝑛=0 = 1 +
1
𝑧+
1
𝑧2+⋯ (3)
which converges for |𝑧| < 1 and the so-called “harmonic series”
67
∑1
𝑛∞𝑛=1 = 1 +
1
2+1
3+⋯ (4)
which does not converge.
Series can be used to define real-valued univariate functions of the form
𝑓 ∶ ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) (5)
To achieve this, the value of the function 𝑓(𝑥) is defined as a series in 𝑥, i.e. as a series that depends on the
function’s input argument 𝑥. An important example is the series definition of the exponential function
exp:ℝ → ℝ, 𝑥 ↦ exp(𝑥) ≔ ∑𝑥𝑛
𝑛!∞𝑛=0 = 1 + 𝑥 +
𝑥2
2!+𝑥3
3!+𝑥4
4!+⋯ (6)
where 𝑛! ≔ ∏ 𝑘𝑛𝑘=1 denotes the factorial. Importantly, the series definition of the exponential function
allocates the value of the convergent series ∑𝑥𝑛
𝑛!∞𝑛=0 ∈ ℝ to each 𝑥 ∈ ℝ. Further important examples for
series are the sine and cosine functions introduced below. Finally, also the Fourier series is a function with
domain ℝ that is defined in terms of a series.
Study Questions
1. Write down the definition of a sequence. 2. Write down the definition of a series.
Study Questions Answers
1. For an nonempty set 𝑀 a “sequence in 𝑀” is a function 𝑓, which allocates to each natural number 𝑛 ∈ ℕ a unique element
𝑓(𝑛) ∈ 𝑀 and for 𝑎𝑖 ∈ 𝑀 is usually denoted by (𝑎𝑛)𝑛∈ℕ or (𝑎𝑛) or (𝑎1, 𝑎2, 𝑎3, … ).
2. Let (𝑎𝑛)𝑛∈ℕ be a sequence in ℝ. Then a series ∑ 𝑎𝑛∞𝑛=1 = 𝑎1 + 𝑎2 + 𝑎3 +⋯ is the sequence (𝑠𝑛)𝑛∈ℕ, where (𝑠𝑛)𝑛∈ℕ ≔
∑ 𝑎𝑘𝑛𝑘=1 .
68
Ordinary differential equations
(1) Differential equations
Differential equations specify sets of functions that model real-world phenomena. These sets of
functions are defined by differential equations in an implicit manner, by specifying functions of the function
specified by the differential equation, its input arguments, and, importantly, its derivatives to arbitrary
order. In this manner, differential equations can be used to model real-world phenomena by explicitly
describing their dynamics, i.e. the way the phenomena change, and the changes of this change. In contrast
to algebraic equations, the solutions of which are numbers, the solutions of differential equations and
associated initial and boundary value problems are functions. The classic theory of differential equations is
primarily concerned with finding explicit solutions to differential equations and establishing criteria under
which such solutions exists and are unique. The more modern viewpoint of dynamical systems theory is
more interested in the qualitative properties of systems of differential equations.
Three classes of differential equations are of interest for probabilistic models of functional
neuroimaging data: (1) ordinary differential equations, (2) partial differential equations, and (3) stochastic
differential equations. Ordinary differential equations (ODEs) are characterized by the fact that only
derivatives with respect to a single scalar input variable (usually modelling time or space) appear. ODEs are
thus used to model systems that evolve either in time or space. Partial differential equations (PDEs) are
characterized by the fact that they comprise partial derivatives, i.e. derivatives with respect to two or more
scalar input variables. Often, these variables refer to time and space, and PDEs are used to model systems
that evolve in both time and space. Both ODEs and PDEs specify sets of “deterministic” functions (all
functions are deterministic by definition), i.e. they do not account for random fluctuations in the evolution of
the system that they model. Stochastic differential equations, which can be both ordinary or partial,
explicitly take random innovations in the dynamics of systems into account. The solutions of stochastic
differential equations, if they are exist, are stochastic processes. In this section, we will be concerned with
ordinary differential equations.
Ordinary differential equations specify functions of scalar variables. We begin our treatment of ODEs
with some preliminary remarks on notation. In the context of neuroimaging data analyses, the most
prevalent use of ODEs is to describe the evolution of systems over time. We will thus denote the input
argument to the functions specified by ODEs using the letter 𝑡 and assume that 𝑡 ∈ 𝐼 ⊆ ℝ, i.e. that 𝑡 is a real
scalar value in some interval 𝐼 of the real line. The functions specified by ODEs can be real-valued or vector-
valued, i.e. with 𝑛 ≥ 1 they are of the form
𝑥 ∶ 𝐼 ⊆ ℝ → ℝ𝑛, 𝑡 ↦ 𝑥(𝑡) (1)
Note that the 𝑥s that appear in the ODEs below are always refer to functions, and not, as in algebraic
equations, to numbers or vectors. Because we imply that the input variable 𝑡 describes time, we will use the
“dot notation” to indicate the derivatives of the function 𝑥 with respect to its input argument
�̇�(𝑡) =𝑑
𝑑𝑡𝑥(𝑡), �̈�(𝑡) =
𝑑2
𝑑𝑡2𝑥(𝑡), 𝑥(𝑡) =
𝑑3
𝑑𝑡3𝑥(𝑡), … (2)
Note that the dot notation only really works for lower order derivatives and that for higher order derivatives,
(𝑘 > 3), we use the more general notation
69
𝑥(𝑘)(𝑡) =𝑑𝑘
𝑑𝑡𝑘𝑥(𝑡) (3)
In general, we will refrain from using the “prime” notation 𝑥′(𝑡), 𝑥′′(𝑡),… for derivatives in the context of
ODEs. Furthermore, we will usually use the notation �̇�(𝑡), �̈�(𝑡), … to denote the derivatives of the function 𝑥
evaluated at 𝑡, while we use the notation �̇�, �̈�, … to denote the derivative of the function 𝑥 proper, i.e. the
functions
𝑥(𝑘) ∶ 𝐼 ⊆ ℝ → ℝ𝑛, 𝑡 ↦ 𝑥(𝑘)(𝑡) (𝑘 = 1,2,… ) (4)
which allocate the respective derivatives to all input arguments 𝑡 ∈ 𝐼.
Based on these preliminary remarks, we can now give a first formalization of ODEs. An ODE is an
equation that can comprise a function 𝑥, its input argument 𝑡, derivatives of the function 𝑥 up to arbitrary
order, and a function of these entities. In general ODEs can be written as
𝐹(𝑡, 𝑥, �̇�, �̈� … , 𝑥(𝑘)) = 0 (5)
where we leave the range and domain of the function 𝐹 unspecified for the moment. The highest-order
derivative occurring in an ODE 𝑘 ∈ ℕ is referred to as the “order” of an ODE. If the ODE can bewritten in the
form
𝑥(𝑘) = 𝑓(𝑡, 𝑥, �̇�, �̈� … , 𝑥(𝑘−1)) (6)
by appropriately reformulating the function 𝐹 for the function 𝑓, the ODE is referred to as an “explicit” ODE.
Otherwise, it is referred to as an “implicit ODE”. Finally, if the specifying function 𝐹 or 𝑓 is not a function of 𝑡,
the ODE is referred to as “autonomous”, otherwise, it is referred to as “non-autonomous”.
To illustrate the nomenclature we give two examples of ODEs. Let
𝑥 ∶ 𝐼 ⊆ ℝ → ℝ, 𝑡 ↦ 𝑥(𝑡), �̇� ∶ 𝐼 ⊆ ℝ → ℝ, 𝑡 ↦ �̇�(𝑡), … (7)
denote a function and its derivatives. Then
(1) �̇� = 𝑥 is an explicit, autonomous, first-order ODE (8)
(2) �̇� = 2𝑡𝑥 is an explicit, non-autonomous, first order ODE (9)
(3) �̈� = sin(𝑡) 𝑥(𝑡) is an explicit, autonomous, second-order ODE (10)
(2) Initial value problems for systems of first-order ordinary differential equations
Explicit systems of first-order differential equations are of eminent importance in the theory of ODEs
and dynamical systems, because many higher-order problems can be rewritten as such.
Systems of first-order ODEs
To introduce systems of first-order ODEs, let 𝐼 ⊆ ℝ be an interval of the real line and 𝐷 ⊆ ℝ𝑛 a subset of the
𝑛-dimensional space of vectors with real entries. Further, let
𝑥 ∶ 𝐼 → 𝐷, 𝑡 ↦ 𝑥(𝑡) (1)
70
denote a function (or, more specifically, a curve in ℝ𝑛) and let the first derivative of 𝑥 be denoted by
�̇� ≔ (
�̇�1�̇�2⋮�̇�𝑛
) (3)
where �̇�𝑖 (𝑖 = 1,… , 𝑛) denote the first derivatives of the 𝑛 component functions of 𝑥. Further, let
𝑓 ∶ 𝐼 × 𝐷 → ℝ𝑛, (𝑡, 𝑥) ↦ 𝑓(𝑡, 𝑥) (4)
denote a vector-valued function of a real and a vector-valued argument, sometimes referred to as the
“evolution” function of the system. Then
�̇� = 𝑓(𝑡, 𝑥) (5)
denotes an 𝑛-dimensional system of first-order ODEs. For a given time-point 𝑡 ∈ 𝐼, the system of first-order
ODEs (5) can be written more explicitly as
(
�̇�1(𝑡)�̇�2(𝑡)⋮
�̇�𝑛(𝑡)
) =
(
𝑓1 (𝑡, (𝑥1(𝑡), 𝑥2(𝑡), … 𝑥𝑛(𝑡))𝑇)
𝑓2 (𝑡, (𝑥1(𝑡), 𝑥2(𝑡), … 𝑥𝑛(𝑡))𝑇)
⋮
𝑓𝑛 (𝑡, (𝑥1(𝑡), 𝑥2(𝑡), … 𝑥𝑛(𝑡))𝑇))
(6)
where the
𝑓𝑖 ∶ 𝐼 × 𝐷 → ℝ, (𝑡, 𝑥) ↦ �̇�𝑖(𝑡) for 𝑖 = 1,… , 𝑛 (7)
are the component functions of the function 𝑓.
□
Before introducing the notion of an initial value problem for systems of first-order ODEs, we consider
an example. Let 𝑛 = 1 and
𝑓 ∶ 𝐼 × 𝐷 → ℝ, (𝑡, 𝑥) ↦ 𝑓(𝑡, 𝑥) ≔ 𝑥 (8)
Then (5) specifies the one-dimensional system of first-order ODEs
�̇� = 𝑥 (9)
In words (9) specifies the set of functions for which the rate of change �̇�(𝑡) at each point in time 𝑡 ∈ 𝐼
corresponds to the value of the function 𝑥(𝑡) at the same time point. Without stating how the solution is
derived, we postulate that
𝑦 ∶ 𝐼 → ℝ, 𝑡 ↦ 𝑦(𝑡) ≔ 𝑐 exp(𝑡) for 𝑐 ∈ ℝ (10)
denotes a solution of (9). To prove this postulate, we evaluate the derivative of 𝑦 for each 𝑡 ∈ 𝐼
71
�̇�(𝑡) =𝑑
𝑑𝑡𝑦(𝑡) =
𝑑
𝑑𝑡(𝑐 exp(𝑡)) = 𝑐
𝑑
𝑑𝑡exp(𝑡) = 𝑐 exp(𝑡) = 𝑦(𝑡) (11)
and thus the function 𝑦 as defined in (10) fulfills the ODE (9).
Note that there are an infinite number of functions 𝑦 of the form (10) due to the coefficient 𝑐 ∈ ℝ. The ODE
in (9) thus specifies an infinite set of functions. Often, one is interested in specific members of this set,
which, in addition to the ODE also fulfil additional conditions such as taking on a specific value for a given
input argument. This gives rise to the concept of “initial value problems”. We define the initial value
problem for a system of first-order ODEs next.
Initial value problem for a system of first-order ODEs
In the context of a system of first-order ODEs as discussed above let 𝑡0 ∈ 𝐼 be a fixed and specified
input argument of the function 𝑥. Then an initial value problem for a system of first-order ODEs corresponds
to the task to find a solution 𝑦 ∶ 𝐼 → ℝ𝑛 of the system
�̇� = 𝑓(𝑡, 𝑥) (12)
which satisfies the initial value
𝑥(𝑡0) = 𝑥0 (13)
where 𝑥0 ∈ ℝ𝑛 is a fixed and specified value referred to as “initial value”.
□
An initial value problem for the example one-dimensional system discussed above would be the
following
�̇� = 𝑥, 𝑥(0) = 2 (14)
In words (14) specifies the set of functions for which the rate of change �̇�(𝑡) at each point in time 𝑡 ∈ 𝐼
corresponds to the value of the function 𝑥(𝑡) at the same time point and for which the value at time-point
𝑡0 ≔ 0 is given by 2. Based on the solution above, we find that
𝑦(0) = 2 ⇔ 𝑐 exp(0) = 2 ⇔ 𝑐 ⋅ 1 = 2 (15)
Thus, the function
𝑦 ∶ 𝐼 → ℝ, 𝑡 ↦ 𝑦(𝑡) ≔ 2exp(𝑡) (16)
satisfies the ODE �̇� = 𝑥 and the initial value 𝑥(0) = 2 .
One-dimensional initial value problems can be visualized by plotting the derivative of the function 𝑥
specified by 𝑓 as linear slope at a number of locations in the (𝑡, 𝑥) plane (Figure 1). An intuition about the
solutions to initial value problems can then be gained by conceiving the resulting vector field as the flow
profile of a liquid, and the initial condition as the location of where a floating particle is added to the fluid.
Over time, this particle will move in the (𝑡, 𝑥) plane, and the solution of the initial value problem specifies
the route of this movement.
72
Figure 1. Visualization of one-dimensional initial value problems. The left panel depicts the flow field, corresponding to the ODE of the example �̇� = 𝑥 discussed in the main text, together with the solution path of an initial value problem where 𝑡0 ≔ 0 and 𝑥(𝑡0) ≔ 2. Note the direction of the flow at given location in the (𝑡, 𝑥) plane is a function of 𝑥 only, as evident from the definition 𝑓(𝑡, 𝑥) ≔ 𝑥 of the ODE. The left panel depicts a more intricate flow field, which according to the definition 𝑓(𝑡, 𝑥) ≔ 𝑥2 − 𝑡 is a function of both 𝑡 and 𝑥. Here, a solution for the initial condition 𝑡0 ≔ −1.5 and 𝑥(𝑡0) ≔ −1.8 is depicted.
(3) Numerical approaches for initial value problems
The mathematical theory of ODEs and initial value problems is primarily concerned with (a) finding
and characterizing analytical approaches to solve ODEs and initial value problems, and (b) establishing
analytic criteria under which solutions of ODEs exist and to determine whether these solutions are unique.
Real-world applications of ODEs, however, often result in rather complicated dynamical systems, for which
often no analytical treatment exists. In these cases, numerical procedures can be used to evaluate
trajectories described by initial value problems. In this section, we introduce a number of basic numerical
approaches. We here consider the initial value problem for systems of first-order ODEs in the following
formulation
�̇�(𝑡) = 𝑓(𝑡, 𝑥(𝑡)) for all 𝑡 ∈ [𝑎, 𝑏], 𝑥(𝑎) = 𝑥𝑎 and 𝑓 ∶ [𝑎, 𝑏] × ℝ𝑛 → ℝ𝑛 (1)
Euler methods
To introduce the explicit and modified Euler methods for the numerically solution of initial value
problems of the form (1), we specifically consider the scalar case, i.e. 𝑛 = 1. Numerical approaches are
characterized by the fact that they replace continuous time by discrete time, i.e. instead of generating
solutions for the infinitely and uncountable values 𝑡 ∈ [𝑎, 𝑏] ⊂ ℝ, they typically generate solutions for a
discrete set of “support points” 𝑡𝑘, where 𝑘 = 0,… ,𝑚 with 𝑚 ∈ ℕ and 𝑚 < ∞. We thus consider a
discretization of the interval 𝐼 of the form
𝐼Δ ≔ {𝑎 =: 𝑡0 < 𝑡1 < ⋯ < 𝑡𝑚 ≔ 𝑏} (2)
To simplify proceedings, we define the distance between two adjacent support points by
ℎ𝑘 ≔ 𝑡𝑘+1 − 𝑡𝑘 (𝑘 = 0,1,… ,𝑚 − 1) (3)
73
Note that the ℎ𝑘 are not required to be all the same, i.e. the support point to be “equidistant”. This is,
however, often the case in implementations of numerical methods. The central idea of basically all
numerical approaches for the solution of initial value problems is to find good approximations for the values
𝑥(𝑡) of the sought solution function in the form of values 𝑥𝑘 at the support points 𝑡𝑘. In other words,
numerical approaches find values 𝑥𝑘 such that
𝑥𝑘 ≈ 𝑥(𝑡𝑘) for 𝑘 = 1,… ,𝑚 − 1 (4)
For 𝑘 = 0, the initial condition 𝑥(𝑡0) = 𝑥(𝑎) = 𝑥𝑎 is usually employed. Recursion equations for the
remaining 𝑥𝑘 are then usually motivated from the perspective of Taylor approximations. Recall that for small
ℎ𝑘 = 𝑡𝑘+1 − 𝑡𝑘 a first-order Taylor approximation of 𝑥(𝑡𝑘+1) is given by
𝑥(𝑡𝑘+1) ≈ 𝑥(𝑡𝑘) + �̇�(𝑡𝑘)(𝑡𝑘+1 − 𝑡𝑘) (5)
= 𝑥(𝑡𝑘) + ℎ𝑘�̇�(𝑡𝑘)
In words: the value of the sought function 𝑥 at the support point 𝑡𝑘+1 is approximately equal to the value of
the function at the previous support plus the derivative (“slope”) of the function 𝑥 at support point 𝑡𝑘
multiplied by the step-size ℎ𝑘. Importantly, the derivative of 𝑥 at the support point 𝑡𝑘 is specified by the ODE
of the initial value problem (1), such that we have
𝑥(𝑡𝑘+1) ≈ 𝑥(𝑡𝑘) + ℎ𝑘𝑓(𝑡𝑘 , 𝑥(𝑡𝑘)) (6)
In (6), the values 𝑥(𝑡𝑘+1) and 𝑥(𝑡𝑘) are unknown. If one replaces them by approximations 𝑥𝑘+1 and 𝑥𝑘,
respectively, one obtains the explicit Euler method algorithm
1. Set 𝑥0 ≔ 𝑥𝑎
2. For 𝑘 = 0,… ,𝑚 − 1 set
𝑥𝑘+1 ≔ 𝑥𝑘 + ℎ𝑘𝑓(𝑡𝑘, 𝑥𝑘) (7)
Figure 2 depicts the application of Euler’s method for the one-dimensional initial value problem discussed in
the previous section.
Figure 3. Application of the Euler method algorithm for the numerical solution of the initial value problem �̇� = 𝑥 with 𝑡0 ≔ 0 and 𝑥(𝑡0) = 2. Note that the deviation between the analytical and numerical solutions decreases as the number of support points 𝑚 increases.
74
Modified Euler method
A fundamental property of Euler’s method is that for an interval [𝑡𝑘 , 𝑡𝑘+1] the derivative at the left
interval boundary is used as basis for the approximation. An alternative approach is to consider the
derivative in the center of the interval, i.e. to use the derivative at 𝑡𝑘 +1
2ℎ𝑘. Using the same approach as
above, a first-order Taylor approximation of 𝑥(𝑡𝑘+1) takes the form
𝑥(𝑡𝑘+1) ≈ 𝑥(𝑡𝑘) + �̇� (𝑡𝑘 +1
2ℎ𝑘) (𝑡𝑘+1 − 𝑡𝑘) (8)
= 𝑥(𝑡𝑘) + ℎ𝑘�̇� (𝑡𝑘 +1
2ℎ𝑘)
= 𝑥(𝑡𝑘) + ℎ𝑘𝑓 (𝑡𝑘 +1
2ℎ𝑘 , 𝑥 (𝑡𝑘 +
1
2ℎ𝑘))
Replacing the unkown values of 𝑥(𝑡𝑘) and 𝑥(𝑡𝑘+1) based on the approximations 𝑥𝑘 and 𝑥𝑘+1 and evaluating
the unknown value 𝑥 (𝑡𝑘 +1
2ℎ𝑘) by means of Euler’s method, i.e., by setting
𝑥 (𝑡𝑘 +1
2ℎ𝑘) ≈ 𝑥𝑘 +
1
2ℎ𝑘𝑓(𝑡𝑘 , 𝑥𝑘) (9)
then yields the “modified Euler method” algorithm
1. Set 𝑥0 ≔ 𝑥𝑎
2. For 𝑘 = 0,… ,𝑚 − 1 set
𝑥𝑘+1 ≔ 𝑥𝑘 + ℎ𝑘𝑓 (𝑡𝑘 +1
2ℎ𝑘 , 𝑥𝑘 +
1
2ℎ𝑘𝑓(𝑡𝑘 , 𝑥𝑘) ) (10)
Study Questions 1. What do differential equations specify?
2. Verbally discuss the notions of (1) ordinary differential equations, (2) partial differential equations, and (3) stochastic differential
equations.
3. What is the order of the ordinary differential equation 𝐹 = 𝑚�̈�(𝑡), known as Newton’s law?
4. Formulate the general initial value problem for a system of first-order differential equations.
5. What is the difference between an analytical solution and a numerical solution of an initial value problem?
Study Questions Answers 1. Differential equations specify sets of functions.
2. Ordinary differential equations (ODEs) are characterized by the fact that only derivatives with respect to a single scalar input
variable (usually modelling time or space) appear. ODEs are thus used to model systems that evolve either in time or space. Partial
differential equations (PDEs) are characterized by the fact that they comprise partial derivatives, i.e. derivatives with respect to two
or more scalar input variables. Often, these variables refer to time and space, and PDEs are used to model systems that evolve in
both time and space. Both ODEs and PDEs specify sets of “deterministic” functions (all functions are deterministic by definition), i.e.
they do not account for random fluctuations in the evolution of the system that they model. Stochastic differential equations, which
can be both ordinary or partial, explicitly take random innovations in the dynamics of systems into account. The solutions of
stochastic differential equations, if they are exist, are stochastic processes.
75
3. The highest derivative occurring in the differential equation is of second-order, Newton’s law is thus a second-order ordinary
differential equation.
4. Let 𝐼 ⊆ ℝ be an interval of the real line and 𝐷 ⊆ ℝ𝑛 a subset of the 𝑛-dimensional space of vectors with real entries. Let
𝑥 ∶ 𝐼 → 𝐷, 𝑡 ↦ 𝑥(𝑡) denote a function (or, more specifically, a curve in ℝ𝑛) and let the first derivative of 𝑥 be denoted by �̇�. Further,
let 𝑓 ∶ 𝐼 × 𝐷 → ℝ𝑛, (𝑡, 𝑥) ↦ 𝑓(𝑡, 𝑥) denote a vector-valued function of a real and a vector-valued argument specifying the first-order
system of differential equations �̇� = 𝑓(𝑡, 𝑥) and let 𝑡0 ∈ 𝐼 be a fixed and specified input argument of the function 𝑥. Then an initial
value problem for a system of first-order ODEs corresponds to the task to find a solution 𝑦 ∶ 𝐼 → ℝ𝑛 of the system �̇� = 𝑓(𝑡, 𝑥) which
satisfies the initial value 𝑥(𝑡0) = 𝑥0 ,where 𝑥0 ∈ ℝ𝑛 is a fixed and specified value referred to as “initial value”.
5. An analytical solution of an initial value problem is derived by theoretical considerations. A numerical solution is derived by
approximating the solution function of an initial value problem with the help of a computer which implements a recursive algorithm.
76
An Introduction to Fourier Analysis
The ability to represent time- or space-domain signals in the frequency domain is a fundamental
aspect of modern signal processing. In functional neuroimaging frequency content representation is
ubiquitous: methodological approaches in the analysis of neural oscillations in cognitive tasks, the
implementation of filters in EEG data processing, the visualization of electromagnetic MRI signals, the
analysis of FMRI resting-state networks, as well as the analysis of task-related FMRI data in the context of
the GLM all involve the transformation of signals from the time-or space-domain to the frequency domain.
Representing signals as linear combinations of sine and cosine function has its origins in the theory of partial
differential equations, most prominently in Fourier’s work on heat diffusion [Fourier (1822) “Théorie
analytique de la chaleur”]. The modern reliance on frequency decompositions in digital signal processing
owes much to the development of the fast Fourier transform (FFT) algorithm [Cooley Tukey (1965) “An
algorithm for the machine calculation of complex Fourier series”]. In this and the following sections, we
develop the theory of Fourier analysis following [Weaver(1983) “Applications of discrete and continuous
Fourier Analysis”]. We eshew a discussion of the origins of Fourier analysis in the theory of partial
differential equations and instead focus on the notion of frequency and phase content of time-or space
domain signals. To this end, we first develop the representation of a univariate real-valued function by
means of a Fourier series. We then discuss alternative representations of the real Fourier series (the
complex-exponential and polar forms) that are less intuitive, but more closely related to technical
implementations of Fourier transforms. Finally, we consider the notions of a Fourier transform of a function,
and, from the perspective of digital signal processing more importantly, the concept of the discrete Fourier
transform. We close with a discussion of the fast Fourier transform algorithm. Fourier analysis rests on a
number of mathematical preliminaries (e.g. sequences and series, the trigonometric functions, complex
numbers, Riemann integration) on which we base our discussion.
(1) Generalized cosine and sine functions
In this subsection we consider the fundamental building blocks of Fourier analysis, namely the
following two functions, referred to here as the “generalized” sine and cosine functions:
𝑓:ℝ → [−𝑎, 𝑎], 𝑥 ↦ 𝑓(𝑥) ≔ 𝑎 cos(2𝜋𝜔𝑥 − 𝜑) (1)
𝑔:ℝ → [−𝑏, 𝑏], 𝑥 ↦ 𝑔(𝑥) ≔ 𝑏 sin(2𝜋𝜔𝑥 − 𝜙) (2)
We discuss the components of (1) and (2) in turn, starting from the behavior of “pure” sine and cosine
functions.
The factors 𝑎 and 𝑏 in the definition of the generalized cosine and sine functions are referred to as
“amplitudes”. They are simply constants that scale the height of the cosine and sine functions and cause
them to vary between 𝑎 and – 𝑎, 𝑏 and – 𝑏 , rather than between −1 and +1, respectively. Note that the
pre-multiplication with 𝑎 or 𝑏 does not alter the fact that sine and cosine repeat themselves every 2𝜋 and
does not alter their zero crossings.
We now consider the terms cos(2𝜋𝜔𝑥) and sin(2𝜋𝜔𝑥) in the definition of the generalized cosine
and sine functions. To simplify the discussion we define 𝜇 ≔ 2𝜋𝜔. 𝜇 is called the “radial frequency” of the
sine and cosine function and is a measure of how often the functions repeat themselves with respect to
units of 𝜋. Specifically, cos(𝜇𝑥) and sin(𝜇𝑥) repeat themselves every 2𝜋/𝜇, or, in other words perform 𝜇
77
revolutions within every 2𝜋 (Figure 1, left panels). Usually, oscillations are not expressed with respect to 𝜋,
but with respect to a temporal unit such as seconds, or a spatial unit, such as meters. To this end 𝜔, the so-
called “circular frequency” of cosine and sine allows for expressing these functions in a meaningful way
without an 𝑥-axis measured in units of 𝜋. Specifically, if measured in circular frequency, the period of the
cosine and sine functions is 1/𝜔, i.e. they perform a full revolution every 1/𝜔 𝑥-units (Figure 1 right panels).
In other words, the period 𝑇 of sin(2𝜋𝜔𝑥) is defined as the number of 𝑥-units required to complete one
cycle of a sine function. It is given by 𝑇 =1
𝜔. If the variable 𝑥 represents a time variable, 𝜔 is measured in
1/𝑠𝑒𝑐, referred to as “Hertz”. When the variable 𝑥 represents a spatial measurement, the period 𝑇 is called
the wavelength and denoted by 𝜆.
Figure 1 Radial and circular frequency for the sine function. If expressed in the form 𝑠𝑖𝑛(𝜇𝑥), 𝜇 is referred to as “radial frequency” and expresses the number of revolutions of the sine function over an interval of length 2𝜋. In other words, the sine function repeats itself every 2𝜋/𝜇. Note that the radial frequency is useful, if the x-axis is measured in units of 𝜋. Usually, the x-axis is not measured in units of 𝜋, but some physical measure, such as seconds or meter. In this case, the sine functions is sensibly expressed as 𝑠𝑖𝑛(2𝜋𝜔𝑥), where 𝜔 is referred to as the “circular frequency” and measures the number of revolutions over an interval of length 1. If 𝑥 is measured in seconds, the number of revolutions per second is referred to as “Hertz”. 𝑇 = 1/𝜔 is referred to as period and measures the number of 𝑥-units required for a full revolution of the sine function.
Finally, we consider the terms 𝜙 and 𝜑. For 𝑥 = 0 we have sin(2𝜋𝜔𝑥) = 0 and cos(2𝜋𝜔𝑥) = 1. It
is often convenient to be able to shift the functions so that at 𝑥 = 0 they may take on other values. This is
accomplished by the addition of a “phase” term to the argument. For example, the function sin(2𝜋𝜔𝑥 − 𝜑)
is the same as sin(2𝜋𝜔𝑥) shifted to the right by an amount 𝜑. Because a sine function can be transformed
into a cosine function by shifting it either to 𝜋/2 to the right or to the left, and vice versa, phase terms are
usually referred to as “phase angles” and are considered only in the interval [−𝜋
2,𝜋
2]. In Figure 2 below, we
depict generalized sine functions with different phase angles and identical amplitude 𝑎 = 1 and circular
frequency 𝜔 =1
2𝜋, i.e. generalized sine functions of the form sin(𝑥 − 𝜙).
78
Figure 2. Influence of the phase parameter ϕ of a generalized sine function. Note that inclusion of the additive term –ϕ for
ϕ ∈ [0, π/2] shifts the original function sin(x) to the right and that the function sin (x −π
2) is equivalent to the cosine function
cos(x).
(2) Linear combinations of generalized cosine sine functions
Using a number 𝑛 ∈ ℕ generalized cosine and sine functions with phase 0, one can create a “linear
combination of generalized cosine and sine functions” of the form
𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ ∑ 𝑎𝑘 cos(2𝜋𝜔𝑘𝑥)𝑛−1𝑘=0 + 𝑏𝑘 sin(2𝜋𝜔𝑘𝑥) (1)
where we introduced 𝑛 amplitude coefficients 𝑎𝑘 (𝑘 = 0,1, … , 𝑛) for the cosine functions, 𝑛 amplitude
coefficients 𝑏𝑘 (𝑘 = 0,1,… , 𝑛 − 1) for the sine functions and 𝑛 circular frequency values 𝜔𝑘 (𝑘 =
0,1,… , 𝑛 − 1) which are associated with the respective amplitude coefficients. Linear combinations of
generalized cosine and sine functions are a very versatile formulation which allow for representing a wide
variety of functions.
We illustrate this concept with two examples. For a first example (Figure 1), let 𝑛 = 2,𝜔0 =
10, 𝑎0 = 0, 𝑏0 = 3,𝜔1 = 15, 𝑎1 = 2, 𝑏1 = 0. Then 𝑓 is given by
𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) = 0 cos(2𝜋10𝑥) + 3 sin(2𝜋10𝑥) (2)
+2cos(2𝜋15𝑥) + 0 sin(2𝜋15𝑥)
In Figure 1 we plot the sine and cosine functions comprising this example and the resulting sum. Note that
the resulting sum is a periodic function, but does not have the “simple” form of a sine or cosine.
As a second example (Figure 2) let 𝑛 = 3,𝜔0 = 1, 𝜔1 = 4, 𝜔2 = 10, 𝑎0 = 2, 𝑎1 = 4, 𝑎2 =
0 , 𝑏0 = 0, 𝑏1 = −1, 𝑏2 = 3 . Then 𝑓 is given by
𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) = 2 cos(2𝜋1𝑥) + 0 sin(2𝜋1𝑥) (3)
+ 4 cos(2𝜋4𝑥) − 1 sin(2𝜋4𝑥)
+ 0 cos(2𝜋10𝑥) + 3 sin(2𝜋10𝑥)
79
Figure 1 Linear combinations of generalized cosine and sine functions, example 1.
Figure 2 Linear combinations of generalized cosine and sine functions, example 2.
80
(3) The real Fourier series
In the previous section we have introduced the generalized cosine and sine functions and have seen how
these can be combined by choosing their frequencies and amplitudes to yield quite arbitrary functions. In
the current section we consider the reverse approach: we start from an arbitrary function 𝑓 and ask which
frequencies and coefficients we may choose for generalized cosine and sine functions to reconstruct 𝑓 by an
infinite sum of generalized cosine and sine functions. More specifically, in this section, we consider the
following problem: Given a function 𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥), can we find real numbers 𝑎𝑘 , 𝑏𝑘, 𝜔𝑘 such that 𝑓(𝑥)
can be represented by an infinite sum of generalized cosine and sine functions for all 𝑥 ∈ ℝ. In other words:
can we find 𝑎𝑘 , 𝑏𝑘, 𝜔𝑘 ∈ ℝ such that
𝑓(𝑥) = ∑ 𝑎𝑘 cos(2𝜋𝜔𝑘𝑥) + 𝑏𝑘 sin(2𝜋𝜔𝑘𝑥)∞𝑘=0 for all 𝑥 ∈ ℝ (1)
To simplify proceedings, we will focus on functions for which we assume that they can be expressed in the
Fourier series form above, i.e., those functions for which the Fourier series converges for all 𝑥 ∈ ℝ.
Functions for which this is the case are said to satisfy the so-called “Dirichlet conditions”. The first Dirichlet
condition requires the function 𝑓 to be defined on the domain ℝ and periodic with period 𝑇, i.e.
𝑓(𝑥 + 𝑇) = 𝑓(𝑥) for all 𝑥 ∈ ℝ (2)
Understanding the further Dirichlet conditions requires some familiarity with basic mathematical concepts
such as continuity, which are not covered in PMFN. We hence omit a further discussion of them here (the
interested reader is referred to [Weaver, 1983] for a full discussion of the Dirichlet conditions). Instead we
focus on the derivation of formulas which, given the specification of a periodic function 𝑓 with period 𝑇,
yield the values of 𝑎𝑘 , 𝑏𝑘 and 𝜔𝑘 for 𝑘 = 0,1,2.. in equation (1). We will state these formulas next and then
discuss how they are obtained.
The frequencies 𝜔𝑘 for 𝑘 = 0,1,2,… in equation (1) are given by
𝜔𝑘 =𝑘
𝑇 (3)
where 𝑇 is the period of 𝑓. The cosine coefficients are given by the definite integrals
𝑎0 =1
𝑇∫ 𝑓(𝑥)𝑇/2
−𝑇/2𝑑𝑥 and 𝑎𝑘 =
2
𝑇∫ 𝑓(𝑥) cos (2𝜋
𝑘
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 for 𝑘 = 1,2,3… (4)
The sine coefficients are given by
𝑏0 = 0 and 𝑏𝑘 =2
𝑇∫ 𝑓(𝑥) sin (2𝜋
𝑘
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 for 𝑘 = 1,2,3,… (5)
In words: Given a periodic function 𝑓 with period 𝑇, the circular frequencies 𝜔𝑘 in expressions (1) are merely
the integer indices 𝑘 = 0,1,2,… divided by the period of 𝑓. Note that all these frequencies are integer
multiples of a basic frequency 𝜔 ≔ 1/𝑇. The cosine coefficient associated with the frequency 𝜔0 = 0, i.e. a
flat line, is given by the integral of the function 𝑓 on the interval [−𝑇/2, 𝑇/2], while all other cosine
coefficients 𝑎𝑘 , 𝑘 = 1,2, … are given by the integral of the function 𝑓 multiplied by a cosine term of circular
frequency 𝑘/𝑇. The sine coefficient associated with 𝑘 = 0, which corresponds to a sine wave of circular
frequency zero 𝜔0 = 0, i.e. a flat line, is 𝑏0 = 0, i.e., this term vanishes. All further sine coefficients
𝑏𝑘 , 𝑘 = 1,2, … are given by the integral of the function 𝑓 multiplied by a sine term of circular frequency 𝑘/𝑇
on [−𝑇/2, 𝑇/2]. To make the statements (3) – (7) explicit, we substitute them in equation (1) and obtain
81
𝑓(𝑥) =1
𝑇∫ cos (2𝜋
𝑘
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 + ∑ (
2
𝑇∫ 𝑓(𝑥) cos (2𝜋
𝑘
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥) cos(2𝜋
𝑘
𝑇𝑥) + (
2
𝑇∫ 𝑓(𝑥) sin (2𝜋
𝑘
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥) sin (2𝜋
𝑘
𝑇𝑥)∞
𝑘=1 (6)
Note that the right hand side of the above is fully specified in terms of the function 𝑓, its period 𝑇, the
cosine and sine functions, the evaluation of infinitely many integrals, and an infinite summation.
Proof of equations (4) and (5)
We now verify that the formulas given in (4) – (7) indeed fulfill equation (1) for the circular frequencies 𝜔𝑘 =𝑘
𝑇, 𝑘 =
0,1,2,…. To do so, we require the following “orthogonality” properties of sine and cosine products for 𝑘, 𝑗 ∈ ℕ0:
∫ cos (2𝜋𝑘
𝑇𝑥) cos (2𝜋
𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 = {
0 (𝑘 ≠ 𝑗) 𝑇/2 (𝑘 = 𝑗 ≠ 0)
(4.1)
∫ sin (2𝜋𝑘
𝑇𝑥) sin (2𝜋
𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 = {
0 (𝑘 ≠ 𝑗)𝑇/2 (𝑘 = 𝑗 ≠ 0)
(4.2)
∫ sin (2𝜋𝑘
𝑇𝑥) cos (2𝜋
𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 = 0 (4.3)
While these properties can readily be confirmed in a formal way based on the series definitions of sine and cosine and the notions of even and odd functions, they are maybe somewhat more intuitively appreciated by inspection of Figure 1 and the intuition of the definite integral as the signed are under a function’s graph. A formal derivation of these orthogonality properties is given for example in Coleman MP (2013) “An Introduction to Partial Differential Equations with Matlab” CRC Press (p.84 – 86).
Figure 1 Orthogonality properties of sine and cosine functions. For 𝑘, 𝑗 ∈ {1,2} the panels depict product functions of sine and cosine functions. From the intuition of the integral as the signed area between a function’s graph and the x-axis (dashed line), the orthogonality properties (9) – (11) may be appreciated.
To derive the expression for the cosine coefficients in equation (5), we first substitute the frequencies 𝜔𝑘 = 𝑘/𝑇 on the right hand side of equation (1), which yields
𝑓(𝑥) = ∑ 𝑎𝑘 cos (2𝜋𝑘
𝑇𝑥) + 𝑏𝑘 sin (2𝜋
𝑘
𝑇𝑥)∞
𝑘=0 (4.4)
We next choose an arbitrary 𝑗 ∈ ℕ and multiply both sides of equation (12) by cos (2𝜋𝑗
𝑇𝑥), yielding
82
𝑓(𝑥) cos (2𝜋𝑗
𝑇𝑥) = (∑ 𝑎𝑘 cos (2𝜋
𝑘
𝑇𝑥) + 𝑏𝑘 sin (2𝜋
𝑘
𝑇𝑥) ∞
𝑘=0 ) cos (2𝜋𝑗
𝑇𝑥)
= ∑ (𝑎𝑘 cos (2𝜋𝑘
𝑇𝑥) cos (2𝜋
𝑗
𝑇𝑥) + 𝑏𝑘 sin (2𝜋
𝑘
𝑇𝑥) cos (2𝜋
𝑗
𝑇𝑥)) ∞
𝑘=0 (4.5)
We next integrate both sides on the interval [−𝑇/2, 𝑇/2]:
∫ 𝑓(𝑥) cos (2𝜋𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 = ∫ ∑ 𝑎𝑘 cos (2𝜋
𝑘
𝑇𝑥) cos (2𝜋
𝑗
𝑇𝑥) + 𝑏𝑘 sin (2𝜋
𝑘
𝑇𝑥) cos (2𝜋
𝑗
𝑇𝑥) ∞
𝑘=0𝑇/2
−𝑇/2𝑑𝑥 (4.6)
Assuming that we may exchange the order of integration and summation on the right hand side of the above, we then obtain with the linearity property of the definite integral
∫ 𝑓(𝑥) cos (2𝜋𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 = ∑ 𝑎𝑘 ∫ cos (2𝜋
𝑘
𝑇𝑥) cos (2𝜋
𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 + 𝑏𝑘 ∫ sin (2𝜋
𝑘
𝑇𝑥) cos (2𝜋
𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 ∞
𝑘=0 (4.7)
From the orthogonality properties of sine and cosine functions, we now know that the latter terms involving sine and cosine of either the same (for 𝑘 = 𝑗) or different (for 𝑘 ≠ 𝑗) frequencies evaluate to zero and we have
∫ 𝑓(𝑥) cos (2𝜋𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 = ∑ 𝑎𝑘 ∫ cos (2𝜋
𝑘
𝑇𝑥) cos (2𝜋
𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥∞
𝑘=0 (4.8)
We further know that all integrals with 𝑘 ≠ 𝑗 in the infinite sum on the right-hand side above evaluate to zero, and we thus have
∫ 𝑓(𝑥) cos (2𝜋𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 = 𝑎𝑗 ∫ cos (2𝜋
𝑗
𝑇𝑥) cos (2𝜋
𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 = 𝑎𝑗
𝑇
2 (4.9)
Dividing by 𝑇/2, we thus obtain
𝑎𝑗 =2
𝑇∫ 𝑓(𝑥) cos (2𝜋
𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 (4.10)
which, upon exchanging the arbitrary index 𝑗 for 𝑘 yields equation (5). For the special case 𝑗 = 0, we have in equation (4.8)
∫ 𝑓(𝑥) cos (2𝜋0
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 = 𝑎0 ∫ cos (2𝜋
0
𝑇𝑥) cos (2𝜋
0
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥
⇔ ∫ 𝑓(𝑥)𝑇/2
−𝑇/2𝑑𝑥 = 𝑎0 ∫ 1
𝑇/2
−𝑇/2𝑑𝑥 = 𝑎0 (
𝑇
2− (−
𝑇
2)) = 𝑎0𝑇
𝑎0 =1
𝑇∫ 𝑓(𝑥)𝑇/2
−𝑇/2𝑑𝑥 (4.11)
and thus obtained equation (4).
Likewise, we may obtain the formula for the sine coefficients 𝑏𝑘. We first substitute 𝜔𝑘 = 𝑘/𝑇 for 𝑘 = 1,2, …. and multiply
both sides of equation (1) by sin (2𝜋𝑗
𝑇𝑥) for 𝑗 ∈ ℕ . We then integrate both sides from −𝑇/2 to 𝑇/2. In this case all terms involving
𝑎𝑘 vanish and in the remaining infinite sum only the terms for 𝑗 = 𝑘 remain. As the corresponding right hand side of (4.10) now
involves the integral of 𝑓(𝑥) multiplied by sin (2𝜋𝑗
𝑇𝑥), we find equation (5).
In detail, we first substitute the frequencies 𝜔𝑘 = 𝑘/𝑇 on the right hand side of equation (1)
𝑓(𝑥) = ∑ 𝑎𝑘 cos (2𝜋𝑘
𝑇𝑥) + 𝑏𝑘 sin (2𝜋
𝑘
𝑇𝑥)∞
𝑘=0 (5.1)
We next multiply both sides of the above by sin (2𝜋𝑗
𝑇𝑥), where we choose an arbitrary 𝑗 ∈ ℕ0 :
𝑓(𝑥) sin (2𝜋𝑗
𝑇𝑥) = (∑ 𝑎𝑘 cos (2𝜋
𝑘
𝑇𝑥) + 𝑏𝑘 sin(2𝜋𝜔𝑘𝑥)
∞𝑘=0 ) sin (2𝜋
𝑗
𝑇𝑥)
= (∑ 𝑎𝑘 cos(2𝜋𝑘
𝑇𝑥) sin (2𝜋
𝑗
𝑇𝑥) + 𝑏𝑘 sin(2𝜋
𝑘
𝑇𝑥) sin (2𝜋
𝑗
𝑇𝑥)∞
𝑘=0 ) (5.2)
Next we integrate both sides of the above on the interval [−𝑇/2, 𝑇/2]:
∫ 𝑓(𝑥) sin (2𝜋𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 = ∫ ∑ 𝑎𝑘 cos (2𝜋
𝑘
𝑇𝑥) sin (2𝜋
𝑗
𝑇𝑥) + 𝑏𝑘 sin(2𝜋
𝑘
𝑇𝑥) sin (2𝜋
𝑗
𝑇𝑥)∞
𝑘=0𝑇/2
−𝑇/2𝑑𝑥 (5.3)
83
Again assuming that we may exchange the order of integration and summation on the right hand side of (5.3) above, we then obtain with the linearity property of the definite integral
∫ 𝑓(𝑥) sin (2𝜋𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 = ∑ 𝑎𝑘 ∫ cos (2𝜋
𝑘
𝑇𝑥) sin (2𝜋
𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 + 𝑏𝑘 ∫ sin (2𝜋
𝑘
𝑇𝑥) sin (2𝜋
𝑗
𝑇𝑥)
𝑇/2
−𝑇/2∞𝑘=0 𝑑𝑥 (5.4)
As in the evaluation of the cosine coefficients, we know from the orthogonality properties of the sine and cosine functions, that all the integrals involving a multiplication of sine and cosine evaluate to zero, and thus we obtain
∫ 𝑓(𝑥) sin (2𝜋𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 = ∑ 𝑏𝑘 ∫ sin (2𝜋
𝑘
𝑇𝑥) sin (2𝜋
𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥∞
𝑘=0 (5.5)
We also know that all integrals involving the multiplication of sin (2𝜋𝑘
𝑇𝑥) with sin (2𝜋
𝑗
𝑇𝑥) for which 𝑘 ≠ 𝑗 evaluate to zero, thus
the infinite sum in (5.5) simplifies to a single term, i.e. the case that 𝑘 = 𝑗 for our chosen 𝑗 ∈ ℕ0. If 𝑗 > 0, we have
∫ 𝑓(𝑥) sin (2𝜋𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 = 𝑏𝑗 ∫ sin (2𝜋
𝑗
𝑇𝑥) sin (2𝜋
𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 (5.6)
and thus
∫ 𝑓(𝑥) sin (2𝜋𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 = 𝑏𝑗
𝑇
2⇔ 𝑏𝑗 =
2
𝑇∫ 𝑓(𝑥) sin (2𝜋
𝑗
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 (5.7)
By changing the arbitrary index 𝑗 on the right hand side of the above to 𝑘, we have
𝑏𝑘 =2
𝑇∫ 𝑓(𝑥) sin (2𝜋
𝑘
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 for 𝑘 = 1,2,3,… (5.8)
and thus derived equation (5). If 𝑗 = 0, we have
∫ 𝑓(𝑥) sin (2𝜋0
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 = 𝑏0 ∫ sin (2𝜋
0
𝑇𝑥) sin (2𝜋
0
𝑇𝑥)
𝑇/2
−𝑇/2𝑑𝑥 ⇔ 0 = 𝑏00 (5.9)
and we may arbitrarily define 𝑏0, which we do by setting 𝑏0 = 0.
□
In summary, the expressions for the Fourier series cosine and sine coefficients (4) and (5) result from
choosing circular frequencies as integer multiples of 𝜔 = 1/𝑇, multiplication of equation (1) by either a
cosine or a sine term, integration over the interval −𝑇/2 to 𝑇/2 and the orthogonality properties of the
cosine and sine functions. In the next section, we clarify why for periodic functions with period 𝑇 the interval
of integration −𝑇/2 to 𝑇/2 is a sensible choice.
(4) Periodic and protracted functions
When introducing the formulas for the sine and cosine frequencies and amplitude coefficients for a
Fourier series of a function 𝑓 in the previous section, we required that this function was periodic with period
𝑇. We also performed the respective integrals from −𝑇/2 to 𝑇/2 which implied that a period of the function
is centered at 𝑥 = 0. In this section, we relax these conditions and introduce the notion of a “protracted
function”. In data analytical problems one is usually interested in functions over a finite domain (for example
the event-related potential for a fixed peristimulus time window in EEG data analysis, or the BOLD time-
course of a voxel over an experimental run). Above the Fourier series was introduced based on functions
over the infinite domain ℝ, but with period 𝑇. If we have a function with finite domain, which is not periodic,
we may create from it a periodic function with infinite domain ℝ by simply putting many identical copies of
the function of finite domain next to each other. This new function, called the “protracted function of 𝑓“ is
then periodic, where the period 𝑇 corresponds to the length of the finite domain of the original function,
and, by construction, has an infinite domain. In other words: from a non-periodic function with finite domain
it is easy to create a periodic function with infinite domain, which may be accessible to Fourier series
representation as in equation (1). There are mathematical subtleties associated with this approach, like the
84
introduction of discontinuities in the protracted function, however, we will not discuss them here. The
interested reader is referred to [Weaver, 1983] for a mathematically sound discussion of the idea of
protracted functions. We visualize the idea of protraction of a function in Figure 1.
Figure 1 Protraction of a function with finite domain yielding a periodic function with infinite domain.
The concept of a protracted function also clarifies the integral boundaries −𝑇/2 and 𝑇/2 used in the
previous section: they refer to the left and right endpoint of the period of the function and not necessarily to
the specific values of −𝑇/2 and 𝑇/2. For the example shown in Figure 2, this means that, as the period of
the function is 𝑇 = 2, the Fourier series integrals may be performed from 0 to 2 rather than from −1 to 1.
Note, however, that by construction also the protracted function defined on [−1,1] is a periodic function of
period 𝑇.
More generally, the periodicity of functions with period 𝑇 allows for performing the integrals over
intervals of length 𝑇 centered anywhere without changing the value of the integral. Formally, for a periodic
function 𝑓 with period 𝑇 and an arbitrary number t 𝛿 < 𝑇, we have
∫ 𝑓(𝑥)𝑇/2+𝛿
−𝑇/2+𝛿𝑑𝑥 = ∫ 𝑓(𝑥)
𝑇/2
−𝑇/2𝑑𝑥 (1)
In words the right hand side of equation (1) specifies an integral of the periodic function on a domain of
length 𝑇 shifted from a center at 0 to a center at 𝛿. The equality to the integral of this function on a domain
of length 𝑇 centered at 0 then implies the shift-invariance of integrals over domains of the length of the
period for periodic functions. Figure 2 visualizes this result.
Figure 2 Shift invariance of the integral of periodic functions over intervals of length 𝑇.
85
Proof of equation 1
Equation (1) may be seen as follows. Let 𝛿 < 𝑇 be an arbitrary number. Then we first note that that performing an integral with lower boundary 𝑏 + 𝛿 which is larger than the upper boundary 𝑏 corresponds to the negative of the interval with lower boundary 𝑏 and upper boundary 𝑏 + 𝛿. For 𝑎 < 𝑏 this may be seen with the fundamental theorem of calculus as follows
∫ 𝑓(𝑥)𝑏
𝑎𝑑𝑥 = 𝐹(𝑏) − 𝐹(𝑎) ⇔ −∫ 𝑓(𝑥)
𝑏
𝑎𝑑𝑥 = 𝐹(𝑎) − 𝐹(𝑏) = ∫ 𝑓(𝑥)
𝑎
𝑏𝑑𝑥 (1.1)
Then, by inspection of Figure 2, we note that by the linearity of the definite integral
∫ 𝑓(𝑥)𝑇/2+𝛿
−𝑇/2+𝛿𝑑𝑥 = ∫ 𝑓(𝑥)
𝑇/2
−𝑇/2𝑑𝑥 + ∫ 𝑓(𝑥)
−𝑇/2
−𝑇/2+𝛿𝑑𝑥 + ∫ 𝑓(𝑥)
𝑇/2+𝛿
𝑇/2𝑑𝑥 (1.2)
where the second term on the right hand side is negative. Further, with 𝑓(𝑥 + 𝑇) = 𝑓(𝑥), we have
∫ 𝑓(𝑥)−𝑇/2
−𝑇/2+𝛿𝑑𝑥 = ∫ 𝑓(𝑥 + 𝑇)
−𝑇/2
−𝑇/2+𝛿𝑑𝑥 = ∫ 𝑓(𝜉)
𝑇/2
𝑇/2+𝛿𝑑𝜉 = −∫ 𝑓(𝜉)
𝑇/2+𝛿
𝑇/2𝑑𝜉 (1.3)
where we have changed the integration boundaries by substitution of 𝜉 ≔ 𝑥 + 𝑇 in the second equality. We thus have
∫ 𝑓(𝑥)𝑇/2+𝛿
−𝑇/2+𝛿𝑑𝑥 = ∫ 𝑓(𝑥)
𝑇/2
−𝑇/2𝑑𝑥 − ∫ 𝑓(𝑥)
𝑇/2+𝛿
𝑇/2𝑑𝑥 + ∫ 𝑓(𝑥)
𝑇/2+𝛿
𝑇/2𝑑𝑥 = ∫ 𝑓(𝑥)
𝑇/2
−𝑇/2𝑑𝑥 (1.4)
We thus have shown that the integral of a periodic function 𝑓 with period 𝑇 over an interval of length 𝑇 may be performed from an arbitrary lower boundary.
□
In summary, the formulas derived in the previous Section for the coefficients of the real Fourier series representation can (ignoring mathematical subtleties) be applied to any function with finite domain and all functions with infinite domain with period 𝑇.
(5) Complex Numbers
In the following we introduce two alternative formulations of the real Fourier series: the “complex
exponential form” and the “polar form”. From an intuitive viewpoint, the Fourier series in its real form is
probably the most useful, while from a mathematical viewpoint, the complex form is used most often and
encountered commonly in applications. The complex exponential representation of the Fourier series also
forms the basis for the Fourier transform. Before we can introduce the complex form of the Fourier series,
we will have to briefly discuss the notion of complex numbers, and the relation of cosine, sine, and the
exponential by means of complex numbers and Euler’s identities.
The theory of complex numbers can be motivated from the desire to solve quadratic equations of
the form
𝑎𝑥2 + 𝑏𝑥 + 𝑐 = 0 (1)
for the unknown variable 𝑥 and with known constants 𝑎, 𝑏 ∈ ℝ. From high-school mathematics we know
that (1) has the general solutions
𝑥1,2 =−𝑏±√𝑏2−4𝑎𝑐
2𝑎 (2)
If for the constants in equation (1) we have 𝑏2 − 4𝑎𝑐 < 0, then (2) requires us to take the square root of a
negative number. This is not defined when computing with real numbers. An example for such a case is the
quadratic equation
𝑥2 − 2𝑥 + 2 = 0 (3)
86
where
𝑏2 − 4𝑎𝑐 = (−2)2 − 4 ⋅ 1 ⋅ 2 = 4 − 8 = −4 (4)
To nevertheless define solutions of equations such as (3), one can define the number “𝑖”, the so called
“imaginary unit”, as follows
𝑖 ≔ √−1 ⇒ 𝑖2 = (√−1)2= −1 (5)
In words: 𝑖 is defined as the square root of −1, or, stated differently, the number that if taken to the power
of 2, yields −1. This definition allows for expressing square roots of negative numbers such as −4, as
√−4 = √−1 ⋅ 4 = √4√−1 = 2𝑖 (6)
The quadratic equation above then can be said to have the solutions
𝑥1 =2−√−4
2=
2−2𝑖
2= 1 − 𝑖 and 𝑥2 =
2+2𝑖
2= 1 + 𝑖 (7)
A number of the form 𝑎 + 𝑏𝑖, where 𝑎, 𝑏 ∈ ℝ and 𝑖 = √−1 is called a “complex number”. The set of all
complex numbers is denoted by ℂ, i.e.,
ℂ ≔ {𝑎 + 𝑖𝑏|𝑎, 𝑏 ∈ ℝ, 𝑖 = √−1} (8)
Note that for 𝑏 = 0, we have 𝑎 + 𝑖0 = 𝑎 ∈ ℝ, which shows that the real numbers are a subset of the
complex numbers.
In calculations complex numbers may be treated just like real numbers, while attending to the fact
that 𝑖2 = −1. For example, we can show that 𝑥2 is a solution of the quadratic equation (3) by substitution as
follows
(1 + 𝑖)2 − 2(1 + 𝑖) + 2 = 0 (9)
⇔ 12 + 2 ⋅ 1 ⋅ 𝑖 + 𝑖2 − 2 ⋅ 1 − 2 ⋅ 𝑖 + 2 = 0
⇔ 1+ 2𝑖 − 1 − 2 − 2𝑖 + 2 = 0
⇔ 0 = 0
We next define the notions of “real and imaginary parts”, “absolute values and arguments”, and
“conjugates” of complex numbers. A complex number of the form 𝑧 = 𝑎 + 𝑖𝑏 has two components, a “real
part” and an” imaginary part”, which can be written as
𝑅𝑒(𝑧) = 𝑎 and 𝐼𝑚(𝑧) = 𝑏 (10)
The “absolute value” of a complex number is
𝑅 = 𝐴𝑏𝑠(𝑧) = √𝑎2 + 𝑏2 (11)
and the “argument” of a complex number is
87
휃 = 𝐴𝑟𝑔(𝑧) = 𝑎𝑟𝑐𝑡𝑎𝑛 (𝑏
𝑎) (12)
The “(complex) conjugate” of a complex number
𝑧 = 𝑎 + 𝑖𝑏 (13)
is denoted by 𝑧̅ and defined as
𝑧̅ = 𝑎 − 𝑖𝑏 (14)
Notably, the multiplication of complex numbers with the conjugates always yields a real number
𝑧𝑧̅ = (𝑎 + 𝑖𝑏)(𝑎 − 𝑖𝑏) = 𝑎2 − 𝑎𝑖𝑏 + 𝑎𝑖𝑏 − 𝑖2𝑏2 = 𝑎2 − (√−1)2𝑏2 = 𝑎2 + 𝑏2 (15)
(6) Euler’s identities
By means of complex numbers, the exponential, sine and cosine functions can be shown to be
related according to what are known as “Euler’s identities”
exp(𝑖𝑥) = cos(𝑥) + 𝑖 sin(𝑥) and exp(−𝑖𝑥) = cos(𝑥) − 𝑖 sin(𝑥) (1)
which we proof below.
Proof of equation (1)
Recall that the cosine, sine, and exponential functions are defined by means of the series
cos(𝑥) ≔ ∑ (−1)𝑛∞𝑛=0
𝑥2𝑛
(2𝑛)!= 1 −
𝑥2
2!+𝑥4
4!−𝑥6
6!+⋯ (1.1)
sin(𝑥) ≔ ∑ (−1)𝑛∞𝑛=0
𝑥2𝑛+1
(2𝑛+1)!= 𝑥 −
𝑥3
3!+𝑥5
5!−𝑥7
7!+⋯ (1.2)
exp(𝑥) ≔ ∑𝑥𝑛
𝑛!∞𝑛=0 = 1 +
𝑥
1!+𝑥2
2!+𝑥3
3!+𝑥4
4!+⋯ (1.3)
Euler’s identity can then be shown by considering the series expression of exp(𝑖𝑥) given by
exp(𝑖𝑥) = ∑(𝑖𝑥)𝑛
𝑛!∞𝑛=0 = 1 +
𝑖𝑥
1!+𝑖2𝑥2
2!+𝑖3𝑥3
3!+𝑖4𝑥4
4!+𝑖5𝑥5
5!+⋯ (1.4)
Noting that for 𝑛 = 0,1,2,… we have the following alternating sequence between 𝑖, −1, −𝑖, 1
𝑖0 = +1, 𝑖1 = +𝑖, 𝑖2 = −1, 𝑖3 = −𝑖, 𝑖4 = +1, 𝑖5 = +𝑖, 𝑖6 = −1, 𝑖7 = −𝑖, … (1.5)
we obtain
exp(𝑖𝑥) = 1 +𝑖𝑥
1!−𝑥2
2!−𝑖𝑥3
3!+𝑥4
4!+𝑖𝑥5
5!+⋯ (1.6)
= (1 −𝑥2
2!+𝑥4
4!−⋯) + 𝑖 (
𝑥
1!−𝑥3
3!+𝑥5
5!+⋯)
= ∑ (−1)𝑛∞𝑛=0
𝑥2𝑛
(2𝑛)!+ 𝑖 (∑ (−1)𝑛∞
𝑛=0𝑥2𝑛+1
(2𝑛+1)!)
= cos(𝑥) + 𝑖 sin(𝑥)
Similar considerations yield the second identity □
88
From equations (1), we may express the cosine and sine function in terms of the exponential
function with a complex argument as follows
cos(𝑥) =exp(𝑖𝑥)+exp(−𝑖𝑥)
2=:
𝑒𝑖𝑥+𝑒−𝑖𝑥
2 and sin(𝑥) =
exp(𝑖𝑥)−exp(−𝑖𝑥)
2𝑖=:
𝑒𝑖𝑥−𝑒−𝑖𝑥
2𝑖 (2)
Note that we introduced the notation 𝑒𝑥 ≔ exp(𝑥) in (2) to keep the notation concise. Based on the
relations (1) and (2) we are now in the position to derive the complex exponential form of the Fourier series.
Proof of equation (2)
The first equation derives from the fact that
exp(𝑖𝑥) + exp(−𝑖𝑥) = (cos(𝑥) + 𝑖 sin(𝑥)) + (cos(𝑥) − 𝑖 sin(𝑥)) = 2 cos(𝑥) (2.1)
and the second equation derives from the fact that
exp(𝑖𝑥) − exp(−𝑖𝑥) = (cos(𝑥) + 𝑖 sin(𝑥)) − (cos(𝑥) − 𝑖 sin(𝑥)) = 2𝑖 sin(𝑥) (2.2)
□
(7) The complex form of the Fourier series
Before introducing the complex form of the Fourier series, we will make a small adjustment to the
notation of the real form of the Fourier series, which shortens the ensuing expressions. In the previous
Section we have seen that if we have a function 𝑓 that is periodic with period of 𝑇, we can decompose it into
a sum of generalized cosine and sine functions
𝑓(𝑥) =1
2𝑎0 + ∑ 𝑎𝑘 cos (2𝜋
𝑘
𝑇𝑥)∞
𝑘=1 + 𝑏𝑘 sin (2𝜋𝑘
𝑇𝑥) (1)
where the coefficients are given by the following integrals
𝑎0 =1
𝑇∫ 𝑓(𝑥)𝑇/2
−𝑇/2𝑑𝑥 and 𝑎𝑘 =
2
𝑇∫ 𝑓(𝑥)𝑇/2
−𝑇/2cos (2𝜋
𝑘
𝑇𝑥)𝑑𝑥, 𝑏𝑘 =
2
𝑇∫ 𝑓(𝑥)𝑇/2
−𝑇/2sin (2𝜋
𝑘
𝑇𝑥)𝑑𝑥 (2)
for 𝑘 = 1,2, …. If we consider 𝑘 = 1, the input argument of the cosine and sine functions takes the form
2𝜋𝑘
𝑇𝑥 = 𝑘
2𝜋
𝑇𝑥 =
2𝜋
𝑇𝑥 (3)
If we define
𝜔0 ≔2𝜋
𝑇 (4)
(note that the subscript “0” does not refer to 𝑘) we may replace every cosine and sine argument “2𝜋𝑘
𝑇𝑥”
with the more concise argument “𝑘𝜔0𝑥”:
2𝜋𝑘
𝑇𝑥 = 𝑘
2𝜋
𝑇𝑥 = 𝑘𝜔0𝑥 (5)
In other words, with 𝜔0 as defined in equation (4), we can write the Fourier series in its real form as
𝑓(𝑥) =1
2𝑎0 + ∑ 𝑎𝑘 cos(𝑘𝜔0𝑥)
∞𝑘=1 + 𝑏𝑘 sin(𝑘𝜔0𝑥) (6)
89
We now rewrite the Fourier series in its complex form. To this end, we first note that from the
previous Section
cos(𝑥) =𝑒𝑖𝑥+𝑒−𝑖𝑥
2 and sin(𝑥) =
𝑒𝑖𝑥−𝑒−𝑖𝑥
2𝑖=
𝑖(𝑒𝑖𝑥−𝑒−𝑖𝑥)
2𝑖2= −
𝑖(𝑒𝑖𝑥−𝑒−𝑖𝑥)
2 (7)
Substitution of the complex exponential expressions for cosine and sine in the real form of the Fourier series
(6) then yields
𝑓(𝑥) = 𝑎0 + ∑ 𝑎𝑘 (𝑒𝑖𝑘𝜔0𝑥+𝑒−𝑒
𝑖𝑘𝜔0𝑥
2) + 𝑏𝑘 (−
𝑖(𝑒𝑖𝑘𝜔0𝑥−𝑒−𝑖𝑘𝜔0𝑥)
2)∞
𝑘=1 (8)
= 𝑎0 + ∑𝑎𝑘
2𝑒𝑖𝑘𝜔0𝑥 +
𝑎𝑘
2𝑒−𝑖𝑘𝜔0𝑥 −
𝑖𝑏𝑘
2𝑒𝑖𝑘𝜔0𝑥 +
𝑖𝑏𝑘
2𝑒−𝑖𝑘𝜔0𝑥∞
𝑘=1 (9)
In two terms on the right-hand side of the above, we find the expression “−𝑖𝑘𝜔0𝑥”, which we may interpret
as “(−𝑘) ⋅ 𝑖𝜔0𝑥”. We can thus rewrite the above using two sums, one with indices 𝑘 = 1,2, … and one with
indices −∞,−∞+ 1,… ,−3,−2,−1 as follows
𝑓(𝑥) = 𝑎0 +∑ (𝑎𝑘
2𝑒𝑖𝑘𝜔0𝑥 −
𝑖𝑏𝑘
2𝑒𝑖𝑘𝜔0𝑥) + ∑ (
𝑎−𝑘
2𝑒𝑖𝑘𝜔0𝑥 +
𝑖𝑏−𝑘
2𝑒𝑖𝑘𝜔0𝑥)−1
𝑘=−∞∞𝑘=1
= 𝑎0 + ∑1
2(𝑎𝑘 − 𝑖𝑏𝑘)𝑒
𝑖𝑘𝜔0𝑥 +∑1
2(𝑎−𝑘 + 𝑖𝑏−𝑘)
−1𝑘=−∞
∞𝑘=1 𝑒𝑖𝑘𝜔0𝑥 (10)
Note that 𝑎−𝑘 for e.g., 𝑘 = −3 corresponds to 𝑎−(−3) = 𝑎3 and is thus identical to the coefficients in
expression (9).
Expression (10) can be written more compactly by defining a new set of coefficients 𝑐𝑘 , 𝑘 ∈ ℤ as
follows
𝑐𝑘 ≔ {
1
2(𝑎−𝑘 + 𝑖𝑏−𝑘), 𝑘 < 0
𝑎0 , 𝑘 = 01
2(𝑎𝑘 − 𝑖𝑏𝑘) , 𝑘 > 0
(11)
Substitution of these definitions in (9) then yields
𝑓(𝑥) = 𝑎0 + ∑1
2(𝑎𝑘 − 𝑖𝑏𝑘)𝑒
𝑖𝑘𝜔0𝑥 +∑1
2(𝑎−𝑘 + 𝑖𝑏−𝑘)
−1𝑘=−∞
∞𝑘=1 𝑒𝑖𝑘𝜔0𝑥 (12)
= 𝑐0 + ∑ 𝑐𝑘𝑒𝑖𝑘𝜔0𝑥 + ∑ 𝑐𝑘
−1𝑘=−∞
∞𝑘=1 𝑒𝑖𝑘𝜔0𝑥
= ∑ 𝑐𝑘𝑒𝑖𝑘𝜔0𝑥−1
𝑘=−∞ + 𝑐0𝑒0⋅𝑖𝜔0𝑥 + ∑ 𝑐𝑘𝑒
𝑖𝑘𝜔0𝑥∞𝑘=1
= ∑ 𝑐𝑘𝑒𝑖𝑘𝜔0𝑥∞
𝑘=−∞
The expression
𝑓(𝑥) = ∑ 𝑐𝑘𝑒𝑖𝑘𝜔0𝑥∞
𝑘=−∞ (13)
where, in general, 𝑐𝑘 ∈ ℂ is referred to as the complex form of the Fourier series. Note that in contrast to
the real form, the indices 𝑘 are signed integers.
90
(8) The polar form of the Fourier series
In the complex exponential form of the Fourier series, the coefficients 𝑐𝑘 contain information about
the “frequency magnitude” and “frequency phase“ for a given frequency 𝑘𝜔0 in the original function 𝑓. As
discussed in the next Section, numerical approaches that find the Fourier series coefficients for a given
function or sequence usually return the complex coefficients 𝑐𝑘. Here, we consider the interpretation of
these coefficients in terms of frequency magnitude and phase. To this end, we first note that the coefficients
𝑐𝑘 with negative 𝑘 and those with positive 𝑘 refer to the identical frequency 𝑘𝜔0 in the original sine and
cosine representation of the Fourier series. We may thus consider only coefficients with 𝑘 ≤ 0, which, in
general are of the form
𝑐𝑘 = 𝑎𝑘 + 𝑖𝑏𝑘 (1)
To see how these complex numbers relate to frequency magnitude and phase, we first consider the so-called
“polar form” of the Fourier series. The polar form of the Fourier series is a reformulation of the real form of
the Fourier series. Specifically, based on the coefficients of the real Fourier series 𝑎0, 𝑎1, 𝑎2, … , 𝑏1, 𝑏2, … the
polar form of the Fourier series is given by
𝑓(𝑥) = 𝑑0 + ∑ 𝑑𝑘 cos(𝑘𝜔0𝑥 − 𝜙𝑘)∞𝑘=1 (2)
where the “polar form coefficients” 𝑑0, 𝑑1, 𝑑2, … and “phase angles” 𝜙1, 𝜙2, … can be obtained from the real
form coefficients by
𝑑0 = 𝑎0 , 𝑑𝑘 = √𝑎𝑘2 + 𝑏𝑘
2 and 𝜙𝑘 = arctan (𝑏𝑘
𝑎𝑘) for 𝑘 = 1,2, … (3)
Proof of equation (2)
To show that the above holds, we capitalize on the following identities for trigonometric functions
cos(𝑥 + 𝑦) = cos(𝑥) cos(𝑦) − sin(𝑥) sin(𝑦), cos(−𝑥) = cos(𝑥) and sin(−𝑥) = − sin(𝑥) (3.1)
and the definition of the tangent function as
tan(𝑥) ≔sin(𝑥)
cos(𝑥) (3.2)
We then have
𝑓(𝑥) = 𝑑0 + ∑ 𝑑𝑘 cos(𝑘𝜔0𝑥 − 𝜙𝑘)∞𝑘=1 (3.3)
= 𝑑0 + ∑ 𝑑𝑘(cos(𝑘𝜔0𝑥) cos(−𝜙𝑘) − sin(𝑘𝜔0𝑥) sin(−𝜙𝑘))∞𝑘=1
= 𝑑0 + ∑ 𝑑𝑘(cos(𝑘𝜔0𝑥) cos(𝜙𝑘) + sin(𝑘𝜔0𝑥) sin(𝜙𝑘))∞𝑘=1
= 𝑑0 + ∑ 𝑑𝑘 cos(𝜙𝑘) cos(𝑘𝜔0𝑥) + 𝑑𝑘 sin(𝜙𝑘) sin(𝑘𝜔0𝑥)∞𝑘=1
The latter statement is equivalent to the real form of the Fourier series as defined in the previous section given that
𝑎𝑘 = 𝑑𝑘 cos(𝜙𝑘) and 𝑏𝑘 = 𝑑𝑘 sin(𝜙𝑘) (3.4)
To evaluate 𝜙𝑘 in terms of 𝑎𝑘 and 𝑏𝑘 we may thus set
𝑏𝑘
𝑎𝑘=
𝑑𝑘 sin(𝜙𝑘)
𝑑𝑘 cos(𝜙𝑘)=
sin(𝜙𝑘)
cos(𝜙𝑘)= tan(𝜙𝑘) ⇒ 𝜙𝑘 = arctan (
𝑏𝑘
𝑎𝑘) ,
𝑏𝑘
𝑎𝑘∈ [−
𝜋
2,𝜋
2] (3.5)
Squaring and adding the statements in (2.4) above to find 𝑑𝑘, we obtain
91
𝑎𝑘2 + 𝑏𝑘
2 = (𝑑𝑘 cos(𝜙𝑘))2 + (𝑑𝑘 sin(𝜙𝑘))
2 = 𝑑𝑘2(cos2(𝜙𝑘) + sin
2(𝜙𝑘)) = 𝑑𝑘2 ⇒ 𝑑𝑘 = √𝑎𝑘
2 + 𝑏𝑘2 (3.6)
In summary, the polar form coefficients and phase angles can be derived from the real form coefficients by means of trigonometric identities and the definition of the tangent function.
□
The polar representation of the Fourier series thus shows that the real and imaginary parts of the
coefficients 𝑐𝑘 (i.e., 𝑅𝑒(𝑐𝑘) = 𝑎𝑘 and 𝐼𝑚(𝑐𝑘) = 𝑏𝑘) represent the amplitude coefficient 𝑑𝑘 of a generalized
cosine function of frequency 𝑘𝜔0 in the form of the square root of their sum of squares and the phase
parameter of the same generalized cosine function in terms of the arcus tangens of their ratio. Neglecting
that the underlying function of the polar Fourier series is a cosine function (as one may equally express the
polar Fourier series in terms of a sine function), one may thus refer to “frequency-specific amplitudes and
phases”.
A commonly used approach is to visualize the relation between the real and the imaginary part of
Fourier series coefficients and frequency magnitude and phase by plotting the imaginary number 𝑐𝑘 as a
point in the two-dimensional plane spanned by the values of its real part 𝑅𝑒(𝑐𝑘) = 𝑎𝑘 and its imaginary part
𝐼𝑚(𝑐𝑘) = 𝑏𝑘 (Figure 1). Basic geometry than relates the “frequency magnitude” 𝑑𝑘 to the length of the line
from the origin to the point (𝑎𝑘 , 𝑏𝑘)𝑇 and “frequency phase” to the angle spanned by this line and the 𝑎𝑘-
axis .
Figure 1. Representation of the complex Fourier series coefficients in the “real-imaginary plane”. Visualizing the real and imaginary
part of such a coefficient as a point in this plane, relates the amplitude coefficients of the polar form of the Fourier series to the
length of the line from the origin to the point, and phase parameters in the polar form of the Fourier series to the angle spanned by
this line and the real axis.
(9) The Fourier transform
The Fourier transform is an example of an integral transform which transforms one function defined
on one domain into a second function defined on another domain. In the case of the Fourier transform, the
former domain is usually time or space, i.e., the arguments of the input function to the Fourier transform
represent temporal or spatial units, while the latter domain is the frequency domain, i.e. the input
arguments of the Fourier transformed function are frequencies. More formally, for a function
92
𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) (1)
the Fourier transform is defined as the function 𝐹, for which
𝐹:ℝ → ℝ,𝜔 ↦ 𝐹(𝜔) ≔ ∫ 𝑓(𝑥)∞
−∞exp(−2𝜋𝜔𝑖𝑥) 𝑑𝑥 (2)
The inverse Fourier transform is defined as
𝑓:ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ ∫ 𝐹(𝜔)∞
−∞exp(−2𝜋𝜔𝑖𝑥) 𝑑𝜔 (3)
The definition of the Fourier transform given in equation (2) can be motivated as the “continuous
frequency approximation” of the “discrete frequency form” of the Fourier series. In order to see this, we first
note that we can write the coefficients 𝑐𝑘 (𝑘 ∈ ℤ) of the complex exponential form of the Fourier series the
coefficients directly in terms of the function 𝑓 and a complex exponential. That is, for a function 𝑓 as defined
in (1) and periodic with period 𝑇, the coefficients in its complex exponential Fourier series representation
𝑓(𝑥) = ∑ 𝑐𝑘 exp (2𝜋 (𝑘
𝑇) 𝑖𝑥) ∞
𝑘=−∞ (4)
can be written as
𝑐𝑘 =1
𝑇∫ 𝑓(𝑥)𝑇/2
−𝑇/2exp (−2𝜋𝑖 (
𝑘
𝑇) 𝑖𝑥)𝑑𝑥 (𝑘 ∈ ℤ) (5)
which omits the recursion to the real form coefficients 𝑎0 and 𝑎𝑘 , 𝑏𝑘 , 𝑘 ∈ ℕ.
Proof of equation (5)
To show that the coefficients 𝑐𝑘 (𝑘 ∈ ℤ) of the complex exponential form of the Fourier series may be written as in
equation (5) we follow a similar line of reasoning as for the derivation of the real form coefficients: we first establish an
orthogonality relation for complex exponentials, multiply the complex form of the Fourier series by a complex exponential and
integrate. More specifically, we require the following complex exponential orthogonality property:
∫ exp (2𝜋𝑖𝑛𝑥
𝑇)
𝑇/2
−𝑇/2exp (−
2𝜋𝑖𝑚𝑥
𝑇) 𝑑𝑡 = {
0, 𝑛 ≠ 𝑚𝑇, 𝑛 = 𝑚
(5.1)
Note the similarity of this property to the cosine and sine orthogonality properties above. Again, we will eschew a formal proof of
(6), but capitalize on it in the following. Multiplying the complex form of the Fourier series
𝑓(𝑥) = ∑ 𝑐𝑘 exp (2𝜋𝑖𝑘𝑥
𝑇) ∞
𝑘=−∞ (5.2)
by exp (−2𝜋𝑖𝑚𝑥
𝑇) for an arbitrary 𝑚 ∈ ℤ and subsequently integrating over the interval [−𝑇/2, 𝑇/2] yields
∫ 𝑓(𝑥) exp (−2𝜋𝑖𝑚𝑥
𝑇)
𝑇/2
−𝑇/2𝑑𝑥 = ∫ ∑ 𝑐𝑘 exp (
2𝜋𝑖𝑘𝑥
𝑇) exp (−
2𝜋𝑖𝑚𝑥
𝑇)∞
𝑘=−∞𝑇/2
−𝑇/2𝑑𝑥 (5.3)
Assuming that we may exchange the order of summation and integration on the right-hand side of (5.3), we obtain
∫ 𝑓(𝑥) exp (−2𝜋𝑖𝑚𝑥
𝑇)
𝑇/2
−𝑇/2𝑑𝑥 = ∑ ∫ 𝑐𝑘 exp (
2𝜋𝑖𝑘𝑥
𝑇) exp (−
2𝜋𝑖𝑚𝑥
𝑇)
𝑇/2
−𝑇/2∞𝑘=−∞ 𝑑𝑥 (5.4)
Using the orthogonality property (5.1), we hence obtain
∫ 𝑓(𝑥) exp (−2𝜋𝑖𝑚𝑥
𝑇)
𝑇/2
−𝑇/2𝑑𝑥 = 𝑐𝑚 ∫ exp (
2𝜋𝑖𝑚𝑥
𝑇) exp (−
2𝜋𝑖𝑚𝑥
𝑇)
𝑇/2
−𝑇/2𝑑𝑥 = 𝑐𝑚𝑇 (5.5)
and exchanging the (arbitrary) index 𝑚 ∈ ℤ for 𝑘 ∈ ℤ, we get
93
𝑐𝑘 =1
𝑇∫ 𝑓(𝑥)𝑇/2
−𝑇/2exp (−
2𝜋𝑖𝑘𝑥
𝑇) 𝑑𝑥 (𝑘 ∈ ℤ) (5.6)
□
Having established the direct complex exponential form of 𝑐𝑘, we now return to the view of the
Fourier transform as “continuous frequency generalization” of the Fourier series. We recall that the
constants 𝑐𝑘 are a measure of the “amount” of the discrete frequencies 𝑘𝜔0 = 2𝜋𝑘/𝑇 that are combined to
represent the periodic function 𝑓(𝑥). We also note that although we have an infinite number of these
discrete frequencies, they are all multiples of a basic frequency given by 𝜔0 = 2𝜋/𝑇. As the period
increases, this basic frequency decreases and therefore, therefore, the discrete frequencies get closer
together, until in the limit 𝑇 → ∞ they equal a continuous frequency spectrum. Formally, we may consider
the distance Δ𝜔𝑘 between two discrete frequencies given by
Δ𝜔𝑘 = 𝜔𝑘+1 −𝜔𝑘 =𝑘+1
𝑇−𝑘
𝑇=
1
𝑇 (𝑘 ∈ ℤ) (6)
We may analogously write
lim𝑇→∞𝑘
𝑇= 𝜔 (7)
as 𝑇 → ∞, Δ𝜔𝑘 → 0 and we arrive at a continuous frequency representation 𝜔 ∈ ℝ. Consider now the
Fourier series coefficient 𝑐𝑘, which we multiply with 𝑇:
𝑇𝑐𝑘 = ∫ 𝑓(𝑥)𝑇/2
−𝑇/2exp (−2𝜋 (
𝑘
𝑇) 𝑖𝑥) 𝑑𝑥 (𝑘 ∈ ℤ) (8)
Replacing the discrete frequency (2𝜋𝑘
𝑇) by the continuous frequency 𝜔, we obtain
𝑇𝑐𝑘 = ∫ 𝑓(𝑥)∞
−∞exp(−2𝜋𝜔𝑖𝑥)𝑑𝑥 =: 𝐹(𝜔) (9)
i.e. the Fourier transform formula. Note that 𝐹(𝜔) can be interpreted as the discrete contribution 𝑐𝑘
(tending to zero, as more and more frequencies become available) multiplied by the (constant) period of the
input function (tending to infinity, as the function becomes less and less periodic).
From a data analytical perspective, one usually computes frequency spectra of observed data, not of
analytical functions. We will thus omit a discussion of the properties and implications of the Fourier
transform and instead introduce the discrete Fourier transform next. The interested reader will find much
material on the analytical properties of the Fourier transform and examples for Fourier transforms of a
selection of functions in [Weaver, 1983].
(10) The discrete Fourier transform
We may regard the Fourier series in its complex exponential form as an operation that takes a
function 𝑓 and returns a sequence (i.e., an infinite ordered list) of coefficients 𝑐𝑘 , 𝑘 ∈ ℤ. Likewise, the
Fourier transform may be considered an operation that maps a function 𝑓 onto another function 𝐹. To
determine either the Fourier series or the Fourier transform, integrals have to be evaluated. Thus it is only
possible to consider functions that can be described analytically and even then these function must be
relatively simple. In the real world, we rarely find such functions and therefore, must turn to a computer for
help. A computer, however, does not represent continuous functions, but sequences of numbers that may
94
represent functions or empirical data on finite domains. The discrete Fourier transform can be viewed as
the discretized analogue of the continuous Fourier transform as defined in the previous subsection. To
introduce the discrete Fourier transform, we first consider the notion of an “𝑛th-order sequence”.
A finite number of 𝑛 terms, or a an “𝑛th-order sequence”, is defined as a function whose domain is
the set of integers ℕ𝑛−10 ≔ {0,1,2,… , 𝑛 − 1} for 𝑛 ∈ ℕ and whose range is the set of function values
{𝑓(0), 𝑓(1), … , 𝑓(𝑛 − 1)}. Alternatively, an 𝑛th-order sequence may be conceived as a sequence in the set
of ordered pairs {(0, 𝑓(0)), (1, 𝑓(1)), … , (𝑛 − 1, 𝑓(𝑛 − 1)) }. Here we use a shorter notation and follow the
common practice of denoting the sequence as (𝑓𝑘)𝑘=0,1,…,𝑛−1 and the 𝑘th term of the sequence as 𝑓𝑘. For
example, we may have a sequence (𝑓𝑘)𝑘=0,1,…,𝑛−1 whose terms are defined by
𝑓𝑘 ≔1
𝑘+1, 𝑘 = 0,1, … , 𝑛 − 1 (1)
and thus
(𝑓𝑘)𝑘∈ℕ𝑛−10 = {1,1
2,1
3, … ,
1
𝑛} (2)
Notably, we do not require a formula or equation for 𝑓𝑘 in terms of 𝑘 to define a sequence. In other words,
any empirically observed set of 𝑛 univariate data points 𝑦𝑖 ∈ ℝ, 𝑖 = 0,1,… , 𝑛 − 1 can be regarded as an 𝑛th
order sequence and we may identify data vectors 𝑦 ∈ ℝ𝑛 with 𝑛th order sequences (𝑦𝑘)𝑘∈ℕ𝑛−10 . In order to
emphasize the data analytical aspect of the discrete Fourier transform, we will use
𝑦 = (𝑦𝑘)𝑘∈ℕ𝑛−10 (3)
in the following to denote an 𝑛th order sequence. We call such a sequence “bounded” if all of its terms are
finite valued.
By analogy to the Fourier transform of a function defined on the continuous domain ℝ given by
𝐹:ℝ → ℝ,𝜔 ↦ 𝐹(𝜔) ≔ ∫ 𝑓(𝑥)∞
−∞exp(−2𝜋𝜔𝑖𝑥) 𝑑𝑥 (4)
the discrete Fourier transform of an a bounded 𝑛th-order sequence is defined as
𝑌𝑗 ≔1
𝑛∑ 𝑦𝑘 exp (−2𝜋 (
𝑗
𝑛) 𝑖𝑘)𝑛−1
𝑘=0 for 𝑗 = 0,1, … , 𝑛 − 1 (5)
Further, in analogy to the inverse Fourier transform, the inverse discrete Fourier transform is defined as
𝑦𝑘 = ∑ 𝑌𝑗 exp (2𝜋 (𝑗
𝑛) 𝑖𝑘)𝑛−1
𝑗=0 for 𝑘 = 0,1, … , 𝑛 − 1 (6)
Like the Fourier transform and the inverse Fourier transform, the discrete Fourier transform and its inverse
are reciprocal: Having obtained a sequence 𝑌 = (𝑌𝑘)𝑘∈ℕ𝑁−10 from a sequence (𝑦𝑘) 𝑘∈ℕ𝑛−10 and applying the
inverse discrete Fourier transform equation (5) to it, yields the original sequence (𝑦𝑘) 𝑘∈ℕ𝑛−10 .
Note that we use 𝑌𝑗 to denote the (usually complex) values of the discrete Fourier transform of the
𝑛th-order sequence 𝑦. Further notice that with respect to the Fourier transform, the frequency resolution
depends on the order of the sequence, i.e. the number of data points: 𝜔 ∈ ℝ is replaced by 𝑗/𝑛 for
𝑗 = 0,1,… , 𝑛 − 1. Usually, the values of 𝑦 refer to data sampled equidistantly over time or space. Each value
95
thus refers to a given time or space increment and the reciprocal value of this increment is the data sampling
frequency 𝑓𝑠. Dividing the sampling frequency by the number of data points and multiplication with the
𝑗 = 0,1,… , 𝑛 − 1 then yields the discrete support frequencies of the discrete Fourier transform. In other
words: the complex number 𝑌𝑗 contains magnitude and phase information of the frequency 𝑗
𝑛⋅ 𝑓𝑠 in the
signal of interest. Finally, note that in the definition of the complex exponential Fourier series, the index of
the values 𝑐𝑘 covered the negative and positive integers. The first 𝑛
2+ 1 values 𝑌𝑗 correspond to the 𝑛/2 𝑐𝑘
values for 𝑘 = 0,… , 𝑛/2, while the latter 𝑛
2 values 𝑌𝑗 correspond to the 𝑛/2 𝑐𝑘 values for 𝑘 = −
𝑛
2, … , −1.
Because these latter terms are redundant, it suffices to consider the first the 𝑛/2 terms 𝑌𝑗 for characterizing
frequency magnitude and phase.
For notational simplicity, one may define a so-called “weighting kernel” 𝑤𝑛 by
𝑤𝑛 ≔ exp (2𝜋𝑖
𝑛) (7)
Note for a scalar 𝑎 ∈ ℝ taking 𝑤𝑛 to the power of 𝑎 yields
𝑤𝑛𝑎 ≔ (𝑤𝑛)
𝑎 = (exp (2𝜋𝑖
𝑛))𝑎= (𝑒
2𝜋𝑖
𝑛 )𝑎
= 𝑒2𝜋𝑎𝑖
𝑛 = exp (2𝜋𝑎𝑖
𝑛) (8)
Rewriting the definitions of the discrete Fourier transform and the inverse discrete Fourier transform above
using the weighting kernel 𝑤𝑛 we have
𝑌𝑗 =1
𝑛∑ 𝑦𝑘𝑤𝑛
−𝑘𝑗𝑛−1𝑘=0 for 𝑗 = 0,1,… , 𝑛 − 1 (9)
and
𝑦𝑘 = ∑ 𝑌𝑗𝑤𝑛𝑘𝑗𝑛−1
𝑗=0 for 𝑘 = 0,1,… , 𝑛 − 1 (10)
We next turn to the question how to compute the discrete Fourier transform (5) of a given data sequence
(𝑦𝑘) 𝑘∈ℕ𝑛−10 by means of the so-called Fast Fourier Transform algorithm.
(11) The Fast Fourier Transform Algorithm
The Fast Fourier Transform algorithm is an algorithm that computes the values 𝑌𝑗 for 𝑗 = 0,1,… , 𝑛 −
1 based on a given data (𝑦𝑘) 𝑘∈ℕ𝑛−10 in the definition of the discrete Fourier transform
𝑌𝑗 ≔1
𝑛∑ 𝑦𝑘𝑤𝑛
−𝑘𝑗𝑛−1𝑘=0 (1)
In principle, these values can be computed readily using basic programming tools, because each value 𝑌𝑗 is
merely given by the product of 𝑛 terms of the form 𝑦𝑘𝑤𝑛−𝑘𝑗
and their subsequent sum. The problem with a
direct computation of the values 𝑌𝑗 according to the formula (1) is that for long data sequences, i.e. large
values of 𝑛, many numerical operations have to be performed. More specifically, evaluation of the definition
of 𝑌𝑗 for 𝑗 = 0,1,… , 𝑛 − 1 requires 𝑛2 “numerical operations” (= multiplications and additions): For each 𝑌𝑗 𝑛
multiplications and additions have to be performed, and there are 𝑛 terms 𝑌𝑗, thus resulting in a total of 𝑛2
96
operations. Doubling the number of data points from, say, 𝑛1 = 100 to 𝑛2 = 200 thus yields a four time
increase in the number of necessary computations (from 1002 = 10,000 to 2002 = 40,000). The fast
Fourier transform algorithm reduces the number of computations necessary to evaluate the discrete Fourier
transform of a given data 𝑛th-order data sequence to 𝑛 ⋅ log2 𝑛. Given that each numerical operations
performed on a computer takes a non-zero amount of time, the fast Fourier transform algorithm thus
returns values 𝑌𝑗 much quicker than the direct evaluation of the discrete Fourier transform according to
equation (1). Figure 1 depicts the values 𝑛2 and 𝑛 ⋅ log2 𝑛 as function of the data sequence order 𝑛.
Figure 1. Number of numerical operations (multiplications and additions) necessary to evaluate the discrete Fourier transform of an 𝑛th order data sequence. Note that the fast Fourier transform algorithm massively reduces the computational demand.
To see how the fast Fourier transform achieves the reduction of computational demand, we follow
the discussion in [Weaver, 1983]. To keep the notation concise, we will write 𝑦(𝑘) ≔ 𝑦𝑘 for the 𝑘th element
of a sequencer (𝑦𝑘) 𝑘∈ℕ𝑛−10 in the following. For a given 𝑛th order data sequence we assume that 𝑛 is an
even integer, and we can thus split the sequence (𝑦𝑘) 𝑘∈ℕ𝑛−10 into two new subsequences (𝑦1𝑘) 𝑘∈ℕ𝑚−10 and
(𝑦2𝑘) 𝑘∈ℕ𝑚−10 by defining
𝑦1𝑘 = 𝑦1(𝑘) ≔ 𝑦(2𝑘) and 𝑦2𝑘 = 𝑦2(𝑘) ≔ 𝑦(2𝑘 + 1) for 𝑘 = 0,1,… ,𝑚 − 1 where 𝑚 = 𝑛/2 (2)
To make the above transparent, consider for example the 6th order sequence
(𝑦𝑘) 𝑘∈ℕ6−10 ≔ {𝑦(0), 𝑦(1), 𝑦(2), 𝑦(3), 𝑦(4), 𝑦(5))} = {0,1,2,14,16,20} (3)
Then (2) defines the two subsequences
(𝑦1𝑘) 𝑘∈ℕ3−10 = {𝑦1(0), 𝑦1(1), 𝑦2(2)} = {𝑦(0), 𝑦(2), 𝑦(4)} = {0,2,16} (4)
and
(𝑦2𝑘) 𝑘∈ℕ3−10 = {𝑦2(0), 𝑦2(1), 𝑦2(2)} = {𝑦(1), 𝑦(3), 𝑦(5)} = {1,14,20} (5)
97
The first subsequence thus comprises the “even” terms of the original sequences (starting from the zeroth
term), while the second subsequence comprises the “odd” terms of the original sequences (starting from the
first term). Note that both new subsequences as defined in (2) are periodic sequences with periodicity 𝑚:
𝑦1(𝑘 + 𝑚) = 𝑦(2(𝑘 + 𝑚)) = 𝑦(2𝑘 + 𝑛) = 𝑦(2𝑘) = 𝑦1(𝑘) (6)
and
𝑦2(𝑘 + 𝑚) = 𝑦(2(𝑘 + 𝑛) + 1) = 𝑦(2𝑘 + 𝑛 + 1) = 𝑦(2𝑘 + 1) = 𝑦2(𝑘) (7)
Since (𝑦1𝑘) 𝑘∈ℕ𝑚−10 and (𝑦2𝑘) 𝑘∈ℕ𝑚−10 are 𝑚th order sequences, we can determine their discrete Fourier
transforms according to the definition in equation (1):
𝑌1(𝑗) =1
𝑚∑ 𝑦1𝑘𝑚−1𝑘=0 𝑤𝑚
−𝑘𝑗 for 𝑗 = 0,1,… ,𝑚 − 1 (8)
𝑌2(𝑗) =1
𝑚∑ 𝑦1𝑘𝑚−1𝑘=0 𝑤𝑚
−𝑘𝑗 for 𝑗 = 0,1, … ,𝑚 − 1 (9)
Now let us consider the discrete Fourier transform of the original 𝑛th order sequence {(𝑦𝑘) 𝑘∈ℕ𝑛−10
𝑌(𝑗) =1
𝑛∑ 𝑦𝑘𝑛−1𝑘=0 𝑤𝑛
−𝑘𝑗 for 𝑗 = 0,1, … , 𝑛 − 1 (10)
By splitting the summation, we can write the preceding equation as
𝑌(𝑗) =1
𝑛∑ 𝑦2𝑘𝑚−1𝑘=0 𝑤𝑛
−2𝑘𝑗+1
𝑛∑ 𝑦2𝑘+1𝑚−1𝑘=0 𝑤𝑛
−(2𝑘+1)𝑗 (11)
However, we note that
𝑤𝑛−2𝑘𝑗
= exp (−2⋅2𝜋𝑖𝑘𝑗
𝑛) = exp (−
2𝜋𝑖𝑘𝑗
𝑛/2) = 𝑤𝑚
−𝑘𝑗 (12)
and
𝑤𝑛−(2𝑘+1)𝑗
= exp (−2𝜋𝑖(2𝑘+1)𝑗
𝑛) = exp (−
2⋅2𝜋𝑖𝑗
𝑛−2𝜋𝑗
𝑛) = exp (−
2𝜋𝑖𝑘𝑗
𝑛/2) exp (−
2𝜋𝑗
𝑛) = 𝑤𝑚
−𝑘𝑗𝑤𝑚−𝑗
(13)
Therefore, equation (11) can be written as
𝑌𝑗 =1
𝑛∑ 𝑦1𝑘𝑚−1𝑘=0 𝑤𝑚
−𝑘𝑗+𝑤𝑛−𝑗
𝑛∑ 𝑦2𝑘𝑚−1𝑘=0 𝑤𝑚
−𝑘𝑗 for 𝑗 = 0,… , 𝑛 − 1 (14)
Comparison to the discrete Fourier transforms of the subsequences yields
𝑌𝑗 =𝑌1(𝑗)
2+𝑤𝑁𝑛−𝑗𝑌2(𝑗)
2 for 𝑗 = 0,… , 𝑛 − 1 (15)
Because 𝑌1(𝑗) and 𝑌2(𝑗) are periodic with periodicity 𝑚, we have
𝑌(𝑗) =1
2(𝑌1(𝑗) + 𝑌2(𝑗)𝑤𝑛
−𝑗) and 𝑌(𝑗 +𝑚) =
1
2(𝑌1(𝑗) − 𝑌2(𝑗)𝑤𝑛
−𝑗) for 𝑗 = 0,… ,𝑚 − 1 (16)
As we have noted, to calculate the discrete Fourier transform of (𝑦𝑘) 𝑘∈ℕ𝑛−10 requires 𝑛2 numerical
operations (additions and multiplications), whereas to calculate the discrete Fourier transform of
98
(𝑦1𝑘) 𝑘∈ℕ𝑚−10 and (𝑦2𝑘) 𝑘∈ℕ𝑚−10 requires only 𝑚2 or (𝑛
2)2=
𝑛2
4 complex operations. When using the equation
(16) to obtain 𝑌(𝑗) based on the values 𝑌1(𝑗) and 𝑌2(𝑗), we first require 2 (𝑛2
4) = 𝑛2/2 operations to
calculate the two Fourier transforms the values 𝑌1(𝑗) and 𝑌2(𝑗) and then we require the 𝑛 additional
operations as prescribed by equation (16). The total number of operations for computing the discrete
Fourier transform coefficients has thus been reduced from 𝑛2 to 𝑛2
2+ 𝑛.
Now suppose 𝑛 is divisible by 4 or 𝑚 = 𝑛/2 divisible by 2. Then the subsequences (𝑦1𝑘) 𝑘∈ℕ𝑚−10 and
(𝑦2𝑘) 𝑘∈ℕ𝑚−10 can be further subdivided into four 𝑚/2 order sequences
𝑔1(𝑘) = 𝑦1(2𝑘), 𝑔2(𝑘) = 𝑦1(2𝑘 + 1), ℎ1(𝑘) = 𝑦2(2𝑘), ℎ2(𝑘) = 𝑦2(2𝑘 + 1) (17)
for 𝑘 = 0,1,… ,𝑚
2− 1. Thus, we can also use the scheme above to obtain the Fourier transforms
(𝑦1𝑘) 𝑘∈ℕ𝑚−10 and (𝑦2𝑘) 𝑘∈ℕ𝑚−10 with only 𝑚 +𝑚2/2 complex operations and the use these results to obtain
{𝐹(𝑗)}. A little thought reveals that this requires 2𝑛 + 𝑛2/4 operations. Thus when subdividing a sequence
twice we reduce the number of operations from 𝑛2 to 2𝑛 + 𝑛2/4. The 2𝑛 term is the result of applying the
scheme above twice, where as the 𝑛2/4 term is the resuls of transforming the four reduced sequences. For
the case when 𝑛 = 4, we note that we completely reduce the sequence to four first order sequences that
are their own transformations and the therefore we do not need the additional 𝑛2/4 transform operations.
The formula then becomes 2𝑛. The smallest value of 𝑛 that does not result in complete reduction of the
sequence is 8. For this case have a reduction factor of 1/2 whereas for large 𝑛, the factor approaches 1/4.
Continuing this way we can show that if 𝑛 is divisible by 2𝑝 , then the number of operatiorns required to
compute the discrete Fourier transform of the 𝑛th order sequence {𝑓(𝑘)} by repeated subdivision is
𝑝𝑛 +𝑛2
2𝑝 (18)
Again, for complete reduction, i.e. 𝑛 = 2𝑝, the 𝑛2/2𝑝 term is not required and we obtain 𝑝𝑛 for the number
of operations required. The number of required operations in this case is thus 𝑝𝑛 = 𝑛 log2 𝑛.
(12) Fast Fourier Transforms in Matlab
Below we include Matlab code that evaluates the discrete Fourier transform of a simulated time-
series using Matlab’s fast Fourier transform implementation. Notably, the complex Fourier coefficients are
returned by this implementation as in the definition of the discrete Fourier transform, such that the second
half of the coefficients is redundant. To obtain the coefficients 𝑐𝑘, the function fftshift.m is applied. Figure 2
below visualizes the simulation
% Matlab DFT/FFT example
% -------------------------------------------------------------------------
% simulation parameters
n = 2^9 ; % number of samples (window length)
fs = 200 ; % sampling frequency (Hz)
nf = fs/2 ; % Nyquist frequency
dt = 1/fs ; % time increment per sample
t = (0:n-1)/fs ; % time vector
s = 1.5 ; % noise standard deviation
% sample data with 15 and 40 Hz component and Gaussian noise
y = 2*sin(2*pi*5*t) + 3*sin(2*pi*20*t) + s*randn(1,length(t));
99
% use Matlabs fft to compute the DFT of y and its power
f = (0:n-1)*(fs/n) ; % frequency range of DFT
Y = fft(y) ; % DFT Y_j
p = Y.*conj(Y)/n ; % power of the DFT
% reorder coefficients according to k = -3,-2,-1,0,1,2,3,…
Y0 = fftshift(Y) ; % coefficient reordering
f0 = (-n/2:n/2-1)*(fs/n) ; % frequency range for c_k
p0 = Y0.*conj(Y0)/n ; % power of the DFT
Figure 2. Visualization of Matlab’s fast Fourier transform implementation. The upper panel depicts 512 samples of a noisy simulated time-series at a sampling frequency of 200 Hz comprising two sine components of circular frequency of 5 Hz and 10 Hz, respectively. The lower left panel depicts the power of all returned discrete Fourier transform coefficients centred at zero, while the lower right panel depicts on the first half of the coefficients.
Study questions
1. Write down the generalized cosine function and discuss the intuitive meaning of its components.
2. Sketch the functions 𝑓(𝑥) ≔ 3 sin(2𝜋𝜔𝑥) for 𝜔 ≔ 1 and 𝑔(𝑥) ≔ 0.5 sin(2𝜋𝜔𝑥) for 𝜔 ≔ 3 by hand.
3. Write down the formulas for 𝜔𝑘 , 𝑎𝑘 , 𝑏𝑘 in the Fourier series representation
𝑓(𝑥) = 𝑎0 +∑ 𝑎𝑘 cos(2𝜋𝜔𝑘𝑥) + 𝑏𝑘 sin(2𝜋𝜔𝑘𝑥)∞𝑘=1
for a function 𝑓 with period 𝑇 satisfying the Dirichlet conditions.
4. Verbally sketch the derivation of the Fourier series coefficient formulas.
5. Verbally define the orthogonality properties of sine and cosine functions.
6. Verbally explain the idea of “protracting” a periodic function.
7. Verbally explain why the lower integral boundary for a protracted periodic function with period 𝑇 for an integral over an
interval of length 𝑇 is irrelevant.
8. Determine the real and imaginary part, the complex conjugate, the absolute value, and the argument of the complex number
𝑧 ≔ 2 + 3𝑖
9. For 𝑧1 = 1 + 𝑖 and 𝑧2 = 2 − 3𝑖 evaluate the sum 𝑧3 = 𝑧1 + 𝑧2 and the product 𝑧4 = 𝑧1𝑧2̅
10. Write down Euler’s identity. Why does it hold?
11. Write down the definitions of the coefficients 𝑐𝑘 and the frequency 𝜔0in the complex form of the Fourier series of a periodic
function 𝑓 given by
100
𝑓(𝑥) = ∑ 𝑐𝑘 exp(𝑖𝑘𝜔0𝑥)∞𝑘=−∞
in terms of the coefficients 𝑎𝑘 of the real form of the Fourier series
𝑓(𝑥) =1
2𝑎0 + ∑ 𝑎𝑘 cos(𝑘𝜔0𝑥)
∞𝑘=1 + 𝑏𝑘 sin(𝑘𝜔0𝑥)
and the period 𝑇 of 𝑓.
12. Let the real form of the Fourier series of a periodic function 𝑓 be given by
𝑓(𝑥) =1
2𝑎0 + ∑ 𝑎𝑘 cos(𝑘𝜔0𝑥)
∞𝑘=1 + 𝑏𝑘 sin(𝑘𝜔0𝑥)
and the polar form of the Fourier series be given by
𝑓(𝑥) = 𝑑0 + ∑ 𝑑𝑘 cos(𝑘𝜔0𝑥 − 𝜙𝑘)∞𝑘=1
Express the polar form coefficients 𝑑𝑘 , 𝑘 = 0,1,2,… and phase angles 𝜙𝑘 , 𝑘 = 1,2,… in terms of the real form coefficients
𝑎𝑘 , 𝑘 = 0,1,2, … and 𝑏𝑘 , 𝑘 = 1,2, …
Study Question Answers
1. The generalized cosine function is given by 𝑓: ℝ → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑎 cos(2𝜋𝜔𝑥 − 𝜑) Here, 𝑎 ∈ ℝ is an “amplitude coefficient”, which scales the function to vary between −𝑎 and 𝑎, rather than −1 and 1 as the cosine does. 𝜔 is a “circular frequency” term, the higher this term, the higher the frequency of the generalized cosine function. More specifically, the generalized cosine function repeats itself every 1/𝜔 units of the 𝑥-axis, i.e., it is periodic with period 𝑇 = 1/𝜔. The smaller the period, the higher the frequency. 𝜑 is a “phase angle”, which describes how much the standard cosine function is shifted on the 𝑥-axis (to the right).
2. The important points for the sketch are that 1.) the function 𝑓 varies between −3 and 3 and performs a full revolution every 1
𝜔=
1
1= 1 𝑥-units. Equivalently, 𝑔 varies between −0.5 and 0.5 and performs a full revolution ever
1
𝜔=
1
3 𝑥-units, or in other
words, it performs three revolution every 𝑥-unit.
3. The frequencies are given by 𝜔𝑘 =
𝑘
𝑇 and the coefficients are given by
𝑎0 =1
𝑇∫ 𝑓(𝑥)𝑑𝑥𝑇/2
−𝑇/2, 𝑎𝑘 =
2
𝑇∫ 𝑓(𝑥) cos (2𝜋
𝑘
𝑇𝑥)𝑑𝑥
𝑇/2
−𝑇/2 and 𝑏𝑘 =
2
𝑇∫ 𝑓(𝑥) sin (2𝜋
𝑘
𝑇𝑥)𝑑𝑥
𝑇/2
−𝑇/2
4. The Fourier series coefficient formulas can be derived by (1.) substituting the frequencies 𝜔𝑘 = 𝑘/𝑇 in the Fourier series, (2.) multiplication of the Fourier series representation with the cosine or sine function of a chosen frequency for the cosine oo rsine coefficients, respectively, (3.) the subsequent integration from −𝑇/2 to 𝑇/2, and finally (4.) solving for the coefficients.
5. The orthogonality properties of sine and cosine functions correspond to the following three statements: (1) The integral of the product of two sine or two cosine functions with different circular frequencies 𝑘/𝑇 and 𝑗/𝑇 (𝑘 ≠ 𝑗, 𝑘, 𝑗 ∈ ℕ) over an interval of length 𝑇 is zero. (2.) The integral of the product of two sine or two cosine functions with the same circular frequencies 𝑘/𝑇 (𝑘 ∈ ℕ) over an interval of length 𝑇 is 𝑇/2. (3) The integral of the product of a sine and a cosine function with different circular frequencies 𝑘/𝑇 and 𝑗/𝑇 (𝑘 ≠ 𝑗, 𝑘, 𝑗 ∈ ℕ) , or the same circular frequency 𝑘/𝑇 (𝑘 ∈ ℕ) over an interval of length 𝑇 is zero.
6. A non-periodic function 𝑓 on a finite domain 𝐷 ⊂ ℝ can be rendered a periodic function on the infinite domain ℝ by putting many identical copies of 𝑓 next to each other.
101
7. The lower integral boundary for an integral of a protracted function with period 𝑇 for an integral on an interval of length 𝑇 is irrelevant, because the part of the function that is not integrated at the lower end due to the integral boundary shift is integrated at the upper end due to the periodicity of the function.
8. The real part of 𝑧 ≔ 2 + 3𝑖 is 𝑅𝑒{𝑧} = 2 and imaginary part is 𝐼𝑚{𝑧} = 3. The complex conjugate is given by 𝑧̅ = 2 − 3𝑖, the
absolute value is given by 𝑅 = √22 + 32 = √13 and the argument is given by 휃 = tan−1 (3
2).
9. We have 𝑧3 = 𝑧1 + 𝑧2 = 1 + 𝑖 + 2 − 3𝑖 = 3 − 2𝑖
and
𝑧4 = (1 + 𝑖)(2 + 3𝑖) = 1 ⋅ 2 + 1 ⋅ 3𝑖 + 𝑖 ⋅ 2 + 3𝑖2 = 2 + 3𝑖 + 2𝑖 − 3 = −1 + 5𝑖
10. Euler’s identity is given by exp(𝑖𝑥) = cos(𝑥) + 𝑖 sin(𝑥) and it does hold due to the series definitions of the exponential, cosine, and sine fucntions, and the fact that 𝑖2 = −1.
11. The coefficients in the complex form of the Fourier series are given in terms of the real form coefficients 𝑎0, 𝑎1, 𝑎2, … , 𝑏1, 𝑏2 by
𝑐𝑘 ≔ {
1
2(𝑎−𝑘 + 𝑖𝑏−𝑘), 𝑘 < 0
𝑎0 , 𝑘 = 01
2(𝑎𝑘 − 𝑖𝑏𝑘) , 𝑘 > 0
and the fundamental frequency 𝜔0 is given in terms of the period 𝑇 of the function 𝑓 by 𝜔0 = 2𝜋/𝑇.
12. The coefficients of the polar form of the Fourier series in terms of the real form coefficients are given by 𝑑0 = 𝑎0 , 𝑑𝑘 =
√𝑎𝑘2 + 𝑏𝑘
2 and the phase angles are given by 𝜙𝑘 = arctan (𝑏𝑘
𝑎𝑘) for 𝑘 = 1,2,….
102
An Introduction to Numerical Optimization
(1) Gradient methods
The necessary condition for a local extremal point (i.e. the location of a local maximum or minimum
in the domain of a function) is that the first derivative of the function under study vanishes. For multivariate
real-valued functions
𝑓: Θ ⊆ ℝ𝑝 → ℝ, 휃 ↦ 𝑓(휃) (1)
this corresponds to the condition that ∇𝑓(휃∗) = 0 ∈ ℝ𝑝 at the location of an extremal point 휃∗ ∈ ℝ𝑝. The
gradient ∇𝑓 of a multivariate real-valued function evaluated at a point of its domain points into the direction
of steepest ascent of the function. To find a local minimum of the function, one can thus take a step
proportional to the negative gradient, in which case one performs a gradient descent. Alternatively, to find a
local maximum of the function, one can take a step proportional to the (positive) gradient. While a fairly
obvious method, the length of step is crucial for the success of a gradient method. This length is usually
referred to as step-size. In its most simplistic form, the step-size is set to a constant 𝜅 > 0. In Table 1, we
describe a gradient descent scheme that tests on each iteration, whether the necessary condition of a
vanishing gradient is fulfilled for the current iterand.
Initialization
0. Define a starting point 휃(0) ∈ ℝ𝑝 and set 𝑘 ≔ 0. If ∇𝑓(휃(𝑘)) = 0, stop! 휃(0) is a zero of ∇𝑓. If not, proceed to
iterations.
Until Convergence
1. Set 휃(𝑘+1) = 휃(𝑘) − 𝜅 ∇𝑓(휃(𝑘))
2. If ∇𝑓(휃(𝑘+1)) = 0, stop! 휃(0) is a zero of ∇𝑓. If not, go to 3.
3. Set 𝑘 ≔ 𝑘 + 1 and go to 1.
Table 1. The Newton-Raphson method for finding an extremal point of a univariate real-valued function.
(2) The Newton-Raphson method
In brief, the Newton-Raphson method is an algorithm that allows for finding zeros (or “roots”) of
functions, and in the context of numerical optimization is used to find zeros of the derivatives of a target
function. In the following, we first consider the univariate Newton-Raphson method from two perspectives
(1) as a method of finding the root of the derivative of a function of interest, and (2) as the minimization of a
quadratic approximation to the function of interest. Both perspectives are based on the notions of Taylor-
approximations, which the interested reader may find useful to review. Finally, we extend the concept of the
Newton-Raphson method to the multivariate case,
If we assume that a univariate function
𝑓:ℝ → ℝ,휃 ↦ 𝑓(휃) (1)
of interest is differentiable and that at a zero of its first derivative 𝑓′(휃∗) = 0 its second derivative is non-
zero, i.e. 𝑓′′(휃∗) ≠ 0, 𝑓 has a maximum or a minimum at 휃∗. The idea of the Newton-Rapshon method is to
first guess an initial value 휃(0), set an iteration index 𝑘 to zero, such that 𝑘 ≔ 0 and thus 휃(𝑘) = 휃(0) . To
103
find an approximate value for the zero of 𝑓′ the function 𝑓 is then approximated using a first-order Taylor
series around the current value 휃(𝑘):
𝑓′(휃) ≈ 𝑓′(휃) ≔ 𝑓′(휃(𝑘)) + 𝑓′′(휃(𝑘))(휃 − 휃(𝑘)) (2)
In brief, equation (2) states that the value of the derivative 𝑓′ at the location 휃 ∈ ℝ is approximated by the
sum of (1) the value of the derivative 𝑓′ at the location 휃(𝑘) ∈ ℝ and the product of the rate of (positive)
change of the function 𝑓′ in 휃(𝑘) and the distance (휃 − 휃(𝑘)) between the location of interest 휃 ∈ ℝ and
the current location 휃(𝑘) (see Figure 1 for an illustration).
Figure 1. Visualization of the univariate Newton-Raphson method for two iterations. Aspects of the first iteration are colored red, aspects of the second iteration are colored blue. For details, see the main text.
The Newton-Raphson method then proposes to approximate the zero of 𝑓′ by the zero of 𝑓′, which is readily
analytically evaluated:
𝑓′(휃) = 0 ⇔ 𝑓′(휃(𝑘)) + 𝑓′′(휃(𝑘))(휃 − 휃(𝑘)) = 0 (3)
⇔ 𝑓′′(휃(𝑘))(휃 − 휃(𝑘)) = −𝑓′(휃(𝑘))
⇔ 휃 − 휃(𝑘) = −𝑓′(𝜃(𝑘))
𝑓′′(𝜃(𝑘))
⇔ 휃 = 휃(𝑘) −𝑓′(𝜃(𝑘))
𝑓′′(𝜃(𝑘))
Note that the right-hand side of the last line in (6) only involves known terms, as 휃(𝑘) has been defined
above, and the first and second derivatives of 𝑓 are known (provided the function 𝑓 is accessible for
analytical derivation, or the derivative can be evaluated numerically). The full Newton-Raphson method then
takes the form of the following iterative algorithm
104
Initialization
0. Define a starting point 휃(0) ∈ ℝ and set 𝑘 ≔ 0. If 𝑓′(휃(0)) = 0, stop! 휃(0) is a zero of 𝑓′. If not, proceed to iterations.
Until Convergence
1. Set 휃(𝑘+1) ≔ 휃(𝑘) −𝑓′(휃(𝑘))
𝑓′′(휃(𝑘))
2. If 𝑓′(휃(𝑘+1)) = 0, stop! 휃(0) is a zero of 𝑓′. If not, go to 3. 3. Set 𝑘 ≔ 𝑘 + 1 and go to 1.
Table 2. The Newton-Raphson method for finding an extremal point of a univariate real-valued function.
Here, “convergence” denotes the case that a zero 휃(𝑘) of 𝑓′ has been found, or, alternatively, a value of 휃(𝑘)
has been found such that the distance between 𝑓′(휃(𝑘)) and zero is smaller than some small value 𝛿 > 0.
From the theory of nonlinear optimization it is known that the Newton-Raphson method converges from any
starting point 휃(0) as long as 𝑓′ is twice differentiable, is a convex function, and has, in fact, a zero.
Rather than a method for finding a zero of the first derivative of a function 𝑓 of interest, the
Newton-Raphson method may also be interpreted as a maximization/minimization of a second-order
approximation to the original function 𝑓 of interest. The second order Taylor approximation of a twice
differentiable, real, univariate function 𝑓 at a location 휃(𝑘) ∈ ℝ is given by
𝑓(휃) ≈ 𝑓(휃) ≔ 𝑓(휃(𝑘)) + 𝑓′(휃(𝑘))(휃 − 휃(𝑘)) +1
2𝑓′′(휃(𝑘))(휃 − 휃(𝑘))
2 (4)
Note the differences to equation (3): firstly, the approximation is formulated here for 𝑓(휃), while in (3) it
was formulated for 𝑓′(휃), and secondly the approximation 𝑓(휃) also includes the second-order term
1/2 ⋅ 𝑓′′(휃(𝑘))(휃 − 휃(𝑘))2
.To find an extremal point of 𝑓(휃) in lieu of 𝑓(휃), the usual approach familiar
from the analytical approach to finding zeros is chosen: the derivative of 𝑓(휃 ) is evaluated and set to zero.
Here, we have for the derivative of 𝑓(휃):
𝑓′(휃) =𝑑
𝑑𝜃𝑓(휃) (5)
=𝑑
𝑑𝜃(𝑓(휃(𝑘)) + 𝑓′(휃(𝑘))(휃 − 휃(𝑘)) +
1
2𝑓′′(휃(𝑘))(휃 − 휃(𝑘))
2)
=𝑑
𝑑𝜃𝑓(휃(𝑘)) +
𝑑
𝑑𝜃(𝑓′(휃(𝑘))휃) −
𝑑
𝑑𝜃(𝑓′(휃(𝑘))휃(𝑘))
+𝑑
𝑑𝜃(1
2𝑓′′(휃(𝑘))휃2) −
𝑑
𝑑𝜃(𝑓′′(휃(𝑘))휃휃(𝑘)) +
𝑑
𝑑𝜃(1
2(휃(𝑘))
2)
= 𝑓′(휃(𝑘)) + 𝑓′′(휃(𝑘))휃 − 𝑓′′(휃(𝑘))휃(𝑘)
= 𝑓′(휃(𝑘)) + 𝑓′′(휃(𝑘))(휃 − 휃(𝑘))
Setting 𝑓′(휃) to zero then results in
𝑓′(휃) = 0 ⇔ 𝑓′(휃(𝑘)) + 𝑓′′(휃(𝑘))(휃 − 휃(𝑘)) = 0 (6)
105
which is equivalent to equation (6). The discussion following (3) is thus applicable also from the perspective
of a second-order approximation to 𝑓.
So far, we have only considered the numerical maximization of univariate realv-valued functions of
the form
𝑓:ℝ → ℝ,휃 ↦ 𝑓(휃) (7)
We next generalize the Newton-Raphsonmethod to the case of multivariate, real-valued functions
𝑓:ℝ𝑝 → ℝ, 휃 ↦ 𝑓(휃) (8)
Note that the output value of these functions of interest is still a scalar value.
Above, we have seen that the Newton-Raphson method requires the first- and second derivative of
the function of interest. The multivariate extension of the Newton-Raphson method takes exactly the same
form as the univariate case, if one replaces the notion of the function’s first-order derivative 𝑓′(휃(𝑘)) by its
gradient ∇𝑓(휃(𝑘))and the notion of the function’s second-order derivative 𝑓′′(휃(𝑘)) by its Hessian
𝐻𝑓(휃(𝑘)). Noting that division by scalar corresponds to the multiplication with the inverse of matrix, then
yields the multivariate form of the Newton-Raphson method as shown in Table 2.
Initialization
0. Define a starting point 휃(0) ∈ ℝ𝑝 and set 𝑘 ≔ 0. If ∇𝑓(휃(𝑘)) = 0, stop! 휃(0) is a zero of ∇𝑓. If not, proceed to
iterations.
Until Convergence
1. Set 휃(𝑘+1) = 휃(𝑘) − (∇2𝑓(휃(𝑘)))−1
∇𝑓(휃(𝑘))
2. If ∇𝑓(휃(𝑘+1)) = 0, stop! 휃(0) is a zero of ∇𝑓. If not, go to 3.
3. Set 𝑘 ≔ 𝑘 + 1 and go to 1.
Table 3. The Newton-Raphson method for finding an extremal point of a multivariate real-valued function.
106
Foundations of Probabilistic Models
107
Probability Theory
In this Section we will start out with a pragmatic approach to probabilistic concepts and introduce
mathematically more rigorous concepts at later stage. It is important to keep in mind that probability theory
is a mathematical model for real-world phenomena that appear unpredictable (i.e. random) to humans. In
PMFN we described this theory, not the real-world itself. In other words, the statement “A fair coin comes
up heads with a probability of 0.5” is a statement within probability theory and is to be understood as a
definition within this theory. Whether fair coins exist in the real world, and if in the limit of an infinite
number of observations of their tossing behaviour they actually come up heads in half of the cases, is a
different question, and is not addressed here. The interested reader may find fine-grained discussions of the
philosophical underpinnings of probability theory in works such as [Jaynes 2003] and [DeFinetti 1981].
(1) Random variables
Random variables can be introduced in at least two different ways: first, an intuitive, mathematically
highly imprecise way, and second, a non-intuitive, mathematically precise way, which uses modern measure
theory. In the latter sense, random variables are measurable mappings from a probability into a measure
space. As such, random variables are neither “random” nor are they “variables”, but they are deterministic
mappings or functions. We introduce the measure-theoretic foundations of probability theory below,
however, for now it is helpful to use the following intuitive concept of a random variable:
“𝑥 is a random variable, if it takes on random values” (1)
Although this concept may almost be considered a tautology, it is helpful nonetheless. It tells us, that a
random variable 𝑥 can take on different values. For a given random variable 𝑥, it is always helpful, to think
about what kind of values it can take on. For example, if the random variable 𝑥 is used to describe a coin
toss, we may choose the values “heads” and “tails” as the values that 𝑥 can take on. Alternatively, if 𝑥 is
used to describe a die, we may chose the values {1,2,3,4,5,6} as the values it can take on.
Consider for the moment a random variable, that can take on 𝑛 ∈ ℕ different values. For each value
𝑥𝑖 (𝑖 = 1,… , 𝑛) that 𝑥 can take on, we can define a probability that 𝑥 takes on the value 𝑥𝑖. The probability
that 𝑥 takes on value 𝑥𝑖 may be written as
𝑝(𝑥 = 𝑥𝑖) (2)
(1) should be read as “the probability that the random variable 𝑥 takes no the value 𝑥𝑖”. Probabilities of
random variables have two well-known properties: they lie between 0 and 1, and for all possible values that
𝑥 can take on, they sum up to 1. For example, if we believe that all sides of a six-sided die have the same
probability of being the result of a throw of the die, we could write down the following:
𝑝(𝑥 = 𝑖) =1
6 (𝑖 = 1,… ,6) (3)
In (2) we allocate to each value that 𝑥 can take on a probability, namely, 1
6. Because there are six possible
values that 𝑥 can take on (corresponding to the sides of the six-sided die), the sum of these probabilities is 1.
More abstractly, we have specified a mapping from the “outcome space” of 𝑥 to the interval [0,1] in
the form
108
ℕ6 → [0,1], 𝑥𝑖 ↦ 𝑝(𝑥 = 𝑥𝑖) =1
6 (4)
In general, in probability theory, probabilities, i.e. numbers in the interval [0,1] are assigned to the outcomes
of random variables.
(2) Joint and marginal probability distributions
A very importance concept we require is the notion of joint probabilities. To this end, we first
consider two random variables 𝑥 and 𝑦. Each of the random variables may take on a set of different values
𝑥1, … , 𝑥𝑛 and 𝑦1, … , 𝑦𝑚 with 𝑛,𝑚 ∈ ℕ. Joint probabilities then describe the probabilities that 𝑥 and 𝑦
simultaneously take on values 𝑥𝑖 and 𝑦𝑗 (𝑖 = 1,… , 𝑛, 𝑗 = 1,… ,𝑚).
To obtain an intuition for joint probabilities, consider the following example: Assume that 𝑥 can take
on 3 different values 𝑥1 = 1, 𝑥2 = 2 and 𝑥3 = 3. Further assume that 𝑦 can take on 4 different values
𝑦1 = 1, 𝑦2 = 2, 𝑦3 = 3, 𝑦4 = 4. Then there are 3 × 4 = 12 different combinations of the values of 𝑥 and 𝑦:
(𝑥 = 𝑥1, 𝑦 = 𝑦1) (𝑥 = 𝑥1, 𝑦 = 𝑦2) (𝑥 = 𝑥1, 𝑦 = 𝑦3) (𝑥 = 𝑥1, 𝑦 = 𝑦4) (𝑥 = 𝑥2, 𝑦 = 𝑦1) (𝑥 = 𝑥2, 𝑦 = 𝑦2) (𝑥 = 𝑥2, 𝑦 = 𝑦3) (𝑥 = 𝑥2, 𝑦 = 𝑦4) (𝑥 = 𝑥3, 𝑦 = 𝑦1) (𝑥 = 𝑥3, 𝑦 = 𝑦2) (𝑥 = 𝑥3, 𝑦 = 𝑦3) (𝑥 = 𝑥3, 𝑦 = 𝑦4)
(1)
A joint probability distribution now allocates a probability, i.e., a number in the interval [0,1] to each
possible combination of the values for 𝑥 and 𝑦. As in the case for single random variables (also called
marginal random variables, because they can be considered the “marginal projections” of multivariate
random variables), all these probabilities must sum up to one. A joint probability distribution for the
example above may take the following form
𝑝(𝑥 = 1, 𝑦 = 1) =2
12 𝑝(𝑥 = 1, 𝑦 = 2) = 0 𝑝(𝑥 = 1, 𝑦 = 3) = 0 𝑝(𝑥 = 1, 𝑦 = 4) =
1
12
𝑝(𝑥 = 2, 𝑦 = 1) = 0 𝑝(𝑥 = 2, 𝑦 = 2) =3
12 𝑝(𝑥 = 2, 𝑦 = 3) =
1
12 𝑝(𝑥 = 2, 𝑦 = 4) =
2
12
𝑝(𝑥 = 3, 𝑦 = 1) =1
12 𝑝(𝑥 = 3, 𝑦 = 2) = 0 𝑝(𝑥 = 3, 𝑦 = 33) =
2
12 𝑝(𝑥 = 3, 𝑦 = 4) = 0
(2)
If we sum up the probabilities from all cells, we obtain
∑ ∑ 𝑝(𝑥 = 𝑥𝑖 , 𝑦 = 𝑦𝑗)4𝑗=1
3𝑖=1
2
12+ 0 + 0 +
1
12+ 0 +
3
12+
1
12+
2
12+
1
12+ 0 +
2
12+ 0 =
12
12= 1 (3)
and thus, the probabilities sum up to one, as required for probability distributions.
We may also sum up the probabilities row-wise, i.e. over the different values that 𝑦 can take on. We
obtain the following
𝑝(𝑥 = 1) ≔ ∑ 𝑝(𝑥 = 1, 𝑦 = 𝑦𝑗)4𝑗=1 =
2
12+ 0 + 0 +
1
12=
3
12
𝑝(𝑥 = 2) ≔ ∑ 𝑝(𝑥 = 2, 𝑦 = 𝑦𝑗)4𝑗=1 = 0 +
3
12+
1
12+
2
12=
6
12
𝑝(𝑥 = 3) ≔ ∑ 𝑝(𝑥 = 2, 𝑦 = 𝑦𝑗)4𝑗=1 = 0 +
3
12+
1
12+
2
12=
6
12
(4)
The probabilities of 𝑥 taking on the values 1,2 or 3 obtained in this manner sum up to one again
∑ 𝑝(𝑥 = 𝑥𝑖)3𝑖=1 =
3
12+
6
12+
3
12=
12
12= 1 (5)
109
This is no coincidence, but reflects the fact, that if we have specified a joint probability distribution
of two random variables, or in other words a bivariate distribution, we have also specified the marginal
distributions of two univariate random variables. Above, we have seen the marginal distribution of 𝑥. The
marginal distribution of 𝑦 is obtained by summing over columns:
𝑝(𝑦 = 1) ≔ ∑ 𝑝(𝑥 = 𝑥𝑖 , 𝑦 = 1)3𝑖=1
=2
12+ 0 +
1
12=
3
12
𝑝(𝑦 = 2) ≔ ∑ 𝑝(𝑥 = 𝑥𝑖 , 𝑦 = 2)3𝑖=1
= 0 +3
12+ 0 =
3
12
𝑝(𝑦 = 3) ≔ ∑ 𝑝(𝑥 = 𝑥𝑖 , 𝑦 = 3)3𝑖=1
= 0 +1
12+
2
12=
3
12
𝑝(𝑦 = 4) ≔ ∑ 𝑝(𝑥 = 𝑥𝑖 , 𝑦 = 4)3𝑖=1
=1
12+
2
12+ 0 =
3
12
(6)
Again, the probabilities of 𝑦 taking on the value 1,2,3, and 4 sum up to one, as required:
∑ 𝑝(𝑦 = 𝑦𝑗)4𝑗=1 =
3
12+
3
12+
3
12+
3
12=
12
12= 1 (7)
(3) Conditional Probabilities
Consider the following joint probability distribution
𝑝(𝑥, 𝑦) (1)
over the random quantities𝑥 and 𝑦. As discussed above a joint probability distribution of two random
variables allocates to each possible combination of values that the two random variables can take on a
probability mass (for discrete random variables) or a probability density (for continuous random variables).
For example, for discrete random variables, the probability that 𝑥 = 1 and 𝑦 = 2 may be specified by (1) as
𝑝(𝑥 = 1, 𝑦 = 2) = 0.1 (2)
If we have specified a joint distribution, we may ask the following questions: (1) What is the probability of
each marginal variable (i.e., either 𝑥 or 𝑦) to take on specific values, irrespective of the values of the other
marginal variable? (2) If we know the value of one marginal variable, say 𝑥 = 1, what is the probability for
the other marginal variable to take on a specific value? The answer to the first question relates to the notion
of marginal probability distributions as discussed above, while the answer to the second question relates to
the notion of conditional probability distributions.
As an example, consider the following example of a joint probability mass function
𝑝(𝑥 = 1, 𝑦 = 1) = 0.1 𝑝(𝑥 = 1, 𝑦 = 2) = 0.3 𝑝(𝑥 = 1, 𝑦 = 3) = 0.2 𝑝(𝑥 = 2, 𝑦 = 1) = 0.2 𝑝(𝑥 = 2, 𝑦 = 2) = 0.1 𝑝(𝑥 = 2, 𝑦 = 3) = 0.1
Table 1 A probability mass function over two random variables 𝑦 and 휃.
Here, 𝑥 can take on the values 1 and 2 and 𝑦 can take on the values 1,2 and 3. Summing probabilities over
all possible combinations of values that the random variables can take on yields 1, as required for probability
distributions. Consider now the marginal variable 𝑥. Summing over columns as discussed above yields the
probability distribution 𝑝(𝑥), which is defined by
𝑝(𝑥 = 1) = 0.6 and 𝑝(𝑥 = 2) = 0.4 (1)
and is referred to as the marginal probability distribution of 𝑥. Using the same summing procedure with
respect to 𝑥 to obtain the marginal distribution 𝑝(𝑦), we get
110
𝑝(𝑦 = 1) = 0.3, 𝑝(𝑦 = 2) = 0.4 and 𝑝(𝑦 = 3) = 0.3 (2)
We now consider the second question posed above. For the probability mass function example of
Table 1, what is the probability that 𝑦 = 2, if we know that the value of 𝑥 = 1? Looking into the table, one
may be tempted to say 0.3, because we have 𝑝(𝑥 = 1, 𝑦 = 2) = 0.3. However, note that 𝑝(𝑥 = 1, 𝑦 =
1) = 0.1 and 𝑝(𝑥 = 1, 𝑦 = 3) = 0.2. Apparently, the values 0.3,0.1 and 0.2 allocated to 𝑦 = 1, 𝑦 = 2 and
𝑦 = 3 do not sum to 1 and hence do not define a probability distribution over 𝑦. Nevertheless, the intuition
that the probability ofy taking on the value 2, if we know that 𝑥 = 1, should be higher than the probability
for 𝑦 taking on the values 1 and 3, respectively, as indicated in the table, is perfectly valid. Based on the fact
that overall combinations of value of 𝑥 and 𝑦, the above probability mass function sums to one, it turns out
that the values specified in the table for 𝑝(𝑥 = 1, 𝑦 = 1), 𝑝(𝑥 = 1, 𝑦 = 2) and 𝑝(𝑥 = 1, 𝑦 = 3) can be
turned into a proper probability distribution over 𝑦, if we divide them by the marginal probability of 𝑥 taking
on the value 1, i.e 𝑝(𝑥 = 1) = 0.6. Intuitively, this step merely corresponds to “normalizing” (i.e. setting to
one) the marginal probability 𝑝(𝑥 = 1) = 0.6:
𝑝(𝑥=1)
0.6=
0.6
0.6= 1 ⇔
∑ 𝑝(𝑥=1,𝑦=𝑖)3𝑖=1
0.6=
0.1+ 0.3+0.2
0.6=
0.1
0.6+ 0.3
0.6+0.2
0.6 (3)
We then have
𝑝(𝑥=1,𝑦=1)
𝑝(𝑥=1)=
0.1
0.6=
1
6,𝑝(𝑥=1,𝑦=2)
𝑝(𝑥=1)=
0.3
0.6=
3
6 and
𝑝(𝑥=1,𝑦=3)
𝑝(𝑥=1)=
0.2
0.6=
2
6 (4)
and these values sum to 1 and thus represent a probability distribution over 𝑦. We can now answer the
question posed above: The probability that 𝑦 takes on the value 2, if we know that 𝑥 takes on the value 1 is
𝑝(𝑦 = 2|𝑥 = 1) = 0.5 (5)
The “|” in the statement above should be read as “given that”, such that the whole statement reads in verbal
terms: “The probability that 𝑦 takes on the value 2 given that 𝑥 takes on the value 1 is 0.5”. The statement
above is a statement of conditional probability, i.e., it describes the probability of 𝑦 taking on a specific value
conditioned on the fact that 𝑥 takes on a specific value. Based on how we computed the probabilities above,
the general specification of a conditional probability for 𝑦 taking on a value 𝑦∗ and 𝑥 taking on a value 𝑥∗ is
given by
𝑝(𝑦 = 𝑦∗|𝑥 = 𝑥∗) =𝑝(𝑥=𝑥∗,𝑦=𝑦∗)
𝑝(𝑥=𝑥) (6)
Because the rule above holds for any values 𝑦∗ and 𝑥∗ that the random variables 𝑦 and 𝑦 may take on, the
specific arguments are usually suppressed, leading to the general definition of the conditional probability
distribution of𝑦 given 𝑥
𝑝(𝑦|𝑥) =𝑝(𝑥,𝑦)
𝑝(𝑥) (7)
The statement above says that to compute the conditional probability of 𝑦 taking on an arbitrary value in its
outcome space 𝒴, given that 𝑥 takes on an arbitrary value in its outcome space 𝒳, we have to look up the
joint probability 𝑝(𝑥, 𝑦) of 𝑥 and 𝑦 taking on these values, and divide this probability by the marginal
probability of 𝑥 taking on the specified value. As above, we may evaluate the marginal probability
distribution of 𝑥 by marginalizing over 𝑦. It is very important to note that the conditional probability
111
distribution 𝑝(𝑦|𝑥) and the joint distribution 𝑝(𝑥, 𝑦), for given value of 𝑥∗ are virtually identical, up to a
multiplicative constant, given by 𝑝(𝑥), which merely renders 𝑝(𝑥 = 𝑥∗, 𝑦) a proper probability distribution
over 𝑦. In other words, the relative differences between the probabilities for different values of 𝑦 specified
in the joint distribution of 𝑥 and 𝑦 are maintained in the conditional distribution of 𝑦 given a specific value of
𝑥.
So far, we have considered the probability of 𝑦 taking on a specific value, given that 𝑥 takes on a
specific value. Of course we may ask the symmetric question with respect to the probability of 𝑥 taking on a
specific value, given that we know that 𝑦 takes on a prespecified value. In this case, we merely have to
exchange the roles of 𝑦 and 𝑥 in the definition of conditional probability distribution above
𝑝(𝑥|𝑦) =𝑝(𝑥,𝑦)
𝑝(𝑦) (8)
Note that there is nothing to exchange with respect to the joint distribution 𝑝(𝑥, 𝑦), because the order of its
arguments is irrelevant.
(4) Bayes Theorem
Bayes theorem is a statement about conditional probabilities. In itself, it is thus a mere corollary of
the definitions of probability and is, by itself silent on whether probabilities are interpreted as observed
frequencies in a large data limit (corresponding to the view of classical statistics) or as degrees of subjective
uncertainty (corresponding to the view of the “Bayesian” school of thought. Bayes theorem is readily
derived as follows:
From equation (7) of the previous section we have
𝑝(𝑦|𝑥) =𝑝(𝑥,𝑦)
𝑝(𝑥)⇒ 𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦|𝑥) (1)
if we multiply both sides by 𝑝(𝑥). Likewise, from equation (8) of the previous section we have
𝑝(𝑥|𝑦) =𝑝(𝑥,𝑦)
𝑝(𝑦)⇒ 𝑝(𝑥, 𝑦) = 𝑝(𝑦)𝑝(𝑥|𝑦) (2)
if we multiply both sides by 𝑝(𝑦). We thus have two different ways to write the joint distribution 𝑝(𝑥, 𝑦)
based on the definitions of the conditional probability distributions 𝑝(𝑦|𝑥) and 𝑝(𝑥|𝑦). Of course, the joint
distribution 𝑝(𝑥, 𝑦) is equal to itself. We may thus write
𝑝(𝑥, 𝑦) = 𝑝(𝑥, 𝑦) (3)
Substituting (1) on the left-hand side, and (2) on the right-hand side of the equality above, we obtain
𝑝(𝑥)𝑝(𝑦|𝑥) = 𝑝(𝑦)𝑝(𝑥|𝑦) (4)
If we divide both sides of the above by 𝑝(𝑥), we obtain the conditional probability distribution of 𝑦 given 𝑥
in the following form
𝑝(𝑦|𝑥) =𝑝(𝑦)𝑝(𝑥|𝑦)
𝑝(𝑥) (5)
112
Equation (5) above is known as “Bayes Theorem”. Resubstituing 𝑝(𝑥, 𝑦) for 𝑝(𝑦)𝑝(𝑥|𝑦), we see that (5) is
identical to equation (7) of the previous section, our initial definition of the conditional probability
distribution of 𝑦 given 𝑥
𝑝(𝑦|𝑥) =𝑝(𝑦)𝑝(𝑥|𝑦)
𝑝(𝑥)=
𝑝(𝑥,𝑦)
𝑝(𝑥) (6)
In other words, Bayes Theorem is a rule to compute conditional probability distributions based on joint
distributions.
(5) Independent random variables
In probability theory, the notion of “stochastic independence“ refers to the factorization of joint
distributions into the products of their marginal distributions. In other words, two random variables 𝑥 and 𝑦
are said to be stochastically independent, if and only if, their joint probability equals the product of their
marginal probabilities
𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦) (1)
Notably, (1) applies in equivalent form to probability mass and density functions. To make the factorization
definition of stochastic independence intuitive consider the conditional probability 𝑝(𝑥|𝑦) for the case of
two random variables 𝑥 and 𝑦. The conditional probability 𝑝(𝑥|𝑦) implies that, that the distribution over
values that the random variable 𝑥 may take on is dependent on the values that 𝑦 takes on. For example, for
a random variable 𝑥 taking on values in {1,2,3} and a random variable 𝑦 taking on values in {1,2}, we could
have the following
𝑝(𝑥 = 1|𝑦 = 1) = 0.2, 𝑝(𝑥 = 2|𝑦 = 1) = 0.3, 𝑝(𝑥 = 3|𝑦 = 1) = 0.5 (2)
𝑝(𝑥 = 1|𝑦 = 2) = 0.4, 𝑝(𝑥 = 2|𝑦 = 2) = 0.4, 𝑝(𝑥 = 3|𝑦 = 2) = 0.2 (3)
Notably, the probability that 𝑥 takes on, say, the value 2 is dependent on the value of 𝑦: if 𝑦 is 1, this
probability is 0.3 and hence lower than if 𝑦 is 2, in which case this probability is 0.4. In the special case, that,
given two random variables 𝑥 and 𝑦, the distribution of the random variable 𝑥 is not dependent on the value
that the random variable 𝑦 takes on, 𝑥 is said to be “stochastically independent” of 𝑦. In this case, the
conditional statement “|𝑦” is redundant, and we may simply write
𝑝(𝑥|𝑦) = 𝑝(𝑥) (4)
However, from the definition of the conditional probability 𝑝(𝑥|𝑦) this implies that
𝑝(𝑥|𝑦) = 𝑝(𝑥) ⇒𝑝(𝑥,𝑦)
𝑝(𝑦)= 𝑝(𝑥) (5)
The latter equation, however, is only possible, if 𝑝(𝑦) can be cancelled out on the left-hand side, from which
it follows that 𝑝(𝑥, 𝑦) must factorize into 𝑝(𝑥)𝑝(𝑦)
𝑝(𝑥,𝑦)
𝑝(𝑦)= 𝑝(𝑥) ⇒ 𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦) (6)
because in this case
113
𝑝(𝑥,𝑦)
𝑝(𝑦)=
𝑝(𝑥)𝑝(𝑦)
𝑝(𝑦)= 𝑝(𝑥) (7)
In other words, the assumption that the probability distribution over the random variable 𝑥 is not affected
by the random variable 𝑦 implies, by means of the definition of the conditional probability the factorization
of the joint distribution 𝑝(𝑥, 𝑦) into the product 𝑝(𝑥)𝑝(𝑦). Note that if 𝑝(𝑥|𝑦) = 𝑝(𝑥) and hence
𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦), the conditional probability of 𝑦 is also independent of 𝑥
𝑝(𝑦|𝑥) =𝑝(𝑥,𝑦)
𝑝(𝑥)=
𝑝(𝑥)𝑝(𝑦)
𝑝(𝑥)= 𝑝(𝑦) (8)
Below, we depict the joint distribution for independent random variables 𝑥 and 𝑦 based on the marginal
distributions of the example in Section (2) “Joint and marginal distributions”. Note that each cell entry
merely is derived by multiplication of the respective row and column marginal entries
𝑝(𝑥 = 1, 𝑦 = 1) =1
16 𝑝(𝑥 = 1, 𝑦 = 2) =
1
16 𝑝(𝑥 = 1, 𝑦 = 3) =
1
16 𝑝(𝑥 = 1, 𝑦 = 4) =
1
16 𝒑(𝒙 = 𝟏) =
𝟑
𝟏𝟐
𝑝(𝑥 = 2, 𝑦 = 1) =2
16 𝑝(𝑥 = 2, 𝑦 = 2) =
2
16 𝑝(𝑥 = 2, 𝑦 = 3) =
2
16 𝑝(𝑥 = 2, 𝑦 = 4) =
2
16 𝒑(𝒙 = 𝟐) =
𝟔
𝟏𝟐
𝑝(𝑥 = 3, 𝑦 = 1) =1
16 𝑝(𝑥 = 3, 𝑦 = 2) =
1
16 𝑝(𝑥 = 3, 𝑦 = 33) =
1
16 𝑝(𝑥 = 3, 𝑦 = 4) =
1
16 𝒑(𝒙 = 𝟑) =
𝟑
𝟏𝟐
𝒑(𝒚 = 𝟏) =𝟑
𝟏𝟐 𝒑(𝒚 = 𝟐) =
𝟑
𝟏𝟐 𝒑(𝒚 = 𝟑) =
𝟑
𝟏𝟐 𝒑(𝒚 = 𝟒) =
𝟑
𝟏𝟐
(9)
(6) Discrete random variables and probability mass functions
The univariate and bivariate probability distributions of random variables that we have considered
so far are examples for so-called “discrete outcome random variables” or “discrete random variables”. The
defining feature of discrete random variables is that they can only take on a (finite) set of discrete values.
Each of this values is assigned a probability and this assignment will be referred to as “probability mass
function”. Intuitively, the complete probability mass of 1 is partitioned and distributed over the values that
the discrete random variable can take on. Note that this does not imply uniform distributions: an example
for a probability mass function is the “Bernoulli distribution” of random variable 𝑥, which can take on the
values 𝑥 = 0 and 𝑥 = 1 with probabilities 𝑝 ∈ [0,1] and (1 − 𝑝) ∈ [0,1], respectively. The Bernoulli
distribution is often used to model the outcome of a coin toss, where, for an unbiased coin, one would set
𝑝 = 0.5.
(7) Continuous random variables and probability density functions
In contrast to neurophysiology, where one commonly encounters discrete probability distribution
when modelling data like neuronal “spikes” (action potentials), neuroimaging data is usually assumed to
continuous. For example, the values of the MR signal at a given brain location may fluctuate around some
baseline value, say 200, and can take on continuous value around that, like, e.g. 159.34, 177.67, 221.89 and
so on. Likewise, the potential measured at specific EEG electrode at specific time-point fluctuates
continuously around, say 0 𝜇𝑉 with respect to a reference electrode, taking on values in the range of
−100 𝜇𝑉 to 100 𝜇𝑉. Because we are interested in describing these kind of signals using probabilistic
concepts, the focus in PMFN will be on “continuous” random variables, or in other words, random variables,
that take on values in the set of real numbers ℝ. From a probability theory viewpoint, this is quite a
complication, because one now needs to distribute the probability mass of 1 over an infinity of values that
the random variable can take on, both in the large value limit ] − ∞,∞[ and in the very small value limit, i.e.
the infinity of real numbers that lie between any two real numbers. We will not strive for too much formal
114
correctness with respect to these mathematical subtleties. For our purposes, it is enough to acknowledge,
that for “continuous” or “real random variables” probability is not assigned to specific values, but to
intervals. Informally, the function that assigns probability to intervals in the case of real random variables is
called “probability density function”. One may conceive this in analogy to the density concept in
Archimedean physics: mass density is defined as mass per volume. This implies, that if one has zero volume,
you has zero mass. With respect to probability density, if one does not have an interval in the real numbers,
but only one of its boundaries, i.e. a scalar in ℝ, this scalar is allocated zero probability.
The most important univariate probability density function required in PMFN is the “Gaussian” or
“normal” probability density function, defined as
𝑝𝜇,𝜎2: ℝ → ℝ+, 𝑥 ↦ 𝑝(𝑥; 𝜇, 𝜎2) ≔1
√2𝜋𝜎2exp (−
1
2𝜎2(𝑥 − 𝜇)2) (1)
The univariate Gaussian probability density function (or, informally, “Gaussian distribution”), has two
parameters: the “expectation parameter” or “mean parameter” 𝜇 ∈ ℝ and the “variance parameter”
𝜎2 ∈ ℝ+. Note that the variance parameter is by definition positive (i.e. strictly larger than zero). The
expectation parameter determines the location of the Gaussian bell curve on the 𝑥-axis, and the width of the
curve increases with increasing 𝜎2. We will sometimes use the following notation to indicate that a random
variable is distributed according to univariate Gaussian distribution (or, in other words, that it is associated
with a Gaussian probability density function)
𝑥 ~ 𝑁(𝑥; 𝜇, 𝜎2) (2)
where the tilde ~ should be read as “is distributed according to”. Note that the symbol before the semicolon
indicates the random variable, the symbols behind the semicolon denote the parameters of the associated
probability density function. Likewise, we may write
𝑝(𝑥) = 𝑁(𝑥; 𝜇, 𝜎2) =1
√2𝜋𝜎2exp (−
1
2𝜎2(𝑥 − 𝜇)2) (3)
which is somewhat sloppy, but is understood in the same way as (1) and (2).
Informally, what expression (1) wants to say is that the probability of a value of the random variable
𝑥 to fall into an interval on the real line ℝ which is close to the value of 𝜇 is higher, than to fall into an
interval on the line ℝ which is quite far from 𝜇. Simultaneously, expression (1) states that the “dispersion” or
“variability” or “probability for a deviation” from 𝜇 increases with higher values of 𝜎2. This is most readily
appreciated by inspecting the probability density functions corresponding to different values of the
parameters 𝜇 and 𝜎2 (Figure 1, uppermost panel). Note that the value of 𝜇 changes the location of the
curve, while the value of 𝜎2 changes the width of the curve. Intuitively, the probability of the random
variable 𝑥 to assume a value in a very small interval around, say 𝑥 = 2.5 varies with the value of 𝑝(𝑥 = 2.5).
In Figure 1 (uppermost panel), values around 𝑥 = 3 have higher probability under the univariate normal
distribution with 𝜇 = 3, 𝜎2 = 0.3 then for the other two probability distributions.
115
Figure 1. The univariate Gaussian distribution. The uppermost panel depict the probability density functions of three different Gaussian distributions represented by different settings of the parameters of their probability density functions. The middle panel depicts a sample of 20 values from the Gaussian represented from the depicted probability density function. Note that these samples cluster where the probability density function assumes high values in 𝑥 space. The lowermost panel depicts a “histogram estimate” based on a large sample from the Gaussian distribution of the underlying probability density function. While we have not introduced the theory of histogram estimates, intuitively, these may be understood as frequency counts for sampled values to fall into intervals in 𝑥 space. The frequency counts are appropriately scaled and depicted as grey bars over the respect intervals on the 𝑥 axis.
A good way to obtain an intuitive understanding of the theory discussed in this section is to explore
sampling from Gaussian distributions using the random number generator implemented in Matlab. Random
number generators allow one to sample numerical values from specified distributions. Matlab and Matlab’s
Statistics Toolbox comprise a large variety of random number generators for many different distributions
(uniform, binomial, Student t, and so on). In addition Matlab provides functions that return probability
densities, which may be seen as the “analytic” counterpart to the “empirical” random number generators.
The middle and lowermost panels in Figure 1 were generated by capitalizing on the random number
generators and probability density functions implemented in Matlab.
(8) Expected value and variance of univariate random variables
Two further concepts from probability theory we require are the “expected value” or “expectation”
of a random variable, and the “variance” of a random variable. The reader will likely have an intuition about
what these terms refer to: the expected value is the value one “expects” to see “most of the time”, or “in
the long-run average” when sampling from the random variable. Likewise, the variance is a measure of the
“variability” around this average value in the long-run.
From a mathematical viewpoint, it is very important to clearly differentiate between the notions of
“expected value” and “variance” on the one hand, and “empirical mean” and “empirical variance”, on the
other. The former two terms are theoretical constructs, that can be evaluated as soon as some probability
mass or density function has been defined for a given random variable. The latter two terms refer to
constructs that can only be evaluated once realizations of a random variable have been observed. Of course,
116
expectation and variance as theoretical constructs carry a clear intuition with respect to sample mean and
sample variance. However, they are not the same and one should always be clear about whether a given
statement refers to the “theoretical” expectation of a random variable, or the “empirical” evaluation of a
sample from the random variable (and likewise for variances).
Because expected value/expectation and variance are theoretical constructs, they can be evaluated
solely on the basis of a probability mass or density function. For discrete random variables, the expected
value is defined as
𝐸(𝑥) = ∑ 𝑝(𝑥𝑖)𝑥𝑖𝑛𝑖=1 (1)
Note that we assumed that the discrete random variable 𝑥 can take on 𝑛 discrete values 𝑥1, … , 𝑥𝑛. For
example, the expected value of a random variable modelling a fair die using the probability mass function
ℕ6 → [0,1], 𝑥𝑖 ↦ 𝑝(𝑥 = 𝑥𝑖) =1
6 (2)
is given by
𝐸(𝑥) =1
6⋅ 1 +
1
6⋅ 2 +
1
6⋅ 3 +
1
6⋅ 4 +
1
6⋅ 5 +
1
6⋅ 6 =
1+2+3+4+5+6
6=
21
6= 3.5 (3)
Note two important features of expected values based on this example: they are single numbers, and they
take on values in the range of the random variable under consideration (even if this value is not really
possible for a die in this case). Also note that the expected value has a clear intuition: it corresponds to the
sum over all values that the random variable can take on, where each value is weighted by a number in the
interval [0,1] Notably, the entire set of weights sums to one.
For continuous real random variables and their associated probability density functions, the
summation in the definition of the expected value is replaced by integration, but the intuition remains the
same: the outcome values that the random variable can take on are multiplicatively weighted by their
associated probabilities and “added up”
𝐸(𝑥) = ∫ 𝑝(𝑥)𝑥ℝ
𝑑𝑥 (4)
As an example, the expected value of a univariate Gaussian random variable is given by its expectation
parameter (which we will not prove here): if 𝑥 ~ 𝑁(𝑥; 𝜇, 𝜎2), then
𝐸(𝑥) = ∫ (1
2𝜋𝜎2exp (−
1
2𝜎2(𝑥 − 𝜇)2)) ⋅ 𝑥
ℝ𝑑𝑥 = 𝜇 (5)
The variance of a random variable corresponds to the expected squared deviation from its expected
value. Formally, we write
𝑉(𝑥) = 𝐸 ((𝑥 − 𝐸(𝑥))2) (6)
The variance can thus be viewed measure for the “variability” of a random variable in the large
sample limit. For probability mass functions of discrete random variables with finite outcome spaces, the
expected value in (6) can again be viewed as weighted sums over the outcome space:
𝑉(𝑥) = 𝐸 ((𝑥 − 𝐸(𝑥))2) = 𝐸((𝑥 − ∑ 𝑝(𝑥𝑖)𝑥𝑖
𝑛𝑖=1 )2) = ∑ (𝑝(𝑥𝑖)(𝑥𝑖 −∑ 𝑝(𝑥𝑖)𝑥𝑖
𝑛𝑖=1 )2) 𝑛
𝑖=1 (7)
117
As an example, consider again the fair die: Above, we evaluated the expectation to
𝐸(𝑥) = ∑ 𝑝(𝑥𝑖)𝑥𝑖6𝑖=1 = 3.5 (8)
To evaluate the variance of this random variable, we have to subtract 𝐸(𝑥) from each value 𝑥𝑖 that 𝑥 can
take on (i.e. 1,2,3,4,5,6), square each result, weight it with the probability 𝑝(𝑥 = 𝑥𝑖) and finally sum over all
values we obtained:
𝑉(𝑥) = ∑ (𝑝(𝑥𝑖)(𝑥𝑖 − ∑ 𝑝(𝑥𝑖)𝑥𝑖6𝑖=1 )
2) 6
𝑖=1
= ∑ (𝑝(𝑥𝑖)(𝑥𝑖 − 3.5)2) 6
𝑖=1
=1
6⋅ (1 − 3.5)2 +
1
6⋅ (2 − 3.5)2 +
1
6⋅ (3 − 3.5)2 +
1
6⋅ (4 − 3.5)2 +
1
6⋅ (5 − 3.5)2 +
1
6⋅ (6 − 3.5)2
=1
6⋅ (−2.5)2 +
1
6⋅ (−1.5)2 +
1
6⋅ (−0.5)2 +
1
6⋅ (0.5)2 +
1
6⋅ (1.5)2 +
1
6⋅ (2.5)2
=1
6⋅ 6.25 +
1
6⋅ 2.25 +
1
6⋅ 0.25 +
1
6⋅ 0.25 +
1
6⋅ 2.25 +
1
6⋅ 6.25
≈ 1.04 + 0.375 + 0.04 + 0.04 + 0.375 + 1.04
= 2.91
(9)
For a continuous random variable, summation is again replaced by integration, and the following
equation for the variance applies
𝑉(𝑥) = 𝐸 ((𝑥 − 𝐸(𝑥))2) = ∫ 𝑝(𝑥)(𝑥 − 𝐸(𝑥))
2𝑑𝑥
ℝ (10)
As an example, the variance of a univariate Gaussian random variable is given by its variance parameter
(which we will not prove here): if 𝑥 ~ 𝑁(𝑥; 𝜇, 𝜎2), then
𝑉(𝑥) = ∫ (1
2𝜋𝜎2exp (−
1
2𝜎2(𝑥 − 𝜇)2)) ⋅ (𝑥 − 𝜇)2
ℝ𝑑𝑥 = 𝜎2 (11)
An often encountered variant of the variance of a random variable expresses the variance in terms of the
expectation of the square of the random variable and the square of the expectation. Specifically, we have
from (11)
𝑉(𝑥) = 𝐸 ((𝑥 − 𝐸(𝑥))2) = 𝐸 (𝑥2 − 2𝑥𝐸(𝑥) + (𝐸(𝑥))
2) (12)
Taking the expectation of the terms on the left-hand side of (125) and noting that the expectation is not a
random variable and its expectation thus again the expectation, we have
𝐸(𝑥2 − 2𝑥𝐸(𝑥) + 𝐸(𝑥)2) = 𝐸(𝑥2) − 2𝐸(𝑥)𝐸(𝑥) + 𝐸(𝑥)2 = 𝐸(𝑥2) − 𝐸(𝑥)2 (13)
We thus have
𝑉(𝑥) = 𝐸(𝑥2) − 𝐸(𝑥)2 (14)
118
In summary, expectation and variance are theoretically concepts that can be applied to any random
variable to measure its “mean tendency” and its “dispersion” about this mean tendency. In the context of
PMFN it is essential to differentiate between three mathematical objects that evoke similar connotations,
but refer to different intuitions: (1) expectation/variance, (2) empirical mean/empirical variance, and (3)
expectation parameter/variance parameter of Gaussian distributions. Again, (1) refer to measures of mean
tendency and centrality, which depend on a random variables probability mass or density function, and are
“theoretical” or “analytical “measures. (2) refer to summary statistics of samples or realizations of random
variables. These can only be evaluated, once samples of random variable have been obtained. Finally, (3)
refer to numbers that describe the functional form of Gaussian distributions. By coincidence, the expectation
parameter of Gaussian corresponds to its expectation, and the variance parameter corresponds to its
variance. This, however, need not be the case: consider for example the Bernoulli distribution introduced
above. Here, the probability mass function has a single parameter 𝑝. One can show, that the expectation of a
Bernoulli distributed random variable is given by 𝐸(𝑥) = 𝑝, while its variance is given by 𝑉(𝑥) = 𝑝(1 − 𝑝).
One may thus refer to 𝑝 as the expectation parameter of 𝑥, however, unlike the Gaussian, the Bernoulli
distribution does not have a “variance parameter”. It nevertheless has a variance.
Study Questions
1. Provide a verbal explanation of the following statement: 𝑝(𝑥 = 2, 𝑦 = 4) = 0.4.
2. Explain the notions of probability mass functions and probability density functions using examples.
3. Compute the marginal distributions 𝑝(𝑥) and 𝑝(𝑦) and the conditional distributions 𝑝(𝑥|𝑦 = 2) and 𝑝(𝑦|𝑥 = 1) of the
distribution given by the following probability mass function for random variables 𝑥 and 𝑦
𝑝(𝑥 = 1, 𝑦 = 1) =2
10 𝑝(𝑥 = 1, 𝑦 = 2) =
1
10 𝑝(𝑥 = 1, 𝑦 = 3) =
3
10
𝑝(𝑥 = 2, 𝑦 = 1) =1
10 𝑝(𝑥 = 2, 𝑦 = 2) =
1
10 𝑝(𝑥 = 2, 𝑦 = 3) =
2
10
4. Derive Bayes theorem based on a joint distribution 𝑝(𝑥, 𝑦) and the definition of the conditional distributions 𝑝(𝑥|𝑦) and
𝑝(𝑦|𝑥).
5. Evaluate the expectation of the distribution given by the following probability mass function for the random variable 𝑥 taking
on values in the set {1,4,10}.
𝑝(𝑥 = 1) = 0.3, 𝑝(𝑥 = 4) = 0.2, 𝑝(𝑥 = 10) = 0.5
6. Evaluate the joint distribution of the marginal distributions given by under the assumption of stochastic independence of the
random variables 𝑥 with probability mass function
𝑝(𝑥 = 1) = 0.4, 𝑝(𝑥 = 2) = 0.6
and random variable 𝑦 with probability mass function
𝑝(𝑦 = 1) = 0.2, 𝑝(𝑦 = 2) = 0.8.
Study Questions Answers
1. 𝑝(𝑥 = 2, 𝑦 = 4) = 0.4 can be read as “the probability that the random variable 𝑥 takes on the value 2 and that the random
variable 𝑦 (simultaneously) takes on the value 4 is 0.4 ”.
2. A probability mass function allocates to each possible value that a random variable takes on a probability mass, such that over
all values that the random variable takes on, these masses add to 1. For example, if 𝑥 is a random variable taking on the values
0 and 1, then a probability mass function for 𝑥 is 𝑝(𝑥 = 0) = 0.2, 𝑝(𝑥 = 1) = 0.8. A probability density function allocates to
each possible value that a random variable can take on a probability density, which, if considered over an interval around a
119
value may be multiplied with the length of the interval to yield the probability of the random variable taking on values in that
interval. A typical example for a probability density function is the Gaussian.
3. For the marginal distributions, see the bold entries in the rightmost column for 𝑝(𝑥) and the bold entries in the lowermost row
for 𝑝(𝑦) below. Note that the marginal distribution probability masses sum to 1.
𝑝(𝑥 = 1, 𝑦 = 1) =2
10 𝑝(𝑥 = 1, 𝑦 = 2) =
1
10 𝑝(𝑥 = 1, 𝑦 = 3) =
3
10 𝑝(𝑥 = 1) =
2
10+
1
10+
3
10=
6
10
𝑝(𝑥 = 2, 𝑦 = 1) =1
10 𝑝(𝑥 = 2, 𝑦 = 2) =
1
10 𝑝(𝑥 = 2, 𝑦 = 3) =
2
10 𝑝(𝑥 = 2) =
1
10+
1
10+
2
10=
4
10
𝑝(𝑦 = 1) =2
10+
1
10=
3
10 𝑝(𝑦 = 2) =
1
10+
1
10=
2
10 𝑝(𝑦 = 3) =
3
10+
2
10=
5
10
For the conditional distributions we have by definition
𝑝(𝑥|𝑦 = 2) =𝑝(𝑥,𝑦=2)
𝑝(𝑦=2) and 𝑝(𝑦|𝑥 = 1) =
𝑝(𝑥=1,𝑦)
𝑝(𝑥=1)
As𝑥 takes on the values 1 and 2 in the current example, the conditional distribution of 𝑥 given 𝑦 = 2 is given by
𝑝(𝑥 = 1|𝑦 = 2) =𝑝(𝑥=1,𝑦=2)
𝑝(𝑦=2)=
1
10⋅10
2=
1
2
𝑝(𝑥 = 2|𝑦 = 2) =𝑝(𝑥=1,𝑦=2)
𝑝(𝑦=2)=
1
10⋅10
2=
1
2
And as 𝑦 takes on the values 1,2 and 3 in the current example, the conditional distribution of 𝑦 given 𝑥 = 1 is given by
𝑝(𝑦 = 1|𝑥 = 1) =𝑝(𝑥=1,𝑦=1)
𝑝(𝑥=1)=
2
10⋅10
6=
2
6
𝑝(𝑦 = 2|𝑥 = 1) =𝑝(𝑥=1,𝑦=2)
𝑝(𝑥=1)=
1
10⋅10
6=
1
6
𝑝(𝑦 = 3|𝑥 = 1) =𝑝(𝑥=1,𝑦=3)
𝑝(𝑥=1)=
3
10⋅10
6=
3
6
Note that the probabilities of both 𝑝(𝑥|𝑦 = 2) and 𝑝(𝑦|𝑥 = 1) sum to 1 over the values of 𝑥 and 𝑦, respectively.
4. From the definition of conditional probabilities, we see that we can write the joint distribution 𝑝(𝑥, 𝑦) in two ways
𝑝(𝑦|𝑥) =𝑝(𝑥,𝑦)
𝑝(𝑥)⇒ 𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦|𝑥) and 𝑝(𝑥|𝑦) =
𝑝(𝑥,𝑦)
𝑝(𝑦)⇒ 𝑝(𝑥, 𝑦) = 𝑝(𝑦)𝑝(𝑥|𝑦)
Setting the joint distribution 𝑝(𝑥, 𝑦) equal to itself then yields the following equivalences
𝑝(𝑥, 𝑦) = 𝑝(𝑥, 𝑦) ⇔ 𝑝(𝑥)𝑝(𝑦|𝑥) = 𝑝(𝑦)𝑝(𝑥|𝑦) ⇔ 𝑝(𝑦|𝑥) =𝑝(𝑦)𝑝(𝑥|𝑦)
𝑝(𝑥)
where the last equality is known as “Bayes Theorem”. 5. The expectation is given by
𝐸(𝑥) = 1 ⋅ 𝑝(𝑥 = 1) + 4 ⋅ 𝑝(𝑥 = 4) + 10 ⋅ 𝑝(𝑥 = 10)
= 1 ⋅ 0.3 + 4 ⋅ 0.2 + 10 ⋅ 0.5
= 0.3 + 0.8 + 5
= 6.1.
6. Under the assumption of independent marginal distributions, the entries in the respective cells of the joint distribution result
from the multiplication of the corresponding marginal probabilities. We thus have
𝑝(𝑥 = 1, 𝑦 = 1) =8
100 𝑝(𝑥 = 1, 𝑦 = 2) =
32
100 𝑝(𝑥 = 1) =
4
10
𝑝(𝑥 = 2, 𝑦 = 1) =12
100 𝑝(𝑥 = 2, 𝑦 = 2) =
48
100 𝑝(𝑥 = 2) =
6
10
𝑝(𝑦 = 1) =2
10 𝑝(𝑦 = 2) =
8
10
Note that the joint distribution sums to 1: 8
100+
32
100+
12
100+
48
100=
100
100= 1.
120
Multivariate Gaussian distributions
(1) The bivariate Gaussian distribution
Multivariate Gaussian distributions extend the idea of univariate Gaussians to more than one
random variable. We will start by considering the most commonly discussed multivariate Gaussian
distribution, the 2-dimensional, or bivariate Gaussian distribution. Consider first the univariate Gaussian
probability density function
𝑝(𝑥 = 𝑥∗) = 𝑁(𝑥 = 𝑥∗; 𝜇, 𝜎2) =1
√2𝜋𝜎2exp (−
1
2𝜎2(𝑥∗ − 𝜇)2) (1)
As discussed previously, the univariate Gaussian probability density function 𝑁(𝑥; 𝜇, 𝜎2) is a formula that
returns for all possible values 𝑥∗ of the random variable 𝑥 the associated probability density, or, more
intuitively, the probability of the random variable 𝑥 to assume a value in an infinitesimally small interval
around 𝑥∗. As noted previously, this probability is dependent on the parameters of the univariate Gaussian
probability density function, the “expectation parameter” 𝜇 ∈ ℝ indexing the centre of the bell-shaped
Gaussian curve, and the “variance parameter” 𝜎2 > 0 (which has to be positive), indexing the width of the
bell-shaped curve. As a generalization of this idea, the bivariate Gaussian distribution is a formula that
returns for each combination of two values of random variables 𝑥1 and 𝑥2 the probability values 𝑥1∗ and 𝑥2
∗ of
𝑥1 and 𝑥2, respectively, to simultaneously fall into a very small square in the space ℝ2. For example, based
on the multivariate Gaussian formula, we can evaluate the probability of 𝑥1∗ = 0.7 and (simultaneously)
𝑥2∗ = 0.5. If 𝑥1
∗ and 𝑥2∗ are summarized in a two-dimensional vector
𝑥∗ = (𝑥1∗
𝑥2∗) (2)
The formula for the probability of the random vector 𝑥 to assume specific values 𝑥∗, i.e. the bivariate
Gaussian probability density function, assumes a form that is very similar to (1) and is written as
𝑝(𝑥 = 𝑥∗) = 𝑁(𝑥 = 𝑥∗; 𝜇, Σ) = (2𝜋)−1|Σ|−1
2 exp (−1
2(𝑥∗ − 𝜇)𝑇Σ−1(𝑥∗ − 𝜇)) (3)
where |Σ| denotes the determinant and Σ−1 denotes the inverse of Σ . As for the univariate case, the
important part is the expression in the exponential function, whereas the factor (2𝜋)−1|Σ|−1
2 serves the
purpose of normalization, i.e. integration of the probability density function to 1 over the outcome space
ℝ2. A number of bivariate Gaussian probability density functions and their parameters are visualized in
Figure 1. Note the similarity between (1) and (3): instead of the mean or expectation parameter 𝜇 ∈ ℝ in (1),
the bivariate Gaussian distribution introduces the “mean” or “expectation vector” 𝜇 ∈ ℝ2. The intuition for
𝜇 is the same in both cases: it represents the center of mass of the distribution, or in other words, the center
location of the bell-curve. Likewise, instead of a variance parameter 𝜎2 > 0 equation (3) introduces the
notion of a “covariance matrix (parameter)” Σ ∈ ℝ2×2 that governs the width of the corresponding bell-
curve by means of the entries on its main diagonal. The off-diagonal elements have a somewhat different
interpretation, which we will examine in more detail below. Beforehand, however, we will consider the
example
𝑥∗ = (𝑥1∗
𝑥2∗) = (
0.70.5) (4)
121
introduced above in a bit more detail: To evaluate the probability of the random vector 𝑥 = (𝑥1, 𝑥2)𝑇
falling into a very small square in the vincinity of (0.7, 0.5)𝑇 (i.e. the probability density at 𝑥∗ = (0.7, 0.5)𝑇)
based on (3), we need to know the values of 𝜇 and Σ. Let us assume that
𝜇 = (𝜇1𝜇2) ≔ (
11) and Σ = (
𝜎112 𝜎12
2
𝜎212 𝜎22
2 ) ≔ (0.1 0.070.07 0.1
) (5)
Equation (3) then simply means the following: to evaluate the probability density at 𝑥∗ = (0.7, 0.5)𝑇
compute
𝑝(𝑥 = 𝑥∗) = (2𝜋)−1|Σ|−1
2 exp (−1
2(𝑥∗ − 𝜇)𝑇Σ−1(𝑥∗ − 𝜇))
= (2𝜋)−1 |(𝜎112 𝜎12
2
𝜎212 𝜎22
2 )|
−1
2
exp(−1
2((𝑥1∗
𝑥2∗) − (
𝜇1𝜇2))
𝑇
(𝜎112 𝜎12
2
𝜎212 𝜎22
2 )
−1
((𝑥1∗
𝑥2∗) − (
𝜇1𝜇2)))
= (2𝜋)−1 |(0.1 0.070.07 0.1
)|−1
2exp(−
1
2((0.70.5) − (
11))
𝑇
(0.1 0.070.07 0.1
)−1
((0.70.5) − (
11))) (6)
Note that the value 𝑝(𝑥 = 𝑥∗) returned by the somewhat lengthy expression (6) (which is usually evaluated
on a computer and not by hand) is merely a positive scalar number and corresponds to the color at
𝑥∗ = (0.7, 0.5)𝑇 ∈ ℝ2 in Figure 1. Intuitively, as stated above, the warmth of this color corresponds to the
probability of the random variable 𝑥 to fall into a square centered at (0.7, 0.5)𝑇.
Figure 1. Figure 1 depicts bivariate Gaussian distributions with identical expectation parameter 𝜇 = (1,1)𝑇 and varying covariance matrices. The white cross indicates the point (0.7, 0.5)𝑇 ∈ ℝ2. The respective covariance matrices are for
the left panel 𝛴 ≔ (0.1 0.070.07 0.1
) ,the middle panel𝛴 ≔ (0.1 −0.07−0.07 0.1
), and for the right panel 𝛴 ≔ (0.1 00 0.1
).
We now consider the general covariance matrix of a bivariate Gaussian distribution
Σ ≔ (𝜎112 𝜎12
2
𝜎212 𝜎22
2 ) (7)
in more detail. The elements 𝜎112 and 𝜎22
2 correspond to the variances of the (univariate) marginal variables
𝑥1 and 𝑥2. Together with the entries 𝜇1 and 𝜇2 of the mean vector 𝜇 ∈ ℝ2 these values uniquely specify the
shape of the so-called marginal distributions of 𝑥1 and 𝑥2, respectively. The values 𝜎122 and 𝜎21
2 have a
122
different connotation: they specify the covariation (intuitively the reader may think of the “correlation”) of
𝑥1 and 𝑥2. This covariation is always symmetrical (like correlation is always symmetrical). In other words, the
covariation of 𝑥1 and 𝑥2 is the same as the covariation of 𝑥2 and 𝑥1, and thus 𝜎122 = 𝜎21
2 . Because the off-
diagonal elements of Σ are equal, the transpose of Σ, Σ𝑇 equals Σ, i.e. we have
Σ𝑇 = Σ (8)
Covariance matrices are thus always symmetric. Like the variance parameter 𝜎2 of a univariate Gaussian has
to be positive, the covariance matrix of a multivariate Gaussian has to be “positive-definite”. By inspecting
Figure 1 and the associated covariance matrices, one realizes that positive values of the covariation
𝜎122 = 𝜎21
2 result in the fact that high values of 𝑥1 are associated with a high probability of 𝑥2 also being high,
and a low probability of 𝑥2 being low. Conversely, negative values of 𝜎122 = 𝜎21
2 result in the fact that high
values of 𝑥1 are associated with a high probability of 𝑥2 being low, and a low probability of 𝑥2 being high.
Finally, if 𝜎122 = 𝜎21
2 = 0, result in the fact that high values of 𝑥1 are associated with the same probabilities
for low and high values of 𝑥2. In this case, knowledge of the value of 𝑥1 does not give us any information
about the likely value of 𝑥2 which intuitively corresponds to the idea, that 𝑥1 and 𝑥2 are “stochastically
independent”. This last case is very important for the theory of the general linear model, such that it comes
with its own label: If a bivariate Gaussian covariance matrix is of the form
Σ ≔ (𝜎112 0
0 𝜎222 ) (9)
it is called “diagonal”. If, in addition the values of 𝜎112 and 𝜎22
2 are both equal to a unique value 𝜎2 > 0, the
covariance matrix takes the form
Σ ≔ (𝜎2 00 𝜎2
) = 𝜎2 (1 00 1
) = 𝜎2𝐼2 (10)
where 𝐼2 denotes the 2 × 2 identity matrix. The corresponding appearance of the bivariate Gaussian
probability density function is round or “spherical”. A covariance matrix of the form 𝜎2𝐼2 is thus called
“spherical covariance matrix”. Spherical covariance matrices of are of 𝑛-dimensional multivariate Gaussian
distributions are of fundamental importance for the theory of the general linear model. We will thus next
extend the notion of bivariate Gaussian distributions to “𝑛-variate” or “𝑛-dimensional” Gaussian
distributions.
(2) The multivariate Gaussian distribution
The extension of the “𝑛-variate” (or” 𝑛-dimensional”) Gaussian distribution and its associated
probability density function from 𝑛 = 2 to arbitrary 𝑛 ∈ ℕ, and thus “multivariate” Gaussian distributions, is
relatively straightforward: 𝑛-dimensional Gaussian distributions describe the joint distribution of 𝑛
univariate random variables 𝑥1, … , 𝑥𝑛, or, equivalently, the distribution of 𝑛-dimensional “random vectors”
𝑥 ≔ (
𝑥1⋮𝑥𝑛) ∈ ℝ𝑛 (1)
The probability density functions of 𝑛-dimensional Gaussian distributions have the general form
𝑝(𝑥) = (2𝜋)−𝑛
2|Σ|−1
2 exp (−1
2(𝑥 − 𝜇)𝑇Σ−1(𝑥 − 𝜇)) (2)
123
where
𝜇 = (
𝜇1⋮𝜇𝑛) ∈ ℝ𝑛 (3)
is called the “mean vector” or “expectation parameter” and
Σ ≔
(
𝜎112 𝜎12
2
𝜎212 𝜎22
2
⋯ 𝜎1𝑛2
⋯ 𝜎2𝑛2
⋮ ⋮𝜎𝑛12 𝜎𝑛2
2⋱ ⋮⋯ 𝜎𝑛𝑛
2 )
∈ ℝ𝑛×𝑛 (4)
is called the covariance matrix. Like in the bivariate case, the covariance matrix of an 𝑛-dimensional Gaussian
probability density function is required to be symmetric and positive-definite. Further, like in the bivariate
case, the 𝑖th diagonal element (𝑖 = 1,… , 𝑛) represents the variance of the 𝑖th marginal variable 𝑥𝑖 and the
(𝑖, 𝑗)th off-diagonal element (𝑖 = 1,… , 𝑛, 𝑗 = 1,… , 𝑛) represents the covariance of the 𝑖th marginal variable
𝑥𝑖 with the 𝑗th marginal variable 𝑥𝑗 (which, of course, is equal to the covariance of the 𝑗th marginal variable
𝑥𝑗 with the 𝑖th marginal variable 𝑥𝑖). For the formulation of the general linear model, the 𝑛-dimensional
generalization of the bivariate spherical covariance matrix is of fundamental importance. In analogy to the
above, for 𝜎2 > 0 it takes the general form
Σ ≔ (
𝜎2 0 0 𝜎2
⋯ 0⋯ 0
⋮ ⋮0 0
⋱ ⋮⋯ 𝜎2
) = 𝜎2 (
1 0 0 1
⋯ 0⋯ 0
⋮ ⋮0 0
⋱ ⋮⋯ 1
) = 𝜎2𝐼𝑛 ∈ ℝ𝑛×𝑛 (5)
where 𝐼𝑛 denotes the 𝑛-dimensional identity matrix.
Figure 2: (Left panel) Probability density function of bivariate Gaussian with 𝜇 ≔ (1,1)𝑇 and covariance matrix
𝛴 ≔ (0.1 0.070.07 0.1
). (Right panel) 200 samples drawn from the distribution characterized by the probability density
function on the left.
To summarize, it is important to be clear about the fact that the only thing we have discussed in this
section is a specific way of specifying a probabilistic model. It is very important not to confuse the “analytic”
or “probability-theoretical” notions of “mean vectors” or “covariance matrices” introduced in the current
section with “empirical” or “statistical concepts” such as the “sample mean” or the “sample correlation”.
124
These concepts, which may be very familiar from undergraduate statistics have not been touched upon in
the current section. In other words, with respect to the undergraduate statistics curriculum the 𝜇’s and Σ’s
introduced in the current section might best be thought of what is often referred to as “population
parameters” in undergraduate statistics. Finally, like for the univariate Gaussian distribution, the probability
density functions discussed in the current section come with a clear intuition: if samples are drawn from, for
example, a bivariate Gaussian distribution described by a given parameter setting of its associated
probability density function, the majority of samples will fall into regions of the outcome space that is
associated with high probability density, and only a minority of samples will fall into regions associated with
low probability density. Figure 1 above visualizes this for the case of a bivariate Gaussian probability density
function.
(3) IID Gaussian random variables and spherical covariance matrices
In this part we introduce a fundamental theorem for the importance of multivariate Gaussian
distributions in the theory of the general linear model. Intuitively, this theorem result may be stated as
follows: 𝑛 univariate random variables, each distributed with (not necessarily identical) expectation and
identical variance parameters, can be described by the joint distribution of 𝑛 univariate random variables in
an 𝑛-dimensional random vector that is distributed according to a multivariate Gaussian distribution with an
expectation vector given by the concatenation of the individual univariate expectation parameters and a
spherical covariance matrix resulting from the multiplication of the 𝑛 × 𝑛 identity matrix with the common
variance parameter of the univariate Gaussian random variables. This result is fundamental, because it
allows to express the assumption of independently and identically distributed error terms for observations in
a general linear model to be concisely expressed by a single distribution, and all classical and Bayesian
inference schemes for the general linear model have tight connections to this basic property.
To express the above more formally, first recall that the notion of stochastic independence
intuitively corresponds to the idea that the value of, say 𝑥𝑖 has no influence on the value of, say 𝑥𝑗
(1 ≤ 𝑖, 𝑗 ≤ 𝑛, 𝑖 ≠ 𝑗). Above we have seen stochastic independence is mathematically expressed (or
modelled) by noting that the joint probability of 𝑥𝑖 = 𝑎 and 𝑥𝑗 = 𝑏 (for 𝑖 ≠ 𝑗) can be evaluated as the
product of probability of 𝑥𝑖 = 𝑎 and the probability of 𝑥𝑗 = 𝑏:
𝑝(𝑥𝑖 = 𝑎, 𝑥𝑗 = 𝑏) = 𝑝(𝑥𝑖 = 𝑎)𝑝(𝑥𝑗 = 𝑏) (1)
The central theorem of this section now claims that the joint distribution of
a) 𝑛 marginal variables 𝑥𝑖 (𝑖 = 1,… , 𝑛) that are independently distributed according to univariate normal
distributions with expectation parameters 𝜇𝑖 (𝑖 = 1,… , 𝑛) (where 𝜇1, 𝜇2, … , 𝜇𝑛 are in general not
identical), and variance parameter 𝜎12 = 𝜎2
2 = ⋯ = 𝜎𝑛2 = 𝜎2 (i.e., the variance parameter is the same
for all variables) and
b) an 𝑛-dimensional random vector 𝑥 = (𝑥1, … , 𝑥𝑛)𝑇 that is distributed according to a multivariate
Gaussian distribution with expectation parameter vector
𝜇 ≔ (
𝜇1⋮𝜇𝑛) (2)
where the entries correspond to the 𝜇𝑖’s of a) and spherical covariance matrix
125
Σ ≔ 𝜎2𝐼𝑛 ∈ ℝ𝑛×𝑛 (3)
where 𝜎2 corresponds to the variance parameter of a)
are identical. From a sampling perspective this may rephrased as follows: It does not matter, whether one
samples 𝑛 values from independent, univariate normally distributed variables with individual expectation
parameters 𝜇𝑖 (𝑖 = 1,… , 𝑛) and common variance parameter 𝜎2 “one after the other”, or, simultaneously,
samples an 𝑛-dimensional vector of variables 𝑥 = (𝑥1, … , 𝑥𝑛)𝑇 from a multivariate Gaussian distribution
with expectation parameter 𝜇 ∈ ℝ𝑛 with the entries 𝜇𝑖 (𝑖 = 1,… , 𝑛) and spherical covariance matrix
Σ ≔ 𝜎2𝐼𝑛 ∈ ℝ𝑛×𝑛. The distributions of the sampled values are the same.
Below we derive this insight formally. To this end, we show that the probability density functions
describing (a) 𝑛 independent univariate Gaussian random variables distributed as described above, and
describing (b) an 𝑛-dimensional Gaussian random vector with the parameters described above, are identical,
in short
∏ 𝑁(𝑥𝑖; 𝜇𝑖, 𝜎2)𝑛
𝑖=1 = 𝑁(𝑥; 𝜇, 𝜎2𝐼𝑛) (4)
with the notation above.
Proof of (4)
Let 𝑥 ≔ (𝑥1, … , 𝑥𝑛)𝑇 ∈ ℝ𝑛 be an 𝑛-dimensional random vector, let 𝜇 ≔ (𝜇1, … , 𝜇𝑛)
𝑇 ∈ ℝ𝑛 be an expectation parameter,
and let Σ ≔ 𝜎2𝐼𝑛 ∈ ℝ𝑛×𝑛, 𝑝. 𝑑. be a spherical covariance matrix. Then statement (4) claims that
∏ 𝑁(𝑥𝑖; 𝜇𝑖 , 𝜎2)𝑛
𝑖=1 = 𝑁(𝑥; 𝜇, 𝜎2𝐼𝑛) (4.1)
To see this, we consider the left-hand side of (4.1). The univariate Gaussian probability density functions forming the terms of the
product are given by
𝑁(𝑥𝑖; 𝜇𝑖 , 𝜎2) =
1
√2𝜋𝜎2exp (−
1
2𝜎2(𝑥𝑖 − 𝜇𝑖)
2) (𝑖 = 1,… , 𝑛) (4.2)
For the product ∏ 𝑁(𝑥𝑖 ; 𝜇𝑖 , 𝜎2)𝑛
𝑖=1 , we thus have
∏ 𝑁(𝑥𝑖; 𝜇𝑖 , 𝜎2)𝑛
𝑖=1 =1
√2𝜋𝜎2exp (−
1
2𝜎2(𝑥1 − 𝜇1)
2) ⋅1
√2𝜋𝜎2exp (−
1
2𝜎2(𝑥2 − 𝜇2)
2) ⋅ … ⋅1
√2𝜋𝜎2exp (−
1
2𝜎2(𝑥𝑛 − 𝜇𝑛)
2) (4.3)
In (4.3), the factor 1
√2𝜋𝜎2 occurs 𝑛-times, which we may summarize as
(1
√2𝜋𝜎2)𝑛= ((2𝜋𝜎2)−
1
2)𝑛
= (2𝜋𝜎2)−𝑛
2 = (2𝜋)−𝑛
2𝜎−𝑛 (4.4)
Further, we can use the property
exp(𝑎) exp(𝑏) = exp (𝑎 + 𝑏) (4.5)
of the exponential function to write the product of exponentials in (4.3) more compactly as
exp (−1
2𝜎2(𝑥1 − 𝜇1)
2) ⋅ … ⋅ exp (−1
2𝜎2(𝑥𝑛 − 𝜇𝑛)
2) = exp (−∑1
2𝜎2(𝑥𝑖 − 𝜇𝑖)
2𝑛𝑖=1 ) = exp (−
1
2𝜎2∑ (𝑥𝑖 − 𝜇𝑖)
2𝑛𝑖=1 ) (4.6)
These simplifications allow for re-expressing the left-hand side of (4) as
∏ 𝑁(𝑥𝑖; 𝜇𝑖 , 𝜎2)𝑛
𝑖=1 = (2𝜋)−𝑛
2𝜎−𝑛 exp (−1
2𝜎2∑ (𝑥𝑖 − 𝜇𝑖)
2𝑛𝑖=1 ) (4.7)
Next consider the right-hand side of (4). By definition, we have
126
𝑁(𝑥; 𝜇, 𝜎2𝐼𝑛) = (2𝜋)−𝑛
2|𝜎2𝐼𝑛|−1
2 exp (−1
2(𝑥 − 𝜇)𝑇(𝜎2𝐼𝑛)
−1(𝑥 − 𝜇)) (4.8)
From linear algebra, we know that the determinant of a diagonal matrix, i.e. a matrix that has non-zero entries only on its main
diagonal, is given by the product of its diagonal elements. We thus have
|𝜎2𝐼𝑛|−1
2 = (∏ 𝜎2𝑛𝑖=1 )−
1
2 = (𝜎2)−𝑛
2 = 𝜎−𝑛 (4.9)
Because the inverse of a diagonal matrix corresponds to the matrix with the multiplicative inverses of the diagonal entries along is
main diagonal, we further have
(𝜎2𝐼𝑛)−1 =
1
𝜎2𝐼𝑛 (4.10)
Finally,
1
𝜎2(𝑥 − 𝜇)𝑇𝐼𝑛(𝑥 − 𝜇) =
1
𝜎2(𝑥 − 𝜇)𝑇(𝑥 − 𝜇) =
1
𝜎2∑ (𝑥𝑖 − 𝜇𝑖)
2𝑛𝑖=1
which is readily seen by considering the matrix product 𝐴𝑇𝐴 of a 𝐴 ∈ ℝ𝑛×1 with itself. Summarizing the above, we thus have
𝑁(𝑥; 𝜇, 𝜎2𝐼𝑛) = (2𝜋)−𝑛
2𝜎−𝑛 exp (−1
𝜎2∑ (𝑥𝑖 − 𝜇𝑖)
2𝑛𝑖=1 ) (4.11)
Comparing (4.7) and (4.11) now shows that indeed
𝑁(𝑥; 𝜇, 𝜎2𝐼𝑛) = ∏ 𝑁(𝑥𝑖; 𝜇𝑖 , 𝜎2)𝑛
𝑖=1 (4.12)
(4) The linear transformation theorems for Gaussian distributions
The “linear transformation theorem for Gaussian distributions” is a fundamental result in the theory
of multivariate Gaussian distributions. Intuitively, it corresponds to the statement that the matrix product of
a multivariate Gaussian random variable 𝑥 with a given matrix 𝐴 and the addition of a second multivariate
Gaussian random variable 휀 to this product (1) yields a third multivariate Gaussian random variable 𝑦, and
that (2) the parameters of the distribution of 𝑦 can be evaluated based on the parameters of 𝑥 and 휀 and the
matrix 𝐴.
Formally, the transformation theorem states that if
𝑝(𝑥) = 𝑁(𝑥; 𝜇𝑥 , Σx), where 𝑥, 𝜇𝑥 ∈ ℝ𝑑, Σ𝑥 ∈ ℝ
𝑑×𝑑 positive-definite, (1)
𝑝(휀) = 𝑁(휀; 𝜇 , Σ ), where 휀, 𝜇 ∈ ℝ𝑑 , Σε ∈ ℝ𝑑×𝑑 positive-definite, (2)
the covariation of 𝑥 and 휀 is zero ℂ(𝑥, 휀) = (ℂ(휀, 𝑥))𝑇= 0 ∈ ℝ𝑑×𝑑 and (3)
𝐴 ∈ ℝ𝑑×𝑑 is a matrix, (4)
then the random variable
𝑦 ≔ 𝐴𝑥 + 휀 (5)
is distributed according to a multivariate Gaussian distribution
𝑝(𝑦) = 𝑁(𝑦; 𝜇𝑦, Σy) where 𝑦 ∈ ℝ𝑑, 𝜇𝑦 ∈ ℝ𝑑 , Σ𝑦 ∈ ℝ
𝑑×𝑑 positive definite (6)
and specifically,
127
𝜇𝑦 = 𝐴𝜇𝑥 + 𝜇 and Σ𝑦 = 𝐴Σ𝑥𝐴𝑇 + Σ (7)
(5) The Gaussian joint and conditional distribution theorem
Joint distribution
Given a Gaussian marginal distribution
𝑝(𝑧) = 𝑁(𝑧; 𝜇𝑧, Σ𝑧), where 𝑧, 𝜇𝑧 ∈ ℝ𝑝, Σ𝑧 ∈ ℝ
𝑝×𝑝 𝑝. 𝑑. (1)
and a Gaussian conditional distribution
𝑝(𝑦|𝑧) = 𝑁(𝑦; 𝐴𝑧, Σ𝑦|𝑧), where 𝑦 ∈ ℝ𝑛, 𝐴 ∈ ℝ𝑛×𝑝, Σ𝑦|𝑧 ∈ ℝ𝑛×𝑛 𝑝. 𝑑. (2)
the joint distribution of 𝑦 and 𝑧
𝑝(𝑦, 𝑧) = 𝑝(𝑦|𝑧)𝑝(𝑧) (3)
is given by
𝑝(𝑦, 𝑧) = 𝑁 ((𝑦𝑧) ; 𝜇𝑦,𝑧, Σ𝑦,𝑧) with (𝑦, 𝑧)𝑇 , 𝜇𝑦,𝑧 ∈ ℝ
𝑛+𝑝, Σ𝑦,𝑧 ∈ ℝ(𝑛+𝑝)×(𝑛+𝑝) (4)
where
𝜇𝑦,𝑧 = (𝐴𝜇𝑧𝜇𝑧) and Σ𝑦,𝑧 = (
Σ𝑦|𝑧 + 𝐴Σ𝑧𝐴𝑇 𝐴Σ𝑧
Σ𝑧𝐴𝑇 Σ𝑧
). (5)
Conditional and marginal distributions
The conditional distribution
𝑝(𝑧|𝑦) =𝑝(𝑦,𝑧)
𝑝(𝑦) (6)
is given by
𝑝(𝑧|𝑦) = 𝑁(𝑧; 𝜇𝑧|𝑦, Σ𝑧|𝑦) (7)
where
𝜇𝑧|𝑦 = Σ𝑧|𝑦(Σ𝑧−1𝜇𝑧 + 𝑋
𝑇Σ𝑦|𝑧−1𝑦) ∈ ℝ𝑝 and Σ𝑧|𝑦 = (Σ𝑧
−1 + 𝐴𝑇Σ𝑦|𝑧−1𝐴)
−1∈ ℝ𝑝×𝑝 (8)
and the marginal distribution
𝑝(𝑦) = ∫ 𝑝(𝑦, 𝑧)𝑑𝑧 (9)
is given by
𝑝(𝑦) = 𝑁(𝑦; 𝜇𝑦, Σ𝑦) , where 𝜇𝑦 = 𝐴𝑧 ∈ ℝ𝑛 and Σ𝑦 = Σ𝑦|𝑧 + 𝐴Σ𝑧𝐴
𝑇 ∈ ℝ𝑛×𝑛 (10)
□
128
As an example of the theorem, we consider the special case that 𝑦, 𝑧 ∈ ℝ, i.e., that the joint
distribution 𝑝(𝑦, 𝑧) is a bivariate Gaussian distribution. We thus have the marginal distribution
𝑝(𝑧) = 𝑁(𝑧; 𝜇𝑧, 𝜎𝑧2), where 𝑧, 𝜇𝑧 ∈ ℝ, 𝜎𝑧
2 > 0 (11)
and the conditional distribution
𝑝(𝑦|𝑧) = 𝑁(𝑦; 𝑎𝑧, 𝜎𝑦|𝑧2 ), where 𝑦, 𝑎 ∈ ℝ, 𝜎𝑦|𝑧
2 > 0 (12)
From (5), we thus have the bivariate Gaussian joint distribution
𝑝(𝑦, 𝑧) = 𝑁 ((𝑦𝑧) ; 𝜇𝑦,𝑧, Σ𝑦,𝑧) with 𝜇𝑦,𝑧 = (
𝑎𝜇𝑧𝜇𝑧) , Σ𝑦,𝑧 = (
𝜎𝑦|𝑧2 + 𝑎2𝜎𝑧
2 𝑎𝜎𝑧2
𝑎𝜎𝑧2 𝜎𝑧
2) (13)
the conditional distribution
𝑝(𝑧|𝑦) = 𝑁(𝑧; 𝜇𝑧|𝑦, 𝜎𝑧|𝑦2 ) (14)
where
𝜎𝑧|𝑦2 = ((𝜎𝑧
2)−1 + 𝑎2 (𝜎𝑦|𝑧2 )
−1)−1> 0 and 𝜇𝑧|𝑦 = 𝜎𝑧|𝑦
2 ((𝜎𝑧2)−1𝜇𝑧 + 𝑎
2(𝜎𝑦|𝑧2 )
−1𝑦) ∈ ℝ (15)
and the marginal distribution
𝑝(𝑦) = 𝑁(𝑦; 𝜇𝑦, 𝜎𝑦2) where 𝜇𝑦 = 𝑎𝜇𝑧 ∈ ℝ
𝑛 and 𝜎𝑦2 = 𝜎𝑦|𝑧
2 + 𝑎2𝜎𝑧2 > 0 (16)
Figure 1 below depicts Bayesian inference for the expectation parameter of a univariate Gaussian based on a
single observation with a “tight”, i.e. low variance, prior (upper panels) and a “loose”, i.e. high variance prior
(lower panels) on for the unobserved variable 𝑧.
Figure 1. Bayesian inference for a univariate Gaussian distribution based on a single observation and with known variance 𝜎𝑦|𝑧2 . Note
that for a narrow prior distribution over the latent variable, the posterior distribution over the latent variable is dominated by the prior, while it is dominated by the data for a wide prior.
129
Study Questions
1. Write down the probability density function of a multivariate Gaussian distribution and provide explanations for its components.
2. Assume you know the following covariance matrix of a bivariate Gaussian distribution
Σ = (𝜎112 𝜎12
2
𝜎212 𝜎22
2 ) ≔ (1 −0.5
−0.5 2)
What can you say about the variances of the marginal variables and their correlation?
Study Questions Answers
1. The probability density function of a multivariate Gaussian distribution of an 𝑛-dimensional random vector 𝑥 is given by is given by
𝑝(𝑥) = 𝑁(𝑥; 𝜇, 𝛴) = (2𝜋)−𝑛
2|𝛴|−1
2 𝑒𝑥𝑝 (−1
2(𝑥 − 𝜇)𝑇𝛴−1(𝑥 − 𝜇))
Its core components are an expectation parameter vector 𝜇 ∈ ℝ𝑛 which defines where the realizations of the random vector 𝑥 are centered in ℝ𝑛, and a symmetric positive-definite covariance matrix parameter 𝛴 ∈ ℝ𝑛×𝑛 which specifies on its diagonal, how much the realizations of 𝑥 vary about 𝜇 for each dimension, and on its off-diagonal elements, how much the components of 𝑥 covary.
2. From the diagonal elements of 𝛴 we can conclude that the first component 𝑥1 of the bivariate random vector 𝑥 = (𝑥1, 𝑥2) has a smaller variance than the second component. From the off-diagonal elements we can infer that 𝑥1 and 𝑥2 are negatively correlated, i.e., if 𝑥1 takes on a high value, 𝑥2 rather takes on a low value, and vice versa.
130
Information Theory
Information theory can be viewed as a collection of information theoretic quantities. Well known
examples of information theoretic quantities are information entropy and mutual information. The common
and defining feature of information theoretic quantities is that they are functions that map probability
distributions onto scalar numbers. Because probability distributions (or more precisely, probability density
functions and probability mass functions) are themselves functions, information theoretic quantities are also
referred to as functionals (functions of functions). We will highlight this special mathematical nature of
information theoretic quantities by using serif symbols for them, for example ℱ instead of 𝐹. In this Section,
we briefly review two central information theoretic quantities in the context of parametric Bayesian
inference, entropy and the Kullback-Leibler divergence. We conceive these quantities merely as functionals
of probability density functions and make no attempt to relate them to their origins in statistical physics.
Also, in contrast to common introduction to information theory, we consider these quantities only for
probability density functions. In basic information theoretic terms, the quantities we are concerned with
here may thus be referred to as “differential entropy” and “differential Kullback-Leibler divergence”.
(1) Entropy
The differential entropy of random variable 𝑥 governed by a probability density function 𝑝(𝑥) is
defined as
ℋ(𝑝(𝑥)) ≔ −∫𝑝(𝑥) ln 𝑝(𝑥) 𝑑𝑥 (1)
The entropy of a random variable is a measure of its variability and becomes maximal for a uniformly
distributed random variable. Its value is independent of the actual values taken on by 𝑥, and only depends
on the probability density function of 𝑥. This distinguishes entropy from other measures of random variable
variability: for example, the variance a of random variable is given by the expectation of the squared
deviation of the values the random variable can take on from the expectation of the random variable. It is
important to note that if the probability density function of a random variable 𝑥 is of known functional form,
the entropy of the random variable can be evaluated analytically. Without proof, we note that for a 𝑑-
dimensional random vector 𝑥 ∈ ℝ𝑑 governed by the 𝑑-dimensional Gaussian distribution 𝑁(𝑥; 𝜇, Σ), the
entropy integral evaluates to
ℋ(𝑁(𝑥; 𝜇, Σ)) =1
2ln |Σ| +
𝑑
2(1 + ln 2𝜋) =
1
2ln((2𝜋𝑒)𝑑|Σ|) (2)
which, for a univariate Gaussian probability density function simplifies to
ℋ(𝑁(𝑥; 𝜇, 𝜎2)) =1
2ln(2𝜋𝑒𝜎2) (3)
Notably, the entropy of a Gaussian random variable is thus monotonically increasing with its variance
parameter as shown in Figure 1.
(2) Kullback-Leibler Divergence
The Kullback-Leibler divergence of two probability distributions 𝑞(𝑥) and 𝑝(𝑥) is defined as
𝒦ℒ(𝑞(𝑥)||𝑝(𝑥)) ≔ ∫𝑞(𝑥) ln (𝑞(𝑥)
𝑝(𝑥))𝑑𝑥 (4)
131
Figure 1. Entropy of Gaussian probability density functions. Panels A and B relate to the entropy of univariate Gaussian distributions, panels C and D relate to the entropy of bivariate Gaussian distributions. Specifically, Panel A depicts Gaussian probability density functions with expectation parameter 𝜇 = 0 and three variance parameter settings. Panel B depicts the differential entropy of Gaussian probability density functions as a function of the variance parameter. The entropies of the three Gaussian densities shown in panel A are indicated using colored markers. Entropy increases nonlinearly as a function of variance. Panel C depicts two bivariate
Gaussian densities with expectation parameter 𝜇 = (0,0)𝑇 and covariance matrix Σ = (1 𝜌𝜌 1
). Larger off-diagonal elements increase
the correlation between the marginal variables 𝑥1 and 𝑥2 and reduce the entropy of bivariate Gaussians nonlineary as shown in Panel D. The markers depict the entropies of the bivariate Gaussians shown on in Panel C.
Intuitively, the Kullback-Leibler is a measure of the “distance” between two probability density functions (it
is, however, not a metric on the space of probability density functions). Without proof, we note that the KL
divergence between two Gaussian probability density functions on a 𝑑-dimensional random vector 𝑥 ∈ ℝ𝑑,
𝑁(𝑥; 𝜇𝑞 , Σ𝑞) and 𝑁(𝑥; 𝜇𝑝, Σ𝑝), is given by
𝒦ℒ (𝑁(𝑥; 𝜇𝑞 , Σ𝑞)||𝑁(𝑥; 𝜇𝑝, Σ𝑝)) =1
2(ln (
|Σ𝑝|
|Σ𝑞|) + 𝑡𝑟(Σ𝑝
−1Σ𝑞) + (𝜇𝑞 − 𝜇𝑝)𝑇Σ𝑝−1(𝜇𝑞 − 𝜇𝑝) − 𝑑) (5)
where 𝑡𝑟(𝐴) = ∑ 𝑎𝑖𝑖𝑑𝑖=1 denotes the trace of a matrix 𝐴 ∈ ℝ𝑑×𝑑. For the special univariate case 𝑥 ∈ ℝ , the
above simplifies to
𝒦ℒ (𝑁(𝑥; 𝜇𝑞 , 𝜎𝑞2)||𝑁(𝑥; 𝜇𝑝, 𝜎𝑝
2)) =1
2(ln (
𝜎𝑝2
𝜎𝑞2) +
𝜎𝑞2
𝜎𝑝2 +
1
𝜎𝑝2 (𝜇𝑞 − 𝜇𝑝)
2− 1) (6)
The KL divergence between two univariate Gaussian probability density functions is thus a function of the
squared distance of their expectation parameters and the ratio of their variance parameters (Figure 2). Note
that for 𝜎𝑞2 = 𝜎𝑝
2 and 𝜇𝑞 = 𝜇𝑝 (6) evaluates to zero.
132
Figure 2. Kullback-Leibler divergences for univariate Gaussians. Panel A depicts the KL divergence between a reference univariate Gaussian density with parameters 𝜇𝑞 = 0 and 𝜎𝑞
2 = 1 and a “test” Gaussian density with parameters 𝜇𝑝 and 𝜎𝑝2 as a function of the
test Gaussian density’s parameters. The colored markers refer to specific cases of the test Gaussian parameters: the red marker refers to the parameter setting in the upper subpanel of panel B, the green marker refers to the center subpanel parameter settings of panel B, and the blue marker refers to the lowermost subpanel parameter settings of panel B. Note that the KL-divergence between two univariate Gaussians is a nonlinear function of the differences in expectation and variance parameters.
Two properties of the KL divergence are central. Firstly, the KL divergence is non-negative for all
choices of probability densities 𝑞(𝑥) and 𝑝(𝑥)
𝒦ℒ(𝑞(𝑥)||𝑝(𝑥)) ≥ 0 for all 𝑞(𝑥), 𝑝(𝑥) (7)
and secondly, it is zero, if and only if the two densities are identical, i.e. 𝑞(𝑥) = 𝑝(𝑥)
𝒦ℒ(𝑞(𝑥)||𝑝(𝑥)) = 0 for 𝑞(𝑥) = (𝑥) (8)
Proof of (7) and (8) We follow (Bishop, 2007) (p.55-56) to show why (7) and (8) hold. Two aspects are central: the negative logarithm is a
convex function, and for convex functions Jensen’s inequality applies. Recall that convex functions are defined by the property that
every straight line connecting two points on the function’s graph lies above it, or formally: for 𝑓: [𝑥1, 𝑥2] ⊂ ℝ → ℝ and 𝑞 ∈ [0,1]
𝑓(𝑞𝑥1 + (1 − 𝑞)𝑥2) ≤ 𝑞𝑓(𝑥1) + (1 − 𝑞)𝑓(𝑥2) (8.1)
does hold. Intuitively, (8.1) can be extended to more than two points 𝑥𝑖 , 𝑖 = 1,… , 𝑛 with 𝑞𝑖 ≥ 0 and ∑ 𝑞𝑖𝑛𝑖=1 = 1 in the form
𝑓(∑ 𝑞𝑖𝑛𝑖=1 𝑥𝑖) ≤ ∑ 𝑞𝑖
𝑛𝑖=1 𝑓(𝑥𝑖) (8.2)
(8.2) can in turn, intuitively, be extended to a continuum of points 𝑥 and associated values 𝑞(𝑥), where 𝑞(𝑥) ≥ 0 and ∫ 𝑞(𝑥)𝑑𝑥 = 1
as
𝑓(∫ 𝑞(𝑥)𝑥 𝑑𝑥) ≤ ∫ 𝑞(𝑥)𝑓(𝑥)𝑑𝑥 (8.3)
From a probabilistic viewpoint, ∫ 𝑞(𝑥)𝑥 𝑑𝑥 corresponds to the expectation of 𝑥 under 𝑞(𝑥) and ∫ 𝑞(𝑥)𝑓(𝑥) 𝑑𝑥 to the expectation of
𝑓(𝑥) under 𝑞(𝑥), i.e., for convex 𝑓 we have
𝔼𝑞(𝑥)(𝑓(𝑥)) ≥ 𝑓 (𝔼𝑞(𝑥)(𝑥)) (8.4)
133
The results (8.2) - (8.4) are known as Jensen’s inequality. Noting from real analysis that the logarithm is a concave function, and
𝑓 ≔– ln hence a convex function, we thus have for the KL-divergence as defined in equation (4):
𝒦ℒ(𝑞(𝑥)||𝑝(𝑥)) ≔ ∫𝑞(𝑥) ln (𝑞(𝑥)
𝑝(𝑥)) 𝑑𝑥 = −∫𝑞(𝑥) ln (
𝑝(𝑥)
𝑞(𝑥)) 𝑑𝑥 ≥ − ln ∫𝑞(𝑥)
𝑝(𝑥)
𝑞(𝑥)𝑑𝑥 = − ln 1 = 0 (8.5)
Also note that the KL-divergence 𝒦ℒ(𝑞(𝑥)||𝑝(𝑥)) vanishes for 𝑝(𝑥) ≔ 𝑞(𝑥), because in this case the logarithmic term in the
integral evaluates to zero for all 𝑥.
134
Principles of Probabilistic Inference
(1) Maximum Likelihood Estimation
The maximum likelihood method is a fairly general principle to derive estimators in probabilistic
models that was popularized by Ronald A Fisher between 1912 and 1922, but found application already in
the works of Pierre-Simon Laplace (1749 - 1827) and Johann C Gauss (1777 – 1855). It is based on the
following intuition: in the context of a probabilistic model the most likely parameter value that underlies an
observed data set is the parameter value for which the probability of the data under the model is maximal.
To render this intuition more precise, we first introduce the so-called “likelihood function”. Consider a
probabilistic model, which specifies the probability of data 𝑦 based on a family of parameterized probability
density (or mass) functions 𝑝(𝑦; 휃), where 휃 ⊂ 𝛩 denotes the models parameter, and 𝛩 denotes the
model’s parameter space. Then the function
𝐿: 𝛩 × ℝ𝑛 → ℝ+, (휃, 𝑦) ↦ 𝐿(휃, 𝑦) ≔ 𝑝(𝑦; 휃) (1)
where 휃 ∈ ℝ𝑘 and 𝑦 ∈ ℝ𝑛 is called the likelihood function of the parameter 휃 for the observation 𝑦. The
important thing to note about the likelihood function is that it is (primarily) viewed as a function of the
parameter value 휃, while, in a concrete case, the value of 𝑦 is fixed. This contrasts with the notion of a
probability density function 𝑝(𝑦; 휃), which is a function of the random variable 𝑦. Less formally, this may be
expressed that the input argument for a probability density (or mass) function is the value of a random
variable and the output argument of a probability density (or mass) function is the probability density (or
mass) for this value given a fixed parameter value, while on the other hand the input argument for a
likelihood function is the value of a parameter value, and the output of a likelihood function is the is the
probability density (or mass) for a fixed value of the random variable given this parameter value. If the
random variable value and parameter values submitted to a probability density (or mass) function and its
corresponding likelihood function are identical, so is the output of both functions.
The maximum likelihood estimator for a given probabilistic model 𝑝(𝑦; 휃) is that value 휃̂𝑀𝐿 of 휃
which maximizes the likelihood function. Formally, this can be expressed as
휃̂𝑀𝐿 ≔ 𝑎𝑟𝑔𝑚𝑎𝑥𝜃∈𝛩 𝐿(휃, 𝑦) (2)
(2) should be read as “휃̂𝑀𝐿 is that argument of the likelihood function 𝐿 for which 𝐿(휃, 𝑦) assumes its
maximal value over all possible parameter values 휃 in the parameter space 𝛩”. A standard approach to find
“closed-form” or analytical expressions for ML estimators is to maximize the likelihood function with respect
to 휃 by means of the analytical determination of critical values at which its first derivative (or gradient for
multivariate 휃) vanishes. We will see examples for this approach below. Another approach, often
encountered in practical numerical computing is to automatically shift the values of 휃 around while
monitoring the value of the likelihood function and to stop, once this value is considered to be maximal
based on some sensible stopping criterion. For now, we concern ourselves with the analytical approach.
Candidate values for the ML estimator 휃̂𝑀𝐿 fulfill the following requirement:
𝜕
𝜕𝜃𝑖𝐿(휃, 𝑦)|𝜃=�̂�𝑀𝐿 = 0 (𝑖 = 1,… , 𝑘) (3)
(5) should be read as “a the location of the ML estimator 휃̂𝑀𝐿 ("|𝜃=�̂�𝑀𝐿") the partial derivatives of the
likelihood function 𝜕
𝜕𝜃𝑖𝐿 with respect to the entries of the parameter vector 휃 ≔ (휃1, … , 휃𝑘)
𝑇 vanish, i.e.
they are equal to zero” or alternatively, “at the location of the ML estimator 휃̂𝑀𝐿, the gradient of 𝐿 with
135
respect to 휃 is equal to the zero vector”. This condition merely corresponds to the necessary condition for
function minima or maxima, which states that at the location of a minimum or maximum, the slope of the
function is zero. By evaluating the derivatives and setting them to zero, one obtains a set of equations which
can be solved for the ML estimator.
To simplify this approach, one usually considers the logarithm of the likelihood function, the so-
called “log likelihood function”. The log likelihood function is defined as
𝑙: 𝛩 × ℝ𝑛 → ℝ, (휃, 𝑦) ↦ 𝑙(휃, 𝑦) ≔ 𝑙𝑛 𝐿(𝑦, 휃) = 𝑙𝑛 𝑝(𝑦; 휃) (4)
The use of the logarithm in the context of ML estimation is pragmatic: first, one often considers probability
density functions which have an exponential term, which is simplified by a log transform. Second, one often
assumes independent observations, under which the factorization of the probability density function is
rendered a summation under the log transform. Finally, the logarithm is a monotonically increasing function,
which implies that the location in parameter space at which the likelihood function assumes its maximal
value corresponds to the location in parameter space at which the log likelihood assumes its maximal value.
In analogy to (5), the “log likelihood equation” for the ML estimator is given as
𝜕
𝜕𝜃𝑖𝑙(휃, 𝑦)|𝜃=�̂�𝑀𝐿 = 0 (𝑖 = 1,… , 𝑘) (5)
which, like (5) can be solved for 휃̂𝑀𝐿.
Before we demonstrate the idea of ML estimation in a first example we note two results that
simplify the application of the ML method considerably: Firstly, the assumption of a concave likelihood
function and secondly, the assumption of independent observables.
If the log likelihood function is “concave”, then the necessary condition for a maximum of the
likelihood function is also sufficient. A multivariate real-valued function 𝑓:ℝ𝑛 → ℝ is referred to as
“concave”, if for all input arguments 𝑥, 𝑦 ∈ ℝ the straight line connecting 𝑓(𝑥) and 𝑓(𝑦) lies below the
functions graph (see Figure 1). Formally, this may be expressed by the inequality
𝑓(𝑡𝑥 + (1 − 𝑡)𝑦) ≥ 𝑡𝑓(𝑥) + (1 − 𝑡)𝑓(𝑦) (𝑥, 𝑦 ∈ ℝ𝑛, 𝑡 ∈ [0,1]) (6)
Note that 𝑡𝑥 + (1 − 𝑡)𝑦 (𝑡 ∈ [0,1]) describes a straight line in the domain of the function, while 𝑡𝑓(𝑥) +
(1 − 𝑡)𝑓(𝑦) (𝑡 ∈ [0,1]) describes a straight line in the range of the function. Leaving mathematical
subtleties aside, it is roughly correct that concave functions have a single maximum, or in other words, that a
critical point at which the gradient vanishes is guaranteed to be a maximum. In other words, if the likelihood
function is concave, finding a parameter value for which the log likelihood equation of vanishing partial
derivatives holds is sufficient to know that indeed there is a maximum at this location. In principle, for every
log likelihood function that we discuss below, we would have to show that it indeed fulfills condition (6) and
thus that a maximum can be found by merely setting its gradient to zero and solving for the critical point.
However, because this goes beyond the formalism strived for in PMFN, we content by noting without proofs
that the log likelihood functions we will encounter are all concave.
We now consider the assumption of independent observed variables. If the observed variables
𝑦 ≔ (𝑦1, … , 𝑦𝑛) are stochastically independent and each variable is governed by a probability density
function parameterized by the same parameter vector 휃, i.e. it has the probability density function 𝑝(𝑦𝑖; 휃)
for 𝑖 = 1,… , 𝑛, then the joint probability density function is given as the product of the individual probability
density functions
136
𝑝(𝑦; 휃) = 𝑝(𝑦1, … , 𝑦𝑛; 휃) = 𝑝(𝑦1; 휃) ⋅ 𝑝(𝑦2; 휃) ⋅ … ⋅ 𝑝(𝑦𝑛; 휃) = ∏ 𝑝(𝑦𝑖; 휃)𝑛𝑖=1 (7)
This may be viewed from two angles: one may either conceive the 𝑦𝑖 (𝑖 = 1,… , 𝑛) to be governed by one
and the same underlying probability distribution, from which one can “sample with replacement”, or, one
may conceive each 𝑦𝑖 (𝑖 = 1,… , 𝑛) to be governed by its individual probability density function, which,
however, is the same for all 𝑖 = 1,… , 𝑛. For the purposes of PMFN, these two stances are equivalent, while
the latter feels somewhat closer to the formal developments below. In the case of independent observed
variables 𝑦1, … , 𝑦𝑛, the log likelihood function is given by
𝑙(휃, 𝑦) = 𝑙𝑛 𝑝(𝑦; 휃) = 𝑙𝑛(∏ 𝑝(𝑦𝑖; 휃)𝑛𝑖=1 ) (8)
Repeated application of the “product property” of the logarithm, i.e. the fact that for 𝑎, 𝑏 ∈ ℝ+ we have
𝑙𝑛(𝑎𝑏) = 𝑙𝑛 𝑎 + 𝑙𝑛 𝑏 then allows for writing the right-hand side of (8) as
𝑙𝑛(∏ 𝑝(𝑦𝑖; 휃)𝑛𝑖=1 ) = ∑ 𝑙𝑛 𝑝(𝑦𝑖; 휃)
𝑛𝑖=1 (9)
In other words, the evaluation of the logarithm of a product over probability density functions 𝑝(𝑦𝑖; 휃) is
simplified to the summation over the logarithms of individual probability density functions 𝑝(𝑦𝑖; 휃) (𝑖 =
1,… , 𝑛).
The developments above, together with the assumptions of independent observed variables and the
concavity of the log likelihood function suggest the following three step procedure for the analytical
derivation of maximum likelihood estimators in a given probabilistic model scenario.
1. Formulation of the log likelihood function. This step corresponds to (a) writing down the probability of a
data sample under the probabilistic model, i.e., formulation of the likelihood function, where special
attention has to be paid to the number of observed variables considered and their independence properties,
and (b) taking the logarithm.
2. Evaluation of the log likelihood function gradient. Usually, probabilistic models of interest have more than
one parameter and maximum likelihood estimators for each parameter are required. To this end, the partial
derivatives of the log likelihood function with respect to the parameters have to be evaluated, which is
usually eased by the use of the log likelihood function in the case of probability density functions belonging
to the class of “exponential distributions”, and the assumption of observed variable independence resulting
in sums, rather than products under the logarithm.
3. Equating the log likelihood function gradient with zero and solving for a critical value of the parameters,
which, under the assumption of a concave log likelihood function, corresponds to the location of a maximum
of the log likelihood function in parameter space. The parameter value obtained, which is usually a function
of the observed variables, then corresponds to a maximum likelihood estimator for the given parameter of
the probabilistic model under consideration.
(2) Maximum likelihood estimation of the parameters of a univariate Gaussian distribution
To obtain an intuition of how the maximum likelihood estimation procedure outlined above works in
practice, we consider the case of obtaining 𝑛 ∈ ℕ independent and identically distributed observations
𝑦1, … , 𝑦𝑛 from a univariate Gaussian distribution with parameters 𝜇 ∈ ℝ and 𝜎2 ∈ ℝ. Our aim is to derive
maximum likelihood estimators �̂�𝑀𝐿 and �̂�𝑀𝐿2 for 𝜇 and 𝜎2.
Equivalently, we may view this example as a “one-sample” GLM of the form
𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (1)
137
where 𝑦 ≔ (𝑦1, … , 𝑦𝑛) ∈ ℝ𝑛, 𝜎2 > 0, 𝛽 ≔ 𝜇 ∈ ℝ and 𝑋 ∈ ℝ𝑛×1 is given by
𝑋 ≔ (
11⋮1
) (2)
i.e., the design matrix corresponds to a vector of 𝑛 ones, and the single parameter 𝛽 corresponds to the
expectation parameter of 𝑝(𝑦𝑖). The 𝑖th variable 𝑦𝑖 (𝑖 = 1,… , 𝑛) is distributed according to
𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇, 𝜎2) (3)
and the covariances of variables 𝑦𝑖 and 𝑦𝑗 for 𝑖 ≠ 𝑗, 1 ≤ 𝑖, 𝑗 ≤ 𝑛 are assumed to be zero, corresponding to
the assumption of independent and identically distributed observations.
The first step in the application of the maximum likelihood principle is the determination of the log
likelihood function. In the current example, the 𝑖th random variable 𝑦𝑖 is distributed according to
𝑁(𝑦𝑖; 𝜇, 𝜎2) for all 𝑖 = 1,… , 𝑛, and the random variables 𝑦1, … , 𝑦𝑛 are assumed to be independent. The
probability density for the joint observation of 𝑦1, … , 𝑦𝑛 thus corresponds to the product of the probability
density for each individual observation, which we may write as
𝑝(𝑦1, … , 𝑦𝑛) = 𝑝(𝑦1) ⋅ 𝑝(𝑦2) ⋅ … ⋅ 𝑝(𝑦𝑛) (4)
The individual probability density functions are given by the univariate Gaussian 𝑁(𝑦𝑖; 𝜇, 𝜎2)(𝑖 = 1,… , 𝑛)
and we thus have
𝑝𝜇,𝜎2(𝑦) = 𝑝𝜇,𝜎2(𝑦1, … , 𝑦𝑛) = 𝑁(𝑦1; 𝜇, 𝜎2) ⋅ 𝑁(𝑦2; 𝜇, 𝜎
2) ⋅ … ⋅ 𝑁(𝑦𝑛; 𝜇, 𝜎2) (5)
In (5) we have made the dependence of the probability density function on the parameters 𝜇 and 𝜎2 explicit
by means of subscripts. Because
𝑁(𝑦𝑖; 𝜇, 𝜎2) ≔
1
√2𝜋𝜎2𝑒𝑥𝑝 (−
1
2𝜎2(𝑦𝑖 − 𝜇)
2) (𝑖 = 1,… , 𝑛) (6)
we may re-express (5) as
𝑝(𝑦; 𝜇, 𝜎2) = (2𝜋𝜎2)−𝑛
2 𝑒𝑥𝑝 (−1
2𝜎2∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 ) (7)
Proof of (7)
We consider the product
1
√2𝜋𝜎2𝑒𝑥𝑝 (−
1
2𝜎2(𝑦1 − 𝜇)
2) ⋅1
√2𝜋𝜎2𝑒𝑥𝑝 (−
1
2𝜎2(𝑦2 − 𝜇)
2)⋯1
√2𝜋𝜎2𝑒𝑥𝑝 (−
1
2𝜎2(𝑦𝑛 − 𝜇)
2) (7.1)
that comprises 𝑛 terms of the form (2𝜋𝜎2)−1
2 and 𝑛 terms of the form 𝑒𝑥𝑝 (−1
2𝜎2(𝑦𝑖 − 𝜇)
2). From the rules of exponiantiation, we
know that
((2𝜋𝜎2)−1
2)𝑛
= (2𝜋𝜎2)−1
2⋅𝑛 = (2𝜋𝜎2)−
𝑛
2⋅ (7.2)
From the fundamental properties of the exponential function, we know that
𝑒𝑥𝑝(𝑎) 𝑒𝑥𝑝(𝑏) = 𝑒𝑥𝑝(𝑎 + 𝑏) (7.3)
And thus a product of 𝑛 exponential factors is equivalent to the exponential of the sum over the exponential inputs
∏ 𝑒𝑥𝑝 (−1
2𝜎2(𝑦𝑖 − 𝜇)
2)𝑛𝑖=1 = 𝑒𝑥𝑝 (−∑
1
2𝜎2(𝑦𝑖 − 𝜇)
2𝑛𝑖=1 ) = 𝑒𝑥𝑝 (−
1
2𝜎2∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 ) (7.4)
We thus found that
∏ 𝑁(𝑦𝑖; 𝜇, 𝜎2)𝑛
𝑖=1 = (2𝜋𝜎2)−𝑛
2 𝑒𝑥𝑝 (−1
2𝜎2∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 ) (7.5)
□
138
Based on the above, we may now formally write down the likelihood function for the case of 𝑛 independent and identically distributed observations 𝑦 = (𝑦1, … , 𝑦𝑛)
𝐿: (ℝ × ℝ+\{0}) × ℝ𝑛 → ℝ+, ((𝜇, 𝜎2), 𝑦) ↦ 𝐿((𝜇, 𝜎2), 𝑦) ≔ 𝑝𝜇,𝜎2(𝑦) = (2𝜋𝜎2)−
𝑛
2𝑒𝑥𝑝 (−1
2𝜎2∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 ) (8)
Note that in the current scenario the parameter space 𝛩 is given by 𝛩 ≔ ℝ×ℝ+\{0} and the
dimensionality of the parameter vector 휃 ≔ (𝜇, 𝜎2) is given by 𝑘 = 2. The corresponding log likelihood
function then evaluates to
ℓ: (ℝ × ℝ+\{0}) × ℝ𝑛 → ℝ+, ((𝜇, 𝜎2), 𝑦) ↦ ℓ((𝜇, 𝜎2), 𝑦) = −𝑛
2𝑙𝑛(2𝜋) −
𝑛
2𝑙𝑛𝜎2 −
1
2𝜎2∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 (9)
Derivation of (9)
Equation (9) follows from basic properties of the logarithm function, specifically
𝑙𝑛(𝑎𝑏) = 𝑙𝑛 𝑎 + 𝑙𝑛 𝑏 , 𝑙𝑛 𝑎𝑏 = 𝑏 𝑙𝑛 𝑎 and 𝑙𝑛(𝑒𝑥𝑝(𝑎)) = 𝑎 (9.1)
We have
𝑙𝑛 ((2𝜋𝜎2)−𝑛
2 𝑒𝑥𝑝 (−1
2𝜎2∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 )) = 𝑙𝑛(2𝜋𝜎2)−
𝑛
2 + 𝑙𝑛 (𝑒𝑥𝑝 (−1
2𝜎2∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 )) (9.2)
with the first of these properties
𝑙𝑛(2𝜋𝜎2)−𝑛
2 + 𝑙𝑛 (𝑒𝑥𝑝 (−1
2𝜎2∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 )) = −
𝑛
2𝑙𝑛(2𝜋𝜎2) −
1
2𝜎2∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 (9.3)
With the second and third of these properties, and finally
−𝑛
2𝑙𝑛(2𝜋𝜎2) −
1
2𝜎2∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 = −
𝑛
2𝑙𝑛(2𝜋) −
𝑛
2𝑙𝑛 𝜎2 −
1
2𝜎2∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 (9.4)
again with the first of these properties. □
The second step in the analytical derivation of ML estimators is the evaluation of the log likelihood
equations, i.e. the gradient of the log likelihood function In the current case, we have 휃 ≔ (𝜇, 𝜎2). To
identify critical points of the log likelihood function, we thus have to evaluate two derivatives constituting
the gradient of the log likelihood function
𝛻𝑙((𝜇, 𝜎2), 𝑦) = (
𝜕
𝜕𝜇𝑙((𝜇, 𝜎2), 𝑦)
𝜕
𝜕𝜎2𝑙((𝜇, 𝜎2), 𝑦)
) (10)
and subsequently set the gradient (i.e., each partial derivative) to zero. For the partial derivative with
respect to the expectation parameter 𝜇, we obtain
𝜕
𝜕𝜇𝑙((𝜇, 𝜎2), 𝑦) =
1
𝜎2∑ (𝑦𝑖 − 𝜇) 𝑛𝑖=1 (11)
and for the partial with respect to the variance parameter 𝜎2, we obtain
𝜕
𝜕𝜎2𝑙((𝜇, 𝜎2), 𝑦) = −
𝑛
2𝜎2+
1
2𝜎4∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 (12)
Derivation of (11) and (12)
We consider the derivative of 𝑙((𝜇, 𝜎2), 𝑦) with respect to 𝜇 first. Using the summation and chain rules of differential
calculus, we obtain
𝜕
𝜕𝜇𝑙((𝜇, 𝜎2), 𝑦) =
𝜕
𝜕𝜇(−
𝑛
2𝑙𝑛 2𝜋 −
𝑛
2𝑙𝑛 𝜎2 −
1
2𝜎2∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 )
=𝜕
𝜕𝜇(−
1
2𝜎2∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 )
= −1
2𝜎2∑
𝜕
𝜕𝜇(𝑦𝑖 − 𝜇)
2𝑛𝑖=1
139
= −1
2𝜎2∑ 2(𝑦𝑖 − 𝜇)
𝜕
𝜕𝜇(−𝜇) 𝑛
𝑖=1
= −(−2
2𝜎2∑ (𝑦𝑖 − 𝜇)𝑛𝑖=1 )
=1
𝜎2∑ (𝑦𝑖 − 𝜇)𝑛𝑖=1 (11.1)
We next consider the partial derivative with respect to the variance parameter 𝜎2. Using the summation rule of differential calculus the fact that (𝑙𝑛 𝑥)′ = 𝑥−1, and the fact that (𝑥−1)′ = −𝑥−2, we obtain
𝜕
𝜕𝜎2𝑙((𝜇, 𝜎2), 𝑦) =
𝜕
𝜕𝜎2(−
𝑛
2𝑙𝑛 2𝜋 −
𝑛
2𝑙𝑛 𝜎2 −
1
2𝜎2∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 )
= −𝑛
2
𝜕
𝜕𝜎2(𝑙𝑛 𝜎2) −
𝜕
𝜕𝜎2(1
2𝜎2∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 )
= −𝑛
2
1
𝜎2−1
2(𝜕
𝜕𝜎2(𝜎2)−1)∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1
= −𝑛
2𝜎2+
1
2𝜎4∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 (12.1)
□
The log likelihood equations for the case of independent and identical sampling from the univariate Gaussian
are thus given by
1
𝜎2∑ (𝑦𝑖 − 𝜇) 𝑛𝑖=1 = 0 (13)
−𝑛
2𝜎2+
1
2𝜎4∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1 = 0 (14)
Notably, these log likelihood equations display a dependence between the maximum likelihood
estimator for 𝜇 and and that for 𝜎2, as both parameters appear in both equations. To solve the log
likelihood equations for �̂�𝑀𝐿 and �̂�𝑀𝐿2 , a standard approach is to first solve equation (13) for �̂�𝑀𝐿 and then
use this solution to solve equation (14) for �̂�𝑀𝐿2 . We obtain the following results
�̂�𝑀𝐿 =1
𝑛∑ 𝑦𝑖 𝑛𝑖=1 (15)
�̂�𝑀𝐿2 =
1
𝑛∑ (𝑦𝑖 − �̂�𝑀𝐿)
2𝑛𝑖=1 (16)
Derivation of (15) and (16)
Equation (13) states that
1
𝜎2∑ (𝑦𝑖 − �̂�𝑀𝐿) 𝑛𝑖=1 = 0 (15.1)
This implies that 1
𝜎2 equals zero or ∑ (𝑦𝑖 − �̂�𝑀𝐿)
𝑛𝑖=1 equals zero. Because, by definition 𝜎2 > 0, and thus also
1
𝜎2> 0, the equation
can only hold, if ∑ (𝑦𝑖 − �̂�𝑀𝐿) 𝑛𝑖=1 equals zero. We hence obtain for �̂�𝑀𝐿
∑ (𝑦𝑖 − �̂�𝑀𝐿) 𝑛𝑖=1 = 0 ⇔ ∑ 𝑦𝑖
𝑛𝑖=1 − ∑ �̂�𝑀𝐿
𝑛𝑖=1 = 0 ⇔ ∑ 𝑦𝑖
𝑛𝑖=1 − 𝑛�̂�𝑀𝐿 = 0 ⇔ 𝑛�̂�𝑀𝐿 = ∑ 𝑦𝑖 ⇔ �̂�𝑀𝐿 =
1
𝑛∑ 𝑦𝑖 𝑛𝑖=1 𝑛
𝑖=1 (15.2)
To find the maximum likelihood estimator for 𝜎2, we substitute the result above in equation (14) and solve for �̂�𝑀𝐿4
−𝑛
2�̂�𝑀𝐿2 +
1
2�̂�𝑀𝐿4 ∑ (𝑦𝑖 − �̂�𝑀𝐿)
2𝑛𝑖=1 = 0 ⇔
1
2�̂�𝑀𝐿4 ∑ (𝑦𝑖 − �̂�𝑀𝐿)
2𝑛𝑖=1 =
𝑛
2�̂�𝑀𝐿2 ⇔ ∑ (𝑦𝑖 − �̂�𝑀𝐿)
2𝑛𝑖=1 =
𝑛2�̂�𝑀𝐿4
2�̂�𝑀𝐿2 ⇔ �̂�𝑀𝐿
2 =1
𝑛∑ (𝑦𝑖 − �̂�𝑀𝐿)
2𝑛𝑖=1 (16.1)
□
With equation (15), we thus derived the following result: the ML estimator for the expectation
parameter 𝜇 of a univariate Gaussian based on 𝑛 independent and identically distributed observations
𝑦1, … , 𝑦𝑛 is given by
�̂�𝑀𝐿 =1
𝑛∑ 𝑦𝑖 𝑛𝑖=1 (17)
140
Notably, the formula for this ML estimator corresponds to the well-known sample mean, usually denoted as
�̅� ≔1
𝑛∑ 𝑦𝑖 𝑛𝑖=1 (18)
for observations 𝑦1, … , 𝑦𝑛. Note that a sample mean can be computed irrespective of whether one assumes
the data to corresponds to independent and identically distributed samples from a univariate Gaussian. If
one makes this assumption, however, in classical point estimation schemes, the sample mean corresponds
to the best guess for the expectation parameter of the assumed underlying Gaussian.
With equation (16), on the other hand, we derived the following result: the maximum likelihood
estimator for the variance parameter 𝜎2 of a univariate Gaussian based on 𝑛 independent and identically
distributed observations and the ML estimator �̂�𝑀𝐿 is given by
�̂�𝑀𝐿2 =
1
𝑛∑ (𝑦𝑖 − �̂�𝑀𝐿)
2𝑛𝑖=1 (19)
Note that the ML estimator for the variance does not correspond to the familiar sample variance, which is
given by
𝑠2 =1
𝑛−1∑ (𝑦𝑖 − �̅�)
2𝑛𝑖=1 (20)
In other words, if one computes the sample variance of any sample, one does not need to assume that the
sample was generated by independent and identical sampling from a univariate Gaussian distribution, and
even if one does, the result is not the maximum likelihood estimator for the variance parameter of the
assumed underlying Gaussian. Using classical frequentist estimator quality theory, it can be shown that the
maximum likelihood estimator for the variance parameter of a univariate Gaussian is not “ideal”, or, more
formally, it is not “bias-free”. However, the notion of “biased estimators” and a principled method for
deriving bias-free estimators using a modified maximum likelihood method, referred to as “restriced
maximum likelihood” is beyond the scope of the current section and is covered in more depth in the Section
“Restricted Maximum Likelihood” of PMFN.
(3) Numerical maximum likelihood estimation and Fisher-Scoring
Depending on the probabilistic model of interest, the likelihood and log likelihood functions may
take arbitrarily complex forms that often render a direct analytical optimization as in the previous example
impossible. In this case, numerical optimization offers an alternative. As discussed above, numerical
optimization procedures initialize the input arguments of an objective function to some suitably chosen
initial value and then iteratively modify this value such that the objective function approaches an extremal
point. In the context of maximum likelihood estimation, the objective function corresponds to the log
likelihood function, which is numerically optimized with respect to the model parameters. A popular
approach for numerical maximum likelihood estimation is based on an adaptation of the multivariate
Newton-Raphson approach, in which the Hessian matrix of the log likelihood function is approximated by its
expected value under repeated sampling from the model. This expected Hessian is known as Fisher’s
information matrix and the resulting algorithm is known as “Fisher-Scoring”.
Formally, for a parametric probabilistic model 𝑝(𝑦; 휃) with a 𝑝-dimensional parameter vector
휃 ≔ (휃1, … , 휃𝑝)𝑇∈ Θ ⊂ ℝp we consider the problem of finding a maximizing value for its log likelihood
function which we denote by
ℓ(⋅; 𝑦): Θ ⊂ ℝp → ℝ+, 휃 ↦ ℓ(휃; 𝑦) ≔ ln𝑝(𝑦; 휃) (1)
141
In the context of parametric statistics, the gradient of the log likelihood function at a location 휃 ∈ Θ is
referred to as a “score vector” or a “score function”
𝑆 ∶ Θ → ℝ𝑝, 휃 ↦ 𝑆(휃) ≔ ∇ℓ(휃; 𝑦) =
(
𝜕
𝜕𝜃1ℓ(휃; 𝑦)
𝜕
𝜕𝜃2ℓ(휃; 𝑦)
⋮𝜕
𝜕𝜃𝑝ℓ(휃; 𝑦)
)
(2)
Further, the negative Hessian of the log-likelihood function at a location 휃 ∈ Θ is known as “Fisher’s
Information Matrix” or simply as “Fisher Information”
𝐼 ∶ Θ → ℝ𝑝×𝑝, 휃 ↦ 𝐼(휃) ≔ −∇2ℓ(휃; 𝑦) = −
(
𝜕2
𝜕𝜃1𝜕𝜃1ℓ(휃; 𝑦)
𝜕2
𝜕𝜃1𝜕𝜃2ℓ(휃; 𝑦)
𝜕2
𝜕𝜃2𝜕𝜃1ℓ(휃; 𝑦)
𝜕2
𝜕𝜃2𝜕𝜃2ℓ(휃; 𝑦)
⋯𝜕2
𝜕𝜃1𝜕𝜃𝑝ℓ(휃; 𝑦)
⋯𝜕2
𝜕𝜃2𝜕𝜃𝑝ℓ(휃; 𝑦)
⋮ ⋮𝜕2
𝜕𝜃𝑝𝜕𝜃1ℓ(휃; 𝑦)
𝜕2
𝜕𝜃𝑝𝜕𝜃2ℓ(휃; 𝑦)
⋱ ⋮
⋯𝜕2
𝜕𝜃𝑝𝜕𝜃𝑝ℓ(휃; 𝑦)
)
(3)
Note that for a fixed observation of the data 𝑦, both 𝑆 and 𝐼 are functions of the parameter 휃 ∈ Θ only.
However, because the data 𝑦 is a random variable the values of 𝑆(휃) and 𝐼(휃) can also be conceived as
realizations of random variables. One may thus form an expected value of the score vector and Fisher’s
information matrix under the distribution of the data, denoted by 𝐸(𝑆(휃)) and 𝐸(𝐼(휃)), respectively.
Initialization
0. Define a starting point 휃(0) ∈ Θ and set 𝑘 ≔ 0. If 𝑆(휃) ≔ ∇ℓ(휃; 𝑦) = 0, stop! 휃(0) is a zero of ∇ℓ(휃; 𝑦). If not, proceed to iterations.
Until Convergence
1. Set
휃(𝑘+1) = 휃(𝑘) + 𝐸(𝐼(휃))−1 𝑆(휃)
2. If 𝑆(휃(𝑘+1)) = 0, stop! 휃(0) is a zero of ∇ℓ(휃; 𝑦). If not, go to 3.
3. Set 𝑘 ≔ 𝑘 + 1 and go to 1.
Table 1. The Fisher-Scoring method for numerical maximum likelihood estimation.
Based on the definitions above, we can now introduce the notion of Fisher-Scoring: If the function of
interest of the Newton-Raphson procedure corresponds to a log likelihood function (i.e. the aim of the
optimization is to find maximum likelihood parameter estimates), and the Hessian of the log-likelihood
function is replaced by the negative expected Fisher-Information (which is often easier to evaluate for the
current setting of the parameter 휃(𝑘)), then this procedure is referred to as “Fisher Scoring” or “Fisher
Scoring Algorithm”. Replacing the objective function by the log likelihood function ℓ(⋅; 𝑦), the gradient of
the objective function by the score vector 𝑆, and the Hessian of the function objective function by the
expected Fisher information in the multivariate Newton-Raphson method, thus yields the Fisher-Scoring
algorithm (Table 1)
142
(4) Bayesian Estimation
There are at least, two reasons to introduce the Bayesian approach to the estimation and evaluation
of the GLM. The first reason is data-analytic. Bayesian inference provides a principled, mathematically
grounded, and coherent framework to deal with uncertainty in scientific theories (“models”) in light of data.
Compared to classical statistics, it does not comprise a collection of tools (which can be summarized under
the classical framework of the GLM, as discussed so far), but provides a general framework for probabilistic
inference on any model class. While historically wrong, Bayesian data analysis still carries the connotation of
“modern data analysis” and hence is becoming increasingly popular in the psychological and neuroscientific
literature. For example, the Dynamic Causal Modelling approach to EEG and FMRI is an explicitly Bayesian
approach to neuroimaging data analysis that is receiving more and more attention. There is an important
second reason. It is fairly safe to say, that at the beginning of the 21st century, the dominant view of the
human brain is that of a dynamic system implementing Bayesian statistical inference. The core idea here is
that the brain encodes a model of the world and matches its sensory input and motor output to adhere to its
own predictions. This “Bayesian brain hypothesis” has a long history, dating back centuries, and currently
enjoys popularity under the name “free energy principle”. To understand this leading brain theory (and/or
come up with sensible alternative hypotheses), it is necessary to understand the formal framework that the
Bayesian brain hypothesis rests on. A fundamental building block of Bayesian inference is the notion of
conditional probability, which the reader is encouraged to review at this point.
Bayesian parameter estimation – Prior, posterior, and likelihood
For a joint distribution 𝑝(𝑦, 휃) of random entities 𝑦 and 휃 (i.e., random variables or vectors),
“Bayesian inference” corresponds to a specific interpretation of 𝑦 and 휃 their respective marginal and
conditional distributions. Specifically, the random entity 𝑦 is interpreted as “the data” and the random entity
휃 is interpreted as “the parameter” and the joint distribution 𝑝(𝑦, 휃) is interpreted as “the (“generative” or
“probabilistic”) model”. Based on these interpretations, Bayes theorem in the form
𝑝(휃|𝑦) =𝑝(𝜃)𝑝(𝑦|𝜃)
𝑝(𝑦) (1)
hence allows to determine the conditional probability distribution of the parameter 휃 “given” (or
“conditioned on”) the data 𝑦. Note that 𝑝(휃|𝑦) reflects the conditional probability distributions over the
parameter values 휃 for all possible values that the data 𝑦 may take on. In given experimental context, a
specific data point (or data set) 𝑦∗ is usually observed. In this case, the conditional probability distribution
over parameter values may be determined from
𝑝(휃|𝑦 = 𝑦∗) =𝑝(𝜃)𝑝(𝑦=𝑦∗|𝜃)
𝑝(𝑦=𝑦∗) (2)
To determine the conditional distribution 𝑝(휃|𝑦) of the parameter given the data and use it for
probabilistic statements about the conditional probabilities of 휃 is the aim of “Bayesian parameter
estimation”. Note that in contrast to the parameter estimation schemes we have discussed so far, Bayesian
parameter estimation does not aim to estimate a single value for the true, but unknown, parameter value,
but a probability distribution over the possible parameter values. The data conditional parameter
distribution is also called the “posterior distribution”, because it can be conceived as the probability
distribution over parameters “once the data have been observed”. Note however, that the conditional
143
parameter distribution is inherent in the specification of the model 𝑝(𝑦, 휃) and is thus not created de novo
once the data is observed.
Figure 1. Fundamentals of Bayesian Inference. The left panel depicts a joint probability distribution (or, more precisely, joint probability density function (PDF)) over two scalar random variables, 𝑦 and 휃. The right panels show marginal and conditional distributions inherent in the joint distribution of the left panel: The upper right panels depict the marginal distributions over 휃 and 𝑦, respectively, while the lower right panels depict examples for conditional distributions of 휃 given 𝑦, here for 𝑦 = 0 and 𝑦 = −2. The parameters (expectation and covariance) of the marginal and conditional probability density functions can be evaluated as functions of the parameters (expectation and covariance) of the joint distribution. The details of this procedure will be discussed in the following Sections.
In principle, the posterior parameter distribution 𝑝(휃|𝑦) for a given observation 𝑦 = 𝑦∗ may be
directly determined from the joint probability distribution 𝑝(𝑦 = 𝑦∗, 휃) and the marginal probability
𝑝(𝑦 = 𝑦∗). However, the Bayesian paradigm usually proceeds by explicitly stating the marginal distribution
𝑝(휃) and the conditional probability distribution 𝑝(𝑦|휃), whose product results in 𝑝(𝑦, 휃). The marginal
distribution 𝑝(휃) over the parameter is called the “prior” distribution in the Bayesian paradigm. It
corresponds to the probability distribution one specifies over the parameter values independent of any data.
To form the posterior distribution 𝑝(휃|𝑦), the distribution 𝑝(𝑦, 휃) and thus also 𝑝(휃) has to be already
specified, which explains the notion of a “prior” distribution over 휃.
The term 𝑝(𝑦|휃) in the numerator on the right-hand side of (1) is referred to as “the (data)
likelihood”. For each specific parameter value 휃 = 휃∗, this term defines a conditional probability distribution
over 𝑦. We have encountered a “likelihood” at many occasions previously. Specifically, we have repeatedly
stated that the GLM specifies a probability distribution over data 𝑦 of the form
𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (3)
where we referred to 𝛽 and 𝜎2 as the fixed true, but unknown values. Let these specific fixed values be
denoted by 𝛽 = 𝛽∗ and 𝜎2 = 𝜎2∗ and think of 𝛽 and 𝜎2 as random variables From a Bayesian viewpoint, we
may define a generative model, in which 휃 ≔ (𝛽, 𝜎2)𝑇 is governed by a probability distribution
corresponding to
𝑝(휃) ≔ 𝑝((𝛽, 𝜎2)𝑇) (4)
In this case we may write the data likelihood as
144
𝑝(𝑦|휃) ≔ 𝑝(𝑦|(𝛽, 𝜎2)𝑇 ) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (5)
In more general terms, the likelihood specifies a probability distribution over the data given specific
parameter values. How these parameter values are linked to specific data distributions is dependent on the
“functional” (or “structural”) form of the generative model, and we will study different scenarios in the
sections to come.
Consider again Bayes theorem for a specific data observation
𝑝(휃|𝑦 = 𝑦∗) =𝑝(𝜃)𝑝(𝑦=𝑦∗|𝜃)
𝑝(𝑦=𝑦∗) (6)
The denominator on the right-hand side 𝑝(𝑦 = 𝑦∗), which we will discuss in more depth below, corresponds
to a constant multiplicative factor for 𝑝(휃)𝑝(𝑦 = 𝑦∗|휃). In other words, the posterior distribution
𝑝(휃|𝑦 = 𝑦∗) is proportional to the product of the prior parameter distribution 𝑝(휃) and the likelihood
𝑝(휃|𝑦 = 𝑦∗). A common mnemonic for Bayes theorem from the perspective of the Bayesian paradigm is
thus
𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ∝ 𝑃𝑟𝑖𝑜𝑟 × 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 (7)
The posterior distribution over the parameter given the data is one of the (two) main outcomes of a
Bayesian data analysis. Often, it is helpful to summarize this distribution using some meaningful numbers.
One example would be the parameter value with the highest posterior probability, which intuitively
corresponds to the “most likely” posterior parameter value. This value, i.e. the mode of the posterior
probability distribution, is known as the maximum-a-posteriori (MAP) parameter estimate. Another quantity
of interest may be the probability for the parameter to fall into a specific interval or to be larger than a given
pre-specified value. For example, for a posterior distribution corresponding to univariate Gaussian (Figure 1)
the probability for the parameter falling in the interval [0,1] may be of interest. Intervals and their
associated posterior probabilities of this sort are known as “credible intervals”.
Bayesian model comparison – Evidence
To determine the posterior parameter distribution and make inferences on it is the first principle aim
of Bayesian inference. The second aim is “Bayesian model comparison”. Bayesian model comparison is
usually performed by posing the following question: given two models 𝑝1(𝑦, 휃) and 𝑝2(𝑦, 휃), under which
model is the observed data 𝑦∗ more likely? As such, Bayesian model comparison is based on the ratio of the
marginal probabilities of the observed data 𝑦∗ under each model
𝑝1(𝑦=𝑦
∗)
𝑝2(𝑦=𝑦∗)
(1)
Note, that for any model 𝑝(𝑦, 휃), the probability 𝑝(𝑦 = 𝑦∗) corresponds to the normalization factor on the
right-hand side of Bayes theorem as specified in equation (1) of the previous section. In the Bayesian
paradigm, this marginal probability is also known as the “model evidence”. Based on their respective
generative models, the marginal probabilities 𝑝1(𝑦 = 𝑦∗) and 𝑝2(𝑦 = 𝑦
∗) may be conceived as joint
probabilities 𝑝1(𝑦 = 𝑦∗, 휃) and 𝑝2(𝑦 = 𝑦
∗, 휃) upon “integration out” or “summing over” all possible
parameter values. We may thus rewrite the above as
𝑝1(𝑦=𝑦
∗)
𝑝2(𝑦=𝑦∗)=
∑ 𝑝1(𝑦=𝑦∗,𝜃=𝜃∗)𝜃∗∈Θ1
∑ 𝑝2(𝑦=𝑦∗,𝜃=𝜃∗)𝜃∗∈Θ2
(2)
145
where we used Θ1 and Θ2 to denote the parameter spaces of the generative models 𝑝1(𝑦, 휃) and 𝑝2(𝑦, 휃),
respectively. Before we return to a general discussion of Bayesian parameter inference and Bayesian model
comparison with respect to the statistical inference machinery introduced for the GLM so far, it is helpful to
consider a basic concrete example of Bayesian inference.
(5) Bayesian estimation of the expectation of a univariate Gaussian
Consider the following likelihood
𝑝(𝑦|휃) ≔ 𝑁(𝑦; 휃, 𝜎𝑦2) =
1
√2𝜋𝜎𝑦2exp (−
1
2𝜎𝑦2 (𝑦 − 휃)
2) (1)
where 𝑦 ∈ ℝ is a scalar random variable and 𝜎𝑦2 > 0 is a fixed an known constant. According to the
likelihood statement above, the random variable 𝑦 is normally distributed with expectation parameter
휃 ∈ ℝ and variance parameter 𝜎𝑦2 > 0. In contrast to our previous discussion of the univariate Gaussian, in
the current example, the expectation parameter 휃 is considered a random variable. A standard choice for
the marginal distribution 𝑝(휃) is a univariate Gaussian with expectation parameter 𝜇𝜃 ∈ ℝ and variance
parameter 𝜎𝜃2 > 0.
𝑝(휃) ≔ 𝑁(휃; 𝜇𝜃, 𝜎𝜃2) =
1
√2𝜋𝜎𝜃2exp (−
1
2𝜎𝜃2 (휃 − 𝜇𝜃)
2) (2)
Here, we used to the subscript 휃 to emphasize that the parameters 𝜇𝜃 and 𝜎𝜃2 govern the distribution of 휃.
From a Bayesian perspective, the distribution 𝑝(휃) corresponds to the prior distribution over 휃 before we
observe a value of 𝑦, and the likelihood 𝑝(𝑦|휃) provides the distribution over 𝑦 for each value that 휃 may
take on. Together, 𝑝(휃) and 𝑝(𝑦|휃) specify a generative model, i.e. a joint distribution over both 𝑦 ∈ ℝ and
휃 ∈ ℝ of the form
𝑝(𝑦, 휃) = 𝑝(휃)𝑝(𝑦|휃) = 𝑁(휃; 𝜇𝜃, 𝜎𝜃2)𝑁(𝑦; 휃, 𝜎𝑦
2) (3)
In the Bayesian formulation of Gaussian probability density function models, the inverse of the variance (or
covariance) parameter, referred to as the “precision” parameter is often preferred. By defining the precision
parameters
𝜆𝑦 ≔1
𝜎𝑦2 and 𝜆𝜃 ≔
1
𝜎𝜃2 (4)
we can rewrite the generative model as follows
𝑝(𝑦, 휃) = 𝑁(휃; 𝜇𝜃 , 𝜆𝜃−1)𝑁(𝑦; 휃, 𝜆𝑦
−1) =√𝜆𝑦
√2𝜋exp (−
𝜆𝑦
2(𝑦 − 휃)2)
√𝜆𝜃
√2𝜋exp (−
𝜆𝜃
2(휃 − 𝜇𝜃)
2) (5)
Note that Gaussians with a high variance have a low precision and vice versa.
Below we will be concerned with general properties of and Bayesian inference in such generative
Gaussian models. For the moment, we merely state that based on the definition of the prior 𝑝(휃) in (2) and
the likelihood 𝑝(𝑦|휃) in (1), we may evaluate the posterior distribution 𝑝(휃|𝑦) by applying Bayes theorem in
the form
146
𝑝(휃|𝑦) =𝑝(𝜃)𝑝(𝑦|𝜃)
𝑝(𝑦) (6)
Substituting (1) and (2) into (6, it can be shown (and we will do so in the following Section), that the
posterior distribution over 휃 conditioned on the data 𝑦 is given by a Gaussian distribution of the form
𝑝(휃|𝑦) = 𝑁(휃; 𝜇𝜃|𝑦, 𝜆𝜃|𝑦 ) (7)
where
𝜆𝜃|𝑦: = 𝜆𝑦 + 𝜆𝜃 and 𝜇𝜃|𝑦: =𝜆𝑦
𝜆𝑦+𝜆𝜃𝑦 +
𝜆𝜃
𝜆𝑦+𝜆𝜃𝜇𝜃 (8)
We use the subscript ⋅𝜃|𝑦 to denote parameters of conditional distributions, here the distribution of 휃 given
𝑦. A couple of things are worth noting about equations (7) and (8).
Figure 1. Bayesian inference for the expectation parameter of the univariate Gaussian with prior parameters 𝜇𝜃 ≔ 0 and 𝜎𝜃
2 ≔ 2, likelihood parameter 𝜎𝑦2 ≔ 1 and observed data value 𝑦 ≔ 2. The left panel depicts the generative
model 𝑝(𝑦, 휃) as specified by equation (29). The white triangle depicts an observation 𝑦 = 2 and the corresponding generative model “section” in 휃 space (dashed white line). The right panel depicts the prior distribution over 휃 and the posterior distribution over 휃 given the observation 𝑦 = 2 as specified by the parameter update equations (8).
First, the posterior distribution over 휃 is a Gaussian distribution, as is the prior distribution over 휃.
Second, the parameters of the posterior distribution over 휃 result from relatively simple formulas that
involve the parameters of the prior distribution and the likelihood (𝜆𝜃, 𝜆𝑦, 𝜇𝜃) and the data 𝑦. Distributions
with the property that the posterior distribution belongs to the same class as the prior distribution, but with
(potentially) different parameters, are called “conjugate distributions”. For the expectation parameter of a
univariate Gaussian likelihood, the univariate Gaussian is thus the conjugate prior (and hence posterior)
distribution. Second, the parameter update equations (8) for the parameters of the posterior distribution in
terms of the prior and likelihood parameters and the data can be memorized very well: to get the precision
parameter 𝜆𝜃|𝑦 of the posterior distribution, the precision parameters of the prior 𝜆𝜃 and likelihood 𝜆𝑦 are
added; to get the expectation parameter 𝜇𝜃|𝑦, a weighted average is formed between the observed data
𝑦 ∈ ℝ and the prior parameter 𝜇𝜃, where the weighting factors are given by their relative precisions with
respect to the posterior precision, 𝜆𝑦
𝜆𝜃|𝑦 and
𝜆𝜃
𝜆𝜃|𝑦, respectively. From a Bayesian perspective, the combined
precisions of likelihood and prior lead to an increased precision in our knowledge about the value of 휃.
Further, the expectation parameter of the posterior distribution thus offers a sensible compromise between
our prior knowledge about the value of 휃 and the data observed. This is visualized in Figure 2.
147
We can reformulate the equations for the posterior distribution parameters using variances instead
of precisions to gain further insight into the analytic properties of the posterior distribution. Based on (8) we
have for the posterior distribution variance
𝜇𝜃|𝑦 = 𝑐𝜇𝜃 + (1 − 𝑐)𝑦 with 𝑐 ≔𝜎𝑦2
𝜎𝑦2+𝜎𝜃
2 and 𝜎𝜃|𝑦2 =
𝜎𝑦2
𝜎𝑦2+𝜎𝜃
2 ⋅𝜎𝜃2
𝜎𝑦2+𝜎𝜃
2 (9)
Proof of (9)
For the posterior variance parameter we have
𝜎𝜃|𝑦2 =
1
𝜆𝜃|𝑦=
1
𝜆𝑦+𝜆𝜃=
11
𝜎𝑦2+
1
𝜎𝜃2
=1
𝜎𝜃2
𝜎𝑦2𝜎𝜃
2+𝜎𝑦2
𝜎𝜃2𝜎𝑦
2
= (𝜎𝜃2+𝜎𝑦
2
𝜎𝑦2𝜎𝜃
2 )−1
=𝜎𝑦2𝜎𝜃
2
𝜎𝜃2+𝜎𝑦
2 =𝜎𝑦2
𝜎𝑦2+𝜎𝜃
2 ⋅𝜎𝜃2
𝜎𝑦2+𝜎𝜃
2. (9.1)
For the posterior distribution expectation parameter we have
𝜇𝜃|𝑦 =𝜆𝑦
𝜆𝑦+𝜆𝜃𝑦 +
𝜆𝜃
𝜆𝑦+𝜆𝜃𝜇𝜃 =
1
𝜎𝑦2
𝜎𝑦2𝜎𝜃
2
𝜎𝜃2+𝜎𝑦
2 𝑦 +1
𝜎𝜃2
𝜎𝑦2𝜎𝜃
2
𝜎𝜃2+𝜎𝑦
2 𝜇𝜃 =𝜎𝜃2
𝜎𝜃2+𝜎𝑦
2 𝑦 +𝜎𝑦2
𝜎𝜃2+𝜎𝑦
2 𝜇𝜃 =𝜎𝑦2𝜇𝜃+𝜎𝜃
2𝑦
𝜎𝑦2+𝜎𝜃
2 . (9.2)
To bring the above into the form (9), we rewrite the above as follows
𝜇𝜃|𝑦 =𝜎𝑦2𝜇𝜃+𝜎𝜃
2𝑦
𝜎𝑦2+𝜎𝜃
2 =𝜎𝑦2
𝜎𝑦2+𝜎𝜃
2 𝜇𝜃 +𝜎𝜃2
𝜎𝑦2+𝜎𝜃
2 𝑦 =𝜎𝑦2
𝜎𝑦2+𝜎𝜃
2 𝜇𝜃 +𝜎𝑦2+𝜎𝜃
2−𝜎𝑦2
𝜎𝑦2+𝜎𝜃
2 𝑦 =𝜎𝑦2
𝜎𝑦2+𝜎𝜃
2 𝜇𝜃 + (𝜎𝑦2+𝜎𝜃
2
𝜎𝑦2+𝜎𝜃
2 −𝜎𝑦2
𝜎𝑦2+𝜎𝜃
2) 𝑦 = 𝑐𝜇𝜃 + (1 − 𝑐)𝑦 (9.3)
where we defined
𝑐 ≔𝜎𝑦2
𝜎𝑦2+𝜎𝜃
2. (9.4)
□
We first consider the expression for the posterior expectation in (9). Because 𝜎𝑦2 > 0 and 𝜎𝜃
2 > 0,
the constant 𝑐 takes on values in the (open) interval ]0,1[. Note that if 𝜎𝜃2 is very small in comparison to 𝜎𝑦
2,
𝑐 tends towards 1, while if 𝜎𝜃2 is relatively large with respect to 𝜎𝑦
2, 𝑐 tends towards 0. Further, because
0 < 𝑐 < 1, we have 𝑐 + (1 − 𝑐) = 1. From these properties of 𝑐, we see that the posterior expectation
parameter 𝜇𝜃|𝑦 corresponds to a weighted average of the prior expectation parameter 𝜇𝜃 and the observed
data point 𝑦 (assume for example, that 𝑐 = 0.4, then 𝜇𝜃|𝑦 = 0.4 ⋅ 𝜇𝜃 + 0.6 ⋅ 𝑦. For 𝑐 = 0.5, 𝜇𝜃|𝑦
corresponds to the standard (unweighted) average between 𝜇𝜃 and 𝑦). If 𝜎𝜃2 is large relative to 𝜎𝑦
2,
corresponding to vague prior information, 𝑐 is small, and more weight is given to the data 𝑦 compared to
the prior expectation parameter 𝜇𝜃. Inversely, if 𝜎𝜃2 is small compared to 𝜎𝑦
2, in other words, the prior is
highly informative, 𝑐 is large and the posterior expectation 𝜇𝜃|𝑦 parameter is close to the prior expectation
parameter 𝜇𝜃. 𝑐 is sometimes referred to as a shrinkage factor, because it provides the proportion of the
distance that the posterior expectation parameter is “shrunk back” from the ordinary estimate 𝑦 towards
the prior mean 𝜇𝜃.
We next consider the expression for the posterior variance in (9). We see that because both 𝜎𝑦2
𝜎𝑦2+𝜎𝜃
2
and 𝜎𝜃2
𝜎𝑦2+𝜎𝜃
2 are larger than 0, but smaller than 1, and the product of two numbers larger than 0 but smaller
than 1 is always smaller than each individual number, the posterior variance parameter is always smaller
than both the variance parameter of the prior and of the likelihood.
148
In summary, we have seen that Bayesian statistical inference is based on joint distributions over
parameters 휃 and data 𝑦. For most of the models discussed in PMFN these joint distributions and their
marginal and conditional distributions are represented by parameterized probability density functions. In
order to not confuse the model parameter 휃 of interest with these parameters, they are sometimes referred
to as “hyperparameters”. However, in our opinion, it is more helpful to think of 휃 and 𝑦 as “unobserved and
observed random variables”, and of parameters of their joint distributions as parameters. In the following
Section, we explore what happens, if we consider the beta parameter 𝛽 ∈ ℝ𝑛 of the GLM as an unobserved
random variable, i.e. if we, based on a prior distribution 𝑝(𝛽) aim to infer the posterior distribution 𝑝(𝛽|𝑦)
and the model evidence 𝑝(𝑦). Note that discussion of the univariate Gaussian case above corresponds to a
GLM with a single observation (𝑛 ≔ 1) and a design matrix 𝑋 ≔ 1.
(6) Principles of variational Bayes
Variational Bayes (VB) is a statistical framework for probabilistic models comprising unobserved
variables and has received increasing attention in the machine learning, theoretical neuroscience, and
neuroimaging literature since the late 1990s. The general starting point of a Bayesian approach is a joint
distribution over observed random variables 𝑦 and unobserved random variables 𝜗
𝑝(𝑦, 𝜗) = 𝑝(𝜗)𝑝(𝑦|𝜗) (1)
where 𝑝(𝜗) is usually referred to as the prior distribution and 𝑝(𝑦|𝜗) as the likelihood. Joint distributions
over observed and unobserved variables are sometimes referred to as “generative models”, a convention we
will follow here. Given an observed value 𝑦∗ of 𝑦, the first aim of a Bayesian approach is to determine the
conditional distribution of 𝜗 given 𝑦∗, referred to as the posterior distribution. The second aim of a Bayesian
approach is to evaluate the logarithm of the marginal probability of the observed data 𝑦 denoted by
ln 𝑝(𝑦) = ln ∫ 𝑝(𝑦, 𝜗) 𝑑𝜗 (2)
If a model only comprises non-random quantities classically referred to as “parameters”, the left hand side
of (2) is referred to as “log likelihood” and no integration as on the right-hand side of (2) is required.
However, if a model comprises unobserved random variables, which are integrated out as on the right hand
side of (2), (2) is referred to as the “log marginal likelihood” or “log model evidence”. The log model
evidence allows for comparing different models in their plausibility to explain observed data. It thus forms
the necessary prerequisite for Bayesian model comparison. In the VB framework it is not the log model
evidence itself which is evaluated, but rather a lower bound approximation to it. This is due to the fact that if
a model comprises many unobserved variables 𝜗 the integration of the right-hand side of equation (2) can
become analytically burdensome or even intractable. To nevertheless achieve the two aims of a Bayesian
approach (posterior parameter estimation and model evidence evaluation), VB in effect replaces an
integration problem with an optimization problem. To this end, VB exploits a set of information theoretic
quantities as introduced in previously and below.
The following log model evidence composition forms the core of the VB approach (Figure 1):
ln 𝑝(𝑦) = ℱ(𝑞(𝜗)) + 𝒦ℒ(𝑞(𝜗)||𝑝(𝜗|𝑦)) (3)
where 𝑞(𝜗) denotes an arbitrary probability distribution over the unobserved variables, which is used as an
approximation of the posterior distribution 𝑝(𝜗|𝑦). In the following, 𝑞(𝜗) is referred to as “variational
149
distribution”. In words, equation (3) states that for an arbitrary variational distribution 𝑞(𝜗) over the
unobserved variables, the log model evidence comprises the sum of two information theoretic quantities:
the so-called “variational free energy”, which is define here as
ℱ(𝑞(𝜗)) ≔ ∫𝑞(𝜗) ln (𝑝(𝑦,𝜗)
𝑞(𝜗)) 𝑑𝜗 (4)
and the Kullback-Leibler (KL) divergence between the true posterior distribution 𝑝(𝜗|𝑦) and the variational
distribution 𝑞(𝜗)
𝒦ℒ(𝑞(𝑥)||𝑝(𝑥)) ≔ ∫𝑞(𝑥) ln (𝑞(𝑥)
𝑝(𝑥))𝑑𝑥 (5)
Proof of (3) Based on the definitions of ℱ(𝑞(𝜗))and 𝒦ℒ(𝑞(𝑥)||𝑝(𝑥)) it is easy to show that the decomposition of the log model evidence
formally holds: By definition of the variational free energy in (4), we have
ℱ(𝑞(𝜗)) = ∫𝑞(𝜗) ln (𝑝(𝑦)𝑝(𝜗|𝑦)
𝑞(𝜗)) 𝑑𝜗 (3.1)
Using the properties of the logarithm and the linearity of integrals, it follows that
ℱ(𝑞(𝜗)) = ∫𝑞(𝜗) ln 𝑝(𝑦) 𝑑𝜗 + ∫ 𝑞(𝜗) ln (𝑝(𝜗|𝑦)
𝑞(𝜗)) 𝑑𝜗 (3.2)
With the linearity of integrals, we then also have
ℱ(𝑞(𝜗)) = ln 𝑝(𝑦) ∫ 𝑞(𝜗) 𝑑𝜗 + ∫𝑞(𝜗) ln (𝑝(𝜗|𝑦)
𝑞(𝜗)) 𝑑𝜗 (3.3)
and because 𝑞(𝜗) is a probability distribution (and thus integrates to 1) and again with the properties of the logarithm, we obtain
ℱ(𝑞(𝜗)) = ln 𝑝(𝑦) − ∫𝑞(𝜗) ln (𝑞(𝜗)
𝑝(𝜗|𝑦)) 𝑑𝜗 (3.4)
The definition of the KL-divergence then allows to write (3.4) as
ℱ(𝑞(𝜗)) = ln 𝑝(𝑦) − 𝒦ℒ(𝑞(𝜗)||𝑝(𝜗|𝑦)) (3.5)
from which (3) follows immediately.
□
The non-negativity property of the KL-divergence has the consequence, that the variational free
energy ℱ(𝑞(𝜗)) is always smaller or equal to the log model evidence, that is
ℱ(𝑞(𝜗)) ≤ ln 𝑝(𝑦) (6)
This fact is exploited in the numerical application of the VB approach to probabilistic models: Because the log
model evidence is a fixed quantity, which only depends on the choice of 𝑝(𝑦, 𝜗) and a specific data
realization 𝑦∗, manipulating the variational distribution 𝑞(𝜗) for a given data set in such a manner that the
variational free energy increases has two consequences: first, the lower bound to the log model evidence
becomes tighter, and the variational free energy a better approximation to the log model evidence. Second,
because the left hand side of (3) remains constant, the KL-divergence between the true posterior and its
variational approximation decreases, which renders the variational distribution 𝑞(𝜗) a better approximation
to the true posterior distribution 𝑝(𝜗|𝑦) (Figure 2).
150
Figure 1. Visualization of the log model evidence decomposition that lies at the heart of the VB approach. The upper vertical bar is meant to represent the log model evidence, which is a function of the generative model 𝑝(𝑦, 𝜗) and is constant for any observation 𝑦∗ of 𝑦. As shown in the main text, the log model evidence can readily be rewritten into the sum of the variational free energy term
ℱ(𝑞(𝜗)) and a KL-divergence term 𝒦ℒ(𝑞(𝜗)||𝑝(𝜗|𝑦)), if one introduces an arbitrary variational distribution over the unobserved
variables 𝜗. Maximizing the variational free energy hence minimizes the KL divergence between the variational distribution 𝑞(𝜗) and the true posterior distribution 𝑝(𝜗|𝑦) and renders the variational free energy a better approximation of the log model evidence. Equivalently, minimizing the KL divergence between the variational distribution 𝑞(𝜗) and the true posterior distribution 𝑝(𝜗|𝑦) maximizes the free energy and also renders it a tighter approximation to the log model evidence ln 𝑝(𝑦).
The log model evidence decomposition in terms of a variational free energy and a Kullback-Leibler
divergence induced by a variational distribution discussed above is a fairly general approach for Bayesian
parameter estimation and model evidence approximation. For concrete generative models, it serves rather
as a guiding principle rather, than a concrete numerical algorithm: algorithms that make use of the log
model evidence decomposition are jointly referred to as variational Bayesian algorithms, but many variants
exists. One may roughly classify these algorithms along two dimensions: (1) whether or not they employ
parametric assumptions about the variational distributions 𝑞(𝜗), referred to as “fixed-form” and “free-form”
variational Bayes, respectively, and (2) whether or not they assume that the variational distribution 𝑞(𝜗)
factorizes over groups of unobserved variables. The latter is usually referred to as mean-field assumption
and for 𝑠 sets of the unobserved random variables denoted by 𝑞(𝜗) = ∏ 𝑞(𝜗𝑖)𝑠𝑖=1 . The variational Bayes
variant often employed for the estimation of differential equation models of neuroimaging data corresponds
to a fixed-form variational Bayesian approach with mean-field assumption.
Figure 2. The log model evidence decomposition of Figure 10 is exploited in numerical algorithms for free-form VB inference as depicted in Figure: based on a mean-field approximation 𝑞(𝜗) = 𝑞(𝜗𝑠)𝑞(𝜗\𝑠), the variational free energy can be maximized in a
coordinate-wise fashion. Maximizing the variational free energy in turn has two implications: it decreases the KL-divergence between 𝑞(𝜗) and the true posterior 𝑝(𝜗|𝑦) and renders the variational free energy a closer approximation to the log model evidence. This holds true, because the log model evidence for a given observation 𝑦∗ is constant (represented by the constant length of the vertical bar) and the KL-divergence is non-negative.
151
Study Questions
1. Write down the general form of the likelihood function 𝐿 and name and explain its components.
2. Write down the general form of a maximum likelihood estimator 휃̂𝑀𝐿 and explain the definition.
3. Name two approaches for obtaining maximum likelihood estimators given a likelihood function.
4. Write down the log likelihood function for 𝑛 independent and identically distributed univariate Gaussian random variables
5. Write down the maximum likelihood estimator for the expectation and variance parameter of a univariate Gaussian based on 𝑛
independent and identical observations and verbally explain how these are derived from the corresponding likelihood function.
6. Based on a joint probability distribution 𝑝(𝑦, 휃) over data 𝑦 and parameters 휃, write down the formal equivalent to the
mnemonic 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ∝ 𝑃𝑟𝑖𝑜𝑟 × 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑.
7. Write down the definition of the precision parameter for a univariate Gaussian distribution with variance parameter 𝜎2.
8. Write down the conjugate posterior distribution 𝑝(휃|𝑦) = 𝑁(휃; 𝜇𝜃|𝑦 , 𝜆𝜃|𝑦 ) parameter update equations for the expectation
parameter estimation for the univariate Gaussian 𝑝(𝑦|휃) ≔ 𝑁(𝑦; 휃, 𝜎𝑦2) using precision parameters and provide an verbal
explanation of their intuition.
9. Write down the definition of the variational free energy and name its constituents.
10. Why is the variational free energy useful for Bayesian inference?
Study Question Answers
1. For a probabilistic model, which specifies the probability of data 𝑦 based on parameterized probability density functions 𝑝(𝑦; 휃),
where 휃 ⊂ 𝛩denotes the models parameter, and 𝛩 ⊂ ℝ𝑝 denotes the model’s parameter space, the function
𝐿: 𝛩 × ℝ𝑛 → ℝ+, (휃, 𝑦) ↦ 𝐿(휃, 𝑦) ≔ 𝑝(𝑦; 휃)
where 휃 ∈ ℝ𝑝 and 𝑦 ∈ ℝ𝑛 is called the likelihood function of the parameter 휃 for the observation 𝑦. Notably, the likelihood function
is a function of the parameter value 휃, while, in the case of an available realization of the random variable 𝑦, this of 𝑦 is fixed. This
contrasts with the notion of a probability density function 𝑝(𝑦; 휃), which is a function of the random variable 𝑦.
2. The maximum likelihood estimator for a given probabilistic model 𝑝(𝑦; 휃) is that value 휃̂𝑀𝐿 of 휃 which maximizes the likelihood
function. Formally, this can be expressed as 휃̂𝑀𝐿 ≔ 𝑎𝑟𝑔𝑚𝑎𝑥𝜃∈𝛩 𝐿(휃, 𝑦)
The above should be read as “휃̂𝑀𝐿 is that argument of the likelihood function 𝐿 for which 𝐿(휃, 𝑦) assumes its maximal value over all
possible parameter values 휃 in the parameter space 𝛩”.
3. One approach to find “closed-form” or analytical expressions for ML estimators is to maximize the likelihood function with
respect to 휃 by means of the analytical determination of critical values at which its derivative vanishes and checking, whether its
second derivative is negative. Another approach, often encountered in practical numerical computing is to automatically shift the
values of 휃 around by means of an algorithm, while monitoring the value of the likelihood function, and to stop, once this value is
considered to be maximal based on a sensible stopping criterion.
4. The log likelihood function for 𝑛 independent and identically distributed univariate Gaussian random variables is given by
ℓ: (ℝ × ℝ+\{0}) × ℝ𝑛 → ℝ
((𝜇, 𝜎2), 𝑦) ↦ ℓ((𝜇, 𝜎2), 𝑦) ≔ 𝑙𝑛 𝑝(𝑦; 𝜇, 𝜎2) = −𝑛
2𝑙𝑛 2𝜋 −
𝑛
2𝑙𝑛 𝜎2 −
1
2𝜎2∑ (𝑦𝑖 − 𝜇)
2𝑛𝑖=1
5. The maximum likelihood estimators for the expectation and variance parameter of a univariate Gaussian distribution based on 𝑛
independent and identically distributed observations are given by �̂�𝑀𝐿 =1
𝑛∑ 𝑦𝑖 𝑛𝑖=1 and �̂�𝑀𝐿
2 =1
𝑛∑ (𝑦𝑖 − �̂�𝑀𝐿)
2𝑛𝑖=1 and are derived
from the corresponding likelihood function by computing the respective partial derivatives, setting to zero, and solving for the critical
points (which due to the concavity of the likelihood function correspond to maxima).
6. The formal equivalent is given by Bayes’ theorem applied for the current joint distribution
𝑝(휃|𝑦 = 𝑦∗) =𝑝(𝜃)𝑝(𝑦=𝑦∗|𝜃)
𝑝(𝑦=𝑦∗)
7. A precision parameter is the multiplicative inverse of a variance parameter. We thus have 𝜆 ≔1
𝜎2.
8. The parameters of 𝑝(휃|𝑦) are given
152
𝜆𝜃|𝑦: = 𝜆𝑦 + 𝜆𝜃 and 𝜇𝜃|𝑦: =𝜆𝑦
𝜆𝑦+𝜆𝜃𝑦 +
𝜆𝜃
𝜆𝑦+𝜆𝜃𝜇𝜃
Intuitively, the posterior certainty 𝜆𝜃|𝑦 about 휃 is larger than the prior certainty 𝜆𝜃 due to the contribution of 𝜆𝑦 > 0 and the
posterior expectation of 휃 is given by the sum of the prior expectation 𝜇𝜃 and the data 𝑦, weighted by their relative precisions, i.e.
their respective precisions normalized by the sum of precisions.
9. The variational free energy is defined as the functional (function of a function)
ℱ(𝑞(𝜗)) ≔ ∫𝑞(𝜗) ln (𝑝(𝑦,𝜗)
𝑞(𝜗)) 𝑑𝜗
It allocates a real number ℱ(𝑞(𝜗)) to a probability density function 𝑞(𝜗), which is an arbitrary distribution over the unobserved variables and serves as an approximation to the posterior distribution 𝑝(𝜗|𝑦) in a generative model 𝑝(𝑦, 𝜗), i.e. the joint distribution of observed random variables 𝑦 and unobserved random variables 𝜗.
10. Due to the log marginal likelihood decomposition ln 𝑝(𝑦) = ℱ(𝑞(𝜗)) + 𝒦ℒ(𝑞(𝜗)||𝑝(𝜗|𝑦)) into a variational free energy term
and the Kullback-Leibler divergence between the variational distribution 𝑞(𝜗) and the posterior distribution 𝑝(𝜗|𝑦), maximization of the variational free energy implies (1) minimization of the KL divergence between 𝑞(𝜗) and 𝑝(𝜗|𝑦), rendering 𝑞(𝜗) an approximation to the posterior distribution, and (2) minimization of the difference between the variational free energy and the log marginal likelihood, rendering ℱ(𝑞(𝜗)) an approximation to the log marginal likelihood ln 𝑝(𝑦), also known as the “log model evidence” and an important constituent for Bayesian model comparison.
153
Probability distributions in classical inference
(1) The Standard normal distribution
The standard normal distribution is the univariate normal distribution 𝑁(𝑥; 𝜇, 𝜎2) with parameters
𝜇 = 0 and 𝜎2 = 1.
(2) The chi-squared distribution
The distribution of the scalar random variable
𝜉 ≔ ∑ 𝑋𝑖2𝑛
𝑖=1 , where 𝑝(𝑋𝑖) ≔ 𝑁(𝑋𝑖; 0,1) (𝑖 = 1,… , 𝑛) (1)
i.e. the sum of 𝑛 squared univariate random variables 𝑋𝑖 (𝑖 = 1,… , 𝑛), each distributed according to a
“standard normal distribution” (i.e. a normal distribution with expectation parameter 𝜇 = 0 and variance
parameter 𝜎2 = 1) is called a chi-squared distribution with 𝑛 degrees of freedom and is denoted by
𝜒2(𝜉; 𝑛). A probability density function of the chi-squared distribution 𝜒2(𝜉; 𝑛) is given by
𝑓𝑛:ℝ+ → ℝ+, 𝜉 ↦ 𝑓𝑛(𝜉) ≔ (2𝑛
2𝛤 (𝑛
2))
−1
𝜉𝑛
2−1 𝑒𝑥𝑝 (−
𝜉
2) (2)
where 𝛤 denotes the Gamma function
𝛤:ℝ+ → ℝ+, 𝑥 ↦ 𝛤(𝑥) ≔ ∫ 𝑒𝑥𝑝(−𝑡) 𝑡𝑥−1𝑑𝑡∞
0 (3)
Probability density functions of the chi-squared distribution for 𝑛 = 1,2,3 degrees of freedom are shown in
Figure 1. Note that the expectation no chi-squared distributed variable with 𝑛 ∈ ℕ degrees of freedom is 𝑛
and its variance is 2𝑛.
Figure 1 Chi-squared distribution probability density functions for 𝑛 = 1,2,3
(4) The 𝒕-distribution
Let 𝑋 and 𝑌 be two independent scalar random variables. Let X be distributed according to a
“standard” normal distribution 𝑁(𝑋; 0,1) with expectation parameter 𝜇 = 0 and variance parameter 𝜎2 =
154
1, and let 𝑌 be distributed according to a chi-square distribution with 𝑛 degrees of freedom, χ2(𝑦; 𝑛). Then
the distribution of the scalar random variable
𝑇 ≔ 𝑋/√𝑌
𝑛 (1)
is called 𝑡-distribution with 𝑛 degrees of freedom, denoted as 𝑡(𝑇; 𝑛). A probability density function of 𝑡 is
given by
𝑓𝑛:ℝ → ℝ+, 𝑥 ↦ 𝑓𝑛(𝑥) ≔Γ(
𝑛+1
2)
√𝑛𝜋Γ(𝑛
2)∙
1
(1+𝑥2
𝑛)
𝑛+12
(2)
where 𝛤 denotes the Gamma function
𝛤:ℝ+ → ℝ+, 𝑥 ↦ 𝛤(𝑥) ≔ ∫ 𝑒𝑥𝑝(−𝑡) 𝑡𝑥−1𝑑𝑡∞
0 (3)
For large 𝑛, 𝑡(𝑇; 𝑛) asymptotically approaches the standard normal distribution 𝑁(𝑇; 0,1). For 𝑛 = 1, the
distribution is not defined, for 𝑛 = 2,3, … is expectation is given by 0. For 𝑛 = 2, its variance is not defined
and for 𝑛 = 3,4,… its variance is given by 𝑛
𝑛−2. Figure 1 depicts the probability density function of 𝑡𝑛 for
𝑛 = 2,… ,5.
Figure 1 The 𝑡-distribution with varying degrees of freedom
(5) The 𝒇-distribution
The distribution of a random variable 𝑋, where 𝑋 is given by
𝑋 ≔𝑌1/𝑚
𝑌1/𝑛 with 𝑝(𝑌1) = 𝜒
2(𝑌1;𝑚) and 𝑝(𝑌2) = 𝜒2(𝑌2; 𝑛) (1)
is called 𝑓-distribution with (𝑚, 𝑛) degrees of freedom and is denoted as 𝑓(𝑋;𝑚, 𝑛). A probability density
function for the 𝑓 -distribution is given by
𝑔𝑚,𝑛: ℝ+\{0} → ℝ+, 𝑔𝑚,𝑛(𝑥) =Γ(𝑚+𝑛
2)
Γ(𝑚
2)Γ(
𝑛
2)𝑚
𝑚
2 𝑛𝑛
2𝑥𝑚−22
(𝑚𝑥+𝑛)𝑚+𝑛
2
(2)
155
where 𝛤 denotes the Gamma function
𝛤:ℝ+ → ℝ+, 𝑥 ↦ 𝛤(𝑥) ≔ ∫ 𝑒𝑥𝑝(−𝑡) 𝑡𝑥−1𝑑𝑡∞
0 (3)
Some probability density function for 𝑓-distributions with varying degrees of freedom are visualized in Figure
1.
Figure 1 𝑓-distribution probability density functions for varying degrees of freedom
156
Probability distributions in Bayesian inference
(1) The gamma distribution
The gamma distribution is a useful distribution to describe uncertainty about univariate random
variables with strictly positive domain, such as the precision parameter of a Gaussian distribution. A
probability density function for the gamma distribution is given by in its “shape and scale” parameterization
by
𝐺 ∶ ℝ+ → ℝ+, 𝜆 ↦ 𝐺(𝜆; 𝑎, 𝑏) ≔1
Γ(𝑎)
1
𝑏 𝑎𝜆𝑎−1 exp (−
𝜆
𝑏) for 𝑎, 𝑏 > 0 (1)
where 𝑎 is referred to as the “shape parameter” and 𝑏 is referred to as the “scale parameter”. Γ(𝑥) denotes
the gamma function which is defined for 𝑥 > 0 as:
Γ:ℝ+\{0} → ℝ, 𝑥 ↦ Γ(𝑥) = ∫ 𝑡𝑥−1𝑒−𝑡 𝑑𝑡∞
0 (2)
The expectation and variance of 𝜆 under 𝐺(𝜆; 𝑎, 𝑏) are expressed in terms of the parameters as
𝐸(𝜆) = 𝑎𝑏 and 𝑉(𝜆) = 𝑎𝑏2 (3)
Figure 1 below depicts the gamma distribution for a range of shape and scale parameters.
Figure 1. Gamma distribution probability density functions for varying shape and scale parameters. If the shape
parameter 𝛼 is smaller or equal to 1, the mode of the density functions is at 0, otherwise it is larger than 0. The scale
parameter changes the scale of the distribution, i.e. an increase in 𝑏 pushes the mass of the distribution towards zero
and upwards.
In addition to the shape and scale parameterization, the gamma distribution is also often
characterized by probability density functions using the “shape and rate” parameterization. For a positive
random variable 𝜆, these take the form
𝐺 ∶ ℝ+ → ℝ+, 𝜆 ↦ 𝐺(𝜆; 𝛼, 𝛽) ≔𝛽𝛼
𝛤(𝛼)𝜆𝛼−1 𝑒𝑥𝑝(−𝛽𝜆) for 𝛼, 𝛽 > 0 (4)
where 𝛼 is referred to as the “shape” parameter and 𝛽 is referred to as the “rate” parameter. In terms of
these parameters, the expectation and variance 𝜆 under 𝐺(𝜆; 𝛼, 𝛽) are given by
𝐸(𝜆) =𝛼
𝛽 and 𝑉(𝜆) =
𝛼
𝛽2 (5)
157
Figure 2 below depicts the gamma distribution for a range of shape and rate parameters.
Figure 2. Gamma distribution probability density functions for varying shape and rate parameters. If the shape parameter 𝛼 is smaller or equal to 1, the mode of the density functions is at 0, otherwise it is larger than 0. The rate parameter (like the scale parameter) changes the scale of the distribution, i.e. an increase in 𝛽 pushes the mass of the distribution towards zero and upwards. Note that the case 𝛼 = 1 corresponds to the exponential distribution family and the case 𝛼 = 2 corresponds to the Erlang distribution family probability density functions.
The shape and rate parameterization of the gamma distribution is closely related to the exponential
distribution, which is defined as
𝐸𝑥𝑝(𝜆; 𝛽) ≔ 𝐺(𝜆; 1, 𝛽) (6)
i.e., the distributions with scale parameter 𝛼 ≔ 1 whose mode is zero. The exponential distribution is the
governing distribution of the inter-event times of a Poisson process, which is itself in turn defined by the rate
parameter 𝛽.
The gamma distribution has close ties with at least three other distributions. Firstly, if the shape
parameter is set to 𝛼 ≔ 2 it corresponds to the one-parameter Erlang distribution
𝐸𝑟𝑙𝑎𝑛𝑔(𝜆; 𝛽) = 𝐺(𝜆; 2, 𝛽) (7)
Notably, the mode of the Erlang distribution is larger than zero. The Erlang distribution is used in
probabilistic models of queuing processes, where it takes the role of modelling inter-event times like the
exponential distribution for the Poisson process. Secondly, the chi-squared distribution, introduced above as
the distribution of the sum of squared univariate Gaussian random variables, is a special case of the gamma
distribution. Specifically, the probability density function of a chi-squared distribution with 𝑛 degrees of
freedom corresponds to a shape and rate parameterized Gamma distribution with 𝛼 ≔𝑛
2 and 𝛽 ≔
1
2:
𝜒2(𝜉; 𝑛) = 𝐺 (𝜉;𝑛
2,1
2) (8)
Finally, it is closely related to the inverse gamma distribution and is generalized to a multivariate distribution
in the form of the Wishart distribution.
(2) The inverse Gamma distribution
The Gamma distribution is traditionally used to model uncertainty over the precision parameter
𝜆 ≔ (𝜎2)−1 of a Gaussian distribution. A more direct way is to formulate uncertainty directly over the
158
variance parameter 𝜎2. To this end, one can show that if a random variable 𝜆 > 0 is distributed according to
a Gamma distribution, then its inverse 𝜎2 = 𝜆−1 is distributed according to an “inverse gamma distribution”
for which a probability density function is given by
𝐼𝐺 ∶ ℝ+ → ℝ+, 𝜎2 ↦ 𝐼𝐺(𝜎2; 𝑎, 𝑏) =
𝑏𝑎
𝛤(𝑎)(𝜎2)−(𝑎+1) 𝑒𝑥𝑝 (−
𝑏
𝜎2) (1)
The expectation and variance of the inverse Gamma distribution in terms of its “shape parameter” 𝑎 and its
“scale parameter” 𝑏 are given by
𝐸(𝜎2) =𝑏
𝑎−1 and 𝑉(𝜎2) =
𝑏2
(𝑎−1)2(𝑎−2) (2)
Note that the expectation only exists, if 𝑎 > 1, and the variance only exists, if 𝑎 > 2. Figure 1 depicts inverse
gamma distribution probability densities for a range of parameter settings (left panel) and their
corresponding gamma distribution probability densities (right panel).
Figure 1. Gamma (left panel) and inverse gamma (right panel) distribution probability density functions for varying parameters. Note that if 𝛼 = 𝑎 = 1, the mode of the distribution of lamba 𝜆 is zero, while the mode of the distribution of 𝜎2 = 1/𝜆 is larger than zero.
(3) The Wishart distribution
The Wishart distribution is the multivariate generalization of the Gamma distribution. As the Gamma
distribution is defined for random variables taking on only positive values, the Wishart distribution is defined
for positive-definite matrices. Like the Gamma distribution is classically used to describe uncertainty over the
precision parameter 𝜆 ≔ (𝜎2)−1 of (univariate or multivariate) Gaussian probability density functions, the
Wishart distribution is used to model uncertainty about precision matrices Λ = Σ−1 of multivariate Gaussian
probability density functions with covariance matrix parameter Σ. A probability density function of the
Wishart distribution is given by
𝑊 ∶ ℝ𝑝.𝑑.𝑛×𝑛 → ℝ+, Λ ↦ 𝑊(Λ; 𝑆, 𝜈) ≔ (2
𝜈𝑛
2 Γ𝑛 (𝜈
2) |𝑆|
𝜈
2)−1
|Λ|𝜈−𝑛−1
2 exp (−1
2𝑡𝑟(Λ𝑆−1)) (1)
159
In (1) we denote by ℝ𝑝.𝑑.𝑛×𝑛 the set of positive-definite matrices of size 𝑛 × 𝑛, by 𝑆 ∈ ℝ𝑛×𝑛 the “scale matrix”
parameter of the Wishart distribution, and by 𝜈 ∈ ℝ the “degrees of freedom parameter” of the Wishart
distribution. Finally, Γ𝑛 denotes the “multivariate gamma function”, defined in terms of the Gamma function
Γ as
Γ𝑛 ∶ ℝ+\{0} → ℝ, 𝑥 ↦ Γ𝑛(𝑥) ≔ 𝜋
𝑛(𝑛−1)
4 ∏ (𝑥 +1−𝑖
2)𝑛
𝑖=1 (2)
Note that the multivariate Gamma function is referred to as “multivariate”, because it is useful in
multivariate statistics, not because it is defined on a multivariate domain.
The Wishart distribution is only defined for 𝜈 > 𝑛 − 1, otherwise, the normalization constant
(2𝜈𝑛
2 Γ𝑛 (𝜈
2) |𝑆|
𝜈
2)−1
of its probability density function does not exist. The expectation of a Wishart
distribution with scale matrix 𝑆 and degrees of freedom 𝜈 is given by
𝐸(Λ) = 𝜈𝑆 (3)
For 𝑛 = 1, 𝜆 ≔ Λ ∈ ℝ𝑝.𝑑.1×1, scale parameter 𝑠−1 and degrees of freedom parameter 𝜈, the Wishart
distribution corresponds to the Gamma distribution with shape parameter 𝛼 =𝜈
2 and rate parameter
𝑠
2, thus
the following equivalence of probability density functions holds
𝑊(𝜆; 𝑠−1, 𝜈) = 𝐺 (𝜆;𝜈
2,𝑠
2) (4)
(4) The inverse Wishart distribution
The inverse Wishart distribution is useful to describe uncertainty directly about covariance matrix
parameters instead of precision matrix parameters in Gaussian models. Based on the definition of the
Wishart distribution 𝑊(Λ; 𝑆, 𝜈) over a positive definite matrix Λ ∈ ℝ𝑝.𝑑.𝑛×𝑛, it can be shown that the inverse
Σ ≔ Λ−1 of Λ is distributed according to the following probability density function
𝐼𝑊: ℝ𝑝.𝑑.𝑛×𝑛 → ℝ+, Σ ↦ 𝐼𝑊(Σ; 𝑆, 𝜈) ≔ (2
𝜈𝑛
2 Γ𝑛 (𝜈
2) |𝑆|−
𝜈
2)−1
|Σ|−(𝜈+𝑛+1)
2 exp (−1
2𝑡𝑟(𝑆−1Σ−1)) (1)
The expectation of the inverse Wishart distribution over Σ is given in terms of its parameters by
𝐸(Σ) = (𝜈 − 𝑛 − 1)−1𝑆−1 (2)
Like the Wishart distribution reduces to the Gamma distribution in the case 𝑛 = 1, the inverse Wishart
distribution reduces to the inverse Gamma distribution and 𝜎2 ∈ ℝ𝑝.𝑑.1×1, i.e.
𝐼𝑊(𝜎2; 𝑠−1, 𝜈) = 𝐼𝐺 (𝜎2;𝜈
2,𝑠
2) (3)
(5) The normal-gamma distribution and the normal-inverse gamma distribution
The normal-gamma distribution is the conjugate prior distribution for inference about Gaussian
models with unknown (univariate) precision parameter. It is a joint distribution over an 𝑛-dimensional
random vector 𝑥 and a univariate random variable 𝜆 that factorizes as follows
𝑝(𝑥, 𝜆) = 𝑝(𝑥|𝜆)𝑝(𝜆) (1)
160
for which the marginal distribution over 𝜆 is given by a gamma distribution in its shape and rate
parameterization
𝑝(𝜆) ≔ 𝐺(𝜆; 𝛼, 𝛽) ≔𝛽𝛼
Γ(𝑎)𝜆𝛼−1 exp(−𝛽𝜆) (2)
and the conditional distribution is given by
𝑝(𝑥|𝜆) ≔ 𝑁(𝑥; 𝜇, (𝜏𝜆)−1𝐼𝑛) ≔ (2𝜋)−1
2 (𝜏𝜆)𝑛
2 exp (−𝜏𝜆
2(𝑥 − 𝜇)𝑇(𝑥 − 𝜇)) (3)
With (1), a probability density function for the normal-gamma distribution is thus given by
𝑁𝐺 ((𝑥𝜆) ; 𝜇, 𝜏 𝛼, 𝛽) ≔
𝛽𝛼
Γ(𝛼)𝜆𝛼−1 exp(−𝛽𝜆) (2𝜋)−
1
2 (𝜏𝜆)𝑛
2 exp (−𝜏𝜆
2(𝑥 − 𝜇)𝑇(𝑥 − 𝜇)) (4)
Notably, the marginal distribution of 𝑥 under a normal-gamma distribution 𝑝(𝑥, 𝜆) is given by a
multivariate non-central t-distribution, and the marginal distribution of 𝜆 under a normal-gamma
distribution 𝑝(𝑥, 𝜆) is given by a gamma distribution. Figure 1 depicts the normal-gamma distribution for a
univariate variable 𝑥 ∈ ℝ and a number of parameter settings.
Figure 1. Normal-gamma distribution probability density functions for 𝑥 ∈ ℝ as a function of the expectation parameter 𝜇 ∈ ℝ (panel rows) and the scale parameter 𝛼 > 0 (panel columns). The white dots depict the expectations of the respective distributions.
In analogy to the normal-gamma distribution, the normal-inverse gamma distribution is given by assuming
an inverse Gamma distribution over the variance parameter of the Gaussian conditional distribution. IN
other words, the normal-inverse gamma distribution is a joint distribution over an 𝑛-dimensional random
vector 𝑥 and a univariate positive random variable 𝜎2 that factorizes according to
𝑝(𝑥, 𝜎2) = 𝑝(𝑥|𝜎2)𝑝(𝜎2) (5)
161
where
𝐼𝐺(𝜎2; 𝑎, 𝑏) =𝑏𝑎
𝛤(𝑎)(𝜎2)−(𝑎+1) 𝑒𝑥𝑝 (−
𝑏
𝜎2) (6)
and
𝑝(𝑥|σ2) ≔ 𝑁(𝑥; 𝜇, σ2𝐼𝑛) ≔ (2𝜋)−1
2 (σ2)−𝑛
2 exp (−1
2σ2(𝑥 − 𝜇)𝑇(𝑥 − 𝜇)) (7)
The normal-inverse gamma distribution probability density function is thus given as
𝑁𝐼𝐺(𝑥, σ2; 𝜇, 𝑎, 𝑏) = (2𝜋)−1
2 (σ2)−𝑛
2𝑏𝑎
Γ(𝑎)(σ2)−(a+1) exp (−
1
2σ2(𝑥 − 𝜇)𝑇(𝑥 − 𝜇) −
𝑏
𝜎2) (8)
Figure 2 depicts normal-inverse gamma distribution a for a univariate variable 𝑥 ∈ ℝ and a number of
parameter settings.
Figure 2. Normal-inverse gamma distribution probability density functions for 𝑥 ∈ ℝ as a function of the expectation parameter 𝜇 ∈ ℝ (panel rows) and the shape parameter 𝑏 (panel columns). The white dots depict the expectations of the respective distributions.
(6) The univariate non-central 𝒕 -distribution
The univariate non-central t-distribution is a generalization of the t-distribution. A probability density
function for the non-central t-distribution is given by
𝑡 ∶ ℝ → ℝ+, 𝑥 ↦ 𝑡(𝑥; 𝜇, 𝜎2, 𝜈) =Γ(𝜈
2+1
2)
Γ(𝜈
2)
1
√𝜈𝜋𝜎2(1 +
1
𝜈(𝑥−𝜇
𝜎)2)−(
𝜈+1
2)
(1)
162
where Γ denotes the Gamma function. The non-central t-distribution has three parameters: the expectation
parameter 𝜇 ∈ ℝ, the scale parameter 𝜎2 > 0, and the degrees of freedom parameter 𝜈 > 0.
The expectation of the univariate non-central t-distribution for 𝜈 > 0 is given by
𝐸(𝑥) = 𝜇 (2)
For 𝜈 = 1, the univariate non-central t-distribution is referred to as the “Cauchy-distribution” and has the
peculiar property that the expectation does not exists, because the respective integral diverges. The
variance of the univariate non-central t-distribution for 𝜈 > 2 is given by
𝑉(𝑥) =𝜈
𝜈−2𝜎2 (3)
Figure 1 depicts the univariate non-central t-distribution for two choices of 𝜇 and a number of choices of 𝜎2
and 𝜈.
Figure 1. Examples of univariate non-central t-distribution probability densities
(5) The multivariate non-central 𝒕 -distribution
The multivariate non-central t-distribution is the multivariate generalization of the univariate non-
central t-distribution. Let 𝑥 ∈ ℝ𝑛 denote a random vector. A probability density function of the multivariate
noncentral t-distribution is given by
𝑡 ∶ ℝ𝑛 → ℝ+, 𝑥 ↦ 𝑡(𝑥; 𝜇, Σ, 𝜈) ≔Γ(𝜈
2+𝑛
2)
Γ(𝜈
2)
1
(𝜋𝜈)𝑛2|Σ|
12
(1 +1
𝜈(𝑥 − 𝜇)𝑇Σ−1(𝑥 − 𝜇))
−(𝜈+1
2)
(1)
where Γ denotes the Gamma function. The non-central t-distribution has three parameters, the expectation
parameter 𝜇 ∈ ℝ𝑛, the scale matrix parameter Σ ∈ ℝ𝑛×𝑛 𝑝. 𝑑., and the degrees of freedom parameter 𝜈 >
0. In terms of its parameters, the expectation of the multivariate non-central t-distribution for 𝜈 > 1 is given
by
𝐸(𝑥) = 𝜇 (2)
and its covariance for 𝜈 > 2 is given by
163
𝐶(𝑥) =𝜈
𝑛−2 Σ (3)
Figure 1 below depicts examples of the multivariate non-central t-distribution the special case that 𝑛 = 2.
Figure 1. Examples of multivariate non-central t-distribution probability densities. The expectation and degrees of freedom parameters of both probability densities shown are 𝜇 = (1,1)𝑇 and 𝜈 = 2, respectively. The shape matrix of
the left probability density is Σ1 ≔ (1 0.50.5 1
) while the shape matrix of the right probability density is Σ2 ≔ (2 00 2
).
164
Basic Theory of the General Linear Model
165
Structural and probabilistic aspects
(1) Experimental design
The aim of the following section is to briefly review some of the key terms in experimental design.
Experiment
A scientific experiment can be defined as the controlled test of a “hypothesis” or “theory”.
Experiments manipulate some aspect of the world and then measure the outcome of that manipulation. In
cognitive neuroimaging experiments, scientists often manipulate some aspect of a stimulus, e.g. showing a
face or an object visually, or manipulating whether a word is easy or difficult to remember, and then
measure the observer's behavior and/or brain activity using FMRI or M/EEG.
Experimental Design
Experimental design refers to the organization of an experiment to allow for the effective
investigation of the research hypothesis. All well-designed experiments share several characteristics: they
test specific hypotheses, rule out alternative explanations for the data, and minimize the cost of running the
experiment.
Experimental variables
An experimental variable can be defined as a measured or manipulated quantity that varies within
an experiment. Two classes of experimental variables are central: independent and dependent variables.
Independent experimental variables are aspects of the experimental design that are intentionally
manipulated by the experimenter and that are hypothesized to cause changes in the dependent variables.
Independent variables in cognitive neuroscience experiments include for example different forms of sensory
stimulation, different cognitive contexts, or different motor tasks. The different values of an independent
variable are often referred to as “conditions” or “levels”. Usually, independent variables are explicitly
controlled. Mathematically, they are thus not represented by random variables, but rather by known
constants.
Dependent experimental variables are quantities that are measured by the experimenter in order to
evaluate the effect of the independent variables. Examples for dependent variables in cognitive
neuroscientific experiments are the response accuracy and reaction time on a given psychophysical task, the
BOLD signal at a given voxel in an FMRI experiment, or the frequency composition at a specific channel in an
M/EEG experiment. Mathematically, dependent experimental variables are usually modeled by random
variables
Categorical and continuous variables
In principle, both independent and dependent variables can either categorical or continuous. A
categorical variable is one that can take one of several discrete values, for example sensory stimulation vs.
no sensory stimulation, or different stimulus categories, e.g. faces and houses. Such categorical variables are
also often referred to as “factors”, which take on different “levels”. Mathematically, categorical variables are
usually represented as elements of the natural numbers or signed integers. A continuous variable is one that
can take on any value within a pre-specified range. Examples for continuous variables are for example
166
different levels of contrast of a visual stimulus, as well as most observed signals in noninvasive neuroimaging
such as the BOLD signal or the electrical potential in EEG. Mathematically, continuous variables are usually
elements of the real numbers. One defining feature of the GLM is that it accommodates both scenarios of
categorical and continuous independent variables, while the dependent variable is usually continuous.
Between-subjects and within-subject (repeated measures) designs
Experimental designs can further be classified according to whether the independent variable
treatments are applied to the same group of participants or to different group of participants. In a between-
subject manipulation, different subject groups reflect different values of the independent variable. More
common in cognitive neuroscience are within-subject designs where each subject participates in all
experimental conditions. These designs are commonly referred to as “repeated-measures designs”.
After the introductory remarks of the previous section we now begin with the introduction of the
general linear model (GLM). In brief, the GLM is a unifying perspective on many parametric statistical
methods such as simple and multiple linear regression, one- and two-sample T-Tests, and the many variants
of the analyses of variance and covariance. Typically, in undergraduate statistical courses, these methods are
introduced one after the other. Here we take a different route: in the Section “Theory of the GLM”, we will
discuss the generalization of these methods in form of the GLM and only after this has been achieved to
some depth re-introduce the aforementioned special cases of the GLM in the subsequent Section
“Applications of the GLM”.
(2) A verbose introduction
The GLM is often written as
𝑋𝛽 + 휀 = 𝑦 (1)
where 𝑋 denotes a “design matrix”, 𝛽 denotes a set of “parameters”, 휀 denotes a probabilistic Gaussian
distributed “error” term, and 𝑦 denotes “the data”. The aim of the current section is to introduce basic
aspects of equation (1). To this end, we will ignore the error term 휀 and focus on the product 𝑋𝛽 and the
data 𝑦. The importance of the error term 휀 will be studied in the subsequent Section.
From elementary statistics we are familiar with the concept of an independent variable, which we
will denote as “𝑥” for the moment, and the concept of a dependent variable, denoted as “𝑦” for the
moment. The idea is that variable 𝑥 is under control of the experimenter, while the variable 𝑦 is the
phenomenon that is being observed. 𝑦 is not under direct control of the experimenter, but in some way
related to 𝑥. In real world experiments, both 𝑥 and 𝑦 come about in a number of ways. For example 𝑥 could
represent a set of qualitatively different groupings, or treatments, as for example pharmacological or
behavioural treatments of depression. In other contexts, the independent variable may be quantitative, for
example representing the dosage of a drug given, or the duration that a visual stimulus is presented to a
human observer. Likewise, the variable 𝑦 can take different forms. For example, 𝑦 could be the number of
rats that die during a poison experiment, the time it takes a neuropsychological patient to identify an object,
or an MR contrast value observed at specific brain location at a specific time.
In any case, there will be two sets of quantities for which we postulate some kind of relationship. For
reasons of simplicity and flexibility (and because any functional relationship is locally linear), humans chose
to represent a lot of these relationships in terms of linear models. In verbose terms, a (noise-free) linear
167
model states that “an observed value of the dependent variable 𝑦 is equal to a weighted sum of values
associated with one or more independent variables 𝑥.” In order to render the last statement more precisely,
we will have to introduce some formal notation:
Let 𝑦𝑖 denote one observation of the dependent variable 𝑦, where 𝑖 = 1,… , 𝑛. Likewise let
𝑥𝑖𝑗 , 𝑖 = 1,… , 𝑛, 𝑗 = 1,… , 𝑝 denote the values of a number of independent variables that are supposed to be
associated with the observation 𝑦𝑖. Here, 𝑝 represents the number of “predictors” or independent variables.
The statement that the value 𝑦𝑖 equals the weighted sum of the values of the independent variables
𝑥𝑖𝑗 associated with this observation can then be written as
𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 + 𝑥𝑖3𝛽3 +⋯+ 𝑥𝑖𝑝𝛽𝑝 = 𝑦𝑖 (1)
The 𝛽𝑗 (“beta 𝑗”) values are multiplicative coefficients that quantify the contribution of the
independent variable 𝑥𝑖𝑗 to the observed effect 𝑦𝑖. All variables in the expression above can be thought of as
real scalar numbers. A numerical example of the expression for the seventh observation 𝑦7 in a set of
observations according the expression above for 𝑝 = 4 independent variables (and, correspondingly 𝑝 = 4
beta parameters) would be
𝑥71𝛽1 + 𝑥72𝛽2 + 𝑥73𝛽3 + 𝑥74𝛽3 = 𝑦7 (2)
A numerical example of (3) could for example be
16 ∙ 0.25 + 1 ∙ 2 + 3 ∙ 0.5 + 2.5 ∙ 1 = 10 (3)
Here, the values of the independent variables are
𝑥71 = 16, 𝑥72 = 1, 𝑥73 = 3, 𝑥74 = 2.5 (4)
the beta parameter values are
𝛽1 = 0.25, 𝛽2 = 2, 𝛽3 = 0.5, 𝛽4 = 1 (5)
and the observed value of the dependent variable is
𝑦7 = 10 (6)
It is very important to be always clear about which variables are known at what point in an
experiment. The independent variables 𝑥𝑖𝑗 are specified by the experimenter, thus they are known as soon
as the experimenter has decided how to set up the experiment. The observation values 𝑦𝑖 are known as
soon as the experimenter has collected data points in response to the fixed set of independent variables
𝑥𝑖1, … , 𝑥𝑖𝑝. What the experimenter does not know in advance (but might have a hypothesis about, or some
interest in), is how much each of the variables 𝑥𝑖1, … , 𝑥𝑖𝑝 contributes to the sum. That is, the weighting
coefficients 𝛽1, … , 𝛽𝑝 are not known in advance and can only be determined by repeatedly observing the
dependent variable in the presence of the fixed set of independent variables.
In terms of mathematical modelling, model values that are determined from the data are called
“parameters” or “free parameters”. Usually, the free parameters are adjusted in a way that the
mathematical model is able to predict an observed response in the best possible way. The process of finding
these parameter values is called “model estimation”, “parameter fitting”, or “model fitting”. We will discuss
168
the details of model estimation and the assumptions behind it for the GLM in more detail in subsequent
sections.
(3) Simple linear regression
Expressions like
𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 + 𝑥𝑖3𝛽3 +⋯+ 𝑥𝑖𝑝𝛽𝑝 = 𝑦𝑖 (1)
are usually introduced in undergraduate statistics courses under the label of “multiple linear
regression”. Multiple linear regression is usually introduced as a generalization of “simple linear regression”
to more than two (or one, depending on how simple linear regression is treated) independent variables. The
defining feature of multiple regression as introduced in undergraduate statistics is that the independent
variables represent “continuous variables”. In this section, we will consider the special case simple linear
regression of the multiple linear regression problem to link the intuitions about simple linear regression with
the matrix notation of the GLM.
Most likely, the reader will have encountered simple linear regression in the form of the following
equation
𝑦 = 𝑎 + 𝑏𝑥 (1)
where 𝑎 was referred to as the “offset” and 𝑏 as the “slope”, 𝑥 was called the “independent variable” and 𝑦
was called the “dependent variable”.
Let us ponder about the meaning of (1) for a bit. What this equation wants to say is that, if we know
the values of 𝑥, 𝑏 and 𝑎 (which are all real scalar numbers), we can compute the value of 𝑦. Let us assume
that we would like to compute the value of 𝑦 for five different values of 𝑥, namely,
𝑥12 = 0.2, 𝑥22 = 1.4, 𝑥32 = 2.3, 𝑥42 = 0.7 and 𝑥52 = 0.5. (2)
The reader may remember from undergraduate statistics that the values of 𝑥 and 𝑦 were allowed to vary,
whereas 𝑎 and 𝑏 were “fixed” or “always the same”. Let us assume that 𝑎 = 0.8 and that 𝑏 = 1.3. We may
thus write the five values of 𝑦 corresponding to the five values of 𝑥 as:
𝑦1 = 𝑎 + 𝑏𝑥12 = 1 ⋅ 0.8 + 1.3 ⋅ 0.2 (3)
𝑦2 = 𝑎 + 𝑏𝑥22 = 1 ⋅ 0.8 + 1.3 ⋅ 1.4
𝑦3 = 𝑎 + 𝑏𝑥32 = 1 ⋅ 0.8 + 1.3 ⋅ 2.3
𝑦4 = 𝑎 + 𝑏𝑥42 = 1 ⋅ 0.8 + 1.3 ⋅ 0.7
𝑦5 = 𝑎 + 𝑏𝑥52 = 1 ⋅ 0.8 + 1.3 ⋅ 0.5
If the reader is familiar with matrix notation and especially matrix multiplication it will be readily apparent
that expression (3) can also be written as
169
(
𝑦1𝑦2𝑦3𝑦4𝑦5)
=
(
𝑥11 𝑥12𝑥21 𝑥22𝑥31 𝑥32𝑥41 𝑥42𝑥51 𝑥52)
(𝑎𝑏) =
(
1 0.21 1.41 2.31 0.71 0.5)
(0.81.3
) =
(
1 ⋅ 0.8 + 1.3 ⋅ 0.2 1 ⋅ 0.8 + 1.3 ⋅ 1.41 ⋅ 0.8 + 1.3 ⋅ 2.31 ⋅ 0.8 + 1.3 ⋅ 0.71 ⋅ 0.8 + 1.3 ⋅ 0.5 )
(4)
Note that in (4) we have introduced another “independent variable” 𝑥𝑖1, 𝑖 = 1,… , 𝑛 which takes on the
value 1 for all values 𝑦𝑖 , 𝑖 = 1,… , 𝑛.
What did we gain from rewriting (3) as (4)? Not much, but we can express the relatively large
expression (3) more compactly in matrix notation, if we additionally define
𝑦 ≔
(
𝑦1𝑦2𝑦3𝑦4𝑦5)
, 𝑋 ≔
(
𝑥11 𝑥12𝑥21 𝑥22𝑥31 𝑥32𝑥41 𝑥42𝑥51 𝑥52)
and 𝛽 ≔ (𝑎𝑏) (5)
The last definition in (5) can be simplified and put into the context of the standard GLM by setting
𝛽1 ≔ 𝑎 and 𝛽2 ≔ 𝑏 (6)
i.e., by setting
𝛽 ≔ (𝛽1𝛽2) (7)
The reader may take note of the dimensions of 𝑦, 𝑋, and 𝛽: 𝑦 is a 5 × 1 real vector, which we write as
𝑦 ∈ ℝ5, 𝑋 is a 5 × 2 matrix, written as 𝑋 ∈ ℝ5×2 and 𝛽 is 2 × 1 real vector, written as 𝛽 ∈ ℝ2×1. In matrix
form, with 𝑦 ∈ ℝ5, 𝑋 ∈ ℝ5×2 and 𝛽 ∈ ℝ2, we can thus write (3) very compactly as
𝑦 = 𝑋𝛽 (8)
To conclude this section, consider equation (1) again. In comparison with equation (8), it is apparent,
that so far we did not consider the error term 휀. In fact, equation (8) merely describes the “deterministic”
or “structural” aspect of the simple linear regression model. This can be reformulated in two ways: the
simple linear regression corresponds the straight line equation “under the addition of observation noise”.
However, more sensibly, it is appreciated that the “observation noise” is itself integral part of the model of
interest here, namely the GLM, and thus the GLM is a “probabilistic model” with some “deterministic
aspects”. So far, we considered the deterministic aspects, next, we will consider the probabilistic aspects.
(4) The Gaussian assumption
In the previous Section and the treatment of matrix notation and matrix multiplication in the
mathematical preliminaries we have considered the terms 𝑋𝛽 and 𝑦 in the GLM equation
𝑋𝛽 + 휀 = 𝑦 (1)
in some detail. For the special case of simple linear regression we have seen that 𝑦 corresponds to a vector
of real values with 𝑛 ∈ ℕ entries, where 𝑛 corresponds to the number of data points. We also have seen that
the design matrix 𝑋 ∈ ℝ𝑛×2 has two columns, as there are two parameters 𝛽0 and 𝛽1 in a simple linear
170
regression model (the “offset” and the “slope”) and thus that 𝛽 ≔ (𝛽0, 𝛽1)𝑇 ∈ ℝ2. One fundamental aspect
of the GLM equation (1) is that it generalizes many special cases such as simple linear regression by allowing
for different numbers of parameters 𝑝 ∈ ℕ and different forms of design matrices.
Using matrix notation we can express (1) more precisely by stating the GLM equation as
𝑋𝛽 + 휀 = 𝑦 (2)
where 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, and 𝑦 ∈ ℝ𝑛. It is important to note that the design matrix 𝑋 ∈ ℝ𝑛×𝑝 always has as
many rows as there are data points (𝑛) and as many columns as there are parameters (𝑝). 𝛽 ∈ ℝ𝑝 is usually
referred to as the “parameter vector” and 𝑦 ∈ ℝ𝑛 the “data vector”. In general, we can thus unpack
equation (1) in the following form
(
𝑥11 𝑥12𝑥21 𝑥22
⋯ 𝑥1𝑝⋯ 𝑥2𝑝
𝑥31 𝑥32⋮ ⋮
⋯ 𝑥3𝑝⋱ ⋮
⋮ ⋮𝑥𝑛1 𝑥𝑛2
⋱ ⋮⋯ 𝑥𝑛𝑝)
(
𝛽1𝛽2⋮𝛽𝑝
)+ 휀 =
(
𝑦1𝑦2𝑦3⋮⋮𝑦𝑛)
(3)
We now consider 휀 in (3). Because 𝑋 ∈ ℝ𝑛×𝑝 and 𝑦 ∈ ℝ𝑛, 휀 must also be an 𝑛-dimensional real vector, i.e.
휀 ∈ ℝ𝑛, and we thus have
(
𝑥11 𝑥12𝑥21 𝑥22
⋯ 𝑥1𝑝⋯ 𝑥2𝑝
𝑥31 𝑥32⋮ ⋮
⋯ 𝑥3𝑝⋱ ⋮
⋮ ⋮𝑥𝑛1 𝑥𝑛2
⋱ ⋮⋯ 𝑥𝑛𝑝)
(
𝛽1𝛽2⋮𝛽𝑝
)+
(
휀1휀2휀3⋮⋮휀𝑛)
=
(
𝑦1𝑦2𝑦3⋮⋮𝑦𝑛)
(4)
To introduce the meaning of 휀 in the context of equation (1), we consider the 𝑖th row of (3), which reads
𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 +⋯+ 𝑥𝑖𝑝𝛽𝑝 + 휀𝑖 = 𝑦𝑖 (5)
The left-hand side of (5) corresponds to the modelling assumption about the 𝑖th “data point” 𝑦𝑖 and
comprises two categorically different objects. The first part
𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 +⋯+ 𝑥𝑖𝑝𝛽𝑝 (6)
is “deterministic”, by which we understand that if we know the values of the 𝑥𝑖𝑗 (𝑖 = 1,… , 𝑛, 𝑗 = 1,… , 𝑝)
and the values of the 𝛽𝑗(𝑗 = 1,…𝑝), we can uniquely compute the value of (6). The second part, 휀𝑖 is
different. 휀𝑖 is conceived a random variable. This means that 휀𝑖 assumes values according to probability
distribution. We might know some parameters such as the mean and the variance of this probability
distribution, but the exact value that a 휀𝑖 takes on does not uniquely follow from this knowledge. Equation
(5) thus also implies that the value of 𝑦𝑖 is given by the sum of a “deterministic” term and a “probabilistic” or
“random” term. Informally, consider obtaining samples from the distribution of 휀𝑖 and adding them to a
constant value
𝜇𝑖 ≔ 𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 +⋯+ 𝑥𝑖𝑝𝛽𝑝 (7)
in the form
171
𝑦𝑖 = 𝜇𝑖 + 휀𝑖 (8)
In (8) 𝜇𝑖 is constant and 휀𝑖 is a random variable, for which we now imagine to draw samples from
univariate Gaussian distribution with mean (expectation) parameter 0 and variance parameter 𝜎2 = 1. Most
of the time, these values will be close to zero, but on occassion somewhat positive or somewhat negative.
Consider drawing the samples 휀1 = 0.2, 휀2 = −0.001, 휀3 = 0.05 and consider 𝜇1 = 𝜇2 = 𝜇3 = 1. If we
evaluate (8) for these values, we obtain
𝑦1 = 𝜇1 + 휀1 = 1 + 0.2 = 1.2 (9)
𝑦2 = 𝜇2 + 휀2 = 1 − 0.001 = 0.099
𝑦3 = 𝜇3 + 휀3 = 1 + 0.05 = 1.05
The most important thing to realize about (9) is that despite the fact that each 𝑦𝑖 (𝑖 = 1,2,3) has the same
deterministic aspect 𝜇𝑖 = 1 (𝑖 = 1,2,3), the values that the 𝑦𝑖 take on vary, because something random is
added to the deterministic aspect. Most importantly, this renders the 𝑦𝑖 themselves “random”, or, more
formally, random variables. We can also infer how they are distributed: because the 휀𝑖 have a mean of zero,
the expectation of the 𝑦𝑖 will correspond to the deterministic aspects 𝜇𝑖. The variance of the 𝑦𝑖, on the other
hand, correspond to the variance of the 휀𝑖.
There are two ways to express the above more formally. We can either state that
𝑦𝑖 = 𝜇𝑖 + 휀𝑖 (10)
where the distribution of 휀𝑖 is given by a univariate Gaussian distribution with expectation 0 and variance 𝜎2
𝑝(휀𝑖) = 𝑁(휀𝑖; 0, 𝜎2) (11)
Likewise (and more intuitively) we can simply state that the distribution of 𝑦𝑖 is given by a univariate
Gaussian distribution with expectation 𝜇𝑖 and variance 𝜎2
𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇𝑖 , 𝜎2) (12)
Recall that
𝜇𝑖 = 𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 +⋯+ 𝑥𝑖𝑝𝛽𝑝 (13)
We may denote (13) using matrix multiplication as
𝜇𝑖 = 𝑥𝑖𝛽 (14)
where we defined 𝑥𝑖 ∈ ℝ1×𝑝 as the row vector
𝑥𝑖 ≔ (𝑥𝑖1 𝑥𝑖2 … 𝑥𝑖1) ∈ ℝ1×𝑝 (15)
which, importantly, corresponds to the 𝑖th row of the design matrix 𝑋 ∈ ℝ𝑛×𝑝. 𝛽 ∈ ℝ𝑝 corresponds to the
parameter vector
172
𝛽 ≔ (
𝛽1𝛽2⋮𝛽𝑝
) (16)
We thus can rewrite (12) as
𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇𝑖 , 𝜎2) = 𝑁(𝑦𝑖; 𝑥𝑖𝛽, 𝜎
2) (17)
Let us summarize what we have achieved so far: starting from the GLM equation
𝑋𝛽 + 휀 = 𝑦 (18)
we have seen that 𝑋𝛽 corresponds to a matrix product, each row 𝑖 (where 𝑖 = 1,… , 𝑛) of which specifies a
deterministic contribution 𝜇𝑖 ∈ ℝ to the corresponding data value 𝑦𝑖. Each entry in the “noise vector” 휀, on
the other hand, contributes a random component 휀𝑖 to the value of 𝑦𝑖. It is important to note that these
ideas are assumptions, or in other words, they conform to the formulation of a probabilistic mathematical
model for some data. They do not correspond to any form of reality, which remains unknown. We have also
seen that we can re-express these ideas as a univariate Gaussian distribution 𝑁(𝑦𝑖; 𝑥𝑖𝛽, 𝜎2) that describes
the probability distribution of a single dependent variable 𝑦𝑖. In analogy to equation (16), one may
formulate a probability distribution for the “data vector” 𝑦 ∈ ℝ𝑛 (or in other words, for all dependent
variables 𝑦1, … , 𝑦𝑛) in the form of a multivariate Gaussian distribution
𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (19)
However, to understand the implications of (19) one requires some familiarity with multivariate Gaussian
distributions, which the reader is encouraged to review next.
(5) Equivalent formulations of the GLM
In this section, we consider the equivalence
𝑋𝛽 + 휀 = 𝑦, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛) ⇔ 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (1)
from the perspective of the linear transformation theorem for Gaussian distributions.
Recall that the linear transformation theorem for Gaussian distributions theorem states that that if
𝑝(𝑥) = 𝑁(𝑥; 𝜇𝑥 , Σx) (𝑥, 𝜇𝑥 ∈ ℝ𝑑 , Σ𝑥 ∈ ℝ
𝑑×𝑑 𝑝. 𝑑), 𝑝(휀) = 𝑁(휀; 𝜇 , Σ ) (휀, 𝜇 ∈ ℝ𝑑 , Σ ∈ ℝ𝑑×𝑑 𝑝. 𝑑. ) (2)
and 𝐴 ∈ ℝ𝑑×𝑑 is a matrix, then the random variable 𝑦 ≔ 𝐴𝑥 + 휀 is distributed according to
𝑝(𝑦) = 𝑁(𝑦; 𝜇𝑦 , Σy) where 𝑦 ∈ ℝ𝑑 , 𝜇𝑦 = 𝐴𝜇𝑥 + 𝜇 ∈ ℝ𝑑 and Σ𝑦 = 𝐴Σ𝑥𝐴𝑇 + Σ ∈ ℝ𝑑×𝑑 (3)
Applying this theorem in the current context with 𝑑 ≔ 𝑛, we first note that with 𝑥 ≔ 𝑋𝛽 we have 𝜇𝑥 = 𝑋𝛽
and Σ𝑥 = 0, or in other words, 𝑥 is not a random variable. Further, setting 𝜇 = 0 ∈ ℝ𝑛 and Σ = 𝜎2𝐼𝑛, we
have
𝑝(𝑦) = 𝑁(𝑦; 𝜇𝑦, Σ𝑦), where 𝑦 ∈ ℝ𝑛, 𝜇𝑦 = 𝑋𝛽 + 0 ∈ ℝ𝑛 and Σ𝑦 = 𝐴 ⋅ 0 ⋅ 𝐴
𝑇 + 𝜎2𝐼𝑛 = 𝜎2𝐼𝑛 ∈ ℝ
𝑛×𝑛 (4)
and thus
173
𝑋𝛽 + 휀 = 𝑦, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛) ⇒ 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (5)
as postulated previously. In words: the Gaussian distribution of the error term 휀 with expectation parameter
0 renders 𝑦 = 𝑋𝛽 + 휀 a random variable which is distribution with expectation parameter 𝑋𝛽 and
covariance matrix parameter corresponding to the error covariance matrix.
(6) Sampling a simple linear regression model
Based on the ability to sample 𝑛-variate Gaussian distributions by means of random number
generators as implemented for example in Matlab, we may now draw a sample from a simple linear
regression model with 𝑛 data points. To this end, we define the expectation parameter 𝜇 ∈ ℝ𝑛 of a 𝑛-variate
Gaussian using the matrix product of the simple linear regression design matrix 𝑋 ∈ ℝ𝑛×2 (recall that this
comprises a column of ones and a column of the independent variable values 𝑥1, … , 𝑥𝑛) and the parameter
vector 𝛽 ∈ ℝ2. To embed the notion of independent sample, we define the 𝑛-variate Gaussian covariance
matrix 𝛴 ∈ ℝ𝑛×𝑛 as the 𝑛 × 𝑛 spherical covariance matrix 𝜎2𝐼𝑛, where 𝜎2 > 0 is the variance parameter of
the ensuing GLM, as discussed in the Section on the multivariate Gaussian. Figure 1 visualizes the result.
Figure 1 A sample of simple linear regression model, obtained by sampling an 𝑛-variate Gaussian.
Study Questions
1. Consider the GLM equation 𝑋𝛽 + 휀 = 𝑦. Which of the symbols 𝑋, 𝛽, 휀, 𝑦 represents independent experimental variables, which
of the symbols represents dependent experimental variables?
2. Consider the GLM equation 𝑋𝛽 + 휀 = 𝑦. In an experimental context, which of the components are known to the experimenter
before performing the experiment, and which of the components are known to the experimenter after the experimenter,
before estimating the model?
3. Write the following matrix statement as a set of equations
𝑋𝛽 = 𝑦, where 𝑋 ∈ ℝ4×3, 𝛽 ∈ ℝ3 and 𝑦 ∈ ℝ3
4. Write the following set of equations in matrix product notation
2 ⋅ 3 + 3 ⋅ 4 − 1 ⋅ 2 = 16
1 ⋅ 3 + 3 ⋅ 4 + 2 ⋅ 2 = 19
0 ⋅ 3 + 0 ⋅ 4 + 7 ⋅ 2 = 14
5. The design matrix 𝑋 of a GLM is of dimensionality 𝑛 × 𝑝. What do 𝑛 ∈ ℕ and 𝑝 ∈ ℕ represent, respectively?
174
6. Name and explain the components and their properties in the GLM equation 𝑋𝛽 + 휀 = 𝑦. 7. In words, explain the equivalence of the statements
𝑋𝛽 + 휀 = 𝑦, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛) and 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛)
Study Questions Answers
1. 𝑋, the design matrix, represents values of independent experimental variables, 𝑦, the data vector represents values of the
dependent variable
2. Before the experiment, the design matrix 𝑋 is known to the experimenter, after performing the experiment, the design matrix 𝑋
and the data vector 𝑦 are known to the experimenter.
3. The matrix notation corresponds to the following set of (“one-dimensional”) equations
𝑥11𝛽1 + 𝑥12𝛽2 + 𝑥13𝛽3 = 𝑦1
𝑥21𝛽1 + 𝑥22𝛽2 + 𝑥23𝛽3 = 𝑦2
𝑥31𝛽1 + 𝑥32𝛽2 + 𝑥33𝛽3 = 𝑦3
𝑥41𝛽1 + 𝑥42𝛽2 + 𝑥43𝛽3 = 𝑦4
4. In matrix notation, the equations may be expressed as (2 3 −11 3 20 0 7
)(342) = (
161914)
5. 𝑛 represents the number of data points, 𝑝 represents the number of parameters.
6. In the GLM equation 𝑋𝛽 + 휀 = 𝑦, 𝑋 ∈ ℝ𝑛×𝑝 is referred to as “design matrix” and encodes the values of 𝑝 independent variables for each of 𝑛 observations, which are concatenated in the “data vector” 𝑦 ∈ ℝ𝑛. 𝛽 ∈ ℝ𝑝 is a “parameter vector”, which encodes how much each independent variable (corresponding to the columns of 𝑋) contributes to the observed values. 휀 ∈ ℝ𝑛 is a 𝑛-dimensional random vector, which is distributed according to a 𝑛-dimensional Gaussian distribution with expectation parameter 0 ∈ ℝ𝑛, and, under standard assumptions, covariance matrix parameter 𝛴 = 𝜎2𝐼𝑛, where 𝜎2 encodes the variance of each component of 𝑦.
7. The first statement formulates an observed data vector 𝑦 as the sum of a deterministic component 𝑋𝛽 and a random error vector 휀 ∈ ℝ𝑛, which conforms to a Gaussian distribution, whose expectation is zero, and whose components have variance 𝜎2. Because 𝑦 results from the addition of a “deterministic” and a “random” entity, one may equally consider 𝑦 a random vector distributed according to multivariate Gaussian with expectation parameter 𝑋𝛽 ∈ ℝ𝑛 and e, whose components have variance 𝜎2.
175
Maximum Likelihood Estimation
In the previous Sections, we have introduced the GLM in the following form: Let 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈
ℝ𝑝, 𝑦 ∈ ℝ𝑛 and 휀 ~ 𝑁(휀; 0, 𝜎2𝐼𝑛), where 0 ∈ ℝ𝑛, 𝜎2 > 0 and 𝐼𝑛 ∈ ℝ𝑛×𝑛 is the 𝑛 × 𝑛 identity matrix. Then
the GLM is defined by the equation
𝑋𝛽 + 휀 = 𝑦 (1)
We have seen that the probabilistic nature of the error term 휀 renders this model a multivariate Gaussian
distribution for data 𝑦 ∈ ℝ𝑛, which we may write as
𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (2)
Further, we saw that the spherical covariance matrix 𝜎2𝐼𝑛 embeds the assumptions of independent and
homoscedastic (equal variance) noise contributions. In the current we now turn to the problem of obtaining
parameter point estimates of 𝛽 ∈ ℝ𝑝 and 𝜎2 > 0 based on the observation of a data set 𝑦 = 𝑦∗. We use the
asterisk to denote that 𝑦∗ ≔ (𝑦1∗, … , 𝑦𝑛
∗)𝑇 is a sample of the 𝑛-dimensional random vector 𝑦. However,
because the theory is developed for all possible values 𝑦∗ that 𝑦 may take on, we will rarely emphasize that
it also applies to a special data set 𝑦∗ and use the notation more general notation “𝑦” most of the time. The
point estimates for 𝛽 and 𝜎2 will be denoted by �̂� and �̂�2 (beta hat and sigma square hat).
Classical parameter point estimation unfolds upon the following intuition: we assume that “in
reality” there are fixed parameter values 𝛽 = �̅� and 𝜎2 = �̅�2 that govern (in interaction with the design
matrix) the probability distribution of the data 𝑦. We cannot observe �̅� and �̅�2 directly, but only infer their
values based on the observed data 𝑦. �̅� and �̅�2 denote the “true, but unknown,” parameter values of 𝛽 and
𝜎2 that we assume to give rise to observed data. Note that the true, but unknown, parameter values �̅� and
�̅�2 remain unknown after the evaluation of the point estimates �̂� and �̂�2. �̂� and �̂�2 merely represent our
“best guess” of what the value of the true, but unknown, values �̅� and �̅�2 are based on the observation of
𝑦. The notion of “true, but unknown” parameter values roughly correspond to the notion of “population
parameters” as familiar from undergraduate statistics. Note however, that the idea of a “population”
conveys some notion of finiteness of the underlying process, which is not required in our case. Because, as
for the data, we develop the theory for all possible values that 𝛽 and 𝜎2 may assume as “true, but unknown”
values, we will mostly implicitly refer to the “the true, but unknown, values of 𝛽 and 𝜎2” and only rarely
explicitly denote them by �̅� and �̅�2.
To develop the classical theory of parameter point estimation we proceed as follows. We first
introduce a very general procedure to obtain parameter estimates in the context of probabilistic models, the
concept of maximum likelihood (ML) estimation (1). Based on this general development, we will then discuss
a first application of ML estimation to a familiar model, the univariate Gaussian in Section (2). In Section (3),
we study the equivalence between ML and least-squares estimation in the context of the GLM and in Section
(4) explicitly derive the ordinary least-squares beta parameter estimator for the simple linear regression
GLM. In Section (5), we then generalize this estimator for all kinds of specific GLMs. Finally, in Section (6) we
discuss the maximum likelihood estimation of the variance parameter estimator of the GLM.
(1) Maximum likelihood and least-squares beta parameter estimation
In the context of simple or multiple linear regression the reader may have encountered the concept
of “least-squares” estimation. The idea of least-squares estimation is to minimize the squared distance
between the observed data points and the fitted regression line/plane. In the context of the GLM, least-
176
squares estimation and maximum-likelihood estimation of the effect size parameters 𝛽 are equivalent. In
the current section, we will firstly spell out this equivalence for the “general” GLM case, and secondly derive
the least-squares estimator for the parameters 𝛽 ≔ (𝛽1, 𝛽2)𝑇 of a simple linear regression model. This
derivation will form the basis for the ordinary least-squares estimators of the GLM introduced in the next
Section. In the current section, we assume that the value 𝜎2 > 0 is known, and we only would like to derive
an estimator for the 𝛽 parameter.
Consider the GLM in the form
𝑋𝛽 + 휀 = 𝑦, 휀 ~ 𝑁(휀; 0, 𝜎2𝐼𝑛) ⇔ 𝑦 ~ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (1)
where 𝑦, 휀 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝜎2 > 0 and 𝐼𝑛 denotes the 𝑛 × 𝑛 identity matrix. As discussed in the
previous Section, the assumption of a spherical covariance matrix 𝜎2𝐼𝑛 renders the error terms 휀1, … , 휀𝑛
independently distributed. Note that in contrast to the univariate Gaussian example discussed above, the
data variables are not identically distributed: the expectation parameters given by the 𝑖th row of the matrix
product 𝑋𝛽 may well differ. Here, we will use the notation (𝑋𝛽)𝑖 to denote the 𝑖th row of 𝑋𝛽. From (1), the
distribution of the 𝑖th random variable 𝑦𝑖 is given by
𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; (𝑋𝛽)𝑖, 𝜎2) (2)
We may thus write down the likelihood function of the parameter 𝛽 for the observation 𝑦 in as
𝐿:ℝ𝑝 × ℝ𝑛 → ℝ+, (𝛽, 𝑦) ↦ 𝐿(𝛽, 𝑦) ≔ 𝑝(𝑦; 𝛽) (3)
Note that in the current scenario the parameter space 𝛩 is given by 𝛩 ≔ ℝ𝑝 and the dimensionality of the
parameter vector 휃 ≔ 𝛽 ∈ ℝ𝑝 is given by 𝑝 ∈ ℕ. Due to the independence assumption the probability
density function of the observation 𝑦 ∈ ℝ𝑛 may be written as the product of its invidual components
𝑝(𝑦1, … 𝑦𝑛) = 𝑝(𝑦1) ⋅ 𝑝(𝑦2)⋯𝑝(𝑦𝑛) = 𝑁(𝑦1; (𝑋𝛽)1, 𝜎2) ⋅ 𝑁(𝑦𝑖; (𝑋𝛽)2, 𝜎
2) ⋅ ⋯ ⋅ 𝑁(𝑦𝑛; (𝑋𝛽)𝑛, 𝜎2) (4)
Substitution of the univariate Gaussian probability density functions
𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; (𝑋𝛽)𝑖, 𝜎2) =
1
√2𝜋𝜎2𝑒𝑥𝑝 (−
1
2𝜎2(𝑦𝑖 − (𝑋𝛽)𝑖)
2) (5)
then yields the likelihood function
𝐿(𝛽, 𝑦) = 𝑝(𝑦; 𝛽) = (2𝜋𝜎2)−𝑛
2 𝑒𝑥𝑝 (−1
2𝜎2∑ (𝑦𝑖 − (𝑋𝛽)𝑖)
2𝑛𝑖=1 ) (6)
The necessary algebraic manipulations to get from (5) to (6) are equivalent to those considered in the
previous section and thus omitted here for brevity. Notably, the likelihood function 𝐿(𝛽, 𝑦) in (6) contains
the sum over squared deviations between the data points 𝑦𝑖 (𝑖 = 1,… , 𝑛) and the model prediction (𝑋𝛽)𝑖 in
the exponential term. Because this “sum-of- squares” ∑ (𝑦𝑖 − (𝑋𝛽)𝑖)2𝑛
𝑖=1 is non-negative (i.e. zero or
positive), and it is annotated with a minus sign, the exponential term in (6) becomes maximal, if the squared
deviations between model prediction and data becomes minimal. In other words, the likelihood function
𝐿(𝛽, 𝑦) is maximized, if the sum of squared deviations between data and model prediction is minimized,
which is just the least-squares estimation principle. Note that we have assumed a fixed and known variance
parameter 𝜎2 in these considerations. To summarize: whether one uses the technique of maximum
likelihood estimation or the minimization of the squared deviations between data and model prediction, the
resulting estimators for the 𝛽 parameters of a GLM are the same.
177
(2) Least-squares beta parameter estimation for simple linear regression
In this section, we will explicitly derive the “least-squares” (or, as shown in the previous section,
maximum likelihood) 𝛽 parameter estimator for the case of the simple linear regression GLM. This serves as
a preparation for the introduction of the general so-called “ordinary least-squares estimator” �̂� in the
subsequent section. The central result of the current section is that the least-squares estimator for the 𝛽
parameter in a simple linear regression model can be written as
�̂� = (𝑋𝑇𝑋)−1 𝑋𝑇𝑦 (1)
To see this, first consider the simple linear regression GLM
𝑋𝛽 + 휀 = 𝑦, 휀 ~ 𝑁(휀; 0, 𝜎2𝐼𝑛) (2)
where 𝑦, 휀 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×2, 𝛽 ∈ ℝ2, 𝜎2 > 0 and 𝐼𝑛 denotes the 𝑛 × 𝑛 identity matrix. Recall that the design
matrix, parameter vector, and data vector in the special GLM case of simple linear regression take the forms
𝑋 ≔ (
1 𝑥11 𝑥2⋮ ⋮1 𝑥𝑛
) , 𝛽 ≔ (𝛽1𝛽2) and 𝑦 ≔ (
𝑦1𝑦2⋮𝑦𝑛
) (3)
where the 𝑥𝑖 (𝑖 = 1,… , 𝑛) carry the notion of “independent variables”, 𝛽1 corresponds to the simple linear
regression “offset” and 𝛽2 to the simple linear regression “slope”. The 𝑖th row of the matrix product 𝑋𝛽 is
given in this case as
(𝑋𝛽)𝑖 = 𝛽1 + 𝑥𝑖𝛽2 (4)
To minimize the sum of squared deviations between model prediction and data
∑ (𝑦𝑖 − (𝑋𝛽)𝑖)2𝑛
𝑖=1 (5)
we view this sum as a function of 𝛽1 and 𝛽2 and write
𝑓:ℝ2 → ℝ, (𝛽1, 𝛽2) ↦ 𝑓(𝛽1, 𝛽2) = ∑ (𝑦𝑖 − 𝛽1 − 𝑥𝑖𝛽2)2𝑛
𝑖=1 (6)
As for a maximum, the necessary condition for a minimum is the vanishing of the partial derivatives of this
function, i.e. at a location of a minimum of 𝑓 (corresponding to a maximum of the log likelihood function),
the gradient of 𝑓
𝛻𝑓(𝛽1, 𝛽2) = (
𝜕
𝜕𝛽1𝑓(𝛽1, 𝛽2)
𝜕
𝜕𝛽2𝑓(𝛽1, 𝛽2)
) (7)
equals the zero vector (0,0)𝑇. Evaluation of the respective partial derivatives yields
𝛻𝑓(𝛽1, 𝛽2) = (2∑ (𝛽1 + 𝑥𝑖𝛽2 − 𝑦𝑖)
𝑛𝑖=1
2∑ (𝛽1 + 𝑥𝑖𝛽2 − 𝑦𝑖)𝑛𝑖=1 𝑥𝑖
) (8)
Derivation of (8)
We substitute the functional form of 𝑓(𝛽1, 𝛽2) and evaluate the partial derivatives with respect to 𝛽1 and 𝛽2. Using the summation and chain rules of differential calculus, we obtain
𝜕
𝜕𝛽1𝑓(𝛽1, 𝛽2) =
𝜕
𝜕𝛽1(∑ (𝑦𝑖 − 𝛽1 − 𝑥𝑖𝛽2)
2𝑛𝑖=1 ) = ∑
𝜕
𝜕𝛽1(𝑦𝑖 − 𝛽1 − 𝑥𝑖𝛽2)
2𝑛𝑖=1 = ∑ 2(𝑦𝑖 − 𝛽1 − 𝑥𝑖𝛽2)
𝑛𝑖=1 (−1) = 2∑ (𝛽1 + 𝑥𝑖𝛽2 − 𝑦𝑖)
𝑛𝑖=1 (8.1)
and
178
𝜕
𝜕𝛽2𝑓(𝛽1, 𝛽2) =
𝜕
𝜕𝛽2(∑ (𝑦𝑖 − 𝛽1 − 𝑥𝑖𝛽2)
2𝑛𝑖=1 ) = ∑
𝜕
𝜕𝛽2(𝑦𝑖 − 𝛽1 − 𝑥𝑖𝛽2)
2𝑛𝑖=1 = ∑ 2(𝑦𝑖 − 𝛽1 − 𝑥𝑖𝛽2)
𝑛𝑖=1 (−𝑥𝑖) = 2∑ (𝛽1 + 𝑥𝑖𝛽2 − 𝑦𝑖)
𝑛𝑖=1 𝑥𝑖 (8.2)
respectively. □
We have thus derived the following necessary condition for a minimum of the sum of squared deviations
between model prediction and observed data:
∑ (𝛽1 + 𝑥𝑖𝛽2 − 𝑦𝑖)𝑛𝑖=1 = 0 (9)
∑ (𝛽1 + 𝑥𝑖𝛽2 − 𝑦𝑖)𝑛𝑖=1 𝑥𝑖 = 0 (10)
To derive the standard form of the least-square estimator, we reformulate equation (9) and (10) as
𝑛𝛽1 + (∑ 𝑥𝑖𝑛𝑖=1 )𝛽2 = ∑ 𝑦𝑖
𝑛𝑖=1 (11)
(∑ 𝑥𝑖𝑛𝑖=1 )𝛽1 + (∑ 𝑥𝑖
2𝑛𝑖=1 )𝛽2 = ∑ 𝑦𝑖
𝑛𝑖=1 𝑥𝑖 (12)
Derivation of (11) and (12)
We rewrite equation (9) as follows
∑ (𝛽1 + 𝑥𝑖𝛽2 − 𝑦𝑖)𝑛𝑖=1 = 0 ⇔ ∑ (𝛽1 + 𝑥𝑖𝛽2)
𝑛𝑖=1 − ∑ 𝑦𝑖
𝑛𝑖=1 = 0 ⇔ ∑ (𝛽1 + 𝑥𝑖𝛽2)
𝑛𝑖=1 = ∑ 𝑦𝑖
𝑛𝑖=1 ⇔ 𝑛𝛽1 +∑ 𝑥𝑖
𝑛𝑖=1 𝛽2 = ∑ 𝑦𝑖
𝑛𝑖=1 (11.1)
and we rewrite equation (10) as
∑ (𝛽1 + 𝑥𝑖𝛽2 − 𝑦𝑖)𝑛𝑖=1 𝑥𝑖 = 0 ⇔ ∑ (𝛽1 + 𝑥𝑖𝛽2)𝑥𝑖
𝑛𝑖=1 −∑ 𝑦𝑖
𝑛𝑖=1 𝑥𝑖 = 0 ⇔ ∑ (𝛽1 + 𝑥𝑖𝛽2)𝑥𝑖
𝑛𝑖=1 = ∑ 𝑦𝑖
𝑛𝑖=1 𝑥𝑖
⇔ ∑ (𝛽1 + 𝑥𝑖𝛽2)𝑥𝑖𝑛𝑖=1 = ∑ 𝑦𝑖
𝑛𝑖=1 𝑥𝑖 ⇔ ∑ (𝛽1𝑥𝑖 + 𝑥𝑖
2𝛽2)𝑛𝑖=1 = ∑ 𝑦𝑖
𝑛𝑖=1 𝑥𝑖 ⇔ ∑ 𝑥𝑖𝛽1
𝑛𝑖=1 + ∑ 𝑥𝑖
2𝛽2𝑛𝑖=1 = ∑ 𝑦𝑖
𝑛𝑖=1 𝑥𝑖 (12.1)
□
We may now express the linear equations (11) and (12) in matrix notation as
(𝑛 ∑ 𝑥𝑖
𝑛𝑖=1
∑ 𝑥𝑖𝑛𝑖=1 ∑ 𝑥𝑖
2𝑛𝑖=1
)(𝛽1𝛽2) = (
∑ 𝑦𝑖𝑛𝑖=1
∑ 𝑦𝑖𝑛𝑖=1 𝑥𝑖
) (13)
Finally, using the definitions of design matrix and data vector for simple linear regression as in equation (3)
we see that
(𝑛 ∑ 𝑥𝑖
𝑛𝑖=1
∑ 𝑥𝑖𝑛𝑖=1 ∑ 𝑥𝑖
2𝑛𝑖=1
) = (1 1𝑥1 𝑥2
⋯ 1⋯ 𝑥𝑛
)(
1 𝑥11 𝑥2⋮ ⋮1 𝑥𝑛
) = 𝑋𝑇𝑋 (14)
and that
(∑ 𝑦𝑖𝑛𝑖=1
∑ 𝑦𝑖𝑛𝑖=1 𝑥𝑖
) = (1 1𝑥1 𝑥2
⋯ 1⋯ 𝑥𝑛
)(
𝑦1𝑦2⋮𝑦𝑛
) = 𝑋𝑇𝑦 (15)
In other words, the necessary condition for a minimum of the sum of squared deviations as given by the
system of linear equations (51) can be written very compactly in matrix notation as
𝑋𝑇𝑋𝛽 = 𝑋𝑇𝑦 (16)
The system of equations represented by (16) is called the “system of normal equations” and represents a
necessary condition for the function 𝑓 to a assume a minimum in 𝛽 (or, equivalently, for the log likelihood
function of the simple linear regression with respect to 𝛽 o assume a maximum). Note that 𝑋𝑇𝑋 ∈ ℝ𝑝×𝑝 is a
square matrix and thus its inverse can be computed, if it is in fact invertible, which depends on the entries in
the columns of 𝑋 ∈ ℝ𝑛×𝑝.
179
For the moment, we assume that the matrix 𝑋𝑇𝑋 is invertible, and may thus multiply both sides of
(16) by (𝑋𝑇𝑋)−1 and solve for the “least-squares” (or maximum likelihood) parameter estimator �̂� as
follows:
𝑋𝑇𝑋�̂� = 𝑋𝑇𝑦 ⇔ (𝑋𝑇𝑋)−1(𝑋𝑇𝑋)�̂� = (𝑋𝑇𝑋)−1𝑋𝑇𝑦 ⇔ �̂� = (𝑋𝑇𝑋)−1𝑋𝑇𝑦 (17)
In summary, we have shown that for the case of simple linear regression, the maximum likelihood or least-
squares estimator for the beta parameter vector is given by
�̂� = (𝑋𝑇𝑋)−1𝑋𝑇𝑦 (18)
Note again that in the current section we assumed that 𝜎2 > 0 is known and that the design matrix
𝑋 ∈ ℝ𝑛×2 and the parameter vector 𝛽 ∈ ℝ2 have a specific form – they correspond to the simple linear
regression case. However, in the next section, we will see that the form of the beta parameter estimator (18)
can be transferred to the case of general GLMs, i.e. arbitrary design matrices 𝑋 ∈ ℝ𝑛×𝑝 and parameter
vectors ∈ ℝ𝑝
(3) General beta parameter estimation
To derive the so-called “ordinary least squares (OLS)” estimator for 𝛽 parameters of a “general”
GLM we generalize the result we obtained for the simple linear regression case, i.e.
�̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 ∈ ℝ2 (1)
where 𝑋 ∈ ℝ𝑛×2 consists of a column of ones and a column of the values of the independent variable
𝑥1, … , 𝑥𝑛 to the general case of (almost) arbitrary design matrix 𝑋 ∈ ℝ𝑛×𝑝 and 𝛽 ∈ ℝ𝑛. Note that from the
considerations above, we may equivalently refer to the OLS estimator for the 𝛽 parameters of GLM as the
“least-squares” or “maximum-likelihood” estimator, however, the term OLS estimator seems to be most
commonly used in the literature.
To generalize (1), we make the assumption that the that the design matrix 𝑋 ∈ ℝ𝑛×𝑝 is of rank
𝑝 ∈ ℕ, sometimes referred to as 𝑋 being of “full column rank”. Importantly, this assumption guarantees that
the matrix product 𝑋𝑇𝑋 ∈ ℝ𝑝×𝑝 is of rank 𝑝 as well, and thus invertible. In addition, we require the
following two results from linear algebra
(a) (𝐴𝐵)𝑇 = 𝐵𝑇𝐴𝑇 for all matrices 𝐴 ∈ ℝ𝑚×𝑛 and 𝐵 ∈ ℝ𝑛×𝑝. (3)
(b) If a matrix 𝐵 ∈ ℝ𝑚×𝑚 is of the form 𝐵 = 𝐴𝐴𝑇 for 𝐴 ∈ ℝ𝑚×𝑛, the matrix 𝐵 is positive-definite (4)
Based on these assumptions, we can now show that OLS estimator for a general GLM is of the same form as
for the simple linear regression scenario. Because this is a fundamental result, we present it in the form of a
theorem, i.e. a factual statement followed by its mathematical proof.
Theorem. The Ordinary Least Squares �̂� estimator
Let 𝑋𝛽 + 휀 = 𝑦, where 𝑋 ∈ ℝ𝑛×𝑝 is of full-column rank, 𝛽 ∈ ℝ𝑝, 𝑦 ∈ ℝ𝑛 and 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛)
for 𝑛, 𝑝 ∈ ℕ and 𝜎2 > 0 denote the GLM. Based on an observation of the random vector 𝑦 ∈ ℝ𝑛 the
ordinary least-squares estimator (= maximum likelihood estimator) is given by
�̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 ∈ ℝ𝑝 (5)
□
Proof of (5)
180
The theorem implies that if �̂� is defined by
�̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 (5.1)
the so called “sum-of-error-squares” (SES)
휀𝑇휀 ≔ (𝑦 − 𝑋𝛽)𝑇(𝑦 − 𝑋𝛽) ∈ ℝ+ (5.2)
becomes minimal. To show that this is indeed the case we will first show that
𝑋𝑇(𝑦 − 𝑋�̂�) = 0 ∈ ℝ𝑝 and (𝑦 − 𝑋�̂�)𝑇𝑋 = 0 ∈ ℝ1×𝑝 (5.3)
This may be seen as follows: by definition, we have
�̂� = (𝑋𝑇𝑋)−1𝑋𝑇𝑦 ⇔ 𝑋𝑇𝑋�̂� = 𝑋𝑇𝑦 ⇔ 0 = 𝑋𝑇𝑦 − 𝑋𝑇𝑋𝛽 ⇔ 0 = 𝑋𝑇(𝑦 − 𝑋𝛽) = 0 (5.4)
Note that above 0 ∈ ℝ𝑝. Forming the transpose of 𝑋𝑇(𝑦 − 𝑋�̂�) = 0, we see that (𝑦 − 𝑋�̂�)𝑇𝑋 = 0 ∈ ℝ1×𝑝
0𝑇 = (𝑋𝑇(𝑦 − 𝑋�̂�))𝑇= (𝑦 − 𝑋�̂�)
𝑇𝑋 (5.5)
and thus also (𝑦 − 𝑋�̂�)𝑇𝑋 corresponds to 0 ∈ ℝ1×𝑝.
We now use (5.3) to show that �̂� minimizes the SES. To this end, we first reformulate the SES as follows:
(𝑦 − 𝑋𝛽)𝑇(𝑦 − 𝑋𝛽) = (𝑦 − 𝑋�̂� + 𝑋�̂� − 𝑋𝛽)𝑇(𝑦 − 𝑋�̂� + 𝑋�̂� − 𝑋𝛽) = ((𝑦 − 𝑋�̂�) + 𝑋(�̂� − 𝛽))
𝑇((𝑦 − 𝑋�̂�) + 𝑋(�̂� − 𝛽)) (5.6)
Resolving the brackets then yields
(𝑦 − 𝑋𝛽)𝑇(𝑦 − 𝑋𝛽) = (𝑦 − 𝑋�̂�)𝑇(𝑦 − 𝑋�̂�) + (𝑦 − 𝑋�̂�)
𝑇𝑋 (�̂� − 𝛽) + (�̂� − 𝛽)
𝑇𝑋𝑇(𝑦 − 𝑋�̂�) + (�̂� − 𝛽)
𝑇𝑋𝑇𝑋(�̂� − 𝛽) (5.7)
with (5.3) the second and the third terms in expression (5.7) are zero and we have for the SES
(𝑦 − 𝑋𝛽)𝑇(𝑦 − 𝑋𝛽) = (𝑦 − 𝑋�̂�)𝑇(𝑦 − 𝑋�̂�) + (�̂� − 𝛽)
𝑇𝑋𝑇𝑋(�̂� − 𝛽) (5.8)
Based on this expression, we can conclude that the OLS estimator �̂� minimizes the sum-of-error squares: the second term is always larger or equal to zero, because 𝑋𝑇𝑋 ∈ ℝ𝑝×𝑝 is positive-definite and thus can become zero (and thus minimize the right-hand side
of (15)) only for choosing 𝛽 = �̂� ∈ ℝ𝑝.
(4) Maximum likelihood variance parameter estimation
So far we have only considered estimation of the beta parameter vector 𝛽 ∈ ℝ𝑝 and have assumed
that the 𝜎2 is known. Using the equivalence of least-square and maximum likelihood estimation in the case
of the beta parameter, we have found the so-called “ordinary least-squares” �̂� ∈ ℝ𝑝 estimator for 𝛽 ∈ ℝ𝑝.
In this section, we consider the maximum likelihood estimation of the GLM parameter 𝜎2 > 0. As in the
maximum likelihood derivation of the variance estimator for a univariate Gaussian, we will proceed as
follows: We will first write down the likelihood and log likelihood function of the GLM as function of both
𝛽 ∈ ℝ𝑝 and 𝜎2 > 0, and then substitute �̂� ∈ ℝ𝑝 to treat the log likelihood function as function of 𝜎2 > 0
only. We then use the standard maximum likelihood approach: we evaluate the derivative of the log
likelihood function with respect to 𝜎2, set this derivative to zero, and solve for the maximum likelihood
�̂�2 > 0 for𝜎2 > 0.
Recapitulating the above, we consider the GLM
𝑋𝛽 + 휀 = 𝑦, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛) ⇔ 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (1)
where 𝑦, 휀 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝜎2 > 0 and 𝐼𝑛 denotes the 𝑛 × 𝑛 identity matrix, and its associated
“two parameter” likelihood function
𝐿:ℝ𝑝 × ℝ+\{0} × ℝ𝑛 → ℝ+, ((𝛽, 𝜎2), 𝑦) ↦ 𝐿((𝛽, 𝜎2), 𝑦) ≔ (2𝜋𝜎2)−
𝑛
2 𝑒𝑥𝑝 (−1
2𝜎2(𝑦 − 𝑋𝛽)𝑇(𝑦 − 𝑋𝛽)) (2)
181
where (𝑋𝛽)𝑖 ∈ ℝ denotes the 𝑖th entry of the 𝑛 × 1 vector 𝑋𝛽. Logarithmic transformation then yields the
log likelihood function
ℓ:ℝ𝑝 × ℝ+\{0} × ℝ𝑛 → ℝ+, ((𝛽, 𝜎2), 𝑦) ↦ ℓ((𝛽, 𝜎2), 𝑦) ≔ −
𝑛
2𝑙𝑛 2𝜋 −
𝑛
2𝑙𝑛 𝜎2 −
1
2𝜎2(𝑦 − 𝑋𝛽)𝑇(𝑦 − 𝑋𝛽) (3)
Substitution of the ordinary least-squares estimator �̂� then render this function a function of the parameter
𝜎2 only
ℓ�̂�: ℝ+\{0} × ℝ𝑛 → ℝ+, (𝜎2, 𝑦) ↦ ℓ�̂�(𝜎
2, 𝑦) = ℓ(�̂�, 𝜎2, 𝑦) ≔ −𝑛
2𝑙𝑛 2𝜋 −
𝑛
2𝑙𝑛 𝜎2 −
1
2𝜎2(𝑦 − 𝑋�̂�)
𝑇(𝑦 − 𝑋�̂�) (4)
The derivative of ℓ𝛽 ̂w ith respect to 𝜎2evaluates to
𝑑
𝑑𝜎2ℓ𝛽 ̂ (𝜎
2) = −1
2
𝑛
𝜎2+1
2
1
(𝜎2)2(𝑦 − 𝑋�̂�)
𝑇(𝑦 − 𝑋�̂�) (5)
Derivation of equation (5)
We have
𝑑
𝑑𝜎2ℓ𝛽 ̂ (𝜎
2) =𝑑
𝑑𝜎2(−
𝑛
2𝑙𝑛 2𝜋 −
𝑛
2𝑙𝑛 𝜎2 −
1
2𝜎2(𝑦 − 𝑋�̂�)
𝑇(𝑦 − 𝑋�̂�))
= −𝑛
2
𝑑
𝑑𝜎2𝑙𝑛 𝜎2 −
1
2(𝑦 − 𝑋�̂�)
𝑇(𝑦 − 𝑋�̂�)
𝑑
𝑑𝜎2(𝜎2)−1
= −𝑛
2
1
𝜎2−1
2(𝑦 − 𝑋�̂�)
𝑇(𝑦 − 𝑋�̂�)(−1)(𝜎2)−2
= −1
2
𝑛
𝜎2+1
2
1
(𝜎2)2(𝑦 − 𝑋�̂�)
𝑇(𝑦 − 𝑋�̂�) (5.1)
□
Setting the derivative of ℓ𝛽 ̂ to zero and solving for the value of 𝜎2 then yields
�̂�2 =(𝑦−𝑋�̂�)
𝑇(𝑦−𝑋�̂�)
𝑛 (6)
Derivation of equation (6)
We have
𝑑
𝑑𝜎2ℓ𝛽 ̂ (�̂�
2, 𝑦) = 0 ⇔ −1
2
𝑛
�̂�2+1
2
1
(�̂�2)2(𝑦 − 𝑋�̂�)
𝑇(𝑦 − 𝑋�̂�) = 0 ⇔
1
2
1
(�̂�2)2(𝑦 − 𝑋�̂�)
𝑇(𝑦 − 𝑋�̂�) =
1
2
𝑛
�̂�2
⇔ (𝑦 − 𝑋�̂�)𝑇(𝑦 − 𝑋�̂�) =
�̂�2�̂�2𝑛
�̂�2⇔ 𝑛�̂�2 = (𝑦 − 𝑋�̂�)
𝑇(𝑦 − 𝑋�̂�) ⇔ �̂�2 =
(𝑦−𝑋�̂�)𝑇(𝑦−𝑋�̂�)
𝑛 (6.1)
□
The numerator of the expression for �̂�2, given by (𝑦 − 𝑋�̂�)𝑇(𝑦 − 𝑋�̂�) is referred to as the
“residual-sum-of-squares” (the analogue quantity for general 𝛽, (𝑦 − 𝑋𝛽)𝑇(𝑦 − 𝑋𝛽) is referred to as the
“error-sum-of-squares”). The quantity (𝑦 − 𝑋�̂�) corresponds to the difference between the data observed
and the data prediction of the obtained for using the OLS beta estimator �̂� ∈ ℝ𝑝 . In other words, the
estimator for the error variance corresponds to a scaled version of the remaining mismatch between
observed data and OLS-based data prediction, which is somewhat surprising, but inherent in the theory of
the GLM.
Unfortunately, however, like for the univariate Gaussian the maximum likelihood estimator for �̂�2
in the form of equation (6) is not “bias-free”. This can readily be rectified, however, by dividing the RSS not
182
by 𝑛, but by 𝑛 − 𝑝. This yields the “corrected” or “restricted maximum likelihood” estimator for the error
variance
�̂�2 ≔(𝑦−𝑋�̂�)
𝑇(𝑦−𝑋�̂�)
𝑛−𝑝 (7)
For details on the motivation and derivation of (7) the reader is referred to the Section “Restricted maximum
likelihood estimation of the GLM”.
In summary, we have established the following classical point estimators for the parameters 𝛽 ∈ ℝ𝑝
and 𝜎2 for a GLM of the form
𝑋𝛽 + 휀 = 𝑦, 𝑝(𝑦) 𝑁(휀; 0, 𝜎2𝐼𝑛) ⇔ 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (8)
1. The ordinary least-squares beta estimator
�̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 ∈ ℝ𝑝 (9)
2. The variance parameter estimator
�̂�2 ≔(𝑦−𝑋�̂�)
𝑇(𝑦−𝑋�̂�)
𝑛−𝑝> 0 (10)
Study Questions
1. Write down the formula for the least-squares or beta-parameter estimator �̂� of the GLM, explain its components, and, verbally,
sketch its derivation.
2. Write down the formula for the variance parameter estimator �̂�2 of the GLM and explain its components.
Study Questions Answers
1. The least-squares or beta-parameter estimator for the GLM is given by
�̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 ∈ ℝ𝑝
where 𝑋 ∈ ℝ𝑛×𝑝 is the design matrix, 𝑦 ∈ ℝ𝑛 denotes the data vector (and the −1 and 𝑇 superscripts denote matrix inversion and
transposition, respectively). The formula for the least-squares estimator follows from evaluating a critical point of the GLM log
likelihood function or minimizing the sum of squared deviations between the GLM prediction 𝑋𝛽 and the observed data 𝑦.
2. The variance parameter estimator of the GLM is given by
�̂�2 ≔(𝑦−𝑋�̂�)
𝑇(𝑦−𝑋�̂�)
𝑛−𝑝
where 𝑋 ∈ ℝ𝑛×𝑝 is the design matrix, 𝑦 ∈ ℝ𝑛 denotes the data vector, and �̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 ∈ ℝ𝑝 denotes the least-squares
parameter estimator of the GLM. 𝑛 corresponds to the number of data points, and 𝑝 to the number of parameters.
183
Frequentist Parameter Estimator Distributions
(1) The intuitive background for parameter estimator distributions
In the previous sections, we have seen that parameter point estimators for the GLM
𝑋𝛽 + 휀 = 𝑦, 𝑝(𝑦) 𝑁(휀; 0, 𝜎2𝐼𝑛) ⇔ 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (1)
with 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 휀 ∈ ℝ𝑛, 𝑦 ∈ ℝ𝑛, 𝜎2 > 0, 𝐼𝑛 ∈ ℝ𝑛×𝑛 can be derived by means of (restricted)
maximum likelihood estimation in the forms
�̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 ∈ ℝ𝑝 (2)
and
�̂�2 ≔(𝑦−𝑋�̂�)
𝑇(𝑦−𝑋�̂�)
𝑛−𝑝∈ ℝ (3)
In this section, we discuss the “sampling distributions” of the estimators �̂� and �̂�2. To this end, it is helpful to
initially review the fundamental “sampling” intuition associated with the GLM. As discussed previously, if the
GLM is applied in a given data analytical context, the following implicit assumptions are made in a
frequentist scenario: In the real world, there exist values �̅� ∈ ℝ𝑝 and �̅�2 ∈ ℝ𝑝, which are fixed and true, but
unknown. The data vector 𝑦(1) ∈ ℝ𝑛 that forms the basis for data analysis is assumed to be a single sample
from the distribution 𝑁(𝑦; 𝑋�̅�, �̅�2𝐼𝑛), and based on this sample, point estimates for the values �̅� and �̅�2
can be obtained by substituting 𝑦(1) in the equations for �̂� and �̂�2, i.e., setting
�̂�(1) = (𝑋𝑇𝑋)−1𝑋𝑇𝑦(1) and �̂�2(1)≔
(𝑦(1)−𝑋�̂�)𝑇(𝑦(1)−𝑋�̂�)
𝑛−𝑝 (4)
where the “(1)“- superscripts are meant to indicate that these estimator values are obtained based on the
data realization 𝑦(1). If the experiment were to be repeated under identical circumstances, a second sample
from 𝑁(𝑦; 𝑋�̅�, �̅�2𝐼𝑛) could be obtained, which we denote by 𝑦(2). The values of 𝑦(2) will not be identical to
those of 𝑦(1), but largely similar, as both are derived from the same underlying distribution. Again, point
estimates for the values of �̅� and �̅�2 could be obtained by means substitution of 𝑦(2) in (2) and (3). Because
the design matrix 𝑋 and its properties 𝑛 and 𝑝 will remain identical, the values for �̂�(2) and �̂�2(2)
will show
some variation with respect to �̂�(1) and �̂�2(1)
, but, as 𝑦(2) is largely similar to 𝑦(1), not too much. One can
readily imagine taking more samples from 𝑁(𝑦; 𝑋�̅�, �̅�2𝐼𝑛), resulting in more values 𝑦(3), 𝑦(4), 𝑦(5)… ∈ ℝ𝑛,
which in turn give rise to more values �̂�(3), �̂�(4), �̂�(5), … and �̂�2(2), �̂�2
(3), �̂�2
(4), ….. In this section, we
investigate the following question: given that we know the distribution of the data values 𝑦(𝑖), 𝑖 = 1,2,3,…,
what can we say about the distribution of the corresponding �̂�(𝑖) and �̂�2(𝑖)
values? These distributional
properties form the theoretical basis for the statistical testing theory based on the 𝑇 and 𝐹 statistics
discussed in subsequent sections.
Before introducing the sampling distributions of �̂� and �̂�2, we remark that we can conceive these
estimators, and in fact, any “statistic” as a mapping from the data space into the parameter space. This is a
useful perspective for studying properties of estimators. For example, we may write the �̂� and �̂�2 estimators
as the following functions
�̂�𝑋 ∶ ℝ𝑛 → ℝ𝑝, 𝑦 ↦ �̂�𝑋(𝑦) ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 (5)
184
�̂�𝑋,�̂�2 : ℝ𝑛 → ℝ𝑝, 𝑦 ↦ �̂�
𝑋,�̂�2 ≔
(𝑦−𝑋�̂�)𝑇(𝑦−𝑋�̂�)
𝑛−𝑝 (6)
(2) The sampling distribution of the beta parameter estimator
The fundamental assumption of the GLM is that the distribution of the data random vector 𝑦 ∈ ℝ𝑛 is
given by a multivariate Gaussian distribution, 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛). Intuitively, when forming the OLS estimator
�̂� ∈ ℝ𝑝, one multiplies these values with the factor (𝑋𝑇𝑋)−1𝑋𝑇, i.e. one stretches or squeezes the
respective values by a fixed amount, but leaves them unchanged otherwise. One may thus expect the OLS
estimator �̂� to be distributed according to a multivariate Gaussian as well, which is indeed the case.
Formally, the parameters of the normal distribution �̂� can be inferred based on the linear transformation
theorem for multivariate Gaussian distributions.
Specifically, as introduced previously this theorem states that if (1) a random vector 𝑥 ∈ ℝ𝑛 is
distributed according to a Gaussian distribution with expectation parameter 𝜇𝑥 ∈ ℝ𝑛 and covariance matrix
parameter 𝛴𝑥 ∈ ℝ𝑛×𝑛, and if (2) 𝐴 ∈ ℝ𝑚×𝑛 is a matrix of rank 𝑚, and if (3) 휀 ∈ ℝ𝑚 is a random vector with
expectation parameter 0 ∈ ℝ𝑚 and covariance matrix parameter 𝛴 ∈ ℝ𝑚×𝑚, and if (4) the covariance
between 𝑥 and 𝑦 is zero, then
𝑧 ≔ 𝐴𝑥 + 휀 (1)
is distributed according to a Gaussian distribution with expectation parameter 𝜇𝑧 = 𝐴𝜇𝑥 + 𝜇 ∈ ℝ𝑚 and
covariance matrix parameter 𝛴𝑧 = 𝐴𝛴𝑥𝐴𝑇 + 𝛴 . In the current scenario, we are concerned with a special
case of this theorem, namely the case that 휀 ≡ 0, i.e. that 휀 does not exist as a random variable. In this case,
we may omit the contributions of 휀 to the parameters of 𝑧. Formulated in general terms, we thus the
following simplified linear transformation theorem for multivariate Gaussian distributions
Theorem. Simplified linear transformation theorem for multivariate Gaussian distributions.
Let 𝑥 ∈ ℝ𝑛 be distributed according to an 𝑛-dimensional Gaussian distribution with expectation
parameter 𝜇 ∈ ℝ𝑛 and covariance parameter 𝛴 ∈ ℝ𝑛×𝑛. Let further 𝐴 ∈ ℝ𝑚×𝑛 be a matrix of rank 𝑚 ∈ ℕ.
Then the product 𝐴𝑥 ∈ ℝ𝑚 is distributed according to an 𝑚-dimensional Gaussian distribution with
expectation parameter 𝐴𝜇 ∈ ℝ𝑚 and covariance parameter 𝐴𝛴𝐴𝑇 ∈ ℝ𝑚×𝑚. In other words for 𝑝(𝑥) =
𝑁(𝑥; 𝜇, 𝛴) and 𝑧 ≔ 𝐴𝑥 we have 𝑝(𝑧) = 𝑁(𝑧; 𝐴𝜇, 𝐴𝛴𝐴𝑇)
□
To apply this theorem in the current case, we note without proof that (𝑋𝑇𝑋)−1𝑋𝑇 is of rank 𝑛, and
obtain the following result:
𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) ⇒ 𝑝(�̂�) = 𝑁(�̂�; 𝛽, 𝜎2(𝑋𝑇𝑋)−1) (2)
Derivation of (2)
Upon setting
𝑥 ≔ 𝑦 ∈ ℝ𝑛, 𝜇 ≔ 𝑋𝛽 ∈ ℝ𝑛, 𝛴 ≔ 𝜎2𝐼𝑛 ∈ ℝ𝑛×𝑛, 𝐴 ≔ (𝑋𝑇𝑋)−1𝑋𝑇 ∈ ℝ𝑝×𝑛 and 𝑧 ≔ �̂� =(𝑋𝑇𝑋)−1𝑋𝑇𝑦 ∈ ℝ𝑝 (2.1)
application of the simplified linear transformation theorem for multivariate Gaussian distributions yields the following 𝑝-dimensional
Gaussian distribution for the estimator �̂�
𝑝(�̂�) = 𝑁(�̂�; (𝑋𝑇𝑋)−1𝑋𝑇𝑋𝛽, (𝑋𝑇𝑋)−1𝑋𝑇(𝜎2𝐼𝑛)((𝑋𝑇𝑋)−1𝑋𝑇)𝑇) (2.2)
The parameters of this distribution can be simplified as follows
𝜇�̂� : = (𝑋𝑇𝑋)−1𝑋𝑇𝑋𝛽 = 𝛽 ∈ ℝ𝑝 (2.3)
185
and
𝛴�̂� : = (𝑋𝑇𝑋)−1𝑋𝑇(𝜎2𝐼𝑛)((𝑋
𝑇𝑋)−1𝑋𝑇)𝑇 = (𝑋𝑇𝑋)−1𝑋𝑇(𝜎2𝐼𝑛)𝑋(𝑋𝑇𝑋)−1 = 𝜎2(𝑋𝑇𝑋)−1𝑋𝑇𝑋(𝑋𝑇𝑋)−1 = 𝜎2(𝑋𝑇𝑋)−1 ∈ ℝ𝑝×𝑝 (2.4)
In (2.4) we used in the second equation that both 𝑋𝑇𝑋 and its inverse (𝑋𝑇𝑋)−1 are symmetric matrices, and thus ((𝑋𝑇𝑋)−1)𝑇 =
(𝑋𝑇𝑋)−1. In the third equation, the scalar 𝜎2 was moved to the beginning of the product and the identity multiplication 𝐼𝑛 was
suppressed. Finally, in the fourth equation, the matrix product 𝑋𝑇𝑋(𝑋𝑇𝑋)−1 was evaluated to the identity 𝐼𝑝.
□
In verbose form, we obtained the following result: From the assumption of error terms 휀 ∈ ℝ𝑛
distributed according to a multivariate Gaussian distribution with expectation 0 ∈ ℝ𝑛 and spherical
covariance matrix 𝜎2𝐼𝑛 ∈ ℝ𝑛×𝑛 and the GLM equation 𝑦 = 𝑋𝛽 + 휀 it first follows that the data 𝑦 ∈ ℝ𝑛
distributed according to a Gaussian distribution with expectation parameter 𝑋𝛽 ∈ ℝ𝑛 and covariance matrix
𝜎2𝐼𝑛 ∈ ℝ𝑛×𝑛. From the Gaussian distribution of the data, in turn, it follows with the (simplified) linear
transformation theorem for Gaussian distributions that the OLS estimator �̂� ∈ ℝ𝑝 is distributed according to
a 𝑝-dimensional Gaussian distribution with expectation parameter 𝛽 ∈ ℝ𝑝, i.e. the true, but unknown, value
of the parameters, and covariance matrix 𝜎2(𝑋𝑇𝑋)−1 ∈ ℝ𝑝×𝑝. Notably, the covariance matrix of the beta
parameter estimator �̂� thus depends on the error variance parameter 𝜎2 and the design matrix 𝑋 ∈ ℝ𝑛×𝑝.
Figure 1 below visualizes the Gaussian sampling distribution of the OLS two-dimensional estimator �̂� ∈ ℝ2 in
the case of a simple linear regression GLM.
Figure 1. Visualization of the OLS 𝛽 estimator distributional properties for the case of a simple linear regression GLM. The panel in
the first row, first column shows 20 samples from simple linear regression GLMs with true, but unknown, parameter values �̅� ≔(0,1)𝑇 (middle) as red dots around their respective expectation (black line)., where the true, but unknown, variance parameter was set to �̅�2 ≔ 0.5 The second column depicts the corresponding evaluations of the OLS 𝛽 estimator in parameter space ℝ2. The third columndepicts the analytical probability density function of the OLS 𝛽 estimator derived by means of the linear transformation
theorem for Gaussian distributions. The panels in the second row depict the equivalent entities for �̅� ≔ (2,−1)𝑇
186
(3) The sampling distribution of the scaled variance parameter estimator
The derivation of the sampling distribution of the variance parameter estimator �̂�2 is - compared to
the derivation of the OLS beta estimator distribution - somewhat more involved. We thus content with
stating the result here and provide some intuition and a sampling-based demonstration of it. Interested
readers may refer to the Section “Probability Distributions” for a formal derivation of the following result:
Let 𝑋𝛽 + 휀 = 𝑦, where 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝑦 ∈ ℝ𝑛 and 휀 ~ 𝑁(휀; 0, 𝜎2𝐼𝑛) for 𝑛, 𝑝 ∈ ℕ and 𝜎2 > 0
denote the GLM and let
�̂�2 =(𝑦−𝑋�̂�)
𝑇(𝑦−𝑋�̂�)
𝑛−𝑝 (1)
denote the unbiased estimator for the variance parameter 𝜎2 > 0. Then the distribution of the “scaled
variance parameter estimator”
𝑛−𝑝
𝜎2�̂�2 =
(𝑦−𝑋�̂�)𝑇(𝑦−𝑋�̂�)
𝜎2 (2)
is given by a chi-squared distribution with 𝑛 − 𝑝 degrees of freedom, denoted by
𝑝 (𝑛−𝑝
𝜎2�̂�2) = 𝜒2 (
𝑛−𝑝
𝜎2�̂�2; 𝑛 − 𝑝) (3)
What is the intuition behind this result? First, from the previous discussion of the distribution of the
OLS estimator for the beta parameters �̂� it should be clear that based on data samples 𝑦(1), 𝑦(2), 𝑦(3), …
(and values of the corresponding �̂�(1), �̂�(2), �̂�(3), …), we may evaluate values �̂�2(1), �̂�2
(2), �̂�2
(3), … based on
the formula in equation (1) and consider their distribution. In contrast to the distributions encountered so
far, the distribution of the OLS estimator �̂�2 is not a normal distribution. This can intuitively be made
plausible by considering the term (𝑦 − 𝑋�̂�)𝑇(𝑦 − 𝑋�̂�), i.e. the so-called “residual sum-of-squares” in the
numerator of the left-hand side of equation (1). From the definition of the GLM and the discussion so far, we
know that 𝑦 is normally distributed with expectation 𝑋𝛽 and that �̂� is distributed normally with expectation
𝛽. Intuitively, the difference, 𝑧 ≔ 𝑦 − 𝑋�̂� ∈ ℝ𝑛 is thus distributed normally with expectation 𝑋𝛽 −
𝑋𝐸(�̂�) = 0 ∈ ℝ𝑛. The numerator of (2) thus implies that one computes the square of normally distributed
random variable 𝑧 with expectation 0.
We next consider sampling values of the random variable 𝑧 from its distribution. One may obtain
values equally likely to be positive or negative and mainly close to zero. Now consider squaring these values.
Most notably, all the negative values will become positive, and thus the distribution of 𝑧𝑇𝑧, i.e. the square of
𝑧, cannot be normally distributed around zero. In fact, as noted in the definition of the chi-squared
distribution in the Section “Probability Distribution”, it is a standard result that the distribution of the sum of
squared univariate random variables 𝑥𝑗 (𝑖 = 1,… , 𝑛), each distributed according to a “standard” normal
distribution of expectation 𝜇 = 0 and variance 𝜎2 = 1, i.e. 휁 ≔ ∑ 𝑥𝑖2𝑛
𝑗=1 is given by a chi-squared
distribution with 𝑛 degrees of freedom, i.e. 𝜒2(𝜉; 𝑛). This fact, the covariance matrix of the random variable
𝑧 = 𝑦 − 𝑋�̂�, and the denominator of the estimator �̂�2 all contribute to the fact that 𝑛−𝑝
𝜎2�̂�2 is distributed
according to a chi-squared distribution with 𝑛 − 𝑝 degrees of freedom. The reason, that the product 𝑛−𝑝
𝜎2�̂�2
is considered here instead of the “pure” variance parameter estimator �̂�2 lies merely in the mathematical
fact that it is this product, which is chi-squared distributed. To recover the distribution of the “pure” �̂�2 one
would have to use the “transformation theorem for probability density functions”, which we will eschew
187
here for simplicity. We visualize the sampling distribution of �̂�2 and 𝑛−𝑝
𝜎2�̂�2 in Figure 2 below. Note that
these entities are one-dimensional and their distribution is hence always readily visualized, irrespective of
the dimensionality of the beta parameter vector.
Figure 2. Visualization of the �̂�2estimator sampling distributional properties for the case of a simple linear regression GLM. The panel
in the first row, first column shows 100 samples from simple linear regression GLMs with true, but unknown, parameter values �̅� ≔(1,1)𝑇 and �̅�2 ≔ 0.1. The panel in the second column depicts the frequency counts of observations of �̂�2 falling into equally spaced
bins of width ≈ 0.08. Finally, the third column depicts the histogram of the product 𝑛−𝑝
𝜎2�̂�2 and the probability density function of
the chi-squared distribution with the appropriate degrees of freedom. The panels in the second row depict the same entities for the
case �̅�2 ≔ 1. Note that while the distribution of �̂�2 changes, the distribution of 𝑛−𝑝
𝜎2�̂�2 does only differ up to sampling error from the
case �̅�2 ≔ 0.1, due to the scaling by 𝜎2.
(4) Overview of the frequentist GLM distribution theory
In brief, we may summarize the sampling distribution of the GLM and its point parameter estimates
with fixed quantities 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝑛, 𝑝 ∈ ℕ, 𝜎2 > 0, 𝐼𝑛 ∈ ℝ𝑛×𝑛 and random vectors 𝑦, 휀 ∈ ℝ𝑛, �̂� ∈ ℝ𝑝
and �̂�2 ∈ ℝ as follows:
General Linear Model 𝑋𝛽 + 휀 = 𝑦, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛) (1)
Data distribution 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (2)
Beta parameter distribution 𝑝(�̂�) = 𝑁(�̂�; 𝛽, 𝜎2(𝑋𝑇𝑋)−1) (3)
Scaled variance parameter distribution 𝑝 (𝑛−𝑝
𝜎2�̂�2) = 𝜒2 (
𝑛−𝑝
𝜎2�̂�2; 𝑛 − 𝑝) (4)
Study Questions
1. Discuss the frequentist sampling perspective of the GLM and address the role of the data that a GLM analysis is applied to. 2. Write down the “simplified linear transformation theorem for Gaussian distributions”. 3. Verbally, explain the fact that the beta parameter and variance parameter estimators have a probability distribution.
4. Write down the probability distribution of the beta parameter estimator �̂� and explain its components.
5. Write down the probability distribution of the standardized variance parameter estimator 𝑛−𝑝
𝜎2�̂�2 and explain its components.
188
Study Questions Answers
1. From a frequentist sampling perspective, there exist fixed true, but unknown, values of the GLM beta and variance parameters.
Based on these parameters (and a given design matrix), data vectors 𝑦(1), 𝑦((2)) , … can be sampled from the GLM. The distribution of these data vector is described by 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛). If a given data set is analyzed by means of the GLM, it assumed to that this data set is one of the many potential samples taken from the underlying GLM.
2. Let 𝑥 ∈ ℝ𝑛 be distributed according to an 𝑛-dimensional Gaussian distribution with expectation parameter 𝜇 ∈ ℝ𝑛 and covariance
parameter 𝛴 ∈ ℝ𝑛×𝑛. Let further 𝐴 ∈ ℝ𝑚×𝑛 be a matrix of rank 𝑚 ∈ ℕ. Then the product 𝐴𝑥 ∈ ℝ𝑚 is distributed according to an 𝑚-
dimensional Gaussian distribution with expectation parameter 𝐴𝜇 ∈ ℝ𝑚 and covariance parameter 𝐴𝛴𝐴𝑇 ∈ ℝ𝑚×𝑚. In other words
for 𝑝(𝑥) = 𝑁(𝑥; 𝜇, 𝛴) and 𝑧 ≔ 𝐴𝑥 we have 𝑝(𝑧) = 𝑁(𝑧; 𝐴𝜇, 𝐴𝛴𝐴𝑇)
3. From a frequentist viewpoint, data vectors can be sampled from a GLM many times. By means of the formulas for the beta and variance parameter estimators, these data sample vectors can be converted into samples of the beta and variance parameter estimators. From the distributional assumptions about the error terms in the GLM, and the ensuing distribution of the data samples then follow the probability distributions of the beta and variance parameters estimators. In an experimental context, there is only one data realization and thus only one realization of the beta and variance parameter estimators. The distributions of the beta and variance parameter estimators are thus “hypothetical”.
4. The probability distribution of the beta parameter estimator is given by the normal distribution 𝑁(�̂�; 𝛽, 𝜎2(𝑋𝑇𝑋)−1), with
expectation parameter 𝛽 ∈ ℝ𝑝, i.e., the true, but unknown, parameter value, and covariance matrix parameter𝜎2(𝑋𝑇𝑋)−1 ∈ℝ𝑝×𝑝, 𝑝. 𝑑. where 𝜎2 > 0 is the true, but unknonwn, GLM noise term variance parameter (i.e. 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛)) and 𝑋 ∈ ℝ𝑛×𝑝 is the design matrix. The (co)variance of the beta parameter estimator distribution is thus given by the combination of the noise term variance parameter and the experimental design.
5. The probability distribution of the standardized variance estimator is given by the chi-squared distribution 𝜒2 (𝑛−𝑝
𝜎2�̂�2; 𝑛 − 𝑝),
where the “degrees of freedom” 𝑛 − 𝑝 refer to the difference between the number of data points and the number of parmeters of the underlying GLM. 𝜎2 > 0 is the true, but unknown, GLM noise term variance parameter (i.e., 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛)).
189
T- and F-Statistics
(1) Significance and hypothesis testing in frequentist statistics
In classical (= frequentist) statistics, model evaluation corresponds to statistical testing. Informally,
statistical testing works as follows: based on some initial assumptions about the distribution of the observed
data, the distribution of estimated model parameters under a null hypothesis of assumed parameter values
is determined and summarized in the null distribution of a “test statistic”. Based on some observed data and
the corresponding parameter estimates, the point estimate of this statistic is computed next. The value of
the test statistic is then compared to its distribution under the null hypothesis. If the value of the test
statistic falls into a region that under the null hypothesis is associated with a very small probability to occur,
it is inferred that it is unlikely that the observed data was generated from the null distribution and the null
hypothesis is rejected. The probability of observing the data-based test statistic (or a more extreme value),
given that the null hypothesis is true, is known as the p-value. If the p-value falls below some conventional
threshold, say p < 0.05, the result is labelled statistically significant, otherwise it is not.
Commonly, in addition to this notion of “significance testing”, the following set-up is used to
describe the test situation: If the null hypothesis, denoted as 𝐻0, is in fact true, it may be rejected
erroneously at given level 𝛼 ∈ [0,1] of probability, known as the Type I error. If, to the contrary, an
alternative hypothesis 𝐻1 is true, this may erroneously be rejected (and the null hypothesis “accepted”),
known as Type II error, with a probability 𝛽 ∈ [0,1]. Table 1 below illustrates this situation.
𝐻0 is true 𝐻1 is true
𝐻0 is not rejected Correct Decision Type II error of probability 𝛽
𝐻0 is rejected Type I error of probability 𝛼 Correct Decision
Table 1: The problem of statistical testing
Pondering about this set-up of statistical testing results in many questions: Why is a null hypothesis
only ever rejected, and the alternative hypothesis never accepted? Why is a value of 𝑝 < 0.05 so important
to have a “statistically significant” result, and thus, in terms of academia, the opportunity to write a paper,
but not, if 𝑝 = 0.08? How should one evaluate the “statistical power” 1 − 𝛽 in a set-up, where one has no
idea about the effect size difference between 𝐻0 and 𝐻1? Finally, why does one always want to reject the
null hypothesis of “no effect”, when really one is interested in the probability of a experimental
manipulation having resulted in an experiment effect? Why does the 𝑝-value have such a counterintuitive
definition that takes a long time to get used to?
To obtain an intuition about the reasons why statistical testing (and thus, in fact, much of
contemporary empirical science) is nebulized by 𝑝-values and uninformed use of statistical software instead
of straight-forward probabilistic reasoning, it is helpful to consider the history of “significance testing” as put
forward by Ronald Fisher and “hypothesis testing” as developed by Jerzy Neyman and Egon Pearson and
documented in the following excerpt from Wikipedia.
“Significance testing is largely the product of Karl Pearson (p-value, Pearson’s chi-squared test), William
Sealy Gosset (Student’s t-distribution) and Ronald Fisher (“null hypothesis”, analysis of variance,
“significance test”), while hypothesis testing was developed by Jerzy Neyman and Egon Pearson, the son of
Karl Pearson. Ronald Fisher, mathematician and biologist, began his life in statistics as a Bayesian, but soon
grew disenchanted with the subjectivity involved (namely use of the principle of indifference, i.e. uniform
190
probabilities, when determining prior probabilities), and sought to provide a more “objective” approach to
inductive inference.
Fisher was an agricultural statistician who emphasized rigorous experimental design and methods to
extract results from few samples assuming Gaussian distributions. Neyman, who teamed with the younger
Pearson, emphasized mathematical rigor and methods to obtain more results from many samples and a
wider range of distributions. Modern hypothesis testing is an inconsistent hybrid of the Fishers vs. Neyman-
Pearson formulations, methods, and terminology developed in the early 20th century. While hypothesis
testing was popularized early in the 20th century, evidence of its use can be found much earlier. In the 1770s
Laplace considered the statistics of half a million births. The statistics showed an excess of boys compared to
girls. He concluded by calculation of a p-value that the excess was a real, but unexplained, effect.
Fisher popularized the “significance test”. He required a null hypothesis (corresponding to a population
frequency distribution) and a sample. His (now familiar) calculations determined whether to reject the null-
hypothesis or not. Significance testing did no utilize an alternative hypothesis so there was no concept of a
Type II error [i.e. the error of “not rejecting the null hypothesis, if the alternative hypothesis is actually
true”]. The p-value [i.e. “the probability of observing a data set as extreme or more extreme than the
actually observed one, given that the null hypothesis is true”] was devised as an informal, but objective,
index meant to help a researcher determine whether to modify further experiments or strengthen one’s
faith in the null hypothesis.
Neyman and Pearson considered a different problem, which they called “hypothesis testing”. They
initially considered two simple hypotheses [A and B], both with [specified] frequency distributions. They
calculated [the] two probabilities [of an observed data sample under each hypothesis] and typically selected
the hypothesis with the higher probability, i.e. the hypothesis more likely to have generated the sample.
Hypothesis testing and the associated Type I and Type II errors [i.e. deciding, based on the observed data,
that hypothesis B is true, when in fact hypothesis A is true, and vice versa, deciding that hypothesis A is true,
when in fact hypothesis B is true, respectively] were devised as a more “objective” alternative to Fisher’s p-
value. Their method thus always selected a hypothesis and allowed for the calculation of both types of error
probabilities.
Fisher and Neyman-Pearson clashed bitterly. Neyman-Pearson considered their formulation to be an
improved generalization of significance testing. Fisher thought that it was not applicable to scientific
research because often, during the course of the experiment, it is discovered that the initial assumptions
about the null hypothesis are questionable due to unexpected sources of errors. He believed that the rigid
reject/accept decisions based on models formulated before data is collected was incompatible with this
common scenario faced by scientists and attempts to apply this method to scientific research would lead to
mass confusion.
The dispute between Fisher and Neyman-Pearson was waged on philosophical grounds, characterized by
a philosopher as a dispute over the proper role of models in statistical inference.
Events intervened: Neyman accepted a position in the western hemisphere, breaking his partnership
with Pearson and separating disputants, who had occupied the same building, by much of the planetary
diameter. World War II provided an intermission in the debate. The dispute between Fisher and Neyman
terminated unresolved after 27 years with Fisher’s death in 1962. Neyman wrote a well-regarded eulogy.
Some of Neyman’s later publications reported p-values and significance levels.
191
The modern version of hypothesis testing is a hybrid of the two approaches that resulted from
confusion by writers of statistical textbooks (as predicted by Fisher) in the 1940s. Great conceptual
differences and many caveats in addition to those mentioned above were ignored. Neyman and Pearson
provided the stronger terminology, the more rigorous mathematics, and the more consistent philosophy,
but the subject taught to today in introductory statistics has more similarities with Fisher’s method than
theirs. This history explains the inconsistent terminology (example: the null hypothesis is never accepted,
but there is a region of acceptance).
Sometime around 1940, in an apparent effort to provide researchers with a “non-controversial” way
to have their cake and eat it too, the authors of statistical text books began anonymously combining the two
strategies by using the p-value in place of the test statistic (or data) to test against the Neyman-Pearson
“significance level”. Thus, researchers were encouraged to infer the strength of their data against some null
hypothesis using p-values, while also thinking they are retaining the post-data collection objectivity provided
by hypothesis testing. It then became customary for the null hypothesis, which was originally some realistic
research hypothesis, to be used almost solely as a strawman “nil” hypothesis (one where a treatment has no
effect, regardless of the context).
[The two different approaches developed by Fisher and Neyman-Pearson may be summarized as follows]
Fisher’s significance testing
1. Set up a statistical null hypothesis. The null hypothesis need not be a nil hypothesis [in the sense of a
difference from zero, but may be with respect to any value].
2. Report the exact level of significance (e.g. 𝑝 = 0.051 or 𝑝 = 0.049). Do not use a conventional 5%
level and do not talk about accepting or rejecting hypotheses. If the result is “not significant”, draw no
conclusions and make no decisions, but suspend judgment until further data is available.
3. Use this procedure only if little is known about the problem at hand, and only to draw provisional
conclusions in the context of an attempt to understand the experimental situation
Neyman-Pearson’s hypothesis testing
1. Set up two [formalized] statistical hypothesis [“probabilistic models”], 𝐻1 and 𝐻2 and decide about
desirable probabilities 𝛼 and 1 − 𝛽 for Type I and Type II errors, respectively, and sample size before
the experiment, based on subjective cost-benefit considerations. These define a rejection region for
each hypothesis
2. If the data falls into the rejection region of 𝐻1, accept 𝐻2, otherwise accept 𝐻1. Note that accepting a
hypothesis does not mean that you believe in it, but only that you act as if it were true.
3. The usefulness of the procedure is limited among others to situations were one has a disjunction of
hypotheses [e.g. two alternative univariate Gaussians with expectation parameters 𝜇1 ≠ 𝜇2 and only
one of them is true] and where one can make meaningful cost-benefit trade-offs for choosing 𝛼 and
1 − 𝛽.”
(recovered from http://en.wikipedia.org/w/index.php?oldid=581635341)
Some of the differences between Fisher’s significance testing and Neyman-Pearson’s hypothesis testing
frameworks are summarized in Figure 1.
192
In the following, we will consider the T- and F-Statistic and the associated T- and F-Test with the aim
of providing some background on the common use of these statistics in neuroimaging and psychological
research. Based on the discussion above, it will become evident that the T-Statistic and T-Test conform more
closely to the idea of null hypothesis significance testing and is best motivated from a null hypothesis in a
scenario where no alternative hypothesis exists. The F-Statistic and F-Test on the other hand form a stronger
hybrid between both approaches and are best motivated from the concept of a likelihood ratio test in which
two hypotheses exist explicitly, but are finally evaluated using a null hypothesis significance testing
approach.
For both the T- and F-Statistic, we will first introduce their formulas first and discuss their intuitive
meaning irrespective of the framework of statistical testing. In a second step, we will then discuss their
distribution under assumed null hypotheses and briefly introduce how are used for statistical testing.
Figure 1 Significance testing vs. hypothesis testing.
(2) Definition and intuition of the T-statistic
For a “contrast vector” 𝑐 ∈ ℝ𝑝 the T-statistic is defined as
𝑇:ℝ𝑝 → ℝ, �̂� ↦ 𝑇(�̂�) ≔𝑐𝑇�̂�
√�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐 (1)
The first thing to note about this definition is that 𝑇 is a function of the data 𝑦 by means of the beta and
variance estimators �̂� and �̂�2. To get at the intuition behind this definition, we first consider the case of the
GLM representing independent and identical sampling of 𝑛 data points 𝑦1, … , 𝑦𝑛 from a univariate Gaussian
in the form
193
𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (2)
where 𝑦 ∈ ℝ𝑛, 𝛽 ∈ ℝ, 𝜎2 > 0 and the design matrix is given by a vector of ones 𝑋 = (1,1… ,1)𝑇 ∈ ℝ𝑛×1. As
discussed previously, in this case the beta parameter can be interpreted as the expectation parameter of
univariate Gaussian distribution from which 𝑛 samples are obtained and which is estimated by the OLS beta
estimator according to the sample mean �̅�:
�̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑇𝑦 =1
𝑛∑ 𝑦𝑖𝑛𝑖=1 =: �̅� (3)
We further assume the value of 𝑐 ∈ ℝ to be 𝑐 ≔ 1. In this case the 𝑇 statistic defined in (1) evaluates to
𝑇 ≔𝑐𝑇�̂�
√�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐=
1𝑇⋅�̂�
√�̂�21𝑇(𝑛)−11=
�̅��̂�
√𝑛
⇔ 𝑇 =𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 (4)
which may be recognized as the T-statistic for “one-sample t-tests” as familiar from undergraduate courses
in statistics.
A more general intuition behind the T-statistic expressed in this form is
𝑇 =𝐸𝑓𝑓𝑒𝑐𝑡 𝑆𝑖𝑧𝑒
𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑖𝑧𝑒 𝑆𝑐𝑎𝑙𝑒𝑑 𝐷𝑎𝑡𝑎 𝑉𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 (5)
because �̅� corresponds to the sample average and its distance from the (implicit) “null assumption” that this
expectation is zero and �̂� corresponds the sample standard deviation, which is a measure of the data
variability. High absolute 𝑇 values thus have the following interpretation: in comparison to the variability of
the data, the effect size is large, either in the positive or negative direction. Conversely, low absolute 𝑇
values, i.e. 𝑇 values close to zero, speak to a small effect size in comparison to the data variance. Another
way to frame this is to say “the T-statistic measures the effect size with respect to the yard-stick of data
variability”.
This interpretation carries over to the more general case of more than one beta parameter, i.e.
𝛽 ∈ ℝ𝑝, 𝑝 > 1. We first consider the case, for which 𝑐 ∈ ℝ𝑝 contains all zeros, except a one at the 𝑗th entry.
In this case , the T-statistic is given as
𝑇 =�̂�𝑗
√�̂�2(𝑋𝑇𝑋)𝑗𝑗−1
(6)
where (𝑋𝑇𝑋)𝑗𝑗−1 denotes the (𝑗, 𝑗)th entry in the matrix (𝑋𝑇𝑋). Notably, as we see below, the denominator
in (6) is a measure of the variance of �̂�𝑗, which scales with general variance parameter estimate 𝜎2 (Note
that also in the familiar 𝑇 statistic as given in equation (4), the denominator is in fact the “standard error of
the mean” �̂�
√𝑛, which is, intuitively, a measure of the variance of the sample average �̅�. However, for fixed 𝑛
it scales with the general variance parameter 𝜎2). In other words, as for the univariate Gaussian case, with a
contrast vector singling out a specific beta parameter estimate �̂�𝑗 (𝑗 ∈ ℕ𝑝), the 𝑇 statistic is a measure of
the effect size associated with 𝑗th independent variable relative to the data variability.
Finally, the general form of the T-statistic as in (1) allows for the evaluation of any kind of linear
combination (i.e. weighted sum) of beta parameter estimates. Commonly encountered contrasts in FMRI
data analysis are contrasts that test, whether one parameter estimate �̂�𝑖 is larger than another parameter
estimate �̂�𝑗. If �̂�𝑖 is larger than �̂�𝑗, than �̂�𝑖 − �̂�𝑗 > 0 and thus the corresponding contrast vector 𝑐 ∈ ℝ𝑝
contains all zeros, except at the 𝑖th position, where it contains a 1 and at the 𝑗th position, where it contains
a −1.
194
(3) The T-statistic null distribution
How can the T-statistic as defined in equation (1) of the previous section be used to implement a null
hypothesis test as outlined in the beginning of this Section? To this end, the T-statistic must first be
expanded with an assumed parameter value 𝛽0 ∈ ℝ𝑝 representing a null hypothesis 𝐻0 (note that 𝛽0 ∈ ℝ
𝑝
is not required to be zero, and hence that a “null hypothesis” is not required to be a “nil hypothesis” of”),
and second, the sampling distribution of this “extended”, i.e., more general, T-statistics must be assessed.
Formally, a “null hypothesis” corresponds to the assumption that the data 𝑦 is sampled from the
following “null hypothesis distribution”
𝐻0 ⇔ 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽0, 𝜎2𝐼𝑛) (1)
The “extended” T-statistic takes the form
𝑇:ℝ𝑝 → ℝ, �̂� ↦ 𝑇(�̂�) ≔𝑐𝑇�̂�−𝑐𝑇𝛽0
√�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐 (2)
Note that this definition corresponds to the T-statistic introduced in equation (1) of the previous section for
the case 𝛽0 ≔ 0 ∈ ℝ𝑝. The introduction of the term “−𝑐𝑇𝛽0” above merely ensures, that the expectation of
𝑇, if the null hypothesis is true, and 𝑐𝑇�̂� has indeed an expected value of 𝑐𝑇𝐸(�̂�) = 𝑐𝑇𝛽0, is zero.
Having established a more general form of the T-statistic, we now ask, how this statistic is
distributed under the null hypothesis (1) . To this end, we briefly rehearse what we have learned about the
data and parameter estimator distributions of the GLM so far. First, the probability distribution of the data
is given by
𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (3)
Based on (3), the distribution of OLS beta estimator is given by the linear transformation theorem for
Gaussian random variables as
𝑝(�̂�) = 𝑁(�̂�; 𝛽, 𝜎2(𝑋𝑇𝑋)−1 ) (4)
If the linear transformation theorem for Gaussian random variables is applied to �̂� to arrive at the
distribution of 𝑐𝑇�̂� (and one note that (𝑐𝑇)𝑇 = 𝑐) one sees that the distribution of the OLS beta parameter
contrast 𝑐𝑇�̂� is given by the univariate Gaussian distribution
𝑝(𝑐𝑇�̂� ) = 𝑁(𝑐𝑇�̂� ; 𝑐𝑇𝛽, 𝜎2𝑐𝑇(𝑋𝑇𝑋)−1𝑐) (5)
Finally, we know that the distribution of the scaled variance parameter estimator �̂�2, 𝑛−𝑝
𝜎2�̂�2 is given by a
chi-squared distribution
𝑝 (𝑛−𝑝
𝜎2�̂�2) = 𝜒2 (
𝑛−𝑝
𝜎2�̂�2; 𝑛 − 𝑝) (6)
Based on these results, one can now ask, how the ratio between linear combinations of the
parameter estimates and the variance, i.e. the extended T-statistic, is distributed. To this end, the reader is
first encouraged to review the definition of the 𝑡-distribution. Most importantly, the 𝑡-distribution is a
statement about the probability distribution of a random variable 𝑇 that is the ratio of a standard normally
distributed random variable and a the square root of a chi-squared distributed random variable with 𝑛
degrees of freedom divided by 𝑛. To establish that the T-statistic as defined in (2) is in fact distributed
according to a 𝑡-distribution, we continue from the distributions of the contrasted OLS beta estimator (5)
195
and the distribution of the variance estimator (6) and ask how they have to be modified to adhere to the
definitions of the random variables in the definition of the 𝑡-distribution.
Figure 1 Analytical and sampling distributions for a GLM representing independent and identical sampling from a univariate Gaussian distribution. For the upper panels, the null hypothesis corresponds to 𝛽0 = 0 and for the lower panels to 𝛽0 = 2. The first panels in each row depict the univariate Gaussian probability density function 𝑁(𝑦; 𝛽0, 1) from which a realization of 𝑛 = 10 data points is
drawn. The second panels depict for 100 draws of 𝑛 = 10 data points the ensuing sampling distribution (histogram estimate) of 𝑐𝑇�̂�, where 𝑐𝑇 = 1 and the corresponding analytical distribution. Note that the beta parameter estimates (sample means) are centered around the corresponding null hypotheses 𝛽0 pararameters. The third panels depict the empirical and analytical distributions of the
ensuing �̂�2 estimates multipled by 𝑛−𝑝
𝜎2= 9, which are both distributed according to a chi-squared distribution with 9 degrees of
freedom. Finally, the last panels depict the resulting empirical and analytical distributions of the “extended” T-statistic as defined in equation (2). Note that irrespective of the underlying null hypothesis parameters 𝛽0, both are centered around 0.
Firstly, to render 𝑐𝑇�̂� a random variable that is distributed according to a standard normal
distribution, we have to z-transform it. Recall that the 𝑧-transform transforms a normally distributed
random variable 𝑥 a standard normally distributed random variable 𝑧. In other words, if
𝑝(𝑥) = 𝑁(𝑥; 𝜇, 𝜎2) (7)
then
𝑝 (𝑥−𝜇
𝜎) =: 𝑝(𝑧) = 𝑁(𝑧; 0,1) (8)
𝑧-transformation of 𝑐𝑇�̂� yields the standard normally distributed random variable
𝑋 ≔𝑐𝑇�̂�−𝑐𝑇𝛽0
√𝜎2𝑐𝑇(𝑋𝑇𝑋)−1𝑐 (9)
196
Note that this is not the T-statistic as defined in (2), because the denominator contains the true, but
unknown, GLM variance parameter 𝜎2 and not the variance estimator �̂�2. However, division of 𝑋 by the
root of the chi-squared random variable 𝑌 ≔𝑛−𝑝
𝜎2�̂�2 divided by its degrees of freedom 𝑛 − 𝑝 yields
𝑋
𝑌/(𝑛−𝑝)=
𝑐𝑇�̂�−𝑐𝑇𝛽0
√𝜎2𝑐𝑇(𝑋𝑇𝑋)−1𝑐⋅ √𝑛−𝑝
√ 𝑛−𝑝
𝜎2�̂�2=
(𝑐𝑇�̂�−𝑐𝑇𝛽0) √𝑛−𝑝
√𝜎2/𝜎2 √𝑐𝑇(𝑋𝑇𝑋)−1𝑐√𝑛−𝑝√�̂�2=
𝑐𝑇�̂�−𝑐𝑇𝛽0
√�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐=:𝑇 (10)
In summary, we have seen that the T-statistic as defined in (2), under the null hypothesis as defined in (1), is
distributed according to a t-distribution with 𝑛 − 𝑝 degrees of freedom. Figure (1) above visualizes this result
for the case of a GLM representing independent and identical sampling from a univariate Gaussian
distribution and Figure 2 visualizes this result for the case of GLM representing a simple linear regression
model.
Figure 2 Analytical and sampling distributions for a GLM representing a simple linear regression model. The first panel depicts a simple linear regression model based on the null hypothesis parameter setting 𝛽0 = (−1,1)
𝑇 and variance parameter 𝜎2 = 1 a single realization from this model. The second panel depicts the beta parameter estimators obtained by sampling the model in the
first panel 100 times. Note that the first component of these �̂�’s is distributed around −1 and the second component is distributed around 1, in accordance with the null hypothesis parameter setting. The third panel depicts the analytical and empirical distribution
of the product 𝑐𝑇�̂� for 𝑐 = (0,1)𝑇. Note that while the distribution of the �̂� can be high-dimensional, the distribution of a product
𝑐𝑇�̂� is always one-dimensional. The fourth panel depicts the empirical and analytical distributions of the ensuing �̂�2 estimates
multiplied by 𝑛−𝑝
𝜎2= 8. Finally, the last panel depicts the ensuing empirical and analytical distribution of the T-statistic as defined in
equation (2) for the current scenario.
(4) The T-statistic and null hypothesis significance testing
In summary, we have shown that the distribution of the T-statistic as defined in equation (2) of the
previous section under a suitably chosen “null” assumption about the true, but unknown, parameter values
𝛽 ∈ ℝ𝑝 is analytical available. In a given experimental context, one usually observes a single data set
𝑦∗ ∈ ℝ𝑛 from which one can compute a single T-statistic value 𝑇∗ ∈ ℝ for a given “null hypothesis” 𝛽0 ∈ ℝ𝑝.
The logic of null hypothesis significance is then as follows: if the observed T-statistic value and more extreme
values, under the assumed null hypothesis/distribution have a very small probability to occur (for example a
probability for it or more extreme values of less than 0.05), one may infer that it is not very likely that the
data that gave rise to the computed T-statistic value was actually generated from a GLM for which the null
hypothesis 𝐻0: 𝛽 = 𝛽0 ∈ ℝ𝑝 holds true. Informally, we may state this as follows: if the observed data (i.e.
the data observations proper and all summaries of it, such as beta and variance estimators, and the T-
statistic) is allocated a low probability density value under the assumption that the null hypothesis is true,
one may reject the null hypothesis. In general, it is sensible to not only report whether one rejected or did
not reject a null hypothesis based on some (arbitrarily chosen) threshold value, but to report the value of the
T-statistic itself and the associated probability mass for more extreme values itself (the so-called “p-value”).
Put simply, more information is conveyed by stating e.g. that a t-test with 𝑛 degrees of freedom resulted in a
197
T-value of 𝑇 = ⋯ and associated p-value of 𝑝 = ⋯, then to merely state that a null hypothesis was rejected
because the p-value was below e.g. 0.05.
(5) Definition and intuition of the F-Statistic
In this section we first introduce the F-statistic and its associated distribution under a null
hypothesis. From a GLM perspective F-statistics and F-tests correspond to a model comparison procedure
for “nested” GLMs, i.e. the comparison of a GLM with another GLM that is a subpart of the first. In a second
part, we relate this formulation of the F-statistic to the perhaps more familiar notion of F-statistics in
classical variance partitioning schemes of one-factorial ANOVA designs. The benefit of the first perspective is
that it is based on a clearly defined set of modelling assumptions, and corresponds to a fairly general view
that is easily applied in “general” GLMs, while the second view is formulated ad-hoc and readily applicable
only in the case of a one-factorial ANOVA design.
In both approaches, the notion of a variety of “sum-of-squares” is important. We thus first define
the so-called” error sum-of-squares” and the “residual sum-of-squares” for a GLM
𝑋𝛽 + 휀 = 𝑦, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛) ⇔ 𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (1)
where 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝑛, 𝑝 ∈ ℕ, 𝜎2 > 0, 𝐼𝑛 ∈ ℝ𝑛×𝑛 and random vectors 𝑦, 휀 ∈ ℝ𝑛.
The “error sum-of-squares” is a theoretical quantity that relates to the deviations of the random
variables 𝑦𝑖 from their expectation (𝑋𝛽)𝑖 (i.e. the 𝑖th entry in the 𝑛-dimensional vector 𝑋𝛽 ∈ ℝ𝑛). Of course,
these quantities correspond by definition to the error terms 휀𝑖, 𝑖 = 1,… , 𝑛. Squaring the deviations
휀𝑖 = 𝑦𝑖 − (𝑋𝛽)𝑖 and summing over all 𝑛 deviations yields the “error sum-of-squares”, which can be written
in matrix notation as
휀𝑇휀 = (𝑦 − 𝑋𝛽)𝑇(𝑦 − 𝑋𝛽) = ∑ (𝑦𝑖 − (𝑋𝛽)𝑖)2𝑛
𝑖=1 = ∑ 휀𝑖2𝑛
𝑖=1 (2)
The “residual sum-of-squares” is a different entity. It refers to the sum of squared deviations
between the realized data points 𝑦𝑖 and their GLM prediction based on the OLS estimation of the 𝛽
parameters , i.e., the deviations 𝑒𝑖 ≔ 𝑦𝑖 − (𝑋�̂�)𝑖 for𝑖 = 1,… , 𝑛. Squaring these deviation and summing
over all 𝑛 yields the “residual sum-of-squares”, which can be written in matrix notation as
𝑒𝑇𝑒 = (𝑦 − 𝑋�̂�)𝑇(𝑦 − 𝑋�̂�) = ∑ (𝑦𝑖 − (𝑋�̂�)𝑖)
2=𝑛
𝑖=1 ∑ 𝑒𝑖2𝑛
𝑖=1 (3)
Below, we will be mainly concerned with various forms of “residual sum-of-squares”.
The F-Statistic in a GLM context can be motivated from the perspective of likelihood ratio-based
model comparison. To make this transparent, consider again the simple linear regression GLM
𝑋𝛽 + 휀 = 𝑦, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛) (1)
where 𝑦, 휀 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×2, 𝛽 ∈ ℝ2, 𝜎2 > 0 and 𝐼𝑛 denotes the 𝑛 × 𝑛 identity matrix, the design matrix,
parameter vector, and data vector take the form
𝑋 ≔ (
1 𝑥11 𝑥2⋮ ⋮1 𝑥𝑛
) , 𝛽 ≔ (𝛽1𝛽2) and 𝑦 ≔ (
𝑦1𝑦2⋮𝑦𝑛
) (2)
and where the 𝑥𝑖 (𝑖 = 1,… , 𝑛) carry the notion of “independent variables”, 𝛽1 corresponds to the simple
linear regression “offset” and 𝛽2 to the simple linear regression “slope”. The central question we would like
198
to address in this section is to evaluate whether the inclusion of the second column in the design matrix (the
𝑥𝑖’s) is beneficial in explaining an observed data set 𝑦 ∈ ℝ𝑛, for which it is assumed that its generation
process is unclear.
To treat the general case from the outset, we consider a GLM with 𝑝 ∈ ℕ design matrix
columns/beta parameters, which we will refer to as the “full model”. We aim for a comparison of models
including the set of all 𝑝 regressors and their associated effect sizes with a model comprising 𝑝1 < 𝑝
regressors, which we will refer to as the “reduced model”. Note that we can partition any design matrix
𝑋 ∈ ℝ𝑛×𝑝 with 𝑝 > 1 into two components
𝑋 = (𝑋1 𝑋2), where 𝑋1 ∈ ℝ𝑛×𝑝1 and 𝑋2 ∈ ℝ
𝑛×𝑝2 and 𝑝1 + 𝑝2 = 𝑝 (3)
and obtain two separate models, if we also partition 𝛽 ∈ ℝ𝑝 accordingly into 𝛽 ≔ (𝛽1, 𝛽2)𝑇. In the simple
linear regression example, the corresponding design matrix partition would be
𝑋1 ≔ (
11⋮1
) ∈ ℝ𝑛×𝑝1 and 𝑋2 ≔ (
𝑥1𝑥2⋮𝑥𝑛
) ∈ ℝ𝑛×𝑝2 , where 𝑝1 = 𝑝2 = 1 and 𝑝1 + 𝑝2 = 2 (4)
In words, the central question in this case is whether the variability in the independent variable (the 𝑥𝑖’s) is
actually important to describe the observed data variability, or whether the assumption of independent and
identical sampling from a univariate Gaussian is sufficient to explain it.
The Neyman-Pearson notion of “likelihood-ratio testing” offers a principle means to address this
question. In general terms, the idea of likelihood ratio testing is to fit two alternative models 𝑚1 and 𝑚2 to a
given data set using the maximum-likelihood method and then to compute the ratio of the two maximized
likelihoods. In other words, one compares the probabilities 𝑝𝑚1(𝑦∗) and 𝑝𝑚2
(𝑦∗) of both models to account
for the same data 𝑦∗ under the optimal parameter settings of each model. If the probability of observing the
data 𝑦∗ under, say, the optimized model 𝑚1 is much higher than under the optimized model 𝑚2, then one
concludes that 𝑚1 is a better model for the data. Note that because logarithmic transforms render ratios
differences and we actually work with the log likelihood function below, we will formulated “likelihood-ratio
testing” as “log-likelihood difference testing” below.
To apply the idea of likelihood ratios (= log likelihood differences) in the context of the GLM,
suppose that one estimates the beta parameters of a GLM comprising only the design matrix subset 𝑋1 using
the maximum likelihood principle, where we assume for the moment that 𝜎2 > 0 is known. As we saw in
above, the maximized log likelihood is given by
𝑚𝑎𝑥𝛽1∈ℝ𝑝1 ℓ(𝛽1) = −𝑛
2𝑙𝑛 2𝜋 −
𝑛
2𝑙𝑛 𝜎2 −
1
2𝜎2(𝑦 − 𝑋1�̂�1)
𝑇(𝑦 − 𝑋1�̂�1) = −
𝑛
2𝑙𝑛 2𝜋 −
𝑛
2𝑙𝑛 𝜎2 −
1
2𝜎2𝑒1𝑇𝑒1 (5)
where we defined
𝑒1 ≔ 𝑦 − 𝑋1�̂�1 (6)
and 𝑒1𝑇𝑒1 is the “residual sum-of-squares term of the reduced model”. Consider next maximizing the log
likelihood function of the full model 𝑋 = (𝑋1 𝑋2) with all predictors. In this case, the maximized log
likelihood is given by
𝑚𝑎𝑥𝛽∈ℝ𝑝 ℓ(𝛽) = −𝑛
2𝑙𝑛 2𝜋 −
𝑛
2𝑙𝑛 𝜎2 −
1
2𝜎2(𝑦 − 𝑋�̂�)
𝑇(𝑦 − 𝑋�̂�) (7)
199
= −𝑛
2𝑙𝑛 2𝜋 −
𝑛
2𝑙𝑛 𝜎2 −
1
2𝜎2(𝑦 − (𝑋1 𝑋2) (
�̂�1�̂�2))
𝑇
(𝑦 − (𝑋1 𝑋2) (�̂�1�̂�2))
= −𝑛
2𝑙𝑛 2𝜋 −
𝑛
2𝑙𝑛 𝜎2 −
1
2𝜎2𝑒12𝑇 𝑒12
where we defined
𝑒12 ≔ 𝑦 − (𝑋1 𝑋2) (�̂�1�̂�2) (8)
and refer to 𝑒11𝑇 𝑒11 as the “residual sum-of-squares term of the full model”. Taking the difference of the two
maximized log-likelihoods then yields a basic statistic for model comparison according to the likelihood ratio
principle.
𝛥ℓ ≔ 𝑚𝑎𝑥 ℓ(𝛽) − 𝑚𝑎𝑥 ℓ(𝛽1)
= −𝑛
2𝑙𝑛 2𝜋 −
𝑛
2𝑙𝑛 𝜎2 −
1
2𝜎2𝑒12𝑇 𝑒12 − (−
𝑛
2𝑙𝑛 2𝜋 −
𝑛
2𝑙𝑛 𝜎2 −
1
2𝜎2𝑒1𝑇𝑒1)
=1
2𝜎2(𝑒1𝑇𝑒1 − 𝑒12
𝑇 𝑒12) (9)
The important thing to note about (9) is that 𝑒1𝑇𝑒1 corresponds to the residual sum-of-squares
resulting from fitting the reduced (smaller) model comprising less regressors to the data, while 𝑒12𝑇 𝑒12
correspond to the case that in addition to the 𝑝1 first regressors also the additional 𝑝2 regressors, i.e., the
full model is used to model the data.
The log-likelihood difference 𝛥ℓ is the major building block of the F-statistic, which is defined as
𝐹:ℝ+ × ℝ+ → ℝ+, (𝑒1, 𝑒12) ↦ 𝐹(𝑒1, 𝑒12) ≔(𝑒1𝑇𝑒1−𝑒12
𝑇 𝑒12)/𝑝2
𝑒12𝑇 𝑒12/(𝑛−𝑝)
(10)
To obtain an intuition about the definition of the F-statistic, we first consider its numerator. To this end, it is
helpful to consider two scenarios: firstly, if the data on which it is based is in fact generated from the
reduced model, and secondly, if the data is in fact generator by the full model. If the data is generated from
the reduced model, the parameter estimates for the beta regressors in the full model will tend to zero, and
the residual errors 𝑒1 and 𝑒12 will both be normally distributed around zero. If the data is generated from
the full model, the residual errors resulting from fitting the reduced model 𝑒1 will be larger than the residual
errors resulting from fitting the full model. The numerator of the F-statistic thus represents the reduction in
the residual errors resulting from including the additional 𝑝2 regressors per additional regressor included. If
this reduction is small, i.e., the additional regressor do not explain much data variance, the F-value will be
small. If this reduction is large, and the full model is a much better model of the data, the F-value will be
large. Regardless of the model of the generating model, the denominator of the F-statistic is an estimator of
the variance parameter 𝜎2: If the data is in fact generated from the reduced model, the beta parameter
estimates for the regressors of the full model that are not part of the reduced model will tend to zero, and
the residual sum of squares is evaluated as for the reduced model. The F-statistic thus measures the
reduction of the residual sum-of-squares attributable to the inclusion of the 𝑝2 additional regressors with
respect to the reduced model per regressor, normalized by the estimated variance of the data.
200
(6) The F-Statistic Null Distribution
As for the T-statistic, one may ask how the F-statistic as defined above is distributed, if a “null
hypothesis” about the full model holds. To establish this distribution, we first consider the distributions of
the F-statistic numerator and denominator under a to be defined null hypothesis separately.
Assume that as in the set-up above, with 휀, 𝑦 ∈ ℝ𝑛 , we partition a design matrix 𝑋 ∈ ℝ𝑛×𝑝 in the
form 𝑋 = (𝑋1 𝑋2), where 𝑋1 ∈ ℝ𝑛×𝑝1 and 𝑋2 ∈ ℝ
𝑛×𝑝2 and the correspondingly partitioned beta
parameter vector is given by 𝛽 ≔ (𝛽1, 𝛽2)𝑇 ∈ ℝ𝑝1+𝑝2 . Now, if the null hypothesis
𝐻0: 𝛽2 = 0 ∈ ℝ𝑝2 (1)
is true, the full model
𝑋𝛽 + 휀 = 𝑦 (2)
reduces to the model
𝑋1𝛽1 + 휀 = 𝑦 (3)
As above, let 𝑒12𝑇 𝑒12 denote the residual sum-of-squares of the full model, and let 𝑒1
𝑇𝑒1denote the residual
sum-of-squares of the reduced model . Under the null assumption 𝛽2 = 0 sampling from the full model
yields a distribution of the variance parameter scaled “extra-sum-of squares” 𝑒1𝑇𝑒1 − 𝑒12
𝑇 𝑒12 that
corresponds to a 𝜒2-distribution with 𝑝2 degrees of freedom
𝑝 (𝑒1𝑇𝑒1−𝑒12
𝑇 𝑒12
𝜎2) = 𝜒2 (
𝑒1𝑇𝑒1−𝑒12
𝑇 𝑒12
𝜎2; 𝑝2) (4)
Note that we have a 𝜒2 distribution of 𝑒1𝑇𝑒1−𝑒12
𝑇 𝑒12
𝜎2, while the actual numerator of the F-statistic is (𝑒1
𝑇𝑒1 −
𝑒12𝑇 𝑒12)/𝑝2. We will return to this below, but first consider the denominator of the F-statistic.
The denominator of the F-Statistic 𝑒12𝑇 𝑒12/(𝑛 − 𝑝) corresponds to the standard variance parameter
�̂�2 estimator as discussed previously, where we have seen that 𝑛−𝑝
𝜎2�̂�2 is distributed according to a chi-
squared distribution with 𝑛 − 𝑝 degrees of freedom
𝑝 (𝑛−𝑝
𝜎2𝑒12𝑇 𝑒12
𝑛−𝑝) = 𝑝 (
𝑒12𝑇 𝑒12
𝜎2) = 𝜒2 (
𝑒12𝑇 𝑒12
𝜎2; 𝑛 − 𝑝) (5)
Taken together, we see that, under the null hypothesis 𝐻0: 𝛽2 ≔ 0 the F-statistic corresponds roughly to a
ratio of chi-squared distributed random variables with 𝑝2 and 𝑛 − 𝑝 degrees of freedom.
The distribution of ratios of chi-squared random variables is called an 𝑓-distribution, as the reader
may now review in the Mathematical Preliminaries section. To use the 𝑓-distribution as defined therein, we
have to reformulate the distributional results (4) and (5) in accordance with it, which will eventually yield
the F-statistic as defined in in the previous section. Firstly, forming the ratio of the chi-squared distributed
random variables above yields
𝑒1𝑇𝑒1−𝑒12
𝑇 𝑒12𝜎2
𝑒12𝑇 𝑒12𝜎2
=𝑒1𝑇𝑒1−𝑒12
𝑇 𝑒12
𝑒12𝑇 𝑒12
(6)
The right-hand side of (6) is a ratio of two chi-squared distributed random variables. In the definition of the
𝑓-distribution, these are divided by their respective degrees of freedom. Dividing both the numerator and
the denominator by the respective degrees of freedom, i.e. the numerator by 𝑝2 and the denominator by
𝑛 − 𝑝 then yields the F- statistic in the form (10):
201
𝐹 =(𝑒1𝑇𝑒1−𝑒12
𝑇 𝑒12)/𝑝2
𝑒12𝑇 𝑒12/(𝑛−𝑝)
(7)
In other words, under the null hypothesis of 𝐻0: 𝛽2 = 0 ∈ ℝ𝑝 the F-statistic as defined in (7) is distributed
according to an 𝑓-distribution with (𝑝2, 𝑛 − 𝑝) degrees of freedom.
(7) The F-statistic and null hypothesis significance testing
In summary, if we observe an F-statistic value which under this null distribution has a very small
probability to occur (for example a probability for it or more extreme values of less than 0.05), we can infer,
that it is not very likely that the data that gave rise to the computed F-statistic value was actually generated
from a GLM for which the null hypothesis 𝐻0: 𝛽2 = 0 ∈ ℝ𝑝2 holds true. The idea of the F-statistic and the its
distribution under a null hypothesis is visualized for the case of a simple linear regression model in Figure 1.
Figure 1 This figure visualizes the idea of the null distribution of the F-statistic as defined in (6.28) for the case of the special simple linear regression GLM case. Specifically, the full model for simple linear regression corresponds to a design matrix ≔ (𝑋1 𝑋2) ∈ℝ𝑛×2 , where 𝑋1 ∈ ℝ
𝑛×1 is a column of ones, modelling the regression line offset and 𝑋2 ∈ ℝ𝑛×1 comprises the values of the
univariate independent variable modelling the regression line slope. For the upper left panel 200 samples were obtained from a simple linear regression model for 𝑛 = 10 data points for which the true, but unknown, 𝛽1 parameter was set to 𝛽1 = 1 and the true, but unknown, 𝛽2 parameter was set to 𝛽2 = 0 conforming to the null hypothesis 𝐻0: 𝛽2 = 0. The lower left panel depicts the empirical (gray histogram bars) distribution over the 200 samples of the F-statistic evaluated from the samples. Additionally it depicts the 𝑓-distribution probability density function (red line) for 𝑝2 = 1 and 𝑛 − 𝑝 = 10 − 2 degrees of freedom. In correspondence with the theory discussed in section 6.5, the empirical distribution conforms to the analytical 𝑓-distribution. The right panels depict the case, in which the null hypothesis 𝐻0: 𝛽2 = 0 does not hold. Specifically, here the true, but unknown, beta parameter vector was set to 𝛽 = (1,1.1)𝑇 and 200 samples obtained as on the left-side. The lower right panel shows the empirical and analytical distribution of the F statistic. Because the null hypothesis 𝐻0: 𝛽2 = 0 does not hold in this scenario, the empirical distribution does not conform to the 𝑓-distribution.
(8) Classical variance partitioning formula and the GLM formulation of the F-statistic
In conventional undergraduate statistic courses the F-statistic is usually introduced in the context of
single-factor ANOVA designs. In this context, the F-statistic refers to the ratio of “between group variance”
202
(also referred to as “treatment variance”) and “within group variance” (also referred to as “error variance”).
The aim of the current section is to link this “traditional ANOVA” view of the F-statistic to the “full and
reduced GLM model comparison view” of the previous sections. To this end, we will first review the single
factor ANOVA design with independent measures and the associated variance partitioning approach of
classical ANOVA. We will next relate the variance partitioning scheme to the structural form of the
corresponding full and reduced GLM and finally demonstrate the equivalence of both approaches.
A conventional treatment treats single-factor ANOVA as the extension of a two-sample T-test to
more than two groups. This is usually motivated by stating that to assess “significant differences” between,
say, three group means, a series of two-sample T-tests would be required (group 1 vs. group 2, group 2 vs.
group 3, and group 1 vs. group 3), resulting in a multiple testing procedure and the problem of an increased
probability of Type I errors. From this view, the F-Test offers an alternative by providing a single test
procedure that allows for the assessment of the null hypothesis that the population means of all three
groups are in fact identical, or, equivalently, that the three treatment groups were sampled from the same
population. The categorical independent variable in the context of ANOVA designs is referred to as “factor”
and the different values that it may assume (for example treatment 1, treatment 2, and treatment 3) are
referred to as “levels”. Data acquired in a one-factorial design in a balanced scheme, i.e., with equal group
sizes, then takes the following form
Group 1 Group 2 ⋯ Group p
𝑦11 𝑦21 ⋯ 𝑦𝑝1
𝑦12 𝑦22 ⋯ 𝑦𝑝2
𝑦13 𝑦23 ⋯ 𝑦𝑝3
⋮ ⋮ ⋱ ⋮ 𝑦1𝑛1 𝑦2𝑛2 ⋯ 𝑦𝑝𝑛𝑝
Table 1. Data layout of a one-factorial ANOVA design.
In the table, the entry 𝑦𝑖𝑗 ∈ ℝ refers to the data obtained from the 𝑗th experimental unit (“subject”) in the
𝑖th experimental group (“level of the experimental factor”), where 𝑗 = 1,… , 𝑛𝑖 and 𝑖 = 1,… , 𝑝. Note that
the subscript indices in Table 1 do not correspond to matrix row and column indices, but are reversed in this
respect. 𝑛𝑖 ∈ ℕ is the number of experimental units in group 𝑖, and the assumption of a balanced design
implies that 𝑛1 = 𝑛2 = ⋯ = 𝑛𝑝. 𝑝 ∈ ℕ is the number or experimental groups or levels of the experimental
factor, and has to be larger than 1. The total number of data points (or experimental units) is given by
𝑛 = ∑ 𝑛𝑖𝑝𝑖=1 . In conventional discussions of one-factorial designs it is usually additional stated that the data
points were obtained from “independent experimental units”, for example, that each data point represent
data acquired from a single human participant who only contributed data in this specific condition and in no
others.
The fundamental idea of a one-factorial ANOVA is to assess whether the variability in the observed
data points is largely due to the variability in the independent variable, i.e. the differences between the
levels of the factor, or due to “inherent noise” in the dependent variable. This is achieved by assessing the
relative contributions of “treatment related variance” and “noise-related variance” in a partitioning of the
“overall variance“, which, intuitively, takes the following form
Overall data variance = Treatment-related variance + Noise-related variance (1)
203
If the ratio of treatment-related variance and noise-related variance
F-value ≈ Treatment-related variance/Noise-related variance (2)
(which roughly corresponds to the F-statistic, as we will see below) is large one may infer that the treatment
had some effect on the observed value and the null hypothesis that the treatment had no effect on the
observed data may be rejected. We next formalize these intuitions by introducing a number of quantities
that can be computed based on data as represented in Table 1.
Firstly, we can compute an average of all data points included in the table, that is, and “overall” or
“grandmean”
�̅� =1
𝑛∑ ∑ 𝑦𝑖𝑗
𝑛𝑖𝑗=1
𝑝𝑖=1 (3)
Of course, this quantity represents the average of all data points over all treatment levels. Likewise, we can
compute 𝑟 treatment level specific averages or “group means”
𝑦�̅� =1
𝑛𝑖∑ 𝑦𝑖𝑗𝑛𝑖𝑗=1 for 𝑖 = 1,2, … , 𝑝 (4)
Note that there are 𝑝 group means, which reflect the average of all data points corresponding to a given
treatment level.
Secondly, we can use sums of squared differences between averages and individual data points to
assess the variability about the means introduce above. To this end, we first define the sum of squared
deviations from the grand mean, a quantity referred to as “total sum of of squares”
𝑆𝑆𝑇𝑜𝑡𝑎𝑙 ≔ ∑ ∑ (𝑦𝑖𝑗 − �̅�)𝑛𝑖𝑗=1
𝑝𝑖=1
2 (5)
We next define the sum of squared deviations of the individual group means from the overall mean,
weighted by the number of data points in each group, a quantity referred to as “between groups sum-of-
squares” or “treatment sum-of-squares”
𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 ≔ ∑ 𝑛𝑖(𝑦�̅� − �̅�)2𝑝
𝑖=1 (6)
Finally, we ca define the sum of squared deviations of all individual data points from their respective group
means, a quantity referred to as “within groups sum-of-squares” or “error sum-of-squares”
𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 ≔ ∑ ∑ (𝑦𝑖𝑗 − 𝑦�̅�)𝑛𝑖𝑗=1
𝑝𝑖=1
2 (7)
Based on these definition, it can be shown that the total sum-of-squares can be written as the sum of the
between and the within sum-of-squares, i.e.
𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 + 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 (8)
or, in other words, that
∑ ∑ (𝑦𝑖𝑗 − �̅�)𝑛𝑖𝑗=1
𝑝𝑖=1
2= ∑ 𝑛𝑖(𝑦�̅� − �̅�)
2𝑝𝑖=1 + ∑ ∑ (𝑦𝑖𝑗 − 𝑦�̅�)
𝑛𝑖𝑗=1
𝑝𝑖=1
2 (9)
Proof of equation (8) and (9)
We have
∑ ∑ (𝑦𝑖𝑗 − �̅�)2𝑛𝑖
𝑗=1𝑝𝑖=1 = ∑ ∑ (𝑦𝑖𝑗 − 𝑦�̅� + 𝑦�̅� − �̅�)
2𝑛𝑖𝑗=1
𝑝𝑖=1
= ∑ ∑ ((𝑦𝑖𝑗 − 𝑦�̅�) + (𝑦�̅� − �̅�))2𝑛𝑖
𝑗=1𝑝𝑖=1
= ∑ (∑ (𝑦𝑖𝑗 − 𝑦�̅�)2−∑ 2(𝑦𝑖𝑗 − 𝑦�̅�)(𝑦�̅� − �̅�)
𝑛𝑖𝑗=1 + ∑ (𝑦�̅� − �̅�)
2𝑛𝑖𝑗=1
𝑛𝑖𝑗=1 )
𝑝𝑖=1
= ∑ (∑ (𝑦𝑖𝑗 − 𝑦�̅�)2− 2(𝑦�̅� − �̅�)∑ (𝑦𝑖𝑗 − 𝑦�̅�)
𝑛𝑖𝑗=1 + 𝑛𝑖(𝑦�̅� − �̅�)
2𝑛𝑖𝑗=1 )
𝑝𝑖=1
= ∑ (∑ (𝑦𝑖𝑗 − 𝑦�̅�)2− (2(𝑦�̅� − �̅�)∑ (𝑦𝑖𝑗 −
1
𝑛𝑖∑ 𝑦𝑖𝑗𝑛𝑖𝑗=1 )
𝑛𝑖𝑗=1 ) + 𝑛𝑖(𝑦�̅� − �̅�)
2𝑛𝑖𝑗=1 )
𝑝𝑖=1
204
= ∑ (∑ (𝑦𝑖𝑗 − 𝑦�̅�)2− (2(𝑦�̅� − �̅�) (∑ 𝑦𝑖𝑗 − ∑ 𝑦𝑖𝑗
𝑛𝑖𝑗=1
𝑛𝑖𝑗=1 )) + 𝑛𝑖(𝑦�̅� − �̅�)
2𝑛𝑖𝑗=1 )
𝑝𝑖=1
= ∑ (∑ (𝑦𝑖𝑗 − 𝑦�̅�)2− 0 + 𝑛𝑖(𝑦�̅� − �̅�)
2𝑛𝑖𝑗=1 )
𝑝𝑖=1
= ∑ (∑ (𝑦𝑖𝑗 − 𝑦�̅�)2𝑛𝑖
𝑗=1 )𝑝𝑖=1 + ∑ 𝑛𝑖(𝑦�̅� − �̅�)
2𝑝𝑖=1
= ∑ 𝑛𝑖(𝑦�̅� − �̅�)2𝑝
𝑖=1 + ∑ ∑ (𝑦𝑖𝑗 − 𝑦�̅�)𝑛𝑖𝑗=1
𝑝𝑖=1
2
With the definitions (5) – (7) above, we thus have
𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 + 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛
□
In order to define the F-statistic in this classical treatment of one-factorial ANOVA, the additional
concepts of “degrees of freedom” and “mean squares” are introduced. In this context, the notion of degrees
of freedom refers to the number of “independent data points” that result from computing a mean over a
group of data points. Specifically, for the case of the grand-mean and its associated total sum-of-squares, we
have 𝑛 values, and, if the grand-mean is known 𝑛 − 1 choices for different data points. In other words, the
number of degrees of freedom of 𝑆𝑆𝑡𝑜𝑡𝑎𝑙 is 𝑑𝑓𝑡𝑜𝑡𝑎𝑙 = 𝑛 − 1. Likewise, for the case of the grand-mean and
the group-means, and its associated between-group sum-of-squares, if the grand-mean is known, 𝑝 − 1 of
the group means may freely be chosen.The number of degrees of freedom of 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 is thus 𝑑𝑓𝑏𝑒𝑡𝑤𝑒𝑒𝑛 =
𝑝 − 1. Finally, for the case of the within-group sum-of-squares, we consider the 𝑝 group means, and the
individual data points. For each group data set, we have 𝑛𝑖 − 1 degrees of freedom, if the group mean is
known. Because we have 𝑝 groups, the total degrees of freedom in this within sum-of-squares scenario are
thus 𝑑𝑓𝑤𝑖𝑡ℎ𝑖𝑛 = 𝑝(𝑛𝑖 − 1) = 𝑛 − 𝑝. In summary, we obtain for the degrees of freedom of the sums-of-
squares introduced above
𝑑𝑓𝑡𝑜𝑡𝑎𝑙 = 𝑑𝑓𝑏𝑒𝑡𝑤𝑒𝑒𝑛 + 𝑑𝑓𝑤𝑖𝑡ℎ𝑖𝑛 ⇔ 𝑛− 1 = 𝑝 − 1 + 𝑛 − 𝑝 (10)
Division of the sums of squares terms by their respective degrees of freedom then yields estimators for the
total, between group, and within group variances, referred to as “mean squares”
𝑀𝑆𝑇𝑜𝑡𝑎𝑙 =∑ ∑ (𝑦𝑖𝑗−�̅�)
𝑛𝑖𝑗=1
𝑝𝑖=1
2
𝑛−1 , 𝑀𝑆𝐵𝑒𝑡𝑤𝑒𝑒𝑛 =
∑ 𝑛𝑖(𝑦�̅�−�̅�)2𝑝
𝑖=1
𝑝−1 and 𝑀𝑆𝑊𝑖𝑡ℎ𝑖𝑛 =
∑ ∑ (𝑦𝑖𝑗−𝑦�̅�)𝑛𝑖𝑗=1
𝑝𝑖=1
2
𝑛−𝑝 (11)
Finally, the F-statistic is introduced as the ratio between the “between group mean square” and the
“within group mean square”
𝐹 =𝑀𝑆𝐵𝑒𝑡𝑤𝑒𝑒𝑛
𝑀𝑆𝑊𝑖𝑡ℎ𝑖𝑛 =
∑ 𝑛𝑖(𝑦𝑖̅̅̅̅ −�̅�)2𝑝
𝑖=1𝑝−1
∑ ∑ (𝑦𝑖𝑗−𝑦𝑖̅̅̅̅ )𝑛𝑖𝑗=1
𝑝𝑖=1
2
𝑛−𝑝
(12)
and is declared to be distributed according to an 𝑓-distribution with 𝑝 − 1, 𝑛 − 𝑝 degrees of freedom.
The treatment of a one-factorial ANOVA is somewhat ad-hoc. Specifically, an ANOVA data table is
introduced without specifications of random variables of which the data are to be considered, yet, at the
very end, the distribution of the derived 𝐹 statistic is presented as known and given by a well-defined
probability density function. Likewise, the definition of the three sum-of-squares and three mean-squares
does not feel well-motivated, as others definitions may be equally appropriate. That the computational
scheme discussed above is nevertheless a sensible approach can be established by viewing the one-factorial
ANOVA as a probabilistic model, namely a specific GLM, from the outset. In the following, we will establish
205
this model and various aspects of it, and finally relate full and reduced versions of this model to the
quantities evaluated in the previous section.
Assume again that an experimental design has resulted in a data layout as in Table 1. However, in
contrast to the previous section, now assume that the data points are realizations of independent univariate
Gaussian variables, such that
𝑝(𝑦1𝑗) = 𝑁(𝑦1𝑗; 𝜇1, 𝜎2) for 𝑗 = 1,… , 𝑛1 (13)
𝑝(𝑦2𝑗) = 𝑁(𝑦2𝑗; 𝜇2, 𝜎2) where 𝜇2 = 𝜇1 + 𝛼2 for 𝑗 = 1,… , 𝑛2 (14)
⋯
𝑝(𝑦𝑝𝑗) = 𝑁(𝑦𝑝𝑗; 𝜇𝑝, 𝜎2) where 𝜇𝑝 = 𝜇1 + 𝛼𝑝 for 𝑗 = 1,… , 𝑛𝑝 (15)
where 𝜇1, 𝛼2, . . , 𝛼𝑝 ∈ ℝ and 𝜎2 > 0. In other words, the data points in the first group of the ANOVA table
are assumed to be realizations of independent Gaussian random variables with all the same expectation
parameter 𝜇 and the same variance parameter 𝜎2. The data points in the 𝑖th group for 𝑖 = 2,… , 𝑝 are
assumed to be realizations of independent Gaussian random variable with expectation parameter given by
𝜇𝑖 ≔ 𝜇1 + 𝛼𝑖, where 𝜇1 corresponds to the expectation parameter of the first group, while 𝛼𝑖 = 𝜇𝑖 − 𝜇1
refers to a treatment-level specific additional “effect”, and variance parameter 𝜎2 as for the first group.
Notably, the null hypothesis that all data points are realizations from the same population corresponds to
the assumption that 𝛼2 = 𝛼3 = ⋯𝛼𝑝 = 0. Further, there is a single variance parameter 𝜎2 that governs the
variability of each random variable. Of course, the above my equivalently be written in structural form as
𝑦1𝑗 = 𝜇1 + 휀1𝑗 with 𝑝(휀1𝑗) = 𝑁(휀1𝑗; 0, 𝜎2) for 𝑗 = 1,… , 𝑛1 (16)
and
𝑦𝑖𝑗 = 𝜇1 + 𝛼𝑖 + 휀𝑖𝑗 with 𝑝(휀𝑖𝑗) = 𝑁(휀𝑖𝑗; 0, 𝜎2) for 𝑗 = 1,… , 𝑛𝑖 and 𝑖 = 2,… , 𝑝 (17)
As always, the 𝑛-univariate independent Gaussian random variables can be formulated as the univariate
projections of an 𝑛-dimensional Gaussian probability distributions (i.e. a GLM) by defining an appropriate
data vector, design matrix, beta parameter vector, and spherical covariance matrix. Specifically, the
equivalent GLM representation of the above is given by defining
𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (18)
where 𝑛 = ∑ 𝑛𝑖𝑟𝑖=1 , 𝜎2 > 0 and
206
𝑦 ≔
(
𝑦11⋮
𝑦1𝑛1𝑦21⋮
𝑦2𝑛2⋮𝑦𝑝1⋮
𝑦𝑝𝑛𝑝)
∈ ℝ𝑛, 𝑋 ≔
(
1⋮11⋮11⋮11⋮1
0⋮01⋮10⋮00⋮0
⋯
0⋮00⋮00⋮01⋮1)
∈ ℝ𝑛×𝑝, 𝛽 ≔ (
𝜇1𝛼2⋮𝛼𝑝
) ∈ ℝ𝑝 (19)
We now consider partitioning the GLM above according to
𝑋 = (𝑋1 𝑋2) ∈ ℝ𝑛×𝑝, where 𝑋1 ∈ ℝ
𝑛×1 and 𝑋2 ∈ ℝ𝑛×(𝑝−1) (20)
and
𝛽 = (𝛽1𝛽2) where 𝛽1 ≔ 𝜇 ∈ ℝ𝑝−1 and 𝛽2 ≔ (𝛼2, 𝛼3, … , 𝛼𝑟)
𝑇 ∈ ℝ𝑝−1 (21)
For this partitioning, the “reduced model” corresponds to
𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋1𝛽1, 𝜎2𝐼𝑛) with 𝑦 ≔
(
𝑦11⋮
𝑦1𝑛1𝑦21⋮
𝑦2𝑛2⋮𝑦𝑝1⋮
𝑦𝑝𝑛𝑝)
∈ ℝ𝑛, 𝑋1 ≔
(
1⋮11⋮1⋮1⋮1)
∈ ℝ𝑛×1, 𝛽 ≔ 𝜇 ∈ ℝ𝑝 (22)
or, equivalently, in structural form to
𝑦𝑖𝑗 = 𝜇 + 휀𝑖𝑗𝑖 with 𝑝(휀𝑖𝑗) = 𝑁(휀𝑖𝑗; 0, 𝜎2) (𝑖 = 1,… , 𝑝, 𝑗 = 1,… , 𝑛𝑖) (23)
while the full model merely correspond to the equivalent forms of equations (13) – (19).
Next, we consider the OLS 𝛽 estimators and the residual sum-of-squares of these reduced and full
models, respectively. It is straight-forward to show that the OLS 𝛽 estimator for the reduced model is given
by the average over all data points, i.e.
�̂�1 = (1
𝑛∑ ∑ 𝑦𝑖𝑗
𝑛𝑖𝑗 =1
𝑝𝑖=1 ) =: �̅� (23)
Further, the residual sum-of-squares is given by
𝑒1𝑇𝑒1 ≔ (𝑦 − 𝑋1�̂�1)
𝑇(𝑦 − 𝑋1�̂�1) = (∑ ∑ 𝑦𝑖𝑗
𝑛𝑖𝑗=1
𝑝𝑖=1 − �̅�)
2 (24)
Notably, this quantity is identical to the total sum-of-squares as defined in equation (5). We thus have
𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑒1𝑇𝑒1 (26)
207
or in other words, the equality of the total sum-squares of the as defined in the in the classical variance
partitioning of one-factorial ANOVA and the “residual sum-of-squares” of the reduced model in a GLM
model partitioning scheme a one-factor ANOVA GLM.
For the full model, the 𝛽 parameter estimator takes the form
�̂� = (
�̂�1�̂�2⋮�̂�𝑝
) = (
�̂�1�̂�2 − �̂�1
⋮�̂�𝑝 − �̂�1
) =
(
1
𝑛1∑ 𝑦1𝑗𝑛𝑖𝑗=1
1
𝑛2∑ 𝑦2𝑗𝑛2𝑗=1 −
1
𝑛1∑ 𝑦1𝑗𝑛𝑖𝑗=1
⋮1
𝑛𝑝∑ 𝑦𝑝𝑗𝑛𝑝𝑗=1
−1
𝑛1∑ 𝑦1𝑗𝑛𝑖𝑗=1 )
=:(
�̅�1�̅�2 − �̅�1
⋮�̅�𝑝 − �̅�1
) (27)
Further, the residual errors corresponds to
𝑒12 = (𝑦 − 𝑋�̂�) =
(
𝑦11 − �̅�1⋮
𝑦1𝑛1 − �̅�1𝑦21 − (�̅�1 + �̅�2 − �̅�1)
⋮𝑦2𝑛2 − (�̅�1 + �̅�2 − �̅�1)
⋮𝑦𝑝1 − (�̅�1 + �̅�𝑝 − �̅�1)
⋮𝑦𝑝𝑛𝑝 − (�̅�1 + �̅�𝑝 − �̅�1))
=
(
𝑦11 − �̅�1⋮
𝑦1𝑛1 − �̅�1𝑦21 − �̅�2
⋮𝑦2𝑛2 − �̅�2
⋮𝑦𝑝1 − �̅�𝑝
⋮𝑦𝑝𝑛𝑝 − �̅�𝑝)
(28)
such that
𝑒12𝑇 𝑒12 = ∑ ∑ (𝑦𝑖𝑗 − 𝑦�̅�)
2𝑛𝑖𝑗=1
𝑝𝑖=1 (29)
Notably, this is identical to the within sum-of-squares as defined in equation (7). We thus have
𝑆𝑆𝑊𝑖𝑡ℎ𝑖𝑛 = 𝑒12𝑇 𝑒12 (30)
Further, it follows by means of the equality (8), that
𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 + 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 ⇔ 𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛 = 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 − 𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛 = 𝑒1𝑇𝑒1 − 𝑒12
𝑇 𝑒12 (31)
With 𝑝2 ≔ 𝑝 − 1 in the current scenario and
𝐹 =𝑀𝑆𝐵𝑒𝑡𝑤𝑒𝑒𝑛
𝑀𝑆𝑊𝑖𝑡ℎ𝑖𝑛 =
𝑆𝑆𝑏𝑒𝑡𝑤𝑒𝑒𝑛𝑝−1
𝑆𝑆𝑤𝑖𝑡ℎ𝑖𝑛𝑛−𝑝
=(𝑒1𝑇𝑒1−𝑒12
𝑇 𝑒12)/𝑝2
𝑒12𝑇 𝑒12/(𝑛−𝑝)
(32)
we thus have the equivalence of the F-statistic as formulated in the classical variance partitioning scheme of
a one-factorial ANOVA design and the F-statistics as formulated as the comparison of a full and a reduced
model in the context of the equivalent one-factorial ANOVA GLM formulation.
208
Study Questions
1. Discuss commonalities and differences between Fisher’s significance testing and Neyman-Pearson’s hypothesis testing
frameworks.
2. Write down the formal definition of the (non-extended) T-Statistic and discuss its intuitive meaning.
3. The t-distribution is defined as the distribution of the random variable 𝑡 ≔ 𝑋/√𝑌/𝑛. What do 𝑋, 𝑌 and 𝑛 refer to, and how are
these entities related to the GLM framework? 4. Explain how the T-statistic can be used to test a null hypothesis. 5. Write down the formal definition of the F-Statistic and discuss its intuitive meaning.
6. The f-distribution is defined as the distribution of the random variable 𝑋 ≔𝑌1/𝑚
𝑌2/𝑛. What do 𝑌1, 𝑌2, 𝑚 and 𝑛 refer to, and how are
these entities related to the GLM framework?
Study Questions Answers
1. Both Fisher’s and Neyman-Pearson’s frameworks are rooted in frequentists’ assumptions about distributions of data and ensuing
statistics given a true, but unknown, parameterized probabilistic model. Fisher’s framework requires the specification of a null
hypothesis (not necessarily a nil hypothesis in the sense of “no effect”) and uses the probability of observed data under this null
hypothesis to decide whether sufficient evidence against the null hypothesis has been gathered. In contrast Neyman-Pearson’s
framework requires the explicit specification of two hypotheses, a commitment to acceptable Type I and Type II error rates, which
together define a decision criterion.
2. The T-Statistic in its „simple form“ is given by
𝑇:ℝ𝑝 → ℝ, �̂� ↦ 𝑇(�̂�) ≔𝑐𝑇�̂�
√�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐.
Intuitively, the numerator of the T-statistics 𝑐𝑇�̂� is a measure of the effect size encoded in linear combination of the beta parameter
estimates �̂� ∈ ℝ𝑝, where the type of the linear combination (for example, the selection of a specific subcomponent of �̂�) is encoded
by 𝑐 ∈ ℝ𝑝. The denominator √�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐 is a measure of the variance associated with the beta parameter estimate �̂�, which scales with the estimated GLM variance parameter σ̂2. Intuitively, the T-statistic is thus a ratio between effect size and its variance. The larger the estimated effect in comparison to its estimated variance, the more “reliable” the effect may be considered.
3. In the definition of the random variable 𝑡 ≔ 𝑋/√𝑌/𝑛, 𝑋 and 𝑌 refer to two independent scalar random variables. Specifically, X
refers to a random variable distributed according to a standard normal distribution 𝑁(𝑋; 0,1) , whereas 𝑌 refers to a random variable distributed according to a chi-square distribution with 𝑛 degrees of freedom, χ2(𝑦; 𝑛). With respect to the GLM, X corresponds roughly to a standardized parameter estimator contrast and Y roughly to an estimated variance parameter, and 𝑛 to the number of data points.
4. In a given experimental context, one usually observes a single data set 𝑦 ∈ ℝ𝑛 from which one can compute a single T-statistic
value 𝑇 ∈ ℝ for a given “null hypothesis” 𝛽0 ∈ ℝ𝑝. The logic of null hypothesis significance is then as follows: if the observed T-
statistic value and more extreme values, under the assumed null hypothesis/distribution have a very small probability to occur (for
example a probability for it or more extreme values of less than 0.05), one may infer that it is not very likely that the data that gave
rise to the computed T-statistic value was actually generated from a GLM for which the null hypothesis 𝐻0: 𝛽 = 𝛽0 ∈ ℝ𝑝 holds true
and one would “reject the null hypothesis”.
5. The formal definition of the F-statistic is given by
𝐹:ℝ+ × ℝ+ → ℝ+, (𝑒1, 𝑒12) ↦ 𝐹(𝑒1, 𝑒12) ≔(𝑒1𝑇𝑒1−𝑒12
𝑇 𝑒12)/𝑝2
𝑒12𝑇 𝑒12/(𝑛−𝑝)
where 𝑒1𝑇𝑒1 correesponds to the residual-sum-of-squares obtained under a reduced GLM “nested” within a complet GLM with
residual sum-of-squares 𝑒12𝑇 𝑒12. n corresponds to the number of data points, p to the number of parameters in the full GLM and p2
to the number of regressors/parameters that are added to the reduced model to obtain the full GLM. If the reduction in the residual-sum-of-squares afforded by the p2 additional regressors is small compared to the residual-sum-of-squares achieved under the full model, the added regressors (compared to the reduced model) are not very valuable. In other words: an F-statistic value of 0 indicates that the residual-sum-of-squares of both the reduced and the full model are identical, and thus, that the additional regressors of the full model do not contribute much to the explanation of the observed data.
209
6. In the F statistic variable 𝑋 ≔ (𝑌1/𝑚)/(𝑌2/𝑛) Y1 referes to a chi-squared distributed random variable with 𝑚 degrees of freedom and Y2 refers to a chi-squared distributed random variable with 𝑛 degrees of freedom. Within the GLM framework 𝑌1 takes on the role of difference in residual sum-squares between a reduced and a full model and 𝑚 the role of the additional degrees of freedoms, i.e., parameters incorporated in the full model with respect to the reduced model. 𝑌2 takes on the role of the full model’s residual sum-squares and 𝑛 the degrees of freedom of the full model.
210
Bayesian Estimation
(1) Model Formulation
In this section we apply the Bayesian paradigm to the GLM. Specifically, we derive a posterior
distribution 𝑝(𝛽|𝑦) for the beta parameter and expression for the evidence (marginal probability) 𝑝(𝑦) in
the Bayesian context. These derivations, in the current section, are conditional on two assumptions. Firstly,
we assume that we know the variance parameter of the GLM. Secondly, we assume that the marginal (prior)
distribution of the beta parameter is given by a multivariate Gaussian distribution. Notably, in contrast to the
classical estimation of the beta parameter by means of parameter point estimation, we treat 𝛽 as an
unobserved random variable. We thus rewrite our familiar form of the GLM data distribution
𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛), where 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝜎2 > 0 (1)
as a conditional distribution
𝑝(𝑦|𝛽) = 𝑁(𝑦; 𝑋𝛽, 𝜎𝑦|𝛽2 𝐼𝑛) , where 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝 , 𝜎𝑦|𝛽
2 > 0 (2)
Note that from now on we use 𝜎𝑦|𝛽2 instead of 𝜎2 to denote the variance parameter of this distribution, the
reason for which will become clear immediately. To simplify the notation, we will write Σ𝑦|𝛽 ≔ 𝜎𝑦|𝛽2 𝐼𝑛, and
note that the matrix Σ𝑦|𝛽 ∈ ℝ𝑛×𝑛 is positive-definite .The assumption of a Gaussian prior distribution with
spherical covariance matrix for the beta parameter is formalized by defining
𝑝(𝛽) ≔ 𝑁(𝛽; 𝜇𝛽 , Σ𝛽), where 𝛽 ∈ ℝ𝑝, 𝜇𝛽 ∈ ℝ𝑝, Σ𝛽 ∈ ℝ
𝑝×𝑝 𝑝. 𝑑. (3)
Here, 𝐼𝑝 ∈ ℝ𝑝×𝑝 denotes the 𝑝-dimensional identity matrix in analogy to 𝐼𝑛. Note that by specifying the
“likelihood” 𝑝(𝑦|𝛽) and the “prior” 𝑝(𝛽), we implicitly define a joint distribution (or “generative model”)
over both the data 𝑦 and the parameters 𝛽 given by
𝑝(𝑦, 𝛽) = 𝑝(𝑦|𝛽)𝑝(𝛽) (4)
We will further investigate this joint distribution in below. For the moment we state without proof that the
posterior distribution over 𝛽 given the data 𝑦 is again a Gaussian distribution, the parameters of which we
denote by 𝜇𝛽|𝑦 ∈ ℝ𝑝 and Σ𝛽|𝑦 ∈ ℝ
𝑝×𝑝, such that we can write
𝑝(𝛽|𝑦) = 𝑁(𝛽; 𝜇𝛽|𝑦, Σ𝛽|𝑦) (5)
Likewise, we state without proof that the marginal distribution of the data 𝑦 is a Gaussian distribution, the
parameters of which we denote by 𝜇𝑦 ∈ ℝ𝑛 and Σ𝑦 ∈ ℝ
𝑛×𝑛 , such that we can write
𝑝(𝑦) = 𝑁(𝑦; 𝜇𝑦, Σ𝑦) (6)
Our principle aim in this section is thus to derive equations for the parameters 𝜇𝛽|𝑦, Σ𝛽|𝑦, 𝜇𝑦 and Σ𝑦 in terms
of the design matrix 𝑋, the likelihood variance parameter 𝜎𝑦|𝛽2 , and the prior parameters 𝜇𝛽 and Σ𝛽. This
endeavor is an example of a “parametric Bayesian approach with conjugate priors”. Intuitively this means
that all distributions involved are determined by parameters (i.e. we only need to specify the expectation
parameter and covariance parameter for all distributions involved), and the prior distributions and posterior
distribution are of the same functional type, i.e. both are Gaussians, such that the posterior parameters can
211
be determined in terms of “parameter update equations”. This is not the only conceivable case for applying
the Bayesian paradigm, but with respect to the GLM, a very important one.
(2) Bayesian estimation of the beta parameters
As a first step, we evaluate the functional form and parameters of the joint distribution of 𝑦 and 𝛽.
Because both distributions over the observed random variable 𝑦 and the unobserved random variable 𝛽 are
Gaussian distributions, we can immediately apply the results for the parameter of joint Gaussian
distributions specified in terms of a Gaussian marginal and a Gaussian conditional distribution derived in
above . For the current scenario, we have the Gaussian marginal distribution
𝑝(𝛽) = 𝑁(𝛽; 𝜇𝛽 , Σ𝛽), where 𝛽, 𝜇𝛽 ∈ ℝ𝑝, Σ𝛽 ∈ ℝ
𝑝×𝑝 and positive-definite (1)
and the conditional distribution over the GLM data vector 𝑦 ∈ ℝ𝑛 given 𝛽 ∈ ℝ𝑝
𝑝(𝑦|𝛽) = 𝑁(𝑦; 𝑋𝛽, Σ𝑦|𝛽) where 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, Σ𝑦|𝛽 ∈ ℝ𝑛×𝑛 and positive-definite (2)
From the Gaussian joint distribution and conditioning theorem we can read off that the joint distribution of
the unobserved random variable 𝛽 and the observed random variable 𝑦 is a Gaussian distribution
𝑝(𝑦, 𝛽) = 𝑁((𝑦𝛽) ; 𝜇𝑦,𝛽 , Σ𝑦,𝛽) (3)
where
𝜇𝑦,𝛽 = (𝑋𝜇𝛽𝜇𝛽
) ∈ ℝ𝑛+𝑝 (4)
and
Σ𝑦,𝛽 = (Σ𝑦|𝛽 + 𝑋Σ𝛽𝑋
𝑇 𝑋Σ𝛽
Σ𝛽𝑋𝑇 Σ𝛽
) ∈ ℝ(𝑛+𝑝)×(𝑛+𝑝) (5)
The formula for the expectation parameter of the joint distribution (4) reveals that the expectation for the 𝛽
subpart of the random vector (𝑦, 𝛽)𝑇 is identical to the prior expectation of 𝛽, while the the expectation for
the 𝑦 subpart corresponds to the projection of the prior expectation of 𝛽 into the data space by means of
the design matrix 𝑋. The formula for the covariance matrix parameter of the joint distribution (5) reveals
that the “variance” parameter of the 𝛽 subpart of the random vector (𝑦, 𝛽)𝑇 is identical to the prior
“variance” parameter of 𝛽. The “variance” parameter of the 𝑦 subpart on the other hand is given as the sum
of the conditional variance Σ𝑦|𝛽 and the marginal variance Σ𝛽 projected into the data space by means of the
design matrix. In other words, the in the Bayesian conjugate prior scenario, the data covariance matrix is
affected by the parameter prior covariance – higher prior variance thus implicates a higher data variance.
Finally, the covariance of the subparts 𝑦 and 𝛽 of the random vector (𝑦, 𝛽)𝑇 results from an interaction
between the design matrix and the prior covariance parameter of 𝛽, but is unaffected by the data
conditional covariance Σ𝑦|𝛽.
Because we have derived the joint distribution 𝑝(𝑦, 𝛽) from the specifications of the prior
distribution 𝑝(𝛽) and the conditional distribution 𝑝(𝑦|𝛽) above, we can immediately apply the Gaussian
212
joint distribution and conditioning theorem to the parameters of this joint distribution in order to derive the
parameters of the posterior distribution 𝑝(𝛽|𝑦) and the marginal distribution 𝑝(𝑦).
For the conditional distribution, we obtain
𝑝(𝛽|𝑦) ≔ 𝑁(𝛽; 𝜇𝛽|𝑦, Σ𝛽|𝑦) (6)
where
Σ𝛽|𝑦 ≔ (Σ𝛽−1 + 𝑋𝑇Σ𝑦|𝛽
−1 𝑋)−1∈ ℝ𝑝×𝑝 𝑝. 𝑑. (7)
and
𝜇𝛽|𝑦 ≔ Σ𝛽|𝑦(Σ𝛽−1𝜇𝛽 + 𝑋
𝑇Σ𝑦|𝛽−1 𝑦) ∈ ℝ𝑝 (8)
We see that the posterior covariance matrix of 𝛽 results from a mixture of the prior covariance matrix and
the conditional data covariance filtered by the design matrix. Notably, the posterior covariance matrix of the
parameter 𝛽 is unaffected by the data. In other words, whatever realization of the data is observed, the
posterior covariance matrix of the parameter 𝛽 is unaffected by this outcome. The expectation parameter of
the posterior parameter distribution, on the other hand, results from a mixture of the prior expectation
parameter and the data, weighted by their respective precisions (inverse covariance matrices) and scaled by
the posterior covariance parameter. High prior precision of 𝛽 and high data variability (i.e. low conditional
precision of 𝑦) thus assigns more weight to the prior expectation, and vice versa.
For the marginal distribution of the data 𝑦, we have
𝑝(𝑦) = 𝑁(𝑦; 𝜇𝑦, Σ𝑦) (9)
where
𝜇𝑦 = 𝑋𝜇𝛽 ∈ ℝ𝑛 (10)
and
Σ𝑦 = Σ𝑦|𝛽 + 𝑋Σ𝛽𝑋𝑇 ∈ ℝ𝑛×𝑛 (11)
As noted above, the marginal expectation of the 𝑦 depends on the prior expectation of 𝛽 and the design
matrix 𝑋. From a model evidence perspective, this implies that data can achieve the highest probability
under a given model if both the prior assumptions of 𝛽 as well as the “data-generating mechanism” (i.e., the
design matrix) match their true, but unknown, counterparts.
In the following, we apply the formulas for the conditional distribution of 𝛽 is three low-dimensional
scenarios, which are readily visualized: Bayesian inference for the expectation parameter of the univariate
Gaussian based on a single and multiple observations, and Bayesian inference for the offset and slope
parameters in simple linear regression.
213
(2) Examples for Bayesian beta parameter estimation
Bayesian inference for the expectation of a univariate Gaussian with a single observation
Based on the expression for the parameters of the conditional distribution 𝑝(𝛽|𝑦) stated above, we
can now verify the expressions for the posterior parameters in the context of Bayesian inference for the
expectation of a univariate Gaussian with a single observation as discussed in the previous Section. Notably,
in this case, the design matrix corresponds to the scalar 1, and the expectation parameter of the univariate
Gaussian corresponds to the only beta parameter. We thus have for the marginal distribution of the
unknown parameter
𝑝(𝛽) ≔ 𝑁(𝛽; 𝜇𝛽 , 𝜎𝛽2) = 𝑁(𝛽; 𝜇𝛽 , 𝜆𝛽
−1) where 𝛽, 𝜇𝛽 ∈ ℝ, 𝜆𝛽 =1
𝜎𝛽2 > 0 (1)
and for the conditional distribution of the single data point
𝑝(𝑦|𝛽) ≔ 𝑁(𝑦; 𝛽, 𝜎𝑦|𝛽2 ) = 𝑁(𝑦; 𝛽, 𝜆𝑦|𝛽) , where 𝑦, 𝛽 ∈ ℝ, 𝜆𝑦|𝛽 =
1
𝜎𝑦|𝛽2 > 0 (2)
Application of the formulae for the posterior distribution over 𝛽 then yields for the posterior precision
parameter
𝜆𝛽|𝑦 = (𝜆𝛽 + 1𝑇 ⋅ 𝜆𝑦|𝛽 ⋅ 1) = (𝜆𝛽 + 𝜆𝑦|𝛽) > 0 (3)
and for the posterior expectation parameter
𝜇𝛽|𝑦 =1
𝜆𝛽+𝜆𝑦|𝛽(𝜆𝛽𝜇𝛽 + 1
𝑇𝜆𝑦|𝛽𝑦) =𝜆𝛽
𝜆𝛽+𝜆𝑦|𝛽𝜇𝛽 +
𝜆𝑦|𝛽
𝜆𝛽+𝜆𝑦|𝛽𝑦 ∈ ℝ (4)
By working through the general (𝑛 + 𝑝)-dimensional joint distribution case of GLM beta parameters and
data, we have thus justified the Bayesian inference scheme for the univariate Gaussian discussed in the
previous section.
Bayesian inference for the expectation of a univariate Gaussian 𝑛 observations
We have repeatedly seen that Inference for the expectation of a univariate Gaussian based on 𝑛
independent and identically distributed observation can be case in GLM form by defining a design matrix
comprising a single column of 𝑛 1′𝑠 and identifying the single 𝛽 parameter in this GLM with the expectation
parameter of the univariate Gaussian to be inferred. In the Bayesian conjugate prior context with known
variance parameter 𝜎𝑦|𝛽2 we thus have the following marginal parameter and conditional data distribution
𝑝(𝛽) ≔ 𝑁(𝛽; 𝜇𝛽 , 𝜎𝛽2) = 𝑁(𝛽; 𝜇𝛽 , 𝜆𝛽
−1) where 𝛽, 𝜇𝛽 ∈ ℝ, 𝜆𝛽 =1
𝜎𝛽2 > 0 (5)
and
𝑝(𝑦|𝛽) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎𝑦|𝛽2 𝐼𝑛) = 𝑁(𝑦; 𝑋𝛽, 𝜆𝑦|𝛽
−1 𝐼𝑛), where 𝑦 ∈ ℝ𝑛, 𝑋 = (1⋮1) ∈ ℝ𝑛, 𝜆𝑦|𝛽 =
1
𝜎𝑦|𝛽2 > 0 (6)
Application of the formulae for the posterior distribution over 𝛽 in terms of precisions then yields for the
posterior precision parameter
214
𝜆𝛽|𝑦 = (𝜆𝛽 + (1,… ,1)𝜆𝑦|𝛽 (1⋮1)) = 𝜆𝛽 + 𝑛𝜆𝑦|𝛽 > 0 (7)
and for the posterior expectation parameter
𝜇𝛽|𝑦 =1
𝜆𝛽+𝑛𝜆𝑦|𝛽(𝜆𝛽𝜇𝛽 + (1,… ,1)𝜆𝑦|𝛽 (
𝑦1⋮𝑦𝑛)) =
1
𝜆𝛽+𝑛𝜆𝑦|𝛽𝜇𝛽 +
𝜆𝑦|𝛽
𝜆𝛽+𝑛𝜆𝑦|𝛽∑ 𝑦𝑖𝑛𝑖=1 ∈ ℝ (8)
Note that for a prior expectation of 𝜇𝛽 = 0 and 𝜆𝛽 approaching zero, i.e. infinitely high prior variance, we
recover the ML point estimator 1
𝑛∑ 𝑦𝑖𝑛𝑖=1 for the posterior expectation parameter.
(4) Bayesian Estimation of the beta and variance parameters
In the current section we assume that both the beta parameter vector 𝛽 ∈ ℝ𝑝 and the likelihood
variance parameter 𝜎𝑦|𝛽2 > 0 are unknown and are treated as unobserved random variables in a joint model
distribution model of the form
𝑝(𝑦, 𝛽, 𝜎𝑦|𝛽2 ) = 𝑝(𝑦|𝛽, 𝜎𝑦|𝛽
2 )𝑝(𝛽, 𝜎𝑦|𝛽2 ) (1)
where 𝑝(𝛽, 𝜎𝑦|𝛽2 ) denotes the joint marginal (prior) distribution over both 𝛽 ∈ ℝ𝑝 and 𝜎𝑦|𝛽
2 > 0. Because
the unobserved random variable 𝜎𝑦|𝛽2 is required to be strictly positive for the Gaussian likelihood
𝑝(𝑦|𝛽, 𝜎𝑦|𝛽2 ) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎𝑦|𝛽
2 𝐼𝑛) (2)
to be well-defined, the marginal and conditional distributions of 𝜎𝑦|𝛽2 cannot be set to Gaussian
distributions. A number of approaches for modeling uncertainty about 𝜎𝑦|𝛽2 can be found in the literature. A
common approach is to formulate the Gaussian likelihood in terms of a precision parameter 𝜆 ≔ (𝜎𝑦|𝛽2 )
−1
and assume a gamma marginal distribution for 𝜆. Another approach, which we discuss below, is to formulate
the Gaussian likelihood in terms of the variance parameter 𝜎𝑦|𝛽2 , and use the inverse gamma distribution to
model its uncertainty. Yet another approach is to use a so-called “Jeffrey’s” or “reference” prior distribution
for 𝜎𝑦|𝛽2 , which takes the roles of an “uninformative” prior distribution. We discuss these approaches in turn.
For a normal-inverse gamma prior distribution, we assume the following factorization of the joint
distribution over 𝑦, 𝛽 and 𝜎𝑦|𝛽2
𝑝(𝑦, 𝛽, 𝜎𝑦|𝛽2 ) = 𝑝(𝑦|𝛽, 𝜎𝑦|𝛽
2 )𝑝(𝛽|𝜎𝑦|𝛽2 )𝑝(𝜎𝑦|𝛽
2 ) (1)
We thus assume that the marginal (prior) distribution over 𝛽 and 𝜎𝑦|𝛽2 factorizes into a the conditional
distribution 𝑝(𝛽|𝜎𝑦|𝛽2 ) and the marginal distribution 𝑝(𝜎𝑦|𝛽
2 ). Notably, this induces a dependence of the
probability density function values for the beta parameter 𝛽 on the value of the likelihood variance
parameter 𝜎𝑦|𝛽2 . While this may not necessarily be the most natural scenario, it is nevertheless (due to its
mathematical tractability), commonly encountered in the literature. As usual, the data likelihood is set to
𝑝(𝑦|𝛽, 𝜎𝑦|𝛽2 ) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎𝑦|𝛽
2 𝐼𝑛) (2)
215
while the marginal (prior) distribution over 𝛽 and 𝜎𝑦|𝛽2 is set to the product of a Gaussian conditional
distribution of 𝛽 given 𝜎𝑦|𝛽2 and an inverse Gamma distribution over 𝜎𝑦|𝛽
2 , i.e., the normal-inverse Gamma
distribution
𝑝(𝛽, 𝜎𝑦|𝛽2 ) = 𝑁𝐼𝐺 (𝛽, 𝜎𝑦|𝛽
2 ; 𝜇𝛽 , Σ𝛽 , 𝑎𝜎𝑦|𝛽2 , 𝑏𝜎𝑦|𝛽
2 ) = 𝑁(𝛽; 𝜇𝛽 , 𝜎𝑦|𝛽2 Σ𝛽)𝐼𝐺 (𝜎𝑦|𝛽
2 ; 𝑎𝜎𝑦|𝛽2 , 𝑏𝜎𝑦|𝛽
2 ) (3)
With this marginal distribution and likelihood, one can show that the data-conditional (posterior)
distribution over 𝛽 and 𝜎𝑦|𝛽2 is given again by a normal-inverse Gamma distribution, i.e., the normal-inverse
Gamma prior distribution is a conjugate prior distribution for the Gaussian likelihood:
𝑝(𝛽, 𝜎𝑦|𝛽2 |𝑦) = 𝑁𝐼𝐺 (𝛽, 𝜎𝑦|𝛽
2 ; 𝜇𝛽|𝑦, Σ𝛽|𝑦, 𝑎𝜎𝑦|𝛽2 |𝑦, 𝑏𝜎𝑦|𝛽
2 |𝑦) (4)
The parameters of this distribution are given in terms of the prior distribution parameters and the data as
follows
Σ𝛽|𝑦 = (Σ𝛽−1 + 𝑋𝑇𝑋)
−1 (5)
𝜇𝛽|𝑦 = Σ𝛽|𝑦(Σ𝛽−1𝜇𝛽 + 𝑋
𝑇𝑦) (6)
𝑎𝜎𝑦|𝛽2 |𝑦 = 𝑎𝜎𝑦|𝛽
2 +𝑛
2 (7)
𝑏𝜎𝑦|𝛽2 |𝑦 = 𝑏𝜎𝑦|𝛽
2 +1
2(𝜇𝛽𝑇Σ𝛽
−1𝜇𝛽 + 𝑦𝑇𝑦 − 𝜇𝛽|𝑦Σ𝛽|𝑦
−1 𝜇𝛽|𝑦) (8)
The posterior parameters Σ𝛽|𝑦 and 𝜇𝛽|𝑦 are thus similar to the case that 𝜎𝑦|𝛽2 is known. The posterior
parameter 𝑎𝜎𝑦|𝛽2 |𝑦 is determined by the prior parameter on 𝜎𝑦|𝛽
2 and the number of data points. Finally, the
posterior parameter 𝑏𝜎𝑦|𝛽2 |𝑦 is determined by the prior parameter 𝑏𝜎𝑦|𝛽
2 ,and three sum of squares: the sum-
of-squares of the prior expectation of 𝛽, the empirical sum-of-squares 𝑦𝑇𝑦 and the sum-of-squares of the
posterior expectation of 𝛽.
From the properties of the normal-inverse gamma distribution, we can infer that the posterior
marginal distributions over 𝛽 and 𝜎𝑦|𝛽2 are given by the following non-central multivariate t-distribution and
inverse gamma distributions, respectively
𝑝(𝛽|𝑦) = 𝑡 (𝛽; 𝜇𝛽|𝑦,𝑏𝜎𝑦|𝛽2 |𝑦
𝑎𝜎𝑦|𝛽2 |𝑦
Σ𝛽|𝑦, 2𝑎𝜎𝑦|𝛽2 |𝑦) (9)
𝑝(𝜎𝑦|𝛽2 |𝑦) = 𝐼𝐺 (𝜎𝑦|𝛽
2 ; 𝑎𝜎𝑦|𝛽2 |𝑦, 𝑏𝜎𝑦|𝛽
2 |𝑦) (10)
The two upper rows of Figure 1 visualize the Bayesian estimation of the GLM for the special case of
independent and identical sampling from a univariate Gaussian with unknown expectation (beta) and
variance parameter and a normal-inverse gamma prior distribution for these parameters.
Another choice of marginal (prior) distribution over 𝛽 and 𝜎𝑦|𝛽2 commonly encountered is
𝑝(𝛽, 𝜎𝑦|𝛽2 ) ∝
1
𝜎𝑦|𝛽2 (1)
216
This choice of marginal distribution is referred to as “uninformative”, “Jeffreys” or “reference” prior,
because it can be argued that the distribution thus defined conveys minimal information about 𝛽 and 𝜎𝑦|𝛽2 in
a well-defined sense.
In addition, (1) is an example for a so-called “improper” prior distribution. Improper prior distributions are
characterized by density functions that do not integrate to 1, i.e., which are not probability density functions
in the general sense. They are usually denoted using proportionality statements as in (1). The marginal
distribution (1) may also be expressed as an improper normal-inverse gamma distribution
𝑝(𝛽, 𝜎𝑦|𝛽2 ) = 𝑁𝐼𝐺 (𝛽, 𝜎𝑦|𝛽
2 ; 𝜇𝛽 , Σ𝛽 , 𝑎𝜎𝑦|𝛽2 , 𝑏𝜎𝑦|𝛽
2 ) (2)
with parameters
𝜇𝛽 = 0, Σ𝛽 = ∞𝐼𝑝, 𝑎𝜎𝑦|𝛽2 = −
1
2 and 𝑏𝜎𝑦|𝛽
2 = 0 (3)
Despite the fact that the marginal distributions thus defined are not probability density functions, it can be
shown that the data-conditional (posterior) distribution over 𝛽 and 𝜎𝑦|𝛽2 is given by a proper probability
density function, namely a normal-inverse gamma distribution. This normal-inverse gamma distribution has
the following parameters
Σ𝛽|𝑦 = (𝑋𝑇𝑋)−1 (4)
𝜇𝛽|𝑦 = (𝑋𝑇𝑋)𝑋𝑇𝑦 (5)
𝑎𝜎𝑦|𝛽2 |𝑦 =
𝑛−𝑝
2 (6)
𝑏𝜎𝑦|𝛽2 |𝑦 =
1
2(𝑦 − 𝑋𝜇𝛽|𝑦)
𝑇(𝑦 − 𝑋𝜇𝛽|𝑦) (7)
In this case, the data-conditional marginal distribution of 𝛽 is given by a multivariate non-central 𝑡-
distribution, of the following form
𝑝(𝛽|𝑦) = 𝑡 (𝛽; 𝜇𝛽|𝑦,(𝑦−𝑋𝜇𝛽|𝑦)
𝑇(𝑦−𝑋𝜇𝛽|𝑦)
𝑛−𝑝(𝑋𝑇𝑋)−1 , 𝑛 − 𝑝) (8)
Notably, for this choice of prior, the posterior parameters have strong similarities to the classical maximum
likelihood point estimation scenario: the posterior expectation parameter 𝜇𝛽|𝑦 corresponds to the maximum
likelihood estimator �̂�. The posterior covariance parameter Σ𝛽|𝑦 corresponds closely to the covariance
parameter of the sampling distribution covariance parameter of �̂�. Finally, the residual sum-of-squares
directly enters the rate parameter of the posterior distribution over 𝜎𝑦|𝛽2 .
If we consider the marginal posterior distribution 𝑝(𝛽|𝑦) for a single beta parameter 𝛽𝑖 for 𝑖 ∈ ℕ𝑝, we have
an even stronger similarity between Bayesian and classical point estimation: For the case of the improper
prior distribution (1), the posterior marginal distribution of 𝛽𝑖, and the sampling distribution of �̂�𝑖 are
equivalent. From (8), we have with
�̂�2 =(𝑦−𝑋�̂�𝑖)
𝑇(𝑦−𝑋�̂�𝑖)
𝑛−𝑝 (9)
217
that
𝑝(𝛽𝑖|𝑦) = 𝑡(𝛽𝑖; �̂�𝑖, �̂�2(𝑋𝑇𝑋)𝑖𝑖
−1 , 𝑛 − 𝑝) (10)
In other words with
𝑇𝐵 ≔𝛽𝑖−�̂�𝑖
√�̂�2(𝑋𝑇𝑋)𝑖𝑖−1
(11)
We thus have
𝑝(𝑇𝐵|𝑦) = 𝑡(𝑇𝐵; 𝑛 − 1) (12)
Note, however, that while the formulas for the random variables 𝑇 and 𝑇𝐵 and their distributions are
equivalent, they arise from fundamentally different assumptions about 𝛽𝑖: In the classical frequentist
scenarios, 𝛽𝑖 is a true, but unknown, fixed value, not a random variable. On the other hand �̂�𝑖 is a random
variable in both scenarios, owing to the common assumption of the Gaussian distribution of the data 𝑦.
The third row of Figure 1 visualizes the Bayesian estimation of the GLM for the special case of
independent and identical sampling from a univariate Gaussian with unknown expectation (beta) and
variance parameter and a reference prior distribution for these parameters.
Figure 1. Bayesian estimation of the GLM 𝛽 and 𝜎𝑦|𝛽2 parameters for the special univariate Gaussian case. Rows depict three different
prior scenarios: the first and second rows depict the case of a tight and loose normal-inverse gamma distribution, respectively, while the third row depicts the case of a reference prior. Columns depict components of the probabilistic model: the first column depicts
the prior distribution over 𝛽 and 𝜎𝑦|𝛽2 , the second column depicts the true, but unknown, Gaussian likelihood over the 𝑛 data
points, and a sample of 𝑛 = 10. The third column depicts the inferred posterior distributions, while the two last columns depict the
posterior marginal distributions over the 𝛽 and 𝜎𝑦|𝛽2 parameter, respectively.
218
Fundamental designs
The General Linear Model is a mathematical unification of a number of data modeling procedures.
Specifically, it unites the following concepts: simple linear regression, multiple linear regression, T-tests, the
multifactorial analysis of variance (ANOVA) and the multifactorial analysis of covariance (ANCOVA). All these
approaches instantiate specific examples of the GLM, i.e., they are characterized by their specific design
matrix and beta parameter interpretation.
To exemplify these approaches in the following, we will use one common artificial data set. In this
data set, we conceive the data vector 𝑦 ∈ ℝ30 as representing the “anatomical volume” of a brain structure
(in mm³) (for example the dorsolateral prefrontal cortex (DLPFC)) of each of 𝑛 = 30 participants (or
“experimental units”) that took part in an anatomical MRI study. The study interests concerns the question
whether the experimental factors (= independent variables) “Age” (measured in years) and “Alcohol” (intake
measured in units (= 7.9 gram) of pure alcohol)) influence the dependent variable “anatomical volume”. A
fictitious data set of this kind is shown in Table 1. Note that we are dealing with a between-subject design
which justifies the assumption that the error components 휀𝑖, 𝑖 = 1,… , 𝑛 in the GLM formulation are
independent. In the following we will consider this data set from different perspectives of experimental
designs and statistical inference procedures.
Participant Age [years] Alcohol [units] DLPFC Volume
1 15 3 178.7708
2 16 6 168.4660
3 17 5 169.9513
4 18 7 162.0778
5 19 4 170.1884
6 20 8 156.9287
7 21 1 175.4092
8 22 2 173.3972
9 23 7 154.4907
10 24 5 158.3642
11 25 1 172.1033
12 26 3 162.6648
13 27 2 165.4449
14 28 8 142.2121
15 29 4 154.3557
16 30 6 145.6544
17 31 3 155.5286
18 32 4 150.5144
19 33 7 137.8262
20 34 1 160.1183
21 35 2 155.4419
22 36 8 127.1715
23 37 5 138.0237
24 38 6 133.4589
25 39 4 139.3813
26 40 3 145.1997
27 41 7 123.7259
28 42 5 130.7300
29 43 8 114.1148
30 44 1 151.1943
31 45 6 121.7235
32 46 2 140.9424
Table 1 An example data set.
219
(1) A spectrum of GLM designs
GLM designs, i.e. specific design matrices and their corresponding beta parameter vectors, for any
data set may be conceived as lying within a spectrum of two extremes: on one side of of the spectrum, one
may assume that there is no systematic variation in the outcome measure whatsoever. This would
correspond to the notion, that each data point 𝑦𝑖 ∈ ℝ (𝑖 = 1,… , 𝑛) has been sampled from a univariate
Gaussian, and all these Gaussian have the identical mean. In other words, all observed data variability over
experimental unit may be explained as “pure Gaussian noise” around a common expectation 𝜇 ∈ ℝ
𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇, 𝜎2) ⇔ 𝑦𝑖 = 𝜇 + 휀𝑖, 𝑝(휀𝑖 ) ≔ 𝑁(휀𝑖; 0, 𝜎
2) (𝑖 = 1,… , 𝑛) (1)
As discussed previously, this “null model” corresponds to the case of a design matrix formed by a columns of
ones and a single beta parameter. The 𝑖th entry in the matrix product of these, (𝑋𝛽)𝑖, represents the
participant-independent expectation parameter 𝜇 ∈ ℝ of the univariate Gaussian:
𝑋 ≔ (1⋮1) ∈ ℝ𝑛×1, 𝛽 ∈ ℝ, 𝜇 ≔ (𝑋𝛽)𝑖 (𝑖 = 1,… , 𝑛) (2)
On the other side of the spectrum, one may conceive a case, in which there is complete and
unsystematic variability over experimental units in the sense, that each participant’s data actually
corresponds to a sample from a participant-specific univariate Gaussian distribution. In other words, each
experimental unit is modelled by an experimental unit-specific expectation parameter 𝜇𝑖 (𝑖 = 1,… , 𝑛):
𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇𝑖 , 𝜎2) ⇔ 𝑦𝑖 = 𝜇𝑖 + 휀𝑖, 𝑝(휀𝑖 ) ≔ 𝑁(휀𝑖; 0, 𝜎
2) (𝑖 = 1,… , 𝑛) (3)
In terms of GLM designs, this would correspond to the square identity matrix as design matrix and a
beta parameter vector comprising as many parameters as there are data points. This renders the 𝑖th entry in
the matrix product 𝑋𝛽 the participant-specific expectation parameter 𝜇𝑖 ∈ ℝ
𝑋 ≔ (
1 00 1
⋯ 0⋯ 0
⋮ ⋮0 0
⋱ ⋮⋯ 1
) ∈ ℝ𝑛×𝑛, 𝛽 ∈ ℝ𝑛 , 𝜇𝑖 = ( 𝑋𝛽)𝑖 (𝑖 = 1,… , 𝑛) (4)
Note than in both cases no interesting statements can be made with respect to the columns of the
design matrix and the outcome data variable in the sense that (2) assumes that all participants are “the
same”, and (4) assumes that all participants are “mutually different”. Most of the designs we will encounter
the following lie somewhere between (2) and (4) and thus constrain the “differences” between participants
or data points in some meaningful way which lends itself to an interpretation in terms of the independent
experimental variables that the design matrix columns represent.
Design matrices can further be classified according to whether their columns are formed by
continuously varying real numbers or by so-called “indicator” or “dummy” variables, i.e. ones and zeros. In
the first case, the designs are referred to as “continuous” or “regression” designs, in which the independent
variables represent continuous experimental factors, and the design matrix columns are usually referred to
as “regressors”, “predictors”, and sometimes as “(co)variates”. In the second case, the designs are referred
to as “categorical” or “ANOVA-type” designs. In categorical designs the independent experimental variables
are usually referred to as “experimental factors” and the values that they can take on as “levels of the
experimental factor”.
220
The difference between the “continuous” and “categorical” approaches lies in the expectation
about changes in the dependent variable as the independent variable changes. If one treats an independent
experimental variable as a continuous variate, one assume a linear effect of this predictor on the dependent
variable. If one treats an independent experimental variable as a discrete variate, one does not need to
assume that for every unit change in the independent variable one expects a scaled unit change in the value
of dependent variable. In other words, one allows for arbitrary changes in the response from one category
to another. This approach has the advantage of a simple interpretation and may be viewed as a prediction of
“qualitative” differences. On the other hand, by grouping independent variable values into discrete
categories, we are discarding information contained in the continuous covariation of independent and
dependent variables.
In the following we will discuss two forms of continuous GLM designs, simple and multiple linear
regression, two forms of categorical GLM designs, T-tests and ANOVA designs, and one mixed form, the
ANCOVA design. For each design, we will select the relevant aspects of the example data set given in Table
1, write down the corresponding GLM in structural and design matrix form, show a typical visualization, and
discuss its estimation and interpretation from a classical and a Bayesian viewpoint.
(2) Simple Linear Regression
The central idea of simple linear regression is that the expectation of the 𝑖th data variable
𝑦𝑖 (𝑖 = 1,… , 𝑛) is given by a constant offset 𝑎 ∈ ℝ and the product of the value of a single “predictor” or
“regressor” independent experimental variable 𝑥𝑖 multiplied by a slope coefficient 𝑏 ∈ ℝ
𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝑎 + 𝑏𝑥𝑖 , 𝜎2) ⇔ 𝑦𝑖 = 𝑎 + 𝑏𝑥𝑖 + 휀𝑖, 𝑝(휀𝑖) ≔ 𝑁(휀𝑖; 0, 𝜎
2) (1)
for 𝑖 = 1,… , 𝑛. In its design matrix formulation the above corresponds to
𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛), where 𝑦 ∈ ℝ𝑛, 𝑋 ≔ (
1 𝑥11 𝑥2⋮ ⋮1 𝑥𝑛
) ∈ ℝ𝑛×2, 𝛽 ∈ ℝ2 and 𝜎2 > 0 (2)
Here the first entry in the parameter vector 𝛽 ≔ (𝛽1, 𝛽2)𝑇 assumes the role of the offset 𝑎 , and the second
entry assumes the role of the slope 𝑏.
As an example, we reconsider the example data set above by ignoring the alcohol intake variable,
which results in the data set shown in Table 2. Here “Age” corresponds to the independent experimental
variable encoded in the values 𝑥𝑖 and “DLPFC Volume” corresponds to the dependent experimental variables
𝑦𝑖. Simple linear regression designs are typically visualized by plotting the values of the independent
experimental variable on the x-axis and the values of the dependent experimental variable on the y-axis.
𝒊 𝒙𝒊: Age 𝒚𝒊 : DLPFC Volume
1 15 178.7708
2 16 168.4660
3 17 169.9513
4 18 162.0778
5 19 170.1884
6 20 156.9287
7 21 175.4092
8 22 173.3972
9 23 154.4907
10 24 158.3642
11 25 172.1033
221
12 26 162.6648
13 27 165.4449
14 28 142.2121
15 29 154.3557
16 30 145.6544
17 31 155.5286
18 32 150.5144
19 33 137.8262
20 34 160.1183
21 35 155.4419
22 36 127.1715
23 37 138.0237
24 38 133.4589
25 39 139.3813
26 40 145.1997
27 41 123.7259
28 42 130.7300
29 43 114.1148
30 44 151.1943
31 45 121.7235
32 46 140.9424
Table 2. The example data set considered as a simple linear regression design.
Table 1. Visualization of a simple linear regression design
(3) Multiple Linear Regression
Multiple linear regression may be viewed as the most general application of the GLM in the sense
that all columns of the design matrix are allowed to take on arbitrary values, and there may be arbitrary
many of them. Here, we explicitly state the multiple linear regression design applicable for the data set
above, which includes two predictor variables. For the special case of two independent experimental
variable, the model takes the form
𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝑎 + 𝑏1𝑥1𝑖 + 𝑏2𝑥2𝑖, 𝜎2) ⇔ 𝑦𝑖 = 𝑎 + 𝑏1𝑥1𝑖 + 𝑏2𝑥2𝑖 + 휀𝑖 , 𝑝( 휀𝑖) ≔ 𝑁(휀𝑖; 0, 𝜎
2) (1)
for (𝑖 = 1,… , 𝑛). In its design matrix formulation the above corresponds to
222
𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛), where 𝑦 ∈ ℝ𝑛, 𝑋 ≔ (
11⋮1
𝑥11𝑥12⋮𝑥1𝑛
𝑥21𝑥22⋮𝑥2𝑛
) ∈ ℝ𝑛×3, 𝛽 ∈ ℝ3 and 𝜎2 > 0 (2)
Here, the first entry in the beta parameter vector 𝛽 ≔ (𝛽1, 𝛽2, 𝛽3)𝑇 assumes the role of the offset 𝑎, the
second entry assumes the role of the slope with respect to the first independent experimental variable 𝑥1𝑖
and the third entry assumes the role of the slope with respect to the second independent experimental
variable 𝑥2𝑖.
As an example, we consider the data set of Table 1 and relabel the columns in accordance with the
multiple linear regression design as shown in Table 3. Multiple linear regression designs are not easily
visualized, especially, if the number of independent experimental variables is larger than 2. In Figure 2 below
we visualize the multiple linear regression design of the current example, note however, that these kind or
graphs are rarely seen in the literature.
𝒊 𝒙𝟏𝒊 : Age 𝒙𝟐𝒊: Alcohol 𝒚𝒊 : DLPFC Volume
1 15 3 178.7708
2 16 6 168.4660
3 17 5 169.9513
4 18 7 162.0778
5 19 4 170.1884
6 20 8 156.9287
7 21 1 175.4092
8 22 2 173.3972
9 23 7 154.4907
10 24 5 158.3642
11 25 1 172.1033
12 26 3 162.6648
13 27 2 165.4449
14 28 8 142.2121
15 29 4 154.3557
16 30 6 145.6544
17 31 3 155.5286
18 32 4 150.5144
19 33 7 137.8262
20 34 1 160.1183
21 35 2 155.4419
22 36 8 127.1715
23 37 5 138.0237
24 38 6 133.4589
25 39 4 139.3813
26 40 3 145.1997
27 41 7 123.7259
28 42 5 130.7300
29 43 8 114.1148
30 44 1 151.1943
31 45 6 121.7235
32 46 2 140.9424
Table 3 The example data set considered as a multiple linear regression design with two independent experimental variables
223
Figure 2. Visualization of a multiple linear regression design with two independent experimental variables.
(4) One-sample T-Test
The one-sample T-Test is usually portrayed as a procedure to evaluate whether the null hypothesis
that all data points were generated from univariate Gaussian distributions with identical expectation
parameters. From a GLM viewpoint is corresponds to a categorical design with a single experimental factor
taking on a single level. Specifically, each data point is modelled by a univariate Gaussian variable with
identical expectation parameter over data points
𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇, 𝜎2) ⇔ 𝑦𝑖 = 𝜇 + 휀𝑖 , 𝑝(휀𝑖) ≔ 𝑁(휀𝑖; 0, 𝜎
2) (1)
where 𝑖 = 1,… , 𝑛. In its design matrix formulation (1) corresponds to
𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) where 𝑦 ∈ ℝ𝑛, 𝑋 ≔ (1⋮1) ∈ ℝ𝑛×1, 𝛽 ∈ ℝ1 and 𝜎2 > 0 (2)
Here the single entry parameter vector 𝛽 assumes the role of the true, but unknown, expectation 𝜇. If
applied to the example data set of Table 1, the one-sample T-Test allows for evaluating the null hypothesis
that all observed data points were generated based on univariate Gaussian with identical expectations.
Table 4 shows the example data set viewed from the perspective of a one-sample T-Test.
𝒊 𝒚𝒊 : DLPFC Volume
1 178.7708
2 168.4660
3 169.9513
4 162.0778
5 170.1884
6 156.9287
7 175.4092
8 173.3972
9 154.4907
10 158.3642
11 172.1033
12 162.6648
13 165.4449
14 142.2121
15 154.3557
16 145.6544
224
17 155.5286
18 150.5144
19 137.8262
20 160.1183
21 155.4419
22 127.1715
23 138.0237
24 133.4589
25 139.3813
26 145.1997
27 123.7259
28 130.7300
29 114.1148
30 151.1943
31 121.7235
32 140.9424
Table 4. The example data set considered as a one-sample T-test design .
Figure 3. Visualization of a one-sample T-Test
One-sample T-Test data are not usually visualized. However, in line with the visualization other categorical
designs it is most appropriate to visualize them by means of their sample mean and sample standard
deviation or standard error of mean (Figure 1).
For the model in (2) the classical point beta parameter estimator evaluates to
�̂� =1
𝑛∑ 𝑦𝑖𝑛𝑖=1 (3)
and is thus given by the so-called sample mean �̅� ≔1
𝑛∑ 𝑦𝑖𝑛𝑖=1 . The variance parameter estimator evaluates
to
�̂�2 = 1
𝑛−1∑ (𝑦𝑖 − �̅�)
2𝑛𝑖=1 (4)
and is thus given by the so-called sample variance 𝑠2 ≔1
𝑛−1∑ (𝑦𝑖 − �̅�)
2𝑛𝑖=1 . Specification of the null
hypothesis 𝐻0: 𝜇 = 𝜇0,then yields the familiar formula for the T-statistic:
𝑇 = √𝑛�̅�−𝜇0
𝑠 (5)
where we defined the unbiased sample standard deviation as
225
𝑠 ≔ √𝑠2 (6)
Based on a data realization we may thus evaluate the T-statistic, which under the null hypothesis is
distributed according to a 𝑡-distribution with 𝑛 − 1 degrees of freedom. If the observed T-statistic value is
associated with a low probability under the null hypothesis being correct, we may consider revising the null
hypothesis.
Proof of (3) – (5)
For the current GLM design, the beta parameter estimator is given by
�̂� = (𝑋𝑇𝑋)−1𝑋𝑇𝑦 (3.1)
= ((1 ⋯ 1)(1⋮1))
−1
(1 ⋯ 1)(
𝑦1⋮𝑦𝑛)
= (𝑛)−1∑ 𝑦𝑖𝑛𝑖=1
=1
𝑛∑ 𝑦𝑖𝑛𝑖=1
corresponding to the arithmetic mean �̅� of the data points in 𝑦 ∈ ℝ𝑛. With 𝑝 = 1 The sigma parameter estimator is given by
�̂�2 =1
𝑛−1(𝑦 − 𝑋�̂�)
𝑇(𝑦 − 𝑋�̂�) (4.1)
=1
𝑛−1((
𝑦1⋮𝑦𝑛) − (
1⋮1)(
1
𝑛∑ 𝑦𝑖𝑛𝑖=1 ))
𝑇
((
𝑦1⋮𝑦𝑛) − (
1⋮1) (
1
𝑛∑ 𝑦𝑖𝑛𝑖=1 ))
=1
𝑛−1
(
(
𝑦1 −1
𝑛∑ 𝑦𝑖𝑛𝑖=1
⋮
𝑦𝑛 −1
𝑛∑ 𝑦𝑖𝑛𝑖=1
)
)
𝑇
(
(
𝑦1 −1
𝑛∑ 𝑦𝑖𝑛𝑖=1
⋮
𝑦𝑛 −1
𝑛∑ 𝑦𝑖𝑛𝑖=1
)
)
=1
𝑛−1∑ (𝑦𝑖 − (
1
𝑛∑ 𝑦𝑖𝑛𝑖=1 ))
2𝑛𝑖=1
= 1
𝑛−1∑ (𝑦𝑖 − �̅�)
2𝑛𝑖=1
corresponding to the sample variance 𝑠2 of the data point in 𝑦 ∈ ℝ𝑛. Finally, with a contrast vector 𝑐 ≔ 1 and true, but unknown,
beta parameter 𝛽0 = 𝜇0, the T-statistic evaluates to
𝑇 =𝑐𝑇�̂�−𝑐𝑇𝛽0
√�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐 (5.1)
=1⋅�̅�−1⋅𝜇0
√𝑠21⋅((1 ⋯ 1)(1⋮1))
−1
⋅1
=�̅�−𝜇0
√𝑠2(𝑛)−1
=�̅�−𝜇0
√𝑠2(𝑛)−12
= √𝑛�̅�−𝜇0
𝑠
where we defined sample standard deviation as 𝑠 ≔ √𝑠2.
□
226
(5) Independent two-sample T-Test
The independent two-sample T-Test is a procedure to evaluate whether two groups of data points
𝑦1 ∈ ℝ𝑛1 and 𝑦2 ∈ ℝ
𝑛2 were generated from the same underlying univariate Gaussian distribution, i.e.
Gaussian with the same expectation parameters 𝜇1 = 𝜇2, corresponding to the null hypothesis, or two
different ones 𝜇1 ≠ 𝜇2 corresponding to the alternative hypothesis. In terms of GLM designs, this case thus
corresponds to one experimental factor taking on two discrete levels. The assumed probability distribution
for data points collected for the first level takes the form
𝑝(𝑦1𝑖) = 𝑁(𝑦1𝑖; 𝜇1, 𝜎2) ⇔ 𝑦1𝑖 = 𝜇1 + 휀𝑖 𝑝(휀𝑖) ≔ 𝑁(휀𝑖; 0, 𝜎
2) (1)
where 𝑖 = 1,… , 𝑛1 and the assumed probability distribution for data points collected for the second level
takes the form
𝑝(𝑦2𝑖) = 𝑁(𝑦2𝑖; 𝜇2, 𝜎2) ⇔ 𝑦2𝑖 = 𝜇2 + 휀𝑖 𝑝(휀𝑖) ≔ 𝑁(휀𝑖; 0, 𝜎
2) (2)
where 𝑖 = 1,… , 𝑛2. In its design matrix ormulation, the two-sample T-test for corresponds to
𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛), where
(
𝑦11⋮
𝑦1𝑛1𝑦21⋮
𝑦2𝑛1)
∈ ℝ𝑛, 𝑋 ≔
(
1 0⋮ ⋮1 00 1⋮ ⋮0 1)
∈ ℝ𝑛×2, 𝛽 ∈ ℝ2 and 𝜎2 > 0 (3)
where 𝑛 ≔ 𝑛1 + 𝑛2 and notably, the data points and ones in the design matrix have to be arranged in such a
manner that they identify the corresponding group membership.
As an example consider, regrouping the example data set into two groups of observations
corresponding to participants younger than 31 years of age and participants older or equal to 31 years of age
as shown in Table 5
Group 1 Group 2
Participant 𝒊𝒋 𝒚𝟏𝒊 : DLPFC Volume Participant 𝒊𝒋 𝒚𝟐𝒊: DLPFC Volume
1 11 178.7708 17 21 155.5286
2 12 168.4660 18 22 150.5144
3 13 169.9513 19 23 137.8262
4 14 162.0778 20 24 160.1183
5 15 170.1884 21 25 155.4419
6 16 156.9287 22 26 127.1715
7 17 175.4092 23 27 138.0237
8 18 173.3972 24 28 133.4589
9 19 154.4907 25 29 139.3813
10 110 158.3642 26 210 145.1997
11 111 172.1033 27 211 123.7259
12 112 162.6648 28 212 130.7300
13 113 165.4449 29 213 114.1148
14 114 142.2121 30 214 151.1943
15 115 154.3557 31 215 121.7235
16 116 145.6544 32 216 140.9424
Table 5. The example data set of Table 1 rearranged for an independent two-sample T-test. The column Participant comprises the original data labels as in Table 1, while the columns 𝑖𝑗 comprise the indices of the data points 𝑦1𝑖 and 𝑦2𝑖after relabeling for the independent two-sample T-test.
227
Independent two-sample T-Test designs are usually visualized by portraying their group sample means and
associated standard deviations or standard errors of mean.
Figure 4. Visualization of an independent two-sample T-Test. Errorbars depict the within-group standard deviations.
For the model formulated in (1) – (3), the beta parameter estimator evaluates to
�̂� = (
1
𝑛1∑ 𝑦1𝑖𝑛1𝑖=1
1
𝑛2∑ 𝑦2𝑖𝑛2𝑖=1
) = (�̅�1�̅�2) (4)
In other words, the two entries in the beta parameter estimator correspond to the two sample averages �̅�1
and �̅�2. The variance parameter estimator is given by
�̂�2 =∑ (𝑦1𝑖−�̅�1)
2𝑛1𝑖=1 +∑ (𝑦2𝑖−�̅�2)
2𝑛1𝑖=1
(𝑛1−1)+(𝑛1−1)≔ 𝑠12
2 (5)
where we defined the the “pooled” or “averaged” sample variance as 𝑠122 , the square root of which
corresponds to the “pooled” sample standard deviation 𝑠12 ≔ √𝑠122 . Reformulating the null hypothesis by
𝐻0: 𝜇1 = 𝜇2 as 𝐻0: 𝜇1 − 𝜇2 = 0, specifying the contrast vector by 𝑐 = (1,−1)𝑇 and setting 𝛽0 = 0 ∈ ℝ2
then yields the familiar formula for the T-statistic of the two-sample independent T-test
𝑇 =�̅�1−�̅�2
√(1
𝑛1+1
𝑛2)𝑠12
(6)
Based on a data realization we may thus evaluate the T-statistic, which is distributed according to a 𝑡-
distribution with 𝑛 − 𝑝 = 𝑛1 + 𝑛2 − 2 degrees of freedom under the null hypothesis. If the observed T-
statistic value (or a more extreme value) is associated with a low probability under the null hypothesis being
correct, we may consider revising the null hypothesis.
Proof of (4) - (6)
For the current GLM design, the beta parameter estimator is given by
�̂� = (𝑋𝑇𝑋)−1𝑋𝑇𝑦 (4.1)
=
(
(1 ⋯ 10 ⋯ 0
0 ⋯ 01 ⋯ 1
)
(
1 0⋮ ⋮1 00 1⋮ ⋮0 1)
)
−1
(1 ⋯ 10 ⋯ 0
0 ⋯ 01 ⋯ 1
)
(
𝑦11⋮
𝑦1𝑛1𝑦21⋮
𝑦2𝑛1)
228
= ((𝑛1 00 𝑛2
))
−1
(∑ 𝑦1𝑖𝑛1𝑖=1
∑ 𝑦2𝑖𝑛2𝑖=1
)
= (1/𝑛1 00 1/𝑛2
) (∑ 𝑦1𝑖𝑛1𝑖=1
∑ 𝑦2𝑖𝑛2𝑖=1
)
= (
1
𝑛1∑ 𝑦1𝑖𝑛1𝑖=1
1
𝑛2∑ 𝑦2𝑖𝑛2𝑖=1
)
which corresponds to the sample averages of the data points 𝑦11, … , 𝑦1𝑛1 , i.e. �̅�1, and 𝑦21, … , 𝑦2𝑛2 , i.e., �̅�2, respectively. With
𝑛 = 𝑛1 + 𝑛2 and 𝑝 = 2, the variance estimator for the current GLM design is given by
�̂�2 =1
𝑛1+𝑛2−2(𝑦 − 𝑋�̂�)
𝑇(𝑦 − 𝑋�̂�) (5.1)
=1
𝑛1+𝑛2−2
(
(
𝑦11⋮
𝑦1𝑛1𝑦21⋮
𝑦2𝑛1)
−
(
1 0⋮ ⋮1 00 1⋮ ⋮0 1)
(�̅�1�̅�2)
)
𝑇
(
(
𝑦11⋮
𝑦1𝑛1𝑦21⋮
𝑦2𝑛1)
−
(
1 0⋮ ⋮1 00 1⋮ ⋮0 1)
(�̅�1�̅�2)
)
=1
𝑛1+𝑛2−2
(
𝑦11 − �̅�1⋮
𝑦1𝑛1 − �̅�1𝑦21 − �̅�2
⋮𝑦2𝑛1 − �̅�2)
𝑇
(
𝑦11 − �̅�1⋮
𝑦1𝑛1 − �̅�1𝑦21 − �̅�2
⋮𝑦2𝑛1 − �̅�2)
=∑ (𝑦1𝑖−�̅�1)
2𝑛1𝑖=1 +∑ (𝑦2𝑖−�̅�2)
2𝑛2𝑖=1
𝑛1+𝑛2−2
which corresponds to the pooled sample variance 𝑠122 . Finally, setting 𝑐 ≔ (1,−1)𝑇 and 𝛽0 ≔ (0,0)𝑇, the T-statistic evaluates to
𝑇 =𝑐𝑇�̂�−𝑐𝑇𝛽0
√�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐 (6.1)
=(1 −1)(
�̅�1�̅�2)−(1 −1)(
00)
√𝑠122 ⋅(1 −1)(
𝑛1 00 𝑛2
)−1
(1−1)
=�̅�1−�̅�2
√𝑠122 √(1 −1)(
1/𝑛1 00 1/𝑛2
)(1−1)
=�̅�1−�̅�2
𝑠12√(1/𝑛1 −1/𝑛2)(1−1)
=�̅�1−�̅�2
√(1
𝑛1+1
𝑛2)𝑠12
□
(6) One-way ANOVA
The simplest way to think about the one-way analysis of variance (ANOVA) is to consider it the
extension of an (independent) two-sample t-test to more than two groups. It is helpful to modify the GLM
notation to reflect the one-way ANOVA layout of the data explicitly. Let 𝑚 ∈ ℕ denote the number of
groups or levels of the factor, 𝑛𝑖 ∈ ℕ denote the number of observations in group 𝑖 ∈ ℕ𝑚, and let 𝑦𝑖𝑗 ∈ ℝ
denote the dependent variable of the 𝑗-th unit in the 𝑖-th group (𝑖 ∈ ℕ𝑚,, 𝑗 ∈ ℕ𝑛𝑖). As usual, we conceive
𝑦𝑖𝑗 as a realization of a random variable with univariate Gaussian distribution. In the one-way ANOVA case,
this distribution takes the form
229
𝑝(𝑦𝑖𝑗) = 𝑁(𝑦𝑖𝑗; 𝜇𝑖, 𝜎2) ⇔ 𝑦𝑖𝑗 = 𝜇𝑖 + 휀𝑖𝑗 , 𝑝(휀𝑖𝑗) ≔ 𝑁(휀𝑖𝑗; 0, 𝜎
2), 𝜎2 > 0 (1)
where the variance parameter 𝜎2 is identical for all observations. The underlying assumption of the one-way
ANOVA model in terms of the deterministic GLM aspect is
𝜇𝑖 ≔ 𝜇0 + 𝛼𝑖 (2)
where 𝜇0 ∈ ℝ takes on the role of a common offset for all data groups, and 𝛼𝑖 ∈ ℝ represents the effect of
level 𝑖 of the experimental factor. The one-way ANOVA model (1) can be formulated as special case of the
GLM. To this end, the design matrix 𝑋 ∈ ℝ𝑛×𝑝 comprises 𝑝 = 𝑚 + 1 columns: a column of ones
representing the constant offset (like in simple linear regression) and 𝑚 columns of “indicator” variables.
These indicator variables take on the value 1 for dependent variables corresponding to level 𝑖 of the factor
and the value 0 otherwise. For the exemplary case of 4 experimental groups with 𝑛𝑖 (𝑖 ∈ ℕ𝑚) data points
each we have 𝑋 ∈ ℝ𝑛×5, 𝑛 ≔ ∑ 𝑛𝑖4𝑖=1 . Using this design matrix and parameter formulation, we can write the
GLM form of the one-way ANOVA layout as follows:
𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (3)
where
𝑦 ≔
(
𝑦11⋮
𝑦1𝑛1𝑦21⋮
𝑦2𝑛2𝑦31⋮
𝑦3𝑛3𝑦41⋮
𝑦4𝑛4)
∈ ℝ𝑛, 𝑋 =
(
1⋮1
1⋮1
0⋮0
0⋮0
0⋮0
1⋮1
0⋮0
1⋮1
0⋮0
0⋮0
1⋮1
0⋮0
0⋮0
1⋮1
0⋮0
1⋮1
0⋮0
0⋮0
0⋮0
1⋮1)
∈ ℝ𝑛×5, 𝛽 ≔
(
𝜇0𝛼1𝛼2𝛼3𝛼4)
∈ ℝ5 and 𝜎2 > 0 (4)
To illustrate the use of one-way ANOVA in the context of the example data set of Table 1, we ignore
the factor “alcohol intake” and consider only the experimental factor “age”. We define four discrete levels
(L) for this factor: L1: 15 – 22 years, L2 : 23 – 30 years, L3: 31-38 years, and L4: 39 -46 years. Regrouping the
data accordingly results in the data layout of Table 2. In the example, we have 𝑚 = 4 and 𝑦𝑖𝑗 is the DLPFC
volume measure of the 𝑗th subject in the 𝑖th category of the age factor, where 𝑖 ∈ ℕ4, and 𝑗 = 1, . . . 𝑛𝑖
with 𝑛1 = 𝑛2 = 𝑛3 = 𝑛4 = 8.
L1 L2 L3 L4
P 𝒊𝒋 𝒚𝟏𝒊: DLPFC Volume P 𝒊𝒋 𝒚𝟐𝒊: DLPFC Volume P 𝒊𝒋 𝒚𝟑𝒊 : DLPFC Volume P 𝒊𝒋 𝒚𝟒𝒊: DLPFC Volume
1 11 178.7708 9 21 154.4907 17 31 155.5286 25 41 139.3813
2 12 168.4660 10 22 158.3642 18 32 150.5144 26 42 145.1997
3 13 169.9513 11 23 172.1033 19 33 137.8262 27 43 123.7259
4 14 162.0778 12 24 162.6648 20 34 160.1183 28 44 130.7300
5 15 170.1884 13 25 165.4449 21 35 155.4419 29 45 114.1148
6 16 156.9287 14 26 142.2121 22 36 127.1715 30 46 151.1943
7 17 175.4092 15 27 154.3557 23 37 138.0237 31 47 121.7235
8 18 173.3972 16 28 145.6544 24 38 133.4589 32 48 140.9424
Table 6 The exempledata set in a one-way ANOVA layout. The participant labels in in the P column correspond to the labels in Table 1 and the 𝑖𝑗-indices correspond to the relabeled dependent variables 𝑦𝑖𝑗 (1 ≤ 𝑖 ≤ 4, 1 ≤ 𝑗 ≤ 8).
230
One-way ANOVA designs are usually visualized by depicted the group sample means and the
associated standard deviations or standard errors of mean.
Figure 5. Visualization of a one-way ANOVA design. Errorbars depict the within-group standard deviations.
Unfortunately, the GLM formulation of the one-way ANOVA design as described above requires a
reformulation in order to enable the estimation of its parameters. As it stands, the design is
“overparameterized”. This problem may be viewed from at least three perspectives. From a data-analytical
perspective we have “more unknowns than observed variables”, because we can obtain average data from
𝑚 groups, but have 𝑚 + 1 parameters, 𝛼1, … , 𝛼𝑚 and the offset 𝜇0 to determine. From the perspective of
the design matrix formulation (4), the design matrix is rank-deficient, because the first column is the sum of
the last 𝑚 columns. It can be shown that the rank-deficiency of 𝑋 ∈ ℝ𝑛×(𝑚+1) results in the cross product
matrix 𝑋𝑇𝑋 ∈ ℝ(𝑚+1)×(𝑚+1) to be rank-deficient. This in turn corresponds to it being non-invertible, which
implies that the OLS beta estimator is not defined for the one-way ANOVA GLM formulation as defined so
far. Finally, from the perspective of systems of linear equations “overparameterization” implies that we have
more unknown parameters than equations from which to determine them. A simple example is the system
of linear equations
𝑝1 + 𝑝2 + 𝑝3 = 02𝑝1 + 𝑝2 + 𝑝3 = 1
⇔ (12
11
11)(
𝑝1𝑝2𝑝3) = (
01) (5)
In (5) we have two equations (and thus two “outcome measures” associated with a specific parameter
combination) and three “parameters” 𝑝1, 𝑝2 and 𝑝3. The problem is that different parameter values for
𝑝1, 𝑝2 and 𝑝3 can solve the system (or model) described by (5) and we thus cannot uniquely infer the
parameters from the measurement outcomes. For example, both the parameter vectors 𝑝𝑎 = (1,1,−2)𝑇
and 𝑝𝑏 = (1,−1,0)𝑇 solve the system of linear equations.
To nevertheless obtain a useful one-way ANOVA model, the model of equations (1) – (4)needs to be
reformulated. There are several ways in which this can be done. One approach is to set 𝜇0 = 0 or simply
drop the constant offset 𝜇0. If this approach is chosen, the 𝛼𝑖 (𝑖 ∈ ℕ𝑚) become the factor level
expectations and 𝛼𝑖 represents the expected response at factor level 𝑖. While simple and attractive, this
approach does not generalize well to models with more than one factor, and thus the so-called “reference
cell method” is preferred. The approach entails, instead to setting the overall offset 𝜇0 to zero, setting one of
the 𝛼𝑖's to zero. Conventionally, one sets 𝛼1 ≔ 0, but any of the groups could be chosen as the “reference
cell” (or “reference level”). Importantly, in this approach, the parameter 𝜇0 becomes the expected response
231
of the reference cell, and 𝛼𝑖 becomes the effect of level 𝑖 of the factor compared to the reference cell. In
other words 𝛼𝑖 becomes the expected difference between a given level of the experimental factor and the
reference cell. The table below illustrates the reformulation at the single-data point level implementing the
reference cell method for the example case of four levels.
Original Formulation Reference Cell Formulation
Level 1 𝜇0 + 𝛼1 𝜇0
Level 2 𝜇0 + 𝛼2 𝜇0 + 𝛼2
Level 3 𝜇0 + 𝛼3 𝜇0 + 𝛼3
Level 4 𝜇0 + 𝛼4 𝜇0 + 𝛼4
Table 7 The reference cell reformulation of the one-way ANOVA model.
At the level of the GLM formulation, the reference cell method is equivalent to removing one of the
indicator variables representing the levels of the factor. If we choose to set the first level to zero, we obtain
the following reformulation
𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (6)
where
𝑦 ≔
(
𝑦11⋮
𝑦1𝑛1𝑦21⋮
𝑦2𝑛2𝑦31⋮
𝑦3𝑛3𝑦41⋮
𝑦4𝑛4)
∈ ℝ𝑛, 𝑋 =
(
1⋮11⋮11⋮11⋮1
0⋮01⋮10⋮00⋮0
0⋮00⋮01⋮10⋮0
0⋮00⋮00⋮01⋮1)
∈ ℝ𝑛×4, 𝛽 ≔ (
𝜇0𝛼2𝛼3𝛼4
) ∈ ℝ4 and 𝜎2 > 0 (8)
We now have formulated an explicit GLM for the case of one-way ANOVA. Parameter estimation and
model inference then can proceed by using the results of the general GLM theory.
(7) Multifactorial designs and two-way ANOVA
Multifactorial designs are characterized by the fact that two or more independent experimental
factors are manipulated and all possible combinations of both factors assessed. They are usually referred to
simply as “factorial designs”. For example, a typical 2 × 2-factorial design employed in cognitive
neuroimaging could involve a stimulus manipulation (e.g. low and highly degraded visual stimuli) and a
cognitive manipulation (e.g. attended and unattended visual stimuli). Any form of two-dimensional 𝑛 ×𝑚 or
higher-dimensional 𝑛 ×𝑚 × 𝑝 × 𝑞 × … factorial design is conceivable. Due to experimental constraints and
the aim to measure each factorial combination from the same amount of experimental trials, 2 × 2-factorial
designs are probably the most prevalent designs in cognitive neuroimaging. Factorial designs allow for
measuring (1) the main effect of each factor, i.e. the differential variability in the outcome measure due to
different levels of this factor averaged over the other factors, (2) the interaction between factors. In intuitive
terms, an interaction in a 2 × 2-factorial design refers to a difference in a difference.
To illustrate the concept of a 2 × 2-factorial design before concerning ourselves with its GLM
formulation, we consider the example data set of Table 1. Here, we define the experimental factors “Age”
232
and “Alcohol” and allow each of these factors to take on only two levels: “younger than 31 years” (factor
Age, level 1) and “older than 31 years” (factor Age, level 2), and “less than 5 units of alcohol consumption”
(factor “Alcohol”, level 1) and “more than 5 units of alcohol consumption” (factor “Alcohol”, level 2). Each
combination of a specific level of one factor with a specific level of the other factor is referred to as a “cell”
of the design. 2 × 2-factorial designs are sensibly depicted using a square lattice (Figure 5), and average data
from the different cells of the of the design is commonly depicted in as bar graph (Figure 6). According to the
layout in the square lattice, the factors may also be referred to as “row” and “column” factors, respectively.
Figure 5. Conceptual visualization of a two-way ANOVA experimental design
Figure 6. Visualization of data obtained in a two-way ANOVA design. Errorbars depict the within-group standard deviations.
Based on the average observed DLPFC volume for each cell of the design and the various forms of variability
about the means, the following set of questions may now be investigated:
1. Does DLPFC volume change with the age of the participant, irrespective of (i.e. averaged over) whether
the participant consume more or less units of alcohol? Colloquially, the answer to this question is referred
to as the “main effect of age”.
2. Does DLPFC volume change with the alcohol consumption of the participant, irrespective of (i.e.
averaged over) whether the participant belongs to the young or old age group? Colloquially, the answer to
this question is referred to as the “main effect of alcohol”.
3. Does the difference in DLPFC volume observed for the different levels of the age factor change with the
different levels of the alcohol? Or the other way round, does the difference in DLPFC volume observed for
the different levels of the alcohol factor change under between old and young age? This difference in the
differences is colloquially referred to as the “interaction between the age and alcohol”.
233
In the current section, we first discuss the two-way ANOVA GLM formulation that applies if the first
two questions are of interest and then extend this formulation to the case that all three questions are of
interest. To formulate the two-way ANOVA GLM, it is helpful to first modify the notation to align itself with
the 2 × 2-factorial design. We start by considering the dependent variables. Specifically, we now use the
notation
𝑦𝑖𝑗𝑘 ∈ ℝ, where 𝑖 = 1,… , 𝑟, 𝑗 = 1,… , 𝑐 and 𝑘 = 1,… , 𝑛𝑖𝑗 (1)
to denote the 𝑘-th data point in the cell corresponding to combination of the 𝑖-th level of the “row-factor”,
where 𝑖 = 1,… , 𝑟 (rows), with the 𝑗-th level of the “column-factor”, where 𝑗 = 1,… , 𝑐 (columns). Each cell
comprises 𝑛𝑖𝑗 ∈ ℕ data points. For the special case of a 2 × 2 ANOVA design, we have 𝑟 = 𝑐 = 2.
Low (L1) High (L2)
P (𝑖𝑗𝑘) 𝒚𝟏𝟏𝒌: DLPFC Volume P (𝑖𝑗𝑘) 𝒚𝟏𝟐𝒌: DLPFC Volume
1 111 178.7708 2 121 168.4660
5 112 170.1884 3 122 169.9513
7 113 175.4092 4 123 162.0778
Young (L1) 8 114 173.3972 6 124 156.9287
11 115 172.1033 9 125 154.4907
12 116 162.6648 10 126 158.3642
13 117 165.4449 14 127 142.2121
15 118 154.3557 16 128 145.6544
P (𝑖𝑗𝑘) 𝒚𝟐𝟏𝒌: DLPFC Volume P (𝑖𝑗𝑘) 𝒚𝟐𝟐𝒌: DLPFC Volume
17 211 155.5286 19 221 137.8262
18 212 150.5144 22 222 127.1715
20 213 160.1183 23 223 138.0237
Old (L2) 21 214 155.4419 24 224 133.4589
25 215 139.3813 27 225 123.7259
26 216 145.1997 28 226 130.7300
30 217 151.1943 29 227 114.1148
32 218 140.9424 31 228 121.7235
Table 8 The example data set of Table 1 in a 2 × 2 ANOVA layout with row factor “Age”, taking on the levels “Young” and “Old” and the column factor “Alcohol” taking on the levels “Low” and “High”. Note that the column P denotes the original participant label, while the column (𝑖𝑗𝑘) denotes the relabelled dependent variable index.
For the 2 × 2 ANOVA design without interaction, consider the expectation of each data variable 𝑦𝑖𝑗𝑘
as the sum of the effects of the levels of each factor. In other words, we conceive 𝑦𝑖𝑗𝑘 as a realization of a
random variable with a univariate Gaussian distribution of the form
𝑝(𝑦𝑖𝑗𝑘) = 𝑁(𝑦𝑖𝑗𝑘; 𝜇𝑖𝑗 , 𝜎2) ⇔ 𝑦𝑖𝑗𝑘 = 𝜇𝑖𝑗 + 휀𝑖𝑗𝑘 , 𝑝(휀𝑖𝑗𝑘) ≔ 𝑁(휀𝑖𝑗𝑘; 0, 𝜎
2) (2)
for 𝑘 = 1,… , 𝑛𝑖𝑗 where
𝜇𝑖𝑗 ≔ 𝜇0 + 𝛼𝑖 + 𝛽𝑗 (3)
In this formulation 𝜇0 represents a constant offset common to all cells, 𝛼𝑖 (𝑖 = 1,… , 𝑟) represents
the effect of the 𝑖th level of the row factor and 𝛽𝑗 (𝑗 = 1,… , 𝑐) represents the effect of the 𝑗th level of the
column factor. In its design matrix formulation of the GLM defined in (2) and (3), the design matrix
𝑋 ∈ ℝ𝑛×(1+𝑟+𝑐) comprises a column of 1's representing the constant offset, and two sets of indicator
variables representing the 𝑟 ∈ ℕ levels of the row factor and 𝑐 ∈ ℕ levels of the column factor and the beta
parameter vector encodes the effect of each level of each factor. That is, we have
234
𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (4)
where
𝑦 ≔
(
𝑦111 ⋮
𝑦11𝑛11𝑦121 ⋮
𝑦12𝑛12𝑦211 ⋮
𝑦21𝑛21𝑦221 ⋮
𝑦22𝑛22)
∈ ℝ𝑛, 𝑋 =
(
1⋮1
1⋮1
0⋮0
1⋮1
0⋮0
1⋮1
1⋮1
0⋮0
0⋮0
1⋮1
1⋮1
0⋮0
1⋮1
1⋮1
0⋮0
1⋮1
0⋮0
1⋮1
0⋮0
1⋮1)
∈ ℝ𝑛×5, 𝛽 ≔
(
𝜇0𝛼1𝛼2𝛽1𝛽2)
∈ ℝ5and σ2 > 0 (5)
As for the one-way ANOVA, the model defined above is overparameterized. Effectively, we have five
parameters to and four equations for the respective group expectations. Viewed differently, on the level of
equation (3) we could add a constant either to each of the 𝛼𝑖’s or to each of the 𝛽𝑗’s and subtract it from 𝜇0
without altering any of the expected responses. We thus require two constraints to identify the model.
These correspond to setting
𝛼1 ≔ 𝛽1 ≔ 0 (6)
and thus identifying the combination of the first level of the row and the first level of the column as the
reference cell. The meaning of the remaining parameters is then as provided in Tables 9 and 10. The entries
in these tables depict the formulation of the expected dependent variable responses for each combination
of levels of row and column factors in terms of the initial formulation of the additive 2 × 2 ANOVA and in
terms of the reference cell method reformulation of the additive 2 × 2 ANOVA, respectively.
In the reference cell formulation of Table 6, 𝜇0 ∈ ℝ represents the expected response in the
reference cell, 𝛼𝑖 (𝑖 = 1,2) represents the effect of level 𝑖 of the row factor (compared to level 1) for any
fixed level of the column factor, and 𝛽𝑗 (𝑗 = 1,2) represents the effect of level 𝑗 of the column vector
(compared to level 1) for any fixed value of the row factor. As for the case of the one-way ANOVA, the
parameters 𝛼2 and 𝛽2 encode the differences in expected values between the design cells. Note that the
model is additive in the sense, that the effect of each factor is the same at all levels of the other factor. To
see this point consider moving from the first to the second row. The response increases by 𝛼2 if one moves
down the first column, but also if one moves down the second column.
1 2
1 𝜇0 + 𝛼1 + 𝛽1 𝜇0 + 𝛼1 + 𝛽2
2 𝜇0 + 𝛼2 + 𝛽1 𝜇0 + 𝛼2 + 𝛽2
Table 9 Initial formulation of an over-parameterized two-way additive ANOVA GLM model
Table 10 Reference cell method reformulation of the two-way additive ANOVA GLM model
1 2
1 𝜇0 𝜇0 + 𝛽2
2 𝜇0 + 𝛼2 𝜇0 + 𝛼2 + 𝛽2
235
Equivalently, the design matrix defined in (5) is not of full column rank, because the row factor
columns (column 2 and 3) as well as the column factor indicator variables (column 4 and 5) add up to the
constant offset indicator (column 1). The two required constraints correspond to dropping the variables
corresponding to the first row factor level and to the first column factor level. This results in the following
reformulation of the GLM for the 2 × 2 ANOVA layout:
𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (7)
where
𝑦 ≔
(
𝑦111 ⋮
𝑦11𝑛11𝑦121 ⋮
𝑦12𝑛12𝑦211 ⋮
𝑦21𝑛21𝑦221 ⋮
𝑦22𝑛22)
∈ ℝ𝑛, X =
(
1⋮11⋮11⋮11⋮1
0⋮00⋮01⋮11⋮1
0⋮01⋮10⋮01⋮1)
∈ ℝ𝑛×3, 𝛽 ≔ (
𝜇0𝛼2𝛽2
) ∈ ℝ3 and σ2 > 0 (8)
Based on the formulation of the two-way ANOVA design in (8) we may test for significant main effects of
either experimental factor. We cannot, however, test for a significant interaction, because this has not been
modelled by the GLM. In order to allow for the modelling of interaction effects in 2 × 2-factorial designs, the
GLM of the previous section is modified as follows
𝑝(𝑦𝑖𝑗𝑘) = 𝑁(𝑦𝑖𝑗𝑘; 𝜇𝑖𝑗 , 𝜎2) ⇔ 𝑦𝑖𝑗𝑘 = 𝜇𝑖𝑗 + 휀𝑖𝑗𝑘 , 𝑝(휀𝑖𝑗𝑘) ≔ 𝑁(휀𝑖𝑗𝑘; 0, 𝜎
2) (10)
for 𝑘 = 1,… , 𝑛𝑖𝑗 where we now define
𝜇𝑖𝑗 ≔ 𝜇0 + 𝛼𝑖 + 𝛽𝑗 + (𝛼𝛽)𝑖𝑗 (11)
In this formulation the first three terms are familiar: 𝜇0 is a constant, and 𝛼𝑖 and 𝛽𝑗 are the main
effect of levels 𝑖 ∈ ℕ𝑟 of the row factor and 𝑗 ∈ ℕ𝑐 of the column factor. The new term (𝛼𝛽)𝑖𝑗 is an
interaction effect. It represents the effect of the combination of levels 𝑖 and 𝑗 of the row and column factors.
The notation (𝛼𝛽) should be understood as a single symbol, not as a product. One could have have chosen
𝛾𝑖𝑗 to denote this interaction effect, but the notation (𝛼𝛽)𝑖𝑗 is more suggestive and reminds us that the
term corresponds to an effect due to the combination of the levels 𝑖 and 𝑗 of each factor. Table 11 visualizes
the two-way ANOVA with interaction parameters
1 2
1 𝜇0 + 𝛼1 + 𝛽1 + (𝛼𝛽)11 𝜇0 + 𝛼1 + 𝛽2 + (𝛼𝛽)12
2 𝜇0 + 𝛼2 + 𝛽1 + (𝛼𝛽)21 𝜇0 + 𝛼2 + 𝛽2 + (𝛼𝛽)22
Table 11 Initial formulation of an over-parameterized two-way ANOVA GLM model with interaction.
As always, equations (10) and (11) can be reformulated in design matrix form:
𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛)
where
236
𝑦 ≔
(
𝑦111 ⋮
𝑦11𝑛11𝑦121 ⋮
𝑦12𝑛12𝑦211 ⋮
𝑦21𝑛21𝑦221 ⋮
𝑦22𝑛22)
∈ ℝ𝑛, 𝑋 =
(
1 ⋮11⋮11⋮11⋮1
1 ⋮11⋮10⋮00⋮0
0 ⋮00⋮01⋮11⋮1
1 ⋮10⋮01⋮10⋮0
0 ⋮01⋮10⋮01⋮1
1 ⋮10⋮00⋮00⋮0
0 ⋮01⋮10⋮00⋮0
0 ⋮00⋮01⋮10⋮0
0 ⋮00⋮00⋮01⋮1 )
∈ ℝ𝑛×9, 𝛽 ≔
(
𝜇0𝛼1𝛼2𝛽1𝛽2
(𝛼𝛽)11(𝛼𝛽)12(𝛼𝛽)21(𝛼𝛽)22)
∈ ℝ4 and 𝜎2 > 0
As can be seen, addition of the second and third column, fourth and fifth column, and sixth to ninth column
all result in the first column, creating a multiply rank-deficient design matrix. We thus re-express (10) and
(11) in terms of an extended reference cell method, which sets to zero all parameters involving the first row
or the first column in the two-way layout, such that
𝛼1 ≔ 𝛽1 ≔ (𝛼𝛽)1𝑗 ≔ (𝛼𝛽)𝑖1 ≔ 0 (𝑗 ∈ ℕ𝑐 , 𝑖 ∈ ℕ𝑟) (12)
The meaning of the remaining parameters can then be read of from Table 12. Here, 𝜇0 ∈ ℝ is the expected
response in the reference cell, just as before. The main effects are now more specialized: 𝛼2 is the expected
difference between the expected response of the reference cell and the effect of level 2 of the row factor,
when the column factor is at level 1. 𝛽2 is the expected difference due to level 2 of the column factor,
compared to level 1, when the row factor is at level 1. The interaction term (𝛼𝛽)22 is the additional effect of
level 2 of the row factor, compared to level 1, when the column factor is at level 2 rather than 1. This term
can also be interpreted as the additional effect of level 2 of the column factor, compared to level 11, when
the row factor is at level 2 rather than 1. The key feature of this model is that the effect of a factor now
depends on the level of the other. For example the effect of level 2 of the row factor, compared to level 1, is
𝛼2 in the first column and 𝛼2 + (𝛼𝛽)22 in the second column.
Table 12 Reference cell method reformulation of the two-way additive ANOVA GLM model with interaction.
The design matrix formulation of the 2 × 2 ANOVA GLM with interaction after reformulation based on the
extended reference cell method then takes the following form: The design matrix is of size
𝑛 × (1 + (𝑟 − 1) + (𝑐 − 1) + (𝑟 − 1) ⋅ (𝑐 − 1)) (13)
Specifically, it comprises a column of ones to represent the constant offset 𝜇0, a set of (𝑟 − 1) ∈ ℕ
indicator variables representing the row effects, a set of (𝑐 − 1) ∈ ℕ indicator variables representing the
column effects, and a set of 𝑟 ⋅ 𝑐 ∈ ℕ indicator variables representing the interactions. The easiest way to
compute the values of the interaction indicator variable is as products of the row and column indicator
variable values. In other words, if 𝑟𝑖 takes the value 1 for observations in row 𝑖 and 0 otherwise, and 𝑐𝑗 takes
the value 1 for observations in column 𝑗 = 1 and 0 otherwise, then the product 𝑟𝑖𝑐𝑗 takes the values one for
observations that are in row 𝑖 and column 𝑗, and is 0 for all others.
1 2
1 𝜇0 𝜇0 + 𝛽2
2 𝜇0 + 𝛼2 𝜇0 + 𝛼2 + 𝛽2 + (𝛼𝛽)22
237
For the case of a 2 × 2 ANOVA model with interaction, the design matrix formulation after reference cell
reformulation thus takes the following form
𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (14)
where
𝑦 ≔
(
𝑦111 ⋮
𝑦11𝑛11𝑦121 ⋮
𝑦12𝑛12𝑦211 ⋮
𝑦21𝑛21𝑦221 ⋮
𝑦22𝑛22)
∈ ℝ𝑛, 𝑋 =
(
1⋮11⋮11⋮11⋮1
0⋮00⋮01⋮11⋮1
0⋮01⋮10⋮01⋮1
0⋮00⋮00⋮01⋮1)
∈ ℝ𝑛×4, 𝛽 ≔ (
𝜇0𝛼2𝛽2
(𝛼𝛽)22
) ∈ ℝ4 and 𝜎2 > 0 (15)
(8) Analysis of Covariance
The simplest way to think about the analysis of covariance (ANCOVA) is as a combination between
the categorical ANOVA-type and the continuous regression-type approaches. To illustrate this we again
consider the example data set and now conceive the participant’s age as a categorical factor with levels
young (< 31 years of age) and old (≥ 32 years of age) and alcohol consumption as a continuous factor. As
above, we modify our notation to reflect the structure of the data. To this end let 𝑚 ∈ ℕ denote the number
of groups or levels of the discrete factor, 𝑛𝑖 the number of observations in group 𝑖 ∈ ℕ𝑚, 𝑦𝑖𝑗 the value of the
dependent variable in the responses 𝑖 group at the the 𝑗 level the continuous factor, and and 𝑥𝑖𝑗 the value
of the continuous factor for the 𝑗th unit in the 𝑖th group, where 𝑗 = 1, . . . , 𝑛𝑖 and 𝑖 = 1, . . . , 𝑚. This
notation is illustrated for the example data set in Table 13.
𝒊 = 𝟏: Young Group 𝒊 = 𝟐: Young Group
P 𝒙𝟏𝒋 : Alcohol 𝒚𝟏𝒋: DLPFC Volume P 𝒙𝟐𝒋 : Alcohol 𝒚𝟐𝒋: DLPFC Volume
1 3 178.7708 17 3 155.5286
2 6 168.4660 18 4 150.5144
3 5 169.9513 19 7 137.8262
4 7 162.0778 20 1 160.1183
5 4 170.1884 21 2 155.4419
6 8 156.9287 22 8 127.1715
7 1 175.4092 23 5 138.0237
8 2 173.3972 24 6 133.4589
9 7 154.4907 25 4 139.3813
10 5 158.3642 26 3 145.1997
11 1 172.1033 27 7 123.7259
12 3 162.6648 28 5 130.7300
13 2 165.4449 29 8 114.1148
14 8 142.2121 30 1 151.1943
15 4 154.3557 31 6 121.7235
16 6 145.6544 32 2 140.9424
Table 13 The example data set of Section 7.2 rearranged in an analysis of covariance layout. Note that the column P denotes the original participant label, while the column
As usual each data point 𝑦𝑖𝑗 is conceived as a realization of a random variable with univariate
Gaussian distribution. In the ANCOVA case with one discrete factor taking on two levels and one continuous
factor, this distribution takes the form
238
𝑝(𝑦𝑖𝑗) = 𝑁(𝑦𝑖𝑗; 𝜇𝑖𝑗 , 𝜎2) ⇔ 𝑦𝑖𝑗 = 𝜇𝑖𝑗 + 휀𝑖𝑗 , 𝑝(휀𝑖𝑗) ≔ 𝑁(휀𝑖𝑗; 0, 𝜎
2) , 𝜎2 > 0 (1)
where 𝑖 = 1, 2 and 𝑗 = 1,… , 𝑛𝑖 for each value of 𝑖. Then, to express the dependence of the expected
response 𝜇𝑖𝑗 on the discrete factor with two levels we use an ANOVA-type model of the form
𝜇𝑖𝑗 ≔ 𝜇0 + 𝛼𝑖 (𝑖 = 1,2) (2)
whereas to model the effect of a continuous predictor, we use a regression-type model of the form
𝜇𝑖𝑗 ≔ 𝜇0 + 𝛽1𝑥𝑖𝑗 (3)
where 𝑗 = 1,… , 𝑛𝑖 for each value of 𝑖 = 1,2. Combining these models we obtain the additive ANCOVA model
𝜇𝑖𝑗 ≔ 𝜇0 + 𝛼𝑖 + 𝛽1𝑥𝑖𝑗 (4)
Essentially, this model defines a set of straight-line regressions, one for each level of the discrete
factor. These lines have different offsets 𝜇0 + 𝛼𝑖, but a common slope 𝛽1. In other words, they are parallel.
The common slope 𝛽1 represents the effects of the continuous variate at any level of the factor, and the
differences in the offset 𝛼𝑖 (𝑖 = 1,2)represent the effects of the discrete factor at any given value of the
covariate.
The ANCOVA model can be formulated in design matrix form by letting the design matrix 𝑋 ∈
ℝ𝑛×(𝑚+2) have a column of 1's representing the constant, a set of 2 indicator variables representing the
levels of the discrete factor, and a column with the values of the continuous variate. Together, this amounts
to the following formulation of the ANCOVA model with one discrete, two-level factor, and one continuous
factor :
𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) (5)
where
𝑦 ≔
(
𝑦1,1⋮
𝑦1,𝑛1𝑦2,1⋮
𝑦2,𝑛2)
∈ ℝ𝑛, X =
(
1⋮11⋮1
1⋮10⋮0
0⋮01⋮1
𝑥1𝑛1⋮
𝑥1,10𝑥2,1⋮
𝑥2,𝑛2)
∈ ℝ𝑛×4, 𝛽 = (
𝜇0𝛼1𝛼2𝛽1
) ∈ ℝ4 and 𝜎2 > 0 (6)
As for the ANOVA models discussed above, the model as defined by equations in the equation above
is not identified: one could add a constant to each 𝛼𝑖 and subtract it from 𝜇 without changing any of the
expected values. To solve this problem, we apply the reference cell method and set 𝛼1 ≔ 0, so 𝜇0 becomes
the intercept of the reference cell, and 𝛼𝑖 becomes the difference in intercepts between levels 𝑖 and one of
the factor. At the level of the GLM formulation, the model is not of full column rank because the indicators
add up to the constant, so one of them is removed to obtain the reference cell parameterization as shown in
expression (7) below.
239
𝑦 ≔
(
𝑦1,1⋮
𝑦1,𝑛1𝑦2,1⋮
𝑦2,𝑛2)
∈ ℝ𝑛, 𝑋 =
(
1⋮11⋮1
0⋮01⋮1
𝑥11⋮
𝑥1,𝑛1𝑥2,1⋮
𝑥2,𝑛2)
∈ ℝ𝑛×3, 𝛽 = (
𝜇0𝛼2𝛽1) ∈ ℝ3 (7)
Figure 7. Visualization of a purely additive ANCOVA analysis of the example data set with the “categorical factor” age and the continuous factor “alcohol consumption”. Note that in the purely additive design, a common slope parameter is fitted to all levels of the categorical factor.
Finally, we consider an ANCOVA model which allows for group-specific slopes, i.e. a model that
explicitly models the interaction between the categorical and the continuous experimental factor. In
structural form, we now assume that the expectation of the 𝑗th outcome measure on the 𝑖th level of the
categorical factor is given by
𝜇𝑖𝑗 ≔ (𝜇0 + 𝛼𝑖) + (𝛽0 + 𝛾𝑖)𝑥𝑖𝑗 (8)
for 𝑖 = 1,… ,𝑚 and 𝑗 = 1,… , 𝑛𝑖. In the model described by (8), (𝜇0 + 𝛼𝑖) represents the group specific
offset of the regression lines of each group, while (𝛽0 + 𝛾𝑖) represents the group specific regression line
slopes. Both are described by the combination of a common offset and slope, 𝜇0 and 𝛽0, respectively, and a
group specific additive effect to offset and slope, 𝛼𝑖 and 𝛾𝑖 (𝑖 = 1,… ,𝑚), respectively.
We now consider the model described by (8) in more detail for the case of 𝑚 ≔ 2. Resolving the
brackets in equation (8) yields
𝜇1𝑗 = 𝜇0 + 𝛼1 + 𝛽0𝑥1𝑗 + 𝛾1𝑥1𝑗 (8)
and
𝜇2𝑗 = 𝜇0 + 𝛼2 + 𝛽0𝑥2𝑗 + 𝛾2𝑥2𝑗 (9)
In design matrix form, we can write (9) and (10) as
240
𝑦 ≔
(
𝑦1,1⋮
𝑦1,𝑛1𝑦2,1⋮
𝑦2,𝑛2)
∈ ℝ𝑛, 𝑋 =
(
1⋮11⋮1
1⋮10⋮0
0⋮01⋮1
𝑥1𝑛1⋮
𝑥1,10𝑥2,1⋮
𝑥2,𝑛2
𝑥1𝑛1⋮
𝑥1,100⋮0
0⋮0𝑥2,1⋮
𝑥2,𝑛2)
∈ ℝ𝑛×6, 𝛽 =
(
𝜇0𝛼1𝛼2𝛽0𝛾1𝛾2)
∈ ℝ6 (10)
In this design matrix, the first column is the sum of the second and third column, while the fourth column is
the sum of the fifth and six column. The model is thus overparameterized and we will use the reference cell
restrictions to render it estimable. Specifically, we set 𝛼1 = 𝛾1 = 0. This changes the interpretation of the
remaining parameters as follows: 𝜇0 and 𝛽0 correspond to the offset and slope of the reference cell, i.e. the
first group and 𝛼2 and 𝛾2 correspond to the expected differences in offset and slope, when the categorical
factor is at level 2 rather than 1. In design matrix form, we obtain
𝑦 ≔
(
𝑦1,1⋮
𝑦1,𝑛1𝑦2,1⋮
𝑦2,𝑛2)
∈ ℝ𝑛, 𝑋 =
(
1⋮11⋮1
0⋮01⋮1
𝑥1𝑛1⋮
𝑥1,10𝑥2,1⋮
𝑥2,𝑛2
0⋮0𝑥2,1⋮
𝑥2,𝑛2)
∈ ℝ𝑛×4, 𝛽 = (
𝜇0𝛼2𝛽0𝛾2
) ∈ ℝ4 (11)
Note that the design matrix column that models the difference in slopes between the levels of the
categorical factors, i.e., the last column can be conceived as the element-wise product between the column
modelling the differences in offset between levels of the categorical factor (the second column), and the
column modelling the common slope (the third column). In this representation, 𝛾2 is a also referred to as the
“interaction effect”, because it models the difference in differences between levels of the continuous factor
for differences in levels of the categorical factor. A common application of these type of ANCOVA designs
with interaction in cognitive neuroimaging is the use of so-called “psychophysiological interaction” (PPI)
analyses. In Figure 8, we visualize the ANCOVA design with interaction applied to the example data set of
Table 1.
Figure 8. Visualization of an ANCOVA analysis with interaction of the example data set with the “categorical factor” age and the continuous factor “alcohol consumption”. Note that in the ANCOVA design with interaction, a different slopes result for different levels of the categorical factor.
241
Study questions
1. Explain the notions of categorical, continuous, and multifactorial experimental designs.
2. Write down the design matrix formulation for the independent and identical sampling from a univariate Gaussian distribution.
3. Write down the design matrix formulation of a simple linear regression model. What do the two beta parameters refer to?
4. Write down the design matrix formulation of a multiple linear regression model comprising three parametric regressors.
5. Write down the design matrix formulation of a one-sample t-test. In which experimental situation is a one-sample t-test
appropriate?
6. Write down the design matrix formulation of an independent two-sample t-test. In which experimental situation is an
independent two-sample t-test appropriate?
7. Verbally, explain the notion of the reference cell method reformulation of ANOVA GLM designs.
8. Verbally explain the notion of a statistical interaction using a 2 x 2 factorial experimental design of your choice.
9. Discuss the commonalities and differences between one-way ANOVA designs, additive two-way ANOVA design, and two-way
ANOVA designs with interaction.
10. Write down the design matrix formulation of a 2 x 2 ANOVA design with interaction after its reference cell formulation.
11. Write down the design matrix formulation of an additive ANCOVA model with one two-level discrete and one parametric
experimental factor after its reference cell reformulation.
12. Write down the design matrix formulation of an ANCOVA model with interaction with one two-level discrete and one
parametric experimental factor after its reference cell reformulation.
Study Questions Answers
1. In categorical designs the independent variable, also referred to as the experimental factor, takes on discrete levels that usually
do not bear a quantitative relationship to one another. In continuous designs, the experimental variable or experimental
variable takes on quantitative values over a specific range with explicit quantitative relationships. In multifactorial designs,
typically multiple categorical experimental factors are crossed, such that the dependent variable is observed for all
combinations of all levels of all factors.
2. Sampling 𝑛 times independently and identically from a univariate Gaussian distribution can be written in design matrix form as
𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) where 𝑦 ∈ ℝ𝑛, 𝑋 ≔ (
1111
) ∈ ℝ𝑛, 𝛽 ∈ ℝ and 𝜎2 > 0
3. Simple linear regression for a data set comprising 𝑛 observations can be written in design matrix form as
𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) where 𝑦 ∈ ℝ𝑛, 𝑋 ≔ (
1 𝑥11 𝑥2⋮ ⋮1 𝑥𝑛
) ∈ ℝ𝑛×2, 𝛽 ∈ ℝ2 and 𝜎2 > 0
The first component of the beta parameter vector 𝛽 ∈ ℝ2 corresponds to the offset of the simple linear regression line, i.e. its
crossing of the y-axis, while the second component of the beta parameter vector corresponds to the slope of the simple linear
regression line.
4. For three parameteric regressors, denoted here by the 𝑛-dimensional vectors 𝑥(1), 𝑥(2), 𝑥(3) ∈ ℝ3 the multiple regression
model for a data set 𝑦 ∈ ℝ𝑛 can be written in design matrix form as
𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) where 𝑦 ∈ ℝ𝑛, 𝑋 ≔
(
11⋮1
𝑥11
𝑥21
⋮𝑥𝑛1
𝑥12
𝑥21
⋮𝑥𝑛2)
∈ ℝ𝑛×3, 𝛽 ∈ ℝ3 and 𝜎2 > 0
5. The design matrix form of the GLM underlying the one-sample T-Test can be written as
𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) where 𝑦 ∈ ℝ𝑛, 𝑋 ≔ (1⋮1) ∈ ℝ𝑛×1, 𝛽 ∈ ℝ1 and 𝜎2 > 0
242
It is appropriate in single-factor, single-categorical level designs and can be used to evaluate whether there is sufficient evidence
that null hypothesis that all data points were generated from univariate Gaussian distributions with identical expectation
parameters can be rejected.
6. In its design matrix formulation, the two-sample T-test for “independent, equally-sized samples under the assumption equality of
the respective group variances for data of two groups 𝐴 and 𝐵” with 𝑛 ≔ 𝑛𝐴 + 𝑛𝐵 can be written as
𝑝(𝑦) ≔ 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛), where
(
𝑦1𝐴
⋮𝑦𝑛𝐴𝑦1𝐵
⋮𝑦𝑛𝐵𝐵 )
∈ ℝ𝑛, 𝑋 ≔
(
1 0⋮ ⋮1 00 1⋮ ⋮0 1)
∈ ℝ𝑛×2, 𝛽 ∈ ℝ2 and 𝜎2 > 0
Two-sample T-tests as formulated above are appropriate in experimental situations in which data from two groups of different
experimental subjects are available, the variances of the two groups can be assumed to be equal, and the interest lies in whether the
underlying distributions have the identical expectation.
7. A statistical interaction in a 2 x 2 factorial design refers to a difference in a difference. For example if both the visual coherence
(low/high) and contrast (low high) of a stimulus are manipulated and a data pattern suggesting faster stimulus recognition times for
low visual coherence than high visual coherence when the contrast is high, but slower stimulus recognition times for low visual
coherence than high visual coherence when the contrast is low, one would speak of an interaction.
8. In an ANOVA design, the “reference cell method” reformulation corresponds to setting one of the experimental effects to zero,
usually the first effect of the first celI. Using this approach, the offset parameter becomes the expected response of the reference
cell, and all other effects become the level dependent effects compared to the reference cell. In other words, all other effects
become the expected differences between a given level of the experimental factor and the reference cell.
9. One-way ANOVA designs, additive two-way ANOVA design, and two-way ANOVA designs with interaction can all be formulated
using the GLM notation and the reference cell method reformulation. They differ with respect to how many columns are included in
the GLM design matrix.
10. Let 𝑛𝑖𝑗 refer to the number of data points 𝑦𝑖𝑗𝑘 (𝑘 = 1,… , 𝑛𝑖𝑗) of the 𝑖th (𝑖 = 1,2) level of the first and 𝑗th (𝑗 = 1,2) level of the
second factor. Then, in its design matrix formulation, the 2 x 2 ANOVA design with interaction takes the following form
𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) where 𝑦 ≔
(
𝑦111 ⋮
𝑦11𝑛11𝑦121 ⋮
𝑦12𝑛12𝑦211 ⋮
𝑦21𝑛21𝑦221 ⋮
𝑦22𝑛22)
∈ ℝ𝑛, 𝑋 =
(
1⋮11⋮11⋮11⋮1
0⋮00⋮01⋮11⋮1
0⋮01⋮10⋮01⋮1
0⋮00⋮00⋮01⋮1)
∈ ℝ𝑛×4, 𝛽 ≔ (
𝜇0𝛼2𝛽2
(𝛼𝛽)22
) ∈ ℝ4 and 𝜎2 > 0
11. For data vectors 𝑦1 ∈ ℝ𝑛1 and 𝑦2 ∈ ℝ
𝑛2 corresponding to the data observed for the first and second level of the two-level
discrete exerperimental factor and corresponding vectors 𝑥1 ∈ ℝ𝑛1 and 𝑥2 ∈ ℝ
𝑛2 for the values of the parametric/continuous
factor, the design matrix formulation of an additive ANCOVA model with one two-level discrete and one parametric experimental
factor after its reference cell reformulation is given by
𝑝(𝑦) = 𝑁(𝑦; 𝑋𝛽, 𝜎2𝐼𝑛) where 𝑦 ≔
(
𝑦1,1⋮
𝑦1,𝑛1𝑦2,1⋮
𝑦2,𝑛2)
∈ ℝ𝑛, 𝑋 =
(
1⋮11⋮1
0⋮01⋮1
𝑥11⋮
𝑥1,10𝑥2,1⋮
𝑥2,10)
∈ ℝ𝑛×3, 𝛽 = (
𝜇𝛼2𝛽1) ∈ ℝ3 and 𝜎2 > 0
243
12. Let 𝑦𝑖,𝑗 denote the 𝑗th data point (𝑗 = 1,…𝑛𝑖) at the 𝑖th level of the discrete factor and 𝑥𝑖,𝑗 (𝑖 = 1,2, 𝑗 = 1,… , 𝑛𝑖) denote
corresponding value of the parametric factor. Then the design matrix formulation of an ANCOVA model with interaction with one
two-level discrete and one parametric experimental factor after its reference cell reformulation is given by
p(y) = N(y; Xβ, σ2In) where 𝑦 ≔
(
𝑦1,1⋮
𝑦1,𝑛1𝑦2,1⋮
𝑦2,𝑛2)
∈ ℝ𝑛, 𝑋 =
(
1⋮11⋮1
0⋮01⋮1
𝑥1𝑛1⋮
𝑥1,10𝑥2,1⋮
𝑥2,𝑛2
0⋮0𝑥2,1⋮
𝑥2,𝑛2)
∈ ℝ𝑛×4, 𝛽 = (
𝜇0𝛼2𝛽0𝛾2
) ∈ ℝ4 and 𝜎2 > 0
244
Advanced Theory of the GLM
245
The generalized least-squares estimator and whitening
(1) Motivation
A central assumption in the estimation and inference theory discussed for the GLM
𝑦 = 𝑋𝛽 + 휀, where 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝 and 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛) (1)
thus far has been the sphericity of the error covariance matrix, i.e. the fact that
𝐶𝑜𝑣(휀) = 𝜎2𝐼𝑛 (2)
In the application of the GLM non-spherical error covariance matrix assumptions arise frequently, for
example in the context or repeated measures designs (see Section “Repeated measures ANOVA”), mixed
linear models (see Sections “Mixed Linear Models” and “Estimation of Mixed Linear Models”), or in the
application of the GLM to time-series data with serial error correlations in the context of FMRI data analysis
(see Section “FMRI serial correlations”). The aim of the current section is to establish why non-spherical
error covariance matrices necessitate a modification of the GLM estimation and inference theory and to
discuss the “generalized least-squares” procedure a fundamental approach that allows for the use the
results of the GLM assuming spherical error covariance matrices also in the context of non-spherical error
covariance matrices. Specific forms of non-spherical error covariance matrices will be elucidated in
subsequent sections. Notably, throughout we assume that the error distribution parameters are known.
Formally, we consider the estimation of the beta parameter vector in a general GLM with non-
spherical covariance matrix, i.e., the model
𝑦 = 𝑋𝛽 + 휀, where 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝 and 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝑉) (3)
where 𝑉 ∈ ℝ𝑛×𝑛 is a known symmetric and positive-definite, but not necessarily spherical, matrix. In other
words, 𝑉 is not not necessarily of the form 𝐼𝑛, which we abbreviate below as “𝑉 ≠ 𝐼𝑛”.
To understand the implications of 𝑉 ≠ 𝐼𝑛, we first investigate the consequences of using the
standard OLS beta parameter and variance estimators for the GLM specified in equation (4). If we consider
the expectation of the OLS beta parameter
�̂�:= (𝑋𝑇𝑋)−1𝑋𝑇𝑦 (4)
we see that it is unaffected by the covariance matrix of the error terms, and thus, in the sense of an
estimating bias, 𝑉 ≠ 𝐼𝑛 does not affect the estimation of the beta parameters:
𝐸(�̂�) = 𝐸((𝑋𝑇𝑋)−1𝑋𝑇𝑦) = (𝑋𝑇𝑋)−1𝑋𝑇𝐸(𝑦) = (𝑋𝑇𝑋)−1𝑋𝑇𝑋𝛽 = 𝛽 (5)
The classical approach to beta parameter inference as discussed in previous Sections would now proceed by
estimating the variance parameter 𝜎2 by
�̂�2 ≔(𝑦−𝑋�̂�)
𝑇(𝑦−𝑋�̂�)
𝑛−𝑝 (6)
and use �̂�2(𝑋𝑇𝑋)−1 as the estimator for the beta parameter covariance 𝜎2(𝑋𝑇𝑋)−1, which enters the
denominator of the T-Statistic. However, the beta parameter covariance for the GLM with 𝑉 ≠ 𝐼𝑛 does not
246
correspond to 𝜎2(𝑋𝑇𝑋)−1, as we show below, and thus classical inference is misguided. Using the linear
transformation theorem for Gaussian distributions, we see that the covariance of the OLS beta estimator
𝐶𝑜𝑣(�̂�) = (𝑋𝑇𝑋)−1𝑋T(𝜎2𝑉)((𝑋𝑇𝑋)−1𝑋𝑇)𝑇 = 𝜎2(𝑋𝑇𝑋)−1𝑋𝑇𝑉𝑋(𝑋𝑇𝑋)−1 (7)
and because 𝑉 ≠ 𝐼𝑛, the right hand side of (8) does not correspond to 𝜎2(𝑋𝑇𝑋)−1 . Hence, the T-Statistic
formed by the ratio of contrasts of �̂� and �̂�2(𝑋𝑇𝑋)−1 is no longer guaranteed to be distributed according to
the 𝑡-distribution. This result is illustrated in Figure 1. Similarly it can be shown, that in the case of non-
spherical error covariance matrix, the F-Statistic is not distributed according to the 𝑓-distribution.
Figure 1 Effect of non-spherical error covariance matrices on the distribution of the T-statistic. The upper left hand-panel depicts the a spherical covariance matrix of the form 𝜎2𝐼𝑛 for a simple linear regression model with 𝑛 = 10 data points. The lower left-hand panel depicts the familiar result, that repeated sampling from the model specified using the spherical error covariance matrix yields a distribution of empirical 𝑇 values that follows the analytical 𝑡-distribution. The upper right-hand panel depicts a non-spherical covariance for the same model as on the left-hand side, where the 𝑖, 𝑗th entry of the matrix is given by exp(|𝑖 − 𝑗|). The interpretation of this covariance matrix will be explored in the sections on FMRI serial correlations. For now, we note that the empirical 𝑇 values sampled from a simple linear regression model with this non-spherical covariance matrix do not follow the corresponding 𝑡-distribution. Moreover, it can be seen that extremely large or small 𝑇-values are more likely to occur than predicted by the analytical 𝑡-distribution, which is based on the assumption of a spherical covariance matrix. Assuming a spherical covariance matrix, if there are indeed error serial correlations as implemented in the non-spherical covariance matrix thus inflates the probability for detecting “significant results”.
(2) Derivation of the generalized-least squares estimator
Under the assumption that the error covariance matrix 𝜎2𝑉 is known, there exists a conceptually
simple solution to the problem of using the classical inference theory results in the context of non-spherical
error covariance matrices. This solution is known as “generalized least-squares” estimation and is based on
the following intuition: if classical inference is only valid in the case of spherical error covariance matrices,
247
then a GLM with non-spherical error covariance matrix is simply transformed to a GLM with spherical error
covariance matrix, and inference is performed based on this transformed GLM.
To introduce the approach, we assume for the moment, that a matrix 𝐴 ∈ ℝ𝑛×𝑛 is known, which
transforms the (non-spherical) GLM
𝑦 = 𝑋𝛽 + 휀, where 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝑉) (1)
into the (spherical) GLM
𝑦∗ = 𝑋∗𝛽 + 휀∗ (2)
where
𝑦∗ ≔ 𝐴𝑦 ∈ ℝ𝑛, 𝑋∗ ≔ 𝐴𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 휀∗ ≔ 𝐴휀, 𝑝(휀∗) = 𝑁(휀∗; 0, 𝜎2𝐼𝑛) (3)
Then the GLM
𝑦∗ = 𝑋∗𝛽 + 휀∗ (4)
is of the standard form with spherical error covariance matrix, and classical inference on the parameters can
proceed as usual using the OLS beta estimator of the transformed model
�̂� ≔ (𝑋∗𝑇𝑋∗)−1𝑋∗𝑇𝑦∗ = ((𝐴𝑋)𝑇(𝐴𝑋))
−1(𝐴𝑋)𝑇𝐴𝑦 (5)
The question is then how to specify 𝐴 ∈ ℝ𝑛×𝑛, so that this transformation works. To this end assume
that a matrix 𝐾 ∈ ℝ𝑛×𝑛 exists, such that we can write the positive-definite, but non-spherical error
covariance matrix 𝑉 ∈ ℝ𝑛×𝑛 𝑝. 𝑑. as
𝑉 = 𝐾𝐾𝑇 (6)
where 𝐾 ∈ ℝ𝑛×𝑛 is a (triangular) matrix with the properties
(𝐾−1)𝑇 = (𝐾𝑇)−1 and 𝑉−1 = (𝐾−1)𝑇(𝐾−1) (7)
If we set 𝐴 ≔ 𝐾−1 ∈ ℝ𝑛×𝑛 and use the linear transformation theorem for Gaussian distributions, we
see that the covariance matrix of 휀∗ is given by the spherical covariance matrix 𝜎2𝐼𝑛, i.e.
𝑝(휀) = 𝑁(휀; 0, 𝜎2𝑉), 𝑉 = 𝐾𝐾𝑇 , 𝐴 = 𝐾−1, 휀∗ ≔ 𝐴휀 ⇒ 𝑝(휀∗) = 𝑁(휀∗; 0, 𝜎2𝐼𝑛) (8)
Proof of (8)
We have
𝑝(휀∗) = 𝑝(𝐾−1휀) (8.1)
= 𝑁(𝐾−1휀; 𝐾−10, 𝐾−1(𝜎2𝑉)(𝐾−1)𝑇)
= 𝑁(휀∗; 0, 𝜎2𝐾−1𝐾𝐾𝑇(𝐾−1)𝑇)
= 𝑁(휀∗휀; 0, 𝜎2𝐾−1𝐾𝐾𝑇(𝐾𝑇)−1)
= 𝑁(휀∗; 0, 𝜎2𝐼𝑛)
□
248
We next consider the OLS estimator of the transformed GLM 𝑦∗ = 𝑋∗𝛽 + 휀∗ for this choice of 𝐴.
Here, we find that
�̂� = (𝑋𝑇𝑉−1𝑋)−1𝑋𝑇𝑉−1𝑦 (9)
Proof of (9)
With 𝐴 ≔ 𝐾−1 we have from (5)
�̂� = ((𝐴𝑋)𝑇(𝐴𝑋))−1(𝐴𝑋)𝑇𝐴𝑦 (9.1)
= ((𝐾−1𝑋)𝑇(𝐾−1𝑋))−1(𝐾−1𝑋)𝑇𝐾−1𝑦
= (𝑋𝑇𝐾−1𝐾−1𝑋)−1𝑋𝑇(𝐾−1)𝑇𝐾−1𝑦
= (𝑋𝑇𝑉−1𝑋)−1𝑋𝑇𝑉−1𝑦
where the last equality follows with the properties of 𝐾 ∈ ℝ𝑛×𝑛. □
The OLS estimator for 𝛽 of the transformed GLM 𝑦∗ = 𝑋∗𝛽 + 휀∗ in (9) is known as the generalized
least-squares estimator and usually abbreviated by �̂�𝐺𝐿𝑆. Note that in the case of a spherical error
covariance matrix, the generalized least-square estimator reduces to the ordinary least-squares estimator:
�̂�𝐺𝐿𝑆 = (𝑋𝑇(𝐼𝑛)
−1𝑋)−1𝑋𝑇(𝐼𝑛)−1𝑦 = (𝑋𝑇𝑋)−1𝑋𝑇𝑦 = �̂� (10)
For the special case, that 𝑉 is merely “heteroscedastic”, i.e. it has only zero off-diagonal elements, it is also
known as the “weighted least-squares estimator”.
A remaining question is, how a matrix 𝐾 ∈ ℝ𝑛×𝑛 with the desired properties (cf. equations (7)and
(8)) can be obtained, and in fact, whether it is ensure to exist at all. The answers to these questions fall into
the mathematical theory of linear algebra. In short, if 𝑉 ∈ ℝ𝑛×𝑛 is positive-definite, a matrix 𝐾 ∈ ℝ𝑛×𝑛 with
the desired properties is guaranteed to exist, and can be computed from 𝑉 ∈ ℝ𝑛×𝑛 using a singular value
decomposition.
(3) Whitening
A related notion to the concept of a generalized or weigheted least-squares estimator is the concept
of “data (pre)whitening”, as prevalent in the neuroimaging literature. Here, the emphasis is on pre-
multiplying a GLM with non-spherical error covariance matrix
𝑦 = 𝑋𝛽 + 휀, where 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝑉) (1)
with an adequate whitening matrix 𝑊 ∈ ℝ𝑛×𝑛, such that the resulting GLM, referred to as the “whitenend”
GLM has a spherical covariance matrix. Given the discussion above, the adequate matrix for pre-multiplying
the GLM is of course again the matrix 𝑊 ≔ 𝐾−1 ∈ ℝ𝑛×𝑛, such that
𝑉 = 𝐾𝐾𝑇 (2)
By the discussion above, multiplication of the GLM in (1) by 𝑊
𝑊𝑦 = 𝑊𝑋𝛽 +𝑊휀 (3)
then yields with
249
𝑦∗ ≔𝑊𝑦 and 𝑋∗ ≔𝑊𝛽 (4)
that
𝑝(휀∗) ≔ 𝑝(𝑊휀) = 𝑝(𝐾−1휀) = 𝑁(휀∗; 0, 𝜎2𝐼𝑛) (5)
and the classical distribution theory for the ordinary least-squares �̂� ∈ ℝ𝑝 for 𝛽 ∈ ℝ𝑝 and its derivatives
such as 𝑇 and 𝐹 values will be appropriate.
Study Questions
1. Why do non-spherical error covariance matrices require modifications to GLM estimation and inference? 2. Verbally sketch the derivation of the generalized least-squares estimator 3. Write down the generalized least squares estimator for a GLM of the form
𝑦 = 𝑋𝛽 + 휀, 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝑉) 4. Show that the ordinary least-squares estimator is a special case of the generalized least-squares estimator for the case of a spherical error covariance matrix. 5. Given a GLM with non-spherical error covariance matrix 𝜎2𝑉, with which matrix does the GLM have to pre-multiplied in order to whiten it?
Study Questions Answers
1. The classical distribution theory for the GLM, i.e. the analytical forms of the distribution of the ordinary least-squares beta estimator, the variance parameter estimator, and the ensuing distributions of T and F statistics all assume independently and identically distribution error terms. If this assumption does not hold true, the analytical statements about these distributions become invalid. Practically, using the standard distributional results in the case of correlated error terms increases the risk of false-positives.
2. The generalized least-squares estimator can be derived by asking which form a matrix must have, that, if multiplied to both sides of the GLM equation renders the error terms independently and identically distributed. To this end, one finds with the linear transformation theorem for Gaussian distribution that an appropriate matrix corresponds to the inverse of the “square root” of the non-spherical covariance matrix of the original GLM.
3. The generalized least-squares estimator for the GLM denoted in the question has the form
�̂� ≔ (𝑋𝑇𝑉−1𝑋)−1𝑋𝑇𝑉−1𝑦
4. In the case of a spherical covariance matrix, the error terms are distributed according to 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝑉), where 𝑉 ≔ 𝐼𝑛. Substitution in the generalized least-squares estimator formula yields
�̂� = (𝑋𝑇(𝜎2𝐼𝑛)−1𝑋)−1𝑋𝑇(𝜎2𝐼𝑛)
−1𝑦
= (𝑋𝑇(𝜎2)−1(𝐼𝑛)−1𝑋)−1𝑋𝑇(𝜎2)−1(𝐼𝑛)
−1𝑦
= (𝑋𝑇(𝜎2)−1𝑋)−1𝑋𝑇(𝜎2)−1𝑦
= ((𝜎2)−1)−1(𝜎2)−1(𝑋𝑇𝑋)−1𝑋𝑇𝑦
= 𝜎2(𝜎2)−1(𝑋𝑇𝑋)−1𝑋𝑇𝑦
= (𝑋𝑇𝑋)−1𝑋𝑇𝑦
5. The whitening matrix for GLM 𝑊 ∈ ℝ𝑛×𝑛 corresponds to the inverse of the square root of the non-spherical covariance matrix,
i.e. 𝑊 ≔ 𝐾−1, where 𝑉 = 𝐾𝐾𝑇 ∈ ℝ𝑛×𝑛
250
Restricted Maximum Likelihood
(1) Motivation
Above we introduced the maximum likelihood (ML) approach as a general method for the derivation
of estimators in probabilistic models. For the GLM
𝑦 = 𝑋𝛽 + 휀, 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝 with 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛), 𝜎2 > 0 (1)
we used the ML approach to derive the ML/OLS beta parameter estimator
�̂� ≔ (𝑋𝑇𝑋)−1𝑋𝑦 (2)
and the ML variance parameter estimator
�̂�2 ≔(𝑦−𝑋�̂�)
𝑇(𝑦−𝑋�̂�)
𝑛 (3)
We noted above that that ML variance parameter estimator is biased and showed that the modified
estimator
�̂�2 ≔(𝑦−𝑋�̂�)
𝑇(𝑦−𝑋�̂�)
𝑛−𝑝 (4)
offers a bias-free alternative. The notion of the estimation bias of the ML variance estimator is illustrated in
Figure 1.
Figure 1 The figure depicts the empirical expectation (i.e. average) of both the REML and the ML variance estimators for a simple linear regression model as a function of the number of data points 𝑛. For each 𝑛, the second to fifth column of the design matrix
𝑋𝑛×5 comprised 𝑛 equally spaced values between 0 and 1 to the power of the design matrix column number minus 1. Data was generated based on the true, but unknown, parameter values 𝛽 ≔ (1,1,1,1,1)𝑇 and 𝜎2 ≔ 1. Per number of data points 𝑛, 1000 samples were taken from the model, and the corresponding REML and ML estimators for the variance parameter evaluated. The average and standard error of the mean of these values are plotted as blue and red curves, respectively. The ML variance estimator is clearly biased downward with respect to the true, but unknown, value of 𝜎2, while the OLS variance estimator is not.
251
Above equation (4) was introduced without much explanation. In fact, the ML approach has the
disadvantage, that the variance estimators it allows to derive are biased. This motivated statisticians to
search for another principled method, that allows for the derivation of better estimators, leading to the
introduction of the “restricted” (also known as “residual”) maximum likelihood approach (REML). In brief,
REML is an approach that enables the derivation of unbiased estimators for variance parameters in linear
models by using a modified likelihood function. This likelihood function is known as the REML objective
function and is derived in the next paragraph.
(2) The REML objective function and REML variance parameter estimators
The REML objective function can be motivated based on the probabilistic model
𝑝(𝑦, 𝛽) = 𝑝(𝑦|𝛽)𝑝(𝛽) (1)
where
𝑝(𝑦|𝛽) = 𝑁(𝑦; 𝑋𝛽, 𝑉𝜆) (2)
𝑋 ∈ ℝ𝑛×𝑝 is a known design matrix, 𝑝(𝛽) is the marginal distribution of 𝛽 and 𝑉𝜆 is a positive-definite
covariance matrix parameter that depends on an unknown parameter 𝜆 ∈ ℝ𝑞. Examples of 𝑉𝜆 are the
spherical covariance matrix
𝑉𝜆 ≔ 𝜎2𝐼𝑛 (3)
in which case 𝜆 ≔ 𝜎2 > 0, or the formulation of the error covariance matrix in terms of the linear
combination of covariance basis functions 𝑄𝑖 ∈ ℝ𝑛×𝑛 (𝑖 = 1,… , 𝑞) such that 𝜆 ≔ (𝜆1, … , 𝜆𝑞) ∈ ℝ
𝑞 and
𝑉𝜆 ≔ ∑ 𝜆𝑖𝑄𝑖𝑞𝑖=1 ∈ ℝ𝑛×𝑛 𝑝. 𝑑. (4)
As discussed above, maximum likelihood estimation of 𝜆 corresponds to maximizing the log likelihood
function
ℓ ∶ Λ ⊂ ℝ𝑞 → ℝ, 𝜆 ↦ ℓ(𝜆) ≔ ln𝑝(𝑦|𝛽) (5)
with respect to 𝜆. Restricted maximum likelihood estimation of 𝜆 corresponds to maximizing the log
restricted likelihood function
ℓ𝑟: Λ ⊂ ℝ𝑞 → ℝ, 𝜆 ↦ ℓ𝑟(𝜆) ≔ ln∫ 𝑝(𝑦|𝛽)𝑑𝛽 (6)
with respect to 𝜆. Equivalently, this may be viewed as maximization of the marginal log likelihood function
ℓ𝑟: Λ ⊂ ℝ𝑞 → ℝ, 𝜆 ↦ ℓ𝑟(𝜆) ≔ ln∫ 𝑝(𝑦, 𝛽)𝑑𝛽 (7)
under the assumption of a constant (uniform) improper prior distribution
𝑝(𝛽) = 𝑐, 𝑐 ∈ ℝ (8)
over 𝛽, such that
ℓ𝑟: Λ ⊂ ℝ𝑞 → ℝ, 𝜆 ↦ ℓ𝑟(𝜆) = ln∫ 𝑝(𝑦, 𝛽)𝑑𝛽 = ln∫ 𝑝(𝑦|𝛽)𝑝(𝛽)𝑑𝛽 = ln 𝑐 + ln ∫ 𝑝(𝑦|𝛽)𝑑𝛽 (9)
252
Because the first term on the right-hand side of (9) is not a function of 𝜆, (9) is equivalent to (6).
Given the generalized least-squares estimate
�̂�𝜆 ≔ (𝑋𝑇𝑉𝜆−1𝑋)
−1𝑋𝑇𝑉𝜆
−1𝑦 (10)
of 𝛽 the log restricted likelihood function evaluates to
ℓ𝑟(𝜆) = −(𝑛+𝑝
2) ln 2𝜋 −
1
2ln|𝑉𝜆| −
1
2ln|𝑋𝑇𝑉𝜆
−1𝑋| −1
2(𝑦 − 𝑋�̂�𝜆)
𝑇𝑉𝜆−1(𝑦 − 𝑋�̂�𝜆) (11)
Proof of (11)
For ease of notation, we set �̂� ≔ �̂�𝜆 and 𝑉 ≔ 𝑉𝜆. Using the identity
(𝑦 − 𝑋𝛽)𝑇𝑉−1(𝑦 − 𝑋𝛽) = (𝑦 − 𝑋�̂�)𝑉−1(𝑦 − 𝑋�̂�) + (𝛽 − �̂�)𝑇(𝑋𝑇𝑉−1𝑋)(𝛽 − �̂�) (11.1)
which we prove below and the normalizing integral for the Gaussian distribution, we have
ℓ𝑟(𝜆) = ln∫ 𝑝(𝑦|𝛽)𝑑𝛽 (11.2)
= ln ∫ 𝑁(𝑦; 𝑋𝛽, 𝑉)𝑑𝛽
= ln (∫(2𝜋)−𝑛
2|𝑉|−1
2 exp (−1
2(𝑦 − 𝑋𝛽)𝑉−1(𝑦 − 𝑋𝛽))𝑑𝛽)
= ln (∫(2𝜋)−𝑛
2|𝑉|−1
2 exp (−1
2(𝑦 − 𝑋�̂�)𝑉−1(𝑦 − 𝑋�̂�)) exp (−
1
2(𝛽 − �̂�)
𝑇(𝑋𝑇𝑉−1𝑋)(𝛽 − �̂�)) 𝑑𝛽)
= ln ((2𝜋)−𝑛
2|𝑉|−1
2 exp (−1
2(𝑦 − 𝑋�̂�)𝑉−1(𝑦 − 𝑋�̂�))∫ exp (−
1
2(𝛽 − �̂�)
𝑇(𝑋𝑇𝑉−1𝑋)(𝛽 − �̂�)) 𝑑𝛽)
= ln ((2𝜋)−𝑛
2|𝑉|−1
2 exp (−1
2(𝑦 − 𝑋�̂�)𝑉−1(𝑦 − 𝑋�̂�)) (2𝜋)−
𝑝
2|𝑋𝑇𝑉−1𝑋|−1
2)
= −𝑛
2ln 2𝜋 −
1
2ln|𝑉| −
1
2(𝑦 − 𝑋�̂�)𝑉−1(𝑦 − 𝑋�̂�) −
𝑝
2ln 2𝜋 −
1
2ln|𝑋𝑇𝑉−1𝑋|
= −(𝑛+𝑝
2) ln 2𝜋 −
1
2ln|𝑉| −
1
2ln|𝑋𝑇𝑉−1𝑋| −
1
2(𝑦 − 𝑋�̂�)𝑉−1(𝑦 − 𝑋�̂�)
□
Having established the REML objective function, we can now state explicitly the notions of a ML and
a REML estimator for the variance parameter of a GLM. As previously, an ML variance parameter estimator
is defined as a value 𝜆𝑀𝐿 that maximizes the following function
ℓ ∶ Λ ⊂ ℝ𝑞 → ℝ+, λ ↦ ℓ(𝜆) ≔ −1
2ln|𝑉𝜆| −
1
2(𝑦 − 𝑋�̂�𝜆)
𝑇𝑉𝜆−1(𝑦 − 𝑋�̂�𝜆) (12)
On the other hand, a REML variance parameter estimator is defined as a value 𝜆𝑅𝐸𝑀𝐿 that maximizes the
following function
ℓ𝑟 ∶ Λ ⊂ ℝ𝑞 → ℝ+, 𝜆 ↦ ℓ𝑟(𝜆) = −
1
2ln|𝑉𝜆| −
1
2ln|𝑋𝑇𝑉𝜆
−1𝑋| −1
2(𝑦 − 𝑋�̂�𝜆)𝑉𝜆
−1(𝑦 − 𝑋𝑓�̂�𝜆) (13)
Note that ℓ corresponds to the log likelihood function for the variance parameter of the standard
GLM as introduced above after estimation of the beta parameter vector, with the modification that the
Gaussian covariance matrix 𝑉𝜆 and the beta parameter estimator �̂�𝜆 have been made explicit functions of
the variance parameter, and additive constants, which do not affect the location of extremal points, have
253
been omitted. Further note that the REML objective function” ℓ𝑟 is identical to ℓ, except for the extra term
−1
2ln|𝑋𝑇𝑉𝜆
−1𝑋|.
(3) Derivatives of the REML objective function
The derivatives of the REML objective function, which are required for finding its maximum can be
computed analytically. Specifically, with
𝑃 ≔ 𝑉𝜆−1 − 𝑉𝜆
−1𝑋(𝑋𝑇𝑉𝜆−1𝑋)
−1𝑋𝑇𝑉𝜆
−1 ∈ ℝ𝑛×𝑛 (1)
the components of the gradient ∇ℓ𝑟(𝜆) ∈ ℝ𝑞 of the REML objective 𝜆 are given by
𝜕
𝜕𝜆𝑖ℓ𝑟(𝜆) = −
1
2𝑡𝑟 (𝑃 (
𝜕
𝜕𝜆𝑖𝑉𝜆)) +
1
2(𝑦 − 𝑋�̂�𝜆)
𝑇𝑉𝜆−1 (
𝜕
𝜕𝜆𝑖𝑉𝜆−1)𝑉𝜆
−1(𝑦 − 𝑋�̂�𝜆) (2)
For 𝑖 = 1,… 𝑞 and the components of the Hessian ∇2ℓ𝑟(𝜆) ∈ ℝ𝑝×𝑝 of the REML object function are given by
𝜕2
𝜕𝜆𝑖𝜕𝜆𝑗ℓ𝑟(𝜆) = −
1
2𝑡𝑟 (𝑃 (
𝜕2
𝜕𝜆𝑖𝜕𝜆𝑗𝑉𝜆) − (
𝜕
𝜕𝜆𝑖𝑉𝜆)𝑃 (
𝜕
𝜕𝜆𝑗𝑉𝜆)) +
1
2(𝑦 − 𝑋�̂�𝜆)
𝑇(𝜕
𝜕𝜆𝑖𝑉𝜆−1 − 2(
𝜕
𝜕𝜆𝑖𝑉𝜆)𝑃 (
𝜕
𝜕𝜆𝑗𝑉𝜆))𝑉𝜆
−1(𝑦 − 𝑋�̂�𝜆) (3)
for 𝑖, 𝑗 = 1,… , 𝑞 . The expected value of the (𝑖, 𝑗)-th component of the Hessian matrix under the data
distribution can also be evaluated analytically and results in
𝐸 (𝜕2
𝜕𝜆𝑖𝜕𝜆𝑗ℓ𝑟(𝜆)) = −
1
2𝑡𝑟 (𝑃
𝜕
𝜕𝜆𝑖𝑉𝜆𝑃
𝜕
𝜕𝜆𝑗𝑉𝜆) (4)
(4) Fisher-Scoring for the REML objective function and covariance basis matrices
Maximization of the REML objective function with respect to the variance parameter 𝜆 ∈ Λ ∈ ℝ𝑞
which parameterizes the covariance matrix 𝑉𝜆 can also be achieved numerically. In this case, the Fisher-
Scoring algorithm introduced in the context of maximum likelihood estimation can be adapted for the REML
objective function. Recall that, essentially, Fisher-Scoring corresponds to a multivariate Newton-Raphson
algorithm for the numerical optimization of an objective function (here the log likelihood function) with the
modification that the Hessian matrix of the objective function is replaced by its expected value under the
probabilistic model of interest. To develop an Fisher-Scoring algorithm analogue for the REML objective
function, we thus require its gradient and its expected Hessian matrix with respect to components of the
variance component vector 𝜆. These were derived in their general form above. Notably, the gradient vector
and Hessian matrix of the REML objective function can be simplified considerably if one assumes linear
covariance matrices combinations. Specifically, if we assume that
𝑉𝜆 = ∑ 𝜆𝑖𝑄𝑖𝑞𝑖=1 ∈ ℝ𝑛×𝑛 𝑝. 𝑑. (1)
then
𝜕
𝜕𝜆𝑖ℓ𝑟(𝜆) = −
1
2𝑡𝑟(𝑃𝑄𝑖) +
1
2(𝑦 − 𝑋�̂�𝜆)
𝑇𝑉𝜆−1𝑄𝑖
−1𝑉𝜆−1(𝑦 − 𝑋�̂�𝜆) (2)
Proof of (2)
For an error covariance matrix of the form (1), we have
254
𝜕
𝜕𝜆𝑖𝑉𝜆 =
𝜕
𝜕𝜆𝑖(∑ 𝜆𝑖𝑄𝑖
𝑞𝑖=1 ) =
𝜕
𝜕𝜆𝑖(∑
𝜕
𝜕𝜆𝑖𝜆𝑖𝑄𝑖
𝑞𝑖=1 ) = 𝑄𝑖 (2.1)
Substitution then yields
𝜕
𝜕𝜆𝑖ℓ𝑟(𝜆) = −
1
2𝑡𝑟 (𝑃 (
𝜕
𝜕𝜆𝑖𝑉𝜆)) +
1
2(𝑦 − 𝑋�̂�𝜆)
𝑇𝑉𝜆−1 (
𝜕
𝜕𝜆𝑖𝑉𝜆−1) 𝑉𝜆
−1(𝑦 − 𝑋�̂�𝜆) (2.2)
= −1
2𝑡𝑟(𝑃𝑄𝑖) +
1
2(𝑦 − 𝑋�̂�𝜆)
𝑇𝑉𝜆−1𝑄𝑖
−1𝑉𝜆−1(𝑦 − 𝑋�̂�𝜆)
Likewise, the of expectation the (𝑖, 𝑗)-th component of the Hessian matrix of the REML objective function
simplifies to
𝐸 (𝜕2
𝜕𝜆𝑖𝜕𝜆𝑗ℓ𝑟(𝜆)) = −
1
2𝑡𝑟 (𝑃
𝜕
𝜕𝜆𝑖𝑉𝜆𝑃
𝜕
𝜕𝜆𝑖𝑉𝜆) = −
1
2𝑡𝑟(𝑃𝑄𝑖𝑃𝑄𝑗) (3)
In summary, under the assumption that the error covariance matrix decomposes into a linear
combination of covariance basis matrices, one may thus use the following numerical approach for REML
objective function maximization
Initialization
0. Define a starting point 𝜆(0) ∈ ℝ𝑞 and set 𝑘 ≔ 0. If ∇ℓ𝑟(𝜆(0)) = 0, stop! 𝜆(0) is a zero of ∇ℓ𝑟(𝜆
(0)). If not, proceed
to iterations.
Until Convergence
1. For 𝑖 = 1,… , 𝑞 set
𝜆𝑖(𝑘+1) ≔ 𝜆𝑖
(𝑘) − (1
2𝑡𝑟(𝑃𝑄𝑖𝑃𝑄𝑗))
−1
(1
2𝑡𝑟(𝑃𝑄𝑖) +
1
2(𝑦 − 𝑋�̂�𝜆)
𝑇𝑉𝜆−1𝑄𝑖
−1𝑉𝜆−1(𝑦 − 𝑋�̂�𝜆))
2. If ∇ℓ𝑟(𝜆𝑖(𝑘+1)) = 0, stop! 𝜆𝑖
(𝑘+1) is a zero of ∇ℓ𝑟(𝜆(0)). If not, go to 3.
3. Set 𝑘 ≔ 𝑘 + 1 and go to 1.
Table 1. A Fisher Scoring algorithm-analogue for the numerical optimization of the REML objective function for error covariance matrices decomposing into linear combinations of covariance basis matrices.
Study Questions 1. State the motivation for the introduction of the restricted maximum likelihood framework. 2. Under what kind of probabilistic model can the restricted maximum likelihood objective function be derived as an expected log likelihood function? 3. What is the difference between the maximum likelihood and the restricted maximum likelihood objective function for variance parameter estimation? 4. How can the restricted maximum likelihood objective function be used for statistical inference? 5. Why is the assumption of a covariance matrix that can be written as a linear combination of covariance basis matrices helpful in the context of REML? Study Question Answers
1. The variance parameter estimator derived under the maximum likelihood framework is biased, i.e. its analytical expectation does not correspond to the true, but unknown, value that it is supposed to estimate. The restricted maximum likelihood framework can be conceived as to originate from the desire to establish a general framework under which non-biased variance parameter estimators can be derived.
255
2. The restricted maximum likelihood objective function can be viewed as a marginal log likelihood function for a GLM with uniform (improper) prior distribution over the beta parameter. In this case, it corresponds to an expected log likelihood function.
3. The difference is the additional term −1
2ln|𝑋𝑇𝑉𝜆
−1𝑋| which appears in the restricted maximum likelihood objective function, but
not in the maximum likelihood objective function.
4. The restricted maximum likelihood objective function can be used for statistical inference like the maximum likelihood objective function (i.e. the log likelihood function): it is viewed as function of a parameter of probabilistic model and the parameter is optimized such as to maximize the probability of observed data under the model. Like in the maximum likelihood framework, such an optimization can either proceed analytically, by finding the zeros of the derivative (or gradient) of the restricted maximum likelihood objective function explicitly, or numerically, for example using the Fisher-scoring algorithm.
5. Under the assumption of a covariance matrix of the form 𝑉𝜆 = ∑ 𝜆𝑖𝑄𝑖𝑞𝑖=1 ∈ ℝ𝑛×𝑛 𝑝. 𝑑., where 𝑄𝑖 ∈ ℝ
𝑛×𝑛 are suitably chosen
covariance basis matrices, the gradient and expected Hessian of the restricted maximum likelihood objective function simplify considerably.
256
FMRI applications of the General Linear Model
257
The mass-univariate GLM-FMRI approach
The application of the GLM to FMRI signal time-series of single voxel data, referred to as the “mass-
univariate” GLM approach, hereafter GLM-FMRI, is a surprisingly complex topic. In the current section we
briefly review the acquisition and preprocessing of FMRI data to become familiar with the data organization
that is used for GLM-FMRI. We will next use a simple two-condition/two-voxel example to obtain an
intuition of how the GLM is used for the purpose of cognitive process brain mapping in FMRI research.
(1) FMRI data acquisition and preprocessing
The fundamental idea of GLM-FMRI studies is to map cognitive processes onto brain areas. To this
end, two fortuitous facts are exploited: firstly, because of metabolic demands local neural activity results in a
local alteration of the ratio of deoxygenated and oxygenated haemoglobin. Secondly, the local displacement
of deoxygenated haemoglobin alters the local magnetic susceptibility of brain tissue and can be detected as
in increase of the local magnetic resonance signal by an MR scanner. The changes in the MR signal induced
by local neural activity are referred to as the blood-oxygen level dependent (BOLD) contrast signal. Based on
this interaction between physic and biology, the idea of GLM-FMRI is to induce specific psychological states
in human participants, which presumably are reflected in specific neural activity states, resulting in
metabolic demands, which in turn can be detected by means of FMRI. If two different psychological
processes are reflected in anatomically different brain structures, GLM-FMRI, i.e. the statistical evaluation of
where a MR signal difference occurred post-stimulus, thus allows for mapping cognitive processes onto the
anatomy of the human brain. A full discussion of the fundamentals of FMRI is beyond the scope of this
PMFN. For an excellent introduction to this topic, please refer to (Huettel et al., 2014). Below, we provide an
overview of the steps that are involved in a standard FMRI study before the data are modelled using the
GLM.
The MRI signal arises from a complex interplay between a strong magnetic field, the magnetic
properties of atomic particles and the behaviour of the latter in the presence of electromagnetic stimulation.
The quantitative characterization of this process falls into the realm of quantum mechanics and even in
introductory discussions requires some familiarity with ordinary differential equations (Huettel et al., 2014).
For the purpose of PMFN, it suffices to know that the MRI scanner allows for taking images, i.e. three-
dimensional arrays of numbers, of the brain. The process of image generation using the electromagnetic MR
signal is based on concepts from Fourier analysis. Depending on the specific parameters that are used to
take MR images, known as sequence parameters, different image types result. So-called T1-weighted
images, which take approximately 10 minutes to acquire, have high spatial resolution and reveal fine
anatomical detail. T2*-weighted , on the other hand, only take 1 to 3 seconds to acquire and are sensitive to
variations in the MR signal due to local deoxyhemoglobin changes, that is, the BOLD contrast signal. T2*-
weighted images are the images used for FMRI. T2*-weighted images are usually acquired using a MR
sequence type known as “echo-planar imaging” , hence these images are also known as “EPI“ images. The
GLM-FMRI approach essentially converts EPI image time-series into statistical maps that indicate where local
activations occurred. These images in turn are referred to as “statistical parametric maps” (SPMs). Here,
“parametric” refers to the fact, that the statistics are evaluated using parametric assumptions about their
underlying distribution.
In general, FMRI data is organized as follows. A single human participant is usually scanned in a
single “session”, which comprises multiple “runs”. A run is usually about 10 to 15 minutes long. During a run,
the participant carries out a cognitive task (for example responding to visually presented stimuli by button
258
presses) while a series of EPI images is acquired simultaneously and continuously. Each run thus comprises a
sequence of EPI “images” or “volumes”. The time it takes to acquire a single volume corresponds to the
time-resolution of FMRI and is called “time-to-repetition” or in brief “TR”. Each volume comprises a number
of “slices” (usually in the order of 30 for whole-brain imaging), each of which contains a number of “voxels”.
Voxels, the three-dimensional analogue to the two-dimensional concept of a pixel, thus make up the entire
image. It is very helpful to simply to think of EPI images as 3D arrays of numbers representing the grey values
of the image’s voxels. It is the sequence of these grey values of a particular voxel over the course of an
experimental run to which the GLM is applied below. Importantly, the same design matrix is applied to
model the time-series data of each and every voxel. Before the GLM is applied, however, the data undergo a
significant amount of “preprocessing” to limit the influence of artefacts on the results. We will briefly review
these preprocessing steps and their purposes in the following.
FMRI data preprocessing usually comprises a sequence of steps known as (1) distortion correction, (2)
realignment, (3) slice-time correction, (4) normalization, and (5) smoothing. We briefly review each of these
steps in turn.
(1) Distortion correction. Due to inhomogeneities of the magnetic field, certain image parts of EPI
volumes may be distorted with respect to the object the image is taken of. Correcting these distortions
based on the knowledge of the field inhomogeneities is referred to as “distortion correction”. The aim of
distortion correction is thus to render the image a more veridical representation of the imaged object.
(2) Realignment. In order to allocate an observed effect after analysis of the FMRI data to a specific brain
region, one has to be sure that the time-course of a voxel actually refers to the same region during the
course of the experiment. The MR scanner’s voxel grid is overlaid over the subject’s brain in a fixed position.
Thus, if the subject moves during a run, voxels will represent different brain regions over time. For this
reason, subjects are usually fixated as much as possible and encouraged not to move during scanning.
However, some residual motion (for example by the pulsation of the blood in the brain’s vasculature) cannot
be avoided and is corrected during the realignment step, usually using the first image of the first run as
reference.
(3) Slice-time correction. The slices comprising a single EPI volume are acquired one after the other.
Because of this fact and because during data analysis EPI volumes are typically considered as data samples
for a single time point, temporal interpolation can be used to resample each slice with respect to a single EPI
volume onset time.
(4) Normalization. Normalization refers to the transformation of the subject-specific three-dimensional
voxel time-series into a standard group anatomical space. This transformation is performed by translating
and rotating the acquired data in three-dimensions and possibly also stretching and squeezing it.
Normalization of FMRI data is required, if the experiment aims at comparing data between different
subjects. Because all brains are a bit different, normalization will never really bring the same regions of two
subjects into full alignment (the question is also what “same regions” really means), but it is a useful
approach for group studies.
(5) Smoothing. “Smoothing” refers to the spatial weighted averaging of individual voxel data with data
from voxels in its vicinity. Intuitively, smoothing removes random signal fluctuations over space.
259
After data acquisition and data pre-processing (which does not alter the data format, but only the
data content) FMRI data correspond to “voxel time-courses”. In other words, for each three-dimensional
brain location (voxel) a time-course of MR signal values is obtained (Figure 1).
Figure 1 A visualization of “voxel time-courses”.
Numerically, these values may look like the table below.
Volume 1 Volume 2 Volume 2 Volume 3 ... Volume 𝒏𝑻𝑹
Voxel 1 97.3 90.2 86.1 89.9 ... 85.3
Voxel 2 98.2 91.1 87.0 89.5 ... 86.2
... ... ... ... ... ... ...
Voxel 𝑛𝑉 98.2 91.1 87.0 89.5 ... 86.2
Table 1 Tabular representation of voxel time courses of a single FMRI run.
In Table 1 and Figure 5, 𝑛𝑉 denotes the total number of voxels of the image/volume, and each line
contains as many values as the experimental run had sampling points, i.e. EPI volume acquisitions. The
fundamental idea of GLM-FMRI is to treat each voxel in isolation and apply the same GLM to each voxel, one
after the other. This is called the “mass-univariate” approach, because the dependent variable is one-
dimensional (= univariate): it represents the MR signal time-course of a single voxel. In all following
discussion of this Section, we will hence deal with the analysis of the data of a single voxel, for which we
model the observed signal value time course using the GLM. Because we are dealing with a single voxel, we
will not use an index referencing different voxels.
(2) Brain mapping using the GLM-FMRI approach
To link the FMRI data format with the GLM theory developed until now, we next discuss how the
FMRI data and knowledge about the timing of experimental stimulation is formulated in the standard GLM
form given by
260
𝑋𝛽 + 휀 = 𝑦 , where 𝑦 ∈ ℝ𝑛, 𝑋 ∈ ℝ𝑛×𝑝, 𝛽 ∈ ℝ𝑝, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛), 𝜎2 > 0 (1)
To this end, we consider the data first. Instead of writing the data of a voxel as a line, we can also
write it as a column, where each line of the column indicates the value of that voxel at the respective TR
(Table 2).
Volume Variable MR Signal
1 𝑦1 87.3
2 𝑦2 90.2
3 𝑦3 86.1
... ... ...
𝑛𝑇𝑅 𝑦𝑛𝑇𝑅 85.3
Table 2. Tabular representation of a single voxel time course as GLM data vector 𝑦 ∈ ℝ𝑛𝑇𝑅. The data displayed correspond to the data of Voxel 1 in Table 1.
In other words, in GLM-FMRI the 𝑖th data point of a voxel time-course corresponds to the the 𝑖th
dependent variable 𝑦𝑖. The entire data set of single-voxels MR signal time course written as a column vector
thus corresponds to the data vector 𝑦 ∈ ℝ𝑛, where 𝑛 corresponds to the number of volumes acquired in a
given run, i.e. 𝑛 ≔ 𝑛𝑇𝑅.
The GLM for FMRI takes the form of multiple linear regression over time. In the design matrix, we
thus include what we believe might have had an influence on the value of 𝑦 at a given time-point. Like
always, the 𝑖th MR signal is thus modelled as weighted sum of the values of the independent variables at the
𝑖th time-point plus a time-point specific noise term:
𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 + 𝑥𝑖3𝛽3 +⋯+ 𝑥𝑖𝑝𝛽𝑝 + 휀𝑖 = 𝑦𝑖 , 𝑝(휀𝑖) = 𝑁(휀𝑖; 0, 𝜎2), 𝜎2 > 0 (2)
However, two things are important now: First, the values of the independent variables 𝑥 change over time,
as they represent for example the presence or the absence of an external stimulus during the course of an
experimental run. If we consider a discrete index 𝑡 representing time, and replace the index 𝑖 with it, we may
rewrite the above as
𝑥1(𝑡)𝛽1 + 𝑥2(𝑡)𝛽2 + 𝑥3(𝑡)𝛽3 +⋯+ 𝑥𝑝(𝑡)𝛽𝑝 + 휀(𝑡) = 𝑦(𝑡), 휀(𝑡)~𝑁(휀(𝑡); 0, 𝜎2), 𝜎2 > 0 (3)
where 𝑡 = 1,2, … , 𝑛. Note that the independent variables and the dependent variables are indexed by time,
but the parameters are not. As usual, we assume that the weighting coefficients 𝛽𝑗 do not change over time
(or experimental units), but represent the fixed contribution that the independent variable 𝑥𝑗(𝑡) makes to
the observed value 𝑦(𝑡) at all times. Because we are dealing with discrete time, the time indexing using
𝑡 = 1,2,… , 𝑛 is in fact rarely used, however, and we represent (3) in standard matrix notation as
𝑋𝛽 + 휀 = 𝑦, 𝑝(휀) = 𝑁(휀; 0, 𝜎2𝐼𝑛), 𝜎2 > 0 (4)
It should be noted, however, that for GLM-FMRI the rows of the design matrix, error vector, and the data are
functions of time as opposed to functions of independent experimental units.
We next explore, how the formulation of the mass-univariate GLM in (4) and the intuition of the
specificity of a brain region’s response to a cognitive process or experimental condition are related. To this
end, consider an FMRI experiment with one-experimental factor (e.g. the category of a visual stimulus)
261
comprising two levels (e.g. category 1: pictures of faces, category 2: pictures of houses). For generality, we
will just refer to the two different levels as condition 1 and condition 2. In a typical experiment, one presents
the participant with each of the conditions over the course of an experimental run. During each run FMRI
data is acquired continuously for e.g. 10 minutes in discrete samples of for example TR = 2 s (time of
repetition). As described above, the fundamental idea of GLM analyses of FMRI data is that in response to a
stimulus or a cognitive process, neurons become active in a region that is “specialized” for this stimulus or
cognitive process. This neural activity causes a metabolic cascade which in turn leads to a local increase in
the level of oxygenated hemoglobin in the area of the neural activity. Hemodynamic responses to a variety
of stimuli have been measured, and the concept of hemodynamic response function, i.e. a mathematical
model that describes the ideal change in the MR signal upon a neural event, has been formulated. The
particularities of hemodynamic response functions will be discussed in more detail later, for now it suffices
to note that the MR signal at a specific voxel in response to a single neural event at time point 𝑡 = 0
approximately looks like the function shown in Figure 2.
Figure 2. The hemodynamic response function. The hemodynamic response function reflects the idealized MR signal response to a brief neural event. It functions as the basis for GLM modelling of FMRI data in a temporal convolution framework. The Figure shows the “canonical hemodynamic response function” with the default parameters as implemented in the SPM software toolbox.
Assume now that an observer was presented with conditions 1 and 2 in random order over the
course of an experimental run of 260 𝑠 at the times shown by the “stick functions” in Figure 3 below.
Whenever the stick function of a condition is 1, the respective condition was presented. For example at
times 𝑡 = 0 and 𝑡 = 16 condition 1 was presented, at time 𝑡 = 32 condition 2 was presented, and so on.
Next, assume that while these conditions were presented, FMRI data were collected every 2 s from two
voxels A and B representing different brain areas, and the data shown in Figure 4 was extracted. Comparing
the event time points of Figure 3 to the voxel-time courses in Figure 4 above, one would come to the
conclusion that voxel A always shows an excursion of the MR signal, when condition 1 is presented and no
excursion when condition 2 is presented. For voxel B, on the other hand, the MR signal is responsive to both
condition 1 and 2.
262
Figure 3 Example condition onsets (stimulus timing) in an experiment with two conditions 1 and 2.
´
Figure 4 Example MR signal time-courses from two different voxels A and B.
Figure 5. Stimulus onset functions of Figure 3 convolved with a canonical hemodynamic response function.
263
In GLM-FMRI, this voxel-specific “responsiveness to a specific condition” is represented by the beta
parameter values of the predictor variable representing the respective condition. To see this, consider first
the predicted time-courses for voxels which are only and ideally responsive to conditions A or B,
respectively. These time-courses are obtained by replacing the stick functions of Figure 3 with the assumed
hemodynamic response functions, yielding Figure 5. Technically, this replacement is achieved by convolving
the stimulus stick functions with the hemodynamic response function, the details of which will be discussed
in the next section.
The predicted time-series for each condition are now identified with the columns of the design
matrix 𝑋 ∈ ℝ𝑛×2 for an observed time-series 𝑦 ∈ ℝ𝑛. That is, the number that is given for predictor and the
number that is given for predictor 2 at time-point 𝑖 are concatenated to a row vector with two entries and
entered in the 𝑖th row of the design matrix, where 𝑖 = 1,… , 𝑛 and 𝑛 is the total number of time-points (=
volumes). Alternatively, one can imagine transposing the predicted MR signal of condition 1 into a column
vector and entering this as the first column of the design matrix, and likewise transposing the predicted MR
signal of condition 2 into a column vector and entering this as the second column of the design matrix.
Commonly, the design matrix is represented as grey-scale image of its entries, which is shown in Figure 6 left
panel. Recall that the design matrix has as many rows are there are observed data points 𝑛 and as many
columns as there are predictor variables.
Now consider again the MR signal time course of voxel A in Figure 4 above. Transposing this row
vector to a column vector 𝑦 ∈ ℝ𝑛, we see that we can write down the GLM equation for this voxel time-
series quite well, if we chose the true, but unknown, parameter vector to be approximately 𝛽𝐴 ≔ (1,0)𝑇.
Intuitively, we can represent the corresponding GLM matrix multiplication graphically as in Figure 6.
Figure 6. Graphical representation of the GLM matrix product for 𝛽𝐴.
Likewise consider the MR signal time course of voxel B in Figure 4 above. Here we see that we can
recreate the voxel time-series using the same design matrix as for voxel A but setting the true, but
unknown, parameter vector to 𝛽𝐵 ≔ (1,1)𝑇 as shown in Figure 7. Note that as usual, the observed signal
264
results from the outcome of the design matrix and parameter multiplication plus a stochastic noise vector,
here denoted as 휀𝐴/휀𝐵.
Figure 7. Graphical representation of the GLM matrix product for 𝛽𝐵.
Equivalently, we can overlay the predicted (= estimated) time-series 𝑋�̂�𝐴 and 𝑋�̂�𝐵 and the observed
time-series to confirm that our choice of the beta parameter estimates as shown in the figure yield a good
approximation of the voxel-time courses (Figure 8). Here, the parameter estimates resulted in �̂�𝐴 =
(1.0027,−0.0471)𝑇 and �̂�𝐴 = (1.0033,1.0025)𝑇.
Figure 8 Time course representation of predicted and observed voxel time-courses based on the parameter choices �̂�𝐴
and �̂�𝐵.
265
In summary, the value of the (true, but unknown, or estimated) beta parameter which belongs to a
specific onset regressors, tells us something about the voxel's preference with respect to the experimental
condition: for voxel A, the first entry in 𝛽𝐴/�̂�𝐴 is large (≈ 1)and the second entry in 𝛽𝐴/�̂�𝐴 is small (≈ 0).
From the discussion above we see that this means that voxel A is responsive to condition 1, but not to
condition 2. Likewise, for voxel B, the two entries in 𝛽𝐵/�̂�𝐵 are very similar (both are ≈ 1), and as discussed
above, voxel B responds equally well to both conditions.
Finally, performing statistical tests on the estimated parameters for all voxels over the brain as
discussed in Section 7 yields the so-called “statistical parametric maps”. For example, for voxel A, the T-
statistic using a contrast vector of the form 𝑐 = (1,−1)𝑇 will yield a value deviating from zero quite a bit,
while for voxel B, the difference between the entries in �̂�𝐵 is around zero, and the T-statistic thus also close
to zero (given that the estimated variance parameter is not also close to zero). Marking voxels with high T-
statistic values (>± 2) with hot colors, and voxels with low T-statistic values (<± 2) with cold colors then
results in statistical parametric maps. Note that these maps are usually thresholded based on the
corresponding p-values so that for example, voxels with T-statistic values corresponding to p-values larger
than 0.001 are not colored at all.
Figure 9. A statistical parametric map
Study Questions
1. What does it mean that the GLM used for FMRI data analysis in a “mass-univariate” fashion? 2. Describe the FMRI data organization after FMRI data preprocessing. 3. What is the difference between the hemodynamic response and a hemodynamic response function? 4. Which GLM design category is used for the analysis of FMRI time-series data? 5. What do the beta parameter estimates obtained in a GLM FMRI time-series data of a single voxel reflect?
Study Questions Answers
1. In standard GLM analyses of FMRI data sets, the time-series data from each individual voxel is modelled using the same design matrix and parameter estimation and inference is performed in a voxel-by-voxel fashion. The signal modelled is thus “univariate”, i.e. it corresponds to the scalar voxel-specific MR signal over time. However, the same procedure is used as many times as there are voxels in the data set, hence “mass-univariate”
2. After data acquisition and data prerocessing (which does not alter the data format, but only the data content) FMRI data correspond to “voxel time-courses”. In other words, for each three-dimensional brain location (voxel) a time-course of MR signal values is obtained.
3. A hemodynamic response is an empirical entity. In the context of FMRI it corresponds to the MR signal change measured at a given brain location, usually in response to some stimulation. A hemodynamic response function, on the other hand, is a theoretical
266
entity. It corresponds to an idealized, mathematical model of a hemodynamic response and is often employed in GLM-based FMRI data analyses.
4. FMRI time-series data is usually analysed using multiple linear regression models. In these models, the design matrix columns correspond to continuous “regressors” or “predictors”, which in their weighted summation correspond to the (deterministic) model of observed MR signal changes at a given voxel.
5. Beta parameter estimates in GLM analyses of FMRI data reflect the estimated effect that a given regressor has on the observed signal of a given voxel. For example, if the regressor encodes the temporal evolution of a specific experimental condition, the beta parameter estimate serves as the measure of the effect size of this condition for the given voxel.
267
First-Level Regressors
In the previous Section we have seen that the stimulus-onset times are converted into “predicted
MR signal time-courses” which form the columns of the voxel time-course GLM. In the current section, we
explore the rationale and the technical details for this conversion, commonly referred to as the “convolution
of stimulus onset functions with a hemodynamic response function”. In general, this convolution-based
approach is motivated by a “linear system” view of the of experimental perturbation and evoked BOLD
response, which we discuss below.
(1) Discrete time-signals
Discrete-time signals can be represented mathematically as “real-valued sequences”. A sequence of
numbers, in which the 𝑛th number in the sequence is denoted 𝑥𝑖, is formally written as
𝑥 ≔ (𝑥𝑖) (1)
If the number of elements 𝑥𝑖 is finite, we call the sequence a “finite sequence”. Sequences of numbers
acquired in an experimental context are usually finite. In this case, if 𝑛 ∈ ℕ denotes the number of sequence
elements, a sequence is identical to an element of ℝ𝑛. If the number of elements 𝑥𝑖 of a sequence is
infinite, we call the sequence an “infinite sequence”. Infinite sequences are usually theoretical constructs,
that can serve as approximations to data sequences acquired in experimental settings. In the current context
we allow for both negative and positive discrete indices 𝑖, i.e., we set 𝑖 ∈ ℤ. Infinite sequences with integer
indices are identical to elements of ℝℤ. The value 𝑥𝑖 of a sequence is often referred to as the “𝑖th sample of
the sequence”. An example of a real-valued finite sequence with 𝑛 = 7 elements is the following
𝑥 = (𝑥−3, 𝑥−2, 𝑥−1 𝑥0, 𝑥1, 𝑥2, 𝑥3) = (5.2, 1.0, 4.2, 𝜋, √2, 3.8, 1.0) (2)
Discrete sequences are most sensibly visualized using “stem plots”, however, often they are visualized more
conventional using line plots for convenience (Figure 1).
For two infinite sequences or two finite sequences of equal length 𝑥′ = (𝑥𝑖′) and 𝑥′′ = (𝑥𝑖
′), we
define their sum by element-wise addition, i.e., if
𝑦 = 𝑥′ + 𝑥′′ ⇔ (𝑦𝑖) = (𝑥𝑖′) + (𝑥𝑖
′′) (3)
then
𝑦𝑖 = 𝑥𝑖′ + 𝑥𝑖
′′ for all 𝑖 ∈ ℤ or 𝑖 ∈ ℕ𝑛. (4)
Likewise, we define the scalar multiplication of a sequence (𝑥𝑖) with a scalar 𝑎 ∈ ℝ as the multiplication of
all elements of the sequence with that scalar. In other words, if
𝑦 = 𝑎𝑥 ⇔ (𝑦𝑖) = 𝑎(𝑥𝑖) (5)
then
𝑦𝑖 = 𝑎𝑥𝑖 for all 𝑖 ∈ ℤ or 𝑖 ∈ ℕ𝑛 (6)
An important example of a sequence is the so-called “unit sample sequence”. It is defined as the infinite
sequence
𝛿 ≔ (𝛿𝑖) where 𝛿𝑖 ≔ {0, 𝑖 ≠ 01, 𝑖 = 0
for all 𝑖 ∈ ℤ (7)
The unit sample sequence thus takes on the value 1 at sequence index 𝑛 = 0, and takes on the value 0 for all
other values of 𝑖 ∈ ℤ. It can be visualized as a “stick function” with a single stick at 𝑖 = 0 (Figure 1). The unit
268
sample sequence plays the same role for discrete-time signals and systems that the unit impulse function or
“Dirac delta function” does for continuous-time signals and systems. For convenience, we often refer to the
unit sample sequence as a “discrete-time impulse” or simply as an “impulse”. It is important to note that a
discrete-time impulse does not suffer from the mathematical complications of the continuous-time impulse,
its definition is simple and precise.
Figure 1. Discrete time-signals. The upper left panel depicts the example sequence (2) in form of a “stem” plot, while the upper right panel depicts the same sequence in more conventional, but less precise, form. The lower panel depicts the unit sample sequence 𝛿.
One of the important aspects of the unit sample sequence is that the elements of an arbitrary
sequence can be represented as a sum of scaled, delayed impulse functions. For example, the elements of
the sequence 𝑥 given by
𝑥 ≔ (𝑥−3, 𝑥−2, 𝑥−1, 𝑥0, 𝑥1, 𝑥2, 𝑥3) = (𝑎−3, 0,0,0, 𝑎1, 𝑎2, 0) with 𝑎−3, 𝑎1, 𝑎2 ∈ ℝ (8)
can be expressed as
𝑥𝑖 = 𝑎−3 ⋅ 𝛿𝑖+3 + 0 ⋅ 𝛿𝑖+2 + 0 ⋅ 𝛿𝑖+1 + 0 ⋅ 𝛿𝑖 + 𝑎1 ⋅ 𝛿𝑖−1 + 𝑎2 ⋅ 𝛿𝑖−2 + 0 ⋅ 𝛿𝑖−3 (9)
for 𝑖 = −3,−2,−1,0,1,2,3. As an example, consider the case 𝑖 = 2. Then (5) results in
𝑥2 = 𝑎−3𝛿2+3 + 0 ⋅ 𝛿2+2 + 0 ⋅ 𝛿2+1 + 0 ⋅ 𝛿2 + 𝑎1 ⋅ 𝛿2−1 + 𝑎2 ⋅ 𝛿2−2 + 0 ⋅ 𝛿2−3
= 𝑎−3 ⋅ 𝛿5 + 0 ⋅ 𝛿4 + 0 ⋅ 𝛿3 + 0 ⋅ 𝛿2 + 𝑎1 ⋅ 𝛿1 + 𝑎2 ⋅ 𝛿0 + 0 ⋅ 𝛿−1
= 𝑎−3 ⋅ 0 + 0 ⋅ 0 + 0 ⋅ 0 + 0 ⋅ 0 + 𝑎1 ⋅ 0 + 𝑎2 ⋅ 1 + 0 ⋅ 0
= 𝑎2 (10)
More generally speaking, the elements of any sequence can be expressed as
𝑥𝑖 = ∑ 𝑥𝑖(𝛿𝑖−𝑘)∞ 𝑘=−∞ (11)
Equation (7) is important, because it expresses the sequence values 𝑥𝑖 as an infinite sum of the product of
shifted impulse functions (𝛿𝑖−𝑘)(recall that the impulse function is one for 𝑖 = 0 and zero everywhere else)
with the original sequence values 𝑥𝑖 as coefficients. This representation of a sequence will be used below to
express the transformation of a sequence under a linear system.
269
(2) Discrete-time systems
A discrete-time system can be defined mathematically as a function 𝑇 that maps values of an input
sequence 𝑥 onto values of an output sequence 𝑦.
𝑇:ℝℤ → ℝℤ, (𝑥𝑖) ↦ (𝑦𝑖) ≔ 𝑇((𝑥𝑖)) (1)
As usual 𝑇 represents a rule or formula for computing the output sequence value from the input sequence
values. It is important to note that the value of the output sequence at each value of the index 𝑖 ∈ ℤ can
depend on all or part of the entire sequence 𝑥. Usually, systems are defined by means of the definition of
their values, i.e. the general definition of a system as in (1) is followed by a definition of the values 𝑦𝑖 of (𝑦𝑖)
in terms of the values 𝑥𝑖 of (𝑥𝑖). An example is the so-called “accumulator system” which is defined by the
transformation
𝑇:ℝℤ → ℝℤ, (𝑥𝑖) ↦ 𝑇((𝑥𝑖)) = (𝑦𝑖) where 𝑦𝑖 ≔ ∑ 𝑥𝑘𝑖𝑘=−∞ (2)
In the current context, we are concerned with specific discrete time-systems: linear and time-invariant
discrete-time systems, also referred to as “LTI systems”. We next define the concept of a linear system and
then the concept of a time-invariant system.
Linear Systems
Linear systems are systems that are additive and homogenous. A system 𝑇 is said to be “additive”, if
and only if
𝑇((𝑥𝑖′) + (𝑥𝑖
′′)) = 𝑇((𝑥𝑖′)) + 𝑇((𝑥𝑖
′′)) (3)
for all input sequences (𝑥𝑖′) and (𝑥𝑖
′′). In other words, a system is said to be additive, if its output argument
for a sum of two input sequences is equal to the sum of the system’s output arguments for each individual
input sequence. A system is said to be “homogenous”, if and only if
𝑇(𝑎(𝑥𝑖)) = 𝑎𝑇((𝑥𝑖)) (4)
for all scalars 𝑎 ∈ ℝ and all sequeces (𝑥𝑖). A system 𝑇 that is both additive and homogenous is referred to as
a “linear system”.
Linear systems fulfill the following “superposition principle”. If a system 𝑇 is linear, then for all
sequences (𝑥𝑖′) and (𝑥𝑖
′′) and all scalars 𝑎, 𝑏 ∈ ℝ
𝑇(𝑎(𝑥𝑖′) + 𝑏(𝑥𝑖
′′)) = 𝑎𝑇((𝑥𝑖′)) + 𝑏𝑇((𝑥𝑖
′′)) (5)
Proof of (5)
Because the system is additive, we have
𝑇(𝑎(𝑥𝑖′) + 𝑏(𝑥𝑖
′′)) = 𝑇(𝑎(𝑥𝑖′)) + 𝑇(𝑏(𝑥𝑖
′′)) (5.1)
Because the system is homogenous, we further have
𝑇(𝑎(𝑥𝑖′)) + 𝑇(𝑏(𝑥𝑖
′′)) = 𝑎𝑇((𝑥𝑖′)) + 𝑏𝑇((𝑥𝑖
′′)) (5.2)
Thus
𝑇(𝑎(𝑥𝑖′) + 𝑏(𝑥𝑖
′′)) = 𝑎𝑇((𝑥𝑖′)) + 𝑏𝑇((𝑥𝑖
′′)) (5.3)
i.e., the superposition principle holds
□
As an example, we show that the accumulator system is a linear system. To this end, we consider
two sequences (𝑥𝑖′) and (𝑥𝑖
′′), which the accumulator system maps onto the two sequences
270
𝑇((𝑥𝑖′)) = (𝑦𝑖
′), where 𝑦𝑖′ ≔ ∑ 𝑥𝑘
′𝑖𝑘=−∞ (6)
and
𝑇((𝑥𝑖′′)) = (𝑦𝑖
′′), where 𝑦𝑖′ ≔ ∑ 𝑥𝑘
′𝑖𝑘=−∞ (7)
respectively. Using the definition of multiplication of sequences with scalars, the definition of sequence
addition, and the definition of the accumulator system, we have
(𝑦𝑖) = 𝑇(𝑎(𝑥𝑖′) + 𝑏(𝑥𝑖
′)) where 𝑦𝑖 ≔ ∑ (𝑎𝑥𝑘′ + 𝑏𝑥𝑘
′′)𝑖𝑘=−∞ (8)
Considering the right-hand side of the above, we have with the well-known properties of sums
𝑦𝑖 ≔ ∑ (𝑎𝑥𝑘′ + 𝑏𝑥𝑘
′′)𝑖𝑘=−∞
= ∑ 𝑎𝑥𝑘′𝑖
𝑘=−∞ + ∑ 𝑏𝑥𝑘′′𝑖
𝑘=−∞
= 𝑎∑ 𝑥𝑘′𝑖
𝑘=−∞ + 𝑏∑ 𝑥𝑘′′𝑖
𝑘=−∞
= 𝑎𝑦𝑖′ + 𝑏𝑦𝑖
′′ (9)
Because the above holds for all 𝑖 ∈ ℤ, we have found
(𝑦𝑖) = 𝑎(𝑦𝑖′) + 𝑏(𝑦𝑖
′′) = 𝑎𝑇(𝑥𝑖′) + 𝑏𝑇(𝑥𝑖
′′) (10)
and thus
𝑇(𝑎(𝑥𝑖′) + 𝑏(𝑥𝑖
′)) = 𝑎𝑇(𝑥𝑖′) + 𝑏𝑇(𝑥𝑖
′′) (11)
Time-invariant systems
A time-invariant system (also referred to as a “shift-invariant” system) is a system for which a time
shift or delay of the input sequence causes a corresponding shift in the output sequence. Formally, suppose
that a system transforms the input sequence (𝑥𝑖) into the output sequence (𝑦𝑖) . Then, the system is said to
be time invariant if, for all 𝑖0 ∈ ℤ, the input sequence (𝑥𝑖′), defined by 𝑥𝑖
′ ≔ 𝑥𝑖−𝑖0 for all 𝑖 ∈ ℤ produces the
output sequence (𝑦𝑖′), where 𝑦𝑖
′ ≔ 𝑦𝑖−𝑖0 for all 𝑖 ∈ ℤ.
In the next section, we exploit the linearity and time-invariant properties of LTI systems to introduce
the notion of discrete-time convolution.
(3) Convolution
For ete LTI systems, the output argument or “response” to an arbitrary input sequence (𝑥𝑖) can be
obtained by convolving the input sequence with the system’s “impulse response function”, which is the
system’s output sequence for a unit sample sequence. In other words, for an LTI system
𝑇:ℝℤ → ℝℤ, (𝑥𝑖) ↦ (𝑦𝑖) ≔ 𝑇((𝑥𝑖)) (1)
with impulse response function
(ℎ𝑖) ≔ 𝑇((𝛿𝑖)) (2)
the elements of (𝑦𝑛) can be written as
𝑦𝑖 = ∑ 𝑥𝑘ℎ𝑖−𝑘∞𝑘=−∞ for all 𝑖 ∈ ℤ (3)
271
Proof of (3)
To see that the above holds, first consider the representation of a sequence 𝑥 as an infinite sum of the product of shifted
impulse functions 𝛿𝑖−𝑘 with the values of its elements, i.e.
(𝑥𝑖) = ∑ 𝑥𝑘(𝛿𝑖−𝑘)∞𝑘=−∞ (3.1)
Because the system 𝑇 considered is linear and obeys the superposition principle, its output to (3.1) can be written as
(𝑦𝑖) = 𝑇((𝑥𝑖)) = 𝑇(∑ 𝑥𝑘(𝛿𝑖−𝑘)∞𝑘=−∞ ) = ∑ 𝑥𝑘𝑇((𝛿𝑖−𝑘))
∞𝑘=−∞ (3.2)
In other words, the output of a linear system to an input sequence 𝑥 can be written as the infinite sum of the output of the system
to unit sample sequences 𝛿𝑖−𝑘 weighted by the values 𝑥𝑖 of the input sequence 𝑥. Let (ℎ𝑖𝑘) denote the sequence of the system in
response to the unit sample sequence (𝛿𝑖−𝑘), i.e.
(ℎ𝑖𝑘) ≔ 𝑇((𝛿𝑖−𝑘)) (3.3)
For a time-invariant system, if (ℎ𝑖) denotes the response to the unit sample response 𝛿𝑖, then (ℎ𝑖−𝑘) corresponds to the response to
the unit sample response 𝛿𝑖−𝑘, such that we can write
(ℎ𝑖−𝑘) ≔ 𝑇((𝛿𝑖−𝑘)) (3.4)
We can thus write
(𝑦𝑖) = ∑ 𝑥𝑘𝑇((𝛿𝑖−𝑘))∞𝑘=−∞ = ∑ 𝑥𝑘(ℎ𝑖−𝑘)
∞𝑘=−∞ (3.5)
where, notably,
𝑦𝑖 = ∑ 𝑥𝑘ℎ𝑖−𝑘∞𝑘=−∞ for all 𝑖 ∈ ℤ (3.6)
□
The formation of a system’s output sequence by means of computing its values via the sum
𝑦𝑖 = ∑ 𝑥𝑘ℎ𝑖−𝑘∞𝑘=−∞ for all 𝑖 ∈ ℤ (4)
where 𝑥𝑘 denotes the 𝑘the value of the input sequence 𝑥 and ℎ𝑖−𝑘 denotes the (𝑖 − 𝑘)th value of the
system’s impulse response functions, i.e. the response of the system to the unit sample response 𝛿𝑖, is
referred to as the “convolution of the input sequence with the impulse response function” and denoted by
𝑦 = 𝑥⨂ℎ (5)
The operation of discrete-time convolution thus takes two sequences (𝑥𝑖) and (ℎ𝑖) and produces a third
sequence (𝑦𝑖), the values 𝑦𝑖 of which are given by the convolution sum (4).
Note that the evaluation of each value of the output sequence in (4) requires the computation of an
infinite sum. In practical scenarios, this can often be eschewed by considering systems with “finite impulse
response functions”. Finite impulse response functions are impulse response functions that have non-zero
values only for a finite-support set –𝑚,−𝑚 + 1,… ,𝑚 − 1,𝑚 of indices. In this case, the values of the output
sequence (𝑦𝑖) can be computed based on the finite sums
𝑦 = ℎ⨂𝑥 ⇒ 𝑦𝑖 = ∑ ℎ𝑗𝑚𝑗=−𝑚 𝑥𝑖−𝑗 for all 𝑖 ∈ ℤ (6)
Proof of (6)
To show that for finite impulse response functions ℎ the infinite sum for each element 𝑦𝑖of the output sequence 𝑦 of a
system can be evaluated based on a finite sum comprising as many terms as the impulse response function contains elements, we
first show that the convolution operation is commutative, i.e. that
𝑦 = 𝑥⨂ℎ = ℎ⨂𝑥 (6.1)
Specifically, according to (4), we have with 𝑗 ≔ 𝑖 − 𝑘 (and thus 𝑘 = 𝑖 − 𝑗), we have according to (4)
𝑦𝑖 = ∑ 𝑥𝑘ℎ𝑖−𝑘∞𝑘=−∞ = ∑ 𝑥𝑖−𝑗ℎ𝑗
∞𝑗=−∞ = ∑ ℎ𝑗
∞𝑗=−∞ 𝑥𝑖−𝑗 for all 𝑖 ∈ ℤ ⇔ 𝑦 = ℎ⨂𝑥 (6.2)
272
Now, for a finite impulse response function ℎ, i.e. a sequence ℎ that only takes on nonzero values on a finite support set of
–𝑚,−𝑚 + 1,… ,𝑚 − 1,𝑚 one may thus use a finite summation for the evaluation of 𝑦
𝑦 = ℎ⨂𝑥 ⇒ 𝑦𝑖 = ∑ ℎ𝑗∞𝑗=−∞ 𝑥𝑖−𝑗 = ⋯0 + 0 + ∑ ℎ𝑗
𝑚𝑗=−𝑚 𝑥𝑖−𝑗 + 0 + 0 +⋯ = ∑ ℎ𝑗
𝑚𝑗=−𝑚 𝑥𝑖−𝑗 (6.3)
because for all other values of 𝑗, we have ℎ𝑗 ≔ 0 and thus the contribution ℎ𝑗𝑥𝑖−𝑗 to the sum is zero.
As an example, consider Figure 2. Here, the impulse response function is non-zero for the indices 𝑗 = 0,… ,7
(i.e., it is of length 8) and zero elsewhere. The system input function comprises all zeros, except for
𝑖 = 0, 𝑖 = 9, 𝑖 = 17, 𝑖 = 21, and 𝑖 = 29. Note that the input function is also finite in the current case and
comprising 40 elements.Evaluation of
𝑦𝑖 = ∑ ℎ𝑗7𝑗=0 𝑥𝑖−𝑗 for 𝑖 = 0,1,2,… ,40 (7)
results in the sequence shown in the lower panel. Note that for 𝑖 < 7, the sequence element 𝑥𝑖−𝑗 in (7) is
not defined. To nevertheless evaluate 𝑦𝑖 for 𝑖 < 7 one usually “pads” the input sequence with zeros. In the
current case, this amounts to setting 𝑥𝑘 ≔ 0 for 𝑘 = −7,−6,… ,−1.
Figure 2 Convolution. The upper right-panel depicts a finite impulse response function. The upper left panel depicts an input sequence. The lower panel depicts the response of an LTI system with impulse response function ℎ to an input sequence of the form 𝑢, or, equivalently, the convolution of the sequence 𝑢 with the sequence ℎ.
(4) The canonical hemodynamic response function
The canonical hemodynamic response function used to model the MR signal time-course in response
to an instantaneous neural event is based on the gamma probability density function. However, in the
current context, the gamma probability density function is not viewed as a probability density, but merely as
a function, i.e. it carries no probabilistic connotation. The input argument of this function is peri-stimulus
time, and the function is parameterized in its scale and rate form
𝐺 ∶ ℝ+ → ℝ+, 𝑡 ↦ 𝐺(𝑡; 𝛼, 𝛽) ≔𝛽𝛼
𝛤(𝛼)𝑡𝛼−1 𝑒𝑥𝑝(−𝛽𝑡) for 𝛼, 𝛽 > 0 (1)
Recall that in (1)
𝛤:ℝ+\{0} → ℝ, 𝑥 ↦ 𝛤(𝛼) = ∫ 𝜏𝛼−1𝑒−𝜏 𝑑𝜏∞
0 (2)
denotes the Gamma function.
273
The canonical hemodynamic response function is given by the difference of two gamma probability
density functions. The first density function describes the main response of an MR signal increase following
stimulation, while the second density function, which is subtracted from the first, describes the post-
stimulus undershoot. For an onset at 𝑡 = 0, a length of 𝑇 > 0 seconds and time-bin width of 𝛥𝑡 > 0 , the
canonical hemodynamic response function is parameterized by four parameters 휃1, 휃2, 휃3, 휃4 > 0 and given
by
𝐻 ∶ ℕ𝑛0 → ℝ,𝑢 ↦ 𝐺 (𝑢;
𝜃1
𝜃3,𝛥𝑡
𝜃3) −
1
𝜃5 𝐺 (𝑢;
𝜃2
𝜃4,𝛥𝑡
𝜃4) (3)
where 𝑢 is a time-bin index in the support interval and 𝑛 ≔ 𝑙/𝛥𝑡 is the number of support points rounded
to the next integer. In this parameterization, the parameters take on the following intuitions: 휃1 corresponds
to the delay of the peak of the hemodynamic response function with respect to its onset and 휃2 corresponds
to the delay of the post-stimulus undershoot with respect to its onset. 휃3 and 휃4 describe the width of the
main response and the undershoot, respectively. Finally 휃5 encodes the ratio of the main response with
respect to the undershoot.
Commonly chosen values for the parameters of the hemodynamic response function, specifically in
the context of high temporal resolution convolution (see below) are 𝑇 = 32 seconds, 𝛥𝑡 = 𝑇𝑅/16 seconds,
where 𝑇𝑅 is the scan repetition time used in the experimental data acquisition in seconds, 휃1 = 6, 휃2 =
16, 휃3 = 1, 휃4 = 1 and 휃5 = 6. Figure 1 visualizes the canonical hemodynamic response function for four
different parameter settings with 𝑇 = 32 and 𝑇𝑅 = 2, i.e. 𝛥𝑡 = 0.125 seconds.
Figure 1. The canonical hemodynamic response function for four different parameter settings of 휃 ≔
(휃1, 휃2, 휃3, 휃4, 휃5).
(5) Stimulus onset convolution in GLM-FMRI
In this section, we discuss some details on the generation of predicted MR signal time-courses that
result from the combination of linear time-invariant system theory with stimulus onset function and the
canonical hemodynamic response function. We firstly note that the convolution of stimulus onset functions
with the canonical HRF is commonly performed at a higher temporal resolution than the given by the MR
sampling rate (inverse TR) (Figure 1).
We next consider the effect of different stimulus duration and amplitude on the predicted MR signal
time-course, assuming no temporal overlap. As evident from the upper panel of Figure 2, increasing the
stimulus duration has two effects: firstly, up to a critical duration, it scales the predicted hemodynamic
response, i.e. the prediction of longer duration stimuli is an overall increase of the MR signal change.
274
Secondly, if the stimulus duration exceeds the duration of the canonical HRF kernel, no further signal
increase is predicted, but the return to baseline is predicted to be delayed. On the other hand, keeping the
stimulus duration constant and changing its amplitude leads to proportional increases in the amplitude of
the predicted MR signal, but no change in its temporal evolution (Figure 2, lower panel).
Figure 1. Microtime resolution convolution
Figure 2. Effects of stimulus duration and amplitude
Finally, we consider the effect of zero duration, constant amplitude stimulus onset functions, which
evoke temporal overlap in the corresponding MR signal (Figure 3). As evident from Figure 3, short stimulus
onset asynchronies prevent that the predicted MR signal returns to baseline between stimulus onsets. Up to
a certain rate, the stimulus variability is still propagated to the MR signal, however, at high rates, the
predicted MR signal takes the form of that predicted under single stimuli with long duration.
Figure 3. Effects of stimulus onset asynchronies.
275
Study Questions
1. How are the columns of a design matrix n the application of the GLM to FMRI commonly generated?
2. On which signal processing framework is the generation of GLM-FMRI regressors based?
3. Which mathematical function forms the basis of the “canonical HRF” and what does the canonical HRF describe?
4. What is the effect of increasing the duration of a stimulus in the stimulus-onset function/HRF-convolution framework for the GLM
analysis of FMRI data?
5. What is the effect of decreasing the inter-stimulus times for short duration stimulus onset functions in the stimulus-onset
function/HRF-convolution framework for the GLM analysis of FMRI data?
Study Questions Answers
1. Commonly, stimulus-onset times are converted into predicted MR signal time-courses which form the columns of the voxel time-
course GLM. Technically, the process corresponds to the convolution of stimulus onset functions with a hemodynamic response
function
2. The generation of GLM-FMRI regressors is based on the framework of discrete-time linear time-invariant system theory.
3. The canonical hemodynamic response function is based on the difference of two gamma probability density functions. The first
density function describes the main response of an MR signal increase following stimulation, while the second density function,
which is subtracted from the first, describes the post-stimulus undershoot.
4. Increasing the stimulus duration has two effects: firstly, up to a critical duration, it scales the predicted hemodynamic response,
i.e. the prediction of longer duration stimuli is an overall increase of the MR signal change. Secondly, if the stimulus duration exceeds
the duration of the canonical HRF kernel, no further signal increase is predicted, but the return to baseline is predicted to be
delayed.
5. Short inter-stimulus onset intervals prevent that the predicted MR signal returns to baseline between stimulus onsets. Up to a
certain rate, the stimulus variability is still propagated to the MR signal, however, at high rates, the predicted MR signal takes the
form of that predicted under single stimuli with long duration.
276
First-Level Design Matrices
(1) Parameterizing event-related FMRI designs
Experimental design in convolution-based GLM-FMRI conforms to two fundamental questions: (1)
which conditions should be included in the experiment and (2) when should which condition be presented?
In the current section, we consider the latter question, i.e., we assume that the number of conditions and
their properties have been decided upon and the question is how to distribute them over the experimental
time-course in order to achieve a design that is in some sense “good”. More specifically, the question of
what defines a good design will be dealt with the next section. In the current section, we consider the more
fundamental question of how designs can be parameterized. To this end, we make a number of simplifying
assumptions. Firstly, we do not consider psychological factors, such as the perceived randomness of the
design. Secondly, we consider only designs comprising one or two conditions.
In the case of a single experimental condition, the only question with respect to experimental design
amounts to the question of when to present the trials of the condition. Here, we assume that a trial of an
experimental condition comprises a single event and that the minimal time between onsets of events 𝑡∗,
referred to as “minimal stimulus-onset asynchrony” is constant. In this case, the design can be
parameterized in terms of an “event-probability function” which assigns to each possible event-time a
probability of the event happening at this time or not. More formally, let
𝑆 = {0, 𝑡∗, 2𝑡∗, 3𝑡∗, … , 𝑛𝑡∗} = {𝑘𝑡∗|𝑘 ∈ ℕ𝑛0} (1)
denote a partition of the total time of an experimental run 𝑇 ≔ 𝑛𝑡∗ ∈ ℝ+. Then an “event-probability
function” is a function
𝑓 ∶ 𝑆 → [0,1], 𝑘𝑡∗ ↦ 𝑓(𝑘𝑡∗) (2)
An event probability function induces a set of random variables {𝑒𝑘𝑡∗|𝑘 ∈ ℕ𝑛0}, that can take on the values 0
and 1, encoding the states that an event happens at time 𝑘𝑡∗ or that is does not, respectively. Notably, the
event probability function 𝑓 assigns to each time-point 𝑘𝑡∗ (𝑘 ∈ ℕ𝑛0) the probability of the event 𝑒𝑘𝑡∗ = 1,
such that
𝑓 ∶ 𝑆 → [0,1], 𝑘𝑡∗ ↦ 𝑓(𝑘𝑡∗) ≔ 𝑝(𝑒𝑘𝑡∗ = 1) (3)
Many event-probability functions are conceivable. An example for an event-probability function is
the following. Assuming that 𝑛 is an even integer, an event-probability function can be defined as
𝑓1 ∶ 𝑆 → [0,1], 𝑘𝑡∗ ↦ 𝑓(𝑘𝑡∗) ≔ {1, 𝑘 ≤ 𝑛/20, 𝑘 > 𝑛/2
(4)
This function assigns a probability of 1 for the first 𝑛
2+ 1 minimal stimulus onset asynchrony time-points,
and a probability of 0 to the last 𝑛
2 minimal stimulus onset asynchrony time-points (see Figure 1, uppermost
panel). Another example for an event-probability function is
𝑓2 ∶ 𝑆 → [0,1], 𝑘𝑡∗ ↦ 𝑓(𝑘𝑡∗) ≔ 1 (5)
which assigns and event-occurrence probability of 1 to each minimal stimulus onset asynchrony time-point
(see Figure 1, lowermost panel). A third example of an event-probability function is the following
𝑓3 ∶ 𝑆 → [0,1], 𝑘𝑡∗ ↦ 𝑓(𝑘𝑡∗) ≔ 0.5 (𝑐𝑜𝑠 (2𝜋𝜔𝑡∗
𝑇) + 1) (6)
277
which parameterizes a time-varying event-probability by means of the frequency parameter 𝜔, representing
the number of cycles per experimental run length 𝑇 (see Figure 1, panels 2 – 3 for 𝜔 = 2,𝜔 = 4 and 𝜔 = 6,
respectively).
In the case of two experimental conditions, the event-probability function has to encode the
probabilities of trials of either condition to occur at each minimal stimulus-onset asynchrony time-point. In
other words, the random variable 𝑒𝑘𝑡∗ now takes on four values 1,2,3,4, encoding the events that a trial of
the first condition are either happening (1), or not (2), and the events that a trial of the second condition is
either happening (3) or not (4). A simple event-probability function is in this case is given by
𝑓4 ∶ 𝑆 → [0,1], 𝑘𝑡∗ ↦
{
𝑝(𝑒𝑘𝑡∗ = 1) = 0.5
𝑝(𝑒𝑘𝑡∗ = 2) = 0.0
𝑝(𝑒𝑘𝑡∗ = 3) = 0.5
𝑝(𝑒𝑘𝑡∗ = 4) = 0.0
(7)
assuming that (1) a trial of either condition is presented at each 𝑘𝑡∗, (2) the probability that this trial is of
either condition is 0.5 (Figure 1 of the next section).
Figure 1. Parameterization of single condition convolution-based GLM designs in terms of event-probability functions. The red lines depict sampled onset functions according to the event-probability functions specified in the main text and visualized here as grey lines, and the blue lines depicts the resulting design matrix regressors upon convolution with the canonical hemodynamic response function. In the current one condition scenario, the design matrix comprises a single column only, such that 𝜉 = 𝑋𝑇𝑋. The uppermost and lowermost panels show that that a blocked presentation design is more efficient than an equispaced desing.
278
(2) Measuring event-related FMRI design efficiency
The “goodness” of a GLM design can be measured according to a variety of criteria. Here, we focus
on a simple criterion that has been employed in the GLM-FMRI literature and relates the variance of the beta
parameter estimates. Specifically, recall that the distribution of the OLS beta estimator �̂� ∈ ℝ𝑝 is given by
𝑝(�̂�) = 𝑁(�̂�; 𝛽, 𝜎2(𝑋𝑇𝑋)−1) (1)
where 𝛽 ∈ ℝ𝑝 and 𝜎2 > 0 are the true, but unknown, GLM parameters, and 𝑋 ∈ ℝ𝑛×𝑝 is the design matrix.
Based on the definition of the multivariate Gaussian distribution, the covariance matrix of the OLS beta
estimator is thus
𝐶𝑜𝑣(�̂�) = 𝜎2(𝑋𝑇𝑋)−1 (2)
Intuitively, the diagonal elements of this covariance matrix encode how, for fixed 𝛽, 𝜎2, and 𝑋, the effect
estimates �̂�𝑗 (𝑗 = 1,… , 𝑝) of each experimental condition vary over repeated sampling from the GLM - or in
other words, how reliable these estimates are over repeated sampling of the identical GLM. According to (2)
this variability is a function of the GLM parameter 𝜎2 and the inverse of the “design matrix correlation
matrix” 𝑋𝑇𝑋. Assuming that 𝜎2 is constant, the variability of the effect size estimates �̂�𝑗 is thus a function of
the diagonal entry
(𝑋𝑇𝑋)𝑗𝑗−1 = 𝑒𝑗
𝑇(𝑋𝑇𝑋)−1𝑒𝑗 (3)
where 𝑒𝑗 denotes the 𝑗th canonical unit vector, i.e. the vector with all zeros except a one at the 𝑗th entry. For
a contrast vector 𝑐 ∈ ℝ𝑝, this motivates the following measure of design efficiency
𝜉 ∶ (𝑐, 𝑋) ↦ 𝜉(𝑐, 𝑋) ≔ (𝑐𝑇(𝑋𝑇𝑋)−1𝑐)−1 (4)
𝜉(𝑐, 𝑋) thus increases with decreasing variability of linear compounds of parameter estimate �̂� over
repeated sampling of the GLM. Importantly, it depends on both the design matrix 𝑋 and the contrast of
interest 𝑐. In other words, according to the criterion (4) the same FMRI design can, in principle, be efficient
with respect to one contrast of interest and inefficient with respect to another.
A different interpretation of (4) is afforded by recalling the definition of the T-statistic as
𝑇 ≔𝑐𝑇�̂�−𝑐𝑇𝛽0
√𝜎2𝑐𝑇(𝑋𝑇𝑋)−1𝑐 (5)
for a null hypothesis represented by 𝛽0 ∈ ℝ𝑝 und the assumption of a known variance parameter 𝜎2.
Assuming identical effect sizes, i.e. true, but unknown values of 𝛽 ∈ ℝ𝑝, adopting the design efficiency
criterion (4) corresponds to favoring larger 𝑇 values.
Examples for the efficiency of GLM-FMRI designs two conditions as a function of the minimal
stimulus onset asynchrony are shown in Figure 1. Specifically, for the two-condition event-probability
function discussed in the previous section, two sampled designs for a fixed number of 20 events are shown
for 𝑡∗ values of 4 and 12 seconds, respectively. In the lower right panel, the design efficiency 𝜉 as a function
of 𝑡∗ is evaluated for two different contrasts of interest 𝑐 = (1,1)𝑇, i.e. detecting activation over both
conditions, and 𝑐2 ≔ (1,−1)𝑇, i.e. detecting differential activation between both conditions. Notably, short
stimulus onset asynchrony values are more efficient for detecting activation across both conditions, while
intermediate stimulus onset asynchronies of around 10 sec are most efficient for detecting differential
activations.
279
Figure 1. Measures of design efficiency in convolution-based GLM-FMRI. See main text for a detailed description.
(3) Finite impulse response designs
So far we have considered the standard approach to GLM-FMRI event-related design modelling
using a pre-specified and fixed canonical hemodynamic response function as “basis function”. In principle,
many different basis functions, i.e. abstract representations of the expected MR signal response at a given
voxel are conceivable. In the current Section, we will discuss the most flexible approach, which in fact
corresponds to an estimation of the HRF shape at a given voxel itself. In other words, the finite impulse
response (FIR) approach may be seen as a means for event-related hemodynamic response averaging in its
GLM implementation. Event-related hemodynamic response averaging (also referred to as “selective
averaging”) works by partitioning the data into peri-stimulus time courses, and averaging the corresponding
data. FIR modelling on the other hand uses a combination of the notion of unit sample responses and the
GLM formulation to achieve a similar goal.
To understand FIR designs, we first consider estimating shape of a single peristimulus time-course
using the GLM. To this end, reconsider the formulation of a finite discrete-time signal (𝑦𝑖) (𝑖 = 1,… , 𝑛) as a
sum of weighted unit sample sequences δ𝑖. We saw previously that we can rewrite a signal
𝑦𝑖 = ∑ 𝑦𝑘𝛿𝑖−𝑘∞𝑘=−∞ (1)
Consider the specific case that we have the sum index 𝑘 running from 𝑘 = 1 to 𝑘 = 𝑛, i.e. a finite impulse
response:
𝑦𝑖 = ∑ 𝑦𝑖𝛿𝑖−𝑘𝑛𝑘=1 (2)
Then we have for the value of 𝑦𝑖 (𝑖 = 1,… , 𝑛)
280
𝑦1 = 𝑦1𝛿1−1 + 𝑦1𝛿1−2 + 𝑦1𝛿1−3 +⋯+ 𝑦1𝛿1−𝑛 (3)
𝑦2 = 𝑦2𝛿2−1 + 𝑦2𝛿2−2 + 𝑦2𝛿2−3 +⋯+ 𝑦2𝛿2−𝑛
𝑦3 = 𝑦3𝛿3−1 + 𝑦3𝛿3−2 + 𝑦3𝛿3−3 +⋯+ 𝑦3𝛿3−𝑛
…
𝑦𝑛 = 𝑦𝑛𝛿𝑛−1 + 𝑦𝑛𝛿𝑛−2 + 𝑦𝑛𝛿𝑛−3 +⋯+ 𝑦𝑛𝛿𝑛−𝑛
which is equivalent to
𝑦1 = 𝑦1 ⋅ 1 (4)
𝑦2 = 𝑦2 ⋅ 1
𝑦3 = 𝑦3 ⋅ 1
…
𝑦𝑛 = 𝑦𝑛 ⋅ 1
Assume now, that we aim to estimate the coefficient 𝑦𝑘 on the right hand side of (4), i.e. we treat them as
parameters, and replace them by the symbol 𝛽𝑘
𝑦𝑖 = ∑ 𝛽𝑘𝛿𝑖−𝑘𝑛k=1 (5)
Then, by analogy to (4)and exchanging the order of multiplication, we have we have the following system of
equations
𝑦1 = 1 ⋅ 𝛽1 (6)
𝑦2 = 1 ⋅ 𝛽2
𝑦3 = 1 ⋅ 𝛽3
…
𝑦𝑛 = 1 ⋅ 𝛽𝑛
Apparently, we may rewrite (6) in matrix notation as
𝑦 = 𝑋𝛽 (7)
where 𝑦 ∈ ℝ𝑛, 𝑋 ≔ 𝐼𝑛 ∈ ℝ𝑛×𝑛 and 𝛽 ∈ ℝ𝑛. Assuming additive independent and identically distributed zero-
mean Gaussian noise, we arrive at the GLM formulation
𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇𝑖 , 𝜎2) ⇔ 𝑦𝑖 = 𝜇𝑖 + 휀𝑖, 𝑝(휀𝑖) = 𝑁(휀𝑖; 0, 𝜎
2) (𝑖 = 1,… , 𝑛) (8)
and
281
𝑋 ≔ (
1 00 1
⋯ 0⋯ 0
⋮ ⋮0 0
⋱ ⋮⋯ 1
) ∈ ℝ𝑛×𝑛, 𝛽 ∈ ℝ𝑛 , 𝜇𝑖 = ( 𝑋𝛽)𝑖 (𝑖 = 1,… , 𝑛) (9)
In other words, to estimate the coefficients of a finite impulse response (1) based on a single observation of
this function, we can formulate a GLM using the (𝑛 × 𝑛) identity matrix as design matrix in a GLM which has
the same number of beta parameters as there are data points. Figure 1 below illustrates this process.
Figure 1 Estimation of a finite impulse response based on a single observation. The upper panels depict two true, but unknown, finite impulse responses, corresponding to the expectation parameters 𝜇 ∈ ℝ20 of a GLM. Taking single samples from the GLM with these expectations results in the center panels. Finally, the lower panels depict the estimated beta parameters 𝛽 ≔ (𝛽1, … , 𝛽𝑛)
𝑇 based on the samples. As the design matrix corresponds to the identity matrix, the beta parameter values equal the sample values.
Consider next the case, of observing the impulse response twice, i.e. observing 𝑦 ∈ ℝ2𝑛. In this case,
the GLM formulation for averaging the corresponding data points into the correct peri-stimulus time bins
would correspond to
𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇𝑖 , 𝜎2) ⇔ 𝑦𝑖 = 𝜇𝑖 + 휀𝑖, 𝑝(휀𝑖) = 𝑁(휀𝑖; 0, 𝜎
2) (𝑖 = 1,… , 𝑛) (𝑖 = 1,… ,2𝑛) (10)
and
𝑋 ≔ (𝐼𝑛𝐼𝑛) ∈ ℝ2𝑛×𝑛, 𝛽 ∈ ℝ𝑛 , 𝜇𝑖 = ( 𝑋𝛽)𝑖 (𝑖 = 1,… ,2𝑛) (11)
282
More generally speaking, for the FIR GLM design for FMRI data analyses, the stimulus impulse
function design matrix as discussed above is replaced by identity matrices of the length of the expected
impulse response function. Figure 2 below depicts this process. On the left, the stimulus impulse functions
for two stimuli are shown. The center panel depicts the stimulus impulse functions convolved with a
canonical HRF as discussed in previously. Finally, the right-most panel depicts the FIR model design matrix
for an expected HRF of 16 TRs. In other words, the stimulus impulses on the left and their 15 post-stimulus
entries were replaced by a (16 × 16) identity matrix. Correspondingly, if this design matrix is employed in a
GLM, the first 16 parameters in the beta parameter vector 𝛽 ∈ ℝ32 will correspond to the finite impulse
response coefficients for stimulus 1 and the second 16 parameters in the beta parameter vector correspond
to the finite impulse response coefficients for stimulus 2.
Figure 2 Illustration of the FIR model GLM implementation. The leftmost panel depicts the stimulus impulse functions in design matrix form, and the center panel the same functions upon convolution with a canonical double gamma HRF. The rightmost panel depicts the FIR model GLM implementation corresponding to the stimulus impulse functions on the left.
Consider sampling MR signal time-courseS from the GLM formulated by the design matrix in the
center of Figure “for true, but unknown, beta parameter values of 𝛽1 ≔ (0,0)𝑇 , 𝛽2 ≔ (1,0)𝑇 and
𝛽3 ≔ (1,1)𝑇, resulting in the time-course shown in Figure 3. While the data shown in Figure 3 were sampled
from a GLM model with the design matrix shown in the center Panel of Figure 2, one may nevertheless
conceive it as a realization of the FIR design matrix model shown in the rightmost panel of Figure 2, or, put
simply “analyse with the FIR design matrix”. Estimating the beta parameter vector for this model,
comprising 2 ⋅ 16 = 32 entries, where the first 16 entries correspond to the finite impulse response for
stimulus 1 and the second 16 entries correspond to the finite impulse response for stimulus 2, results in the
BOLD time-courses shown in Figure 4: Note that the blue curve corresponds to the first 16 estimated beta
parameters and the red curve to the second 16 estimated beta parameters of the GLM parameter vector
�̂� ∈ ℝ32.
283
Figure 3 Simulated voxel time-series based on the canonical HRF design matrix shown in the center panel of Figure 2 for beta parameter vectors set to of 𝛽1 ≔ (0,0)𝑇, 𝛽2 ≔ (1,0)𝑇 and 𝛽3 ≔ (1,1)𝑇.
Figure 4 FIR model results (= estimated beta parameters) for the data shown in Figure 2.
In summary, the FIR model formulation of the GLM allows to estimate the hemodynamic response
to stimulus conditions without the need to partition the data. It should be noted however, that the FIR
model procedure does not implicitly correct for the parameter uncertainty that results from overlapping
hemodynamic response functions. In other words, it is most appropriate, if the trial onsets are well
separated in time (e.g. by 20 seconds) to allow the hemodynamic response to return to baseline. FIR models
are not the standard way to analyse whole-brain GLM data, but can sometimes be useful to evaluate the
hemodynamic response time-courses that gave rise to an interesting statistical effect.
284
(4) Psychophysiological interaction designs
Psychophysiological interaction (PPI) GLM designs for mass-univariate FMRI data analyses may be
viewed as the application of Analysis of Covariance (ANCOVA) designs to FMRI data acquired from a single
participant. Recall that we introduced the ANCOVA model as a combination of discrete-factorial and
continuous-parametric GLM designs. Specifically, we discussed the following additive ANCOVA design. For
𝑖 = 1,2 and 𝑗 = 1,… ,𝑛
2, we conceived the observation variables 𝑦𝑖𝑗 a realization of random variables with
univariate Gaussian distribution of the form
𝑝(𝑦𝑖𝑗) = 𝑁(𝑦𝑖𝑗; 𝜇𝑖𝑗 , 𝜎2) ⇔ 𝑦𝑖𝑗 = 𝜇𝑖𝑗 + 휀𝑖𝑗, 𝑝(휀𝑖𝑗) = 𝑁(휀𝑖𝑗; 0, 𝜎
2), 𝜎2 > 0 (1)
for which the dependence of the expected value for each 𝑦𝑖𝑗 was given by
𝜇𝑖𝑗 ≔ 𝜇0 + 𝛼𝑖 + 𝛽1𝑥𝑖𝑗 (2)
Here, 𝜇0 ∈ ℝ models an offset, 𝛼𝑖 the effect of the discrete factor taking on two levels (𝑖 = 1,2) and 𝛽1 ∈ ℝ
the contribution of the value 𝑥𝑖𝑗 ∈ ℝ of the continuous factor. In its GLM formulation, this model, upon its
reference cell reformulation was written as
𝑦 ≔
(
𝑦1,1⋮
𝑦1,𝑛1𝑦2,1⋮
𝑦2,𝑛2)
∈ ℝ𝑛, 𝑋 =
(
1⋮11⋮1
0⋮01⋮1
𝑥11⋮
𝑥1,10𝑥2,1⋮
𝑥2,10)
∈ ℝ𝑛×3, 𝛽 = (
𝜇0𝛼2𝛽1) ∈ ℝ3 and 𝜎2 > 0 (3)
In PPI for GLM-FMRI, the data 𝑦, offset 𝜇0, main effect of the discrete factor 𝛼2, and the independent
variable values 𝑥𝑖𝑗 take on the following meaning:
𝑦 ∈ ℝ𝑛 represents the MR signal time-course of a specific voxel. The same GLM design is chosen for
each voxel, hence we do not index the voxel number.
𝜇0 ∈ ℝ models the run-specific MR signal offset
The discrete experimental factor corresponds to a “psychological state”. For example, 𝛼2 may model the
difference in MR signal observed under Task A with respect to Task B. An example for Task A could be
“attend to the colour of a visual stimulus”, an example for Task B could be “attend to the shape of a
visual stimulus”. Another example could be: Task A “Imagine a tactile stimulus” and Task B “Perceive a
tactile stimulus”. In other words, 𝛼2 models the MR signal difference in response to different
experimental conditions.
𝑥𝑖𝑗 ∈ ℝ represents the MR signal time-course of a fixed and pre-determined “seed region” or “seed
voxel”. Without the psychological factor, the PPI design thus would correspond to a simple linear
regression of the MR signal of each and every voxel onto the MR signal of a pre-determined voxel. We
have seen above, that simple linear regression and correlation are closely related. In most simple terms,
a PPI design without experimental factor would thus correspond to a correlation between the voxel MR
signal time-series of all voxels (one after the other), with a single seed voxel.
Based on the specific interpretations of the components introduced above, (3) then takes the following
structural form
285
𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇𝑖, 𝜎2) ⇔ 𝑦𝑖 = 𝜇𝑖 + 휀𝑖, 휀𝑖~ 𝑁(휀𝑖; 0, 𝜎
2), 𝜎2 > 0 (𝑖 = 1,… , 𝑛) (4)
for which the dependence of the expected value for each 𝑦𝑖𝑗 is given by
𝜇𝑖 ≔ 𝜇0 + 𝛼2 + 𝛽1𝑥𝑖 (5)
and 𝑛 corresponds to the number of EPI volumes acquired per experimental run. In their traditional
formulation PPI designs model the main effect of the discrete factor using 1’s and -1’s for task blocks and 0’s
for fixation blocks and do not discriminate between the independent variable values for each of the two
conditions. In their GLM formulation for FMRI, additive ANCOVA designs then take the following exemplar
form
𝑦 ≔
(
𝑦1𝑦2𝑦3⋮⋮
𝑦𝑛𝑇𝑅)
∈ ℝ𝑛, 𝑋 =
(
1⋮1
1⋮1
𝑥1⋮𝑥𝑖
⋮1⋮1
⋮−1⋮−1
⋮𝑥𝑖′⋮𝑥𝑛)
∈ ℝ𝑛×3, 𝛽 = (
𝜇0𝛼2𝛽1) ∈ ℝ3, 𝜎2 > 0 (6)
So far, we considered an additive ANCOVA model only. Crucially, the PPI design corresponds to an
ANCOVA design with interaction. On the level of the GLM, the design matrix column corresponding to the
interaction parameter is formed by a nonlinear combination of the discrete factor and continuous factor
time-series. The usual way to form the interaction design matrix column is by the point-wise (Hadamard)
multiplication of the corresponding main effect columns. The same approach was initially chosen for PPI,
resulting in the following structural and GLM formulation:
𝑝(𝑦𝑖) = 𝑁(𝑦𝑖; 𝜇𝑖 , 𝜎2) ⇔ 𝑦𝑖 = 𝜇𝑖 + 휀𝑖 , 𝑝(휀𝑖) = 𝑁(휀𝑖; 0, 𝜎
2), 𝜎2 > 0 (𝑖 = 1,… , 𝑛) (7)
where
𝜇𝑖 ≔ 𝜇0 + 𝛼2 + 𝛽1𝑥𝑖 + 𝛽2(𝛼2𝛽1𝑥𝑖) (8)
Correspondingly, we have the following GLM formulation
𝑦 ≔
(
𝑦1𝑦2𝑦3⋮⋮
𝑦𝑛𝑇𝑅)
∈ ℝ𝑛, 𝑋 =
(
1⋮1
1⋮1
𝑥1⋮𝑥𝑖
⋮1⋮1
⋮−1⋮−1
⋮𝑥𝑖′⋮
𝑥𝑛𝑇𝑅
𝑥1⋮𝑥1⋮
−𝑥𝑖′⋮
−𝑥𝑛𝑇𝑅)
∈ ℝ𝑛×4, 𝛽 = (
𝜇0𝛼2𝛽1𝛽2
) ∈ ℝ4, 𝜎2 > 0 (9)
Figure 1 below shows a design matrix 𝑋 ∈ ℝ𝑛×4 of kind introduced in equation (9). The first column
in the Figure corresponds to the indicator variable for the offset, the second column models a
“psychological” main effect (e.g. attention vs. no attention), the third column a “physiological” main effect
corresponding to an MR signal time series, and the fourth column models the psychophysiological
interaction. Based on the design matrix shown in Figure1 and a beta parameter setting of 𝛽 ≔
(1,0.1.0.1,1)𝑇, Figure 2 shows a simulated PPI analysis. Here, the seed voxel time-course was generated
based on an Ohrnstein-Uhlenbeck process. The upper panel of Figure 2 depicts the simulated seed voxel
time-course and a sampled target voxel time course based on the corresponding PPI design. The lower
286
panels depict the correlation between the MR signal in the seed and target voxels as a function of the
psychological factor. Note that the correlation changes with the psychological factor, corresponding to an
interaction.
Figure 1. A design matrix for a psychophysiological interaction analysis.
Figure 2 A PPI-GLM-FMRI simulation. The upper panel depicts a simulated seed and a resulting target voxel MR course based on the design matrix shown in Figure 1 and a beta parameter setting of 𝛽 ≔ (1,0.1.0.1,1)𝑇. The lowermost panel depicts a correlation analysis between the seed and target voxel time-courses of the middle panel separately for blocks of the psychological main effect.
Why are PPI designs for mass-univariate GLM-FMRI analyses interesting (and in fact, somewhat
surprisingly, increasingly being used)? In most simple terms, the detection of a statistically significant value
for the interaction parameter 𝛽2 ∈ ℝ at a given voxel can be interpreted as a difference in the slope of the
regression between its time and the seed-voxel time-series under a modulation of the “psychological
context”, i.e. under Task A vs. Task B. If one interprets the correlation between two voxel time-series as an
287
indicator of their coupling or “connectivity” a significant interaction parameter indicates that regions of the
brain modulate their coupling depending on the current task being carried out. This speaks to a current
interest in the “dynamics of brain function”. Note however, that a significant interaction parameter may be
interpreted in two ways: a) as evidence for that the contribution of one area to another is altered by the
experimental context, or b) that the response of an area to an experimental context is altered due to activity
variation in the seed region.
Study Questions
1. Discuss, why for a design matrix 𝑋 ∈ ℝ𝑛×𝑝 and contrast vector 𝑐 ∈ ℝ𝑝 the function
𝜉 ∶ (𝑐, 𝑋) ↦ 𝜉(𝑐, 𝑋) ≔ (𝑐𝑇(𝑋𝑇𝑋)−1𝑐)−1
has some merit as a measure of the “efficiency” of the design encoded in 𝑋 for the contrast of interest 𝑐.
2. Explain the rationale for performing a FIR analysis of FMRI data.
3.Discuss which standard GLM model PPI GLM-FMRI analyses correspond to and which specific meaning the regressors and
parameters take on in the PPI context.
Study Question Answers
1. The variability of the GLM beta estimator is proportional to (XTX)−1, where the variance parameter 𝜎2 corresponds to the
proportionality constant. Assuming constant 𝜎2, and noting that the pre- and post-multiplication with c extracts the relevant
components of (XTX)−1 with respect to the beta estimator contrast 𝑐𝑇�̂�, the function above thus increases with decreasing
variability of the beta estimator due to the −1 exponent. In other words, if the variability in the contrasted beta estimator value is
small, the function ξ gives a large value and is hence an intuitive measure of “design efficiency”
2. A GLM-FIR model allows for the estimation the hemodynamic response to different stimulus conditions without the need to
partition the data and performing event-related averages.
3. PPI GLM-FMRI analyses correspond to and ANCOVA GLM with interaction. The data to be modelled corresponds to the MR signal
time-course of a specific voxel. The same GLM design is chosen for each voxel, referred to as a mass-univariate approach. The
discrete experimental factor in the ANCOVA GLM for PPI corresponds to a “psychological state”. For example, it may model the
difference in MR signal observed under Task A with respect to Task B. An example for Task A could be “attend to the colour of a
visual stimulus”, an example for Task B could be “attend to the shape of a visual stimulus”. The parametric/continuous regressor
corresponds to the MR signal time-course of a fixed and pre-determined “seed region” or “seed voxel”. Finally, the interaction
regressor and its associated parameter can be interpreted a) as capturing how the contribution of one area to another is altered by
the experimental context, or b) as capturing that the response of an area to an experimental context is altered due to activity
variation in the seed region.
288
First-level covariance matrices
(1) Serial correlations in FMRI
We have seen in a previous section, that classical inference rests on the assumption of spherical
error covariance matrices and that the violation of this assumption results in an increase of the false-positive
risk. Likewise, we have seen that the REML approach can be used to estimate the parameters of covariance
matrices, specifically, if the error covariance matrix decomposes into a linear combination of known
covariance matrix basis functions
𝑉 = ∑ 𝜆𝑖𝑄𝑖𝑞𝑖=1 (1)
where 𝑄𝑖 ∈ ℝ𝑛×𝑛 (𝑖 = 1,… , 𝑞) denote known covariance basis functions.
In the analysis of FMRI data, non-spherical error covariance matrices are estimated routinely based
on the notion of “error serial correlations”. The fundamental idea is that FMRI time-series comprise
correlations of values adjacent in time that are not captured by the deterministic aspect of the GLM, i.e. the
expectation parameter 𝑋𝛽. Assumed physiological origins for such serial correlations are for example the
breathing cycle or the heartbeat, which are assumed to induce fluctuations in the local deoxyhemoglobin
content that are unrelated to experimental stimulation. Note that if these fluctuations were monitored, they
could enter the GLM analyses in its deterministic part, i.e. could be used to derive additional regressors.
Common covariance basis functions used in the analysis of FMRI data for 𝑞 ≔ 2 and 𝜏 ≔ 1
𝑄1 ≔ 𝐼𝑛 (2)
𝑄2 = (𝑄2)𝑖𝑗 ≔ {exp (−
1
𝜏|𝑖 − 𝑗|) 𝑖 ≠ 𝑗
0, 𝑖 = 𝑗 1 ≤ 𝑖, 𝑗 ≤ 𝑛 (3)
Figure 2 depicts the resulting error covariance matrices for different choices of 𝜆1, 𝜆2 and 𝜏.
Figure 2. Covariance matrices resulting from the combination of the covariance basis functions 𝑄1 and 𝑄2.
The first covariance basis function in (2) models independently and identically distributed errors. The second
covariance basis function in (3) models short-range correlations based on a first-order autoregressive model
of the error terms. Note that the extend of these short-term correlations is governed by a time constant 𝜏,
which is usually assumed to be fixed, such that it does not form an additional parameter of the GLM
framework.
289
In Figure 3 we illustrate the effect of error serial correlations on observed time-time series. In three
simulations, the identical expectation parameter 𝑋𝛽 was used and the error covariance was created based
on (1), (2) and (3). The first two panels show the sampled time-series data for independent error terms
(upper panel) and the residual error (lower panel). In the same layout, the third and fourth panels depict the
case of short-range error correlations with fast time constant, and the fifth and the sixth panel depict the
case of short-range error correlations with a slower decay. As the range of the error serial correlations
increases, the residual error assume a “smoother” profile.
Figure 3. The effect of error serial correlations on time-series realizations.
290
Second-level models
(1) The “summary-statistics” approach
Most GLM-FMRI studies comprise a group of participants. This leads to at least two sources in
variation of observed condition-specific MR signals: within-participant variation and between-participant
variation. Commonly, analyses that ignore between-participant variation (for example, because they refer to
data of a single participant only) are referred to as “fixed effects” or “first-level” analyses, and account for
scan-to-scan variance. Analyses that explicitly take into account between-participant (or between-session)
variation and aim to make inferences about the population that the participants were sampled from, are
referred to as “random effects” or “second-level” analyses. Usually, these models comprise both fixed and
random effects and are thus better referred to as “mixed effects” models.
A typical FMRI group analysis in the framework of the GLM uses the so-called “summary-statistics”
approach, which we sketch in the following. Note that before second-level analyses are performed the FMRI
data are typically spatially warped into a common group brain space such that voxel coordinates correspond
to (approximately) the same brain regions across participants. The summary-statistics to group inference
then proceeds as follows: First, the data from individual participants is analysed separately, usually using a
common GLM design that is motivated from the study’s experimental design. In this manner, for each voxel
beta parameter estimates for each of the model’s regressor are obtained for each participant. To isolate
specific participant-specific beta parameter estimates or to combine them in a linear fashion, and effect of
interest is specified for each subject using a contrast vector. This generates a “contrast image” comprising
the contrast of beta parameter estimates across voxel for each participant. Upon contrast formation, there
exists a single scalar number for each voxel and each participant which reflects the participant-specific effect
on this contrast and can be viewed as a scalar, participant-specific outcome measure. Finally, the contrast
images are subjected to voxel-by-voxel one-sample t-tests in order to infer which voxels display a significant
average effect size when compared to the between-participant variation.
In the following we consider this approach from the perspective of a hierarchical GLM and detail,
what kind assumptions this approach entails about the covariance structure of the underlying mixed-effects
model. To this end, we first formulate a first-level “all-in-one” GLM and then introduce a second-level
random effects model. Finally, we consider a set of assumptions in this framework that renders it equivalent
to the summary-statistic approach sketched above. In all of the below, we assume that the respective
variance parameters are known and focus on the estimation of the beta parameters.
(2) A hierarchical GLM
For 𝑘 = 1,… , 𝐾 subjects, let the 𝑘th first-level individual-subject GLM be denoted by
𝑦𝑘 = 𝑋𝑘𝛽𝑘 + 휀𝑘 (1)
where 𝑦𝑘 ∈ ℝ𝑛𝑘 , 𝑋𝑘 ∈ ℝ
𝑛𝑘×𝑝, 𝛽𝑘 ∈ ℝ𝑝, and
𝑝(휀𝑘) = 𝑁(휀𝑘; 0, 𝜎𝑘2𝑉𝑛𝑘) (2)
Where 0 ∈ ℝ𝑛𝑘 , 𝜎𝑘2 > 0, 𝑉𝑛𝑘 ∈ ℝ
𝑛𝑘×𝑛𝑘. Then the 𝑘th first-level single-subject effects 𝛽𝑘 ∈ ℝ𝑝 can be
estimated using the generalized-least squares estimator
291
�̂�𝑘 = (𝑋𝑘𝑇(𝑉𝑛𝑘)
−1𝑋𝑘)
−1𝑋𝑘𝑇(𝑉𝑛𝑘)
−1𝑦 (3)
By concatenation, the 𝐾 models
𝑦𝑘 = 𝑋𝑘𝛽𝑘 + 휀𝑘 , 𝑘 = 1,… , 𝐾 (4)
specified above may be formulated as a large, subject-separable, first level model
𝑦𝑠 = 𝑋𝑠𝛽𝑠 + 휀𝑠 (5)
with 𝑛 = ∑ 𝑛𝑘𝐾𝑘=1 as follows:
𝑦𝑠 ≔ (
𝑦1⋮𝑦𝐾) ∈ ℝ𝑛, 𝑋𝑠 ≔ (
𝑋1 00 𝑋2
0
0⋱ 00 𝑋𝐾
) ∈ ℝ𝑛×𝐾𝑝, 𝛽𝑠 ≔ (𝛽1⋮𝛽𝐾
) ∈ ℝ𝐾𝑝 (6)
where the zeros in 𝑋 denote appropriately sized matrices with all zero entries and
휀𝑠 ≔ (
휀1⋮휀𝐾) ∈ ℝ𝑛, 𝑝(휀𝑠) = 𝑁(휀𝑠; 0, 𝑉𝑠), 0 ∈ ℝ
𝑛, 𝑉𝑠 ≔
(
𝜎12𝑉1 0
0 𝜎22𝑉2
0
0⋱ 00 𝜎𝐾
2𝑉𝐾)
∈ ℝ𝑛×𝑛 (7)
The “𝑠“ subscripts on the variables involved in this “all-in-one” model are mnemonic for “subjects” or
“single-units” and are meant to remind the reader that these variables correspond to concatenated variables
of “subjects or single-units”. See Figure 1 for an example of an “all-in-one” GLM. Note that due to the block-
diagonal matrix properties of 𝑋𝑠 and 𝑉𝑠 the generalized least squares estimator and its covariance for this
all-in-one model correspond to
�̂�𝑠 ≔ (𝑋𝑠𝑇𝑉𝑠
−1𝑋𝑠)−1𝑋𝑠
𝑇𝑉𝑠−1𝑦𝑠 ∈ ℝ
𝐾𝑝 (8)
In the “all-in-one” GLM above we have introduced the concatenated participant effects beta
parameter 𝛽𝑠 ∈ ℝ𝐾𝑝. On a second-level, we now model this parameter vector as the result of a linear
combination of population parameters 𝛽𝑝 under additive Gaussian noise. We thus relate subject-specific
effects or parameters in 𝛽𝑠 ∈ ℝ𝐾𝑝 to population paramters 𝛽𝑝 in the form
𝛽𝑠 = 𝑋𝑝𝛽𝑝 + 휀𝑝 (9)
where 𝑋𝑝 ∈ ℝ𝐾𝑝×𝑞 denotes the second-level design matrix, 𝛽𝑝 ∈ ℝ
𝑞 a set of unknown population effects,
and 휀𝑝 ∈ ℝ𝐾𝑝×𝐾𝑝 denotes the population error term with distribution
𝑝(휀𝑝) = 𝑁(휀𝑝; 0, 𝜎𝑝2𝑉𝑝), 0 ∈ ℝ
𝑞 , 𝜎𝑝2 > 0, 𝑉𝑝 ∈ ℝ
𝑞×𝑞 (10)
The “𝑝“ subscripts on the variables involved the second-level model are mnemonic for “population” and are
meant to remind the reader that these variables model structural aspects of the population from which the
first-level units are derived. See Figure 2 for an example.
In summary, we have formulated the following hierarchical linear Gaussian model
292
𝛽𝑠 = 𝑋𝑝𝛽𝑝 + 휀𝑝 (11a)
𝑦𝑠 = 𝑋𝑠𝛽𝑠 + 휀𝑠 (11b)
where on the second level (11a) 𝛽𝑠 ∈ ℝ𝐾𝑝, 𝑋𝑝 ∈ ℝ
𝐾𝑝×𝑞 , 𝛽𝑝 ∈ ℝ𝑞 , 𝑝(휀𝑝) = 𝑁(휀𝑝; 0, 𝜎𝑝
2𝑉𝑝), 0 ∈ ℝ𝑞 , 𝜎𝑝
2 > 0,
𝑉𝑝 ∈ ℝ𝑞×𝑞 and, with 𝑛 = ∑ 𝑛𝑘
𝐾𝑘=1 on the first level (11b) 𝑦𝑠 ∈ ℝ
𝑛, 𝑋𝑠 ∈ ℝ𝑛×𝐾𝑝 , 𝑝(휀𝑠) = 𝑁(휀𝑠; 0, 𝑉𝑠), 0 ∈
ℝ𝑛, 𝑉𝑠 ≔ ℝ𝑛×𝑛. Note that both 𝑦𝑠 and 𝛽𝑠 are random variables. They are, however, distinct. 𝑦𝑠 is an
observable random variable which models the concatenated FMRI data time series over participants, while
𝛽𝑠 is an unobservable random variable for which only an estimate can be obtained.
Figure 1. An all-in-one GLM as conceptualization of a first-level FMRI model. The upper right panel depicts an exemplary single-session/single-subject design matrix for 𝑛𝑘 = 180 data points (scans) comprising three regressors (one constant regressor modelling the MR signal offset, and two event-related condition regressors). The upper left panel depicts a non-spherical error covariance matrix for the same data set. The lower panel depicts the block-wise concatenation of the design and covariance matrices into a “all-in-one” GLM for the concatenated data across 𝐾 = 3 participants.
(3) An equivalent beta parameter estimates model
In the summary statistics approach outline above, we first obtain subject-specific beta parameter
estimate contrasts 𝑐𝑇�̂�𝑘 (𝑘 = 1,… , 𝐾) and then model these at the second-level using a one-sample t-test
GLM. We next consider the beta parameter estimates model implied by the hierarchical GLM formulated in
the previous section and show, how we can estimate the population parameter vector 𝛽𝑝 ∈ ℝ𝑞 at the
second level based on the generalized least-squares estimate �̂�𝑠 ∈ ℝ𝐾𝑝 of the first level. To this end, we
293
Figure 2. An exemplary second-level design for the all-in-one model shown in the lower panels of Figure 1. Note that there are 3 ⋅ 3 = 9 regressors at the first-level, corresponding to an offset regressor and two condition-specific regressors per subject. According to the second-level framework discussed in the main text, these are assumed to be the result of a set of 𝑞 basis parameters at the second-level. The design matrix 𝑋𝑝 shown in the left panel, implies that there are 3 population beta parameters
(one offset, two condition-specific), which are mapped onto the three subject-specific parameters in the partitioning of 𝛽𝑠. The covariance matrix shown in the right panel in addition implies that there is some degree of within-participant covariation in the beta-parameters, but no between-participant correlation.
have the following result: the hierarchical GLM introduced in the previous section implies a GLM for the first-
level generalized least-squares estimator �̂�𝑠 ∈ ℝ𝐾𝑝 GLM of the following form
�̂�𝑠 = 𝑋𝑝𝛽𝑝 + 휀�̂�𝑠 (1)
where �̂�𝑠 ∈ ℝ𝐾𝑝, 𝑋𝑝 ∈ ℝ
𝐾𝑝×𝑞 , 𝛽𝑝 ∈ ℝ𝑞 as previously, and the error term 휀�̂�𝑠 ∈ ℝ
𝐾𝑝 is distributed according
to
𝑝(휀�̂�𝑠) = 𝑁(휀�̂�𝑠; 0, 𝑉�̂�𝑠), where 0 ∈ ℝ𝐾𝑝 and 𝑉�̂�𝑠 = 𝜎𝑝2𝑉𝑝 + (𝑋𝑠
𝑇𝑉𝑠−1𝑋𝑠)
−1 ∈ ℝ𝐾𝑝×𝐾𝑝 (2)
If we know the values of 𝜎𝑝2, 𝑉𝑝 and 𝑉𝑠, and thus 𝑉�̂�𝑠we thus may use the generalized least squares estimator
for 𝛽𝑝 in (1) to estimate the population effects from the single-subjects generalized least-squares estimators:
�̂�𝑝𝐺𝐿𝑆 ≔ (𝑋𝑝
𝑇𝑉�̂�𝑠
−1𝑋𝑝)−1𝑋𝑝𝑇𝑉�̂�𝑠
−1�̂�𝑠 (3)
Proof of (1)
From the second-level model for the subject-specific effects 𝛽𝑠 ∈ ℝ𝐾𝑝 we can derive a model for the subject-specific beta
estimates �̂�𝑠. To this end, we add �̂� to both sides of (11) and obtain
𝛽𝑠 = 𝑋𝑝𝛽𝑝 + 휀𝑝 ⇔ 𝛽𝑠 + �̂�𝑠 = 𝑋𝑝𝛽𝑝 + 휀𝑝 + �̂�𝑠 ⇔ �̂�𝑠 = 𝑋𝑝𝛽𝑝 + 휀𝑝 + (�̂�𝑠 − 𝛽𝑠) (1.1)
We next consider the distribution of the term
휀�̃� ≔ 휀𝑝 + (�̂�𝑠 − 𝛽𝑠) ∈ ℝ𝐾𝑝 (1.2)
in detail to write the right-hand side of (11) in standard GLM form. For fixed 𝛽𝑠, both 휀𝑝 and �̂�𝑠 are Gaussian random variables, thus
their sum is a Gaussian random variable as well, and we derive expressions for the expectation and covariance parameters. The expectation of 휀�̃�, due to the unbiasedness of the generalized least square estimator is given by
294
𝐸(휀�̃�) = 𝐸 (휀𝑝 + (�̂�𝑠 − 𝛽𝑠)) = 𝐸(휀𝑝) + 𝐸(�̂�𝑠 − 𝛽𝑠) = 0 + 0 = 0 ∈ ℝ𝐾𝑝 (1.3)
Under the additional assumption of 𝐶𝑜𝑣(휀𝑝, �̂�𝑠) = 0, the covariance of 휀�̃� evaluates to
𝐶𝑜𝑣(휀�̃�) = 𝐶𝑜𝑣 (휀𝑝 + (�̂�𝑠 − 𝛽𝑠)) = 𝐶𝑜𝑣(휀𝑝) + 𝐶𝑜𝑣(�̂�𝑠) = 𝜎𝑝2𝑉𝑝 + (𝑋𝑠
𝑇𝑉𝑠−1𝑋𝑠)
−1 ∈ ℝ𝐾𝑝×𝐾𝑝 (1.4)
□
In the summary-statistics approach one is primarily interested in a weighted linear combination of
parameter estimates on the first level of the form 𝑐𝑇�̂�𝑘, 𝑘 = 1,… , 𝐾, where 𝑐 ∈ ℝ𝑝. In this case the
following modifications of the second-level model for 𝛽 and �̂� ensue:
�̂�𝑠 takes on the form
�̂�𝑐 = (𝑐𝑇�̂�1⋮
𝑐𝑇�̂�𝐾
) ∈ ℝ𝐾 (1)
The covariance of �̂� = �̂�𝑐 becomes a diagonal matrix (rather than a block-diagonal matrix)
𝐶𝑜𝑣(�̂�𝑐) ≔
(
𝜎12𝑐𝑇(𝑋1
𝑇𝑋1)−1𝑐 0
0 𝜎22𝑐𝑇(𝑋2
𝑇𝑋2)−1𝑐
0
0⋱ 00 𝜎𝐾
2𝑐𝑇(𝑋𝐾𝑇𝑋𝐾)
−1𝑐)
∈ ℝ𝐾×𝐾 (2)
Under the further assumption of homogenous, i.e. equal, variance over subjects, the above simplifies to
𝐶𝑜𝑣(�̂�𝑐) ≔ 𝜎𝑠2𝐼𝐾 ∈ ℝ
𝐾×𝐾 (3)
and under the assumption of a spherical covariance matrix on the second-level, we have
�̃�𝑝 = 𝜎𝑠2𝐼𝐾 + 𝜎𝑝
2𝐼𝐾 = (𝜎𝑠2 + 𝜎𝑝
2)𝐼𝐾 = 𝜎𝑠𝑝2 𝐼𝐾 (4)
In this case, we may use the ordinary least-squares estimator to estimate the population effects
�̂�𝑝𝑂𝐿𝑆 ≔ (𝑋𝑝
𝑇𝑋𝑝)−1𝑋𝑝𝑇�̂�𝑐 (5)
based on the first-level subject-specific beta parameter estimate contrasts 𝑐𝑇�̂�1, … , 𝑐𝑇�̂�𝐾.
Study Questions
1. Verbally, discuss the notions of “fixed” and “random/mixed” analysis in GLM-FMRI. 2. Verbally, describe the one-sample t-test approach to second-level inference in GLM-FMRI. 3. Write down the all-in-one formulation of a first-level group GLM-FMRI. 4. Write down the hierarchical model underlying two-level group inference in GLM-FMRI. Study Questions Answers 1. Fixed effects analysis usually refer to analyses that discount any between-subject variation, and thus are usually applicable to
the data of a single participant (possibly over multiple FMRI sessions). Random/mixed-effects analyses on the other hand take into account between subject variation for making statistical inferences that allow to generalize to the population that the participants were sampled from
295
2. The one-sample t-test approach to second-level inference in GLM-FMRI corresponds to the three step procedure of (1) fitting an individual voxel-wise GLM to the FMRI data of each participant, (2) defining an effect of interest for each participant using a contrast vector, resulting in a participant specific “contrast image” and (3) evaluating the contrast images over participants using a one-sample t-test of the null hypothesis that the population mean of the contrast of interest is zero.
3. For 𝑘 = 1,… , 𝐾 first level GLMs 𝑦𝑘 = 𝑋𝑘𝛽𝑘 + 휀𝑘 with 𝑦𝑘 ∈ ℝ𝑛𝑘 , 𝑋𝑘 ∈ ℝ
𝑛𝑘×𝑝, 𝛽𝑘 ∈ ℝ𝑝, and 𝑝(휀𝑘) = 𝑁(휀𝑘; 0, 𝜎𝑘
2𝑉𝑘) the all-in-on
formulation of the first level model is given with 𝑛 = ∑ 𝑛𝑘𝐾𝑘=1 by 𝑦𝑠 = 𝑋𝑠𝛽𝑠 + 휀𝑠, where
𝑦𝑠 ≔ (
𝑦1⋮𝑦𝐾) ∈ ℝ𝑛, 𝑋𝑠 ≔ (
𝑋1 00 𝑋2
0
0⋱ 00 𝑋𝐾
) ∈ ℝ𝑛×𝐾𝑝, 𝛽𝑠 ≔ (𝛽1⋮𝛽𝐾
) ∈ ℝ𝐾𝑝
and
휀𝑠 ≔ (
휀1⋮휀𝐾) ∈ ℝ𝑛, 𝑝(휀𝑠) = 𝑁(휀; 0, 𝑉𝑠), 0 ∈ ℝ
𝑛 , 𝑉𝑠 ≔
(
𝜎12𝑉1 0
0 𝜎22𝑉2
0
0⋱ 00 𝜎𝐾
2𝑉𝐾)
∈ ℝ𝑛×𝑛
4. The hierarchical model that is used to relate subject specific effects or parameters in 𝛽𝑠 ∈ ℝ𝐾𝑝 to population paramters 𝛽𝑝 is
given by 𝛽𝑠 = 𝑋𝑝𝛽𝑝 + 휀𝑝 where 𝑋𝑝 ∈ ℝ𝐾𝑝×𝑞 denotes the second level design matrix, 𝛽𝑝 ∈ ℝ
𝑞 a set of unknown population
effects, and 휀𝑝 ∈ ℝ𝐾𝑝×𝐾𝑝 denotes the population error term with distribution 𝑝(휀𝑝) = 𝑁(휀𝑝; 0, 𝜎𝑝
2𝑉𝑝), 0 ∈ ℝ𝑞 , 𝜎𝑝
2 > 0, 𝑉𝑝 ∈
ℝ𝑞×𝑞 .
296
The multiple testing problem
(1) An introduction to the multiple testing problem in GLM-FMRI
In this and the next Section we are concerned with the problem of classical inference in mass-
univariate (voxel-by-voxel) applications of the GLM. Intuitively, the problem may be framed as follows. Upon
classical point estimation of the beta and variance parameters of a GLM, these can be combined in order to
produce a test statistic, such as a 𝑇 or an 𝐹 value. For example, based on a contrast vector 𝑐 ∈ ℝ𝑝, the 𝑇
statistic for a test of the null hypothesis 𝐻0: 𝛽 = 0 is given by
𝑇 ≔𝑐𝑇�̂�
√�̂�2𝑐𝑇(𝑋𝑇𝑋)−1𝑐 (1)
An observed value𝑇∗ of 𝑇 at a given voxel may then be compared to the distribution of the random variable
𝑇 under the null hypothesis, which is given by the 𝑡-distribution. If the probability of the observed value 𝑇∗ is
low under the null hypothesis, for example 𝑝(𝑇 > 𝑇∗) < 0.05, this is taken as evidence against the null
hypothesis, which may then be rejected. In terms of GLM-FMRI for a given voxel, if the test statistic exceeds
a given threshold, and thus its associated probability under the null hypothesis falls below a given
probability, the corresponding voxel may be labelled “activated” for the experimental effect expressed by
𝑐𝑇𝛽. Likewise, if the test statistic falls below a given threshold, and thus its associated probability under the
null hypothesis exceeds a given probability, the corresponding voxel may be labelled “not activated”. It is
this categorical decision of a voxel being “activated” or “not-activated” based on a given statistical threshold
that classical inference for GLM-FMRI is concerned with. Note that statistical thresholds in terms of “critical
values” of a test statistic are, based on the properties of the null distribution, always associated with
probabilities for this value or a more extreme value to occur, i.e. their corresponding significance levels.
More formally, we may restate the above as follows: when testing a single null hypothesis 𝐻0, the
probability of a Type I error, i.e. rejecting the null hypothesis when it is true, is usually controlled at some
designated “significance level” 𝛼 ∈ [0,1]. This can be achieved by choosing a critical value 𝑐𝛼, such that
𝑝(𝐻0=0)(|𝑇| > 𝑐𝛼) ≤ 𝛼 (2)
We here consider “𝑇” in a generic sense, i.e. 𝑇 may refer to either the 𝑇- or the 𝐹-, or some other, suitably
chosen test statistics. Note that we indicate by the subscript (𝐻0 = 0) that the probability statement above
refers to the case that the null hypothesis is true and refrain from conditioning on 𝐻0 = 0 as often seen,
because 𝐻0 = 0 is not a random event in the context considered here. Also note that we consider the
absolute value |𝑇| of 𝑇 to allow for the evaluation of two-sided tests. Based on the above, 𝐻0 is rejected, if
|𝑇| > 𝑐𝛼. Notably, in the neuroimaging literature 𝑐𝛼 is often referred to as a “threshold” and denoted by
“𝑢”.
We now consider testing not only a single, but multiple null hypotheses, which we refer to as
“multiple testing”. Above we described that if one defines a significance-level of 𝛼 (e.g. 𝛼 = 0.05), one will,
if the null hypothesis is true, declare voxels “activated” with a Type I error probability of 0.05. In other
words, the probability to declare a voxel “not activated”, given that for this voxel the null hypothesis is true,
corresponds to 1 − 0.05 = 0.95. Imagine now a set of 5 voxels, for each of which the null hypothesis is true,
and which, in some meaningful sense, can be regarded as stochastically independent entities. In this case,
based on an significance-level of 𝛼 = 0.05, the probability to declare the first voxel “not activated” is 0.95.
297
Likewise, the probability to declare the second voxel “not activated” is 0.95. However, the probability to
declare both voxels “not activated”, assuming independence, is only 0.95 ⋅ 0.95 = 0.9025. The probability
of making at least one Type I error, i.e. declaring a voxel “activated”, although its null hypothesis is true, thus
increased from 1 − 0.95 = 0.05 to 1 − 0.9025 = 0.0975 over the two voxels. Consider now all five voxels.
The probability to declare all voxels “not activated”, assuming for all that the null hypothesis is true, and
they are independent entities, is given by 0.95 ⋅ 0.95 ⋅ 0.95 ⋅ 0.95 ⋅ 0.95 = 0.955 = 0.7738. The probability
of making at least one Type I error over the set of five voxels thus corresponds to 1 − 0.7738 = 0.2262.
Given that a typically FMRI data set may comprise around 40,000 voxels, the single-event Type I error
probability, based on an individual voxel threshold of 0.95 approaches certainty with certainty (see Figure
1).
Figure 1. The probability of at least one Type 1 error as function of the number of tests for two different significance levels 𝛼.
A different view of the problem is provided by the following simulation. Consider the case of second-
level GLM-FMRI analysis. Assume that from the first level the values of 𝑐𝑇�̂�𝑘𝑣 for voxels 𝑣 = 1,… , 𝑉 and
participants 𝑘 = 1,… , 𝐾, where 𝑉 = 1000 and 𝐾 = 12 have been taken to the second level and a one-
sample T test of the null hypothesis 𝐻0: 𝑐𝑇𝛽𝑠 = 0 is performed for each voxel. Assume further, that for all 12
participants, the null hypothesis is, in fact, true for every voxel, and that the between-subject variance is
identical across voxels and given by 𝜎2 = 1. Assuming stochastic independence over voxels In this case, the
variables 𝑐𝑇�̂�𝑘𝑣 correspond to 12000 independent and identically distributed univariate Gaussian variables
𝑥𝑣𝑘 whose probability density functions are given by
𝑝(𝑥𝑣𝑘) = 𝑁(𝑥𝑣𝑘; 0,1) (3)
where 𝑣 = 1,… , ,1000, 𝑘 = 1,… ,12. As numerically evaluated in Figure 2, performing one-sample T-tests
and rejecting the null hypothesis at a significance-level of 𝛼 = 0.05 results in approximately 1000 ⋅ 0.05 =
50 erroneous rejections, with associated T- and p-values distributed according to the histograms shown in
the lower two panels.
298
Figure 2. Simulated Type I errors numbers.
We have seen above that in the case of testing multiple null hypotheses simultaneously, controlling
the Type I error rate for single tests at conventional thresholds can lead to a large number of false positive
results, if the null hypothesis is in fact true for all tests considered. Together, these issues may informally be
referred to as the “multiple testing problem” (we prefer “multiple testing” over “multiple comparisons”,
because the latter appears to be more inspired by the notion of evaluating many contrasts in a given GLM
scenario, rather than evaluating the same contrast over many repeats, i.e. voxels). To avoid the conclusion of
important effects from high test statistic and associated low p-values in the case of multiple testing for
subsets of tests in large “families” of tests, the significance-level 𝛼 may be lowered, such that it becomes
increasingly “difficult” for a single test statistic to exceed the associated critical value 𝑐𝛼. This is the basic
tenet of all multiple testing procedures. Note, however, that there exist no common definition for the notion
of a family of tests - in neuroimaging, the “family of tests” of interest usually refers the collection of tests
over voxels, and not, for example, over assessment of contrasts, or the number of significance test a given
researcher has performed throughout her career.
(2) Type I error rates
In the case of a single statistical test, there exists only a single “Type I error rate”, namely the
probability of rejecting the null hypothesis when it is in fact true. As soon as multiple testing scenarios are
considered, it is not immediately clear, what the term “Type I error rate” refers to, and a careful
differentiation is required. In this Section we introduce a number of commonly employed error rates in the
GLM-FMRI literature. Before doing so, we formalize the multiple testing problem.
Consider the problem of simultaneously testing 𝑚 ∈ ℕ null hypotheses 𝐻0(𝑖)(𝑖 ∈ ℕ𝑚). For 𝑖 ∈ ℕ𝑚 let
𝐻0(𝑖)= 0 denote the case that the null hypothesis 𝐻0
(𝑖) is true, and 𝐻0
(𝑖)= 1 denote the case that the null
hypothesis 𝐻0(𝑖)
is not true (false). The number of 𝑚 ∈ ℕ of null hypotheses is assumed to be known, while
the sets
299
𝑀0 ≔ {𝑖 ∈ ℕ𝑚|𝐻0(𝑖) = 0} and 𝑀1 ≔ {𝑖 ∈ ℕ𝑚|𝐻0
(𝑖) = 1} (1)
i.e. the (sub)sets of elements 𝑖 of ℕ𝑚 for which the null hypothesis is true (𝑀0) or not true (𝑀1) are assumed
to be unknown. For simplicity, we define
𝑀 ≔ 𝑀0 ∪𝑀1 = ℕ𝑚 (2)
and denote the number of elements of 𝑀0 and 𝑀1 (i.e. their cardinalities) by 𝑚0: = |𝑀0| ∈ ℕ𝑚0 and
𝑚1: = |𝑀1| ∈ ℕ𝑚0 , respectively. Note that by choosing the set ℕ𝑚
0 for both 𝑚0 and 𝑚1, we allow for the
case that either set is the empty set, i.e. 𝑀0 = ∅ or 𝑀1 = ∅ . Further note, that 𝑚0 and 𝑚1 correspond to
true, but unknown, non-random quantities.
Based on observed test statistics 𝑇𝑖 (𝑖 ∈ 𝑀) (which are random variables, as they are derived from
the random variable “data”) and fixed critical value 𝑐𝛼, each null hypothesis 𝐻0(𝑖) can either be not rejected
or rejected. Note that as in the case of the single test, we here consider 𝑇𝑖 in a generic sense, i.e. the 𝑇𝑖
refer to either 𝑇 or 𝐹, or some other suitably chosen test statistics (but all to the same). Let 𝑊 ∈ ℕ𝑚0 denote
the number of not rejected null hypothesis and 𝑅 ∈ ℕ𝑚0 the number of rejected null hypotheses, where
𝑊 +𝑅 = 𝑚. 𝑊 and 𝑅, due to the random nature of the 𝑇𝑖 (𝑖 ∈ 𝑀), are observed random entities.
From these definitions, four unobservable random variables follow:
(1) the number 𝑈 ∈ ℕ𝑚0 of null hypotheses 𝐻0
(𝑖) for which 𝐻0(𝑖) = 0 and 𝐻0
(𝑖) is not rejected
(2) the number 𝑉 ∈ ℕ𝑚0 of null hypotheses 𝐻0
(𝑖) for which 𝐻0(𝑖) = 0 and 𝐻0
(𝑖) is rejected
(3) the number 𝑇 ∈ ℕ𝑚0 of null hypotheses 𝐻0
(𝑖) for which 𝐻0
(𝑖)= 1 and 𝐻0
(𝑖) is not rejected
(4) the number 𝑆 ∈ ℕ𝑚0 of null hypotheses 𝐻0
(𝑖) for which 𝐻0(𝑖) = 1 and 𝐻0
(𝑖) is rejected
For an overview, consider Table 1.
NHs not rejected for 𝑐𝛼 NHs rejected for 𝑐𝛼
True NHs 𝑈 𝑉 𝑚0
Non-true NHs 𝑇 𝑆 𝑚1
𝑊 𝑅 𝑚
Table 1. Numbers of importance in the multiple testing scenario.
Note that
𝑈 + 𝑉 = 𝑚0, 𝑇 + 𝑆 = 𝑚1, 𝑈 + 𝑇 = 𝑊, 𝑉 + 𝑆 = 𝑅 and 𝑈 + 𝑉 + 𝑇 + 𝑆 = 𝑚 (3)
Note that the rejection or non-rejection numbers refer to a fixed, but arbitrary, critical value 𝑐𝛼. In the table
all variables refer to “numbers of”, and NH the word “null hypothesis”. Note again that 𝑚 is a fixed an
known number (the total number of performed tests, e.g. voxels in the neuroimaging context), 𝑚0 and 𝑚1
are true, but unknown, fixed numbers (in the neuroimaging context referring to the numbers of truely, but
unknowingly, activated voxels), 𝑊 and 𝑅 are observed random variables, whose outcome depends on 𝑚0
and 𝑚1, the underlying samples, and the chosen and fixed critical value 𝑐𝛼, and finally, that the 𝑈, 𝑉, 𝑇 and
𝑆 are unobserved random variables.
As noted above, for a single null hypothesis 𝐻0, the probability of a Type I error, i.e. rejecting the null
hypothesis when it is true, is usually controlled at some designated significance-level 𝛼 ∈ [0,1]. This can be
300
achieved by choosing a critical value 𝑐𝛼 such that 𝑝(𝐻0=0)(|𝑇| > 𝑐𝛼) ≤ 𝛼 and 𝐻0 is rejected, if |𝑇| > 𝑐𝛼. In
the multiple testing situation a variety of generalization of Type I error rates are possible. We next
introduction a selection of Type I error rates based on the layout of Table 1.
Per-comparison error rate. The per-comparison error rate 𝑃𝐶𝐸𝑅 is defined as the expected value of the
number of Type I errors per total number of hypothesis, i.e.
𝑃𝐶𝐸𝑅 =𝐸(𝑉)
𝑚 (4)
Per-family error rate. The per-family error rate 𝑃𝐹𝐸𝑅 is not a “rate” but corresponds to the expected
number of Type I errors
𝑃𝐹𝐸𝑅 = 𝐸(𝑉) (5)
Family-wise error rate. The family-wise (sometimes also referred to as “experiment-wise”) error rate 𝐹𝑊𝐸𝑅
is defined as the probability of at least one Type I error over the family of 𝑚 hypotheses
𝐹𝑊𝐸𝑅 = 𝐸(1{𝑉>0}) = 𝑃(𝑉 > 0) (6)
where 1{𝑉>0} denotes the indicator function, i.e. the random variable 1{𝑉>0} that takes on the value 1 if
𝑉 > 0 and 0 otherwise2.
False-discovery rate. The most straight-forward way to define the false-discovery rate 𝐹𝐷𝑅 would be as the
expectation of the ratio between the number 𝑉 of rejections of the null hypothesis, when it is true and the
total number 𝑅 of rejections.
𝐹𝐷𝑅 = 𝐸 (𝑉
𝑅) (7)
However, as 𝑅 is a random variable, and it is in principle possible that 𝑅 = 0 (no rejections), this ratio may
not be defined. Defining
𝑉
𝑅≔ 0 if 𝑅 = 0 (8)
results in the FDR definition of Benjamini & Hochberg (1995):
𝐹𝐷𝑅 = 𝐸 (𝑉
𝑅1{𝑅>0}) = 𝐸 (
𝑉
𝑅|𝑅 > 0) ⋅ 𝑃(𝑅 > 0) (9)
2Unfortunately, appreciation of the equality 𝐸(1{𝑉>0}) = 𝑃(𝑉 > 0) is conditional on the familiarity with measure-theoretic
approaches to probability theory. For those readers, who fulfil this criterion, we recall that for a probability space (Ω,𝒜, 𝑃) and a set 𝐴 ∈ 𝒜 (i.e. an event), the indicator function is defined as
1𝐴: Ω → {0,1},𝜔 ↦ 1𝐴(𝜔) ≔ {1, 𝜔 ∈ 𝐴0, 𝜔 ∉ 𝐴
Informally, the random variable 1𝐴 takes on the value 1 in the case that the event 𝐴 occurred, and the value 0 in the case that the event 𝐴 did not occur. Notably, the expected value of the random variable 1𝐴 corresponds to the probability of the event 𝐴, where using the Lebesgue integral, we have
𝐸(1𝐴) = ∫ 1𝐴(𝜔)𝑑𝑃(𝜔)Ω= ∫ 1 𝑑𝑃(𝜔)
𝐴= 𝑃(𝐴)
301
Positive false discovery rate. If at least on null hypothesis 𝐻0(𝑖) is rejected, we have 𝑅 > 0. In this case, the
conditional expectation of the proportion of Type I errors among the rejected hypotheses, given that a least
on hypothesis is rejected is referred to as positive false discovery rate 𝑝𝐹𝐷𝑅
𝑝𝐹𝐷𝑅 = 𝐸 (𝑉
𝑅|𝑅 > 0) (10)
We next consider some basic relationships between the Type I error rates defined above, which
follow from their definitions and the formulation of the problem as summarized in Table 1. Firstly, note that
by (6)
0 ≤ 𝑉 ≤ 𝑅 ≤ 𝑚 and 𝑅 = 0 ⇒ 𝑉 = 0 (11)
We thus have
𝑉
𝑚≤
𝑉
𝑅1{𝑅>0} ≤ 1{𝑉>0} ≤ 𝑉 (12)
Taking expectations on the above results in
𝐸 (𝑉
𝑚) =
𝐸(𝑉)
𝑚≤ 𝐸 (
𝑉
𝑅1{𝑅>0}) ≤ 𝐸(1{𝑉>0}) ≤ 𝐸(𝑉) ⇔ 𝑃𝐶𝐸𝑅 ≤ 𝐹𝐷𝑅 ≤ 𝐹𝑊𝐸𝑅 ≤ 𝑃𝐹𝐸𝑅 (13)
In the neuroimaging literature, the most popular Type I error rates are arguably the family-wise error
rate 𝐹𝑊𝐸𝑅 and the false discovery rate 𝐹𝐷𝑅. We thus consider their relationship in more detail. From (13)
the false discovery rate is less, or equal, to the family-wise error rate. Equality holds in the special case that
all null hypotheses 𝐻0(𝑖)
are true, i.e. 𝐻0(𝑖)= 0 for all 𝑖 ∈ 𝑀. In other words: for 𝑚 = 𝑚0, the inequality
𝐹𝐷𝑅 ≤ 𝐹𝑊𝐸𝑅 reduces to the equality 𝐹𝐷𝑅 = 𝐹𝑊𝐸𝑅. This may be seen from the relationships in Table 1 as
follows: From 𝑚0 = 𝑚, it follows that 𝑚1 = 0. 𝑚1 = 0 implies that 𝑇 + 𝑆 is zero, and as 𝑇, 𝑆 ≥ 0, this
implies that both 𝑇 = 0 and 𝑆 = 0. From 𝑆 = 0 it follows that 𝑅 = 𝑉. There are now two possible scenarios:
the number of rejected null hypotheses 𝑅 is zero or it is not zero (and all rejections, if any, are obviously
wrong). In the first case, the 𝐹𝐷𝑅 evaluates to
𝐹𝐷𝑅 = 𝐸 (𝑉
𝑅1{𝑅>0}) = 𝐸 (
𝑉
𝑅⋅ 0) = 0 (14)
and because 𝑉 = 𝑅 = 0 the 𝐹𝑊𝐸𝑅 evaluates to
𝐹𝑊𝐸𝑅 = 𝐸(1{𝑉>0}) = 𝐸(0) = 0 (15)
and we have 𝐹𝐷𝑅 = 𝐹𝑊𝐸𝑅. In the case that 𝑅 is not zero, i.e., 𝑅 > 0, from 𝑉 = 𝑅 it follows that
𝐹𝐷𝑅 = 𝐸 (𝑉
𝑅1{𝑅>0}) = 𝐸(1 ⋅ 1) = 1 (16)
Because 𝑉 = 𝑅 implies that if 𝑅 > 0 then it follpows that 𝑉 > 0, we have
𝐹𝑊𝐸𝑅 = 𝐸(1{𝑉>0}) = 𝐸(1) = 1 (17)
Also in this case, we thus have 𝐹𝐷𝑅 = 𝐹𝑊𝐸. In other words: under the assumption that all null hypotheses
are true, 𝐹𝐷𝑅 and 𝐹𝑊𝐸 are equivalent.
302
(3) Exact, weak, and strong control of family-wise error rates
In this section, we define what we understand by the exact, weak, or strong control of a family-wise
error rate. This is important in the context of mass-univariate GLM-FMRI, because, intuitively, it allows to
associate the ability to localize an observed effect with different forms of control of family-wise error rates.
To this end, we first introduce the notion of a liberal, conservative or exact statistical test.
Liberal, exact, and conservative statistical tests
Given a null hypothesis 𝐻0 and a test statistic 𝑇 ∈ 𝒯, where 𝒯 denotes the set of values that the test
statistic can take on, a statistical test is said to be liberal, conservative, or exact, if, for any given significance
level 𝛼 ∈ [0,1] and corresponding rejection region 𝑅𝛼 ⊂ 𝒯 the probability that 𝑇 belongs to the rejection
region 𝑅𝛼, denoted by
𝑝(𝐻0=0)(𝑇 ∈ 𝑅𝛼) (1)
is greater than, less than, or equal to 𝛼, respectively. Appropriate control of the Type I error rate requires an
exact or conservative test. In other words, for a liberal test, we have
𝑝(𝐻0=0)(𝑇 ∈ 𝑅𝛼) > 𝛼 (2)
For an exact test, we have
𝑝(𝐻0=0)(𝑇 ∈ 𝑅𝛼) = 𝛼 (3)
and for a conservative test, we have
𝑝(𝐻0=0)(𝑇 ∈ 𝑅𝛼) < 𝛼 (4)
Weak and strong control of family-wise error rates
In order to handle the multiple testing problem appropriately, the rejection criteria, i.e. the rejection
region 𝑅𝛼 for a given test have to be chosen so that the probability of rejecting one or more of the null
hypotheses when the rejected null hypotheses are actually true is sufficiently small. Let the search volume
Ω ≔ {𝑣1, … , 𝑣𝐾} (5)
consists of 𝐾 ∈ ℕ voxels 𝑣1, … , 𝑣𝐾 and let 𝐻01, … , 𝐻0
𝐾 be the null hypotheses for each voxel. The omnibus null
hypothesis 𝐻Ω is the logical conjunction of the events (𝐻01 = 0),… , (𝐻0
𝐾 = 0), that is
𝐻Ω ≔ (𝐻01 = 0) ∩ (𝐻0
2 = 0) ∩ …∩ (𝐻0𝐾 = 0) (6)
In words: 𝐻Ω corresponds to the case that 𝐻01 and 𝐻0
2 and … and 𝐻0𝐾 are true. To test each of the 𝐻0
1, … , 𝐻0𝐾
we use a set or a “family” of tests 𝑇1, … , 𝑇𝑘. For all 𝑗 ∈ {1,…𝐾} let 𝐸𝐻0=0𝑗
be the event that the test 𝑇𝑗
incorrectly rejects 𝐻0𝑗, that is,
𝐸𝑗 ≔ {𝑇𝑗 ∈ 𝑅𝛼𝑗 𝑎𝑛𝑑 𝐻0𝑗= 0} (7)
where 𝛼𝑗 is the corresponding rejection region. Suppose the test is exact or possibly conservative, i.e.
303
𝑝𝐻Ω(𝐸𝑗) ≤ 𝛼𝑗 (8)
In the context of the family {𝑇𝑗}𝑗∈𝐾 of tests, the family-wiser error rate 𝐹𝑊𝐸𝑅 is defined as the probability
of falsely rejecting any of the null hypotheses {𝐻0𝑘}𝑗∈𝐾
. Let 𝐸Ω denote the event that the omnibus hypothesis
is rejected, that is,
𝐸Ω ≔ 𝐸1 ∪ 𝐸2 ∪ …∪ 𝐸𝐾 =∪𝑗=1𝐾 𝐸𝑗 (9)
Note that this union of events is to be understood as the event that at least one 𝐸𝑗 of the 𝐾 events has
occurred, or in other words, that 𝐸1 and/or 𝐸2 and/or … and/or 𝐸𝐾 has occurred. If only a single 𝐸𝑗 is
rejected, the omnibus hypothesis is rejected.
Weak control of the 𝐹𝑊𝐸𝑅 requires that the probability of falsely rejecting the omnibus null
hypothesis 𝐻Ω is , at most, the test level 𝛼, i.e. that
𝑝𝐻Ω(𝐸Ω) ≤ 𝛼 (10)
Evidence against the null hypothesis 𝐻Ω indicates the presence of at least one of the null hypotheses
𝐻0𝑗, 𝑗 = 1,… , 𝐾, for arbitrary 𝑗, or, in brain imaging terms, of “some activation somewhere”. Informally, this
implies that the test has no “localizing power”, meaning that the Type I error rate for individual voxels is not
controlled. Tests that have only weak control over the 𝐹𝑊𝐸𝑅 are called “omnibus tests” and are useful in
detecting whether there is any experimentally induced effect at all, regardless of location. If, on the other
hand, there is interest in not only detecting an experimentally induced signal but also reliably locating the
effect, a test procedure with “strong control” over the 𝐹𝑊𝐸𝑅 is required.
Strong control over the 𝐹𝑊𝐸𝑅 requires that the 𝐹𝑊𝐸𝑅 be controlled, not just under 𝐻Ω but also
under any subset of hypotheses. Specifically, for any subset of voxels 𝐵 ⊆ Ω and corresponding omnibus
hypothesis
𝐻𝐵 ≔∩𝑗∈𝐵 𝐻0𝑗 (11)
the probability of the event of rejecting the omnibus hypothesis 𝐻𝐵,
𝐸𝐵 ≔∪𝑗∈𝐵 𝐸𝑗 (12)
is smaller or equal to 𝛼, i.e.
𝑝𝐻𝐵(𝐸𝐵) ≤ 𝛼 (13)
Note again, that this inequality is required to hold for all possible choices of the subset 𝐵 from the original
set Ω. In other words, all possible subsets of hypotheses are tested with weak control over the 𝐹𝑊𝐸𝑅. This
ensures that the test is valid (i.e. exact or conservative) at every voxel and, from a neuroimaging perspective,
that the validity of the test in any given region is not affected by the truth of the null hypothesis elsewhere.
Thus, a test procedure with strong control over the 𝐹𝑊𝐸𝑅 can be said to have “localizing power”.
304
(4) The Bonferroni procedure and its “conservativeness” in GLM-FMRI
Broadly speaking, there are two classes of multiple testing procedures commonly used in the
literature: single-step and stepwise procedures. In single-step procedures, identical adjustments are made
for the tests of all hypotheses, regardless of the observed test statistics or raw 𝑝-values. In stepwise
procedures, the rejection of particular null hypotheses is based not only on the total number of hypotheses,
but also on the outcomes of the tests on other hypotheses. One may further distinguish step-down
procedures, which order the raw 𝑝-values or associated test statistics starting with the most extreme under
the null hypothesis, while step-up procedures use the reverse strategy. We here consider the Bonferroni
procedure as an example for a single-step approach controlling the 𝐹𝑊𝐸𝑅 in the strong sense.
The Bonferroni procedure
The Bonferroni procedure may now be formulated as follows. Given a family of null hypotheses
𝐻0(𝑖), 𝑖 = 1,… ,𝑚, and an interest in a family-wise Type I error rate of less or equal to a significance-level
𝛼 ∈ [0,1], each individual hypothesis 𝐻0(𝑖) is tested at a reduced significance-level 𝛼𝑖 ∈ [0,1], such that
∑ 𝛼𝑖𝑚𝑖 = 𝛼. To this end, one sets the adjusted 𝛼-values 𝛼𝑖 to
𝛼𝑖 ≔𝛼
𝑚 for 𝑖 = 1,… ,𝑚 (1)
and reject the null hypothesis 𝐻0(𝑖)
for 𝑝𝑖 < 𝛼𝑖, where 𝑝𝑖 denotes the raw 𝑝-value obtained for the test of
hypothesis 𝐻0(𝑖)
. One can show that that the Bonferroni procedure controls the 𝐹𝑊𝐸𝑅 in the strong sense
by capitalizing on Boole’s inequality
𝑃(∪𝑖=1𝑛 𝐴𝑖) ≤ ∑ 𝑃(𝐴𝑖)
𝑛𝑖=1 (2)
as follows. Given a set of 𝑚 null hypotheses 𝐻0(1), 𝐻0
(2), … , 𝐻0
(𝑚) and associated observed raw 𝑝-values
𝑝1, 𝑝2, … , 𝑝𝑚, we have
𝐹𝑊𝐸𝑅 = 𝑃(𝑉 > 0) ≤ 𝑃 (∪𝑖=1𝑚0 {𝑝𝑖 ≤
𝛼
𝑚}) ≤ ∑ 𝑃 ({𝑝𝑖 ≤
𝛼
𝑚})
𝑚0𝑖=1 ≤ 𝑚0
𝛼
𝑚≤ 𝑚
𝛼
𝑚= 𝛼 (3)
Less formally, consider the event 𝑝𝑖 ≤𝛼
𝑚, i.e. that the 𝑖th observed raw 𝑝-value is smaller or equal to the
adjusted significance level 𝛼
𝑚. The joint probability that this holds for all (arbitrarily chosen) 𝑚0 ≤ 𝑚 true null
hypotheses is, by Booles inequality, less or equal to the sum of the probabilities for each of these events,
which is maximally 𝛼
𝑚. Since the sum runs over all tested null hypotheses which are true, the sum is less or
equal then 𝑚0𝛼
𝑚 and as 𝑚0 ≤ 𝑚, this value is less or equal to 𝛼. Because the above holds true for any choice
of 𝑚0, strong control of the 𝐹𝑊𝐸𝑅 follows.
It is often stated that the Bonferroni procedure is “too conservative” for controlling Type I error
rates in mass-univariate GLM-FMRI. We next consider, how this statement may be understood. Recall that
one way to achieve strong 𝐹𝑊𝐸𝑅 control is to adjust the level of significance at which the different
hypotheses 𝐻01, … , 𝐻0
𝐾 are test. As seen above, the single-step Bonferroni procedure is an illustrative
example of such a strategy. Suppose that the set of hypotheses {𝐻0𝑗}𝑗=1,…,𝐾
is tested at an equal level
𝑏 ∈ [0,1], such that
305
𝑝𝐻Ω(𝐸𝑗) ≤ 𝑏 (4)
In general
𝑝𝐻Ω(𝐸Ω) = 𝑝𝐻Ω(∪𝑗=1𝐾 𝐸𝑗) ≤ 𝑝𝐻Ω(𝐸1) + ⋯+ 𝑝𝐻Ω(𝐸𝐾) ≤ 𝐾𝑏 (5)
If 𝑏 is chosen such that 𝐾𝑏 = 𝛼, i.e. 𝑏 =𝛼
𝐾, it follows that
𝑝𝐻Ω(𝐸Ω) ≤ 𝐾𝑏 = 𝐾𝛼
𝐾= 𝛼 (6)
In the case that
𝑝𝐻Ω(𝐸Ω) < 𝑝𝐻Ω(𝐸1) + ⋯+ 𝑝𝐻Ω(𝐸𝐾) (7)
the Bonferroni correction will thus correspond to a conservative test. To see, when such a situation might
occur, consider the case 𝐾 = 2. Then, from elementary probability theory, we have
𝑃(𝐸1 ∪ 𝐸2) = 𝑃(𝐸1) + 𝑃(𝐸2) − 𝑃(𝐸1 ∩ 𝐸2) (8)
Note that for disjoint events 𝐸1 ∩ 𝐸2 = ∅ and thus
𝑃(𝐸1 ∩ 𝐸2) = 𝑃(∅) = 0 (9)
If the events 𝐸1 and 𝐸2 are not disjoint, then we have
𝑃(𝐸1 ∪ 𝐸2) < 𝑃(𝐸1) + 𝑃(𝐸2) (10)
and thus indeed a conservative test, as
𝑝𝐻Ω(𝐸Ω) = 𝑝𝐻Ω(𝐸1 ∪ 𝐸2) < 𝑝𝐻Ω(𝐸1) + 𝑝𝐻Ω(𝐸2) ≤ 𝛼 (11)
Note however, that disjoint events are defined by 𝐸1 ∩ 𝐸2 = ∅, whereas while independent events are
defined by 𝑃(𝐸1 ∩ 𝐸2) = 𝑃(𝐸1)𝑃(𝐸2). Why are the events 𝐸1 and 𝐸2 not disjoint? Because it possible that
both 𝐸1 and 𝐸2, i.e. the events that the test 𝑇1 incorrectly rejects 𝐻01 and that test 𝑇2 incorrectly rejects 𝐻0
2
can occur simultaneously. What is the probability that this will happen? That depends on the dependence
structure between test statistics. For independent tests, it is given by
𝑃(𝐸1 ∩ 𝐸2) = 𝑃(𝐸1)𝑃(𝐸2) = 𝑃(𝑇1 ∈ 𝑅𝛼1)𝑃(𝑇2 ∈ 𝑅𝛼2) (12)
We conclude that under the formalization above it is the fact that the events 𝐸𝑗 , 𝑗 = 1,… , 𝐾 are not
mutually exclusive (i.e. not disjoint) that renders the Bonferroni procedure conservative. For two events, it
depends on the probability of their joint occurrence 𝑃(𝐸𝑖 ∩ 𝐸𝑗), how conservative it will be. This sense of
conservativeness of the Bonferroni correction has motivated the search for alternative approaches to Type I
error rate control in mass-univariate GLM-FMRI, such as topological inference based on the concept of
Gaussian random fields, and non-parametric approaches to multiple testing control.
306
Study Questions
1. Why is the multiple testing problem of relevance for classical GLM-FMRI?
2. Define the family-wise error rate and the false discovery rate for a multiple testing problem.
3. How are the false-discovery rate and the family-wise error rate related?
4. Define the notions of liberal, exact, and conservative statistical tests
5. Discuss the notions of weak and strong control of an multiple comparison error rate in the context of GLM-FMRI.
6. Formulate the Bonferroni procedure
Study Questions Answers
1. GLM-FMRI analyses are typically mass-univariate, i.e. a statistical test is performed at each voxel. Because there are usually
many voxels, the probability of falsely rejecting the null hypothesis when it is in fact true, or committing a “Type 1 error” at least
once is very high.
2. The family-wise error rate is defined as the probability of at least one Type I error over a family of hypotheses. The false-
discovery rate corresponds (roughly) to the expectation of the ratio between the number of rejections of the null hypothesis,
when it is true, and the total number of rejections over a family of hypotheses.
3. In general, 𝐹𝐷𝑅 ≤ 𝐹𝑊𝐸𝑅, that is, the false-discovery rate is smaller or equal to the family-wise error rate. Equality of the false-
discovery-rate and the family-wise error rate holds in the special case that all null hypothesis are true over the whole set of
statistical tests.
4. Given a null hypothesis 𝐻0 and a test statistic 𝑇 of data , a statistical test is said to be liberal, conservative, or exact, if, for any
given significance level 𝛼 ∈ [0,1] and corresponding rejection region 𝑅𝛼 , the probability that 𝑇 belongs to the rejection region
𝑅𝛼 denoted by 𝑝𝐻0=0(𝑇 ∈ 𝑅𝛼) is greater than, less than, or equal to 𝛼, respectively. Appropriate control of the Type I error rate
requires an exact or conservative test. In other words, for a liberal test, we have 𝑝𝐻0=0(𝑇 ∈ 𝑅𝛼) > 𝛼. For an exact test, we have
𝑝𝐻0=0(𝑇 ∈ 𝑅𝛼) = 𝛼 and for a conservative test, we have 𝑝𝐻0=0(𝑇 ∈ 𝑅𝛼) < 𝛼.
5. Controlling a Type I error rate under the assumption of the “omnibus null hypothesis”, i.e. the assumption that the null
hypothesis is true for all tests considered is referred to as weak control. In the context of GLM-FMRI with voxel null hypotheses
of no activation at a given voxel, the complete null hypothesis may be expressed as “no activation at any voxel in the volume of
the brain under examination”. Evidence against the null hypothesis indicates the presence of “some activation somewhere”.
Informally, this test has no “localizing power” in that the Type I error for individual voxels is not controlled. In other words: if the
omnibus null hypothesis has been rejected, then any set of voxels could be declared as “activated”. Controlling a Type I error
rate for every possible choice of subset of tests for which the null hypothesis holds, is referred to as “strong control”. A test
procedure with strong control over a Type I error rate has “localizing power”.
6. Given a family of null hypotheses 𝐻0(𝑖), 𝑖 = 1, … ,𝑚, and an interest in a family-wise Type I error rate of less or equal to a
significance-level 𝛼 ∈ [0,1], the Bonferroni procedure corresponds to testing each individual hypothesis 𝐻0(𝑖)
at a reduced
significance-level 𝛼𝑖 ∈ [0,1], such that ∑ 𝛼𝑖𝑚𝑖 = 𝛼. To this end, one sets the adjusted 𝛼-values 𝛼𝑖 to 𝛼𝑖 ≔ 𝛼/𝑚 for 𝑖 = 1,… ,𝑚
and reject the null hypothesis 𝐻0(𝑖)
for 𝑝𝑖 < 𝛼𝑖, where 𝑝𝑖 denotes the raw 𝑝-value obtained for the test of hypothesis 𝐻0(𝑖)
. In
other words, instead of testing against the significance level of 𝛼, one tests the 𝑚 individual hypotheses against 𝛼/𝑚.
307
Classification Approaches
Binary classification methods for fMRI and M/EEEG data analysis posit the existence of a “training
data set”
{(𝑥(𝑖), 𝑦(𝑖))}𝑖=1
𝑛 (1)
comprising “training examples” (𝑥(𝑖), 𝑦(𝑖)), where 𝑥(𝑖) ∈ ℝ𝑚 is an 𝑚-dimensional data vector, usually
referred to as “feature vector” or “input vector” (or in more generally, “feature variable”), and 𝑦(𝑖) ∈ ℝ is a
univariate “target variable”, also referred to as “output variable”. The superscript 𝑖 = 1,… , 𝑛 indexes the
training examples. For binary classification schemes, we may have 𝑦(𝑖) ∈ {0,1} or 𝑦(𝑖) ∈ {−1,1}. As an
example, 𝑥(𝑖) may be a vector of the MR signal from 𝑚 voxels of a cortical region of interest, while the
corresponding 𝑦(𝑖) may code the condition the participant was exposed to when the data 𝑥(𝑖) was recorded.
The goal of classification approaches is to learn a mapping from an observed data pattern 𝑥(𝑖) to the
corresponding value of the target variable. In terms of fMRI, the goal is thus to predict the stimulation
condition from the observed voxel pattern activation. Note that, compared to the theory of the GLM, the
intuitive meanings of “𝑥” and “𝑦” are reversed: in the GLM “𝑥” refers to aspects of the experimental design,
i.e., the independent variable, and is used to predict data “𝑦”. In the current scenario “𝑥” refers to data that
is used to predict the experimental design “𝑦”. In other words, while in the familiar GLM scenario data “𝑦” is
regressed on predictors “𝑥”, in the current section, data “𝑥” is used to predict experimental conditions “𝑦”.
In general, there exist at least three different learning approaches (Bishop, 2006). The first
approach is referred to as a generative learning approach and models the joint distribution of feature and
target variables 𝑝(𝑥(𝑖), 𝑦(𝑖)), such that the conditional probability of 𝑝(𝑦(𝑖)|𝑥(𝑖)) can explicitly be evaluated.
An example for a generative learning method is linear discriminant analysis, and will be discussed in Section
1 below. The second learning method is referred to as a discriminative learning approach and models the
probability 𝑝(𝑦(𝑖)) given 𝑥(𝑖), but not conditioned on 𝑥(𝑖), as the feature vector is not considered as a
random variable. An example for a discriminative learning method is logistic regression, which we consider
as special case of a larger model class known as “generalized linear models” in Section 2. Finally, a third
approach makes use of linear discriminant functions based on geometrical consideration without reference
to probabilistic concepts. An example of such an approach is the popular support vector classification
technique, which we will discuss in Section 3.
The most common use of this procedure in FMRI is based on the idea that if the prediction accuracy
of the learning model is larger than chance, than the underlying data (for example, a cortical region of
interest) must represent some “information” about the experimental condition of interest. To this end often
a full data set {(𝑥(𝑖), 𝑦(𝑖))}𝑖=1
𝑛 is partitioned into a “training data set” {(𝑥(𝑖), 𝑦(𝑖))}
𝑖=1
𝑟 and a mutually
exclusive “test data set” {(𝑥(𝑗), 𝑦(𝑗))}𝑗=1
𝑠, such that 𝑟 + 𝑠 = 𝑛, the learning model trained on the “training
data set”, and used to predict the experimental condition 𝑦(𝑗) of the associated test data set feature vector
𝑥(𝑗) for 𝑗 = 1,… , 𝑠 . By repeatedly changing the allocation of examples from the full data set to the training
and test data sets, all possible combinations of training and test data are explored. This procedure is referred
to as “cross-validation”.
A simple discriminative learning approach for binary classification is afforded by logistic regression.
In brief, upon learning a set of parameters 𝛽, the logistic regression model outputs the probability of a
308
feature vector 𝑥(𝑖) ∈ ℝ𝑚 belonging to one of two classes as target variable 𝑦(𝑖) ∈ {0,1}. To classify the
feature vector, the following decision rule is usually employed: If 𝑝(𝑥(𝑖)) ≤ 0.5, then 𝑥(𝑖) is allocated to the
first class (i.e. the class membership prediction is set to 𝑦(𝑖) = 0), and else it is allocated to the second class
(i.e. the class membership prediction is set to 𝑦(𝑖) ≔ 1). Comparing the allocation of feature vectors to
classes based on known class memberships than allows for (informally) inferring how much information the
feature vectors carry with respect to their class membership.
(1) Generative learning - Linear Discriminant Analysis
The linear discriminant analysis model posits that the multivariate data points 𝑥(𝑖) (𝑖 = 1,… , 𝑛)
correspond to realizations of one of two possible multivariate Gaussian random variables governed by
distinct expectation vectors 𝜇0 ∈ ℝ𝑚 and 𝜇1 ∈ ℝ
𝑚, but with the same covariance matrix Σ ∈ ℝ𝑚×𝑚 𝑝. 𝑑..
The class membership (i.e. the expectation vector) of the 𝑖th data point is assumed to be determined by an
associated Bernoulli random variable 𝑦(𝑖) with parameter 𝜇 ∈ [0,1]. Note that 𝜇0, 𝜇1 refer to expectations
of Gaussian random variables, while 𝜇 corresponds to the expectation of a Bernoulli variable. Formally, for
𝑖 = 1,… , 𝑛
𝑝(𝑥(𝑖), 𝑦(𝑖)) = 𝑝(𝑦(𝑖))𝑝(𝑥(𝑖)|𝑦(𝑖)) (1)
where
𝑝(𝑦(𝑖)) = 𝐵𝑒𝑟𝑛(𝑦(𝑖); 𝜇) (2)
and
𝑝(𝑥(𝑖)|𝑦(𝑖)) = 𝑁(𝑥; 𝜇𝑦(𝑖) , Σ) (3)
This model is best understood from an “ancestral” sampling perspective: first a value 𝑦(𝑖) ∈ {0,1} is sampled
from a Bernoulli distribution with probability 𝜇 ∈ [0,1] of being 1. Next ,depending on the outcome of 𝑦(𝑖),
the associated multivariate data point 𝑥(𝑖) ∈ ℝ𝑚 is sampled from the multivariate normal distribution with
expectation 𝜇0, if 𝑦(𝑖) = 0, and expectation 𝜇1, if 𝑦(𝑖) = 1. Figure 1 below depicts 100 samples (or “training
data points”) (𝑥(𝑖), 𝑦(𝑖)), 𝑖 = 1,… ,100 from an LDA model with two-dimensional data points 𝑥(𝑖) ∈ ℝ2 and
with parameters 𝜇 = 0.7, 𝜇0 = (−1,−1)𝑇 , 𝜇1 = (1,1)
𝑇 and covariance matrix 𝐼2. Note that the outcome of
𝑦(𝑖) is depicted as the color of the 𝑥(𝑖) data points.
The LDA model can be used for classification of novel feature vectors, denoted here by 𝑥∗ ∈ ℝ𝑚 as
follows. First, based on a training data set {(𝑥(𝑖), 𝑦(𝑖))|𝑖 = 1,… , 𝑛}, the parameters of the LDA model are
learned by means of analytical maximum likelihood estimation. We will omit the full derivation of the ML
estimators here for brevity and only provide a sketch of it: Like usual, the log likelihood function of the
training data is formulated as
ℓ(𝜇, 𝜇0, 𝜇1, Σ) = ln(∏ 𝑝(𝑥(𝑖), 𝑦(𝑖))𝑚𝑖=1 ) = ln(∏ 𝑝(𝑥(𝑖)|𝑦(𝑖))𝑝(𝑦(𝑖))𝑚
𝑖=1 ) (4)
and evaluated based on the functional forms of 𝑝(𝑦(𝑖)) and 𝑝(𝑥(𝑖)|𝑦(𝑖)) specified in (2) and (3). Next, the log
likelihood function is maximized with respect to the parameters 𝜇, 𝜇0, 𝜇1 and Σ by computing the
corresponding partial derivatives of the log likelihood function, setting to zero and solving for the ML
estimators. Using the indicator function
309
1𝐴 = {1 𝑖𝑓 𝐴 𝑖𝑠 𝑡𝑟𝑢𝑒0 𝑖𝑓 𝐴 𝑖𝑠 𝑓𝑎𝑙𝑠𝑒
(5)
the resulting ML estimators for the parameters of the LDA model are given by
�̂� =1
𝑚∑ 1{𝑦(𝑖)=1}𝑚𝑖=1 (6)
�̂�0 =∑ 1
{𝑦(𝑖)=0}𝑥(𝑖)𝑚
𝑖=1
∑ 1{𝑦(𝑖)=0}
𝑚𝑖=1
(7)
�̂�1 =∑ 1
{𝑦(𝑖)=1}𝑥(𝑖)𝑚
𝑖=1
∑ 1{𝑦(𝑖)=1}
𝑚𝑖=1
(8)
Σ̂ =1
𝑚∑ (𝑥(𝑖) − �̂�𝑦(𝑖))𝑚𝑖=1 (𝑥(𝑖) − �̂�𝑦(𝑖))
𝑇 (9)
Proof of (6) – (9)
Figure 1 Linear discriminant analysis realization and classification
Note that these estimators have very intuitive interpretations: the Bernoulli parameter 𝜇 is
estimated by the proportion of observed 1’s and 0’s in the training data set, the Gaussian parameter 𝜇0 is
estimated by the average of all 𝑥(𝑖) belonging to training points with 𝑦(𝑖) = 0, and likewise for 𝜇1, while the
common covariance Σ is estimated by the average of the empirical covariances of the first and second
classes. For the training set realized in Figure 1, for which the true, but unknown parameters, are given by
𝜇 = 0.7, 𝜇0 = (−1,−1)𝑇 , 𝜇1 = (1,1)
𝑇 and Σ = (1 00 12
) (10)
these parameter estimates are given by
310
�̂� = 0.70, �̂�0 = (−1.15,−0.93)𝑇, �̂�1 = (0.95,0.88)
𝑇 and Σ̂ = (0.86 0.060.06 1.09
) (11)
Once estimated, the LDA model can be used for the classification of novel feature vectors 𝑥∗ ∈ ℝ𝑚 as
follows: To determine the class membership of a novel feature vector 𝑥∗ ∈ ℝ𝑚, the probability of 𝑦∗ being 1
conditional on 𝑥∗ is evaluated. To this end Bayes theorem
𝑝( 𝑦∗|𝑥∗) ∝ 𝑝( 𝑦∗)𝑝(𝑥∗| 𝑦∗) (12)
is applied, which, upon some algebraic manipulation, results in the following expression for the probability
of 𝑦∗ being 1 given 𝑥∗:
𝑝(𝑦∗|𝑥∗) =1
1+exp(−𝑥∗𝑇�̂�) (13)
Proof of (13)
As will be seen below, the expression (13) corresponds to the logistic regression mean function. Notably, in
expression (13), 𝑥∗ refers to the “augmented” feature vector, i.e. 𝑥∗ ≔ (1, 𝑥∗)𝑇, and �̂� corresponds to the
following vector function of the ML estimates
�̂� ≔ (
1
2�̂�0𝑇Σ̂−1�̂�0 − ln(1 − �̂�) −
1
2�̂�1𝑇Σ̂−1�̂�1 + ln(�̂�)
−Σ̂−1(�̂�0 − �̂�1)) (14)
Based on the probability 𝑝(𝑦∗|𝑥∗), a simple decision rule can then be used to classify the novel feature
vector 𝑥∗: if the probability 𝑝(𝑦∗ = 1|𝑥∗) is larger than 0.5, 𝑥∗ is classified as member of the class 𝑦 = 1,
else it is classified as a member of the class 𝑦 = 0.
(2) Discriminative Learning - Logistic Regression
Here we consider logistic regression, which we formulate in terms of a generalized linear model.
Consider now the case of Bernoulli distributed target variable 𝑦(𝑖) with
𝑝(𝑦(𝑖)) = 𝐵𝑒𝑟𝑛(𝑦(𝑖); 𝜇(𝑖)) (1)
where 𝐸(𝑦(𝑖)) = 𝜇(𝑖) ∈ [0,1] and the Bernoulli distribution is a member of the exponential family. To
formulate a generalized linear model, we require a link function 𝑔, which maps the expectations of the
response variable 𝑦(𝑖) onto the linear predictor 휂(𝑖). For the logistic regression model, this link function is
defined by the so-called “logit function”, given by
𝑔: [0,1] → ℝ, 𝐸(𝑦(𝑖)) = 𝜇(𝑖) ↦ ln𝜇(𝑖)
1−𝜇(𝑖)≔ 휂(𝑖) (2)
The functional role of the link function becomes somewhat clearer, if we consider its inverse 𝑔−1, which
maps the familiar linear predictor 휂(𝑖) ≔ 𝛽0 + 𝑥1(𝑖)𝛽1 +⋯+ 𝑥𝑚
(𝑖)𝛽𝑚 onto the expectation of 𝑦(𝑖), which is
given by
𝑔−1:ℝ → [0,1], 휂(𝑖) ↦ 1
1+exp(−𝜂(𝑖))= 𝜇(𝑖) = 𝐸(𝑦(𝑖)) (3)
This mapping is found by solving the defining equation (2) of 𝑔 for 𝜇(𝑖) as shown below.
311
Proof of (3)
We solve the defining equation of 𝑔 for 𝜇(𝑖). From (2), we have
휂(𝑖) = ln𝜇(𝑖)
1−𝜇(𝑖)⇔ −휂(𝑖) = − ln
𝜇(𝑖)
1−𝜇(𝑖)⇔ −휂(𝑖) = ln
1−𝜇(𝑖)
𝜇(𝑖) (3.1)
where the last equality follows with the properties of the logarithm. Taking the exponential of the last equality then results in
exp(−휂(𝑖)) = exp (𝑙𝑛1−𝜇(𝑖)
𝜇(𝑖)) ⇔ exp(−휂(𝑖)) =
1−𝜇(𝑖)
𝜇(𝑖) (3.2)
which we may now solve for 𝜇(𝑖):
exp(−휂(𝑖)) =1−𝜇(𝑖)
𝜇(𝑖)⇔ 𝜇(𝑖) exp(−휂(𝑖)) + 𝜇(𝑖) = 1 ⇔ 𝜇(𝑖)(1 + exp(−휂(𝑖))) = 1 (3.3)
and thus
𝑔−1: ℝ → [0,1], 휂(𝑖) ↦ 1
1+exp(−𝜂(𝑖))= 𝜇(𝑖) = 𝐸(𝑦(𝑖)) (3.4)
□
The mean function 𝑔−1 (also referred to as “logistic function”, or more generally, because of its S-shape a
“sigmoid function”) may thus be regarded as a nonlinear function of the linear predictor 𝑥(𝑖)𝛽, which maps
unbounded values 𝑥(𝑖)𝛽 onto values in the range [0,1], which in turn serve as the expectations of a
Bernoulli random variable. Figure 1 below depicts the mean function and its inverse, the link function.
Figure 1. The logistic regression link function (left panel), and its inverse, the mean function (right panel).
For simplicity we will denote the function 𝑔−1 in the following by
𝑓𝛽: ℝ𝑚 → ℝ,𝑥 ↦ 𝑓𝛽(𝑥) ≔
1
1+exp(−𝑥𝑇𝛽) (4)
to emphasize its character as a nonlinear transformation of the input feature vector 𝑥 ∈ ℝ𝑚 and its
dependence on a parameter vector 𝛽 ∈ ℝ𝑚. Note that with
312
𝜇(𝑖) = 𝑓𝛽(𝑥(𝑖)) (5)
the Bernoulli distribution probability mass function of 𝑦(𝑖) may be written as
𝑝(𝑦(𝑖)) = 𝐵𝑒𝑟𝑛 (𝑦(𝑖); 𝑓𝛽(𝑥)) = (𝑓𝛽(𝑥(𝑖)))
𝑦(𝑖)
(1 − 𝑓𝛽(𝑥(𝑖)))
1−𝑦(𝑖)
(6)
As derived below, the log likelihood function for the parameters of the logistic regression model
based on 𝑛 training examples is given by
ℓ(𝛽) = ∑ 𝑦(𝑖) ln (1
1+exp(−𝑥(𝑖)𝑇𝛽))𝑛
𝑖=1 + (1 − 𝑦(𝑖)) ln (1 −1
1+exp(−𝑥(𝑖)𝑇𝛽)) (7)
Proof of (7)
Assuming that the 𝑛 training examples were generated independently, we can write down the likelihood of the parameters 𝛽 ∈ ℝ𝑚+1 as
𝐿(𝛽) = 𝑝𝛽(𝑦(𝑖), … , 𝑦(𝑛)) = ∏ 𝐵𝑒𝑟𝑛 (𝑦(𝑖); 𝑓𝛽(𝑥))
𝑛𝑖=1 = ∏ (𝑓𝛽(𝑥
(𝑖)))𝑦(𝑖)
(1 − 𝑓𝛽(𝑥(𝑖)))
1−𝑦(𝑖)𝑛𝑖=1 (7.1)
Taking the logarithm, we obtain the log likelihood function (7) as follows
ℓ(𝛽) ≔ ln 𝐿(𝛽) (7.2)
= ln (∏ (𝑓𝛽(𝑥(𝑖)))
𝑦(𝑖)
(1 − 𝑓𝛽(𝑥(𝑖)))
1−𝑦(𝑖)𝑛𝑖=1 )
= ∑ ln ((𝑓𝛽(𝑥(𝑖)))
𝑦(𝑖)
(1 − 𝑓𝛽(𝑥(𝑖)))
1−𝑦(𝑖)
)𝑛𝑖=1
= ∑ ln ((𝑓𝛽(𝑥(𝑖)))
𝑦(𝑖)
)𝑛𝑖=1 + ln((1 − 𝑓𝛽(𝑥
(𝑖)))1−𝑦(𝑖)
)
= ∑ 𝑦(𝑖) ln (𝑓𝛽(𝑥(𝑖)))𝑛
𝑖=1 + (1 − 𝑦(𝑖)) ln (1 − 𝑓𝛽(𝑥(𝑖)))
□
A common way to maximize the log likelihood function in the case of logistic regression is by means
of gradient ascent. Based on the analytical evaluation of the partial derivatives of the log likelihood function
with respect to 𝛽𝑗, a gradient ascent algorithm for logistic regression is given by
1. Select the initial 𝛽0 and learning rate 𝛼 > 0 appropriately
2. For 𝑘 = 0,1,2…
𝛽𝑘+1 = 𝛽𝑘 + 𝛼∇ℓ(𝛽𝑘) =
(
𝛽0𝑘
𝛽1𝑘
⋮𝛽𝑚𝑘)
+ 𝛼
(
∑ (𝑦(𝑖) − 𝑓𝛽𝑘(𝑥(𝑖)))𝑛
𝑖=1 𝑥0(𝑖)
∑ (𝑦(𝑖) − 𝑓𝛽𝑘(𝑥(𝑖)))𝑛
𝑖=1 𝑥1(𝑖)
⋮
∑ (𝑦(𝑖) − 𝑓𝛽𝑘(𝑥(𝑖)))𝑛
𝑖=1 𝑥𝑚(𝑖))
(8)
313
Note that the magnitude of the update of each entry in 𝛽𝑘+1 is proportional to the summed “prediction
error” ∑ (𝑦(𝑖) − 𝑓𝛽𝑘(𝑥(𝑖)))𝑛
𝑖=1 , i.e. the summed differences between the observed data points 𝑦(𝑖), 𝑖 =
1,… , 𝑛 and the data point prediction 𝑓𝛽𝑘(𝑥(𝑖)) based on the previous parameter setting 𝛽𝑘.
Proof of (8)
We evaluate the partial derivatives 𝜕
𝜕𝛽𝑗ℓ(𝛽), 𝑗 = 0,1… ,𝑚 of the log likelihood function (7). To this end, it is helpful to first
rewrite the function
𝑓𝛽: ℝ𝑚 → ℝ, 𝑥 ↦ 𝑓𝛽(𝑥) ≔
1
1+exp(−𝑥𝑇𝛽) (8.1)
as a function of the dot product 𝑥𝑇𝛽, i.e as
𝑓:ℝ → ℝ, 𝑥𝑇𝛽 ↦ 𝑓𝛽(𝑥𝑇𝛽) ≔
1
1+exp(−𝑥𝑇𝛽) (8.2)
We further note that the derivative of the function 𝑓 evaluates to
𝑑
𝑑𝑥𝑇𝛽𝑓 = 𝑓′: ℝ → ℝ, 𝑥𝑇𝛽 ↦ 𝑓′(𝑥𝑇𝛽) = 𝑓(𝑥𝑇𝛽)(1 − 𝑓(𝑥𝑇𝛽)) (8.3)
This may be seen as follows: For a function
ℎ: ℝ → ℝ, 𝑧 ↦ ℎ(𝑧) =1
1+exp(−𝑧) (8.3)
we have, using the chain rule of differentiation
𝑑
𝑑𝑧ℎ(𝑧) =
𝑑
𝑑𝑧(1 + exp(−𝑧))−1 = (−(1 + exp(−𝑧))−2) exp(−𝑧) (−1) (8.4)
and thus
𝑑
𝑑𝑧ℎ(𝑧) =
exp(−𝑧)
(1+exp(−𝑧))2 (8.5)
The right-hand side of the above may now be reformulated as ℎ(𝑧)(1 − ℎ(𝑧)) as
exp(−𝑧)
(1+exp(−𝑧))2=
1+exp(−𝑧)
(1+exp(−𝑧))2−
1
(1+exp(−𝑧))2=
1
1+exp(−𝑧)−
1
(1+exp(−𝑧))2=
1
1+exp(−𝑧)(1 −
1
1+exp(−𝑧)) = ℎ(𝑧)(1 − ℎ(𝑧) (8.6)
Using this result we can now evaluate the entries 𝜕
𝜕𝛽𝑗ℓ(𝛽), 𝑗 = 0,1… ,𝑚:
𝜕
𝜕𝛽𝑗ℓ(𝛽) =
𝜕
𝜕𝛽𝑗(∑ 𝑦(𝑖) ln (𝑓 (𝑥(𝑖)
𝑇𝛽))𝑛
𝑖=1 + (1 − 𝑦(𝑖)) log (1 − 𝑓 (𝑥(𝑖)𝑇𝛽))) (8.7)
= ∑ 𝑦(𝑖)𝜕
𝜕𝛽𝑗(ln (𝑓 (𝑥(𝑖)
𝑇𝛽)))𝑛
𝑖=1 + (1 − 𝑦(𝑖))𝜕
𝜕𝛽𝑗ln (1 − 𝑓 (𝑥(𝑖)
𝑇𝛽))
= ∑ 𝑦(𝑖)1
𝑓(𝑥(𝑖)𝑇𝛽)(𝜕
𝜕𝛽𝑗𝑓 (𝑥(𝑖)
𝑇𝛽))𝑛
𝑖=1 + (1 − 𝑦(𝑖))1
1−𝑓(𝑥(𝑖)𝑇𝛽)
𝜕
𝜕𝛽𝑗(1 − 𝑓 (𝑥(𝑖)
𝑇𝛽))
= ∑ 𝑦(𝑖)1
𝑓(𝑥(𝑖)𝑇𝛽)(𝜕
𝜕𝛽𝑗𝑓 (𝑥(𝑖)
𝑇𝛽))𝑛
𝑖=1 − (1 − 𝑦(𝑖))1
1−𝑓(𝑥(𝑖)𝑇𝛽)(𝜕
𝜕𝛽𝑗𝑓 (𝑥(𝑖)
𝑇𝛽))
= ∑ (𝑦(𝑖)1
𝑓(𝑥(𝑖)𝑇𝛽)− (1 − 𝑦(𝑖))
1
1−𝑓(𝑥(𝑖)𝑇𝛽))
𝜕
𝜕𝛽𝑗𝑓 (𝑥(𝑖)
𝑇𝛽)𝑛
𝑖=1
Using (8.3), we then have
𝜕
𝜕𝛽𝑗ℓ(𝛽) = ∑ (𝑦(𝑖)
1
𝑓(𝑥(𝑖)𝑇𝛽)− (1 − 𝑦(𝑖))
1
1−𝑓(𝑥(𝑖)𝑇𝛽))𝑓 (𝑥(𝑖)
𝑇𝛽) (1 − 𝑓 (𝑥(𝑖)
𝑇𝛽))𝑛
𝑖=1𝜕
𝜕𝛽𝑗(𝑥(𝑖)
𝑇𝛽) (8.9)
314
Finally,
𝜕
𝜕𝛽𝑗(𝑥(𝑖)
𝑇𝛽) = (𝛽0 + 𝑥1
(𝑖)𝛽1 +⋯+ 𝑥𝑚
(𝑖)𝛽𝑚) = (
𝜕
𝜕𝛽𝑗𝛽0 +
𝜕
𝜕𝛽𝑗𝑥1(𝑖)𝛽1 +⋯+
𝜕
𝜕𝛽𝑗𝑥𝑚(𝑖)𝛽𝑚) = 𝑥𝑗
(𝑖) (8.10)
we then further have
𝜕
𝜕𝛽𝑗ℓ(𝛽) = ∑ (𝑦(𝑖)
𝑓(𝑥(𝑖)𝑇𝛽)(1−𝑓(𝑥(𝑖)
𝑇𝛽))
𝑓(𝑥(𝑖)𝑇𝛽)
− (1 − 𝑦(𝑖))𝑓(𝑥(𝑖)
𝑇𝛽)(1−𝑓(𝑥(𝑖)
𝑇𝛽))
1−𝑓(𝑥(𝑖)𝑇𝛽)
)𝑥𝑗(𝑖)𝑛
𝑖=1 (8.11)
= ∑ (𝑦(𝑖) (1 − 𝑓 (𝑥(𝑖)𝑇𝛽)) − (1 − 𝑦(𝑖))𝑓 (𝑥(𝑖)
𝑇𝛽))𝑛
𝑖=1 𝑥𝑗(𝑖)
= ∑ (𝑦(𝑖) − 𝑦(𝑖)𝑓 (𝑥(𝑖)𝑇𝛽) − 𝑓 (𝑥(𝑖)
𝑇𝛽) + 𝑦(𝑖)𝑓 (𝑥(𝑖)
𝑇𝛽))𝑛
𝑖=1 𝑥𝑗(𝑖)
= ∑ (𝑦(𝑖) − 𝑓 (𝑥(𝑖)𝑇𝛽))𝑛
𝑖=1 𝑥𝑗(𝑖)
= ∑ (𝑦(𝑖) − 𝑓𝛽(𝑥(𝑖)))𝑛
𝑖=1 𝑥𝑗(𝑖)
□
(3) Support Vector Classification
In contrast to linear discriminant analysis and logistic regression (and, in fact to the great majority of
models discussed herein), the popular support vector classification approach is not a probabilistic model. As
will be seen below, estimating the parameters of a support vector classifier corresponds to solving a
constrained quadratic programming problem which is derived based on geometric intuitions. While support
vector classifiers may thus practically achieve very good classification performance, the interpretation of this
performance is somewhat limited by the fact that the assigned target value to a feature vector has only
geometric, but no probabilistic meaning. Support vector classifiers hence are not able to serve as a models
that quantify the remaining uncertainty of a phenomenon of interest given a set of observations.
To develop the theory of support vector classification, we will proceed as follows. Firstly, we will
review the geometric intuitions and properties of linear discriminant functions. We will next discuss the case
of linearly separable classes and soft-margin classification. Thirdly, we will review the formulation of these
geometric problems in terms of constrained quadratic programming problems and their solution, and finally,
discuss the notion of kernel functions for support vector classification, giving rise to the notion of a support
vector machine.
Geometry of linear discriminant functions
Consider a training data set {(𝑥(𝑖), 𝑦(𝑖))}𝑖=1
𝑛 comprising 𝑛 training examples (𝑥(𝑖), 𝑦(𝑖)), where
𝑥(𝑖) ∈ ℝ𝑚 is an 𝑚-dimensional feature vector and 𝑦(𝑖) ∈ {−1,1} is the corresponding target variable
signifying the class membership of 𝑥(𝑖). The aim of support vector classification is to learn the parameters of
a linear discriminant function that can be used for the classification of feature vectors. The structural model
that is used to relate feature vectors to target variable in support vector classification is given by a “linear
classification function”. We define a linear classification function as a function
𝑓 ∶ ℝ𝑚 → ℝ, 𝑥 ↦ 𝑓(𝑥) ≔ 𝑤𝑇𝑥 + 𝑤0 (1)
315
i.e., as an affine-linear function that maps a feature vector 𝑥 ∈ ℝ𝑚 onto an output value 𝑓(𝑥) ∈ ℝ based on
a “weight vector” 𝑤 ∈ ℝ and a “bias parameter” 𝑤0 ∈ ℝ. To achieve a binary classification of the feature
vector 𝑥 based on the output of a linear classification function, it is augmented with a decision function
𝑔 ∶ ℝ → {−1,1}, 𝑓(𝑥) ↦ 𝑔(𝑓(𝑥)) ≔ {−1, 𝑓(𝑥) < 0
1, 𝑓(𝑥) ≥ 0 (2)
In other words, if the output of the classification function 𝑓 is negative, it is assigned to class −1, else it is
assigned to class 1. It is critically to note that, for a fixed feature vector, the linear classification function’s
(and thus also the decision function’s) output behaviour depends on the values of the weight vector 𝑤 and
the bias 𝑤0. The aim of training a support vector classifier represented by linear classification and decision
functions thus correspond to selecting 𝑤 and 𝑤0 according to some criteria. Before discussing approaches of
how this is achieved, we review the geometry induced by a training set, and the combination of linear
classification and decision functions. While the geometric intuitions discussed below hold for arbitrary
feature space dimensions, the basic concepts are best visualized in the plane, i.e. for 𝑚 ≔ 2 (Figure 1).
Figure 1. Basic concepts of binary linear classification. Note that the hyperplane 𝐻 divides the feature space (here ℝ2) into two
decision regions 𝐷+1 and 𝐷−1.
Consider the case of a training set of 𝑛 examples which is “linearly separable”, i.e., one may find a line such
that all feature vectors of class −1 lie on one side of that line, and all other feature vectors of class +1 lie on
the other side (Figure 1). Such a line is also referred to as a “decision surface”, which separates the
underlying feature space into two “decision regions”. Notably, the dimensionality of the decision surface is
𝑚 − 1, which in the case of a planar feature space corresponds to a 1-dimensional subspace of the plane,
i.e. a line. The decision surface is defined as the set of points in feature space for which the output value of a
linear discriminant function is 0. Such a decision surface is also referred to as a “hyperplane”, which is
defined in terms of (1) as
𝐻𝑤 ≔ {𝑥 ∈ ℝ𝑚|𝑓(𝑥) = 𝑤𝑇𝑥 + 𝑤0 = 0} ⊂ ℝ𝑚 (3)
Note that the hyperplane, and thus intuitively, the location and orientation of the decision surface, is a
function of the weight vector 𝑤 and the bias parameter 𝑤0 of the linear discriminant function. In other
316
words, changing the weight vector and the bias changes the hyperplane (3). Three geometric characteristics
of hyperplanes and the associated parameters of the underlying linear discriminant function are crucial for
the theory of support vector classification (Figure 2)
The weight vector 𝑤 is always orthogonal to the line described by the hyperplane. Changing the weight
vector thus allows for changing the orientation of the decision surface, while a specific value of the
weight vector corresponds to a specific orientation of the decision surface in feature space.
The output of the linear discriminant function 𝑓 for a feature vector 𝑥 provides a measure of the
distance of 𝑥 from the hyperplane.
The bias parameter 𝑤0 of the linear discriminant function corresponds to the distance of the hyperplane
from the origin of feature space. Changing the bias parameter thus shifts the hyperplane in feature
space and a specific choice of the bias parameter corresponds to a specific location of the hyperplane in
feature space.
Figure 2. Geometric properties of hyperplanes and linear classification functions.
In the following, we restate the properties above more formally and provide proofs.
Orthogonality of weight vector and hyperplane. Let 𝑓 denote a linear discriminant function and 𝐻𝑤 denote
the associated (𝑚 − 1)-dimensional hyperplane induced by the choice of the weight vector 𝑤 ∈ ℝ𝑚. Then
the weight vector 𝑤 is orthogonal to any vector 𝑦 pointing in the direction of the hyperplane. (4)
Proof of (4)
Let 𝑥𝑎, 𝑥𝑏 ∈ 𝐻𝑤 be arbitrary points on the hyperplane. Then the following system of affine-linear equations holds
𝑤𝑇𝑥𝑎 + 𝑤0 = 0 (4.1)
𝑤𝑇𝑥𝑏 +𝑤0 = 0 (4.2)
Subtracting (5.2) from (5.1) yields
𝑤𝑇𝑥𝑎 − 𝑤𝑇𝑥𝑏 = 0 ⇔ 𝑤𝑇(𝑥𝑎 − 𝑥𝑏) = 0 (4.3)
and thus the weight vector is orthogonal to the vector 𝑦 ≔ (𝑥𝑎 − 𝑥𝑏), which points in the direction of the hyperplane.
317
□
Linear discriminant function output as distance from the hyperplane . Let 𝑓 denote a linear discriminant
function and 𝐻𝑤 denote the associated (𝑚 − 1)-dimensional hyperplane induced by the choice of the
weight vector 𝑤 ∈ ℝ𝑚. Then 𝑓(𝑥) is proportional to the minimal Euclidean distance 𝑑 ∈ ℝ of the point 𝑥
from the hyperplane with proportionality constant ‖𝑤‖2−1
𝑑 =1
‖𝑤‖2𝑓(𝑥) (5)
Proof of (5)
Consider the decomposition of a point 𝑥 ∈ ℝ𝑚 into its orthogonal projection onto a given hyperplance 𝑥𝑝 ∈ ℝ𝑚 and its distance
from the hyperplane 𝑑𝑤
‖𝑤‖2, 𝑟 ∈ ℝ
𝑥 = 𝑥𝑝 + 𝑑𝑤
‖𝑤‖2 (5.1)
Note that this decomposition is possible, because 𝑤 is orthogonal to any vector pointing in the direction of the hyperplane and
‖𝑤
‖𝑤‖2‖2= 1 (Figure 2). Consider now the evaluation of the so decomposed 𝑥 under the linear discriminant function
𝑓(𝑥) = 𝑤𝑇𝑥 + 𝑤0 = 𝑤𝑇 (𝑥𝑝 + 𝑑
𝑤
‖𝑤‖2) + 𝑤0 = 𝑤
𝑇𝑥𝑝 + 𝑤0 + 𝑑𝑤𝑇𝑤
‖𝑤‖2 (5.2)
Because 𝑥𝑝 ∈ 𝐻𝑤, we have
𝑓(𝑥) = 𝑑𝑤𝑇𝑤
‖𝑤‖2= 𝑑
‖𝑤‖22
‖𝑤‖2= 𝑟‖𝑤‖2 ⇒ 𝑑 =
1
‖𝑤‖2𝑓(𝑥) (5.3)
□
Bias parameter as distance of the hyperplane from the origin. Let 𝑓 denote a linear discriminant function
and 𝐻𝑤 denote the associated (𝑚 − 1)-dimensional hyperplane induced by the choice of the weight vector
𝑤 ∈ ℝ𝑚. Let 𝑑0 denote the minimal distance of points on the hyperplane from the origin. Then
𝑑0 =𝑤0
‖𝑤‖2 (6)
Proof of (6)
Consider the evaluation of the minimal distance of the origin 𝑥0 = (0,… ,0)𝑇 ∈ ℝ𝑚 from points on the hyperplane given by (10).
Then
𝑑0 =1
‖𝑤‖2𝑓(𝑥0) = 𝑤
𝑇𝑥0 + 𝑤0 = 𝑤𝑇 (0⋮0) + 𝑤0 = 𝑤0 (6.1)
□
Hyperplane margin, support vectors, and canonical hyperplane
Note again that (4) and (6) imply that the weight vector 𝑤 and bias parameter 𝑤0 of the linear
discriminant function together uniquely determine the location and orientation of the hyperplane 𝐻𝑤. In
other words, by adjusting the values of the parameters, one can determine the classification for all possible
input feature vectors. We next consider the question of how (in some well-defined sense) “good” values for
𝑤 and 𝑤0 can be determined based on a training set . To this end, we first introduce the notion of the
training-set dependent “hyperplane margin” and “support vector set”.
318
For a given training data set {(𝑥(𝑖), 𝑦(𝑖))}𝑖=1
𝑛 and a given linear discriminant function specified in
terms of the values of 𝑤 and 𝑤0, the minimum distance 𝑑(𝑖) of each feature input vector 𝑥(𝑖) to a point on
the induced hyperplane can be evaluated based on (5). In order to measure this distance in an absolute
sense, i.e., irrespective of on which side of the hyperplane the feature vector 𝑥(𝑖) is located, we multiply (5)
by the label 𝑦(𝑖) ∈ {−1,1}, such that all thus measured distances are positive
𝑑(𝑖) =𝑦(𝑖)
‖𝑤‖2𝑓(𝑥(𝑖)) =
𝑦(𝑖)(𝑤𝑇𝑥(𝑖)+𝑤0)
‖𝑤‖2 (𝑖 = 1,… , 𝑛) (7)
Based on (7), the “margin” of the hyperplane induced by 𝑓 for a given training set is defined as the minimum
minimal distance of a training feature vector from the hyperplane (Figure 3)
𝑑∗ = min𝑥(𝑖){𝑑(𝑖)} = min𝑥(𝑖) {
𝑦(𝑖)(𝑤𝑇𝑥(𝑖)+𝑤0)
‖𝑤‖2} (8)
Figure 3. Hyperplane margin.
The points 𝑥(𝑖) for which the equality 𝑑(𝑖) = 𝑑∗ holds, i.e., which lie on the margin of the hyperplane, are
called “support vectors”. With the help of support vectors, we may specify the notion of a “canonical
hyperplane”. So far, equivalent hyperplanes can be obtained by multiplying the linear discriminant function
𝑓 by a scalar, as
𝑓(𝑥) = 0 ⇔ 𝑎𝑓(𝑥) = 𝑎𝑤𝑇𝑥 + 𝑎𝑤0 = 𝑎(𝑤𝑇𝑥 + 𝑤0) = 0 (9)
To obtain a unique representation, we now define the absolute distance of a support vector from the
hyperplane to be 1. Let 𝑥∗ be a support vector and 𝑦∗ the associated class label. Then this convention means
that, by definition
𝑓(𝑥∗) = 𝑦∗(𝑤𝑇𝑥∗ +𝑤0) = 1 (10)
Moreover, as 𝑥∗ achieves the minimum in (8), the margin is given by
𝑑∗ =1
‖𝑤‖2 (11)
319
The convention of defining the absolute distance of a support vector from the hyperplane to be 1 has the
benefit, that now, by definition, any feature vector that is not a support vector, say 𝑥(𝑖), has an absolute
distance of 𝑦(𝑖)𝑥(𝑖) > 1 from the hyperplane, and that the margin 𝑑∗ is a simple function of the weight
vector 𝑤. We next consider how this weight vector 𝑤 is determined based on the intuition of “maximizing
the margin”. To this end, we consider two scenarios: a linearly separable training data set, i.e., the case in
which a hyperplane can be found, that results in a correct classification of all training feature vectors, and a
non-linearly separable training set, in which no such hyperplane can be found.
Linearly separable training sets and maximum margin classification
Having introduced the margin of a canonical hyperplane, we can now state a basic criterion for
deciding which settings of 𝑤 and 𝑤0 are “good” for a given training set in well-defined sense. We first
consider the case of a “linearly separable training set”, i.e. a training set for which a linear discriminant
function can be found such that all training data points are classified correctly (Figure 4). A “good” linear
discriminant function can conceived as satisfying two conditions: firstly, it results in the correct classification
of all training data points, and secondly, it maximizes the margin. The latter condition can be formalized in
terms of an optimal 𝑤∗ as
𝑤∗ = argmax𝑤 (1
‖𝑤‖2) (12)
while the former can be formalized as the set of linear constraints
𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +𝑤0) ≥ 1 for 𝑖 = 1,… , 𝑛 (13)
Figure 4. Maximum margin classification for linearly separable training sets.
Note that instead of maximizing ‖𝑤‖2−1, one may equivalently minimize
1
2‖𝑤‖2
2, a formulation of which has
some benefits with respect to the standard theory of quadratic programming. The set of 𝑛 inequalities (13)
corresponds to the fact that all training data feature vectors 𝑥(𝑖) are to be either support vectors, i.e.,
𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +𝑤0) = 1, or lie on either side of the margin of the canonical hyperplane, i.e. 𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +
320
𝑤0) > 1. In summary, for linear separable trainings sets and the intuitive notion of maximum margin
classification, learning of the parameters 𝑤 and 𝑤0 can be cast as the linearly constrained quadratic
programming problem
min𝑤 (1
2‖𝑤‖2
2) subject to 𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +𝑤0) ≥ 1 for 𝑖 = 1,… , 𝑛 (14)
As discussed in more detail below, (14) can be solved for 𝑤∗ and 𝑤0∗ using techniques from the nonlinear
optimization literature.
Nonlinearly separable training sets and soft-margin classification
We next generalize the formulation of learning the linear discriminant function parameters from the
linearly separable case with maximum margin to the non-linearly separable case with “soft margin”.
Consider the scenario of overlapping class distributions (Figure 5). In this case, it is not possible to find a
setting of 𝑤 and 𝑤0 such that all training data feature vectors are correctly classified and the hyperplane
margin is maximized. To establish the notion of a “good” hyperplane also in this case, the linear constraints
(13) are augmented with “slack variables”, such that the constraints in this case are given by
𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +𝑤0) ≥ 1 − 𝜉𝑖 for 𝑖 = 1,… , 𝑛 (15)
To make sense of (15), firstly note that there is a slack variable 𝜉𝑖 for each training data feature vector 𝑥(𝑖). If
𝜉𝑖 = 0, (15) corresponds to (13), and the constraint corresponds to the maximum margin constraint as
above. Secondly, if 0 < 𝜉𝑖 < 1, then the training feature vector is still correctly classified, but lies closer to
the hyperplane than in the maximum margin case. Finally, if 𝜉𝑖 > 1, the point training feature vector 𝑥(𝑖) is
misclassified. The fundamental aims of soft margin classification are hence to firstly minimize 1
2‖𝑤‖2
2 such
as to maximize the margin, and secondly, to minimize the sum of slack variables 𝜉𝑖 over training data points
in order to achieve good classification performance on the training data set. In terms of a quadratic
constrained programming problem, this intuition can be formalized as
min𝑤,𝜉𝑖 (1
2‖𝑤‖2
2 + 𝐶 ∑ 𝜉𝑖𝑘𝑛
𝑖=1 ) subject to 𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +𝑤0) ≥ 1 − 𝜉𝑖 , 𝜉𝑖 ≥ 0 for 𝑖 = 1,… , 𝑛 (16)
Figure 5. Soft-margin classification for nonlinearly separable training sets.
321
Note that the objective function captures the intuition of hyperplane margin maximization in by means of
minimization of 1
2‖𝑤‖2
2 while simultaneously attempting to minimize the “loss” ∑ 𝜉𝑖𝑘𝑛
𝑖=1 . Here 𝑘 ∈ ℕ is a
constant that determines the particular kind of loss that is considered for the optimization. If 𝑘 = 1, the
“loss term” ∑ 𝜉𝑖𝑛𝑖=1 is referred to as “hinge loss”, if 𝑘 = 2,∑ 𝜉𝑖
2𝑛𝑖=1 is referred to as “quadratic loss”. Finally,
the constant 𝐶 determines the relative contributions of the margin maximization and loss term and is usually
chosen empirically.
Dual Lagrangian formulation of soft-margin support vector classifiers with hinge loss
Above we have seen that for a training data set {(𝑥(𝑖), 𝑦(𝑖))}𝑖=1
𝑛 the parameters 𝑤 ∈ ℝ𝑚 and 𝑤0 ∈ ℝ
of a “good” linear discriminant function
𝑓 ∶ ℝ𝑚 → ℝ,𝑥 ↦ 𝑓(𝑥) ≔ 𝑤𝑇𝑥 + 𝑤0 (17)
can be found by solving the constrained quadratic programming problem
min𝑤,𝜉𝑖 (1
2𝑤𝑇𝑤 + 𝐶 ∑ 𝜉𝑖
𝑘𝑛𝑖=1 ) subject to 𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +𝑤0) − 1 + 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0 for 𝑖 = 1,… , 𝑛 (18)
In principle, any constrained quadratic programming technique may be employed to solve (18). Traditionally,
however, the constrained quadratic programming problem above is transformed into its dual Lagrangian
form. This has the advantage that the feature vectors 𝑥(𝑖) enter the procedure only by means of their inner
products, which reduces the dimensionality of the problem and lends itself nicely to a generalization by
means of “kernel functions”. In this section, we consider the transformation of (18) into its dual Lagrangian
form.
From the general theory of constrained nonlinear optimization, we know that for a general
constrained nonlinear optimization problem of the form
min𝑥∈ℝ𝜈 𝑓(𝑥) subject to 𝑐𝑖(𝑥) = 0 (𝑖 ∈ 𝐸) and 𝑐𝑖(𝑥) ≥ 0 (𝑖 ∈ 𝐼)
(19)
with |𝐸 ∪ 𝐼| = 𝑚, the Lagrangian function is given by
𝐿 ∶ ℝ𝜈 × ℝ𝑚 → ℝ, (𝑥, 𝜆1, … , 𝜆𝑚) ↦ 𝐿(𝑥, 𝜆1, … , 𝜆𝑚) ≔ 𝑓(𝑥) − ∑ 𝜆𝑖𝑐𝑖(𝑥)𝑖∈𝐸∪𝐼 (20)
and the dual objective function is given by
𝑞:ℝ𝑚 → ℝ, 𝜆 ↦ 𝑞(𝜆) ≔ min𝑥 𝐿(𝑥, 𝜆) (21)
Setting 𝑤 ≔ (𝑤1, … , 𝑤𝑛)𝑇 , 𝜉 ≔ (𝜉1, … , 𝜉𝑛)
𝑇 , 𝜔 ≔ (𝑤𝑇 , 𝜉𝑇)𝑇 and 𝜆 ≔ (𝛼𝑇 , 𝛽𝑇)𝑇, where 𝛼 ≔ (𝛼1, … , 𝛼𝑛)𝑇
and, 𝛽 ≔ (𝛽1, … , 𝛽𝑛) the Lagrangian function for the hinge loss case of 𝑘 = 1 of (19) is thus given by
𝐿𝑝𝑟𝑖𝑚𝑎𝑙 ∶ ℝ4𝑛 × ℝ2𝑛 → ℝ, (𝜔, 𝜆) ↦ 𝐿𝑝𝑟𝑖𝑚𝑎𝑙(𝜔, 𝜆) ≔ (22)
1
2𝑤𝑇𝑤 + 𝐶 ∑ 𝜉𝑖
𝑛𝑖=1 − ∑ 𝛼𝑖
𝑛𝑖=1 (𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +𝑤0) − 1 + 𝜉𝑖) − ∑ 𝛽𝑖
𝑛𝑖=1 𝜉𝑖
Analytical evaluation of the minimum of the function 𝐿𝑝𝑟𝑖𝑚𝑎𝑙 with respect to 𝜔 then yields the dual
Lagrangian (objective) function
322
𝐿𝑑𝑢𝑎𝑙 ∶ ℝ𝑛 → ℝ, 𝛼 ↦ 𝐿𝑑𝑢𝑎𝑙(𝛼) ≔ ∑ 𝛼𝑖
𝑛𝑖=1 −
1
2∑ ∑ 𝛼𝑖𝛼𝑗𝑦
(𝑖)𝑦(𝑗)𝑥(𝑖)𝑇𝑥(𝑗)𝑛
𝑗=1𝑛𝑖=1 (23)
Proof of (23)
Computing he partial derivatives of 𝐿𝑝𝑟𝑖𝑚𝑎𝑙 with respect to 𝑤,𝑤0 and 𝜉𝑖, seting to zero and solving for the extremal points yields the
following conditions
𝜕
𝜕𝑤𝐿𝑝𝑟𝑖𝑚𝑎𝑙 = 0 ⇔
𝜕
𝜕𝑤(1
2𝑤𝑇𝑤) −
𝜕
𝜕𝑤∑ 𝛼𝑖𝑛𝑖=1 𝑦(𝑖)𝑤𝑇𝑥(𝑖) = 0 ⇔ 𝑤 − ∑ 𝛼𝑖
𝑛𝑖=1 𝑦(𝑖)𝑥(𝑖) = 0 ⇔ 𝑤 = ∑ 𝛼𝑖
𝑛𝑖=1 𝑦(𝑖)𝑥(𝑖) (23.1)
𝜕
𝜕𝑤0𝐿𝑝𝑟𝑖𝑚𝑎𝑙 = 0 ⇔ −
𝜕
𝜕𝑤0∑ 𝛼𝑖𝑛𝑖=1 𝑦(𝑖)𝑤0 = 0 ⇔ −∑ 𝛼𝑖
𝑛𝑖=1 𝑦(𝑖) = 0 (23.2)
and
𝜕
𝜕𝜉𝑖𝐿𝑝𝑟𝑖𝑚𝑎𝑙 = 0 ⇔ 𝐶
𝜕
𝜕𝜉𝑖∑ 𝜉𝑖𝑛𝑖=1 −
𝜕
𝜕𝜉𝑖∑ 𝛼𝑗𝑛𝑗=1 𝜉𝑗 − ∑ 𝛽𝑗
𝑛𝑗=1 𝜉𝑗 = 0 ⇔ 𝐶 − 𝛼𝑖 − 𝛽𝑖 = 0 ⇔ 𝛽𝑖 = 𝐶 − 𝛼𝑖 (𝑖 = 1,… , 𝑛) (23.3)
We next firstly reformulate the primal Lagrangian functional form as
1
2𝑤𝑇𝑤 + 𝐶∑ 𝜉𝑖
𝑛𝑖=1 −∑ 𝛼𝑖
𝑛𝑖=1 (𝑦(𝑖)(𝑤𝑇𝑥(𝑖) +𝑤0) − 1 + 𝜉𝑖) − ∑ 𝛽𝑖
𝑛𝑖=1 𝜉𝑖 (23.4)
=1
2𝑤𝑇𝑤 + 𝐶 ∑ 𝜉𝑖
𝑛𝑖=1 − ∑ 𝑤𝑇𝛼𝑖𝑦
(𝑖)𝑥(𝑖) + ∑ 𝛼𝑖𝑦(𝑖)𝑤0
𝑛𝑖=1
𝑛𝑖=1 + ∑ 𝛼𝑖
𝑛𝑖=1 − ∑ 𝛼𝑖𝜉𝑖 −
𝑛𝑖=1 ∑ 𝛽𝑖
𝑛𝑖=1 𝜉𝑖
=1
2𝑤𝑇𝑤 − 𝑤𝑇 ∑ 𝛼𝑖𝑦
(𝑖)𝑥(𝑖) + 𝑤0∑ 𝛼𝑖𝑦(𝑖)𝑛
𝑖=1𝑛𝑖=1 + ∑ 𝛼𝑖
𝑛𝑖=1 − ∑ (𝐶 − 𝛼𝑖 − 𝛽𝑖)𝜉𝑖
𝑛𝑖=1
Substitution of the conditions (23.1) – (23.3) then yields
𝐿𝑑𝑢𝑎𝑙(𝛼) (23.5)
=1
2𝑤𝑇𝑤 − 𝑤𝑇 ∑ 𝛼𝑖𝑦
(𝑖)𝑥(𝑖) + 𝑤0∑ 𝛼𝑖𝑦(𝑖)𝑛
𝑖=1𝑛𝑖=1 + ∑ 𝛼𝑖
𝑛𝑖=1 − ∑ (𝐶 − 𝛼𝑖 − 𝛽𝑖)𝜉𝑖
𝑛𝑖=1
=1
2𝑤𝑇𝑤 − 𝑤𝑇𝑤 + 𝑤0 ⋅ 0 + ∑ 𝛼𝑖
𝑛𝑖=1 − ∑ 0 ̇𝜉𝑖
𝑛𝑖=1
= ∑ 𝛼𝑖𝑛𝑖=1 −
1
2𝑤𝑇𝑤
= ∑ 𝛼𝑖𝑛𝑖=1 −
1
2(∑ 𝛼𝑖
𝑛𝑖=1 𝑦(𝑖)𝑥(𝑖))
𝑇 ∑ 𝛼𝑖
𝑛𝑖=1 𝑦(𝑖)𝑥(𝑖)
= ∑ 𝛼𝑖𝑛𝑖=1 −
1
2∑ 𝛼𝑖𝑛𝑖=1 𝑦(𝑖)𝑥(𝑖)
𝑇 ∑ 𝛼𝑖
𝑛𝑖=1 𝑦(𝑖)𝑥(𝑖)
= ∑ 𝛼𝑖𝑛𝑖=1 −
1
2(𝛼1𝑦
(1)𝑥(1)𝑇+ 𝛼2𝑦
(2)𝑥(2)𝑇+⋯+ 𝛼2𝑦
(𝑛)𝑥(𝑛)𝑇) ∑ 𝛼𝑗
𝑛𝑗=1 𝑦(𝑗)𝑥(𝑗)
= ∑ 𝛼𝑖𝑛𝑖=1 −
1
2(𝛼1𝑦
(1)𝑥(1)𝑇∑ 𝛼𝑗𝑛𝑗=1 𝑦(𝑗)𝑥(𝑗) + 𝛼2𝑦
(2)𝑥(2)𝑇∑ 𝛼𝑗𝑛𝑗=1 𝑦(𝑗)𝑥(𝑗) +⋯+ 𝛼𝑛𝑦
(𝑛)𝑥(𝑛)𝑇∑ 𝛼𝑗𝑛𝑗=1 𝑦(𝑗)𝑥(𝑗)) ∑ 𝛼𝑗
𝑛𝑗=1 𝑦(𝑗)𝑥(𝑗)
= ∑ 𝛼𝑖𝑛𝑖=1 −
1
2(∑ 𝛼1𝑦
(1)𝑥(1)𝑇𝛼𝑗
𝑛𝑗=1 𝑦(𝑗)𝑥(𝑗) +∑ 𝛼2𝑦
(2)𝑥(2)𝑇𝛼𝑗
𝑛𝑗=1 𝑦(𝑗)𝑥(𝑗) +⋯+ ∑ 𝛼𝑛𝑦
(𝑛)𝑥(𝑛)𝑇𝛼𝑗
𝑛𝑗=1 𝑦(𝑗)𝑥(𝑗))
= ∑ 𝛼𝑖𝑛𝑖=1 −
1
2(∑ 𝛼1𝛼𝑗𝑦
(1)𝑦(𝑗)𝑥(1)𝑇𝑥(𝑗)𝑛
𝑗=1 +∑ 𝛼2𝛼𝑗𝑦(2)𝑦(𝑗)𝑥(2)
𝑇𝑥(𝑗)𝑛
𝑗=1 +⋯+ ∑ 𝛼𝑛𝛼𝑗𝑦(𝑛)𝑦(𝑗)𝑥(𝑛)
𝑇𝑥(𝑗)𝑛
𝑗=1 )
= ∑ 𝛼𝑖𝑛𝑖=1 −
1
2∑ ∑ 𝛼𝑖𝛼𝑗𝑦
(𝑖)𝑦(𝑗)𝑥(𝑖)𝑇𝑥(𝑗)𝑛
𝑗=1𝑛𝑖=1
□
Study Questions
1. Explain the concept of “cross-validation” in classification approaches to fMRI data analysis.
323
2. Verbally describe the maximum likelihood estimators for the parameters 𝜇 ∈ [0,1], 𝜇0, 𝜇1 ∈ ℝ𝑚 and Σ ∈ ℝ𝑚×𝑚 of a linear
discriminant analysis model.
3. According to which distribution is the target variable 𝑦 distributed in a logistic regression model?
4. Name a commonality and a difference between logistic regression and linear discriminant analysis.
Study Questions Answers
1. Classification approaches in fMRI data analysis are based on the idea that if the condition prediction accuracy of a learning
model is larger than chance, than the underlying data (for example, BOLD signal feature vectors of cortical region of interest)
must represent some “information” about the experimental condition of interest. To this end often a full data set
{(𝑥(𝑖), 𝑦(𝑖))|𝑖 = 1,… , 𝑛} of date feature vectors 𝑥(𝑖) and condition labels𝑦(𝑖) is partitioned into a “training data set”
{(𝑥(𝑖), 𝑦(𝑖))|𝑖 = 1,… ,𝑚} and a mutually exclusive “test data set” {(𝑥(𝑗), 𝑦(𝑗))|𝑗 = 1,… , 𝑞}, such that 𝑚 + 𝑞 = 𝑛. The
classification-learning model trained on the “training data set”, and used to predict the experimental condition 𝑦(𝑗) of the
associated test data set feature vector 𝑥(𝑗) (𝑗 = 1,… , 𝑞) and the prediction accuracy recorded. By repeatedly changing the
allocation of examples from the full data set to the training and test data sets, all possible combinations of training and test data
are explored and the prediction accuracies averaged over combinations to yield a final prediction accuracy
2. The maximum likelihood estimators for the parameters 𝜇 is estimated by the proportion of observed 1’s and 0’s in the training
data set, the Gaussian parameter 𝜇0 ∈ ℝ𝑚 is estimated by the average of all 𝑥(𝑖) belonging to training points with 𝑦(𝑖) = 0, and
likewise for 𝜇1 ∈ ℝ𝑚, while the common covariance Σ is estimated by the average of the empirical covariances of the first and
second classes.
3. The target variable 𝑦 ∈ {0,1} of a logistic regression model is distributed according to a Bernoulli distribution.
4. Both logistic regression and linear discriminant analysis associate multivariate feature vectors 𝑥(𝑖) ∈ ℝ𝑚 with values 𝑦(𝑖) ∈
{0,1} of a random variable distributed according to a Bernoulli distribution. While logistic regression treats the data points 𝑥(𝑖)
as non-random entities, linear discriminant analysis treats them as realizations of two multivariate Gaussian distributions with
expectation parameters 𝜇0 ∈ ℝ𝑚 and 𝜇1 ∈ ℝ
𝑚associated with the values 𝑦(𝑖) being either 0 or 1.
324
Deterministic Dynamical Models
325
Deterministic dynamic causal models for FMRI
(1) The structural form of dynamic causal model for FMRI
In structural format dynamic causal models (DCMs) for FMRI take the following form
�̇� = 𝑓(𝑥, 𝑢, 휃𝑓) (1)
𝑦 = 𝑔(𝑥, 휃𝑔) (2)
where
𝑢 ∶ 𝐼 → 𝑈 ⊆ ℝ𝑙 , 𝑡 ↦ 𝑢(𝑡) (3)
is a vector-valued function of system inputs, usually corresponding to the temporal evolution of 𝑙 ∈ ℕ
experimental conditions,
𝑥 ∶ 𝐼 → 𝑋 ⊆ ℝ𝑛, 𝑡 ↦ 𝑥(𝑡) (4)
is a vector-valued functions of latent system states describing the evolution of neural and hemodynamic
variables, and
𝑦 ∶ 𝐼 → 𝑋 ⊆ ℝ𝑚, 𝑡 ↦ 𝑦(𝑡) ≔ 𝑔(𝑥(𝑡), 𝑢(𝑡), 휃) (5)
is a vector value function of system observables in continuous time, usually representing the MR signal
in 𝑚 ∈ ℕ regions of interest, and 휃 ∈ Θ ⊆ ℝ𝑝 is a vector of parameters. While the function 𝑢 is usually
explicitly formalized based on the experimental manipulation, the function 𝑥 is specified in terms of a first-
order 𝑛-dimensional system of ordinary differential equations. The function
𝑓 ∶ 𝑋 × 𝑈 × Θ𝑓 → 𝑋, (𝑥(𝑡), 𝑢(𝑡), 휃𝑓) ↦ 𝑓(𝑥(𝑡), 𝑢(𝑡), 휃𝑓) ≔ �̇�(𝑡) (6)
specifies the rate of change of the system state and is usually referred as the system’s “evolution function”.
The function
𝑔 ∶ 𝑋 × Θ𝑔 → 𝑌, (𝑥(𝑡), 휃𝑔) ↦ 𝑔(𝑥(𝑡), 휃𝑔) ≔ 𝑦(𝑡) (7)
maps latent states onto system observables and is referred to as “observation function”.
In standard formulations of deterministic DCMs, each of the 𝑚 ∈ ℕ regions is equipped with five
latent states, such that the hidden state space of the system corresponds to 𝑋 ⊆ ℝ5𝑚. One of these latent
states describes the evolution of a lumped neural activity process, while four of these latent states describe
the dynamics of regionally-specific neuro-vascular coupling processes. In this formulation, the evolution
function 𝑓 takes the following form
𝑓 ∶ 𝑋 × 𝑈 × Θ𝑓 , (𝑥, 𝑢, 휃𝑓) ↦ 𝑓(𝑥, 𝑢, 휃𝑓) ≔ (𝑓(𝑖)(𝑥, 𝑢, 휃𝑓))𝑖=1,…,𝑚
=
(
𝑓(1)(𝑥, 𝑢, 휃𝑓)
𝑓(2)(𝑥, 𝑢, 휃𝑓)
⋮𝑓(𝑚)(𝑥, 𝑢, 휃𝑓))
(8)
326
In other words, the evolution function partitions regions-of-interest-wise. To ensure positivity, the state
variables 𝑥3, 𝑥4 and 𝑥5 are exponentiated, which we will denote by defining �̃�3 ≔ exp(𝑥3) , �̃�4 ≔ exp(𝑥4)
and �̃�5 ≔ exp(𝑥5), respectively.
Each region-of-interest specific evolution function then takes the following form
𝑓(𝑖) ∶ 𝑋 × 𝑈 × Θ𝑓 → ℝ5, (𝑥, 𝑢, 휃𝑓) ↦ 𝑓(𝑖)(𝑥, 𝑢, 휃𝑓) ≔
(
𝑎𝑖𝑥1(𝑖)+ (∑ 𝑢𝑘
𝑙𝑘=1 𝑏𝑖
𝑘)𝑥1(𝑖)+ 𝑐𝑖𝑢
𝑥1(𝑖) − 𝜅𝑠𝑥2
(𝑖) − 𝜅𝑓 (�̃�3(𝑖) − 1)
𝑥2(𝑖)/�̃�3
(𝑖)
1
𝜏0�̃�4(𝑖) (�̃�3
(𝑖)− (�̃�4
(𝑖))1/𝛼)
�̃�3(𝑖) (1−(1−𝐸0))
1
�̃�3(𝑖)
𝐸0− (�̃�4
(𝑖))1/𝛼 �̃�5
(𝑖)
�̃�4(𝑖))
(9)
where {𝑎𝑖 ∈ ℝ1×𝑚, 𝑏𝑖
𝑘 ∈ ℝ1×𝑚, 𝑐𝑖 ∈ ℝ1×𝑙, 𝜅𝑠, 𝜅𝑓 , 𝜏0, 𝛼, 𝐸0 ∈ ℝ} ⊂ 휃𝑓 are parameters and 𝑥𝑗
(𝑖) (𝑗 = 1,… ,5)
are the five state variables that together form the state vector 𝑥(𝑖) ≔ (𝑥1(𝑖), 𝑥2
(𝑖), 𝑥3(𝑖), 𝑥4
(𝑖), 𝑥5(𝑖))
𝑇 of the 𝑖th
region of interest. We discuss the differential equations for each state variable in more detail below.
Like the evolution function, the observer function partitions source-wise, such that it takes the form
𝑔: 𝑋 × Θ𝑔, (𝑥, 휃𝑔) ↦ (𝑔(𝑖)(𝑥, 휃𝑔))𝑖=1,…,𝑚
=
(
𝑔(1)(𝑥, 휃𝑔)
𝑔(2)(𝑥, 휃𝑔)
⋮𝑔(𝑚)(𝑥, 휃𝑔))
(10)
with
𝑔(𝑖): 𝑋 × Θ𝑔 → ℝ, (𝑥, 휃𝑔) ↦ 𝑔(𝑥, 휃𝑔) ≔ 𝑉0 (𝑐1(1 − �̃�5) + 𝑐2(𝑖)(1 −
�̃�5
�̃�4) + 𝑐3
(𝑖)(1 − �̃�4)) (11)
where 𝑐1 = 4.3𝜈0𝐸0𝑇𝐸, 𝑐2 = 𝜖0(𝑖)𝑟0𝐸0𝑇𝐸 and 𝑐3 = 1 − 𝜖0
(𝑖) are region-independent and region-dependent
constants, respectively.
(2) The neural state evolution function
The form of the evolution function for the neural state variable
𝑥1(𝑖) = 𝑎𝑖𝑥1
(𝑖) + (∑ 𝑢𝑘𝑙𝑘=1 𝑏𝑖
𝑘)𝑥1(𝑖) + 𝑐𝑖𝑢 (1)
is motivated by the following considerations. Let 𝑥1 ≔ (𝑥1(𝑖), … , 𝑥1
(𝑚))𝑇∈ 𝑆 ⊆ ℝ𝑚 denote a state vector
comprising the neural state variables of a set of 𝑚 ∈ ℕ regions, and Θ1 ⊂ Θ denote the parameter subspace
for parameters governing the evolution of the first state variable of each regional system. Consider the
system’s evolution function in the form
�̇�1 = 𝜙(𝑥1, 𝑢, 휃1) (2)
327
In DCM for FMRI, the evolution function for the neural state 𝑥1 is defined as
𝜙 ∶ 𝑆 × 𝑈 × Θ1 → ℝ𝑚, (𝑥, 𝑢, 휃) ↦ 𝜙(𝑥, 𝑢, 휃1) ≔ 𝐴𝑥1 + (∑ 𝑢𝑘𝐵𝑘𝑙
𝑘=1 )𝑥1 + 𝐶𝑢 (3)
where 휃1 ≔ {𝐴, 𝐵1, … , 𝐵𝑙 ∈ ℝ𝑚×𝑚, 𝐶 ∈ ℝ𝑚×𝑙} is a set of matrices. The first term in (3) encodes a (possible)
contribution of each system variable current state to the rate of change of each system variable,
independent of the system input function 𝑢, the second term in (3) encodes a (possible) interactive
contribution of the system’s current state and the system inputs, and the third term encodes a (possible)
contribution of the system’s input function to the rate of change of in each system variable. Intuitively, if
one assumes that the input functions 𝑢𝑘 take on only the values 1 and 0, then the last term in (3) represents
the contribution of the input on the rate of change on the state variables scaled by the values in the matrix
𝐶. Likewise, the middle term represents a condition-dependent contribution of the neural state variables
scaled by the values in the condition specific matrices 𝐵1, … , 𝐵𝑙 to the rate of change of the neural state
vector. Together, the parameters in 휃1 thus specify how neural activity at a given time-point and in a given
region-of-interest influence the evolution of other neural activities. Intuitively, these parameters thus
capture the notion of “effective connectivity”, informally defined as the “effect neural activity in one brain
region has on neural activity in another brain region.
As an example, consider 𝑚 = 3 and 𝑙 = 2, i.e. a neural system comprising three regions and thus
state variables, and two input functions (e.g. the evolution of two experimental conditions over time). Then
(1) takes the form
�̇�1(𝑡) = 𝐴𝑥1(𝑡) + (𝑢1(𝑡)𝐵 + 𝑢2(𝑡)𝐵)𝑥1(𝑡) + 𝐶𝑢(𝑡) (4)
where 𝑥1(𝑡) ∈ ℝ3, 𝑢(𝑡) ∈ ℝ2, 𝐴, 𝐵1, 𝐵2 ∈ ℝ3×3 and 𝐶 ∈ ℝ3×2 . Writing out the above explicitly results in
(
�̇�1(1)(𝑡)
�̇�1(2)(𝑡)
�̇�1(3)(𝑡)
) = (
𝑎11 𝑎12 𝑎13𝑎21 𝑎22 𝑎23𝑎31 𝑎32 𝑎33
)(
𝑥1(1)(𝑡)
𝑥1(2)(𝑡)
𝑥1(3)(𝑡)
) + (𝑢1(𝑡)(
𝑏111 𝑏12
1 𝑏131
𝑏211 𝑏22
1 𝑏231
𝑏311 𝑏32
1 𝑏331
)+ 𝑢2(𝑡)(
𝑏112 𝑏12
2 𝑏132
𝑏212 𝑏22
2 𝑏232
𝑏312 𝑏32
2 𝑏332
))(
𝑥1(1)(𝑡)
𝑥1(2)(𝑡)
𝑥1(3)(𝑡)
) + (
𝑐11 𝑐12𝑐21 𝑐22𝑐31 𝑐32
)(𝑢1(𝑡)𝑢2(𝑡)
) (5)
and we see that indeed
�̇�1(𝑖) = 𝑎𝑖𝑥1
(𝑖) + (∑ 𝑢𝑘𝑙𝑘=1 𝑏𝑖
𝑘)𝑥1(𝑖) + 𝑐𝑖𝑢 (6)
for vectors 𝑎𝑖 ∈ ℝ1×𝑚, 𝑏𝑖
𝑘 ∈ ℝ1×𝑚 (𝑘 = 1,… , 𝑙) and 𝑐𝑖 ∈ ℝ1×𝑙 as specified in the previous section.
To obtain an intuition about the dynamic repertoire of neural systems specified as in (3), we first
consider the case of 𝑚 = 2 and 𝑙 = 1 with 𝐵1 = 0. We thus obtain
�̇�1(𝑡) = 𝐴𝑥1(𝑡) + 𝐶𝑢(𝑡) (7)
where 𝑥1(𝑡) ∈ ℝ2, 𝐴 ∈ ℝ2×2, 𝐶 ∈ ℝ2×1 and 𝑢(𝑡) ∈ ℝ . For 𝐶 ≔ (1,0)𝑇,modelling a positive influence of the
input function on the rate of change of the first neural state variable 𝑥1(1)
, a “Gaussian-like” input function
𝑢 ∶ [0, 𝑇] → ℝ+, 𝑡 ↦ 𝑢(𝑡) ≔ exp (−1
𝜎2(𝑡 − 𝜇)2) (8)
328
and the initial condition 𝑥1(0) = (0,0)𝑇, we consider the scenarios
𝐴(1) = (−1 0 0 −1
) , 𝐴(2) = (−1 0 0.9 −1
) , 𝐴(3) = (−1 0.9 0 −1
) and 𝐴(4) = (−1 0.9 0.9 −1
) (9)
Intuitively, the diagonal entries of the matrices in (9) model a self-inhibitory dynamic, that is, the rate of
change of each neural state is negatively proportional to its current state. In addition, 𝐴(1) models the case
of no connections between the neural state variables, 𝐴(2) models the case of a positive influence of neural
state 𝑥1(1)
on the rate of change of neural state 𝑥1(2)
, 𝐴(3) models the case of a positive influence of neural
state 𝑥1(2)
on neural state 𝑥1(1)
, and 𝐴(4) models the case of positive influences of both state variables on
each other. The dynamics resulting from these scenarios are visualized in Figure 1
Figure 1. Neural system scenarios based on equations (7) – (9). The left panels depict the intuitive connectivity structure that this
represented by the 𝐴(𝑖) ∈ ℝ2×2 and 𝐶 ∈ ℝ2×1 matrices of the current example system. Note that the self-inhibitory values on the
diagonal of the 𝐴(𝑖) result in the dissipation of neural state activity over time. Further note that from the structure of 𝐴(1) and 𝐴(3) it
follows that 𝑥1(2)
remains at the baseline level of 0.
(3) Interpretation of the hemodynamic evolution and observer functions
In the following, we discuss the intuition of the state variables 𝑥2(𝑖), 𝑥3
(𝑖), 𝑥4(𝑖) and 𝑥5
(𝑖) for each of the
𝑚 regions of interest. Because the intuition is identical over regions, we drop the superscript (𝑖) in the
following discussion. The reader should nevertheless be aware that these state variables exist for each of the
regions and that parameters governing the evolution of these state can either be assumed to be regionally
specific or common to all regions.
For each region, the second and third state variables of region-of-interest model the dynamics of
neurovascular coupling. State variable 𝑥2 models the evolution of a vasodilatory signal and state variable 𝑥3
models the evolution of the regional blood flow. More specifically, the system of ODEs
Vasodilatory Signal �̇�2 = 𝑥1 − 𝜅𝑠𝑥2 − 𝜅𝑓(�̃�3 − 1) (7)
Blood Flow �̇�3 = 𝑥2/�̃�3 (8)
329
establish a link between the state value of the regional neural state variable and the regional blood flow. As
evident from (2) the change of the vasodilatory signal is positively related to the neural activity state, and
negatively related to itself, modelling a signal decay. Finally, the vasodilatory signal is negatively related to
the regional blood flow. The change of regional blood flow �̇�3 is positively related to the vasodilatory signal
state 𝑥2 and scales with itself.
The fourth and fifth state variables of each region model the blood flow induced changes of blood
volume and deoxyhemoglobin content. More specifically, this subsystem describes the behaviour of the
post-capillary venous compartment by analogy to an inflated Balloon, and models the evolution of blood
volume 𝑥4 and deoxyhemoglobin content 𝑥5. The evolution equations are given by
Blood Volume �̇�4 =1
𝜏0�̃�4(�̃�3 − �̃�4
1/𝛼) (9)
Deoxyhemoglobin �̇�5 = �̃�3(1−(1−𝐸0))
1�̃�3
𝐸0− �̃�4
1/𝛼 �̃�5
�̃�4 (10)
The motivation for is as follows. The rate of change of volume 𝑥4 is given by the difference of the current
inflow �̃�3 and a volume-dependent outflow �̃�41/𝛼
. Notably, this form of outflow models the balloon-like
characteristics of the venous compartment to expel blood at a greater rate when distended. Finally, the rate
of change of deoxyhemoglobin reflects the delivery of deoxyhemoglobin into the venous compartment
minus that expelled. More specifically, the first term in (5) reflects the product of the current blood inflow �̃�3
and a blood flow dependent model of oxygen extraction given by (1 − (1 − 𝐸0))1
�̃�3 and 𝐸0 denotes a resting
oxygen extraction fraction. The second term in (5) denotes the product of volume dependent blood outflow
�̃�41/𝛼
and the concentration of deoxyhemoglobin per blood volume.
Finally, the observer function
BOLD signal 𝑔:𝒳 × Θ, (𝑥, 휃) ↦ 𝑔(𝑥, 휃) ≔ 𝑉0 (𝑐1(1 − �̃�5) + 𝑐2 (1 −�̃�5
�̃�4) + 𝑐3(1 − �̃�4)) (11)
with constants 𝑐1 ≔ 4.3𝜈0𝐸0𝑇𝐸, 𝑐2 ≔ 𝜖0𝑟0𝐸0𝑇𝐸, and 𝑐3 ≔ 1− 𝜖0 relates the state vector of the regional
neurovascular system to the observed MR signal change, or “BOLD signal” for short. Here, 𝜈0 = 40.3
denotes a frequency offset at the out surface of magnetized vessels in Hz, 𝐸0 = 0.4 denotes the resting
state oxygen extraction fraction, 𝑇𝐸 = 0.04 denotes the echo time in seconds, 𝜖0 denotes the ratio of intra-
to extravascular MR signal and is estimated for each region, and 𝑟0 = 25 denotes the slope of the
intravascular relaxation rate as a function of oxygen saturation.
In Figure 2 we depict the evolution of the hemodynamic and BOLD signal variables for a pre-specific
evolution of the neural system state for a single region. Notably, and increase in neural activity results in an
increase of the vasodilatory signal and blood flow, and a delayed increase in the blood volume, while these
three mechanisms are accompanied by a decrease of the local deoxyhemoglobin concentration. Based on
the observe function, the predicted BOLD signal change exhibits the typical increase upon stimulation and a
post-stimulus undershoot.
330
Figure 2. Regional hemodynamic and BOLD signal components of DCMs for FMRI. The uppermost panel depicts the neural system variable 𝑥1(𝑡) evolution of a single region, here simulated by a Gaussian centered at time 0. The middle panel depicts the resulting evolution of the hemodynamic system state variables 𝑥2(𝑡), 𝑥3(𝑡), 𝑥4(𝑡), and 𝑥5(𝑡), modelling the regional-specific vasodilatory signal, blood flow, blood volume, and deoxyhemoglobin content, respectively. Note that neural activation induces an increase in the vasodilatory signal, followed by an increase in blood flow, and, with some delay, an increase in regional blood volume, while it induces a decreases in deoxyhemoglobin content. The changes in blood volume and deoxyhemoglobin in turn result in an increase of the regional BOLD signal.
(4) The probabilistic form of DCM for FMRI
To render DCM for FMRI a probabilistic model that allows for parameter inference and model
comparison based on observed data, the structural form discussed above is embedded into a parametric
Gaussian framework in discrete time. To this end, let
𝑌 ≔ (𝑌1⋮𝑌𝑚
) (1)
denote the concatenated time-series data from 𝑚 regions of interest, where 𝑌𝑖 ∈ ℝ𝑛 corresponds to the
𝑛 ∈ ℕ measurements taken for region 𝑖. We may conceive the structural from of DCM for FMRI as discussed
previously as a mapping
ℎ ∶ 𝑈 × Θ𝑓 × Θ𝑔 → ℝ𝑚𝑛 , (𝑢, 휃𝑓 , 휃𝑔) ↦ ℎ(𝑢, 휃𝑓 , 휃𝑔) (2)
which, using an appropriately chosen discretization method, maps an input function 𝑢 and parameters 휃𝑓
and 휃𝑔 onto the concatenated predicted discrete MR signal time-series of the 𝑚 regions. The likelihood of
DCM for FMRI than takes the following form
𝑌 = ℎ(𝑢, 휃𝑓 , 휃𝑔) + 𝑋휃ℎ + 휀 (3)
In (3)
𝑋 ≔ 𝑑𝑖𝑎𝑔(𝑋1, 𝑋2, … , 𝑋𝑚) ∈ ℝ𝑚𝑛×𝑚𝑞 (4)
denotes a block-diagonal matrix comprising a set of 𝑞 ∈ ℕ “nuisance” regressors for each region. Typically,
these regressors are chosen as a set of discrete-time, discrete-frequency cosine function which are used to
model slow-drifts of the observed signals 𝑌1, … , 𝑌𝑚 and 휃ℎ ∈ ℝ𝑚𝑞 is a further parameter vector. 휀 is an
additive Gaussian error term, i.e. 𝑝(휀) = 𝑁(휀; 0, Σ ) with expectation 0 ∈ ℝ𝑚𝑛 and covariance matrix
Σ ∈ ℝ𝑚𝑛×𝑚𝑛 𝑝. 𝑑., which is often chosen to be of the form Σ ≔ ∑ 𝜆𝑖𝑄𝑖𝑘𝑖=1 for parameters 𝜆𝑖 ∈ ℝ and
covariance basis matrices 𝑄𝑖 ∈ ℝ𝑚𝑛×𝑚𝑛.
331
To simplify the variational treatment of (3) below, we set Θ ≔ {Θ𝑓 , Θ𝑔, Θℎ} and summarize the
structural DCM model function and the nuisance term in a single, input function-specific function
𝜓𝑢 ∶ Θ → ℝ𝑚𝑛, 휃 ↦ 𝜓𝑢(휃) (5)
Based on these conventions, the likelihood-embedding of DCM for FMRI thus takes the following form
𝑝(𝑌) = 𝑁(𝑌; 𝜓𝑢(휃), Σ ) (6)
Study questions
1. Which mathematical framework underlies deterministic dynamic causal models for FMRI?
2. DCM for FMRI is often portrayed as a framework to measure “effective connectivity”. How is the notion of effective connectivity
build into the DCM for FMRI framework?
3. DCM for FMRI is a model for the BOLD signal time-series of 𝑚 regions of interest. How many latent state variables are used to the
model the dynamics of each region and what do they model?
4. What does the observer function of DCM for FMRI describe?
Study Question Answers
1. Deterministic dynamical causal models for FMRI are based on latent systems of ordinary differential equations equipped with
nonlinear observation functions.
2. The evolution function of DCM for FMRI corresponds to a system of ordinary differential equations, some of which specify the
evolution of lumped neural activity variables in selected regions of interest. These equations are governed by parameters which can
be viewed as representations of the effect that neural state activity (or input activity) has on the rate of change of itself or on other
regions.
3. Five state variables are used to model the dynamics of each region. The first state variable models lumped neural activity, the
second state variable describes a vasodilatory signal, the third state variable models the region’s blood (in)flow, the fourth variable
models the region’s blood volume, and the fifth state variable models the regional concentration of deoxyhemoglobin.
4. The observer function of DCM for FMRI describes how local blood volume and deoxyhemoglobin concentration give rise to the
observed BOLD signal change.
332
Variational Bayesian inversion of deterministic dynamical models
(1) Model formulation
In this section, we consider the variational Bayesian estimation of the following static two-level
nonlinear Gaussian model
𝑦 = 𝑓(휃) + 휀 (1)
휃 = 𝜇𝜃 + 휂 (2)
where 𝑦 ∈ ℝ𝑛, 휀 ∈ ℝ𝑛, 휃 ∈ ℝ𝑚, 𝜇𝜃, 휂 ∈ ℝ𝑚, (𝑛,𝑚 ∈ ℕ), 휀 and 휂 are distributed according to 𝑝(휀) =
𝑁(휀; 0, Σ𝑦) and 𝑝(휂) = 𝑁(휂; 0, Σ𝜃) with Σ𝑦 ∈ ℝ𝑛×𝑛 𝑝. 𝑑. , Σ𝜃 ∈ ℝ
𝑛×𝑛 𝑝. 𝑑., respectively, and
𝑓 ∶ ℝ𝑚 → ℝ𝑛, 휃 ↦ 𝑓(휃) (3)
is a differentiable, not necessarily linear, function3. To apply the principles of variational Bayes to the model
problem given by (1) and (2), we first consider the joint distribution corresponding to the structural form of
(1) and (2). To this end, we make the simplifying assumption that the parameters Σ𝑦 and Σ𝜃 that govern the
covariance of the probabilistic error 휀 and 휂 are known (these parameters are sometimes referred to as
“hyperparameters”). We further assume that the expectation parameter 𝜇𝜃 of the unobserved random
variable 휃 is known as well. We then have the random variables 𝑦 and 휃 which are governed by the
following joint distribution
𝑝(𝑦, 휃) = 𝑝(𝑦|휃)𝑝(휃) (4)
The conditional distribution of the observed random variable 𝑦 is specified by (1), and the marginal
distribution of the unobserved random variable 휃 is specified by (2). In functional form, we can thus write
the joint distribution (4) as the product of two multivariate Gaussian distributions
𝑝(𝑦, 휃) = 𝑁(𝑦; 𝑓(휃), Σ𝑦)𝑁(휃; 𝜇𝜃, Σ𝜃) (5)
To summarize, we have specified a joint distribution over one observed random variable 𝑦 and one
unobserved random variable 휃 that is given by the product of two multivariate Gaussian distributions and
parameterized by an additional set of known parameters Σ𝑦, 𝜇𝜃 and Σ𝜃. Given an observed value 𝑦∗ of the
observed random variable, the aim is now to obtain an approximation to the posterior distribution of the
unobserved random variables 휃.
(2) Fixed-form evaluation of the variational free energy
To achieve this aim, the standard assumption in the DCM literature, is to approximate the posterior
distribution 𝑝(휃|𝑦) (a) using a variational mean-field approximation, which is redundant here, as there is
only a single unobserved random variable, and (b) to use a fixed-form VB approach by setting the variational
distributions to Gaussian distributions from the outset:
3 Note that with respect to deterministic models for fMRI, we have set 𝑛 ≔ 𝑛𝑚, 𝑦 ≔ 𝑌, 𝑓 ≔ 𝜓𝑢 and have assumed that Θ = ℝ𝑚. In
addition, we have introduced a Gaussian distribution over the parameter vector 휃.
333
𝑞(휃) ≔ 𝑁(휃;𝑚𝜃, 𝑆𝜃) (1)
(note that in this Section we use roman letters to distinguish the variational parameters from the
parameters of the marginal distributions of the joint distribution). Somewhat unfortunately, this latter
assumption is referred to in the DCM literature as “Laplace approximation”. This is unfortunate, because the
term “Laplace approximation” is used in the general machine learning and statistics literature for the
approximation of an arbitrary probability density function with a Gaussian density and not for the definition
of a variational distribution in terms of a Gaussian distribution. As discussed above the latter is usually
referred to as a “fixed form” variational distribution and the ensuing inversion scheme as “fixed-form
variational Bayes”. Above we noted that fixed-form VB has the benefit that a probabilistic calculus problem
is reformulated as a nonlinear optimization problem, as we will sketch in the following.
Recall that the principle mechanism of variational Bayes was to maximize the variational free energy
ℱ(𝑞(𝜗)) = ∫𝑞(𝜗) ln (𝑝(𝑦,𝜗)
𝑞(𝜗)) 𝑑𝜗 (2)
with respect to the variational distribution 𝑞(𝜗) over the unobserved random variables 𝜗. Also recall that
this is a problem of variational calculus, because the free energy as defined in (2) is a function of a
(probability density) function. Under a fixed-form assumption as in (1), however, all probability densities
involved in the definition of the variational energy are known and can be substituted. In effect, this renders
the variational free energy a function of the variational parameters and transforms a problem of variational
calculus to a nonlinear optimization problem in multivariate calculus, which can be addressed using the
standard machinery of nonlinear optimization algorithms. In the current case, the unobserved random
variables 𝜗 correspond to 휃 and substitution yields the following form of the variational free energy
𝐹 ∶ ℝ𝑚 × ℝ𝑚×𝑚 → ℝ+, (𝑚𝜃, 𝑆𝜃) ↦ 𝐹(𝑚𝜃 , 𝑆𝜃) (3)
where
𝐹(𝑚𝜃, 𝑆𝜃) ≔ ∫𝑁(휃;𝑚𝜃, 𝑆𝜃) ln (𝑁(𝑦;𝑓(𝜃),Σ𝑦)𝑁(𝜃;𝜇𝜃,Σ𝜃 )
𝑁(𝜃;𝑚𝜃,𝑆𝜃))𝑑휃 (4)
Note that we have changed the variational free energy from ℱ to 𝐹 to indicate that instead of a functional
we are now dealing with a function of real-valued parameters. From a mathematical perspective, it is worth
noting that the fixed-form reformulation of the variational free energy by no means results in a trivial
problem. First, the argument of the function 𝐹 is defined by an integral term involving the nonlinear function
𝑓, which, as discussed below, can be evaluated analytically only approximately. This is important, because it
calls into question the validity of the optimized free energy approximation to the log marginal probability.
However, as of now, the magnitude of the ensuing approximation error as a function of the degree of
nonlinearity of 𝑓 does not seem to have been systematically studied in the literature. Second, the function
𝐹 is not a “simple” real-valued multivariate function, in the sense that its arguments are real-vectors, but
also covariance parameters, which have a predefined structure, i.e. they are positive-definite matrices.
Fortunately, optimization of the function 𝐹 with respect to these parameters can be achieved analytically, as
discussed below. We next provide the functional form of the free energy function (4) including its derivation
(which, surprisingly, is largely absent from the DCM literature) and then proceed to discuss its optimization
with respect to the variational parameters 𝑚𝜃 and 𝑆𝜃.
334
Using a multivariate first-order Taylor approximation of the nonlinear function 𝑓, the function
defined in (3) and (4) and setting Σ𝑦 ≔ 𝜆𝑦−1𝐼𝑛 can be approximated by
𝐹:ℝ𝑚 × ℝ𝑚×𝑚 → ℝ, (𝑚𝜃, 𝑆𝜃) ↦ 𝐹(𝑚𝜃, 𝑆𝜃) (5)
where
𝐹(𝑚𝜃, 𝑆𝜃) ≔ −𝜆𝑦
2(𝑦 − 𝑓(𝑚𝜃))
𝑇(𝑦 − 𝑓(𝑚𝜃)) −
𝜆𝑦
2𝑡𝑟(𝐽𝑓(𝑚𝜃)
𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃) (6)
−1
2(𝑚𝜃 − 𝜇𝜃)
𝑇Σ𝜃−1(𝑚𝜃 − 𝜇𝜃) −
1
2𝑡𝑟(Σ𝜃
−1𝑆𝜃) −1
2ln|Σ𝜃| +
1
2ln|𝑆𝜃|
where 𝑡𝑟 denotes the trace operator, and 𝐽𝑓(𝑚𝜃) denotes the Jacobian matrix of the function 𝑓 evaluated
at the variational expectation parameter and a number of constant terms have been removed for ease of
presentation. Note that formally, we should have used different symbols for the functions defined in (2) and
(4) and its approximation provided in (5) and (6). We provide a derivation of (6) from (4) below.
Proof of (6)
For the derivation of (6), we require the following property of expectations of multivariate random variables 𝑥 ∈ ℝ𝑑 under Gaussian distributions, which we state without proof.
Gaussian expectation theorem. For 𝑥,𝑚, 𝜇 ∈ ℝ𝑑 , Σ ∈ ℝ𝑑×𝑑 𝑝. 𝑑. and 𝐴 ∈ ℝ𝑑×𝑑
⟨(𝑥 − 𝑚)𝑇𝐴(𝑥 − 𝑚)⟩𝑁(𝑥;𝜇,Σ) = (𝜇 −𝑚)𝑇𝐴(𝜇 −𝑚) + 𝑡𝑟(𝐴Σ) (6.1)
where
𝑡𝑟 ∶ ℝ𝑑×𝑑 → ℝ, (𝑚𝑖𝑗)1≤𝑖,𝑗≤𝑑≔ 𝑀 ↦ 𝑡𝑟(𝑀) ≔ ∑ 𝑚𝑖𝑖
𝑑𝑖=1 (6.2)
denotes the trace operator, i.e. the sum of the diagonal elements of its argument. Proofs of the above can be found for example in (Petersen et al., 2006) and in the references therein. We are concerned with the following joint distribution
𝑝(𝑦, 휃) = 𝑝(𝑦|휃)𝑝(휃) (6.3)
where
𝑝(𝑦|휃) ≔ 𝑁(𝑦; 𝑓(휃), Σ𝑦) and 𝑝(휃) ≔ 𝑁(휃; 𝜇𝜃 , Σ𝜃) (6.4)
and the variational distribution
𝑞(휃) = 𝑁(휃;𝑚𝜃 , 𝑆𝜃) (6.5)
Based on this notational simplification, we now consider the variational free energy integral term in (6). Using the properties of the logarithm and the linearity of the integral, we first note that
ℱ(𝑞(휃)) = ∫ 𝑞(휃) ln (𝑝(𝑦,𝜃)
𝑞(𝜃)) 𝑑휃 = ∫ 𝑞(휃)(ln 𝑝(𝑦, 휃) − ln 𝑞(휃)) 𝑑휃 = ∫𝑞(휃) ln 𝑝(𝑦, 휃) 𝑑휃 − ∫ 𝑞(휃) ln 𝑞(휃) 𝑑휃 (6.6)
Of the remaining two integral terms, the latter corresponds to the differential entropy of a multivariate Gaussian distribution and is well-known to correspond to a nonlinear function of the variational covariance parameter 𝑆𝑥:
∫ 𝑞(휃) ln 𝑞(휃) 𝑑휃 = ℋ(𝑁(휃;𝑚𝜃 , 𝑆𝜃)) =1
2ln|𝑆𝜃| +
𝑚
2ln(2𝜋𝑒) (6.7)
There thus remains the evaluation of the first integral term, which corresponds to the expectation of the log joint probability density of the observed and unobserved random variables under the variational distribution of the unobserved random variables.
∫ 𝑞(휃) ln 𝑝(𝑦, 휃) 𝑑𝑥 = ⟨ln 𝑝(𝑦, 휃)⟩𝑞(𝜃) (6.8)
335
Substitution of the functional form of 𝑝(𝑦, 휃) (cf. equation (6)) the results, using the current notation for expectations, in
⟨ln 𝑝(𝑦, 휃)⟩𝑞(𝜃) (6.9)
= ⟨ln(𝑁(𝑦; 𝑓(휃), Σ𝑦)𝑁(휃; 𝜇𝜃 , Σ𝜃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃)
= ⟨ln (𝑁(𝑦; 𝑓(휃), Σ𝑦))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) + ⟨ln(𝑁(휃; 𝜇𝜃 , Σ𝜃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝑥𝜃)
= ⟨ln ((2𝜋)−𝑛
2|Σ𝑦|−1
2 exp (−1
2(𝑦 − 𝑓(휃))
𝑇Σ𝑦−1(𝑦 − 𝑓(휃))) )⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃)
+ ⟨ln ((2𝜋)−𝑚
2 |Σ𝜃|−1
2 exp (−1
2(휃 − 𝜇𝜃)
𝑇Σ𝜃−1(휃 − 𝜇𝜃)) )⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃)
= ⟨−𝑛
2ln 2𝜋 −
1
2ln|Σ𝑦| −
1
2(𝑦 − 𝑓(휃))
𝑇Σ𝑦−1(𝑦 − 𝑓(휃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) + ⟨−
𝑚
2ln 2𝜋 −
1
2ln|Σ𝜃| −
1
2(휃 − 𝜇𝜃)
𝑇Σ𝜃−1(휃 − 𝜇𝜃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃)
= −𝑛
2ln 2𝜋 −
1
2ln|Σ𝑦| −
1
2⟨(𝑦 − 𝑓(휃))
𝑇Σ𝑦−1(𝑦 − 𝑓(휃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃)−
𝑚
2ln 2𝜋 −
1
2ln|Σ𝜃| −
1
2⟨(휃 − 𝜇𝜃)
𝑇Σ𝜃−1(휃 − 𝜇𝜃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃)
There thus remain two integral terms. Of these, the latter can be evaluated readily using the Gaussian expectation properties introduced above. Specifically, we have
⟨(휃 − 𝜇𝜃)𝑇Σ𝜃
−1(휃 − 𝜇𝜃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) = ⟨휃𝑇Σ𝜃
−1휃 − 휃𝑇Σ𝜃−1𝜇𝜃 − 𝜇𝜃
𝑇Σ𝜃−1휃 + 𝜇𝜃
𝑇Σ𝜃−1𝜇𝜃⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) (6.10)
= ⟨휃𝑇Σ𝜃−1휃⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) − 2𝜇𝜃
𝑇Σ𝜃−1⟨휃⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) + 𝜇𝜃
𝑇Σ𝜃−1𝜇𝜃
= 𝑡𝑟(Σ𝜃−1휃) +𝑚𝜃
𝑇Σ𝜃−1𝑚𝜃 − 2𝜇𝜃
𝑇Σ𝜃−1𝑚𝜃 + 𝜇𝜃
𝑇Σ𝜃−1𝜇𝜃
= 𝑡𝑟(Σ𝜃−1𝑆𝜃) + (𝑚𝜃 − 𝜇𝜃)
𝑇Σ𝜃−1(𝑚𝜃 − 𝜇𝜃)
There thus remains the evaluation of the first integral term on the right-hand side of (6.9). To simplify proceedings, we assume in the following that Σ𝑦 ≔ 𝜆𝑦
−1𝐼𝑛 , such that we only need to consider the term
⟨(𝑦 − 𝑓(휃))𝑇𝐼𝑛(𝑦 − 𝑓(휃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) = ⟨𝑦
𝑇𝑦 − 𝑦𝑇𝑓(휃) − 𝑓(휃)𝑇𝑦 + 𝑓(휃)𝑇𝑓(휃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) (6.11)
= 𝑦𝑇𝑦 − 2𝑦𝑇⟨𝑓(휃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) + ⟨𝑓(휃)𝑇𝑓(휃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃)
We are hence led to the evaluation of the expectation of a Gaussian random variable 휃 under the nonlinear transformation 𝑓. The (apparent) idea of the DCM literature is to approximate the function 𝑓 by a multivariate first-order Taylor expansion in order to evaluate the remaining expectations (see (Chappell, Groves, Whitcher, & Woolrich, 2009) for an explicit discussion of this approach). Denoting the Jacobian matrix of 𝑓 evaluated at the variational expectation parameter 𝑚𝜃 as the function
𝐽𝑓 ∶ ℝ𝑚 → ℝ𝑛×𝑚, 𝑚𝜃 ↦ 𝐽𝑓(𝑚𝜃) ≔ (𝑑
𝑑𝜃𝑓) |𝜃=𝑚𝜃
≔
(
𝜕
𝜕𝜃1 𝑓1(𝑚𝜃) ⋯
𝜕
𝜕𝜃𝑚 𝑓1(𝑚𝜃)
⋮ ⋱ ⋮𝜕
𝜕𝜃1 𝑓𝑛(𝑚𝜃) ⋯
𝜕
𝜕𝜃𝑚 𝑓𝑛(𝑚𝜃))
(6.12)
we thus write
𝑓(휃) ≈ 𝑓(𝑚𝜃) + 𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃) (6.13)
By replacing 𝑓(휃) in the first expectation of the right-hand side of (6.11) with the approximation (6.13), we then obtain
⟨𝑓(휃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) ≈ ⟨𝑓(𝑚𝜃) + 𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) (6.14)
= 𝑓(𝑚𝜃) + 𝐽𝑓(𝑚𝜃)(⟨휃 − 𝑚𝜃⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃))
= 𝑓(𝑚𝜃) + 𝐽𝑓(𝑚𝜃)(⟨휃⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) −𝑚𝜃)
= 𝑓(𝑚𝜃) + 𝐽𝑓(𝑚𝜃)(𝑚𝜃 −𝑚𝜃)
= 𝑓(𝑚𝜃)
336
Further, replacing 𝑓(휃) in the second expectation of the right-hand side of (6.11) with the approximation (6.13), we obtain
⟨𝑓(휃)𝑇𝑓(휃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) ≈ ⟨(𝑓(𝑚𝜃) + 𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃))
𝑇(𝑓(𝑚𝜃) + 𝐽
𝑓(𝑚𝜃)(휃 −𝑚𝜃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) (6.15)
= ⟨𝑓(𝑚𝜃)𝑇𝑓(𝑚𝜃) + 2𝑓(𝑚𝜃)
𝑇𝐽𝑓(𝑚𝜃)(휃 −𝑚𝜃) + (𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃))
𝑇(𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃)
= 𝑓(𝑚휃)𝑇𝑓(𝑚휃) + 2𝑓(𝑚휃)
𝑇𝐽𝑓(𝑚휃)⟨(휃−𝑚휃)⟩𝑁(휃;𝑚휃,𝑆휃) + ⟨(𝐽𝑓(𝑚휃)(휃−𝑚휃))
𝑇
(𝐽𝑓(𝑚휃)(휃−𝑚휃))⟩𝑁(휃;𝑚휃,𝑆휃)
Considering the first remaining expectations yields
⟨(𝑥 − 𝑚𝜃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) = ⟨휃⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) −𝑚𝜃 = 𝑚𝜃 −𝑚𝜃 = 0 (6.16)
To evaluate the second remaining expectation, we first rewrite it as
⟨(𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃))𝑇(𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) = ⟨(휃 − 𝑚𝜃)
𝑇𝐽𝑓(𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) (6.17)
and note that (휃 −𝑚𝜃)𝑇 ∈ ℝ1×(𝑚+𝑝), 𝐽𝑓(휃)𝑇 ∈ ℝ(𝑚+𝑝)×𝑛 , 𝐽𝑓(𝑚𝜃) ∈ ℝ
𝑛×(𝑚+𝑝) and (휃 − 𝑚𝜃) ∈ ℝ(𝑚+𝑝)×1. Application of the
Gaussian expectation theorem (6.1) then yields
⟨(휃 −𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃)
𝑇𝐽𝑓(𝑚𝜃)(휃 − 𝑚𝜃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) = (𝑚𝜃 −𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃)
𝑇𝐽𝑓(𝑚𝜃)(𝑚𝜃 −𝑚𝜃) + 𝑡𝑟(𝐽𝑓(𝑚𝜃)
𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃) (6.18)
= 𝑡𝑟(𝐽𝑓(𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃)
We thus have
⟨𝑓(휃)𝑇𝑓(휃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) = 𝑓(𝑚𝜃)𝑇𝑓(𝑚𝜃) + 𝑡𝑟(𝐽
𝑓(𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃) (6.19)
In summary, we obtain the following approximation for the integral on the left-hand side of (6.11)
⟨(𝑦 − 𝑓(휃))𝑇(𝑦 − 𝑓(휃))⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) = 𝑦
𝑇𝑦 − 2𝑦𝑇⟨𝑓(휃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) + ⟨𝑓(휃)𝑇𝑓(휃)⟩𝑁(𝜃;𝑚𝜃,𝑆𝜃) (6.20)
≈ 𝑦𝑇𝑦 − 2𝑦𝑇𝑓(𝑚𝜃) + 𝑓(𝑚𝜃)𝑇𝑓(𝑚𝜃) + 𝑡𝑟(𝐽
𝑓(𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃)
= (𝑦 − 𝑓(𝑚𝜃))𝑇(𝑦 − 𝑓(𝑚𝜃) + 𝑡𝑟(𝐽
𝑓(𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃)
Concatenating the results (6.10) and (6.20), we thus have obtained the following approximation of the joint density of observed and unobserved random variables under the variational distribution
⟨ln 𝑝(𝑦, 휃)⟩𝑞(𝜃) = −𝑛
2ln 2𝜋 +
𝑛
2ln 𝜆𝑦 −
𝜆𝑦
2((𝑦 − 𝑓(𝑚𝜃))
𝑇(𝑦 − 𝑓(𝑚𝜃)) + 𝑡𝑟(𝐽
𝑓(𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃)) (6.21)
−𝑚
2ln 2𝜋 −
1
2ln|Σ𝜃| −
1
2(𝑡𝑟(Σ𝜃
−1𝑆𝜃) + (𝑚𝜃 − 𝜇𝜃)𝑇Σ𝜃
−1(𝑚𝜃 − 𝜇𝜃))
Together with the result for the entropy term in (6.7), we have thus found the following approximation for the variational free energy functional under the Gaussian fixed form assumption about 𝑞(휃)
ℱ(𝑞(휃)) = −𝑛
2ln 2𝜋 +
𝑛
2ln 𝜆𝑦 −
𝜆𝑦
2(𝑦 − 𝑓(𝑚𝜃))
𝑇(𝑦 − 𝑓(𝑚𝜃)) −
𝜆𝑦
2𝑡𝑟(𝐽𝑓(𝑚𝜃)
𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃) (6.22)
−𝑚
2ln 2𝜋 −
1
2ln|Σ𝜃| −
1
2(𝑚𝜃 − 𝜇𝜃)
𝑇Σ𝜃−1(𝑚𝜃 − 𝜇𝜃) −
1
2𝑡𝑟(Σ𝜃
−1𝑆𝜃) +1
2ln|𝑆𝜃| +
𝑚
2ln(2𝜋𝑒)
Notably, the variational free energy upon evaluation of the defining integrals can now be expressed as a function of the variational parameters 𝑚𝜃 and 𝑆𝜃. Because the presence of additional constants does not change the location of optima of this free energy function, we omit these constants for ease of presentation in the following and in the main text and write
𝐹:ℝ𝑚 × ℝ𝑚×𝑚 → ℝ, (𝑚𝜃 , 𝑆𝜃) ↦ 𝐹(𝑚𝜃 , 𝑆𝜃) (6.23)
337
where
𝐹(𝑚𝜃 , 𝑆𝜃) = −𝜆𝑦
2(𝑦 − 𝑓(𝑚𝜃))
𝑇(𝑦 − 𝑓(𝑚𝜃)) −
𝜆𝑦
2𝑡𝑟(𝐽𝑓(𝑚𝜃)
𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃) (6.24)
−1
2(𝑚𝜃 − 𝜇𝜃)
𝑇Σ𝜃−1(𝑚𝜃 − 𝜇𝜃) −
1
2𝑡𝑟(Σ𝜃
−1𝑆𝜃) −1
2ln|Σ𝜃| +
1
2ln|𝑆𝜃|
Equation (6.24) corresponds to equation (6) of the main text.
□
(3) Optimization of the variational free energy
Optimizing, i.e. finding a minimum or a maximum of a nonlinear multivariate, real valued function of
the form ℎ ∶ ℝ𝑛 → ℝ such as (9) with respect to its input arguments is a central problem in the
mathematical theory of nonlinear optimization. Intuitively, many methods of nonlinear optimization are
based on a simple premise: from basic calculus we know that a necessary condition for an extremal point at
a given location in the input space of a function is that the first derivative evaluates to zero at this point, i.e.
the function is neither increasing nor decreasing. If one extends this idea to functions of multidimensional
entities, one can show that one may maximize the function 𝐹 with respect to its input argument 𝑆𝜃 based on
a simple formula. Omitting all terms of the function 𝐹 that do not depend on 𝑆𝜃 and which hence do not
contribute to changes in the value of 𝐹 as 𝑆𝜃 changes, we write the first derivative of 𝐹 with respect to 𝑆𝜃
suggestively as
𝜕
𝜕𝑆𝜃𝐹(𝑚𝜃, 𝑆𝜃) = −
𝜆𝑦
2𝐽𝑓(𝑚𝜃)
𝑇𝐽𝑓(𝑚𝜃) −1
2Σ𝜃−1 +
1
2𝑆𝜃−1 (1)
Setting the derivative of 𝐹 with respect to 𝑆𝜃 to zero and solving for the extremal argument 𝑆𝜃∗ then yields
the following update rule for the variational covariance parameters
𝑆𝜃∗ = (𝜆𝑦𝐽
𝑓(𝑚𝜃)𝑇𝐽𝑓(𝑚𝜃) + Σ𝜃
−1)−1
(2)
Proof of (1)
We only provide a heuristic proof to demonstrate the general idea. A formal mathematical proof would also require the
characterization of for the function 𝐹 as concave function and a sensible notation for derivatives of functions of multivariate entities
such as vectors and positive-definite matrices. Here, we use the notation for partial derivatives. We have
𝜕
𝜕𝑆𝜃𝐹(𝑚𝜃 , 𝑆𝜃) = −
𝜆𝑦
2
𝜕
𝜕𝑆𝜃𝑡𝑟(𝐽𝑓(𝑚𝜃)
𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃) −1
2
𝜕
𝜕𝑆𝜃𝑡𝑟(Σ𝜃
−1𝑆𝜃) +1
2
𝜕
𝜕𝑆𝜃ln|𝑆𝜃| (1.1)
which, using the following rules for matrix derivatives involving the trace operator and logarithmic determinants (cf. equations (103)
and (57) in (Petersen et al., 2006)
𝜕
𝜕𝑋𝑡𝑟(𝐴𝑋𝑇) = 𝐴 and
𝜕
𝜕𝑋ln |𝑋| = (𝑋𝑇)−1 (1.2)
with 𝑆𝜃 = 𝑆𝜃𝑇 yields
𝜕
𝜕𝑆𝜃𝐹(𝑚𝜃 , 𝑆𝜃) = −
𝜆𝑦
2𝐽𝑓(𝑚𝜃)
𝑇𝐽𝑓(𝑚𝜃) −1
2Σ𝜃−1 +
1
2𝑆𝜃−1 (1.3)
Setting the above to zero then yields the equivalent relations
𝜕
𝜕𝑆𝜃𝐹(𝑚𝜃 , 𝑆𝜃) = 0 ⇔ −
𝜆𝑦
2𝐽𝑓(𝑚𝜃)
𝑇𝐽𝑓(𝑚𝜃) −1
2Σ𝜃−1 +
1
2𝑆𝜃−1 = 0 ⇔ 𝑆𝜃
∗ = (𝜆𝑦𝐽𝑓(𝑚𝜃)
𝑇𝐽𝑓(𝑚𝜃) + Σ𝜃−1)
−1 (1.4)
□
338
In contrast to the variational covariance parameter, maximization of the variational free energy
function with respect to 𝑚𝜃 cannot be achieved analytically, but requires an iterative numerical optimization
algorithm. In the DCM literature, the algorithm employed to this end is fairly specific, but related to standard
nonlinear optimization algorithms such as gradient and Newton descents. To simplify the notational
complexity of the discussion below, we rewrite the function 𝐹 as function of only the variational expectation
parameter, assuming that it has been maximized with respect to the variational covariance parameter 𝑆𝜃
previously. The function of interest then takes the form
𝐹:ℝ𝑚 → ℝ,𝑚𝜃 ↦ 𝐹(𝑚𝜃) (3)
where
𝐹(𝑚𝜃) = −𝜆𝑦
2(𝑦 − 𝑓(𝑚𝜃))
𝑇(𝑦 − 𝑓(𝑚𝜃)) −
𝜆𝑦
2𝑡𝑟(𝐽𝑓(𝑚𝜃)
𝑇𝐽𝑓(𝑚𝜃)𝑆𝜃) −1
2(𝑚𝜃 − 𝜇𝜃)
𝑇Σ𝜃−1(𝑚𝜃 − 𝜇𝜃) (4)
Numerical optimization schemes usually work by “guessing” an initial value for the maximization
argument of the nonlinear function under study and then iteratively update this guess according some
update rule. A basic gradient ascent scheme for the function specified in (3) is provided in Table 1.
Initialization
0. Define a starting point 𝑚휃(0)∈ ℝ𝑚 , a step-size 𝜅 > 0 and set 𝑘 ≔ 0. If ∇𝐹(𝑚휃
(0)) = 0, stop! 𝑚휃
(0) is a zero of ∇𝐹. If
not, proceed to iterations.
Until Convergence
1. Set 𝑚휃(𝑘+1)
≔ 𝑚휃(𝑘)+ 𝜅∇𝐹(𝑚휃
(𝑘))
2. If ∇𝐹(𝑚휃(𝑘+1)) = 0, stop! 𝑚휃
(𝑘+1) is a zero of 𝐹. If not, go to 3.
3. Set 𝑘 ≔ 𝑘 + 1 and go to 1.
Table 1. Gradient ascent algorithm for the determination of a an optimal variational expectation parameter 𝑚휃.
A prerequisite for the application of the algorithm described in Table 1 is the availability of the
gradient of the function 𝐹 evaluated at 𝑚𝜃(𝑘) for 𝑘 = 0,1,2,… In the DCM literature, it is proposed to
approximate this gradient analytically by omitting higher derivatives of the function 𝑓 with respect to 𝑚𝜃.
The function (4) comprises first derivatives of the function 𝑓 with respect to 𝑚𝜃 in the form of the Jacobian
𝐽𝑓(𝑚𝜃) in the second term. If this term is omitted, the gradient of 𝐹 evaluates to
∇𝐹(𝑚𝜃) = −𝜆𝑦𝐽𝑓(𝑚𝜃)
𝑇(𝑦 − 𝑓(𝑚𝜃)) − Σ𝜃−1(𝑚𝜃 − 𝜇𝜃) (5)
and the update rule for the variational expectation parameter takes the form
𝑚𝜃(𝑘+1)
= 𝑚𝜃(𝑘)− 𝜆𝑦𝐽
𝑓 (𝑚𝜃(𝑘))
𝑇(𝑦 − 𝑓 (𝑚𝜃
(𝑘))) − Σ𝜃−1 (𝑚𝜃
(𝑘) − 𝜇𝜃)𝑇 (𝑘 = 0,1,… ) (6)
Proof of (5)
∇𝐹(𝑚휃) = −𝜆𝑦
2
𝜕
𝜕𝑚휃(𝑦 − 𝑓(𝑚휃))
𝑇(𝑦 − 𝑓(𝑚휃)) −
𝜆𝑦
2
𝜕
𝜕𝑚휃𝑡𝑟(𝐽𝑓(𝑚휃)
𝑇𝐽ℎ(𝑚휃)𝑆휃) −1
2
𝜕
𝜕𝑚휃((𝑚휃 − 𝜇휃)
𝑇Σ휃−1(𝑚휃 − 𝜇휃)) (5.1)
339
Notably, the second term above involves second-order derivatives of the function 𝑓 with respect to 𝑚휃. Following the DCM
literature we neglect these terms, and obtain, using the rules of the calculus for multivariate real-valued functions (Petersen et al.,
2006)
∇𝐹(𝑚휃) = −𝜆𝑦
22(
𝜕
𝜕𝑚휃𝑓(𝑚휃))
𝑇
(𝑦 − 𝑓(𝑚휃)) −1
22Σ휃
−1(𝑚휃 − 𝜇휃) = −𝜆𝑦𝐽𝑓(𝑚휃)
𝑇(𝑦 − 𝑓(𝑚휃)) − Σ휃−1(𝑚휃 − 𝜇휃) (5.2)
□
From a technical viewpoint, gradient ascent schemes for the maximization of nonlinear functions are
suboptimal and thus rarely employed in numerical computing. Furthermore, as can be shown in simple
univariate examples the approximated gradient can easily fail to reliably identify the necessary condition for
an extremal point. A more robust method is a globalized Newton scheme with Hessian modification and
numerically evaluated gradients and Hessians as shown in Table 2.
Initialization
0. Define a starting point 𝑚휃(0)∈ ℝ𝑚 and set 𝑘 ≔ 0. If ∇𝐹(𝑚휃
(0)) = 0, stop! 𝑚휃
(0) is a zero of ∇𝐹. If not, proceed to
iterations.
Until Convergence
1. Evaluate the Newton search direction 𝑝𝑘 ≔ (𝐻𝐹(𝑚휃(𝑘)))
−1
∇𝐹(𝑚휃(𝑘))
2. If 𝑝𝑘𝑇∇𝐹(𝑚휃
(𝑘)) < 0, 𝑝𝑘 is a descent direction. In this case, modify 𝐻𝐹(𝑚휃(𝑘)) to render it positive-definite.
3. Evaluate a step-size 𝑡𝑘 fulfilling the sufficient Wolfe-condition using the following algorithm: Set 𝑡𝑘 ≔ 1 and select
𝜌 ∈ 0,1[, 𝑐 ∈]0,1[.Until 𝐹(𝑚휃(𝑘) + 𝑡𝑘𝑝𝑘) ≥ 𝐹(𝑚휃
(𝑘)) + 𝑐1𝑡𝑘∇𝐹(𝑚휃(𝑘))
𝑇𝑝𝑘 set 𝑡𝑘 ≔ 𝜌𝑡𝑘.
4. Set 𝑚휃(𝑘+1)
≔ 𝑚휃(𝑘) + 𝑡𝑘 (𝐻
𝐹(𝑚휃(𝑘)))
−1
∇𝐹(𝑚휃(𝑘))
5. If ∇𝐹(𝑚휃(𝑘+1)) = 0, stop! 𝑚휃
(𝑘+1) is a zero of 𝐹. If not, go to 3.
6. Set 𝑘 ≔ 𝑘 + 1 and go to 1.
Table 2. A globalized Newton method with Hessian modification
Intuitively, the algorithm described in Table 2 works by approximating the target function 𝐹 at the
current iterand 𝑚𝜃(𝑘)
by a second-order Taylor expansion and analytically determining an extremal point of
this approximation at each iteration. The location of this extremal point corresponds to the search direction
𝑝𝑘. If, however, the Hessian 𝐻𝐹 (𝑚𝜃(𝑘)) is not positive-definite, there is no guarantee that 𝑝𝑘 is an ascent
direction. Especially in regions far away from a local extermal point, this can often be the case. Thus, a
number of modification techniques have been developed that minimally change 𝐻𝐹 (𝑚𝜃(𝑘)) , but render it
positive-definite, such that an ascent direction is obtained. Finally, on each iteration the Newton step-size 𝑡𝑘
is determined such as yield an increase in the target function, but with a not too short step-size. This
approach is referred to as backtracking and the conditions for sensible step-lengths are given by the
necessary and sufficient Wolfe-conditions. Notably, the algorithm shown in Table 2 is a standard approach
for the iterative optimization of a nonlinear function, and hence analytical results on its performance bounds
are available.
The optimization scheme actually employed in much of the DCM simulation literature, however, is
more specific. It derives from the local linearization method for nonlinear stochastic dynamical systems and
340
is usually formulated in differential equation form. In Table 4, we provide a basic numerical optimization
formulation of this scheme.
Initialization
0. Define a starting point 𝑚휃(0)∈ ℝ𝑚 and set 𝑘 ≔ 0. If ∇𝐹(𝑚휃
(0)) = 0, stop! 𝑚휃
(0) is a zero of ∇𝐹. If not, proceed to
iterations.
Until Convergence
1. Set 𝑚휃(𝑘+1)
= 𝑚휃(𝑘)+ exp(𝜏𝐻𝐹(𝑚휃
(𝑘)) − 𝐼)𝐻𝐹(𝑚휃(𝑘))
−1∇𝐹(𝑚휃
(𝑘))
2. If ∇𝐹(𝑚휃(0)) = 0, stop! 𝑚휃
(0) is a zero of 𝐹. If not, go to 3.
3. Set 𝑘 ≔ 𝑘 + 1 and go to 1.
Table 4. Local linearization based gradient ascent algorithm